IPFS IPFS þing 2022 - Content Routing 1: Performance, 10 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How Elastic IPFS content is discoverable in the DHT - @ShogunPanda, @alanshaw - Content Routing 1

Description

How Elastic IPFS content is discoverable in the DHT - presented by @ShogunPanda and @alanshaw at IPFS þing 2022 - Content Routing 1: Performance - https://2022.ipfs-thing.io

A

Hi everyone, uh I'm alan and paula's gonna- be we're double teaming this talk, so it's gonna be good fun. Okay, so this talk is about is uh about how all of the provider records for us elastic, ipfs content are discoverable in the dht, and so we're talking about how all of the content that was ever uploaded to web3.storage nft.storage.

A

um So that's currently, hundreds of terabytes of blocks are discoverable or basically resolvable to our peers id when you query the dht, um and so this is a deep dive into exactly how uh we do that in elastic ipfs.

A

So, first of all, let's talk a little bit about. Why we're talking about discovering content. uh The the problem we have is that um we're at scale nft.storage and web3.storage have over 90 million uploads between them. That's over one and a half billion blocks. We have over 40 ipfs nodes in clusters, trying to write provider records to the dht. It's just a lot a lot of stuff, and so we bought ourselves some time by uh turning on the accelerated dht.

A

When that became available, we uh we switched our data store from badger to flat fs for to eke out that extra performance. um We also recently considered um actually just uh changing our providing strategy to just roots, uh but we kind of didn't really want to do that, and luckily we wouldn't have to um there's also wider, as um I think will touched upon as well.

A

There's wider concerns about being a good citizen on the ipfs network, like the number of records, we're expecting other peers to store, um as well as like the bandwidth costs, to kind of send in send them to them and continually do that. It's quite um it's quite a lot.

A

uh I think will said it was like four meg per pier of just provider records um that doesn't seem cool, um and so what we found- and it's not super surprising- is that at some point there's just too many cids uh for a single node to provide to the dht, and it's also kind of annoying to pinpoint uh that tipping point.

A

um If you've got like popular content and your node's gonna be really uh busy with reading stuff as well like it depends on things stupid things like systems like the system file system type the like, if you rated your disks, if you got like- and it varies from disk to disk um and stuff like likes like that- which makes it really difficult to know when you've put too much stuff on your node, but in general, just don't let them get too big. uh So this is a it's not super easy to see on the screen.

A

Unfortunately, but um it's a graph of the percentage of dht records, found that say that this particular peer has this content and uh it's over a period of about a month earlier this year and it's for one of the oldest nodes in the cluster. That node has got about 70 terabytes of data on it, and it's not doing great they're, very low low percentages of uh found provider records in the dht um and uh just so you're clear, like we're checking for uh for content.

A

We know is stored on this node, so we're expecting to see a dht record on the dht, um the y-axis. Only goes up to 55, so uh yeah that node is not having a good time.

A

uh So, conversely, with like 20 terabytes of data, things are a bit better, um still not great. uh It's got some brief periods of awesomeness, uh but it's mostly bad and you can sort of see this slow decline. As the disk fills up, uh it gets worse and worse at being able to provide um stuff to the dht, so um there you are uh so for comparison. This is elastic ibfs. This is when uh this is. This is happy days.

A

This is when the and nodes came online and started schlepping up all of the advertisements that we'd been writing um and uh just to be clear. This is all of the data like we have or everything and it's we went from. Nearly zero percent found dht records to uh like almost 100 all the time, and it's stayed this way ever since, and it's continually um slurping up stuff. This is like hundreds of terabytes of data billions and billions of cids um and yeah. So it's things are good uh at the moment.

A

uh That's really cool, um but how does elastic ipfs achieve this? Well, uh it makes use of the indexer nodes and they are purpose-built to map cids to content providers uh for the scale of the uh file coin network and we're we're, I guess, relatively small comp compared to the the whole of the file coin network, but um essentially indexer nodes work by um you ask them like who has a cid and it tells you who has it uh pretty simple right.

A

um Obviously, someone has to tell the index of nodes before you ask them uh who who has it um and uh there are actually multiple ways to tell an index of node. I mean two ways that you have a cid, and so paulo is going to tell you specifically how elastic ipfs does it.

B

Hello again folks, so um as alan was anticipating, we have two ways to ingest data to the indexer node using the direct api they provide. The first one is based on lib p2p and gossip sub, while the other one is a two phases: http based one, let's analyze the first one leap, b2p and gossip sub right now. Let me ask you one question: how much money do you have?

B

Because that's the main issue here, because this method is very good very fast, but the problem is that we cannot use that on eipfs because we are on the cloud, so maintaining lovely long living connection is expensive and using a verbose protocol like gossip sub is as expensive as well due to the the overhead in the data transfer.

B

So will impact costs as well, so we're gonna throw a lot of money out of nothing basically, but there's even even more, even if even lsc you have infrit money, there are technical limits that cannot be bypassed very easily. The main one is that, as francisco was anticipating this morn this morning, is that eipfs from the outside is just I regular node like every anybody else's one peer id, which seems to be just a machine, but actually our tons of machine right.

B

The implication of that is is that there is an optimization in the lib p2p stack. Is that says that if you try to connect to a destination to the same destination from two different sources which which share the same peer id lib p2p will try to merge the streams in a single connection and drop the other one? Therefore, the communication will be dropped and not effective. We cannot support that. So let me introduce you.

B

A very old friend which is http, that's lightweight is is everywhere everywhere is stateless and the the long-living connection, which is basically http live, are opt-in. So if you want to use it whether this also applies on the client on the server, the server can refuse a keep a live connection from a client if they want to. So that's the ideal situation from cloud environments because it's very lightweight and you can drop as soon as you need it. So we use that right now.

B

How do we use that? The ingest api of the uh indexer nodes requires us to provide the advertisements and entries data over http? That's it! That's all they ask us. Moreover, we need to maintain an additional head uh link to the latest advertisement that that has been published, because the entire idea behind this uh indicator node, is that there are blockchain approach to this thing, so you can reconstruct to the hell to to the from the edge to the tail of the cube anytime.

B

So if you're not this gone drops you do your disks damage and so forth. You can reconstruct from the beginning. Now, I'm not saying you can really do that because, in order to reconstruct billions of records, you might probably will take your centuries, but, theoretically speaking, you could, if you want to really waste a lot of time there anyway, once you have all the data available, the http, all you need to know is that make a put request to that.

B

Slash in the ingest, slash announce route and which basically will signal the index or not to say look. I have some data, please get it now. This last thing is also the reason why for that attended francisco the conference this morning right now, right after publishing data through web3.storage, you cannot immediately fetch it using the dht, because we cannot predict when the indexer node will actually fetch the new new advertisements and therefore publish on the dht there's something we can't control. Usually it happens fast.

B

I mean fast, I mean, let's say, means hour whatever, but for sure not days. So it's pretty available.

B

Yes, yeah yeah, so what will the indexer node will do on their site like we said here? First fetch the slash, add and say what is now the latest advertisement two fetch this advertisement, three analyze the entries which this advertisement points to and then repeat repeat repeat until you get to a point that you've meet advertisement that you already processed so they're.

B

Basically the old head of the queue and you're done all the data is now available in the dht, via whatever I will tell you, neither I don't want to make any spoiler and now, let's take a brief look to the different files involved. In this case, the slash add file is a very simple structure file for the sake of this torque. What do you really care about? Is the line number three that says what is the cid of the current latest published advertisement?

B

That's it already is a secretary and a publicly, but we don't really care about that. Then we have the other advertisement files, which cares much more information on line two and line. uh What is that 18 there is? There are information about the current provider in this case. That's the peer id of the elastic ipfs cluster and its addresses I mean, which is a single address in this case over there and then on line 7.

B

There is the link to the entries that we are now saying we provide over the elastic ipfs bitswap and then the endless file which is adjustable link of cids. Nothing more, that's pretty pretty simple!

B

Now, as alan said, we have a lot of blocks and a lot of new blocks get added every day if we mishandle the concurrency. First of all, the queue will explode very easily before we even realize. First and second, if we lose any advertisement, this advertisement will be lost forever because nobody will be able to ever find it because it's not advertised by anyone. So there is no provider associated for the same reason, there is a concurrency.

B

I mean there is a racing problem when updating the head of the queue, because if two, basically, if two workers are trying to upload the queue at the same time, one of them will fail and the advertisement will be lost. So the entire chain of the advertisement is lost, so how we managed to do that?

B

Well, we just we just did something very obvious, reduce the size of your problem, how we use the usually divided temperature issue so that, instead of going one gigantic processor split it, we split the process in two steps.

B

The first one is that we get all the new blocks that we receive from the indexing part which are in a sqs topic, and we group them together by a number of ten thousand. In this case, since sqs is on exactly once delivery. We don't care about the ordering. We know that each cid will finish in one entry file. We don't care, which one is that once we compute this entry file, we um enqueue its cid in another topic, another sqs topic.

B

This sqs topic is fed to another lambda which will group these. All this entries file in another batch of ten thousand create um sorry we'll we will. There will be an execution for each 10 000 entries each entries will become an advertisement file and gets uploaded an internal in memory. We keep the sequence of these advertisements at the end of the day, we update the ad only once with the latest advertisement that we have processed now.

B

The very important thing here is that the latest lambda has the concurrence limited to one, so we just have one execution for this thing.

B

The result of this is that we reduce the order of magnitude of the problem by ten thousand, which means and in theory to handle one millions upload, sorry, one billion uploads per day. We just need a thousand the execution of this second lambda and we are able to do that pretty easily. That's it. That's.

A

That's 10 000 entries per execution.

B

That's the same because each advertisement is bound to a single entry file, so we we group, we group things twice.

A

Oh, my question is: is that a typo then.

B

uh You may know all over there right here you mean here yeah, no, it's not because a blocks, each block goes into an entries file by in grouping by 10 000..

B

This this block becomes an entries file and each entries file is mapped to an advertisement and each advertisement is bound to a one execution of the second lambda. What we call the publisher advertisement, lambda, it's all executed once, but in this case, in the first case we were grouping manually, ten thousand. In second case.

B

Ten thousand nil is a limit given by aws, so aws you can receive up to ten thousand messages per uh peri event generated from sql to lambda, and that's why we get to ten thousand, but each advertisement is maps to one entry. I mean, if you think about this earlier.

B

You see on uh online seven jobs maps to a single entry file. This is an advertisement file maps to an entry file. That's it.

B

And so that's what I was like yeah, so that's how we were able to reduce the size of the problem pretty easily and handle the huge load that we have.

B

Finally, this is an overall view of what I just said: that's the diagram with architecture, so it all starts on the top right. When there is the sqs multi-ash's topic, then there is the first alarm with the execution that will output to f3 and to sqs then goes to a second lambda. Then eventually, we'll also update the advertisement to s3 and finally, only once per execution notify the indexer node that at some point later in the future, will say. Okay, let me fetch everything from that moment.

B

Data is available to the dht in a way that alan will explain you now. Thank you.

A

Okay, so once it's in once it's in the indexer node, how does how does how does it become available on the dht? So so, uh for that we need to talk a little bit about the um the hydra nodes which we, I think people have already uh already talked about uh a little bit, um but this is kind of a bit more in depth. um So anyway, how does it become available on the dht? Well, um the hydro nodes are just uh ipfs nodes and they're designed to uh help the dht.

A

They position themselves uh such that when you query the dht you're you're, pretty likely to run into one and they're just everywhere and they kind of help there to help out. The kind of the key thing is that they share a uh data store. So if one hydra head knows about something, then all of them do uh just like a real hydra. uh They uh share the same belly of data with many hits um uh and so yeah. The the cool thing about them is that they prefetch provider records.

A

So when someone asks uh asks uh who has the bear, um the uh the and the hydra nodes, don't know, then they'll respond and say I didn't. I don't know, but behind the scenes, what they'll do is they'll actually query the network and find that information. So next time they can be, they can tell uh whoever asks uh what the actual answer is um so yeah it's this key kind of like getting provided records and caching them so that next time they're available and also being available all over the network.

A

So um the difference now is that the hydras have been updated and now, when hydra is asked about that, bear it actually delegates that query to a network indexer.

A

If it doesn't already know the answer, obviously, um and then so, the network indexer is able to say the elastic provider uh has that teddy bear, and so by virtue of the hydra heads being prevalent on the network. You'll almost always get an answer uh without having to explicitly issue provider records to the dht uh yeah. So here we go, send it on.

B

uh It get stored at that point.

A

Yeah, I think it gets cached still yes. So so, if there's a lot of requests, it eventually is going to catch all the hot set in hybrids. Yes, uh yeah, there's a cash yeah yeah! I didn't do the implementation in hydrogen. I don't can't speak to it like if that's then cash in hydros, yeah, we'll nodding so yeah.

A

Okay, so uh the requests that hydra make uh to the indexer nodes are actually uh what we call reframe messages. It's a spec for transport, agnostic, request, response messages and what the hydras use is the find providers method, um but this spec is kind of part of a bigger kind of delegated routing protocol that the hydras are sort of participating in, um so that that's cool uh and then so.

A

This is sort of ends up like this, uh where the uh our elastic bfs implementation is basically telling the network indexes uh what cids it has and then uh the ipfs nodes on the network are able to find that information via the hydras. At the moment this is kind of a temporary situation.

A

uh The the the index and nodes will eventually like we talked about earlier, the uh queryable via ipvs, directly and and other ways which I'm not entirely in that situation to explain about. uh But um this is how elastic ipfs currently works and gets all of the data that we have ever uh available on the dht for for people to uh to query and respond to and have nearly 100 um provide a record um coverage for all of our stuff cool. Thank you.