IPFS IPFS þing 2022 - Connecting IPFS, 10 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Rise of Elastic IPFS - @alanshaw - Connecting IPFS

Description

The Rise of Elastic IPFS - presented by @alanshaw at IPFS þing 2022 - Connecting IPFS - https://2022.ipfs-thing.io

A

Hi, I'm alan. uh This is the talk about uh the rise of elastic ipf s, so um as you've just kind of learned, I guess a little bit. uh Pfs is a new. It's open source, ipfs implementation that runs in the cloud uh it separates, read and write pipelines to allow it to scale massively.

A

um If you're interested in the elastic ibfs architecture and you're watching this on a video in the future, then you should look at francisco's talk that he just gave um and also, um like, I said, there's a there's, a deep dive into the provider subsystem later today in the content, routine performance track uh with me and paulo.

A

This talk, though, this talk is the story of how we got our initial implementation uh out of the door and into production. uh So let's, let's go. Let's first talk a little bit about what this all kind of hinges on.

A

So what we have is this tool, uh which is continuously collecting data um on the uploads that both nft.storage and web3.storage receive and um and so the gist of this thing is we we pick a cid uh that we know we're storing, and we pick a pier that we know that's storing that data um and uh we do some stuff to check up on that information. um And then we graph.

B

A

On this nice uh on this nice grafana panel, um and so these are like the the headline stats, um but you can see like uh at the top left there there's dht provider records. So this is the percentage of time where um we ask the dht. If there is a provider record that says that this cid is being provided by this pier and that's- um you can see like we at this period of time, not doing so well uh like about 50 kind of availability.

A

um Top uh top right here is bit swap availability. So what we do is we make a p2p connection to the peer that we know is meant to be storing that cid uh and we ask it using a bitswap. Have message: do you have this cid of, and so it's meant to say yes, uh doesn't always happen, um that's bad! um That's! But anyway, that's that that's that panel we also uh chart like checks per second but uh connection errors. So what can happen is when we're trying to connect to that pier.

A

We might experience a connection error um because it is uh very busy or it's broken or uh down, and that's no good and that's uh often the reason why this is not 100 um connection. Error is also bad at this period of time, um and- um and so this bottom right panel is bit swap round trip time.

A

So the actual time it takes to send that have message and receive a response uh responds to it so um they're the kind of headline stats you can get real overview of like this is all of the nodes that all the peers that we're running um in in all of our clusters, that we have um for nft storage um and then from there you can drill down into uh per peer metrics, and so this is rainbow mode and uh rainbow mode is not meant to be a thing that indicates that our peers are acting very erratically, um and that is not a good thing that you want from your prof uh production infrastructure.

A

uh So um this becomes more useful when you actually filter by a particular peer I've just selected like random ones. Here this isn't the same. These random ones, like this top left one, was having a really bad time on connections, and maybe it got restarted and then you can see it's um it's doing uh a a lot better.

A

um This one is uh this: guy was um we were finding provider records for this particular pier for every cid that we checked for well, not every cid but most of them, and then it started to have a bit of a bad time um bits. What found this was. One was doing. Okay had a really bad time and then maybe got restarted again and you get you kind of get the idea. um This is like drilling down into purpose. I mean you can see uh from here.

A

This is a really good indication of when there's a peer. That's um that's currently in trouble uh is struggling in some in some way, because the checks are telling us that uh think bad things are going on. um Anyway. You get the idea um that, like all of these, all of those metrics that you see in that graph ana are all like specific to the data that we're storing.

A

So, there's cds that we know have been uploaded uploaded to web freedom, storage and nft, dot storage, um so they're specific to us, but um this, like this all hinges on this ipfs check tool, um which is like a generic public open source api that is available to for anyone to use so and anyone can run it themselves.

A

We actually run our own one as well, um and so you can just go and put your cid in and you appear and check if, uh if good things are happening and it will show you the results um uh afterwards, it's made by the dean because he's a wizard and uh willian and elena really heavily for those um those stats. So thank you, um brad, okay, so um anyway, how to production um so checkup gave us the tools to uh kind of determine um how elastic ibfs was performing.

A

Basically, um as you learned in the previous kind of talk, alaska ibfs works off of s3 buckets it car files that people upload go straight into s3 buckets and it reads the blocks out of those uh directly out of those car files.

A

um What we in in our in dot storage, um kind of uh apis at the moment, like we already write to s3 buckets and we were doing we've- been doing that forever for like disaster recovery, uh just in case our cluster decided to blow up like we still got like some uh some kind of extra backup of all of the data that we could restore restore from if anything happened, but that turned out to be really good, because it meant that we could get elastic ipfs up and running ingest, all our existing data and any new data that was coming in without putting elastic ipfs on the critical path for either of those products, uh so yeah.

A

That would turn out to be really good anyway. So these these are the things that happened uh and the graphs that show the uh resolution things, um but like first of all, when the implementation was done when paulo and the team finished uh like building it like, we did this kind of sanity step zero, uh like is this thing a goa check um and- um and so this is kind of like a really naughty like uh one-to-one connection- send like try and transfer something over there.

A

So it's not really typical of ipfs, because potentially you'll be able to get stuff over bitswap from multiple peers, um but this is just um kind of is it? Is it actually going to be usable? And so just to note, like we do, expect ipf uh last skype fest to be a little bit slower than go ipfs, because we're trading off we've got network io, where we're fetching stuff from s3 buckets and and the indexes from the dynamite db versus like disk, I o, which is like ssd disk.

A

You can create that really really fast um so anyway, this is a speed test. Before we did like any optimizations, it was just fresh out the door um and and like we found it was you know it was. It was reasonable. It was usable and uh so we're like okay right, let's, uh let's continue, um and so there's there's optimizations that we've done and are still to come, that I'll talk about a little bit later.

A

um So after that, the the kind of first thing we we managed to do was um was optimize our bit swap round trip time.

A

We noticed that it was uh all the way up here around like six milliseconds, which is a little bit slower um than um than uh regular kind of go ipfs, kubo ipfs, um and so we managed to actually kind of almost half that um uh round trip time and it's it's now, it's a lit, tiny bit slower than um than go ipfs on on a good day, but still like consistently better than a lot of our cluster peers um in in our in our clusters at the moment. So what do we do to optimize? It?

A

That's a good question. I don't know the answer. Paulo, do you know what we did to optimize it uh you mean.

C

A

Yeah, um that's the only one. I actually don't know what happened um so thanks um all right uh anyway.

A

So the next thing that happened to us was that the indexer nodes came online uh and- and this is this was amazing to watch, because I was I was like um camping in a field at the time, and I was just sat on my mobile like on grafana just right, refreshing it, um but, but that within a few few days uh they read all of the advertisements that we've generated and and effectively they've indexed, the majority of all data ever uploaded to web 3.0, storage, nft storage, that's terabytes of data, that's 1.5 billion more than that cids um in total, uh and and once it got up there.

A

It's basically stayed on like here forever ever ever since, which is incredible and um like I don't know, if maybe we can go back to when you look at um rainbow mode. This is the same uh sort of thing that was happening um yeah. This is the same graph, um but like this just everywhere, um it's it's like that sort of thing, whereas we're now like consistently up in the you know, 1990s to hundreds um also.

C

This ingest, we could do way faster now that we have the bachelor's and dining well, because we could get them to turn up the lambdas, and then we could do like 10 000.

A

The cool thing about this is that um once the index and nodes had indexed all of this data um it it meant that, like these provider, records were like um essentially available on the dht, so um so people started discovering elastic, ipfs um and, uh and that put it under load, which is rap, uh but then we saw connection errors. We started seeing this in the graphs and we were like. Oh no, what's happened, um and so we managed to we might.

A

This is where we fixed that, um but we uh we in node there's this massive foot gun if you've got an event emitter and you omit an event called error and don't listen for that event. Then it just takes down the whole process and turns out we'd missed that in one of the connections uh that we were making, and that was what was causing that. So um this is where we fixed it. um I only started seeing it when we started getting loads of traffic um and then, um after that that was fixed.

A

We still had. We still there's still some connection errors there, but talk about that in a second.

A

um The thing we we did realize was uh that we weren't currently graphing in checkup was the time between a user uploading, a car file and the time it takes for that data to be available on the ipfs network, and by that I mean um people can connect to the an ipfs peer and transfer it via bitswap, that's important to us, because as soon as people can transfer it via bitswap, it's available on the gateways essentially and a lot of our read traffic is through gateways.

A

So this is an important metric for us to us to be tracking, um uh and so we did some changes, and now we can graph this uh that, like this, is currently at, like the actual time to index, a car varies based on um like the size of a car, uh but also the number of um blocks. uh The number of blocks in the block size as well um can change so it like it's diff different this. This is our old. This is the old value.

A

We did some um changes recently to the dyno db, which we already talked about, where we now bulk uh bulk right, which basically, I think, half this or something like that. um So essentially, you know this is this means that data that gets uploaded or car files that get uploaded um are available on the gateways. In you know, less than the second ish uh depend like given, depending on what what size and number of blocks are in the car. Like you.

B

Can have a really.

A

Small car file- that's 30 meg, but it can also have like 10 000 blocks in it, because because in nfts they do these 10k drops and they have like assets that have 10k like images which are big, but they also have metadata for each one and each one of these is like a tiny json file, so you have like 10k blocks of json files, so you can have a small car file relatively small car file with loads of blocks in it.

A

That needs to be indexed and previously we're writing for every single block to the dyno really now we just write them all at once.

C

There are car files of bitcoin and every gigabyte profile has like 10 million boxes.

A

Cool so yeah network availability. We did that then um this was the other, the second uh second kind of kit. We realized that there were still kind of connection errors. It was causing these little um little bumps in our bit swap no response, so things were trying to connect and not not responding, um and for this we found. The reason was that we we actually depend on a native dependency, called sodium native.

A

It's used by noise to do the connection encryption um and uh what was happening was that under load, that native dependency was somehow triggering this like race, condition in node, 16 um and uh and like just pulling the whole process down. So the whole, like the whole kubernetes container, had had to be restarted, essentially, uh which is not good when you need a like a bit swap connection that is kind of open for a long period of time to send stuff. um So this is where we fixed that and um turns out.

A

This is the first time it's ever happened to me, but the internet said that um in node 17 it had been fixed.

A

uh So we we took a punt on that and upgraded to node 18, because that's the next long-term support version um in in our bit swap peers and it fixed it, and ah I can't believe it worked. I can't believe it worked, but it did- and I was so happy. uh That's me happy um uh so yeah. uh So then then, like we didn't see that error ever again, so which is right um and so that at this point we were feeling pretty confident in elastic ipfs.

A

um We had consistently good metrics on the checkups for for kind of a few weeks and we decided to do like a soft deploy um in in cloudflare workers. What you can do is you can you can essentially keep the worker alive and running um without blocking on sending a response to the user? So previously what we were doing is we were uploading cars to cluster.

A

We were also uploading them to s3, at the same time, waiting for both both of those tasks to finish and then responding um responding to the to the user, and so we switched that round so that we uploaded to s3. We responded to the user and in the background we still upload to cluster. So it's also also in cluster, um and uh this happened.

A

We saw a massive reduction in the amount of time it takes to upload stuff a huge reduction in variance as well, which is which is incredible and, uh and everyone was very happy and um got excited uh and um yeah. It meant that there was just a whole lot, a whole lot less fire to to kind of deal with, uh and um we had a very good time, um and so like the only thing left to do now is to take cluster out of the picture completely for uploads.

A

We still use it for um pinning service apis, so people still gonna, pin cids to us and we need cluster to actually go and fetch stuff from the network. um So cluster can basically do what it does best like it.

A

It is for pinning things and that's what it does: uh pinning data finding and fetching data from the ipfs network and storing it um and so uploads actual the car file uploads can um can can go straight in straight into s3, be indexed by elastic ipfs and be available um uh like that, um and so as we need to just do the hard deploy and and then also the optimization. So these are some of the optimizations that we've done and are thinking of we've already done.

A

The uh we added an lru cache to the um to the bit swap peers, so any blocks that seen very recently will just be able to serve without going to the dynamodb for the index information or going to s3 to get the block from the car file. um We need to take advantage of data proximity. The way um like I said we always put like in the backup bucket and I think midway through the the elastic ibfs being um being built.

A

We decided that uh we'd just use that bucket, rather than have a separate bucket, that we upload to and um and so elastic ibfs stuff is all deployed in the uh in the west and the buckets in the east. So whenever the peers need to surf data needs to get it from the other side of america and um yeah so that you know that that that's time, but it's also money for us and so yeah. We will fix that pretty soon. um We can also have multi-region peers.

A

Currently, all of our bit swap peers are in the same region, um so we could put them in multiple, multiple region, regions and also have people connect to them uh in a place, that's closer to them than the other side of the world, uh which would be rad um uh yeah um by byron. So this is request, optimization um yeah.

A

So if, if we're asking for multiple cids within a car file um and they're in the same one, rather than making separate requests for each block, what we could do is make one request that covers that whole range, and maybe we get some junk in the middle, but that might be faster than making separate requests for each block um and then also take advantage of data locality.

A

So um if you've uploaded a car file with a dag in it, and then someone starts to bit-swap that it's likely they're going to want the other blocks in that same car file like we get this magic kind of um it's the same, dag so you're, probably gonna. What we could do is just instead of like serving each individual block. We could just um like preload that whole car into the cache, so that, like as the bit swap um session, progresses as they ask for the route and then ask for more and stuff.

A

We've already got that stuff to serve to them um straight away.

A

um This one I haven't told anyone about, but I thought that what we could do is, uh I think I think we currently issue like a like to the dynamodb like what's the index for this cid um and then get the data back, but I think we can ask it for multiple crds at once and get all of that data back.

A

So we might be able to do something there um and uh also yeah we'd love to maybe switch to r2 or use r2 from cloudflare um because of the the free egress, because s3 costs money for egress and uh and that's all of the optimizations and that's the finished of my talk. You've missed it.

A

uh Any questions you go.

B

So those graphs that you showed uh still won't have the second point right.

C

A

B

I have a about the badge updates that you guys did.

C

B

So when you're scanning your car file, you're not skipping blocks anymore, just uh batching, certainly either block and writing on top right.

A

Yeah, I think it's just looking right.

C

The existing blocks, because we know that if a block like the key is the block and the the card, the path of the card, so it's going to be like the same, offset the same name. So we can just trust that and overwrite it yeah. And when you do the reads, you basically have to do like a limit.

C

One range query: in order to just get the first one for that particular multi-hatch, um one of the cool things that we've talked about is: we should just generate car b2 indexes for everything ever and then once we see that, like you know, a bunch of cds are a couple different profiles. We should just grab those car indexes and I bet they're all in one actually and then you just do one big range.

B

Is the um the check monitoring tool that you used for evaluating when you were ready to switch to elastic provider? I know that using ips checked into the hood, but the packaging that into a hey. I want to monitor this set of cids, like you know, for as an infrastructure operator. Is that something that, like open source.

A

Yeah um it's uh it's called checkup, it's in a github repo in the web3.storage.org and yeah. You just set environment variables like the cluster api url that you want to use and um we've changed it obviously to also include electric provider um but yeah like anyone like it's not specific to our is specific to our setup, in that we have cluster and elastic provider, but other than that um anyone can use it. It's open source and available.

A