IPFS IPFS Camp 2022, 31 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 5 Billion Blocks - Alan Shaw

Description

Alan's talk will give a high level overview of infrastructure the DAG House team has built for serving massive amounts of IPFS content to thousands of users around the world.

A

um Let's talk about ipfest at scale, um this talk is called five billion blocks uh for most definitions of billion, because that's approximately how many blocks that we store right now for uh for our users and that's roughly um nearly a petabyte of data on ipfs and filecoin and that's across a hundred nearly a hundred thousand user registrations. So uh that's that's quite rad and those users have made over or nearly 100 I've rounded, these all up to make it fun.

A

uh I guess I've made nearly 175 million uploads, uh so um we have some scale. uh At least we felt the Pains of some scale, and so I am here to tell you about it. uh So, let's go um so.

A

We built nft.storage in two weeks for this uh hackathon called nft hack, I, think it was that hackathon I'm, not sure, but anyway, if Global was putting that on, and we built this thing for it um and at the time the idea was just to create the easiest way for developers to onboard data into vilecoin and um that turned out to be pretty popular.

A

All the people at the hack were like okay, we're building we're minting these nfcs and stuff, and we need to put our stuff on on like store it somewhere, and it turns out.

A

Idevice is a really good place for that for nfts um for reasons we can talk about after this talk, if you like um so anyway, we started with uh ibfest cluster, so um cluster cluster got us a long long way um it it stored over 25 million pins for us it still stores pins for us at the moment um and at the time, what else were we going to do? This was like two years ago, um like whatever ipfs implementations.

A

Do you reach for um and what we wanted was like a multi-tenant system with like redundancy putting data onto multiple nodes so that they weren't like we didn't lose it because we didn't want to do that. That was a big thing about nfts was that people were putting stuff on on ipfs, assuming it was like forever storage, uh not realizing exactly that.

A

uh If you don't keep your node running or put it on somewhere and on a node that does keep running, then it's not going to continue to be on the internet anyway, ibfs cluster, an amazing product, and uh we made good big use of it and it got us so so far, but it wasn't easy. We learned the hard way how to make ipfs cluster scale uh massively um and like small things like uh a data store choice and we started with badgerdb, because I think I think that's the default.

A

That's why we started with it anyway, um and that was fine for a while quite a long while, um but then, eventually we had to switch to flat effort s I, don't know there wasn't an emoji for flat FS. So uh anyway, that gave us way better performance characteristics, um uh so that was fun. We also started with a cluster of just a few nodes with really big disks, and then we had to switch to a cluster with many nodes of small disks.

A

um Basically, if you let ibfs nodes get too big, then you find that performance sort of tends to tail off, um but that came with it with its own kind of challenges like just managing them. All. Like upgrades um upgrades for ipfs for cluster I, remember, we had like a block store upgrade quite recently um and it was going to be super invasive to our users, because we can't we have upload okay.

A

So we have about like five to ten uploads a second, um so uh we can't really stop the cluster and uh and do a data store migration.

A

um So actually it was way easier for us to just um uh create a new cluster and then copy it stuff across over the course of months and months and months, uh but eventually got there um yeah, because we can't stop the world um for that sort of thing um and we're also relying quite heavily on the pl netops team for uh because for the we built it in two weeks with attack of one we were like now tops.

A

Can you spin us up a class they're, like yeah sure, no problem, uh and then you know that cluster turned into a um you know a cluster of? Maybe three nodes I think uh to a cluster of about 50 nodes uh and then netops were on pager Duty and we were like uh sorry. uh This has taken up some of your time, uh so well got a bit guilty for that. But anyway, uh let's talk about garbage collection, um yeah! No, you can't garbage collect.

A

We can't really unpin everything, because, as a multi-tenant system, we can't really guarantee that someone else isn't also uploading the same uh data um and if we could unpin, then it would just take forever to garbage collect. These are Big notes, lots of nodes uh with uh with lots of data on them uh and it and uh we could take them out of rotation but uh and then garbage collects and then put them back in.

A

But that's like it's a real like manual process to be to be performing on your live production infrastructure and uh it's kind of um yeah, not not great, a bit error prone um bit, scary uh and um yeah. So our solution to that is to not garbage, collect uh and, um and then just like it's uh it's just busy.

A

It's busy busy busy all the time like we've got car file, uploads and pin requests coming in the whole time right right, right, right, right uh and then we then all each one of the rights like provide, provide, provide to the DHT and then periodically. We have like reprovides, which try and re-provide the whole of all of the blocks that this particular node is storing.

A

And then the cue of that is so long that that you know the the the the provider records expire before they even get onto the DHT and uh that's fun, and then obviously, then there's reading via bit Swap and um and that's not just external traffic. That's like cluster cluster, actually bit squats between its peers, because when you upload stuff to it, it goes on to one of them and then to get that replication.

A

It bits what's between them, uh so yeah so busy working busy and that's why it gets hot and tired and then um yeah, especially if you have popular content. If someone uploads something that's super popular, then that node is busy forever we're observing that content, um yeah, so busy uh busy notes, and so we built elastic ipfs uh to help sort of alleviate some of these issues.

A

um I'm not going to talk um loads in depth about how it works, because there is actually another talk a little bit later. That goes into how I've elastic ibfs Works in depth, but um I'm going to Breeze through it super quick, just uh just because I think it's really interesting, um but the essentially elastic the the computers that are like accepting data. The rights are not the same computers that are actually reading data as well. We've separated those two, those two pipelines and elastic IVF- is free and open source on the internet.

A

You should go to GitHub and check it out. If you search for elastic ibfs, you will likely find it, um but how does it work? Well, um we accept car files, which are serialized tags, both nfts storage and we have free lot. Storage both accept car uploads. We also accept files, but um we like to encourage the web free way of doing things. So people know the seids before they send them to us um and uh anyway, so currently they go into like workers in the cloud.

A

That's the little cogs um they are like uh like like lambdas, um but that scales up really nicely so that they can accept uploads. uh Really, it's a good big good concurrency and the workers put car files in a simple storage bucket, so we put car files in the bucket um and we're actually moving towards a system where the workers just give the user a signed URL and they upload directly to the bucket. So we don't even need to go through the workers anymore.

A

We don't avoid that that problem of, like um proxying the content and the cost associated with that, and so this is where elastic ibfs comes in, because it gets informed that there's a new car in the bucket and it can be any bucket- doesn't have to be that one um and uh just as long as elastic ivfs can actually read that that car file, then it's all good, um so it gets told of it and then elastic ipfs indexes the blocks in it.

A

um So we stored a block cids the byte, offset within the car files and also the car file that we're actually in so we know where to look for it um and so elastic hobbyfest. Has these specialized ipfs nodes that run in there and they're called like bit swap appears because that's all they do, they just do bit swab and it's uh it's made up of an auto scaling, load, balanced kubernetes cluster, um and so when nodes in the ipfs network um connect to elastic ipfs and send a bit swap want list the bits.

A

What appears consult the index they find out where, where those blocks are like which car file and what offset um and then they send a message back with the blocks that were that were requested uh and um and it does that by making range requests to the car files direct. So it reads directly from the car files in the bucket making range requests uh to serve the blocks, and that is it, but not quite it's.

A

It's also worth mentioning that um the when we index the car files we send that information to uh indexer nodes- and this is how all five billion blocks are discoverable on the DHT, um and so we yeah we use indexer nodes and index and nodes, are purpose built to map cids to content providers so and that's for the scale of the file coin. Network.

A

Our tiny amount of data in comparison to the capacity of the Falcon network network is should hopefully they should be able to handle that um so far they have uh and uh so yeah. Essentially, you can ask an indexer node, who has a CID and it will tell you, provided someone else has already told it who has it uh before before you ask them, um and so this graph this is my second favorite graph, is um the reason why we need indexer nodes. This is one of our nodes.

A

This is actually one of our nodes in our clusters. Right now, it's quite an old node. um It's too full of data. It's got too much. It's got a big disk, too full of data, and it is having a bad time.

A

uh So uh we've got a disc tool called checkup which basically takes a CID sample, a CID that we know is stored on a particular node and then asks the DHT who has that and if, if it finds that that node is in a provided record in the DHT, then it will say you know the chart will be better and if it doesn't, then in chart will be bad like this. The chart only goes up to 55.

A

55 of the time is the maximum amount of time, and this is way below that for most of the time, um so bad Bad News Bears. uh So this is my favorite graph, um and this is elastic ipfs. When the X indexer nodes were turned on, you can see. We had basically nothing because elastic ivfs doesn't do any providing to the DHT. This is annoying, um and so any any spikes here is literally just because someone else happens to have that content on the internet I think um no. Actually it's not.

A

Oh I, don't really know anyway, it doesn't matter. um So uh this is when it got turned on. We went from zero to 100 and it basically stayed 100 um ever since then so uh really, good and yeah uh comparatively.

A

This this is also. um This is uh one node in a cluster of 50.. This is one-fiftieth of our data. This is all of our data, so super cool, that's elastic, ipfs, oh yeah! No! This is a new thing um we released about a week or two ago. um um So I wanted to quickly talk about this. um If I've got any time left, I, don't even know 12 minutes. Okay, I've got some time. This isn't long. um So this the freeway.

A

um We can talk after about the name, uh why we named it that, um but it's a new thing that we have called freeway and it's an ipfs Gateway, that's backed by car files, uh and that's why I put tons of car emojis on there and it's why I'm concerned about the Emoji rendering on my slides uh so uh yeah, so it's an ipfs Gateway, that's backed by cars and it's the same car files that our users upload to our service. That's right!

A

So, let's recap on gateways using the best graphic for gateways, I know of on the JS ipfs website. uh So a Gateway is an ipfs node and it provides access to the wider ipfs network from a centralized point and it's essentially a HTTP interface to ipfs. So HTTP requests come in asking for file data for a particular CID and idfs goes and finds the data and then exports a regular file from the Unix of s blocks that it finds in the ipfs network, and so in the case of elastic ipfs.

A

It kind of it looks like something like this. This is like the ipfs.io Gateway, uh you ask it. The HTTP request comes in so they give me this this deployer for the CID. It does some bit swapping with elastic ipfs. Meanwhile, elastic ipfs is doing the read, read the blocks from the bucket thing um and then it all comes back so all the way through, and so what freeway does?

A

Is it cuts out that middleman and um we're so we're still serving ipfs content addressed data, but we don't go over lib, P2P or bit swap to retrieve it, um and this kind of thing is um is is possible because we actually run our own gateways. We run nft, storage.link and w3s Dot link and these gateways are actually special gateways and they they race, multiple other gateways and give the uh the response back so freeway doesn't need the discovery element that ipfs has.

A

um If there are other gateways in the race that do have that Discovery element, and so freeway can literally just return 404 for things that it doesn't have um uh yeah and uh but but the the fun part is it's quite likely to have the things that people are requesting through those gateways because they're probably uploaded it to our service, so they're going to use our gateways and then so yeah so quite often um serves the data.

A

That is, um that needs to be served and it's blazingly vast uh and and it's great and uh yeah. So we have data in in uh yeah, uh so there we go. But how does it work? Okay? How does it work uh real, quick, real, quick I'm over time? How am I uh so we have a bucket full of cars. uh So how does freeway know which car file to export from uh because there's blocks in the cars we don't know which car the block is in?

A

How do we know Dude Where's, My Car, uh the dudeware is a link index and it links root data cids to the car files where that dag can be found. So we consult the the we consult, dudeware and dudeware says it's in these cars uh and then uh we also store a index in the bucket, uh and this tells us the byte offsets of the blocks that are in that particular car and we store a car V2 index.

A

If you know, if you know who carving two index is then it's a multi-hash short sorted index I, don't know why I put that in the slides. um If you know you know anyway, it's car V2 index, it's not a car V2. We don't actually change our car v1s into car V2s. We just store like side indexes, so actually we can inflate to a car V2.

A

If we want to just read the first one read the second, uh but but we don't, because it's actually he's better to believe me, it's better to store them side by side anyway, so we have indexes next to the class um in the bucket uh and yeah. So all together. The flow looks like this: a request comes in for a particular CID. We consult Dude Where's, My Car. We know the cars that it's in.

A

We read the car V2 indexes because it might be in multiple cars because we actually split dags, sometimes when the car's too big, um it's normally in one.

A

uh So read the car V2 index as we know where all the blocks are, and at that point we have all the information we need, and we can literally just do a Unix FS export directly from the bucket using byte range requests to extract the blocks that we need to serve in the order that we need them, and we do some clever things like batching range requests when the blocks are close together, we'll make fewer requests by reading multiple blocks. With one request there you go: okay, rad, so that's very cool. So who are we?

A

um We are the house of dagus where we keep the dags in that's where they live in the house and um our Alter Ego is dag house, which is our German Rave night on Saturday, and uh this is our dag house in the Merkel Forest. This is where we live, um so thanks very much.