IPFS IPFS Camp 2022 - IPFS Implementations, 31 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Horizontal Scaling of a Web3 system to the sky and beyond in AWS - Paolo Insogna

Description

Paolo will walk us through Elastic IPFS.

A

First of all, let me briefly introduce myself: I am Paolo I'm, a staff, not core member and the ex-team engineer at near Farm. uh You can find me on Twitter and GitHub at those handles and the little tiny blue dot. On the right hand, side is Where, I, Come, From, Within Italy for red. The record. The rest of Italy does not acknowledge the existence of my region.

A

They're gonna regret it somehow, but that's fine. Anyway, let's get started. You all know about ipfs. What is about is a protocol. I mean a set of protographers. Actually uh they signed The Preserve and grow Humanities knowledge and women. Making the web upgradable resilient and more open on Top Pop, That webtrade of storage was built and is a service that makes uh storing files on ipfs very simple. Even for non-technical people mix magical web UI, you saw it and you retrieve it when you want. That's fine pretty easy.

A

Now, since we were contacting something was going wrong right, of course, and there were challenges in the original architecture or web3 storage and they had the what I call the most wonderful problem that a company can experience is just that we cannot handle our growth. The previous architecture is not able to handle the growth that we are seeing on the specific case.

A

The previous architecture could not handle the amount of new uploads per day, and the biggest issue was that trying to add the new nodes to the system was very expensive and uh not effective, because it took literally two days: I mean 48 hours to a node, to add it to the system uh due to the DHT uh bootstrap phase, basically in shaft that was the biggest challenge.

A

That was the problem and then so. The protocol Labs people reach out to us for a very simple question: simple simple: how we can use cloud services to make the service horizontally scalable with no limits, I mean that's a pretty. You know uh little requirement not hard at all. You know just the bare minimum requirement right, I. Actually, unfortunately, yes, that was what they asked us.

A

So there was the goals that we uh immediately established. First of all for for most and obviously we need to end the growth. That's uh that's the first one, so uh the second one was to be cloud-based because of course, if you want to scale with no limits, the cloud is your only option because otherwise it's hard to add the dynamically new notes when there is burst or whatever so the solute scale pretty fast and the last one, which is something we establish in order to have a simple and lean architecture. We went stateless.

A

All the services were as much as tasteless as we could so new notes could be added and removed at any time, and this also improved both the scalability and the fault tolerance, because if a node goes down, you can just drop it and start again with the new node very fast.

A

So that brings to the solution: hello, Jaime, ipfs, so the on the right hand side. You can see the architecture I, don't expect to try to read that one. If you want there is on the bottom. There is the link on GitHub, where you can actually find that diagram.

A

If you want to have a close look uh forward is very interesting to know is that the architecture is divided in three different and completely independent subsystems, so they can work in isolation each other and uh the architecture has been designed with replication in mind, so it can be replicated on many known regions and so forth and that, as I told earlier, you can add and remove nodes at any time without any penalty.

A

Now one thing that I want to uh clarify if before going on, is that what I'm gonna tell her right now for the rest of my presentation is focused on AWS, but I want to make clear that elastic ipfs is not an architecture based on AWS. It's a cloud-based architecture, which means that you can easily replicate on, let's say Google, Cloud, Azure or even on-premises. In short, what you need is a Computing system like Lambda, so serverless computing share, database storage, object, storage system and IQ system. That's all we use now.

A

Aws is just a reference. Implementation for now like it was let's say at the beginning- was scuba for ipfs there was the go ipfs and JS ipfs. They were just reference implementation. We are not logged in AWS, we could technically re deploy somewhere else. Of course we have to adapt the code. The part of the codes, but is very little part. So in theory is cloud agnostic in architecture, and we have a reference implementation on AWS. So that's uh we are not like.

A

We are not basically want to pay money to Amazon or Google or other is up to you. That's the idea. The reason why we initially chose AWS, because uh some parts of protocol Labs were already on AWS, so the witch chose to stay on the same Cloud for now just was just a confidential choice.

A

Now, um first of all, I would say that we are using a share database and, of course, we put data inside um given that you all know what a car file is, which is a Content addressable resource and basically a stream optimized file format for storing blocks in ipfs.

A

We store them in Dynamo with a few information like the position on S3, because the the all the car files from Webster storage are uploaded to S3 and stored there.

A

So we also include the full Pat industry, which is very important for reason that I will analyze in a moment and we store some other debugging informations that we we might use when debugging issues.

A

Then, of course we store information about the blocks. As you can see. In this case, we don't really care about the blocks because we don't process them at all. We just store the information, so we store basically the multiage, the creation diet when we actually index it. This block first Index this block and the type that's it pretty pretty simple.

A

Then it comes to the very important table in this architecture, which is the blocks to cars.

A

So, basically, if each block has a relationship of one to many, because it might be included in many car files, so for each of them we store the block multiage the car file path, the length and the offset remember these two last two information, because they will come back later when we talk about how we serve this data back to the ipfs network, so, let's analyze the very first subsystem in um in the elastic ipfs, which is the indexing subsystem once again. This is the diagram overview, but we don't we don't care about it.

A

This is the flow. That's what we care about. That's pretty pretty simple. At the end of the day, summarizing very a lot. The indexing flow is just about opening an S3 file. Reading is in a sequential way and for each block you, you see the block information inside the Dynamo and you also enqueue the Block multiage in sqs.

A

The sqs topic is, will be used later by the publishing subsystem, which I will analyze in a bit, so there's basically is triggers for for the publishing subsystem. Lambda.

A

Excuse me, the indexing flow is triggered by a file in S3, which is forwarded to ask us topic which eventually triggers the indexing Lambda execution.

A

Now one curious thing about the first implementation of the indexing subsystem.

A

Initially, the Lambda indexing Lambda was either impotent, which means that if you actually execute several times with the same input, you will get the same output right initially. We did we designed like that, because we were terrified of car files with millions of blocks inside and we say if something goes wrong. We don't want to start from the beginning, especially because Lambda have a 15 minutes execution time maximum. So if the file is too big, we might not end enough fast enough right.

A

So we say: okay, let's make it either, but then let's see what happens, what happened is that in order to be the impotent, you have continuously read and write status to to Dynamo in order to track the progress of the Lambda, and this was absolutely killing performance. So we have to give up on in the impotence, remove it and in order to basically handle the the the upload rate fun fact. We realized that it impotence was completely useless because the car file format is so well designed that we hardly have any failure.

A

We are close to no failure rate, so we can directly go through, publish and that's it. We don't have to do anything else, so either impotence was completely useless, we dropped them, and now the indexing Lambda is our right. Only Lambda, which is able to execute very fast and index very fast, especially because after we drop, read and write we'd, also Embrace dynamodb batching techniques. We could go 25 times faster, so which was massive performance gain.

A

Is killing me? Okay, yes, now, one very crucial component of the ipfs is the Academia DHT, which is used to keep shared information about content, discoverability and peer discoverability right. Unfortunately, the DHT. The way it's designed is a huge pain for the elastic ipfs for two reasons: basically, first of all elastic ipfs, despite being a cluster within AWS or any cloud provider, for what matters claims to be to the outside world, a single node, so we only have one period for the entire cluster.

A

Second, since we are on the cloud, we cannot maintain long-living connection. If, when we can they're expensive, when we cannot, we simply cannot, by definition, so it's impossible to participate to the gossip sub, which is used by the Academia Network, which led to the consequence that the on the DHT side, the elastic ipfs system, was not self-sufficient in order to provide the entire experience, and then we have to rely on other Technologies created by protocol.

A

Labs people like the network indexer, which I will call index, are not in the for uh within the future and either nodes created by protocol labs for other purposes. So, let's talk about Hydra nodes. Actually, these other nodes. When I realize how they work. They were a very nice piece of archaic architecture because they were obviously exploit the characteristic of the DHT and they're, basically notes that are only put there in order to make sure that whenever you make us research on the Academia Network very quickly, you run in one of The Idler node.

A

That's why that's why the name right is a single entity with several ads. Basically, that's the idea right. So when you start searching, you eat one of the Hydra nodes. All the Hydra nodes share the same database as basically the shade storage, so they can cache all the searches made on the DHT and eventually they can also access third-party systems like the indexer node to fetch information for for Content that they've been have not experienced so far without relying on the DHT itself.

A

Now. One thing that I want to clarify is that when I say that either nodes are everywhere in the network, I'm not saying that they are physically present in the network, I mean there. There is always Hydra node physically close to you, but you have to remember that I'm talking about a neighbor of neighbor concept within the Academia DHT, which in other words, means that all the nodes in in the Academia Network are assigned to a bucket.

A

All the nodes in that bucket are considered to be enabled so either nodes are created in order to be present in each possible bucket of the Academia Network, and that's that's how the the magic happens and that I was astonished when I understood how it worked. It was amazing, Simply Amazing, and then we have the last piece which is missing for eipfs in order to work properly, which is which are the indexer node or the network index, because I think it got actually renamed, but I will name it in the old way.

A

um Basically, it's a system that is capable to deterministically map immediately a CID to content providers. It you can do that via lib peer-to-peer or via a plain old, HTTP. Api. Now guess what we are using for what I spoke earlier, we were forced to use the HTTP one because, as I said, I already said earlier, eipfs is on the cloud so leap year topic, gossip sub is not available.

A

And fun fact: we also incurred in one very subtle, optimization automatically done by lib peer-to-peer, which is the following. As I told you, the elastic ipfs is a cluster, so can have multiple outgoing connections from different sources.

A

All of them they expose the same peer ID if it happens that multiple sources connect to the same destination most by optimization of the lib P2P, the destination of will assume that one of the connection is wrong because probably the cost, basically, it is used to reconnect the source from another LSA Wi-Fi network or whatever, and we'll drop one of the connection. So this makes the communication impossible when it comes from different sources. That's why we could not use them and we chose for the HTTP API now.

A

The another interesting part about the indexer node is that the it reverses the control of downloading the data. Basically, you don't directly upload data to the indexer node, but rather you upload the new advertisement and entry data somewhere, and they must be forever available on HTTP on that destination.

A

You have to make sure that this advertisements are strictly ordered in a blockchain manner. So you update the head of the queue and then you link each advertisement to the previous one back to the tail of the queue which is the very first advertisement and then when you're done and you actually pretty confident that the indexer node can download the new data you make a simple HTTP put request to to that slash, India slash, announced endpoint and then that's it at some point: the indexer node.

A

According to his own schedule, we downloaded new data, which remember is served with two plain http. In our case, we leverage this because we don't have a specific HTTP server to serve this advertisement to the indexer node but Twitter. We basically use a public S3 um bucket and we let Amazon make the job for of HTTP hosting for free. So we don't even have to manage that part. So we could basically remove one component.

A

One thing I forgot to mention, which will come back later. Remember that, in my opinion, the key to the success of the eipfs project was love and I will explain you later what I mean? There is a reason for that. I'll remember what I said.

A

So, as I said, when the index error node is ready to download the new data, it will basically connect to this HTTP server start downloading the head of the queue then iteratively fetch the entries and the previous advertisement up to the point that either it reaches the an advertisement that theory already processed or the tale of the queue which is the once again the end of the queue, and that brings us to sample file which I'm not sure. If you can actually yeah, you can actually kind of read them. So this is the sample.

A

Add of the queue. If you look at line, three is a link to the current advertisement. Bear with me is what is written there. I will share the slides online, so you can double check, but that's what it says then, in the advertisement file. Instead, you can see on the line. Two. There is the address multi-address of the content provider and on line seven there is the link to the entries file. Remember that when I say, link I literally mean one part of an HTTP server.

A

So if you read this slash, beep whatever is written there, you download address on file basically and at the end of the day there is the sample entries file. It's just a link of all this, the multi-blocks, sorry, the multi-ash, which are served by these content provider.

A

That's it that's how indexer node works so to close the the Gap we index new files, we publish them to the indexer node, the indexer node will make them available via idra nodes and whoever is connected to the ipf and Fs Network and searches for this file will run into nadra node, which will do the content, delegation, routing and say, look I know who has this file reach out to the EIP eipfs cluster?

A

So to summarize, that's what I just said: that's the publishing subsystem once again. This is the graph and the idea is that block received to sqs are sent to the index server node, that's the the gist of it right and there should be a Lambda in the middle.

A

But actually, when we designed this system, we run into another problem, which is the concurrency problem, and you, the data amount load problem, because ipfs and web3 storage go on a steady upload rate per day around the order of millions. Now, if we make a rough estimate of having a thousand blocks per car file, that leads us to billions of blocks uploaded per day, which is horrible and not really sustainable.

A

If you go one by one, moreover, since you have to provide this file in a strictly ordered fashion, if we go back one by one, we cannot really make one billion uploads in a second without any computers, this will kill us. Finally, if we introduce concurrency without carefully thinking, what are the implications, we might lose an entire branch of updates if two concurrent lambdas are updating the head of the same time, because there's a risk condition on the ad, and we cannot do that now. What? What was the solution for that?

A

Well, as we do at linear form, which was not to panic, we we sat down, we say: okay, there must be a solution right. There was a solution. The solution was the usual DVD at the Imperial approach. In other words, we try to reduce the size of the problem. What was the solution very easily? We chose to use two lambdas in a multi-stage process.

A

First of all, we group all the entries in a random order because it doesn't matter in groups of ten thousand, that's it and we create a new advertisement which is NQ on another sqs topic. A second Lambda will pick this advertisement and eventually publish to the to the indexer node.

A

The important part is that well, when the while, the group in Lambda has no concurrency limit, the publishing Lambda has a strict limit of countries of one, so it can only execute once, but when he executes once given the way we have grouped the data, we're actually publishing 10 000 advertisement. At the same time, that's the trick.

A

Now, if you make some rough calculation, you can quickly get to the point that 1 billion blocks, grouped by 10 000, make to a hundred thousand uh advertisement per day to publish if you've divided by the order of seconds which is 86 400 makes that you have just to make one call to indexer note per second: that's it so very quickly, so we can probably even more scale on this one, because we can go much faster than that if we basically max out the performance of the lambdas and so forth, but so far we can easily easily easily publish a billion block per day with no problem at all.

A

Last once now bear with me, we index the data we publish to the ipfs we are now we are now we have to actually serve the data right. That brings to the last subsystem, which is the peer subsystem once again. This is the overview. Just let's skip it now. What are the the characteristic of the pre-sub system, which is delegated to be contacted by a lot of people from all around the world?

A

Well, the trick was to have a fully automatic eks cluster, with so managed by on front by an elastic load balancer, which basically uh makes the balancing for the websocket connection. That's it. That was the trick.

A

Second, second part of the trick was the bit swap, is a read-only system and therefore is stateless, which means that you can, without any penalty scale up and scale down notes, as you want, and also another thing is that they are very lean, because the trick is that when you ask for uh for us for a block, we check on Dynamo. If we have the block, if we don't have it, we directly respond, we don't have it. We don't try to first to fetch it on the external network.

A

If we have it, we immediately serve it by contacting S3, making an HTTP, not I mean we're not using the AWS SDK we're actually eating the the class, the bucket via HTTP plain HTTP. We Leverage The byte range of HTTP. So that's why we care about the offset and the length, so we can immediately spot the three bytes in the file that we need and we serve over memory to the client without even storing in our uh file system for the Lambda.

A

That's it so that's a very simple lean and fast, and we leverage basically the node ability to adjust forward streams without any any penalty, any performance penalty.

A

Moreover, the second trick was that the peer subsystem is a bit swap client, but is a simplified one. Once again, since I've said that we are read only and we don't fetch external data, we could remove much the more complex part of the beastwork protocol, like the ledgers, the ones list and so forth. So we received a list of immediately process respond and that's it. We don't do anything else, very simple, so that's why we were able to basically scale up and down very quickly.

A

Now you all have this question. How does eipfs perform? Well? My good friend Alan already show you this slide, which is about talking about 200 terabytes of data. I challenge you to spot when actually eipfs was I was deployed in that graph. You can easily you can see that right, so we went from zero to I, mean close to zero to close to 100 it rate for the systems in the in bit swap, but there is also a second slide this one. This is the average indexing time on the right hand.

A

Side of the graph is not that there is no sample data to show anymore. It's just that the left hand side is so big that the right hand side where the performance has improved, pretty much disappear in the graph, because we went from an average indexing type of three minutes to an average indexing time of few milliseconds.

A

So we literally killed that. Thank you.

A

And finally, if you want the row numbers after six months in production, we have ah close to 100 million car files, 16 billion blocks and 25 billion blocks to cars, links where we are basically serving the performance that we have seen before now.

A

What we can learn from this journey right, so I told you that that was related to love right and that's one of the nice acronyms in the Unix world at hello, which is the keys rule for who the donor, that acronym is keep it simple stupid, which means that, would you have a lot of very nice and complex protocol? If you try to completely Embrace Implement and replicate all of them, you just go crazy. That's it there's! No way you can do that.

A

Still, the only way to have viable products is just to simplify to church, try to understand what you need, what you don't need and you throw everything else away.

A

Second thing: HTTP is not that and will never be probably I mean I'm biased, because I manage the HTTP stack in node.js, so I'm I'm in the node core, so I'm biased. But actually the thing is that new Protocols are fashionable and they're very nice like bit swap Works awesomely fine, but when you want to stay simple, fast and performant, often, if HTTP is your choice, it still works fine after I, don't know 20 30 years now, right I mean I'm getting old but anyway.

A

Finally, democracy is good, as you remember, I told you that this is a cloud-based architecture which happens to be on AWS at the moment temporarily, because we are going to be Cloud agnostic, as Alan said now um to us, was also a good, a good thing, knowing that in the future we would have been on another cloud or even on mixed Cloud approach, because it's perfectly legit. Why is that?

A

Because we were forced not to lock ourselves on very Advanced products, let's say by Amazon or for Google I mean same thing right, so we stayed on general purpose tools available everywhere, which forces not to rely on a specific features that have could have make our architecture less agile and adaptable to new scenarios, because the in theory, this architecture could be about adapted to new scenarios, pretty pretty quickly that so that was the trick not being able to do a vendor login, which is honestly, is never a good idea, led us to this kind of uh benefit.

A

One last thing, which is also something for people like me that, like to over engineer stuff, remember to keep your eyes to the Stars, so keep dreaming, but your feet on the ground. So don't lose the Practical approach, otherwise you will soon get lost. This was from Theodore rules Roosevelt, so we are talking about 80 years ago, but it's still up to date, so it can still apply it every day and it will be in the future.

A

Very fastly I want to take near from for sending me here. We are a global, consulting company, deeply in the JavaScript. So if you happen to run on npm eight eight percent of the time you were using our stuff, you can can't escape I'm. Sorry, and even though we are more than three other people now we are always looking for new talents. So if you want to engage, you can reach out to me to Cody and Matt which are outside and that's it do you have any question for me: auditions.