IPFS IPFS Implementations - May 2022, 23 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Linux2ipfs - Joroppo // IPFS Implementations Workshop

Description

In this May 2022 IPFS implementations workshop, we heard a number of first-hand accounts from builders & thinkers working on and with IPFS, taking stock of the current implementation story, and imagining what’s possible through shared effort on a protocol with such broad applicability.

A

Hi everyone, I'm geropo, I'm working at protocol labs at the developer advocate for ipfs, and I will tell you the story of um something I've built building for ipfs, which is linux to idfs. So the main part of the story is not actually linux to ipfs. It's mainly how it goes to happen and what lesson you can take if you want to make your own custom project. So the first problem I have is: I'm a linux user.

A

I use ubuntu and debian mainly, and I wanted to be able to download my packages from ipfs because it's cool and I believe, that's a good enough reason, but no it has practical application. um Ipfs do negative one scaling.

A

That means that, basically, when someone downloads the data, they reshare it to other peoples and right now, the way debian and ubuntu repository works is they have a lot of mirrors which are web servers around the world? That's very expensive, so obviously the developers does not pay for most of them. Most of them are like universities that have a few gigabits of bandwidth. Three that host one and ipfs would allow like, instead of having lots of spread things having one global network that everyone can use, which is very fast total.

A

So what zero is basically just web servers without the files, so the first step for me was getting the actual data for the uh packages to be able to create my own repository. Basically, um the first thing I try with wget because well I have web servers w get downloaded from web servers. It didn't work very well mainly because I couldn't find an easy way to have incremental updates.

A

That means that often like the repo itself is very big, but it's not getting updated completely every day, most of the time, only a few gigabyte of added per day, and ideally I want to only download the new gigabytes as I would be wasting my bandwidth. So I didn't go dash route. Instead, I go with earthsync, which is a very well-known tool for doing that, and that's my command. That's my script.

A

It downloads debian and ubuntu and termix, and that's basically it that's not what I have- and I have an issue which is they are very big- um depends of your definition of very big, but I basically tried. Oh yes, sorry, my setup. So for the context, um that's an ass, which I built of recycled hardware. Only hdd is on you. Everything else is very old hardware, and that means that when I tried adding it to ipfs, it was very slow.

A

It took more than a week and it was not faster like every time it took more than a week and the repos update more often than a week, so I would always have to catch up and be very old with like. If you assume you want to download security patches, for example, having them a week, late can be devastating. So that was not a good solution. I will, in my explanation, give you some some tips about writing fast software. The first one, the most important one is measure. Really.

A

um You might think you know why your software is slow and I can't tell you by experience. No you don't um it's incredible. How many times just profiling showed me how I was wrong about my assumption about what was still what was fast and that's the first rule like you need to bottleneck, doing a um benchmarking and profiling ipfs.

A

There are many tools you can use, but basically the confusion I came with is ipfs was not really playing well on io. The main issue, I believe, were my discs, which were hard drives and the issue with hard drives is they have poor random rates and I'm not actually sure this was the issue, but that's why I tried to fix so to doing that. The first rule is only do the work you need to do.

A

That's the first rule for going fast and the way uh I think ibfs fails at this, for my precise case is like it's a very complete theme. You can use it to build a lot of application and the main issue I have with the way I was building. It is it's kind of built around a demon model which so, like you, have a single service that communicates over an api which is often http and that had lots of overhead and present some optimization. I wanted to do so to do that.

A

I created a linux to ipfs, which is a custom tool. I've made and it's really small and the main part of it was it. I needed it to be easy to experiment with so like it's one file, one single name file, uh it's about one, try the line of code, and it only has one job which is taking the files on the hard drive and uploading it to an ipfs node, which right now I use estuary because of convenience reasons.

A

But, for example, linux itfs doesn't know how to do bitswap, it doesn't know libya2p, it doesn't know nerdworking, but I don't really care, I want it to attract to a node and that it does really good. So the way it works. um The main points of it is the recursive function. That's really the heard bit of the code and uh to do that, it's mainly what you would think it. You will code something like that.

A

So I give it a pass on my hard drive and it will scan it see what is it and different depending of the type it's going to do a different thing. So basically, I have a small example on the bottom right: let's assume it's adding this directory. It would first start here, cabc, add abc and write, create an empty directory block. Then it will repeat so like the requestion on this step is finished.

A

We enter in fu, we're reckless in foo we enter in food, we recourse in bar right, then we take care of buzz and high, which is the files so that we write it to the output color file and once bus is finished, we can go to high and then, when high is finished, we go back up the stack so bar through and top, and we have finished and we have added everything. um I don't want to spend much time on this, because it's really like the canonical way of implementing it. That's the different way.

A

The second rule of going fast is doing chips check. First, um one issue: ipfs was not going doing a tram. Is I told you, linux, repo updates a few gigabytes per day. However, the way linear uh ipf is quite effects. Sorry, I should have let go ideas but guys handle this. Is it hash everything again and if it's mana, if it finds the same hash, it's not going to restore it, but the issue is that means that I have to hash everything- and I have to remind you- this is all hardware with a terrible cpu.

A

I don't want to do that. So the way I get around this is by chanting a bit so the most file systems report modification time and that allows me to basically compare a file and if the modification is older than the last time I've updated it. I know that the content hasn't changed, so I just reuse the cid I got from the old one, so I have an old.json file which saw all my old c ids, and I use that to skip adding files that have not been added. That's also good.

A

On bandwidth, because I don't have to re-upload the wall repo every time I just to estuary, I just upload the new files. um The forward of going fast is using your resource to your full potential.

A

um One issue I believe go itfs has, with my hard drive, particularly, is hard drives, are kind of fast compared to to ssds, but the main part of ssds.

A

It's not really that they go fast, but they have a very low latency and hard drive have terrible latency, and so the issue I was running into is basically most of the time I was waiting for the disk to seek to a certain place, so the way hard drive works is they have a head that sticks around the disk and most of the time I spend doing this and the way you can fix this is by adding um reflecting so instead of like having to copy things back and forth between my my files and my destination, what I do is I wrestling the data, so that's a feature of a newer linux journal and btrfs one.

A

The file system I use, which allows me to create a copy and write copy. That means that the data is not actually copied was it created, is only a tiny, tiny block on the disk which says this file like the new destination points to the data of the old file, and if anyone modify the the shared data, if it just modifies the the block on the disk, it would modify it for both files, which I don't want it's kind of, because else the data would be corrected. So what they do is rethinking it.

A

If someone would write to that shared piece of data, it will create a copy of it and write to the copy instead and that's where most of the speed comes from, because I keep copying the data, I only have to read it and I write a few metadata, which is like megabytes for the terabytes of data. I have.

A

The second thing is with ipfs. I could not find an easy way to send the data to estuary. While I was pinging, I was uh traversing it, so I had to first. Do the ikeft add which take weeks and then upload to estuary which could potentially also take weeks. um So what I did is I used the small representation on the left.

A

We have the traversing module and basically it's going to send the data by 32 gigabyte blocks, so we create a 32 gigabyte file and it's going to send it to the sound module and they both work in parallel. So the the traversing module doesn't have to be completely finished. It can send smaller chunks and that allows them to work in parallel, and that's also very nice because then like basically, the total time is whatever of them is the slowest, um because the other one is just gonna run in the background.

A

So right now, um there's also gain a lot of speed. The issue I had doing that is, I had to split my dag into multiple blocks and that didn't really work out. So estuary wants full dags and full tags. That means that if I send a file a directory, sorry I have to also send all the files and the subdirectories and all the sub files in the sub directory and all the way down to the dag.

A

So I have to send everything complete, which is exactly what I don't want to do so the way I got around doing this is something I call the leaf hack. So in the left to right we have the actual dag as I'm creating it and from the top down we see the dag I sent to estuary.

A

So the main difference is when I send the deck to estuary: I'm not I'm not giving it the true cid, I'm only giving it rosy, ids and basically a rossi id is uh just some bytes and it doesn't have any encoding. It's just about it as is, and the way uh the way it works is I send the actual bytes uh the same circularized byte, but with the with the russia id so the hash matches, and if someone so estuary is getting the fake root, which is smaller blocks of 32 gigabytes.

A

But if someone wants to use the real route, which is far bigger, it still works, because when this person downloading the real files is going to ask it to estuary, actually doesn't care that the codec is wrong. All it tells it has a hash matches, and that way I can get around splitting my thighs in smaller chunks and that also that's not only about speed and that's also because estuary has a limit of 32 gigabytes because of the fire con sector limits.

A

This way, I'm also getting around the ficon size, and I can split around multiple sectors: five consecutors uh to store the data. The last thing I did to greatly improve performance is parallel shanking, um instead of just having the traversal doing one block at a time. I now spawn a lot of different um chunkers and that's only on one file, but that means that instead of working on two megabyte blocks in the file, I can work 32 times 2 megabytes and that's very good at generating big q depths.

A

um The I told you before that hard drives are slow and not so slow, but they have bad latency and that's a way to get around it.

A

Basically, since I'm now asking the kernel for a lot more data, uh basically like 32 times more so kernel is much has much more leeway in which data it can give me first and the the kernel is able to optimize the seeking pattern of the hard drive far more efficiently and that's giving me better performance, because I am using more performance of the hard drives by asking more of them, and so now I have an architecture kind of like this.

A

I have the travel source that creates a chunking job and the chunker and the traversal send the blocks to the sending module and all of this works in parallel, which helps getting faster, and so the final result of all of this work was uh before, where on ipfs, an ad will take one week for adding the same data. It took one hour and 30 minutes, um that's for a full flash, so I have to add all the data again and if I do an incremental update.

A

That means like we have one gigabyte of new file, but everything else is old. Ipfs still take one week because it doesn't have any special case. It's just going to rehash everything and linux to ipfs is going to take 15 minutes.

A

I can improve that number, but basically that's very good, like I still think for uh that's was for information. This test was doing adding the terabytes, um the debian repository, which is 1.8 terabyte.

A

So I think that's pretty good and also one neat thing is like uh this is actually faster than the right speed of my disk, because my discount, only I don't remember the actual number sorry, but um because I'm using rest thinking- and I don't have to do- the copy linux to ipfs- create a fake copy like a copy without mark copy and write. But since, like I'm, not actually moving the blocks on the disk, it's able to write faster than the actual write speed, because it's not writing so again.

A

So the first rule is like not doing work since I'm not doing the copies. That's faster um the future plans uh I want to make it actually usable, it doesn't actually work for hosting linux data, mainly because I have an issue with sim links in the gateway and when the attitude the debian package manager try to fetch it, it doesn't really understand. Basically, the the package manager uh go. Ipfs gateway, have a different idea of what should happen with the same link and they don't agree and doesn't work making faster um most of the.

A

So there are multiple points that could be improved. uh All of them is about the first, the zeros rule of making the code faster. This is thing I measured that is actually slowing down my so basically my current bottlenecks, so uh starting is slow. So if I go to the traversal it, I could accelerate starting parallel uploads, because the pinning service is very slow. In my case, that's the main bottleneck blake too, because if the painting service is fast enough, then my cpu is slow and yeah.

A

So all of these could improve it better ux and remove bugs because um it's buggy, I don't actually recommend you use it, and the main point of this is not to tell you: hey, go use. Linux ipfs is great. It's not it's a buggy code that it works for me because I wrote it, but I'm not really sure anyone should use it. The main point is like this took me about five days of work. To make doing the same, optimization in go.

A

Ipfs would have been very, very long, and that allows me to save time by concentrating on what I wanted to work on uh again, it's 1000 lines of code. If I want to do this change in go ipfs since, like there is far more features, there is less attention I can give to each feature.

A

Since linux to ipfs has very few features. I can work more on publishing them with the same amount of time, and all of this is possible to docs and spec. Really the hardest part is understanding what you want to do, but once you kind of know about what's an iplt block, why do I want to do x and y? uh Read the spec and implement it? It's really easy. I think once you get like as the first understanding hurdle. Actually writing your own implementation is surprisingly easy.

A

uh Just read the spec and do what they say and should be fine and yes, that's that's finished.

A