IPFS IPFS Camp 2022, 3 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data-Driven Protocol Design: What it is and what are the benefits - Yiannis Psaras

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

Today, we're going to be discussing uh more of what some of the team has been doing. Our team is codenamed uh problem because we're doing lots of probing in the ipfs and Liberty networks, and this is what we are going to be some of the results that we have we're going to be presenting today.

A

So me, I'm not going to present any of the results I'm just going to give an overview of what we've been doing so far and lead to the rest of the team, the speakers that are going to give you the important information, the actual results.

A

So, let's get started. What we want to do is actually data-driven protocol design and optimization. There is a lot that someone can learn when we're looking into the actual data from the network and how the protocols actually perform compared to their specification and what we think you know, programming time would would happen in the real.

A

So our kind of motto is that you can't really improve what you don't measure, because you don't know where the bottlenecks and you should measure what you think you have just improved. So if you apply an optimization to a protocol and you push it out to the network and the network you know starts transferring bytes around you, you cannot, unless you have the right measurement and monitoring infrastructure. You can't really say if the optimization that you had on your plan- and you were expecting is actually the optimization that is seen as performance in the network.

A

So that's what we're trying to do now. The end goal, of course, of doing measurements in the network is not really just the measurements just for the sake, but it's it's. So it's not an end goal. So the end goal is basically to identify the problems and go and quantify how much space do we have for improvement and finally, design protocol optimizations and, of course, as with every other protocol in the ipfs stack and ipfs ecosystem, which is open source. We want our results to be open source as well.

A

So one of the things that we want to focus on is make everything available, not only the results themselves, but also the the scripts and the software written behind in order to get those results. This way you and everyone else can go and reproduce the results and do your own experiments now.

A

The workflow that we are kind of following is that we start with a research, hypothesis or facts from the network, then, based on that, based on what we want to do in the hypothesis, we design an experiment methodology, then experiment, instrumentation and then basically doing the experiment in the network.

A

We get the results and then we have to do some analysis to see if they are what we expected it to be, if intuitively they make sense and if they don't. Of course, we need to revisit step two or five if they do expect what we, if they do, show what we were expecting, then all good. We move to the next study, um so the methodologies that we have used so far.

A

There are three complementary methodologies one is through crawling and through several crawlers that exist in the ibfs network, um but we've written or enhanced uh existing ones through probes, which is basically controlled nodes in the network and through logs.

A

So I'm not going to be discussing all of that in this presentation, but I can I can point you to where you can find relevant information and, of course, some of that are going to be presented by speakers later on today.

A

um So the continuous network monitoring we have been using the mighty nebula crawler, the author of of which is right here and we'll be talking later today, it's a very useful crawler uh that we kind of adapted. You know the features that the Corolla has from previous ones in order to get some more information that was important to us from the network, so the network, the crawler, has been running since last summer. It's now perhaps above 10 000 crores, but if not it's close to that and we did some analyzes.

A

Some of the results I'm going to present are from October of 20 2021, which is a blue stripe down there. Among other things, what we have found is that we've seen more than 200 000 peers. More than two million addresses about half a million IB addresses uh we've seen that the network is transferring more than one billion requests per week. It should say there and we've seen the ibfs nodes exist in more than 150 countries and 2 700 autonomous systems, which kind of verifies the fact that ipfs is a distributed Network.

A

What the crawler, though, also does is that it's monitoring the uptime- and that was important for us because we didn't want just want to see how many peers are in the network, but how much churn there is in the network, so the journeys when I know these joining the network and then it's leaving at some point. You know, what's the time between those two things and of course, then, when leaving, when do they come back again? These are important for phpa networks.

A

As some of you might know, the chair of the network is something that drives the decision for many parameters in the DHC or other protocols.

A

So some quiz time um who knows what is the peer churn in the ibfs network so who can fill in this Gap? 50 percent of peers leave the ipfs network. After how many hours.

A

Two hours: okay, any other takers, 12 6, not 12, definitely not 12., unfortunately, laughs half an hour, okay you're, stretching it yeah, but you're not far away, fortunately or unfortunately, is about one hour. uh So those two are the closest. um Of course. This varies between different implementations of ipfs and different, uh also versions of of ipfs, but yeah. There is lots of churn in the network, at least from what we have observed.

A

So the cloud dependency ipfs is a decentralized network, but how much of the? uh How many of the nodes run on centralized? Cloud infrastructure?

A

Any guess percentage of nodes are deployed on centralized, Cloud infrastructure. You all know what centralized infrastructure is.

A

Sorry 90.. No, it's too far away.

A

See 60 I had no, it's not 60.

A

20, okay, getting closer, be ready to be surprised. It's about three percent of nodes that runs on centralized Cloud infrastructure from our measurements, which we double checked and triple checked because yeah, it was a little bit surprising to us as well, but it's definitely good news. If you ask me because it means that the community is putting up own infrastructure to host ibfs nodes, which is great final one to do.

A

What do you think is this.

A

Could be a nice logo, but apart from that, what what is it.

A

Okay, I guess you must know you in the middle.

A

Now it's the uh it's a representation of the DHT routing tables. Is the ipf sdhd some a little bit more fancy representations but a representation but basically um yeah. It's bullets that okay, it's connected to others, but it's not shown here um is the routing table representation of nodes in the network, which looks pretty cool, I mean there are some things that we also cannot really understand and we need help. But that's why we're here?

A

So let's go no wieners here. So, let's start a little bit by sorry.

A

We don't know yeah, there is lots of clustering.

A

We can brainstorm later, it's yeah exactly yeah yeah and there are some outliers. If you see up there or down there, that's also very weird: what are these nodes? No one talks to them. I, don't know.

A

um Cool okay, so I'm going to do a brief overview of how the ipfs system works, at least from the lenses of the ibfsdhd, because this is going to influence a little bit. um What we're going to talk about next. So what is happening in ibfs when you want to? uh When you have a document, then you want to share it with others through ibfs. Is that you get the document you hash it?

A

As you know, you produce the CID and then what you do is you create a small file that includes the cad and your own? If you're the provider and your own address information, the peer ID and the IP address, and you put that in the small in a small file and you send it to the ipf sdhd and then the API, the apfs, the DHD does its magic and it's going and finding one node in the network, which, let's assume assume, is that one now?

A

Next, what you do is you you send that CID to your friends that you want to access the content, so they have the CID and what they do is they use a protocol that is called bit Swap and they're. Asking they're basically immediately connected peers to ask whether you know they happen to have that that CID right. If the answer is positive, all good, they get the file and call it a day.

A

If not, then what they do is they also go, go to the DHT and ask for the CID and the deity again does its magic and hopefully ends up to the same node.

A

What these know the yellow node does is that it sends the what is called the provider record, so this small file that you store there to your friends now that point, your friend has got the CID, of course, which they knew, but also your address information, the peer ID and the multi-address, the the IP address. Basically so the establish a connection and they get the file so that that's what is happening um at the high level now it's true there's two points worth highlighting here.

A

One is that you don't upload the file itself um to the DHD. You just upload the provider record that then points to your machine and others can come and take it from you at least it's the base version, if, unless we add other bits and pieces on top and the second one is, as you know, you can get the file and touch it again and then, once you hash it you can you will.

A

You should come up with the same CID that you requested for, and that says that, basically, the file that you received is the one that you have asked for and not something else.

A

So that's it at the high level. Now there are some opportunities that we've seen I'll uh while playing around with you know and understanding all the different steps that are involved uh and we figured out that the provide process so the very first step there uh when you want to to basically uh like put the provide the record to the DHD, it's very slow. It takes tens of seconds in some cases more than 100 seconds. So the hypothesis there was that there is some bottleneck in the ibfs.

A

The HD provide process which we do want to prove. uh What is that, through measurements? We have found that, although the overall process takes tens and hundreds of seconds the nodes that we are finding so that red node there uh and all the rest we're not finding one we're finding much more than one we're finding 20. all those nodes are being found within mostly less than half a second right.

A

So this leads us to think that must be some some wrong doing there, because you know it takes an order of magnitude more to basically complete the process when we could be as fast as doing it in less than one second, so we're working on this. There is a lot of documentation and presentations on that and it's ongoing work on how to uh to improve that.

A

What I want to say by that is that you know, as you start playing with the network and through measurements, you can find out important details and important optimizations about the project about them about ipfs or any of the projects that you're working on now. The second one is uh I look up late. They look out the lookup latency I, uh in particular the uh the DHT lookup latency.

A

So the hypothesis there is that if we break down the content routing process, uh which is uh composed of many many steps, as we've seen in this previous slide, then we'll identify every bottleneck that exists and that's very good, because you know applying optimizations, then we can find you know and optimize the performance. So what we did is that we did the controlled experiment.

A

We spun up several different nodes that are controlled by us and published a unique CID from one of them and then went and requested that from all of the rest and that went around because other nodes then published cids and then the rest requested that so we repeated that several times and finally, there were more than 3 000 cids that were published and almost 15 000 cads that were retrieved. So that's the kind of sample size that we are talking about now.

A

What we found out is that um around 80 of requests from EU from an EU based node have been resolved, or at least through the DHT part of the procedure that I described in in less than 500 milliseconds right and 50 of all the requests got through the DHD, um uh the DHT part of the resolution process in less than one second. So what does this tell us? Where is the opportunity here?

A

If you see the middle picture is just the DHT walk duration, the DHT part of the process- and this is what I'm talking about above. But if you see that that, on the on the left as you're looking at it, then you see that everything is basically shifted by about one second. So there is one second there which is like I, don't know, 100 percent more of the overall time, um which should be down to something and decreases performance significantly.

A

Now this is the step on bit swap so that's the opportunity here the bit swap process that we said in the beginning, where you go and ask all your immediately connected peers might take a lot longer and unless it is very successful, uh it's going to delay everything by 100 of the time it's going to double and triple the time that you need as a normal user as a normal client to um to resolve content from the DHD.

A

So if our hypothesis is correct, then it means that lots of we can see a great Improvement in the resolution process. So this is an ongoing study that we have we're, trying to figure out how successful is bit swap if it's a lot or not, and what does this mean for the average user so um yeah? These are two of the studies that we did based on measurements again. What I want to highlight is that, as you dig more, you find more optimizations.

A

That's why it's great to do measurements, and you should do as well and then some other studies are also based on measurements and um yeah. Those I'm I'm not going to go into any detail because I'm going to ruin the next speaker's presentations, we wanted to see the DHC routing table health, so how?

A

um How well instrumented is every routing table so that it can point to the right other nodes if you ask them for something for a CID or a PID in the network, great study, very detailed report. There you'll understand how the DHC Works, which uh in great detail, which is not very easy, then provide the record liveness again we're going to have a talk uh in a little bit about that.

A

The research hypothesis there was that, if provide the records do not stay alive, then the content that is published in the network is not reachable, which of course, is terrible right. If you, if you have a storage and retrieval Network like ipfs and you publish content and then suddenly you cannot find it. Then you know it's not great news, so great results there very, very encouraging um yeah. So uh did I have a third one, I think I had the third one um yeah I I skipped it that's!

A

Okay, uh so uh third, one so yeah that pretty much concludes what I wanted to say. As in you know what we have been doing roughly. There are many more studies, so um you can see what we are um the result. Some of the results that I mentioned, but also many more uh in this URL- starts to ipfs.network um yeah. There are weekly reports there uh very detailed, very interesting talk about the geolocation of users, rotating peer IDs and the churn of the network, of course, and a lot more so go check it out.

A

uh You can read pretty much yeah all of what I said in this recent paper that we have. um You can find it online, of course, I'm going to share the slides. This is the CID that is on ipfs network, so you can find it through that uh yeah. So you can get involved. You can find lots of what we're doing in our notion page, which is also linked from here. We have funding available through um the radius.space platform.

A

You can yeah go there and apply. You can follow most of what we're doing in the GitHub repository Network Dash measurements, which is where we put our reports, which is where we put the requests for measurements as we call them, and there are many many that are open. uh Of course, you can go and work on some of them or you can add your own ideas right. So we are looking for you know things like. Why are these lines there? Black? Okay, let's figure that out um yeah.

A

You can also find us on uh on the on ipfs Discord uh under hashtag probe lab. uh That's where the team is mostly chatting so yeah join that in the problem notion page. You can also see the board of the projects that we have with our current projects that are in progress, those that are done, the next ones that we're going to work on and so on.

A

So you can use some of the tools we're developing to do your own research. So one of the teams that he's present here right now they did this Telemetry. They developed this Telemetry tool that you can put on your ipfs node. If you're running one you should and then you can start getting statistics out of this node, which is very useful, I think there are. There is documentation on how to set up a flashy grafana dashboards as well. But that's that's the easiest part the. How to do. The Telemetry is the most important one.

A

um What else yeah and we'll ask you to get involved in another study that is going to be primarily presented on Sunday in the lip P2P day, so the Olympic team, as you might know, is, has developed a nut whole punching approach, probably between uh not hole punching, is yeah one big problem that has not been solved in peer-to-peer networks, so the Liberty team now does have a solution and we're going to be running a study where we're going to ask users anyone in the community to download a binary and run it, and then we're not going to get any of your personal information, of course.

A

But we want to see what this is going to do is going to instrument that node, like measurements between your node and some of the nodes that are, our lobby is running to see if there can be natural punching through your home network right, which is great. It's it's really going to improve performance a lot. If we manage to get that right, so we're doing the measurement study for the lipitim, so the experiment is going to run later on in December.

A

But if you want, you can scan this QR code, which is going to take you to this Google form and you fill it in just your email or contact information so that we can reach out to you when we have everything ready and yeah you're going to be able to download that little program which is going to add this box glove there on your desktop and yeah participate in the study.

A

Of course, we're going to make the results publicly available afterwards, so you're going to be able to use them, um yeah, analyze them, publish your own papers or publish your own blog posts and reports.

A

Did everyone take a picture.

A

Sorry, it's too far away: okay, yeah I'll, send the slides, I'll I'll share the slides in them in the slack channel of this particular track, so that you can do it uh from there right now: cool yeah. So that's it from my side. I'm very excited about the rest of the program, uh we're going to talk about some things I touched upon, but some others that uh I have not talked about so uh yeah. Let's welcome our next speakers.

A

Thank you very much.