Filecoin Filecoin Orbit, 22 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Measuring the Web3.0 Stack

Description

Permissionless networks are challenging to design and operate. Developing methodologies to get insights from the performance of nodes in the network is an essential operational procedure. In this talk we are going to go through the steps we have taken so far to measure IPFS, and the Web3.0 stack more in general, and the results we have gathered. We are also going to point to directions we plan to pursue in the near future and invite the community to get involved!

A

Hi everyone uh very nice to be here thanks for the great intro bailey, uh so indeed my last name is a little bit unpronounceable, so my name is janis pasaras and I'm a research scientist at protocol labs uh and I'm going today, I'm going to be talking to you about measuring the web3 stack.

A

So when it all started was a great workshop that we organized back in june, uh which was called uh the idf or diff uh more simply, and it focused on decentralizing the internet with ipfs and filecoin. We had great great sessions with top researchers and scientists from around the world.

A

We had demos and a hackathon that was organized as part of this and we had some great results. One of them has been the nebula crawler, which was designed and implemented by one of the community members. So you can.

A

You can find that link to the github repository out there, and that was a crawling tool, so we started doing something that was long overdue, so crawling the network more systematically in order to find out what's going on actually uh inside the nuts and bolts of ipfs, uh the crawling tool runs every uh 30 minutes roughly and what it does is.

A

The unique point is that it does not only grow the network, but it's also monitoring the network and from the first moment we started realizing that you know we were seeing some concerning points such as many online peers, but very many peers that were offline when we were monitoring the network.

A

Just after uh crawling the network basically- and that could have been due to a number of reasons uh it could have been because we had very unstable nodes because we nodes would rotate their multi addresses or hydras were having some problems and, as you know, the hydro boosters are prevalent around the ipfs dhd network.

A

We've also seen that many peers rotated their multi addresses to the point of uh having a different multi address, coming from the same id address more than 5 000 times within a space of one week. So, as I said, we did something that was long overdue. uh We had to design a more systematic approach to do measurements on the web3 stack, so that includes ipfs. It includes file code and it might include other networks in the in the future.

A

What we now have is tools that can uh go on and do continuous measurements on the pairs of the network uh in a fully transparent in a fully transparent way, and we can have. We do have an architecture that is basically built sorry split in three different parts.

A

We have probes to collect data, we have back ends to store the data and we have a processing engine that can basically give us all the information that we need to know in the end.

A

So our crawler nebula in this case is dockerized now and can run in several different places, um several different points in the in the globe.

A

So we decided to take two main directions: measure the churn rate of the ibfs network, which is um something that was not very clear before there were measurements, and um you know statements you find online, but nothing to say you know how much that is actually and that's something very important, because it drives the design and the protocol setting for the dht and not only it.

A

It goes into defining other parameters of the network, and we also decided to measure the latency of the whole cycle from the content publishing in the ipfs network to retrieving content as a client and, of course, we've got several future directions, we're already working on them. This is just a sneak peek of what we're uh doing right now. The results I'm going to present we're soon going to be making all the public in a nice way online in reports, but also separate websites and we're also going to.

A

We have started work on um extending our studies to the filecoin network, the file coin, the the lotus dhd and so on, the retrieval network of file coin and so on. So uh there is lots more to come to what I'm going to present in the next few minutes, so starting from the rate in the ipfs network, our results show that um the turn rate in ipfs is quite quite big.

A

So we see that around like, if you dig into this prof you're, going to see that around 60 percent of the dht server appears stay online for uh one at one and a half hours or less. And similarly, if you go a little bit further down, you see that 80 percent of the dht server appears stay for stay online for three hours or less. So we consider these to be quite high as a churn rate.

A

It's almost as high as um we've seen big torrents, um the bittorrent dht um reports from the bittorrent dht about 20 years ago and um on the plat on the bright side. We realized that the settings that we have for the ipfsdhc are very correctly set and they manage to to provide lots of resilience to the network. This is results about the resilience that I'm not going to talk about today, but we are going to publish in the near future.

A

um So we wanted to to understand um generating the ipfs network a little bit more, so we decided to see what percentage of nodes is stable and what percentage is those are kind of coming and going? And we also wanted to see how often nodes go offline, uh so a node is coming online. For how long does it stay before? We can assume that it can. It might go offline again and therefore not be reachable.

A

So we have split the overall node by running. Like that's one of the experiments, as you see here, it runs from the um it runs for about uh two and a half days, a little bit less than that and from this experiment we can. We found out that about 14 of nodes are always online and they're very stable. There is a very tiny percentage that we're seeing in the initial crawl of the network, but we're never seen online again.

A

So that's about 0.6 of the network, which we call the always off node, and then there is the vast majority which is about 85, which we call the dangling nodes. Those are that are going online and offline in small time intervals, so that's a big percentage and we wanted to dive a little bit deeper.

A

So we went on to ask the question for how long do nodes go offline and for how long do they stay online? So in this graph you can see the node counts on the y-axis and reliability on the x-axis. So let me explain for a minute what this x-axis means. We can go here and see the amount of time that this run. This experiment was running for which is about 50 53 hours or something basically translates to about 3200 minutes.

A

So if we go to this first spike there, which is um just above 2000 nodes, we see that those nodes stay in the network for one percent of the time. One percent of the time, given the experiment, duration, is about half an hour and that's a very large percent percentage. We can see that um other nodes go um from then on.

A

Obviously, the node counts go down, and um so, for example, we see that here we've got about 400 nodes in the 20 uh online time mark, which means it's about 10 hours, so about 20 of the nodes in the ipfs dhd uh stay in the network for 10 hours or less, at which point they go offline and then might come online um later on.

A

So we started thinking, so we need to dive deeper. What is um what is causing this, and we ask the following questions that we you see in this slide: whether nodes are running on unreliable home machines, whether nodes that run on those home machines turn off at night, which is something very normal to do for a normal user, and we have some questions that we're still getting results for whether nodes are rotating their their peer ids, which could be intentionally or unintentionally for other reasons such as bugs and so on.

A

uh So we went and we measured those three uh different types of nodes, the uh all nodes that you know basically includes both the always own nodes and the dangling nodes and we're trying to see what infrastructure they're running on. We found out that there is um here. The blue slice is 10 of nodes that run on digital ocean, about 3 that run on aws and a tiny percentage that runs on azure, and there is a very big percentage. 80.85.8.

A

Which is unknown, so this is home machines of users, or it could be cloud environments that actually um have not made their iep addresses public.

A

So we don't really know if it's cloud infrastructure or not, then we split those two always-on nodes and dangling nodes, and we see that those nodes are always online and they, the percentage of them that run on cloud infrastructure, is a little bit larger than the one that I just presented above, and this is normal when, when nodes run on cloud infrastructure, it's much more likely that um they stay online for longer uh still, however, there is a 70 73 of nodes that run on mostly on home machines for dangling nodes on the right hand, side.

A

We see that obviously now this increases and is very similar to what was presented before, with 87 of nodes, almost 88 being on non-well-known cloud infrastructure.

A

So this gives us a very good footing to say um you know at least that you know the turn rate can be, can be attributed to unreliable home machines, although we still have go experiments on going to find out more. There is a great result out of this to say that um ipfs, infrastructure, decentralized storage and delivery network does not, for the most part, run on on centralized cloud infrastructure, which is um enabling the fact that you know there is no single point of failure, which is the vision for all these technologies.

A

Now, on the next question, we wanted to find out whether nodes that might be running on home machines turn off at night and without going into too much detail about how we define the correlation.

A

On the left hand, side, this graph is showing um the uptime when the uptime sorry the the points in time when nodes go offline and this is related to daytime and on the on the right hand, side, we can see nodes that go offline during nighttime.

A

We see that these two look pretty similar and based on the correlation pattern that we have picked. uh We we conclude that actually it is not the case that some peers, the peers, go offline during night time. That was for a specific location. Obviously- and in this case it was for hong kong- no notes that were based in hong kong, because you we obviously have to take into account um the the time of day.

A

So we need to differentiate between different time zones now going on quickly to the latency of the whole cycle, from publishing content to retrieving content in ipfs. We wanted to break this down into several different steps and obviously the first one. Is the content publish time so in this graph we can see the distribution of content publishing. So the count the y-axis here is.

A

The number of items that we have tried to publish on on the network and on the x-axis is obviously the latency that these two now this, as you know, is broken down into the dht walk latency in order to find the 20 peers that provide the records need to be replicated and the put latency where we actually put the actual provider record to those 20 chosen beers.

A

These results are literally been in the oven right now and we are going to be releasing them pretty soon, but that's to give you an idea of the level of detail that we're going into now going to retrieval latency. We find out that we have the dht walk that is depicted in this picture and again on the y-axis. We have um the counts, as in the number of um items that we have tried to retrieve from the network and on the x-axis, we have the latency and we need to carry out more of those experiments.

A

But for now it looks like at least on the retrieval side. Things are um are being improved a lot uh compared to the state of things a few months ago, and this is a huge win for the developers, both pl supported and the open source community. Of course the is providing all those um that is doing all those improvements to the protocol stack.

A

We have also tried to find out some interesting um some interesting insights, going into further detail and for the provider records where we see that the put operation is failing, we wanted to figure out. Why is it failing and what are the agent versions that are causing these? So we see in this uh in this graph here that out of about 1300 records, put attempts.

A

These are the agent versions of which uh which have have failed. We see it's about 36 percent on go ipfs nodes, uh 31 in hydro, booster, nodes and so on. So this gives us a very good insight into going and debugging what is going on there, which is a very important thing to do if we want to, you know, improve the performance of the network and and make it brighter and better.

A

So, uh with this I'm going to finish here, I would like to mention that we have lots more coming in very shortly, so follow up, uh get in touch and we're more than happy to collaborate with more uh people from the community to dig deeper, find more insights and actually try to build more robust protocols, high performance protocols and resilient protocols.

A

Thank you very much.