IPFS IPFS Camp 2022 - Measurement & Performance, 30 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lessons from IPFS workloads and implications for Multi-Level DHTs - Pedro Akos Costa

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

First of all, thank you for the invitation to to talk about our work here: I'm, Pedro and I'm, going to talk about our work with with PL that we have been doing for for the past year. So, first of all, uh introductions. I am a computer science.

A

Phd student at the at the Nava University of Lisbon I, am advised by Professor Juan lito on the topic of enabling generic computations at the edge and my interests spent on peer-to-peer and distributed systems and algorithms in general, and our collaboration with PL have been on started last year on January and we have working on multi-level DHD and then we continued to do measurements on ipfs.

A

This will be the outline for this presentation and I'll start by giving you a bit of background. So, as I said, we started looking at multi-level dhts designs and what we wanted to do at that time was group nodes that shared a given property.

A

So here you, you can see a consumer ADHD representation that you already saw on Gee's presentation, and what we have here is I mean if nodes were given a certain property. Let's say here: these are these four colors, then you can see that these nodes are organized randomly in the DHD okay.

A

So what we did was we prefixed each each property, and we would add this identifier to to the nodes that share this property and which what would do is that we would change the node identifier in the beginning to be that the the property identifier and what this does when you run on the Kadima DHD, is that knows that share the property will be grouped together like such okay, so we actually did two different mechanisms for this.

A

uh What I just showed you is the soft partitioning mechanism and we presented a paper on dimps or a workshop that was collocated with icdcs this year and if you want to see the paper it's online on that link, but this raised a question so which property should we use? That will benefit the system well. Well, we want to minimize latency and maximize the throughput of of the DHD, and our hypothesis was: let's use the geolocation, so we thought that content is mostly requested in the same geographical area, so think about news outlets, for example.

A

This would be like straightforward, correct, so to see if this was actually the case. What we did was we analyzed the ipf for ipfs workload and what we wanted to know is what and where is content being requested and how many, and where are the providers request providing the requested content? And for this we analyze the content requested through ipfs. I o one of the most popular ipfs gateways and then we search for the providers of the requested cids on the ipfs gateways.

A

I mean I'm, sorry on the ipf sdhd, okay. So to get the requested content, we had two weeks of logs from ipfs.io from the seventh to the to the 22nd of March of this year. So from this we have a ton of HTTP requests, but we only considered get HTTP requests that that had a status code of 200 and 300, because this would mean that the Gateway actually managed to fetch the content from from the network and with high probability.

A

Let's say the content is still there when we are going to to look so with this. This filter, like mostly half of the of the requests on the on the Gateway, and we managed to get four million, or a bit more than four million different cids to get the providers. We built a very simple gold dptp program that would go and fetch all the providers of a given Series, so not just the 20 providers. That would be normal in the in the in the API call where you wanted to get everyone.

A

Okay, so, unfortunately, we could only find 45 of providers for all of the cids, so out of the total CID is only 45 of them. We found providers of okay, so this would mean that we found 55 000 different providers a bit more than that and unfortunately, about half of those didn't have any multi-adders information. So we could not extract more information from those providers. Okay and at the end we gathered all the IP addresses that we got from both requested content and from the providers.

A

We ran it through Max, Minds, Julie, 3 databases and we got a continent geolocatal decoding, as you can see in this in this map. Okay- and you can see more details- if you want about this mythology in our notion page in this huge huge link- okay, let's yeah, okay. So let's get into to our results, so I'm going to try to answer these five questions.

A

Okay and the first one is so. How many requests are there per day being successfully uh processed by the Gateway so for this I plotted here? The the requests per hour over our our time of our observed time? You'll see here that on day 14th like around in the middle, we had the sudden drop. This was due to a probably a failure at the Gateway. We don't have logs from that day from that from that period, but what we see is that requests are mostly study, so ipfs is always working.

A

At least the ipfs Gateway is always working, always fetching content.

A

If you break this down uh by continent, we see that I mean you can see these two uh lines on on the middle, the the orange and the red one which represents North, America and Asia content originated from Mathematica and Asia, and we see that the requests are mostly divided by these two continent groups, while the rest of the continents have very little expression and the reason for this I mean from our data, it's unknown, because you can have many reasons for this.

A

It can be the fact that requests from other continents are just not pushed to that Gateway or it can also be that these continents requests are originated from these continents. Don't do any requests to to ipfs so for this I I mean I think we need a bit more data and investigation for this, but for the second question, so what is the popularity distribution of CID?

A

So how many times is a CID requested on this Gateway and here I have a distribution plot where, on the x-axis, sorry on the axis, you can see the number of cids a different number of cids, while on the y-axis you have the frequency of requests, so I mean to understand this a bit better, so that point over there is the most popular CID with more than I, think 100 000 requests made to it. This represents a single CID, okay and that point over.

A

There is the least popular cids, okay, and this represents two million different cids, and what we see here is that we have a very few cids that are highly popular while you have a lot of cids that are not popular at all being requested, as only less than 10 10 10 times.

A

Okay, so this would be like you could see this as a zipf distribution of the of the system of the workload ones of the system. If you break down this by continent, you see that the distribution remains in almost the same shape that we saw before, and a fun fact is that this is kind of proportional to the number of requests being made from each continent.

A

Okay, so next, so how many providers are there of each City? I mean Dennis already touched about this. We're going to see almost the same thing. I guess so here I have a CDF plot of of the I guess on of the cids on the y-axis. We have the number of providers for that. Each CID has on the wax axis this. It's labeled replicas, okay. So what? How is what we see here is that a bit more than 40 percent of all cids have only a single provider.

A

This is I, think it's a smaller percentage that that you that Dennis showed right, and we also see that around 15 of all cids are provided by two providers, and we see that the percentage of cids that are provided by more providers continues to drop, as as we increase the number of providers- and but it's not shown here- but it's very.

A

It's something that's it's very interesting is that the the request or I mean the cids that are provided by most peers, or most providers are not actually the cids that were most requested and, in fact the CIA this, in fact, the cids that were most requested only had a thing one or two providers: okay, so next, how many cids do? Does a provider provide so how many cids yeah it's a weird phrasing? It doesn't matter okay, so here again, I have a another CDF plot.

A

Okay, on the way like y-axis I have the we have the providers on the X. We have the number of cids per provider, and what you see here is that again, I think Dennis also touched on this.

A

We see that sixty percent of providers only provide a single single CID, while less than 10 percent of providers provide more than 10c IDs and even less than five percent provide more than 100 cids and I mean I, wanted to say something more on this graph, but I guess what you see here is that you have a lot of cids that are only provided by very few.

A

Very few providers, as as Dennis said, and if we look if you break down this by by continent, what we see is that these providers are mostly located in Europe and North America again, as you saw in the last presentation, Okay so last question and really what we. What got us to do this work? So is there actually any Geographic locality of requests?

A

So to answer this, we generated this this hit map so to get this We join. We combine both data sets from the requested data and from the provided data on the on the CID that you requested on I mean the rows. Are the origins of requests? The The Columns are the the location of providers and to read this as an? uh What decides is that that cell?

A

There means that 60 or I'm, sorry 53 of all requests that were originating from Europe had providers in North America, so data is actually normalized by the number by the total number of requests made by each continent, and if you want to see if there is locality here, we would look at this diagonal and if there was any locality you would see Hitler so it'd be more um you'd, see more a higher percentage of requests coming from there, which is not what you see right so instead, what we see is that requests are mostly concentrated in in Europe and North America, which are the most significant partition of providers.

A

So the most providers are in these in these uh in this continents.

A

Okay, so what does this mean in term for the multi-level DHD designs and for the DHT in general? So what we saw is that requests that content is provided only by a few providers, and this content is mostly provided by peers in North, America and Europe, and in this sense we saw that there is no actual geolocality of requests, and this may lead to load balancing issues if they are not already in place, because you have only a few very few providers providing all the the work in the in the DHC in the network.

A

So what this suggests and uh is that we should somehow remove the load on highly popular providers a way to do this would be to have incentives to reprovise content, and we should also develop a new strategies to perform load balancing on ADHD, and this can be through novel, multi-level DHT designs or have something completely different from a DHT at all.

A

But in fact, what we actually need to do is to to be sure which directions we should take is that we need to continue monitoring to understand. How does the system evolve and how? How big is this issue in ipfs, okay, and we actually started uh some steps towards this uh I think Janice. In the beginning also already mentioned our work on Telemetry, which one will also have discussed this a bit tomorrow in one of the breakout sessions, so so have a look into it and that sums my talk I would like to again thank.

A

We have actually have a paper submitted with these results. If you want to to follow. If you are interested in this work, please follow the discussion on notion and GitHub a lot more of links for everything and if you are interested in the work that we we do at Nova I mean be sure to follow us. Thank you.