IPFS IPFS Camp 2022 - Measurement & Performance, 1 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Thunderdome - Ian Davis

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

Hi, my name is Ian Davis and today I'm going to talk about Thunderdome now Thunderdome is named after the uh very popular 1980s uh hit film called uh Mad, Max, Beyond, Thunderdome and just like in that film uh fandom is a gravitorial Arena where, instead of fights to death, we actually compare uh different ipfs Gateway configurations and implementations and we battle them to see how their performance compares to one another.

A

That's what a standard time do. It runs experiments on demand and it captures all the metrics from those experiments and exports those to grafana um so that we can view the results. It currently supports testing ibfs, Gateway implementations with HTTP traffic, but there's nothing in the architecture that stops IT from testing other things, that's just what our Focus has been on. So far.

A

It's designed to run for long periods like several days, typically you're not going to get any good results from this kind of experiment over one or two hours. You really need to have a soak test for this kind of thing, and so given an experiment definition, it starts all the necessary infrastructure, applies the load and then monitors and captures the results.

A

So I'm going to talk a little bit more about what actually goes into Thunderdome and what it consists of so it's simplest. An experiment is simply Aegis a bunch of containers which contain ipfs instances and we fire traffic at them. And today those are HTTP requests because we're trying to test the performance of Articles gateways, but in the future there could be other kinds of load. So it could be bits of requests or it could be a different kind of request for a different protocol for a different kind of software under test.

A

An experiment defines a particular request rate and a concurrency. The concurrency is basically the number of uh requests that are in flight at any one time and each Target, which are the uh containers container instances. Each Target is configured differently and the experiment describes how those things are configured.

A

It may be just that one uh config setting is different between each Target, so you're doing comparison with different uh values for a particular conflict setting, or maybe that each target has a different uh base image and actually has different versions of software or entirely different pieces of software that are being compared to one against the other.

A

uh There's a component called deal good and deal good applies load evenly to the targets, and it's designed to set fire requests a certain rate capture, the metrics on the results and uh ensure everything runs smoothly.

A

Once you've got that set up, we can have multiple experiments running at the same time because they are designed to be independent, and each experiment has its own instance of deal good, which controls the load for that particular experiment.

A

Now the requests in Thunderdome we take a feed from the public gateways that PL runs ipfs.io. We ship some of those logs through a service called Loki, which is run by grafana, and then we have an automatic called skyfish which Bridges Loki with uh some Amazon Q Services, uh and they and then deal good listening to those cues. So every time a an experiment starts up. It subscribes to this queue of incoming requests uh and then forwards.

A

Those on to the targets under test and the reason we do it that way is uh skyfish can control the kind, the the uh the low the level of requests that are incoming. It can smooth that out, because the incoming requests from the public gateways can be a little bit kind of um uh bumpy, depending on uh where we're getting those requests from or how what the distributions requests is, uh and we want to make sure that is uh as uh controllable as possible and that we, the deal good, receives, requests a standard rate.

A

Now uh all the components in the system are uh implemented with metrics and we use grafana agent to pull any metrics out of the containers and out of the uh ECS machines that run the containers, but also out of deal good and out of skyfish and the queues. Basically, all of this stuff gets pumped into into a grafana, and then we could build interesting visualizations of the results from that foreign metrics via Prometheus.

A

So any the containers that we run as targets, all they have to do is expose a Prometheus export endpoint and the grafana agent will scrape those metrics and send them up to the grafana cloud uh and deal good and Sky fishing implement the same kind of things and they they uh uh export metrics around for deal good. It's the request and response timings and for skyfish it's the behavior of the incoming requests and the stability of that of that request feed.

A

So we can build visualizations like this. This is the uh default timeline view of an experiment and across here we can see things like at the top left. The two numbers show what our expected request rate was and what the actual request rate rate was, and then we categorized things like the number of good responses uh or a number of dropped, requests or the time to First byte, which is what we're particularly interested in in uh in the Gateway implementations.

A

This particular graph shows a a uh a poorly behaving experiment which I'll come to in a little bit, and this is what we call the summary view. So what we need to do when we're running Thunderdome experiments is to really take a long-term view.

A

So there's no point looking at 30 minutes of metrics, we really want over six hours or 12 hours or even 24 hour uh averages and, in this case, we're seeing here six six hour averages of various metrics, including time to first bite, and you can see some of those uh standing out, particularly slow and again I'll.

A

Come on to the reason for that, in a short while- and we have another another dashboard which is basically using the metrics from Sky fish to show us the incoming requests- okay, that are coming from our public gateways and how we've kind of uh forwarded those onto each experiment in turn to just make sure the experiments are getting a fair number of requests that they've asked for if they don't and obviously that's going to affect the outcome of the experimental results.

A

There are some challenges to this kind of infrastructure. Although it's simple on the uh on the surface, all we're doing is spinning up some containers and sending traffic to them. Ipfs and P2P might, by its very nature, uh wants to connect to lots of different things, wants to discover connections all the time um so nodes like to chat to their neighbors. But what we're trying to do is an experiment is trying to isolate these things, because we want to have a fair test.

A

So we've got three or four copies of the same software with different configuration settings. We don't really want those cross talking, because we don't want them to do our block to be appear in one node and then to be instantly retrievable from another node just because it happens to be paired with it. So we do some things like we isolate the targets from one another using network ACLs uh that works pretty well.

A

What we really want to do is isolate at the peer level, so we're tracking some ipfs work where there are proposals for having rules around what peers can be connected to. So we could block particular peers and we're going to what we do. There is we'd block all the peers in every experiment that we run and probably block access to our own gateways as well, um uh just to be sure that we're not getting any kind of cross-contamination that might affect the uh the uh performance.

A

So I'm going to talk about a little bit about some of the experiments we've run using Thunderdome.

A

Initially, we've been kind of focused on proving Thunderdome, ensuring there's no bias and making sure we can scale these request streams properly. We've picked some easy experiments as a baseline that we do and then we have run experiment recently to compare the latest reversion of Kubo against previous versions.

A

I'm also doing some experiments around uh altering the the delay between uh when we're fetching blocks about when we go out to the DHT versus when we just request blocks from peers, so our based on experiments, we call tweedles or Tweedle and based on the old uh Tweedledum and Tweedle D um poem by Lewis Carroll. So the idea is, we have two identical uh instances of Kubo. In this case, they've got exact same configuration, they just have the same load.

A

They should perform identically and that's our test to see whether there's any bias in the platform and on the whole, they do over the long term perform identically. But of course this is a a dynamic system, so each instance will have different peer lists, so they will collect different Piers so in the very short term, there's sometimes very very variations in performance because there may be requests that come in that one. One of these identical nodes can service because it happens to be collected or appear.

A

That has the blocks already another one can't because it has to go then go through Discovery process, so there's always very slight variations, but on the whole, uh these basically run identically.

A

So another larger, more kind of interesting experiment, We've ran recently, was to compare Kubo versions.

A

um We actually took some actually, it says two pairs. We actually took three of each in uh in the later versions, so we took version 0.14, 0.15 and the release. Candle is 0.16, uh fire, those upwards same configuration and then with the two experiments. One at 10 requests a second to one at 20 requests per second looking for kind of potential regressions or improvements between those versions, so about 10 requests a second. There really wasn't any difference that we could spot, basically all the instances.

A

Basically, there behave roughly identically in terms of like time to first bite and the amount of resources they were using, but at 20 requests per second, we saw a lot of changes uh and you can see here that at the top of these graphs is version, 0.14 I think the middle is 0.15 at the bottom. Is the new release candidate, and you can see that some of the earlier versions have some very uh pathological behavior in terms of time to First Bite I mean very extreme numbers High to go routines.

A

Throughput was down and the quite High Heap usage as well. In this particular case, and over time you can see that there's a kind of distinct uh difference between these uh these instances. You can see that some of them were dropping quite a number of the requests that are being sent to them, they're supposed to be handling 20 per second, and quite some of them are dropping up to like 10 per second at times number of go routines.

A

For some of these is very high, uh like at 60 000, whereas we kind of could actually run around 20 odd thousand, and you can see a bunch of them are running at 20 000, but some were elevated. Some had elevated CPU utilization, same ones that had the high go routines and- and we also see a high Heap as well- some of those machines just pegged 100 CPU throughout the whole of the experiment, and we see the same one kind of bits.

A

What we see is there's some elevated pianists on some of these um uh the slow performing instances the pianists are up to like 20, 20, 10 and 12 000. Now, the way these are configured, they have a high water mark of about five thousand four of these, so they should hover around that that level.

A

um This is kind of typical of a setup for a public Gateway, but we find so. We found that the poor performing instances have very high numbers of peers. They're bit swap lists are want Mr High, even though they're not actually receiving or sending that much more bits or traffic.

A

That's what we found was that sometimes Kubo could get trapped in the state with elevated numbers of peers and that peer High number of peers have to be serviced, and that causes a large Heap High number of go routines and CPU, because they're servicing these peer lists larger, want Mists, really because they're not getting as many blocks as they think they should be getting and that all feeds into a higher time to first bite and pour throughput overall.

A

But we did see also was that the new release candidate did seem to avoid falling into the state, and you can see from on the right hand, side that actually, the bits of want list sizes were reasonable. Like 50 most of the time for these three instances and the number of pairs was re, was exactly what we expected for this kind of experiment.

A

So, as a follow-up, uh you can actually read the summaries if you've got access to uh uh to a notion which is a public page. uh We follow up with comparison of various commits between the two versions and we did narrowed down to a point in which Kuma does appear to be more stable at high volumes of usage, uh and we think that's a bit round through some logic. There's some changes in the in the routing logic.

A

uh We've run some various other experiments recently so like, for example, we have run an experiment where we've blocked access to these particular nodes from the uh public gateways, and the reason for that is because we're taking a feed of traffic from public gateways. You know I wanted to test the theory that the.

A

If, if we're paired with those gateways, then we may just retrieve blocks that have already been just retrieved by a Gateway just 30 30 minutes previously, um and uh so this experiment was to test that whether there was any really any bias by being paired with any particular gateways, because obviously the indices can can discover the gateways that have these blocks uh and it turns out there isn't really any uh any evidence for for that.

A

In fact, the the kubernetes actually performed slightly better when they were not paid with the gateways, which is probably down to the fact that gateways are very highly uh uh High utilization um and, in fact, they're a little bit slower at servicing requests. Because of that, we have a bunch of experiments.

A

Looking at this bit swap provider delay settings comparing whether we want to change the the number of Piers or whether we use the accelerated DHT client, the size of the data store, and also some very long settings for these, um and we're currently running experiment. To look at what the unoperable size high water mark for the pier set is for a Gateway, typical kind of Gateway node.

A

It seems that, as we saw before, very high number of peers can also cause problems with performance, because there's a lot of services that has to go on uh in looking after those peers and making and maintaining that that those connections so, in fact, maybe uh better to have a smaller peer set, even though that's kind of counter-intuitive to a very large, largely trafficked Gateway, which, which is designed to fetch blocks as soon as possible, from a very wide number of diverse requests.

A

So in the future for Thunder Dome, we have a kind of short-term road map. uh What we're going to be doing is automatically testing every Kubo release candidate. Again, the previous release.

A

um We're just going to set it up to be automated, so we'll always have that data and that will inform the release process and we can hopefully identify performance, progressions uh or even better. If we can find better performance improvements or stability improvements, then that would be great as well, and then we have some work to do around making it even easier to run experiments So. Currently, uh the only people that can build and run experiments are the people who own the infrastructure, which is currently the the Thunderdome team.

A

What we want them to do is in our collaborators to define the experiment and then run it on the infrastructure uh independently and then gather the results themselves and then, of course, we want better test.

A

Other Gateway implementations, not just Kubo or to test with the iro and JS ipfs, and anything else really and there's some kind of cross-match I'm going to do around metrics to make sure we're measuring the same things from each of these types of software, so that will take some time next year to do and then longer term and sweet long term. We want to measure other kinds of software, not just ipfs gateways, um really anything that that can accept traffic and respond to it in some way and produces metrics that we can do that.

A

I think in this in this kind of model, um so it could be with sending bit swap requests or it could be a quick different kind of software. It could be um something like we to see that apply Define what the load is and how to measure that application of that load.

A

um Also, we want to decouple from AWS So. Currently, we have you, we rely on a couple of AWS components for obviously we use ECS for our container system, we're using the Q services, and we want to be able to allow this infrastructure to be destroyed anywhere so that anyone can quickly run up uh tests of their own software or our software or whatever they want to do. And, of course, we want to do more experience, more data and understand ipfs infrastructure in a better way.

A

So if you've got an experiment, you can come find me I'm, Ian Davis, on find me on uh firecon slack in the prodenge uh Channel or an ipfs Discord in the problem. Channel or you can raise an issue on our GitHub repo, which is github.com, imfs, Shipyard, standard Dome and soon you better run your own experience.

A

um But right now you have to ask me and I will configure things for you on your behalf and it causes all open source. So you can take what we've done build on it, learn from it run it yourself, if you and all those kind of things, so that was Thunderdome uh thanks very much.