Status Swarm Mini Summit, 23 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: [Swarm Mini Summit] Observability in Swarm

Description

Anton Evangelatov - EF

Observability means making a complex system as transparent as possible to those who operate it. In this talk we will explain how we approach observability and monitoring within Swarm, what systems and tools we use, and discuss how we utilise them to bring simplicity, transparency and visibility to our production and staging environments.

Biography - Anton Evangelatov

Anton Evangelatov is a software engineer at the Ethereum Foundation, specialising in distributed systems, currently working on the Swarm project and as Ethereum Foundation DevOps. Prior to joining the Ethereum Foundation, Anton worked at a number of startups across Switzerland and Austria.

A

So I'm going to talk about observability, nice form and how we approach this with the services that we operate for our users, namely the storm gateways that Danny said that you shouldn't use and I would like to Devon that.

A

So, if you really want to cure access to forum, you should be running your own remote, and hopefully this will also help you understand what your storm node is doing once you actually run it on your own system, so a bit of history of swarm from a perspective of a developer and someone who's been muted from so I joined the team in October 2017 so one year and at the time. Basically, what was the state of the code base is that we had no canonical master branch.

A

There were multiple feature branches, mostly conflicting with one another. The swarm gateways was running PLC, but it wasn't really clear which commit or which version that is, and we had no canonical git repository. So we had ether sphere. That was also a Tyrian, but POC two was actually on a specific committing the ether sphere repository. So it was a bit difficult to understand which code is running on the Gateway and what actually is happening within this node.

A

So basically it was a black box that you had to restart every now and then and you didn't really have much visibility over what's happening, so you would hit the gateway. You can upload content, you can download it, but every now and then it would get really slow or you wouldn't really know what's happening.

A

So we were basically wondering how to fix that and the plan that we came up with, obviously to consolidate all the feature, branches and merge them into economical one so that we have continuous integration of the codebase also point to to increase the visibility.

A

We then decided to obviously introduce a release process which we now have so now you can get a news forum version every two weeks and you pretty much know that if something is broken that used to work fine, you know that basically, a problem has been introduced into the code base over the last two weeks. It's not difficult to find what happened because of the small amount of changes, and the last one was to basically simplify deployment so that our users know what version the majority of the nodes in the network are running. Obviously we're.

A

Not the only ones who are running swarm, swarm gateways is just a subset of the nodes of the network, but as a new user to form, that's probably the place that you're going to go. If you want to try swarm, that's the lowest barrier country right now. So what I'm going to talk about in this talk is mostly about how do we increase visibility?

A

How do we know what's happening in this world node when we run one so observability can be defined as basically those three pillars, aggregation of locks, aggregation of metrics and statistics and distributed tracing. So there is a really nice blog post on how twitter approached that, and we can read it. They basically explained how they went from a monolith to a distributed environment and basically, the challenges that came up with that with swarm. We've always been in a distributed environment. So that's pretty much applicable right from the start.

A

For us I mean you need to understand how your requests go from one node to another and basically make sense of all the protocols that are running. So there is another realized blog post from Peter Bergen. Who basically has this nice diagram on what he understands in terms of observability, so he has metrics that are aggregate statistics that all nodes emit and we want to basically aggregate them and get full view based on all distributed processes. We also have tracing, which is pretty much request, request scope.

A

So let's say that you upload something or you try to get something from swarm. You're interested in your specific request to the network and also logging is similar to tracing, but in terms of events. So for locs we are using the goretorium log package, which is pretty much modification of this other library. It's a structured logging library for gone nothing complicated there. It's. It was already part of the code base.

A

It was just that we were getting locks only on the standard out, so we were thinking of a way to aggregate the locks from all the machines so that we have one common view and easy access to our locks. So that's pretty clear. There is an example of how you would do it within the go. Atm code base and yep sorum is part of that.

A

What we introduced was the requests, unique identifier, also known as a correlation ID, that's something that you could use if you build on top of swarm in terms of features within form. So what this is is basically we generate the correlation ID for.

A

Requests that go into form and then we store it in the context from there on you basically can propagate it. 20 of the internal subsystems, and you know that a given log basically belongs to a specific request and how we use that is. This is just a simple screenshot of one of the systems we use. We basically search for a specific request identifier, and then we can see the whole path of the request within our network.

A

It's helpful for debugging, and let's say you run your own form note when you're wondering why you're getting results that you don't expect. That would be the first thing to check, basically your unique identifier and where your request has gone through within the code base. So what do we look? We look obviously errors and when I say errors, I mean something that a developer must look at. That's the definition that we use. We also lock warnings something that might happen, but it's not necessarily an error.

A

Let's say a high number of some warning might be suspicious and something that we should investigate. But a single warning is probably not something worth spending time on.

A

We also log a lot of requests and response specific information, but still with I want to emphasize on the earth everything that comes out as an error. It's something that we should be looking at and trying to get out of the codebase. That's probably something that's a bug for aggregation. We used to use okay lock and it's very simple block management solution.

A

We just deploy it within our cluster and we aggregate all the locks that we received from the different nodes and that's how it looks like I mean you can pretty much just search and run different queries towards the system. So for metrics for metrics we are using the go metrics library again, it's part of the code base. It's a fork of the famous korekiyo metrics library. Some of you probably come from a Java background or not, but it's a really popular Java, metrics library.

A

So the way to instrument the code with it is basically just create, let's say a counter and increment it or that's an example of how you can measure latency. You basically record the time you set your timer and at the end you update it. When the event you try to measure has finished, it's quite simple so with matrix. Basically, they help us detect issues with within current and new versions of swarm, and they also up show us the performance of our nodes.

A

So it's very handy to when you deploy a new version to basically compare the metrics that you got up until that point and after the deployment, so that use, if you have introduced regressions or you see if you have improved bugs that you have had in the past, so that's something that we're heavily using. So what do we measure with the matrix? Obviously infrastructure, matrix, that's clear memory, usage, CPU disk usage and, most importantly, the application matrix. So that's the errors that I talked about warnings different info counters.

A

We also measure application, specific things like number of peers per node and all the different message, exchanges between peers. So that's where the interesting information comes along. It basically helps you visualize. What a specific note is given in doing it at a given time with respect to the different protocols that it's running in swarm, that would be the busy zip protocol. The stream protocol PSS or yep anything else, so this is what it looks like within the dashboard. It's not very visible right now here, but it's best-before dashboards and those things that you see here.

A

These are the different nodes that we're running. So it's a handy way to basically visualize all the airs on a set of nodes and you create alerts for them and monitor large deployment of your network. In let's say yes, developers use case. That would be only one node. Let's say your local node and you can visualize what your local node state is a given time, that's again the infrastructure and that's an example of application monitoring.

A

So we also run this when we are doing simulation tests and that specific screenshot is from a run on the PSS simulation tests. At the time we were comparing different transport models and you could see that at some point the metrics become flat and on the right side, where we're basically measuring the number of calls to specific PSSs handlers.

A

So it pretty much looks clear that at some point we just flat out and we're not incrementing the counters anymore, which tells you that your process is not doing anything at that point and this might be expected or unexpected, depending on the simulation that you're running. It basically gives you visibility over your system and the last one is the distributed tracing. That's the system that we introduced within the swarm team, so we decided to use open tracing open tracing is basically vendor-neutral.

A

Ip is and instrumentation libraries they have standardized the semantics of what they mean by spam: trans, trace, etc. There are different client libraries we're using the go library and they're popular tracer supports, so basically those systems that actually aggregate the traces from your processes and produce a nice visualization so that you can make sense of your traces, so I'll go quickly to the open tracing data model. Basically, traces are defined implicitly by their spans, and a trace can be thought of as a dag of spans. So this is what it looks like now.

A

You have one route span, and then you have a lot of children of that. Pan so let's say that your request comes into the system, you create your route span and then within the internal subs subsystems of your process. You just attach spans to the parent route span. So that's how the relationships between spans look like in a single trace. It's also useful to think of it within the temporal space.

A

So generally, the route span is going to have the full length of a request, whereas its chart spans are gonna finish within the route span, unless you fire up a synchronous, jobs from let's say your route require. So that's the data model of of G open tracing model. How do we use this in this form? We basically have instrumented our code, and this is a screenshot of an HTTP GET request for a specific route chunk.

A

So basically, you can see the full length of the request and the internal systems that it's doing I'm not sure that's visible for you. But basically the route span is the HTTP GET file. Then we can see that we're hitting the API package and we see that. Then we go into the chunk reader, which is responsible for fetching.

A

The individual chunks that the route hash is beautif here is another example where we actually issue an HTTP GET request to a specific swarm node and we trace it within our network of costs, so we can see which other nodes specific request is hearing. So let's say that you don't have all the chunks cached on your local node. You can visualize, which other nodes you are hitting in order to retrieve the chunks for that specific content. Obviously, that is taking privacy out of the window, but it's there for debug purposes.

A

We don't expect people to be running this in production and even if they are. Obviously, this is just giving you information on the nodes that you control and you're pretty much free to do whatever you want with your notes. That's something that you should have in mind that when you are using a public gateway, you might actually be losing yeah your privacy, so you should be running your own modes.

A

How does that work in terms of go? Adding the instrumentation? It's just a simple example: here we have a handler function for one of the protocol messages. In this case. It's called offered hashes message and when we handle that message, the only thing that we do is we define this span and we also extract the route span from the context, so we're basically attaching the handle offered hashes span to the root span, which is kept into the context and the context is propagated within other internal subsystems so that you have access to this instrumentation.

A

That's how you get the relationships between the different nodes. So how do you use all that if you're a developer and you want to develop bonds form? Basically, the simplest way to do the tracing is to run the so-called aggregator and we talk on tracing. You can run, for example, jäger or any other, like cap, etc.

A

With this simple docker command, you can start it down on your local machine, and then you just have to start to warm with tracing enabled that's the tracing flag. You need to give the end point which in this case is localhost and 683 one port, and you also give you have to give your node a name in this case, my local swarm with metrics.

A

We developed a tool that bound us together in flux, DB and gravano. This is basically the stack that we are using for matrix and where those charts are taken from so the way to do it is you can use the static which pretty much downloads in flux, DB and gravano.

A

You run it, and that starts a lot going to get to be on your funny instances on your machine and then you just have to start swarm with matrix enabled, as well as the export enabled so metrics are aggregated within the process and then every few seconds they are emitted to this central repository in this case in flux, DB.

A

So that's pretty much. All I have for you today.

A

A

So the question is: is there any visualization that is better than the tracing and if there is something in the works, no I'm not aware of such tools so tracing together with the stats? That's how you get visibility over your systems, anything that adds more value to it. If something in the audience knows that be very interested in hearing about it, but that's the state of the art as far as I know at the moment,.

A

Well, you could use, as you said, the treat D tree library and have simulations on top of swarm that, where you can track individual events, but for anything that gets more complicated like and request that hits 1015 nodes. It gets difficult to visualize that you know in a meaningful way.

A

All right so Thank You Anton.