GitLab APM, 14 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo August 14th 2021

Description

Weekly incubation engineering video for APM.

https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/

History issue tracker link - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/2

A

Hello, uh joe shaw here engineer in the incubation department for apm.

A

uh Another weekly update video uh mixed week this week, looking at uh historic information uh in gitlab and talking to people um and looking at a bit of research around uh potential time, series, databases or uh general column, storage databases. We might be looking at using to store all this performance data um and looking specifically at datadog agent um and how we can set that up and use that as an agent and create our own.

A

Version of the datadog api, a simplified version that we can use in production, uh this first page, I'm just showing you here- is um one of the issues. I've got open that will stay open and I'll keep adding to it. It's a historic uh issue: project, epic tracker for parts of the organization that I want to keep a tab on a lot of this stuff is probably over a year old now.

A

So, whilst I am talking to people in the business, it's um a lot of this information is, is too old for people to recall uh easily enough? They'll only remember the salient points, uh so it's useful for me to dig into the details so I've I've managed to search and track down some epics issues that are quite pertinent to what we're doing here, uh an open mr there in a room book and some more uh internal docs that I don't feel for the general reading list in the handbook.

A

um But it's still quite useful is one or two ones that are specific to research, for example in uh grafana usability and things like that. So an example of one of these issues. Well, an epic was for actually dog fooding, the monitoring there you can see a discussion starts where they're talking about it um and looking at actual dashboards that we're looking to to dog food into uh get lab here, and this was a year ago you can see these are all comments from 2020 early 2020 and it eventually gets closed off.

A

But you can see the sort of discussion that's going on here and uh likewise jumping off from that. You can see things like uh existing issues that have been created around things like datadog.

A

So looking at agent metric collection here um and digging down into this there's some interesting stuff, I don't think anything ever got fully completed here or even started, but it's all interesting information anyway, and a lot of this was focused on being able to monitor ci jobs, for example within different environments. So if you're running your own um gitlab runner for ci, how you monitor make sure that's still performing well, and you can do all that with prometheus right now.

A

But if you've already got a datadog uh integration, then you might want to be able to do that directly. So there's some work that was going on there and I do believe that datadog integration was completed recently for the runner in one of the recent releases of gitlab. So that's great um so before I move on to talk a little bit about why we're looking at data dog runner, specifically um I'll, obviously get people asking me. Why not jump straight to using the open telemetry project, which is a great community project?

A

That's brought together and is stood on a number of other projects that have folded into it. For example, I think openmetrics was one that's been folded in, I'm not sure about open tracing but working on things. They're. Looking at things like how you standardize um the metric data structures, trace data structures, log data structures because they all tend to be very similar and then how you can create uh an agent things like that. On top of that, so you can see um in the actual documentation here for open telemetry.

A

uh Where was I going to look? You can see that the kind of reference architecture there would be very similar to something that we would look at. Look at using you've got your own. uh You've got a you know, open telemetry collector, pushing data into it, a very simple single agent architecture.

A

The problem when looking at this is the level of maturity.

A

So, for example, if you look at the goal, language instrumentation you've got very, very low level. um I release candidates, alpha releases and and no logging implemented there at all. For example,.

A

And at this point we wouldn't want to give this to customers to use, because it wouldn't be at a level of maturity, which would mean that they could just get going with it straight away.

A

Whereas you look at something like data, that's open source and what the agent is and the libraries are- and it's very well documented- and there is a big community around it and lots of uh solved issues and things like that.

A

People will be able to get going very very quickly with that and whereas, in this scenario I've seen, for example, a github were using, I started to use open telemetry with ruby, and it looks like they've had to put a lot of work in um to get that going based on a recent blog post, and you know a business of that size and technical capability can do that, whereas a lot of uh small to medium-sized businesses simply can't.

A

So that's one of the reasons behind not jumping straight with open telemetry, but we will keep looking at it and keep an eye on it as time goes on. So just as a size. I've also been looking at time. Series, databases and ways we can store all this information.

A

um One of the things we want to keep in mind is trying to pick a solution that might be able to solve all areas of observability, as opposed to just um solving for time series data and by time series I usually just mean metric data where you've just got a floating point number and tags and timestamps, which are you know.

A

There are a lot of databases that support that we're also looking at log data, which can also which can be sort of full text searchable data um and trace data which again, whilst it's all time, spanning timestamp index, it's a very different data structure. um So we're looking to have that level of flexibility. Now I stumbled upon quest db here this blog looking at um benchmarking and there's a benchmarking suite the time series benchmark suite, which I think uh was created yeah.

A

It's run its own uh open source by um time scale here, which is another uh time series database based on top of postgres, um and this was an interesting one, because you can see here the sort of throughput test that they're doing uh click house comes out as a very strong contender versus other things like influx timescale, which are very good time. Series databases don't get me wrong.

A

um Timescale based on postgres has some limitations there in terms of scalability and also influx. It's another open core model. It scales vertically really really well on single nodes, but then, when you're wanting to do horizontal scaling and replication- and you then have to get an enterprise license for that and whereas click house, theoretically, I haven't tested this yet should be able to scale horizontally.

A

And I had seen a an issue in looking at quest db when I was looking at to evaluate it, they haven't actually implemented horizontal replication multi-server application yet, uh but it is an interesting one and it looks like a very performant time series database- it's not clear yet whether that would be useful for other sorts of uh metrics as well, so I'll leave that for the time being, uh here's another article um now this is uh looking at various time series workloads and one thing that comes out of this- that we need to do some more research around with click house in particular, is when you're doing like queries, simple queries.

A

um Click house actually comes out as being a lot less performant than a lot of other solutions. Right, uh that's creating small amounts of data. You can see it's one of the lowest contenders in there, but actually when you then switch to very, very heavy queries because the nature of the database, it then starts to perform much much better.

A

um So we need to get an idea of what sort of access patterns we're going to get with this uh from users, building dashboards and visualizing the data, and if they are of this sort of nature, then it would be um quite useful- and I you know from my instinct- I think it's going to be the case- that we're going to have large volumes of writes, which click house is very good for and can ingest.

A

A lot of data has very high throughput and a less volume of complex reads where there's lots of um selection, of a lots of of the columns uh merging tags together, and things like that. So my hope at the moment is that this would be a good fit um also another another blog. That's interesting is uber's logging blog entry here, where they talk about migrating from elasticsearch to click house and as a solution for their logs.

A

I think you can see the sort of um architecture diagram here which is very similar to the sort of architecture we might be looking at for the apm solution, where you have kafka, acting as a set of brokers there to handle back pressure and perform uh messaging, uh passing those logs storing them in click house and having abilities to to view on top of that, and they go into a lot of details here and talk about the schema they're using and some of the um optimizations that they then put in place uh to scan to be able to provide the data at the scale of uber.

A

That's um that's a very good sign that they're able to use that for that. For that method- and it means you know if we can just use click house for all these different types of observability data in one place, it means, from an infrastructure point of view uh much less of a headache to have uh loads of different databases.

A

um So what I want to do now is just to show you a quick example of actually running datadog.

A

So very quick. um There's a data tag agent uh in a docker compose file that I'm setting up here um and I'm using mock servers here, which is another project to essentially mock out. Http requests, captures them and allows you to put in your own responses and things like that. It's good for just being able to analyze sort of trace, http requests because, as we'll see, the data dog agent uses a bunch of endpoints that are documented, but some aren't as well, and we need to look at what it's doing there.

A

um As you can see we're setting that up setting up various url overrides here, which is a bit awkward, you can see with what you've got this main data dog url, but we we also have to override it for different areas that we haven't turned on, but otherwise the default is just to go to digital hq. If you don't override, which is a little bit frustrating. But it's not a big deal. You can set these up in config. You can set them up as command line, args and environment variables.

A

We pass it the docker socket the unix socket so that we can read docker state information. It automatically detects that and reads it, and you know the hosts proc file system, which is sort of standard linux, way of interrogating counters for performance for the actual hardware and the cgroup information as well, which is related to the way containers run and organize themselves.

A

So if we just stand that up see those setting up there and we should see logs from the agent starting to spread out and if we jump into the mock server dashboard, we start to see uh requests coming in. So we can see, we've got a an initial uh validation request there to validate against our api key. All we're doing with all these responses is returning a 200 with the body of okay, and it seems to just not care in fact.

A

Initially it was all 404s and it's still just carried on so it's um it doesn't seem to mind much what you send back. It just keeps spamming metrics at the end point um for these in-text endpoints that I haven't had a chance to break down, yet that are not documented. So there are, you know, there's lots of stuff documented for the datadog apis that you can see here, um but that's not documented and these check run ones. I'm not sure what they're doing here.

A

The main thing we're interested in at the moment is the series posts here. So you see every. I think it's set to every 10 to 15 seconds by default, to send this and it's sort of a batch request. So it's everything's building up in memory and then it does a flush out to this series. Endpoint. Let's do another check run there, let's wait for another one.

A

We should be getting one any. Second, now.

A

Fine, let's have a look at this one, so we'll get a pop that one open you can see inside here. This is your series. Endpoint you've got, for example, lots of system metrics coming in here, so.

A

Disk metrics so right, so I'm going to read.

A

Let's see if we actually get some numbers there we go. So this is a actually percentage metrics here, so you've got a metric name there, which is sort of namespace system.disk right type and get some percentage.

A

These are tuples, it sends over you've got a unix timestamp and a and a value here in which case would be the percentage you got. Some tags tag formatted here device name. This is the name of one of the disks on my machine and saying it's a particular gauge metric type and for this particular device. The interval is not relevant here, because it's a gauge that would be if you've got like a calculated rate or something like that and the source of data. So you get these sort of metrics coming in from the system.

A

As expected, I haven't got to the point of being able to visualize those. Unfortunately I was hoping to be able to get that done this week. That's one of the first things I've been looking at next week is getting a basic back end with grafana. It might not be click house because it might be a bit complicated to set that up.

A

Initially, it might be simpler time series database and then we can just get click house uh set up straight away, I'll get the something like grafana set up straight away on that so anyway, there we go I'll, leave it there.

A