GitLab APM, 23 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo August 23rd 2021

Description

Weekly update for SEG in APM.

Update issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/6

A

Hello there joe shaw single engineer group for apn. This is my uh weekly update, video slightly delayed, but uh better later than ever uh at the moment, we're starting to put issues together for the videos themselves.

A

So this is what you can see now is the apm uh issue for this week. I'll have to record again at some point so here's my current update.

A

I just saw my headphones out, that's better, um so uh I showed you the datadog agent sandbox a bit last week. That's come along quite a long way that project, and this allows us to test that agent in isolation and we've added some um stuff to isolate the network. There it's in docker compose and add core dns, so we can capture any external traffic.

A

um So in that respect we can see if the agent is sending any outbound requests and that we wouldn't normally know about, because while we can see the expected requests through the mock servers that we previously set up, we can't be sure, without auditing, all the code in depth that there are other requests going on. So I'll show you what I mean by that. So we've got.

A

The datadog agent there in this compose file, you should be able to see this better than usual now, because I've actually got it zoomed in as before I didn't, I noticed, um and at the top, as you can see, we've got a dns proxy setup here using core dns, and this is uh using that dns proxy ip.

A

So if I actually go into this and have a look at the logs in the dns proxy.

A

And, in fact, let's predict the dog in there see if they request going through there, what you'll see and the only other request- that's not now being captured by this compose environment and going to one of our services- are uh data doc requests here for ntp, network time, protocol servers, which is a good idea and to be expected, so it keeps the age in synchronization. So, even if the host machine has the wrong time settings, you can reach out and grab some ntp settings and synchronize with those servers. So that's fine. We don't.

A

I don't think we need to worry about that going through there uh what we've got next.

A

Yes, so we've added into this environment, um a click house data store with a golang services, it's on top of that to capture the series data. It's a really simple setup.

A

uh I'll show you the series go agent here, it's just in one one, golang file um and you'll see that click house has a very much a sort of sql like dialect. In most cases, it's almost identical.

A

We create a metrics database on startup here in a series table in that database and it's really just a sort of flattened representation of the series data we get from datadog. Not much thought has gone into this and it's it's something that we will um redesign and do some benchmarks against to get a better data model in place, because this is completely normalized and it's it's not a very efficient way of doing it, and so you've got timestamp host metric value and an array of tags there.

A

I'm using this merge tree engine, which is the default engine in uh click house, which I need to spend some more time looking into. We don't bother with a primary key and we're just ordering it all. The records are there by timestamp, it's that we'll create a timestamp index in there, and this is some code to flatten the incoming requests and store them, and you can see.

A

We've got very similar sort of familiar sql like code here, where I'm creating a transaction statement and writing them all in, and I've also got grafana sat on top of that, pointing to that clicky house data source quickly. Show you that so here's grafana sat there.

A

um What you get with the quick house data source which isn't built in you- have to add this plugin in um you get close that a simple way of setting it up. It will try and detect timestamp column that it will want to sort by, um and you don't get much in the way of query editing capability, but it gives you some sort of template variables that do some automatic expansion like this time filter here.

A

So if we look at the actual query, the time filter is actually converting epoch times to dates and and checking the timestamp is within those. And here what we're doing is filtering this data down to the system disk right time percentage here. So just refresh this I've noticed with this plugin, the auto refresh, doesn't seem to be working um so the last five minutes.

A

So, let's see if we can actually um run some run, a stress ng test here, I'll keep that off for 30 seconds, so I'm just going to run a hard drive test with 32 workers and it's just doing sort of I o sync buffer syncs to disk. Hopefully this will get some um changes in those disk metrics.

A

It's always a bit of an issue when you you're doing these sort of disciplines when you've got a lot of ram. Often things just get memory mapped. You don't see any changes to the disk, so we can just wait for that. I've just set that for 30 seconds and there it is. You can see it's done some operations there.

A

If we have a look at this, you can see a big spike there so relatively. If we look at that time frame there, they were sort of normal percentages and then, if we look at the last five minutes, you can see that big big spike. Now. What I don't really understand in this data is why that is supposed to be a percentage, but it's higher than 2000.

A

I don't know why. That is something I need to look into something to do with the data dog agent. I'm not quite sure you can see that big right spike there and it's coming back down there and there we go and it's roughly uh was it 2, 20, 36 yeah, so like roughly over over 30 seconds to a minute there. So you can see, that's that's working fine and if we, if we took that spike off, you know you can you can see the normal data set there um in terms of the query there.

A

uh We are grouping by hosting tags there, so I've concatenated the tags together. So you can see for this data set. We've got my host name and we've got device names for the individual uh disk devices here and you can see the only ones getting here. The real real disk devices here there's lots of loop, um logical volumes here that aren't doing anything at all. If you look along the bottom there right, so that's that.

A

So there's grafana configured yeah. uh We did notice that there's no docker stats being collected other than some basic ones. I'm hoping this is just a c group's v2 issue, and otherwise it would work. I need to test the agent out in kubernetes really just to make sure that it is really getting all the metrics that I would expect.

A

It does seem to be getting all the system metrics that I would expect as you would, as you would think it would. You know it's. This is the data dog agent. We can have fairly high confidence that it's doing what we expect.

A

um Yes and yeah configuring the agent, because it's kind of set up with lots of different endpoints. If you want to configure everything like process logs, the apms stuff, appsx stuff, you would expect these are these environment variable overrides? You would expect some level of consistency there, but there are some inconsistent bits like the logs url, not allowing you to specify a protocol in the actual url and having that set somewhere else.

A

Things like that also certain things that aren't consistent in terms of this sort of schema that you would expect- and these are documented, but it it just makes setting this up a bit more awkward. We might be able to solve that by providing our own image, for instance, um some items of interest that came up, there's a an article here about um talking about observability as a braid of data instead of a classic three pillars, so you're weaving them together with contacts that are coming from that trace information.

A

This is something that a lot of products are starting to offer now and it's something that we'll definitely look to provide in the future and also one thing that keeps coming up. Is people asking? Why aren't we using the open, telemetry collector because it is becoming the sort of uh I wouldn't say the standard? But it's getting a lot of traction. There are a lot of blogs about it and, I would say, is the open?

A

Telemetry collector does have a data dog exporter, so in theory, as long as we keep the data, modeling click house or whatever time, series uh database, we use generic. You know it won't matter, because this format, this exporter, would still work. You could still use open telemetry, you don't need a datadog agent installed, so that would be great um so up. Next, I need to dig further into click house here, so we've started doing that.

A

Looking at some logs um looking at uh some of the sort of drivers and um there's the time series uh database benchmarking tool there that we'll look at a bit further um and what I want to do with the next week as well is create an apm architectural design to share in gitlab. So we can get some early feedback on that and start to look at setting up infrastructure and and general sort of design assumptions, and things like that.

A

Okay and that's it for now. Thank you for watching I'll, see you next week.