GitLab Incubation Engineering, 1 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo November 1st 2021

Description

Update issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/32

A

Hello joshua here full stack engineer in the incubation engineering department, I'm working on apm application, performance, monitoring management and observability for the gitlab devops platform. This is my weekly update video. You can see here my demo update issue if you like for today covering the last week.

A

So one of the main things I did in the last week was fix a fairly significant bug in the existing implementation, with the handling of data dog tag sets. So this is where you have higher cardinality data going into the database.

A

We weren't grouping correctly on data when it was incoming, which was meaning that when I was doing some investigation into higher cardinality data that I would expect say from kubernetes a good example being pod data, specifically all the sort of tags you get around, that I was seeing very, very slow updates from that data source compared to a standard uh metric like just a host cpu.

A

uh One solution could have been that. Maybe this was sending that high cardinality data much at a much lower rate in the agent, um but I actually found it was an issue whereby I wasn't deduplicating or grouping the data correctly, based on the entire unique tag set coming through in the series data from datadog.

A

Now this means uh that we were losing chunks of data and it was sort of a an ordering or race condition where certain uh bits of data would be dropped, whereas all the bits will be kept um and I've done. I fixed that I've added more documentation around it now so I'll show you some of the docs that I've added into our project here. So there's a better data dog agent docs. Here I thought I better better put these in.

A

While I was doing that work, so there's documentation around the series endpoint that I'm talking about about here. So this is an example post and information about how we handle metrics and custom metrics uh information around the validation rules that we expect based on the datadog api, and I've got tests around all these now to make it a bit more robust uh and more specifically, how we convert that series format into our measurement format that we use in the click house database and I've got.

A

I've got some information here about how those more uh converted in more detail.

A

uh So it's easier to to refresh on that information and the key bit here being that it's grouped on this uh unique set here of the host timestamp measurement and the unique tag set, which is the bit that I was missing before that sort of transfers into this goal line structure here. So I can. I can show you the uh a running database now, just to give an example of the sort of cardinality of data we've got in there. So if I have a look, this is running on my machine at the moment.

A

Just uh and you can see, I've done a count of metrics there, individual metric uh measurements, that's uh a measurement name and it has a tag set and it has a set of field values associated with that measurement. You see, there's quite a lot there quite a significant amount. That's because it you know it generates a lot of data. If we look at the distinct measurements in metrics, these are all the different, unique top level measurements that we that we're getting and there's a there's a lot of stuff.

A

So a lot of kubernetes metrics there system metrics like cpu memory and things like that. uh So for each one of these, then you're getting uh an insert with a unique tag, set you're getting uh individual fields. Some of them just have one field, so they have multiple fields associated.

A

uh So just as a refresher, let me just uh I'll show you my little box out of the way, so we can do uh select star from metrics, where.

A

A

And you can see the cpu metrics there and, as an example, you you, this doesn't have any tags at all. These are the text fields here, but it does have a lot of different fields. uh So in some ways it's relatively high cardinality. Those are those are all fields associated with one cpu record, and these are the the array values that are stored with it, but as an example of a high cardinality tag set data source.

A

Let's look at uh what I'm going to do here is get accumulated, cpu records and I'm going to array zip the uh tag key and values together just get one out of there.

A

What you can see here- and this is just one record on this- to the tag center for that record- you've got a lot of different fields associated with it, so you've got container id container name, display, name, docker, image, image, name, image, tag, container, name, cube, name, space, part, name, pod, phase, short image, so a lot of fields and there's a lot of different variations of this sort of thing, and I was seeing way less than I expected. So I managed to fix that issue.

A

uh So if I actually turn the limit off there, you know we get 2 700 rows there, which is more. You know what we would expect to have that data set. So that's good.

A

So back to the other thing that we were doing, we were getting uh the datadog, endpoints and general sort of series, time series endpoints integrated with gitlab, so that we could tie together an api key provided by an agent with a a gitlab project, an optional environment, and I showed you a bit of this work last week. We've made good progress with this is nearly functional.

A

um So just as a reminder, we've got some documentation here as part of this issue, talking about adding uh these environment variables as as global tags in the case of datadog, but this will put with other agents as well uh and what we actually discovered while we're doing this. These don't end up on every request with datadog there's. Actually, an initial intake request that ends up with these host tags specified.

A

So we need to be able to strip those out of that request and then pair those up with the api token and what will and that the last bit we need to do with that is pair, that, uh together with those uh access rights uh and store that in some sort of session storage about the solution. So I'm thinking of having a simple redis cache in there for now, uh so this stuff can have a one-way hash on it and we'll store that that information based on the api key and the host name.

A

So the uniqueness constraint would be that the uh whichever agent is running this for a particular api key and a host name. You would be limited in that combination to a specific gitlab project and environment.

A

I don't think this is really an issue if it is for an end user.

A

What they can simply do is generate another api key if they think they've got if they've got hosts that are host names that are going to clash, and that can mean that you can have a the exact same host name but associated with a different project, and I think this makes sense, because a lot of the time you're wanting to have a unique host names within a particular project or cluster anyway, uh and it would make sense not to have duplicate host names because it would make the data much harder to explore.

A

So I think that constraint is is acceptable uh in this case.

A

So we've got, I showed you before this example of the gitlab api. I've updated this for the intake endpoint that we were talking about and what we do and what we've got implemented as well is when we get a project back successfully to attach uh to a metric set. We just check that the permissions part of the response is set, and then it has an access level equal to or above 30, which means that the api token is associated with a developer level account and the assumption there is a developer level, accounting git.

A

That will be able to write metrics data to a project that it can access, and I think this is a fair assumption. I think that that's the right level, I don't think it needs to be the maintainer level- should be the developer level as a minimum.

A

So that's good, that's implemented now and what was the next bit? uh The other part of that is for local development. We we created a gitlab stub service, so we don't have to have a gitlab instance running or all well-known keys configured in gitlab.com or whatever for local dev. I've just created a stub for the endpoints. We need again I've documented this a bit better as well as part of the documentation, um so the local stub endpoints you can see here. These are the ones we've we implemented, and these are the ones we need.

A

So we we need to get the project's id, get a project by id to work out what these permissions are. Optionally, get the environment, so we can validate that environment variable and then we use the version endpoint to do basic api, key validation and that's part of the standard gitlab api. So when the agent does a verification request for the token we can just check against that endpoint as well, uh so that makes it nice and simple um I'm trying to think if it's anything else. I think that mostly covers what we've been up to.

A

So I fully expect that we'll get this um project level, integration and validation working within the next week. I should be able to demo that this week, so that's everything for me for now. Thank you very much for watching.