GitLab Incubation Engineering, 26 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo August 26th 2021

Description

Weekly demo issue link - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/8

A

Hello, my name is joe shaw. I am an incubation engineer in the incubation engineering department, looking specifically at application, performance management and monitoring and how we can bring that into the entire gitlab devops platform.

A

This is my weekly update, video, here's, my apm specific issue uh for this week. So what have I been doing?

A

My current update uh I've further documented the datadog endpoints for the agent that we're looking at using before apm, so I've documented the initial metrics for the metrics endpoint, which is used to capture time series data and some basic events, and this is the first one we're looking to tackle as part of our first iteration to get this data in and store it.

A

um I'll just show you here. So this is the first, the first environment, variable or bit of conflict that this part of the agent supports. There are lots of other parts like process and log monitoring things like that. We aren't going to analyze at the moment until we till we need to um so documenting this. I've started putting these this documentation this issue and we've got the sort of request that we expect, like validation, requests there and links to parts of the documentation.

A

uh There are some potentially slightly. You know unusual, non-documented uh parts of the agent here so there's this intake endpoint that we need to have a look at a bit more uh where there are a couple of requests here, for example, it sends through a lot of metadata about the host that it's running on, so it will send through um information about the hardware specs, the operating system host names, install method, all sorts of useful, useful stuff that we could use for auditing.

A

We probably aren't looking to capture this. What we might do for the time being is just log this information out, so we can check logs and see what's going on, but we probably won't save this initially, unless we actually need to refer back to this and uh really if we need this for end users to be able to sort of see, make sense of what agents are being used for examples. There's an initial one there.

A

It shows you, for example, what what other processes are running in this environment um and then the second one gives you a lot more information I've yet to really break this down. It seems to give sort of container error states and things like that, and then we get a check run endpoint, which is events about service checks that the agent is performing so, for example, here we're blocking external traffic when we're testing this agent.

A

So this ntp in sync check is failing with this somewhat cryptic status code, which I believe means critical in this case, because zero one, two three so zero is okay. One is warning: two is error. Three is critical um and other checks. For example. Here, like you know, the actual agent itself is okay and so on and so forth. So that might be useful. But again we won't look to take that straight away. What we want to focus on is the series data. Initially it's where we actually have things like um system metrics and I've.

A

I've shown you this in a previous video, so I won't go into that at this point. So that's that bit of documentation um and that's going to feed into an initial iteration where's that it's not there.

A

So this initial iteration will be looking at how we implement uh a gateway service to do authentication against get with access tokens uh handle that series endpoint and the storage as well.

A

So another thing we've done this week is create an initial system design.

A

So this is the issue here and you can find all this linked in my weekly video issue, which will also be linked in the bottom of this video as well.

A

So this is a fairly typical architecture for something like this with agents here focusing on datadog, but we think you know other compatible agents, potential api clients, so another compatible agent might be say.

A

Open, telemetry gateway service implements that api initially, and we talked for a little while about using um kafka, but because gitlab already uses google's pub sub service for the sas only side of the business in particular um handling logs and pushing those into elastic at the moment, we're going to reuse pub sub because there's a lot of knowledge about that already within the teams.

A

um So here we're going to ingest that data, we're still evaluating click house, but it's looking like a strong um potential candidate for the actual data store here so I'll get on to that in a second and in terms of being able to visualize and interact with this data, we're looking at putting grafana on top of it with a query proxy at its own database and it allowing us to embed that in in gitlab itself, um some notes that I've put in here- I don't think I need to go into any particular details at the moment um and we've had a good conversation with people in infrastructure, for example, about how we might go about setting this up as its own service.

A

It's got quite um nice service boundaries, so we don't have to host it in the same environment as gitlab.com. We can have it in a separate uh google cloud um project and a separate, obviously separate kubernetes cluster, for example, um so that will make it easier to isolate and understand the costs around running this service. So that's that um the other thing I'm looking at right now is the schema evaluation for click help.

A

So I started this last week as well, so we're looking at I've expanded upon this a bit put a few more links in a bit more of an introduction, I'm in the process of documenting some other um time series databases, because what we want to do initially, because we're looking at time series for this very first iteration is just compare and contrast a few of the other examples.

A

I'm going to use a this time series I'll drag the link over here time, series benchmark suite, which I think was created by influx db. Initially and now it's been um sort of taken up by time, scale db to do a bunch of comparisons between time series. Specific databases see click houses in here and a timescale db and a few others. um So what I'm? I don't want to benchmark against all of them, but what I want to do is compare a potential more generic schema that we might use against.

A

The click house schema that this benchmark uses, because this is a very s, sort of hard coded, very specific schema for certain uh series data. So, like one schema for cpu one schema for memory, for example, we don't really want to do that, so I'm going to see if we can get a sort of equivalent or a decent enough performance with a more generic schema I'll also be comparing those results against some of the other competitors based on whether we think they're applicable or not.

A

So, I've been looking in here, for example, if I should preview this, I'm editing this at the moment. um Looking at you know uh comparing these databases, so we want databases, replication, high quality flexibility, a a decent amount of maturity and, ideally a multi-model.

A

So it's not just time series, but we can put all the data in there, so we can work with logs when we get there work with traces, um so only a few of those come out of that, so I'll be doing a comparison with a few of those uh to make sure that we're going in the right direction. Obviously, we don't have time to compare every single database on the market, but it's a bit of due diligence.

A

We to back up back up what we, what we've decided to do so I'll, be continuing that uh the rest of today and next week, probably um and we've forked the time series database project, so we can make relevant changes there any fixes any issues we find we'll contribute back to the upstream project, um so yeah, that's it for now, uh and you can see up next continue with that and continue to refine the infrastructure stuff.

A

uh So that's it for now. Thank you for watching and I'll see you next week.