GitLab Incubation Engineering, 1 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo October 1st 2021

Description

Weekly APM Single Engineer Group update video.

Demo issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/19

A

Hello there, joe shaw here full stack engineer uh in the incubation engineering department. This is my weekly update, video and what you can see here is my weekly update issue, where I just go over some of the components that I've been working on.

A

um So one thing to note is this apm project where the issues are stored is also now where I'm working on the proof of concept system for apm.

A

So that's where we are looking at storing metrics, logs and traces initially via the datadog agent. So a bit of history on that. um We've previously uh done some uh analysis of the data job agent in a sandbox environment.

A

To make sure it's doing the sort of things we expect and a bit of validation around that we've also uh performed quite an in-depth evaluation of click house and benchmark for metric storage and we've designed a flexible metric schema for click house as well, and you can see that in the previous videos and we benchmarked that as well with some good results.

A

So what I'm work? What I've been working on in the last week is uh trying to push forward uh infrastructure for the apm project, so it will be uh running slightly separately from gitlab initially uh because it it integrates with the gitlab api, so it will do and we can run it in its own uh google cloud project.

A

uh So we can more easily monitor that. So uh we managed to get incubation members to be able to create cloud sandboxes now, so we can create google cloud and aws sandboxes to build and test our our software if we need to so I'll, be using one of those sandboxes to set up a test environment for this, uh and we we've sort of got a fairly good handle on now. The sort of access requests we need to make to get proper, staging and uh production environment set up.

A

uh I'm going to do that once I've spent some time iterating on the infrastructure for this to make sure it works. As I expect it should.

A

The next thing is: we've been working on getting this implementation of the metrics submit service in place, so we've built an initial version of the submit endpoint for datadogmetrics.

A

That's using the uh newly developed click house, uh storage schema uh and you know it's uh it's. A very initial version needs a lot of work to harden it and make it acceptable for production use a git web. So I need to work on that as well. So we've got a an open issue here, which is the the sort of overall iteration uh link to the submit backend there and then some cross references here. So we've got the the metrics design that we've completed and some of the components here that we're looking at.

A

um So what we've done so far in this project is set it all up to to be kubernetes first, so it's all sort of cloud native setups and that's even in development we're using mini cube so that when we move through environments into testing staging and so on, it should be a lot easier for us to to accomplish that. We don't. We won't be building all that from scratch, and so I'm that's the way. I'm testing the application locally.

A

So we we merged in a helm deployment with click house. So for click house we've gone down the route of using this alternative click house operator. um I've seen quite a lot of mentions of this uh for a while, um and it seems to be using a few different organizations successfully.

A

uh It's got a lot of documentation and this should allow us to create a click house deployment on communities where we can manage shards, replicases and migrations of data, and things like that quite well. So there's there's a lot of functionality in there. It's got. You know built-in uh configuration for monitoring maintenance tasks and you can do a lot of complex setups with it. So it looks quite useful, I'm using that both in development as well to set up the base basic case.

A

um What else do we do? uh The development environment also sets up data dogs, so we can test that locally. That's running in the in the mini cube instance, and we've also got a test version of uh grafana in there as well uh to run with that uh and the.

A

What is that other?

A

This uh issues? That is, the one that we're tracking the actual series endpoint implementation in. So we've got some details of that here and we're working on that. uh You know in another merge request here uh for the draft gateway service, so I'll show you that running uh that runs from this project. It's a simple uh installation via helm that we can run locally through mini cube.

A

uh So here's the canines interface that we can look in our local cluster. Looking at the pods in the cluster here, so you've got a click house uh cluster. There single shard, single replica cluster, just for development that you can see. There's the click house operator. We've got our dev gateway. We've got the datadog agent running uh that has the main agent and some of the components in there.

A

We've got a a sort of local test version of grafana where all the authors turned off, and this is similar to sort of stuff we've demoed previously, when we're testing datadog.

A

So I've been able to reuse quite a few of those components, and we also got cube statement cube state metrics running there that data dog uses to get kubernetes metrics, uh so that is deployed in quite a simple way. uh All the images are built automatically and deployed. So it's quite easy to use.

A

I need to set up ci to do some builds on there as well, and you can see as a quick example if we refresh that this is the data from that click house database with the newly designed metrics schema, and this is uh system cpu and user cpu being rendered in here and those are you can see the queries down here.

A

Excuse me, um you see the queries in place down there uh that grab and display the data. So there we go grabbing the measurement system cpu and getting an average of the of the system field names from there. So this this kind of splits out the data into a nice format for us. um So that's good! That's positive! Like I said, there's a lot of work still needs doing uh there just to productionize that that service- and you can see all the code for that in this in this project.

A

I've added some interesting links there, including the the operator so uh next week. I need to continue improving that implementation, uh adding some test cases making sure it works for different combinations of metrics, doing some exploration with it um and also to continue iterating on the test infrastructure as well.

A

That's going to be vital, so I want to be able to get at least the test cluster for this upon running with ci deployment into there, so we can get into staging and production as quickly as possible with this okay, that's it for me for the time being. Thank you very much.

A