GitLab APM, 2 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo September 2nd 2021

Description

Weekly demo issue link - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/9

A

Hello, I'm joe shaw, I'm a full stack engineer in the incubation engineering department at gitlab uh and my focus is on application performance monitoring and how we take that idea and build our own solution for it. So what I'm going to show you, as you can see now, is an issue for my weekly demo.

A

It's a fairly quick on this week. um I've been out a couple of days and I've been doing a lot of useful postgres training as well, which has been very handy for this work actually to be doing a bit of postgres work uh with time scale db.

A

um So, as I mentioned in the last video, I'm currently evaluating click house as a back-end database for observer observability data um for a number of reasons.

A

So I'll just go through uh to the issue here, we've got for the evaluation, um so there's a few, a few links in there for anyone, that's interested and what we've started looking at is benchmarking click house against a few different databases that are prominent in this area and particularly ones that are supported by the time series benchmark suite, which I think I discussed previously a project created by.

A

I think influx db and now um maintained by timescale db and here's a breakdown that you can see of all the supported databases, in that benchmarking tool that you can run against and the tool supports a variety of workloads.

A

The one we're interested in is a devops workload, so it will simulate metrics for cpu disk memory, kernel metrics, all sorts of stuff like that and give you quite a few tables there um and the benchmark that that they have um will set up a database, create tables simulate data set up workers to insert that data, and it also creates a bunch of different varying queries in terms of complexity from basic selects through to quite complex group, by operations as well.

A

But I will document um all those queries in here as well, once I've done the analysis. um So if this sort of matrix I've set up here, the what we're looking for is a database. That's is multi-modal. So it's not just time series, because we want to use this data store for other observability data as well. So we want it to be able to handle things like traces and complex data structures like that, and we really need replication, probably partitioning as well as as things scale up. We want at least medium to high query flexibility.

A

So what I'm talking about there is, you know, sort of how complicated the queries you can write. You do complex groups and joins, and things like that and a level of maturity and I've really based that on uh how long the the project has been running the amount of traction things like likes on the repository code and and discussions on stack overflow and things like that as well.

A

um So you can see from that grid a few stamped out click house being one of them that we're interested in create db, another uh mongodb and, of course, timescale db and timescale itself is metrics focused uh in terms of time series, but uh it is multi-modal because uh it's on top of postgres okay, so you can put you know anything you can put in the postgres database in there mongodb json store database very flexible, so we'll see how that performs.

A

Create db is one I'm less familiar with, um but it does claim to be multimodal so we'll I'll, I'm just going to keep that in there for the time being, the other ones. I've excluded. For that reason and levels of maturity and stuff, like that, so I'm just working through a checklist now and we're using the time series benchmark, benchmark suite here that we forked from the original one from timescale and things that we've changed in there we've got our own evaluation script, which just utilizes all the built-in scripts.

A

I've had to make some tweaks and fixes, and some slight modifications to get it working the way I want um that orchestrates for the number of databases, the load, the setup number of workers, things like that compose there that sets everything up so yeah, we're running it in docker. I think you're behind this. It just keeps it simple, but also it's containerized, which is how we would be running it in production anyway.

A

um So we, you know, there's a few databases, I'm setting up there and it will bring these up one. At a time. I have a prometheus and c advisor uh running to collect general background and container metrics if you're not familiar with the advisor it's. What is used in the cubelet in kubernetes and to create to collect pod metrics uh in there and the runner itself is containerized as well and orchestrates the rest, so that just makes it simple to run.

A

We can override these environment variables and control that, uh in terms of the data that we have there just quickly showing you it, it spits out a lot of these data files here for different runs, different queries.

A

uh If we just have a look in one of these, this is a click house gives you a brief, very simple description of the create double group by all. uh I haven't looked at that query yet so I'm not sure what it's doing. Some information about the run, uh duration and you've got percentiles in there, which is very useful, so we'll probably look at sort of 95th percentile for evaluating for comparing, which is fairly typical uh and a query rate throughput there as well.

A

So what I want to do with this is just write: a quick script, maybe python notebook, something like that that will get all these and make some graphs, so we can do a quick comparison with them all and then, after that, what we're wanting to do um back on here. Let me find the issue: uh do uh uh run that comparison initially just between time scale and click house.

A

It's timescale is the only other sort of obvious candidate um and then add create db and in there as well, to make sure we've, you know, covered some obvious choices and we can sort of back up any decision we make and make sure we analyze cpu memory and disk usage for each one as well and present a comparison in this issue, and then hopefully I can make a sensible choice based on that and we can start building the first iteration.

A

Okay, that's all for me this week. Thank you for watching.