GitLab Incubation Engineering, 10 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo September 10th 2021

Description

This video goes over recent weekly changes in this incubation group, and also includes details ClickHouse vs TimescaleDB benchmarks.
Benchmarking details - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/4
Weekly demo issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/11
Handbook page - https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/

A

Hello there, joe shaw here full stack engineer working in the incubation engineering department for application, performance management, monitoring, observability all that kind of stuff, our aim being to build an observability stack into gitlab as part of the sas solution for gitlab. So looking at agent monitoring, collecting metrics application performance metrics and how we store them, query them visualize them all that kind of stuff. Here's my weekly demo update and the weekly issue that we create to track that. So you see that here.

A

So what have I been up to? uh I finished uh running benchmarks against click house and timescale db. I've documented it. I've got the results in so I'll. Show you those! uh So that's in this issue here.

A

So just a quick recap on this: we're looking at click house, because there are a number of um other companies that have been using it for things like logs um yandex, who, I think created it, use it for a lot of really large scale analytics data for their sites.

A

um It's horizontally, scalable this issue, I've shown before in these videos. This has a bit of background. A few links to some interesting blogs, like uber's logging platform built on click house. um We wanted to compare it against some other databases, primarily time scale db, because we already run postgres quite large, postgres instances.

A

So there's already a lot of familiarity there, time scales built on top of that, and we wanted to see how it would compare um and whether you know it would be worth supporting a new database to do this or whether it might be worth sticking with postgres with scale, because um time scale would also give us that multi-model capability, where we could store other sorts of data in there. You know traces and things like that, as well as the time series database, specifically my cat's, climbing around on some stuff back there.

A

um So uh I've only benchmarked with time scale at the moment, because the benchmarks do take quite a long time to run and when I've had to go out running the other ones, we're looking at uh mongodb and crate bb.

A

They didn't work out of the box initially, so I haven't put put any more time into getting those ones running. But I will come back to that at some point so that one's still left unchecked here. um So I added some more documentation here. So the the outline of the benchmark run the sort of variables we're using um so, for example, use case cpu only or devops, so it's sort of devops data we'd be looking at um the number of workers that we run for these tests.

A

So we run four workers I'll get onto that further and the scale of each test. We're running is an important metric here.

A

So um we've got a hundred hosts uh as an example here, uh so those would be the sort of machines generating the metrics and that that results in um just over what is that number 2.5 is that 20 250 million uh individual metrics uh for the just for the cpu only case you see hundred hosts is the base case that we're looking at there. It's quite a significant number of metrics.

A

We scale that up to 4 000 metrics, which I believe as if I remember correctly, results in something like 10 billion, metrics being generated and stored. So you know significant number, but not impossible to get to. If you've got a lot of customers running lots of agents in, say big clusters, you could you could easily get up to that sort of level fairly quickly.

A

um So there's a bit of a description of the the benchmarking flow here uh starts the advisor and prometheus so that we can monitor the actual containers themselves and generates data based on that that scale that we set on the use case and queries starts the target database with we're doing.

A

um Does it does a load run where it loads all that data in and stores metrics as we go along and then runs all the different queries we generate stops. Clean up gets some aggregate metrics from prometheus that we've stored and then goes on to the next database, with some delays in between to let to let the sort of things settle down and we clean up uh volumes and things like that, so we're starting with completely fresh databases.

A

In the background I added I've added some collapse, sections in here, for example, um the sort of example uh schemas that we end up for the devops stuff. So I've kind of outlined those here, so people can have a look at those and see what we end up. The one that we end up targeting most is the cpu one here, and you can see that the schema there and there's a separate tag schema that's used in this benchmark.

A

So this is this: is a click house specific one, the time scale one's very similar, although in the background it time scale, is storing lots of sort of tables that are generated as segments which are then turned into sort of these hyper tables? That end up being things that you actually query, uh so you can see some examples.

A

There there's this bit of example, data snippet there, so you got some fairly typical tags: hosts regions, data centers, a bunch of cpu metrics there, some disk metrics there it's in this sort of csv format uh that we use. I think, actually, I think it's the influx db wire format that this is this is stored in because the time series benchmarking tool was created by influx db. First and now it's taken on by time scale, as we've mentioned before my assumption.

A

When running these tests against timescale is that most of the effort has been put in to the time scale implementation here to make that stable and performant, so I'm not going to try and tune that uh and I'm using the sort of recommended way of setting up time scale.

A

um So so you know I'm assuming that's that's going to be as about as good as it gets with time scale, whereas other ones like the click house, one and the other, the other databases they've been contributed, but I've had to make sort of patches to make them work properly. Things like that, so we can assume there's, perhaps some tuning that could be done there.

A

uh I've documented the queries as well that we run as part of the benchmarks and put some descriptions in so, for example, like the cpu max or one query, get the maximum of all the cpu metrics for a particular render host in a random eight-hour segment, a random eight hours and grew up by one hour uh and and I've put the queries in here. So you can actually see an example. Query that's running there. So it's getting an hour!

A

It's grabbing everything from the from the um cpu table there saying where the tag, what that particular host tag we've got looking in that particular time range there an eight hour time range grouping by hour ordering by hour, and that's what that's like an example of one query and I've documented a few of these below and the way the benchmark runs is to run for each sort of use case that we're and running query.

A

It runs a thousand of these queries with sort of randomized inputs that are known to be as part of the data set. So there is some random randomization in there as well.

A

uh So let's have a look at some of the results, though so we ran. We run this a few times, um mainly because we ran out of disk space, even after I'd done, some tagging up in the scripts to try and minimize the impact of the running the tests and then save on some disk space.

A

We we ended up uh having to increase the disk space there. So it's a google cloud vm we were running it on n2 standard, 16, vm, so 16 vcpus gig of memory. So it's kind of a recommended setup for a lot of these um time: series databases, lots of relatively low powered, cpu cores uh and a large amount of ram to deal with the queries. We ended up, giving it 500 gig of a persistent ssd, so we want to try and get the best performance out of this.

A

They could all run on an attached uh typical, spinning disk, but the ssd is the recommended way to run these. Given the um you name: output there for the linux. This is an ubuntu image there, so you've got ubuntu sort of lts version there kernel version and where we run this using a scripted bash script that orchestrates the benchmark and it uses compose in the benchmarking tool that we've added to start various services. Everything runs there makes it easy to get logs out of services, and things like that.

A

We don't have to worry about uh running sort of team up sessions and things like that to capture logs um four benchmark workers used because we don't want to create contention on those 16 cpus. We want to give most at most most of the cpu time we want to dedicate to the actual database.

A

So let's have a look um so metric rate when we're loading higher is better. You can see that click house performs a lot better there in all of these cases, so which case being say, cpu only 100. That means a cpu only case with 100 scale or hosts. If you like, that's the sort of generated data set, then 1000 and 4000, that's the really big data set and then devops, which is all the tables not just the cpu.

A

You want thousand four thousand again and you can see the metric grip for click house is significantly better there uh by quite a factor now. Looking at we, we grab the container volume sizes after that as well.

A

So we can see how much storage these databases are using on disk, which is another important factor, and you can say, click house really outperforms time scale, thereby almost a factor of 10 in all cases, so significantly lower amount- and you can see here- and this is in bytes- see this. The size is sort of nearly 300k gig, with timescale and much much less space on click out. So it turns out.

A

It was time scale causing a benchmark tests to fail with the 200 gig disk, and that's why we had to increase it, uh and then we start actually looking at these queries. So what I've grabbed from the query from the data set here is the 95th percentile latency, that's aggregated for the 1000 queries that run for each of these query types. So you've got the actual test case here so again, cpu only 100 hosts and here's the query name. So each of these have been running in different sessions, but I've grouped them by the query.

A

So we can see what's going on. This is all by the way generated using a jupyter notebook with python, so the benchmark data it doesn't do this live the benchmark. Data is stored on disk. I pull it from the vm, so there's a notebook that I've added to that that tsbs repo, that we forked- that does this data analysis and we can quite easily change that to to look at different uh percentiles. Here I could put the you know 50th in there as well, for example, and we might do that at some point.

A

um So again, click house performing really well there uh all sorts or sort of 10 20 milliseconds. This is only milliseconds time scale going over a second in a number of cases there again for this one and I'll scroll down to a case where it doesn't perform that just so there is. You know this this group, by order by limit it's a more complex query and the um the tables that I set up and cookouts aren't really optimized to this query. They don't have the sort of proper indexes on them.

A

However, having said that, while it is significantly larger by an order of magnitude, in some cases, um it's not drastically slow and it's certainly it's not over a second, whereas if you look at some of these time, scale ones where it's failed, for example, uh the devops case here, click house still clocking in uh nearly three seconds but time scale taking well over 20 seconds uh in the in the 95th percentile. That's you know five percent worse 95 better, uh but for the vast majority of cases click out out performs on these queries.

A

This is another one. This last point, one where it's querying the sort of very last point in a data set is a mixed bag. Some some perform better some performers again nothing drastically over a second and when you're creating these quite complex time series data sets uh running. You know complex analysis.

A

You can expect things to take a while, sometimes, but it tends to perform really well again, better in the vast majority of cases of sub 10 milliseconds, which is a really a good result for this volume of data and the complexity of the queries, uh so yeah very positive, and then what we do during these tests during the loading and during the actual query execution, we are monitoring, vse advisor and storing in prometheus the cpu any of the cpu stats.

A

We can get for the containers and memory usage, so we we plot it near the cpu rate and memory usage bytes, which is a metric that sort of contains all the different uh lower level memory stats like sort of caches buffers resonance, set size things like that, so the cpu usage rate there and you can see the cpu usage rate in a lot of these plots. It's quite similar.

A

So I'm using box plots here because you you know, we want to visualize the sort of range standard deviation, median average, because they're going to vary quite a bit over the range, and we can also see how much they fluctuate during the runs there.

A

And what you'll see as I scroll down is that, while the cpu is fairly consistent between the two often the average of click out is a bit lower, but it's variation uh standard deviation seems to be a lot higher. You'll see that the usage memory usage invites again with click. Asses is a lot lower than time scales. Let me scroll down a bit more and you can see timescale.

A

For example, this one here getting well up to 60. Gig of memory, which is getting close to the actual host machine uh memory capacity there, as click house stays well well below that again that one devops 4 000 creeping above 60. So it may well be that in these cases the actual vm is under provisioned for time scale here, which is assigned you know. Obviously, click house performs far better on this on this size of vm and we've documented some limitations there. um So you know workers and databases on the same machine.

A

It's not realistic, but you know reduces the complexity of network latency on top of these tests, and four workers are always used might be interesting to see if we change that um you know some some other utilization in the tests uh only targeting cpu in most cases, which is a bit simplistic, but you can see you know it's it's a very good outcome for click outs here.

A

uh The other thing we've been working on is building out a more general metric schema for click house, and I'm doing that in that same for now doing that in that same tsps repository there's a an openmr, that's in draft at the moment for this- that I'm working on.

A

Here's, for example, that this sort of naive schema that we're going to put in there first and get that working just to see how this compares against the existing benchmarking and because click house is not that well known, and I don't have any sort of production experience with it. I don't have a good intuition when it comes to designing the tables, as I might do with a typical relational database. For example, it's a column, storage database, the indexing and things like that works quite differently um to say something like postgres.

A

So I need to do a few experiments just to just to see if what I'm doing makes sense from a performance point of view, and then I'm not doing something like completely ridiculous. But luckily now, if I, if I get this merged into that project, I can just run these tests again and and the analysis I can run really quickly.

A

So you know I can iterate on that and that's what I'm I'm doing now and once I've got a decent schema that I've settled on there. I'll put it as part of the metrics back end pull request uh so yeah, that's that's where we are at the moment. Thank you for watching and I'll see you next time.