GitLab APM, 17 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo September 17th 2021

Description

Incubation Engineering APM weekly issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/13

A

Hello there, joe shaw here uh incubation engineer uh in incubation engineering department at git, lab working on application, performance management, monitoring observability to try and create our own observability stack within gitlab. This is my weekly demo weekly update.

A

If you will issue you can see the issue in front of you there um and you could subscribe uh to the weekly list of videos in this issue here I'll add, add a new related link. Each time I put one up um so this week. I wanted to mostly focus on the metrics schema work, where I was trying to build a generic schema to capture observability metrics in click house.

A

While doing that, I quickly realized. I probably wouldn't make enough progress to show anything in this demo, so I quickly switched back to running some more benchmarks, which I was going to do anyway, with uh mongodb and createdb. Previously we benchmark benchmarks click house against timescale db, um because that was sort of an obvious competitor for us in terms of having a time series database that was multi-modal, flexible and fit with what gitlab were doing already with postgres.

A

But out of this selection that I identified as part of the time series benchmarking suite that we're using mongodb and createdb would be uh other ones that would fit in with that as well. So um the issue here I tried to run uh these benchmarks.

A

uh I managed to fix up as previously with time scale, while with clickers, in particular with the uh time series benchmarking suite things didn't work out the box quite how I wanted them to, and I have previous experience with mongodb, so I was able to sort of patch that up and get that running create db. Unfortunately, I wasn't.

A

um They were parts of the golang interface implementation that we're just missing various scripts that we're missing. So I've decided just to drop that and if, unless there's any particular objection, I won't be going back to that.

A

So I I carried on with mongodb, and it runs the same as last time, so I've linked back to the previous benchmarks, we're doing it's a vm in google cloud, 16, cpus, 64, gig of ram um and running running the cpu only test suite because the devops one, while it does put more stress on it, it's kind of all relative.

A

It takes a lot longer to run um and we found that the cpu, uh the cpu only case was a reasonable subset of um of the devops, the full devops cycle, one so to try and speed things up. I ran it with that. uh So here we go here. Are the results against mongodb with click house, so the metric rate when loading the databases much higher again, uh it might need to be seen pretty fast, but still click house managed to to beat it there.

A

The volume sizes are getting much smaller and then we're getting to the click that's much smaller. Then we get into the queries.

A

uh The uh p95 uh 95th percentile uh latencies here that we're running and you can see click house is performing better initially and then a lot better and even better, uh as the queries get more and more complex and in this case here the group, by order by limit one, which is a very, very complex query, uh the mongodb version doesn't actually timed out.

A

I couldn't get any results out of it for that, and you can see the vast difference there again similar there similar and there are a couple where again single group, by's, very simple: query: uh mongodb actually outperformed again we're talking sub 10 milliseconds here in this case anyway, uh similar case here, simple, um so yeah. So in terms of most of the queries, it performs a lot better.

A

We couldn't run um the cpu 4000 node test um where that's generating data for sort of 4, 000 simulated hosts um mongodb, just it would load the data but querying it was just absolute standstill and I think it was the memory on the host. That was the issue um and they're jumping down to sort of cpu and memory usage there.

A

um Those are kind of, as expected, click house uses more cpu, but it manages to utilize the server resources a lot better. It's using all the cores to do its work. I found when monitoring it that mongodb was using far fewer calls, and I don't know if that's the setup issue that I've caused and although having done some reading, I do think it's something to do with the type of queries being used in their complexity.

A

The actual records that were used in this benchmark do seem quite odd for longer db, but I can see that they tried a more naive approach originally and that didn't scale very well so this this was then the approach that was switched to to try and get some um good benchmark results out of it.

A

So this is the this is the approach that we're testing against so in this mongodb json document here see the measurement cpu, the the service tags there that we've got for this particular record, so it's kind of grouping it by the unique tag set and the particular day by looks of it and then storing all the events, all the cpu events in that day.

A

For that tags, and this one record in this kind of strange format where there are lots of mpt uh empty values which appear to be the sort of gaps between them, um and I guess the idea being, then you could just index into this events list. It does make this document in longer to be very large and it's clearly not not having the best effect on the uh queries in use there. So not the best um solution, but I think we can safely say that that we can rule that mongodb.

A

For the time being, you can see that the memory usage of uh like houses is lower there as well. So that's great so so I'm happy with that. I'm happy to move on now and and stop with this benchmarking. Unless I see any comments or anything that indicates, I've missed something obvious um I'll, listen to some limitations. Here, it's an old mongodb driver um and, like I say it, doesn't utilize all the vm cores.

A

So it might be like a weird connection, pooling issue or the type of queries that the benchmark is running, aren't really utilizing the cpu very well.

A

But my assumption has to be that whoever created the people who created uh the benchmarks for mongodb and the the document format for it probably felt better suited to do that than I am and would be better building a benchmark for it. So I have to I have to assume a trust. That's that's going to be as good as I could. I could ever create um and I need to focus on getting my click house implementation done uh so back to the issue. There's still working on this metric schema.

A

I've got an ongoing, uh merge request for that started with a naive schema and I'm building the implementation for that. Just just so, uh I can see how that performs in the most basic case before I go on, to uh understand how to refine that and and make the queries work better, but it might be that the naive scheme, it works, fine, uh my little test so far. I've shown that it doesn't anyway. So that's fine um on our previous weekly issue.

A

We got a a very helpful uh comment in where someone was suggesting improvements like naive schema. So that's really useful. So uh one thing I need to do is sort of double check: the documentation for various things that have been suggested here like custom, codecs, cardinality of measurements.

A

I looked at some issues through with, for example, the primary key and I figured I should sort the primary key- uh an order by case here for the at least for the naive one. So I I went ahead and used this kind of suggestion for that to try and improve its performance in the most basic case, so it wasn't just really really bad uh for no good reason uh and I'll look into some of these aspects of this table design.

A

uh What I intend to do is, rather than having a a record uh being created per field in a measurement. The measurements are all going to end up in the same record using arrays, which has been designed in a lot of other systems as well that I've seen using click out. So I'm gonna have a go at that.

A

Let's come back to the weekly issue, then um I also stumbled across uh the click house, youtube channel, which I didn't know about. That has uh some quite useful introductory videos and some useful, uh in-depth videos there as well.

A

So I've been I've been going through a bit of that, which is quite handy, uh to have more of an introductory uh approach to learning it rather than looking straight at the documentation which I have been looking at, but it's not as easy to get into and the up next is basically the same as it was last week.

A

I need to get back into finishing this metric schema.

A

So that's it for me this week and I will see you next time.