GitLab Incubation Engineering, 19 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo November 19th 2021

Description

APM Single Engineer Group handbook page - https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/

Weekly demo issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/36

A

Hello, joe shaw here full stack engineer in the incubation engineering department at gear lab I'm looking into apm application, performance, monitoring and observability and how we can create an observability apm platform for the gitlab devops platform itself.

A

As a quick recap, what we're looking at is how we store various apm data sources inside click house, so we could use click house as a unified backend for the apm data, and we've had some good progress looking at that with metrics, and what we want to do is move on to looking at logs and how we can go about. Storing those and we've. We've found some decent articles in the past about how we might do that um from last time.

A

We've merged in a project and environment integration whereby you can configure the datadog agent, which is the only agent we're supporting at the moment, just for the case of simplicity, and you can configure this with a gitlab api key and a project id and optionally, an environment id and it will authenticate and authorize those data sources and then, in our click house back end. We get the relevant project id and optional environment id stored.

A

So then we can, when we're visualizing the data and the users say authenticated with a particular project or looking at a particular dashboard for a project. We can pull the data in based on the the ids.

A

um So one of the last things we had to do for this merge request was just go through a bunch of testing with it. We have integration tests for it. We've got a lot of unit tests. I just wanted to do some manual testing just for sort of my own sanity to make sure things were working as expected, and so here's a sort of test matrix that we've got here, based on the validity of the api key and its role.

A

uh Whether the project exists is invalid or valid uh whether the environment exists or is invalid or is empty, uh and what the expectation we should see in the api when the agent is talking to our gateway, whether we're getting things like 403s or whether we are seeing the data being stored in the clickhouse database. So that's fine, so that's merged.

A

In the next thing we found, while I was doing some upgrades and I'd upgraded, the dependencies of the helm chart that we used locally for testing because we tested in minicube locally uh with this project.

A

um I found that the newer versions of the dataplug agent actually enable compression by default, and I think in this one it's deflate compression using z-lib, um whereas you can also have uh gzip compression enabled. So I put a quick, relatively quick fix fixing uh to handle that as well, and it also allows us to put the sort of we missed out.

A

Putting the limits in the data api specifies so 3.2 meg limit on normal non-compressed 62 meg limit is imposed uh on compressed on the decompressed body, size of compressed payloads and a relatively simple set of changes. There. I've got some tests for those, so that's merged in, and the agents work fine now with the uh compression enabled.

A

uh So that's that done so. The main diversion we ended up uh taking was me having problems with my test environment. You know a lot of the stuff I've been doing with it because I hadn't had time to fully automate. Things was a lot of manual deployments via helm, chart tweaking things testing things that way and it was becoming uh quite slow to deal with, and when I was. I noticed that when I torn down the click house database with replicas, when I tried to rebuild it, I just couldn't get it to work.

A

So I realized that I really needed to put some effort into making this much more reproducible at this point.

A

So that goes on to the merge request, which is pretty much ready to merge. Now. I just need to do a little bit more testing whereby we've got environment. Automation set up in a much better way, so we've got a helm set up for the test environment that we override with some parts that are hidden in ci variables and things like that and one of the key things this does.

A

Is the migrations themselves no longer rely on a particular database to exist. So this is all sort of configuration driven now, uh which means that we can have a one-click house cluster, potentially multiple, shards or replicas, and different databases for different environments, whereas before it was assuming sort of one database uh to connect to an apm database.

A

uh So that makes it much more flexible means we can have sort of temporary environments in is without creating whole new click house clusters, which is good, um and we have a gk google kubernetes engine engine deployment automated from uh the ci pipeline as well, which we didn't have before. I was doing that myself as that's much better now and we've wrapped up the migrate cli that we were using so we're using a migration, a pre-existing migration tool.

A

Golang things go migrate um and just to make it easier and make a lot of the settings shared we've just wrapped that in our own cli, so that when we build like the click house, dsms the connection strings, we can share them between the tools and and then it has a bunch of default settings. So I'll show you there's. Quite a few changes were required there, and one big change, as I mentioned, is that the config, the migrations, no longer create the database, which is which is good.

A

The database is bootstrapped first, before the migrations run, so, for example, here's the config map in the helm, deployment which creates a database based on a configured value there, and that will run once that once uh the first replica is set up now this this caused a lot of problems and was one of the key issues. When I restarted or tore down the cluster and started again was previously when we'd had this in migrations.

A

We had an on cluster statement here, which is a standard bit of click house sql. To say that you want to create the data is on the cluster.

A

The problem with using this and the alternati click house client was that the replicas wouldn't come up because this statement was expecting another replica to exist before the cluster had been fully created, and it took me a long time to realize that the the click house operator would actually handle the replication of the table out of the database as opposed to me having to do it um so that took me a while to debug.

A

So that's a lot simpler now and just to quickly show you here's the test deployment job here and it's very simple, because we already have a helm chart that we work with. We have a test env override file that has a set of defaults for that environment and has the click house set up for that environment? There are no secrets in there or anything like that.

A

It's all hidden, that's all pre-configured within the cluster, so there are references to existing secrets um and we override a cubeconfig file with protected ci environment variables uh and things like the google application credentials- and these are these are hidden in in ci. So in theory, it's all nice and secure and we can reuse this process for the deployment tiers and it's linked to a test environment and so on and so forth. So that's nice and simple.

A

It's it's easy enough to to reason about um so that was that, and uh the last thing I was going to do was just point out uh quite a nice article that I saw around click house, materialized columns and some performance optimizations here on the post, hog blog and I've seen a lot of this stuff presented elsewhere, but it was quite a nice blog wrapping it all up and looking into how you set up materialized columns and actually how you use click house flame graphs, which is a flame graph being a visualization I've used before when debugging things like go programs, but I've never been able to or or known about, the utility within a database engine.

A

So this is this is really nice. That would be very useful and then the article goes into adding the materialized columns, and this is something I think would be very useful for us with our metric schema, we'll likely want to add materialized columns for uh array headers that are being used frequently, for example, so yeah. I thought it was a nice article there and that's it for me, I'm off next week, but I'll be looking more closely at logging, then, and the see loki, implementation and potentially doing a bit of benchmarking with that.

A

So that's it for me now. uh Thank you very much.