GitLab Incubation Engineering, 11 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Incubation Engineering APM -- Weekly Demo October 11th 2021

Description

Weekly demo issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/20

A

Hello, there uh joshua here incubation engineering department, I'm a full stack engineer for the application, performance, monitoring and management solution.

A

um So this is our weekly demo issue, uh where I put in what I've been doing for the week and add my recording into you can follow the link here to the container issue where you can subscribe to weekly demos.

A

So, following on from what we're doing last week, we're trying to get a test environment with automated deployments uh through, so that we can get a staging environment together uh with a with a good sort of hardened implementation with click house and the gateway in, uh so that we can get through the hurdles of getting that into production as well, because there's a certain amount of rigor required for that in gitlab, which is good. Of course.

A

So I've been looking at how we set up click house uh generally, so things like um authentication, security, also replication and sharding.

A

So in terms of replication, you can see that there are replicated tables, that you can build into click house and that's what we've started doing now.

A

These replicated tables require zookeeper to be installed, along with click out, so that the replicas in click house communicate the shared state, such as tables that should exist in the cluster through click house, and you then have to define uh a relative sort of path within click house. Well, actually, an absolute path can click house here, as this is an example where it stores, like the shard information for that replica, and you you have to do that within within clicky estimate that work.

A

uh So what we've also been looking at as well is as well as creating replicated tables which exist on each replica.

A

You can also use this distributed table engine to create a sort of abstraction on top of those tables, and that allows to just operate on the distributed table rather than having to synchronize and query into individual replicas, and this makes it really useful. um It makes it easy to query against so that that has a slightly different syntax I'll show you that what we've got here um so I'll get onto this shortly, but in terms of migrations and things, uh there's an example of what you might call a local replicated table.

A

So it's created on a specific cluster using a replicated, merge tree, and that would be a that would be replicated across each replica in that cluster and then we can create also a distributed table engine which sort of mirrors that metric local table and allows us to span queries across those tab. So that's that's really useful. So then we all we have to do is query this one metrics table rather than distribute queries across all the replicas manually.

A

um So to help us do that. The alternative click house operator that we've been using provides macros, which can be defined as part of click house's configuration and it manages those. So as you increase the number of replicas or shards, uh this click house operator then manages these macros. So whenever you're creating a table, that's local!

A

These are expanded to the correct values. Then you don't have to go and put config in for each single replica. It handles all that nicely. So you can see an example here of creating a a replicated table and this is a path, the zookeeper path that defines that table.

A

So it makes it specific to the installation to the cluster we're running in to the specific shard database table and the name of the replica, uh so that works, and then we can create the distributed table on top of that, and we do that with ours as well notice, however, that within hours um in that creation there we don't actually have to specify those paths, because we've configured a set of default properties for our account installation, so that we don't have to keep repeating those same keys over and over again, uh and you can see that if we have a look at the values file, so we're deploying all this to hell have a look at the values file here uh in the click house section we've got this extra defaults xml file here that gets merged in and you've got that default replica.

A

You can see there that will get used for any replicated tables and the default replica name again. It's a it's a macro that will get expanded, depending on which node you're on uh so the other element of this, which is important to get working uh as well, is migrations and I'll. Just say before I move on to migrations. We are not worrying too much about sharding at the moment. Sharding is kind of necessary when the amount of data we've got outgrow as a particular server.

A

I don't really need to worry about that. Just now. It's something I need to monitor, obviously, but replication replication's useful, because it effectively gives you uh stability. You've got effectively a backup running in the cluster there. um If anything goes down, then you could load balance onto the other server and things like that, so that's useful, so the other thing we'd be doing is um as well as getting that working is um working out.

A

How we do database migrations with fake out again it's important to have migrations in place, because we don't want to have to go into any environment and manually, run scripts against databases to upgrade them and things like that. It's not very good practice, and ideally we want a sort of seamless upgrade. So whenever we upgrade via the helm, chart the migrations get run and the application keeps working so uh click house it doesn't have as a sort of mature migrations and management community is say, postgres would do and get lab.

A

We already use uh the rails, migrations uh system to migrate the database and there's a lot of work going into that, and it's it's very mature, but it's not really applicable to what we're doing here. We need our own strategy, so we've picked up this tool called go migrate, which is I've seen some various recommendations for it's both a cli and a library, written girl that will do database migrations for you.

A

So uh it will have uh support a number of different migration sources, as you can see here, so we're just using file system sources. Our migrations get stored in a config map to go in with the release, and you can see you've got basic cli usage there. So we're you know, path uh file to them after the migrations point at a database and you can bring them up. Likewise, you can roll them back and that's proved successful.

A

It's taken a little bit of time to get that working and so as part of the branch we're working on that we've got the migration strategy documented here, as well um with some links to the various things. We're using uh the migrations have a particular format name. So they've got a date time stamp in there, so that you know if there are multiple people working on it, they shouldn't clash.

A

For example- uh and I showed you an example of uh some of those there that we could see in this project so, for example, you've got the creating the distributed table there. You also got the down migration for that as well, which drops the table on the cluster, uh and we've got some tests for those as well we're going to put in a continuous integration that will every time we we push. We'll do a full up and down migration to test that it all hangs together.

A

um So the way that we actually get these migrations working is via helm hooks. So in here there's a brief description of how that works. uh So I'll install. We use a post install hook to run the migrations on upgrade. We use a pre upgrade hook to run effectively the same script, and so, whenever we are rolling out a new version of this helm chart we'll get a new the latest set of migrations baked into it, so we effectively just copy them into the charts.

A

So the problem here is that the rollback migration didn't work. As I expected um and long story short, I had to kind of put some uh be quite inventive about how how that should work so effectively.

A

uh Each time we run a migration, we get a revision config map in the solution that stores pointers to where we migrate from into.

A

And then, if we do a down migration, we can automatically pick that up in the cluster perform a down migration then remove that, because it's no longer valid state for the the cluster that we're working in and the state, and then we can effectively uh branch in a different direction.

A

If the migrations take us that way, and that was a bit awkward to work out, I was hoping I'd be able to use uh rollbacks in a much simpler way, but unfortunately it's not implemented that way, and there are a couple of issues around that that have been created uh so I'll I'll. Just do a quick demo of that. So you can scaffold to run against our cluster. So let's just do a run there, and if I open k9ca you'll start seeing the pods coming up.

A

There's the clickhouse cluster highlighted there. That's coming up uh and various services that that are starting as that cluster comes into existence. One of them is the migrations here. So that runs it's getting a crash loop back off initially because the host doesn't exist, but fortunately we're in kubernetes. So it handles that nicely and it's going to do a number of restarts until they can connect to the cluster and run the migration, uh and you can see various things are going online. Our cluster is there running.

A

Let me just show you the sort of conflict maps that go into building that click house cluster. So one thing here, not this one, this one, maybe not- that one- that one there we go that's this is the actual cluster specific, the replica specific config file.

A

As I was telling you earlier, you can see those uh macros in this part here uh that have been set up that are specific to our replicas replica identification values there, the shard, the actual name of the replica there, for example, and the zookeeper that we've connected to.

A

So if we go back to the parts, we should see that now that the on the third attempt there, the migration worked, let's have a look at logs logs. There you go so we didn't find any migrations, then it ran them all and they got it migrated to that particular one. If you look back at the config maps, this is the one that we'll be looking at, and this is a config map that was saved.

A

That says, it migrated from nothing which it didn't, because it was an install and it migrated to the specific version, and if we were to do subsequent migrations, we get more of these config files with those uh basic metadata in and I've tested it a few times, and that seems to do the job. So that's quite useful.

A

um So what else is there a couple of useful links that a few people had pointed me to uh first fully there's a jager click house plug-in for storage? So I will have a look at this at some point. It's got a schema definition for storing uh spans and traces in for um for traces for for the jaeger interface, so that looks useful uh another one is this vector project by datadog and I've been having a look at this as well.

A

This looks like quite a interesting way of building uh data pipelines for observability data, so I I think you can think of this as sort of like a fluent d, uh whereas that's just looking at log data.

A

This is looking at sort of any sort of observability data with a very pluggable infrastructure, so there are lots of different source data sources, transformation mechanisms, sinks and, of course, as part of the sinks. There you've got things like data dog metrics, for example. So if anyone were using this, they could, in theory, put our uh end point in here for the configuration uh and ship metrics from different systems into our system, which is great. So that's that's a very interesting project.

A

Okay uh up next, I'm going to continue trying to get this testing infrastructure working into the test environment before I get it into a staging, and hopefully a production environment before long uh so I'll continue iterating on that, okay- and that's all from me, goodbye.