GitLab Applied ML group, 25 Aug 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Clickhouse proof of concept for applied ML

Description

https://gitlab.com/gitlab-org/modelops/applied-ml/review-recommender/gcp-postgres/-/issues/7

A

Hello and welcome to this video today we're going to be talking about our journey to create uh services that talk to click house getting us up in kubernetes. My name is stephen brainer, I'm a senior back-end engineer with the applied ml team.

B

And I'm andres scherzeg the back-end engineer inventory same thing.

A

So what we tried to do here was spin up a click house instance inside kubernetes set up a few services that can just sort of produce some randomized data and just sort of send data to the click house instance.

A

So we can get from an end to end python and golang into click house and then visualized in some other visualization software, which we use grafana for this to get an end-to-end experience of how the whole click house process would work when deployed in kubernetes uh in a simplified fashion that we could then take that and export to our existing applications problem statement. We need to move off postgres for some of our data because we wanted to get better throughput version of our analytics.

A

Postgres is a great database. It's wonderful for crud-based operations and your standard uh oltp kind of things. If you use those terminologies online transactional processing is what that acronym stands for, um and we want to move towards more of an analytics database that we could use for getting distributions of uh model performance, doing analytics and a series of other things.

A

So, for that we decided to use click house click house has some wonderful benefits in that it is a column oriented database instead of a row oriented database which allows for really really fast uh processing of analytics and sort of ad hoc queries.

A

It has an incredibly rich table engine which we will not be covering in depth in this meeting, which allows you to be incredibly flexible with how you process your queries um in that way, it's very good for that. It is not good as a transactional processing database similar to postpress. So if you have to do a bunch of updates click, us is not your jam, wouldn't recommend it for that, um there's also a series of really well supported baked in functions.

A

So if you find yourself in a space where you are brute forcing a query, there's probably a function, it's probably written about in the docs. One of these is great and can be tricky with click house.

B

So um how we actually iterated well before anything we released some research.

B

Some part was just some free-form research on our side, reading up a documentation, quick start guides, various table, engines, etc, etc, but also we were in a lucky situation that another team in gitlab already started trying to introduce guitars in their projects.

B

So we decided to conduct some interviews with them, and these interviews pretty much just were questions we're trying to explain this video. Why did you go with click house? What did you do and what did you find out during the journey.

A

Yeah so first steps for this were spinning up a clickhouse instance on kubernetes. We leverage the alternati kubernetes operator, for this makes the whole process significantly easier. Instead of having to maintain all of the kubernetes manifests, it provides you a templated system which allows you to spin it up yourself. I would highly recommend anyone, spinning up click house on kubernetes uses that thing uses that operator, as opposed to most any other.

B

A

um We did that and then we decided to spin up very simple services in python, initially to simply just spam data into the data.

A

The standard thing that most people do is they create, like a customer tracking data, then sort of create a standard app that produce customers that seemed played out and not as much fun as creating a dog tracking app. So we created a table using.

A

We tried a different number to the different database engines into table engines to create a database of dogs that tracked things like breed height weight. Things like this and then a service that generated random data and just spammed click house to see if we could hit threshold uh hit caps for inserts, which we were able to do more on that in the documentation attached to the repository once we finished that we were able to visualize all the data in grafana, using the click house plugin and a pretty rudimentary knowledge of sql.

A

So I think if you know some sql, you can get ahead and handle around your data and we'll be showing that shortly and also there's more links in the docs.

C

uh This is an example: dashboard we're gonna, come back to and talk through how it's got here. Let's start with our kubernetes manifest. um This is the kubernetes manifest that defines all of the grafina dashboard and underlying tools. You need um there's a config map to define all the plugins we need.

C

We are, of course, leveraging click house in this demo, so the click house plugin, is in here that dashboard, you saw we're going to see again in a moment, is related to click house data quality testing. um There's a persistent volume claim which allows you to store data for your grafana dashboards super useful. If you want to ever visualize it again.

C

This is the kubernetes manifest that defines everything. You'd ever need about the actual container itself. Ports exposed.

C

Life cycle failure, probes, resource limits are currently not fully set because we like that- and this is the service that allows us to visualize and look at the dashboard itself. You can get this set up on your local machine if you connect to a cluster you'll notice. Here I am connected to our gke recommender viewer cluster and if I port forward in the appropriate namespace to port 3000 can now go over here and open localhost 3000 and get grafana.

C

um The data analysis dashboard that I mentioned earlier is probably probably on the front page. You can jump into it um and just scroll through it and take a look at our digital quality checks and a bunch of data checks that this actually came with the click house plugin for grafana.

C

You can find all other related things underneath data sources and click house and dashboards. There's three that came with it. Many.

A

C

Plug-In ecosystem will actually come with default dashboards if you just want to explore click house directly via sql. You can do that by going through the explore tab on the left here it jumps into here. I, like the sql editor. You can use a query builder. Pardon me if you are not a sql happy person.

C

um So let's just do basic stuff.

C

Check find all of our tables, we go and here's a little demo table that we have that. We talked about later that we talked about in a different video. There you go. That is everything about how to set up and where to find all the information about the plc demo for click house and our usage of grafana.

B

Besides that, we also added persistent volumes because uh yeah you don't want the pods destroyed and with that's the whole database destroyed, so don't forget to add persistent volumes for your data, and since we have uh both python and golang in our repositories, we also decided to to copy the toy returning python in go just so.

B

We have a good sense of how the libraries work in both environments- and this was basically a carbon copy of the python app so not much to say about it, but the click house, libraries were quite straightforward in both the languages regarding the go app there's a few things I wanted to highlight so to begin with we're using the official click house code driver and there are two ways to connect these standard way and the so-called native way.

B

So the standard way using this function, but we could also use sql open, just create a standard sqli connection, which is quite nice because first is standard library and second, because if we ever want to replace click house in the future, we don't have to replace the whole code base. We can just replace this part and the other bits that depend on the this sql connection can pretty much stay, as is.

B

So this is how we insert neuro one risk here is that uh these parameters have to be in the right order.

B

What I mean is that, if we have two parameters at the same time, for example- and we accidentally mix it up, it will be just saved to the databases, is so there's a risk of human error here.

B

On the other hand, with the native collector, which uses this function and creates a driver connection.

B

We can use the append struct function, which we can provide our struct, and it will be marshalled into a database insert statements.

B

One thing: that's good to know. With this uh advanced practice, we can use the ch struct tag to define how the the fields will be named in the actual insert, because without this this will be saved as capital d capital s, capital l, but this is not how we want to save our database entries. Instead, we want to use state case. All we have to do is provide this ch tag, and this will be saved in the right format.

A

B

A

To the persistent volume problem, there is one other caveat here as you are when you're working into click house, if you as we did, set up your table engine as an in-memory table, it will not be persisted to disk and your persistent volume will not help you. We did it, so you don't have to now learn football.

B

Say again, we found out the hard way.

A

Yes, learn from our mistakes.

B

All right, so this is uh pretty much how far we got with the poc. We also updated some data we had lying around, so that was 5.1.5 gigabytes of data, almost 130 000 entries, and we could upload it in 5 to 10 seconds. So we can see that it's quite an efficient engine, what we didn't implement in the poc or things that we want to maybe come back to in the future.

B

One thing was the click house, backup, there's a great library, the more we looked into it, the more complicated it seems, and we realize this is not necessary for this mvc.

B

So for now we decided to implement a more simpler solution for backups and circle back to this problem in the future.

A

One of the things we a couple other things we did not do in the psc was any sort of detailed secret management. As you look through our repository you'll notice that the the secret for the pfc database is currently directly encoded, um obviously not optimal solution.

A

We also looked at user management uh and thought that was a problem worth solving at some point, but does not have to be solved right away for this sort of demonstration capabilities, as the alternative operator does do a lot of that. For you, um however, it does in a very static way, so we may have to address that in future. We also looked at various proxy solutions for balancing load across multiple clusters and across multiple insertions, which again is not part of the psc, but is worth looking at.

A

um Some of those proxy solutions also allow for more granular user management. So we will be circling back to take a look at those in future, so yeah key takeaways. uh We had a wonderful time doing this, uh it was click house is an incredibly powerful system. We were not quite aware of how many different options they are and we were not fully able to to traverse all the way down all of the options um truly impressive system that we were happy to build up.

A

A basic click house instance get to sort of very chatty random data, generating services up and running and get a visualization going in grafana. Much of what we've done here can be lifted. uh There is still, as we mentioned earlier, things around secrets, users, config control and stuff- that sort of bridges between click house, specific things and kubernetes specific things that I think other teams would want to take on board if they were to take it to this stage or wait a little while and we'll probably hammer those out in our next iteration.

A

So there is one sort of caveat when you're dealing with data insertion using an atomic database engine if you're not familiar with the difference in a database engine and a table engine is there's a doc in our docs that links mostly to the click house docs. So you can go directly there for that or read ours.

A

uh When you're dealing with an atomic database engine, you have to set insert limits. The initial theory is that you should, where the initial thought perhaps is that you might want to set that limit as high as possible, so you can insert as much data as quickly as possible.

A

There are also drawbacks to that, because click house internally will have to be inserting and sorting up this data. So you want to keep that within a relatively optimal margin for your insert period and otherwise register period and also the batch sizes, avoiding dropping and avoiding issues and avoiding um really lagging out click house. So there are some limitations to atomic engines. There are definitely some guarantees in that. You have atomic transactional guarantees on inserts on database renames or very table renames, and several other things there. So there's trade-offs.

A

There read the docs, be mindful juxtaposing atomic engines versus ordinary engines for the database as a database engine there is advantageous using an ordinary engine. Despite the fact you don't get the atomic guarantees you get higher, insert volumes and faster inserts if you're just uploading a bulk bit of data and want to do some work, as was the case when we were loading. Some of the v2 model data.

A

You can load faster on an ordinary table without set atomic guarantees. So again choose your choose your weapon and be mindful.