Red Hat OpenShift Operator Framework SIG | OpenShift Commons, 17 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing: Building Vitess Operator with Operator SDK - Dan Kozlowski (PlanetScale)

Description

Dan Kozlowski (PlanetScale)
from August 17th, 2018 OpenShift Commons Operator Framework SIG Meeting
- what is vitess?
- why build an operator?
- common crd patterns
- potential beyond provisioning

A

But uh yeah, so my name is Dan. Kozlowski I am the lead engineer at a company called planet scale. We provide support for an open source project called the test.

A

First thing: I wanted to do is sort of go over a little bit of what Vitesse is before I jump into our operator and then what we did here is we built a few versions of kubernetes operators and so I I called this the tale of two operators, but there's actually like four of them and I'll get into why we did what we did as well. As a few lessons we learned and a few challenges that we're currently facing, but first things.

A

First I since I, don't know how many people may have be familiar with Vitesse. It is a database that was developed at YouTube to help Google scale there. My sequel deployment, it's a shardene layer on top of native my sequel database, and this is its architecture.

A

So what it is it splits a single database into a number of shards. Each of those charts has a number of replicas and it gates each my sequel process with a Vitesse process that we call a VP tablet and then it provides a gateway which looks like a normal, my sequel, server to a client. So at the end of the day, you have a gateway layer that is horizontally scalable, you have a sharding layer behind it. That is, can have as many shards as you want and you have a control plane.

A

That is the in this picture. That's BTC TLD and the lock servers and this sort of all works together to give you what looks like a single my sequel database, but is actually, in the background, a lot of machines working together to provide that functionality.

A

So that is, that is with tests some interesting things about it is.

A

This was developed at Google. It was developed to run on Borg, which is the inspiration for kubernetes. So this has been running for years in something kind of like kubernetes. So when we took a look really making it run well in kubernetes, it was a pretty natural. Thick now been developed for almost eight years to to work in that environment.

A

So when we asked ourselves what why should we make an operator for this? The reasons why we would make an operator is because it's really close to actually just working like. We expect things that that run off of operators to work.

A

Everything has been designed to sort of fit together and scale independently, and there are a lot of different components that need a lot of different configuration options based on how you do it, and a lot of that can be abstracted away into a very simple cluster architecture, also because it has all these various components in various layers. You need to do more than just bring up containers, so you can get a mostly working Vitesse cluster, just by using straight kubernetes manifests.

A

But there are a few steps that have to be taken before you spend up certain components and then, after you spend up certain components, really make sure that everything can talk to each other. So, in that case, it's a prime candidate for an operator, since it's really close to really close to just straight kubernetes, deploy with just a little bit of stuff on top and in fact this project was used with tests was used to demo a few other I guess operator, operator technologies.

A

So there is a meta controller based operator for Vitesse and there are held charts for tests and all of them work pretty good. So at planet-scale here we said well, why do we even need an operator if there is this meta controller out there? So why would you choose to go with the operator SDK?

A

We looked around at what we wanted to do, and these are really the reasons why we chose to go with make yet another operator, though, when we are talking about running a database in production, we wanted to do more than just provision it. We also wanted to handle a lifecycle events. So if you update your version of my sequel, we don't just want to bring down all your database servers and then turn them all back on, there's a sequence of events.

A

We want to follow and I know a rolling deploy, wouldn't necessarily do things in the right order, so we knew we needed a little logic there. We also wanted to be able to handle what I have called the deep state, and that is things like your your schema inside your database to a lot of the other technologies.

A

There is no visibility into actually what's going on inside the database, so when it comes to things like doing DDL transformations, if we wanted to assist in that, we were going to need something that actually had very fine-grained control and was very extensible, then the last reason why we decided to go with the operator SDK was for code reuse. So the way we have our project architected. We have a component that takes configuration from kubernetes by the operator SDK that actually passes it off to a non kubernetes specific engine, which will then create the does.

A

Actually, all the life cycling events and then that gets passed back to a kubernetes provisioner, so that little middle block, which actually does all of the configuration, sets command line parameters and does the ordering of events that can be reused for other things that we do here at planet-scale.

A

So you know we, we had a compelling reason and we evaluated the technology. So we went ahead and we built the operator version was a real, quick and dirty solution. We took the cluster itself and we just transliterated the state of a Vitesse cluster into a CRD.

A

We used the most basic state reconciliation that there is in that we just used the name collision protection that qu randy's offers to make sure that we're not recreating things. Elements in our operator had deterministic names. The only tried to create them the second time we've just got a already exists there and what about our business? And because of that? We were able to pretty much eliminate all the internal state, and that was one of our first design goals. Was this operator should be completely stateless? It should come up given a CR D.

A

It should just provision that cluster directly, so this was the C idea that we came up with and it's a direct translation of what a Vitesse cluster looks like so in the test. There's this concept of a cell, which is a failure domain. If we were running in AWS, this should be an availability zone or even a region, and then key spaces are actually your database key spaces have replicas or key space have shards, which is the actual splitting of your entire database into various parts.

A

Then each shard has a list of replicas, so we just wrote all that down and it was very straight forwards and then processing this. We could actually spin up of a test cluster very easily.

A

So we made a few slight modifications to add in secrets and to add in the ability to specify your database but I'm, actually just gonna page down through this, and you can see one of the problems that we ran into double Doan still doing.

A

That was our first version of the operator in the CR D, and this actually works really well. We wrote some Python groups to generate this CRD and this one is a 64 shard cluster, and this can do several hundred thousand theories per second and provisioning. It is just a matter of having the physical resources up and running and your kubernetes cluster, and then you can get a very high performance database off of it, so that was version one of the operator, so we built that deployed.

A

It started doing some tests with it and talked to the Vitesse community about what we had done and quickly decided that we needed a version two, so we went through and we decided to make some improvements and talking with some of the members of the tech community who had built sort of like version zero of the operator who had built the helm chart and the meta controller based operator, we decided on a few initial improvements, so the biggest change in the biggest sort of improvement over that original CRD is incorporating the selector pattern wherever we can- and this is something that I think you'll see a lot through operators that people have made so the place where I I think the best example of it.

A

If you look at the Prometheus operator, the Prometheus operator uses what I will call the selector pattern, we're actually defining the service electors, so you can put in as many different service. Selector manifests as you would like, and then you just have. The Prometheus operator give match labels to actually provision the Prometheus offer the Prometheus server. So we went and incorporated that into the Vitesse operator.

A

It's a very good fit for us, since you can define key spaces and shards independently of your Vitesse cluster, so improvement number one was to take all of the shards and all of the key spaces and all the cells, so the three sort of components that put together a cluster and turn them into their own CR DS and then add the ability to have a selector which can just select those so that took our extremely large.

A

That's what our extremely large CRD and turned it into something that looks like this.

A

So now we had a number of CR DS that had these selectors on them, and there was much rejoicing right because that solves a number of problems you can have a you, can have different components that are defined sort of independent of each other. I can now change my cells without read, without changing my cluster and I.

A

Can let all that propagate through wherever this really becomes great, is I can actually define the rest of my cluster with key spaces and shards to exist in a cell, and then I can just duplicate that entire deployment in another cell by just provisioning, another cell CRT.

A

So that was the first improvement we raid, the second one was sort of once we did that we realized. We had only moved the problem around a little bit. You still had to do a lot of definition to get a Vitesse cluster up and running so for improvement. Number two. We decided well how about? We also add defaults. So, instead of having to define all of your CR DS, you can give some defaults and let the operator to find the CR DS for you sort of do parameter.

A

Expansion inside the operator, instead of explicitly having to you, know, generate all of the all the shard information and load each of those into the operator. We can just give a same set of defaults and say how many of those things we wanted and then let the operator create them for us.

A

So this would be if you decided that you wanted to have a replica on my sequel, replica running a specific container listening on a specific port with a specific TLS secret, you could just define that container profile once and then tell the operator I want each of my shards to be running ten of these replicas and then the operator would spin up 10 of these replicas.

A

This was a really good exercise for us when, through the second CRD, because it made us be very explicit about how we were going to configure things and where the configuration points were so when we have to define the configuration points, then we can go through and make sure that we have flexibility in actually defining the cluster. We can say if we wanted to do half of the cluster defined explicitly and half of it expanded from defaults.

A

We wanted to build that in so that made us then go through and think about how we were gonna be able to do that at the end of that whole process. We ended up with a cluster that can look like this, so that's 64, dart cluster, that from our initial CRT it actually get collapsed down into this. This would be a cluster definition with 64 shards, two replicas per shard in every cell. That was labeled super-awesome cluster, and here I only have one to find.

A

That's us West and then, if I decided that I wanted to spin up a data center in US, East I would just define the South for us East label it correctly and the operator would expand my footprint into us deist as well following my defaults.

A

The third improvement is when we actually haven't made yet- and this is our current challenge- that we're having the improvement that we are currently targeted. Working on is reducing the amount of calls we have to the kubernetes api. It turns out that when we get these extremely large clusters, we found that we were making a lot of calls to the sdk functions. So sdk creates, as well as a lot functions, to get the state of running containers to make sure that we weren't duplicating things and then internal to the test.

A

There is our test control plane and we were making a lot of calls to the Vitesse control plane to check on the state of shards and key spaces and cells, and that is making our reconciliation loop take several seconds like 42 eighty seconds, and that was providing problems.

A

So we wanted to make sure that we could reduce that and that's currently what we're in the works of doing and we're trying to figure out how we can make sure that we're reconciling correctly so that when an outage occurs or something happens to the resources that they're getting repr visioned. In short order, but then also not bringing down the API server because we're making so many API calls, so that is sort of the third improvement that we are currently working on and I know on the mailing list.

A

Other there has been some chatter about this specific topic since, in addition to just concerns about stability, there's also concerns about cost and other things to consider when we are really instrumenting these. These operators that are making a lot of automated calls back to the API masters. So those are the improvements that we made of reversion to and then looking at version, 3 or evaluating.

A

If we could do some of these things with the operator SDK, which sort of go beyond what you could do with just straight kubernetes deployments, though online schema transformations are a big one. For us, a lot of people may be familiar with the tool ghost or github. That works really well with the test, and we want to make sure that, with an operator, it's a great opportunity to just make that even more seamless so that your schema transformations can be completely automated.

A

Also doing things like cross kubernetes deployments, maybe not something that everybody would like to do, but if you have a globally distributed kubernetes footprint, the test was built as a globally distributed database, so it would be.

A

We would want to be able to deploy in different data centers around the world and make that as easy as possible. So right now the path the people I'm looking at is write multiple kubernetes clusters and then just orchestrating. On top of that, and since we're running effectively a go program, we have options on how to coordinate across the kubernetes deployments.

A

Another great thing, which I actually think I just saw in the the OLM operator, is I. Haven't here's chat integration, but really it's human intervention. So when events occur, like upgrades or life cycle, events, don't do them without approval, and that is something that, especially when you consider databases and where they sit in your application stack. The idea of requiring human approval is something that a lot of people are are insistent upon.

A

So again, because we do have a lot of integration points, we can do a chat integration or we could do a ticketing system integrations that you can actually give approval before events are taken and then the last thing sort of attach specific is auto very sharding. Since it's a sharded database, you can expand the size of the cluster as big as you would like, and and that's something that we can also do the operator if that's got problems with other style of operators, because it is a multi-step, sometimes multiple day thing that happens.

A

But it's you know with the with the operator SDK. We can maintain state in a global, lock, server or in some external entity. We can do those big multi-day lifecycle events, so that is the story of our two operators of sort of our path through our path through creating them. The test is an open source product. So if you want to take a look at it, it's Vitesse EO. We hang out on slack that second link.

A

There is the link to join our slack, stop in and say, hi and then I said I work for a company called plans, Gil we're providing support commercial support and features on top of that open source product by me as cause on most of the places that I lurk.

B

Alright, you very much I need it eventually, but sure why that, but if they're, how is, is there any questions, heard Verdana more cause, I.

C

D

C

B

C

um Just to clarify I mean I'm curious. If you actually saw any problems with the operator bringing the API server down. No.

A

We have not the the problem that we've been running into now is just the performance when we get really big clusters, when we get a lot of operators running off the same cluster, it.

D

A

The operator down, so it just takes too long for the operator to run its reconcile loop I'm concerned about bringing the API server down, but we haven't actually managed to do that. Yet.

E

Hey this is Shawn. Just a quick note. I think that the controller, runtime design story and operator SDK might be an issue worth watching for you I think this should help with that problem, because I believe the controller runtime will start to use an informer cache to do get and list requests which will probably significantly help in this regard. But that's just a guess from a guy, so I don't really know. If that's true.

D

Either the rain aggression? How are you running the sed cluster that you're utilizing.

A

We use the sed operator well, it depends so for the sed cluster that's associated with each cell, which is the failure domain. We just provision that, with the sed operator, we ask that you load the sed operator both for you load our operator, so that we can just we can provision that the sed CRD the global one's a little bit different since the task ends Vancouver Danny's clusters, the sed operator, doesn't quite work there, so we just have some custom code to spin up at TD cross region, sed cluster.

D

How is that distinct between the two clusters or three clusters that you were running or that you would run.

A

So the xev, the global, lock server, the global, lock server is a coordination point for multi cluster communication. So the test knows how to run it and completely isolated data centers if there doesn't need to be any synchronization between them, but that global, lock server does need to be accessible by each cluster and does need to maintain same for all of them. That's why we provision that outside of we could also early demos that we did.

A

We actually just picked one kubernetes cluster made that the special cluster and then had it's at CD server, be the global one and then just made sure everybody could see it. Gotcha.

D

That was a strategy we took under tectonics Multi cluster stuff that we were testing was like make one of these clusters special other than that they all.

D

A

Forward to a demo of it next.

D

Time maybe I.

A

Know I feel like I should have that every time I've done this I tried to give demos the demos. Do you take a little bit longer and I wanted to make sure we had time for questions? Ok,.

D

No worries a youtube video, maybe yes,.

B

That would be. That would be wonderful, we'll give if we can give you a longer time to do a deeper dive to once here once you're ready for that.

B