Cloud Native Computing Foundation Online Programs, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How we manage thousands of clusters with minimal efforts using Gardener

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Off we go, I want to thank everyone for joining us. Welcome to today's cncf live webinar, how we manage thousands of clusters with minimal efforts using gardner, I'm libby schultz and I'll be moderating today's webinar. I want to introduce our speakers today, smarth a software engineer and hardik a software developer, both at sap a few housekeeping items before we get started during the webinar you're, not able to speak as an attendee.

A

There is a chat box on the right hand, side of your screen. Please feel free to drop your questions there and we'll get to as many as we can at the end or throughout just depending on our flow.

A

In addition, please join our cncf public slack channel hashtag cncf cncf-online-programs to continue the conversation later and address any questions you had we didn't get to. This is an official webinar of the cncf and, as such is subject to the cncf code of conduct. Please do not add any questions that would be in violation of that code of conduct and please be respectful to all of your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cncf online programs, page at community.cncf.io, under under online programs they're also available via your registration link, and the recording will also be available on our online programs, youtube playlist on the cncf channel.

A

With that, I will hand it over to our speakers to kick off today's presentation thanks so much.

B

Thank you, libby.

B

So um thanks a lot, everyone first of all joining and um let me first of all introduce myself so um I'm hardik, I'm a software developer developer on gardener on metal. um Previously I was working on gardner, mainly the machine management and auto scaling of corner, and otherwise I am also once in a while active in the cluster api community and the auto scaling community in general.

C

Hello: everyone thanks for tuning in this is samartha deyeganda. I am a developer at gardener and I primarily work on a component called machine controller manager.

B

Okay, so let's get started first of all, the first thing: first, what's the motivation, so um the webinar is actually, of course, about card uh and a brief about it. So it's basically an open source initiative by sap. It's basically a fully managed control plane as a service that offers homogeneous clusters potentially on any cloud provider and is fully customizable and scalable, and we have been thousands of clusters for real since recently, and this webinar is about giving a glimpse around what and why and how, and we will basically be doing something interesting.

C

Yes, so managing thousands of kubernetes clusters at scale is not a cakewalk, and over these three to four years, gardener has evolved to be so robust and scalable that we have actually made managing thousands of clusters a cakewalk, so gardener primarily runs everywhere. It runs on our own infrastructure, it also runs on other cloud providers and the experience that it gives with respect to the versions or the features offered, is pretty homogeneous, even though the support is for various cloud and even our own infrastructure.

C

It is also fully automated and fully managed practically with zero manual ops, and it is highly scalable even beyond a single cluster, and it is highly customizable that is, we have given all the configuration knobs for your conveniences.

B

Yes, and um to communicate the idea a bit more effectively, this is what we are going to do. So we are going to host um a hypothetical, highly consumable once in a minute. Instead of an application, it's going. We are going to call it the botanist waste on a platform which would really need something, as droppost is called.

C

So we will be assuming kind of our roles and we will do the role play where hardik will be the founder of this application. He spoke about the botanist quest, which is hypothetical, and I will be the product manager for this modernist quest, and this webinar is basically going to be. You know a set of arguments and brainstorming between us to design this gardener from scratch and convince you that such a robust platform is practically possible to exist for applications like botanist quest and or for also other critical applications of yours.

C

That needs such a platform with the foresight features. Let's get started, then.

B

Thank you so hey summer, um shall we then start planning on taking photonic squeeze to the new heights already.

C

Hello hardik! Yes, that's why I'm here? What do you have so far.

B

So this is what I have I have. I have got one very nice kubernetes cluster with three dedicated control, plane machines and it's basically serving a bunch of beta users, and I have got a really good feedback already for the application and we are going to launch the jungle release very soon and then our initial set of or the target new users would be around 500 or so.

C

Pretty good, and how does the platform look like today.

B

So on essay we have a beta app and that's hosted on one cluster and the plan is that we simply scale this cluster to file. So we would have basically five clusters with dedicated control, plane machines and all of them would basically be hosting. The pattern is quest in high india hybrid model. They would also be running on different cloud providers and yeah. That's the situation at the moment.

C

That is good, but five dedicated clusters might not be enough scaling because, as per my research, botanist quest is having a lot of traction in the market, and I see something similar to pokemon. Go might potentially happen with botanist quest too. That is, you might have planned for an expected influx of a 5x in the worst case, but your traffic might hit 250x, which happened with the pokemon, go right, so you might have to have the scaling done massively and across geographical locations.

C

um You said about multi-cloud, so are you planning to take the managed clusters across different cloud providers because, if you are doing so, probably there won't be a homogeneous experience when it comes to managing the clusters and also the transparency of the control plane. I doubt if we would get that and also I want you to employ our own infrastructure that we have at different locations.

C

So can you align the strategy? Something in this regard?.

B

Yeah yeah, I get it, I get it so. Basically, I my team is then going to replicate the installation. We would have 30 kubernetes clusters, we'll use this awesome, foo bar tool that we have been using. It will be multi cloud, yes, and all of them will also have three dedicated control plane machines again, so it will be super reliable and I think it should work like a charm.

B

I said um what I would also do is basically put each of or divide the clusters across different regions so that the customers in different regions are better served, and that should, I think, be good enough.

C

Yeah that does look good, but 30 clusters with three dedicated control plane nodes for each cluster doesn't really is as cool as it appears to be. You know why. The first reason is that the control plane nodes are never fully utilized, they are always under utilized. So let's say for 30 clusters, you will end up having 90 control plane nodes, and these will only incur your cost faster than your team scaling up your clusters and not just that. Have you even considered about the operational complexities that your team might face?

C

What if some cluster goes into volume, mount issues? Some cluster has an api unreachable issue and some other cluster has some other issue, and if these things start happening simultaneously, then the team will go heavier and more than that, how have you planned to manage the tracking of these clusters for their config files and cloud credentials, etc?

C

Manually managing 30 plus clusters that, with a dedicated control, plane, nodes is probably not how a cloud provider or a software as a service should operate.

C

So, to summarize what I just said with the proposed solution of having 30 clusters with dedicated control nodes for each cluster, the excessive under utilized control machines will only incur cost. There will be operational complexities as well as cluster management complexities.

B

um Yeah, those are actually good points and, let's take a first look, I would say the kubernetes or let's, let's, let's ponder upon what we already know so. First of all, we know that the dedicated control plane machines are usually underutilized in most of the setups. As you see um second point that is very well known. Also, the beauty of the kubernetes is that the both the control plane and the workload are somehow decoupled. So they can.

B

They don't really necessarily has to run always together and the third and the more important point is basically the control plane components themselves are actually a full-fledged workload. Applications they are, they can be probably be treated as the workload themselves. um Okay. So what can we actually infer from this information and maybe innovate to address your concerns?

B

Maybe we can experiment treating the control plane of the kubernetes, yet another workload or or maybe we actually host the workload we? Maybe we actually host this workload on some other kubernetes cluster, so we basically do kubernetes inside the kubernetes. What do you think about that.

C

Wow, so so you basically want to migrate the control plane of these clusters as workload onto another cluster, though this essentially is kubernetes inside kubernetes. Isn't this like cubeception.

B

um Yes, it is the chip section and, and so in order to improve the resource utilization of the controller plane nodes. I think we will spawn one kubernetes cluster manually and then we'll call it a management cluster, and then we do. We basically use that management cluster to host the control plane of other clusters. So for the visualization, this is. What is what you see on the screen is uh is what I would like to propose.

B

So we have clusters across different locations and we simply move the contour planes as a containerized applications into a single nice management, cluster and uh yeah. That's what is sounds like a good idea to me as of now and.

C

This looks good. Maybe I want to take a closer look into the management cluster. Can you take me around it.

B

Yeah sure so, let's take a let's double down and let's take a bit closer look so essentially the control plane of each let's call them. Child cluster would basically have their own dedicated name species. So that's first level of isolation. Of course we don't want them to mess up with each other, so we can also isolate them using the network policies. So that would be the baseline idea and then another thing which I would really want to consider is that we really use kubernetes here and not invent the pre-inventor bill.

B

So I would use deployments and stateful sets and such battle tested inbuilt controllers to deploy the api server. Lcd, cube, scheduler and and components like that, and this. This essentially should actually reduce the blast radius by effectively having to manage only one management cluster against 30 clusters, as it was in the previous case.

C

Oh, that's. That appears to be pretty efficient, but uh but I think it only addresses the cost issue right, where the excessive under utilized control plane machines were rather migrated as workload onto one single management cluster as workload. But when it comes to life cycle of these control planes and the life cycle of the underlying machines, I think we are back to square one.

C

In my opinion, uh you must take care of the life cycle of these hosted control planes and the workload machines of the child cluster more efficiently, because uh the the traditional kubernetes is not having the domain knowledge which it might want to have to manage these betters right.

B

um Well agreed on that, um I think we are. We are circling back to the main issue, so, let's again step back and and look at it again.

B

So, first of all, um it rather actually looks like a natural candidate for the controller or operator pattern, so just to just to reiterate or remind so what an operator is basically a go controller or tribunalist controller, which also comes with additional domain knowledge to manage its own resources, and what we have here is basically abstracted control plane as a pod as parts and what we could do is basically represent this control plane with the dedicated crds, so ins in essence.

B

um Let's, let's do it this way, so we have our control plane parts and then we we basically represent them using the cluster crd. This cluster crd would have all the knobs and necessary configuration options that decides the whole life cycle of a given cluster and in a similar way. This will help me trigger the cluster of creations.

B

Rollouts updates patching hypernation deletion and whatnot, and then but this should not be it in my opinion, so on top of the cluster crt, we would also need to take care of another very important and very dynamic infrastructure component, which is basically workload machines.

B

So we can also introduce a crd for machines and then it could look like a machine deployment, machine set and machines in a similar way. There are deployment replica set and parts so the way a deployment controller somehow always ensures that a certain number of replicas of the pod are always running, and it does a very fine-grained rolling updates of the pod.

B

We could also implement the similar similar functionalities for the machines, so we have machine deployment which would basically help us do the right kind of rolling updates and so on, and and yes with this kind of abstraction, if you call it the machine api, we get seamless, auto scaling as well, because with such abstracted, dedicated crts, then the higher level um functionalities or the higher level automations becomes really easy. So just imagine cluster autoscaler, making use of this and we get free, auto scaling for all cloud providers, even parameters and so on.

B

So this is me a lot.

C

That is, that is, that is really good. So so, essentially, you are telling that uh the machines and the clusters that we are trying to deal with will now be treated as the first class citizens of kubernetes, along with the cluster controller manager in place.

B

Yeah so, and- and here is also detailed visualization, so so essentially, uh we would also have the cluster controller manager. Of course, we would have a controller which would take care of both kinds of crts, and this controller would be running in the management, controller management cluster and upon creation of the cluster crd. The control plane of the child cluster should be deployed first by the controller and then in turn. The same controller also deploys the machine crds.

B

So it basically pulls the information from the cluster's worker section and then rest of the things are would basically be taken care of with the machine deployment.

C

Nice enlightening so this design proposal looks good to me. So shall we prototype it, and maybe you can showcase a demo for me.

B

C

Sure cool, then, let's move to the demo.

B

Okay, you might be surprised, but I already already prepared it. I think I'm just too fast. So, um let's look at the demo. uh We have three terminals here: two management clusters and one to my uh load cluster. I wanna quickly see the shoot, so I'm gonna I'm going to call it a shoot because it's my cluster, it's my bottomless twist. So for the key one shoot cluster, which is called bqtemocncf, I have.

B

I should have a dedicated namespace, the namespace, which designates.

C

B

Particular shoot cluster and I would expect that this name space hosts all of the control plane, components that are necessary got it. Let's take a look into it and we already see that here I have on top of the essential control plane components like api server, scheduler controller manager. I also have few other controllers. I have introduced something also for the machine api, a separate controller which we call machine, controller manager and then also auto scaler as a separate controller.

B

On top, so autoscaler already talks to our machine api now and let's also take a very quick look inside our cluster or shoot, and it's let's take a look at its spec.

B

So here um it's a glimpse uh where we see that in the spec section we can already see there are multiple sections. Hibernation is one of my favorite, it saves so much cost for us. We can also kind of configure all kind of cooperative related stuff, fully transparently via the spec, and here we go and we see the worker section. The worker section is basically what we just discussed so based on the information that I gave here I tell minimum is high. Three maximum is five.

B

I give a very fine grained um information like max unavailable and max search, which should be respected during rolling update, and then this controller basically fetches this information and prepares the right kind of machine deployments. Let's take a look at what we have in terms of machine deployment, machine set and machines.

C

B

So um we see that we have a machine deployment with three replicas and then representing one machine set with three replicas and then three actual machine objects: okay, nice. um Let's also take a quick look inside the machine deployment spec and see what the api actually contains.

B

Here we see again for the consistency through replicas. This allows us to the rolling update. It's also allows us to do the recreate strategy in case of rolling update. It would really delete the machines one by one. The way the deployment controller does, and we have a reference to the machine class and the node template to sync to sync: the labels and other metadata back to the node object back and forth. Node object and the machine objects, because essentially the machines are really dynamic.

C

B

And um yeah what we could do more, I think we could quickly change because I claimed it should take care of the life cycle. Let me actually make make a very small change in the shoot. So before that I would, I would watch the machine deployments and I would also watch the nodes of my workload cluster. So three nodes designate the three machine objects that are there in my management cluster.

B

And let's edit our shoot, object or the cluster object, and here um I would make a small change. What I would do is that I would change the machine types from let's say: x, large to 2x large and.

B

With with a minimal change- and this is actually the power of the declarative approach with once I make this change, my controller, which is running in the backend, is going to reconcile this particular chain form or it's going to update during the reconciliation. It's going to update the machine deployments, machine sets, and so on that there is a change in the worker section where the promotions previously running were x large.

B

Now they should be running on 2x large, but the uh the catch or the magic is that it should not be done abruptly because um we actually don't want to handle only infrastructure here. We also want to take care of the pods running on them. So because I said makes unavailable was zero and max search was one. It created a new machine and it will actually wait for the new machine to join until then.

B

It would not delete the machine from the previous previous set of machines, so one machine is in the pending state and it would wait till this new machine joins, and only after this new machine joins, it would go ahead and delete one of the old machines.

C

Okay, this this looks good, so essentially, every machine is backing the node object that is actually uh attached or registered to the cluster correct.

B

Yes, that's true, and the internal node object is basically the virtual machine or the real machine, and also one thing is that this this? What we are, what we see here is infrastructure related stuff, but if you would have a pod running on one of the machine and if that part says um I have xyz pod sls that this particular machine I am running on should not be deleted.

B

Unless there is another replica or there are enough replicas, then this controller is smart enough or I would say this controller uses the drain feature in such a way that such pod, slas and disruption widgets, are also very properly taken care of and achieve with a bit of a fast forwarding in between. We already have all the three brand new nodes available placed one by one.

B

So that's what I have as a very at a very initial stage to show you. How do you like.

C

It this this already looks good to me, so it is honoring the infrastructure sls as well as it is also honoring, the slas for the applications that are running as spots within that infrastructure, and this looks pretty good to me to see the infras being handled as uh custom resource objects in a particular atp. This this is really nice.

C

Okay, uh this is all good, but I see that all the control planes of these shoot that you call, because you are botanist quest all these shoot clusters. Control plane are sort of hosted on a single management cluster as workload, but the shoots are distributed across regions. So I see a potential issue of cross region latency here alongside this.

C

If I'm having one management cluster, let's say, and this guy is hosting several control planes of several shoot. It should hit an upper limit correct that, after certain number of clusters, whole control planes hosted there. Maybe it cannot accommodate more so you might want to scale the management cluster right. So can you just explain me how you're handling these two aspects.

B

um Those are again good points and, honestly, I see a very straightforward solution to this so um see we have one management cluster and if, if the latency is the problem, then um we could simply replicate this management cluster to the regions and that should basically solve the problem. Although it looks that we are having instead of one management cluster, we are having more management clusters, but um uh all of them would basically be auto scaled.

B

So each management cluster has the cluster or scalar, which will make sure that they do not have excess number of replicas excess number of worker machines. With time.

C

Okay: okay, uh if I have to rephrase what you just told, then we are going to replicate the management clusters and host the control planes of those shoot in the geographic vicinity of the distributed management clusters. Correct.

B

Yes, that's true.

C

Good, so this idea is nice, so probably the cross latence cross region- latency, is kind of handled here, but with time you see with increasing workloads with increasing number of shoot, the density of shoot will increase and we might have to scale the management clusters also in a pretty large number. So how do you plan to manage these management clusters so.

B

Is there an elegant mechanism.

C

That you have already thought about.

B

Okay, yeah, that's also a valid argument and you already got me through entangled into it. um I would say I feel we are. I don't want to pack. I don't want to fall back to the square one. So let's take a look at it again from what we have discussed so far. So what we had um you know phase one. We had a plenty of clusters with a dedicated control plane in one location, and then we saw the problem that uh we have a lot of resource being underutilized.

B

So we decided that we move a few clusters, few of the clusters different locations in different regions, and this worked pretty well but again, um this this situation has its own um set of problems. That again we have plenty of control planes running at plenty of locations. So we then um say: let's go to the phase two and introduce a management cluster.

B

So we said: okay, having plenty of clusters is fine, but let's move the control planes from them to one single management cluster and that solved the problem at certain extent. For us, um but this again introduced the issue of the.

C

Latency so the.

B

Latency is again a bit of a trouble, so we replicate the management clusters to the geographic vicinity, so we moved all of the clusters to their different regions. This worked well, but again we fell into the same problem that we have. We could have actually possibly plenty of management clusters. So the way we had to manage plenty of shoot clusters. We now also have to manage plenty of management clusters. um To be honest, what um looking at the perception- and you can looking at the recursive approach- I would go bold and introduce another cluster.

B

Let's call it a super management cluster and I would migrate the control plane of these management clusters to this super management. As the super management clusters workload and to make our lives a bit more easier, I would introduce another crd and let me I'll call it a management cluster crt.

B

So essentially now I have a management cluster crd, which takes care of my management clusters, and then I also have to shoot crds and machine crds, so the machine crs can also be used for the management clusters in general with because it's completely recursive and um our cluster controller manager runs at the top label. At the super management cluster.

C

Okay, okay, I think this proposal is also pretty good. So uh if I understand this correctly, then the cluster crd and the machine crd that we were speaking about in the before slide, so that cluster crd will now be a part of or it will be applied or created in the super management cluster, because the cluster controller manager is also running there and even these super management clusters. Even these management clusters will be represent as management cluster crd in our super management. Cluster, correct.

B

Yes, that should, I think, solve the race issue. Hopefully, yes,.

C

Yep, uh I think this looks like a sophisticated design. That kind of convinces me that we can now manage thousands of clusters so to actually answer this question. Can we now already manage thousands of clusters? Probably we want to look at the flow of adding one new cluster to this ecosystem and see if there are any unknown unknowns right.

B

Yeah, certainly, I would not be so quick to judge um let's, let's take a quick look, let's, let's see what we have so far and what we can do so currently we create a cluster object in a super management- cluster, okay and that will be processed by the cluster controller manager.

B

Then um we are manually assigning this cluster to one of the management clusters. Okay, that sounds fuzzy. Then the third step is basically the cluster controller manager reconciles this cluster object and then it creates the control plane and in the dedicated name, space, that's good and at the end, the cluster controller manager, of course takes care of the rest of the lifecycle of this control plane in maintaining it.

C

Looking at this flow, do you also see what I see.

B

um Here, yeah, so this there seems to be a similarity between um what we do with the control plane. I think kubernetes actually does something very similar at a fundamental level with parts I.

C

Would say so, then: let's compare and find the design parity. Maybe we are.

B

Yes, I would actually think that, let's take a step back and let's, let's look at the kubernetes, what kubernetes does with the pods? So essentially we have a cube api server. Yes, and then we have scheduler controller manager, so scheduler's job is actually, although it's it's really important, it's job in sense is to assign a node, so it would just update the node name field on the pod and that's job. Q. Controller manager, of course, takes care of certain other life cycle aspects.

B

um I also know that there is cubelet on each of the nodes, so the so when a pod is introduced, it's assigned to a node and then the cube, but the respective cubelet will basically fetch the definition, create the pod or container. um That's actually a deja vu moment. So, let's see what we have.

B

We have a cluster aggregation, server or, let's say a standard server which which is going to host our cluster crts or some securities. um We already have the first cluster controller manager, which and this controller manager is creating control planes. um Okay, that's something that could be improved. On the other hand, I also see that we have, um we already have the control planes running on the management cluster. So what's actually missing here.

B

I think there are um mainly two components which are really at the core of kubernetes and could really also be helpful to us, so I can already think of a cluster scheduler, so a cluster scheduler, which assigns a cluster to a particular management cluster.

B

The way a cube, scheduler does and a cluster delete where a respective cluster lit will basically fetch the definition of the cluster object and then spawn up the control plane and then do the rest of the business or the business logic that we need to do, and um I think the the interesting uh phenomenon, something or something that I would really like to put out explicitly is that with the introduction of the cluster elite, we are actually separating the whole, the business logic out of the cluster controller manager of which, which was related to deploying the control planes, and this really really helps us scaling.

B

So I can think of plenty of management clusters now and plenty of cluster. It's doing their job independently.

C

B

And um now, let's, let me stretch a bit and name it so for the bottom, his quest, let's name or whatever small design that we have prepared and let's name it, a gardener, and I would introduce gardner here and I would say, what's the design of the gardener, so the gardener's design is exactly similar to what we saw previously in the previous slide that we have a partner api server.

B

Then we have a corner, scheduler gardener, controller manager, which does the same as what we just discussed and then the gardener garden elite, which is sense of high scale and then on each of the seed cluster. We have one gardener, cuddling hosted, which is responsible for managing the control planes which are going to run on those particular seed clusters. So if you look at the mapping, then this gardner scheduler becomes a chip.

B

Scheduler coordinator, ps4 becomes a kubernetes api server corner controller manager becomes the cube controller manager, the seed cluster becomes the management cluster or the seed cluster becomes the node objects in the kubernetes.

B

Currently, it is, of course, the cube plate and the control should control plane is the part wow this. This looks fascinating and you know what just just to add. So this is the core of the cotton design it maps to the design pattern of kubernetes. As we see so, we can really actually reuse the skills in effect, in effect, a cubeception model of turtles.

B

All the way down, along with the requirement of delivering a fully managed kubernetes as a service step by step, actually lead us to this architecture. Now, initially, our requirements are for running the bottom: squeeze application on kubernetes motivated the whole platform, but now it seems um this platform is not only.

B

This platform is really for everyone for to build applications on top as well.

C

That is good. That is good. So what we started for ourselves looks like it has become a novel for all other potential users too. This is this is good so going forward. uh I really like the the design already, uh but you know we don't want to miss one important aspect, because now we are making it available for majority of the customers that might find potential usage for this.

C

So with increasing adoption, we might want to support even more cloud providers, and this may force us to switch to different operating systems and to different network plugins and to different other aspects of the cluster management. So with ever evolving cloud native ecosystem, our systems also have to be completely extensible.

C

So some system, where you know the batteries are included, but they are swappable. So in essence, I just want to bring in a thorough extensibility to this gardener that you have built.

B

Support, that's really a great point and I can't agree more with you. I would say the extensibility should be at the very, very core of any good design and gardner supports a very neat and kubernetes native extension model. So each extension point is essentially a provider specific controller very similar to how extensivity is designed for cloud controller manager, for example in kubernetes, and a very simple example would be actually for the cloud providers themselves.

B

Where you can see that corner gardner would basically declare a neat coolant interface for a provider, and then the provider would have to implement that particular interface. The interface content would be a bare minimum functions which would be needed for a gardener to support or which would be needed for a full, refrigerated kubernetes to run on a particular provider. One of the one of the simple example would be the gardener extension provider aws, which is targeted for aws, and this, of course, this report.

B

This approach recursively builds on kubernetes support for various other providers as well. um So this was that that's theory, but let's actually look at a beautiful outcome of of of a well-defined extension model or, I would say, the power of well-defined extension models. So we just this is basically a one, a simple single gardener installation, where there are large number of clusters being managed on different cloud providers. um Basically, gardener is the support management cluster.

B

It runs inside the super management cluster that can host the control plane of seat clusters, which are the green dots in the super management cluster. Then the workload machines are basically the seed clusters.

B

The workload machine of this seed clusters are basically deployed on different cloud providers in different regions is the case fits based and then um to maintain the to have the list resource latency, the workload machines of the actual end user clusters or, let's call in the shoot clusters, so those are deployed in the same region and they have the control being hosted on the management clusters workload, and this is what I have, and this looks uh as if it can actually handle a large number of clusters if this actually works also- and not only on the paper.

C

Yes and this picture of this ecosystem of gardner looks so beautiful, I'm actually not able to take my eyes off, but I'm forcing myself to do so, because I want to see this in action. So let's go to a demo.

B

Okay, so let me um show you the demo of what I just talked about. What we see on the screen is a garter dashboard to have a better user experience. Of course, everything can also be done from the terminal um on what we see is basically cluster secret members, some utilities, and then we already saw a demo of the pq demo cncf cluster, where there are few fields, different sections that can be configured in a given q1 cluster, and you have you- have a chance to directly fall into the terminal from there.

B

So what you just saw in the overview of course, everything there. The essence of them is basically also in the dml file here. So you basically change the cml files or you you can declare everything in the ml files as well, and let's actually try to create a new cluster and see how the flow looks like. So I'm going to create a.

C

B

On aws, I'm going to call it bottomist here um the version is 1.20. I can set different purposes, let's call it evaluation purpose, um I'm going to use standard aws secret for that, just access, key and so on for the worker pools. um I would basically use m5.large. I can choose other worker sizes, let's use m5.large, I could use carbon linux operating system. Docker um I would keep min and max is one and two uh maintenance, so I would keep it as it is.

B

So only in this maintenance window the cluster would be rolled out and not in any random time, and then, of course, my personal favorite is hibernation. Where that I would say every day at 5 pm, my cluster should be hibernated, because this is just evaluation clusters. I would actually save a lot of cost by bringing down all the machine and control planes every day at 5 pm, um of course, same can be done with the yml's and let's not wait long and go ahead, create a cluster.

B

What I also see is a tracker, basically, it says, create processing. So the tracker keeps me up to date in terms of what's what's going to what's happening right now with some of the detailed messages, and um it says it's deploying the external domain and let's now look at the back end. So at the gardner api server, I would expect to see another shoot cluster which we just created by the dashboard. I already see a bottomist as a new shoot cluster, and now um it says the creation is processing.

B

I would also like to see a seat clusters, specifically the aws seed cluster, because we created the cluster on the aws, so I have one seed object which says it's ready and it is in the eu east one.

B

Now I am going to watch my shoot clusters to see how the progress is going on the next terminal.

B

Let's relate our let's look at our terminals to our actual diagrams, so we have a corner scheduler, which maps to the carbon scheduler here and I would expect it should have already done something with my with the shoot object that I just registered. Let's see.

B

Yeah, I see a message which clearly states that um it has been scheduled to the seat, which is aws. It has a difference. It can also be plugged with different kinds of scheduling strategies. If you want the way we do with cube, scheduler and um perfect, let's, if it's assigned to the aws, then this is the carton lid which is running on the aws seed cluster or the aws management cluster, and here I would expect this. Cotton lid should have at least started doing.

B

Something at least it should have fetched the definition and start to create the uh control plane parts, and I already see that there are there are demo pro bottom is less bottom is which basically refers to our cluster, and it has already started processing that cluster. I think it's doing something in the background we'll get to know that the next next terminal, then I can also take a quick look at the gardner controller manager now, so we have from the diagram.

B

I know that controller manager is responsible to take care of other life cycle aspects of my shoot cluster, and I see that the hibernation and maintenance are basically the subcontroller of it and they are also already taking care of those aspects of my cluster and now the most important or the most interesting uh aspect of the whole system that we just just we get it right.

B

So the control plane is what we are going to look at now that which is running in one of the seed or the management cluster, which is aws in our case, and let me zoom in.

B

Let's search for the namespace, which is dedicated for our cluster, I already see one namespace, let's get inside it and see, what's what's already deployed or what's happening,.

B

Okay, so I see that it's already api server and lcd, I think more parts are coming. We also have a nice logging and monitoring setup with loki and um on the other hand, um I would also be later on be interested in looking at the workload cluster, but essentially, um if we, if you try to recall what we just learned in the diagram so from the api server, it has reached to the cotton scheduler from scheduler to carton leaked from carbon lead to in parallel to corner controller manager and now on the management cluster.

B

We see something happening and the dashboard says it shows clearly that the creation is ongoing. um I think it should take place in the infrastructure for five to seven minutes, or so um I would suggest let let's take a look at another key. Other key features. Summers. Meanwhile, while the cluster is being created.

C

B

Good, so, um to add, to the whole thing, um what we saw was the day, one in general, so creating a cluster even creating thousands of clusters still is okay, but what really happens, and what's really more fascinating, is what what's gonna happen on day, two or day three and so on. So we are, we are going to have, or we already have, the customers or the people who would generally create lots of lots of workloads on their cluster and in such cases we don't want our api server to die.

B

We don't want our other control plane, components to be exhausted. So what what's there to save us is basically the horizontal vertical part autoscaler. So what it does is really fascinating. It auto scales, the pods control print parts, both vertically and horizontally, as situation demands, and then we have fcd backup and restore. So this is this is our savior for the disaster and recovery, where um this sidecar container basically keeps on taking the snapshots of the lcd.

B

Let's say every one over: this is perfectly configurable, so it takes full snapshot every one over and then it takes delta snapshots every few seconds and then um at any point in time. If things go south, it would basically restore the entire cluster using the snapshot taken previously, and it would give us kind of a point in time, recovery of loss of only a few seconds of the data. The next next one is also my favorite, where gardner goes one step beyond and it actually does the automatic seed provisioning.

B

Now we, if you look at the design, we saw that we assumed that there are always x number of management clusters available. But what if uh we have a certain increase in the cluster number of clusters and someone just creates 1000 more clusters, and then we don't have enough capacity in the existing management clusters. So gardner also offers such features where the new cluster is automatically new management. Cluster is automatically added and it actually.

C

B

The whole control pin across different management clusters and.

C

That is very thoughtful.

B

Yes, that comes with the experience with the hard way of learning things and then the last one is, of course, the auto scaling of all all of the clusters at all of the layers that we just talked about, and this auto scaling is the key uh key feature of enormous amount of resource savings.

B

I would say, because this cluster, our scalar, basically always scales our super management, cluster management, cluster and the actual shoot clusters, and it works in a cloud agnostic fashion, as as we discussed that, because it supports the machine api, which is gardner's machine api. If it's it basically only has a common denominator requirement that, if a private, if a cloud provider has a create machine and delete machine api is implemented which is like a bad bitimum, then the auto scaler would be able to do its job of auto scaling.

B

All the machines, as per the requirement.

C

That that is pretty cool. uh Probably we want to re-look at the cluster creation. State here looks like it is done. That's beautiful.

B

Yeah it's already created, and it takes only a few minutes. Okay, perfect.

C

Wow that was really good, and so we I already see that we have a set of adopters who are running gardener and managing thousands of clusters at ease. So obviously at sap we use gardener internally for development purposes and also for production workloads, and it is utilized by software developers and all the line of businesses across the globe. Gardiner creates hibernates scales, deletes hundreds of clusters on a daily basis.

C

Gardener is operated by a central platform team and its premier usage within sap leads to synergies on total cost of development and reduced cost of operations in multi-cloud environment gardener is also in use by other cloud providers such as fits who have extended gardener into their metal stack and stack. I t t systems.

C

23 technologies applies, gardner's multi-provider feature for its gaia x, a federated european sovereign cloud initiative, ping cap, the makers of ti kv and tidb run their commercial database as service offering on top of gardener landscapes and many of the same reasons apply. Running critical applications and critical systems off record requires you to have complete access on your control plane, and each component of gardener is completely independently.

C

Consumable right, which is why it has also witnessed uh some of the nice innovations wherein they have powered raspberry pi with kubernetes using gardner's machine api implementation and gardner also often sees some of the external contributions from adopters to support next generation use cases uh such as spot instances of kubernetes nodes and, interestingly, the innovation does not stop in the infrastructure domain.

C

Peculiarly gardener managed seed clusters around globe can be thought of standard kubernetes infrastructure that can host a platform service outside the end cluster, but near to control, plane, right and gardner, also, ships with multi-tenant multi-cluster capabilities of dns and the certificate service.

C

As a user, you just need to annotate or apply a custom resource in your cluster to consume these value added, but managed services so think about it. Gardener has the minimum architecture that is needed to provide all types of related services. Thank you, stu, linker d or think cross plane in a managed way. So for us and our community gardener is more than just a kubernetes cluster as a service.

B

Hey and yeah, of course, towards the end um bullseye question or a million dollar equation. What's the relation with the cluster api, we know what's clustering, api, it's it's it's a great community project which has a very similar purpose and we are often asked about it. So, um in general, with the latest cluster api specification, um it is possible to delegate the specifics of cluster management to a separate controller, plane provider.

B

So it's the extension model of the cluster api, so there is already a battery included, which is cube, adm control, plane provider and that works pretty cool with the dedicated control plane machines, but then, with the whole concept of control, plane, controllers being an extension. What is the possibility or what we are planning to do- is basically to have another control, plane provider provider which is going to be the gardener control control, plane provider.

B

Of course it is not yet implemented, it would be, it would be implemented and we would be really really interested if there is any traction for externally from anyone who would be. Who would be helping would be willing to help us or would be willing to consume or have a feedback on this.

C

Wow so great things in place and great things on the way nice so, and I just wanted to show a funny funny meme, which one of our fellow developers have created, because he could not really resist when the whole team was creating thousands of clusters and gardner managed it nicely.

B

Thanks to team, for that.

C

Yes, so yep, that is, it.

B

Yeah guys so gardner already has actually a significant community in the social media. Please do join us. We would love to hear your feedback suggestions, contributions, complaints or otherwise, just to say hello,.

A

Is there any other questions from anyone.

A

B

You did a good.

A

B

A

See one question from.

B

um Through tucker, I can probably answer that question, so his question is that doesn't replication of management clusters across regions again increase the cost factor just certainly, and um I think in the flow we answered that already so increasing having more management clusters has basically two effects. One is the complexity of manage managing more clusters, one and the other one is the costs. The first one is, of course targeted by the super management cluster and the second one is targeted by having a very well defined, auto scaling of the management clusters themselves.

B

So if you have a one management cluster with large number of machines, that would not be very different if I have a few management clusters, but their machines are basically divided across different regions.

A

All right last chance, anyone have anything else.

A

C

A

Oh, I think there.

B

A

B

There's one more question from mandal: does gardner facilitate git, ops and deployment of workloads to the cluster?

B

um This is a very, very interesting question, so this this question falls um slightly outside of the bucket of the gardener, but we do have the um we do have the ecosystem, where we want to basically not only take care of the cluster management, but also take care of the life cycle of the applications which are going to run on top of the cluster and that you can probably find under the gardener, or I could share it later with the flying and just follow the name of that.

B

But we do. We do have the whole facility, which is which is basically going to take care of the whole ecosystem of the applications themselves as well.

A

Hey, thank you all very much. Everyone thanks for joining us um this again, the recording and slides will be up later today on the website and thank you again for attending another cncf live webinar. Thank you to our presenters and everyone have a great day and we'll see you next time.

C

Thanks everyone thank.

B

You everyone have a nice.