Red Hat OpenShift The Cloud Multiplier | Red Hat Livestreaming, 8 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Red Hat Advanced Cluster Management Presents: Driving management at the Edge

Description

Multi-cluster management is hard. Technology, teams, and culture clash in a race to deliver clusters and applications in a secure and compliant way. Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides the capabilities to address common challenges that administrators and site reliability engineers face as they work across a range of public and private cloud environments. Clusters and applications are all visible and managed from a single console—with security policy built-in.

A

A

B

Good morning, good afternoon, good evening, wherever you're hailing from welcome to another episode of red hat advanced cluster management presents, I am joined by the team. We call rackham um it's one of my favorite products here at red hat, because it does such an amazing job at multi-cluster management. So I'm going to hand it over to scott behrens to kind of tell us what we're talking about today.

C

Awesome thanks chris always a pleasure to be on your show. I think we've managed. I think this is our eighth episode. The name is always your longest title. I've noticed in all the topics that come through so we're at least winning in that category.

C

um I'm a product manager and I love to solve problems in the multi-cluster management space. So that's why we're here is to talk about what are we doing with rackham and management at the edge, a very exciting topic, I'm going to also introduce my colleague, brad white and benner he's new to the team, but his focus is in this telco edge scale. Space brad go ahead and introduce yourself hello,.

D

Everybody nice to be here, thank you for your time and, like scott said, focusing on the delivery of multi-cluster networking at the edge, definitely performance and scale initiatives and uh glad to be here.

C

Yeah, it's he's just got off the boat, I mean literally, he was fishing during his lunch break. So.

D

He's down there in south florida.

C

Just soaking in the sunshine uh and then in terms of the actual technical horsepower and the real brains behind this, I'm going to turn it to how and how lou can introduce himself and we'll pass it around. The team.

E

Hello, my name is hal. um I've been on the acm team for a long time. I was there since the initial poc of the product. um A lot of random things on the team, uh helped design the trusted life cycle bits um with integration with hive and also I'm online on the cicd team, and now I'm focused on getting acm to scale and help uh expand into the edge arena.

C

You have a team of, I don't know stalwarts here, they're all stellar who's going to go next. Who are you passing off to.

E

uh Crystal go for it.

F

Oh hi, everyone, my name is crystal and I'm a developer on houzz team. It has been fantastic so far, getting to be part of the far edge and also working under um how I'm learning from him. So I'm very excited to be here.

F

Can I pass it off to.

G

Oh hi, everyone, so I'm han and I'm also in house team. I've been in acm for about a year and my job is basically writing. Go code with some controllers and before I was in class life cycle team was under house lead and now I'm in fire team and the house league again yeah. Thank you and chris.

A

Hi everybody, my name is uh chris doan and uh I've been with acm uh for quite a long time, um but uh I'm actually from the sre squad. But somehow hal was able to wrangle me onto this uh far edge effort and uh I try to contribute wherever I can, but yeah glad to be here I'll pass it on to uh alex.

H

Yep hi I'm alex cross, so uh I'm the one member of the team, that's actually on a different team, I'm on the telco 5g performance and scale team based in raleigh, I'm actually on my second tour of duty with red hat. That makes me a boomerang employee.

H

I did take a break in 2019 for a year as a cluster administrator, but found out that I really enjoyed performance engineering way more. So uh I just boomerang gone back.

C

To great to see alex's contributions and everybody that we brought together here, chris is dedicated to the mission of management at the edge, and I guess the best way to frame this from a multi-cluster management problem statement perspective.

C

Is that, like we know, you know, clusters right, you've been hanging out with clusters for six ten years now, a long time.

B

Not ten years.

C

Not that long, not enough right here, but before that it was virtual machine. So anyway,.

B

F

C

And we understand that the notion of a cluster being this gigantic thing, with hundreds of nodes and just a large footprint, a multi-tenant cluster that still exists, and we do see that, but less and less of that we start to see smaller clusters. We we have new topologies coming out like compact clusters, where there's a shared. You know three master, three worker kind of scenario and then, as that gets smaller, you see like a single node openshift.

C

We don't even really want to call that a cluster, maybe that's a whole different debate over a picture of beer. But you know we're in this space where we need to have a smaller footprint. You need to be able to manage that. So there has to be enough tooling enough, componentry in place to manage that thing out on the edge and that's what we call a single node openshift, that's been introduced as part of the 4.8 release, that's coming out and our team has been working with that day and night.

C

I see chris shake in his head because I think he had.

C

He had one of the first like feature: complete, builds deployed back in december, maybe that's a little a little too soon, but the fact is, like we've, been beating our heads around that same problem statement as how do we start to deliver a single note, open shift out to the edge in large scales and do that in a performant way, and one of the big tasks that we had to solve is how do you do that in a certain amount of time, with a certain number of deployments in your scale?

C

So I think, let's just say arbitrarily was you gotta? You got to finish this in 10 hours and you have to be able to deploy a thousand of them ready set, go so like. How did how? How would you solve that right? So what we're? What we're here to talk about today? Some of the growing pains some of the learning some of the stories that we've gone through and why we have the gray hair. We do uh to get to the point that we're at which is incredibly awesome like we could deploy a thousand clusters.

C

Is that right? How we we've successfully deployed a thousand bare metals in under three hours, with configurations in place, I mean I don't want to spike the football too soon, but some really some really strong results that this team has been able to drive.

C

How it's your turn tell me tell me what we're doing in that space. Oh sure,.

E

uh First of all, a little bit of background right, scott came to me with this, but like end of last year, I was like you want what now, what um for a little piece of history right at that at that point like we have only tested acm or whether only able to test the acm up to 50 cluster, and that's us like still bag and borrow clusters to be managed by acm.

E

We um with the resources that we have we're only able to test the acm up to managing 50 clusters um for a short period of time, so um so understanding that a thousand right is order of magnitude higher than 50.. So I start I was like okay, that sounds fun. Let's do!

E

Let's do it right, um so there's a couple early early lessons that we learned. I think that that is just fascinating and is generally applicable for any web app that we built. So the first thing I ever try um after scott approached me is okay, I'm gonna go stand up with openshift cluster, I'm gonna create. Well, let's see how many main spaces I can create right, because in acm every single main space, uh every single cluster has its own main space to serve as our back border to contain the resources that manage cluster can access.

E

So now, let's see how does openshift respond to a thousand link spaces or two thousand namespaces, so we started to just simple script: looping through creating name spaces, and we found I found that like what the heck um after about uh like two thousand namespaces, the the control plane crashes, and at that time I started to panic a little bit. Oh crap am I am I setting myself up to for something: that's not doable, so we learned our first lesson right um and let me show you a little bit of graphic uh about about this.

E

C

Definitely not the news you want to hear on a friday as you're heading off to your weekend is: oh, we completely trashed ncd and the api can't handle what we're pushing at it. Okay, fun.

E

So I I took the weekend did some reading and ran across this document, so this is the comparison of different story type on aws. So at that time we were mostly testing on aws, because that's just the most highly available resources that we have so gp2 right is the default storage that we use and we found. One thing that I found out is that it's got a burst budget, meaning that uh like when you first provision it it.

E

It performs fantastically right, um kyops comparable to our um l1, um but as we exhaust that burst budget, the iops tank like by default, I think we carry a 300 gig um storage and that's only about uh 9000 iops after we exhaust the burst budget. So that's what we saw so the first lesson here when you build a cloud native application that deploys on a cloud provider is well there's a hidden wall. You wonder why your wonderfully built application doesn't scale storage.

E

I guess should be the first thing that that that we take a look at so once we replaced um our storage with io1 with a reasonably high iops like 3000, then I could uh I end up being able to query, like you know, tens of thousands of lane spaces without a problem.

E

So, but at this point right, I started to realize that there's a lot of hurdles that we can foresee without actually testing this literally with a thousand cluster, and I have no idea right where, where I can get that resource like I was about to go, ask scott: hey, can you hand me a blank checkbook, please.

C

You basically did that.

E

And uh you did that.

C

Yeah and what so magically you made it happen, you just you crafted it up in your garage or something.

E

Now the checkbook actually didn't help uh that much because uh aws throttles your api right. So you can't actually create a thousand cluster with a snapple finger on aws, because certain apis are very limited, like you know the ones that you create um dns stones, those are really limited.

E

um So, unfortunately, big big checkbook didn't help, and I grabbed my buddy chris stone here and we were just starting to throw our heads together and at this time, like being the wonderful company that we're here is like for some reason, alex cruzo just popped out of the thin air.

E

Yes, exactly um hi, I'm a seasoned uh performance engineer, I'm here to help uh like wonderful, so I I would like to pass to alex to talk a little bit about how we ended up approaching this problem.

H

Sure um so, when I kind of uh joined to help here with acm, it was probably around november or so um having been seasoned with the scale lab and the the uh the hardware that we have available at red hat for scale, testing and whatnot, I had kind of known what our capabilities were and what not, and the first kind of test that was it was kind of asked of me- was: let's just see how many openshift clusters we can get off of a certain chunk of hardware, so uh the first uh iteration of that actually involved openstack.

H

So we had uh requested some hardware. We got some hardware like around 32 nodes, or so we deployed openstack on top of that deployed a hub cluster, and I tried to deploy as many spoke clusters as I could, um after sizing everything down as small as I possibly could. We actually only got about one hub cluster plus 55 spoke clusters, so we really got nowhere further for testing than what hal had done in the past, with the where beg, borrowing and borrowing everybody's cluster, they could find out of aws and whatnot, which.

C

Was also a fun exercise as we were cobbling together, you know clusters from every different line of business that we could find like. Oh you've got three you've got two. Here's five over here anyway, that's a story for another day, but alex you're on the right path. You're, you you're, the you were the shining light that figured out how to actually get these resources in place.

H

And about that time, that's when I was playing with acm as well and seeing when you manage clusters you're creating a namespace for it, and I was you know, obviously heard the the goal there of a thousand clusters being managed by acm. My first thought was like well shoot. That's a hundred! That's a thousand name spaces right there and previous performance testing of open open shift has stressed the name. Space is in a certain dimension, of course.

H

Openshift and kubernetes is a multi-dimensional um it's multi-dimensional in order to try to form where scalability is so. You could create a ton of name spaces, but if they don't have a ton of other resources that might work fine, if you create, so it's really multi-dimensional, so you really have to kind of test with a real environment more closely to what the kind of customer should have deployed there.

H

So anyway, I had reached out worked with howe on testing this cluster. I actually showed him hey one way to actually improve fcd, and we do this in the uh the scale lab.

H

Is we actually pass through an nvme into the hub cluster or into whatever cluster we have under test, and we put a cd on the nvme, so we give it the best possible disk performance at that time, also shared with how how to look just using grafana to look at some of the metrics that ncd has there just so we could see how it's performing at that at that time.

H

C

Remind me that got us from like 50 to 300 like what range where we we were within striking distance, at least to the thousand target.

H

Yeah, so so the next big jump there was. We had changed the test bed a little bit.

H

We had actually asked for more nodes, since we got to we have the 55 clusters um and throughout that we've also had to manage through various infrastructure issues of just this cluster is not working or something else, or this build was not not working, but anyway we we improv, we added more hardware, and then we actually got to the point where we could decrease the size of the openshift clusters themselves, so, rather than 55 full three node masters with two worker nodes, um we shrunk that down into sno clusters at that point in time, however, because we're still using openstack, we ran into kind of other scaling issues and issues that weren't surrounding how sno or single node openshift was supposed to be represented as a far edge cluster there.

H

One of those was that in openstack it would still create a boot, a bootstrap node, so we had to plan capacity of our cloud around that. But after we worked through all that with about the 64 nodes, carve off a few that pieces of hardware that may have failed and we actually got up to about 320 clusters at that point in time and that's when we started hitting the scaling limits of what we were doing there with having the infrastructure as a service kind of layer that we had there being openstack.

H

So that's when we made our last pivot, that's been our most recent um test there and the last pivot is: we've actually just removed the openstack layer and we've gone with completely bare metal. So we made our our hub cluster completely bare metal. We've actually had to use similar. um You know I said, pass through nvme now the nvme is right there for the uh the hub cluster, but you still have to allocate that nvme. So what we do is we actually just use ignition configuration to make the nvme have that cd mounted on it.

H

We also use the the nvme on our worker nodes um that is able to actually serve as local storage. So that's how we solve kind of a storage solution for our bare metal cluster and then for our spoke clusters for management. We actually have just pure rail, with liver, hypervisors, so and depending on which piece of hardware we have from the from the lab. We can fit up to 17 or seven, depending on our sizing that we we saw previously with capacity kind of analysis that we did with the hardware. So.

C

A lot of a lot of gymnastics, a lot of uh head banging head scratching to get to that point. That was what a couple of months of just learning what we have access to and how can we maneuver, you know basically more clusters into that: a more dense test.

H

Bed, yes and then it, and so one of the other uh big things, the pivoting when we changed from that pivot, when we started to use sno on libert, um is that we actually had new technologies integrated into to openshift at the time. So, instead of having acm working with hive to provision clusters, there's now acm with the assisted installer with hive boom and that's.

B

Nice, that's a big.

C

B

That is amazing, yeah.

C

That's so for people that don't know I mean assisted installer is.

B

That's why I used to set up my cluster here. Yeah.

C

Next generation of technology that's coming out, it's actually already available at cloud.redhat.com as a as a sas, and that is a tech preview offering to start carving out bare metal in your data center with discovery, isos and all this magic. But that pivot point is key, because now we're bringing that technology into the on-prem space and so alex. I mean talk us through what that looked like and how the team responded.

H

Yeah so the biggest savior there was one of the other scaling limits that I neglected to mention a little bit earlier was when we were on top of openstack. We had to make much more planning for a hub cluster, not just fcd with nvme, but it was also hive would create an installation pod and that pod required 800 megabytes of memory almost almost a game. So if we wanted a high concurrency of installations, we had to create enough worker nodes that could host all of these pods.

H

In addition to that, it would actually download an image file that would that would then serve so that would consume ephemeral disk space. So we had to plan around memory and ephemeral disk space on those those nodes. In reality, though, all the installation is happening on this remote remote machine, so why can't it just happen there?

H

Well, thankfully, we had the assisted installer and that's what got us there, so that really shaved down the resources for our hub cluster and actually, once we moved the bare metal, we had originally planned for extra nodes and we ended up having to get extra hypervisor. So that's what allowed us to end up scaling up to greater than a thousand clusters with that, with a rough roughly about a hundred hundred or so nodes in the lab.

C

And new scale targets keep going your way. So I hear 2000 is next, maybe ten thousand by the end of the year. Wow, who knows a hundred thousand end points? Maybe we should start to call these things. Endpoints.

E

Let's not, let's not start there.

D

We'll go back to 2000, okay,.

C

I get a little carried away sometimes, but yeah.

E

Yeah, the amazing guidance from alex right enable us to just go ahead and test our system like to see. Where are the pain points, and we found a lot of design choices that about that. We made and implementation choices that we made that um can be can be improved as well as this new assistant installing technology that that help us address the spike of resource utilization right during the cluster provisioning time.

E

So we so the customer don't have to plan for this excess uh uh resources that only gets used during the plus the provision time and afterwards kind of just lay there doing nothing which is wasteful.

C

Yeah so alex basically set the stage for us to find a whole batch of new problems right because.

F

If you don't have this.

C

Hardware and that had its own journey to get to there, then you, if you can't, if you don't have the hardware, then you can't actually flex the technology to figure out what's going to shake out. So where do we go next?.

E

Yeah well, one of the interesting assumptions, sometimes like people think scalability is kind of linear, not not not actually choose their occasionally. There's just one of these like at this number stuff. Just completely disintegrates right, and these are the things that we are really not able to see until we have the resources until we have the actual clusters uh to play with so, but han is one of my favorite software developer like please follow him on github like it's awesome uh and there's a lot of things.

E

There's a lot of lessons that we learned during this journey and that helps us kill our operators uh to help break through this bottlenecks.

E

And pond, would you mind rolling into.

C

And this is a big moment, because this is this: is the mentor um sharing kudos to the protege, because I I mean I remember seeing you know han who's been he's been growing under your leadership. How but seeing him take off in this space so take it.

E

Away he's a he's a far better software engineer than me.

G

There you go yeah cool, thank you. How and I I did prepare some slides this morning and as I mentioned this release, we achieved a thousand goal and we learned a lot. So my as I mentioned I mentioned my job is basically developer and I do have I've been working on controllers for several releases and this release.

G

I feel I learned the most because we hit a lot of difficulties during this release and the difficulties, because how imagine we we have a thousand classes now we have to support a thousand cluster, which is super hard and the first difficulty I want to share is about our controller.

G

They just keep crashing when we have a thousand clusters and the reason is basically out of memory and it's pretty easy. We can just increase memory limit, but that is not elegant and that's not the solution. We want right. So actually we did some investigation and there something gotcha is there, and I want to share and also another thing is about performance here.

G

I'm I mean the speed is too low, too slow, and it's always we possible that we can refactor our logic, but actually again, there are some very easy solutions we can choose and I also want to share so. First, let's talk about the other memory and all the memory killed with their summer. Investigation turns out it's because of the cash some background about the cash it's basically, if you're, using some gokan to contact you, the kubernetes, and for here we are using the control, runtime and most of the gokan.

G

They actually have some cache under the background. So when you you're using a cat you're using a client to do a watch like kubernetes design, you you watch for some resource. If you change, then you will modify, you will do some reconciliation and when you are doing that watch actually there are some background, uh goes up routine as doing the cache and they will save every change in the cache yeah and.

H

G

If you're doing uh doing a list or get with the client, they actually also use a cache. They will copy everything to cache and then save everything and that's yeah, that's something we don't know. We didn't know before. Oh we we actually know, but we didn't realize how how terrible it is. It can be to affect our performance and especially if you are just getting one result like you get one secret in the cluster. You just want to function get and the in the background. It will cache everything in the cluster.

G

So that's not something we want yeah, and so we figure out the solution. It's pretty easy because the cache is problem, and actually we don't need cache everything. Sometimes we just cache some of the results we care like for secret. We only each name space. We only have one secret. We care, we don't need every secret in cluster. So actually we just don't catch everything we don't need.

G

So this is first first. I I want to recommend that if you have you, you can choose a namespace scope to client. You just use namespace because scopedclient and if you can use labels to like select the resources to reduce the cache, just use some labels and again the third thing is: don't ever catch any secrets of the whole cluster. That's a lot of memory and I will show some examples there. We really we we have all most of our controllers, they crash because they are catching the secrets.

G

The secrets is super large in the clusters and another exciting news just happened this week: uh controller runtime, which is a very popular library for controllers. They release a 0.9 and, in this release, there's the builder with options with this configuration user can easily configure the cache to add the labels and or use any selectors. They want to just cache whatever they need. So this is.

F

G

Is never implemented before before we have to customize the cache, but now it's very easy, just several lines of code change and we, I actually have a fresh example here. This is actually having this release.

C

So here this is awesome because chris I know we we get on your show and it's a bunch of smoke and mirrors, but this is like real bona fide development stuff like right now I just in awe of this team and the way they brought this together. Sorry, I didn't mean to throw you off your game on, but yeah.

G

It's okay, we're also really excited with this, and also here. Here's the example of one of our controller. It catch the secret and because we didn't realize we should use labels or use any technique and we just capture everything. So this is before we are catching everything you can see. We have a thousand clusters and for openshift each classroom will create a namespace and a namespace. They will have three service accounts. Each service account.

G

They will have two secrets: one is for a service account token and the other one is for docker config and though, all those secrets, that's uh like 6000 secrets, all those secrets adds up. They can be several hundreds of megabytes, let alone other secrets actually running for each component or controllers and actually those secrets we don't need, and then we just use the.

F

G

With labels and boom, we just reduced 500 megabytes memory. Now it's just nothing right before it's 509, nothing so consider we have. A couple of controllers are catching the secret. We actually reduce several gig of memory, that's a lot for us and we are super happy for this result, and another thing is about the performance.

G

uh Basically, performance is too slow because we have a thousand clusters and we we should know that. There's no one-size-fits-all solutions for performance. Turning, sometimes the only solution is just refactoring, but sometimes it can be very easy because we have some of the configuration always have some configuration available like here example, is all for control. Runtime first is there's a con qps and the burst current qps is when you're using the kubernetes client you're doing a get or list of nugget lists mainly apply, update or patch.

G

Something like that and there's the qps is limiting the default is just 20 and there's a burst is uh versus for there's a buffer, and you can do like qps for 30 at most but yeah, and if you are doing a lot of requests at the shelter runtime, you will see a lot of struggling keywords in the logs and in this time maybe you can consider up just scale up the qps and then see if it can help. You reduce the pro uh solve the problem and the example is.

G

We apply a thousand manifest to one cluster, the hub cluster and we we want to apply it in one recycle, reconcile and uh because of the qps. Here it takes like 30 or 40 seconds for wine reconcile, which is super slow and after we change the qps to 200. It's just several seconds, and it's super fast now and another thing is one queue limiter. This is the reconcile every time when you watch a resource and it will trigger a reconcile. There's a limited rate limiter here default is 10.

G

If you think it, it may help you to reduce the speed uh to speed up your controller, then maybe you can choose this one another another one is the max current to reconcile this configuration. The name explains everything. Basically, you can add. The concurrencies default is always one so you're doing one thread, and if your task is very time consuming and the task can be done parallel, I think this configuration can be helpful. Like our example is we apply one manifest when we are importing clusters by importing.

G

Actually, we are just applying some manifest on the remote clusters and after we apply the agent on remote cluster agent. World turns up and get all the cluster imported. So we will apply the manifest on the remote clusters. We will apply it every every class of the thousand clusters and because we only have one thread so they are doing linearly and because the remote cluster can be super busy, it takes a long time and adds up. It takes a lot and we have a really fresh example here. This is last month.

G

We have an experiment of a thousand clusters. Here is the orange lines means the cluster is installed complete finish. The small install the slo single node openshift and after after the after every class is finished, actually we're expecting our controller to automatically import the cluster so that we acm can manage it. So the green line is managed, but I mean the the managed process is just apply manifest, so we're expecting it should be super fast.

G

It shouldn't take very long and but let's see the custom install only takes three and three hour out, but the import actually takes four and a half hours, there's a one half hour difference which we didn't expect and after some investigation we found this because we only have concurrency one and also because the costs are remote clusters and they are just finishing stall. They have a lot of things going on. So when we are applying manifest, it takes a while, like 10 or 20 seconds so add up, because we only have a single thread.

G

Add up it's four and a half hours, it's a lot and then we just think. Maybe we can just easily configure concurrency and that's what we did and you can see here. The line is super uh super perfectly aligned. That means every every time. There's a cluster completed installation.

G

We just imported it and there's no delay we're super happy with this, because we only need a one line: code change, just the concurrency yeah. That's.

B

G

Yeah, that's super cool and we're super happy and let me do some conclusion. So refactoring is always good if you have time, but we don't and so before, you're trying to increase in the memory limit, maybe think about cash and before your refactoring, maybe think about qps and concurrency yeah. That's everything I want to share. Thank you.

C

Thank you. It turns out controllers. Give you a lot of tools yeah. So knowing you know knowing what you're working with knowing what's on the table to begin with, with your container orchestration, is it's pretty important at this level of scale.

E

Yeah and clearly in the community that uses control one time right, um the cash problems definitely have been observed, or else we wouldn't have seen that um change uh to implement the filter. Cache like come up, it's serendipity, uh it just happens at exactly the same time. We need it. We khan will probably go contribute to it. If.

E

But he was too slow. He didn't get that pr at any time, but it's just wonderful. Now, the last graph, the home brought up right, shows how fast we were able to provision clusters holy crap. That was a thousand cluster within three hours right. That was not achievable without a significant amount of resources, if we're using infrastructure provision.

E

Sorry installer provision, infrastructure ipi, which, which is essentially what you do when you openshift install uh with, with conjunction with hi, just just because the sheer amount of resources that we need to pre-provision and pre-prepare in order to achieve the concurrency that we need right.

E

Well, we mentioned an assistant installer already and uh crystal have a really well uh written document that kind of described, what's the what's the magic here that makes it different and that reduces then well the resource that we need to prepare and that allow us to achieve this thousand plus the provisioning in three.

E

C

On the edge of my seat,.

D

Yeah, no, that was beautiful, the um dc compute dropped there. You could see it go from yellow to green on the on the graph and- and that was just a beautiful thing. uh I know there's a lot of hard work behind the scenes. Lessons learned and it's just optimizing the bits before they're out there and so great great phenomenal work. Scott would that stuff show up in the cost management. You said you know knowing it's a sas offering.

D

Would we be able to see in cloud dot.

C

Yeah, I don't know if they've connected the dots, that's a great idea, though I mean if it's in aws. I think they would probably already have that. um Those are you talking about openshift 4.8, which is yeah not into ga yet.

D

Forward-Looking yeah.

C

Yeah, it's definitely forward-looking, but that you're talking about cost savings because you're not sending as much over the wire you're, not spending as much time in storage, absolutely all the resource consumption. So what was that picture? You were talking about how that that chris doane has.

E

Oh sorry, khan shows that how how we were able to provision a thousand clusters within three hours- and I just wanted to spend some time- uh the lead crystal kind of shows us um what's going on. What's the magic here that allow us to achieve that.

C

So, from the sre perspective, chris you've had your eye on metrics um data gathering usage graphs, all that kind of stuff enlighten us. What are we missing out on here.

A

C

Are we missing.

A

Out uh data gathering, I mean, as we've been doing these tests, we've always been collecting uh the metrics uh for our provisioning time, like the graph that han has showed. I think um one of the things that we'll have to roll back into the release is we're generating these metrics today. uh It will be um even better if these metrics uh are uh captured and stored within our platform, so that we can query it.

A

um I think we, we query uh bishop metrics today already, but these metrics aren't that easily accessible, so um that could be one set of metrics that we could um roll into the product and then, if we can roll into the product, I guess in my mind, uh if customers want to uh replay the work that we've done here in their own environment, they could uh re-qualify our results and that could bolster their competence in our platform right um as we present or make all the the, for example, the automation that we constructed uh to get to this point right that should also or could also be made public and that customers.

C

A

C

Got it got it so help me connect the pieces. We've got to this point. We've got the scale lab and hardware, we have improvements in the controller. We've met criteria around a thousand deployments within three hours. Success rate was something like 98 or something really high yeah.

A

It's really high, it's uh three percent failure or issues.

B

And and those issues.

A

May be attributed to the environment right, we are using a scale lab environment, but they're still, uh these are still virtual, bare metal as well. um So there could be some nuances uh in the environment that leads to some.

C

Some failure rate just shake it out, yeah, and then you get to this point where you have assisted installer, which is fantastic technology working together with acm.

C

We intend to deliver that as a dev preview in the version 2.3 coming in july, so that gets us to this point of I'm deploying clusters and I'm going to come back and say so what okay? Like you, you did some good work, but so what I want to manage at the edge- and I need tools to do that. I need policy. I need compliance. I need to be able to configure something centrally so policy.

F

C

Know this is part of the journey that crystal's been working on, but you know getting to that point where we were like okay, now that we've deployed what's the day, two looking like on the on the single note, open shift well,.

A

That's that's kind of like what uh the the slide that han was showing as well uh the fact that we can provision these uh snow single node openshift clusters using assist installer, but then the next part is that we actually import those managed clusters into the hub and once you have the the managed clusters imported that opens up the window for the rest of our capabilities on uh racking capabilities, right policy management and application lifecycle management right, uh focusing on policy up that the the the day two configuration that you were mentioning right.

A

As long as the configurations are controlled by openshift operators, you can pretty much define any kind of policy to con to modify or or or or constrain those those uh behaviors and and by uh creating a policy that you can distribute across those ten thousand or one thousand managed clusters. You can I'm jumping the gun there. um You can consistently uh keep your uh your fleet uh consistent right.

A

You have a policy for, for example, oh you can deploy that to the thousand clusters and keep that aloft configuration consistent across your fleet and the fleet running at the edge as well.

C

You mentioned operators, but this would be you know, kubernetes resources, really anything that you can describe with a within a piece of enamel. You can now start to define as a desired state model across your fleet, and in this case these could be dev clusters that operate differently from prod clusters and those might operate differently from west coast versus east coast. Excellent.

A

C

And label constructs that we have to articulate how you want those things to be configured z-lands ingress. You mentioned oauth. All of that stuff comes into play roles worldwide; users.

A

Yeah and and the the other copyright is that if you define your policies in a in a github repo or your github, repo is your source.

C

A

You can use get ops connected to your hub to maintain that source of truth.

C

That's awesome, so how did we get to that point? I think this is crystal's territory where we were defining policy. You know articulating that as part of this deployment graph that that honju what's the magic in that space.

F

um So the magic in that is that with policy, it kind of comes from what rackham deploys known as the grc framework, and it's been. It's had a lot of great work done to it in that um it's not only scalable across all these thousand managed clusters, but it is able to deploy all these policies very fast and very efficiently.

F

So in that sense you kind of get to manage your clusters and know that they're compliant or non-compliant like very, very fast and in our initial testing. We kind of found that we started off with 100 policies deployed over these thousand snl clusters and it took about 90 minutes to propagate all these policies um to all the managed clusters.

F

So you have about a hundred thousand objects from that, but after um some tuning done from the rest of our team and um some efficiency scaling, we were able to get that down to about 10 minutes for propagating all these policies. Yeah. This is a huge improvement shout out to ian who's on our team as well.

F

Who has done the qps tuning for that that how I mean that han oslo mentioned um earlier, so with that performance in mind, it's an incredible improvement over something that was already very scalable in the first place and that's kind of the magic of it is how it was built up in the first place to be scalable and then from there it was moved towards something more efficient. From that point of view,.

C

The fine-tuning of the configuration um to to quickly ingest that policy definition and ensure compliance across the end-to-end fleet yeah and be, like, I think, han pointed out. That was a one-line change and then we did the the concurrency magic. So uh show me a picture, or do you have something that kind of describes like the journey that you went through in that policy?

C

F

Here we go so this is um kind of our initial findings document.

F

As you can see, when we first tested this out, we created a hundred policies on our hub, which then propagates to all of the managed snl clusters, the about a thousand or so, and that took about 1.5 hours and with this testing we kind of wanted to see how long it would a take the policy to propagate and then how long be once we switch it from inform to enforce how long that would take to show up as compliant from all the managed clusters. And that's why you see bullet point.

F

uh The first bullet point, which is the propagation initially and then the second bullet point, which was switching these policies to enforce and.

C

F

C

So the difference there is subtle, but let's hit that for just a second one of the things that our customers have told us is that they love the ability to check and kind of use, an audit type of framework to see what is compliant and non-compliant, and we call that inform so there's an inform mode which is a yaml verb. That says just inform you of what's going on in terms of the compliance spec but you're telling me you, you can actually enforce so change that verb to enforce, and now I can make changes right.

C

I can actually propagate that change across the fleet in this literally the same amount of time.

F

Right exactly and that's kind of the really crazy thing is that not only does it propagate when you change it to enforce it also actually does the enforcement on all the thousand managed clusters and then tells the hub, hey my cluster or my managed cluster is compliant now, which is fantastic.

F

um But of course this was just the initial investigation, um and I have a graph right here that kind of shows, like the testing that we did for that and, as you can see, um the amount of time 1.5 hours to propagate and then 1.5 hours to switch from inform to enforce fully, but after our after the efficiency qps tuning, it dropped down to 10 minutes for each of those things so 10 minutes to propagate initially and then 10 minutes when switching from inform to enforced- um and I do not have a picture of that right now, but take my word for it.

F

It's it's. There.

C

It's awesome. No, I trust you because I've seen it I've seen the.

E

Team working just want to pause there for a second just making a change on thousand cluster 10 minute. That's it.

B

Wait, you said 10 minutes.

E

Yes, yes with the latest code change, uh this is our initial finding. It took 1.5 hours, but after the tuning that ian and I have done wow 10 minutes making changes to a thousand clusters.

C

So, enforcement at the so that's management at the edge enforcing compliance across your fleet and getting those results reported back as either compliant or non-compliant so being able to step into one interface and see all that I mean we hear these kind of requests like all the time.

C

How many different clusters do I have to log into, and where do I need to set the context? I'm like? No? No, no that's the problem. We're solving is that you don't have to jump into context on all these different clusters. We provide one interface for you to do all of that. To set those controls from one spot, and I forget, I think it was chris doane who was mentioning the get ops part of this, where these policies are actually stored.

C

In a repository you know, and and being able to have a code source and a source of truth for what that policy should look like, and then you know designating that policy as as what you want to distribute to the fleet and what they should all be compliant towards, so that I mean that part of this story is is what's the super powerful part, I don't have just one model or one way to introduce a policy. I have multiple ways I can cube apply it.

C

I can pull it in from source, create it directly within the ui. If that's how I want to do it, what are we missing here? I think we're down to like the last five or ten minutes, but are there any areas that we've missed in terms of our coverage here.

E

I really wanted to uh for crystal to spend some time on defense uh uh showing us the magic of assisting installer, and why is it what's the difference between that and api, but I don't know if we have enough time for that. I.

C

Think we should spend a few minutes, at least because that's part of the what.

B

Was your use case for assisted installer and ipi right like I think that would answer a lot of questions.

C

Yeah, well, you know we were put on this planet to help create clusters right. We want openshift to be everywhere, and so ipi was the first model, the first tool that we really started with and installer provision infrastructure. Chris, you know more about that. Anybody so take it away. Tell us tell us that.

E

Story crystal not chris stone. Sorry, the the two name gets a little uh close to each other. Oh, go ahead!.

C

And I think it's those on me.

A

Yeah, it was uh crystal right how.

E

uh Crystal prepare something for that: okay,.

C

F

Yeah, so um assisted installer is like the service that, um like alex mentioned, came in at the right time at the right moment. That kind of helped like funnel all these things that we're doing um and as I like mentioned before and how has mentioned before, um they were using ipi to with hive in order to actually create all these clusters that um they wanted to scale and you know being at the far edge.

F

They found snl clusters, but, of course that came with a lot of disadvantages like they mentioned, like the ephemeral storage that was needed, or just the extra memory that was needed so with assisted installer, it kind of came in and was able to take on all of these um installation procedures that are required to run on sno clusters um and move that away from the hub. So that way, you don't need to have these extra storage spaces or the you need. You don't need to plan for any extra concurrency failures that would happen with ipi.

F

You just have the assisted, installer kind of take it over to the cluster. You want to provision and run everything on its own. Therefore, kind of increasing the success rate of these clusters because with ipi there was failure due to unexpected. You know, memory issues, um but with the assistant seller we got um so many more clusters and it was able to help us kind of provision like all these thousand snl clusters. So another thing with assisted installer, and I think how this is. What you wanted me to show um was I'll.

F

Give you a sneak preview exclusively for this. This will come out in a dock, probably a little different um in our official rockum docks. But this is the sneak preview of how or what assisted installer comes with, which is fantastic, assisted installer enables something called zero touch, provisioning ztp for short, so with zero touch provisioning. We just have these five simple steps. That is done once you kind of put everything you need to configure for assisted installer and the configuration is fairly simple, but once that's going it has this great feature. Zero touch, provisioning.

F

That kind of is where the assisted installer takes over and provisions your cluster for you. So your managed cluster. You don't need to actually go into the managed cluster at all or onto the actual machine to do anything. It just handles everything for you, um and these are the five steps which hopefully are very simple. So first it generates the discovery iso, which is an image used to boot, the managed cluster which you can see on the right side.

F

Here once that's generated, it gets booted onto the actual target, bare metal machine or the sno cluster that you want to provision. So on the hardware itself.

F

It boots this iso for you and then afterwards, once it's successfully booted, it will report hardware information back to your hub cluster and when your hub cluster is aware of all the hardware information, it will then proceed to install openshift container platform on the bare mineral machine, thus kind of giving you the sno cluster, with um the single node on running on that bare metal machine and then open shift on top of it. And then, um after that, you have ocp when it finishes installing the hub will then um or the hub as red hat.

F

Advanced container management will then take on that new single node openshift cluster as a managed cluster and from there that's where you get all the good stuff from rackham, which is all the deployments of the add-ons and all the management that we previously talked about with policy application, etc. um So that's kind of the basic flow of ztp and of course all you need to do is log into your hub cluster and just do the provisioning and let assisted installer. Take it away from you for you.

C

E

It this really does easy.

E

This really does abstract away a lot of complexity from uh how to provision cluster on a bare metal machine right before you have to set up a provisioning network to hold a provisioning server to host a bootstrap. um The setup was complicated. This one bootstraps in place, like you, don't need anything external right. This is it. It reaches out, boosts the machine. uh It forms a cluster done. Acm manages it. You deliver whatever configuration that you want to.

C

Deliver your compliance model from that point forward, it's under management and you make it sound so easy crystal your team has worked beautifully to. I know this has been development under you know the pressure of creation and you've created a diamond here out of the rough, but seeing your team work together with assisted installer with metal ztp.

C

All the componentry. That's come together into this package that acm is delivering it's it's! It's awesome. It's just really awesome. The way this team has performed has been brilliant so anyway, I should stop sharing the kudos here in the last stretch. But chris do we have any questions that have come forward.

B

Out there, the last question that I haven't been able to answer is: what are the current pain points in hub with using a thousand clusters right like.

E

Well, uh presentation for that much information: it is kind of hard and we are it's still kind of under investigation, a little bit ux improvements, um but we function pretty darn well with a thousand cluster at this moment now we we do just scoped it to a couple of components and a couple of features for now like we, we focus heavily on policy um just because scott says so.

C

Well, we've also been doing monitoring and observability right. So that's.

E

Really true, that's really true right, um right, uh monitoring and alerting so that no one have to actually stare at the dashboard for a 1000 cluster to figure out. What's going on, like the centralized monitoring um spent.

E

Unfortunately, we didn't have time to go into it, but we reduced the memory footprint of that component by a significant amount and still retain all the capabilities that originally had so.

C

The providing long-term metric store and the ability to alert off the fleet and bring that to a central spot. That's.

B

D

It's a good cast of 20 28 or so by my account right on the on the far edge squad and any anyone we want to name drop and, thank I know. Even today, we saw like emily uh demoing some new, some new stuff and and randy george and others uh crystal anyone who you want to name drop on the squad or how here it's this special moment here as we're debuting some of this great stuff.

D

I'll put you on.

E

I can drop every single yeah. Every single person help me out. I've just gone.

H

D

The additional three, but it's I'm just proud to be part of the um far edge effort with the scale and performance and then connecting with all the other uh components as scott was mentioning, observability and and grc. Thank you.

C

Chris, I think our time is up. I appreciate.

B

You having us this has been amazing right, like the level of effort to get to a thousand clusters in any way, shape or form is enormous.

B

But the fact that you found problems with caching- and you know, concurrency and like those were easy fixes that you can do kind of show you the power of, like the underpinnings of you, know kubernetes in general, and the scale that you can achieve with that, and that's just an an amazing number right, a thousand clusters and at that little amount of time, yep just baked into hub managed right like that is awesome.

C

That's a great story that we're pushing for so we'll have the the bits will be in there as a dev preview, uh we'll be moving towards tech preview in the fall, and I think by that time who knows maybe that number is bigger. Maybe we'll be back here.

E

The general improvement that we have done to acm is generally applicable, so in two three you will expect acm to use less memory uh and less resources in general, leader and meter.

C

Well, not only do we have the logic name, but I think we also bring the most people into the zoom panel here.

B

You were beaten by the last call.

C

Oh, of course it was ibm I'll get it. I get it.

C

Gotta live up to the big brothers.

B

Yes, uh fantastic work team seriously. Thank you so much. I can't wait to like hear more about this. uh This journey and just pushing the edge further. If that makes sense, it.

C

Does management awesome.

D

Yeah yep yeah. Thank you.

B

Guys, thank you all so.

D

Much thank you audience for tuning in thank.

D

D