Red Hat OpenShift The Cloud Multiplier | Red Hat Livestreaming, 12 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Cloud Multiplier: Ep. 4 | Taming Upgrades at Massive Scale for Telco

Description

Like Uncle Ben said, “With availability-centric topology and incredible scale comes great upgrade complexity”, or something like that. Join Gurney, Joydeep, and their guests Ian Miller and Jun Chen, members of Red Hat’s Telco Engineering team, to discuss Topology-Aware Lifecycle Management (TALM). TALM aims to tame the complexity of upgrades and configuration changes across a fleet of edge appliances by integrating topology awareness into lifecycle operations.

A

A

Hey folks welcome to the cloud multiplier. um I am here, as always with my co-host joy, deep banerjee, and today we are glad to welcome two guests from red, hat's, telco engineering team.

A

uh We have ian and june here today, so uh welcome to the show folks we're today we're going to be talking about, and- and this is the brief bit I know ahead of time- taming upgrades at massive scale, so we're going to be talking about upgrades and configuration changes and all these problems that you uh you face with those big changes at great scale. So it'll be really interesting to day.

A

Today, I've been promised some cool demos, but before we kick off uh ian june, do you uh want to tell us a little bit about yourselves? Give us the intro talk about what you've been doing I'll, we'll start with ian? I guess you're on the top of my screen, hey sure.

B

Yeah uh just really glad to be able to join you guys today. So my name is ian miller and I work here at red hat in the telco 5g ran group and we're working on bringing openshift out to the the 5g rand edge of the network.

B

I've been involved in uh networking uh for within the telco space for for quite a while and uh just really excited to be uh bringing that to openshift and helping openshift succeed in that area.

A

That is amazing. uh How about you june uh tell us about yourself a bit.

C

Sure uh yeah jin chen, so I'm yeah. I recently uh recently joined the same group as ian a few months ago and uh yeah so if prior to that, I've been working for telco's uh customers for for a long time, so yeah. So my last my last few months is all about this uh town. Things glad to have a chance to to showcase it.

A

That is awesome. uh Well welcome. Folks, uh someone in chat has already said long live telco cloud. So um a lot of the issues we're gonna talk about today are definitely telco scale. I think I've looked at uh looked at a little bit. I've peaked around their their uh open source, repo so far, speaking of which I'll go ahead and drop that in chat, but before we get into that, we have our usual pile of off-topic topics. To start with, um I guess we'll start around the room. uh Joy deep.

A

We talked a little bit beforehand. I see you're still you're still working on what was the name of the book again, the book of why the book of why? So? How far have you made it? You said it wasn't way too thick.

D

Yeah I've been about 75 pages, so this is my ritual. When I finish my work every evening close the laptop I open the book and then usually, I cannot make more than five pages at a time because you know you read something and then did I understand that you have to think about it right and then you have to refer back to something else. It's fascinating! It's fascinating! You know someday gurney, I'll, take this the stream and say: okay! This is a dump of a causal model in our space guys.

D

This is all you know, starting from scratch. This is nowhere need help. Let's build something: exciting someday.

A

I'm going to do that. That is awesome since, since our last stream, I I told joy deep a little ahead of time, because we were, we were nerding out a bit about it. I have uh it for the first time in my life, having been in the industry for not way too long, I've started playing around with I originally started, and this is me a person who works at red hat saying this. I started on uh ubuntu and deviant-based distros in college.

A

Then we had a little bit of work on. You know various raspbian-based raspberry pi. You know a little bit of bended computing as we all did, and I've kind of settled into fedora lately, as it seems a lot of folks have because I was shocked to find you know, I'm going to use fedora. I work at red hat all of us rpms, and then I discovered that most of the communities for the devices and other things in my in my home and in my computing were just very, very good defaults for fedora.

A

So it worked flawlessly on my laptop, but now I've I've picked up a device that runs arch. Linux, which has just been an interesting and it's not just a normal one- I I got a couple of my co-workers- have a steam deck as well, and it runs a consumer-facing distro based on arch, which has been very interesting found one of the uh one one other red hatter was a maintainer in the arch community and said yeah. That was kind of an interesting surprise.

A

That valve told us yeah we're we're going to use arch for our new version for our handheld that we're going to put in normal consumers hands and then trust them to uh to use this arch linux based distro with a pretty thick layer of ui over top of it. um So that's been pretty jody. Have you uh have you gone through the journey of building on and installing arch I've told I'm told it's a reading comprehension test. Basically,.

D

I I have not done. I have not done that by the way. Okay, but.

A

I I it was apparently a test of skill back in college that I never did.

D

And you have that handy, whatever you've got gurney.

A

Oh yeah, I do it's it's in this little case. I won't show it off too much on stream, because most people look it up or already know, but um I just find it very interesting that that we've at this point we've reached the point where oh well, we had we hit it beef we have android and ios out there in the wild. We have windows and mac os mac. Os is a unix system.

A

Well, now of all the linux distros that we've put on a consumer-facing product, we've decided we've gone with arch linux, which is, I think, going to be an amazing and interesting fit and journey. um It's exciting.

D

And from ours, let's go to the sky for a moment, because you're reminded that us before this call.

A

Yes, hey there, we go, we found our segway, um uh so I I think I think this goes really well into saying. Okay, so you've started putting linux on everything. We've started putting linux everywhere. Maybe we put it on, perhaps a cell tower. Maybe we put it on every cell tower. Maybe we make a flavor of rel and make a flavor of kubernetes that runs really well on a small device on a cell tower.

A

Well, there's like there there are tens of thousands of vz and how do I? How do I manage all of this madness.

B

Yeah uh good question right and uh yeah great great segue um yeah at those kind of scales, there's a lot of different issues that start to pop up right and uh so so we'll dive in here. For for just a moment, I'm gonna give a nod uh to to what joy deep was saying. uh You know you're asking about things that we're kind of watching right now. Well, it's kind of it's hard to peel myself away from the images coming out of the james webb space telescope recently.

B

So when, uh when not focused on that, we we are deep into uh yeah uh scaling up uh at the edge. You know within the telco environment, so happy happy to talk about that as well yeah.

B

So so, like you said um when you start scaling up dealing with uh managing a fleet of clusters that numbers in the thousands or tens of thousands as you get up to, that kind of scale starts to bring some of its own unique challenges and- and certainly uh I could not even begin to to run down the list of all of the issues that you may run into.

B

uh But within what we've been doing in the telco space, there are some really interesting challenges that we had to tackle around life cycle events, so so various different things that are going to go on over the course of the life of your cluster and trying to manage that within an environment that has really demanding needs for for uptime and availability and and are really uh sensitive to any sort of disruption to their operational environment. um You know when cell phone service goes offline, nobody's happy.

B

So um a lot a lot, a lot of sensitivity around that so um yeah. So so we started working on uh something that we'll talk more about here, uh called uh topology, aware, lifecycle manager- um and this is an operator that we've developed- that that can be used to help address some of these. Some of these issues uh with uh with managing life cycle events or trend, changes or potentially disruptive uh things that happen to clusters at scale and um a lot of that sensitivity, you know, will come in uh for for various different reasons.

B

Right um you there may be service level agreements uh with that.

A

I think we might have just lost ian that.

D

Was the service level agreement piece.

A

I think I think the service level agreement may have been breached there. uh This is interesting because coming into the show my my computer did lock up and we had to reset so clearly. I think we think we're running on the same zone here I'll. uh I can hide ian real, quick, so so we'll uh we'll pivot then so I should I'll go ahead and and branch a little bit um we'll see when ian's connection comes back jody. um I can watch him on the side.

A

He has a network connection of 0 out of 10., um so uh so the lawyer.

D

A

Cutting this connection oh highlight that real quick um he was. He was talking too much about.

D

You know he was, uh and perhaps june can say this something very interesting that my cell phone connection, when I'm talking that connection might be dropped. If some stupid stuff is going on and this stuff that junior you guys are working on, can prevent that you guys kidding me.

D

Is it that real I mean you, you guys keep track of that topology.

D

C

D

That's the topology.

C

Well, in a way, yeah and we make use of the topology to do things in a coordinated fashion so that you never do high-risk kind of changes to both towers at the same time if they are supposed to cover each other. So that's a that's amazing here.

A

Yeah, so so we take computing and networking topology about we may have. We may have a primary and a backup, an a and a v, a blue or green on a tower site. We may also have two towers, is what you're saying that cover each other and have some coverage overlap and can take over it for each other? So we're not just topology aware for computing, we're taking computing topology and distributing it over physical topology of land by putting on the cell towers. That's amazing.

D

In a crude wave, you could probably approximate it as this is just another kind of rolling upgrade where you make sure one is upgraded or one is changed before you know what we do in our parts in in openshift. Okay, all.

A

Right, that's interesting, amazing! That's interesting! um Oh yeah! um I I also wanted to say uh one note for an audience. I actually linked the their open source repo in chat, but it is a different name than talum. So talm is used a lot topology where life cycle manager manager management june. I want to manager management manager got it.

D

Yeah, I guess the the other question uh uh john, is that is this related only to telco, or is this related to only 10, 000 or 100 000? I mean what, if I have let's say, 20 clusters, important clusters in which I'm running my production, I mean gurney runs some of these right. You run some of these. For us, you run the infrastructure for us and if it's not working gurney, we won't cut your slack.

A

No, it fire, it won't cut me slack, it fires alerts. So so what are? What are we talking? Can I use this to to make my upgrades not impact our ability to ship a product as well.

C

Yeah sure, like uh this started with like a more telco focused requirement, but the work is truly generic, where you just have to yeah wherever you need to manage a number of uh a relatively large number of clusters.

C

This can be very useful.

A

Awesome whenever topology can play it can play a role. We got ian back ian. All you missed a little bit of kickoff a little bit of intro and we talked about uh moving moving to the topology of a physical topology, a network of of cellular towers, and we also.

B

C

About whether we.

A

Can use this at a smaller scale in our day to day to make those upgrades a little bit less disruptive.

C

B

Yeah so great sounds like we're good. uh So was I just talking about uptime and that service disruption.

C

A

Something that.

B

Need to be avoided.

A

At all costs, yeah yeah, it's talking about slas.

B

Yeah there you go so so we're looking for high levels of sla demo number one is now complete, so we've got a couple more.

C

B

uh And so my apologies for that, but uh yeah, so so I'll just kind of pick up from there. uh I'm glad you touched on that right. This is the the issues we're talking about really are not a telco specific uh environment thing. It's really whenever you've got large-scale topology lots of clusters and whether it be slas uh that you need to ensure that you're meeting um for for obviously contractual reasons or uh it may be, that your operations team has just uh through prudence and experience over uh over time has said.

B

You know what doing uh uh uh an upgrade of my entire fleet of clusters simultaneously at the same moment is probably not the best idea, right and, and so you've got these different things.

B

That say I I want to be able to have a higher level of control over these life cycle events, um and certainly when you're talking about a large scale like that automation really is, is key, and so so we're looking to try to bring some tools to allow us to build on other existing tools and bring in these things that allow us to to manage uh in a in a topology aware way these life cycle events. um So it probably makes sense to talk a little bit about what we're talking about with topology right.

B

So so clearly we're talking about thousands of clusters, tens of thousands of clusters that are being managed, so we've got large scale, but those clusters may also have some sort of service level overlap, whether that's a logical overlap between the clusters or in the case of a cell service right. You may have some amount of geographical overlap and you want to make sure that hey, if I'm going to go, upgrade a cluster.

B

If I'm going to do something, that's potentially disruptive, let me not take all of the cell phone towers in manhattan down simultaneously, let's at least share that between manhattan and philadelphia. Or you know, wherever right and and have some sort of geographical awareness or or like I said, if you've got logical service overlap, you may want to uh to to make sure that that logically, you've got some redundancy built into your system, and so you may want to not take down two that are logically uh overlapped.

B

So um so yeah topology can kind of span across scale, but also service availability as well um and.

A

I imagine that could impact someone else's service availability as well, because I I know I remember a project I worked on a while ago. There was a concern of okay. If we run this as a managed application was the use case. If we run this on this data center and this region of this this cloud platform, we run this number of them. On the same on the same networking interface, the same physical networking interface, I can imagine if you decided to take all of manhattan down for an upgrade at the same time.

A

Not only would you disrupt manhattan, but you'd, probably flood a bunch of network uh network. You know in between interconnects between you and wherever the data you're pulling to get those upgrades. So.

C

A

Be pulling payload from someone somewhere, so you might overload a cdn.

B

Yeah, exactly and and um and actually that's a great segue right, so one thing I haven't started to dive into is right: what are some of these disruptive events that we're really intending to to manage, and one of those would definitely be an upgrade of the base operating system right? You know, openshift and, uh and sometimes that content is not small right.

B

The update may be fairly large and if you're, uh if you're, if you're, dealing with bandwidth constraints, yeah that that topology may need to take into account that um that you need to to manage how many are sharing links, and so so you may build that into your topology and you certainly don't want to overwhelm the servers that are serving up that data right, and so so you want to be able to not only do it in a topologically aware way, but you also want to do it in progressive waves right, so that you're not doing more than some limit that you've tested to that.

B

You know you can support on whatever your content service, yeah content, delivery uh servers are so yeah there there's a lot of different ways that that topology can be sliced up, and one of the things that we tried to do within town is to not bake in knowledge of what those different mechanisms would be, but to try to provide the set of tools that puts that into the user's hands and say you get to define what topology looks like you get to what a progressive roll out of changes looks like you get to to define how this will stage through and and uh and work its way through, and whether you're you're uh you're staging change set one followed by change, set two or doing things simultaneously.

B

So so we tried to build some tooling that allows uh users to do that within town and some of the key use cases that we were focused on when we were doing this. I I've named one already is around open shift, upgrades right and making sure that that, when you do an open shift upgrade potentially that that may be a disruptive event for that cluster.

B

If you have a highly available cluster, certainly far less disruptive, but not zero risk either, and so again that comes back to within your operations team. You know- maybe it's not disruptive, but that doesn't mean that they necessarily want to roll out that upgrade simultaneously to the entire network right.

B

So there's a lot of different reasons why this comes into play, um but but a lot of times when we're dealing at this scale out at the edge the kind of clusters that that we're dealing with are single node, openshift and and within that context you do have a you know: service disruptive event, when you're doing an open shift upgrade right. So um so again lots of different reasons, but openshift upgrades were certainly one of them. um Olm operator updates are another again, not zero.

B

You know, non-zero risk may or may not be disruptive, but again the kind of thing that we want to be able to roll out and within the context of operator, updates, um there's a really good inbuilt mechanism within olm operators that allow you to subscribe to a registry and uh um and and automatically keep in sync with that registry.

B

So uh so so yeah, you know the ability to work through uh and and pull those is in built, but within a an environment like the telco environment, you may not want to uh to do all of your operator updates simultaneously, and so so town provides some of the the functionality there.

A

B

A

A related question before.

D

A

It on go ahead jody before.

D

I mean what this reminded me was uh in my prior life working for an entertainment company. The thing that I knew is technically. We can be ready to push out something, but then the business guys they have real knowledge, which we had no clue about which really goes to decide whether you push it out or not. What you were telling you know it struck the same chord that you are providing flexibility, I guess through apis for the user to customize. However, they want to.

B

Yeah exactly there's a lot of great features in openshift right that allow this functionality there's a lot of great features within acm that support and enable a lot of this. The piece that we were trying to fill in the gap of is, let give the user the tools to say we can we can um we can time it when we want to time it.

B

We can batch it the way we want to batch it and we can roll it out in a in a controlled manner that allows us to meet whatever our operational constraints are whatever that industry may be right. There's there's real reasons why they may not want to do simultaneously a whole lot of things, um and so so we're trying to give the give that additional set of tooling.

B

That builds on that great base to say here right here: here's the additional functionality that you need in a in an operational sense to go forward in your network and to be able to do the kind of updates that you're looking to do so. I guess the last one that I'll mention is really any configuration. Change could potentially be a risk, and so it doesn't even have to be the major. What we consider life cycle events, but really any configuration change could potentially be something that a that a customer uh may want to.

B

uh You may want to be able to manage using uh using something like town.

A

Okay, uh the the question chat very related. I need I I should not have put a caption on uh related before we move on is a question about using satellite as a local repo, so they're talking about the openshift upgrade repo server. I've worked with a very little bit. I don't know if you've done any work with that to to get that content closer to the edge where you're going to actually run that upgrade or not. I assume that's a complimentary tool.

B

Yeah, so um so that that's a great question right um and again getting into the complexities of what topology means and um in bandwidth constrained networks right, uh you may want and need to move functionality further out. You know toward the edge of the network and certainly that content out toward the edge of the network um uh june. Maybe I can uh uh you know hand over to you a little bit here.

B

um uh There's some primary use cases that we focused uh on within tallow, but then there's some additional functionality as well that uh um that comes along along with talum uh that allow that allows things like this. So so in terms of moving or pre-caching. uh You know and further out at the edge of the network june. Can I hand over to you on that sure, like.

C

Yeah, so uh we have this uh building feature where we can look at the upgrade you want to achieve, but.

A

Without really doing it,.

C

We can make all the clusters involved to pre-download every all the artifacts, that's required so that when you actually enable this uh or let this upgrade to start, you know all these clusters already have the artifacts local like uh right on the node. So you have much like a yeah. We know for the edge.

C

uh Often we have that limited bandwidth or flaky connections. That's not good for bulk download right! Okay. So it's important that when we started the upgrade, we already uh prepped this relatively risky step beforehand right. So that's uh that's uh yeah! That's what we do for this.

D

So I guess june, what you're talking about here is again. We are talking about real-world systems right, so you have to complete the maintenance within a certain time. You have to complete the upgrade within a certain time. So you only start the upgrade once you've made sure all the prereqs like downloading, and things like that are done and you are allowing those to be done. Prior yeah.

C

Yeah, that's the other major advantage today. It makes the actual upgrade way faster, right and, and the other thing is, it gives you a better chance to succeed, because you, you don't need to worry.

D

C

Your network of your networking, your connections as much during this process exactly.

A

And I imagine that's really relevant if you're upgrading a bunch of networking appliances.

B

A

That's amazing.

B

Yeah you'll probably hear us say, say the word progressive quite a few times. You know during the course of this right because it really is about enabling, rather than one rapid, big monolithic thing to happen across your fleet, right to to break it up into chunks and so yeah. I've talked a good bit about breaking it up into logical chunks.

B

To you know, for uh overlap in that sense, and what june was just describing is is breaking it apart in chunks time wise right and allowing different phases of the uh of the change to be done in in two separate events, and you know, as joy deep said right, you may have certain windows of time where you're allowed to make those changes or allowed to do those things uh based on your slas or whatever it happens to be, and so that allows you know that feature of town allows you to do that pre-caching and then uh and then initiate that that upgrade yeah.

A

I could imagine good journey.

D

I mean just just one physical question: garnier uh ian june, you guys are talking about things, singular openshift telco. Are they those small boxes we see while driving by which are mounted a tower in no man's land? Sometimes, are you talking about those kind of things.

B

Yeah, there's a yeah there's a lot of different areas right where, where those servers can can be deployed right and it could be yeah right there at the cell phone tower at the the you know, the base station- uh that's right there, distributed out at the edge we've seen how many towers there are you get the the sense of the kind of size and scope of you know what we may be talking about here and then progressively further back into the network right, there's a lot of different places that that that openshift has some real uh fantastic ability to address problems and so yeah it's uh it definitely does span.

B

A lot of you know the edge all the way back toward the core of the network and um and depending on where you are within that network. Different cluster topologies, single node versus compact clusters versus uh you know a larger scale. You know full ha cluster right yeah. Those can come into play uh as well.

B

A

B

A

I guess you so, and you probably we're about to go straight into a demo where you probably have this so might be good timing, uh I'm curious: what is it does talim, do some work to discover the topology and understand some of these constraints before the user, or does the user define and say tell them? This is what my network. This is what my fleet looks like from the things that you can. You can determine you can discern.

A

So you know this is this? Is a you know these two are redundant pair, so the a.

B

A

Cover each other and and you can upgrade a or you can upgrade b, but you should never upgrade or carry out a change on both at the same time is.

B

A

Discoverable thing or or is it a mix or is it a user defined that.

B

Great question so, uh and a good segue right in so so again taum is, is a it's a it's a tool right and it's something that builds on top of other components of the solution here right.

B

I I mentioned acm and in a moment here I'll, throw up a slide that helps to tie all that together, but to try to give a succinct answer to you up front and then I'll dive into some of the nuance and the details is it puts a lot of tools into the user's hands, but across the the combination of town and acms policy and governance engine and the ability to label manage clusters, there's a lot of tools here that can be used to define how what your network topology looks like and then to make use of those tools.

B

As you're doing uh you know, life cycle event, you know some sort of progressive rollout that you want to do so, let let me throw up a slide and uh if it doesn't, uh if I haven't answered your question uh you can uh you can definitely uh double down uh on uh on the question happy to to continue to dive deeper.

A

Let's see, okay, we got a screen whoo all.

B

Right, hopefully, hopefully, it's uh reasonably legible here, um so so I wanted to throw through this slide up uh to try to give a sense of where town fits within the the broader pieces of the solution here. So, as I mentioned before, town's an operator- and it runs on the hub cluster and builds on features that are available within uh advanced cluster management. Acm um and really the unit that talm is using for rolling out changes to the network are policies, and so the user has the chance to describe what they want.

B

The end state of their network to look like uh within policy, and I I won't dive super deep into into policy, because I know you just had a great uh a great session on this uh within the last few weeks. So um if folks haven't heard that I'll, throw in the plug right, great deep dive into policy uh available in the show archives, um but uh so so the unit of work is policy.

B

So the the the user here can describe uh whether it be an open shift upgrade or a change to configuration or an olm operator update. They can describe that in a policy um and and then talum has the ability to say all right.

B

Let's take that, let's look at the set of clusters that that's bound to and let's start to progressively roll that out uh through through the network um and and there's a lot of different ways that can manifest itself and, like I said across different uh different life cycle events, but that that's the unit, calm iterates over those policies, um it'll do them in order, and so you can, you can actually specify an order to the policies that you want it to to remediate. And then you say across this large set of clusters.

B

I want you to go and and roll it out five at a time, ten, at a time, five hundred at a time yeah, whatever that that increment of or wave size is it'll, do that many concurrently and then it'll move on to the next set move on to the next set. So to answer your question: gurney, it's a combination of the placement rules and the placement bindings that go along with policies along with cluster labels that allow you to do some selection, that let you define how you want to roll things out.

A

That makes sense, so basically you build you, you you uh build via the building blocks of policy and labeling and and all of these other constructs and you're able to build a structure that says here's what my network looks like and then you're allow you're able to build actions that you want to carry out on that network. So it's kind of.

B

Your one-two.

A

B

A

B

So, let's, uh let's dive through an example and uh see if that kind of helps, um so uh apologies uh can't fit quite as much on the screen and make it legible simultaneously. So so we'll we'll see how this goes here.

B

um I've scripted this out a little bit to simplify but I'll talk through the steps and I'll show some different pieces here. The first thing I'm going to do is apply a couple policies that are going to describe my my changes within the network, you'll notice on the left side of my screen here in the red. These are the actual sites.

B

I actually have five of them configured up on this hub cluster and so we're going to roll out a set of changes to those five, and so the first thing that's happening here and I'll zoom in a little bit. It's just creating two policies, so so you'll see a uh the first policy here um and then the second one here and it creates the policy and the associated placement rules and placement bindings. uh Sorry there we go all right, so it applied those to the hub cluster in the bottom right here.

B

This is a view of the hub cluster, and so you can actually see the uh um see the policies applied here. So you see two inform-based policies. That's that is one of the key things to what talm is doing is that we create all of our policies as inform based policies, so they don't take immediate effect in changing the clusters that are out in the network, but you do get that immediate visibility, and so, if I jump over into acm, this is the acm policy governance view.

B

Let me zoom in a little bit. I don't know if that makes it hopefully a bit more readable again, you can see these five clusters and you can see that there are two policies that are not compliant, because these are describing a change that I want to make, but that I haven't made yet um and on the left side here, you'll see under config the two policies: one is creating a config map and the other one's, creating a secret, so trivial changes but uh good good for demonstration.

B

So you can see under these clusters no config map, no no secret, um so we're basically sitting in a state where we've described the change but not not rolled it out yet. So the next thing I want to do is apply. A cluster group upgrade cr this. This cr is what describes to talum what you want to do and june's going to give a deeper walk through of what's in there, um but the the the two high-level things that I want to point out.

B

We list off the policies that we want it to remediate, so you can see here the config map and the secret policy, and we tell it what clusters we want it to apply to, and I'm just doing it by label here. So uh all of these clusters appear to be named after space shuttles, and so uh so uh the uh the label uh fleet equals shuttles is uh common to all of them.

B

So I'm basically saying I want to update all five of these, but I want to do at most three at a time um when, when I created that cluster group upgrade cr, uh you can see here that it's enabled false, and so it's giving us the status saying hey the upgrade is not started yet. So the next thing I need to do is to go, enable that- and that's just a simple patch to that cluster group upgrade cr and um and now talum is actually remediating those clusters.

B

Let me zoom in on this. This screen here a little bit you'll see that, in addition to the uh to the inform policies, we now have enforced copies of those, and this is how it's actually pushing those changes out to the network. It's taking those and in this case three clusters at a time, I'll jump back here to the acm view, and it's a little easier to see. You can see it's remediating, those three clusters, um the the first policy, is now done. The second one is about to be done.

B

The reason it says four here is remember: we have an inform and an enforced copy uh that enforced copy will disappear, uh so the first batch of three is done. It's now moved on to the second batch, the two remaining clusters: it's remediating, those and in about the next I don't know 20 seconds or so those will complete and uh and all five clusters will have. uh You can see the the config map here uh has been populated based on the first policy.

B

This cluster is in the last batch, and so the secret is in the process of of applying itself right now, as soon as that is done, there we go that policy will go compliant and once that policy goes compliant, you notice that all of those enforced policies just disappeared.

B

That's because talum is completed with its work and you can see it moved to the state upgrade completed and it actually, I didn't mention this at the beginning. Town will will label uh the clusters before and after to, let you know kind of what's going on, and so you can actually track status through those labels as well. So that was a super fast run through uh and I know I jumped around the the screens a little bit.

B

So apologies for the jumping, but as you can see, we went from non-compliant to fully compliant across the entire fleet of clusters, but we did it in two batches uh yeah as it progressively as town progressively rolled that out I'll pause. There.

A

That is amazing, see seeing three in a set of that go. So how were those three selected by town? How? How did we define that? We wanted those three to be the chosen ones for wave one, how how'd that work.

B

Yeah so so tom will will create those batches itself right. The way tom is built today you get to define what gets included in the set, and so I did it by fleet equals labels right. So imagine the scenario that you had where I had some amount of overlap, and you know even versus odd, and I don't want evens and odds to be offline simultaneously.

B

I could easily build a cluster group upgrade cr that said, go go roll this out progressively 50 at a time 500 at a time on all of the even nodes, and then when the even nodes are complete right then then you can hand off and go ahead and do an update of all of the odds again 50 or 500 at a time right, and by doing that, you get that that ability to say I don't want to. uh I don't want to have these overlapping services down simultaneously yeah.

A

And you were able to control both the grouping and the rate of the act yep, so you're able.

C

To great limit.

A

So you don't overwhelm anything and you're you're able to control the grouping, so you don't bring anything fully offline. That's magnificent!.

D

Yeah and to add to add to that question yeah are: is this operator actually mutating some of the policies that I am creating initially.

B

Yeah, so so everything this operator is doing is in units of the policies that you create right, and so so logically, what it's doing is it's saying: you've created one or two or even a dozen in form, based policies and now you're, instructing tom to say I want you to go out and I want you to enforce those inform-based policies and so, rather than flipping a switch in the policy and saying enforce and having it apply simultaneously everywhere, it's going to slowly roll that out right in at the rate and in the batch size that you that you've defined and again you have the control by by labels, uh which sets of clusters right, because those policies may apply to the entire fleet of 10 000 clusters, but using labels to select within town.

B

You can do a subset of that and say. Maybe I only want to do a hundred clusters out of my 10 000 initially, and let that soak for a week as a canary set and it'll roll that out and then you can say, that's been successful for this week or this month. Now I want to roll it out to the rest of the 10 500 at a time.

B

The the key here is that you only have to describe the desired state of your system in one set of policies, but town gives you that tool that says I have my desired end state in the policy. Now let me progressively work my way towards that until I'm done at the rate and time of my choosing.

D

Right so technically in the in the policies speak of terms for advanced cluster management. What you're stating is that the initial policy that you create, though you really want to make sure that a config map exists? You just create an informed policy to that that will that will basically return that now the configmap doesn't exist, then talon will pick up and ensure that the config map is indeed created at time and pace. As you said in the api of time,.

B

Exactly exactly fantastic.

D

Gurney, coming back right after after you guys, this is something gurney would love to. You know in one of the roles, one of the many hats that he wears.

A

I have we to describe my my shock. We have literally had this problem before so. We've asked the question: okay, we need to do a dark roll out of an update. We want to do that for some percentage and we want to make sure we don't have increased api or you know, error rates before we set it live I've even worked with tech in the uh goodness uh joy deep. You may have worked with it. Some too. There is a certain package that I'm remembering for a ui.

A

I think it's just react in general, where you can enable some ui elements and you can get see if we will, we enabled some client side are there increased error rates or users reporting more issues that are experiencing this, and this lets you do it at the application at the server side, where you're running that application, which is amazing yeah, I mean.

D

That's the heart of it gurney. This is pure practical. I mean forget about all speak. I I have a lot of important stuff. I definitely do not want to change them all of all of them simultaneously. That's basic common sense, and this is what's allowing us to do. I guess, in a in a uh elegant way,.

B

Yeah yep, so so I have a second demo here and I won't spend super long on it.

B

uh But uh but I did want to talk about you know a couple other things so so you know we mentioned operator updates as well, and so so operators are a little bit unique um in uh in that they they have updates available in a registry, and so when you go update that registry, we want to be careful that that update in the registry does not immediately propagate out, and so so operators have the ability to to be set into a manual mode where updates are only applied when they're told to, and so so talum has some features built into it that that allow operators to be handled specially and for town to actually act like an operator or a user.

B

Going down to that cluster and saying I want to approve the operator update on this particular cluster um june. uh I'm going to kick this off here in in a moment. um Do you just want to give a little bit of you know a little deeper dive on how talm deals with operators and how it's how it's actually uh um doing? The the work around uh operator approvals.

B

So uh I think june.

A

Might be needed.

C

I was on mute, sorry.

A

D

C

Yeah for for operator upgrade even for for, like ocp upgrade like uh tom, can look into the policy and recognize these policies they are. They are for upgrades and do specific things like uh one example is a precache example.

C

We we already talked about okay, we will look at the versions and do the download beforehand right and another example is where, when there is a operator up upgrade tom will create the uh will monitor, monitor the subscription status on those uh on those clusters and do the manual approval, that's uh normally done by by operator uh when we reach that uh it's called upgrade pending status. So that's uh that's. uh This there's no logic for operator upgrade policy.

B

Yep sorry, I was trying to highlight, as you were talking june, so okay.

C

Yeah exactly right.

B

Exactly what you said right, you know tom noticed the upgrade pending state and you'll see right here as it upgraded the operator. It set that manual to true right, and it enabled that.

B

So, what's happening in this particular demo is talum is working through we had. We had installed the 5.3.8 version of cluster logging and we told it through a policy to go upgrade to 5.4.2 um and it's working its way through again in batches of uh maximum size of three, and so it's uh it's updating those operators. uh Again. Apologies it's hard to see uh text on a screen here, um but you can see uh in some cases right.

B

uh This uh endeavor one has already been upgraded to five four uh two and the last two sites are in the process of being updated right now, they'll move to 5.4.2 as soon as town recognizes the upgrade pending switch that manual to false uh sorry, let's switch the manual uh approval to true there. We go just did it and you can see that immediately. That operator is now updating so again just wanted to demonstrate that uh the town uh is dealing with both openshift upgrades configuration changes, operator updates as well.

A

Okay yeah this is this- is a generic enough tool that I can. I can use a policy to teach it that I need you to. I. I need this to look like this and and to enact that change. I need you to change this setting. um So that's what it's doing there. It's toggling that manual yeah exactly yeah, that's wild.

A

um We did have a question. I wanted to surface um I'll I'll splash it up here, but how can? How can you convince your team and your management to do frequent updates, uh there's a push and there's resistance to perform upgrades? I know we've seen this. I am part of this, sometimes as a person who operates a bunch of open shift infrastructure, um any incentives to update uh frequently. Why should we not miss out? I I think I think this is a good place to say it sounds like from my perspective.

A

Talum is the best tool in the world, for if I have redundant infrastructure I can actually have a blue green environment and I can bring blue green. I can bring blue up to date using this tool, and then I can wait and see how that behaves and then green can come along with it once things are healthy once this has proofed that out.

A

Lowering that bar is that a use case you've seen in all and in the wild for testing these sorts of updates early as well.

B

Yeah, so so uh I I feel like that could be a show fully in and of itself right. How do we convince folks to do more frequent updates uh and keep more? You know, keep as current as possible uh that it's a great topic um relative to town and relative to what we've been talking about here right. One of the ways you convince people to do, updates faster, is to make it safer right is to reduce risk.

B

You know, p people's desire to not update is general, it's a risk uh risk reward equation, and so, if we can provide tools and provide uh mechanisms to lower that risk, I I certainly think that's part of the puzzle. I you know won't go so far to say that's the whole uh the whole puzzle, but but it I think it does come down to reducing risk. Yeah.

A

And I think, caching that content for me, an openshift upgrade, can take like two hours. If I can cache that content, if I can make sure those updated images are there and that upgrade doesn't have to pull a bunch of content, it happens even faster. That means my window for something to go wrong is so much tighter. It's so much slimmer and that's that's amazing.

A

I'm stoked! I'm gonna have to try this.

D

Yeah, and- and this just does give a this- does strongly incentivize you to do upgrades more frequently uh because it as ian you mentioned this, takes care of certain, makes it a little bit more solid but depending on.

B

What you're doing there might be other things as.

D

B

Yeah, I I feel like what we're helping to do here is to lower the operational risk, and I.

D

Think that can.

B

Help to shift that risk reward balance, because, on the other side of that question is um why would I want to upgrade and with the constant um set of cves and security threats right that there there's real motivation to want to do updates right to stay uh to say, stay current on the latest versions of things, so that security flaws issues, holes whatever are are closed, um and so, if you can provide tools that lower the risk operationally of rolling those out, it helps to shift that balance. A bit yeah.

A

um The old the old risk reward chart. I I.

B

Yeah yeah, absolutely in my.

A

Mind um I hope that answers the question. Please shout out and chat if there you have any other questions. um Thank you for that. That was awesome.

B

Yeah so uh certainly happy to take more questions. If you guys have more questions, that's great. I did want to put this up here. I promised uh uh that we would do a little bit of a deep dive into uh how uh how talum is configured and yeah some of the options that are here so june. Maybe maybe I can uh hand off to you and uh and uh let you uh do a bit of a.

C

Different dive in here yeah sure, like I think, we've covered most of them like uh because you well, the first part is, or generic name name space right like the starting with the spec right. The actions part we briefly mentioned. So that's that's a another nice feature where you can label your targets, your clusters, at different point of the the upgrade the process like a before or after, and so that you can easily see which ones are in flight which ones are completed and uh yeah. So that's that part action. Then the cluster selector.

C

We talked about. There's there's that enable flap right like uh yeah the other. The other thing I want to mention is this: enable act. This enable part is really important because, uh for example, the pre-caching and all the other, like validation, look verifying the managed policies do exist and you do have clusters matching the the labels. This can all be done beforehand like uh before you actually flip this enable flag, you, you know everything as much as we can right like you know, everything's downloaded and your policies are in good shape.

C

So then yeah that's there's a list of the uh policies and yeah we reinforce them in order right. The other thing I want to mention is within each batch like each.

C

We progress each cluster independently like um so they can. It's not like. We do policy, one all the clusters until we move on to the next one, so within the batch they can actually go on their own pace. Then the last one is uh yeah strategy where we define the batch size essentially and the overall timeout yeah. I think that's yeah. That's it.

D

So so this is, this is the api, where let me play it back. What you're defining here is that you're telling that hey. uh I have uh config policy, one config policy, two, so first roll out config one and then roll out config, two and roll it out three clusters at a time in parallel and we select the clusters, as per cluster level, fleet equal to shuttles, and then you, you are instructing that hey before you start do these labels and after you complete do these labels right. That's the api, fantastic.

A

Again, what's the.

D

What's the timeout at the end, what happens if things do not complete? What's the deal there.

C

Yeah, so if, for some reason, your clusters can never reach uh could never become compliant to to one of or all of the policies at uh within this timeout uh period, then the cgu status will say: upgrade timed out yeah.

B

Allows you to keep from from getting stuck right. You know talking at the scale of ten thousand right, you have a cluster go offline or something like that. You don't want that to to hold out hold up the rollout right, and so so there's timeouts built in that. Allow you to come back and and deal with those clusters after the fact um and and figure out what went wrong, whether it went offline or whether there was an issue.

B

One thing that's not covered in here is that there's, actually a configuration parameter in here for canary clusters and canary clusters are actually a batch that are run before before any other clusters are run, so you can identify a very specific set of clusters to run first and if you experience any failures in that set, that's that's fatal and that that and and it's determined that at that point the rollout should not proceed and it won't move on to other clusters, and so so again a little bit of sort of operational experience.

B

It says, let's test this out on a cluster or two. If you want and make sure that goes well before you go kicking it out to the entire fleet, so if you've got a typo somewhere this, this is one last chance to catch it before things go live.

A

Yeah- and I can also imagine that the the timeout is very valuable because you can, you can always query these policies that you're rolling it out by you can query query: what's not compliant, what's not up to date,.

B

A

That way, you can know what you need to remediate, but also at the scale of like 10 000 items. I can imagine, there's probably one or two or ten or a hundred that are down at any point in time for some reason: um we're talking physical hardware sitting out in the wild, so there's there's a decent chance, you're going to have some level of acceptable outage at any point in time.

A

That has no effect on the network that you're aware of, but you can't have you can't wait for the system to be in a perfect state to roll out these sorts of changes.

B

You can, but you won't ever roll out any changes. Yeah.

A

B

Yep yeah exactly so. This is designed to get the the maximum amount of work done, that it can uh right to move your entire fleet forward based on these policies, but it's not trying to re-implement or anything about policies. Right policies are a really good tool for managing the state and for describing how you want things and they've got the visibility. So you can see what you know which clusters are compliant, which ones aren't. This is just an adder on top of that right.

B

It builds on top of that framework and says: let's give you the tools put them in your hand, to allow you to progressively move your entire fleet to that undesired state.

A

Amazing, that is, that is awesome um and then oh yeah, the other. The other important question that I saw and I'm stealing joy, deep's question here by the way, um a little back room, joy, deep and I have random thoughts beforehand that we make sure we write down. I have them typically in the shower. um The good shower thought is: does get up fit well into this paradigm, so I've pushed a change. um Everything is driven by git ops. I have my fleet of clusters, that's all driven by get ups.

A

I push a change and I expect the answer here is probably well. Do you have your policies defined via get ops? That's how you'd accomplish this and I'm guessing that's how that goes.

B

Hit the hit the nail on the head yeah, uh I think I'm slightly embarrassed that you said get ups before I did uh we're all about get ups um again right, we're dealing with with uh scales of thousands and tens of thousands of clusters right that that's got to be manageable in a really rigorous way and get ops is a fantastic uh way to do that, and so again, that's a whole topic on its own. But, yes, we haven't supplanted anything in those flows right.

B

You can use your existing git ops flow to define these policies to drive toward the desired cluster state. Talim just gives you that additional uh operational uh capability to say. Let me do this in a structured, ordered, but automated way.

A

That's uh so so I can make my get ops change and not you know, finger check a change to 10 000 clusters at the exact same time with no remediation, no time out and and everything.

B

Exactly so so, as as the person that's doing, the get commit, get push, you're, really glad that you've got this tool sitting in between, because that git log is also going to tell everybody who did that push.

A

Yeah, so yes exactly get get blame who to blame for this one, that's in the canary that canary column, I'm I'm really curious about that. That's that's very interesting that I can have that defined. You know you, you push a change. Pr comes, I can imagine push a change. Pr comes in, it runs through ci, it gets merged and then you still have one extra protective layer of canary in that roll out to see. If something goes wrong, your ops team won't blare up immediately and say: oh no everything's wrong.

B

Yup, exactly right gives you that initial sanity check the yes. This is okay, now yeah, now we're going to move on to the rest of them.

D

And the fun thing about this is the api. Doesn't look too complicated. I was I I I'll confess. I have never looked at this api earlier. So I was, you know all the promises that you guys are giving. I was wondering how complex the api would be. Yeah.

A

This is super simple. I I love it. That's amazing, yeah.

C

Because we we built this on top of a lot of good stuff right, like a policy and yep.

B

Yeah, exactly so yeah june highlighted a couple. You know a couple things here that are right. These labels, for example, right these are these- are things that are not core and central to to the story of progressively rolling out the state and the changes, but these are really features that go with that additional theme of let's make this really usable in an operational environment right, let's put some additional tools in here that make it easier for people that are using this.

B

um I think we've got one more slide here, uh yeah that talks about a few of those those other features, and so at least briefly, I wanted to to just touch on these or june. I think you can do a much better job touching on these than I can.

C

So yeah, I think, we've touched pretty much all of that right, like a training, that's where we do blue green, like a one one cr for blue one c of green and chain them together, so that, like the one doesn't start until the other one completes right and uh we talk about the the sequencing and the ordering it's like.

C

We do enforce them on every cluster in the same order as they show up in english, er right and uh and then we talk about the the pre-post actions, pre-caching yeah and I think yeah you hit this really well too, like you make a change like in the policy, but they don't take effect, but you can see who's which clusters will be impacted if if we do make it happen right, so that's uh yeah. I think that that's we covered this one pretty well excellent.

A

Point I have to stop. Go ahead, joe deep. I think the same question I do.

D

Are you sure, okay, I I.

A

Have the question in the last line I.

B

D

Understand what it meant- oh this, this one! No, no, the the the slide. That june was talking to right.

A

Yes, the last line.

D

Changes to the policy results in non-compliance, so I have I have I'm thinking. I've deployed something right: I've created informed policy and talum has gone ahead and deployed it, and then accidentally I make a change to the policy is. Is that what it's talking about?.

C

Yeah yeah, okay, so my my okay, maybe that's just one use case in my mind like if you make a change, you expect. Maybe this set of clusters, I mean half of your cluster is supposed to go non-compliant but the. But you made a mistake and you see all of them when non-compliant, then you know right away. This is something I need to take another look at all right, right, yeah, I don't know yeah you.

B

Yeah, but so so june brought an out an aspect that was actually a different one than I was going to say, but that but june's spot on right I mean one this structure right, the the methodology of saying I want to craft all of these as informed policies gives you that opportunity to see exactly the scope and impact of your changes right.

B

If you, if you really want to you, can go look at every cluster and see exactly what change is going to be made based on that policy before you actually do it, and so, if it applies like june said if it applies to all of your network, and you only intended for it to apply to a subset of your network, you see that before it becomes live.

C

B

That's that's huge. The other one is maybe yeah, maybe made a mistake right and there was a typo and that managed to get get it get its way through. You have the opportunity to review that in in the inform policy prior to pushing it to your cluster right it doesn't it doesn't if it were an enforce-based policy, the moment you hit get push and that got synced to the hub. That's going to start going, live immediately right. So again you have the ability to see the scope and impact of all of your changes.

A

That's amazing, I see jody, we were actually going different ways. I was going to say I I can't not stop and and say if chaining means what I think it means I can have different sets for a and b and a happens and then b happens.

A

So a goes through waves b goes through waves, so I can, I can have a region that has an a and a b that are labeled separately, a blue and a green, and I can do blue and then I can, if blue's up and successful, and meets these criteria, then I can do b in a chain afterward and the same for regions. Maybe I uh maybe I update northeast and then I go southeast and then I do central and then we just go through these regions as a chained series of upgrades.

A

So it just kind of lets. You make a series out of these upgrade crs.

C

A

C

Yes, one more note on training: it can be yeah. It definitely can be used that the way you just described, but it can also be used uh to the same set of clusters, but where you want to group, your change sets into two pieces and you want the first group to be applied to all the clusters until you start the second piece on any one of them. So that's the another dimension.

A

C

Chaining, so that yeah it can be used in either way.

A

So if I have three sre teams for three different applications- or you know, sre teams doing rollouts for three different applications, app a is going to make a change through that sequence, they're on call. Okay, it completes at b, starts the roll out yeah make change or even dependent changes. I need. I.

C

Need all of you, he requires a to be upgraded. First right, like yeah,.

A

B

Is awesome, yeah right, another really practical example right I want to go. Do all of my uh open shift upgrades and then I want to follow that with my operator updates right now. I want to do them in that order. So um yeah a lot of different.

D

B

D

B

And bend it to uh to the type of operational scenarios that that you want.

D

B

D

And run through and ian I mean, let me steal something which I I already know. We already know about this, but it's very important in this world that this could probably reduce the amount of restarts that are required on the machine. If we do it correctly right and.

B

Potentially yeah, potentially right, just depending on how yeah how how you structure and order things.

A

B

You can really uh you can tailor the way that the change occurs right to be tuned to what you want to happen. Yeah.

D

Correct correct and in the case of single or open shift clusters, restarting means that fellow is not available at all and goes back to serviceability. Okay, we.

A

Are still alive, I thought that's a jinx.

D

One the moment.

B

We say serviceability.

D

The connection.

D

B

Mean a service.

D

Level agreement, I guess sma yep.

B

So yeah, so we've covered a lot of ground, uh there's, certainly more more aspects of uh of town that that we can uh dive into we're always happy to take questions. But um we kind of mentioned it up top, but I did did want to mention it again right, thomas building, on top of a lot of really good technologies and- and we really benefit from those.

B

um Obviously, we've talked a lot about policy and acm and and the tools that acm is providing um within our use cases it uh it uh intersects uh incredibly well with uh the initial deployment of clusters and the assisted service um and we've gotten uh fantastic help from from our uh from our integration and our field teams to to to give feedback on this and to help. You know work through some of these real operational constraints.

B

um So a lot a lot of really good technology and a lot of good folks uh in in building this out, and then I can't leave off uh you know. One of my favorites as well is uh testing these things at scale. Just sheds. uh Some really interesting light on how valuable it can be, and also uh where things start to rattle and uh and give us the opportunity to tune those up and and uh and really have this roll out at scale and deal with. You know large scale fleets like that.

D

How large scale have you tested within.

B

Over 2 000 clusters um is, is the the typical scale environment that uh that we've been working in um so yeah we've we've rolled this out to thousands uh and and certainly then you can uh start to scale this out more horizontally as well. uh So so you can get to to those really large-scale deployments uh by then replicating this out.

A

Okay kind of get to your regional scenario and then yeah.

C

A

Stack, it probably.

C

A

So last question: I always ask we're right about time, so we'll wrap up. It has been a pleasure having you but the most important two questions. First, how can people get their hands on this? I have okd. I have ocp, I'm guessing from the installing section of your readme. It looks like that. Your git repo is the best place to go right.

B

Yeah you can you can absolutely uh do that right uh this pro. This is a project in the the upstream repos. uh So so um I think you posted the link to that up front uh and if not, we can certainly uh drop a link here into the the cluster group upgrades uh repository on github. um We certainly have uh the downstream uh versions as well within red hat. uh You know, built out as an operator um and and uh yeah just builds on top of all that great work in acm and the policy engines.

A

Amazing, what is the operator called by the way, because I have operator hub up right now on my cluster to to start enabling this.

B

Yeah so um it'll be published as the uh topology aware life cycle manager, I believe, is.

A

B

The name it'll be published under.

A

Okay, so not published yet coming soon.

B

Yep cut coming very very soon imminent.

A

Amazing, okay! uh Well, I think that wraps us up unless you guys have any parting thoughts, I the only other one, is send us an email at cloudmultiplier at redhat.com. If you have any questions want any links to that downstream repo or follow-up um ian and joondal, we will loop them in on those emails if they come in.

B

A

Really appreciate.

B

It gurney and joy deep. Thank you for the the chance to.

A

B

We certainly enjoy it. These uh these issues are a lot of fun to to dig into, and uh yeah folks are always welcome to reach out to us, we're glad for uh for questions and comments and uh yeah would love.

C

Folks to reach out, thank you guys.

A

Thanks for joining us yeah, it's been magnificent: okay, we're going to roll the intro as an outro. Once again, I think we're going to stick with it at this point. It's just uh just habit and we'll see everyone in two weeks with a fresh new topic that isn't on the top of my mind or the tip of my tongue right now, but we do have it scheduled already so see everyone in two weeks, thanks.

D