Ceph Orchestration Weekly, 7 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2022-02-07

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So there's uh two topics in there and they're they're, both things I added them a little bit ago um and they're both upgrade related and they're from um paul kuzner.

A

If you guys know, him is uh being an uh like a large scale, cluster doing some testing for us kind of like the the posi experiment, but it's exactly, I think larger than that, and one thing he's trying to test uh is the upgrade procedure, and so both these points are things sort of related to that um some decisions we have to make uh maybe optimize the upgrade procedure or some larger clusters um I'll start playing the first one.

A

um So the osds have a dependency of the mod map, um which normally that's that's good. You know we want that to do that, but it causes an issue during the upgrade um where, after you upgrade the monitors, uh all the osds now need to be reconfigured, because the bottom map changes and that's if you have like a lot of osds like in this case it was like 3, 900 osds.

A

It really slows everything down because it's just sitting there re-configuring osds for like over an hour and then that time it doesn't give you any updates on the upgrade or what's going on there. So it's like a bad user experience for that. um It also just generally slows everything down, probably more than it needs to uh there's. The question here would be guess what we want to do about it.

A

um The one idea I have was under there is that maybe we could just push it off until the upgrade is over. um I'd have to look into it a bit more. I'm thinking that! Maybe if we do that, like it, wouldn't do anything bad, like the oc, didn't get the new mod knob immediately, we were able to just upgrade them all first, because they, when we upgrade them, they will get the new mom up anyway.

A

um There would just be like a brief period while we're doing the crash demons well, where they would not have.

B

They wouldn't have the new map so I'll admit to knowing very little about this topic. But um what are the downsides?

B

You know if we, if we kind of make a list of what are the you know the downsides of not doing this versus doing it or doing it some other way, it'll might help just make the decision for us.

A

Yeah, I mean that's what I was trying to figure out. I mean this topic, just sort of came up last night. um I don't see a serious downside. I guess that's kind of why I was asking in here.

A

I'm not sure if anyone in here would know of a downside or if there even is one to not having the mod map but like the newer mod map figured for the osds for a little bit longer, so they are going to get the new one anyway, when we, when we upgrade them, which should just be like in between so the only time period where they wouldn't have.

A

It is basically in between we put the monitors and we upgrade the osd's, which is just the crash, upgrade, I think, all right, um and instead, what we're doing is we're just going through and reconfiguring all of them immediately just so that we can then later redeploy them all again when we upgrade them, but it does feel like uh it feels like a bit of a time waste- and I said it's a bad user experience, because the upgrade status or anything doesn't show any of this happening.

B

A

B

If everything happens quickly, I think your intuition is probably correct concern I have is say that it is taking a long time for some reason um you know osdx gets upgraded on tuesday, but it doesn't until wednesday that osdy gets upgraded.

B

So, like imagine, if the time scale was stretched out, if it's still not a problem, then I think your idea sounds great. The other thing, I would say, is just maybe ping, someone on the osd development side to double check.

A

Yeah, I thought about um being like mia or team asking them. um What danger is and also maybe looking at the way uh ansible did their upgrades because I think they did it. They just upgraded individual roles, one at the time, um and so obviously, in that case, then, if the monitors change, they wouldn't necessarily have all the osds immediately being reconfigured or anything. They just then upgrade the osds one rule at a time.

A

So I think it actually would be sort of similar to what they were doing there, but I do want to check one yeah, but I just want to bring it up here. First see if anybody uh had anything just because it is, I guess it changed the way our reconfiguring system is going to work and uh bridge gonna work not a big deal, but I think overall would be a positive change.

A

I just want to collect people's thoughts if they have any.

B

All right yeah, I have nothing to add to you, know like my kind of devil's advocate questions there.

A

Right um in that case, I guess we'll just go forward with that um I'll see if I can get that stuff sort of working um and the second topic's also an upgrade topic. It's related pretty much the same thing um now this one's for more granular upgrades, and this is also an issue for um like larger clusters.

A

Essentially, if you have like somebody who has a really huge cluster, they may not like the idea of I'm just going to like set this command up, and then it just runs for like two days or something um so some people have a desire to sort of have the upgrades be maybe buy, hosts or buy like demon type or something, and that way they can just upgrade a few of the demons at a time make sure everything's working okay before they keep going.

A

um It would be a step away from the way our. Currently we do our upgrades, because, right now we just do them all at once and there's a bunch of things. You have to be really careful about doing that.

A

um I guess the topic I had sort of for here is: I guess: if we're going to do this sort of regular stuff and what level we want to do it um so the options that we sort of have are, and we can combine them as well, is um you do it by demon type which is sort of someone?

A

However, it already works, um which is like you just do the managers and the monitors and stuff um we could just say like you- can provide just one of those types and we'll do it uh the one thing we need to be careful about doing something like that is we still have to enforce the ordering. We also want the managers before the monitors, so we say, like you, can provide the command saying just do the managers just do the monitors, but you haven't done the earlier types. First, we're going to probably block it.

A

I think we have to do that. um The other options are buy hosts so say. Maybe you want to only upgrade the managers. You also only want to upgrade the manager on like this host race for managers doesn't make much sense. It's just one of them, probably, but maybe only the osd's on one host or something.

B

Yeah well, it makes sense too, because you might say say you have your cluster split up by like rack or something like just organizationally like on monday. I'm gonna do rack a tuesday, I'm gonna do rack b or something like that yeah. um The question I had when I read the pad originally right before the meeting I was curious, is um like say, you're running the command today, your time, you know your bullet says something like you know, this long running command might take days.

B

um The funny thing that popped into my head was uh some tooling, like uh I've used like rh pkg, um I've seen some stuff from the pulp project. Where you know, if your command line command is running and you know say your ssh gets connected or you control c. Is there like a reattach where you can actually kind of reattach to what like the server might be doing? Is it all there's most of the logic in the command?

B

This is a kind of a phase one to this you could, you could kind of make the upgrade process like resumable or reattachable before even needing to like break it up into phases. I hope this makes sense.

A

You talk about like kind of doing it synchronously and letting them like run it for a bit and see what's happening and then stop it.

B

Yeah, that's that's one way um I was thinking more like uh you were just saying like already today, the command sort of has phases.

B

If you knew like you're in phase a you know, I don't know if the logic is kept inside the manager itself or if it's more in the command line, tool, you're running, say it's in the server and the command line command was disconnected if you could say, reconnect to the server and continue polling status of the upgrade. You'd know: oh okay, it's in phase a it's doing months or whatever. Oh it's phase b, it's doing, osds, etc, etc.

A

B

And that way you then have the same state you could use for the ordering enforcement. You were just talking about.

C

B

C

It's maintained in the manager, so I think that's like a pause resume on the upgrade is what jones described it.

A

Which we already have that's what I was wondering here. We do have a pause. It's just that it's kind of hard to use it properly because um you have to like go watch the logs, see what it's doing and then like you'd have to like stop it at the exact moment. It redeploys, like the last of this type or something.

B

Yeah so like creating, like, I don't know from back of a better word like phases for each upgrade step. You could say: okay, you know we're in phase a um when phase a is complete. Then you can move on to phase b. But if you're, you know whatever is driving the process at the at the user level says: oh, I'm still in phase a, I can't push the system into phase b.

B

Yet um again, I'm kind of speculating, because I don't know enough about the or architecture of the system, but um I'm doing a poor job. I think, but I'm trying to share a user experience from some other things that I've seen, which are pretty nice, where you can so normally get back in.

D

Normally not great process expect that in a large cluster, in the face that takes most of the time is those the upgrades, because normally monitors and managers only have a few of them. It's a quick operation but upgrading those ds that is the the longest operation, is.

A

That correct yeah, that's the slowest one.

D

So I mean from the point of view of user experience,.

D

Like it makes more sense to um to give them more control on this space, this specific phase, because the most time consuming.

A

Yeah, I think it's. What we're kind of hoping for with the the hosts sort of you know do it by host is that you can just upgrade. Although a season one host.

C

But we still need to enforce that by type so the ordering matters, so we have to be um cognizant of the migrations that are occurring in the manager, so you have to do one complete service on all of the hosts before you can progress, yeah.

A

That's what I was thinking is: if we did do this, we would have to make sure that all of the demons of the previous types are upgraded first and then we say all right: it's okay! Do this um so you're like no like right now. The way that upgrade works like it like leaves the function. It comes back and it checks like the measures and mods again and then it like eventually gets to the right demon type that it needs to upgrade.

A

It would kind of work like that, but essentially we would just stop at the whatever one they specified and if there's any previous ones that need to get upgraded. We just immediately like back out, tell them excellent.

D

So, like normally in a cluster, we normally have several different services running, for example, nfs or rgw, and then in our line we have. um We have other demons giving service to this to these services a little bit redundant.

D

So do you mean right now we we are upgrading at service level or we are doing it at the most level.

A

By we do it well, we do it one day at a time, but it's by demon type like all the managers and all the monitors and all the crash demons, etc.

A

um So, like say, if you have multiple services, specifying your osd's, like that part, isn't really brought into it at all, we're still going to do all the osds.

A

But yeah, so it's really just by type rather than the actual service. Like underlying that so say you had a bunch of rgws. They were in different services. We would just get to a point where, like our run, upgrade rgw's now it wouldn't matter which service they're in we just their rgw circuit kind of get upgraded. We don't.

E

Care about the surface, we just do it. That's the demon, liver.

C

So adam remind me: is there any parallelization that's occurring here or is this I think it's sequentially one demon at a time we.

A

That does the check for check okay to stop, and it just tries to add as many of them as possible to it and then once it maxes out it just upgrades all of them in parallel.

C

Got it okay, because that might be the other problem by doing it per house that may actually be slower because of the failure domain. You may.

A

C

To stop one honesty at a time or host.

A

Yeah I mean it could be slower, but I mean when we're doing like they buy. I don't expect doing it by host to ever make it faster. It's really like what like the aim is. I think the aim is just like people feel a bit better about it. They're like I, can just upgrade the demons on this host and like that gets done, maybe in like 20 minutes rather than running one command, that it sits there for like a day.

A

um So I don't think it would ever speed it up, even if we had it like somehow perfectly in parallel. So we'd do like all the osds on this host at once, or something.

B

Yeah, I think it's a late, latency versus throughput kind of thing you feel like you're, getting feedback. The perceived latency is better.

A

Yeah it'll always start on this throughput, because I mean our command. That just does everything is always going to be at least as fast, if not faster than whatever else we can do. It's just that. um It's a bit more responsive. I guess. Basically what john was saying. The latency goes down.

A

That's desirable for some people doing upgrades.

A

A

Yeah, I mean that's basically the idea of it. It's just we were seeing if we do that. Basically, like john, was saying to decrease the latency, you could just say I want up here just the osu's in this host and you can see that completes and you can just go into the next one. um It's a bit less intimidating, because the problem is, I guess what paul was just worried about. Is people are going to get a bit jumpy if you give them a huge cluster?

A

And then you report like nothing really for a long time, they might run something that they shouldn't be running or something, whereas if you give them this command, that's going to finish in like 10 minutes, like just upgrading, say like a bunch of osd's in one host, and even if it is a bit slower overall, do the whole upgrade as long as that command is returning or finishing um fairly quickly, then it could still be an improvement and if they don't want to use that they want to just do it as fast as possible.

A

We still leave the option to just do the whole upgrade.

B

Yeah custom people getting jumpy is something I've definitely seen like live. I was at a customer site, one time you know it was like calm down just wait and it's tough to make people. You know patient yeah.

C

I, I guess, there's also the possibility they want to do this during like periods of downtime like they don't want an upgrade running during peak.

B

Periods yeah, it's like, oh, our maintenance window starts at like uh 8 p.m and then runs to say 3am and then you don't really want a lot of churn happening.

B

So you can split up your maintenance window over a couple of days right.

D

Yeah- and this is already achievable by the powers command- I guess, but it isn't.

D

You know you can pause that great and continue using the cluster, or this is not possible.

A

You have a gun, I couldn't.

D

Yeah I was asking if this this behavior is already achievable by using the pause command. You can pause that grade any time it's using the cluster and then like in the next maintenance window. So you resume the upgrade.

E

And this continues.

A

Yeah, I guess if you only were worried about just having it not doing anything any upgrade while it's like in your downtime and you wanted to start again later, you could do that with the pause and particular. I still think it could be desirable people to just say like I want to upgrade like this thing today and like just let that go and then do something else tomorrow, rather than having it positive. Some random points.

B

The only other wrinkle I would add to that is- uh and this is just partly my own ignorance, so feel free to inform me, like um you know how many versions different are people allowed to uh you know upgrade or how many different versions of stuff running in one single cluster. Can we tolerate? um So that's something we will probably want to keep in mind, especially if they're here we're creating a wider window.

B

So if we know that you know, oh, you know, 16.5 and 16.6 stuff work perfectly well together great, but if we start to worry that, oh well, you know the upgrades taken so long that now 16.7 is out. um uh I I yeah, I'm not I've kind of lost my train of thought, but I think you get what I'm trying to say. Yeah.

A

I don't think that'll be a problem. I mean it's going to be like if you're on a big cluster anyway, it's going to be pretty slow. It's going to be, like I said, like a whole day. Maybe where you're just have some of your osd's are on the new version. Some of them aren't right um and I, as long as they're, up into a stable tag, which I hope they would, if they're out on a huge cluster. um I don't there's any worry of like the image changing or something like there's no problem with like.

A

Oh, they have this one on 16.7 and like something else comes out that shouldn't affect anything uh not as much about that.

B

Yeah I just I've seen clusters where people were. These were kubernetes clusters, though um they were using like the latest tag, so they could accidentally consume um the uh you know, versions that they weren't really wanting to necessarily opt into. They were just kind of opting into it by default.

B

Eventually, the um the cluster ansible openshift stuff changed that behavior, so the latest tags weren't being used, but there were people in the field that had them for a long time.

A

Yeah yeah, I hope people upgrading their big clusters are using stable tags and stuff. We do offer a bunch of stable text.

C

Well, so that's exactly why we convert the latest tag into a digest and then we enforce that right, just across all of the um container nodes and so until everybody's consistent. They can't you know this.

D

Is done at the beginning of that grade right right, so it gets the latest or whatever version we calculate the hash code and we use it everywhere. So we don't get messed up versions.

E

C

That's the very first thing: we pull the latest image computer digest and then we set that in the config store and enforce that across all of the payments during upgrade.

A

I thought it was specifically for people using latest tags as well.

D

Does this make sense right now? Does it make sense to provide that grade like at service level, instead of uh upgrading everything.

A

um That was actually one of the questions it didn't. We really talked about that part of it, but that was actually one of the things we were talking before about, like oh by demon type by host, whatever um one of the things paul asked about was if we should do it by service, um because so many people could maybe they specify their own services for like how they want to organize like their osu, something, maybe they want to upgrade just the ones in that service.

A

um There could be a world where people want to do that.

C

So I encountered something kind of like this, where I wanted to update prometheus um just for a security fix, but I didn't want to upgrade the entire cluster um to do this, which forced me to set a config key and then um do a redeploy that instance. Instead, I'm going to get around that.

D

And I guess, if we do it like per servers, we can get like some stats at the beginning of the impacted, especially number first, that will be upgraded, the the more the number of usds the longest it will take. So this way you can provide some kind of feedback to the user, that some kind of driver and to say, okay, if you upgrade this service, so be aware that you will be upgrading this number of usds.

A

And there's an idea, I don't think we have anything like that. A dry run for upgrade, but we're like this is how many demons will get upgraded. If you do this yeah. That would be something interesting uh sort of like a tangential, maybe useful thing.

C

The other thing to keep in mind is we: we um build everything in the same container, so like ganesha ice, because everything's all the exact same container image and digest so we're still restricted by that order, where the managers must come first, he can't play upgrade ganesha by itself.

B

It wouldn't make sense correct me, am I wrong? It's not everything, though some of the stuff around monitoring stack is doesn't come from ceph container.

C

B

C

Yeah, those are different. We hard code, the version for those inside the um manager.

A

If, if we are going to do like this, um I know obviously we're saying we're definitely going to for seth type demons enforce this ordering it's super important to make sure nothing breaks. um But for like we played monitoring stock, demonstrating a special case like if you say I want to upgrade prometheus like I'll. Just do that, for you sort of a side topic almost not really super important in terms of large-scale upgrades, but yeah.

B

I mean plus, you've got the whole generic container service feature feature.

A

Yeah right now, you wouldn't do it like mike said you at this time I'll change. The container image then just do a redeploy. It isn't too bad, um but I guess it could be. Maybe a use for upgrade.

C

I think the complexity is too high for the that narrow use case right now. Maybe we just document that case um it seems like for this thing it's mostly useful for osd types, but not really any of the other services osds that are really the problem. We just want to target a host.

B

Yeah, you could certainly just start there not support it anywhere else, and then, if there was a use case in the future, try and extend it.

C

Yeah, because we definitely want to keep the scheduling um as simple as possible here.

A

Okay, so when we wait to the osd part, we want to be able to limit it by host or service.

A

How do you want to get to that point like say if you want to do an upgrade, you want to get to that point where you have the ost and you're doing that.

D

In a normal uh I mean probably different customers have different configurations, but in general, from your experience, you managed to have like a lot of osds for hosts, or you may have different configuration. We have like fewer stays, one host a lot of them. Another host I mean, does it make sense to do it per host or like per batch or dispatch like generously or 100 or whatever.

A

um I'm not sure about just limiting by number, because I don't know like what the ordering is going to be on that, like, I don't think we guarantee anything in terms of ordering. So like you're saying I'm going to limit just upgrade 100 osds we could, but I'm not sure it just seems odd, because you don't know which osc's are going to get upgraded. You just say gotta upgrade a random 100 osds, not the other ones. It feels a bit weird to me.

A

And if you wanted to just upgrade a single osd technically you can do that. The demon redeploy.

C

I think a lot of those comes back to what john had mentioned, where you want to upgrade a rack, and then you want to validate and confirm that everything's sane before you progress the next step or the next rack is really part of the desire here.

A

Do we think that host and service are enough to specify those.

C

I think so, and then we can use our. We persist, the migration last migration.

C

um So if you were to do this say like on a really small service like nfs, it wouldn't really make a ton of sense, but for osds this would.

A

Yeah, because the way I was kind of thinking about this, I guess remember, we've gotten to now. Is you kind of love all three of them almost and then you just like for the? If you really want to do it granularly you just do the first three types take the manager in one of their crash by the type when you get to those ds. Maybe you start going to more detail. You're like I want to use one by host. I want to do this one by the service or something um yeah yeah.

C

Yep, I think I think, in the um upgrade stuff we just have a bunch of validations to ensure the migration order. Actually is saying stuff like this, but when we reach like say this step like they start an upgrade on osds for host 3 and when that completes, then we just simply pause the upgrade. Let them resume somehow in the next upgrade step.

A

Yeah yeah, there is a stoppage we haven't gone into. Yet is what we do we hit this and it's not actually the full upgrade. Do we just totally kill the upgrade? We stop it or do we just put in a positive.

D

Right now we don't have any kind of rollback mechanism, so if it fails.

E

We just stopped great.

A

I guess you could downgrade afterwards, uh I'm not sure I'd like that. That should be fine, regardless of if we pause it or if we stop it, because if it's paused, you can just technically.

D

Sessions with sebastian he mentioned that downgrading could be very, very weird, especially if you are changing the monitor stuff in the monitor it end up in some very bad state in the cluster. That's why we don't support right now, downstate.

B

Yeah, usually a lot of software, you know downgrade, isn't supported or it's very, very risky, because you don't know what new metadata say got written into your door or whatever usually helps. If you're like a developer, can do it because you know. Oh, I know that there no metadata change has happened in you know: release 13.4.

A

Yeah, I think this is also one of the things that's nice about having the more granular upgrade is if you upgrade like one thing and you're like this didn't work. At least you have a smaller issue and if you just let it go, and it was putting bad versions on everything.

E

C

I think the one thing we need to be careful of is that, right now we have a status like through status about the upgrade status. Things like this. We don't want them to do a partial upgrade and then forget about it.

E

A

We just um yeah the progress bar would have to get out of adjusting a bunch yeah. We definitely guess we could clear it when we get to the point where we like did what they wanted to do like they want to just do like their monitors. We clear it after they're done.

A

um The bar will look weird, though, because, like we still we'd have to format it to say, like. Oh we're only going to upgrade this many demons, but we could definitely work around all that stuff. I think an upgrade status wouldn't be too bad. If we like know exactly what's going to get upgraded.

B

Right, that's what I was trying to imply with that ux talk at the earlier on was like. uh What do you do? You know? What's the workflow look like? Can you just no okay, there's a status command great, we run it and it gives you some sort of status report about the overall upgrade, but if you're only asked for a partial, you know what does that look like? Does it even make sense to have a progress bar or should you have? You know just bullet points? You know I've done.

B

You know bonds and managers, but nothing else. Yeah.

A

So right now what the overwatch upgrade status has it tells you which image you're going to it'll tell you which services are fully upgraded like if all the managers are on the right version. It'll say like the manager's service is done.

A

um It'll tell you how many of the total number of set demons have been upgraded um and if there's like a status there like it gives like a generic message field that, like usually it'll, say like oh we're, pulling this image on this host right now or we're currently upgrading like this type of demon or something.

A

uh So if we do do this, we're going we'd have to adjust that to probably like safely the number of demons that says, gonna upgrade like change the uh that number to say we're only gonna upgrade like this, many of them over many of this type or just be a bit more specific there. There have to be some some changes to the upgrade status, uh and I guess the progress bar is always one of the parts where it's smaller.

A

uh So it looks like it's more done. If, like, if you're, only upgrading five demons with this, then it shouldn't look like it's one percent done after you've done four of them. It would be if you did the whole cluster.

A

All right, uh one thing: I guess I don't know if he analyzes this go back to it. So if we do get reached this point where we finish what they told us to do like say they just said they already the managers. These want. The monitors done, there's like five of them and we do all five of them.

A

Do we just pause everything or do we fully say we're gonna stop the upgrade and you have to start a new one. Basically to I feel, like the upgrade start command is where we're going to add these arguments in, to like, say like specify the demon type or the host or something I don't know. If it's better just to stop the upgrade or if we should pause it leave it there, just let them!

A

I don't know what a resume would mean. That's the one that would be worried about is. If we just did the monitors, we had to then add to resume basically say like what demon types I want to do now or something I feel like it might be better to keep it all and start and then just stop the upgrades when we reach whatever it is. They told us.

A

Upgrade initial thought anyway. Anyone else I.

C

Think that's kind of just important to implement because it's on.

E

The resume it's going.

C

To be awful confusing, you could pause it type or pause again with a different type kind of weird.

A

Yeah, that's what I was thinking so.

D

Do we have any mechanism right now? Like imagine, I have cluster accidentally. I uh I powered up a note which has like old version of code of images in it. So how would it detect this case and do we upgrade the images automatically in this case or not.

A

Here's the thing you have an old version of an image downloaded on the host.

D

Yeah, for example, you switch off some hosts with the which has old code version, and then you upgrade cluster and you switch on this host what's happening in this game,.

A

Right now you get at the end of failing when we get you get hit offline host. um That is a good question as well. I think it's actually a totally different topic than what we're on it's just what we should do, how we should handle offline hosts during upgrades, we block the whole upgrade and say you have to have all the hosts online, because it's sort of dangerous, then again one could go offline in the middle of the upgrade.

A

I think right now, if you were to say, upgrade the other host and then you added add this quest post back to the cluster. If the query was still going on, it would just start upgrading the demons on that host um if their, if upgrades already over- and they won't do anything- you have to run the upgrade again to tell it to go upgrade that.

E

Host, so you will just see some error somewhere indicating that there is some form mismatch on this host.

A

Yeah, I think I think it will point out that there's a version mismatch at least the staff demons. I think that you do get a warning in this f status about this.

A

Yeah, I guess we should probably that as a health warning, there's not an upgrade in progress, and we have that for uh deaf demons of different types, which probably should raise a health ring yeah. I don't think we check that explicitly right now. I probably should.

A

D

A

Stop the upgrade, so I guess um sort of finalize here he said we're good with doing it by the demon type, the host the service type. You probably want all three for making it easier for osd service, get exactly what you want.

A

Whenever we hit that point, we just stop the upgrade and we'll let them start it again with new arguments, whatever they want up right now um and we have to make sure we strictly enforce the ordering so that there's always manager mon whatever the migration order as mike, I think, posted um and.

A

Oh yeah, and we want to make some, maybe a dry run command or something for a bit more transparency. So you know exactly what an upgrade is going to do so, as you do stuff or upgrade osd with a certain service type, then you do a dry run. We'll say like these: are the demons in that service that are not on the right version? This is.

A

All right uh does that sound good to everyone, or is there still something I'm missing discuss.

C

About that, out of curiosity, how does um rook handle this type of scenario? Is there any insight you can give us at that point.

A

I don't really remember how rook does their upgrading honestly um there's been a whole thing about like work with manager, rook and stuff, so I don't even think if there is an upgrade thing in manager, rook they're, probably using it right now.

A

I don't know I'd have to ask uh travis, I guess they're doing there. Actually um lane is in here, yes, yeah um rook's upgrades are handled pretty much on the the daemon level um with the osd's we do check the like. um We get the like batch of osds that we can update in parallel.

A

But that's that's kind of the only real, crazy thing we do, otherwise we yeah. We know that we need to upgrade if we see that, like there's like a cluster versions command that we issue that gives us like json output of like what versions of what demons are running in the cluster and if those are not all running the same imported version that we have great to do it's how we detect that we need to upgrade all right and what what's kind of like the actual user interface of doing that.

A

Do you just tell it like? I want to do an upgrade now and it just goes and figures everything out. uh Yeah I mean effectively. It's just update the uh the ceph image in the the kubernetes manifest for the subcluster, and it just goes and does okay, that's almost kind of similar to what we have right now for our set podium upgrade where we're just using the specify an image. We want to be the new image and then it just goes and does everything.

A

We probably actually are already right now sort of in line with work, but I think for some big clusters. It would be nice to have granular stuff.

A

Yeah we you get asked about a granular stuff, occasionally, but mostly we're just like that's so much more complexity that it is usually not worth it to tackle like at this point, all right, that's good.

A

Okay, um explain: uh is there anything else, um okay points, I think we're missing here.

A

All right, I think, we've kind of come to an agreement on what has to happen there I'll make sure to update the other pad and stuff we said after, and that was the last topic I had in there. So does anyone have any other topics you want to bring up this week.

A

Okay, in that case, I will see you guys all next week.