Ceph Developer Monthly, 5 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-05-05

Description

https://tracker.ceph.com/projects/ceph/wiki/Planning

A

All right looks like we've got a quorum. Why don't we uh get started so welcome to cdm for may um today. We've got two topics on the agenda. uh The first one is optimizing. The osd map for client uses uh kifu. Do you want to take it away.

B

Yeah, let me start with the motivation I I explained it briefly over the uh middle east, but I'll repeat it in in this meeting.

B

So the problem is that, um in a very typical use case in a real, fast cluster, for example, a all-flash flood cluster, it just completes a rebalance or recover the balance rebalance complete very quickly and the rebalance leaves the cluster with a lot of oc maps, but before before that, before before that some cluster didn't get a chance to pick up, these updated maps is they're far behind of these updates over this oc map.

B

So this cluster, these clients with ancient oc map in their hands they awaken randomly and the for talking to the cluster. They want. The latest oc map then, and then they're talking to us, monitor for for subscribing the latest map and for serving the latest of this map. A lot of them.

B

Please bear this in mind, so this monitor will be very, very occupied with loading, the the host maps from roxdb and encoding them in large large batches.

B

While discussing this, this problem with one other contributor from the community, the contributor mentioned that in their case, the total size of the host incremental ocean map could be up to 233 megabytes.

B

And before before the monitor completes serving the ocd map, the cluster is basically unresponsive.

B

So there's a couple um couple alternatives out here.

B

One is I'm the original one posed by the contributor. He wanted to add a setting to limit the versions when building incremental oc map.

B

So when the the the numbers or the side, total size or incremental hosting map exceeds the pre-configured setting, the the client will ask for instead of asking for the incremental map, it will asking for the ask for the full map, in hope to to to reduce the total size of the traffic and the number of stmap encoded by monitor to to to leave this the burden of the the monitors and also the clients, because he it if it gets the point where they could decode automatically try.

B

It should decode all of them right and apply it over over its asian hosting map.

B

And the second option is to is purpose by me to to to use the final grand um blocks in a monitor in hopes to to have a more parallelizing polygon in the um monitor when it's serving the, uh for example, the uh the hosting map.

B

I think that's a that's a one use case, probably there's more because we we did this before to to to encode the uh to to update the mapping of those of the pg2 to usd, but that greg sends an email to to explain his concern, because inherently the the monitor need to access the single data structure of the rxdps for for writing, to updating the roxdb and for for reading from it in a pre-processing stage, even erased from active in pre-pressing stage. So to to use a final, grand or lock is very scary.

B

In this perspective, because we need to, we really need to lock the rocks dp when accessing. Well.

C

So uh do you mind if I ask a couple questions so we're specifically concerned about the scalability of the monitor serving osd maps to clients right? Yes, that's not a logging.

D

C

It's just a read, greg's worried about trying to parallelize updates, um which is not which is not relevant here, and it also means that finer grade, locking is also irrelevant here.

C

um All we need to do is slightly change up the way we serve requests for ost maps, not access any of those data structures in the first place. um So greg's right, we don't want to add fine-grained log into the monitor because it won't solve any fundamental problems, but that's not relevant here anyway.

C

Secondly, we don't get faster with more monitors. So I disagree about your characterization. That increasing throughput at the monitor is the more general solution. The more general solution is to reduce the work the monitor has to do in the first place, and in this case I think this is even easier than the original proposal makes it sound.

C

The reason why we ask for a sequence of oc maps in the client isn't actually because incremental osd maps are smaller. It's because the client needs to see the delta between different maps to determine whether ops need to be resent.

C

So, in the case of a quiescent client that has no outstanding ops in the first place, josh can correct me here from missing something, but I think that's enough: we never need the intervening maps.

C

So if a client wakes up it has no outstanding ops and the newest map is more than I don't know two away. We should always just request the most reason and be done with it. I think that will always suffice.

A

C

A

Right you're saying about the question: clients not requiring incremental updates, so I think the two best.

C

Solutions here are to modify the client to optimist or to up notice when it doesn't need the innovating maps and make a point not to ask for them and, secondly, to change up the osd map requesting pipeline on the monitor to be more parallel. That should be straightforward. I don't think there's any way that it could cause problems. The only exception would be that the monitor needs some kind of widely available internally published notion of the most recent monitor map, but even that can be allowed to be somewhat out of date.

C

So an rcu light solution should do the job.

B

As you for the because of the amount of those the number of maps, you will need abundant double memory for holding all of the maps, otherwise that.

C

Won't work, no, no! No, we can read them out of roxy. We don't have to hold them in memory. It's just that. If a client request quote the most recent unquote map, we need to know what that is. That's the only piece of shared immune of shared mutable state that this theoretical, parallel osd map serving process needs to worry about, and even that's allowed to be out of date, so it doesn't even need to be perfect.

A

And which part of the um osu app serving is expensive, is it encoding the map or is it fetching for box db? We have the memory cache already.

C

I presumed that it was actually dominating the uh the primary monitor box, so the goal here isn't so much to reduce the cost it's to get at the hell. Out of the monitor, update, bottleneck.

D

Right, um I think we've seen clusters where it actually is just that the map's big and it takes a long time to encode and decode from the various buffer lists. We have it in.

A

Yeah example, where you had 200 megabytes of total incrementals to encoding all that will take some time for sure.

C

So a bigger in-memory cache of decoded maps would help too.

B

By the way, there is a third, oh because you have to read.

C

You have to re-encode it to send it out, so actually can we like? Can we keep cached copies of the encoded mosd map structures? So we just do a buffer list reference.

D

Yes, but I think we already do that and part of the issue is just about how many clients we have across different versions. I'd have to check. I could I could be misremembering that yeah.

A

We do have to be encoded for different versions of clients, sometimes.

C

True, yes, but we can keep the most common.

A

B

He suggested that we could cut down the size of the the incremental map, which only encoded the interesting pitch by by client, because we have two sections in uh encoded map. One is for the client. Another is for the osd in general.

C

X info, I believe, is always uninteresting to clients, if I recall correctly yeah, I think there's some other stuff too yeah. That's true. So that's four distinct things we can do to solve the problem, but I think the biggest most important one is that clients that don't need intervening maps shouldn't ask for them.

C

That's the most important one, that'll cut down the total amount of work we have to do in the first place.

D

That's the thing we can do, but it doesn't help us with existing clients in the field and I'm actually not sure it's the most important in these degenerate cases, because clients are trying like a lot of times like when this is a problem. It's usually because there's a lot of cluster change happening and the clients are like trying to get rights through.

D

So actually, I think probably the smaller map for clients is, is the simplest thing that will be most effective. I'll start with.

C

D

Yes, yes, yes, um but, and we can, we can look at trying to parallelize osd map, answering requests, but.

C

That, honestly,.

D

Seems pretty easy.

B

We probably another way to to to follow this way is to audit all the clients to see if they are really decoding the the other parts of the uh hosting app.

C

No, we agree that would help reducing.

A

C

Map is one of the things that will help yep.

A

What about um when does the client request it from the map from the monitor versus um getting the map automatically from the osds, when it tries to talk to them.

A

I guess I thought it was only the case that it asked the monitor for the map. If it um wasn't currently doing I o uh to osd's or was very out of date. Do you does anybody remember exactly exactly.

C

That's right: the osd is supposed to optimize to like preemptively set up the maps.

A

Yeah, no, he shares.

C

This in the normal case, if you find out from the osd that you're out of date, you also find out the new maps, I'm trying to think.

D

C

The case where, when do we ask for I.

D

Think there's a lot.

C

D

Think we have god at this because I think we because, like the rules, have varied over the years and so there's, I think, there's a lot of overlap where the clients ask for it and exp and the osu sends it for them. So the clients are doing duplicate, requests.

C

um That sounds like that. Actually is the first thing to check. It could be that there's a bug, that's causing us to go to the moderator when we don't need to.

B

Asking osd for for for a great incremental maps.

C

No, the osu is supposed to provide them on the on its own as soon as it gets like.

A

An app from the client it'll share the map, if it's, if it has a newer osd map and the client does right.

B

But how can how can the client know which osd should.

C

B

What I'm saying.

C

The for for the client to know, without a date in the first place, someone has to uphold it. So in the case that it's an osd that told it, the osd is supposed to give it the maps, that's missing at the same time, so we will need to know under what conditions this is happening to the client like. Is it that the client sent a request to the monitor?

C

That's how it found out or is it that the osd is not correctly sending the maps or is that the client isn't waiting for the master of the osd.

B

I think this first case because in in the very.

C

B

They were using a kcvs 35s clients which does not give a long, long-standing connection with the monitor or any any of the lsd. But when it's wet up it try to grab the latest hosting map and.

C

Oh just like every type of monitor.

A

Maybe this is a specific behavior of the ground plant.

E

A

E

It tries to connect; it should just get the entire map right, why? Why is it trying to get incremental match at that point if it is already out of date,.

B

C

B

The way how hormone works.

C

B

C

Different ways to ask the monitor for a for a map- this is only the one you're describing is only one of the behaviors that the monitor has so it could be that it remembers the map at last had and every time it wakes up. It always requests all the maps back to that map, whether it needs them or not.

D

C

D

More details from the person with the giant osd maps, because I mean it doesn't take very many requests from that for a map that size to overwhelm things but in general, a total set of maps that big. I think I don't think more than just one from among the clients trying to keep up.

C

But that's not the behavior described. He was saying that this particular client was unresponsive for a long time. Has it caught up.

C

If I interpreted the initial report correctly when kifu.

D

I'm saying remember, but I thought a big chunk of it was just that the monitors weren't doing anything because they had a bunch of osd updates pending and a bunch of ocd map requests incoming.

C

uh Anyway, it sounds like we need, find a great understanding of what the behavior was.

E

Key for maybe you can ask the person who reported this to create a tracker with all the associated information, we're looking for.

B

It's on the meeting someone will do. We want to to explain the the use case in more details.

C

The use case and there and the observed behavior.

B

huh Over here has some difficulty.

A

B

You say it was uh simon. Yes, he was actually other of the pr I'm performing uh in impact.

A

Seven, if you can hear us we'd love to know more details about the situation that you observed.

C

The fix or mitigation depends on some variables that aren't clear.

D

All right, so looking at the pull request, it seems to be idle stuff s. Clients will not get the latest osd map. There was no request to accept of the monitor clients, tick um and it's recovering ssd cluster. So there's a lot of actual traffic happening. Yes, yep! That's.

C

My understanding from one yes correct.

B

Yeah, which would be.

C

A slow, slow cluster.

B

We won't have that many of the the host map before the client can catch up with the cluster.

C

Right so the fix there is trivial. The client should always request the most recent map in that zip. If there are no sending ios that will fully solve that specific behavior.

C

But if that's not the only behavior in question, then it won't suffice.

B

Okay, so the let me um summarize the discussion a little bit so the the first thing might be. We need to do is to to define the to understanding the behavior of the client. Whites would need to ask for the incremental map, even though we have we offer offers the protocol in which one the client is able to ask for the full map. What is far behind the cluster hdmi folks.

C

B

C

Far behind and it has outstanding, I o, then we ask for the intermediate. I know.

C

Because, if it's not idle, then we know that it's supposed to ask the incremental ones. Arguably that would be a behavior to fix as well, but in a special case where it's idle. We don't even need that that was just filling out your summary or adding that detail. The idle part is.

F

F

Oh okay, next to the either pad.

B

B

Okay, I see you updating the easy apparent. So if, if this is the existing behavior of the other kcfs client, we need to to fix it so that the client can ask for the latest map. It's very likely. What's what is proposed in the in the pool pull request, but it's trying to fix the an add a setting to do it to do this.

C

I don't really want a setting for this I'd like to change the default to be correct. All the.

B

Time, then, okay, so if we, if we are able to update the behavior of update the client to address this problem, we can dispense with all the changes on the on the server side to to serve the to to either parallel parallelizes. The processing of the hosting map, or this.

C

Pull request is monitor site. I don't think you can do this.

C

Yeah, this is monitor side. You can't do this in the monitor. You need to change the client to ask for the right thing if the client actually asks for those versions, that's what we have to give okay.

C

At least that's my guess,.

B

Okay, I think that's it.

B

Okay, that's my I I was. uh I was wondering why. Why does all that? Go all the way through all the trouble updating the monitor side to address the problem from the client funny, because he also updated the client client to to to get get his problem fixed anyway, I will talk to him offline to to find out more details and I'll get. You update, update you guys over the mailing list.

G

A

E

Yeah, but before we move to the next topic, I was just thinking the idea of the lighter weight was. The map is also not bad, um generally, probably explore whether whether that's needed not needed or how. How far can we get with that.

H

But that that all the need.

B

To update the protocol, so the client can up subscribe for a customized version of the.

C

B

Yeah compatible well.

A

It depends whether it's using today, if it's done, decoding those sections uh like the x info and it doesn't.

D

A

Them in the first place, yeah.

D

It does yeah because we use the same libraries to decode clients.

A

It's possible that.

D

We could do it.

A

And decoding that says, there's zero x info or something like that. I don't know I think we'd have to look into it more detail. Yeah.

C

Or if it happens to be the case that they don't access any fields introduced after a particular version, then it's super easy. Some of the old version so.

A

C

A

C

Be admittedly, hacky ways of getting around that.

B

Problem, why why is it be hacky.

C

Well, it's hacky in the sense that it's not how the protocol was intended to be used. It will just happen to work. We'll have to be extra careful, for instance, about retiring those version, numbers.

A

And with kernel clients as well because their own clients don't have the same exact versions,.

C

So it's permissible it's just that we'll be overloading some things we did in the future, so we'll have to be extra, careful.

B

Yeah, oh you mean we. We just need. We just change the behavior of monitor.

B

So instead of I was serving the full version of the customer which just serves the partial version of those map without changing the protocol right.

C

Well it if we happen to have a version number in the past that encodes everything that the client currently uses and happens not to include a bunch of big stuff that it doesn't need, and if we send the client that version it won't decode stuff newer, because it won't be there and it will work correctly, even though it wasn't designed that way.

C

So in that sense, it will be hacky it, but it yeah.

D

It would work so so, for instance, if and.

E

D

I think there are some things it needs, but if the client doesn't need anything out of the extended osd attributes, we could set the extended osd attributes version to zero or one or whatever. That is the minimal size, while maintaining the rest of the osd at its current state or the rest of the map at its current state.

B

um I think we can just audit the behavior of the other clients. I think that's that's. That kind of triggered this resume then work to to understand what it is interested and if it's only interesting, the first part I think we can. We can just skip the second part when serving the, but we cannot differentiate the the the client by looking at it's a it's an entity because, as the iliad mentioned that.

D

B

For example, civ cri uses a client, but it wants the full version of the host map. That's uh that might be an exception, but we have no way to differentiate to tell if it's the kind, it's if your eye or not or it's a case of ls client.

D

Yeah and it's it's also made tricky because the chrono clients needed to decode stuff, but we're more selective about which parts they paid attention to.

C

I think we intended.

D

For clients not to look at the extended osd, but a bunch of them did, but then they ended up looking at bits of it anyway, because we put stuff in the wrong place.

D

I have no idea anymore. I just know that.

C

It came up at one point, but also we don't need to restrict ourselves to changes that don't change the protocol. We can actually fix the clients. Yes,.

D

C

Our ability to correct this in the short term, but as far as I know, this is only a modestly sized problem and if we need to do something drastic in the short term, we can do that too, but we should definitely address creating a cut down version of the osd map with only things the client cares about. Also splitting the osd map into a version encoded for osds and a version of code for clients will actually make maintenance easier.

C

We won't have to be nearly as careful about adding stuff to the osd map and it will be a lot easier to deal with. Yes,.

C

So it's worth doing independently of this problem.

B

If we won't go that way, thousand steps, I I I enlist in the mail make sense I could paste it in.

C

I mean we also need to change it so that idle clients only request the maps they need, because again the the one thing we really can't like distributed system. Our way out of is monitor throughput, so anything we can do to reduce the amount of load on monitors is worth doing.

C

So I think we should do essentially all of the things um discussed here, just perhaps not at the same time,.

A

Yeah, I wouldn't say everything's highest parity, but we've seen one real problem which is solved by the latest client um change and the further optimizations can improve the ability of the monitors type.

D

Yeah, so yes, I I think the features of steam, app client thing would make sense.

D

To answer your question.

B

I'm I'm trying to figure out the exact step we should. We should follow if we want to implement the uh current version of the host map to before that. I want to have your opinions on on this exam step. If we want to go this way, I'm pasting I'm updating the uh it's pad with a step in the very end.

D

I think what you've outlined is is a good approach. It's just that, like the details will depend on exactly what things are going to be in both both the client and the os and the server side version of the map. Yes, so like identifying that very carefully and then being like. Yes, the solution works or no there's something a little trickier that we need to deal with is I, I think the way to go.

A

Yeah, I think sub-zero is oddly, the clients just understand what they need from the maps.

C

Yeah other than that, I think it's pretty simple. Any client that advertises support for the cut-down ost map gets the cut down osteam up and that's kind of it.

C

Okay, we'll probably encode it it's not like. We keep two versions of the ost on the monitor. I think we just encode on the fly, a cut down version of the relevant map.

D

Hopefully, yeah yeah.

C

D

Maybe not maybe.

C

We keep cast versions of it, I don't know yeah, but.

A

I mean I thought we kept caches of different versions already.

C

A

C

If sorry, I meant the different um versions, the wrong word here of uh different um encoding types, yeah that.

A

Is that's right? One version.

C

141 we might keep. Oh, we keep more than one encoded version of that. I.

A

C

Normal, for instance,.

A

Or just for whatever feature bits, the client advertises because they're they differ across versions as well.

B

Oh okay, we can even update the layout of the full version of steam maps, so we can put the the only ocean map at dessert.

G

C

B

That, though, and.

C

We shouldn't bother, I don't think, that's important, it's a difference.

F

C

Memories, don't always have a cache to write. There's no reason to relate them. We would just cache two different copies like if we, if we try to, if, if we add that as a requirement it it creates dependencies between fields updated in the client hosting app and the ones updated in the observer. I don't think it's worth it.

D

Yeah I'm worried about like newer clients with older osds, because the newer clients have extra fields that the older osds don't.

C

We should just very.

D

Very! Compare that space.

C

Two different derived encodings of the same thing. No, it's just.

B

B

I thought the the oh: the servers always have needed a super deeper set of the the bits. Then the the convert.

C

That might actually not be true, it's possible that we might introduce a field. That's interesting, that's interesting for both clients and servers where the client is new enough to to get the field, but the server isn't because we're in an upgrade situation. So I it it'll be often true, but I don't think it's strictly true.

B

Okay, thank you.

D

Yeah, just just in this example, there were things like whether clients, the the force resend operations flags. I think there would have been situations where the monitors could set that and get the correct behavior on the clients without changing the osd's at all.

C

Heck we could actually choose to encode fields in the client map that don't exist in server map as a summary or something. So I don't I, I don't think we should assume that it's a supers up that should. I think we should assume that they're deterministically derived encodings.

D

C

D

Actually, if we're, depending on the options we take with kerms and messaging, for instance, there's a good chance that we want to do that.

C

B

Basically, it's both down to to a new subscriptions from the clients when it whenever it wants. In the newer version of a custom, small version of the old map, you need to ask the monitor for for such kind of most map.

C

Yeah I mean or okay yeah. That sounds right.

D

C

D

The monitoring applications yeah.

A

It solves this fcli problem as well.

D

Yeah and it'll be good for like if the monitor, if like the the mdss or other privileged server, clients need to do weird stuff, yeah, okay and that's what I think. Yeah.

C

The client might want the version for whatever reason.

B

Thank you guys, and uh I think I think that's it.

A

All right thanks a lot great discussion.

C

Yeah monitor favorite is kind of one of the most important things. It's always good to discuss.

A

All right, so the next topic um is that unless there's more on this one.

D

I was just going to make a note that um even some sam brought up a thing about using lrus for answering osd maps, subscriptions requests and I'm actually I'm not sure if that's a problem or not. But I would want to look.

H

What happens when.

D

So what- and maybe this isn't a thing in the monitor, because I can't I can't remember, but what happens if you send it an update request, followed by a get latest map on all of the same client.

C

As long as the serving process is monotone, then it's fine.

D

Maybe no I'm saying.

C

It is fine as long as it forces a synchronization point.

D

C

D

Okay. Sorry! Yes, so I'm saying that right now, I'm not sure, but we might depend on that synchronization point and that would be really I'm.

C

Saying we can afford it's. Okay, okay, synchronization point is fine. It will be not a problem all right.

C

It's not a full synchronization point for the publishing side. It's only a synchronization point on the consuming side and that's not a bound resource here.

D

A

In any case, it sounds like that's a a lower impact change than the uh uh other pieces that we discussed.

C

Well, it solves a different problem in cases where the, where the bottleneck actually is the monitor running out of resources, then the most important thing would be to increase, monitor throughput, so it depends.

C

I would rather focus on the maps right now. Yes, I agree with that. If we did see reports where the primary problem was osd map request logging, the monitor down, then that's your thing.

A

All right, so, let's move on to the second topic, then um this is about adding being able to block list entities or by a label or by generation number or some some mechanism for block listing things uh across the whole site, but still being able to bring the site back.

A

So I think cheyenne brought this up, and so do you want to explain more.

I

Yeah sure so uh we we sort of discussed this in the december cdn, uh so I'll start with the motivation here, which is more covered in the mail sent to the mailing list. uh So this is basically for disaster recovery, where there are two kubernetes clusters or in other words, two client clusters, self-clan clusters that share a common theft, storage, endpoint. You know sorry self-storage clusters in that sense and applications that have uh rbd images of cfs sub volumes mounted on client clusters.

I

A could be recovered on client cluster b, because can cluster a is no longer accessible and or just the application and client cluster is no longer accessible?

I

uh The boundary of disaster is, is either the entire cluster or just a an application or a few applications.

I

um The intention here is that there is the recovery point objective or the there is no data loss because they're writing to the same ceph cluster. So as long as you cut out one client cluster and then move the workload to the other client cluster, it should theoretically be able to contribute from where it left off. uh So so that's that's where the requirement kind of comes in from uh so initially, we were looking at uh the existing existing block list features and whether we can reuse them.

I

In this context there are a few gotchas we've been discussing this with jason and patrick, uh so block listing did not fit, and plus we actually want a whole set of clients cut out from the cluster and start from the cluster, not necessarily a.

A

I

Or a client, the other interesting aspect from the community side of things is that we could still have some of the worker nodes in the community's client cluster continuing to function, whereas their master nodes down.

I

So so so the volumes could still be mounted or sub engines could still be mounted and- and you could still be going on, uh so we needed something more broader to fence off an entire cluster uh from access to these f clusters, so so so, okay, so we were discussing that and um jason came up with an idea saying that, okay, why don't we just.

I

Knock out the credentials for this fx identity, that's that's being used to mount uh images from cluster a client cluster a and uh that way we'll take care of all. uh You know that cluster will be completely sensed till that particular fx identity is reinstated or its capabilities are being stable.

I

So uh there was a a problem here in the sense, uh so we did try that that basically works, but there was a problem that there could be outstanding tickets, which are still valid, and hence they also need to be invalidated or invalidated.

I

And then uh patrick also added a few other cases where uh there were some there was some stuff. I I don't know these cases very well he's added the trackers in there, but uh this f manager and cfs interactions needed certain glob based fencing of certain identities, so that when I'm not sure about the use case, I don't want to talk about that.

I

I'd probably move it up, uh so so so, based on that, we we wrote up this uh particular I mean I wrote up this particular feature uh request which is around you know: block listing suffix entity, name uh by and either by either either a suffix entity name or by a label um uh and and all tickets lower than a particular generation number so that it that's for the third requirement which I'm not going to talk about.

I

uh So I just wanted to resync on this and kind of understand. If this is the last time we discussed this, there was no problem around feasibility, but then again we can discuss this with everybody, uh so I just wanted to bring this up again. So the the first one is the the ability to block listed as in the tracker, the ability to block listed on block list.

I

uh The io operations, including the current valid tickets and the next one, is to have the ability to add labels to suffix identities and then uh block a group of suffix identities based on this label.

D

So I'm, I think I understand like the use case with kubernetes here, but that strikes me as having a simpler solution, which is can't we just block traffic at the network layer instead of having the storage system need to do it.

D

uh Which network.

I

D

Sorry I mean we're talking about a geographically distributed set of stuff here, so like kubernetes cluster, one and two are gonna, be at different places, sending routed traffic to a step cluster in a third place, or maybe one of those places. So it's going to be going through routers that can just drop the traffic so.

C

The ios could already be in the queues on the osd. Is it's not that simple.

D

Yeah, but I mean but but they also could have been committed like what's in flight immediately at the instance. The blacklist happens. Doesn't that's not a thing we can solve. We can say that by the time yes, it is. The second cluster is made active. The first cluster is no longer descending.

C

No, no, no, hang on there's a that's. That's not what that kind of block list is for so okay, so we have two set. We have. We have two clients for for simplicity purposes, the from and the two right, so the sequence of operations is block list from swap to two start. I o on two. What this guarantees is that, when the first I o arrives on two anything that had been committed from one is committed or not, but after the very first read that site two does it won't see changes to that?

C

But itself does changes. So it's not about preventing things that haven't been committed yet for being committed. It's about creating a.

D

C

Point for the cutover.

D

C

D

Which blocking all the traffic from the from site before you get the two site is activated and allowed to read, will do.

C

I mean it, it doesn't, though,.

I

There is also a practical problem here, in the sense, the the management layer that that we interact with it's not like. We have access to the route. I mean this needs to be automated. Obviously,.

D

Sure, like kubernetes, has a whole bunch of networking smarts in it.

C

D

G

D

Bunch of different.

C

Network layers- and they don't all- have the ability to do that I'll just.

D

G

D

G

D

Bring it up because, like we can do something like this, but it's going to be immensely complicated and finicky and lots of unpleasant work for many people.

C

I'll grant you that it's a it's a complicated feature to implement, but conceptually it's not complicated. We distribute an ost map with enough information to drop io from the offending client. We know which osd map, epic it went out in and I o from the new client will have an osd map greater.

D

Than or equal to that, I under I understand many possible ways of solving it, but there's a lot of ui there's a lot of ux. It needs to be. Yes, that's all true, you know not easy to break and it's going to be a pain in the ass to keep correct yeah. It's a huge amount of work, but.

C

It is actually solvable.

D

Yes, it's it's solvable, but it's a lot of work and for a very specific use case that I think can and maybe I'm wrong. And if we go away and look at it, then the answer will be no but sam you're concerned about the cutoff point. I understand the theoretical concern you're saying, but that's just not a problem. It is like super, not a problem. The cues aren't that long at the places that we would be blocking them from and it would be and if we can just.

C

D

Hole network I o, would be so much simpler and probably safer.

I

I I we did look at black holing network io or instead of nodes. I we didn't find any answers that convincing answers that that we have access to from from the ability to you know block it.

I

Only now our specifications changing to block a node in the kubernetes cluster from uh you know, fencing a node off that that's because it would be now being developed and that still has a mapping that we need to do between a node identity that kubernetes thinks of as the node and the actual addresses that the node is using to map volumes and sub volumes.

I

Also, they need the requirements also that for a given application, its volumes, we may want to cut it off right. So it's not necessarily the entire cluster that we want to cut off, but but uh but only an app that needs to be cut off.

D

So I'm just thinking about the timelines, I think we maybe downstream want to have a serious discussion about what can possibly be implemented in what time frame.

I

D

Someone writes this and submits matches for it. I'll look at them carefully and if they seem maintainable, then that's fine, uh but I don't think anyone's going to do it quickly and.

I

So so I have absolutely no idea about the complexity. That's involved here right. Let me be honest, uh so yeah, so we never got around to how difficult is this to do uh if you're gonna say that it is tough, then I definitely like. Are there other alternatives that you can think of here? That can make this.

C

Happen uh for that I'm with greg. This is, I don't know a person year person year and a half before.

A

You thank you simply.

D

Next year, probably.

A

uh Only um block list in the entire cluster that's much much simpler than trying to do a blog with some unblock list of um much finer entities and handling the manager. Failover kind of things.

C

The final grade entities I actually have like other problems with like scalability and size problems.

A

C

Yeah whole clusters don't create those problems so conceptually much much easier.

A

C

I

Right so the among the three, the first ones: what well, what what the community side of things need. uh But what you're saying is even that.

G

Something like a year year and a half is what we think it is.

E

So I I I just want to ask you a basic question, so we already have the ability to blockless clients right now. What if we have something in the middle? Don't don't have let us say we don't have this ideal feature of block listing and entity or whatever. What do we have something in the middle that deals or like that takes care of this one client versus multiple clients and for for raiders it is just going to be issuing multiple block lists for n number of clients versus one client.

E

Is that something that that that would solve your purpose?.

I

So one of the prerequisites there is is to know all the client, so uh it's kind of funny, because the cube cluster can still continue doing what it does the cluster that you're actually trying to fence. So it could at least get you the amount of a volume and another node in its cluster. Just because it's it thinks it's up in a life, whereas a higher level management layers actually taken the decision to move the workloads to the other cluster. So.

I

So so, if I'm not wrong block listing is for con for active connections. Right I mean it's not for future connections. I mean it it it's like blocklessly, so yeah yeah, so so it could. It could just move from node one to node two, just because node one in that cluster is inactive whereas, uh whereas from like, I said from a higher level orchestration entity, the entire cluster is inactive and the workloads actually moved off to an entirely different cluster.

I

Did that make sense.

E

Yeah, so it's basically the knowledge of the clients is not uh you know it can change.

I

It it can change like I said we actually have this problem in a humanities cluster as well.

I

The specifications are just now being written for that, but that's another problem. That's not that we don't want to discuss that with this one.

A

Wouldn't there be other kinds of resources in the cluster that um you'd want to like prevent the applications from using.

I

Johnny I didn't get that question.

A

Wouldn't there be other other kinds of resources that you want to prevent the applications from using in a failover scenario, and not just storage like, for example, if the applications are talking to each other or.

I

Okay, so there are two layers here: one layer is: is a global traffic manager or which actually routes application traffic. uh So there is within cluster of application traffic that can be that applications can talk to each other and then, but then uh for applications that need disaster recovery.

I

The intention is to actually have a global traffic manager which which actually routes traffic to these various applications, but that will be flipped from uh instead of going to cluster, adb, flips, away to b um and the applications would be moved as a set. So so there was a communities terms, a stateless application that uses another stateful application, which basically means it has mounted storage. uh They will. They will be restarted on cluster b, as I said, and the global traffic, the gtm would be re-routed to cluster b.

I

So so all traffic now flows to all clients, click to the applications flows through the traffic manager.

A

I

This is purely.

A

About getting blocking the the potentially still running clients from accessing the storage layer with that, even though they're no longer serving traffic due to the higher levels switch over.

C

There are no other interfaces and kubernetes that require that kind of a cutover. I know there have been proposed s3 or file system type layers, there's no other persistence layers for which this matters.

I

There are three persistent layers I mean there is yes s3 file and block, or I again I don't think I understood the questions.

C

It's the same problem for all of them, so fundamentally we're talking about what block level requests to rbd right or file level requests. Ffs doesn't really matter either way we're dealing with those vsf. However, if there were a more generalized s3 available service that a an application we're using for its persistence, you'd have the same problem. You need to stop the previous defunct version of the clients from updating that shared resource.

C

So are there no proposed ways of dealing with those problems uniformly.

I

I

Multi-Cluster kubernetes workloads are not something that's prime time yet. So it's uh there aren't even communities. Multi-Class to sig is not necessarily.

C

um So that raises a question: how does this work within a kubernetes cluster if you're migrating a stateful application from one part of a cluster to another part of the same cluster? Why isn't this a problem, then it okay.

I

It is, we actually discussed this problem probably a year back um there only. It is so if a particular node within the communities cluster is is unreachable.

I

Perhaps the definition of an unreachable and the application there is an application application with a mount point running on that node, it has to be, it will be scheduled subsequently on another node, so there are two levels of protection there, but it's just time.

I

One of them is that it just waits uh five minutes before a node, which is the configured default timeout before the node is deemed dead and an additional six minutes for the volume to detach.

I

But there was no mechanism in communities to actually inform storage that uh that that that a particular node is dead and it needs to be fenced the the spec. So this has been a problem and the spec for this is, as we speak being written, I mean I think it was rolled out a week and a half back where, uh where the control plane will be notified, that a particular node is dead and that will subsequently notify the storage uh orchestrator, which will yes.

C

It's already a problem, but the control plane is unable to see it. So it's entirely unfixable for kubernetes right right now, right now in this video. What I'm saying is this problem? You're trying to solve for cluster migration is present for regular clusters, just just as severely okay,.

G

C

I

C

Cluster migration, is it more noticeable, or is it just that it's more fixable from the kubernetes standpoint.

D

It is possible to implement client applications and pods that handle this correctly, whereas it is not currently possible to implement kubernetes multi-cluster that handles this correctly. I think, I think, is the answer. How do you do the former look? Yeah client can do things in its pods, like rvd volumes are fenced using the existing rvd fencing mechanisms.

C

That shouldn't work right because you can only fence the whole clamp, not just a particular instance yeah. You can defense particular instances.

D

So why doesn't that work for the cluster migration? One? Because you need to know all the instances and that's not possible.

E

I

Even fencing an instance is work in progress in kubernetes. It's it's not done yet just saying just repeating things.

D

Okay openstack does it it? It is a thing that that stuff that stuff provides tooling for in the and that sufficiently intelligent clients can do within their within their.

C

Spots, this is even simpler when, whatever whatever unit you're using to transfer these things, you could, we could add a featured rbd that lets you just write down the client instances.

D

C

D

Not really not scalable.

C

Why not you can speak.

D

Over as many subjects as you want, no block lists are part of the osd map. Sam.

C

I know that part is yes, but apparently we already solved this problem within a kubernetes cluster, using so to the extent that it's a problem, it's already a problem. This wouldn't be anything it's.

D

A question of of of of of inherent scale.

C

The problem is.

D

That we didn't know the inheritance is too.

C

Big, that's that's the argument.

D

Yes, when you lose a node, you need to. You need to put one node's worth of of client instances in the block list. When you use a cluster, it's an entire cluster's worth just automatically.

C

I wonder how big that set actually is and whether an intermediate version of this problem wouldn't be fine. That actually does have scalability problems, but whatever.

A

The skill of these problems would be a lot less if we added the command to allow you to block list a bunch of entities at once.

D

What I would feel more comfortable with is that we could block like ip ranges um with with some kind of epic that that I.

G

Think it takes less.

D

Time I mean I'm looking at the at the way the is blocklisted code works and.

C

We can make the block list perfectly.

D

C

Better once, but so I think we have two different sustained problems. One is that we don't know the end the instance list, for whatever reason, I think we need to solve that problem via whatever means it doesn't seem conceptually hard and use. The current unscalable commands that already exist and stuff to just do it with an orthogonally.

A

C

That we, if we address the fact that the way we're doing blocks for kubernetes, isn't very scalable.

D

Like if kubernetes.

C

Is saying, I think that allowing.

D

You I think that allowing you to block ip ranges would be the way just to solve the scalability problem.

C

um Right, but that's not the immediate problem. The immediate problem is that this feature doesn't work in kubernetes. It's not that it's not scalable. It's that it's not possible.

H

Yes, so what so.

I

uh Initially, we were thinking about the ips and you know just getting all node ips and things like that, uh but I'm not so sure that two kubernetes so within a kubernetes cluster and not having a node and pods having unique ips is guaranteed, but across cube clusters, not so sure it's guaranteed. So now the point is, if you block list a range of ips and and and and and the other cluster he uses those because it's I I don't know whether that that will work.

I

That's one of the problems that one of something that I need to find out.

D

Okay, well, having told you that the earliest, you could do any version of these things that you talked about is in a year. Maybe uh we could. We could do ip range block listing much more quickly um if you're convinced it was useful and- and I mean I'm talking about like you- know us well, however, much range is needed, but like a slash, 16 or whatever um like we like, like, we would probably.

C

Probably do is literally every.

D

Number of bits it needs to match and then, like you know, the starting and ending ranges.

A

I

D

The start range.

I

Or something so the ip ranges I'll tell you this, I mean so yes, it's definitely tempting, but then there is the requirements.

D

I mean if they're using them, then yeah, I'm just telling you this is one that you might be able to get in a time frame that is useful to you, unlike the others,.

C

I've seen that categorically won't help.

D

All right well, then, yeah.

D

It's just not the way networking usually works. You can't configure.

C

It to use a to start an ip point. I'm sure you can, but there's absolutely no reason to assume that two different clusters will have well-defined blocks or that they'll be non-overlapping with other resources as well.

D

Okay, but I like, but if, if, if we're like setting up a multi-site cloud thing designed for this failover to work, that's a constraint, we can enforce right if we were setting up sure, but we're not yeah. That's that's! Well, like yeah yeah, I'm sorry when I say your name.

D

Okay, well, in his case, I I think we are setting it up like right this. This has got to be for openshift.

I

It is for openshift.

D

I

Often I mean we can document.

D

Like these are the requirements, if you're not using like for upstream periodies enough streams that we could be like here are the rules.

C

Actually, I take back what I just said now that I think about it even having two different kubernetes clusters addressing the same ceph cluster creates really weird constraints that are working already.

C

They all have to be layer, 3 routable to each other, which isn't generally true in kubernetes clusters. That's weird! So I don't know. Maybe it's not totally crazy. Yeah.

C

Is that really.

A

A common configuration.

C

Where you have two totally distinct kubernetes clusters that are routable in this way,.

C

Hey, like the routing requirements for cepher draconian every client needs to be able to talk to every osd directly. You need to be able to set up a direct tcp connection.

C

Like that's pretty extreme you, you wouldn't have two separate set of clusters.

D

I think you're asking about what exists and versus what we're going to make exist.

I

Yeah the two separate set clusters is is a different one that we, where we don't. We don't need defense off in this this fashion, because, for example, both rbd mirroring and flash mirroring um have a notion of primary and secondary interference.

I

uh It'll just split brain and and the orchestration layer will have to decide who's the uh which one has the real copy and and and and resync it in the other way, so we're okay, there actually we're good there.

I

The only use case that I know is an external cell cluster shared by multiple open shift clusters, ah sorry, uh which no, the only use case that I know is an is an external self cluster shared by an open shift cluster. This this whole multiple clusters is something we're attempting to do so. I don't have an answer to your question. That's that's what I'm saying.

I

Okay, um so the this effects, identity and feature there is going to take a long time we probably need to we, as I probably need to figure out.

I

If we can block list a range of ips or a list to begin with a list of ips, then we can talk about the range of ips because that feature doesn't exist. Yet things f so uh and we see where we go from there.

A

Or you can start with the basics of uh trying to use existing blockbuster commands even uh trying to get the list of instances.

C

Like you can use, remember the set of instances associated that would do the job it will be a little efficient, but whatever.

I

Yeah yeah a list of instances and then range and then.

D

Yeah, like like.

I

These are small.

D

Clusters- and we know the the ip addresses, then that.

C

Are available it's not even that the cluster has to be small. It's that the set of pods speaking directly to seth has to be not that big, and not that big means hundreds is probably fine, like maybe you can. These are.

G

C

Probably that survey much larger set of the actual application pods. I would guess so. If the scalability problems here may be less severe than they look, it depends on how the application is architected.

D

Yeah, well I mean I, I think most of them are mounted on the host. So it's just the host ip, but I could be misremembering how that's partitioned.

C

But this really may not be a big scalability problem. This may be not a big deal at all. Actually.

I

Okay, the uh so it it it is the host that mounts me uh verifying one or two things. A couple of days back, we had another meeting, we're verifying one or two things around that, but it's the host that mounts and for the parts they just bind mounted into the pods.

I

So so we're not so worried about pod ips yet, uh but we need to confirm that.

D

Yeah, um so a lot of the rbt tooling that you're, probably referring to, is set up to do specific instances, um because it was designed for openstack where things were mounting as like part of a key new kvm process. um But you can also just block this in ip and it's stock block listed into, but I it might be a default 24 hour, timeout or ever, but you know either you can refresh it or it's stuck block listed until you unblock listed.

D

um So if you know the number of posts in their ips that works the way. I don't know what scale, but a large enough scale for many for many kubernetes clusters that I've heard of those deploying.

D

C

I mean complicated things: you're talking.

D

About here are, you know, like I said, a lot harder.

G

Okay, okay! Well, that has this.

D

Have upstream meetings you want to come and talk about this at I'm or or downstream, once from red hat, I'm happy to do that.

I

So I'll uh I'll keep that in mind. I mean this is a little down the line, I'm just.

E

Yeah, I think I think the main thing to remember is what what is the scale we are looking at. What is the timeline right? We don't need to come with come up with the perfect solution in six months. We can come up with a perfect solution in, like you know, two years, but the point is what is the minimum viable product you're looking for in in what time frame?

E

If you can get those uh straight to us, then probably you know we'll be able to guide better at this point we are like okay, this is a possible solution and this is the best solution, but we need we can pick something in the middle.

I

Okay, okay, I'll I'll- probably do that that.

G

I

Actually, that's that's when we started and ended up with this tracker, but we'll redo that conversation, which is totally fine.

I

Good, I mean thanks for realistic effort estimates, because that decides what we want to do as well, so that.

A

Helps anything else in this topic.

I

Okay, I'm good. Thank you.

A

Thanks for bringing it up, um any other topics folks wanted to discuss.

B

Yes, george, I just had a very tiny topic. It's a follow up of the.

D

B

Over the cdm, oh the you know the regarding to the collecting the slow through request in in in manager, because in last discussion we mentioned that, let's order, cluster load going through the uh the textures could be hurts the performance of the the monitor in in the case. If, when the cluster is suffering from the performance issue, we don't need to to burden it with more more small load with processing the plus log.

B

So this last time our solution was to use the health protocol to collect the summary of the house health reports and let the manager to aggregate summary and then ask enable the manager to to collect the details from the, for example, the osds and the mds statements for them for the exact um a slow request, so it so so the safe cri could collect all them or dashboard could ask the manager for more details on demand.

B

But when looking into this problem in in depth, I realize that there's some problem we need to solve the first. One is: what's the? What if the major active manager said instance fails because before that, we are using the rock db for for persistent in the uh this class log, but the manager not does not have this facility at this moment.

A

But yeah, I think the basic point is this: is that we don't need all the all this information for it debugging and so we're trying to summarize it, and I think part of that also means we don't even need it to persist. Necessarily like it's helpful to understand what um what went wrong the at the moment.

A

But if the manager happens to go down, I don't think we need to worry about um trying to collect these information again, because if the requests are still happening, then they'll still be sent again and if they're not, then there's not much more diagnosis that we can do with that point.

B

It will find if the the the messages, the the slow request details go with the failed active manager instance.

A

Yeah yeah, I mean they also.

B

Show up in the osd.

A

Logs themselves, so we will, we have a persistent copy on the osds where they were created originally.

D

And I mean we're not removing the slow requests or the slow slowness reports from the monitor it's just about how much details in them right exactly so.

B

No, we are removing them from the monitor. We are aggregating them on the manual side.

A

We're still having the aggregate report in the monitor log, but we're not having any of the individual request. Reports.

B

Yes, so the point of this change is to to collecting all of them on the manager side instead of on the monitor side, but.

D

The manager isn't is intermediating all of them, so the manager's down we just lose it completely from the monitor. Okay,.

A

If the vendor's down, you already lose a lot of status, but I think we've yeah we've been away.

D

A

Point now it's not.

D

Just still like our little vision where the cluster would run it, just wouldn't change itself.

E

We did talk about having a summary in the cluster log as well. Right I mean so if.

C

E

A few data points uh you know in different places. We should you know that that should be good starting points. I don't think we even even at this point we log so much a slower request, but we're not probably looking at everything we're, probably looking at when it initiated and how long it lasted right. Yeah.

A

That reminds me actually um had an idea when we were investigating a recent issue when we were able to tell uh what was blocking the slow requests based on the dq latencies.

A

um In order to make this easy easier to diagnose what was an end up blocking, we could have um right now it's very difficult to tell from the existing information what is causing a slow request um if we had a sort of cache of what the recent ops were for for each shard, uh when the dq latency uh has a spike, we could uh record that in the cluster login and or somewhere record that somewhere, at least in the osd log, and find out what what the most recently processed ops were, since they were probably the culprit of that large.

B

Blockage yeah. I agree that is important for for looking at the issues, but another question is how many, um how much load request entries? How many entries of a slow request reports should we keep presumably should be configurable right instead of keeping all of them in.

E

Yeah, I don't think we want to keep all of them. We can keep a count of how many of them are there right like um how many just the count of like how many slow requests over time have we seen if it's like.

B

You know, or the only number.

E

B

And we should keep them in the memory or a local storage like sql, serialized, db and manager.

A

In memory they doesn't need to be persisted again.

B

Okay, cool so problem solved. Thank you.

A

Okay, any other last-minute additions.

A

All right, thanks, folks, see you next month, thanks see ya. Thank you.

A