Ceph CDS Quincy, 12 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Summit Quincy: CephFS

Description

00:00 - Quincy open issues
00:59 - Dashboard and FS/NFS
14:05 - mds_memory_target
18:19 - FScrypt integration
24:32 - async rmdir/mkdir and link/rename
30:55 - recursive unlink
42:14 - MDS rolling
50:00 - CephFS mirroring metrics (in line with RBD mirroring metrics)

A

Welcome everybody to the wfs um quincy cds. um The agenda is in the chat.

A

uh First, stop. We have um dashboard and nfsnfs current status, and next steps did ernesto put that up.

B

Hey patrick here I am okay. Well, I'm not sure actually who put those in the agenda but yeah, it's it's dashboard uh really. So I will start with a quick demo.

A

Yeah sounds good, perfect.

B

So please, let me know when using my screen yep, you see it. Okay, thanks so well. uh This is uh for those of you not familiar with the dashboard or the file system, specific component of the dashboard.

B

This is how it looks like I'm running a via start cluster, so some things would be a slightly different compared to as fading one, but what I think basically will work for for the purposes of this demo, so well, basically in via star there's, this default system. So currently this is what you get in the dashboard. I'm also enabled a couple of mds, so we have an active standby one.

B

um That's currently the info, also the bulls associated with this file system, and we have this chart. I think there was a proposal for removing this, because we have the grafana ones, but this is coming from the um uh information in in the manager. So well it's free to get that. So we may leave that, apart from that, we have the clients right now. There is no client connected here, the ad directory listing.

B

So we now have are able to see a list of directories and also we have the possibility of creating snapshots, so the snapshot comes with with a pre-built or free suggested name for that, based on the date and for the directories. We also have the ability to uh set quotas, so basically we can have set course for years and that's mostly it regarding uh the operations for the file system.

B

There is this dashboard as well grafana one and what else is missing here, so that would be for uh pure ffs if we go to the nfs which uh we didn't demo for for rtw, so we may use this also for demo in this as well. uh This is how it looks like um this is created by uh yesterday. So with uh stevie m. It will be yes a bit different. I cured this.

B

This one point uh alf also has been recently investigating. It seems like we're able to create nfs v3, though I'm not sure if, from the dashboard perspective, the nfs with remote point is there, but I'm not sure if the ganesha diamond we haven't tested that yet is fully working with with this config and not sure. How is this working with the volume manager or the nfs new module so we'll need to double check that. So, basically, you can create a an export from here.

B

You select the backend in this case we're going to select sffs, but you can also create a an open, fest sport from from rtw, and we yeah. Please.

C

I don't know, I think that the latest update that I received is that the v3 should be dropped here, right, patrick.

A

Yeah, I we don't suggest using v3 with cfs at all, but I realized that it's been supported in the dashboard for.

D

A

While now, but moving forward, that should not be allowed.

C

Okay: okay, okay,.

B

I think it probably it will be worth it to have a follow-up discussion on this because there are different. Well, you know uh upstream downstream discussions around this topic, so at least that's where it's possible to do that, we can simply disable that hide that or whatever. So it's just for our convenience.

E

Yeah and also, I would probably specifically want to be using nfs41 or greater 4-0, actually quite a different protocol, and we, you know it may be fine on 4-0, but it's not well tested and one thing that doesn't give you. If you use 4-0 and then you your grace periods, have to last the full length. You can't lift the grace period early if the. If the server goes down, it has to come back up, so you don't and that that applies to any you know serving the cluster as well.

E

So you end up with really long grace periods and finally stall out, while they're trying to do stuff.

B

Okay, I I think, for this we're still directly creating uh something scores me if I'm wrong the export files in in brothers right, we are not using the nfs or the volume module for this right.

C

In that cluster, that you have, I think, it's user defining here, yeah well, in this case it's usually fine yeah yeah yeah. I think that orchestrator is only worked with before, but as we allow also the user defined it exports until now at this, they can set this this way, but well, this is a one of the. The debates is if we we definitely dropped this for quincy or or what two okay.

A

I put uh in the in the agenda: there's uh a tracker ticket for the nfs export update um to integrate the the dashboard with uh the nfs plugin, um there's actually quite a bit of work. That needs to be done, but it's all laid out in that ticket.

B

Okay, thanks patrick yeah, I think we've been in conversations with varsa, so yeah. We are following this uh quite closely.

B

Okay, I will resume this and yeah just create a sample uh there, for example, so yeah this uh uh form will create this direct directory in surface, so we can just, for example, we may drop the nfsp3 and then we lose the tag option. We put another name for the export just to and let's select this option and that's it. So we have the other export created here and if we go to the file system, the directory is, is there so there's uh the sample directory created there?

B

Currently we don't support the creation of files here. I don't really know the reason for that, but well uh we may start adding more fine-grained operations, maybe uh also dealing with objects and are yeah, it's something that we haven't considered yet, but as we progress towards more uh fine-grained operations, we may include that as well. Just kind of some way of uh checking there's some objects and I think that's all of it not sure if you have any questions about this is any thing that you would miss here from the cfs or nfs perspective.

A

um I had a few comments, um so the snapshotting of directories is interesting. What would be really cool is if we can get that to hook into the new snap schedule module so that you can manipulate the schedules from within the dashboard.

A

So yeah you've already got the the you know being able to snapshot individual directories with arbitrary names. That's that's great, but I think the next step. There would also be the scheduling um the uh performance metrics. uh I think we're I'm trying to. I can't recall like where exactly mds is sending those metrics. I think it's in the manager report message that periodically gets sent sent out um that hasn't been touched since john spray was working on this.

A

um What we have now, though, is a new, the the metrics that are reported for seppfest top um venky's, not here, but uh the uh I think the vision is. We have the zip fest top tool for the command line interface and then the dashboard would also consume the same metrics to display it in in the the browser.

A

uh So I think that that would be the direction we would want to go next, you'd be able to get the client listings and various information about the clients like how many caps they're using and how effectively they're, using their capabilities like how well their caches are being used, and things like that.

A

um And also just the individual rights and and read throughputs of the clients.

F

What about sub volumes that those be surfaced here.

A

Yeah, that would be a another good thing to add to the dashboard is to hook into the sub volume.

G

Interface, so I I don't want to divert, but it's hard register um interest. um I work on manila and openstack.

G

um um And I'm um I'm. I noticed that the way the dashboard um handles exports is more up-to-date or a different method than than we use in particular for um ganesha.

G

We use a dbus mechanism to update the interfaces rather than using a watcher on the redis url, even though we stick the url in a random object. So I'm wondering if we should be updating we we've recently updated. You know to use this ffs sub volume module rather than the old volume, client library, and we have interest in staying up to date with what um what's happening here.

G

um So um I just want to you know. I don't want to divert the conversation about the dashboard. The ui aspects of it are primarily of interest to us from a read-only cloud administrator perspective rather than uh in terms of updating exports, all of which is, you know, api automation, but underneath here there's a different api. Clearly, I've played with the um dashboard and the orchestrator uh with pacific a bit uh with uh seth adam and with uh and they're both both um more up-to-date than what we're doing so.

G

Maybe it's a question mainly for jeff, uh patrick and romana or something but uh just letting you know. I noticed that what's happening here and if we should be changing, let us know and we'll work with you on it.

G

And the other aspect of this is you can deploy ganesha active active with the orchestrator and that's complicated, but you know for us: if we lose a node system d is not able to handle that migration to another node or something like that today um and we had talked about kubernetes and one, but you know simpler than that problem. Right now is just the export maintenance issue, and that may be the place to start.

A

Yeah that the nfs plug-in um right now also does all the orchestration handling for setting up nfs clusters. But I think there is um interest to to make it work with statically defined set of nfs servers, which is something both the dashboard for a legacy, support perspective and also manila is interested in, but that work hasn't been scoped out. Yet.

G

B

Okay, so I think that's all from from our site, uh if you have any extra feedback, please feel free to reach out to us and yeah. We may also discuss that later. Thank you. Folks, thanks.

A

Very much so any other comments or topics of discussion related to the dashboard.

D

A

Next, um the angela is the mds memory target.

A

We have siddharth here, no.

A

So the the tracker tickets in the agenda, um unfortunately this this pr- is the the pr associated with this take its uh atrophy quite a bit. It means uh revived, but it's a fairly simple vr that just dynamically modifies the mds cache memory limit uh in response to the total memory usage of the mds, which would be defined by or the the target would be defined by the uh this new mds memory target variable analogous to the other ones in the lsd.

A

It did I didn't put this on the agenda. Did someone have any comments about this? I saw I think, mark put in the agenda about the priority cache.

F

A

F

I was curious what the status was. That's all okay,.

A

Yeah this this pr is revived and I'm want to get it done for quincy and maybe it's something we can backport to pacific or.

F

One of the tv items also frequency for stephanie anna's to um have stuffed idioms, setting these limits and be able to scale them automatically to the size of the node or whatever. It is so it'd be nice if it was using a consistent setting across all the different, even types but very nice to have that's all yep.

D

E

What is priority cache is like a thing that allows you to like an lru sort of thing, or something or put it. You know to where it trims it down to a particular memory usage, I'm not familiar with it.

F

uh It's something that mark built for bluestore um that allows you to use that it's. If you have a pool of memory, that's um consumed by multiple caches, you can set various policies around like how you would like that memory to be used. So like these caches, you know at least this much memory and then above that, then they scale at this ratio. And then, if it's above a certain point, then this one gets all of it or whatever. It is. You know sort of this tiered.

F

um So it's helpful if you have multiple consumers like in this case, we just have the mds cache as sort of a single knob. I'm not sure it would be appropriate, but if, for example, we were independently scaling like memory consumed by caps versus inodes.

D

Trying to prioritize different different ads from different places in the future. That might be helpful too.

F

A

I think the one challenge with integrating with the priority cache is just dropping. um Cache entries is not as simple as it is in in blue store. You have to, there might require a cl uh cap recall, which um uh probably complicates it.

A

You know the priority cash code would need to be updated to support that and.

D

A

Payoff there is uncertain.

E

At least in the near term, if you have to do a cap revoke that you're, probably doing memory allocations and making it even worse right, you know so yeah I mean. Is this really a bug I mean per se? I mean it almost looks like it's like. If we're not hitting the mbs cash memory limit, you know correctly, I guess maybe we should be uh just our math is off.

F

Yeah, it seems like it's a nice to have it's like only only if you want to if you're. If there's something clever, you want to do with balancing cap memory versus dyno memory or something then priority. Cache may or may not be helpful to do that, but I think it's sort of up to you guys.

D

A

All right uh next on the agenda is fs crypt integration jeff. Would you like to go through where that's at right now.

E

Yeah I mean I've got um the file names piece, pretty much done and we've got uh dang did, uh though I guess zhang picked it up that started it and then finished it up. It did the alternate name feature that we need, so we need to be able to give entries a secondary name uh in case they're, very long, because we can't just hash or encrypt and hash them at that point, or we can't just encrypt in base64 and code them at that one.

E

So uh because we want to keep all the file names under name max, it's a little complicated but uh but effectively. Yes, I've got the file name portion of it done where we are stuck up.

E

What I'm stuck on now is the fact that the mds is what handles truncates for the content encryption, uh and so when we set size to a particular length, um you know the mds will come back and truncate the thing down to that bite, but that could be in the middle of a crypto block uh and at that point we can't decrypt the tailbone file. So we want to.

E

We have to ensure that we that we somehow teach the nds to round up uh to the to the end of the next crypto block, as we do this that uh you know stuff, there's a couple. Different approaches we could do uh and I think greg is a great farm- is going to help out with some of this, but but uh any case we've. We got a meeting scheduled for later today, just to discuss some. You know where to go with this. So any case.

E

That's so I've got it about halfway done the rest of it doesn't look too bad. um You know sans the part that where we actually have to do so, we also have a we have a uh code path in the uh ceph client that handles uh uncached, I o so caps.

E

For instance, it will just write through to the you know, synchronously through to the to the nbn or to the server whenever it needs to go to the osd's. Whenever it needs to go, do a write or a read. If we do it right, uh we can't in that situation uh we need to be able to do a read, modify right cycle because we have may have to slurp in the beginning and end of a crypto block at the beginning of crypto blocks to make sure that we can handle those.

E

So we've got some code that does all this it's pretty complicated um and I'm hoping I can uh clean it up a little bit before we it's ready for merge. So I'm hoping that, maybe by the summer we'll have something that's ready to go, but we need uh some mbs support for the truncate, but um but yeah the file name portion actually looks pretty good. Now, it's working really fairly decently.

E

um Luis enriquez says that uh he was hitting a problem with it, but uh where you occasionally, but he and he was trying support to track it down. But I remember back from maybe I haven't been able to reproduce the problem so anyway, that's where we are. I don't have a lot to discuss here with it. I'm thinking this anyways.

A

Yeah, I think, as far as the direction we're going to go with handling truncate right now. The preferred solution is having two sizes profile. One is the the actual size that we're already using and the others the um a size that's used for the purposes of truncate, so the mds will will not truncate off the last few bytes that we need and they'll. Both both sizes will be protected by the same locks um or maybe I'm sorry. I have that reversed.

A

uh The current size is the one that would be set to the the end of the crypto block and the um the new size is is the actual uh unencrypted size of the file.

E

We could uh we could even take that feel that the uh the you know the encrypted um or that the crypto enabled clients use and encrypt it too. So we could cloak the the full size of the file, even if we wanted um the cache there is that you need to uh when you do a crypto, the the crypto blocks have to be at least 16 bytes. So uh you know we would be consuming an extra word in the in the inode. You know somewhere to do this.

E

If we want to do it that way, but it's possible.

D

A

All right any questions or comments on fs script.

A

So, for also for those who aren't aware with the you know about this project, um we didn't really introduce it. The fs crypt is the uh is in the is in the kernel tree and it's a library for it's, a generic library for file systems to use to encrypt files and file names within the file system um currently supported by ext4. I think it was originally a project for android um and we're now trying to use it in ceph.

A

But this is a kernel driver only feature, and at least at the moment, there's no plans to port that to fuse mostly due to the library being there's no user space library equivalent that we could uh that we can use yet.

E

Yeah the neat thing about this is that it allows you to uh set keys as an unprivileged user. So um so, if you are say you know, um you know, got a vm or something that you're using uh for, and you don't need any uh mds. You know you're not sharing the keys with the mds.

E

So uh so, if someone like a hosting provider, could you know give you a chunk of space on a on a cfs, and you can uh then use that uh the fscrip code to encrypt all that data? So even though it's hosted in a public place, you know you leave a lot of protection to the data.

E

I think it's actually kind of a killer feature if we can get it all.

E

D

A

All right, let's uh move on to the next agenda- topic, uh asynchronous um armed error, maker and link rename.

A

So this is a continuation of the work that jeff had done.

A

um Jeff and zhang had done about a year ago uh on supporting asynchronous, creates and unlinks, and the next step uh in that project is to support asynchronous armed er maker, and uh is it nice to have also link and rename within the same directory.

D

A

um So the the main point of of of adding support for those is that, uh as as it currently stands, um untarring a directory tree or rm-rf, uh the the unlinks and the crates are all asynchronous, and that works just fine.

A

But the problem is: is that once it reaches a maker or an arms or that becomes a barrier for future operations, um so it would uh it'd be nice to make all those those system calls um completely asynchronous so that you know you can rm dash rf a whole sub tree, and it just uh almost instantly completes until you do like, and then you can do like an f sync to ensure that it's actually durable.

A

uh Neither neither antar nor arm dash rf actually do those have syncs so user perspective. It would be an instant change um so that that would be the direction we'd like to go. So the next step is to get it getting armed or make their definitely uh asynchronous, and then link and rename is sort of a nice to have stretch goal, because uh that's those are operations that are used by uh by uh rsync, for example, um in some configurations.

A

um So with that said, uh jeff um did you have any thoughts on on uh where we are with that? Is there any current code or or do you know if you'll have time to work on it, putting you on the spot.

E

Yeah, I don't have a, um I don't have any current code. I can I I mean you know. The hard part is like um you know it's how we handle recursion or you know like recursive, unlinks and stuff right. uh You know if you've got the the catch here.

E

Is that you know when you start doing these things asynchronously, you know you kind of lose control over the ordering, uh and so it's harder to ensure that uh you know at the point where we've got um uh you know we might have uh deleted all the dentures in a directory on the client and then go to do an rmdr, but the you know the, uh but somehow that rn gets ordered before the all the unlinks on the nbs.

E

So we have to make sure that that doesn't happen, because then that rm door won't work if there's a file in the directory, um and so that's the hard part, and that's why you know we have done armed or to be a synchronization point, but we you know, maybe we need to revisit that and think about how to do it. I I don't know a way right off hand to make that simple. I haven't really figured out anything. Don't.

A

We already have that issue, though like if, if I have two applications that are unlinking files in a directory, um and then you know, finally, all the at least uh from the client perspective. All the files have been deleted, then one of them does an armed or one of the applications says an arm there, um isn't that armed or already ordered against the other unlinks in flight.

E

I mean not necessarily right I mean do we know that those things are always going to be processed in order on the mbs. I don't think we do.

A

Well, if it's from a single client, it is.

E

I don't think about it. I don't know I mean it's worth experimenting with be able to probably please maybe draft something and throw it together and see if we can make it work. I don't know, but it's a little. um I mean it's not too hard to do even to fix the client. To do this, I mean we there's some guard rails in there. Basically, that keep you from proceeding at you know, barriers, basically that keep you from proceeding until all the dentures have you know, synchronously been deleted from the mbs, but we can.

E

We could remove that and see how it goes. um I I hit problems with this. You know, I don't think it's uh you know, but you know again. The catch is that when you uh issue um these things, you know the unlinks come back very quickly, and uh you know the the calls are still very much in flight. You know that may not have been transmitted yet.

E

So it's not too hard to imagine that the socket buffer you know for the rn dirt process that you're calling ends up ordered you know getting the mutex for the socket before you can send the uh the unlinked you know gets his chance to send it yeah. We have a lot of competing new texas stuff too. So it's hard to. I don't know which way that it's gonna fall out so anyway. I when I did some experimenting with a while back. I hit problems where you know director is not empty.

E

You can't aren't endura, you know and then you come back later and it's all good and it is empty right. You know it's just the ordering is not there yet.

E

So I don't know I'm a little leery that we can pull this off. I don't know.

D

A

um Yeah, I think, right now, this this project is gonna sit at the moment until we we have resources available to work on it. Maybe shubo will have time he's developed an interest in the kernel, client.

A

um Any questions or comments on this.

A

A

All right, the next one is recursive unlink, which is kind of similar. uh This is an idea to just add an actual rpc to the to the mds that does a recursive on link of a subtree um one of the main target uk use cases. We have at least from an immediate sense, is uh the uh um the volumes plug-in be able to recursively delete uh an entire sub-volume uh without having the manager have to list and rm the entire uh sub volume in an asynchronous fashion.

A

It could just one off shoot off the rpc off to the mds and let the mds chew through it over time and thank you is planning to work on this uh hopefully, but for this this release, um I think uh it actually supporting. It will be a little weird um for the mds or require some changes, because the directory assumes that, like every directory, that's actually in the stray is already empty.

A

So it needs to be taught to deal with directories that also have um entries in them and then also subdirectories, etc, um and then also uh this. This will be explicitly not osx compliant because link counts will necessarily not be updated for all of the.

A

For the entire recursive subtree, um uh you wouldn't be able to iterate through all the files to update the link counts until you, actually until the mds actually performs the unlink to make this uh uh available across all of our drivers and not just as a special purpose rpc and the lips ffs. I was thinking we would also plumb in support by maybe having a dot trash directory that you can move hidden in both the views and in the kernel client that would be translated into that recursive unlink operation.

A

So that would be available for everyone.

A

um So that intro out of the way any questions or comments.

F

Just a random thought so, um since you mentioned that dot trash, would it make sense to have that be like a proper trash feature or like if you rename something into that directory it'll just stay there until the mvs decides to delete it based on some policy yeah.

A

Yeah, um I don't think we really thought about that, but um that could be a very good good add-on feature having like a special stray directories that are only unlinked after the entries been there long enough, probably wouldn't call it trash just.

E

Because, with their applications that sort of thing you might want to do, ceph trash or something yeah, yeah.

H

um So there's just some discussion: the tracker ticket, but hard links are going to be really irritating to deal with on a recursive unlink. I'm not sure if there are other issues, but that's the most immediate one that pops up in my head.

A

Right so, as a as I was saying, greg this would be like um an explicitly. Not posix operation, so link counts would not be updated immediately until the mds actually does the unlink right.

H

But I mean, if you have parent like off entries in there that are hard linked from outside the subdirectory. You can't just say: oh, this doesn't work with posix like like. We have to figure out how to maintain everything when, when you issue this operation for for the rest of the users.

A

What is all right so are we talking about um handling like the subtree authorities when, after doing an.

H

Unlimited file, you have a file in in slash toolbar and it's and there's a hard link outside to one of those files and you recruit. Then you do an issue and recursive delete on foobar. You need to like move the move. The file up or something out like like you, can't just sit in the existing purge queue and then go away because something else might be pointing at it yeah.

H

A

H

Out that that's the scenario we're in is hard.

A

So we're talking about the the remote entry links they they would need to be updated somehow or the mds would have to do some bookkeeping to remember that.

H

It's uh and I'm saying that's yeah, I'm saying that before we start on this like that needs to be figured out um it might.

H

It might be okay to just like put in the story directory and wait for it to get reintegrated, but I'm not even sure it's easy to detect that that's a situation we might have entered.

H

I don't remember what invariants we hold around around the link, counts and and their locations, because likely because, like the vet, like the current back trace, might on on, it might be out of date, and so we like look at it and say: oh it's in the recursively deleted section. We can just throw it away, but actually got moved out.

F

H

F

It any different than a: how is it different from a regular delete like isn't this just putting it into a queue, and the mds will go? Do this slowly asynchronously in the background.

H

Because we don't actually delete it if it's got more links on a normal delete, it just sits in the story directory until somebody else happens to reintegrate. It.

F

I know yeah, I guess my question I mean I'm saying right now: the manager will just do a rm effectively right and this is basically an rpc that will tell the mds to put it in a queue and to do that sometime later, but when the mds eventually does do that arm dash r, it's just going to do a normal.

H

The the manager's arm dash r, is doing full listings from the bottom up because it's running through the posix interface, whereas this would very much not be.

F

I assume in the description.

H

Of the ticket this will be top down, so we're like just trying to I it would depend on the implementation yeah, but but the way this department can take it. This is a lot more intrusive in to the and messes around, with the invariants.

H

More it talks out, for instance, modifying the purge queue to permit an unlinked directory to have children which the manager can't force or.

H

A

Yeah um there's this: we haven't thought about those details quite fully greg to be honest, but this is a good point um in my head. Logically, it's the same as just renaming a subtree somewhere special and the mds just happens to be unlinking. Those entries, so I mean yeah, they're, really sure.

H

The hardware negation should.

H

It's just like it. I wouldn't try and shove it into normal purge q top down.

A

Yeah, maybe we'll stick it in the mds or make that uh more.

D

Useful curious, so this kind of brings up another idea. My head: does the sffs, like um generally run the ck periodically in the background currently when you're.

A

Talking about it, it right fs check in the background.

D

A

um No, it doesn't have schedule scrubbing.

D

Okay, I just was thinking um if, if there are structures that we end up needing to clean up in some way like like cracking uh filling indexes of hard links or something like that, uh that could potentially be a way to do it.

F

Like reintegrating strays, for example during scrub, if we encountered a remote entry linking to something in the straight directory with only one link, then we could reintegrate it.

A

um Yeah, that's how you do that? Well, you would need to know where it's being reintegrated to and if the mds is loading that entry. Naturally, it should reintegrate the stray, but.

H

If it's, I think it because I know where it is to touch things. Yeah.

A

I believe we even have tests that exercise that bit.

A

That is a problem right now um the stray directory can grow unbounded. If, if there's you know, you have a huge sub tree of hard linked files and one of those sub trees gets deleted and if the mds never touches the the uh remote entries, it'll it'll never do the reintegration, so the strategy just keeps growing, so um it made make sense to just uh have the mds periodically go through the entire file system so that it can do that, reintegration if it's not driven, naturally by client, io.

H

You just print a warning message that tells the user to run scrub.

A

This used to also be a bigger problem because we didn't fragment the stray directories, but now we do.

F

Well, it seems like if, if we add um a schedule for scrub, so it runs once a week or whatever it is. Then this would sort of happen organically. Also.

H

Yeah yeah that used to be on the backlog.

H

Scrubbing so we should probably move that back up.

H

Visible team describe works now right, patrick for pacific.

A

All right any other comments on this.

A

Topic all right next step is mds rolling upgrades, um so this has been a kind of troublesome part of cfs for a while. Now, in fact, we have a rather complicated uh upgrade procedure for cfs that users that users should follow, involving pretty much turning off all mdss, except the one active on a file system and reducing max mds to one and then actually doing the upgrades of all the mds daemons.

A

And then, finally restoring max mds uh the reason for turning off all the standbys has been that if there's been any change to the ad set for the file system, which is uniform across all the file systems, then all of your mds demons commit suicide, um which is fairly scary, because you know you have aborts on all of your mds logs.

A

um uh So I have a.

A

A patch set to make it so that the mds is no longer do that um no longer abort.

A

If the compat set on the file system changes which will now store the competitive for each file system and the mds daemons have a compat set on all of their beacons of what they support and what the mds supports doesn't change for its entire instance um and the monitors will do um standby promotion it only if the mds is compatible with whatever the file system's compat set is, and, furthermore, it'll only upgrade the file system compat by promoting a stamp uh standby that has a newer incompat um in its compat set by uh if it meets certain requirements, namely that the file system has only one rank max, mds, one and uh and standby replay is disabled, um and if that's true, then it'll actually do the promotion and then finally, the file system compats it incompat will also be updated.

A

uh Instep ffs, we only use the incompat um feature set in the compat sets. So that's that's the only thing I'm talking about here um that works pretty well, uh it's something I want to backport specific. I promised it would be in pacific for rook, because the tearing down all the standbys turning them off is is a rather onerous upgrade procedure for the orchestrators to have to deal with, um and this gets us pretty far to making that process a little easier.

A

um The next step in in supporting rolling upgrades is to make it so the mds you can have mixed versions of the mds's uh mixed active mixed version, actives on a file system, um and I think to support that. It's going to require some significant changes to the to the mds map, uh some of those things I've laid out in the ticket and then um there's been some efforts to version.

A

A lot of the messages that go back and forth between the mds is so that we can note that something is a newer version and then I think uh is there needs to be some work to uh gate features with uh flags in the file system, so that mdss don't make uh breaking changes to the metadata until all the mds's are on this uh on the same version, similar to what we do with uh modders and osds.

A

It's a much more significant undertaking. The compat set changes are much simpler and get us pretty close, but uh the rolling upgrades is something I would like to get done for quincy um yeah. So that's where we are currently any thoughts on that.

F

um Just uh before thinking about the like mixed version thing um just with the changes that you have so far, I just want to make sure that the the sefadm upgrade procedure is updated accordingly, um but it occurs to me something you mentioned that I didn't realize before that you need to also have um nose standby replay.

F

I guess maybe it doesn't matter what what the um what separation currently does. Is it that's the max mds to one on every file system and it waits for it to scale down to one rank, and then it just restarts every daemon, and it sets it back again.

F

um That might not actually be the right thing even before or after your change.

A

Yeah, I don't think any of the orchestrators were doing the right thing, at least according to the instructions yeah, but the the main problem with that is just you. The mdss might commit suicide and that looks scary to users but broken.

F

Separately with this change, that'll stop yeah right. So should this be updated to also set the max or somehow disable the standby replay.

A

Yeah you just set that to false, and now the monitors in pacific will kick out all the standby replay demons as soon as you turn the setting off. So you don't have to do anything more than that.

F

Okay and then turn it back on with it, I'm actually not even sure if that works properly with.

F

F80M, no, I guess so. You just set.

F

I I don't know because, when you're doing standby replay do you have to decide which um what rank each standby is playing for, or does the cluster sort of automatically figure that out.

A

The the monitors do do that and it's arbitrary but they're, going to prefer the standbys that have the mds join fs setting for the file system. You want them to replay for.

F

Okay, and do they, um if you have like four mds's, and you only have two standbys and you have that option turned on.

A

Because you'd have two standby, replays and zero regular standbys.

F

For the other people so.

A

We have a note in the in the docs that you know if you turn on standby replay, you need to make sure you have enough standbys for for all the ranks. Yeah, okay,.

F

Okay, that f8m needs this minor update. There, then just to turn off that option if it's on and then turn it back on when it's.

D

D

A

All right uh next on the agenda is um ffs, mirroring metrics in line with rbd mirroring metrics.

A

I'm afraid venki's not here, did someone else put this on the agenda.

F

Let's see ilia either between my guess.

F

A

um I don't have any thoughts or comments on this because I haven't through it.

A

So unless someone has um something they'd like to discuss about it, I think we'll just table that and the same with exporting client metrics to.

A

F

D

F

Yeah, it's probably worth following up with the olya at some point just to check in.

E

F

That the idea there is that um to have as much alignment as possible between the metrics that rbd mirror is presenting in stuff. That's mirror so that they're, hopefully as close to identical as possible.

D

A

I know uh venky's been working with some folks on on ensuring that all right, the last topic in the agenda is um fencing.

A

um So this is uh the the target scenario for this is um we have two kubernetes clusters and one steph cluster and uh or at least one of the target scenarios, and um if an entire cluster it becomes unavailable for whatever reason you want to be able to block list all of the active instances from that cluster, um and so the open question is: how do we?

A

uh How do we express that with the how we currently do block listing? um We have sean here. He was kind of leading this session, no he's not here. um So there are a few different ways. We can do this um or we've thought about doing this, and one was to to have a tag of some kind on on client credentials.

A

Maybe add that to the caps list that indicates, um you know what what kind of uh availability zone it's associated with and and whatnot and then you'd be able to block lists an entire set of auth credentials um with or instances that are using those auth credentials. Based off of that tag, um I think that was, uh I don't recall the exact details, but there was discussions on the mailing list. That indicated I was not um that wouldn't be quite workable.

A

In particular, you need to be able to on. I think one of the main challenges is: you need to be able to unblock list new instances um so or if a new instance comes from that cluster, that it should not be block listed, so you need to perhaps some kind of a generation id epic for for how you block lists an entire group based off of a tag, um and then I think it development discussions kind of stalled there.

A

So the current thoughts are in the tracker ticket and also the past discussions with jason and sean about this on the mailing list.

A

I'm not sure if there's been any new pushes for for supporting this. I know um it's a feature. That's been desired by this fcsi folks for with uh rook deployments, but I haven't heard a lot recently about it.

A

Do we have anyone here who has corr thoughts on.

A

D

I've heard about the desire for for it from the working uh stuff csi folks more recently, so I think it's still something that they're interested in.

D

But uh nothing can be on.

D

A

um I think right now it's still blocking on just figuring out the way it should probably work and that's still a bit unclear.

A

So I guess, depending on on interest, we we should have some discussions josh um about how this would look.

D

Yeah yeah, it sounds like we could. We could uh get the interesting folks together at some point and try to figure that out generation concept sense reasonable, but not sure exactly how that would look.

A

Yeah not exactly sure either.

A

Yeah, it's possible. We just want to tie it to the osd map. Somehow.

A

The osd map, epic, would be simplest.

A

You just block list all the instances that exist before a certain osd epic and then um each instance has a epic that it's associated with, based off of when it first uh the birth of the instance itself.

D

Yeah, when it first connects to the monitors yeah that might work.

D

That seems like a path worth investigating.

D

A

H

We're just about important to them to be able to um to like block list only a specific site, or can we just force all the clients to redo their sessions.

H

Well, no, that wouldn't work.

H

Yeah yeah, I mean I missed this when it first came out and I'm this is a big problem.

F

D

Yeah we're out of.

F

A

Is there any more comments or.

A

A

All right, thanks for showing up everybody, have a good friday and weekend.

C

Thanks thanks, thank you. Bye-Bye.

B

Thanks bye thanks.