Ceph CDS Quincy, 7 Apr 2021

Previous Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Summit Quincy: RADOS

Description

00:00 - Beginning
00:23 - Dashboard and rados: overview of current status and next steps
34:11 - Review crash telemetry panels: https://pad.ceph.com/p/telemetry-crashes
48:05 - Immutable Content Optimizations
1:10:37 common mgr pool
1:46:14 structured confvals to generate document (and c++ source code): https://pad.ceph.com/p/confva-yaml-doc
2:02:08 - automated auth key rotation: https://pad.ceph.com/p/auth-key-rotations

https://ceph.io/cds/ceph-developer-summit-quincy/

A

So we've got quite a few topics um already on the agenda. Should we begin.

A

Oh all right, I'm just going to follow the order in which the topics are. uh If there is, you know anybody who wants the topic to be brought up first or anything or later, uh let me know um okay, so the first topic is dashboard and rados overview of current status and next steps.

A

So this, I believe, um will be led by ernesto and alfonso.

B

I think you're going to run the demo, uh we're basically going to add him around the support of rather specific stuff in the dashboard and for the future things or things to come. uh Well, I will probably will again to get your feedback and suggestions and also know about the things that are new to brothers and that you can see their key for uh usability and and that an operator should should have in the in the dashboard.

B

So also you want to add anything or I will start with the demo.

C

Yeah we're gonna start well. The fact is, I was using kcli and I have to add more holes, but I can share my screen in order to do a quick walkthrough.

C

C

Are you able to see my screen?

C

C

Okay right now I have a minimal fifidium deployment with a manager and a dashboard, and now is this right now, so mainly for the the core utilities. We have this cluster area when we can see the hose right now I have only one hose. I was planning to add two more and, for example,.

C

We have the inventory here with the physical disks.

C

Also I have at the 1 150 gigabytes of 120..

C

Here we have the services that we have right now running and if we click on the service, we can see the diamond type, the id the container id and some other details, tattoos another another information that should be relevant and well that the plan here is to to collect feedback on what should be showing which we should going uh um or other relevant info that can become important, another kind of stuff.

C

I can do a quick walkthrough finish and then go to the places that we think that we can be enhanced from the ui perspective in order to enhance the user experience. For example, here we have the osds.

C

Right now I have no osd, so I will proceed to create some osds like I can go to the primary devices not and right. Now this is the current way of adding waste. This is to try to to apply some filters in order to proceed. So, for example, I will go for the type and select hdd, so I have applied a filter.

C

If I click on add, then you can see that it is selected.

C

The two devices the dvd and the pdc.

C

And then, if I go to the preview, then I can see the drive group specification from the orchestrator.

C

And if I click on create, then it is proceed have distorted here. You can see.

C

B

Regarding this specific workflow, we've got recently a request from downstream qe on being able to specify the path for the filter in the track group. I think that's supported by the specification, but I'm not sure how. How useful is that for uh if there are specific use cases where we can foresee that a user would need to filter by a specific path, or apart from the other fields, that we are exposing right now.

C

D

Like dev disc bypass or something like that,.

B

Yeah, I guess it's a staple of device, but basically.

E

um I think that depends a bit on what you want to do right, if you, if you have a day one situation and you want to create a lot of force, these on a lot of hosts, then specifying the path might be a bit problematic and on the other hand, if you just want to add a few of these on on one specific new host or so then having the password make sense. So I think we kind of need both.

F

I think that we have discussed about this part of the dashboard and the orchestrator functionality, and we decided to do several things. Okay, I have put a a link in the chat in order those that are interested in knowing what we discussed the basically or the authentication of osds and or the.

B

B

Thank you for me.

B

Yeah, okay, so we we can assume that this would be a kind of day two operation right rather than a day. One.

E

Okay, I I wouldn't be so strict, I wouldn't say no or or yes.

E

Depends on which what it depends a bit on what you want right. If you have someone who's really unfamiliar with drive groups, then he will.

E

Probably not understand them and will create mistakes right uh on on. On the other hand, um using the path on on a lot of posts is also going to be problematic.

G

E

Yeah, uh I I don't have a clear answer. um That's that's my problem that I have.

D

It seems to me like having a path as part of a drive group where it's gonna map to multiple hosts or be some pattern matching, doesn't sound very useful.

D

But um I guess in my mind this case is like there's a specific drive on a specific host that I want to go, create an osd out of, and it seems like that's a slightly different workflow.

D

It's almost like you want to go to the device inventory list and you see a disk that's available and you want a button that says, create an osd out of this disk and it just like it just does it right, like that's, what's going to happen when you've like replaced the disk or you've like swapped, something or whatever you just want to like? Go okay where's, the disk there it is.

D

It's not used creatorsd and having to like navigate through the drive group string and set a bunch of filters so that you happen to find that one disk.

G

C

E

On the other hand, you you don't want to go over each osd. If you have a thousand oasis that you want to create right.

D

You don't use the drive group filtering, yes, yeah.

G

C

Yeah but for example, then, for the use case when an end user wants to apply just create one additional ost in in a certain disk, should it be supported from the ui perspective like, for example, this the device id filter? So you can check when this can then apply a noisy there.

E

I would say so: yes,.

E

G

C

Okay, I don't know how is the status of this there by the way yeah.

H

By the way, I noticed something like sca listed in some examples here. um I think you want to be careful about using persistent name names for the disks as well. um Things like sda or scp are liable to change uh with hardware changes or cross reboots. Potentially, I'm not sure if this is like a split that needs to be considered in cydia or dashboard or.

E

So, that's why we're you looking at the device id thing? That's basically the path, but it's more consistent across rewards.

H

C

I think that related to this the word this, but I think this specification is not reliable right. Relying on paths.

H

There are, there are paths that are reliable in their past, that aren't reliable. That's the trouble.

G

C

Okay, so well, uh we showed how to what the the the the way that the dashboard has implemented the osd specification. Well in my cluster, I should add more horse in order to allow more hostic creation, but the thing is: we have to find ways to well to to simplify this osd creation.

C

Apart from these, let me check.

H

How does it look like when you're trying to add a disk with a db device as well? Are you recommending or making sure that the users are selecting an appropriate sizing for the db device.

G

C

Yeah well, the fact is right now I run an empty of these two to show it have to ask more, let me add, more hosts later and then I can reserve my screen and we can take a look at the dtv devices.

H

C

Yeah yeah, I mean why we can I I will finish my walkthrough and I try to do other a couple of more hosts.

D

Really quick on the left, it says host and then inventory inventory is like the device specifically like the disk device inventory. I wonder if it makes sense to just rename that item physical disks or disk inventory or something yeah.

C

Okay, so we'll, then we will take note about the suggestion um and do we have a an official document from retrieving all the feedback. I mean for all the components or should any component maintain its own document in order to have the the minutes of all the.

B

I'm writing that down uh on a notepad, but I I can put that in the third part, but perhaps it's going to be a bit messy right. If we put everything there.

C

Yeah, okay, so an alignment between inventory and physical disks, so we can see the same okay about the configuration we have this. Well, we have the configuration of the cluster.

C

We have several levels: basic, advanced and developer, so you can base it on this level. You can see or not see some cluster configuration settings.

C

You can edit that.

C

With some settings here.

C

So I don't know if, regarding this, for example, this level, for the sake of simplicity, should we maintain this level filtering or if we simplify the ui and not only allowing.

D

If you go back to the initial view, does it show like just all the settings that are actually set or is it if you just you, remove the filters clear filters? I guess what do you see.

C

Oh yeah, if I clear the filters you put the level any.

G

D

Okay, I wonder if this should look more like the stuff, config dump dli output, where it shows only the settings that are set regardless of the level instead of showing all possible options, even though most of them, because they're like a thousand two thousand options, and most of them are never set.

A

Yep, I agree: how are you currently filtering these like um what you know? What is the basis? Well, why are we seeing only these options at the moment.

B

uh I think right now there is no filtering in place, but I remember in the past we were filtering by basic, so only the basic uh level were.

C

A

Okay, and if you remove the filters, you have everything in alphabetical order and you see the first 10 entries.

G

C

Yeah, okay, yeah! You you have pagination here, so you can go and do the last one.

A

Yeah, I think that I think I agree would say that I don't think users need to see so many things.

I

A

I

Like you should be able to find them, if, if if they want to find options, that and and be able to identify that they're valid and set them, so maybe some exceptional to you, one could choose.

A

Yeah but the default view shouldn't have all of it right.

G

D

Yeah it should it should look kind of like a config file. I guess where it just shows you what settings are actually that that means the view would be different, because the there's an extra column for what the section or scope of of the option is, whether it's global or osd 12 or whatever.

D

But I think that's some more, that's, usually what people think is most useful to see and what they would expect to see. They want to know what their configuration is.

C

C

Yeah restore any other comment, or should we continue.

B

C

B

Think that's, okay! I I perhaps just wanted to go back to usd because uh I don't think you mentioned, for example, the flag setting for usds and also the recovery profile, uh those things that maybe are worth situation. I could mention in those.

C

C

C

You mean this right there, the cluster wide flags, yeah yeah yeah. I forgot. Let me first select one, you can see the flags here and you can apply the flux. For example, let's see the no no.

G

C

So here in the flux column, you can see that there are these flux.

C

Applied here we have also, there is the recovery priority.

C

C

Well, this are these levels priority.

C

Also this this pscrap.

C

Options- I don't know if again um well, I I I from our understanding that right now this is already available for the user, but I don't know if.

C

We have, we have put some uh info, but I don't know if this symphony is, should be enough. This uh briefly descriptions that we have with this information or or should we put more info for the user, or should we understand that it is enough this info, that we have, for example, at.

D

Least, per oc or per are these cluster weight options missed where you go.

B

I think these are cluster white. All.

C

B

G

C

D

Okay, yeah, I wonder for like.

D

Like for the recovery stuff, we have, we have sort of a we've tried to simplify this from the configuration standpoint right. So we don't have so many knobs. um I don't remember what.

G

D

Current iteration of amplified option is, but we should probably try to align whatever the dashboard is presenting to have that. Be the thing that.

A

Yeah everything, um essentially, you can overwrite the some of the defaults, but under the hood it auto-tunes all the values at the moment um with pacific. We also have new, like employee based profiles, which we could add to the dashboard, um which will always have a default. But if the user wants to change the profile, it should just be like select abc from abc and we'll do everything under the hood. So all the recovery, sleep and max backfills everything should just go away.

A

When, when somebody is using m clock, I think these kind of details, I think uh ernesto alfonso. We need to reiterate and have a separate question, we'll go into the details.

G

Yeah: okay, okay,.

B

Yeah, no okay, so that's uh just only for the recovery right, not for the scrubbing. There is no profiles previously.

A

Yeah with a scheduler change, it applies to scrub as well. Okay, and all this, like some scrub settings, let's just say that. So uh that's what I said: let's have a separate meeting to describe, which which of those settings should not appear when we're using scheduler.

D

Just it's sort of a separate comment on just the ui. I wonder if it would it's a little bit confusing to me that there's this drop down with cluster wide stuff, but it's on a page that has all these ocs listed. So when a dialog pops up not obvious whether this is like came because you clicked on something on the lsd or whatever. I wonder if it may would make more sense of these cluster-wide options were just like different tabs.

D

So there's an osd list tab and then there's one for like cluster.

G

D

C

I mean when you selected the the oc initiative.

D

Just like the items in this cluster wide option configuration.

D

Yeah, if these three things were just more tabs, so instead of osd list audio, let's do list flags, recovery, scrub, overall performance.

B

Yeah yeah, we were discussing that where to put that drop down, okay, yeah.

G

That'd be like.

D

B

Different kind, so I think that's mostly about the cluster white and also for osd. You can also set flags and trigger the scrap to scrap all that stuff.

G

G

C

Okay, then, apart from this well, we have a here some when you select a noisy several information, auto metadata, for example,.

E

Should we split those into stencil information and advanced information.

C

Yes, sorry: what two types of information.

G

E

Useful information and uh advanced information.

B

Yeah that's mean I'll, understand, understanding topic having a kind of uh advanced mode or basic, mid advanced modes yeah. You can filter narrow down the amount of information you're displaying yeah yeah.

C

Yeah as long as we reach out consensus about what is useful and more detailed information, it can be done. So it's a matter of having consensus here and it can be, can be done.

J

Just a comment: I've been asking for help from the ux design team for over two years on this, and it is not a priority to get help on set dashboard, so anyone that can help anyone that can have an opinion and can push those guys. Please help. So thank you.

G

F

A

To comment the ones we can get, we can have some more iterations over these. um I think the whole osd configuration and how we display stuff and what is important, what is not um we can have separate discussions in the interest of time I just want to um you know we have 13 topics and we're almost like. You know, half hour into the meeting. uh What what other major uh topic like areas do you want to cover, uh except for the ost section in the dashboard uh ernesto and.

C

Yeah yeah, we were doing uh just showing the cluster area of the diver in the sense that if there is any missing uh feature that you think that from the rados team should be there here in the driver- and you have noted or noticed that something is not there and should be. For example, it can be a great feedback for us, because if we go, if we finish quickly the walkthrough you have, we have the christmas viewer.

C

If you select the osd, you have some some info here.

C

You have the also the manager modules, so you can, for example, select a module from the dashboard. uh Well, you have some the details also about the settings, but the you can click on edit and then, if I want, I can um edit some settings and this will be reflected like a theft config down, for example, if I want to put the I will do a secret key and the access key, I can do it in a graphic manner in order to have access to the object gateway.

C

For example, then you up, you update and it gets updated and you do a safe conflict. Then you will see the changes and for in the list of the manager model, we can see what are enabled and what are always on example- and I don't know if we should add any other relevant info here about the modules about modules, or we are okay here, for example, can.

A

We also like disable or turn something off from there.

A

Let us say I want to turn some module off.

B

Yeah yeah yeah, you can do that with the drop down you select, one of the yeah.

C

For example, you mean not sephidium. Well, I I'm looking one that is not always on but yeah here.

C

Here I'm looking for one that is not always on, for example, the restful. Is it.

B

Now you have to click on the downward arrow to enable disable the start and stop the module.

B

C

Down menu this one that is harmless.

B

When you see what this is that, as soon as you disable the module, the whole manager becomes unresponsive.

C

Let me let me check, is here okay, so you see that it's not enabled.

A

B

C

B

Last quick thing would be I'll just show quickly. The pulse area, which I think is the other rather wise thing that we're displaying in the dashboard.

B

Okay, that's been added support for easy configuration. So now we can configure the z profiles from the ui itself.

C

Let me create a.

C

C

C

Me an rv stack, for example. Well, we have here also the equality of service. Just in case we want to put the rbd configuration to to tweak is the settings, but if you remove the rb attack, then it's no longer shown within your target doesn't appear, but for rbd. This is this one.

K

Can you can you change? Can you change the rule to a different rule since now it disables now, but you did a different crash rule for the pool.

C

I see that you should create a crash rule. Then let's see that I do a crash rule.

C

Then you should be able to.

L

K

C

K

Yes, but if you need the most sophisticated question, you need to do it on your own, on the server side.

B

Yeah yeah probably yeah. I think it only supports the domain failure and nothing much more than that. Yeah.

K

And but you will be able to select a rule which exists on the server, but not.

G

G

G

B

Okay, so you select the easy pull type. You can also.

C

The pool creates.

B

C

Okay, if it's a ratio or replicate that you mean.

C

The placements.

B

Yeah but you have to select the uh eraser code pool.

C

B

And then you can create a specific profile.

C

Yeah up here, it's the same mode, the same way to create an initial code profile.

C

Well, here we have some info about how it's any any setting.

C

Well, we assume that this, this extra info should be enough or for the end user.

D

Hide that directory option- that's not actually something we want to support that directory option at the bottom. We want to hide that.

D

The directory option at the bottom, we should hide.

D

Yeah, we can't actually add extra plugins well.

J

I think, for the sake of time I mean there's a lot of items on your list today and I I you know we got on the agenda because uh you know we basically you know, don't think of enough people use this and give us feedback so overall we'd love to see that we'd love to see you come to the dashboard meeting and use this on occasion give us that feedback we need. Otherwise, this product won't be what our customers need, let alone our engineers. I'd love to see everyone use this on a regular basis.

J

I mean so um overall, please send feedback. You know, please help us, you know you know with the features that we require moving forward. I mean it's an absolute must for for that. Otherwise you know it will. It will not be successful and um you know at some point I mean sebastian's going to talk about. You know dashboard, it's who say and their successes it may be. Hopefully our next dev meeting there'll be some interesting. uh You know discussions at that about you know suse.

J

You know success and failures with you know, cephen and all these bearing tools. So anyway, I just wanted to add that so.

D

And just just for the benefit of everyone here I see on the calendar. There's um there's a ceph dashboard stand up um almost daily, but it's at like 3 30 a.m. My time is there. Another dashboard stand up. That's at a later time.

B

We have the bi-weekly thing uh now, I'm not sure what time is there, but it's three to three pm uh european time, so we can try to I think in the past you joined that.

B

So we may try to uh look for a plot where we can all gather okay.

B

Okay, thank you folks. Thank you all right.

D

Thanks so much thank.

A

You thank you. Thank you ernesto. Thank you advance, so I think we have more than what we needed. I'm very impressed that it's almost like you know. We have to eliminate things.

C

The fact is, I have to look at because I I should have added more two more hosts in order to have a healthy cluster yeah.

A

But that's fine! That's fine!.

C

A

Time I get the idea, I get the idea, so, let's, let's yeah, let's uh you know, meet more often we can discuss that. How okay, thanks all right.

A

uh The second topic here is about review crash. Telemetry panels. um Are you yuri? Do you want to leave this sure.

M

So yeah I linked um another ether pad that has um a few links, um um there's a link to a sentry dashboard. That has um sorry so. First of all, I don't know if everybody knows we're collecting uh telemetry data and um some basic data about the cluster and data about uh crashes and sometimes uh data about the disks of the clusters and um now we're going to focus specifically on the crashes data collected.

M

So the sentry link um is the first one. I don't know if you guys can.

M

Access that you need to have um the self member membership of github, both the team and the.

M

Organization, oh okay, thanks, I don't know who shared a screen. I was about to do that.

G

Yeah go for it.

M

No, no, it's okay! It might be even faster um still having network uh problems, so um I can try if you want, but yeah.

N

G

M

Cool so right now, um only the last uh 30 days of the data that we collected from crashes uh could be imported to century.

M

There is an issue with the um importing the older events and there's another issue with the latest uh seven days, but um we're handling that the main idea is um to integrate the reviewing of these crashes um daily uh when we're doing a bug scrubbing so that we actually get value out of it. So I don't know niha josh state. If you want to suggest ideas of how we can actually integrate looking at them daily.

M

There is um one thing that um still um in process, which is to enhance the daily emails um once we once we know where exactly to link them to, um we will have on the daily emails, uh just a a summary of uh the latest uh 24 hours and 14 days with the new crashes that were reported through telemetry, and then we can link to either sentry or the other grafana instance and decide how to proceed with opening a correlating tracker issues for.

D

Them yeah, I mean I ideally we'd, have this nice there, the redmond plug-in for century is, is pretty useless, um so I think we need to probably write around um just so we can link these to tickets, but even without that, um I'm hoping we can incorporate this into our bug scrub routine, so that, in addition to looking at the tracker issues, we also look here um and you know iowa. We can ignore. Obviously, but you know: here's random, a random issue.

D

That was only one event, but if we sort by events and find things that um users are actually hitting like here's, an rgw issue that hopefully affecting two different users- and it looks like it's just crashing repeatedly on this particular thing- um you can tell what versions are being affected.

D

uh Here's some more stuff version, the the frequency over time, um and ideally, ideally we'd- have this link to the trackers you see be able to see um until then like. Maybe we can come up with some like we ignore it once there's it's solved or resolved, or I don't know it's kind of hard to link it to the without um actually having it linked. I guess that isn't quite like you can't have a custom state.

D

um Maybe we just pay attention, try to remember, but you know: here's some bluefs errors, for example,.

D

um Yeah I mean, is it? Is it just a matter of bookmarking this for all the leads? Do you think to get this to get people looking at this on a regular basis? What do you guys think.

A

And then yeah I mean I at least the way I use sentry at the moment is not like to do bug scrub, but when I see um issues that can be used uh that can be tracked using sentry. I do link it back to the tracker or, if I open a tracker, I link the entry event associated with it, so anybody looking later can go and see. You know what is the frequency or when it started occurring and all that kind of stuff but yeah.

A

I think it will be useful uh if, if we could make this, you know more unique failures that you know are happening over a week. This could be part of our bug: scrub, uh routine.

G

D

Yeah I mean at the bug: scrubs are usually once a week um which isn't ideal, which is why I guess why those those daily emails will probably still be important but um seems like at least once a week. We should be checking this just to see like on the latest stable point release like what crashes.

G

D

People still hitting, um because that should be guiding our.

D

Guiding fixes yeah.

M

Okay, yeah yeah. I think that's.

A

Something we discussed here right right, you know when we do a point release, we start tracking, then on you know what kind of unique crashes have started appearing which did not appear earlier.

A

That is a clear data point of what we need to focus on or if there is a regression, particularly, um I think the questions we had have was like. You know how, when do you start tracking or, like you know, uh we currently, we have like these 14 days um of crashes that we, um you know then make that more useful and have separate sections.

A

Like you know, let us say we did a 4 15 to 10 any unique crashes in 15 to 10, get grouped together separately, so that clearly catches uh attention or we focus on those. Maybe the grouping aspect could be improved a little bit there.

M

Okay, um I maybe it's worth mentioning that um the entire database is browsable uh through the other dashboard um and there's um lots of uh search um options there as well. um So if there are new um crashes reported on the latest uh point release, uh you can also see if it was reported prior to that um and.

A

That's a very good point, I mean, I don't know if you have the ability to share your screen, but a search page where you have multiple sub strings and you know how you can filter versions that will be really useful and.

M

Can you see.

M

Yes, okay, cool, um so that's the other dashboard um has um all sorts of uh panels with um the statistics, um according to versions and um other um other parameters as well, and I just wanted to focus on. Maybe this is um the most interesting part um you can see how many new uh crash we call it fingerprints, because we try to group together similar uh crashes.

M

um So, for example, in the last 30 days, you can see that um there are about 300 new fingerprints that were never seen before. So obviously these are all 15..

M

um Well, it's not that accurate, because if all of a sudden there's a cluster that started reporting and they're running mimic, so we will see new crash fingerprints of 13 here. So these will be outliers.

M

um But there are a lot of um filtering that that you can do here.

M

You can specifically filter a certain point, release here and focus just on that the minor affected column will show you all the minor affected of that specific um crash signature and that's on purpose, because if all of a sudden, you see something weird on a certain point release, you can see all the other versions that were affected by that as well.

M

um And then you can do some more drill downs. You can see when it was um first reported. You can see the actual factories of that.

M

Signature and you can also see all the occurrences- this was a couple of years later back, but you can change it here, of course, and if you want to see the actual, the actual brush itself oops this that's the cluster.

M

um You can see just the crash id, and here you have um the actual trace that wasn't um filtered and all of them captured details from that crash. So it's easily browsable um and by the way, if you, if you see a certain, a certain function that catches your attention, you can just click on it and you'll see all of the fingerprints that actually contain this so and you can keep filtering by more um functions on the on the sectors that has same one.

M

So it'll be nice if um it'll be used more often.

M

So maybe we can find some interesting things from this data.

M

Awesome. Okay: I might love uh my audio. I'm sorry.

A

No, we can hear you, you can do you fine, okay, this is it um um anything else uh you want to cover, or are you good.

M

A

Thank you all right.

A

The next topic is about clue, store um blood cash improvements, but I do not see any of uh the interested parties on the call. So maybe we will keep that for now and if uh igor adam or mark joined later, we can cover it.

A

um So moving on, we've got immutable content optimizations and this is weak.

O

A

Do you want to just take it? Can.

O

You hear me yes cool, so there um uh we'll try to keep it short, because it's kind of a little bit out of topic. um So there are three of us. uh First, massimo is from university of peace, and his focus is on uh streaming uh data and high performance computing, and there is also nikola dondrimo, who is here listening and he is working for software heritage, which is an initiative to collect all the course code available in the world and produced by humanity and keep it safe.

O

The same way archive.org does so um they uh the problem that we we faced, uh and there is me luik uh and I'm actually from a company which is called easter eggs. uh That's not a joke, it's actual company name.

O

uh The date is funny about that, and we uh I'm working with software heritage on a very specific use case, which is to help them use ceph to store their ever growing uh archive of fairly small objects, so the most of the objects 75 are under uh 16k and uh about half of them are under 4k.

O

uh This is, um this is a pattern that also uh is common to facebook when they store a lot of images, also linkedin for the same kind of uh data, and it grows uh two petabytes of data.

O

It also shares um commonalities with eos, which is the storage system of the cern in which they store uh much bigger objects, but they have billions of them.

O

So when we we try to use stuff for that uh small objects and there are tens of billions of them. The the problem that we faced is twofold: really it's space amplification.

O

That is, uh we want to save space, so we use an erasure coded pool, but unfortunately there is a space in purification that insafe insaf grows over 35 of the total storage.

O

So, no matter how hard we try there, there is a very significant space and amplification, because the objects are too small um and the second problem uh is enumeration. That is what uh software heritage uh is about: the the users of software heritage.

O

uh Most of them are researcher and they're trying to apply algorithms on the whole corpus, and uh these algorithms need to work all the objects, but when there are billions of them there currently are 10 billions, then it can take a very long time if you try to enumerate them one after the other.

O

um So when trying to to figure out how to uh help with both problems, uh we stumbled uh upon the solution that lincoln did found in their system, which is called ambry and essentially what they did is grouping the objects together, that's uh a fairly simple solution, so they um they take millions of objects, they put it in a one, big uh 100 gigabyte uh container and that allows them to more efficiently.

O

Take all the objects together and apply an algorithm to it also mirror it replicate it, but it also saves them the of the problem of space amplification, because then you can put the object on top of each other.

O

All this uh wouldn't work at all uh if it was read right objects, that is it's not flexible enough, but the commonality of all these workloads is that they are immutable objects, and so we can probably we can leverage this property by packing them together and that actually works. There is no drawback, so uh for software heritage, what we're doing this year is to use ceph and pack the objects into an rbd image so that we we get the benefit of everything that we just talked about.

O

The thing is it's on top of uh ceph and it it will be fine, but it's yet another software to solve this uh fairly old problem, because the first time facebook uh published something about that specific workload was in about 10 years ago, and nowadays you would think that software storage, uh distributed storage uh natively provides something that solves the problem and you you don't have to do anything, but it's not the case.

O

So there we go uh what, if seth provided something that would uh help this use case so that software heritage facebook link it in just need to use ceph out of the box and it just works. There is no space and solidification and if you want to enumerate all the objects, it goes as fast as if you were doing an rsync on 100 gigabyte worth of volumes, and I don't know if it's possible, because I've worked and before I go into the ideas that we got. I would like to probe you everyone here.

O

Have you heard of that? Do you have ideas? Is there already something that we missed.

H

I think there are certainly a lot of ideas like this out there, and certainly one we've talked about in the past two bunches of batching at the rgw layer.

H

I think one of the major aspects of this is difficult is the listing piece um at the the radius level is, but since sf is, is starting objects across the cluster you're, treating them each as an individual raiders objects and trying to do listing across all of those um it's a very different sort of design than when you're talking about packing a bunch of objects together, you're, no longer kind of hashing them according to their name, you're kind of trying to group them together in some way.

O

And uh yes, you heard that.

P

That sounds like a candidate for as a zipper back back end.

D

Like the same thing,.

P

H

I

H

Provide some backup background, for what is zipper is.

I

Well, zipper is part of an extension system being developed for rgb that allows builds filtering, basically stackable drivers and things, so you could yeah, I think who's imagining this is the driver. Filter photo driver for rw, but I think in general, in in rgw, is where we would want to put this we're, definitely interested in incorporating packing support um and we and we some challenges in the.

Q

Indexing but we.

I

Would like to solve it, so this is. This is definitely a place where we would accept this and encourage it. Try to help with it.

D

I think the one of my questions, I guess, is how this is used like in in your case the week when you prepare these big rbd images. Is there this whole sort of offline phase where you back pack it all together and you build an index and you put the index at the front, and then you import that into exactly.

O

I would imagine.

D

That, once this goes once this is in ceph like you want the index like in the front, so you can do best listing of the names of objects and look up a specific object. That's in this big blob um and that's a little bit challenging to do if you're, like sort of slowly, you know doing these over time, um but yeah, I think it would be. It would be helpful to have um maybe a set of a set of constraints to better understand how it would be used like what are the.

D

What are the ingestion works loads like? Is it like a trichlophiles that are sort of randomly accumulated, or is it like a serial import of a linear sweep of a file system or whatever, and then also, then, what are the like? The subsequent workloads um like the batch export or static file, or so on. um So I think those understanding what those workloads look like um and what the expectations are. There would help drive with the design.

O

Yeah, I suppose that's part of the problem because it varies. I focused on the workload of software heritage, which is um fairly easy to understand because they collect uh elements and then stack them, and then you can say after 10 million objects or 100 gigabyte. Okay.

O

Now I close this and this is immutable and we we add an index, as you said, and then we, when you read, you can look up the index and if it's, if it's a perfect hash table, then you have uh fairly quick access to the individual object and then there there is the mirroring part, which is again fairly easy because you want you just want to send the 100 gigabyte object to someone else now, if it's, uh if it's implemented as a rgw uh filter driver as it suggested, would would that allow for that kind of order of magnitude change.

O

uh Would it be feasible to have to stack them into gigabytes and gigabytes of single object? That's what I don't quite see.

I

Well, I'm not sure, I'm not sure that the film driver explains all of it, for what rw you do, but rgb might be using that filter in order to in order to do listings or to incorporate or incorporate edits. There's also- and this is this- is valuable this this. This- is this this this, your your your posts just develop, reminded us to think about this more. We have other.

I

There are other workloads that have been proposed for for object packing with, with with s3 object, storage, historically x, guy came in with a group of them and and that are that are that are used in hyperscale deployments, they're, not sure which ones, um but where you said less of smaller s3 objects and you'd like to pack them into some intermediate form um and and that that that isn't, that probably is not not not not the kind of object scales that that your proposal has, but, but we might want to be able to, we might be able to reuse some machinery.

O

Oh that's interesting: okay, okay,.

G

O

uh Was there any other area in self where um that kind of workload was, or uh rather more generically taking advantage of imitable objects, but that ever discussed.

D

um I mean we have had some discussions around in a raido's pools that have different access semantics, where they might be immutable or when you create a new object. You know that it has a unique name or something like that, because those would affect the way that the replication and recovery algorithms work, um but those, I think those are different because they don't really address the problem of tiny objects and bulk import and export, and so on. So.

A

I

D

Having it layered over rados in rgw or rbd, like you're doing or something it's going to make more sense.

G

O

um So that stores that actually we can skip the first idea because then we talked about it. The second idea is um actually we can skip it too. I think because it would be more relevant in rjw and the bulk reads.

O

um I guess that's where I'm a little hesitant, because when, when I thought about it, um the benefit of packing- let's say at least one million object together- that having at least 10 gigabyte worth of something that you can mirror uh seemed like a good way to change the order of magnitude of the problem, and I I still don't see if rgw can provide that kind of um change. If that makes any sense.

D

I think I think that, if I'm understanding correctly, the the key benefit is, if is the ability to access the data both on a granular basis, so you can look up an individual object with reasonable efficiency and also on a bulk basis where you grab, like an entire archive of you, know: 100 gigs of them efficiently, so yeah.

O

D

Of being able to do that, bimodal access pattern, um and so it's not. I think it's not like immediately obvious that zipper is just going to solve the problem like it might be able to do packing, but allowing both bulk access and granular access. I'm not sure if it would do that or not.

I

I don't think zipper does, but but by itself, but zipper is a way that there's a way to do inline edits of the other requests to direct it. To do something specific, but my sense is this probably involves I mean I want you to jump in here too. We've talked about this a bit and and the the object manifest that rw has a lot allow for objects to have alternate structures that can be arbitrary and an async.

I

Another piece probably would be asynchronous processing after ingest uh to ingest these into their pack form or to do later. Compactions. If we, if we, if we make edits um your system, never makes edits, but others might so that and zipper gives you zipper. I think, gives you inline access to territory target. You know to make more flexible access patterns instead of rhw. If we see that we're looking for something that has a more complicated as an alternate access path, I mean other than our bucket index or or then we have some.

I

We have some tool in in in the rw flow to you know to splice it in.

I

Would you add something to that or adjust that.

P

Yeah well, yeah zipper is not going to solve everything magically just start the framework where everything will need to be created right. um That's one thing, uh just a side note uh that access pattern, where you have multiple separate objects, and then you can read everything in one go reminded me of this with the terribles with um large object implementation.

P

um So rgw already does something like that. Where you know you can pack multiple objects as as the one logical um entity, but you know that's that wouldn't be the way to go.

I

Because yeah the ultrasound of implementation, but yeah, but it does do that and and then the pieces are all like you know, as in your workflow, the pieces are sort of at the inverse of it, but in switch object, storage like these live objects. The pieces are, the pieces, are themselves objects that something uploaded and then and then and then it uploaded a rule for combining them right.

R

G

So so to answer.

R

I think your specific question loic then part gw, does support object like logical s3 objects of sizes of terabytes. So that's not a problem.

R

um I think I think the broader concern is that most of these projects have like have very different ratios of data to metadata to how many like disk ios they can afford per per like packed object to put it in or to read it, and we don't have a great place for building the giant memory caches that something like like haystack tended to have to to amortize the lookup times- and I mean like rgw- is the place to build something like this in in the south ecosystem.

R

But you need to look very carefully at um whether we are trying to what whether objects are pre-packed before rgw gets them or whether we can afford to like write a bunch of four kilowatt objects into r2w and then have rgw like read them out and rewrite them into a packed form and delete them. um Whether lookups can afford to go find light.

R

Do the like rgw generic object, look up to find the the packed location and then like look at the the index of that packed object and then do the actual read, um and that and that's my concern with with something like this, especially if we're trying to like do a generic implementation. We expect to work across a large. A large range of anything is that, okay, because, like I think the haystack paper had a lot of math to demonstrate that that every that, what their constraints were and and that their design like hit them.

O

True, unfortunately, we don't have the implementation, but yeah it did um so one. um That's okay, yeah, sorry,.

D

I'd say one one.

D

Last one last comment again: I think it would be really helpful to have um just something written down to better understand what that what that ingest workload is, but, for example, if, if you're ingesting like a project at a time or something or it's like our ball of a particular version of software or whatever, so it's the files are all localized within a tree or a branch of the hierarchy all grouped together already, then that's like that's easier to sort of graft into some larger view of reality, whereas if you're just getting like random files spread across all different parts of the namespace, all at once,.

O

It's really completely opaque objects pushed and we know nothing about them: yeah, okay, yeah. Unfortunately, it.

D

Makes it harder, okay,.

O

um And so there was one last item: um it's the um it's: the ability to stream the object out of uh ceph, so um the general idea is there seems to be an ecosystem developing of layers that transform database updates, for instance, into streams like kafka, brings and people developing software to analyze these these streams to do various stuff and they are interested in having different backends from which they can read objects.

O

So currently, there are an ecosystem and they know postgres, they know mysql, they know other backends, they don't know ceph, and I was wondering uh and that's where uh massimo is most interested in, if steph has so, it could be, maybe a rgw that allows to plug in this kind of back end, so it made available as a stream of objects somewhere which would help again with mirroring and again. The speed of this mirroring would matter.

O

I don't know if it makes sense or.

O

That does not ring any belt.

O

O

Sure, okay, well, um unless nicola or massimo have anything else to add, that's all I had.

S

G

O

Well, thank you very much and uh so we'll be looking forward to uh examining how rjw does that or could do that?

O

Thank you. Thanks.

G

Thank you thanks.

A

All right um next topic is about ceph manager improvements. This would be josh and me probably.

H

A

H

Sure so uh we talked about about a number of shorter improvements um in the cd in november, so today I wanted to focus more on the larger scale, improvements like scaling out the manager and dealing with um scalability issues in general there.

H

So there's kind of two uh categories here. um But let's start with this, the scaling out the answer in general.

H

um There are a number of different ways to do this, but perhaps uh one simple way to do it would be to try to decouple the manager, the core from the modules like move them to uh the modules to a separate process they could be on the same most could be on different hosts.

H

This um gets us out of the problem of using sub-interpreters in python, which aren't actually supported anymore by cython or um and isolates the modules from each other, so that, if one has a problem, they just doesn't take down another one plus lets us easily say restart a particular module or reload a particular module that affects the others.

H

um It does make probably the the deployment model a little bit more complex, because you know you need something like a zip adm to under understand that needs to deploy a bunch of processes and meshes, and you need to do all of the same kind of like failure, detection and and restarting that you would for a single management process with many processes.

H

But the bright side is that these are all steelers. So if any time one goes down restarted, it's not a very complex operation. It doesn't require a much complex recovery and the uh the key piece I mean that you need. There is a like a proxy layer like a shape proxy, which we already often deploy with the dashboard to be able to stand in front of whatever ek, endpoints. We may be exposing there, but that's something that we already need.

H

um So a first version of this might could look like the core manager is the only thing that is processing all the current things.

H

It does like the osd map, the pg map, the pg stats, the configuration updates and whenever a module needs to access that information, it talks to the manager to get that there are only a couple modules out there today that access, every single update and like those ones are like the insights module in the progress module and we're actually looking to change that behavior, since it actually causes more problems than it solves, and it's not really necessary to their function, so they could be pulling for these.

H

uh This information periodically say every 5 seconds, 10 seconds, 60 seconds. That's penny unload.

H

So that means there's a little bit of extra latency for getting this kind of information, but for a lot of the car modules it shouldn't be so bad. I think the one I'm I think might could have more impact on what might be the dashboard, I'm not certain how much how frequent its calls are.

H

It could be some level of caching that could be helpful there. We discussed that a little bit the cdn in november 2. ernesto. Maybe you have a better idea.

H

Like the dashboard like, how often is it accessing kind of common data that the account is coming from the core manager.

B

Yeah for a single user, it's uh every five seconds, uh we're right now, working on a uh caching layer for the cabinet, and so we don't really need to uh all the uh manager api. So often so we can keep the data inside the module itself.

A

So this caching uh structure could be a shared caching structure across modules, which doesn't necessarily need to be only for the dashboard.

B

Yeah, that's what we are uh plotting right now. I will share the uh link to the tracker issue here. That's it! uh In fact, uh what ampere are going to do that I'm going to explore the that part, so we can generalize that and also include the site. If we want to perform the question there.

C

Yeah they started doing some profiling first in order to to apply this cache because it can be done at several levels, not only uh only at manager, level or api level, or that whatever it can be to several layers.

H

Yeah, it sounds pretty helpful. What layer is it? Are you developing it for right now.

B

uh Well, right now there are some uh existing caches like, for example, the controllers, the http controllers. There's this view cache, which is caching things, for example like the rbd images or the pools, and apart from that, I think we have some caching in in the front end as well, we're also avoiding the browser to pull in the backend. So, but I think in the case of the dashboard, we will probably need to bring this kind of uh layered approach, because there are multiple places where this dashboard can have multiple users.

B

So technically, every user could trigger a load and we want to limit that.

H

Sure sure that makes sense, yeah and certainly only the back end caches, which makes sense to share between different.

H

H

The um doesn't address is models that might potentially need to scale beyond a single process. um I'm not sure that we have any of those. Currently, I think the one that has been a bottleneck is prometheus and maybe a better approach. There would be to change the way. We're reporting media's data to give it directly to prometheus itself from each of the nodes, instead of funneling it all through a single um or even a few of energy processes, I'm not sure yeah how much value we're getting at that extra processing. There.

D

Yeah it's simpler for a small cluster, but it doesn't scale super well. I guess I can ask a couple just like a high level question about what the it's, the high level goal here to scale to a larger cluster like yeah.

H

They have a goal right.

D

Because I'm wondering, if that, if, if the approach we take to scale, the manager is to basically separate separate out modules into separate manager processes, then that limits our the the ceiling to what we can get is like the number of manager modules right like. If we have 10 manager modules, then at most we can scale 10 times bigger, but I think the reality is that we don't actually have that. Many only a few of the manager modules are actually ailing.

D

Issues like prometheus is a big one, but that one sort of maybe set that one aside, because we can address it separately, yeah there's the dashboard, um there's ceph8m.

H

You could do this for just a few models better than the heaviest.

D

Ones, I don't know which ones like.

H

Yeah, the other ones that have been problematic are like the progress modules in this module because of that non-polling behavior. So if you address them, make them pull instead of uh processing every update, I think they won't be an issue.

G

H

A

Some of it with the balancers that we have with the balancer.

D

Yeah, I guess it seems like if we sort of address those piecemeal, then there's only like by count on one hand like the remaining modules that are problematic, and so I'm wondering, if, um like the cost benefit of the complexity of breaking these things out, is going to be limited. The benefit will be limited by one hand. I guess that makes sense.

I

To what extent two is the issue with with that with the http runtime, that the manager currently has to linux, and is that playing a role if any.

H

uh I don't think the http runtime is an issue per se. I mean things like the that's only used by a few modules, not not by everything.

I

Or or single party run time in general, maybe I should say but.

D

Yeah, that's the like. That might be the bigger issue. Yeah like I wouldn't.

I

Justify I mean, I hate to say it this way, but that would justify a rewrite to go something like golang or something all by itself. If you you know it, doesn't it doesn't solve scaling problems that are topologically naturally solved, but it is quite a limit limiting environment itself.

H

I don't think it requires a rewrite, but um using more than one process in python as possible,.

D

Yeah I mean like it instead of imagining different demons on different hosts, I mean. Maybe it actually is a similar path to go down, but like just forking and having independent python interpreters and using ipc to communicate so that you're not like that.

D

The idea of having rest grpc for those like cross module calls is kind of scary to me because that's like huge amount of latency and most of those, I guess they're, probably not, that many of those calls that we can go out at them, but I'm not sure that they would expect or just to behave well with that.

H

I'm not sure how many there are at all, like, I think, the only ones I'm familiar with other across the updated progress module which aren't the sensitive.

D

Yeah, but my guess is that, like the the manager is using, like you know, a couple of cores, probably at peak like on a large cluster when it's like getting hammered only using a few cores and like modern machines have like dozens of cores like I.

H

Wonder if our scaling isn't is it first.

G

D

If there are ways to do that without.

I

It's it's true, but python runtimes also have about an over order of magnitude, more overhead than something that compounds native code and that might actually matter.

C

S

Separated out into multiple processes, it would be a lot easier to go one at a time and convert to golang for the ones that are actually performance, sensitive.

I

That's actually true, I would agree with that.

H

Yeah, regardless of the interface that they used to communicate as long as you have some some interface here, communicating with different processes that does make them fairly independent.

H

I thought the balancer is probably the one that I'd be most concerned about in terms of cpu usage today, whether it's large clusters, it does take up significant chunk of cpu time.

G

H

Imagine self admin and that where you could as well.

C

And the manager under heavy pressure, or is there any possibility to create some some workflow to achieve this.

H

I think that's a great idea. It doesn't exist today.

C

No, no, I was wondering if we have the the resources, the machines to try to set up a stress, testing job and see how it goes.

D

We have a lot of machines, we don't usually use them that way, but we certainly could like. We could build a cluster. However, many smithy we have like 150 200, build a big cluster.

H

First, testing certain aspects of things like you could, like the balancer. You could create a cluster with tons and tons of pgs with no data.

D

Yeah yeah a lot of it. You can just simulate too.

A

Yeah, even the progress mode right, a lot of events, yeah.

H

What kinds of stress testing do you think would be useful for, like the dashboard, alfonzo.

C

Yeah, for example, creating um countries or of a lot of osds a lot of rbds thousands of buckets, for example, for for dashboard. I can imagine several several testing and and retrieving info that info, for example, retrieving thousands of buckets 10 000, for example, or 1000 rpd images, because we are doing listing and we are trying every time to to improve the way we retrieve the info.

C

But, for example, we are working on the cache layer because we know there are some bottlenecks, but it would be great to have a stress testing job that we can see clearly where the bottlenecks, because in my laptop I cannot fully recreate a heavy environment.

C

So it would be great to have some some lap or some machines dedicated just to put a lot of a lot of stress in in a cluster.

C

I I don't know I don't know, for example, the an upstream consumer like the cern I uh it would be great, at least for example, I don't know if the the stress that the cern put on clown staff is something more realistic. I don't know what are the exact figures that the cern is using for safe? I don't know how many or is this, how many rpds or are you having, but if we can retrieve this data and create some more resistant.

A

They're one of our biggest users- and they have reported like issues with the progress module, if I remember correctly, were reported by them, so we can definitely find out what their you know. Scale looks like at the moment.

H

Yeah, it seems like we could simulate a lot of these things too, like a good manager, fake cluster that has like pretends it has like tons and thousands of osds and thousands of rgws and rbd images. It.

A

Q

G

H

The same actual skill that would be able to stress the dashboard quite a bit.

D

Yeah, like prometheus, might be another one if we just spin out.

H

D

Of fake osds that are sending manager reports, whatever it's whatever it is, um I guess my like high level comment. Thought is basically that seems like there are sort of lower hanging fruit where there are existing issues that are sort of independently addressed, make the current, even with the current architecture, just perform better like the notifies yeah.

H

D

On and if we adjust address those um independently, um we can, you know, keep keep that the splitting up the manager card in our back pocket. um You know don't close that door, but hopefully not I'm not sure. I would yeah.

H

I agree: it's not the first first thing to do. I think I definitely want to introduce the progress module, insights um notification, consumption first, um but I think, like the we've already seen, that, like the balancer and the other modules doing a lot of cpu, so it might be time to think about that splitting out. Even with a single you know single host into multiple processes.

D

It might be that, um like the balancer module is a good example where it doesn't actually have to talk to other um manager modules. It only.

H

D

Manager, so it could be in a separate interpreter in a separate uh well out of I don't know it might be easier to break that one out, I guess than the others, that's what I was getting at, but the python interpreter constraints are sort of bizarre.

H

G

Soon, as we're.

H

It makes it relatively easy to break out other things too. This could.

D

Also be that it just doesn't shouldn't be written in python. Well, that's.

H

Not right, it's calling it's calling everything it calling into c.

D

Plus I know, but it's like building these huge lists and matrices and whatever, like it's doing, a lot of yeah.

H

All the times in the c plus bus code, though it's.

G

Instant, oh, is it okay, yeah.

G

H

H

Okay, but maybe we could do something more with multi-threading there, with the super plus layer at changing the brighter architecture.

A

Yeah looks like we need to tag some of these items as handling versus medium term and longer term low and medium don't work. We definitely need our longer term plans to be sorted out.

A

um Yeah talking of uh talking of um manager modules, I think next, one next topic is also about the auto scaler.

A

Do we want to go into that.

H

Yeah sure that's.

A

I guess one more point: before we move to the auto scaler I wanted to bring up this uh stage. We had this discussion about using a common pool for some of the modules like insights and device, health and stuff. I think that's still a good idea. We just have like one manager, module uh pool versus you know different rules and inside it's doing its own thing.

D

Yeah so common pool yeah, I agree um yeah, that's probably one of the first things. We should do this cycle. Let's get that infrastructure in place, so we can make use of it. It'll be a little bit tricky to have the transition for device, health pool or whatever to convert over to be that.

D

A

Okay, cool um for the auto.

A

Scaler junior, do you want to kick off this one? I know there are items that you know. Other parties are also interested, but maybe you can start off with um what you're working on.

N

Sure um so me and josh has been working on a kind of like creating a new behavior of how the auto scale would work. So just a bit of background of the problem of the old um or scalar is that it starts out with like a minimum number of pgs for who and it scales up when the pose is like used more.

N

um This is a problem for like self out of the box um users, if they're creating poles in a large cluster, and so they will have like a low performance at the start, because there will be like the minimum number of pools and it would only scale when there is like pressure, assuming that they don't know like how much um usage like the cluster in general they're going to be using.

N

So um the new algorithm, how it works is, is that um just in a high level term is that it starts out with like a full complement of pgs. So that's like um so how much pg's um the post in general should be used so like uh it depends on the amount of like osds and the monitor target pg per osds. So let's say: there's like four osd, so um four times like 100 would be 400 and assuming the the replication size of each pool is like one.

N

um The full complement in this case would be like 400 and let's say there are four pools and it so when you start out with four pools, each pool would get 400 divided by four. That's like a hundred pieces each- and I guess like if you round it up to power of two there's 128 pgs each um but let's say, uh pool one started um using 50 of the capacity of the space. um So how it works is that pool one would get.

N

Fifty percent of like the full complements of pg's in this case, is two hundred um to round it. Up with uh power of two is two fifty six and the rest of the pools, uh whose two, three and four would get um so the 200 divided by three, which is like 67 p like yeah, so it would get like 64 pgs. So that's just how it works. Assuming like, let's say stuff like the bias is like one. um The problem with this that I encounter is that um when, if you can see in the.

L

N

There's a problem with when you for rgw uh with this new um r scaler um it. So when you start out the pool for the device health monitor starts up with, like 128 pgs and uh when you're creating like rgw pools. um It did. The device health monitor did not scale down in time. So, therefore, that hits like the cap for uh mon max um pg per osd's cap and one of the solutions that I've been working on and fixing. That is just to create a pg num max on um it's another feature.

N

So basically just capping the number of pgs um you're allowed like each pool just similar to how pg num min works. We could possibly do that on device health, monitor pool. So that's where I wanted to discuss more on what we should do about that.

D

I think it definitely makes sense to put a cap on pools that we know shouldn't get big, but that's that doesn't doesn't solve the general problem. It only solves like one specific right.

A

And how, if there's a generic rule, how do you decide what the max should be.

G

Right yeah, I think you can't.

H

uh So we do have like ways that, like ffs and rgw, create their metadata pools and they're already applying certain settings to them so that we could apply similar kinds of mac settings for those metadata pools. At least I was thinking. Maybe this max should be like a percentage of the like pg budget, rather than a fixed, uh absolute number, though, or like a large, a very large cluster. You might want to have a more parallelism for your metadata.

H

Does that make sense.

A

That's yeah, that's that definitely sounds better than a hard cap. What max should be like yeah? What you said is right. I mean if we know what the application is going to be. Maybe we can determine those gaps um based on whether it's rgw data pool or metadata pool whatever, but it's just like a generic pool. The user is trying to create, for those it'll be hard to determine um what those gaps should look like or that's. We just have to go with something.

H

These days, like the data pools you kind of want them to expand, so they feel the whole cluster. uh It's only the the pools aren't going to get too much data. I can't use that level of parallelism that you don't want to have that many pgs.

D

G

D

Seems like there's, I mean there's no, there's no substitute for actually having some information about what the pool is going to be. If we can somehow induce a user to set the target ratio on a pool, then we're going to like pick the right number, the first time and then we're not going to have to like glitter merch cool. um But in the absence of that information we basically have two options.

D

We either sort of start small and scale up as they fill up with data, or we start big with lots of pg's and then merge if more pools show up.

D

I think actually, the the the scaling up is more conservative, but the the starting with like lots of pgs and scaling down is probably more likely to not um move data because in general the pools are going to get created at the beginning of the cluster before there's anything in there, so um the pool the pg's will fill up in place.

D

um I think the problem is just that. There's this issue where we start hitting the pg cap when you start creating like a whole bunch of pools at once, but I think that might actually like if we change two things like if we got rid of that pg cap, basically or just made it like really big.

D

That might be one thing and also if we create a pool, we always create it with one pg and we let the scaler scale it up in concert with scaling other things down, then we can avoid hitting hitting those caps.

H

Yeah, that's the thing like the the check right now is in the monitor. That's why it has no clue what the oscillator is going to do once this pool is created yeah. So if we remove that cap from the monitor check check entirely when not just enabled, then let's let this schedule its thing.

H

uh Yeah, like I think when we were hitting this and testing it was actually it wasn't, even a pool that was being created with a particular pc name. So it was just using the default low number, but because the articulator had already kind of used up the entire budget um and because of like the rounding, it ended up being close enough to the cap that we just went a little bit over.

D

I mean if we made the initial pg count, one then that also probably would have avoided the issue also right, unless you create 100 like 100 pools at once. We also won't hit the limit.

H

D

And we could just make that sure the autoscaler sort of scales, things down before it scales things up so that it doesn't bump into the limit, because the scaling down is much slower than the.

G

D

I mean there's no real reason to create a pg with 16 pgs out of the gate you might as well just create it with.

H

That's right, yeah.

D

H

I think we're probably applying the minimum settings in the monitor today, but we could just get rid of that and have it always be one yeah, even regardless of what the user specifies.

D

And if we hint, whenever possible, like we set a cap on the device health monitor pool, because we know it's going to be small and the metadata pools, I think we already set um initial limits. That'll mostly take care of it.

D

I guess the the one other sort of element in the room is that um this was such a shock when it when it happened during when I was testing the pacific upgrade because um as soon as the manager upgraded, which was like the very first step of the upgrade suddenly pg's went crazy right. There was a bunch of splitting all the whatever everything went nuts, um and so I think on upgrade like I don't know that we want to like.

H

Wait so like we just maybe just pushed by default, and then after you upgrade, you can turn it on. Knowing that this may cause a lot of video movement.

D

Yeah or even like an auto scaler profile or something so there's like the conservative profile that has the current behavior, where things start small to scale up and there's like a yeah, a new profile that we introduce. That's this new behavior, but isn't the default for upgraded clusters, but it is for new clusters or something like that. Yeah.

H

I think that would make sense.

G

H

Then, if we have a new behavior in the further future, it could be a new profile yeah.

G

Yeah, that's kind.

A

Of what we did, yeah.

D

A

D

And out of the auto scale status you could like, you could have that you could have an optional argument of what well you want to see what it would do, so you can just.

D

What it would do before you actually do it.

A

Exactly what we did with the auto scale, one and on right? That's what we did for existing clusters when we introduced the order scale or it was all always involved, you knew what you would get into if you turn it on.

G

A

G

G

H

So that making up um related to the auto scaler was that we have some of these warnings, like the learning about object, skew um and I'm not sure if there's maybe a few. Others like this, that we introduced before the auto scale existed to try to warn people about um pgs being imbalanced or not having too few or too many or something like that.

H

With the autoscaler around, but and on by default, now it doesn't support. These warnings are going to be that helpful, since there's not a lot that users should be doing about them. In most cases,.

D

Do they actually happen with the auto scaler run.

H

Yeah yeah we've seen like the this one with objects. Q happen, uh I think, just because there's like something that was used a lot more than another pool, or something like that. I forget what the scenario.

A

Was I think there were a lot of small objects uh in the pools or a scalar? Let that happen, but then this warning popped up which technically was not useful.

A

So I guess we just if we want to keep this warning. We only should warn if the auto scaler is not on for whatever reason or something.

D

Yeah yeah that's one option, but I mean the other. The only thing I worry about is like, maybe that that warning is pointing to a problem with the autoscaler.

D

Behavior, but for the sake of like user sanity, probably just skipping that warning if the mode is on, do the thing.

H

You should probably go through and look at the other kinds of warnings out there. There might be some a few others like this.

G

H

And maybe some other checks on the monitor too. I forget we have another.

R

We start developing a mechanism for reporting warnings to like developers that we think are modules misbehaving without alerting users to cache stuff, like that. Auto scaler isn't doing what we want.

H

Could add that to the telemetry? Let's.

D

Let me show you.

D

Yeah, I wonder, if um probably just like a pass over what telemetry is collecting and making sure it has all of the relevant inputs that the autoscaler would be using.

D

H

I don't think it has light buffalo pg map.

D

But it could have like um a list of pools with the eg min max number of objects, number of bytes number of whatever, like all the sort of relevant inputs that would go, get fed into the auto scaler algorithm minus the names and application tags or whatever.

H

And then the crush tree that tells you yeah yeah. I guess you could have that root, at least, but.

D

Yeah there might be some refactoring needed with the um with the auto balancer the balancer code, so that you could actually feed in like a telemetry input into the algorithm and see what it would do.

D

um But if we did that, that would be pretty nice because in any potential change we could just like throw it at the whole telemetry population and just see what it would do.

A

You talk about the balance right, but the osd map tool has an option to simulate some of this. If you feed an uh an existing osd map, it will tell you what kind of balancing a dry run, that it does already david added that at some point.

A

It also has, like you know it's some settings like the deviation like the up map. Balancer has the deviations you can simulate a different value of deviation, how that would help or not.

A

D

Okay, well, I mean just to summarize, it seems like that the main takeaways two action items are probably the creating pools with one pg, instead of whatever and disabling the relevant and warnings, probably so that doesn't trigger health issues immediately and then adding the profile mechanism, because once we have profiles in place, then we can reintroduce this new behavior yeah.

A

I think that's a good.

H

Yeah, maybe there's this general peachyman pg max warnings, whether it's what we should disable when our scalars on just in case we accidentally figure them.

A

Anything else on topic- yeah, okay, um on to the next one, so this is about uh avoiding cluster log messages to go through paxos and storing them in them.

A

There are a couple of requests here which have some context.

A

H

So I think, there's like a couple of questions.

E

H

This but um yeah, but I guess fundamentally, we've run into this a few times where we've had a lot of extra detail from coming into the cluster log like slow requests reports or causing the monitor to store them in the database in the database to eventually fill up since we weren't matching the interest rate with the deletion rate.

H

um Hry is already addressing this with our batches that were merged last week, and you want to talk a little bit about those.

T

Oh yeah sure so uh the change that that's been merged. Basically, what that does is is that we dynamically change what amount of logs are trimmed um before this change, we used to have an upper bound specified by paxos service trim max. So, according to the log ingest trade right now we just or change the max accordingly.

T

So hopefully, we won't have a very high number of logs being stored in monitor db with this change uh but yeah. Basically, that's what it does.

A

I guess I mean what what I sure, as pr does, is much better than what we had earlier, where somebody had to manually go change. Some setting uh after the monitor had already uh the monitor db, had already filled up to maintain the ingest rate versus the trimming rate. But I guess the bigger question is that is there a need for all this to be stored in you know, go through access and store them on db um or not or like? What is the historical significance of something that it's.

D

Historic reason was just to have a consistent view of what the cluster log contained and because everything was persisted through praxis, so it was easy. um I mean I kind of like the simplicity of everything that the monitor stores always being consistent and always going through.

D

Paxos um feels like if there are specific issues with that, then either it doesn't belong in the monitor at all, or um we need to make it work right like that, like the ingest versus trimming like that's something that we just need to fix right, that was just really just yeah.

H

It's a general problem, not just a log problem so because we've seen that with osd maps as well in the past, I.

D

Suspect that a fairly big part of the problem- maybe not if yeah one part of the problem, is also just that the way the log monitor is implemented is like not efficient at all like it has a it has like a there's, a log summary class or something that has like the last 100 entries, and it's basically that entire structure is rewritten on every commit like it's just it's just totally the way it's persisted is just totally stupid, and that was just because it was like expedient. I think.

L

But I think the way how we are using the the cluster log is the wrong, because I think the large the cluster lock is for random message. Human readable message, which is very critical and which should get the attention of the administrator immediately, which should up in the of course call for human intervene at this very moment. Instead of for some some slow operations, I think the flow operation should be should be to be sent to manager to uh um module. For example, the earth module you you created.

L

It could be repurposed for for collecting the subscribers subscribe. The um slow message, for example, when, when the alert list module notice that there is some slow operation going on, it is supposed to collect the details from the uh from the demons, which is that which has low messages and log into a local time.

L

Local database, which is not hosted by by civ cluster and later on the when the as major as a major administrator log on to the system using the dashboard or something he or she will notice that something goes wrong and it will close the data, along with the timestamp and figure out what was going on and and looking to the log or whatever he or she can have done. Instead of looking at a class lock, prologue is for something a a system is going wrong or or something from something big happens.

L

Instead of slow, slow ops, which happens all the time.

G

H

So I think that that's the thing I like with the slops, though, is that um that information is very helpful in the cluster vlog, because it's very it's a fairly simple place to match up what is going on with the attack cluster state. I don't think we need to have all details of every single slow request, which is what we currently do.

L

D

Yes, we just need something. We.

L

Already included such kind of facilities using in c log for the abbrev for aggregated, slow log, we already have them.

H

G

It was missing those that.

H

Okay, we have the aggregate but and like the full detail, but nothing in between, like we don't have like a like a detail saying like you've got: here's like the top 10 slow requests or something.

L

Yeah, that kind of thing.

D

L

D

I think, as a practical matter when you're like debugging an issue like you, you need to know that there are like 10, 000 follow-ups, yeah.

G

D

Need to know like the first 10 of them, so you can.

H

D

One and that's it you don't need you, don't actually need all 10 000 of them.

A

S

I would say that you need to know which pgs are experiencing slow ops, so even just a stat line in the pg stats would do the job you don't even need to list the slow requests at all yeah like aggregated just this pg is as of the sosd mapbook or as of this reporting interval, seeing slow ops. That's it.

H

You want to see which stages are stuck out to they're all stuck on q3pg or.

S

Through but you can go to the osd and dump it.

H

Yeah, if you're, if you have a sister in that in that configuration, but the reason that's been so helpful in like classifieds- is to be able to see the history and what happened at the past state.

S

A

I guess the idea is that if you don't have the ability to lie right, but if you don't have the ability to do live, debugging or like capture um the state of the usd at that moment. Having that, like the initial information of you know where the initial store slopes were or why they were in the cluster log, as an afterthought is, has been often useful.

S

As a middle ground, perhaps the monitor could notice that slopes are happening and then choose a filtering function that it remembers. That way, whenever shows up, it'll it'll generally report that slow up again in the future, but not others, but.

C

G

Wonder whether the cluster lock.

S

Is the correct place to store that information, given the problems we're observing.

G

H

It seems they could be you better to filter it before sending the these are the more detailed information anywhere like on the osd side as well.

S

H

S

Might like, like the beginning of this conversation, I think was: is the cluster log the right place to put this or more generally, is something that goes through paxos the right place to put that kind of detail, tracing.

G

Information yeah, probably.

D

Not except like maybe just like the slowest one or maybe a summary I mean you could imagine, having like the osds, are sending regular messages to the manager with their like birth counters and all that stuff. They could also just send like their their top. Ten slow stops right.

L

G

Well, the other way around.

L

Yeah, the manager could ask the ocd4 for more details, because we we have the tail command implemented and it serves for the same purpose. Yeah yeah carrying the swaps well,.

S

I mean the managers could offer a general purpose. Sampled logging, yes mechanism, without needing to go through the monitors access system. I'm saying we could have not a uh logging system that isn't the cluster log that serves the same purpose. The cluster log currently does, but without screwing up the monitors.

L

We could have a manager, login.

S

It would probably still be called the cluster log, it just wouldn't go through paxos. Yes,.

L

Yes, I mean the implementation details. It come from manager instead of.

L

G

D

I think somewhat separate from this um at some point. Somebody should like clean up the way that log monitor is storing its data in rex tv, because it's kind of annoying the way that it stands.

H

Yeah I mean do we need with the way we do. We need to store that that information in the database at all we're already writing it out to a log file.

D

I think it's nice because you have a um there's a log last command that like dumps, the last few messages, and um actually one of the things that I wanted to do was change it. When you just select w you can give it the channel name. I wanted to combine that with the logs command so that you could do you could do like an equivalent of a tail dash f, do do it logs what the channel is number event recent entries and then also block and pull or whatever to follow the log.

D

It would be nice to have like a more user-friendly experience there, because I frequently find myself typing stuff log last fdm semicolon w stuff adm like watch the log. That's super annoying.

R

I mean you, could I mean it? It ends up in a text file, and so we can maintain a text file not through paxos but like pax. This is the way the monitors should.

H

I get the fact that it's there. I don't think this is the big problem, though it's more the database itself, like the storyline, has intensive information. There.

R

Well, so if we cut it down by by two orders of magnitude the size of the log, would that just fix it.

D

I mean, I think, the yeah pexos, I think, only has to store the last thousand lines or something like that or whatever it is, but it doesn't have.

R

That much and so what's happening is that every one of those entries is sorting 100 log lines, and maybe that's the real problem, and it's not that it's in the log at all or that it's not in that syntax. At all.

H

Yeah yeah, maybe once we fix that, then it won't receive this issue again. Although.

L

H

G

L

Of the log lines, another problem is that an oil line should go need to go through the pixels. That's another burden over the shoulder of the monitors.

H

This was triggered by just continuous warnings going through through the cluster log, so I think we probably need some kind of filing mechanism to prevent ourselves from adding too many uh log messages.

A

One of the pr's that is.

R

Then it should be, but I'm pretty sure it is like, like you, don't guarantee a practice commit for every single entry.

H

It is collated, but if you're generating thousands of these for every second, it's it doesn't.

G

G

L

Do we have a conclusion according to how to yeah, I guess.

A

L

It fixed- and I I think the the right way to do it is to let the manager to do its job instead of manager monitor.

H

So we already have like the osd, the stop-loss being reported to the manager in aggregate. Maybe we can just add a little bit more detail there to like each of each of demon reports. Does that the top five or top um more detailed information.

L

Yes, okay, I think that's what.

D

We need to see like command j, just like show the slow ups or something so we can inspect that. I don't know how you see it. There.

H

We need to, we think we want to have that logged into the manager log by default, so because we want to have that historical historically preserved.

G

Oh yeah, okay, sure.

H

Okay, that's the big benefit for the cluster log.

A

Today, the conclusion is that we, we redirected to the manager log and we don't have any kind of slow information in the cluster.

H

We saw the summary, probably but.

A

It'd be nice to have the summary uh you know, as what sage was suggesting you know when, when it started happening and how many slow ups yeah.

G

That kind of information.

A

I don't think that should be very heavy.

H

Yeah, I think, that's cause any issues, but the more detailed information about the individual, ops.

H

And only a limited subset.

D

Agreed, I put a couple notes in the pad. I don't know more if that doesn't capture everything.

H

Are we using the same bed? I don't see those.

D

Oh, was there a different path? I just put it in the agenda.

H

A

um Okay, I think we're at the top of the air. Let's move on to the next topics, maybe we'll have.

A

Two more topics to go.

A

A

um Do you wanna just click up the next one.

L

uh Yes, um I think I I'm paid putting the details in the chat in uh another in the pad, I'm placing it through your ui chat window.

L

So idea is that we is that when we could simplify the document, writing because we are re repeating as typically we when we we add a uh an option, we added to three places. Even that option should be read or writing the critic should be reading the critical path.

L

The first place is the um lego, um lego options, dot cc and another place is option.cc, which is for for so we can read it using get uh get val, and the third place is that the rc file, where we, which is rendered into a sphinx document, so I think the better idea better way to do it is write in a add a option in a single place.

L

I propose an um solution that we can add is the option in engineering file with uh with a predefined scheme which is flexible enough. So we can even write some c simple um code in it and it's inflexible structured enough. So we can use a python script to extract the interesting information from it and generate generated rc file and the more interesting things that we could generate different option versions from a single source of yaml file.

L

For example, if we were, if a certain option is only consumed by osd, we can extract the the partial options which is read by um common resistive common and the subset of ost subset of options, which is only which are only interested by ost that could potentially reduce a little bit um memory footprint and- and I think we've been thinking about of just up- um that's a topic subject that we've been thinking about to to split the message message.edge into some smaller pieces right, because some some messages are never seen by a ost.

L

If that message is passed between monitors, I think that's the same idea.

G

D

That that I mean options.cc is super annoying to edit anyway, because it's like that yeah, it's annoying yeah.

H

Yeah, I think it's a fantastic idea: we've used it like this for a long time and it's it's always a pain to having the options in the docs, updated and some options if they're there. Maybe the defaults are changed.

L

Yeah, we can even add more text, for example, and frequent frequently used or not so frequently used. That kind of tech is not interested for for the applications, but it might be interested from the um dashboard developer. Point of view, because.

F

L

Might want to display them in using different ways.

G

L

I highlighted them in different schemes.

A

Exactly that's what we were talking about earlier right and what the user should know about and what they you don't need to expose through the dashboard. That kind of flagging is really cool. Yeah.

D

One thing we probably want to think about: if we're going to the point where we're generating the rst docs, the other place where options exist, is um in pi bind manager whatever module.pi. Yes,.

A

L

ah Yes, there's some, but how is this related to to this uh abstract con for configuration variables.

D

Because, if well, if that is that last step of generating the documentation, the doc rayo's configuration, the manager modules should feed into that as well. Right.

H

But there are their own options that that you can set with the usual commands.

L

ah Yes, that can do. We can do that.

D

It might not be documented in the same place, but maybe it should be. I don't know.

L

You could actually, we generated the module document from the modules source code. At this moment, we can go further by extract the options and the commands out of it. Currently, we check the commands out of the source code. Oh.

G

G

D

I

H

So you think this would be able to like feed into the existing structure of the docks where you would have like different sections. Referencing different pieces of the dock, of the different options based on like a tag or something or.

H

Would it be like a one one page for as a reference for all the options.

L

I think we could start with some some basic commands, but by first step, is to to convert the the option.cc to a jumbo file and whip up as python script, to generate the cc file and legacy.h file from this yaml file and then expand this script. So it can can can generate the subset of the options and render it using a, for example, a ginger template to rst and render it using things later on. We can expand it, expand it by adding more labels and schemes.

H

Okay, so we have like a sphinx command. That would say include all the options related to x in this section.

L

Yeah and later on, we can filter the liamo file by the text, so we can generate three three copies, for example, one for oc, another four mds and one another four monitors program for the fourth one is for the clients.

H

Yeah yeah. I think it would be great.

L

H

Yeah thinking of our existing configuration documentation, there's a lot of um kind of expository text about the general category of options and then the options themselves. So if we could uh do the filtering, be able to keep that like the general.

L

Description, dashboard will you'll be using the same filter because, from the from the rgw administrator's point of view, what he or she will be interested is common options and rgw options.

H

Cool that sounds great.

A

I think everybody's on the same page for this one, so we just move on to the last topic.

A

Yeah, this is about automated auth key rotation and things hope to say a little bit about this. Did you wanna.

D

Yeah yeah, so this was um uh ignored for a long time, because um we didn't have encrypted sessions between demons and the monitor, and so any sort of automated key rotation would just start sending keys over the wire. That was actually would be worse, um but now we have messengers to secure mode, so no more excuses. um I think that there are two big challenges.

D

The first is that we basically need a two-phase commit, because we need to update the key both on the monitor database and on the client, and if we update only one and not the other and the system restarts or something like that, then something can authenticate so be a little bit careful and then the other challenge is that um it's one thing to do this so that you've, like re, you update the key and then restart the daemon, um but that's kind of disruptive um it'd be nice to be able to update the keys without restarting the thing, the osd for example, especially when you have form caches and all the rest of it so um they're at a high level.

D

There there's, I think, there's one really big decision point here and there are two options um for doing that like two-phase thing: either the monitor keeps track of like both the old key and the pending new key and either key works for a interim period, while the client is being updated, um which means that, on the monitor, all the auth paths need to be updated so that they, if a client tries to authenticate with either the old key or the new key they'll, both work, um which is sort of hairy.

D

We need to make sure we address all the paths. We can focus that just on the messenger 2 stuff and not on the legacy um stuff. So that makes a little bit better, but we need to make sure that that's done. Yeah.

R

Sorry to interrupt, but can you talk about what the goals are doing this automated key rotation before we talk about our options for it yeah.

D

R

That we need to rotate every six months for compliance reasons or yeah. Are we trying to allow updates to keys to like permissions while they're live or.

D

I think the goal is compliance. This is something that users ask for, um and it also seems just like a bad security practice. I have a key. That's been in use for, like three years, that's been sitting on a drive and who knows whether it's been exposed during that period or not.

P

So that's not related to the demon, key rotation, uh internal key rotation, a complete different topic.

D

P

You mean uh in suffix, there's uh the demons themselves. uh You know you have a service key.

D

Yeah yeah, that's unrelated yeah. Now these are the these are the the shared secrets that clients and and demons would have.

H

I mean the other stuff like that. Separate would be the discrete key. I like the luxe encryption keys. um Is there some kind of version? I thought that was like some kind of version of lux the the way you could rotate those keys by having like uh having a key that's encrypted with another key, and you rotate the encryption key or something like that.

D

Yeah I mean there are, we do have two key encryption keys. um Obviously you can't change the actual encryption key, but right.

G

D

Could be rotated, um so probably a similar mechanism would need to be used in that case too, um that one's probably a little bit easier, because it's um there's only like one place where the dmcrypt is like started up and so like identifying the path that needs to try. Two different keys um would be a little bit simpler, um but I think even then it just it's a similar versus question of like which.

P

The the the osd is the decline to authenticate against uh are the session keys, and these are based on the the rotating keys, the temporary ones, that are every 30 minutes.

D

Yeah, that's yeah. That's different, though.

H

So this only affects the edema itself, not anything else in the system.

D

Yeah, I think we would we would automate like osd keys and mds keys and manager keys like those would be the ones that would be easier to automate. um Clients are harder because um you don't have necessarily control over how the client is being used, although I think it might be possible to do that too.

D

um There's a. I have a question down here about that, but um right, so I I guess there are these two options right either you have like the old key and the new key both available on the monitor and the client will use it whatever key it has and either if it's old, it'll work and if it's new it'll work or you flip it around- and you have.

D

The client has its old key and it has a new key and you install a new key on the monitor and then, if there's a failure, whatever you make it so that the client is responsible for trying two different keys and seeing which one works.

D

um Kind of leaning toward option one having it done on the monitor, because they're sort of the way we can narrow down the cases, and um it might even be that the idea of having multiple keys associated with the same auth entity might actually be a good thing. Also in general, I'm not really entirely sure.

D

um Maybe not maybe that's silly, but um it seems like it's a little bit tidier that way, um but that's sort of the first. The first question to answer um the next is around: maybe not a question, but the the next real challenge is making the the mom client um and the auth protocol um such so that you can.

D

Actually you can do it a key rotation so that um when you, when you're renewing your ticket, um you could try using the new key and it would transition seamlessly from the old key, so that you're sort of you're an exist. An existing session would be able to transition from an old key to a new key without being interrupted, um and that I think, needs a more careful read of the auth code to see exactly how that would be implemented. um I don't think delia is here.

D

Unfortunately, but we probably need to brainstorm on that figure out exactly how that would work, um and that's assuming that we actually want to be able to do this without restarting demons. If we, if we think it's tolerable to just if you rotate every six months, for example- and you just like restart demon demons as you do it, then that's, then you can do sort of a simpler solution.

H

I can see that being an issue like on the client's life talking about a very long-lived kernel, client, that's serving many different cases on given host. You probably don't want to reconnect all those clients.

G

Right, yeah, exactly yeah.

D

Yeah, so I think, I think being able to. I think it's an it's a good idea to teach them on client, basically how to and the author protocol or whatever how to transition from an altitude to a new key while it's online um and so there's sort of some hypothetical scenarios here like the client, would generate a new key and then store install it on the um installed on the uh on the monitor and then it would re-authenticate using a new one and then the monitor, perhaps as soon as it sees that.

D

If you successfully used use the new key, then it would commit it as the active key and you could. You know, tell the client to do this using a tell command something like that.

G

D

There are a few things, um there's sort of another question of how you actually update the key on the client like. Would you want stuff adm to like update the key ring file, or would you want the process that, like the mod client, for example, that you just told to go, do this sort of online key change to rewrite its own key file key ring file um and in the case of blue story, you also have to remember that the key ring file is actually on a temp fest.

D

Let's start up with sub volume, and so you need to actually update the field in the store, super block or whatever to have the actual key ring there. So there'll be a bit of a clutch there to make sure that works.

D

D

And then the last, the last challenge is around kernel clients because that's like sort of a separate implementation and it's a little bit more awkward, because um you know mount that stuff probably um needs to be able to accept. If, if you go the two keys on the client, it needs to have those keys, um but the kernel is not going to rewrite its keyring file, so you're going to have to have some other like helper utility or whatever.

D

That runs on the client host that you know pokes sysfs or whatever, to tell seth to go change its key and then updates wherever the key ring file is stored in etsy, ceph or whatever it is so it'll be a little bit tricky. We need to make sure that we standardize the way that those kernel clients are managed right now. I think mount.set doesn't take a traditional keyring file which has sort of been a future ticket.

D

That's opened our tracker for like 10 years now, but it might be a good time to actually implement that um and then.

Q

No, it does now sage, oh, it does actually reads the config file, the cell config and then um uses the monitor, ips and also the dev key ring. Oh good.

G

D

D

But yeah so there's some details there, but I think we could probably kick those and do this last.

G

D

I think that the two questions that I have as far as like which path to go are this or this first question about whether the monitor has sort of two keys and either one works for that temporary transition period or whether the client has the two keys and tries both of them.

D

That's the first question, and the second question is whether it's a good idea to have the actual demon or client code rewriting its own key ring file or whether we want some external tool to be doing it. For you, like stuff adm, trying to orchestrate this whole thing.

H

Would say dm even be able to do that with something like the ost and blue store, and it's already in use by the running demon.

G

D

It probably could but it'd probably be a bad idea- yeah, probably not yeah. So probably what that would mean then.

D

R

Ahead, have we investigated if there are regulatory requirements for doing the rotation like if it is like, like being through an alternate pathway or something.

G

D

How do you know where this requirement came from.

A

Yeah, I think we've got a customer downstream customer requesting this feature, but that's for security reasons, but yeah I'm not aware of the details, but I can find out.

D

It's going to say, I think that probably the practical implication of this having the demon rewrite at some queueing file is just the way that the permissions are set up. um We need to make sure that the file is writable by the process that um runs it or whatever.

D

I suspect, that's probably fine in like in all the current deployments, because it'll be owned by user stuff.

D

Q

We don't allow access to etsy so where you can read only access to edc, but I don't think we do that with sub-adm. In fact, I'm not sure what this is kind of off-topic, but I'm not sure what kind of protections we have in the system to unifiles we generate with cdm. Maybe that's something we should look at because there's a divergence there.

D

It looks like stuff: adam is deploying it it's a the keyring file is in its little directory for the container and it's owned by user stuff, so it should be able to rewrite it. Just fine.

D

In general, nothing gets or demons. At least nothing is demons, don't use etsy stuff at all. The only time separation uses epcs is when you use the shell command, and that's mostly just so that you can have if there is an etsy stuff on the host, that's sort of like the default cluster that the fediam shell will bring up. So you have to pass all the arguments all the time.

D

That's more of a convenience.

D

D

D

Okay, well, I guess the action items are probably just to knee-high. You could check the source and see if there are any like additional constraints um and then I'm going to lean towards seeing if option one it's feasible. If there are a small enough number of paths that the monitor can do it, I think that's probably going to make a little bit more sense and then.

D

Yeah, just diving into the authentication code of the mod client and figuring out how to make this work.

A

J

A

Find out more about the requirements there, and maybe we can circle back on this topic or in a cdm session or something ilia is also around, be good to her.

G

G

D

That's probably all we need.

A

Oh, that's all the topics that we have listed here any more last minute.

A

G

No, I don't think so.

A

All right, then, thank you, everyone for joining the cds radio session and see you guys later thanks.

O

Thank you, bye.