Ceph Orchestration Weekly, 28 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator 2023-02-28

Description

Join us weekly for the Ceph Orchestrator meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

All right I think, but it won't be here for a little bit, so I can probably start at the topic.

B

A

um I know this first one was your thing John. Do you want to intro this yeah sure.

C

um So I have been kind of trying to chip away at the backlog of unit tests um for the various functions side of Seth Adam and as I've kind of gone along I've been avoiding certain functions, especially the command functions, because it's a lot of mocking now I'm not going to suggest that we avoid all of them but I'm starting to wonder if there is a lot of value in getting coverage unit test coverage for those when we have integration, test coverage with toothology and it's all big exercise in writing.

C

A lot of mocks so I think it's a lot of value to try to cover some of the simpler lower level functions with unit tests, but these higher level ones the command functions. um I'm, not as convinced now as when we made the list that there's a lot of value in doing those. So uh looking for feedback, I.

A

Mean like these sort of ones here, yeah.

C

Now again, we should probably look at them at least once before.

C

We take them off the list, but I'm I'm, suggesting that if you look at it um and say this is not worth covering, maybe just put a comment saying you know not going to cover and then in the code doing a pragma no cover for now, because we can always remove the no cover in the code later um that way, when we run the coverage report after running the tests, we know what we've not got to versus what we explicitly decided not to do for now.

B

A

um Yeah I'd have to go through some of these like I know. Some of these are like super short, just call one other function anyway, right.

C

A

Think, like Commandos, is one of those.

C

Also yeah I think when we spoke yesterday, I said something like if the function is like twiddling repo files, that might be one we can cover, because all it's doing is like writing files to a directory, but some of these other ones that are just like I'm going to call 800 container commands it's. It just becomes an exercise in mocking everything and what does that prove.

A

Yeah, some of them are tough, like I know, even like the the underlying function of and LS is a lot of unmanned commands right.

C

B

C

Use that one as an example, we would basically say like like we did with the putting our name next to it, like I put my name next to it and say decided not to cover it.

A

Yeah, like that's one, we probably just want to have done via integration tests they're just too hard they're, not useful I guess to do unit tests on.

C

It's not yeah, it's not just a difficulty thing if it was a difficult but provided a lot of value, I'd probably be more for it, but I think it's again. You you usually get to a certain point where all you're doing is writing mocks and it doesn't prove much yeah.

A

All right, um yeah, that makes sense to me I mean we can Mark some of them like that, as we go through them um say like I. Don't think this is like one we should do and I guess.

A

We should try to see when we do that, if there is integration tests for them or that involve them in some way, um just to make sure they're still tested like say, like I can bring up this LS1, because I have to know that one like that one, because we use the ls commands to get the demons on the host like at all times like that one's constantly being used like you can be pretty confident that one generally works, um maybe there's a specific case that I could be broken, but generally it should work, um whereas some of these other ones I, don't know who we use as much so we might have to maybe introduce integration tests later.

A

If um okay good point do that at least yeah I think at least because this documents really just for checking the unit tests. That's that's the point of it. So I think we should still this Market, as you said, so we're not gonna write unit tests for this one.

A

um Let me go through later and check for the integration test, but as far as the process of getting the unit tests done before we pick up this file. Okay, we can just say we're not gonna. Do it.

B

A

I'm fine with that um I think it's just about a matter of going through each one, obviously offline, not here um yeah. Oh.

C

Yeah and the other thing is, there's definitely stuff in this list. That's not sorry, there's definitely things in the file that are not in this list. um I was just working on call and call is not in this list, for whatever reason um so I I should add it to the end of the list, um but uh yeah yeah, if you're, just looking at it and go oh I, want to cut I, want to write some coverage from XYZ and it's not in the list.

B

C

Yeah, please feel free to add. It are.

A

You also using these arrows to say it's done. Is that what that means.

C

Yeah that the arrows is just the same as the line above I did I did them all together, as I kind of conceptually saw them as related yeah.

A

I, don't like this one for you, but I know I did to finish this one, but I never marked it as being done.

C

Yeah I, don't worry, I, don't think um check units for, for instance, I, didn't put a PR number and I didn't I, got lazy and didn't finish uh remarking on that yeah.

A

Maybe something like this, where we just say like merged or even use your arrows yeah, they also merged. You should be doing the monorail here.

A

All right, um that'll make sense, um I think everything you said back here, yeah all that.

C

B

C

A

Was an easy one: yeah I mean I, don't think. There's gonna be a lot of pushback on um I mean not the unit tests or tests or for functions where it's not useful through unit tests.

B

C

Go back like I said: if you, if you look at it and go oh, this is not bad. It's not it's, not a low value. uh You know not just mocking exercise.

A

Yeah thanks yeah, but at least we'll try to do the one set of these everyone or that people don't think needs to be skipped first and then at the end, if you look at the coverage and stuff and say like all right, this is.

B

What we have this.

A

Is what we skipped, but now I know I guess the first thing is we have to get through all the ones that don't need to be skipped, but.

C

A

Be a lot easier to make some decisions on what to do with at the end, but um yeah.

C

My issue is I, like I I run the coverage report and I like scroll up and down the file looking for one that is both mostly uncovered and.

C

um Doesn't look like it's a giant spider web into all sorts of other things, yeah.

A

Unfortunately, there's a lot of that, because this tool is meant for interacting with the host system and stuff, but yeah um yeah. We keep finding a handful of things that are good to test that we'll um we'll eventually get to the end of this.

A

um We can go back at the end and see like all right things. We skipped maybe revisit them just to make sure once over that.

D

A

Should be skipped.

C

The real tough ones I have I, don't want to touch or ones that are like 90 covered and then there's like an if block and I'm like I, don't know what I don't know who is covering this function or why it's like I'm gonna, avoid that for now.

A

Yeah yeah, that's something good at the end, that's like the details. Once we have most of the coverage done.

C

Yeah well, I guess a part of it is some of the functions get tested implicitly or covered implicitly by other things. So it's a little hard to say. Oh, this has got unit test coverage when it's being kind of accidentally covered, but that's true. That's life.

A

All right, yeah I, don't think there's gonna be any pushback on that I think we're all gonna sort of agree. You know we can we definitely Mark things as being skippable um if Steve's, probably making a few too many calls to things via them up.

C

Yeah, so what you might see is, as I go. I might just put more lines in this while we're like for myself I, don't want to touch it because.

B

C

A

Like marking it, regular marketing is merged up here in this market, yeah yeah.

B

All right that all sounds good.

D

I think as we do, that we had to like maybe go through and um check off like where the functional tests or where the integration tests exists like is it in the run unit yeah somewhere else, because some of these like um revoke SSH key I'm, not sure where we have coverage or things like that um good point, we probably had to confirm that that's explicitly tested somewhere the output, that's expected by the manager, the same.

A

Yeah is there any way the coverage report to do that, or is that that too hard.

C

So the coverage report that's generated by the unit tests is just generated by unit tests. So you don't you if you really really want to like we could mess around with the whole infrastructure and you can. You can get a regular Python program to do to generate coverage, but it's a big pain. In my opinion, um it might be an investment for the future, but it's not a small task. In my opinion, okay.

A

I was just asking if, like the coverage board stuff, that already exists, there are some ways like say like tell me which function or which tests did this one. Now.

C

You basically would have to in basically enable um like the the cephanium that gets run off in the toothology nodes to generate coverage in like a directory somewhere and then run a coverage report on that afterward and, like I, said it's doable, you can like pie test is invoking the coverage module, but the coverage module can be used on its own. It's just um it's an infrastructure job, okay,.

A

And maybe in the future, once we haven't, if you have like a lot of stuff done, um and maybe we can go, look at that I think we still have probably like a bunch of things that we know need to be tested and aren't tested at all and they don't even have any coverage all right. Unless you have more stuff we're closer to the end, we can revisit that and see if we can um figure out where some of these things are covered.

A

Basically they're not on this list, and we don't know where they came from.

C

Yeah yeah, so it might be interesting if we have like an intermediate State like I. Looked at it and looks difficult, but I haven't checked if there's ethology coverage yet so you don't have to do every every step altogether.

C

A

I think for now we'll we worry about like the skipping and stuff and I think I can think that are marked like that. I think that we we didn't explicitly test as part of this, we can go back at those ones at.

B

A

We at least know that the ones that are on this list that getting marked assessed as we know where they came from and stuff, so um there's an explicit test for them.

A

It's just everything else we have to worry about after.

A

Yeah, we can recreate this about later.

B

A

Yeah I think it starts as a topic today.

A

Dipping stuff I think there is no problem with that market.

C

I think it's I think what I'll try to do is as I look at things. I'll try to write more free form commentary on them. I can look at the list afterwards, it'll be like. Oh, he thought this one was challenging. I will then go look to see if there's a pathology test for this.

A

Yeah, or at least it's something either it's all just explicitly for it or it's something that is so common that, like we know it's getting yeah.

D

A

The command deployed.

A

Everywhere, yeah.

B

All right good, let.

A

Me put put something in the notes for this, and.

B

A

I think those were the main things so right.

C

About yeah works for me.

B

All right: um do you want to go into the call function, stuff.

C

Yeah I'll try to be a little faster with this, because it's it's more like uh uh we we've got two um sets of patches. They both build on top of a existing PR that I have where I wrote, tests for the call function. While I was writing test coverage for the call function. I noticed that the timeout parameter just doesn't work.

C

You can tell it to time out after five seconds, and the thing will just sit there happily for 60 seconds, waiting for the process to exit, um took a detour into learning a little bit more about async IO, because I really never used it in Anger before um the at least two patches. These two two series are alternative versions of the same fix. One of them adds an additional feature.

C

um The first one is the smallest possible fix and it's probably the most efficient as well.

C

um The second one adds a feature that I thought might be nice to have, which is the ability to capture the logging of a command that we do time out on.

C

But the thing is how the code doesn't do that today, it's entirely optional, so I was wondering if people thought it was valuable enough to add, or we should probably just stick with the simpler version.

A

So clarify this is the more complicated one that has the modified consumption and then the.

C

B

C

Yeah, the simpler version gets rid of the T function entirely and uses a built-in function from the standard. Library called communicate.

C

Yeah and it's it is for what it's worth in the comment of the second. um It message on the second Branch uh I do note that the T option is slower by about two seconds on the on the slowest running test.

C

um So uh it's it's not the worst thing in the world. If we really like the feature, um but it is, uh there is also a performance difference which I felt like uh pointing out.

E

I remember like long time ago trying to fix something in that function to be on a sentence. Remember the details, but right it seem. It seemed to work at that time, but then we ran in in the back in very dense cluster. You remember Adam yeah,.

A

I know exactly: we talked about: I ended up linking John the exact uh Packer for that I. Don't have it on hand now, okay, um but he saw that um like what we ended up to figure out is that that was happening.

A

He swapped these two lines like these. You hear yeah because it made sense like oh I'll, wait for the process and then we'll gather, because the wait for wasn't doing anything because right because you.

B

A

C

The the issue is await really says block on this and don't move forward until they're all done. So. What what we really need to do is process the standard out and standard error, while waiting for the thing to exit um I had an intermediate version that I'm not showing here, because it's this is simpler, where T continued to exist and it basically said um gather TT wait for. If you do that at all, it actually does work, um but then I some further simplified. The the patch to use, communicate.

E

Yeah, but the first chance you introduced just um it should be good to make sure that it works for a very, very long uh outputs, because, yes, that's the case.

C

Adam, can you right so? Can you scroll to this very long, yeah yeah, it's right there, 6 23., um so I don't know if this number is insufficient. We can certainly bump it up some more, but this thing tries to write whatever a hundred thousand line. I can't count zeros on Adam's screen a million, but it okay thanks.

C

So it tries to generate that many lines of output in a simplistic way, um and that test is the one that takes like either three or or six ish seconds, um and hopefully that's sufficient I don't know. So if you you have a case where a command is generating more more output than that.

E

um I, remember and that's in in that cluster in the dense cluster. If my memory is good, um the line was 160, something a kilobyte.

A

um Yeah, because.

B

A

Think that was was that before we moved to Jason, pretty first volume up was that when it was just one big line as well, I know there's one that we had to do. Is we had to shift the set volume output at one point to be Jason pretty instead of just Jason, because the single line was too big.

A

um But that was a separate thing. That's already been fixed elsewhere um and then I I. Don't we never fully figured out what was causing that issue? Rhett I was talking about. It could have either been something with the thing timing out, but the fact that the the time, the wait for thing was before The Gather meant that when it timed out it never fully returned or it could have been something with there being too much output. We don't actually know which one it was the timeout or the output that was causing the problem.

C

So in the case of line breaks, communicate does not care it only it basically accumulates bytes and returns of giant string or it actually returns of big bytes at the end, and we have to convert it to a string.

C

um If we kept the T function, then yeah, maybe the buffer could fill up um waiting for a long, long long long line, but if you've already resolved it I, don't think I. These patches change. Any of that.

A

Or just too much output overall, like it was just, it was just too big yeah what.

C

I can do and find before filing the pr what I could do is I could take the long running test and you know bump it up to even more let it run for whatever 30 minutes or I. Wouldn't I don't want a unit test that runs more than like this. One already feels slow when you run the test. You're like oh, what's going on, oh yeah, it's the five second test.

C

um I! Don't really want to increase it Beyond this, but I could do it by hand and say, oh by the way I bumped it up to you, know a million or whatever it is I, don't know yeah and bytes and um see if it chokes on that and just do it by hand yeah. Would that be sufficient.

A

Yeah, that's probably as good as we're gonna get with testing with unit tests, um all right so I think so. I.

E

Think one million it's really big value as if, as far as remember they can. This very dense clusters was 160 kilobytes, so well.

C

I'm gonna write it down and try that 160k yeah.

E

And I think very good thing to see that you are an immunity test for that part, because it it was discovered too.

C

Right, because that was one of the first things Adam warned me about when we, when I mentioned, that I found that Carl had a yeah timeout wasn't working. So if you don't mind, can we I just want to rewind a little bit to the uh uh the main question I have which is uh do you would you prefer? Would the group prefer that we add a feature of being able to partially log find out commands, or do we prefer the more simple, possibly better, performing um option?

C

Because again we don't have this feature but I kind of went ahead and implemented anyway, because I wanted to uh but I'm leaning towards not doing that now, but I wanted to give everyone the opportunity to chime in and say. Oh actually, we really want to know um we'll say why a command light of hung. So we want that logging.

A

I'm, leaning towards using just using communicate, um yeah I was thinking that, even with the things that we normally run that we're hanging they log some things eventually that and maybe give some some information like that volume stuff. But I don't know if that even doesn't even show up as output I think I, don't think that ever ended up in any of this stuff uh that he was doing before um so I.

A

Don't know how much we'd actually get, and this on top of being faster, is also probably a bit safer for using a proper Library thing. Somebody who knows what they're doing with this um as well Hunter T works. Okay, but um I mean it is still something we're sort of doing ourselves and try to make sure it works, and all that.

C

Yeah, that's generally my um my feeling now as well, and um we can always come back and add locking back in later. If we decided, we do need that.

A

Yeah but as you said, we already removed it um because it didn't exist at one point: I ended up, removing it and I didn't even notice at the time.

D

A

Was happening because I don't think I ever actually did too much, because I've learned a lot of situations where the command would even hang and then, when it does I think most of those things weren't really outputting until the end, so we didn't get much out of it.

A

um So I would say it's probably definitely not probably not worth it for now. Unless we we find a specific use case, or we find it easier way to do it. I guess yeah.

C

The only thing I'll add- and this doesn't matter, whichever version um is that when we try to time out the process and we kill it, and then we wait for it if the process isn't d-state or something we still get stuck here, um there's not much. We can do about that at this time, uh like massively rewriting call which I don't think we want to do, but I just figured I'd mention that, because it came up in the discussion yesterday as well.

A

Yeah um I mean eventually we do kind of want to do that, because I think the the end goal is to whenever something for whatever reason fails like it hangs or or something we want to be able to raise a health warning in the manager um which would require coming out of this band eventually, because the manager is just waiting for this to finish, um we would have to Implement some sort of timeout there as well like at the managers. I.

A

Have that call some out of time out and not get stuck on this um or we'd have to have this be able to return at some point, but that was being able to kill.

C

The process is there a tracker for that I I didn't ask this yesterday, but I'm wondering yeah for the health warning um aspect of it.

A

I think there's one generally for that, it's time to meet I think let me go check real quick if I can find it.

C

Okay, because maybe stick that in the in the dock and I can refer back to it later again, I kind of lean towards doing the simpler thing now. But um if that's a real, if it's a real desire to have, we can I think the easiest thing might to be passed, some sort of callback to call and say you know if you time out, do this.

C

That might be the simplest um alternative.

C

B

Anyway, sorry I.

C

Keep I keep I, have a tendency to go off on tangents. um The main point of this meeting is um just Gathering feedback on that uh approach. One versus approach two. So if anyone has any strong opinions, let me know otherwise: I'll probably go with approach. One.

E

Approach, one, it's the one using communicate right. Yes, this is one. Yes, the screen yeah all right. Whenever we use standard I, really think it's that's I.

A

C

Vote for one all right, that is what I'll go with then um so I'll follow up my existing PR with another PR that actually fixes the uh the function and I'll delete the speculative test um that, what's in the middle of that of screen right now, okay.

A

Okay, so this one, this one's actually just testing the logging in the middle of a function that hangs.

C

Yes, yes, exactly.

A

Yeah yeah, let me just remove that for now. If we want it later, we can go for it, but I. Don't actually think we're. Gonna have too much use case for that right.

C

That's fine! That's.

B

A

B

C

Okay and then the last one and hopefully I will I, can stop talking and stop wasting everyone's time is uh while I was looking at the async. I o docs I noticed that there is a newer um I think it's called a Handler I forget the Watcher, um so newer Kernels have a concept called pitfds um at the python docs, even call it a Goldilocks child Watcher, as in um sort of the ideal one.

C

It's only supported on newer kernels I checked this morning and I proved to myself that uh the Rel um or Centos 8 kernel does not support this, um but I do believe. Nine does which is not tested and Fedora definitely does um so. It might be nice to have something um and optionally enable this. So so do like a probe of the kernel version, and that is what I linked to uh in the gist my hacky script show how we might enable that alternative Watcher for newer kernels and newer.

B

E

It's only available from for python 3.9 right.

C

Yeah so basically I first checked to see if the kernel is new enough and then uh oh I already see a mistake, but that's okay. This is just a hacker demo.

C

um So basically looks to see if the kernel version is new enough and if it is, it tries to um use it, and if that, if that name doesn't exist, the attribute error we just use whatever the default is: okay, I'm not saying I would do it this way in actual proper stuff ADM. But this this is like a partial cut and paste of the pieces of idiom I needed to proved myself that it was a workable and B. You know we could conditionally do it.

C

I was actually hoping that when we instantiated bit of the child watch, it would Rave us raise a specific exception itself, like your kernel version is too old, which, unfortunately it didn't do so I did the you name decking stuff. What is that, if there's a simpler way to do that? Let me know I just hacked that together, I'd like to I think.

A

I mean I, don't know of any other way to check the this platform. Module looks like it works. Fine. If this is this check actually works. It detected that the the thought eight.

C

A

C

What I did is I ran this script on my laptop, which is Fedora, 36 or 7, doesn't really matter, and then I also ran it on a Southwest. 8Vm and I ran it on the base version of python, and then I ran a fedora 37 container on that version of uh Centos, so that we're basically using the latest Python and an old kernel and um yeah that was successfully used. The uh the native async run but skipped using the uh pitify Watcher because uh it successfully used detected that was on or on an old kernel.

C

I hope that all made sense. Okay,.

A

In this case, you're, are you importing this from somewhere? Are you just using python39 so just um I'm doing things I know for this threaded child Watcher, which.

C

I saw right so if I run, if I run the script on the python version that comes with Centos 8, it I see the print using fallback, async run because it's whatever Python 3, 8 I, think um and then, if I run it in the Fedora 37 container, it runs line 37 and goes. Oh, your Kernel's, too old I'm not going to execute the if block and then when I run it on my desktop, it is able to print, enabled async IO pitfd child watch.

B

A

I think about the outside of the kernel version right, because I know for the threaded child Watcher. We have right now. We um we just have that code copied into the binary right. It's not like Yes, actually importing it, because it's on.

C

Python well, it'll I think it tries I, think it tries to do it and then otherwise falls back to the um uh copy and paste it version, because.

A

This one's on three nine we're gonna have to probably do the similar thing. That's a.

B

A

Of whatever it is in there.

C

Yes, so that's the that's! The CIS version info check, which is checking the version of python um I personally like to try and import things first, and rather than use explicit version checks, but whatever um we, what am I trying to say the kernel, is different from the python version, so the kernel version I, don't see another way to do that right now, um the python version I I opted for a try, except rather than a is my version. X I hope that all makes sense.

A

Yeah I would agree with that. um I I prefer it that way as well. We just do a try, except on the the thing, if we, if we have, because we actually can do that, it can work there, not like the podman or the kernel version where we have no nice way of doing.

B

A

Catch so I I like that.

C

Anyway, I'm not gonna like rush to do this, but while it's fresh in my mind, it might be nice to add yeah.

A

Yeah, so if I have to copy in the this bit ft child Watcher then do something like this and just have that. Take the place of threaded child Watchers. What it sounds like.

C

Oh uh sorry, I wasn't planning on doing the copy and pasting. I was basically going to say if the version of python is new enough and the version of um use the built-in one so.

D

C

Actually fixes some bugs where this is just a purely efficiency thing. um The threaded child Watcher has to create threads to watch the child. The pitfd basically creates a special FD to monitor the state of the child and sticks it in the e-pole. Slash select, slash whatever underlying thing. The async io Loop is using okay.

A

So you're saying well we'll just use a threaded child Watcher unless all the conditions are met and we can just import it properly. Oh.

C

Kind of that was my plan.

A

Yes, that works as well did I. Imagine a lot of the ones that have the kernel version are going to have the python.

C

A

What I was thinking um I'm, not sure what I guess the only problem, I guess is inside of our containers, they're, so bad. They built on Centos eight, they probably for a while, but I'm I, don't I. Imagine that's still going to be an issue we're going to build uses anyway. If I had to wait until I guess they start building the containers using and yeah.

C

So I guess it depends on whether you're running stefanium outside the container or inside the container. So since I guess it runs in both modes, it's like it can again it's it's more of a small optimization rather than a oh, my God. We need this to fix a bug. Yeah thank.

A

You Randy I was familiar thinking like the manager module, because this is running on the host actually most of the time, I think it's the host one there's.

B

A

A handful of things with everyone yeah, if they're using relatively new OS, and they could get some use out of this.

B

And I said it's probably gonna match.

A

Up a little bit with the kernels, the python version, so if we just do the checks, I think uh you're right. That makes.

C

Sense, yeah I, don't know I'm, not totally on top of what red hat, Downstream or rhcs Downstream is doing, but I thought things were moving towards nine and I. You know at some point I'm, sorry, so we'll do something similar with their Enterprise stuff. So yeah um eventually we'll have kernels that are newer or more widespread.

A

Yeah I'm trying the Rel versions, but even just for Upstream, like a lot of people, are going to be on some newer kernels Sometimes. Some of the newer Ubuntu version releases will yeah we'll have the kernels high enough as well.

A

um So they can get some use out of.

C

This maybe so, ultimately, it's a little bit of memory. Savings I think it's a nice to have, but if I don't do it soon-ish then we'll all forget about it, and it won't be until like four years and someone will be like. Why did you never switch to PFD child watch.

A

You make like a cleanup tracker or something.

B

C

Good all right I will stop talking now. Thank you for everyone. Listening to my nonsense,.

E

But no we're using this through the side picture I'm reading the code machine. That is something that's going to switch it through, because we had some bags to practice with that right.

B

Bugs with one thing.

E

The three that uh the other child Watcher, because I'm reading the binary code and I see that we are using like special class.

A

But that class is really just a copy of their actual one, just because we were supporting an older python version. Okay,.

E

We only do that for other other python versions by more than.

A

Three, oh, if you're using.

E

A

Or eight or higher, then there's actually no benefit to that class. I think I, don't know if it even tries to import, though I don't know if it just always uses it, because it assumes that there's a risk, it could not be there from them.

A

It's not actually anything special.

E

A

You know it sounds good well, um especially if we're not going to do anything weird with it, we're just going to be like important and stuff, we'll just throw it in there. If it's usable, we'll use, it seems like it's a slightly better version.

B

All right, um You can go on to this last one.

A

um So this is something that came up in the uh stand up yesterday.

A

um Basically just the naming of this uh label we want to introduce.

A

So the idea of this pull request is that we want to split up the host training into two separate things: where are you training the demons off a host and draining the unfagen gearing we Deploy on the host are slightly different things, so you can do one, but not the other, and for doing that we need to introduce a new label, um you'll Mark, the config and key Rings being drained, and so yesterday it seemed like people were a bit confused with the naming and stuff, um and since these are really hard to change after they're sort of in an actual release, we sort of want to talk about a little bit here.

A

uh So I have some some stock changes in the pull request here. That sort of explains what the labels do uh well, this is part typically is talking about. What drain will do when what drain will do is add both labels uh by default, no schedule which stops it from putting demons on that host, gets them all removed and no conflict key ring, which will stop you from putting any config and keyring files on the host and remove any that? Are there um and then down here?

A

There's extra parts that explains what this label does, but it basically just sort of piggybacks off the no schedule label uh description up here and then just says that this one applies to config and keyrings, and then there's also the naming of the flag, I guess as though the flag that gets it to not put this label on the host so that you can only join the demons but not the conflict. Gearing is this: uh keep configuring lag.

A

um Yeah we're basically just talking about how we want to name these things, make them a bit less confusing, um more so for for users, I guess than us, even though it's nice for us to be able to easy understand as well. It's most important that it makes sense to users how these things are named and.

E

The users normally um handle these levels, I mean these are like internal levels for safety and.

A

They can, if they want to put these on manually, but typically this these labels only get put on when people are draining the host, so they use the host drain command and it puts before before this change. We just put this no schedule label on um and it would tell them what team is around.

A

Those sets to be drained and usually people are draining hosts and they want to remove them, so they would is being drained which would put the label on then stefidium would remove all of the demons in the background and then once that's done, they would just remove the hose, so they would usually never touch this label themselves.

A

That being said, they can, if they want to there's nothing, stopping them from saying I want to add the no schedule, label, books and move the roads of demons or something.

C

You have a somewhat silly question: um no conf key ring doesn't make sense on its own without no schedule right, like one is kind of stream of the other.

A

uh You technically could put it on its own if you were like I say if you wanted like a placements or yeah, um often keyrings and then like you're, like I, think I want this house and not have them for whatever reason you could put that label on instead.

C

It would break anything.

A

No, it would, it would be fine they're just trying to remove the content. Keyrings.

D

All right, great.

A

um So it can't exist on its own, like most of the time. I expect them to be paired up together, because you probably want to you're, usually training hosts, because you want to get rid of them or move them from the cluster.

C

So the this clock, key ring pair is mainly used for other things that aren't managed by ceph ADM to talk to the cluster.

A

Yeah so you've been basically I mean I have to go into like what con that con.

A

The key ring management in general is, but uh if we can find the doctor for that as well, just we can look at a little bit, but basically it's like any hearing you want like so, for example, if you need some client keyring for some client demon on the host, for whatever reason um right, you could have Stephanie deployed that on every host that matches like X label, for example, or like every host that matches or I know any placement. Basically, you just put some placement for it.

A

Okay um and so I guess you could use this one to say like I added on all these hosts, but this one host right now I want to not get those constant key rings and you can put the label on manual but again, I expected to mostly just be used as part of the drain.

A

I think if I could find The Conch.

C

Well, I would say to suggested name that you have right now is okay with me. It seems clear enough that now.

A

Yeah I think about with anything better um go to market. That's why I was left of that because, because it gets roughly permanent once we do it, I was gonna. We're.

B

Gonna bring up.

A

As a topic for sure yeah there's this part um talks about the the keyrings and stuff and what it does. Basically, you put pretty much any key ring there.

A

And then it'll just put it'll deploy that key ring into the Etsy stuff directory on all the hosts that you match the placement again, if you, if they're, like I, want for whatever reason like this host can I don't want to have the key Rings anymore, you could put the label on and it would even off of that specific host. Well still put it on the other ones that match the placement but use it. That way.

A

B

These doctor here, if.

A

uh You want to look at just that stuff in general, but for this discussion in particular, it was just um or do you want to name the label that says don't put these on this house.

E

So, by default, this long won't be there and normally we just go and remove the conversion files and the key ring from the host frame.

A

um If you mean before this change, um no schedule would basically cover both of these cases. So it is. These were both encompassed under the one, no schedule label, and so no schedule would cause it to remove the demons and the config interior.

A

So this color price is splitting it up into two things um and the reason behind that all is because somebody wanted to be able to drain hosts with from demons, but leave the keyrings.

A

um They wanted to do that on some of their nodes. I, don't remember exactly why.

E

And they want to do to do this like automatically not having to copy the files manually right, yeah.

A

Because they want to remove all the demons from the host and the easiest way by far to do that is to drain the demons they do the host drain. But the ocean would also remove the config in key ring. So then they were forced to like manually edit their placements to exclude that host or something um which is a pain. So this would allow them to keep those onto the key rings that they want on the host, while still getting rid of all the demons.

A

E

Isn't there a way like to pass some black without using labels when removing the demons or that's well,.

A

The label is how we do it internally, because we basically we need to know um which hosts we need to drain uh whatever we're doing.

E

C

Does drain do anything more than apply labels and then let the background stuff happen. I.

A

Think it's just Supply labels I, have it down here, I'm, pretty sure it basically justifies the labels and then everything else is done via the um the.

C

Normal, so it's really a convenience thing which, which makes a lot of sense.

A

Yeah, so this is the drain host function. I don't want to go super depth of this stuff, but it's not very long. Oh no, it doesn't matter stuff it. It also will um Mark. So OCS are special, so we have to mark those as be drained as well, and then it prints out a list of all the demons on the host. So there's some extra little things, but as far as handling the north, the non-osd demons.

A

um It's it's mostly just prints out a list of what demons are on the host and it says I'm going to drain these and it sets the label.

C

A

C

If your host had no osds on it say it was just like MDS and some other random things. It would just empty them completely.

A

Yeah um and even with osds it'll, try to start draining them, but obviously the OC removal process is a bit more in depth than the other demons yeah, so that one has to do some other stuff, but for yeah non-osds it'll just put that label down and then, when the it does, the normal apply spec stuff.

A

It will get to applying the spec for the Deep, for whatever demon type and then it'll always exclude this host from the placements, because the snow schedule label that's what it does and then as I'm by doing that, it will Mark that theme as you need to be removed, because this is a demon shouldn't, be there according to the placement.

A

Does it that way? And then the client keyrings actually use the exact same placement system um for putting those down, and so because they do that. Then those people also affected them, and so they would also get removed in the exact same fashion, because I think literally, what we do when we're calculating where to put those is.

B

A

um We just make a placement here yeah and then we just make it a mod spec, so we just basically fake that we're placing a demon, a mon demon, but actually this is the placement we're going to use for um the client key rank stuff. So it's exactly the same system. It's just that, although this Polar Express is sort of doing is saying it's using a different list of things to figure out which hosts are draining and which ones are available and stuff, because this is a.

A

This is one big function that does all of the calculation for where the client key rings to go.

A

uh Calc client files, so for when we're in there, it is um using the new label that we're introducing and then everywhere else is going to just use, no schedule the actual, putting demons down.

A

Yeah I mean I'm, confident all this stuff, at least Works. um With this pull request, it's just a matter again. How do we want? What do we want to call these things like? No configuring I mean. Does that conflict any better for a label to say like we're, going to remove all the configuring keyring and then keep const keyring for the flag? I know that comes confusing because they're they're so similar in name um and they're opposites.

A

But basically the point is that this flag means we will not put this label because we do not put that label. We will not remove the config and key ring.

E

The name is to keep the default Behavior as if it is right now. So, if you don't put anything, it will remove the configuration and if you want explicitly to keep the configuration you have to pass, the minus minus keep.

A

The behavior will be the same because it'll put both labels, which is the same as just putting no schedule before this change, but without the flag. It's exactly the same behavior that you would get right now before this PR is merged.

A

um With this flag, you just you, only get the demons removed because you wouldn't have this label.

B

E

E

What's what happens? Okay? We don't. When you don't have the flag, it means that um you have to remove sorry. You have to keep the files right.

A

If you don't put the flag, then we put both the labels and then everything gets removed. If you do put the flag, it puts only this label and not this label, and then um the config hearings will be left, but the demons will still get removed.

C

Right so there's a behavior change for the flag itself, but the behavior of the drain command stays the same by default.

A

C

B

A

A little different because it'll have an extra label in there, but what it actually does is identical. That's what it did before.

E

But not sure like in the past, when you have no schedule and nothing, you will remove everything. Yes and now, when you have no schedule and nothing, you will not remove the configuration file. Yeah.

A

You'll only remove the demons.

E

So the behavior is changing.

A

Yeah, but by default it puts both labels yeah, so you can put that and then, if you put both labels, that's the same behavior as before it'll drain everything exactly so.

E

To uh to keep the same behavior, you have to put two labels. Yes,.

C

Most people I think Adam's, saying most people use the drain command, so there's no, the normal workflow isn't changing. Only people who would say use the label manually need to be aware of this.

E

um I'm thinking about like some upgrade scenario when you have old hosts with these labels for whatever reason, and then you have new manager running this code, then the behavior is changing right.

A

uh I guess so, it would now put the cardigan keyrings back on the host. um I was sort of just considering that as very unlikely, because I don't know why you just leave the hosts around with no schedule on it, not remove it. There's no.

C

A

Have anything on it.

C

I think that's a good point to raise. I also think that the the danger or downside of putting those files back is probably very minor.

A

Yeah I mean, if we really, if it was we're, really concerned about it, we could put a migration on the upgrade. That's like we'll put the extra of one in but I don't know if it's worth it.

E

Yeah like if, if the only secondary uh risk is what's uh John commented, then I want I'd like more gold in migrations like in the worst case, you will end up with some host with only confusion, fights, but you know nothing brand new on it. Yeah.

A

So I think it's super unlikely that someone's just upgrading their cluster with some random hosts Mark no schedule and it doesn't. The host is useless at that point, but.

E

But yeah what I think it could be good to add in some points in the documentation to say be aware of manipulating these blacks on your own. uh We don't support that change.

E

Yeah, because this Flex our internal to sephorium and the behavior, can change at any moment. So please don't play with them yourself, because they are internal. Okay,.

A

We got at least out of warning for adding the labels I think we still want to let people remove the label. That's how you mark a host as like I don't want to drain anymore. It's like it's just you change your mind. You want to put demons back in the house. You have to just remove the label.

C

A

C

Admonition saying something like you know: these these labels are they're meaningful to the system and they're. Not you know, oh, how do I say it they're subject to change over time, and so, if you're doing, if you're playing around with them manually, because this is more Upstream like Downstream, we can just say: don't do this, but Upstream. You should just warn people like hey, be aware that the the exact semantics may change between versions.

E

Yeah, okay, I think it should be something really good to have in these recommendation. So this way we warn people not playing with this Lighthouse because they are internal to safety.

A

All right now, I could also note um because I when you talk about it also the other thing others about removing the labels. um I had no saying like be careful about adding these labels and just leaving them around else. You don't want to remove because of it could change, but also, if you want to cancel draining a host, you want to remove these labels.

B

I, don't think we don't have.

A

A host drain cancel right.

E

You want to get rid of it. Yeah, probably it's worth just doing edit this command I'm, not letting the the users playing with this internal levels. So this way you prevent them from having to manipulate our original Olympus I.

A

Mean I could also add yeah like a host drain cancel the man wanted to now just remove both of these labels. If they're there exactly.

E

So this labels are similar to us and it's only a cpdn, just manipulates them to add terminal whatever, but users shouldn't touch this labels. This week, like we do.

E

C

To open a can of worms, but are these the only internal labels, no.

A

There are four of them now they're.

B

Actually they're.

C

All documented.

A

C

Sorry I don't have the whole thing open. Oh wait.

E

Oh no altitude memory I can see it and underscore at me.

C

My question is: are these meant to be applied directly by users or are there also commands that manipulate them?.

A

I think these ones, people I, don't think there's a command for either of these right.

C

I didn't think so either. So I would kind of go back to my earlier statement, which is just warn people that they're meaningful to the system and not to be willy-nilly but I. Don't think we should tell people not to ever touch them because in some cases that's the only way to to use them.

B

A

And give a point: yeah all right: um yeah I got a lot of warning in general or uh I. Guess a lot in this section that, like you'll, be careful with these um and I. Think up here. um I'll, add something that says like. If you want to stop the draining whatever you can remove these.

B

A

Because that you guys do manually as well.

E

Yeah I think a general. We should avoid like giving the user the possibility to touch uh in internal labels and provide commands for whatever they need, because um this way we reduce the risk of something going back because they are manipulating uh internal labels.

A

I mean we could always go through and add commands for all the other ones as well, um but I think that's a I, don't know, it'll take a few more pull requests and things so I think I may both start for this one, adding some sort of warning about the label. Yeah yeah.

E

Just like to be consistent this way, if the user knows that, please don't touch whatever label starts with underscore, then nobody will go and try to do anything with this labels just have like consistency like, but don't touch them at all or can't what we can't have like a mix. Okay, you can't touch this one button or this one. That's that's could be very confusing to the user.

A

All right and we don't go in that direction. I don't have to mind that um having commands for I don't know something like admin label or no auto-tune or something.

B

We can do that.

A

But I think I'll start with the warning here and then we'll um that'll be some sort of follow-up work, yeah and.

E

Actually add in this new comments as two months, it shouldn't be like very, uh very big task, because it's very trivial it just it just the label. Okay,.

D

That makes sense, uh let's see what we want to do here.

B

A

Well, we're out of it over time. At this point, that's noon now: um I'm gonna, stop, sharing I! Think we got enough out of that topic. I'm gonna know what we're gonna do going forward.

A

um Yeah. Does anyone have any last minute things you want to bring up here before we call.

A

All right, in that case, um yeah we're a little over time. So we'll end it here and I'll see you guys all next week.

B

D

Bye. Thank you. Bye.