Ceph Orchestration Weekly, 13 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Reef: Orchestrator

Description

Apologies for the legibility of this recording.

The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.

Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/

A

This is the brief planning the orchestrator. um We have our initial topics here in this other pattern,.

A

um So the topics that are on there right now are the things that the team sort of come up with we're talking last weekly about some things we knew we wanted to get in. um So we have those as sort of starting points, we'll discuss those ones and then, if anyone has any other topics that um want us to consider trying to get in for uh for our uh we'll do that.

A

um Let's talk together, maybe I'll just share my screen.

B

Does this assume that the monolithic blob has basically been restructured and reformatted into you know more of a modular approach, because I don't see that listed, I guess- or maybe I just read it too quickly- um that's supposed to be um adm.

A

Binary refactoring. That topic is uh meant to be. Oh. Okay. Sorry, sorry, sorry, sorry, okay,.

B

Thank you yeah. We do have that one on there beautiful that's, not done yet. Then it's, but it's been written. It's just okay. We got more work to do.

A

Beautiful yeah I can, um I can even actually link in the other pad, and I think about it. The pull requests go, find it real, quick, okay, yeah. So it was mostly done and then what ends up happening is we got too close to quincy release and we didn't think we're gonna get it in for the release and it seemed like too big of a change to be introducing in a minor release, and I got pushed back.

A

A

There you go across the work in progress, one.

A

All right, um yeah I'll, share my screen, so we can just look at the there.

A

Right is that working you guys see. I'm sharing.

C

Yep, yes, yes, very cool.

A

um Yeah so we'll start with going through this list and then um any other things want to bring up for what we should do for reef we'll do that, after uh starting with the agent stabilization stuff um so overview that people who don't know what that is, um the agent is essentially a non-containerized demon that would be running on on each host. That fadium would automatically deploy.

A

If you like a naval setting um and essentially the point, is that it speeds up our our metric gathering um so like uh they developed the demons from the hosts the devices all those things we have to gather those and right now. The way it works is uh part of our service loop, um every single, like 10 minutes or so it'll go and we'll refresh these things.

A

So that's the sh into the house and gather them, and it's pretty slow once you get to a large number of hosts, especially because uh just in general gathering the state of all the demons can be slow and the set volume commands we have to run like get all the devices is slow.

A

um We have this whole agent feature where it's it'll gather it for us and it'll. Send it over http to the manager, and so we don't have to spend so much time in the server loop doing that and it seems to make it so we can get those metrics more frequently and it makes the server run faster, so we can get to the actual operations uh quicker instead of having to spend so much time refreshing all of that information.

A

um So we have it as there. It exists in quincy already like it's there, but there's a lot of stabilization that has to happen on there.

A

um I know for one thing: there's problems with the hosts going offline still.

A

I think it's if the host comes offline and goes back up, there is some issue where the manager would report the agent is down when it wasn't really down um just general, a bunch of smaller things that have to be sort of worked into it.

A

We have to handle. So we want to do that stuff, then, for the additional features, I think we want to do stuff, like you know, deploy demons with it, uh because it has a copy of this fading binary, um and so theoretically, it should be able to do all the sort of actions we would do on the host normally like deploying demons and gathering information and all that stuff.

A

um So we should be able to get it, so it can actually do that as well, and if we do that, we can sort of make the a lot of how the um definite manager module works asynchronous and it could really speed things up.

A

That was just one of the big things you wanted to do for r is try to actually incorporate this a bit more and get it properly tested and everything, because right now, it's sort of experimental and we just have it disabled by default, and we do have some tests for tautology at least, but it needs to be flushed out. It also needs to be documented. There's a there's, no documentation either.

D

A moment ago, you said it's not based on a container. Is that correct.

A

Yeah, it just runs directly on those.

D

Yeah, so that's one thing we could possibly add to the list. If you want to um just to containerize it, it would probably need to be a privileged container, but uh once your privilege container can always reach outside of its uh container context and it might simplify the packaging and delivery.

A

um I'm trying to remember for sure if there was a reason why we didn't want to containerize it.

A

I don't remember exactly I maybe might have been simpler at the time.

D

I'll admit: I'mma containerize, all the things sort of person, but yeah. We can discuss it uh as a individual topic later, all right yeah. It's obviously.

A

We can at least come up or talk about the idea that once these other sort of things about it are are set up properly, um it could be a container that could work. um So those are the main things for that. This one um something we were sort of planning to do. We wanted to actually do originally as part of quincy.

A

uh It's kind of come up halfway through so if it came out um of doing this and then it sort of got implemented like the basic version of it, but I never really got stabilized or properly uh brought in so it's sort of just in this state right now.

A

We want to get that in for a week to do that.

C

A

C

At some point I don't remember exactly when, but we discussed it like. I did some kind of heartbeats mechanism in the agent.

A

Oh yeah yeah for the um detecting offline host.

C

A

Yeah, um because right now we're when the agent was on the one we were taking offline hosts with it was they say, just didn't report. We would have that um in there, which is, I guess, sort of similar to a heartbeat and we wanted a more explicit heartbeat. So there's a normal reporting and there's a heartbeat that should come in no matter what, um if we don't get that, then we know that it's offline, so we want to do something like that as well. um That is some point here.

E

A

All right uh that was all I had for this topic didn't have anything else. I want to talk about with the the agent for now.

A

No all right I'll move on to the next thing we have on here, which is the serve loop transparency, um so right now the way that a lot of the seth adm sort of notifications work or when something goes wrong. Is it has these events, um because there's about the more this topic than the other one, but they go together, um there's demon events and there's service events and then there's obviously there's just the random logging that the server loop does.

A

uh We found that all these things are sort of hard for people to access they're, not really like intuitive, um like something goes wrong in the server it might log an event or like demon manager whatever, and you specifically go in. You run something like.

A

Something like that um and then it would have, and if there were any events for that demon, they hear they went wrong. They would show up there, but we found people like don't aren't going to check that by default, especially because there's a lot of services a lot of demons.

A

So, even though we have this information stored somewhere hard to find currently, and so we want to have better ways of explaining this sort of information with people, um so on top of just the events for the demons and the services, also just in general, what the servlet is doing so say like we were talking earlier about how we have to like refresh all the demons and we have to professional devices and things. And then we, after that, we can go, do like x, y and z, um really what those things are.

A

What we're doing in the background should there should be some way for the user to see what that is. um Look it up for a lot of the debugging cases.

A

uh So like say, if you had something that was hanging like we had that issue before, where there was some stuff volume command hanging that we were trying to run at least be able to see like oh, this is what it's trying to do right now. This must be hanging somewhere, make the debugging a bit easier and also some sort of history to these things.

A

So you could just see these are the actions that it ran, saying the last serve loop when I ran through they did these things um that could be useful for debugging, because then you could see if um something that you wanted to happen isn't happening. Then this you know why.

A

uh So you want to have stuff like that. I'm thinking some cli commands that give you again the history of what server was doing recently, just like short snippets, uh maybe something to say what we're doing currently stuff like that and then some way to make uh the demon the service events more visible.

A

um Maybe we just need some like a cli command just for viewing events in general and we'll just list out all the known events we have recently. So we see all the errors that have happened for all the demons and then you can sort through that to find what you're, looking for, rather than having to specifically go look for a certain demon in the check. If that demon has an event, it's just easier just to look through the existing events that happen.

A

uh Yes, that's one of the big topics. I guess these two points sort of go together is we want to make the debugging and everything easier.

A

We found that people have some trouble uh finding some of the problems when things do go wrong.

F

All right uh does that have any yeah: where do the self-loop logs? Where are they? Why? Why can I currently find them? Are they in the self manager logs.

A

Yeah, I think they all go under the stuff manager logs currently um normally, when I look at them I'll, usually watch them live or I'll just do like. If you do like a.

A

Glass I'll do something like this: there's a command um and that'll just show the last like it doesn't have to do too much. I think, there's a cap on it, but it'll show the last x number of uh debug logs this f50m spit out, uh which will usually be all the stuff from the serve loop. You can see them there and then also.

A

I think they go into the manager um they're, just in general like what I'm talking about here, though, obviously the logs will still exist, but we want to have something a bit more concise, um it's easier to look at and see like what the circle is doing in general, um so like the debug logs, if you want like a more detailed view of what's going on, but just I think that overview would be super helpful.

A

So um maybe we even want like an easier command to get our last number of debug vlogs. That was possible. We could hold a certain number of them good people as well.

F

Yeah yeah, something like a cf config set command. I mean we have that for other demons where we by mds, so we just set that and yeah the logs increase, and then we can check out the mds logs. Something like that.

A

Yeah, I think stuff item does have its own log level, um but again, I think the problem isn't necessarily setting the log level. Is just people don't know where to find our logs? um We had to do something to make them a bit more transparent.

F

Okay, so is that, like a self manager, config set command? Is that what you're saying where you set the log level for sephardium.

A

Yeah, I think there is like a config setting. uh I don't remember exactly my head, but it's like versus f80m slash um what it is.

A

We have our own log level. I believe.

G

A

A

Maybe we do just use one of the other ones actually.

A

And the energy might be fall under just the generic manager log level, that's what I was thinking of yeah, so I might not just be part of one of the other ones, but we should probably maybe even consider adding one of those just specifically for cepheidium. um If that's something that actually have is a new thing to do,.

A

A

That'd be good. If we did do that, then.

A

But yeah we'll add that to the list of things we want to do when making this more visible. um I have some way to more customize specifically set videom's logs, as you set the level easier and uh you can find them a bit easier as well.

A

A

Hey, I didn't have anything else on us or, I guess yeah. I want to open the floor now. If anyone has anything you want to say about the the logging or the transparency in the server loop or anything, maybe something they've run into that's a bit hard to find or any ideas for how we do the debugging or anything like that.

F

Yeah with respect to our service events, I mean we can filter by our service name and service type. I do that and I try to figure out more logs for nfs servers, for example, and maybe we need better documentation upstream for the conflict settings that you mentioned adam or maybe it's already. There.

A

Yeah, it might just be in the wrong spot. I know even for the demon, the service events like they're.

F

A

It's pretty scattered.

F

Yeah, we need to make sure that each service- you know you mentioned that in each one of those servers that you can use these uh safar commands and filter by service name and service type. Yeah.

A

Yeah, I know I don't know I think they're in the troubleshooting section somewhere, but I think they should be a bit more visible as well. I think that's probably part of the problem why people seem to not find these is they're not documented in like some more forward-facing spots, where they probably should be because they're kind of important um we should do that as well. I still add that points here.

A

That's good um anything else. One has comments. This sort of thing.

A

um Another movement- this is a another topic on here- that's actually pretty much in the same vein of uh transparency and everything I hit the wrong button. Sorry, I have fires the previous.

B

One I put my video on, I said it on muting. um Just a curiosity question: is there a way to make any of the you know, make everything that comes back sort of actionable. You know try to try to bring it down into like you know, go do this to take care of the problem versus you know, um you know just leaving it sort of opaque. You know what I mean. Make everything try to try to do it in such a way that everything has an action. You know this happened.

B

This is how you go fix. It kind of thing is: is that achievable you think, or if that's a design goal, can we get further than we typically do.

A

um I mean in theory as long as we, I guess the hard part would be knowing what to do in that situation, but we definitely if we didn't know what to do, give a recommendation, because the way we do it works right now is we just sort of log an event, and so we could also just easily log some other like about the event there.

B

Yeah, because, typically in the past I mean to make a you know product, you know easy to use and usable I mean you've really got to try to get most things actionable, and you know if you fall into like a one percent category of things, that you know you just throw your hands up and call support. I mean that's, not bad, I mean you know it's better than probably where we're at today a lot of our actions today, a lot of the events that occur just like you scratch your head.

B

I would still think today I mean so I mean. Maybe that should be just the goal. I would. I would just try to achieve that. You know.

E

Let me, let me just add in on there for a second, so are you talking about?

E

You enter some command and you get an error code and it comes back and it says you need to do x before you do y or something like that. Is that what you're talking about yeah yeah, you know a good example to me to that is if you use command line, get git has a habit of telling you exactly what you need to do when you do something wrong.

B

Yeah, if we could achieve that, I think it would be awesome from a product perspective.

H

And usability perspective, I.

E

I would have to agree with that the, uh but the problem with then goes being sure. You know what the user is trying to do. uh Yeah yeah, and that is the problem where you know, there's an error, but you don't know what the user is trying to do. Unless you came back and said you either do a b or c yeah yeah. Something like that. I don't know, but I do agree that some helpful hints on air would be nice.

F

I think one of the complexities is that uh some of the commands to the deployment of daemons asynchronously, for example, when you create an fs cluster or you- create a cfs volume, the the command returns, but it shoots off uh asynchronous deployments.

F

So I'm not sure how we want to handle that where we deploy these different daemons asynchronously.

A

Yeah, that was one of the things I kind of wanted with this point uh with the history and everything. um Maybe we could like log, not log but like something more easy to access than logs, um like that. We just actually tried to deploy this demon like this many seconds ago or whatever.

D

um So um one thing I would recommend- maybe not copying but kind of uh taking for inspiration um when I debug problems with kubernetes the um cuddle describe command is extremely helpful. It brings a lot of useful information about the state of a resource in one place, so that might solve the problem that you were just bringing up there, which is like. What's the state of this asynchronous thing,.

A

Yeah we can take a look at that. um I don't know.

D

Much about handworks myself yeah, I wouldn't copy necessarily how it works, but more about the kinds of information it displays.

D

We can talk more about the details: offline, yeah, yeah, we'll.

A

Have to get into the details later, this is more just for ideas. That's a good place to look. If there is someone who's already doing something like this well then definitely get some inspiration from that.

B

But if you do have a context of something at a you know a larger level that you're trying to accomplish and it fires off a whole bunch of asynchronous events. I mean somehow you do have to keep track of that know the history and you know at the end of the day, if that larger, you know context fails, then you know because of one of these other asynchronous events you did, then I would think that somehow you've got to make that you know connection. I mean again.

B

If that's all part of the conversation I agree 100 I mean you know you absolutely have to be able to do that and hopefully make sense of it, and then from that you know what is your? What's your course of action. As a result, I mean, depending on maybe availability model. Maybe there's absolutely nothing. You have to do let the system recover eventually or if, if the service didn't this thing, you tried to do a larger context doesn't work. Then you have to you know.

B

Who knows what I mean make up your story here. You know, but those kinds of things I think, are very very hard from a usability perspective.

A

Yeah, I think all these things sort of tie together where we need to be able to describe the asynchronous service or, what's going on there, we need to be able to properly display errors when things do go wrong, and then we want to, if possible, give some recommendations for how to fix it and sort of do all those things together and improve the ux a lot, um because you could see like the history of what happened.

A

You can see like the current state of your whatever asynchronous command you care about, then you can see if it goes wrong and what you're supposed to do when it does go wrong, then sort of have the whole workflow covered.

A

I guess we'll have this we'll probably have to start with this.

C

Do we have somehow um something to get the history of comments that were shoot in the sephie dm.

A

We could theoretically could track them at the orchestra level. We could probably have some sort of decorator on our commands that just adds them to some list somewhere.

A

um It would, I guess, as long as they're only orchestrator commands like we can't it's kind of hard to do it for and it's not orchestrator command like they just change the log level or just do some like random radios gateway admin thing. We can't really see that, but we could try to track which orange commands they ran. If we thought that was like a thing, people would want know what they ran. I guess it could be useful actually if they are need debugging and we want to see what they did um yeah exactly.

D

Rather than the commands, maybe log state changes like up to some maybe limit, because you don't necessarily want to capture everything, but um I don't think you want to log random, say: read-only commands. um I've used a system that did that in the past, and it was just noise.

A

Yeah, I think, there's already for our um decorators for the clx. I think there already is like a read versus the right one. I don't know if they've all been applied correctly, but there is um something in place for like differentiating them.

A

um So maybe we could at least look there see that on the our right ones, particular we could maybe log those somewhere. So we know what people are doing. um They could share.

I

Generics of audit log that exists for um general stuff commands, I'm not sure if the zfdm commands are showing up there as well, but that would be a good place to include if it's not, if they're, not there already.

A

All right, um yeah, that's a little less.

J

And yeah, and even in the audit audit log, we differentiate between read only and read, write.

A

Oh yeah we'll definitely want to look there then um they're already logging. All the commands is suddenly something we could need to make use of. That is one of the harder things sometimes people have issues is what do they actually do to end up there yeah.

C

Exactly so, if somebody reports some back so can get the history of goodness to see what you have issued before.

B

C

B

Could be very helpful, I've one other naive question I mean if we do have this. You know. Logging capability audit capability is a replay capability, beneficial at all. You know you want to redo something. Is it easy?

B

Is that something that would be helpful or is that something that's just a little little much. Are you playing like all the commands? They ran, not all necessarily, but a group of you know. I mean like a almost like a history kind of you know. uh You know click kind of thing. I don't know I just. um Is it the right level to do it? Do you do it at a higher level? I don't know I'm just bringing that up. Is that how is it helpful at the at at this level?

B

You know, I don't know, I'm just asking.

C

I could have the history right now. If you use control r, you can re-apply some comment from the history, but you can apply like a lot of them. It's similar to linux, bash history.

A

Yeah, well, I can barely see some news for it in like a recreating scenario. You want to like recreate something we did, but I think it would be sort of situational.

G

And my argument would be: why wouldn't I script it if it's the same set of commands.

A

Yeah, maybe at least stephanie might not be the place you want to do that um he's like we could get like the history of the commands really easily there. Then maybe people could use it for making their automation or for doing the recreation. I don't know if you want to implement something that actually that videom does the recreates the commands or whatever.

C

If you call cfdm shell and the command, then you can execute from outside. So you can create your scripts by calling cfdm shell and whatever you want to execute.

A

Yeah most people do that they'll use the shell in their scripts and everything.

A

Right uh well, yeah. That was a lot of good stuff. um I'm gonna move forward with that um yeah. I mentioned a look at more of this generic logging stuff and make sure this is fdm. Related stuff is all in there. I'll have to give that a closer look.

A

um If we could get some way to get those out and just use them at least get all those fdm related ones. Maybe we can store them somewhere temporarily or whatever just so. We can have an easier time debugging when something goes wrong, figure out what people are up to I'll, be good.

A

All right um don't have anything else. I want to say on this sort of debugging stuff. Basically, this whole section here we were just talking about um where to view failures. uh What should we be tracking? What people are doing and things like that.

I

Maybe out of the loop here but um but I know there was some kind of events framework that the cypherdm was using but uh to try to, but these are now like what had happened in the past. That's still useful for failure tracking in this kind of sensor.

A

Yeah, so we were talking about a little bit earlier. um It's like one of these ones. um So, uh basically, when you look at like an rgps or hls output, and if you as long as you give it a format that isn't like the plain one, if that event or if that demon or service has had an event happen like something has gone wrong there, it'll it'll show up, um we were saying earlier, is that the only problem with that is people tend not to define them or not? Look there like.

A

It might just be like too hard to find, and also just like the level of granularity where you have to specifically go out of your way. To like look, I want to check if something happened with this service might be too hard. We might need some more general way of finding all these events, so they are there, but it's just they don't seem to be super visible right now we wanted to increase the visibility as part of this okay.

I

Yeah, that makes sense thanks.

A

All right, um let's move on to the next thing we have on here, uh so this one again, it's actually just in the same vein of what we were just talking about, which is an upgrade history.

A

um So one of the things that hasn't been great with the way the upgrade works is when it ends the whole, like upgrade status, that you can kind of use to track, how it's going the whole time just goes away, and it says it's not in progress anymore, but if we had an upgrade history sort of thing going, then we could, as you after you see an upgrade, no longer says it's in progress anymore. Instead of just hoping that it's all worked, and it's all good, you could check the upgrade history.

A

You could see what we claimed we did and everything.

E

A

A command line.

E

A

Is it a separate log.

E

A

This would be like a command line thing, because the idea is, we just have like a log of say, like something pretty quick like we just completed a upgrade of like these demons to this version. At this time you could see sort of what's happened recently, yeah, um I'm sure this stuff could be found in the logs already.

A

But again, it's just super hard to find right now, so we want to have a command line to just say: you'd say if you've done multiple upgrades um as long as there's not like too many of them, you could just store them somewhere and that way you could see what upgrades have happened in the cluster.

A

Recently and just falls in the same vein of all this stuff.

A

So if we're just trying to make things that happen, asynchronously more visible because upgrade is one other one of those things that we do in the background you have to fire it off and then it just goes and um somebody's nicer just have some better tracking of what happened with those things, um and there is some work already going on, like I said, it'll end up in r at some point, um let's plan to go into earlier versions as well um to make the upgrade you can split it up.

A

You don't have to do the whole thing at once, which will make this a bit nicer, you'll, be able to upgrade, say just all the demons on this or all the demons of this type on this host or whatever something like that.

A

um But even though it doesn't, it doesn't take away from the usefulness of having a history like this, where you could have some way of seeing all the upgrades that happened. So you know like when this was upgraded and when it finished and everything that could be useful for people.

A

uh So does that have anything you want to say about that upgrade history? What could be useful to be there? uh What things should be tracked? Anything like that.

E

I don't have anything really to say there, but it might yeah again might be nice just to be able to say upgrade history and it tells you what happened and when it happened.

A

Yeah, that's sort of the idea we were going for is um just some way to see what those things were.

J

Yeah, I think, in terms of the summary I think it does a pretty good job of like you know how many demons have been upgraded and you know what is it working on next but which exact demon is it upgrading? I don't think there's a way to look at that right, adam.

A

Yeah it'll tell you what type of demon it's upgrading, um but it won't tell you which one in particular yep, but then that's what I've noticed. We could even add. I don't think I have a section in here or just the upgrade status.

A

That would be a kind of increased granularity.

A

Then we could also do- and I said right now- it's pretty high level. It just does which type we're doing, and I think it's really unless it's like pulling an image or something it'll tell you that, but other than that, it's basically just which demon type we're currently looking at.

A

So we can make some improvements.

C

There, as part of the history adam, are you planning to edit something like the date, the original version, the the final version.

A

uh So what I have about here right now is which version it upgraded to and when it finished, uh maybe the previous version could be something as well.

A

I haven't finalized what needs to be in the history.

A

Maybe we definitely would have which version we upgrade to and when it finished, those are the ones that seem super obvious, but I don't think I've nailed down exactly.

C

So, are you planning to only keep the last upgrade version or to get like more extended history of the upgrades on some classroom? I think you go back a.

A

Few, um it's not either something that's going to happen super frequently. Typically, uh I I I like.

E

The last three yeah: how often does somebody upgrade well, you might be upgrading within a let's say, within pacific or within nautilus or whatever.

E

Is that what we're talking about for upgrades is that is, you know you would say nautilus x to nautilus? Why, or is this more meant to be not a list of pacific history.

A

um I mean it could be for for either it just wasn't. Even very generic.

E

Yeah, I'm more apt to upgrade within a you know within let's say nautilus or something like that. Then I am, I wouldn't very often go. I mean uh a major branch jump. I mean that would be something that you. I would think that would be every couple years, but within the branch upgrades are more common when bug fixes come out and you upgrade.

A

Yeah might be more important to keep track of it across minor versions or something yeah. At least you can know like. When did I get to this version that has this bug fix? You can see the exact date and then upgrade finished yeah. That is because a customer.

E

May be complaining about something you know.

A

E

C

Are behind- and this could be useful- also to get more information about the status. Maybe some back only happens in some rare, upright scenario, so it will be more information for the developer to produce the same issue as the customer.

E

Yeah information is just going to be maintained in some sort of text file or something like that. You're just kind of.

A

Yeah I mean, I don't think it's gonna be some like very much information it'll probably be like a few lines, upgrade um I'm not sure exactly the technical side, where we're gonna put it in everything yet, but um it shouldn't be too much. So I don't think it will matter where it is really.

A

Love to see um yeah, but most of those tests about here, just sort of what needs to be in there. How about like how far back you want to go. That's another thing, because.

A

Is it useful to know back like three or four upgrades? You only need to know the last like two like um stuff, like that, like high level stuff.

A

But it seems like in general it's going to be a good idea, maybe better hashtag, some of these details at some point.

E

Start out simple and change it.

A

Yeah all right um how many else want to say about the upgrade history.

A

All right- um and we just talked about the status improvements, so uh we go to the binary factoring. I mentioned this earlier. uh There's a pull request here. It's linked. Anyone wants to look at it. uh It's been outstanding for a pretty long time, now, um sort of what happened with that is.

A

It was almost done close to the quincy release, but we sort of agreed to something that we wouldn't be able to get it done in time frequency release and it was too big of a change to be a minor release thing um because it would sort of change how you get access to the binary and everything, and so we wanted that to be something.

A

That's only done on a major release, so we sort of pushed it back, and so um basically, what it allows to do is right now the binaries is one really large file like 8 000 lines or so right now, and it makes it hard to work with a lot of the time it's sort of stuck in that form uh for sort of technical reasons. We reuse it right now, but the work in in this pull request here was uh giving us a way to split it up and still make use of it properly.

D

A

D

Clear it's it's the first step. It's not actually splitting anything, it's just making the binary out of components. Those components are almost the same as the original yeah correctly.

A

Right, that's true yeah, it doesn't actually do any of the refactoring or anything it just gives us the power to do so in the future.

A

A

Yeah yeah, so the idea was um try to get back to that uh sort of late in the year and see if we can get it really ready like right before sort of r comes out um and sort of get that in and then do some refactoring like the month or two before the r release, and then we can have our sort of come out with this new refactored version and, however, we're going to distribute that and everything will be something new for that release and then, like q and an earlier um we'll just use the old sort of system of just the the one big script file.

A

um That was the idea. There's something we've been trying to do for a while. Now um this goes back to I think when pacific was the newest, I don't know even when before pacific came out, I think we already saw some discussion uh being able to refactor this there's just something that's never been able to get in uh one reason, another thing and it's kind of hard, because it's a big change that it sort of needs to be a major release. Thing.

A

I'm not sure, there's not much, there's much more to say about that. We have some details about how we're going to split it up in here. I guess, but that's not super important. I think for this. That's a really sort of developer level. Like limitation thing, it's not like a big idea.

A

Don't really go into that stuff right now. um That is something we definitely plan to do for r, uh so they might have anything they want to say about the binary factoring. I need that this is kind of like a random topic. That's more technical.

A

And that case we can keep going here, um and so this point here is more of a generic points about just continuing to sort of do what something I'm supposed to do, which is uh simplify, workflows, make them a bit easier, um and so this was just sort of a starting list on things we plan to do that, for um so we have here setting the monitoring stack images, anyone who's changed their monitoring, stack images by default or from the defaults sort of.

A

I know this process fdm, where you have to sort of change the config option and then, if you use the config option, you have to then go like redeploy the demon. So it's not like a super complex process or anything, but it's one of those things that it would be pretty straightforward to automate. Just have one command said like I want.

A

My monitoring stack this like my prometheus demons on this image, then we should just be able to handle that for you and set the image and make sure the demons get redeployed and all that so there's no real reason for us to force you to do all these extra steps. We should be able to sort of automate that for you, so that's something we want to do.

A

One of those processes when automate um and rw multi-site stuff is something that almost happened a long time ago, and then it didn't happen, something we're going to want to bring back up.

A

Hopefully now that we have some more time to look into it, um just having some way to make the deployment of rw multi-site easier, fm does have the power to run those like windows, gateway, admin, commands and everything.

A

So like say, all those steps were like setting up the realm in the zones and all that, um if we gave it with a good format for like a proper yaml, you could provides to specify what everything you need for multi-sites, um theoretically cfdm code sort of handle that, for you, make sure those things get created correctly and set up some of that for you.

A

So that's one of the workflows you want to uh automate we'll probably be trying to work with the rgw team and get some of that stuff in as.

B

Well, you know my question: um why cli and just not work with dashboard and put it in dashboard, add value there. You know I wouldn't do it in the cli personally, but I know this I I know that's probably not a popular statement.

H

Here, but I you know, I just think that's what we should do. I I would add in anything you could put into the dashboard the better.

A

Yeah I mean I don't want to do this as like an alternative to the dashboard I want to. I was honestly if it worked properly in the cli, then the dashboard could even make use of that to some degree. It's not like it's anything we do in this fdm can't dashboard can't make use of it all. So I feel like having it here wouldn't take away from that, but.

B

Again, with limited resource I mean a dashboard. First would be my preference and you know cli second, but I again I I just I just said. As always, I wanted to bring that up. I just uh I.

E

I would be a proponent of always having both, but that's it needs to be in the dashboard so anyway, so other people can use it other than me.

E

A

um Yeah, that is one of the workflows either way and we're thinking about it.

A

Getting that in I know some people have wanted that yeah and again we tried to do it before and we hadn't gotten it working. I think it got removed a while ago uh and then even again, even if we got this started working here, that sort of work that could help get it working the dashboard as well.

A

It's not like it would only help percept video and we could sort of either they can make use of it directly or we could sort of um I'll be sort of the ideas from one place to another if we figure out a good format for doing that stuff.

A

So I think maybe there could even be some collaboration there uh when setting this stuff up to make sure it's working in both places. So yeah again it's something we want to do so um I'll open it up again. If you want to say anything about either these workflows any comments on these in particular, or if there's anything else, you think that it would be nice if it was automated. That's not currently automated something like that. So anyone wants to say that stuff.

A

uh It doesn't sound like we have anything there, so um in that case yeah over the course of the next few months as well. If anyone comes up with anything and reach out to me on, like the rc channel or anything that says, focus triggers one or anywhere else um and uh pose those ideas there, and we can see if we can look into adding stuff like that.

A

This is sort of the next sort of one of the next steps for stuff. Idm is the power to maybe do some of these slightly more complicated workflows. Now that I have a lot of the basic stuff that are going.

A

Right- um and so the last point we have on here so far, is this disconnected environment stuff, um so we have some basic documentation and whatever for how you handle disconnected environments, but we at this point is basically just that we want to give this a bit more attention over the our release. um Do a bit more testing around it's in environments and all that um I know we have some documents that tell you you should accept a private registry and whatever, and there might be like a conflict option or two.

A

We tell you to change, um but there's not a lot of like testing around us, and I I think maybe we could there's. Maybe one of these things again we can automate certain parts of it um and we have on here is the server sub points. This came up.

A

We were talking about it last week, the idea of maybe deploying a registry, but we sort of decided against it, and we just I don't really know if you want to have sofia managing registries like that, might be a bit of an overstep for what stephen is meant to be.

A

Didn't really want to do that, but for the other things in general, um making sure that as long as you have the registry set up that have a discontent environment that bill is built on top of that private registry is, uh is a smooth process, because we know a lot of people use these disconnected environments. So it would be good to make sure all that stuff works. Well.

A

Yeah, so there's not a lot of specifics there right now, it's uh more of a big idea. I guess what you want to do so. Does anyone have any comments on that one? They want to say disconnected environments, things that they've found that are difficult with cepheidium dispensed environments or things that can be improved there. They have used them.

A

All right, it doesn't sound like it, so in that case yeah something we're doing anyway. um So if you guys, you do use protective stuff, um you look out for that stuff in our hopefully have a few small improvements um making that a bit smoother, but that was the last topic we had come up with so far.

A

uh What we want to do for our very decent number of things in here. uh Does anybody, just in general, have any comments they want to give or ideas that we should be trying to do in our.

B

Totally, okay, just uh just an overall comment: it's it this! This is actually an interesting list because it looks like it's more about stabilization. There's not a you know, not nothing, not anything major new. I guess I'd say it's all about. You know, refinement um stabilization, making the product more usable. I think it's all good. You know it's! um So that's I! I think you know a good sign of where we're at um one of those things. I guess would be missing in my mind from that kind of stabilization.

B

You know we don't have anything about performance and scale here. You know there's work. We need to do there to try to drive that to a different level. You know and see where we can go. What happens when you get to huge note, you know, note counts huge, you know, drive counts, I mean whatever I mean that the only thing is it's another stabilization.

B

You know evolution polish kind of thing you know moving forward. I mean I is that fair, fair assessment.

A

Yeah, I think that's definitely something that's like a good thing to bring up.

B

I know we have a little bit of the overall assessment about where we're at I mean, like you, know, there's nothing like shockingly major new year. It's like you know it's all about polishing refinement. You know making it more a more stable, better product I mean. So I guess that's that's sort of the interesting point as well. I think.

H

I mean, and and that is all good stabilization making.

E

It easier great for.

H

E

Yeah for sure I I performance at scale, I assume you're talking about if you're going out to, I don't know hundreds of nodes or hundreds of data nodes.

E

Yeah exactly yeah, um I will admit that you know I'm topped out right and I'm I'm. I'm gonna call me uh or myself a more small guy, because I got popped out at about 40 data nodes, uh but uh certainly some sort of performance metrics. I don't know.

A

Yeah how the trouble has been recently is: we've been trying to do some testing scheduling and you need like the hardware to do stuff like that. So on a couple occasions, uh we've been able to get some good hardware for that stuff that there was the posi lab um got access to for a while.

A

We had some good stuff there, another a bunch of different these bug, fixes and everything that came out of that one just because there are certain things in license to pop up when you got to a big scale and then there's the give a cluster. That's going right now that at least has.

E

Real world clusters to test with is yes, that's an absolute problem. um I will say my biggest deployment is at a customer site and I can't mess with the customer system.

E

It's an operational system, so uh yeah and I can't afford uh you know I'll, say to duplicate their system.

A

Yeah, um but that's something we have gotten some work on and there has been a fourth range says now, there's a presentation on it. um I don't know that preservation come out yeah, it's I don't remember, uh there's a whole set of slides and everything on our scale testing and all this.

J

A

J

Yeah, I think the it has a clear section for uh what we are done with gibba pausi and the other stuff. That paul also did um it's going to be ready. Once quincy release goes out.

A

Okay, yeah, so there I guess there will be a presentation, um I guess around the quincy release of the test. We've done so far with the scale, performance and sort of what we're able to achieve currently and when we start running the problems and those things, um but it is something definitely yeah um once things are functionally stable.

A

The performance at large scales is something we want to look at for scifidm um and the problem is always finding consistent ways to test it and everything, and also, I think, part of the scale performance stuff ends up being not stepping specifically but just the manager itself. um I know when you did. The policy testing sage said a bit of refactoring on the locks lock handling in in the manager, so maybe some changes there as well.

A

If you want to prove performance at that point, so it's all things left to start looking into as things get a bit more stable, yeah that'll be good, um there's, definitely a point: we should start looking at well. If we're going to be talking about stabilization and stuff, then yeah scale performance is something that should become a big topic.

A

All right uh does that mean anything else yeah. I know we're sort of nearing the end of timeslot. I believe so.

K

Yeah, I have got a question it's about. I remember we talked about this a while ago. uh Is there any update on the uh scheduling of uh service or demon scheduling for for safety right now, it's kind of a static right. So I remember we identified some piece of work for the dashboard too, so basically to somehow balance the resources or the diamonds on the services across the cluster right.

K

So is there any plans to upgrade that or improve that the scheduler for safety.

A

Can you elaborate, like I don't remember the conversation referencing the.

K

Scheduling yeah, maybe it was with sebastian uh yeah, basically uh right now: it's we have the placement right, which is kind of a static. You can still use the count and the labels, which yeah ensure some kind of uh high availability, but for avoiding hotspot and distributing a lot of of the different services across the cluster. uh I think we remember. We were talking about some kind of smart, dynamic scheduler for safety, so basically based on cpu and memory, it would try to balance the services.

A

K

The different hosts so any plans for that.

A

um I don't think I know you're talking about. I don't. I know at some point. There was talked about us doing that, but I don't think any works happen in some time. If, I'm being honest, I think it was, I think, we're calling it resource, aware scheduling.

A

um Yeah, basically, the idea was just that all things equal like if there's multiple hosts we could choose from where to put something we would actually check their resources and everything and make our decisions based off of that but yeah, I don't honestly. I don't think a lot of work has happened so far. Is this something something we can start looking into for r, um because there was something we planned some time ago to start looking at, and I think we have the means of gathering all the host metrics.

A

Now we have that gather facts and everything that we didn't have like a year ago. um That could all be used to make these sort of calls. So it's definitely something that we can start looking into for r but yeah. If, I'm being honest, I don't think much. Work has happened on that um anytime recently, but I know.

D

A

All right, thanks for just though um alright somebody have anything else to talk about last minute or so.

A

um All right, in that case uh thanks everybody for coming. We have our sort of a good list of topics here. I think sort of some good discussion um yeah. So that's our our our planning.

E

All right! Well, thanks a lot. It was a good discussion.

A

Yeah thanks everyone for coming and uh enjoy your day, see you guys later.