Ceph Orchestration Weekly, 7 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2021-09-07

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello hi everyone and welcome to this week's orchestrator weekly. uh Looking at the topics, we have a presentation from edin regarding the ancient adam. No, I want to start.

B

uh Yeah I'll share my screen, so I'm going to present a little bit about the agent. I think most people in here actually already know most of the stuff, but anyway I'm going to go over it sort of why we're doing it, but the architecture is going to be for some of the like important issues with it.

B

So to start uh why? Why do we need this agent? So the main thing here is for scalability and performance, uh so for one on here we have ssh is currently the only way we really communicate with any of the hosts and we've seen that that's pretty slow um and so like. We have our server loop and we go through and we try to ssh and every single host to do everything it just takes too long, and um we can't come up with everything uh for parallelization.

B

So right now um right we're just paralyzing the metadata gathering in the future. There could be. Maybe if you want to deploy a lot of osps, you want to do that in parallel. Just tell all the agents what they have to do and they can go do it. You don't have to worry about going through each host individually, one of the time in the circle doing that and also it's a push model, so it saves the manager of the work of having to go explicitly going and gathering.

B

All these things all have to do is sit back, have its uh http server up and it'll get the metadata sent to it and also the other reason we need. This is responsiveness specifically we're talking about uh ha for nfs here um when I remember uh nfs needs.

B

We need to know if it's down within a minute, and so if we have the serve loop- and it only goes off every few minutes- that's already too slow, and even then, if it went off once a minute, um it still takes a while to go to sh in the host and gather the metadata. So it's just it's too slow.

B

We can't do any aj for nfs with the current architecture, so why we need some sort of agent on the host to make things faster, okay, so the basic architecture here is that the manager itself is going to have an http endpoint, we're using cherry pie. For that, uh because we're working with um we're worried about scalability and stuff, we want to be able to take http requests from a lot of different places. We really want to have like a nice library here. We don't want to.

C

B

Have our own um http server we have to, you, know worry about debugging and everything this one's already built have to implement it in here and then for the host themselves. The agent will be a non-containerized system d unit that allows it to run commands directly on the host really easily.

B

So we'll see the other slides, like all the metadata easily gather it's easy to just run on the host without it being in a container and what they'll be doing is we'll be sending all the things it gathers over http to the manager on that server that it has waiting here and then for messages from the manager to the agent. We have a raw socket. We don't really want to have to have an http server running on every single host agent, so we just have a socket to communicate.

B

It should be enough there you see here, we've set it up, so we can today receive a variable and json string which basically lets us send whatever you want to. It will help in the future. If you want to extend the functionality, that's what we'll do for now we're just getting metadata, um because we're worried about the responsiveness and scalability stuff right now. The biggest scale problem is actually gathering metadata in the serve loop. It just takes a while because most other things you want to do once in a while.

B

You only have to deploy demons, usually like once on each host or whatever, um but metadata, yet together all the time. So we is the first thing we want to have the agent do so I have a list here of all the things that's currently set up together.

B

um So list of demons like this is the important stuff um for one of the slowest things that runs like the ls is super slow, and so the agent will do that it'll send to the manager, so it already has it ready. Now we have the networks and hosts facts in here. This is just some information about the host that could be useful um networks is pretty useful and the set volume output. This is helpful too, for the discs right now.

B

We don't even refresh that very often um so this will make that faster, we'll have more up-to-date info on the discs on the host and in the future. I think we can even set up so if the disks change we'll immediately, have the agent send more data, um so we'll be really responsive on that stuff and maybe more in the future. So right now it's just metadata stuff, but we've talked about the possibility of uh once. This is a stable thing.

B

uh Maybe we wanted to deploy demons, it could help with pulling a lot of osds or whatever, but that's future work. You don't want to go there until this. Some other stuff works.

B

uh So I'm going to talk about some of the bigger issues with the agent things that we're worried about. um So the first one and one of the more important ones is security.

B

So we're talking about having a secure channel here for because we're doing things over atp and also the raw socket, the things we're worried about are making sure the messages are encrypted and then making sure we're authenticating who's sending them. So we have messages that can't be sniffed by anyone. You can't read them at all, but and also we know exactly who they're coming from and who we're sending things to. Then we have a pretty secure channel of communication.

B

So in this case we have two channels we're worried about so for the http messages and that's things going from the agent to the server like when we send um metadata up there. We do it with a post request, and so we have to make sure that's secure and then also the raw socket, the manager or another. The agent itself has a the raw socket and the manager needs to be able to send information to that socket again. It needs to be secure.

B

We need to make sure that when we're sending something to the agent, it's actually the agent, an agent needs to make sure what gets from the manager is actually coming from the manager can't just have anyone sending things to it. Let's talk about that http channel.

B

um So first we need to it's up https, um so we have encryption. Unfortunately, it doesn't seem like there's any native two-way authentication in cherry pie for ssl, so we um can't that would be the ideal way to do it, but we can't quite do that um we'll get back to that in a second.

B

um So first of all, we're doing here is we're having the manager generate root cert for us used for this http server, and then once we have the root cert set up, we can create another cert, that's signed by that root, cert, that we can use for ssl on the manager and then what we do once we have.

B

That is we give that to the agent when we're deploying it that's done over as estate, so we're not as worried about anything encryption or anything over that and then once since the agent has its root cert, it's able to use that to verify the manager. I can see that the cert the manager is using is one that's signed by this it'll also check the host name on that certification. The manager has that way. It knows who it's sending data to is, in fact the actual manager.

B

It's not just some random person and the manager also verifies the agent um like. So we don't have two-way authentication, so we can't verify some sort of ssl cert on the agent side, so we have a different way of doing it. So we do, is we generate a key ring for the agent and then, when the agent sends metadata back to the manager? It includes that key ring and we verify that the agent on that host is supposed to have that exact hearing.

B

If it doesn't, we just discard it whatever it sends, but if it does, we can use that metadata. Be sure that it's from someone reliable on the other side, the raw socket. This is just a raw socket. We don't have any problems with not being allowed to use two-way authentication. So that's what we're doing here um the nice thing about this. It covers both the encryption and the authentication.

B

So it's a similar thing. We have that root cert on the manager that we generated, but this time we're actually generating a search for the agent itself. So, on top of passing the root cert to the agent, we also are passing this new generated cert to the agent and then now the manager and the agent both have their own certifications, and so they can verify each other's serve. The manager will verify the agent has a notification signed by that root serve and the agent can verify the manager's certification is also signed by that root.

B

Cert since we're the ones making the root cert we're the ones only passing that stuff around we can say if both sides have a search signed by our newly created root. Cert, then we're pretty sure that this is like a legit request from someone in the cluster and again because it's uh ssl, it's already all encryptions, already covered okay. Another big topic we had was metadata integrity, specifically here it's about things being out of date.

B

So before we were doing is um in our server loop, we'd gather the metadata first and then we'd say apply specs or whatever, um but now we're doing that all asynchronously. So we have to be concerned about when we go through the server loop we want to apply. Spec is the metadata. Actually update is reliable.

B

um That can be an issue, because, um if it's not uh up to date, you could double deploy a demon. So if you've deployed, say a monitor on a host and then you've rented the server loop started up again. Maybe you didn't have new metadata from the host. You might still think there's only there's no monitor there or whatever you could try to deploy it again. You don't want to have double demons going on house. I think it's more of a problem with.

B

I don't know if money, specifically, it's probably it's just a problem in general, with um extra demons getting deployed, and so our solution to this is just a counter, um essentially we're originally thinking of something like a lan port clock, but I looked at it some more. This problem is a bit simpler than what a lampport clock can cover um a lamp or clock.

B

You have a full queue and there's counter values in that queue, and you can use it to verify or access distributed resources across a bunch of different hosts, but in our case they really only have one issue which is verifying two specific events which order they happen, then, on two specific hosts. So it's super simple, and this goes out. We can use a counter and the way the counter essentially works is the manager is in control of actually incrementing the counter at any point.

B

So it has this counter value and then, whenever it changes the hosts that are on any given or demons that are on any given host. It just updates that counter value, then, from that point on it only will accept metadata um or we'll only consider the host metadata up to date if it receives metadata with that counter value.

B

So we just send that new counter value to the agent and what the agent will do is whenever it's about to start gathering metadata, it will um see what its counter value is and it'll make sure it attaches that to the message that way, then, if the host gets metadata with the counter value, the new counter value, who knows that?

B

Not only did the agent like see that counter value and put it in, but that saw that kind of before it even started, gathering the metadata which guarantees that the metadata is newer than the last time we um we deployed demons on that host, which is essentially the same verification we have for our current setup, so we're not losing any sort of metadata integrity there, which is good, because that was one of the big problems with this asynchronous push model system um offline hosts. This is another thing.

B

This is the big thing with like the aha for nfs stuff is we need to be able to detect offline host quickly, so we're just doing a timeout here. Currently, I have it set to being two and a half times the refresh rate of the agent, which I think is defaulting to 20 seconds.

B

So essentially, if it's been 50 seconds- and we haven't gotten a message from any agent, then that host will get considered offline and if that happens, then we know hey. We have to move a demon around if you want to like have some sort of hi system um and then what we do in that case is we simply just schedule a redeploy of the down agent.

B

The reason we do that is, it helps with two things: one it can verify. Those is actually offline because we will try to ssh in to deploy the agent and redeploy the agent, and so if the host zac is not actually offline, we'll we'll be able to tell because we'll go to station and see that and if the host is offline, then we won't and we'll know for sure it's offline.

B

It also helps that if the agent just has an issue- and maybe redeploying could fix it, so we might also fix the problem we have if it's not actually an offline host, so that way offline hosts are sort of handled they can be handled quicker than they are currently with the serve loop.

B

um I'm just going to mention here. This is still a work in progress. I have this pull request uh open. Still, I'm not going to open it right now, but it's there. um If you just run through a tautology run last night, there's some problems with ssl on ubuntu, so there's still things to be ironed out. If people are interested in going and looking at it reviewing it themselves, I'm going to give a small demo- I hopefully this cluster is actually up.

B

I started up a cluster like right before this meeting and I don't know if it's all here, but we're going to check, there's not much to show um because the agent doesn't really add new functionality. It just makes things supposedly like run quicker than before a performance thing. We can see here what.

D

The refreshed column is nice and small.

B

Yeah, this is really where it um it helps. So you can see here we have three agents deployed. I have three hosts pm000102 and there's some demons on these hosts just the default demons. If you have the monitoring stock stuff on and you can see if I run this again like say it got up like um these already all got reset they're, not all in sync anymore, it's just because whenever they just have to send data and they really only get up to like 20 something seconds before they'll send something back.

B

You see it's always kept super slow or a super small amount of time in between refreshes. That way, we can always tell um things are always up to date. We can see things faster, we'll be able to project offline hosts faster, and that's really all there is to show for this stuff. I guess like it shows these debug logs. You can see here.

B

um We have messages about receiving this, probably super small, actually.

B

Anyway, if you can't read this, it says refresh the host vmo on demons. um They have a message about how we received up-to-date metadata from host vmo1 and then there's just this automatically printed message about a http message being sent, um but basically that's all. It relates to show again because there's no new functionality here, we can see that the refresh time is super low, which is the whole goal of this and that the agents are able to run. I said this is on a centos set of vms.

B

It seems to be problems with ubuntu currently, but it'll get ironed out, and I think that's really it. um Unless anyone has any questions, they want to ask.

A

Yeah awesome awesome. Thank you. Adam.

A

um I have questions since well.

D

uh Nope nope, I just think it looks great.

A

A

So three questions.

A

um The one is for large clusters, uh maybe a thousand hosts or more- are we going to kill and overload the manager by pushing metadata so often.

B

um I think it should be able to handle it because cherry pie can have a lot of different threads running at once, and it doesn't take very long to handle the individual metadata once it's there. The actual processing of it seems really quick. I think the gathering is the slow part, so I want to say it should be able to handle that stuff, at least a lot better than it's currently being handled with the ssh and stuff, but I think it'll be okay. Obviously it's going to have to get tested.

B

I'm optimistic about it.

A

A

A

A

So when, when we are going out of the one minute timeout in a large cluster, are we ending up in a threshing situation where we are creating even more more imagine, imagine the um the manager is looked for a minute or so, for whatever reason I don't know.

A

And the push the agent's pushing information to the manager and if that takes longer than a minute, we are ending up in a timeout and starting to run to use ssh.

A

Are we ending up in a threshing situation by causing even more timeouts by falling back to ssh.

B

I mean my hope would just be to avoid that at all. Just have it not take that much longer. I think we could maybe work with adjusting the how fast we time out or anything, if that's a problem and also we're in the future, we want to get it so it can run. The agent can run a bit faster and so can push um off without having any being slowed down, um but there's no real way around it.

B

I think where, if you're going to have something like this, if you put the time out it like, if it takes too long, then it has to time it out. There's nothing else. Really to do um so. I think they really have to just do is work towards making it so that the agent can always get its request accepted in that timeout range.

B

I'm not sure, there's any other way to move forward with it.

D

When, if the manager fails over, it has to redeploy all the agents right.

B

Yeah they do have to get reconfigured. I thought about it as some sort of future work, um it's possible to do that reconfiguration over the http, with like the socket right now. It's just doing the um ssh thing we have built in because that's what we have. I didn't want to try to implement that in this basic version, but necessarily something to go for afterwards, just see. If I, if it's just a simple reconfigure, it's not need to change that. Much. If I just want to change, say the target ip of the manager.

B

um Maybe we can send an http message to it and it can just fix it.

D

We could also imagine yeah. We could also imagine that the agents learn what the standby managers are every time they send in their data, so that, um if they're having trouble connecting, they could try one of the standbys and then they'd sort of seamlessly transition.

B

Yeah, that's also a pretty good idea. I've even thought about having the standby managers kept as um by the agent. You try that as well.

D

Let's get your work, it seems like I'm going back to the thrashing question. It seems like maybe the way to avoid that is, um and the issue is that the manager is like ah five of the agents have timed out.

D

So I have a list of these five agents and then I iterate over them and each iteration is like this 20 second process of like doing the slow stage connection whatever it is, and then you end up like redeploying agents that are no longer slow, um maybe like structuring that loop, so that um we check and see if there is a slow agent and if so, we take one of them and redeploy it.

D

Then we start over and we check again if we have slow agents, because in the meantime, while we were doing that work, they might have phoned in the other one spina phone did or whatever.

B

So like bail out of the serve loop, if we deploy one of those agents like that.

D

Yeah, just like start over yeah, exactly yeah yeah just do one.

B

Thing and then reevaluate, maybe I have to see- I guess what the actual effects are with speed and stuff. If that's definitely an option, and at least it would lower the amount of thrashing it only would play one of them. Yeah.

D

um Quick question: sorry you're, using on the agent side you're using the standard library for python, which means url, lib2 or whatever that yeah.

B

This on the request, you had the ask for that specifically yeah yeah.

D

B

Changed from the python 3 request to the url that.

D

I was just surprised that that that can do all the ssl like whatever stuff, but it can. You can do.

B

Yeah you can do um like the client can verify the servers with that library.

A

D

I guess the one last thought um the since both since this assert for both the client and for the server. um I wonder if that, whatever the agent, the agent cert piece of it, can be used in place of like generating a suffix key for each of the agents.

B

The only reason I thought I needed the key rings is because the cherry pie server doesn't do the two-way like for the thought into the raw socket. I can just use the two authentication and the cell search, but when I want to verify the agent from the manager side when it sends metadata over, um I can't it doesn't seem like. I can use a verified search.

D

B

I won't let me so I needed the key.

D

So I get that, but because of that you're just sending some other thing, some other just like key in this case you're using the key ring, but you could use the cert for that same purpose are.

B

You saying just send the cert over and just verify the cert manually.

D

Yeah or or not even like verify the cert, but just use the the cert or like hash of the cert or something as the authentication piece. Okay. Instead of having a separate key ring, which you.

A

B

D

Piece of data- that's j, that's agent, specific.

B

Yeah, I could do that or just store the certs somewhere and then yeah.

D

Yeah, I'm just thinking that we'll have a bunch of key rings for agents. I guess that's not really a problem, because we have keyrings for everything, but.

B

Yeah, we can look into that again as like a future thing. Maybe it saves us one piece of data to send over.

D

Yeah, not a big deal in any case looks pretty good to me sounds great.

A

Okay, anything else regarding the agent.

A

Nothing. Okay, then, let's move over to repaving oasis.

E

Yeah sorry, so uh this is cory yeah jump, though I just wanted to kind of get uh some tips and stuff. I guess so far today uh for how you guys imagine it being envisioned from my standpoint. I guess uh what I've looked at so far and how I see it is basically you guys already have some utility functions for determining whether an osd is safe to destroy.

E

So I imagine I just uh look for ones that are safe to destroy and have a cue of them and destroy them as possible based upon the requirements of maintaining the replication factor and stuff and then use your standard commands the same ones that end up being used. If you were to do it manually to replace them, watch for them to be drained, and then I guess it would take care of zapping them as well.

E

One of the questions I have then is after they're zapped you, you know just let the existing machinery and or redeploying it like, based on labels and stuff or what all needs to be done after they've already been destroyed. We need to bring it back up.

D

You can certainly imagine a situation where there's an existing drive group defined that would cover those newly.

D

Evacuated devices depth devices, um in which case you wouldn't have to do anything.

D

um Then you if there was already a drive group that would slurp up the devices as they became available, then you just basically need to have the logic to identify which ones are candidates, the ones that you want need to be repaved and then wait until it's safe to do so.

D

Probably the easiest way to do it. If it's, if you don't have that, if they don't have an existing drive group, then you'd have to have some special logic to like.

D

Identify the free device and then deploy it, but I feel like it feels like that's the drive group you're duplicating the drive functionality. If you want to do that so.

F

A

um There is a device id, maybe we can craft a specific drive group for that specific device. Id.

D

D

I mean we could make those yeah.

D

But it seems like if we did that, then you would start out with say you have 10 deployed osds and you have no drive groups to find and you run this process.

D

We did that then, at the end of the, when you're totally done, you would then have 10 drive groups, each of which is like each device which is kind of a gross bait to end up in all these, like one-off drive groups,.

D

I don't know how common this um scenario is, but the repave osds are exactly what like user scenario, we're trying to cover here, but it seems like if we, if we just said that, in order for this to work properly, you have to have a drive group defined that will cover the devices as it becomes available, then like that, would sort of make our life easier and would also nudge users.

D

Easier to manage.

A

um Currently, what was your use case again? I think it was changing a min allocation, nice right.

E

Yeah yeah, that's the big one and sorry my headphones seem to cut out there for a little bit, so I've missed some of what you're saying. Did you guys hear what I said originally.

A

Originally yes, originally, yes, okay,.

E

Good um but yes, uh the the big use case for us is to change the bin allocation size across a bunch of clusters that are pretty big. And then we just had another use case uh that we ended up doing manually, where we wanted to start using db wall on some osds that weren't previously configured to use db wall on vmes.

B

A

A

Guillaume, didn't we we have this already and it's defensible right, a way to kind of re-pave osds. In this case it was from from fire, starter plus, store.

G

Well, insta-functional, um the workflow is pretty basic. Let's see it's more a first playbook you have to run to shrink the osd and then you would basically redeploy.

G

Whichever playbook was not really intended to do this actually so.

A

um But I think the the problems or edge cases we have to cover are kind of similar.

A

Do you remember which weird edge cases or or complicated situations you had to think of or fix later on, so that we're not ending up with similar problems.

G

uh Well, the worst case was more about safe disk, based with.

G

Because I was of nautilus, we don't have any model self-dci, so it was quite challenging to get details on rsds and stuff like this. So it was mainly because of safe disk.

A

I think we are still supporting theft, disk osd that are that were created by.

G

Yeah we make safe volume take over self-destruct this, but we can deploy new sds with self-disc. Obviously,.

A

um Do you want to at least have a look at that fire starter blue store, um ansible playbook in order for for us to avoid running into the same issues? Yeah, I'm sure one of her.

E

You guys have a link.

A

Yeah out of the other pad.

A

um What what else do we need to add some persistent state for us to track? What's the current state of repaving ocs.

D

Seems like if we're able to rely on the drive groups for redeploying once the devices are free, then we wouldn't really. We could just identify victim and then procedurally stop the demon adapt. The devices like go back to the top of the loop.

A

I mean drive groups, don't share the information like or don't don't store. Persistent information like this oc should be repaved.

D

Right but the drive group is just a like a pattern match that says I should boy. I see this right so as soon as the device is zapped, then drive group should pick it up.

A

But what happens if we have a failover while the osd is being.

A

Drained a drain does not necessarily follow it's going to be followed up with a with a zap.

D

Well, the brain is stateful, I guess right like there's already state about which those dudes are being drained.

D

So maybe maybe what you actually want is just a flag on the drain process. That says, when we already have this, actually that when we're done draining, then we should tap the devices.

D

Actually so maybe the gap, then the drain also has this thing where um there are two modes: one where you're gonna reuse, the id osd id and one where you're? Not. Maybe we need to have a way so that when we're zapping and the drive group applies, it notices that it's knows that it should reuse the osd id.

A

um Reusing the osd idea is pretty simple: we are just when creating those these. We are just searching for those d, ids.

A

That are marked as destroyed that one is pretty easy.

D

Does it just pick anyone that's destroyed, that's on the same host. Is that the idea.

A

D

Okay, so maybe that's maybe about to do anything there.

A

So, which means that this this should actually be pretty easy to reuse.

D

So one other concern I have is that if you have a drive group that says let's say you have servers that have like eight hard disks and one ssd or two ssds or whatever they get split up into db wall partitions and you do um and if everything's empty- and you have the drive group that says, use this for wall on this for data or whatever the volume like figures out, that the ssd should be divided eight ways.

D

um But if you delete, if you zap like one of the osds, so one of the data devices and you delete one of the lvs for the the db wall, we need to make sure that that volume is or whatever yeah that is smart enough to like know, to recreate the db lb. That's the right size.

E

We did just test something like that on a specific cluster and it seemed to work because there was somebody on the mailing list who had claimed that that wasn't working. So we kind of set it up in a lab and tested. It seems to be working, but we can verify that again too. That yeah.

D

That's just a key test case that we'd want to make sure that cover.

A

um In the self adm test tweets, we don't testing it. We are relying on the volume test suites to properly do that uh guillaume. Do you know if that's properly tested.

G

Honestly, I will not bet on that at the moment.

G

I don't think so. To be honest,.

A

Okay, which means that we might need to do it, because we are really relying on that functionality.

G

D

Yeah, that's a bit tricky because the normal methy machines, like don't, really have devices that. Let us do that anyway, like we need a.

D

We find a separate set of machines that we run those sorts of fidm tests.

E

So, as far as the uh command to kick this off and sorry, if you guys talked about this, when my headphones were disconnected, but um do you imagine you imagine a new command like seth, osd, repave kind of thing, and then they specify some kind of uh with some kind of wild card? Syntax or I don't know what do you? What do you imagine the input to select the set of osts that should be repaved, I guess, or just a whole host spec, something like that.

D

Seems like he needs some condition.

D

I don't know I mean, maybe I mean maybe it's just a.

D

I don't know because you want to make sure that if you repave something that it gets removed from the list, you don't repeat it again.

D

You could imagine that, like there's, some reason why you want to repave and so there's some condition that you're checking on every osd but like imagining. What all those things are going to be for users is maybe a fool's errand, because we don't know what the reasons might be.

D

um So maybe the middle ground would just be that you tell it repave the sosd and the sustenance osd, and it puts them in a queue.

D

And if we did that, if we just had a queue of that, then we had to. We shouldn't use the oc id to do it, but um maybe the ocu id. But then you destroy the ocd that goes away and if it's redone it'll come back with a new one. New.

D

One way to do it.

E

And then that list of osds that needs to be repaved after the command is initially submitted and needs to be stored on the monitors in case yeah manager goes down and stuff.

D

And then you just look at the list and you'd say: does this d still exist, if not, then remove it from the list if it doesn't look just hit save to destroy? If so, then do it.

A

That makes sense.

D

Or maybe some of you know safe to destroy is quite right because you basically want to you will be destroyed in general. You just want to you want to start draining it.

D

So maybe you say, are we draining any osds currently, if not then take the first one on the list train that one.

E

Do you think, just one at a time always.

D

I mean that's probably the safest thing, because then you have to worry about like over loading the cluster, but it would be slower. Certainly.

E

Maybe start with that at least and then leave room for optimization at that.

D

Maybe just yeah, that's! That might be just a good starting point right. I think before um when we talked about this like two years ago.

D

The strategy we thought of was that, if you're going to repave like when you it's, the complicated part is when you have these hybrid osteos, where you have the ssd and four hard disks that are whatever using ssd. Is the db like you basically want to repave that whole set of um devices the ssd and the paired hard drive hard disks all as like a a unit.

D

You'd want to like destroy all those osds and then redeploy that whole thing as it's that um maybe that isn't necessary, because volume is smart enough, that you could do them one at a time, be nice, but they're, probably cases where you do want to do all of them because, for example, maybe their stuff. Well, we don't really care about. I don't even know if he supports that disc well enough that we should worry about it.

D

It's a lot of them. Yeah.

E

For a big cluster, I imagine it's going to take a lot longer to do one at a time, because basically you're shuffling all the data off and then shuffling it all back on individually versus. If you do budget once you just have one shuffle each time right.

E

So the monomath isn't the osd map the crush map, isn't changing every single osd versus you know what I.

D

D

Yeah yeah, I mean it's, it's probably there'd be less data movement. If you do a bunch at a time, but higher impact on buster yeah I'd be inclined to start with something really simple and make you nice yeah.

A

A

Curry that, okay for you to start working on that top and that uh project.

E

Yeah yeah, I think so let me uh yeah I'll, take back and kind of sketch it out more and stuff and then uh I'll come back with more questions for next time and stuff, probably or things that uh might need a second pair about something, but that was really helpful. Thank you.

D

And it seems to me, like the sort of that.

D

Whether if you have that that ssd hard disk combination- and you just do one remove one osd and one of the like db, lvs or whatever, if the reader all right, it might be worth retesting that situation just to verify that it'll do the right thing.

E

Okay, yep. I would like a note to do that, and I guess one last thing so as far as backboarding it you imagine this being backported to pacific and then octopus, yes or no.

A

Octopus, probably not okay, pacific, yes,.

D

Probably, unless it ends up being.

F

Yeah disruptive for some.

A

A

um I I forgot something from when, when talking about the agent, um how fast do we need to file over nfs ganesha demands in order for clients to properly.

A

D

So the grace period is 90 seconds. um I would say that we probably want to do the failover by the midpoint of that, if we can so that they'll have plenty of time to go through their thing.

A

um Adam the agents are pushing data every 20 seconds or so or in every every half a minute.

B

Yeah, a little bit of work needs to be done, so it actually pushes every 20 seconds, um because right now, it's 20 seconds plus time takes a gather which I want to fix in a follow-up. um But it should be every 20 seconds unless.

G

D

Pretty close to be 50 seconds and probably about a minute before that's probably still good enough in like a healthy environment. So maybe we can just go with that and we can make the um the agent interval like tunable too. So somebody has like really tight and the best performance or something that could like the agent would overdrive.

B

We also, um I have a config setting right now for the cluster overall, but you also, we wanted to deploy individual ones with different settings. We'd say like nfs hosts wanted to have slightly faster ones or something let's do doable.

A

A

That's what I wanted to make sure that we are actually being able to hit the target for nfs, because it was one of the main reasons for doing that.

A

A

Okay next topic I had on my list- was testing reboots. We, we had a bad issue where um once got removed from the monmap when rebooting posts and that's kinda either, and we really probably should avoid running into the same issue again um said. You mentioned that we have a power cycle thresher and a kernel tasker with what we could use to test reboots.

D

Yes, the the thrasher and I think the code actually was in step manager, um those reboot nodes.

D

Oh it power cycles, nodes um which should have the same effect, so we could actually do a full-on power cycle or we could go ssh and run reboot, but I haven't checked to see whether there's any trick you have to do to like.

A

Act properly, then anyone else using the power sucker pressure.

D

Yeah, I think actually I don't think it does anything fancy it. Just literally power cycles it and then the in the log you'll see some retries as it like, tries to do its stuff and eventually it'll recover.

D

A

There is a power cycle, suite ui, suites power cycle.

D

D

Yep, if I bet we could actually just you know, exec reboot, that would work.

D

Anyway, I can take a little experimenting, but I think it should be fine.

B

A

uh Another thing that did pop up last week was lock aggregation. um There is a demand for sephardium to kind of make it possible to aggregate locks from all over the cluster, possibly for support cases. Also, I don't know.

A

I kinda was a bit hesitant to um add functionality to the thefty manager module to aggregate logs from all over the cluster because they are huge or might be huge.

A

What can we do if we really have to implement this?

A

um What what ideas do you have to um aggregate blocks from all over the cluster?

A

um We we know that ansible is the wrong tools for the job uh gm youth looked into that already and that's unreliable yeah, which.

G

So from support, we experience issues with this.

A

Yeah, which means that enzo is really the wrong tool for the job. What else can we do.

D

It seems like one of the big questions is like: where do you want to aggregate them right.

D

A

Even make sense.

D

It's like the sos report, type of thing where you're like something went wrong and I just want to over up a bunch of logs or case, or something then like where, where do I want to put them?

D

A

Really have to put them onto the active manager.

D

And the active manager is the only one that has the ssh key. I guess you can make the cli go talk to the manager to get the key that it uses to connect all the road hosts, but.

D

D

It's just a little bit ill-defined because, like do you want to gather all logs, do you want to gather like just the manager logs? Do you want to do whatever.

A

Hold up I likes for everything.

A

We're doing it in quarterly already we are um ssh, we are running a ssh on the remote host and then aggregating all daemon locks and putting it into a zip file.

D

Seems like maybe then you'd want something like um idiom gather all logs and then like get key from manager.

D

And you would run this on some host and then like directory.

D

Where they're gonna go.

D

Or maybe you'd always do this. Actually, that is what it just do: active manager to get a host list and.

A

But what I really want to avoid is using the manager module to aggregate logs because it feels just wrong manager. Module shouldn't be yeah in the realm of transferring a lot of data.

D

Yeah, that's what I think it should be command line itself, iran. So you like cd.

C

One thought: do you think that really is important or is interesting to have all the locks in the same file or in a f5? Maybe it could be useful for forensic diagnose, okay, but when you have a problem in a cluster, probably this is not the kind of vlog that you want to use.

C

uh I think that there are external tools that are specialized in analyzing. Logs is another issue? Okay, so I I think that it could be useful, for example, for integration tests to have all the information after the test, but in running clusters or in production systems. I think that to have everything or to have an aggregation of the logs, just an aggregation of the logs not going to be useful.

A

That now we're back to the like a erk stack or efk stack.

C

And to provide the just the possibility to put together all the logs, I think that one step for forensic analysis is not it's not very useful.

A

But the elk and log aggregation thing did come up in the past already already years ago.

A

um And we never had a proper answer.

E

F

E

D

B

E

Island, we have blogger log aggregation set up with uh prom tail and loki, and I know there are other alternatives, solutions, oak, stacks and stuff, but that seems to be a pretty straightforward setup and those tools already exist. So.

E

At least from our standpoint, like it's easy enough to use those external tools for the log aggregation and making them searchable and indexed and stuff for all of these purposes, and I'm not sure what advantage there is to having specific support for it and stuff. I guess, besides, maybe being convenient and easier to set up right away.

A

Maybe we can just document it setting up a prompter or relay case stack and just don't do any coding ourselves.

A

D

Mutually exclusive right, like some users, will want a full blown out stack really um but having like a gather logs ooh. It might also be.

D

And maybe you would want to teach that video how to play elk bfk or whatever? Maybe not I don't know, I don't know what the drain is.

A

Yeah, it's uh I don't know, I really don't like going into the realm of gathering rock files it. It feels to me that civilian is really the wrong tool for the job, and it feels that we are that we're going to invest a lot of time to implement this or a rather limited benefit.

D

Well, I think just having a simple command isn't a whole lot. I mean it's narrowly scoped, it's not a whole lot of effort. It doesn't have to be part of stuff idiom. You could imagine just writing a quick little script. That does the same thing, but just taking that functionality and putting it inside that idiom, the cli tool at least seems nice, because I think that lots of people will.

D

I'm not yeah, I'm not wild about supporting full efk type of deal.

A

A

Anyway, thank you.

A

And we are out of time.

A

Thanks for joining, if there is nothing super urgent.

A

I would say: let's: let's call it a meeting and the next week bye.