Ceph Orchestration Weekly, 22 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2022-03-22

Description

Join us weekly for the Ceph Orchestrator meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

I was muted um yeah, so you can get started, um but I have a lot of topics today, maybe add one later uh so. The first thing we have on here is service discovery, and I think whether you added that in do you want to introduce that one.

B

So basically, this comes as part of this request.

B

B

Which is the chat.

B

That's part of this request for adding some new endpoints for premises for service discovery. The idea is to be able to get the current configuration by using http from outside. So this way you can use it its perception or for other promises, external confused in the customer, and then we get the same configuration for both cases.

B

So here in this request, if you can see in the changes, you will see that you are adding a bunch of end points and points right now and hard coded in the code. The urls basically and ernesto from the dashboard team, commented that it would be good, at least to think about having something for service discovery.

B

That's a level. This way um we can publish our uh end points there, and this will be much easier for clients from inside the the cluster to discover different services we are providing and where, where are they listening?.

A

All right did you ever actually end up talking with them like um yeah.

B

You have they had quick chat with him to understand more.

B

And I had to be honest, we didn't um came like to a conclusion that discussion.

B

But his idea was probably to have something virtually in the manager or other independent model and acting as a server surveys. The discovery.

B

The problem with that that you have also um yeah- I I remember now, yeah in order to implement this kind of um features in a good way. You need to have the name resolution in the cluster and right now. As far as I know, we are using uh ips and and cfdm like, for example, you you publish right now in the end points we publish ips and ports, and this normally points to the active manager.

B

What happens if the there is file over and the new manager becomes, the standby becomes the new manager and is running another machine, so this endpoint is not anymore valid and because we are using hardcoded ips in order to implement this well, you need names instead of ips and then to to update the name resolution to the new active manager automatically.

B

So it's more complicated.

A

Okay, and so until we have this proper name resolution stuff um can't have this sort of more generic feature.

B

Yeah, I don't know how we handle this for the other services like the ingress, because in the english we have something similar for high availability by using h8 proxy. As for I know, because I don't I don't know the details of how this is implemented right now for nfs and other services.

A

um I know I think they uh combines with keep alive, to have a virtual ip, that everything can point to um like that.

B

Maybe that makes things easier and the virtual people to them to the active um server somehow.

A

Yeah it's supposed to be like um if you have one ip and you just have all the requests always go to that one and then yeah um h.a proxy will handle everything else from there where it should.

B

Go and everything what's up in that case, if you want to do something similar, we need to to have a virtual id for for the the node for the manager.

A

Or we basically need aha.

B

For the manager.

A

Yeah which something we had discussed doing a while back possibility of maybe doing that, um but it never really sort of came in. um There was no reason not to. I think it was just it's a lot of work, um and so we didn't have a reason to prioritize it. I think.

B

Does the dashboard do anything dashboard lives inside the manager too right, yeah dashboard? I don't know, I think they are resolving this by some hack or some hard coded stuff.

A

B

Manage your services.

A

Command it'll tell you where the dashboard is what ip is sitting on.

A

Gotta post that I don't know how this form that's gonna, come up in the chat box.

A

B

I usually see the dashboard as.

A

An ip, I think, if you move to manage active manager, the dashboard ip also just changes so the app you.

B

Update this value, basically.

A

Yeah, because all the manager modules are restart, and so the dashboard is part of that, and then as part of this restart it'll set this ip again yeah.

A

So there's no h a sort of thing going on there, which I guess is another reason why it would be nice to have sort of a manager. H a is that david. The dashboard can make use of some sort of virtual for our ip.

B

So you don't have to worry about that, and they think that also not related with that, but could also help to solve um scalability issues we already have with the manager just with the metrics uh reporting.

B

As far as I know, they had some some issues because the manager is not able to to cope with the with so many metrics yeah um soon.

A

Yeah as long as all of the managers know what the metrics are. That's one thing I'm really unsure about is: if I contacted a standby manager like does it actually know values for these metrics? If not.

B

Right now we have uh right now we have only redundancy in the manager at the manager level. We have active and standby, but we don't have load balancing.

B

Yeah for that, you need.

A

Multiplexies managers yeah and then it gets complicated and you have to make sure they have they're consistent between them. That's what they know. Everything so there's a whole like thing for getting the monitors to work like that, um so that would be really hard, but I think for the managers as well.

B

Yeah, especially when the owner model of manager is not.

A

um Well, in that case, it doesn't seem like we can really implement this sort of generic one right now, um this that's like a long term project. That might be something that we aim for for our release. Maybe.

B

Yeah, it seems.

A

B

Needs a big effort, especially in the manager's side, to have this hga support and which is more, which seems more complicated, using only ips at our level, because we don't have name resolution right now.

A

Yeah they would have to start with just the pride generic manager h, a or something and then we'd have to move on to this extra stuff with the double active ones whatever. Well, I guess that's more for the metrics thing, that's that wouldn't be for what you're doing with service discovery that could just use the virtual ip.

A

um We could do that. But again, that's right. It's a long term thing like for our.

B

John, we can't hear you if you are since something exciting yeah, no.

C

Yeah, so that was just a yep. That was probably why it didn't sound like a word.

A

Okay, that's good um yeah, so just to wrap that up. I guess in the meantime, uh we're probably just okay with doing things the way you're doing right now, using the um this one endpoint.

B

Here yeah at least it it's better than having nothing, which is we current implementation.

A

Yeah, I still want um to see if we get paul cuz there to take a look at this um I'll have to see if I can get in contact with him, um the other than that. I think the the strategy in general is is fine for now and we maybe change it once we have uh work on the manager done yeah all right, good uh next topic on or do you have something else.

B

uh Nothing, oh myself. Thank you all right.

A

uh Next topic we have here is update on the rook test failure. I assume this is your thing joseph.

C

Yeah yeah yeah, so, um okay, this issue came up uh a while ago. Actually- and I remember when it first came up- I thought it. It was just like a radius issue because it was the first occurrence of this happening uh turns out. It's been happening.

C

A lot in the radio suite to uh neha asked me to take a another look and what I've decided to do is just remove all the the orchestrator commands being tested from from the suite and just basically remove the orchestrator from like this test suite and uh see if that still breaks and uh the the the idea behind this is that the orchestrator isn't being maintained uh at the moment. So it's okay to remove it from the test suite, because we can just assume it's failing.

A

um This is the brook manager module you're talking about yeah.

C

A

C

A

Sure it was just the generic orchestrator.

C

Oh yeah, no, no uh just for the yeah just for the rook test, suite like in the in the cef qa, um so yeah, I'm just that the the idea is to remove that, because it looks like that. The error right now, it's failing due to the the rook orchestrator being tested. So if we remove that, we might get a better idea of why this is feeling.

C

um I think it's just better to clean up the tests so that this doesn't happen in the future. But I'm just.

A

Like you know, okay, I do know, there's a whole discussion with uh travis about the future of manager rook and what was sort of going on there um yeah. I don't know if they're using it right now, but I guess, if nobody's using it at the moment, then I guess we can remove it from testing. I mean nobody's going to fix it if something's broken or no um yep yeah. I guess that's.

A

C

Okay yeah, so I'm just gonna stop and keep you updated. I guess on this issue since it it came up like a while ago and it has to do with the orchestrator kia yep. But that's that that's that's! It really.

A

All right thanks, okay- and uh we don't have it on here, but I thought maybe we could talk a little bit more about h, a nfs um like did you have a chance to test with all three of those pull requests talking about last week.

D

I did actually yeah, but um the combination of all three worked out much better um on this go. The. What I found is that the offline host detection is fairly quick when I unplug the network cable. um However, detecting that the node has gone online again, that's fairly slow, that's again takes about 10 minutes.

D

The other interesting thing about it is the nfs daemon was rescheduled. um However, it took a bit longer than I was hoping this. I think it was more in the time frame of about a minute and a half two minutes, but in either case I crossed the grace period. For that.

D

But when I did read when I did finally redeploy the nfs stamen, um I had clients connected with um in hard mode um just reading and writing to a file, and they were able to resume um it just took a little bit longer than one might hope, but um I think all in all it was like you know. The fellow overtime was a few minutes, which is certainly better than what we were doing before. The other issue that I encountered is one of my mds's was co-located on the same box.

D

That went down is the nfs, and that was not rescheduled, so I had to kind of manually work around that um I think it so in a very small cluster, like you know, two three nodes, it kind of seems like we need to some of these other stateless services, reschedule them as well, because there's there's definitely a dependency there. Between nfs and mds.

A

All right I mean, I think, even at this point um with the work from that one pull request is supposed to move to nfs. I think if we just add the mds to that list, that it has, um it would do that and then we just have to also have it be checked for that offline post.

A

Schedule all of those.

D

Right and to be clear, all of my testing was without the agent um when I turned the agent on I've had trouble with the agent reconnecting um when the node comes back online. So I kind of just avoided that whole branch of the code.

A

Yeah, I don't think the agent would make it much faster because with that extra loop that um for the host, I don't think it's just going to beat that time. It helps faster with, like the checking their demons are down or anything, but that's really not what we care about. In this case, it's really just post offline, great, exactly so um yeah. I guess the question is one: can we make it faster somehow.

A

I think how we do that.

D

You know the more I was thinking about it, we're kind of really playing with timeouts and polling, and I think the only way to do that is maybe implement a proper heartbeat and that might come from a proper heartbeat and that might have to you know. Then we might need to use the agent or something like that.

A

Yeah all right, you're saying the host sends like a thing every 20 seconds or whatever.

D

Right, yep, yep.

A

Okay, yeah yeah yeah. That could be good, just yeah. It requires agent being stable, because if the agent fails, then we have to stay all the hosts are offline.

D

Yeah, I think we have a ways to go with the agent stability, so probably not worth it right now.

A

Yeah, so I wanted to get this working without it, because I don't want to back for the agent. I don't think it should go there.

D

uh Yeah, I agree well, and I guess the other thing to be clear with this is this is using the timeout through async ssh. um I don't think we have a remote based solution, so do do. We want to approach the back port of a sync ssh to pacific or.

A

Would we I.

D

A

First, if I could get a remote, because it's really just the one thing we need is just that the that keep alive on the sh requests do something similar for remoto. I was going to see first. If I could, um if that's possible, then only backboard a6sh is like a last resort. If there's no way to do it with moto.

A

D

It seems like it's introducing a new feature, but I guess it's not really an externally visible feature, since it's kind of.

A

Yeah yeah, it wouldn't change anything from the outside to anyone, but it's still like a it's sort of engrained in there. So it's one of the things that's sort of difficult to backboard.

D

That's right, yeah.

A

um I like to avoid it it's possible and I'll have to check if we can do keep a live request for remoto. That lets us do something like that.

A

If we can and that's the way, it's way easier.

D

Yeah I agree yeah and then on the topic of your larger pr that has both the there's. The selfie dm agent changes mixed in with, like the scheduler changes, we kind of want to split that into two pr's.

A

Yeah um that would I was originally written with the idea that it was necessary to have the agent changes for the offline notice, section yeah, but we've sort of moved away from that now, so there's no need to have it as one thing and honestly, the parts, the asian parts that other things just get removed. I don't think we're going to need them at least right now, even if we do the host section using the agent, it sounds like we're going to try to do this heartbeat strategy.

A

You were talking about where it comes from the agent rather than the weird sort of way it was working there.

A

I think I could just remove the agent parts from that and then just leave it as just reschedule benefits demons.

D

Okay, yeah, I think, that'd be a good idea, because it's a separate topic and it'll make the backboard easier.

A

Yeah yeah, I don't want to, we don't want the agent stuff in the back ports right all right. um It sounds like it's almost there just a little bit slow, but it was able to reconnect eventually just as.

D

A

As you'd like it to be.

D

Yeah, the the the initial test, I did where it redeployed the just the nfs statement that was it reconnected within, like you know, a minute or two um which of course exceeded the grace period, but it still worked out. Okay, um the case with the mds was much longer. I I think it took like five minutes or more before it was able to re-establish the clients were able to re-establish a connection, but but in both instances it did actually work out it. Just the clients were kind of hung, since they were mounted.

A

Yeah, just a little bit longer than we'd like right um with the mds one, at least we could probably just add something similar for the mds we just removed them. It should be even easier. We don't have to fence anything, but that I don't think we just have to move it. uh I had to check that actually.

E

Yeah, uh michael, how was the nds set up that? uh Was it active standby? There were two active mtst means. What is the setup of the mdss.

D

Well, it was a fairly small cluster. I think there were only like three nodes, and so I had one active and one um standby okay and that's why it's one offline, because I lost one of them so.

A

It the one active one that assumed was on the host that went down.

D

I believe so I don't remember.

E

Yeah I mean the mds have in built-in, so if you have multiple active mds it it should be quicker, so we could play around with what kind of setup we want for the mdss. In this case, yeah.

A

Maybe it's worth trying that actually is just for yourself. You got like three active ones or something, and just have like one on each host that way when the one active one goes down, it's still not a problem. Is that maybe a more realistic scenario for like a big cluster. You worry about this.

D

It would be yeah, I think I took just the default settings from our the start cluster or maybe it's in the service spec, which I think defines two mds's. So.

A

Yeah, I guess we should just try that, because again, that would be more proper tests and at least if it works that way, we can maybe push off the mdsha stuff and say anyone who's like very serious about this and has like a large cluster they'll, be okay and we'll we'll fix it for the tiny clusters perfect um and then there's the other part of it, which was the thing with. If you don't have enough hosts to reschedule them on.

A

We have, I know, you're going to talk to jeff leighton about like how that works and why we need all of them. You ever that conversation ever happened.

D

It it didn't. I had a separate conversation elsewhere. It has a lot, I believe, to do with the nfs protocol, I'm in maintaining consistency between the ranks, and so I don't think it's you know. So if we have say two nfs daemons, you know we need to bring back two of the same of their freeing ranks. So um so the nfs service can't continue with just one of them present.

D

um Okay and then.

A

We need to redeploy.

D

All of them we have to do the redeploy yeah and if you watch the grace db, even when that one of the ranks goes down, it's not reflected in the grace db because there's nothing actively manipulating that because that's only during an ad or drawing, but that are saying so as far as the um other nfs stamen is aware, and their pair is still active, even though they're not so.

A

Yeah, um just so, you think we should try to get in right now as well, or is this something you tell you could also work around if you have enough hosts to do it on it should work.

D

Yeah, I think that's it's not convenient in a small cluster, but I think probably the best thing we could do is just um try and reschedule and then do a health warning or something if we can and or maybe consider fixing the port conflict issue. So we can co-locate two ranks on the same server.

A

Yeah yeah yeah. I just want to see yeah it's hard to do right now, but.

D

But a quick solution would just be a warning, a health warning, but we were unable to reschedule.

A

All right- and we can probably get that in uh sort of soon and just have that as sort of the solution for the time being. um If you have enough hosts it'll work and you have, I guess, your mts elsewhere- it'll almost work. Apparently it's still a little bit slow. It needs to be like 30 seconds faster um and maybe we can see if there's any ways to do that. I know like I'm trying to think of where all the time would come from um I'm thinking like worst case it maybe takes.

A

A

Like 40 something seconds to detect the hosts offline and then we have to redeploy if you have to play like multiple nfs demons. Maybe there is only to replay one because one host going offline, trying to add up to how it gets to a minute at 30 or it gets to two minutes and see if there's anywhere, we can save some time off.

D

Yeah, some of it in my case, was just the slowness of the container. To start. You know the hardware, isn't you know super performant of.

A

I wonder if we can have a skip yeah because refreshing there's a lot of hosts, that's going to have to do like a cepheid mls on all of them.

A

If it happens to they need to refresh when we do that.

D

A

That would be that's yeah. That's that could be sort of painful time wise, but you can't move apply services before the refresh.

A

A

I wouldn't be interested to see if we knew for sure it didn't need a demon refresh or device refresh or anything, and it was all good um if we could get it to happen within a minute 30..

A

um Let's see what narrow it down sort of what we need to be able to do, but yeah, that's tough. I didn't even thought about that. The fact that the refresh would take so long that it doesn't matter how fast you do everything else.

D

That that may have to be the role of the agent, because I think we need that refresh to detect things like poor conflicts and other things down the road. So.

A

Yeah yeah: that's what yeah! Because for the demon refresh stuff that you can't do you skip the device refresh for something like this? That wouldn't.

C

A

Big deal right, but you can't skip the team in refresh if you're going to be moving demons around so yeah, I guess maybe we'll have to live with the right now. It's not 100 great, but.

D

I think this is a really good first step and we can just iterate right here. You know.

A

Yeah you'll probably have it up in like at least a couple of minutes I mean that's if it has I'll do all the refreshes. If you got like a lot, unfor yeah got fortunate with the timing, and then they didn't need to refresh anything and detected the host in like the minimum time. I think you could still do it in like a minute or so.

B

And then it's also the time it takes to deploy the actual investing. But how do we detect right now when some hosts.

A

Goes in yeah there's a thread. It's a pull request. That's open, there's a thread that just loops through checks every 20 seconds um and then in this case you also just say, uh there's a keep a live request. I think it's is it 21 seconds.

B

How would you think that the host is offline? We try to assist the.

A

B

There um and what it works is.

A

B

A

There's no timeout on it. The way it works is the sd connection itself that we're using has a people live request on it. I think it's every seven seconds right now, all right.

B

Let me try to link the four requests. You know what I'm talking about. I I remember this pull request. You posted uh you have like a timeout of seven seconds.

A

Yeah seven seconds three three times yeah through them, so that.

B

A

B

Have that pull request.

A

Then, when the thing goes to check at worst, it'll hang for the 21 seconds, just like say like right before yeah it. um It went off like right before I sent the message where's, the other one. That's he does the.

A

Yeah and then that one detective loops through and checks the host, and so the combination of that one. The idea is that the even the worst case, it'll it'll time out after 21 seconds, and then it checks every 20. and.

B

um I don't know how the agent works right now, so what's the idea behind the agent agent is something that is running in the node and sending some information back to the to the manager.

A

It handles all of the refresh stuff we were talking about, like it refreshes the devices, the demons and stuff on the host, so that stuff takes a really long time. If you have a lot of posts that run stuff at mls, just slow run volume inventory, which is slow, if you have say like 50 hosts, then that's going to take forever yeah and so every action we want to do in the server loop gets delayed because we're waiting for that. So you have the agent going.

A

Then it refreshes every 20 seconds. So we, the the server itself, will like never need to do that refresh, so the server can always get immediately to whatever action it has to take. That's the idea so.

B

um In deployments where the regent is active, why just not introduce some hurt, but I think mike already machining this, so we can edit some active heartbeat between the agent and the manager.

B

So this way we detect from cfdm automatically. If some house goes down a very quick time.

A

um I mean that's what we want to do this solution, we're trying right now supposed to be for pacific, which won't have the agent um and also the agent just isn't stable. So we don't want our offline host section to be reliant on like an unstable okay, okay, opponent, okay,.

B

So this is for pacific and.

A

Yeah we could potentially even remove the offline host watcher just thread thing that's being implemented once the agent has heartbeats and that can be more uh the greater than yeah. It should be faster, okay, yeah and that could that would handle the refresh stuff and it would handle the alpha host detection.

A

um So it would be good, um be able to do that. All quicker yeah yeah!

A

Well, we probably would still remember we're attacking it. Really fast, probably still need some sort of loop. They wouldn't have to loop. Look wouldn't have to check if the hoses offline would have to check if it got a heartbeat in the last seconds or something.

B

um Yeah you can do it either ways. It should be much easier to have like some periodic messages. Heartbeats coming from the agents to the manager and in the manager. You just check if the agents haven't sent like heartbeats in the last.

A

Seconds, that's what I'm saying we'd have this sort of like the same way. We have that offline host detection thread right now, we'd have like a heartbeat check. The red detection.

E

A

So we could actually repurpose that thread to do that um if the agent had a proper artbeat and stuff and we were confident in it yeah yeah that could be a future work. Yeah yeah, that's more for next releases, um so I guess that's sort of where we are with htnfs. Is you? Can it sort of works, but it's a little bit slow if we ignore the mds case, we're talking like two minutes to get it from an offline host back into a working state?

A

Assuming you have enough hosts to reschedule the nfs demons on um that's like an okay state, that's better than we were a few weeks ago.

A

um So we'll see if we can backport those things and then we'll have like a decent solution in pacific and then we can try to work on like a faster one. Over the.

B

Course of maybe for mds, I don't know if there are some parameters we can tweak to drop this file over. We.

A

Could just do it the same with the nfs one, I think, and then it would sort of be all right. I think, or we just reschedule them like that, you know what the nfs ones are, because right now we're not moving them at all. Like that's part of the problem is um we're relying on basically the standby mds to come up after the active one goes down which seems like it takes a while.

B

So yeah, that was what I was saying that maybe there is some parameters to configure at mds level to make this faster like how fast it goes to the standby one.

A

If it yeah yeah, I'm not sure I have to ask someone on this effects. Team.

B

Yeah somebody safely has to see how this residency works.

B

Yeah all right.

A

Well, in that case, I think we're what.

E

A

E

uh I I I'm in the self-esteem so like I work with the surface team, so yeah, so maybe uh setting up this uh active standby should help, and there are other conflict settings to you know lower the lower the failover. So we can look at that figure out what the default settings are and yeah yeah and um yeah. We could also think about having multiple active mds and you know whatever makes sense. So all this is target targeted for pacific right and for the open, stack use case. Is that what you are.

A

Yes, although we have now so basically the two minutes again slightly slow, not quite the what we wanted, but two minute failover for aha on I'll find host. I guess that's what we're going to try in the pacific.

E

A

E

uh And we don't need uh atm agent or pacific right, no, not purposeful yeah, and uh what was the port conflict issue that uh you guys are talking about.

A

um To co-locate nfs daemons, we have to make sure they don't use the same port um and if we were able to do that, we could then co-locate them, which means this problem, where you don't have enough hosts to reschedule nfs demons would go away because we could just go locate them if we needed to.

E

Oh, you mean now demons belonging to the same cluster. Okay,.

A

Yeah, so if you have two investments, they use the same port. We can't put them on the same host right now, um which means that we're limited so that you have your setup with three nfs demons and we want to reschedule all of them so that um if one of them goes offline, one of the hosts goes offline. But if you only have three hosts, we are allowed to put them on based on the placement.

A

We can't do anything because we can't put them on the same host and there's nowhere else to put them. um So if we fix that so that the port conflict doesn't happen anymore, then we could just save like put the two of them on one host and one of them on the third one, and then we could still sort of have the service stay up, even if there's not enough hosts for one for each okay, yeah right now. That's again, that's future work. We want to do in the short term.

A

We wanted we're just going to buy stick with what we have here. It works fairly. Well, it's not just not as fast as we want, because the nfs race period is really low hard to meet that sometimes, but in the future, maybe we can, but for now I think this might be the best we can sort of do.

A

Okay, yeah thanks.

A

All right uh that was the last topic I had in mind. um Does anyone have anything else? I want to talk about here.

A

uh In that case, we can end here, and I will see you all next week- bye see you next week.