Ceph Orchestration Weekly, 21 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2022-06-21

Description

Join us weekly for the Ceph Orchestrator meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

At least we'll start with the topics.

B

A

uh So I see there's mixed openstack team about the hdfs.

C

That's basically the plc that we made with keyboard, one people id one virtual ip1, ganesha demon uh without dha proxy layer. So basically there is the description of the plc and the document, and you can see uh that is working as expected. So uh at least for the solution that we discussed a couple of weeks ago regarding the one deep one ganesha. This should be useful.

C

We didn't have the chance to test this kind of solution with multiple virtual ap and multiple ganesha, but at least at the on the osp side. Having this poc is the same of the current solution that we have today. So we can update the tracker, probably adding this uh findings, this plc, and we can talk more about this sephidium changes that we probably need.

A

All right, yeah, I know we're going to need at least a standalone people id to use some of the vip. I always haven't read through this document. Yet was there a lot of additional changes you had to make?

A

Is it really hard to set up, or does it actually end up.

C

Being sort of no, no, no, uh it's a pretty much a manual process. I made a private um if you look at the second link at the bottom of the document. There is a change that I made in the safe idm. It's basically uh for the uh I'm not too much familiar with the english team, but I know that you can deploy two different containers, one for keepalfd, one for hd proxy.

C

uh What I did is adding a key within the cluster using the manager, stat store and get store functions right. So if you detect that the ingress mode is direct, we can have a boolean or something like that. You can just skip the h approximate returning. That's the primary demon hyperlink d instead of h, a proxy and then leaving everything the same, but I'm not sure it's the right approach so feel free to look at it as an inspiration.

A

Yeah- and it's really just what we need to do- that yeah, we can figure out how the best way to deploy just to keep the live videos. But uh the more important part is, as far as I'm concerned is what we need to actually be doing on how we're doing.

C

Yeah and the main change for this um comet is related to the uh kepler d config. That should be different uh if you don't have any hda proxy, like the check script, something like that to keep the container alive.

A

And I see it looks like you did something in a unit test that showed.

A

There's this unit.

A

So the example in the unit test and that commit you added that's sort of what it's supposed to look like.

C

Oh, there is the testing service rgw. Oh, I probably.

A

There's a testing.

C

A

Nfs class, you added.

C

Yeah yeah, because the in the previous comments we had the testing res service, which is generic. It's the same for rgw and nfs. We need to def, we we need to have two different classes, because at this point, if you are going to use the nfs direct modes, which means only kpop d, you need to make sure that the config is different.

C

This time and that's the reason why we should have a different class.

A

Yeah I mean we'll probably have to do, is: have it um be able to specify whether you want the ingress to be just the keep alive or not? And then we probably don't want this generic one, because I don't think we're going to remove the nfsha with the h8 proxy. I think we're just going to add this as another option.

C

Yeah yeah right and in theory this is uh another option: it's not just removing h a and the h a proxy for nfs. It's just uh looking at the star key within the manager and see if you need to go with the direct mode with k5d or having a cha proxy and keep on fd.

C

There is no way, as far as I know, in the english demon to look at the uh backhand, so you first have nfs, and then you have the ingress demon and in the ingress demon you are able to specify the back end type. So you just know if it's nfs or rgw.

C

In this case we need the other way around. We need to have an english demon that owns an ep address, and then you need to check if it's nfs and it's direct, you just need people fd bound to that instance. Okay. So it's it's a different logic, but we can work on it.

A

Yeah yeah yeah, we're gonna, have some extra settings with the ingress. Spec that are gonna have to help us figure out what we're doing yeah.

C

Yeah, that's probably easier having a dif having an additional field in the spec.

A

um I'm trying to just look through the rest of it real, quick.

A

So um so, beyond the individual capability on top of the nfs, was there any other stuff good to do on the host any sort of there's no like ip staple stuff or anything like I do before.

C

C

And the plc you can see at step step two we basically tweaked the ganesha container cfdm basically creates the working directory with the fsid of the cluster. We have the container and all the run grid.

C

They are translated into the units. You know we have this uh this script, so what we did is uh duplicating some information uh and the unit the system, the unit then running an additional demon. So cepheidium wasn't aware about this additional demon, of course, because we did it manually, uh but it was just yet another gateway, ganesha gateway for this ffs.

C

The difference was that we manually changed the bind address pointing to the beep instead of the ap address used by assigned by cdm.

A

I understand changing the cons for the mind address: were you saying about having an extra demon? What was that.

A

The unit run file. You were doing something.

C

Yeah we just duplicated the unit run file and we created a different mount point for the ganesha conf, with a different bind address, just to demonstrate that you can run a ganesha demon bound to the virtual fp. uh Of course, fedm is not aware, because we manually uh duplicated the working directory.

A

There's a week here, real quick, so this was not you're running like two different demons for this one thing you're just this is a manual demon you were doing for them.

C

Yeah yeah right right.

A

Yeah, that's good to hear I thought every second. You were trying to run two nfs demons at once. Make this work um yeah. That should be fine because it seems like if we just get the content right, the first time it's fading. um We shouldn't have to do any manual stuff.

C

Yeah yeah- this is just for demonstrate that we can do that. um Then there is some work to do for both the english demon to run one people fd, but it's just half of the work, the other half yeah, the other 50 percent- is basically having ganesha bound to the virtual p on it by k5d. So we need a connection between the english team and the ganesha instance for sure.

A

I wonder if we it's worth even making um one spec for this, like uh I don't know what you'd call it yet, that's some sort of spec that you can set all these things in and so we'll make like a you know.

A

Right now, we have like um this would be two different services, but you can technically, if you wanted to, you, could have one spec cause it to deploy two different services, and we could use that as a way of deploying this where it'll know manifest, will know that it's like the special one that needs to use this different bind address and they keep alive or the ingress will know that it's a special one that needs only to keep alive.

A

It could do something like that or there's also the option of just adding a bunch of stuff to the nfs and angular specs individually, that they can know that there are individually that type, that's probably maybe the easier way to do it, but it's all.

A

C

Yeah, it probably makes sense, I mean we can inert from the existing uh classes and create a different spec with additional parameters that we can feel. uh So we can keep the same behavior from the old components, adding specif extra values that can be used to reach this. This kind of behavior.

A

Yeah um anyway, so it sounds like, but the overall thing has to happen is we need to be able to deploy the ingress with just to keep alive d with a modified config, and then we need to be able to deploy nfs with a modified config stuck to a different address.

C

You need to deploy ganesha pointing to the virtual ip defined for the inverse demon, which is just made by kiplabd.

C

That's the the tricky part, because you need this connection between the ingress demon and ganesha while before in the ha proxy model. It's uh it's the reverse. You have the backend deployed, and then you have the ingress demon pointing to a specific backend. Now you need to avoid losing this connection between the ingress demon and ganesha. I mean you need the virtual ap up and running on the host before running the ganesha demon, because you need to bind the process to that. Ap address.

A

Okay, so it matters what order we deploy them in this time.

C

Yeah, the order for me was different, first key fd uh and then ganesha bound to the virtual ep own it by.

A

All right, um and did you ever end up testing putting this on two different hosts? Does it keep live? Do you have to be in the same house as the nfs.

C

uh We didn't test this. This part we just tested client restrictions, so people, fd and ganesha were on the same host and we tried to mount on from two different nodes. According to the client restriction that we made.

C

So this might be a test that we can do. I guess during the development phase, we can do this kind of test.

A

Yeah I mean I'll have to read through this in more detail, but it sounds like it's mostly um just modifying the ingress and the keeper or ingress to deploy the sleep life t the mess with the nfs financial config a little bit, um and then we have to sort of tie them together, so that they're put in the right order and they know about each other a little bit.

A

D

That sounds good.

D

Oh sorry, I I.

A

D

Going to say, uh I think the the the hard part would be that making sure that uh whenever the replacement, ganesha daemon is brought up wherever it's brought up, we need to have keeper id there right. So we didn't have the restriction previously because he proxy and keeper id uh the way they worked. It was you know we didn't have to care about it. All we did was edit the back end con back in section of the haproxy file to add the new ip of the uh ganesha server.

D

But now we need to make sure that uh wherever we have, uh the replacement demon brought up ganesha even brought up to have keeper id there. So there's a.

A

Oh, it doesn't need to be on the same host as the nfs demon.

D

I think so that, uh because.

C

In the same os people like yeah yeah, I don't think so, because if you have this option the non-local binds, which is something that we had in our environment. In theory, you can bind um process to a non-local ep, which is what we what we have today in osp. So uh in theory, they shouldn't be on the same host, regardless of the test that we did.

C

D

Yeah yeah, my I I assumed that uh it had to be on the same most yeah.

A

Well, we'll have to figure that one out I could try to actually start working on it, maybe to have the defaults to put them on the same mouse anyway, but we'll have to figure out at some point whether that's a real restriction or not, um but other than that. This stuff looks really cool um I'll, have to look through it in more detail, but it sounds like it's mostly just going to be figuring out the implementation detail behind the scenes for stuff video sounds like you guys have figured out what really needs to happen.

A

What I was looking for, um I don't have any other questions on it right now, myself then I'll just spend some time and go through it. Some does anyone else have anything they want to ask francesco or ramona about how this works or anything.

A

All right, I said I'll, let you look through this. It looks like it's uh really well done um I'll see about how we can start implementing it and we'll have to figure out that one's outstanding question about um whether people, ifd and nfs need to be located or not other than that looks like we have somewhere to go moving.

A

Forward um the same thing.

A

um So anything else for that topic, or I just wanted to say about nfs stuff, or should we just um give it a week from like meeting anyone else in the second time to go over it and come back to it next week to see if we have any more stuff.

A

In that case, uh we can keep going here.

A

There's a tracker for the okay: that's just hnfs! um Well, I guess we're on to this brook nfs initial issues with node failures. You want to talk about.

E

A

E

And this is more of a question, uh so thanks uh the um I mean last week we were talking uh internally with the rook team and especially plane, and he was saying that the there was an issue where, uh if you had a ganesha cluster that you're able to stand up with rook and you you basically had one of the ganesha servers suffering a node failure, the cluster would actually be inoperational.

E

I mean you couldn't you know if there were clients connected to some other ganesha server in the cluster they'd also not be able to, you know, perform reads or writes, so is this I mean I'm not sure if I'm capturing this correctly blaine so feel free to correct me.

E

So I just wanted to like dig in and see if this was a, this issue is being tracked somewhere or if we should be, uh you know trying to reproduce that or.

A

um I think I do remember a little bit about that with some of our hnfs testing, with the other way with the aha proxy, not the way we were just talking about um where I know we tried the zapfidim to implement a failover.

A

So if you essentially could specify a placement that gave like more room for demons and you actually put down and then if one of them went down, we would move the failed demon somewhere else, or so we would just deploy another one somewhere else, and we also like do some fencing and stuff around it. So that's what like the ranks and everything make sure that's all good. um So I think we we did see the same issue where, if you like, one of them goes down and you have more of them it.

A

It doesn't work like he doesn't work with the reduced number of them you needed to have all of them up, um but in cepheum our solution to that was just essentially to fail over the broken one somewhere else, and this is just get the cluster back to the right size so that it could take reads and writes again.

A

Yeah, I think we have seen it, but that was we've and snr has been sort of already found a way to address it through the failover system.

E

I see and and so is the current approach that I mean so if you have a place to bring up the failed ganesha server, everything is back to normal again.

A

Yeah, so the way it works is um listen step in with the placements and everything you do is say you had like a label on like five hosts or like nfs or something, and you tell it to put them on those hosts. But you also say I want a count of three, so we would pick three of the five hosts and put them there and then, if one of the hosts goes down or um yeah, usually so the nfs demon itself shouldn't necessarily be failing, because we try to restart it anyway, with system b.

A

But um if like whether one of the hosts goes down so then they're gonna only have like two again fez is running.

A

It will then pick one of the other of the five hosts and then put an nfs team in there instead and then it'll change all of the ranks and everything, so they all work again um and then they would become operable again and at least in, like a small cluster, um where you don't have other things, operations in the way that we do before we do the redeploys it would seem like it would take like a minute or two and then you'd be able to do region rights again, at least from when, when mike rich was assessing it before it seemed like it was, it was working all right.

E

Makes sense- and we are aware that this is this issue- is uh on the rook side. Still.

A

E

Been solved with this, we.

A

Haven't been tracking it really on the rook side, um we were just concerned about the second m1, but we were implementing it, but I guess yeah if brooke doesn't have any implementation for some sort of failover like that, then yeah. If one of the nfs hosts goes down, then I don't think you'd be able to read and write until someone does something to deploy another one somewhere else.

B

Like ask a related question it. What is it about the nfs protocol- or, I guess maybe just nfs ganesha- that this is the the intended behavior of the cluster like in yeah? I mean like with ceph with pretty much any clustered uh kind of software. You, you assume, there are outages, and you assume that they, you know there are times when they can't be brought back up is like. Is this something that should realistically be an issue brought up with, uh like the anavest ganesha in the fest county, show like project.

A

I thought about that as well. um I don't remember why I feel like someone asked them, but I don't remember what they said: I'm not really a great person to sort of answer why it's like that.

A

I know I'd also been really confused when we found out that it worked that way that we actually couldn't just use the standby ones that we had and just have them read them right. There um yeah, you have to ask somebody, I guess knows fs a little bit better, I'm not sure if anyone here has that expertise or not.

D

I mean uh we can ask, we can email jeff for that question. I think uh all of them are active actors, so we have ganesha as a concept of standbys, so the the the remaining active ones are put in grace period and I think they're waiting on the replacement demon to come up within within a few minutes. So if that doesn't happen, I don't know why.

D

uh uh I I expect the grace period to be lifted and the other two ganesha servers to start serving the existing clients, but uh but for the clients that were connected to the previous uh server that went down, I'm not sure what happens there. um So we can. We can ask jeff that question because he architected this active, active ganesha uh solution, where the radar's back in the coordinates the grace period, uh radar, spool based packet.

A

D

Were you observing that the existing clients of the servers that didn't go down uh even even those clients weren't being served after one of the hosts, went down.

A

Is that a question I think there was yeah one of them on the rook team, but you guys when you were testing it. um Did you see like if you kept trying to write to these nfs servers you see after, like the grace period expired, did you see like new rights or raids? Do those work or is it.

B

I I mean to to be clear: we have not tested that we we haven't like we haven't, been focused on the active active aj. I I'm aware of the issue because someone mentioned it I think in in this meeting some time ago.

B

So I know that it is something we have to be concerned about, uh but it's not something that I've gone out of my way to to try to test, given that we're only focusing on a single nfs server development.

A

Okay, all right, um I don't think I'd say it would probably be mike fritz, actually the person I would. I would ask about that, but he's on vacation right now. I think for the next week or two he was one who's doing a lot of the aha nfs testing stuff.

A

I don't know exactly what scenario he was testing. I know he was testing with with reads and writes that were in progress already and seeing what happened with those ones and making sure those would eventually continue if we like did the failover stuff back on earlier. I don't know if he tried with new reads and rights after the period ended. I mean there was no failover. I don't know if that situation ever came up.

A

uh I don't if you know if that works or not, you have to find out.

E

Yep, absolutely no and thank you for this discussion because uh you actually made me think on a different direction and I think I'd seen some of jeff's blogs that actually talked about this issue. So we might be able to track it from there and create a tracker if we can't do if we don't have the failover bits in rook. Yet.

A

B

Yeah, um could you also- and just include me on uh like, if, if you find any resources like that, I I'm also curious to yeah.

E

Thank you, you're welcome.

A

Yeah I mean I'd also be interested if they end up um learning some stuff about how that works, because I don't know the details either pretty good for everyone. I think.

E

A

um Don't have anything else you want to say about the this topic. uh Nfs failures.

A

All right, um so that was the last topic we have in the agenda right now for today. Does anyone have any other topics they want to bring up here.

A

You seem like it in that case, I guess we can end here and I will see you all next week.

B

Bye thanks everybody. Thank you. Bye-Bye bye.

C

Thank you, bye.