Ceph Crimson Weekly, 19 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Crimson/SeaStore 2021-05-19

Description

https://ceph.io/community/meetings/#crimson

A

So hi everyone and welcome to this week's orchestrator meeting.

A

A

So what do we have on there again now? So we definitely would I. I would like to definitely look into the um self-adm agent a bit more um trying to to figure out uh what what we need to do and and um if we need to do it and what are the gains and then, if yes, how and and what does it impact the architecture?

A

Do we have any other topics for today that we could discuss? I mean, please add them to the other pad, as always,.

A

B

Good morning,.

A

So yeah it's fading agent, um so the the the problem is that the reconciliation loop that we have in in cephadm doesn't scale right it.

A

It's um anything we are going to do, is kind of bound by uh creating ssh connections and then creating executing stuff on the host and then going to the next house. I mean we, we are doing it in parallel with 10 hours at the same time, but still it's going to be slow, especially for the dashboard for users that want to have kind of up-to-date information, and if we want to have fast failover in a few seconds, then we also going to need to improve here.

A

Is it worth it.

A

I think so, yes.

A

Any other opinions.

B

I think it's worth it all right.

C

Yeah, I think I think so.

A

Okay, that's the the most important question. What if we would have decided that? No it's not worth it, then we could just skip everything and and have and go home. Yes,.

B

C

No, but I think that it answers what to comment: what are going to be the responsibilities of the third time, what what is do do we expect from the third time.

A

Yes, so where's the biggest gains that we can get right um and and and how the should the architecture look like. I, I know that we have the current architect of this. Heavy adm agent or daemon is like providing a simple api.

A

Endpoint, that's great from the saf adm manager module, but I do not think this is going to be the best way to leverage this f8md, mainly because we are still going to have a reconciliation loop that goes to every host and instead of creating an ssh connection to every host, we are creating an http or https connected to every house, so we are gaining a lot less, at least from from a um from a from a from a um algorithmic complexity perspective. It's still o of n.

A

um If we go the other way around and say yes, let's push information from the cfadm daemon to the to the to the manager module that would.

A

Be much more uh performant and and and then we can, we can actually decide what we want to do with the severity agent or demon.

B

A

I think okay gaining a lot more flexibility.

B

I was just gonna say: it'll, be a little bit weird. If you haven't heard like it'll, be unclear what the manager should do if it hasn't heard from the agent like, should it go and poke at it to like make sure it has up-to-date manager address, or maybe maybe both push and pull work, and so it can pull if it needs to if it doesn't hurt from it recently.

B

But the if the agent is like responding or sees a change, he can immediately push like that might be the nice way to do it.

A

We are gaining a lot of new failure modes and a lot of new complexity.

B

If we do both, you mean.

A

uh No, even just generally only yeah in general, the the the loop that we have right now is that simple. So we are in any case, we are gaining a lot of new failure modes um so right now, cypherium the reconciliation is really dependent on the order of things right. I guess.

B

My my suggestion would be to start by making like we should. We should be thinking ahead so that we're not um preventing ourselves from doing something later, but we should focus on making the reconciliation loop initially. Just like all the refresh like.

B

First have the agent maintain all the the node state on the current containers and the devices and the facts, and just have that, be the thing that's being pushed to the manager in a scalable way, and then the manager can still actually take action using the existing loop, which I think is okay, like if you're, if you're deploying a thousand osds it'll be slow. But like that's less of an issue, I guess yes, um yeah.

A

B

A

It still is is very problematic because.

A

First, let let's imagine we had that race condition. Already. um Imagine your um you record, you did the reconciliation loop and you want to deploy a new manager on host one, and then um we have a race between the agent and the cdm manager module.

A

The um the next thing that happens is um the agent goes over all the uh demons and and and and aggregates the information about which demons are running on one specific host.

A

The the next thing that happened is that um the manager daemon deploys a new manager on host a and, finally, the agent pushes the information about its aggregated state into the manager module and suddenly the manager module kind of forgets that it ever had deployed in your manager um on house one of jose.

A

So we are gaining a lot of new race conditions here we have to be super careful. I mean.

B

This might be a reason to stick with a pull model.

A

But that that doesn't help it doesn't help the existing yeah.

B

Yeah you're right, it is me yeah.

A

It doesn't hurt if the the agent cannot caches um the information, then we have a rest condition between the cache of the agent. Then the manager deploys a new daemon and then um the the agent.

A

Returns the cast result.

C

But I I think that probably this is just an implementation detail: okay, it's something that is going to happen, and probably what the the things that we need to do is to avoid any kind of change in the manager when you are and when you have pending pending operations. Okay or you are going to put some way to to signal that you are doing an operation and it's not possible to change the main.

C

The manager and this operation has finished okay or, for example, block the the manager okay in in order to to get new operations until 13 condition has been, uh has been finished. Okay, so I think that we have several several ways to do that. I think that what is most important in this moment to have a high level view of what we we need, what are the responsibilities of the agent and what model is going to to be used?

C

I I think my my vision is that well having the the third diamond running in each house and communicates and communicating the information to the manager is. I think that this is the push model. Okay, not the pull model is the the right way to do that in order to to reach scalability okay- and uh we are going to to this part of the complexity that we have now in the orchestrator in the diamond model- the third diamond, because in the third diamond, the themes, for example, drumming diamonds, for example.

C

What is the the hardware composition of the if there is any kind of change or even the deploy of new diamonds in the host? We can delegate all these these things in the third diamond and the third diamond.

C

I think that could be but more easy to do these kind of things and, at the same time, to simplify the the orchestrator so.

A

um Yeah, but does it really simplify the orchestrator? I don't think so. It's going to be more complicated. I guess it's worth it, but but we have to be super careful.

B

A

And that's why I really do not want to enable the current daemon implementation, because it's prone to that race- and I know we we um we had that race already between um between.

A

um The sevedm bootstrap command that deployed, monitors and and stuff and the manager module, and we had a race and it was um and that that that, as a result or sometimes deploying two miniatures on the same host.

B

Yeah yeah, I mean, I think, that particular race. We can resolve with some variation of a lan port clock that we know if the information that's being recorded in is older than whatever um yes. So I think we should. I think we can set that aside and we'll we'll get to it. um For the when we do a the agent does a push. Is it just going to use like the manager, like the cli, basically just issue, a cli command.

B

Does it matter.

A

We can either create a new.

B

It needs a medication right. If it doesn't do that, then we have to have like a rest, endpoint and like have an ssl certificate or something I don't know like. There needs to be some sort of um security around it.

B

A

We can uh make it an api endpoint right um with proper client authentication that would work.

B

A

B

Be more secure, lighter weight, more efficient, but.

A

um But but that would.

A

Seph already has its own key rings: yeah can. Can you make the demon part of the cluster and use its own way to communicate.

B

I mean we could just give it a key and a client id and give it a capability that says it's allowed to run the self-adm command, and then I could just do that stuff. Adm agent report dash, I and then a bunch of stuff.

A

That's a lot of, but why should we do that? It's much more easier to just.

A

Push to an http endpoint.

B

Yeah yeah, we can do that.

B

Too yeah the other thing with that is that then it needs to run inside the container. I guess this is a dumb question, but it should this should run outside the container right because, yes stuff, on the host right. Yes, yes, yes, so it's probably going to be basically it'll, be the seth adm binary it'll be stuff adm agent.

B

Basically, that already exists yeah, okay,.

A

It's just called exporter instead of agent, but yes,.

B

Yep and we probably need a way to kill it reliably, so when it starts up, the first thing you should do is like write a pid file somewhere so that we can stop it.

A

B

Stop agent command.

A

No, if it's, if it's hooked into system d, then then we can use f, systemd.

B

uh Yeah yeah, okay, right yeah, that's.

B

A

That's what actually simplify the existing surface dm binary, because right now it it creates an endpoint.

A

But that is super complicated, because this fadm binary doesn't have any libraries that it can reliably use. So it has to do everything by its own.

B

A

By making it a push thing, we only need client capabilities in these fadm daemon, and that makes it much more.

A

B

Yeah, I think, there's no reason to implement like this. Oh because we can just run it right.

A

Yes, we can just run it and long term. I think the goal would peter just reduce our dependency on ssh.

B

You could also make a command that will just like poke it like a reload or something like that actually related.

A

A

I'm I'm concerned about load spikes.

A

If you have all hosts simultaneously um trying to push information to the manager, um the manager is getting overloaded.

B

Yeah, maybe I mean the agent the data there shouldn't be that much data right, like the pages, are frozen.

A

No but but the manager is already overloaded with the premises, module and stuff like that.

B

Yeah well yeah.

B

They're looking to change in the prometheus model, but there it goes.

B

A

I I think it's kind of kind of easy to cope with it right um we just have to. If, if it turns out that the connection times out, then we have to um have a random amount of seconds or minutes to uh to back up and then.

B

Later on, so here's the problem with the push model based on an http endpoint, is that the manager moves around right, and so, if we still need to issue a key to the agent so that it can either ask the monitor what the latest manager address is if it gets a failure to post.

A

This, as we always are going to.

A

On the other hand, we still have more than one manager available.

B

So yeah you could ask one of the other.

A

Ones, we could use standby modules right similar to the dashboard.

B

Yeah, I mean that's that'll, probably cover most cases, but we can't rely on that. Yes, yeah.

A

Yeah, it is creating a lot of new failure, mods where we have to think about and and do the right decisions right. Everything is possible to solve, but we, but it's not easy. It's not.

C

Straightforward well, what is the information that we are going to manage? Basically, the hardware information of the host and the information of the diamonds that that's running in the horse.

B

C

That, okay, so I think that what is something that this uh hardware is basically static. Okay, so if we have just the time stamp in order to see if we have very old information or not, we can we can deal with that, maybe in the case of the diamonds, maybe we need to to think a little bit more, but basically is well. This is there is like the list of demos that must be running in this course, and this is the the diamonds that the host has communicated that are running.

C

So I think that what is not very dynamic, the the the information that we need to entertain, that change.

A

Yeah, but still it's it's prone to races and having a solution that that 99 percent works is crea, going to create a lot of headaches. If, if we have uh multiple monitors deployed on a single host, but you only want to have one monitor and stuff like that right, so we are creating. We have to be sure that we are not prone to.

A

A

And then yeah, there are plenty of ways to do that right. What um sage? What was your proposal to have some kind of uh with which clock? Is it it's a.

B

Lampard lamport, yes,.

B

I'll, send you a link.

B

Logical clock there you.

B

B

um There'd have to be basically some coordination between whenever we call stuff adm on a remote host, we'd have to pass along a timestamp that gets recorded somewhere on that local host and the agent whenever it does. Something would have to record that check that adjust it and so on so yeah.

B

But I think something like that would work.

A

Do we need to have a large scale, rectifier factorization, of the safe adm binary, in order to do that.

B

I don't that one thinks so.

A

I I also don't think so. I think it kind of works yeah.

A

I I don't like it pretty much, but that's.

B

I mean I haven't looked at the current um exporter code at all, but is there any, is it? Is there anything valuable there, or can we just like rip it out and re-implement.

A

Yeah yeah pretty much.

B

Because I mean one of the things that I keep like having putting this on my list and meaning to fix it and then going and looking at it and then not doing it because right now, there's a list. There's a list networks command that I rely on to get information about ethernet interfaces and subnets and stuff, and then there's also a gather, fax command that it was part of the exporter. I think, but I don't actually can't actually tell what uses it.

B

I think maybe the like the config checks, use it or something I'm not sure, but they're gathering, like slightly different information or the same information, slightly different ways: yeah it should just it should just be consolidated into one. Like the other facts,.

A

Yeah paul was it they're pretty um uh demanding when it comes to introducing a new way of doing the same thing, kind of yeah.

A

B

He had good reasons too.

A

Right he had good reasons to do.

A

A

A

And and no no, we can ever actually leverage a lot of things like um the integration and the system. D is good, um so all the basic stuff is there and we can deliver it. So we have a head start when it comes to to to introducing the push model.

A

But everything that related to providing the end point is a bit problematic.

A

Gather effects actually actually uh dot in the in the manager module already, so we just have to expose it it's there um and- and we shall probably add it to death arch posterless, I guess but other than that, it's it's already. There.

B

There's just a bunch of cleanup.

A

I think with a push model and we have a failed demon, we could achieve uh failovers within a few seconds. Yep.

B

I agree, I think, that's because basically as soon as you as the manager module gets a push, you could just wake up the reconciliation loop.

A

A

Not in all circumstances but more or less yep.

B

Maybe it wasn't.

A

Slight delay so.

B

That there's a bit of batching.

B

And similarly, the agent can only post only push if there's an actual change that it observes between its last successful push yeah so, hopefully fight almost the time.

A

um And the should avoid pushing information to the manager every few seconds if we have a flaky demon, it's also a new failure mode that you have to care about.

B

Oh, where it's like flapping or something.

A

I mean it's pretty much solved by systemd already, so if, if we have a demon, that's failing constantly, then um the systemd unit is going to be in in an aerostat.

A

B

A

It didn't we had a problem with osds that sometimes ocs come into the cluster and go outside of the cluster in a very short succession.

B

Yep yeah, if they're, hitting, if they're, causing each other to crash, they can get crash loops and thrashing or whatever so there's like a back off mechanism and then monitor for usds, make a markdown.

A

Why was it not catched by systemd.

B

It oh, this is all predate systemd um by many.

C

B

But also it um it wasn't just related to a single demon necessarily restarting. I guess it did, but.

B

Just remember, I think it wasn't always that that demon actually stopped. It was just that the cluster was marking it down because of load or something like that, and then the lsd would try to mark itself back up and then that would trigger okay, a couple different feelings.

A

But that's not a big problem for us from from a management perspective.

B

So I wonder um the tasks would basically be add at an endpoint.

B

That's fading command.

B

B

B

B

B

C

Oh, the the idea is to to get the the current ftm exporter and put it in isolate in a in a container.

C

A

No, no no container, we have to um we. We can't put it into a container because at some point it would be great to have the posit the capability to actually deploy demons on that host with the agent uh and and if we want to do that, we can't put it into a container.

B

Yeah I mean, I think the the agent will have to run this cli inside the container like shell, but um but the cl it'll actually be running on the hose yeah.

A

But the demon itself should not be run inside of them.

B

Yeah yeah exactly.

B

I mean it's kind of weird, because if somebody stops the system unit for the agent, there has to be basically some time out on the um the manager should say. If I haven't heard from the agent in the last five minutes, then I'll go like do a poll which will include the status of that.

B

Whatever that service write the agent if necessary or whatever, it is right.

A

How do we detect offline house.

B

Yeah yep, I don't know what we do right now.

A

I mean we yeah. We can't just wait for for the host to to respond right. If it is gone, we have a network connection, error, net split or if the host just dies.

B

A

How can how can we do that.

B

um I mean if we make the agent check in at regular intervals and then, if it doesn't check in then we do a poll attempt and if the poll fails, then we mark the host offline, which is basically, I think, more or less what it does now right. If we refer to the poll and we hit.

C

An error, then we mark it offline.

B

Yeah, I think all that will work basically the same.

A

We still need some reconciliation loop, but not to actually ssh to the host, but just check if the host never pushed any information.

C

And what about to push a keep a live token, or something like that? That's a signal. I am.

B

Alive yep, it should do that. Definitely.

B

But if we don't get that, then we go and we check the host try to reply, redeploy or restart the agent not running.

B

B

Do we make it so that the manager can actually scrape things directly or should the only data path be to start the agent until the and let the agent phone in.

A

Are we there yet can we rely on it? I would say we have to at some point: do you want to go full in and just rely on the agent from day one.

B

It would be one only one mode of operation which would be kind of nice. Yes,.

A

B

And the deploy process could be a um well down here. I said that, um like one of the tasks would just basically be a spli command that will just explicitly comparatively run the gather, fax code and phone it home.

B

So that could be the thing that um the push or not the push the pull does, or it could do that. I guess, but it's still a little different.

B

I mean, I guess the failure event that we'd be worrying about is if, for some reason, the um the endpoint isn't responding like there's a firewall or something something is blocking you from being able to post to the whatever the cherry pie, endpoint or whatever, just whatever it's going to.

B

A

If someone um has a broken firewall configuration that just as soon as we have a manager failover, then suddenly it can be that the agent can no longer access access. The new manager, because fire rule only provides access to that signal manager.

B

I guess the so this thing where it goes to ssh into the host and restart the agent should do a test, also to like make sure that it can contact the callback to the manager and make us successful.

B

Do you plan to expand.

A

This into clients of the set cluster or.

B

Just members of the set cluster on day, one just the.

B

A

I mean we can later expand it to also deploy daemons, that that would make osd deployment super fast yeah, because at that time we can deploy all those d's in the cluster at once. That would be awesome.

B

Yeah, I think no particular hurry there.

A

And that would mean we could deploy all the crash demons all at once to a point where we're overloading the the registry.

A

If we have an internal registry or the network, though we have to be careful when we are going to do that in later on.

A

A

What happens if the manager is overloaded and cannot accept pushes from the agent to a point where we are.

A

Missing keep a live, timers or lifeless timeouts. Are we going to then threat.

B

A

Imagine we we have a five minute timeout, where um agents need to push information every five minutes and- and we have um a load spike in the monitor that prevents the manager from from accepting proper connections for for a time period of maybe five five to six minutes, and at that time we are marking all posts as offline.

A

A

Starting to create ssh connections to all hosts that would make it even more likely that we are going to in that. We are overloading the.

A

Monitor that's the problem that we need to care about.

B

um I think, maybe not yet.

C

The I think that that means to to have some kind of of control over the load of the endpoint in order to see if you are in a situation of saturation or not. Okay and depending on that, just checking, if you are saturated, uh maybe has no sense, try to connect with different force to see if they are alive or not, because you are in a situation that you know that no, you are not protesting request.

C

So we need some kind of control of the saturation.

A

I'm a bit scared of loot, spikes.

A

Where all holes at once are trying to push information onto the miniature making the manager overloaded, I'm a bit scared.

B

I mean, I think I I don't know what um library use for the the rest endpoint, but we can make it like. Only accept a single connection at a time and just have them like retry, wait or whatever so that'll that'll throttle it a little bit.

B

A

Now are we going to directly push information to the monitor within http connections? We probably should not, and only return information that is available in the in the memory of the manager.

A

To make it fast.

B

Sorry say that again.

A

um Should we patch or push information to the config key store from within.

A

The connection handler.

B

A

Or, should we only store information into the memory of the manager and then having a different thread, push information to the config keystone? Should we write or read directly from the config key store from within the uh rest, endpoint connection.

B

Yeah, I think, by doing it, asynchronously makes more sense.

B

Otherwise, someone scale very.

B

A

If we're making it asynchronously, then we're going to create another huge list of ways to to create race conditions.

B

B

Well, maybe for the first implementation just do it synchronously I mean I think if config key is getting slow, then there are bigger problems, and maybe you actually want to slow things.

B

A

That's all I can think of all problems that I can think of right now,.

B

What about an upgrade path from say, octopus-based clusters to and then also like you know, old agent, a new agent.

B

It's the post message could just have a version or something.

B

B

um We can either interpret and hold a post from an old agent and adjust it accordingly, or we can say oh there's an old agent and then trigger or redeploy that agent and ignore it and.

A

B

I guess same thing.

A

Should we make it so that we, and instead of a version, simply said, ah should we use the hash.

B

Yeah, I was just thinking that it can. It can use the same hash as the whatever, at least in the file name, and compare that on the um on the manager side. If it doesn't match, then replay the agent.

A

Slower right because we before we can actually do anything, we have to redeploy all the agents.

B

A

B

It'll be pretty quick and only happen once.

A

Depending on your uh definition of quicker, yes,.

B

I mean there's we have to replay all the crash demons. Also so it'll be like it'll, be like one more stage more thing to redeploy. This one will just happen to be the first thing that happens.

B

We'll probably have a similar issue where that we had before, where um it's not just the version that's running, but also which version is deployed.

B

By I guess, it's always deployed by itself actually never mind yeah.

C

Yeah well in in this case, I I think that the important is the message, not the version of the emission of the message.

C

If we, if we have a very clear format for the message in order to communicate hardware in order to communicate diamonds information, it doesn't matter what is the version of the of the of the sender of the information.

B

So we're still going to want to redeploy the demon so that it matches those so that the agent is writing the latest code.

C

Yes, but I I see that this is most a problem in the in the upgrade procedure of the cluster and the different diamonds that are running in the cluster and not about the communication of this information.

C

I think that we can avoid lot of problems if we are very strict with information that we are passing and to to keep this information in the uh in the same version, and uh it doesn't matter too much what what is the version of the of the center. This might maybe, but.

B

C

Well, we have a list of tasks very high level tasks, okay, so it could be nice, at least for me, okay, I need more more explanation of of this task in order to to have more clear what are the the things that we need to do, and I think that maybe it is good to just to try to to ascend this task to different people in the in the team. uh Let's see when when we can start with with that.

C

Okay, maybe uh I think that when sage or sebastian you are the the best people in order to clarify the the details of the task? Okay, so well, I think that maybe we can we can try to to start destination and to to see if we can start to to work with that.

C

B

I have a question about the um the rest endpoint part. I seem to remember there's some weird thing where um we're using cherry pie for both for prometheus and for the dashboard or something something like that, and it doesn't like having two instances of the like some static variables or something yeah.

C

Wasn't there some problem.

B

A

I know the terrifying really wants to to kill the manager if, if, if cherry pie fades to binder port it, it calls this dot exit.

B

So maybe I mean, should we not use cherry pie if they're like.

C

But we, I think in the that's what we mocked that the uh just.

A

To avoid it, yes um should.

A

We I I kinda dislike the fact that we are creating yet another endpoint.

B

um It's the same thing everything else.

A

Is should we just have one web server, that's used by different manager, modules.

A

Well, that would be kind of a complete re-architecture microservices.

C

One yeah, I think we could probably think right.

C

A

Rethink that yeah, I I think the better option is to split the manager apart into every container, has its own manager module. I think that's a better long-term way of doing.

A

B

A

That would mean that, for now we have to make every module as independent as possible right, because that would allow us to have a strategy where we are splitting up the minutes or more.

B

Yeah, I mean, I think, that's a long way away and there are, there are going to be a lot of issues to get there. So I would, I wouldn't worry about it, but I wouldn't make the problem worse.

B

It seems like the simple thing is just to add: a new port bind to a new port for sephiem only so that it operates in isolation.

A

Do we actually have to live in the manager, or can we um can can thefi dm live in its own container?.

B

It's better living in the manager, because then we have access to.

A

A

A

But what about the permissions module supremacism? Would you live in a different container.

A

B

I don't think we should be worrying about containers manager. I don't think we should re-architect the.

A

Manager, yeah we're still thinking, um I think, really. We are close to the top of the hour.

C

A

Okay, I guess that's it for today, except if we have something. Oh it's in to discuss, doesn't seem to be the case. Okay then have a nice week.

C

A