Ceph Orchestration Weekly, 25 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2022-01-25

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I'll, just like the other pattern here, we have a couple topics on there.

A

um The things everyone here is actually already aware of uh from talks in the standups recently.

A

The first one is our timeouts. We discussed this bit before um that. We have that issue. I don't have the tracker linked or actually is that the right tracker yeah, that is the right track, so the uh the rbd tracker is linked in the other as well. Let's take an example of this issue uh in the manager module where, if you have any of these ssh commands like a set volume command or something that ends up hanging, then the entire server loop hangs up permanently, so you restart it.

A

So there was some discussion about introducing some timeouts to our ssh commands, um but we never finalize anything because it ends up being a bit tricky, because if you try to set like say a global timeout, then it's hard to find a sort of good spot for it, because you make it too long, then it's not doesn't really do very much like.

A

If you have to wait 20 minutes to run this expanded timeout, then you just loop back to the server loop and then it just does that again, it's almost never doing anything anyway, uh but you can't make it too short either, because some commands like a deploy, command or next actually pull an image at the start of it is gonna, take a few minutes um so either we'd have to actually do this sort of there are different commands, have different, timeouts or we'd have to have like a really long, timeout and just say it's okay to have it.

A

Let's raise a health warning and have it be idle most of the time um something one of those options I don't know. Does anyone have any thoughts on that? One.

B

I mean you can also combine them to a certain degree where you start off by default with a very long timeout and then per command. You could shorten it. You know, you know if the per command value is like set, and it's less than the global time of you could bring it down.

B

That's uh I just want to mention that, because it's not totally orthogonal um to have global versus per command. Okay, it's a global default.

C

um There was one thing: actually, if I remember correctly, there is a 15 minutes timeout in the ssh protocol. That's super hard to get rid of.

A

And it was hanging indefinitely.

C

Yeah here, okay, here I found the. um If you actually hear this one.

C

uh Offline house saying thanks surf loop for 15 minutes found by daniel a half a year ago, and it's super hard to get rid of this specific problem, because it's some kind of a weird thing because we are, we are persisting- the ssh connections in our ssh cache and as soon as we are trying to reuse the existing open, ssh connection to a host that is no longer there. We are going to hang for 15 minutes and there is basically no way to avoid that. 15 minutes hang.

A

Yeah other than making a new connection each time, rather.

C

Than making a new connection exactly.

C

That was still remoter.

C

But as this problem is uh with the ssh protocol and with ssh implementation and not with the python binding, I very much think that we are still prone to the very same particular problem.

C

So we should probably avoid doing this edge connections to a host. That is.

C

Offline, you know where we know that it's offline and um we shall probably first reset the connection if we suspect that the house is offline. Otherwise we end up in this 15 minute thing.

B

Yeah, it sounds like you also want to flush that cash.

A

At least that cash entry specific, I think at some points recently, we added, I think I was doing something else with the agent. I ended up, adding something to the check host that reset the connection each time.

A

That's what we were. That's one thing we were running actually on um offline hosts we'd always still do checkos.

C

um I know that the the agents have an open port can can we just if we know that there is an agent? I mean adam correct me. If I'm, if I'm outdated, my information is outdated. um Can we just connect to that open port on the a10? If it's? If, if we know that donation should be there, and only if we cannot connect to that host and.

C

um But the agent should be there, then we're calling checkers.

A

um Yeah we could, I mean the reason we I've been avoiding. That is because it it was. I guess it's fine, the first calling check. Oh so it fails, as I was worried about it being like we're sort of relying on this being like a stable thing like if it's there it'll be working, but I guess, if we're just saying as long whenever it fails, we'll just call checkouts, normally otherwise, we'll just ping ping. I think you guys would guarantee it's online. At least.

C

I remember I remember sage: I've implemented the ping protocol in the very beginning of the ssh, orchestrator and sage removed. It.

A

Is it? Was there a y? I said there was a y.

C

uh I think the y was it: it's just not going to give you any.

C

Usable information, if.

C

The the the problem is: if the ping fails, you don't know why it fails, why it failed and.

C

If it's, if it works, you don't know if tech house actually succeeds.

C

Like I said that was the uh the reason said to remove it, it doesn't give you any information. Ping might fail just because the ping protocol is disabled by a firewall, so there.

D

Is a high chance.

C

D

C

Oh yeah, I see I'm peeping.

A

I guess for the agent one, I guess, because we know that part it works for us exactly sorry, we can use it yeah. I guess that could be a good preliminary check. We can, um I think I said ping page.

A

I think I've I kind of meant to send a message to it, saying that um just implement that a little smaller, because, right now, it's like actual updates and stuff you'd, send there like, like a that's like an empty json or something and just see if we get something back uh because it does respond, and we can do that and at least if that works, we can guarantee it's online and we're fine and if it doesn't work, maybe reset the connection and try check host.

A

All right I mean it's almost more of our second topic on the and than the first one, but I guess they kind of go together.

C

The the looming, 15 minutes of the ssh protocol is really evil. If you want to provide something from some user experience, but actually kind of works, you know 15 minutes is way too long for anyone to be to to think of a normal timeout.

A

I mean it could be a starting point, though it's just. um If there is an agent up, we can try to use its port to verify it's online and then, if it doesn't that that fails for whatever reason that we don't get anything back, then we can be a bit more cautious and try to reset the connections instead of just running a normal check host.

A

Be a good starting spot um yeah, I guess so so I guess if we already have any time out, though, the introducing some other timeouts, probably not going to help us too much uh we could do is if we could uh recognize the time what actually happened. We could raise a health warning. That's one thing we're not doing.

A

um Because the agent's not even enabled by default anyway, right now, so we I want something there regardless, but I think that could be a way forward with that.

B

Seems reasonable to me.

E

So I think this touches like on the other subtlety of this issue, because there's the issue of detecting that the host is offline, but in the case of the rbd failure the host was online, but simply in a live locked state. So we were attempting to inventory devices and we're blocked in an uninterruptible state, so the process would never return. So the connection was valid, but it was simply holding the global sephadm lock and when something like this occurs, there's no real good indication of what the orchestrator is doing on that host.

E

It just appears to be hung or stuck, and I think in those cases we need some sort of trigger to say you know this: is the operation we're attempting to perform and raise a health warning, um but that's not occurring I've. I've seen a similar thing if image fell, pools happen in the background they'll just silently fail over and over and over on a individual house without progressing in the serve loop, which creates the appearance that the manager is hung. But it's it's really not it's.

E

Just kind of in a live lock state in the server yeah.

C

C

Just today uh looked into the looking of cfdm and turns out that we are looking indefinitely uh trying to get to get a hold of the uh cdm global, lock indefinitely. There is no timeout.

E

Correct yep, yep and that's part of the issue is you can have two processes competing for that lock, one of them holding it and it'll never release it and someone else who can never acquire it.

A

Let me ask you, so: do these ones once that fail like this? It's not an offline host. It's just something stuck do those ones never come back is permanent.

A

Or do they actually come back after 50 minutes as well?

A

You're talking about the rbd issue, yeah yeah things like that.

E

No that one never marked the host is offline, because the ssh connection was still active. It was just simply waiting for the set volume command to return, yeah and so just indefinitely blocked. Essentially.

A

We probably need some sort of timeout for those cases.

B

This is a little bit more ambitious, maybe or just you know, my own ignorance. Are these commands like fully synchronous, as in you know, four thread x or four host x? The system only does this one thing until the command reply um leads with a response or is it like? uh Are there any async components where it's like I've started operation x on host y, but I haven't gotten a response yet.

A

They're, like mostly synchronous, because there's like a certain order, the server loop does things in and it kind of like has to happen in that order.

A

So it's hard to have a bunch of async things and then get them later.

C

I mean internally we're using the azing ssh libraries. Well, we don't lose control of the uh we're, not losing control. Now we we could. I I think, melissa. Do you know if the uh I think ssh library supports some kind of a timeout when waiting when when doing ssh calls.

F

um Yeah, I think there is a timeout it's like if, um if like the uh the process, if there's like a certain like uh if the process the timeout starts before the process exits, there's a timeout error that I can raise um and then also if the like ssh process. If there's also like there turns an error if it returns, if it exits with a non-zero status, um so it can return a timeout error.

A

When I remember looking at this, there's like two different sort of basic stage commands we're doing where's like a actual connect command and then there's like the run command. I think that's the connect. One is probably okay, it's probably one that still has the timeout and works like that. I think it's the run command. That has the opportunity to um hang forever. I think I looked at the documentation before for racing stage. I think the default timeout is just no timeout for those whenever we actually execute commands.

A

Assuming when we assume we have like a good connection, even if the text actually is good, then it just like it will just last forever, but I'm pretty sure you could put timeouts for that. I think when I was also canada before you get to set up some sort of ssh conflict object and you could pass it in there and it would do something like that.

F

Yeah, I think you have to like specify the timeout. If you want there to be a timeout, I don't think we specified it or at least like in the config, um if yeah, if it uses the sh config for the timeout.

A

Yeah, I don't think currently, I don't think we are um yeah, it would be nice if we for those commands had like a he could be afraid to be a pretty lenient timeout, because again there are some things that are slow, but if it could at least raise a health warning that timed out day which command failed, uh it would be an improvement.

C

Maybe we would need to extend using ssh level to support it.

A

I think it already does support these timeouts. It's just we're not using them.

E

There's also the possibility, there's a timeout part on the didn't stuff ibm specifically, but I don't believe we are passing or utilizing from the manager which essentially implements a p open, timeout.

A

Have you actually passed those, or is that just something internal.

E

It's I say it in the heart parser, but we never. I don't think the manager ever applies a value for that.

A

A

See it in the file lock object.

A

A

And if that works as well, we could just throw that onto, and then you run the set videom that function put, that there um use the test if it works properly for um basis.

C

Indeed, you're right, adam.

A

I think those these run commands are the ones that um they're the risky ones right. Now we don't handle it at all.

A

I think we've seen. I think this rbd issue is not the only time this happened. I'm pretty sure this was what was happening in the uh gibba cluster. At one point, um the third loop was seemed to be hanging and I think it was. There was one host that had some hardware issues and so that volume pin was failing there, I'm pretty sure, and then also um downstream. There was a spla testing. There was a similar issue as well again, it was in step volume, command hanging.

A

um So it seems like we have two options here, so the obviously the sebastian just linked there is that timeout or run commands might work and I'll see how it likes posting there, which is. We have actually have a global timeout arc fidm.

A

um I think the global timeout one would be easier to implement as long as it works.

G

Are we discussing the the timeout issue we mentioned yesterday? Yeah? Okay? Yes, sorry we're doing late, because I didn't have this in my calendar.

A

Yeah, I can invite you to their uh or get you on the steph community calendar. I think, after this, where I normally find this one.

A

Okay, yeah we're discussing options for I'm looking more of a global level right now with the timeouts, um so there's two options we have uh async ssh has a timeout option for run, commands uh that could work, and we also set adm actually has a built-in timeout argument.

A

So all we could verify that actually works for all of our cases. It could be. We could be just if we implement it with that.

G

A

Want to understand.

G

And go ahead also just understand, um so the idea is to uh rely on ssh for for the timeout.

A

Yeah, I don't know if you can using the links that were in the chat.

G

A

uh Yes, is one sebastian posted uh 11 21 for me.

A

It's a ssh, the run command, that's what we use to do. Our ssh commands those, and so if we included a timeout value there that might fix the problem, then we could at least have a timeout, and then we could raise out warnings. That would be a big improvement over hanging forever.

A

Then we also have what mike fridge posted uh stephanie. Damn the binary has a built-in timeout arc that we've already set up. um So we include that in all of our set fit amp commands like a base timeout, um it's possible. That would also we'd also do the same thing. We could let it timeout and then raise health warning.

B

I was going to say one one way to kind of think about. The problem, too is I did not see in the documentation for the async ssh timeout, whether it terminates the command or not, um but you know so if the connection is killed but say this process is still running. Maybe it's got some kernel level locks running lvm commands or whatever it's it's still there running on the host, whereas possibly the timeout option to set atm might actually be able to.

B

You know, terminate the job running and that might be better because then you have a better chance of knowing for sure that at least it's tried to free those resources.

G

I mean in general, I think, uh regardless of if acch supports the the the tablet or not. I think we should not rely on some underlying component in this case ecch. If there is some time out, it should come from c5dm and propagate it to the underlying component to to have it like more explicit.

G

A

It sounds like we're part of an agreement that we should definitely try. The sephadm's timeout argument.

E

The second part of this question is: what's a reasonable timeout, and um are we walking down the path of implementing a crash loop back off like kubernetes? What is our strategy here.

A

And I thought, for the initial version of this was just to have like a fairly long one and we'd raise a health warning and say like something was very wrong here and it wouldn't be great, um but obviously it's pretty. This would be pretty slow, timeout whatever this is, but at least actually raise a health warning after say, like 10 15 minutes and now you know something's wrong, and sadly we also wouldn't have a full hang a serve loop. We would still be able to do things that aren't whatever I was trying to do there.

A

um That was my initial thought. What we would do.

G

Are you are you ready yesterday, because I was often looked to uh back and this pack was basically uh this if adm is stuck forever waiting for the that lock fine, so that was probably a particular case, but we have more cases like this, so maybe it depends on on each specific use case, so the amount value will be different.

A

Yeah I mean we have another example case in the um the other pad there's a link to a tracker issue.

A

Is it synthetic I can relink it? I like it earlier.

B

Do we have this idea a pro you know very rough approximation of what the longest running success you know cases are you know it's like? Is it five minutes? Is it 10 minutes? Is it 20 minutes? You know that kind of scale. It doesn't need to be super exact, but that can help um kind of create an upper bound on what you would want for a success case.

E

I I think the challenge is many of the operations are actually um quite fast like within a few minutes, but we do know that inventorying devices expressed on a dense node with set volume can take quite a long time so hardware variabilities yeah, that's.

C

Fairly unbounded and we have a different thing and that's downloading downloading big container image as if you are by coincidence, only connected uh very thinly to the internet. So you have a long time until you're downloading an image and that's.

B

C

C

How long should we wait.

G

Is there somehow like to get feedback of the progress of this kind of operations, or there is no way.

C

We're not not printing any progress.

G

I mean not printing it, but if there is somehow um a way to get to the the progress, some events and feedback.

B

So for the container download might be something you know by default: automatic docker, typically print like kind of uh ascii output. I don't know if there's a way to get it to print.

B

You know, percentages state to you know, percentage new line. Then you'd have some rough idea that it was making progress. The the hardware inventory stuff that I was talking about earlier.

B

That would be harder because there's probably no established convention for that not impossible, and it kind of goes back to my earlier thought about doing some of these things. Async.

B

If we know it might take a long time, it might be better to kick off an async process and then pull the results, but that's a bigger problem than the timeout thing. It's somewhat orthogonal, so I'm still in favor of the timeout. It's just you know further down the road we might consider making some of these long-running background tasks. Async based async, not on the level not like async ssh, but like create a system d job or something.

C

Yeah, that's fading to my uh okay, great the safety.

B

C

Already already does uh inventory of the uh of the volume, as in honestly uh so to to the uh from from the from cfvm manager modules perspective with asynchronous.

C

A

Awesome again, that's.

B

A

A

Yeah, it's not like uh by default right now, so it's not getting um as much as use. Some people aren't making use of it yet, but it is.

F

A

That the volume inventory stuff and also the um the demons on the host it's growing- those because that's also a fairly slow command, not as slow they're, talking like, like maybe 10 seconds or something or longer, one you get like 10-20 seconds, but still fairly slow as far as directions go.

A

So we do have that stuff set up for that those things, um the volume command. That was failing with the mics tracker issue, though our view- and that was actually not in an inventory one that was like a I think it was 77 osd on disk. It shouldn't have.

E

Well, I saw it in two cases. One case was a set volume inventory of a cell rbd device, and then the next iteration was attempting to deploy osds on a stell rbd device that appeared available in both instances. We solved that by just skipping over rbd. um This could potentially happen elsewhere.

B

Yeah, that's a really good point. That's why you would still want timeouts or something like that, because you don't know you know this command's supposed to complete one second, but it's not versus. I know this is going to take a long time, so I want it to be async by default.

B

Oh, I think it's great to have both strategies.

A

Yeah and I think, we're all kind of agreed that we want to try this cdm timeout option, um we just kind of need- I guess a good reproducer for this to test this on.

A

See if I can find something else, maybe I can just remove the patch, the rbd stuff.

E

Another option is, we could artificially um create a still lock yeah artificially hold the global, lock.

A

But yeah, I think, if we can get um some way to reproduce that, and someone can test that this timeout actually still works and it comes back and then we can just implement health printing based off of that that'll be a good starting point, at least and as for the actual timeout we're sending it to it sounds like yeah. I think the easy sort of way to start with it is to set it pretty high by default.

A

um Maybe we can try to collect some information on what how long these commands actually take. But I mean it is always going to be hard to get like a max time on, like these inventory commands, and these point image commands.

G

And probably this kind of values will depend on on the cluster configuration the hardware, which is could be very, very hard to figure out something that works always.

A

So I guess we um we're now. I guess we'll start with try to set like a default pretty high, maybe like 20 minutes or something and see if we can get health warnings to raise up.

A

Yeah with the setjm flag.

C

Okay, I mean if we make it configurable.

C

And then someone right runs into this particular timeout, because his internet connection is so slow that he's running into those timeouts or he has such a slow death volume inventory a return time.

C

He could actually increase the timeout on this particular faster. So we we would create an escape head in case someone, but that chance actually runs into those timeouts, because some something is too slow.

A

Yeah, because could we even then include information about that in the health warning we raised? We could say like maybe consider we're using this timeout say how.

G

Awesome good idea: oh I really like that.

A

Yeah, so I guess: um go without a configurable timeout based on this fadm timeout flag, assuming that all works, so they test that um and then we raise a health warning and in partisan health warning. We include information about black in case somebody's on some cluster with really slow internet, or maybe they have two videos or too many discs on it. This takes too long, and I guess the fault for that can be a bit lower than maybe a few minutes, five minutes or so, and they can raise it if they need to.

A

I think five minutes is probably enough for most commands that are going to work.

A

All right, so we are good on that.

G

Topic, we kind of agree.

G

Just question I mean um normally, the cpdm is launched manually and sometimes from the instable playbooks.

G

Does it work in this way.

A

So there's a few ansible playbooks do very specific tasks like they do some pre-flight stuff. They install some things. They do like a purge or they remove the clusters. um Most of the uses of it is going to be from the cepheum manager module where it deploys it, and then it runs an individual command with it.

A

um Oh yeah, but uh us changing something with these timeouts flags and stuff, wouldn't affect the ansible stuff at all, be unrelated.

C

A

Actually, care.

C

I mean purge is something that's done only once for a given cluster.

A

Yeah I mean, if I don't, I don't even worry about it right now. Plenty runs through a problem with it um I mean, even if they do like. I don't know what you would do if the the purge times are just going to fail anyway, they probably end up in the similar spot they're going to tell us who's hanging they're going to tell us.

A

Okay, I don't think we need to worry about it right now,.

A

That sounds good for that topic. I think um I said we're going to have the timeout configurable timeout with the stuff idiom timeout flag. If that works I'll, raise a health warning telling people to maybe raise a timeout or something's wrong.

A

Okay and then our other topic we had on here was testing it offline host. So we kind of already went over that one earlier when we're trying to discuss the timeout stuff. So I think the idea there was.

A

If we have an agent, we can check it. It's support. We can um send it a quick message and if we get acknowledgement back, we know the host is online and we're all good, and if it fails, then we're a bit more cautious and we try to reset our connection and then maybe do a normal check host or something only if you have to.

A

So does anyone have any more thoughts on that one, or is that one also? They are closely related right.

A

I mean the timeouts will help this as well, because if we do do this check host- and that has a problem for the reason that's hanging- the timeout will help with that, um but they are like.

A

I mean I guess, they're. Both good changes, they're, like tangentially, related.

D

A

D

The question I had is how how quickly would this procedure detect the offline host? Because for nfs uh service we wanted to detect that it? The host nephes host failed within an order of 30 seconds.

A

Or a minute, so if it works and we reset the connection before running, check host it's fairly fast, um I think the biggest risk in this situation is, if we were already like, say, the serve loop was running and it like called czechos. First before we got a chance to detect it when, like one of the agents, comes back or anything one of those threads and then that one would be on the normal timeout.

A

um And then, if the serbloo was waiting on that timeout, then maybe we can't actually tell to go redeploy the nfs demons where we want them to.

A

um But I'm not sure really what to do about that other than just not caching ssh connections, because there's always going to be a risk that, like it, could go offline at this exact moment right before we run this. If you never know exactly what's going to happen, any sh command, we do there's a risk and it'll do that.

A

We just have to we'd have to not cache them. I think, in order to.

B

Currently for them, or they just cashed indefinitely, I think there's a timeout on that.

A

I'm not sure anyone know if we do we time out the sh connections we cash.

F

I think they're cash like indefinite well like there's some calls to like reset connection, but.

A

Yeah you can reset them.

F

Manually yeah, but once they're in the cache, I think it's mostly just like they're there and you don't really know if they're like active or not they're, just in the cache.

A

How much of a boost we get out of actually caching these connections? It helped a lot. I think I've always done it since I started working. I don't actually know before we had this.

C

That was before the aids on broad.

A

Yeah I mean it said it was it's been like even on remoto, we were catching things weren't. We way back.

C

Yeah, it was slow reading, as edge connections is slow.

C

And at which point do we want to get rid of that ssh connection on really rely on the agent to push things to to self adm.

A

I mean it's not going to happen in quincy, because we'd have to set up being able to deploy demons agent. Do that.

C

ah Okay, there is the reason uh we are sometimes you're doing four different commands in a row: first check host and um safari mls and the vdm.

C

List networks and fadm device ls um and having to create four different phony ssh connections for each self-adm command is extremely expensive, but even for.

C

For for a very few hosts that connection, cache is actually super helpful already, even though it's only going to be even if it's just one host that we are catching um even that is, this is already helpful.

C

So, yes, I think we still need it.

A

Could we like, if that's the only spot where we are doing like a bunch of room? I don't know if it is, uh could we like patch it at the start of that and like keep the cache through those like four commands, then the end of it reset it or that still be too much of a performance draw.

C

No idea really no idea.

A

Yeah, I mean, I think, we're gonna have to live with that stage for at least another here.

C

Yeah, I I wouldn't hurry with a gettering out of the obvious h connection.

A

C

A

Question of like what is our worst case, detection of offline hosts, I think it's going to be pretty slow.

G

So, just to understand how it works, so we have the agents and how? What what protocol do we use to talk to the agents we use ech or another protocol, http data http? Okay, so for some stuff we use the agent and for others we just use an ssh connection to the node.

A

Yeah, basically, everything can be done with ssh, because that agent is still optional, but right now the agent is responsible for collecting metadata on the host. So it collects the demons and the uh inventory stuff and it returns it back to the host and then, while the agent is active, we usually don't run those without states. We just avoid it.

A

Is she yeah, but um we still need states for certain things like we can't deploy demons with the agents or anything so.

G

Yeah, you can't get rid of ssh.

B

Yeah, it's gonna be around for a while sounds like one of the things that we should do. If again, if it hasn't already been done, is you know, um file issues? Let's say you know for feature parity with ssh? We would need the agent to do a b c d.

B

Then people can pick up those tasks individually. Can you know make the agent capable of doing it, and that puts you on the road to eventually deprecating and then removing ssh?

B

A

Good, we could do that. We could make one big tracker yep and has a bunch of subtasks.

G

Is it possible to use the agent in all the scenarios or are they scenarios where.

A

Well, all of our.

G

A

Are running are just cdm commands and the agent is running with this fdm binary, so it should have access to a binary always be able to do all the things. In theory,.

A

I mean the reason we're not doing it right now is just because we don't it's not considered like a stable, reliable component. Necessarily, yes, there's a lot of changes going in yeah, yeah, okay. I think it would probably be at least another year before stuff, even if we were to get all of this feature. Parity in we'd have to make sure it's stable um or we'd like actually considered getting rid of ssh for anything.

C

A

C

Thing that there is one thing which is hard to get rid of for the ssh protocol and that's adding a new host. How do you add a new host to an existing cluster?

C

If you're not using ssh you're, not you're, you will need to invent some kind of a new way to initialize a trust between two posts.

G

And then, in the worst case, if it is only restricted to this use case, we can maintain it, but only for this very specific use case till we figure out a better way to to do it.

A

Yeah, I wouldn't mind just leaving for that case, because the problem we have with the staging of these timeout things and things being slow and stuff and only adding the hosts like that's, not a common operation. So it wouldn't really be a big deal if you started for that.

G

Yeah, because the first time you deploy the host, you don't have nothing a running on the other side.

A

You guess you would just need set vm to build. Do that and also to be able to deploy agents. Is that also, obviously, our ssh.

B

Yeah, a bootstrapping, not the current bootstrapping, but a bootstrap on a new node.

A

Yeah, so we need those two functionalities still there and typically we could do everything else, but again that that's all just like this initial setup stuff, you should theory only have to do once um and then I guess it's during upgrades as well, because you'd have to go upgrade the agents and deploy it on a new binary whatever, but it'd be uncommon operations that you do these like little one-offs, and so it wouldn't be a performance thing at all. At that point,.

A

Yeah, I guess we can try to look into that in the future and see if we can get some more future parody with the agent too. That's what that stitch can do minus those two things we're talking about there um in the meantime, um implement this timeout stuff and try to use the agent port for some offline host detection. um If we can, we still don't have necessarily solved like the worst case offender section for now, but I'm not sure there is a great solution currently.

A

Yeah, those are all my thoughts.

A

Anyone have any other topics you want to bring up the weekly.

A

C

There yes um going to be my last orchestrator weekly.

C

Wish you all all the success.

C

G

See you thank you very much. Sebastian.

D

C

G

A

Everybody bye see you all later I'll, see you bye tomorrow, bye, bye,.

B