Ceph Orchestration Weekly, 7 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting: 2021-12-07

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi everyone and welcome to today's orchestrator weekly, as always.

A

We can have a look at the agenda. Okay, um we don't have anyone from the rook team here, so we are not going to have any data's update, but I don't think that's going to be super interesting. I think that the um travis made a topo every understaters update last week in the uh in the arts, weekly call um but said you're up to date, right up to speed.

A

B

A

C

One means we're working on that quite a bit, so he could probably give you an update.

D

Yeah, well, I I don't know if you are aware of the latest news news in this area: okay, but basically well, it was decided to go with a new, completely new operator in order to use a topol vm, because the current operator, what the developer the plans for the project was to include another providers.

D

So it's going to to be completely different of what we want, what we need? Okay, so it was decided last week to go with a new operator. We are going to start from scratch and uh our intention is to to present the operator to the top lvm project in the in a couple of weeks. Okay, uh the the situation in this moment is uh just to start with the basic functionality that is required for uh snow systems. Okay for single node deployments.

D

That is a very simple configuration, but when there is a definition of the cluster, the lvm cluster crt, where we can include devices and a selection of these devices and a selection of the nodes where the devices are going to be used to create persistent physical volumes and after that, with a message class, we use it as per system volumes in inquiries. What in with a recuperator?

D

So this is this- is the current state. Okay, we have what uh it it would be nice to to collaborate with uh civoso. That is the the company that what most of the developers are working with topol vm, okay, it would be nice, but I think that is almost impossible, because the current operator, by using topol vm, is going to follow a completely different path of what we need. Okay, so we are going to start from from scratch.

D

I think that in a couple of weeks, probably we can provide a design to the topology and project maintainers and to the our aim is to to put this operator under the top of the project.

D

This is the the situation yeah.

A

Yeah, um that's a bit sad. You know uh I was hoping for for a better outcome.

D

Yeah, it's very difficult, it's very difficult just to try to mimic the the functionality that we have in favor, okay, in order to create osds a lot of things needed to do and a lot of pieces more if there are more moving pieces. Okay, so it's difficult to too it's not as simple as just make the changes in their state.

A

Okay, thank you. Welcome.

A

So we have concluded our pausing and our.

A

uh I don't think we have access to the poetry class anymore. I think that was was taken over by uh by the positive folks again and they're, going to know that their production based cluster. I think they told us that they're going to use defensible and pacific, at least, if I remember correctly, that they're going to use defensible in pacific they're principal in their custom scripts for things that are too slow and ansible yep. Here we see his defensible is so slow, jeff.

A

A

That we have the given dubai cluster.

A

That you've, uh you set it up already.

B

Yep um yeah, it's 40 nodes. A thousand of these.

B

It's using um it takes that one nvme it carves up into 25 lvs and then uses the nvme lootback on top of that. So um they look like really small indians. Basically, um but systems up seems to be working. It's nice and stable.

B

I think we should focus on trying to stress it. um Probably the main thing that um tripped up the posse stuff is when um we had hosts down, so I don't actually know what the command is to make ipmi or whatever it is turn nodes off and on, but those hooks exist I'm pushing to check with david, um but we should learn what that is and then try turning nodes off and then like making sure that the system is well behaved with um from user experience, make sure it can still schedule when hosts are down.

B

Everything else still works, make sure that you can enter a host maintenance node that you can remove the host if it's like, whatever just like butts around it, make sure that all those things sort of behave in a way that is friendly to the user.

A

um How was the uh time to set up a thousand osd? Do you remember.

B

No, I mean the hardest part was like copying. Ssh keys around um once I started deploying osgs. I didn't actually watch the logs. You can go, look at the log. I don't think it took that long.

B

I just like let it go away, so I came back a little bit later, but it goes pretty quick because it's in parallel.

B

um I only realized after the fact that um for the first like two days it was in it wasn't using the agent I didn't. The agent turned on so now the agent's on.

B

um I didn't realize that at a time it might be a good opportunity to go and test the osd drain, um like zap option like take a host and um drain all the osteos on one hose and hit the zap thing and make sure it like goes and correctly will zap them and then reprovision them there is that there's a drive, drive, spec or whatever in there. Oh, then I'll use all devices or whatever.

B

um So it is configured right now to just turn every ssd into a lsd. So that's pretty pretty easy.

B

A

Any um did you encounter things where mercifully was just stuck and doing nothing.

B

I don't think so. I haven't managed to like wedge it.

B

So that's good. The monitoring stack is deployed, but I haven't tried logging into the dashboard or anything. So I think this is a perfect environment for um ernest, though, to play with the dashboard. I think the scale is large enough, that it should exercise pagination and all that stuff pretty well.

B

Yeah, I mean the main takeaway. What number was the main takeaway but, like the last big uh item, uh thing that was affecting the posse thing was where the manager kept freezing.

B

um I did eventually track that down to the progress, module and there's a full request that is open right now, that's going to testing that makes that much more efficient, but I haven't tested it on this gibbet cluster, yet basically as long as they're as soon as there are um pg's that are appearing or whatever so that there's one of those global recovery events going on, then um every five seconds, the status progress module will do a pg dump or would do a pg dump, which was like an obscene amount of metadata about every single pg of which, like three fields, are being used.

B

So there's a new manager pass through command or whatever that only fetches, exactly those three fields um that should reduce the lock hold time. Basically, so I would expect like to be about 10x faster than overdue. Before.

A

What really strikes me is that um we have seen problems in the progress module since about two years. Oh and and things are popping up again and again and again.

B

A

B

Often, do we want.

A

B

A

Often, do we want to fix the progress module.

B

I think we need to restructure it a little bit so that there is the progress infrastructure that allows other modules to register progress events um as they see fit um and the part of the progress module that like tries to generate events for like radius, curing and stuff and recovery, because it's that radius recovery trying to understand what's going on. That's the part, that's really problematic and slow, and the overall infrastructure is, I think, mostly fine.

B

So if we could separate out those two, I think that would help.

A

um Really strikes me that um we have sig figs, so so many fundamental issues with the progress module and it still gets still doesn't scale properly.

B

Yeah, I agree. I mean it's kind of weird that it's the way that it's written it's it's weird, that the module is providing the service to other modules um and then passing through to the monitor for status. It feels like it should be part of the overall module environment, interface or whatever, but yeah. The problem is that progress module combines both the infrastructure and the like cpu intensive attempts to generate events, yep.

A

But what's a good feeling when you're going to see the next little bit of yourself, the progress module.

B

um That in prometheus, I think that the prometheus module is probably, um I think that was also causing some problems on the posi cluster, I'm guessing. At least that was that's what we thought was happening, so I think.

A

B

um Turn paul loose on this system and try to like try to stress the prometheus stuff as much as possible, so like probably increase the rate at which we have things reporting in maybe or make sure that there's like extra scraping information, I don't know whatever it is like find some way to like measure how well or how poorly prometheus is behaving. um Let's see if we can push it over the edge.

A

But actually be possible using the big suit bricks.

A

um But just trying to patch the information out of the progress module and just verifying it that it just doesn't take over uh over 100 seconds or maybe just just it returns proper data every every few seconds. Instead of.

B

A

Over 100 seconds, yep that'd be good.

B

E

A

um Sage, adam you, you mentioned la a few days ago that the update algorithm of self-idea was a bit slow, upgrading all the crest humans it was fixed in the meantime right.

B

It looks like a tree. The last upgrade I did was like super speedy awesome, a part of it that part of it did all the crush.

A

A

uh Awesome so we we've gained so much from the product cluster and from the.

B

Backseat, I think somebody still needs to go through this document basically item by item and um follow up either with like a tracker ticket or whatever, like basically translate, that into like cards on the trello board or something so that we actually make sure that we do up all the issues.

A

Yeah, I think that was requested by by the pausing box to just have a list of tracker items that came out of it.

C

um And I've also asked sebastian, you know what are the things you know that um are going to affect. You know products at scale. You know what do we expect? You know um you know from this. You know from a usability perspective from a customer perspective. You know what is it you know what is acceptable? Where are they going to have angst? You know from the point of view of timings or usability and that kind of stuff, so I'm not really sure based on everything I've read where that is.

C

Is that fair stage or you know it's um I'm just not sure you know reading all of it? It's all good stuff. We learned a lot, but what does it really mean at the end of it all, with all the fixes.

C

What's the take-home message, yeah yeah, it's it's really hard for me. To put it, I mean you know, being so far, removing the technology it's hard to put it in. You know what a customer is going to see and how it will affect them, and you know I.

B

Yeah I mean somebody needs to like come up with a narrative around it. I think we should do a blog post that basically summarizes like what happened. Plaza gave us access. Thank you very much, blah blah blah what the environment looked like and then like what we found. um I would start with that big list of all the track of tickets um and try to like organize that a little bit.

B

um But then one of the outputs of this should be a blog and that that can have like confirm it in some sort of narrative and, like put it all in context, I think which would be nice, because I think we need to give credit to pozzi and thank them.

B

We need to like make people aware that we were able to test and like fix things at scale and then also like make people aware that if they do have large environments that are going to be coming online and they have an opportunity to do testing, then collaborating with us is like good for them good for us.

B

Who, who wants who's gonna go through that document and like make a list of requests and tracker tickets.

A

Anyone uh raising his hand.

A

Can try to do it.

B

B

Appreciate that.

A

Okay, anything about gay or palsy or well.

A

We we have to talk about. uh What's going on in surveillance of ours goes offline.

A

Yes, I think the the behavior right now is it's uh from the markets, the the houses, microsoft line.

A

In memory only and then we are trying to connect to it.

A

But we are, I think we are trying where we are uh skating, the the tech host uh functionality, but everything else. I think it's skipped if the house is offline, so we are waiting for the tech host to return succeed in order to market this online online again and as soon as it is online again, every other tech host every other feature is going to be enabled.

A

A

But we have to think about evacuating horse automatically as soon as it is as soon as the office host um the host is offline.

A

That's not implemented in self-adm at this point, um any anyone expert in how kubernetes does it.

B

I mean osd is already handled right like long enough, then radius will mark them down and out and re-balance.

F

um I might be able to answer the question. I also was distracted by someone messaging me, so I missed the initial. Like the discussion. Sorry.

A

So how does kubernetes handle offline hosts and can we learn from it.

F

um That's a good question. I I don't know like super much the details of it. I think it does like heart, beating between the like cubelet diamonds, that run on each host and if a cubelet daemon can't be reached, kubernetes uh assumes that it's down.

F

uh Even if it's just like partitioned.

B

F

Then immediately.

B

Starts rescheduling everything.

F

Yes, yeah, so there are definitely things that can happen in kubernetes failure: scenarios where there's a network partition, but not hardware, failure where a like an application might be running in the same like the quote-unquote cluster twice, um if it's still running on the like node, that's partitioned or nodes that are partitioned.

F

I'm not really sure how kubernetes handles the like scenario coming back online once the partition is resolved. I assume that it just goes through and sees that you know whatever resource created that has more running things than it should and removes one of them, I'm not sure how it prioritizes, which one.

A

There any inbuilt fencing mechanism or rescheduling ports on a different house.

F

um I I don't think so. I think kubernetes assumes that applications follow the like 12-factor application. Is that what it's called that like applications, are like stateless and uh yeah? So, like any fencing that needs to happen in kubernetes, I think has to be orchestrated by the application itself or some sort of like uh coordinating application.

G

There is a timeout around five minutes after the node is considered down where it will start moving pods um on its own, but there's definitely no like fencing or limiting of applications.

A

A

Which means that if we want to do fencing we have to, I mean we need to do it completely by our own anyway, right we are at the same level as kubernetes and self-adm.

F

Is this on the topic of uh nfs, specifically.

A

F

A

To evacuate host in order to have failed over for nfs, which is not implemented in severe dm, it was refreshed, disgusted, let's stand up a few days ago, um yeah, but if we want to implement uh and if it's failover properly when a house goes offline, we really should think about how we want to deal with that problem in general and not just for nfs.

F

I yeah, I think, that's a good point. I've been looking at nfs uh failover, that's what I did with like my my day of learning, and I don't know that I have any like really great takeaways other than it's pretty complicated in it in kubernetes. It definitely needs like some sort of application to.

F

Like coordinate, I can I can post a few links that I found though, and it seems like it.

B

Comes down to um that's definitely equivalent of the osd down out interval where in radius after the ocd is down for a certain amount of time, then we mark it out and trigger like a rebalance um and similarly in septem, if a host is like not reporting like, we don't instantly necessarily want to do. Something like there needs to be some like time period where, like the host, has been down for more than 20 seconds or six seconds or something before we start like rescheduling stuff.

B

But I think it's easiest to think of it on a per um demon type like decide what it should do, because, for example, if the if a manager is on a host that went down like we can instantly reschedule it right like there's no like there's. Basically this whole category of this manager, rgw and the nds are all kind of like they're, all stateless.

B

Unless harmless. If we run too many right.

B

So if that's the case, like you'd, have a um down your balance, interval or whatever.

B

Whatever it is like one next time out, something like basically as soon as we exit something.

B

A

What was problem that rook only has a single manager?

A

What is the problem? What what? What was the reason? Rook only schedules, a single manager, has some things to do with. Yes,.

B

For two reasons, because we need to there's a service, so all the services that the manager provides. We have a kubernetes service that like routes for that port, and it like internally implements this like load, balancer thing, and so um previously the problem was that the standby managers and the active manager all answer. It's just the standby- send to redirect and so you'd load balance between the thing and a redirect to an internal ip that would break, and so that wouldn't work, and so we basically could only go to the one.

B

um Nowadays we have an option where you can turn off that redirect behavior, so that the standby manager won't do anything um so maybe now for rick we could run multiple managers if the load balancer is smart enough to only the service, only routes, traffic to the one that happens to be answering on that port. But I'm not sure if that works right. We don't know if we've ever like sort of verified that that behaves the way we wanted to.

B

Maybe if you could change now, but.

B

A

And that really doesn't affect suffering yeah, even if um yeah.

A

But in the in the case of the manager, we don't need to hurry up, because the monetary already has its own ata cluster.

B

A

Yeah, there's no hurry, but also it's.

B

Harmless to do it so sooner, it's probably better.

B

And I would put purge yeah rj to the mds in the same category too,.

B

B

That happen when we reschedule are complicated, but, like the only decision point is like.

B

When do we reschedule, um and I think brain fest, it's like as soon as those goes down, or maybe I don't know I get it seems like if we can, if we have a concept of like what the interval is before we reschedule like it feels like that's.

B

Like it could be that the monitor interval is like one day or something like if it's down for a day then- and it still doesn't come back up, then we go create a new, monitor and update sub.com everywhere, something like it doesn't. Usually not. We aren't in a big hurry, whereas for other things like nfs, for example, we want to make it like 20 or 30 seconds.

B

We want to start rescheduling.

A

But we have to be sure that uh in in case of flaky posts, we are not ending up in a thrashing situation again that we are just switching over between hosts so fast that we are just constantly rescheduling things around yeah.

A

This might be the cause for the five-minute timeout.

B

In kubernetes, the way that we dealt with that we had the same problems in stuff where we had um in. Like bob taylor, we had thrashing posties going up and down and up and down, and so we basically, we have uh there's sort of a back off so that if um an ost it keeps getting marked down every successive time it tries to come back up, it has to wait longer and so there's sort of like a back off dampening effect.

B

um So maybe we could do the same thing like if the agent.

B

Keeps going offline, then we like tolerate, longer and longer delays before we declare that it's offline.

A

And, on the other hand, if you're just flapping flapping every five minutes and that shouldn't be big of a deal and then you're just flapping, every five minutes.

B

And I mean the way that the way that it um scheduling works, it's sort of sticky so that, if like, if it's just one hose, that's flapping like the first time it flaps will reschedule everything somewhere else and it can continue to flap. But nothing's going to happen because everything's just going to stay or all the other services are going to stay where they were.

A

A

B

Ingress ingress already has like it's running: it's keep alive d with a virtual ip that has a really low um timeout. So I don't think we need to be very aggressive about that.

B

In fact, we should be relatively conservative. I think.

B

But I guess if it would be nice if, if we could generalize all this to like one basically time delay for each based on type, and I think that that, like.

B

Then I figured we should be in pretty good shape, like osd will have none. I guess there's nothing there, but for all the other ones like monitor the or an hour or something manager could be five minutes.

B

It's gonna be an nvs like these to be three seconds. I mean really.

A

We don't need we, we don't need to risk at your htw or and yes so fast. They are already.

B

Capable we don't need to hurry up. I guess yeah give me five minutes congress.

B

Five minutes.

B

It's cozy, I don't know how it works. These two are like maybe super lazy here too. So these.

B

B

I think it's just a matter of changing the scheduling code so that um it's not just whether it's down, but how long it's been down, and so there's a threshold that gets crushed yep that that part and then the second part, which is like the extra dimension of all. This is like what about maintenance mode?

B

What does maintenance mode do because I think maintenance boats could basically should. Basically, I don't know that maintenance mode is any different than um any of these right like if you have a node that has a manager and you put in maintenance mode. We should reschedule immediately right, like there's no.

A

B

Not to move a manager um if it's a maintenance mode and it has a monitor then, like probably, we should wait a day like. I guess I'm wondering: if is there any real difference between maintenance mode and anything else, all.

A

These other things.

B

Like we should just reschedule aggressively because they're free.

A

As soon as we can reschedule, things are so much easier.

A

At some point in some regards.

B

I guess the main difference is that if we do, if we're, if a host in maintenance mode and it reschedules, then we can still remove the demons from the host of the maintenance mode right.

B

A

Nice we have we, we have to implement proper fencing.

B

Do we have anything that we ever have defense.

B

Like the only service, I'm aware of that actually has defense. That fadium cares about is nfs and it it does. It does its own fencing right like when it an old daemon and we deploy a new one. Then we fence the old one using rados.

B

We don't actually have to do anything dsf idiom um like removing things it's just like cleanup. I guess.

B

That's her eyes because he worked. It works a minute to do some similar things.

A

A

B

Have a plan a slightly different way to look at this, so the huge pain of the butt that I had with ozzy was that there was a host that was offline and it was just like constant noise in the log and, like everything was unresponsive, and it was just like annoying until I was able to until I just like removed the node. I couldn't I couldn't mark it into me. Put it in maintenance, though the command, wouldn't let me and then when I ended up just having to remove it.

B

Basically, so there's no like, I think the fix is basically to make maintenance mode like work, it's equivalent of like marking a host down or um telling except adm to like stop touching it.

B

I don't know what maintenance mode does like. What does maintenance mode enter do like? Why is it slow? Is it not? Is it like trying to synchronously go and like evacuate the host or something.

A

I think it um stops the system. The unit on the sf assembly here.

B

We just need to make it so that you can skip that part.

B

A

Just um also to be offline.

B

Yeah we don't do we even need to stop the it like.

A

I don't know that's how it was implemented.

B

A

We still have to refactor the maintenance mode, because the maintenance model and the offline uh status are overriding themselves yeah.

A

So if you have a maintenance house in maintenance mode that is offline, I am not completely sure the we are losing track of the maintenance, motor and overriding it with the offline yeah.

B

Yeah, it seems like it should basically force it into an offline state. um So we don't touch the host and there should be an option for whether or not to stop everything. First or just just update our state and stop touching the host.

B

But then, if there's an agent already running on that host, we just leave it running and it keeps phoning in information depending on whether it happens to be started or not.

A

Adam yeah, exactly adam. um If, if we have an agent on a host, do we still regularly run tech hosts on that on that host.

E

um I don't remember what the triggers are for checkos. um I know if I've sources offline is gonna, keep doing that. I think it does it anyway on like a regular interval, so I think check those runs regardless.

E

But as for the maintenance stuff of the agent yeah, if we just if we have this thing, where we skip stopping everything, then the agent would still be reporting if you put in maintenance mode like that. But I the way I was sort of thinking of this maintenance mode stuff with skipping the stopping stuff is you'd only really want to use that if the host was already sort of offline or something was wrong with it, because if it wasn't, why not just stop everything anyway?

E

um So if you they was already offline, then they just not be reporting anyway. So I don't think it should really be a big deal and you just kind of if you decide to skip stopping everything, then now it'll keep reporting. But I don't see why that would really matter.

B

Don't harm the time well, it means we won't we're good. I guess we will reschedule anyway because yeah okay yeah, if.

E

We're just rescheduling based on whether the host is like getting marked as like an offline or maintenance state. Then it won't matter the agents reporting there.

A

A

Anything regarding offline hosts.

A

Page you just added a topic.

B

Yeah, can we talk about how the agent finds the new active manager, because this is like one of my pet peeves after watching ozzy? Also, is that every time you do a failover it takes like?

B

I don't know, minutes for it to like redeploy every single agent and just so. The new agents find the new manager.

A

And so you don't.

B

Get any updated stats until like you go and you touch every node kind of maybe maybe not the best way to do it. I had a couple ideas there. um You know. One option is new manager, folks, old agents to ride wait so because all those agents have a port that they're listening to um if as long as they record that state somewhere and to configure them when the new manager comes up, they could just do a really quick rest query whatever to all of them and just give them the new reporting id.

B

So that's, maybe faster, but it's still, the manager has to go iterate over all of them and poke them. um Another option would be if the agent uh asks mom for active manager um if it hasn't.

B

Like if the, if the agent hasn't been able to phone in it'll, just do a like a step, cli command to ask the manager for whatever the new address is, and that could be something like ceph, whether it's manager services, something like that there could just be some metadata in the manager map or whatever. That indicates what the like right report and ips.

E

Yeah I was going to run these stuff commands. We have to have like a sephardium and all the hosts. We do a shell command or something I guess we would actually because the agent already has it so does it need certain keys run that or is that, like a low permissions.

B

For this um I mean it needs like one, our something similar like that. I mean we can give it. We can give it a capability for that it would need a yeah. It would need a suffix key.

E

Right now they don't have any keys because they didn't need to do anything just give them an extra.

B

Maybe that isn't the best best option, um but that's I guess the nice thing about that is that after the manager fails over it wouldn't have to do any work. Necessarily, it would just wait a few seconds and they just start reporting in again. Then we would only have to go like redeploy agents if.

B

If we don't hear from them after a while.

E

And the other one, I think the first option as well would also probably be fairly fast, because when we do the the poking thing, um it usually starts up its own little thread. That goes and does that, um so I think you could just start up a bunch of threads really fast they'd all go so it might be fast enough um and also it wouldn't have to wait for the responses whatever.

E

um So it depends what you think it's worth putting in the the keys and everything.

A

um Don't don't the um standby managers re-erect to the active manager? Don't they do it, but don't they redirect to the active manager.

B

um Yes, but only if they're answering on a like fixed port um and we don't the manager reporting ip is dynamic.

A

But at least we could um send the agent all the known managers at a time and then we could say the.

B

Sort of it, it doesn't quite work like that, because.

B

It doesn't quite work like that, so the way that the normal manager works is every manager is basically a client at the which is called the manager standby and it's basically a client of the monitor and it's subscribing to like cluster updates or whatever and based on whoever gets elected as the active manager. It starts up the whole manager infrastructure and that whole manager is the thing that binds to its own like stuff, endpoint and ports and all that stuff, um and so you don't know how to talk to at least using the setbacks protocol.

B

You can't talk to all the standby managers because they don't have they aren't like running per se. They just have this sort of like standby, english.

B

Yes, so they do do that um and the standby modules um only certain modules um bind to ports, but they bind to generally a fixed port, which is why managers normally have to run on separate posts. um If you run them on the same host, then they have to. You have to disable that standby stuff, because otherwise they'll comport conflict, um but even then the um we made it so that the agent stuff doesn't bind to a fixed port.

B

It binds to a dynamic port so that you could have clusters with that because right now, if you run multiple clusters on the same host, you have to be careful that they don't have active managers or the managers. Services are all configured on like custom ports like, which is what vsart does, um but it's like a little bit tedious, and that seems like too much of a lift to require that for just basic functioning of that video at all for the agent to work at all, like you have to whatever.

A

But at least we we could configure the agent to know all known managers at the time we could and then just have the manager. The agent try different standby manager and see if it gets a proper redirect to the new, the active manager and then just update its own configuration.

E

Yeah we could try something like that and then we just have to right now it has a dependency on the ip of the active manager. So I guess we have to change it to be dependency on the list of all the manager ips, because, as you move where the standby managers are as well, you still have to tell it where all the different managers are yeah.

B

You could do that and then, like the active manager, picks a random port that is able to bind to, and that's the one that it uses but like the standby managers, might not be able to bind to that port right.

A

Exactly so just the whole stands are not enough. No not enough. We need to have the exact hostname plus port for each.

B

And I guess that would mean that the standby stuff, adm modules would need to also like try to bind to a port and would need to share that port with the active manager so that those could be shared with um agents. It seems like one of those things that will like help, but it's an is it necessary. It's like an optimization for like because the stamina manager isn't always going to be online in time to like go. Do this or whatever. So this is like only.

B

Which makes me wonder if it's worth it, because it's a little complicated.

B

I guess my my gut says to start with number one where we just make it a really efficient way to like update the reporting ip for the current agents, because I think those queries would be really quick.

A

The I have the agent and uh did the agents start an open port at this time or are they just clients? I think they're just clients right at them.

E

They have a portal, they listen on a port because the agent has to be able to know when um so it has like a counter. It uses like a sort of a like pseudo lampport clock and yes, no one update the number so that you can tell if the metadata that gets back is up to date. So we have to be able to implement that yep, perfect.

A

B

Okay, extend that protocol a little bit to allow updates for the reporting address and also yeah.

E

Yeah- maybe I just maybe just have a generic update figure like dependencies thing, because it has all these like files where it stores like the various values or um things I could just have it update all of them. We just send like the yeah, updated json blob with the yeah. I think it is a json file. It's just a one json file that has like a list of the things it has, but you just update them real, quick yeah. I reconfig basically right yeah. It would just be a reconfig over htv instead of ssh yeah.

A

Okay, anything else for today.

A

Perfect, have a great thursday ah have a great tuesday, sorry um and see you next week.

A

Bye everyone, everyone bye.