Ceph Orchestration Weekly, 20 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Orchestrator Meeting 2021-07-20

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, let's start um yeah hi everyone and welcome to today's better meeting. We have. We have two.

B

Is the meeting recorded.

A

B

Okay, perfect: it's a portable cutter asked me to to have the links to the meeting.

A

um It is recorded, but it is recorded using the f upstream bluetooth account and I have no access to it. Do you? No.

A

So I pinged mike.

B

A

I paint my a few days ago. I hadn't got an answer: okay, but if you have a way to access those recordings, that would be awesome.

C

I would I will take out they used to get posted to youtube, but.

A

Yes, but they're posted to youtube.

C

A

Yeah mike is really good at it, but posted to youtube. uh Maybe two weeks later.

A

um That worked for everyone else, except for people that are engaged in uh in the development here right, um because the orchestrator meetings from two weeks ago is uh like having the news from last week.

A

uh Yeah, thank you daniel, it's even linked in the other pad. If you have a look at the other pad, there is a inline basic. There is a recordings page anyway.

A

Anyway, maybe from miguel you can ping mike again. Okay, let's see what else do we have on there again? Now we have that ember ride issue, that's pull request or two three seven, eight.

A

That's about disabling the admin account and in itself that's that's good, but.

A

Changing the template, the ginger tube template or recommending that to change the ginger tooth template is problematic because.

A

It's going to be persistently saved and if we are rolling out a new release with the new.

B

A

The user is not going to benefit from it; it could even mean that the demons deployed are broken. Also so.

A

Which means that whenever you're going to upgrade, you have to make sure that the thief defer that the default template it's still compatible to the overwritten template that you have used, and what do we want to do?

A

I think we we've discussed it in in a standard. Last week we have two options.

A

um Option one is to put all uh all switches into the gamma spec file that we are that we are using.

A

um The the closer encounter both right um and the alternative is to use to continue using the ginger tube template.

D

A

Downside for putting all the switches into the yammer file is that we are going to end up like a ham chart.

A

And the downside for continuing changing the trim plane template is the um self upgrade.

A

That's upgraded and I don't know what do you guys think do you think.

E

Oh, I was just going to say: there's also the downside that this forces, the um user to understand the internals about the orchestrator, so some of these variables are kind of magic. That's unfortunate.

A

um So we have good downsides for both approaches, but I guess that's it right. Is there any other way we can? We can use.

B

I think that maybe there is a possibility that is to to have separate places or a standard templates and customized templates for the user. So if the user is selecting a customized template, this is outside the the official folder. That is not going to be the possibility to to override the default template that we are deploying with uh with the with the container or with the deployment of the fadm.

A

That's not the biggest issue, because a user can always reset the timeover written and uh let it reset to the default that we are shipping that that's always a possibility.

A

um It's just that uh when doing a sf upgrade and we are shipping a new template in a new manager container that new default is never going to end up in the uh and the deployed demons, because the user has a config key setting overwriting the the default, the templates, not the default template, but the template. So it's stored in the config store in the monitor. So it's it's uh safe right.

A

If we are not forcing users to to override the default template, that would be um um now that that would be really evil, but um it's possible to store them in convky key, but nevertheless we we have the downsides of the upgrade issue and and the and exposing internals.

A

A

A

Okay, mike, do you have a better idea? Do you have an answer.

E

I wish I did, but so the thing is like exposing the internals we'd kind of be doing that in yaml. Anyway, right I mean http, port http port will become http port, I mean that's, they kind of become one-to-one mapping.

E

In my mind, it's almost like there's more than just like. We need templating um for some of these things, but we also need a way for users to define their own custom configuration somehow.

E

Just custom config files generically maybe as part of the ammo- I am a string- I don't know but that's harder to templatize. So.

A

A

Okay, we at some point we introduced that spec sub object in those yaml files right. So if you have, if you have um I'm writing something down in the I'm, writing something down in the um in the other page here. So if you have a um service.

A

Service type um ntw and you have some additional properties of the gamma file island, um a spec object within the yamaha and there we have.

A

Ntw realm at the bad.

A

That's a bad example, but we could make all those additional spec attributes accessible from the template, which means that we would not need to.

A

A

That we would not.

A

um Keep everything explicitly in in the um in in the in the um ncf edm, but we could make it possible for users to to write additional arbitrary files to the spec file plus making it possible to overwrite the template file and then having the possibility for users to to first change the template and second use custom fields in this in the in spec files to to override the template. I don't know, does it make sense?

A

E

I was thinking along that path. I mean it could be like a files or a template section in the spec, but the complexity of the spec continues to grow.

E

When we do this- um and I think there are some things in the orchestrator that um are hard to templatize so like the http start port for aj proxy things like that, but maybe they're covered elsewhere.

A

um We could I mean ginger. 2 is extremely express stuff.

A

We can do a lot with center, too.

E

But I do some what like the concept that these are all in the same place, so you can dump a yama for a service and you have all of the declarative context for that service.

E

A

Page, did you got the topic that we're discussing just discussing right now.

F

No, I just joined I'm on my phone, so I can't see the agenda. What are we talking about.

A

Okay, um yeah: we are talking about that.

A

The user is not not really able to benefit from new default templates if a user changes the template in in a config key setting.

F

Oh yeah, I think I put something in the pad. I think I think this is a problem, because you can't, you can't see the defaults, you can't see the current value. You can't see how it differs. I I think we should just add, like a whole new set of cli commands um that will like list the templates, let you fetch the current value of the template. That's the default value disk, the current value with the default and set the defaults or set that that.

A

Doesn't solve the problem that the user is not going to benefit from new default templates.

F

Yeah, it doesn't that we can, after the upgrade they can then go look at this or it could be that we upgrade when we upgrade. We we take the disk and then apply it to the new one. I mean it's like. I think, there's no good answer here. It's like if you upgrade a debian package and you've modified a configuration file. It always prompts you like. Do you want to take the new one, keep yours or look at the difference and maybe apply the disk?

F

I don't know, I don't know exactly what the options are but like it'll depend on what the change is with the template, whether which one makes sense right.

B

I I don't think that an uh final user that is modifying a template is going to be very happy if we replace them again for something new or upgrade it okay. So I think that, if probably the user, what he wants to to have is always the same template and only change in that template.

B

If, if this changes, our uh arduino uh are uh unexecuted by by himself.

F

Yeah I mean it seems like at a minimum. We need to have the tools to be able to like see your template versus the default in the dish.

F

um But probably the end result is just that after you do the upgrade you should, as the next step, go. Look at the disks re-examine your templates and see if they're changed.

A

I'm not really a fan of.

F

Like what else, what else could you do.

A

I don't know adam, do you have a better idea.

D

uh Not really gifting, all the tablets sounds really complicated, but it seems like other than that. We just have to have like a million different options in the emails.

A

We could even allow users to do to do a three-way div.

F

I mean the upgrade is another issue, because it's like it's not interactive, so you can't like prompt them for what they want to do with their template. If it happens to change, um but you could like log, something maybe or I don't know- generate some report somewhere, so they can see what the difference is some of those flag it, but I mean even even independent of the upgrades like just think having it seems like at a bare minimum.

F

You have to have the ability to see what the template should be without having to have them, go, look and get or like inside the container image to go copy, a file to start with right, like I should be able to see what the templates are that can be changed. I should see what the current value is. I should be able to see what the default is like. That's I don't know that feels like a minimum if we like want to invest any effort in them being able to customize these templates.

F

Otherwise we just say you're on your own and go read the source code. Every time you upgrade to see. If there's a problem, but this isn't supported you now, I think we can do better than that. It feels like they feel like comments like are, would be pretty easy to implement um a document and if we just say after an upgrade, go recheck your disks for your templates. If you customize them, then like that's, that would be enough.

A

um Do we know when, when a user did set a config key setting.

F

We can get a notification of that yeah, but I think for these templates I think we should. We should just not make them do that, but have them stuff that um have a coco to set the template separately right.

A

Would would be great to know if if the info template changed before user uh did the custom template or if the default template got a newer version, then a user changed the set a custom template, but you.

F

Know which is newer for that information right.

A

Yeah or are we we just invested time and put everything into the into the template, what we can think of or what a user demands, and then we are just editing everything.

F

I mean we could do some of that anyway, but it's never going to be enough right, there's always going to be something that they either don't want, or do one that we don't support.

F

It seems easier to focus on the things that are important and then have an escape hatch, backdoor or whatever.

A

That that's with the the vector is already there. It's just it just solidifies the vector.

A

Backdoor a tiny bit more user friendly.

F

A

For coming off, yeah, I don't know daniel. Do you have a better option? Good idea.

C

uh Not really, it seems, like all the options, have one downside, either way.

A

A

A

Anyway, let's see we, we don't need to decide that um here today we can um postpone it until we um we, we can move it to the requests.

A

Discussion and then think about it, maybe a day or two okay, um so another another topic that we have today is that the manager is stuck for 15 minutes in the surf loop um daniel. Do you want to just briefly ex what happens and why.

C

uh Yeah I'll put the tracker in the chat.

C

But uh so basically, if you have an offline host or a host that goes offline,.

C

The next time the server comes around, uh it's gonna try to run uh like gather facts, refresh hosts and demons which we'll try to run gather fact that runs idioms, gather facts um and at that point, uh cepheum hasn't realized that the host is offline. Yet so it's going to go through the whole process of it goes through a bunch of functions and stuff, but it's eventually going to call some remote functions that are going to try to either create a connection to that host or use an existing one.

C

And if the timing is correct, it'll have an existing connection to that house. Even though the host is actually offline and then it's going to try to you, know ssh into the host and run the like gather, fax command and the connection it's using isn't going to work because the host is offline um and then there's like a 15 minute period, where uh it's trying to do that, um and it eventually will error out.

C

But during that 15 minutes the serve loop is stuck. um It will arrow error out after the 15 minutes and mark the host is offline after and everything kind of goes back to normal. But for 15 minutes it's done um in the tracker, I put a bunch of details about what functions and what lines are causing it and what's happening and a log of it happening.

C

um I don't really know if there's a great solution. I know melissa is working on changing the whole ssh like backhand system, so maybe her basic ssh thing might be able to resolve this or the ancient thing that adams did is working on. I don't know if that would make a big difference with this or not but yeah. That's the gist of the problem.

A

I think the the agent is going to.

A

Because we're just not going to use to call the remote host that often.

A

It's still going to be a problem when, when deep plot, but for just getting a refresh of the status, I don't I. I really think the agent is really going to help here.

D

There's some way we can just add, like a timeout to this like 15 minutes, is a really long time to get stuck if we could just lower that I like a lot. This is a really rare timing. I feel like go offline right, as this already has a connection it's about to do something we could just have that, maybe only like a minute. It wouldn't even really be a big deal because of how rare it is.

A

Can we lower the timeout how.

C

uh There's not like it well currently, there's no time out that 15, I'm not exactly sure where that 15 minutes is coming from it's it's very consistent, 15 minutes, I'm assuming it has something to do with the ssh or remote stuff. um uh I don't know, I assume we could probably add a time up to it, I'm not sure, but.

A

There there is a tcp keeper live thing, the setting that we could use.

A

A

Or do we really need to care, or do everyone just want to look into the agent and see how things are improving with the agent.

C

Well, yeah, I mean if the agent makes this problem go away. That's fine! um Like I said it's not the end of the world. I mean that the manager is stuck for 15 minutes, but after that 15 minutes it does resolve itself and mark the host is offline and the problem doesn't continue, but there is a small window for 50 minutes where you're stuck.

A

I I would just wait for the agent and see if, if, if it improved things honestly.

D

I mean even regardless shouldn't this be super uncommon because you have to like make a connection like as the host goes off the line, and then it has to be like about to do something like hosts. I went offline super often, and then it has to be the very specific timing when it goes offline like unless you're purposely testing goings going offline. You wouldn't see this very often.

D

I don't know how big of a deal really is going to end up being in practice.

C

um Yeah I mean I've been able to reproduce it like very consistently, but obviously I'm like trying to like I'm manually putting hosts offline for that to happen. um So I don't know how exactly it would behave in like a real world cluster. I would think I mean, based on what I figured out.

C

It seems like this should happen almost every time, though, because the only way a host gets marked offline is, if it uh it has, there's a function in um the remote stuff that tries to create the connection and if it fails to do that, that's when the host gets marked offline um and usually there's already an existing connection to the host, and it's just gonna use that and bypass the part of the code where it marks the it could fail and mark the host offline.

D

Have to mess with a little bit and see if I can I'm gonna look at some of myself see if it's like how often well yeah look at this.

D

Yeah I'll message you later either like I haven't, been I'd, reproduce it personally, so I have to look at it, but if it's actually seems like it's something that happens almost every time it goes offline, we probably should see if we can uh maybe put a timeout in and then it'll, hopefully just go away. Let's replace this page library.

A

All right um melissa: do you know if asking ssh supports the pcb keeper live setting of ssh.

G

Yeah, I think it does. I I was reading something um and it does support the keep alive function um or the tcp thing, and if it's related like in my like implementation of like the async ssh stuff for running the commands, if the connection is broken, it like returns an async, ssh error. And then I just like reset the connection. So I don't know if that would fix that.

A

Depends a bit on how fast it is right and how fast we are. You are getting information about it if it takes 15 minutes to for you to get notified or.

A

Yeah anyway, I think we shouldn't do anything about it right now and just wait for the other, ssh library or the agent and see if things are improving, then, and if not.

A

Let's reiterate: that's okay, for you,.

A

A

Is that what we had on the again for today, better anything else.

A

A

See you next week and yeah have a nice day.