Kubernetes SIG Node, 22 Aug 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230822

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230822-170502_Recording_640x360.mp4

A

Hello, hello uh today is uh August 22nd 2023. uh It's a signaled weekly meeting welcome everybody. We have a very short agenda today, let's dive into it. My first is a question from maratha.

A

B

uh Yes, so, uh as I said there in 128, we had the support for stateful parts, which is running spaces previously, when we supported stateless parts and I was wondering if we can make our blog post or who should I talk to to make that happen. Monroe already mentioned that seems okay, but yeah yeah.

C

I think it will be, it will be useful because we want people to try it out and see if there are any issues or not right and.

D

Right and we can also mention the in progress work that Sasha is pursuing on the art side.

B

Right and so what are the next steps to make the blog posts happen?.

A

So uh this is a call for a blog post for 128 and I. Think uh deadline passed a long time ago, but at the same time, like they don't mind to have new posts. um I asked somebody from sick dogs and alternatives ndpr to a website and uh try to find somebody to approve it from sick dogs. I, don't think anybody will mind um it's still.

A

We still have time to like, because release blog posts are kind of spaced out, so we don't post everything at a single day, um so yeah I would suggest to start with the pr like uh have a short PR. uh If you want to discuss text or post, you may start with a Google doc, it's just easier to collaborate, um and uh once you have it, let's find somebody from sick dogs to approve it. If you find I have problems um troubles to find somebody, let me know I will help you navigate.

B

Perfect, okay, I'll I'll, do that. Thank you very much.

A

Yeah, thank you for driving that it's a very interesting feature.

B

A

Okay, next item is striking: I think it was approved Peter. Do you want to go talk about image, work.

E

Yeah so um we uh came up. uh I did a doodle poll to find the timing for the image, garbage collection, rework working group um and that would be on Wednesdays uh at I, guess, 11 or no 12 Pacific.

E

um So I was wondering if we could add uh that to the signaled calendar I, don't think I have permissions to do that. So I would need someone's help.

A

Okay, um I, can you send me your address? I will uh give you permissions.

E

Yes, oh shout it to you cool, um that's all.

A

A

Meeting notes, so people are using meeting yes, yeah.

E

Absolutely yeah and I'll post it in the channel and in the email list as well I'm thinking about just having a preliminary meeting tomorrow, probably won't be very busy just like meet and kind of go over goals and overall, maybe some uh desires out of the working group. So it'll be probably pretty quick, but I figure get the ball rolling, um but I will post that uh where appropriate. Thank you.

A

Thank you, okay, next time, I'm, not sure who added next item.

F

I'm, sorry, that was me um so uh I Smith here a while back and I was talking with Peter on the issue, I sort of like a philosophical like where's, the responsibility law between no problem, detector and cubelet with respect to detecting some issues.

F

So the root cause of this was uh I've seen a few times where file systems can go read only um not just for like physical failures but like xfs is a really weird issue where you can no longer write files, even though you apparently have plenty of free space and plenty of I notes.

F

If you get excessive free space fragmentation and what occurs when that uh happens, that cubelet basically stays ready, everything's great except all your pods are failing, because they can't actually write to any file systems whatsoever, keep the camera at the file systems, and so it's just a pretty bad failure scenario for users. So when this change I was looking at adding something to all right, cubelet, just right to the Pod data directory every few seconds. If that fails, then something has gone massively wrong. We can mark the note.

G

It's not ready.

F

And uh resolve it and then Peter was saying: well maybe it's more a responsibility of uh like no problem detector which uh yeah, maybe maybe it is I'm, not I'm, really just looking for guidance like what's the line between detecting issues internally with cubelet versus no problem detector.

G

Maybe I can answer this question so um I think both right so for the for this one kubernetes cannot perform. Normally, there are no more behaviors right, so they should Mark itself and in our system when we design. Actually it is the kubernet, both kubernetes I, don't know the problem. Detector both can Mark know the notoriety so kubernetes uh when they Mark is just based on the prerequisite we could make the can operate normally and.

G

Basically, no the configuration is not ready all those kind of things they will Mark, that it's not ready, then rest the stuff, like the kernel issues.

G

um Next, the first system to read only a lot of times close by negative, for example, room boom like the system and the will be next. The kernel try.

G

Totally attach the whole system in the milk will be turned around. This is the file system is read only right, so those kind of problems could be detected by no Department detector and just through those kind of things right. So it could be not nothing at this moment not realness itself cannot write in certain cases right. So there's no new, but the job cannot also cannot access cannot write the data.

G

So this is why those cases know the problem is supposed to detect those problems, but the one challenge I know it is today we didn't really actually update of the node problem. Detector kernel evolved, for example. We already started to switch switch to support of the signal V2. So there's a lot of problem, because no problem database are the previous production. What we have the kubernetes right after this eight nine years, a lot of things has changed so today, I think no, the problem detected did a little bit more work.

G

I think I've been mentioning this signal, the community. We need to keep that up to date. I'm not sure today is no problem. Detector can detect this video only file system, issue, yeah.

F

Yeah it'll look for some logs that say read only, but if the file system goes read only for some other reason, it doesn't cause that log to appear then yeah, so okay didn't make sense so sort of cubeletus. It represents and then NPD for sort of everything else that occurs while surrounding yeah.

G

Yeah, in this case, ready and also know, the problem is the problem also say this: node is not already so. This should observe the same thing, because there's that did have the kernel knows this transition is moved only so both in this case is expected.

A

Mm-Hmm yeah one question will be how to recover from the state like if Google detects the prerequisites and uh it marks itself, is not like not ready uh or unhealthy how it will get back if a system respond itself.

F

um So this the situation that I've seen there is no great way to recover other than roll. Your node and start.

G

That's the once this is by the states. You just have to reboot the note. So this is why we only reported the status and then the job supposedly should be rescheduled and the node should be recovered by reboot. So that's kind of this is why we build this Auto Repair. But the auto repair is not done. I, don't know the level you can stand on the global level because you need to make sure Sky, tuner, API server knows that node is not ready and the work should be rescheduled to the other yeah.

G

That's that's the better diet. Why we are doing that.

F

In the Real Results because, like it doesn't happen currently the note stays ready and pods stay there and they just fail, and there is no Auto Recovery.

G

Yeah, so that's why I think of the both kubernetes and know the problem detector didn't perform that do that job I did in his kisses here. Okay, so the Law Department detector fix should be straightforward.

G

uh Kubernetes the bosses should be straightforward, but it just I'm a little bit surprised to know that the problem detector didn't detect this problem, because it must be something in this area where we didn't put the record right because we're against those patterns We join those Analytics.

F

So the the actual issue um that I saw was not one where the so it's not a mounted read-only file system. It's just a file system. You can no longer write to.

G

Got it so having to have the log too so yeah.

A

Sorry, thank you. Hey keep my comment active for general rule like Uncle General. If you check for predict, which is on Google, we need to make sure that it's either unrecoverable for sure and we've like 100 sure about it, or um there is a way to recover, offset and get back into account. State.

A

A

Okay, we reached the end of our agenda items. Is there anything else anybody wants to talk about.

G

Oh I thought. Today we have someone proposal. Some some yeah, yeah I didn't saw that under our agenda.

A

A

Going once white rice, uh thank you, everybody um short and uh useful meeting. Let's start at 129 uh strong, but also don't forget, to take your rest of your vacation and enjoy the rest of the summer.

B