Kubernetes SIG Node, 27 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220427

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's april, 27 2022, it's a signot, ci subgroup meeting uh welcome everybody.

A

um I posted both topics from slack threads and I think danielle you started post them or maybe like this was started by the deems. Maybe um do you want to go through this topic? First, uh since it's more generic and then we go into more like a triage mode.

B

uh Sounds good, um so dems hopefully gave us a list of jobs that are currently failing.

B

And it would be nice for us to get people looking at them, um so I already went and opened pr's for um the fedora swap and ubuntu swap shops.

B

um But yeah um ideally, um would be nice to get someone to look at the cryo drops that are currently failing or flaky.

B

um I've never actually used cryo outside of kubernetes tests so like uh and probably the wrong person to go uh play with those.

B

um If anybody else has time to take a look at those, that would be great because some of them have been failing for a while.

A

Yeah an interesting thing about cryo jobs. This alpha job doesn't really run alpha tests and it fails on non-alpha feature, so maybe definition needs to be adjusted should be easy fix and then eviction is our favorite eviction.

C

A

Hard troubleshoot, but there's different failures. I think uh cryo has this conviction, problem or, and uh continuity has some. So there are two different evictions. I don't remember like my memory doesn't serve me well right now and flaky, I'm not sure like I, I never looked at it peter. uh Are you on the call? Do you think you can start looking into that.

D

Hello, I'm here uh yeah I can. I can try to take a look. I'm like this and next week are kind of swamp, trying to cut stuff for cryo 124, but uh definitely after that, and hopefully before that I'll also poke my team and see if we can get any assistance as well.

D

And uh thanks a bunch to danielle for getting started and taking a look at some cryodrops. I really appreciate it.

B

Seas of korean are much better than red, so.

A

Okay, um so I think yeah this test, um I looked at it, it's just a broken image. I have a pr for that, but we are in a code freeze, uh since I I don't I haven't looked at this one. Don't remember.

B

That one is a node problem detector. I posted in the slack channel asking someone to take a look um because.

B

Oh wait. Maybe this is a different test. Oh no, the soak is uh npd1 the flaky one.

A

B

Something else.

A

Yeah so uh I looked at soccer like I have a pr for that, like npd verification, it doesn't work when the test is running more than 24 hours, because uh it's look. It is looking for events that supposedly kicked on in the beginning of test and uh if beginning of test was more than 24 hours before, then it will not find it. So I have like work in progress for that. But then sock test is also have a different problem over time. It accumulates so much something that disk getting overloaded.

A

So we start getting like these problems and it's basically because it is a soak test like it soaks the cluster for a very long time, and uh it's maybe some configuration that we can adjust. I really don't want to turn it off, because we don't have any other soak tests, but we need to get it back to life, uh but npt part. It's uh figured out. I have a work in progress. Pr for that. Oh sweet yeah,.

A

uh Yeah, like I remember uh what was the name queen or like a person who joined last week, uh he said that he will take a look at this flaky, uh but uh it will take him time to do that. So, let's see if there will be progress.

D

A

Yeah eviction, I think we need to find owner for eviction. uh We looking internally in our team uh and uh in uh g key note, but uh if anybody also has cycles now, let's try to take a look. I think this is a long longest. Failing test is an unknown issue. That's unknown problem. I.

B

Mean so the problem with the eviction test is mostly known. um The pro like the problem is fixing. It is unknown.

B

um The eviction tests have the problem of um interacting with each other and, like anything else happening on the host. uh So if we actually want to make them reliable, we need to do do something. uh That's gonna mostly involve potentially rewriting a lot of them uh for the disc ones. I've been thinking about trying to switch to like using um like ram disk or disk image, or something rather than the host disk.

B

But for the other tests we need to figure out something else.

A

um Yeah you start, uh I remember now david porter mentioned that uh some disk problems may be caused by long like slow, uh filling up of a disk and david on the call. So maybe you can comment like.

E

Yeah I mean I I had a little bit of time. I unfortunately got sidetracked by something else, but I did a little bit of debugging of the eviction test and what I found at least was. It was trying to fill up the disk like the one. That's failing the local storage eviction. One is failing because it's trying to fill up the the whole disc and the disc is pretty big. I think it's like 30 gigs or something like that.

E

40 gigs, but it was filling up very, very slowly, like 10 megabytes, a second or something like that, so it was maybe just timing out before it even got a chance to kill the disc. um So that was my investigation. I got any comments on the bug, so maybe we just need to speed up. I mean maybe there's other problems there. Good, no question, but at least one problem it just seemed like it was filling up the disc, very, very slowly.

A

Okay uh yeah, if you'll comment on the bucket will be helpful, and maybe you can allocate smaller disk machines for that.

A

A

So david will comment this one.

A

Perfect, thank you daniel and david yeah. I mean, even if you'll find a quick fix for that and like it wouldn't fail for a while. It's good first step, I mean I really need to get into green state, because uh our next steps is, uh as danielle pointed out like into, is to improve uh coverage right.

A

So uh it's really hard to start improving coverage. If we cannot get to the green uh for so long.

A

Daniel are you there uh yeah? Do you want to talk about this, uh like thread that you started on reliability and maintainability? Is there any action items you want to start immediately or um yeah.

B

um So immediately, not necessarily, and that I think it's going to be more valuable to figure out what we want to do before we, you know just start going ahead and doing stuff.

B

um I'm gonna try and set up some time in the next couple of weeks for people to start to sort of like um get people together to start talking about the improvements we want to make and what.

B

uh Yeah, like I'm gonna, presumably send something out to the mailing list tomorrow,.

B

But if people are interested in helping make the kublet be better tested and more reliable, you should come chat to me about how you want to help make that happen.

B

But yeah, like we've, had a lot of fairly scary, escaped bugs in the last few releases that it would be nice to not uh keep repeating, um because it's really hard to land any code in the couplet without breaking something else. Right now and uh like the ci cell group has been a great start to making that less likely to happen in that.

B

If tests fail, you can now generally assume it's because the test failed um for like legitimate reasons, uh but now we need to sort of like move forward and increase the coverage and also in some cases you know, write code to actually be testable and stuff.

F

Hey francesco here um I will have a couple of open-ended questions. I'm not sure this is the right place and time so, but let me just mention it and we probably can elaborate offline. I would be happy to and the most first of all. Yes, absolutely. I want to help in last couple months. I had unfortunately not enough time, but things should be better now so yeah sign me in thing is adding tests and adding the necessary refactoring to make the code testable as ultimately the same reviewer bandwidth issue as new features?

F

Now I I for myself, I would agree that we should prioritize this work, but still we will have the same fundamental issue for reviewer priority.

F

Do you do you folks, in this forum agree that could be an issue and if so, let's maybe, let's try to figure out how we can address that.

B

Yeah, so that's part of why I bought this up in the general stick meeting yesterday, and I want us to treat this with the same importance that we would treat you know caps and actually as we're considering what caps we're going to accept for 125 uh build in some bandwidth for reviewing and improving um like maintenance prs, uh because they are really hard to learn today and uh like a pr, adding test to container manager sat open since february, without review until yesterday, um and I'm hoping that if we can get it actually treated as a priority, um especially now that derek is back.

B

We can try and get some stuff to actually happen.

F

uh Okay thanks, I I will just say that I will actually go as far as requiring to have a band some bandwidth, for that I mean statically, allocating some bandwidth for them and making that actually a goal for this. This is probably what you said already, but really I'm reinforcing that for the next couple of cycles, because I agree we are in a bad state and we should improve.

B

uh Yeah, like um any like caps, we bring in at this point, are non-trivial amounts of risk for everyone who runs kubernetes in production and that's not okay, uh and so I'm gonna be as loud as I can to try and make that a little bit better, uh because I am tired of debugging new weirdness.

B

um But as part of the reliability effort, I also want to uh try and put in place some kind of like reviewer guidelines for um kubelet prs.

B

Hopefully around like making sure that things actually get test coverage.

B

And also to try and motivate people to you know encourage people to write us before merging code, uh like the amount of changes to behavior that we land that don't break or change any tests is, quite frankly, terrifying.

A

Yeah, I agree with all counts. I think, uh did you good job getting there getting to where we are now, like, I think, in the past, we always stumbled onto this uh trap when you want to increase coverage and do something, but everything is right like how do you even operate when uh so? My thing needs to be improving, improved and even today, like francesca, I don't know about your bandwidth and you said you have problems with bandwidth. I hope you still have time to debug this device plug-in um test.

C

A

Just disabled and then said it's hard and like I, I don't want to put it peanut on you, but uh I mean without this test coverage we have uh very little um visibility into quality of that code. So.

F

You're right you're right, I I I would prioritize that. Unfortunately I have a couple of rough months- bandwidth wise. But yes, please, please actually nag me. I want to get this done.

A

Yeah- and I don't want to see not on you specifically- I just want to demonstrate this problem when, like bandwidth, is really a problem, an issue and I think we get into better shape uh this uh bandwidth wise. So we have more um more people having time to investigate things and uh healthy things.

A

So, if you need help with this test, maybe we can find somebody else who interested in helping you. You know.

F

I will find I are there to find time. It should be better but a little better, but if anyone wants to chime in and take it over be my guest, I'm not really offended it's okay. We need to get this fixed.

A

Danielle did you uh think of any ways to kind of enumerate areas where we need to better test coverage or like.

B

A

Of tests, we want to support.

B

So I have kind of a bunch in the back of my head. I am trying to rather than like force that into what people are thinking about. Let people first think about the areas they think things are missing and then pat it out. If there's anything I missed uh partially, because I want people to be engaged.

B

um If I wrote a document with a list, people would go yeah, it looks great and then nothing would ever happen.

B

If this is a discussion where people have things they are interested in and care about, I'm hoping that actually motivates something to change, but there's a lot of stuff where like. If we have like happy path testing, we don't test. What happens when say um like there are grpc issues talking to a cri like I've. Had customer production bugs escalated to me where it's basically like some weirdness happened, we couldn't reproduce it.

B

So I can't say what weirdness happened uh talking to a cri at like the wrong time in the kublet's loop and just like breaking a bunch of container state, I could never figure out exactly what happened. um I couldn't reproduce it like just somewhere in the pod loop cri returned garbage and the kubelet just broke like um uh so like a lot of like sort of failure. Case testing, where we have happy path, testing, uh also potentially different types of failure.

B

Testing would be nice uh in that we don't really have any today, um but the more important part is adding coverage. While we don't have any today.

B

uh There's also a lot of cases today where it's hard to tell what is broken versus what is expected behavior, because we don't have tests defining what the expected behavior is and there's no like specification for. The kubelet is aside from like conformance tests, and they are nowhere near like complete enough for that.

A

Yeah definitely can count a few examples of those as well. um I think uh yeah people, like example, this graceful termination when we still kind of decide whether the readiness prop supposed to run during graceful termination not supposed to run during race determination like we can. We see bugs filed for both situations for both cases it's uh quite annoying and we need to decide what we want. uh This kubernetes yeah another example.

A

Anything else on this topic, I um I think next steps will be like danielle. If you say we promise to start the mail thread and like uh get more activity around that, so we will uh probably discuss a lot during this meeting and our main sig meeting and francesco. Thank you for bringing up the reviewer band. This uh problem like we definitely need to increase that and improve uh things like that.

F

Just one quick note, because it was mentioned about the dress roots management area, which I agree by the way, could use some more testing. um We may want to make sure we involve kevin clues from nvidia, which is very, very, very expert in this area, and I think it should. It should be happy to help on and assist us, so just make sure he's in, because derek was mentioning the area wanted to mention it so make sure he's involved.

F

I I will, I will ping him and make sure he knows about. This sounds good.

A

And especially with a couple caps, uh we keep discussing around this uh device, the allocation and device plug-in model.

F

Yeah absolutely.

A

Okay, thank you. um I think we can quickly go through dashboard.

A

I think jamie did you fill it up this week. I think we yes yeah.

A

I think we listed all the test failures uh in uh in this um jim's message, but uh if do you did you notice any anything new anything that was meshing.

C

um There are a few new flaky, but only a few one or two data points, so we can continue monitoring those and to see whether it's constantly failing.

A

Okay, thank you. I don't want to go deep into this because we just discussed the other jobs yeah. Thank you.

A

Let's get in the triage, really quick.

A

Danielle you mentioned that you want to drop off early, so.

B

Yeah, sorry, I am not having a good day um yeah. I will uh yeah feel free to cc me on anything that needs review.

A

What is it thank you.

A

Yeah looking at this, uh it feels that there are so many things that don't relate to this group, so I hope that triage will be super fast.

A

A

I'll take a look, it's still sent to me. Let me put it into.

A

I wonder why this wasn't merged yet.

A

I think this is already approved.

A

Oh, it's product change, so uh when code freeze, so I think we can mark it as done because once uh branch will be open that will be merged automatically.

A

Anybody wants to take a look just remove all the files.

A

You can take a look, oh, yellow.

A

A

Oh, I open it twice.

A

Yeah seems to be just new test case. Anybody wants to take a look.

F

Yes, please count me.

A

Thank you francesca.

F

A

Some code refactoring, I think it's outside the scope of this group, so I will just archive it.

A

Quite a cherry pick.

A

Yeah, it doesn't need our attention.

A

So this is, uh you don't need this any.

D

A

There they will close it already open if you need it.

E

Yeah, you can go ahead and close. It.

A

A

It's not not about us.

A

A

This is a huge pr and I'm not sure whether it will happen.

A

Okay, crosstalk, um okay, I see advisor end-to-end fails.

A

Is it already been looked at? I think it's now green right. There.

E

Yeah this one I looked at.

E

Oh, is it green now yeah.

A

Oh, it's weird: did you do anything, no, all right, so he fixed it. uh He's a new chad manager.

E

I'm a little surprised actually because I thought it would require some test changes but good to see. Okay,.

B

A

I, like closing box.

A

So this is uh waiting on author, it's yeah. It's me experimenting.

A

Every area of everything these two files, I think a person just- did a rebase or something so, let's remove us.

A

A

If you can support.

A

Oh yeah, this touches a few of our test areas. I looked at this before I will take a look if anybody interested just assign yourself.

A

And let's needs a review because uh I think uh we're removing some code there from like skipper code that keeps execution.

A

It's working on progress.

A

And we're done with this review, I think we're done with the test part of meeting. We will go to bug trash right now, uh if you are not here for bug, triage feel free to drop off and for bugs we only have 10 like actually 9, because it also counts, as pinky.

A

Opposite starts.

A

Where was it closed.

A

A

This is what instigate I would ask this person to give us a little bit more information.

A

A

There it adjusts matrix coming from advisor as well or you don't remember.

A

uh Sorry everything do you know if those metrics coming from c advisor matrix.

E

I think so yeah, I think so radar yeah. If it's for the root of s, I think so. If it's for the other volumes, you need to double check.

A

Smb volumes and the hostpass volume.

E

Yeah, I think the advisor just creating reports all the disk stats, so it doesn't get them for whatever reason they won't.

A

uh What logs will help cooperate? Locks wouldn't have this transformation right.

E

Yeah, I don't think so, maybe like something like a, uh maybe if they ran see advisor manually and then compared it to what like the df. You know I just on the linux. If you do like df-h and show all the file systems and stuff and disks mounted try to see why supervisor does not detect the disc that linux is exactly.

A

So that's advisor has any logs that we can ask for.

E

uh See what's wrong, I mean yeah. I think it should if they bump hublot, they think they should see it because it'll maps the same verbosity so.

E

So maybe maybe a couple things we can suggest one is to bump up the the kubelet verbosu, maybe like b4 or something like that b6 and then uh add their logs. Second suggestion might be to run c advisor manually um and compare the output, as reported from c advisor to like df-h or something like that, which lists all of the file systems.

E

Maybe they have some weird, I know see if there's some weird stuff, where, like zfs and other weird uh file systems that are like standard, ext4 and stuff. So maybe something like that.

A

Do you have minus what.

E

Look at the f uh h, for example, seem like that.

E

A

A

Sample stuck on continue creation.

A

Yeah very badly formatted log message uh trying to understand where the failure is.

A

What sandbox changed.

A

I'll take a look.

C

A

Reference: okay, thank you. Ryan, yeah,.

A

I'll accept triage accepted because there is a difference issue and I mean you cannot fake it.

A

Can you remind me um r, phillips.

A

uh P h, I l o yeah.

A

A

A

Or the person cannot reproduce it again.

F

On top of that, are we sure this is a cubelet issue, not say a shadowing issue? I'm just saying.

D

A

Yeah, I don't remember how sick scheduling is just getting.

A

A

Almost out support.

A

A

And let's just wait for more information.

A

Yeah, I think this uh yeah this, uh you already replied to this.

F

uh Yeah speaking about art to the bug and not understand bugs.

A

uh Are you looking for somebody to assign it to or just uh keep them back? I.

F

I I capacity allowing I'm looking into that. I will probably send a pr it's going to take some time because it's really a very nasty flow. Okay.

A

I will keep it in the background.

F

Okay again, if someone wants to jump in be my guest.

A

A

Yeah, I I mean from a quick look. I don't know whether it's rowing right parameter.

A

It's just a pack yeah. I will have 6 storage to decide.

D

And last one.

A

Okay, the problem is docker installation.

A

If anybody wants to take a look and remove some code.

A

Yes, you can give it to me. I will check with danielle.

A

Thank you west coast, less code is better okay, then uh I think we've done all the bugs any other topics for today. Any anybody will have anything to discuss.

A

Okay, then, thank you. Everybody bye.