Kubernetes SIG Node, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210901

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's september 1st 2021., it's a signaled, ci subgroup meeting welcome everybody. We have short agenda today. uh Let's get going.

A

I'll share my document no.

A

Can you see it.

A

So yeah, let's start with this first item uh mike mike's here.

B

Sure, yes, I'm here, okay, uh so there was an issue with uh no problem detector uh for a particular job that pushes images it was failing because there was a cloud in it. Yaml file missing the repository.

B

I created it, but it's still failing uh on the particular reason. If you go to the to the issue, then there's uh I don't think there's I don't think I linked it there but yeah. That's! Basically the error uh gcc not found. I take it a little bit more and it appears that the seagull enabled flag is true.

B

So this might be why gcc is required for this particular build and the image that builds is based on alpine, which doesn't include gcc. I was wondering, if there's any other image which might include this or is it or if there's any work around to this particular error.

B

Otherwise we can create a new image which I'm not sure if that is, if it's the best well idea, that's a particular image here.

C

I thought most stuff we built on debian for eventual like um destroyless stuff.

B

Yeah, I think the image that you use, derived from from alpine from from a google alpine image, a google cloud.

C

Alpine scary.

A

Do we have to use images from this project or we can use it from anywhere.

B

Yeah right now, I'm using the uh we have like synthetic sugar and up on top of the of the google cloud images, which is this particular image from from kubernetes.

B

uh What I'm not sure if this is the best idea, but what I can do is I can create another version which derives from debian, which would make stuff easier, but also we will have to track another image.

A

So I mean we don't have to change this image. We can add another step right, so we can build it first and then push will it work.

B

I'll I'll try to do that by default the push command. It actually builds the image I'm going to see. How can I work around on.

A

um That mike shouldn't rebuild everything.

B

Yeah there should be another way of.

A

A

And can you tell again why.

B

Npg needs gcc because of the way it's built, it has a sigo enabled flag, and apparently this flag makes the build process require gcc.

A

How did it work before? Because this this job was created so long ago, and it was always working.

B

Oh, you know it used to work yeah yeah. I was wondering also because uh I noticed that there are some some images already I mean there are some versions which are pushed.

B

uh I was wondering if these are built with a particular image, or they are built with this image itself or like built with the with the desktop configuration depending on the ones that push the image.

A

B

I'll try to do the second step and if that doesn't work, I'll I'll read more about the considering other images, cool.

A

Thank you for update, sure.

A

B

Item had the review updated so yeah you can. You can discard that.

A

D

Click the link.

D

We have a bunch of test grid alerts going to our main mailing list. It is adding noise that probably shouldn't be happening. I think that we might need to fix that.

A

Yeah, I think it goes to main english so with plus bugs right or plus test or something.

D

Yeah, I think we should just send it to I mean: do we want this spamming, the main mailing list? It makes it really hard to sort my email. We already have a test failures list, yeah.

A

I think we need to switch, I'm not sure why we didn't switch all of them.

D

So that's possibly a very easy contribution if someone's looking for something to do is fix the email on all of these jobs.

D

I do not volunteer.

A

I can create a task.

B

I can update the task if you want like. I can attack myself on the issue.

A

A

It's simple find and replace thingy, so it should be shouldn't, be a problem.

C

uh Yeah, um so that is a fairly annoying pr that deflates a bunch of the storage eviction tests. It's been open for a while just needs some reviews.

C

So if someone could take a look, that would be helpful.

C

uh It's a mix of like stretching out timeouts and actually fixing some tests that were broken because of behavior changes.

A

I can see it already have looked good to me, so we need the approval.

C

Aqua, I missed the dt lgtm.

C

uh This mostly had the issue of not actually solving any problems. Unfortunately,.

A

C

Yeah, sadly, reducing disk usage didn't really help like.

C

um I basically had those tests running in a loop for like two weeks.

A

Yeah I can commit for next week only if it's fine to wait lonnie, if you want to take it over.

D

I am gonna be in and out of the office for the next two and a half weeks, so I would not be the right person to assign this to right now.

C

Just wait until my follow-up that fixes gpu tests needs review, that one's nasty.

C

Turns out the way we built images for that has been broken for at least six months.

A

Yeah for gpu tests do we need to restore it. uh I believe francesco wanted to understand what uh like first, do you want to test device plugin with gpus right and second, do we need gpu tests at all.

C

I mean so we had the existing broken gpu tests that I now have at least running and nearly have working, um but they were broken both in terms of like the way we configured the host was broken and, like actual logic in the test, was broken.

C

Which has been a really fun like week.

C

E

A

Yeah next week I will get it and gpu after that. I guess ryan. You wanna talk about uh regression.

F

Yeah, uh so this is uh a fallout from clayton's uh refactor of the pod life cycle and static. Pods are not being recreated correctly.

F

And so the proposal is to add the modified time of the static pod to the uid creation.

F

Hash that gets generated and the side effect is, is that static, pods that get created with controllers uh or will get restarted? And so um I don't know if any controllers actually do that. But that would be the fallout from this one, but it does fix the issue and seems to be working pretty good within our ci. So far as well.

G

I had a good question about this one, actually um so like what I still don't understand is before the refactor did like. How did it work? I guess because did the refactor change what um what was taken into account for the uid or kind of how does the refactor, I guess affect that.

F

So the refactor, I think, uh manifested the problem further. I think this was a problem prior to the refactor as well.

F

um We've seen on openshift uh static, pods, not restart in certain situations and we're never able to actually figure out really why they were doing that and I think clayton's refactor actually uh forced the issue of um reproducing it a lot easier.

F

So I think this is a problem that was prior to the refactor, but it just manifested itself more.

G

Thanks for explanation,.

A

So, are you working on that ryan or somebody else.

F

Yeah, there's a pr that I created this morning and it's going through ci as we speak.

C

F

A look at that.

C

Earlier and I'm just, I was waiting for ci to run before I reviewed it properly.

F

Yeah yeah, that makes sense.

D

I know that this was a problem even before the refactor, so I'd be interested to see if this helps.

F

My local bench testing uh shows how to fix those eye completely, but it's running through openshift openshift ci right now and upstream ci.

C

um Might be worth adding an ete for that? Yes,.

A

C

Sorry sweet saw.

F

Yeah I'll look into that as well.

F

A

Likely to separate.

F

Pr yeah uh yeah: I can we're trying to get a release done for this week, so uh yeah. Hopefully I can put it into a separate pr.

C

C

Mostly there's enough behavior of the kubelet that doesn't have any like real coverage that it would be nice to like slowly fix.

F

Yeah totally agree.

C

Especially as we make the e3 test be useful.

A

Okay, thank you for update.

A

Let's switch to triage, really quick.

A

So we have two pr's at this reviewer.

A

There do you want to take this yeah. You can find that one for me.

A

A

Admission must exclude completed.

A

I think it's a follow-up from uh regression right.

F

Yeah this one is yeah. My follow-up from it just adds the end.

A

G

Think you again yeah. This is yeah. Yes, sir ryan, just just to be clear on this one we wanted. I think I left a comment. um One one question I had about this one: there was like the existing um scheduling test that already had a job, so I'm just wondering if maybe we should modify that already, um instead of creating new tests. Just because uh for this we will need to ensure the scheduling, folks are on board and create new jobs, etc.

G

So I think, maybe it's easier just to add it to the existing uh test. I don't know what you thought about that, though,.

F

uh Yeah, we can certainly do that. It would be easier to get in I'll. Look at that today as well.

G

Okay, yeah yeah, because I mean it looked like they already had a job or tesla did something very very similar. The only difference is that the pods didn't have any um resource requests. So that's why the uh that's why I like it didn't reproduce the issue. So I think if we just copy and paste their existing tests and only add uh resource requests, I think it would be the same.

G

Okay sounds good.

A

Great um okay issues to do.

A

There are new issues, just take a look.

A

Anybody wants to take it. Take a look at alpha features.

D

A

Sure we're using this job.

D

Like we maybe can get rid of this one, uh because I use the like overall gce alpha features job as a put for like pull requests rather than a node specific one like there's.

B

Only quite like very few runs.

D

A

B

A

Yeah, I remember we discussed that we run this test in different place right.

D

Yeah, I don't think I've ever run this, so maybe we can get rid of it.

B

Yeah, I think it was.

D

B

Yeah, I think it was triggered on one particular pr just to see if it didn't change anything, but I didn't notice any other any other ones.

C

What does it even run.

D

I don't know I mean we'd have to look at the spec.

D

The corresponding job uh runs, I think, like all of the ede stuff, with all the feature gates on. So that's why I just want to leave that job.

A

Mike would be okay for you to test to check that. We already run this test.

B

Sure I'll confirm if, if we already run this well I'll, just remove it. Thank you.

A

A

Your conformance is trading.

E

Is it still fighting.

A

My famous stats over here.

A

Man, do you know anything about this.

A

D

I feel like that that test has been failing for a while on various different things. uh When did it start failing.

E

Yeah, I think it's also failing on the c on container d cos container d test grid as well, for these tests I'll see, group v2.

D

It looks like there must have been. It looks like we should be able to find a commit that caused us to start failing, because it happened somewhat recently on the 19th. Is that filed in the bug as well.

A

A

Jamin, do you have a link to continued that's great, where it fails.

E

uh Yes, I will share with you.

A

Yeah, it doesn't have that much history, but it just created. Okay,.

A

Anybody interested to take a look.

E

Yeah, I can help take a look.

E

uh It's dragon cell. Thank you.

A

And I guess same here.

A

Why you're here perfect, thank you for taking uh for looking.

C

A

And the crimes.

C

Because terrorism.

A

C

Puns make it everywhere on the internet.

A

Here as well, uh yeah.

A

Yeah I mean it will be easier for you to investigate, because on cryo we know it is commit that broke it, so it will be easier to nail it down.

E

Yeah sounds good.

A

What advantage.

A

I believe yeah. This is the pr4 right, brian, oh, no! No! No! This is uh yeah. This is from yes, the discussion signal meeting yeah.

D

I was going to say.

D

Looks like sig windows is handling this. Should we mark this one at as triage accepted at least.

A

Good point: I'm not sure it was it's a fading test. At this point, yeah they'll adjust the test. Okay,.

A

It's triable yeah. I think I just triable. Okay,.

A

Cool, okay uh and yeah- I think it's assigned tomorrow, set.

A

So I need to mark swish my chest.

A

A

Okay, um I think we done with no assigned triage. We can get more in depth once we caught up on everything.

A

Do you want to switch to mainboard triage or we want to cut it short today.

D

Should we go through the bugs? Have we been doing that well, like the.

H

Box yeah: we did this much.

D

H

Yeah, the new board, I was gonna, say.

A

Yeah, we did it last time and uh we cleared quite a few.

D

D

I haven't looked at it in a bit, so are there new bugs to add probably.

A

I wonder if we can go through 9 bucks in 30 minutes.

A

Google appliance probe its own account.

A

Yeah not a stick note I think.

D

I feel like this person should not use liveness probes if they don't want a thing to get restarted.

D

It looks like this is a some sort of sig author api machine or anything.

D

Oh look: I commented on it.

A

Are you already commenting.

A

Okay, so it's signaled because they may need a feature for like make loudness probe authenticated somehow.

F

Yeah, this is what mourinho was talking about this morning, alana about mtls.

D

Yeah, I mean mutuality less. I don't know that we want to get in the business of supporting them.

D

I mean, I suppose that is probably something that just exists already in the go library, but.

A

Is the bug at all.

C

D

C

Future feature request.

D

Yeah, I think it's really more of a feature request than a bug.

C

Like works as expected, just they want it to work differently.

A

But this is the bug for um api server.

D

I mean uh david's comment about the or I think it was from david uh that the api server can't tell the difference between a random anonymous user and the cubelet talking to something anonymously uh is true, so I think it's a feature request.

D

They're, basically asking for. Can the cubelet please differentiate itself from some random anonymous thing, and the answer is well. It can't right now.

A

A

Oh wait: how do I remove it from a board.

D

A

Okay, thank you.

A

A

Yeah, this is known.

A

D

What you expected to happen, something a question to me:.

A

Yeah, this is a good suggestion and not a bug.

D

Yeah, maybe we can change this one to a support, request and close it, because we've we've suggested a work around and it's mostly a question like docker inspect, does not work on non-docker runtimes, so.

A

Is it support to support request.

D

Support you got it.

E

I forgot to close.

A

Again, uh removing from the board.

D

Then I'll get mark just done because it's closed.

A

Configuration is broken, restart both temporary report content is not ready. I think we fixed it, but yes,.

H

A

H

A

Okay, remember: we have like two or three bugs about google restart people updating certificates for um for something and uh they restart kublet and uh at the startup. Everything is not ready.

H

Right, but I think there was a pull request on this. No, I don't remember. Okay, I need.

A

H

A

This is on restart. This is when google is like completely restarting and initial.

H

Stages? Yes, okay, okay, this is expected right.

D

Yeah, I would say this is expected. Like oh yeah, you said it wasn't a bug and someone reopened it. Oh yeah.

D

I don't think that's accurate.

D

I don't think that it assumes that uh containers are.

A

No default state is not ready when it starts.

D

Yeah, but that doesn't like cause the container to restart or anything, that's just from the cubelet's internal accounting perspective. I don't see what's new.

H

But it it prevents the service from from uh forwarding.

C

H

C

But that, for the most part, is what I would mentally expect.

D

Yeah, that's also what I would expect like. There's control plane downtime if yeah, that, like the data plan, is fine and it's there, but if you're putting a dependency on the thing.

C

Yeah, like it's one of those fundamental things of, do you fail safe and not let something service requests when you don't know its state or do you fail open and just let requests go to random garbage and, like I could, like a cluster orchestrator shouldn't fail to garbage.

D

D

I really don't like I mean the requested fix here is that we should assume that the thing is in ready state until proven otherwise, and I think that that is not safe. Yeah.

H

But if you own this, you just don't put a readiness probe and then and then it's always ready.

C

Also like, if you're running your infrastructure that thin the restarting a single node causes you problems.

D

Oh, the other thing too is they could potentially, uh if they add a delay to the probe, then won't that also solve this problem. It won't mark it as not ready until I think it.

C

Will get much is not ready until the first run. I see.

D

Yeah, so you could set that to be like run immediately.

A

I mean there's no immediate seriously.

D

I think you can set the delay period to zero or something like that.

A

C

Is like prevention, like.

A

You remove this so yeah yeah. I.

D

Mean it won't be immediate, but like it could be close to immediate, particularly if it's like only running every so like on a large interval or something like that.

A

Yeah, I think this behavior is expected when, um when you restart for unknown reason, when you restart for to update certificate, maybe I mean maybe we need to allow certain graceful.

D

Shutdown help with this in any way.

A

I mean just restart the note completely again: don't bother starting google by itself yeah.

C

The only way you could even ever safely do this is, if, like you, also measured how long kubla was down for and then made like a reasoned guess about whether something could still potentially not be stale.

D

Well, whether or not like I mean I don't think the timing is a matter of, I don't think you can make any assumptions in terms of how long your cubelet's been offline or something like that you could checkpoint. So, for example like when the cubelet shuts down, you could dump the state of everything known to disk and then like pull it back in and be like. Oh well, this thing was ready previously, so I'm going to assume it's still ready until proven otherwise, but, like I think, that's really fraught.

D

I think that there's potentially unexpected consequences you'll get from that. I think this is a feature request.

C

If you want that to happen, nomad, yes, yeah.

A

Can, sir, I hold this register somehow.

D

No because the the cri doesn't do anything as far as probing goes, it's like out of that loop.

A

Yeah, I know but yeah I mean I.

D

Don't think it'll actually help, I think fundamentally, the issue is people want things to magically work in ways that, like architecture, will limit them working like if we make this always work by default, then we cause broken behavior in other cases, and then we get a bug about that. It's not possible to make everything like magically work. The way that people want it to. I think.

D

So uh I can hire danielle, it sounds. I can write a comment on this and then reclose. It.

D

A

A

First, remove kind back.

D

Yeah, I'm happy to I'm happy to close this one okay, so you can just assign it to me.

A

Okay, just do it.

A

A

Bottom, not with temporary unknowns that is now marked ready again.

A

So you vote again.

C

Oh, that's a fun bug.

C

Like that seems like an actual race condition, bug.

A

C

uh If you assign it to me, I can uh take a look to see um what the state of any open prs are at least.

A

Yeah, but it's uh I think we need to triage it into.

H

Yeah yeah yeah, maybe it's even fixed, we don't know, since the guy has not been able to reproduce it by the script. So.

D

uh If it can't be reproduced, then we shouldn't market triage, accept it. We should mark it uh triage. I think it's not. Reproducible.

C

uh We should double check first.

D

There's a triage label, uh is it yeah needs.

A

Information people.

D

There's needs information and there's also not reproducible. So those are your options but yeah you can put in the needs information column on the board.

A

E

A

Is it notebook.

A

How is it not related.

D

uh It has a documentation label on it, so it might be something I might have a comment on here. So it sounds like something I would do.

D

Yeah there's one for me.

D

I think this is a misunderstanding of the pod life cycle uh and so like we should document this, so people stop asking about it and getting confused.

D

Basically, this is a. This is another one of these like it's working as expected, it should work this way, but, like a lot of people are upset about it because they don't understand that it should work. This way.

A

Should they repurpose it to feature.

D

Well, it's it's a it's got kind, documentation and kind bug on it.

A

D

It's not really a feature, it's just. Oh the reason it's not triage is it doesn't have a triage accepted label on it, so you might want to add one of those. I think this bug predates those.

A

Gpu fell so much for us.

H

But if, if it's cube cattle, it's not node or is it.

G

It depends who handles the copy other than it.

A

C

That seems like it's probably either a timeout somewhere or a load balancer configuration problem.

H

Or it could be temporary space as well, because it's I think when you copy a file, you you turn it. You know somehow yeah.

C

Yeah, but if it was like keeping everything in memory, you'd probably see someone complaining about an um and have more issues.

C

And if you're running out of temper space, then.

C

D

Danielle sounds like you want to do an investigation here.

C

No, no, it doesn't does.

D

Someone want to investigate this. I think it's important. It's got a lot.

C

Of comments, the general rule of anything I pick up in kk right now seems to be it will take a week. Things will be broken and I will be sad.

H

No give it to me if you want please.

H

H h y yeah x y y x, yeah.

A

So, do you put unit information.

H

No, we try to reproduce yeah.

D

We need to reproduce this so.

A

A

Again, keep cuddle.

A

Oh, it's a logs.

A

Yeah, it's something for me to why it's not issue.

A

Needs information.

H

A

The reporter active.

H

If you really care about the the load, output and everything you should schedule a job instead of doing like a keep cutter run, I think.

D

For this one, uh if we put it in the needs information column sergey, can we make sure that we also put a triage needs information label on there? Because that's what I use to manage.

C

D

D

We only have two left and they're very old.

D

We have six minutes.

D

I think this needs to be reproduced.

D

This is very old.

D

I feel like this is not a bug and it looks like somebody marked this as a feature. I think this is just a feature request.

D

I'll remove that one from the board.

A

Thank you, yeah. We're done with this.

C

Wow that one might be a bug if it's silently overriding stuff.

D

So there's a there's a bunch of discussion in here, but it's not really that it's silently overriding them. This is a pretty long discussion. uh I think you can read through that. There's I remember reading through this one at some point, marva needs talks, then yeah cool.

A

Okay, do we want to apply documentation, taco, it's fine. I didn't read it.

D

I think that there's like we have a few issues on so skills floating around right. Now I don't. I think it would be good to have a sort of more unified strategy for how we want to deal with them like. This is clearly an issue affecting a bunch of people trying to use kubernetes, but we don't really have anybody who owns that right now and I don't I don't think we should try to like you know, address these things, sort of like pieced meal or with bits of documentation or small fixes here and there.

D

I think we just need to determine like. I know that there was, for example, um there was like a request to like change. The list of white listed, says cuddles uh or allow listed cis cuddles. I don't know if that's been taken up by anybody, but that was discussed previously. um So.

A

D

A

A bunch of these sort of.

D

Floating around, I think, if you look in like the node backlog, there's a lot of cisco related issues.

A

Make sense: okay, then I think we have three minutes left and we can stop meeting.

H

Do you have access to the cattle jobs because I'm trying to fix a cattle for the last week and nobody in sick testing seems to answer my questions. So I said.

E

A

You're, a googler.

H

And maybe you have access to it.

A

I mean classic has quite a few new googlers right, um but.

E

H

It's fine. I will try to ping again on snapchat.

A

I'll ping me on slack as well like she see me and I will take a look okay. Yes, I'm not sure what you're talking about.

H

A

A

Okay, any other topics.

A

H