Kubernetes SIG Node, 16 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220316

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

A

Good good day morning or whatever day of time, you have it's happy.

B

A

Yeah it's march 16th uh 2022., it's a weekly signal, crmsync and let's get into the agenda. We have couple topics in agenda today. Imran you go first.

C

B

Yeah, do we have him around today? Doesn't look like it.

A

Oh okay, so if I just added this um item to agenda, I think uh the plan was to move low contention flags to config file instead of command line parameters.

B

Well, all aren't all the tests failing. We have to get the test passing first.

A

No is that passing no.

B

uh No, no, not not like pre-submit jobs, uh the the test, script, jobs for it. Are they passing now.

A

I think uh they got fixed. The problem was that.

A

The problem was that the test was verifying that um kublat will restart after it was.

A

Down because of the law contention, uh indeed the test I mean the problem is that something went wrong with the image and the image wouldn't restart the couplet automatically. So um we just removed the coupled check for cobalt being they started from the test. um I think it will need to be back when we'll return everything together.

A

So especially when we switch to config file and we'll start changing config file uh during the test and change it back after the test, then you will need to make sure that kubelet will restart, but for now it's good so yeah. This is green low contention functionality works restart of kubelet, uh which is a system like even ccg or something, let's see something like system settings that may not work.

A

Yeah, so this is.

A

Yeah, this is just product pr, mostly moving it to config file. Okay, so I don't have much test here.

B

Yeah, I can take a look. uh This one is an api review, so feel free to assign it to me.

A

Requested root.

A

And it's always trashed.

A

Okay, this is done and I think uh once this will be merged. The next step will be to uh change the log contention test to use config file and move it into all the serial lane. So we don't need to run it as a parallel lane for just for this test.

A

uh Amount next.

D

Morning of afternoon and everyone evening for everyone, so my subject is quite simple: we we are working on margaret's, kgcr, okay, gcr, dot, io to dot, gets io. So.

D

E

D

This endpoint is get gcr.io is still owned by google and basically in the next few years. We need this need to be owned by the community, so we already set up the new endpoint, but we need to basically make sure everything is working right now for every sick.

D

I noticed this is art code in the kubernetes code base, so I'm trying to see how I can test the change of the default sandbox image.

D

If, if it's not possible, we will just basically do the flip once we are ready.

A

um I think any test that you run will basically any end-to-end test will be okay. uh One question would be if this image will match the image that container d uses uh it's not a big problem but uh yeah the worst case. Scenarios on garbage collector logic will take off and remove the wrong image.

A

B

C

B

Reported before where the garbage collector was removing the wrong image due to a mismatch, so this can be overridden, I believe, uh with a.

A

Yeah uh is there a pr for uh for the change.

D

uh Not yet because um I didn't have time to basically, I was basically investing in if there's basically a flag somewhere that can make that change possible, but I have a full request ready for that right now.

D

I will open something, probably today or tomorrow tomorrow,.

A

Yeah so answering your question: any end-to-end test will be will suffice, most of them testing that the plot will be started, um so it should be fine. The mismatch is, uh can be caught on pull request review, so we, you may need to double check that the runtimes are configured with the proper uh image as well.

A

I know got rid of.

B

I do not uh do we have either we've got ryan on the call ryan. Do you know.

A

Don't I remember they will work uh uh to do that. At least it was this past.

A

Okay, yeah, thank you are not uh for bringing it up. Is there any changes related to that with how we release test images.

D

For instance, um no, because the test image don't change, this is basically for a moment. This is a new proxy in front of gcr.io. So for the moment, it's not impacting the test images. This is only for the image promote from the staging repository and all one migrate from the domain flip. We did two years ago.

A

Yeah make sense: okay,.

A

Any questions or not: whenever did.

D

Oh yeah yeah, I'm sorry I was. I was thinking somewhere else.

A

uh Okay, um thank you so yeah, let's get to three hours, then.

B

No more agenda items right.

A

No more agendas.

B

Before we jump into.

A

B

I just wanted to say: hi. We have a couple of uh new names, at least to me on the call uh tyler mina. Do you want to say hi introduce yourselves to the group.

E

B

Hey mina uh hi there we go.

E

Yeah hi, so I'm mina I work for a costing container optimized to os team. um So I I'm just joining these meetings so that if there is anything that uh any feature that we need to uh develop for uh kubernetes.

B

Cool, as you can see, we're mostly uh discussing the state of test and continuous integration or ci infrastructure for sig note so stuff pertaining to the cubelet and other node related matters.

B

uh And I think I may have seen you in the uh yesterday we had the full node meeting, which is not uh limited, like this group to sort of the ci bugs and triage bits.

E

um Sure so so do you think this meeting uh wouldn't be helpful for me.

B

uh I'm not sure do you want to contribute to helping out with node tests. It's.

E

A really great.

B

Place to get started if you're interested in working on node things.

E

uh Sure I'll I'll just attend couple of meetings and see whether if it is you know, help uh benefit us awesome well. Welcome.

B

Tyler, did you want to say hi.

F

Yeah yeah, so I'm I'm a new engineer that joined google a couple months ago, I'm on sergey and mike's team, so yeah this week I helped with uh triaging test failures. So I'll probably start coming to this more regularly but hey everyone.

B

Very cool I'll go.

A

And I also wanted to say: um ask everybody: do you want to start taking attendance and put our names in the meeting? I think uh we had some. um uh We had this blog post published with uh like uh mentioning names, and uh we wasn't sure whether we mentioned everybody who wanted to be mentioned and who, like visiting this meeting regularly.

A

uh So if you, uh what do you think about idea to track names again.

B

Yeah we used to take attendance and I don't know why we stopped, maybe because we just forgot, uh so I certainly have no objections.

A

Okay, because it's not as we as attended as uh main signal meeting where it doesn't make sense to have so many names listed, so yeah uh just open the document uh with agenda and uh put your name. uh If you will.

B

We might also just do the thing where we screenshot the attendee list. That might be easier.

B

I'm happy to do that, anyways. Why don't we kick off triage, sergey I'll, take care of that.

A

Thank you. uh So let's look at table first and then so target health is.

E

A

uh We have no changes on our critical tests and uh well, we removed critical right.

B

I submitted a pr to remove it. uh There were like two jobs that were both on other tabs. uh Okay, it's apparently.

G

B

There I I submitted the pr so somebody said I.

G

B

Because there's nothing there.

A

Okay, um yeah. We need to double check why? Why it's still here but um yeah uh apart, is still in as you there and supreme so good. uh It's green! um So eviction!

A

You know what was it my performance uh phones is still active. I cannot remember what we did anything for performance, so I know we have slow contention.

A

Yeah still failing, okay.

A

Wonder if it was issue assigned to somebody specifically.

D

A

No, it's the wrong item. I think we had one for performance specifically.

A

Okay, I think we still have a problem. So let's mention the right item.

A

This is a correct issue.

A

And it's not assigned to anybody.

A

Okay, let's move to triage and discuss it when you get into triage, um but yeah and eviction is still um not being looked at. I know it's. It got rotten and refreshed back to alive so and it's failing for very long time.

A

It started one five at least.

A

So yeah, let's try to.

B

A

B

Yeah, I know that they were on danielle's backlog, uh but I don't know if they are completely fixed. Yes, uh I know that she had at least one pr up. I think for them.

A

You know danielle want to like redesign some of it, uh but the design may take longer. So maybe there is a quick fix possible. uh I didn't mean to find somebody like, uh let's maybe uh discussed like this issue.

A

It's the center, then yeah.

A

Nicole wants to take a going to final, find somebody.

A

Okay, I'll try to find offline. Maybe somebody can come help with that.

A

Right now updates uh few minute course. I think somebody started looking at that, but I don't think there is a progress.

A

Npd we had we removed coupled like one job at least because image didn't have gimko of proper version or something like that. So job was removed and uh for push image. There is a fix uh in the works, but it wasn't uh merged because of this test was failing. Hopefully it will be clean next time.

A

Yeah work is happening here, and this is uh green cool. Any questions about this discrete about the state of things.

A

Let's move forward, then so this.

F

Isn't so I I just had one question uh when I was going through and doing the triage, I noticed that the like the right click and file bug button on our test grid points to the like global kubernetes project um is that I don't know, should some of the tabs point to like different repos or is the kubernetes project. That's.

B

A great question uh most of the tests that we're going to be looking at for the purposes of this meeting are all based on like code, that's being tested in kubernetes kubernetes, so kubernetes, kubernetes or kk is the correct repo to file bugs against um yeah.

E

It doesn't really.

B

Like, unfortunately, this view doesn't really give a lot of information uh for filing them, I prefer to use the automated uh sergey. Can you go to the triage thingy? I don't know if uh it's working since I feel like it frequently falls over. But uh if you go to that triage link, um you can use this to triage various test failures uh and use that to sort of like see where flake started and that kind of thing and if you click on the file bug for that test down there sergey perhaps.

B

You get a lot more information, auto filled, so I I like using that ui for this.

F

B

It links you to a bunch of example, failures uh and uh like a bunch of other info automatically, which is really handy.

A

Yeah, this is a great uh integration. I know that work group reliability wanted to do something about it, uh but I don't think they actually did, and I don't is this bug b will be associated with a failure going forward.

B

I'm not sure um I.

A

B

There's been some talk about like raising the reliability bar, uh which I continue to see, come up at like community meetings and whatnot, uh but like what action people are taking. There is not particularly clear so.

A

Yeah, tooling, is definitely a big part of uh reliability. uh Raising the reliability bar like field of this spreadsheet. It's not ideal, uh but at least it gives you like historical view. I I really enjoy this uh going back. Functionality.

A

um Okay: okay, thanks! Thank you daniel for bringing it up.

A

Okay, we have a few items to try actually.

A

Oh here, it is something failing.

B

Lots of things to triage, I think a lot of these are not relevant.

A

Yeah, I think we'll uh clear them out really quickly, but uh I think we'll get those.

B

A

Or just uh feedback on the.

G

G

A

So I need to try it.

A

Anybody wants to take a look.

B

Well, it looks like it's failing ci.

B

E

B

That one, but the required job.

A

Yeah, it's it's been flaky for some time. Yeah, it's freak.

B

D

It's not flakiness, you just need to trigger with dust. It's a. I think. It's a that's great dashboard missing.

B

Yeah, uh oh, I can take a look at this one.

D

Yeah, you just need to just trigger the test and you should be finding.

A

And we'll move to in progress.

C

B

See I did do it yeah.

A

Yeah, I remember that I I even approved it. uh We still need six sig release to approve, and now it has a conflict so.

B

A

Okay, I'll rebase.

B

A

But uh to be so fixed.

A

Okay, this is not related to success.

A

It's not related to our group.

A

A

A

Because it's not related to ci signal.

A

This is testing, but it's do not merge, ignore so waste your motor.

A

Yeah, I don't think it's related to our group or default statistic.

A

Next, api machinery related wanna need suggestions.

B

A

Suggestions, I don't see any.

A

Oh corey, one: okay: let's keep it for signal main in group.

E

It's a private.

A

Okay, this is a related dog group.

A

Assigned to that and.

A

So there is a possibility to um merge this cap into this release. I think it's a little bit late with a soft freeze, but if it will go smooth, I don't see a reason to block it.

D

D

Cannot really talk.

A

Yeah, it's a little bit related to us not like fully related like I can self-assigned.

A

We cannot approve it, but it's a little bit related to what we're doing it's a promoting some of windows tests, as conformance because container d now supported, and they have uh they support the host process like what is it uh privilege, ports?

A

This improvement there.

A

A

A

Another entrance financial provider.

B

Yeah, I think this is so we can graduate it.

A

Yeah, but we just uh reviewed one from andrew, I think.

A

In their case, I really take a look.

B

It looks like he was also reviewing it too. So.

A

Yeah they're clearly talking.

A

And just duplication, warning: it's not related to our group.

A

Emotion field is not related to ci.

A

A

Okay, this is quite straightforward. uh Any takers.

F

I can take it.

F

Not a lot of staplers in the world, apparently.

A

Okay, this is what david mentioned on yesterday on the signed call new serial job.

A

For new cluster serial job.

A

You have to remove a bunch of documents admits and uh on one of the prisms windows team was uh was not happy with removing it. I, oh I'm not sure why.

B

Didn't matthias remove like all the docker pre-submits.

G

Or were those just for node, they complained on one of my pis. Oh.

B

They complained but docker's gone.

A

Did you figure out, why do do they want to run pre-submits on release all the release branches.

G

Yes, I think they decided to move it to the 23 file. Okay,.

B

Oh yeah, I mean we still need pre-submits for the old releases that is true like, but I think that the pre-submit jobs most of them aren't set up to run on anything other than master anyways. So it's kind of moved.

A

Okay, this is uh whatever hat made to conformance. If anybody wants to lgtm, I would be really appreciated.

B

If it's becoming conformance, doesn't it need to like there's some file that gets updated, which pulls in the conformance approvers? I'm.

A

B

That this is saying it's approved.

A

Yeah there is a first uh step is to create a test and run it for two weeks and then.

B

The test is not created, yet it's a new. It's a new test. Yeah, you had a test.

A

That is under not on the cluster, so no test cannot be made conformance, so I needed to create a cluster test that will test whatever hat api, and uh so this first one is to create it, and then second one will be to like move to conformance after it's running successfully.

A

But thank you for asking. Do you want to take a quick look?

A

B

Yeah, if you assign it to me, I can look.

A

Straightforward, hopefully, you wouldn't have promised that.

A

And in position.

A

54 files, it doesn't look like test only change so yeah. I would leave it to main triage to go deeper and just recover it from now on test grid.

A

And this one that I discussed for performance tab, there's likely some configuration issue and performance. Is there anybody who wants to take a look.

A

A

I may keep it in entry arched right now. We need to 300.

A

And this test is failing for so long. We may need to review whether we really need to understand. I wonder about the history: okay,.

A

Okay, thank we, you everything and we have 20 minutes for backtrack. So at this stage we thank you for watching we on here right first and then we go to backtrack.

A

B

What happened there.

A

F

What's good, what's the time scale on the x-axis.

B

Yeah it's hard to see it's like on the scale of months. uh You have to click on the jobs uh like. I think you can kind of see on the x-axis. It does have dates, but if you click on the jobs they'll say when they ran.

A

This one five zero, it's unix timestamp. So if you.

B

Okay, how is it okay? I.

A

Was gonna say you can click.

B

On the uh you can click on the job runs and each run is it's once per day, so every dot there is a day. So that gives you kind of an idea of the scale.

A

So, sometime like a week ago, something happened and.

B

It looks like the variability is much higher because the p50 is lower, but the p90 and the p99, or is it 95 and 99 they're much higher again, it looks like p hundred p. Ninety ninety nine yeah, okay.

A

What about memory memory drop down.

B

What happened, something landed.

B

That's like a 30 or more improvement on the p50.

A

So memory dropped but the cpu increased.

B

I mean that's pretty normal and I know that we are looking at some optimizations right now for cpu, but like at the expense of memory and caching, so yeah. It's often what.

A

Is that for sure to bring it back right? um Okay, um I think we need to file a bug to investigate.

A

A

Let me quickly look at runtime.

A

One time memory dropped just a little bit, not significantly mostly p50,.

A

Stomach weird happening with cpu as well.

A

P90 went down for sure.

G

A

It's very strange since was kubota and run time went down. Maybe.

A

Maybe some environmental from infrastructure, probably but yeah. Let me find the bug to investigate.

A

Okay, um it's done now test. Part of the meeting is done. We move to the back striage if you're interested to stay. Please stay.

A

Okay, we have 14 bucks of trash today and we have for 18 minutes.

A

One mil per buck in four minutes: that's a lot of bugs.

B

This sounds like a doc thing and also sick network.

B

I don't think it's a bug.

B

There you go that couldn't have been more than 15 seconds.

A

Bot takes longer yeah. I hope it will react. Let's go next.

A

Okay, so this is full up from this uh out of memory. I think we just need.

A

I will put your importance on because it's a regression that we need to address.

A

A

He asked he's working on that.

A

What is all for this? For today.

A

Okay, I not quote again.

A

um Now you moved it to bug, but I think it's a feature request to limit. I nodes.

B

Maybe I mean I think that creating you know nested, deep directory, trees or whatever that shouldn't cause instability like the cubelet. If it's doing normal, cubelet operations should remain stable. That's why I think this is more of a bug and a feature.

B

Like maybe it needs a feature proposal to address it, but we've had bugs before where that was the case.

A

Okay, I'm not sure um I know there are two bugs one. One block is related to advisor taking too much time to traverse oldest directors. Another is, I note, uh noisy neighbor, so you can exhaust all the file descriptors and like I knows, and nobody can stop you um do. You know which one is that.

B

uh I didn't read this in depth: I mean a deep directory tree, isn't necessarily gonna exhaust inodes or cause a noise neighbor issue I mean I definitely there is the it might take too long, but I don't think it's just merely in c advisor. I think it's also in the container runtimes.

A

It may not be even c advisor issue uh just yeah.

B

I don't think it's a c advisor issue.

F

B

Like I would say that, if the um killer is triggered because of the behavior of cubelet, when trying to remove a deeply nested directory, that is a bug that shouldn't happen.

B

Like an obvious, easy way for something like this would be to found or validate like the depth of directories that are possible to make here, so you can't, like denial of service attack the cubelet in this way, um because I think that this could be like I I don't know, I I think of it more as a bug like. Maybe the way that we fix it is like needs a feature proposal, but this strikes me as a book.

C

There is a open, goaling issue similar to this I'll paste it to you in slack.

B

There was something in the description about like os dot, remove all or something like that.

A

Okay, um we're taking a little bit too long time here.

B

Yeah, it was filed by the same person so.

A

What is it you can mention yourself, okay,.

C

Okay, uh let's.

A

Just accept it and.

A

Has enough information with genome, so I don't think we need to need information for now.

A

Status 2 already running.

A

Okay, it's not signaled any longer.

B

Yeah and it's also expected behavior, so we can probably just like close it.

A

You know, I wonder if he needs to be converted into documentation.

A

This question comes up quite often.

A

Can I have the label after live, maintenance.

B

That sounds like a storage thing.

B

I specifically reference csi client duck node get info.

B

I think this is storage, maybe remove sig node.

A

The documentation.

B

Seems reasonable.

A

You can't reach it in a website. I don't I don't want to do it. The website.

A

A

Yeah, that's fine! It's just not tired!.

A

Time shifting because google failed to create okay, I remember anything about it. This.

B

Is not supported, why would they tire shift their cubelet.

A

Into the issue for not adjusting its own clock.

A

What time should this mean.

B

This is there's no way. This is.

B

Supported I, I will not entertain the idea that if you change the time on your cubelet back two days, it should work.

A

Two days is too much.

A

Situation here.

A

Yeah one hour is more reasonable. I I can see how close q can happen for one hour, not in a data center, but.

B

Well, but I mean.

B

Yeah, I don't know this- this just seems not supported.

B

I feel like that. Cubelet having a hard uh like dependency on uh like.

B

Time being set accurately like crony d or whatever uh ntp being set correctly like as far as I'm aware, that's like generally agreed upon requirement for nodes.

F

G

It would shift backwards but forwards. It would, if you use a raspberry pi which has no battery inside you know it it's. It saves the time when it shuts down and when, when it finally gets the network again, it shifts but forward, not backward so yeah.

B

But in this case I agree with you.

G

B

G

Use case for going backward in time.

F

I feel like if your bios batteries like dying and your machine restarts it. Could, you know, set you up in september of the year like before or set you back a couple of months, but, like I guess, what's the failure mode here? Does it just well.

B

They're just saying that the node won't create containers after it's already been like. uh I don't know it's unclear in this case if a reboot was involved or if they just went and like set the time back.

B

um If a reboot's involved, I would say, take the node down for maintenance, make sure it has a reasonable, accurate time on it. uh If they are asking like. Can I just set the time back on my node? No, that's not supported, but in any case I don't think it makes sense to entertain trying to support this use.

B

Case like they are doing things that are breaking the way that cubelet tracks things linear time has to move forward.

A

Yeah, I think the proposing solution for this specific issue is uh sorting to the pr. Second part closed. I remember okay, one pr here, so there is a pr that.

A

Okay, yeah: I have to have a bug open at least.

A

Yeah, I tend to agree that uh correct time should be requirement. I just wonder if one minute skew is enough to trigger this issue or not enough to trigger this issue. So if one minute is enough to trigger this issue, then it probably needs to be addressed in couplet, because I can see how close you can happen and uh coco will change one minute back.

B

The other thing that concerns me a little bit is that, like, I, don't think, they've reproduced this on a recent version of kubernetes, the pr that you just opened, uh which uh was rebased the person who tested it said they tested against 120, which is which isn't a supported branch. uh If they can't like, I would want to see them reproduce this after the pod life cycle refactor at the very minimum, um because maybe things changed since then,.

A

Okay, let's put the need this information and.

A

And we'll tell that.

A

I think this is reasonable and yeah uh the test that appears that you mentioned didn't have any test confirms that it was failing before so um have a hard time accepting it.

A

Mr mission, okay, we have one minute left and I don't think we will have time to review any more bugs. I think we did a great job of going in depth a couple of bucks. Any last minute notes, uh suggestions.

B

A

Okay have a good day: everybody nice.

B

To see everyone.

A

F

G