Kubernetes SIG Node, 23 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220323

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's uh what is it 23rd of march uh 2022., um welcome everybody to weekly signal, ci subgroup.

A

We have quite a few items on agenda and then we'll go to regular triage first, I don't know, but he can't join um so yesterday there was a pull request that changed added the new registry into into the list. Unfortunately, this pull request broke um uh many ci jobs and uh when I was filling up that uh table, it wasn't clear whether this job crashed something or like. So maybe we need to refuel it a couple days later. uh The reason for breakage is this version two here, but uh there are still a version.

A

One uh style configuration here and here so yeah. That's why it broke, but there is a uh like fixed pr already merged. I think.

A

Yeah so fixed pr has merged. Hopefully many jobs will get back to green, but the main idea here is to let everybody know that we will be switching to registry to kubernetes io right now, it's just a um outside redirector to a gcr registry. In future it will be smarter. It will likely be like different for different clouds, so, like uh we'll, have some other smartness uh built in into that, uh and it's in general is great, so it's not gtr name any longer, so yeah yeah we're moving towards a better future uh yeah.

A

Any questions about this item.

A

Next, one is a mic: my cup.

B

Sure so this was not a relatively old failure. uh I need a review on this change. Basically, after trying to run it, and I was able to reproduce the flakiness and on this failure, basically, the the ubuntu swap job was failing um was failing due to some conditions not met.

B

I kept trying it that the failure was the same and there was a fix on the fedoras if you open the second issue.

B

So there was a similar failure on the on the job with fedora, and this was fixed. You can go to the test screen and it should be green now.

B

uh The failure was on the stat summary test, and this is now passing yeah.

A

There's no failures.

B

So I tried running this on on the on the ubuntu one. There was a cop. There was some failures with the timings. I just increased the demo that it works now and the the time differences were like very, very small, just a couple of minutes uh a couple of seconds, so this additional minute should be enough.

A

It's a little bit over two minutes.

B

uh Yeah, it was like two or three seconds more wow.

A

Okay, yes, our ci is still very slow. Is it much faster than your local machine.

B

uh No, it took enough time, I mean my local machine was failing as well. I increased it to the the time mode and just a couple of seconds and it worked. We can leave one minute more. Okay,.

A

Finally, uh I hope it will fix it and we'll forget about this uh for some time.

B

At least yeah, the thank you.

A

B

In another pr, I didn't submit this one and I'm not sure of the exact details of this, but this is just to to make it work again.

A

Cool. Thank you.

A

Any more topics for today for agenda before you go to triage.

A

Okay, then, um let's go through yeah. I I don't even know whether we need to go through that, because many tests started failing because of config change. um Oh yeah! Actually this one is interesting. I I'm not sure whether this uh where is this coming from so on signal critical tab. There is now like this uh coupe test2 that now fading.

A

I think it's also failing because of uh yeah. Let's just wait for for this image config to propagate and uh then we'll revisit this test grid again, maybe next time it will be better because, like you start failing on things like that, never failed before, like this one preset meets uh start fading and new failures on this top for sure and uh yeah anyway.

A

Yeah this one started failing it also likely because of container d, so yeah, let's skip maybe for this week and go directly to.

A

Board, we have 18 items to triage today.

A

C

I added three more issues that were flakes.

A

Okay: um okay: this is uh center thieves.

C

Someone, I guess, is.

D

Trying to do a migration.

C

But we don't know if it works yet. So I think I agree with dims.

A

Okay, oh anybody wants to watch for that.

A

No, no take cares. Okay, we can probably get to it when we get to it yeah I just probably will watch for it, but uh yeah. We need to help you so.

C

Why does this need a separate, ci job? Is it because there are like public credentials, provider binaries that need to get downloaded or something? And if so, why is this owned by node and not like cloud provider.

A

So this is testing integration of google to this credential provider.

C

A

C

Oh, it's not the especially with the feature gates turned off yeah.

A

Yeah they want feature gates to turn off and actually not, like all other tests wouldn't mind. This feature gate turnout because nobody uses this. uh No tests are using credential provider, except maybe one or two.

C

Yeah, like is this just a default that we could add to the fix or something I mean. I guess we can test it on this job first and then add it later to the others.

A

What is it so? You have an opinion. Do you want to comment on that.

C

Yeah you can assign it to me.

A

On me, but yeah.

A

A

Start summary check.

A

Okay, this is we just approved right. Yes, okay board is awfully slow today,.

A

A

Drought remains continuous aspire.

A

Do you know anything about it.

A

Because some tests, mostly unit tests,.

E

Yeah, I'm not uh I'm not super aware of this one.

D

I uh took a look at it and it I I uh but lgtm did but.

A

So I will remove it from this top because it's probably product fix and yeah. That's what we have for the gtm.

E

This is only for the c advisors. That's right or not. The cri, that's provider, accurate, yeah, okay,.

A

A

Okay, this is uh promotion conformance. I.

C

Think we're just waiting on conformance approvers for this.

A

Oh, let me move it to maybe even done from outside.

C

So I think this got added and then reverted because it was panicking.

A

We can yeah, we can move it.

C

F

Specifically, test.

C

D

A

Server side field edition.

A

Support yeah the same with product, okay,.

A

Yeah, it's very broad, I don't think it's beyond our.

A

Yeah this is a new issue I filed. um I, I think david added, this uh skipper logic here and now the I didn't put a link to task grid.

A

So now this test constantly being skipped.

A

A

I think it's under couplet features.

A

Let me continue.

A

Yeah now runtime plus constantly skipped, so this schipuler logic was introduced to uh not run the test when test handler didn't wasn't pre-installed on the machine and I think that's uh checks for provider being gce. I think this logic is faulty and uh we need to do something about it.

A

E

It we don't want, we don't want.

A

To skip this test all the time.

E

Yeah thanks thanks for catching this yeah. Maybe I guess maybe the provider's not set correctly or something like that. Yeah.

A

Maybe another other note or like nothing, yeah. Okay, do you want me to assign to you david, or can you work with that.

E

Oh yeah inside me, I don't I can take a look at maybe when I have a bit more time.

E

A

A

Yeah, it's a bug for gc that we tracking.

A

I wonder whether we have another one for that.

A

A

What's happening.

C

I just approved that one.

A

A

Npg test now take a look. Okay, it's also sent to me so true.

A

Let's present users.

A

So this new registry label.

A

Yeah, I can take a.

A

Look nice and cryo uh folder. Oh it's part of that. As a pr, I can try to take a look and see what's going on. Why do we need container d infra container dryer, folder.

A

Yes, do you want to take a look at? Do you have any particular knowledge about this.

C

A

C

I'm, oh, is this: for the yeah, this is for the registry.case.io migration I would have to I mean you can assign it to me I'll, try to figure out what's going on here with. I don't know.

G

C

You can assign this to me. This is an end-to-end test that I wrote and the last time somebody approved a fix to this. They broke the test. So I want to make sure that I review so they don't accidentally. Do that again.

C

uh Well, it's it just ping! This.

A

Issue, can you assign.

A

Pro job, okay, work in progress; okay, it's fine.

A

A

Somebody already assigned yes, you can just move to disapprove.

A

Okay, I think, what's happening.

A

Oh, you just created it about six days ago.

C

I did not just create it yeah I just didn't get.

A

C

Onto the board.

A

C

Yeah, uh I haven't checked it since I created this, but uh I assume that it's probably still flaking in test grid.

E

A

You just want to need to do or.

C

uh I think that peter was looking at this one, maybe uh peter you're here right. Do you have an update.

D

D

No, I don't, I think, we've just been focusing on getting the serial jobs in general working.

A

A

C

Yeah, what's happening with this one is that the pods? Basically it relies on there being a certain number of pods on the node in the serial end-to-end test like in order to schedule the right number of pods. In order for the test to work and the problem is, it seems to keep changing how many pods that are like just system. Pods are hanging around on the node uh requesting like cpu. So previously we were having an issue where sometimes pods would be requesting 50 ml cpu.

C

So like fine, we changed it and now we're seeing that sometimes there's like a hundred milli, cpus of pods or 150 hanging around on the node and it's like. Where are these system clouds coming from? I don't know uh they didn't used to be there, but now they're. There is this an issue with like system pods not tearing down properly or some other accounting issue on the node? I don't know, um but uh that seems to be why this one occasionally flakes it'll, never schedule the pod, because there's not enough space for it.

C

So I feel like if I think it's zach here is tweaking timeouts. That's not gonna fix it.

A

Yeah, so it seems there is a pr for that.

C

Yeah, that's the one that you assigned to me. Yeah.

A

C

I suspect that what is suggested in that pr won't actually work here, but.

A

A

Stopped working.

A

What should run through the lifecycle, but.

A

Was it failing before.

C

B

You might want to make the.

C

Size like ultra compact or something so then you have to scroll or scroll as much.

A

Yeah just three failures: maybe five.

A

A

Okay, so the request is to rewrite: do not rely to rely on events.

A

C

Yeah I mean I think that this is a confirmed flake, given how often it's flaked, but it's unclear what the issue is like. I would say: that's enough flakes for a job to not be green, so it would. That would show up as flaky on like a summary.

A

C

Especially if it's a conformance job, because that was agreed before so and it seems to think or this person seems to think that it started failing after a particular pr was merged. So.

C

Okay, oh, if it's a conformance test, uh sorry, this is triggering something in my brain. Conformance tests aren't supposed to rely on events.

A

Yeah, that's I understand that, but uh I think it's common reflection. Okay,.

C

Yeah we need to. We need to figure that out, because I think that that was a mistake made at some point.

A

I think this needs to be moved out.

A

And this one we just discussed better. I just need to move it out of triage to through.

A

Yeah, if you need support, we need to take a look.

A

um Yeah, I suggest we move to black triage unless there is more, uh oh by the way I wanted to update on this performance flake. um This causes performance thingy yesterday and I started looking into timestamps and like what happened, I didn't find anything and now it's green again. um I like same again, you can look at performance, really quick and uh double check on that.

A

It was a minor. He sees this minor like annoyance happened and then everything back to normal.

C

Yeah, I have no idea what happened there.

A

Exactly yeah yeah, it.

C

Could have also been like a knock-on effect of some other sig doing something and then reverting it.

A

I didn't find anything suspicious, so so I it's maybe some interrupt problem.

G

A

G

Crash pr put in that fixed uh um um are not a new but uh core and cubelet.

G

Three weeks back yeah, it was about three weeks back: okay,.

A

um Yeah, if it was a crush, then it may explain some differences.

A

Okay yeah! Maybe I just didn't see that thank you, anyways back to normal.

A

Okay, we're going to backtrack. If you here for just for tests, you can bail out, let's get into box. I think we have four bucks here: okay, just six.

C

We I feel like talked about this last week. Why are we getting so many bugs about time? Changing like this.

C

What are people doing to time? Time is linear.

C

E

D

C

C

uh I think we, like, maybe somebody, suggested, a blog post or something like that to do this, like I don't understand why we're getting so many bugs saying like by the way I set my cubelet back like two days in the past, and then it stopped working like that's, not a thing. You should do.

A

I think the claim is that even one minute should be enough to cause some problems.

C

Is one minute enough to cause problems.

A

Yeah, it is a comment in one other box.

A

Jump forward the cluster time by one.

G

A

Yeah just one day, yeah.

C

A

C

Become not ready and then they become ready again, like god, that's that's expected behavior. I would say close this. This isn't a bug. No should not remain ready. If you do weird things with time, this is not supported.

A

Yeah, I think um matias told the scenario last week about um raspberry pi that loses loose connection just stays in time.

C

Yeah, but like do you expect your cluster to stay ready like that doesn't seem like a reasonable expectation to me like if, if time is not being maintained, danielle were you here last week, when we talked about these.

C

We have a bunch of contributors filing bugs being like. I changed the time on a cubelet like by a day and then my cubelet went not ready or like something weird happened and I'm like that's not supported. Why are you doing that.

F

I mean like no software that relies on wartime, actually handles your changing time, underneath it.

C

Yeah, like I, I think that I mean. Maybe we should like make a an update to our docs saying that this isn't supported but like I know that no just yeah don't do that yeah, uh but the thing that I'm, like mildly concerned about is like we're suddenly getting this onslaught of bugs being like.

C

I changed the time of my cubelet and it didn't work, and this is a bug and, like I don't know, if there was like a blog post or something like suggesting that you do this, uh that we should, like maybe say, don't do this to or something uh but like this is, I think, the third or fourth bug I've. Seen saying like I changed the cubelet time wall clock on the node by like a large amount like a day and then it didn't work properly. I was like that's not supported why.

C

Why would I don't know what people are doing? This.

F

I'm so confused.

C

Mike says in the chat time is a hoax. I agree, but, like time is a lie, linear time moves forward, you can't just like change the time like that of a single component in a distributed system and expect it to work. That's just not how these things work.

F

I mean it makes like the behavior, as described, makes perfect sense in that. Yes, this is how it would exist. This is.

C

What I expected to do, yeah so yeah.

A

I mean in this case scientific is minimal, so kublai like gets itself back on track. So this is great. I think.

F

uh Yeah, because doesn't it have a thing that, like checks, some heartbeat time, if it's further away ago than something something's wrong, I'm not ready! Oh that thing updated, because it was time that changed now we're good.

A

Yeah, so I mean this is a surprise for me that we even recovered in this case another case that somebody else posted that we discussed last week. It was time changing back uh backwards, and in that case some ordering of pods somehow like is broken, and because of that, we like uh completely lies on this ordering to do some, like.

A

I don't remember what what was that in the spark, but uh this ordering basically led to unrecoverable error for kublet, and they claim that it's even happened with one minute shift back, which can be legit uh a change uh but uh yeah. We probably need to get back to the pack and understand. What's going on.

A

Yeah this time forward and recovering it's fine and for time forward last week, matias gave example of raspberry pi that loses battery and then it recovers. But in this case I would expect a simpler restart of coupler should restore everything.

C

And I think in a lot of these cases the cubelet restart is fixing it so they're just like. Why did I have to do that? I was like well, it's not supported.

A

Yeah just so accept.

A

Still supported right, barely.

F

Oh, my favorite bugs.

C

Given the version here uh there was this, I notify dependency that we upgraded in a bunch of versions, but not I don't know if we backported it back to 120, I think we didn't uh or if we did it was recent, like I think ryan, do you remember what's going on with this.

G

Yeah, I don't think we backported it to 120, because it's too old.

C

Yeah, I think 120 is out of support, is the issue uh I'm pretty sure this is that I notify memory leak and uh I think that it's just not going to get fixed in 120.

F

I spent last week debugging something in 120.7 for a side reasons and like it turned out to be.

F

uh Something in um c advisor, like c advisor, was just like leaking door routines and never terminating stuff and couldn't repeat reproach on anything newer.

C

I mean c advisor will use a lot of memory and cpu if you let it, uh but that's typically on like oh, there were.

F

Ten thousand blocked uh see advice. I got routines, oh.

C

Boy, but that is different than potentially this thing. I linked the issue in the chat sergey. um This was not back. I think it was backported to a bunch of supported branches, but I don't think it yeah. It got closed uh because 120 is out of support.

C

So it's I think we can just probably close this and tell them to upgrade to a supported version.

A

We need a board to reply on all the unsupported issues, but your kubelet version is not supported.

C

Why is this sick, apps and not?

C

Well, I guess if there might be a storage component here as well.

C

I guess maybe let's let look at this before escalating to us.

C

Oh and it looks like it's a docker, slash, container d thing.

C

Maybe let's unassign node and see what windows does with it.

A

A

D

Notice, the book.

D

A

Store resources have to be genius result I mean why.

A

I'm curious to try it out myself. I don't think it should be the case.

C

Given that paco can't repro it, maybe, let's put it needs information on it,.

A

Okay hit done with this one and let's.

A

Here did you notice, github was down yesterday and today.

C

A

C

And last week it's kind of in my butt for, like pr review, I'm like like submit comment, push button and then it's like there was a problem with your request, or you can't comment at this time or like any number of oh yesterday. It was super annoying because I kept commenting and the comments worked fine, but then, like the hooks for ci weren't working. So I was just like what is going on.

A

C

Sounds like that kept that you worked on with jack.

A

Yeah, it's still.

A

Yeah, this is some math problem. I don't know, since there is no new information.

A

Yeah, it wasn't related to this uh exact, prop timeout.

C

A

To some timing in math.

C

This one's now steel.

D

C

Was the change.

C

Also sad when the latest update is just the stale bot, but uh that seems to be what's going on with these.

A

Yeah time, shifting, okay, okay, let's get back to the.

F

A

Easy to replace the latest version, one.

C

Supported it's not supported.

A

One minute, one minute, I I'm with you, but one minute makes me start thinking about it. Well,.

C

What happens when, like there's a one minute drift, I think it's fine for a node to go, not ready if there's a one minute drift as long as it goes back when the time like goes back to being in coordination with the rest of the cluster, like that seems normal.

D

For a distributed system.

C

That is having a hard time coordinating time. That seems like expected. Behavior.

C

Yeah, like the other one that we saw, they were shifting, I think, the time into the future uh and they were finding that the node went not ready and in this case, they're switching it into the past and then like they're, finding that they can't create a container uh but like, I think it recovers. Eventually, uh you can't do that.

G

F

Ntp is good actually.

C

Like I've seen, this sort of thing happen uh in like service environments, when I was like running cube as a service uh in various clouds, and there are sometimes weird time keeping problems and they do cause weird failure. Modes on clusters, but, like you, gotta, keep time like that's kind of a base assumption so.

F

I mean when those issues happen in the cloud you can, at least you know, kill the nerd and have a new one rather than suffering through weird issues.

C

Or just update time on the node, I mean that's really what the sort of fundamental issue here is.

A

Okay still needs information, and if there is any approval, is one minute back and one minute forward, then maybe we can try to look at it. Okay, um I think we're done with bactriash.

A

We have 13 minutes back if no more topics.

A

D

Thank you. Everybody.

A

Thank you. Everybody bye.

A