Kubernetes SIG Node, 14 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20211014

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, everybody happy cubicon week. It is thursday october 14th and we are the ci subgroup for sig node. uh We have a few agenda items for today. uh I don't think that we currently have uh francesco or imran with us, so I'm gonna skip those for now. I paint them in slack just in case they might join us um mike troubleshooting container d, 1.4, canaries. What we got.

B

Sure, uh first of all, for a little more context, this derives from a job I was creating for uh having container one continuity, 1.5 calories after some suggestion to move them to another directory.

B

I noticed that the container the 1.4 canaries are failing and if you go to the template- and you click any of those you'll notice- that the failures are because the cluster is not started well, the clutter is not starting, but also when looking at the code of the of the node logic I it looks like it expects the node to already be present. It doesn't create it.

B

Oh the first one always works, uh because it's just the build, but the other two they keep failing. I see.

A

Those do look like they're failing.

B

I think I added some of the logs on the document. I don't recall.

B

Oh, it looks like somebody once removed them.

A

It's heard rational for why this should be removed.

B

Yeah, this is the old.

A

Maybe it already was removed.

B

No this this is the um around me to move my my original work to another directory, and this is the old directory.

A

I see so it's trying to remove a file that got moved.

B

A

So is there any action that we need from this group or if it's just an fyi.

B

uh It's it's an fyi, um I'm thinking like we have to create a cluster before the actual test starts. I'm not really sure about this, but that's that's what I'm thinking from from bringing the logs.

A

Sounds good well, it looks like this is approved now.

B

Yeah, I don't want to submit it, because I'm I'm expecting that this will fail as well. If we don't.

A

Cool makes sense, uh I don't know any history about this job. uh Maybe we should.

A

Cc uh dimms and danny.

A

Because I am not working on uh container d things primarily, okay, that's that uh next item memory, pressure testing with swap enabled how can we run tested machines with swap enabled ah that's a great question? um We do have some uh configs for this- uh that I comma put together in test infra.

B

A

Let me show you what.

A

I just don't remember the job name is so it's in here. I guess okay! Well, this is good enough. um There's this node args image, config swap uh so. Basically, if you take a look at this, I think it's the jobs, eating node, swap config.

A

There's a couple of configs here and basically as long as you pass this config for the nodes, um it will like use this metadata to make a swap partition. So there's one for uh ubuntu and there's one for fedora core os, um and then you can see that like, for example, this one just runs a command to make a swap.

B

Partition and turn it.

A

On uh and then similarly, uh this one just has an ignition file that turns on swap.

A

Oh, no, I think I think we were just tagging them with, like probably uh so. Currently we don't have any tests that are specific to swap. Rather, what we've been doing is we just run all of the tests in a swappy environment, um or at least like all of the standard edes, so the selectors for the job that I showed you they were just all of the standard edes.

A

We don't have any specific tests for swap yet like the idea is, while we run these tests with swap, we do want to start. Having uh like we want to be able to like, for example, have some nodes that have uh like exercise different swap configs, but that's all going to be like infrastructure configuration and not uh like test level configuration.

B

We need to like.

A

Set these values in the nodes when they boot up and then we want to run the tests so right now at least most of the memory tests uh for swap like there's no specifics today, there's no specific test selectors for swap tests that may change as we start adding new categories of tests where we both need the swappy environment and we have special tests that we want to run on swap, but currently there's no swap specific tests and the reason is in part because the swap changes were very, very limited in terms of what change in the code like it was.

A

Basically, here are some new configuration values and the actual code change itself was here's a new value to the cri which, like you know, I think, currently still the cris, don't actually support doing anything with that value. They drop it on the floor, both cryo and container d. So um that is the state of the app uh it doesn't look like we got francesco or imran, but I guess let's go up the agenda.

C

They just brought back on on slack, so francesco is not joining that we can ping him if we want and imran is sick. So I I.

A

Just replied to them.

C

And say we postponed to next week and.

A

Yeah, I think, that's fair, so why are we meeting next week, 20th.

A

Any other agenda items for today.

B

I think it's totally half in myself, cool, okay,.

A

Then, uh let's do, um let's do some quick bug, triage.

A

And uh let's maybe also take a quick look at the test board.

A

A

Okay, I guess we just have this one flake that hasn't been triashed.

D

A

A

Ryan's pr, okay, let's take ryan.

A

Someone has assigned themselves to fix this. uh Let's check governator, hopefully we'll crash. That was okay.

A

um I have to make flakes lately, but it definitely was flaking.

C

Is it a machine gun or a keyboard? Yes,.

A

That is my model m.

C

You mean a real one,.

A

Vintage, a real nice.

A

Yeah, I've used a unicomp, it's really not the same. I rescued this one from ebay like well when I was in university so a while ago uh and I oh, it was filthy when I got it uh and I cleaned it and uh yeah I've taken good care of. It still works great and yes, it's very loud. My apologies.

E

A

Okay, so uh we've triaged that one, I guess that's to do. I don't know what else there is to take a look at on this board. I guess: are there anything that has lgtm on it a couple things.

A

A

But yeah, uh I don't think, there's anything critical urgent on here or anything like that. Is there.

B

You can move that continuously 1.5 calories to waiting on other, but.

D

Just on my side,.

A

This board is so long, I'm just going to do it here.

A

Okay and I'm not gonna, I guess why does this if this fixes a flake? Why does this not have area test on it? ah Maybe we can add it to area test.

A

Let me get this onto the test board.

D

A

Okay um and then, let's take a look at bugs.

A

No new bugs this one has been triggered already. Oh and it's urgent. uh How many do we do? uh Oh sorry, uh one second, I need to be right back.

A

uh Let me stop my video for a second uh y'all can continue without me. If you want, um I don't know how long this is going to take. uh Let me stop sharing my screen for a moment.

A

uh Let me does either of you want to share your screen.

A

Sure I'll give you a co-host. Okay, give me one. Second, okay, I'll be right back.

B

One second, I'm just moving in there.

C

You don't have the co-host.

B

Yes, trying, though,.

B

Something's going on.

B

I think it's ready perfect. Yes, that's! It is this on on the I guess, this is on the main repository right.

C

You're, looking for the board, oh yeah, nice.

B

Yeah, I don't think this is on this. No.

C

No, no, it's it's the 43. I think.

C

We try to find there's.

B

Nothing anymore.

C

C

Yeah, it's either the 43 or the 49. It depends if you want the node one or the ci yeah.

B

D

Sorry, I'm back.

B

I totally accepted.

B

I guess we should just move this to waiting on offer. Yeah.

A

Which board are we looking at.

C

The ca1, I think, no.

A

uh Well, this is a pr.

C

Oh yeah we're looking at the box yeah right.

C

It still has a hold.

A

Can you find the comment where the hold was applied.

A

ah Yes, wrong board: this is the pr triage board. There's this other one.

C

Yeah, it's my fault. I get the wrong bling.

A

Oh, that's: okay,.

A

The bugs board is this: one.

A

Some background from this one is uh so in sig instrumentation. uh This is a instrumentation cap. Like the event api got redesigned a very long time ago.

B

Maybe that's I don't know something's going wrong with it. Oh okay.

A

Can you hear me now.

C

A

Oh he's left us.

A

uh I will start shooting my screen again.

A

uh I just wanted to go back to this one.

A

Not that one, that's apparently the first one, because the search engine is 383.. uh I wanted to give you a little bit of background on this one while we were staring at it. um So this is a very old cap.

A

uh We basically like the event api was rewritten and then in theory, so like this one beta in one eight and then they tried to ga in 119, uh but I argued that you can't call it ga unless you've like deprecated the old thing uh and so there's still a bunch of things that are not using the new event api uh and every time I've asked someone to step up and take ownership of this feature and please finish migrating the rest of the things uh it's been kind of hot potato, so um yeah.

A

It looks like somebody commented yesterday and removed the stale from this, uh but like uh it's as far as I'm concerned, it's in beta, it's not stable because, uh like initially it was started by merrick uh and then chelsea chen, and she has not worked on this. Basically at all, uh since it was handed over. So I think we might need to find a new owner, uh but yeah. This is kind of a backlog thing, I'm not sure, like technically it's owned by sig instrumentation, uh but there are stakeholder sigs and uh I'm not sure.

A

I I feel like this. It's a very old old. I guess it's got the old design proposal and all that so like it would have to be migrated to the new template and yada yada. I think that honestly people should just do that. They don't need a cap to do the migrations, um but we can't really call it ga until it's fully migrated. So anyways, that's just some background on what what is going on there with the event api.

A

A

Let's look at some bugs uh because prs will take a long long time. uh How did this pr get in here? Maybe schedule like it's rotten so uh let's go through. I think there were no new bugs to add to the board so.

A

Scheduler being topology unaware can cause a runaway pod creation, wow.

A

I guess this guy reopened francesco.

A

uh I think the issue is, it keeps getting close because nobody's removed the rotten label, so I'm gonna do that and then put it into triage.

A

Look at that the bug put it in the right or the bot put it in the right column. How useful okay! I will triage this.

A

uh We've got a bunch of these oh and there's another one. That's already triggered so great.

A

Okay, orphan pod found but error, not a directory occurred when trying to remove the volume stir. That sounds like storage.

A

I think this is sick storage.

A

Yeah, why would that have a file in it? That doesn't make very much sense to me. This seems like a storage thing.

A

I feel like this code is owned by.

A

D

A

Kill its memory capacity and dynamic ram in hyper-v? Okay, I don't know what dynamic ram is.

A

I don't know that we.

A

Support dynamic ram.

C

A

I'm just going to label as a feature request.

A

Okay, node lifecycle, controller. All positive mark does not ready making workhold services unavailable during network partition between master and worker nodes.

A

Okay, that's normal.

C

Why, on depots on work for eviction, I don't understand.

A

Yeah, I don't. I don't think that that makes any sense. uh I think that they don't understand the readiness behavior if there is a network partition where the entire control plane gets severed from the data plane, there's nothing wrong with the workloads like the workloads are still probably running and ready. The issue is that, from a endpoint perspective from a control plane perspective they're, not ready because it can't check in uh like that is the safe failure mode.

A

So uh when something like that happens, we like I would expect that services which require communication with the nodes would go. Haywire like the issue is that services are part of the control plane, so you can have service outages if, uh like your control, plane goes down and I think that's the issue here. The workloads are fine. uh I I people seem to.

A

I get the impression from getting a lot of these bugs and reading through them. Recently, people really believe that not ready means something other than it actually means, uh and maybe that's like a function of bad terminology, which we probably can't change now, but like readiness does not mean it doesn't mean that the thing isn't actually running. It means that from kubernetes perspective, it's accounted for everything and it's running correctly. So.

A

A

um Here I won't meet soon enough.

A

Does this explanation make sense.

C

Yeah, I think the alternative for him would be to bypass the service and just attack directly the the pod itself, but.

A

Yeah, oh I mean basically the the issue is uh so like, possibly uh so federico. I guess this must have been mark's api machinery, yeah.

B

A

This doesn't make sense why it was api machinery. I agree this is node. um I think that, like the issue here, uh is that, like it's honestly, it's more of a networking thing. It's certainly not api machinery.

C

But again yeah, I I agree. It's not the book.

A

Yeah, I don't think this is a bug.

A

Though will put sig network on here, since this does involve, but.

C

But I wonder what what would happen if you have a service mesh, maybe maybe they can still access well.

A

C

A

Don't want like if your control plane is fully severed from your worker nodes, like no matter what you do that service mesh is not going to work uh in theory, you would want to set up multiple, like failure domains uh in order to like not have a full network outage like that, but if you've got a full net workout, what are you gonna? Do.

A

Okay, so that's that uh enabling feature gate blah causes unit test failures.

A

I don't know why that would be node.

A

Because no tests are failing, doesn't look like it.

A

Okay, I'll remove that one from the board.

A

Cubelet services not coming up because inode consumption is increasing.

A

A

I don't know what's going on here.

A

This one needs more information, there's no way that we can debug this.

A

Soap path, sources never match when containers are recreated.

A

It seems like a sick storage thing.

A

A

A

A

A

Sorry here help me.

A

I hate triage accepting my own bugs, but.

A

This one is definitely a bug.

A

Okay, unable to start system d based containers, definitely seems like support.

A

Help my container is not running is not a bug.

A

Log show verify container status failed, are flooded with these entries.

A

Now someone's looked at this.

A

Peter's working on this.

A

Npd yaml failed, incento, s4, journal path, blah.

A

This seems to be a node problem. Detector thing uh mike, is there someone that I should assign this to.

B

I don't think we have any active maintainers in there.

A

Okay, that's if it's like, I feel like. If this is a node problem, detector thing it should be filed against that repo, not kubernetes kubernetes.

A

uh So yeah, I agree.

B

With the signal.

A

Not cloud provider.

A

And I think that's it hooray.

A

We have 10 minutes to spare. uh Should we go through any of these ones to see if they've been updated.

A

It looks like some of them have updates, but have they been updated since.

A

We last looked at them.

A

ah This just needs information. This is a feature request, not a bug, uh pod stuck terminating. Let me ask for more information. Oh somebody has a repro.

E

A

I feel like this is just not a supported setup.

A

This seems like an issue with the api server proxy and not the cubelet.

C

What what is the the linked issue command does not restart static pods when ap server is down.

A

Sorry, what did you say.

D

That the the just the the dimension, just just one line above.

C

Maybe it's not. It's still happening even without the proxy. Now I don't know.

A

No, he said he could only uh reproduce.

D

A

The api server proxy so.

B

A

Like it's not really a cubelet bug.

A

uh Also, I know this one has something to do with the pvc. uh We do have another issue where, if there's a network partition, uh I think clayton has a bug uh that uh about and we were looking for a reproducer, so this might actually be handy.

A

What was it I got pinged about this like yesterday. Let me see if I can get the.

A

uh I can't spell: let me just dig up this bug: clayton filed a thing saying that static pods aren't restarting if there's a partition with the api server.

A

I'd expect this to only be happening on.

A

Oh and of course somebody had already mentioned this there as well.

A

And my brain just did not process that.

C

This is exactly what I was trying to say: okay,.

A

A

I did not understand.

A

Okay, so this it looks like this was actually a bug uh and.

A

It looks like this is cryo specific.

A

A

A

It looks like it was also fixed in cryo, so I think we're good uh okay.

A

We could probably see some of those next time, but I think we've done plenty of bugs thanks folks, uh we're doing good work and I hope you have a great rest of your day and, if you're attending cubecon, I hope you have a fun cubecon.

C

D