Kubernetes SIG Node, 16 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220216

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Good morning, good day, good evening, whatever time zone you are it's uh february, 16 2022., it's a signature, subgroup meeting welcome everybody. uh We have one agenda item today from imran and brandon talk about it.

B

Yeah hi, can you hear me said.

A

Yeah perfectly.

B

Fine yeah cool, so um the last time we talked about this, uh it was um regarding the pr on the kubernetes repo regarding the e3 test for law contention flags. So, after providing uh so after doing um addressing your feedback, I found out that there were the test job for it in the test. Infrarepo was removed.

B

I have mentioned the link to the comment uh comment as well, which removed that so I was not sure what would the next steps be for this pr, and especially the test job here, um someone- um I don't remember the name, but uh that person, um I think.

A

Daniel recommended it with energy.

B

Yes, so um it was mentioned that okay, to you know, add the job back, so I have done that. uh So it's it's a matter of um taking a look at that and see. If uh you know, if everything is in there order.

B

Okay and regarding the pr on the kubernetes repo for the e2e test, I have addressed your feedback. um There was a bit of mess-up where I accidentally pushed an old version of the um the changes I mean so it over there. It override the changes which I made addressing your feedback. So I also addressed that.

A

Great, uh can you remind me what the status of the uh migrating this fla uh this arguments to the conflict? I remember this was part of the bigger uh work that you you've been doing.

B

So the idea is that we want these flags and we don't want it to be deprecated, so we wanted to move to public configuration. uh This is uh that it is hold up, because we don't have an end-to-end test suite for that.

B

So this uh right now this pr what it does it adds the end-to-end test weight and afterwards um later we'll you know, convert this add the test for uh for the same set of flags in the cubelet as a cubelet configuration, so that there's another pr for that uh which does the actual move, the uh these flags as a two cubic configuration which, uh which will then later be reviewed.

A

Okay and after that, we can merge it with all the other serial jobs, because we can change the configuration and change it back after the test right.

B

Yeah, uh so this does not require the serial test field, because uh you know it is disruptive in nature. It restarts the cubelet and also um it. uh For now I mean we add two extra flags to the cubelet um which is not required by you know the other test, suites or other tests in the serial test. Suite.

A

I just uh curious about long-term plan. um Oh it's uh yeah, it's very! uh Ideally, we need to run most of the tests as a single suit. It's more efficient. Okay! Thank you. I will take a look any questions on that topic.

A

Okay, then, let's take a test grid. I think we made a significant progress this week. Look, I think, tao filled it up today. Right now, do you want to talk about it.

C

Yes uh sounds great, and this uh this week uh much better and can we scroll down a little bit? There are highlighted uh signal critical. They are all green now and which is recovered from last week, and the other parts are almost the same from naturally and from the signal to container d. I think there are nice this week by comparing to another week uh notes and for signal the cost it's it's the same last week and for npt.

C

There is one new test figure and I have created an issue, but I don't know how to add that issue into the project into our tracking board. So please feel free to click on this 108 166.

C

And added it into the ci dashboard.

C

And the other two remaining yeah, the other three meaning they are the same one last week. All the theories has a corresponding issues to drag and I think we're good.

A

Yeah, so pr for npg from uh varsha. um I think she is working on that, but uh there were some comments that needs to be addressed uh and great progress on sigru v2. If you haven't seen that uh the problem with c group 2 was that ordering of continuous installation and configuring of test was messed up so cut, the rd and test installation will run in parallel and there will be a lot of races.

A

What will be installed first, so that was fixed and it fixed many tests and the reason we haven't seen it before, because before we've been sometimes passing containing runtime as a remote. Sometimes it's contain energy and in one case it would do some steps that may raise with continuity stimulation some other cases it wouldn't. So we just never noticed these problems before because of this messed up of what we passes continue. Runtime.

A

Okay, this is it uh any other questions or comments on task grid, health.

C

Not for me, thank you.

A

um Next, uh let's take a performance uh with the performance really quick, and then we switch to triage of test issues.

A

Speed is good all stable and look at run time.

A

And time is also stable and stable. Perfect no surprises this week.

A

No unboxing, let's take a look at this dashboard.

A

We have only eight items to triage today: muslim work in progress so easy triage.

A

Okay, can they have huge pages job?

A

A

Okay, so this is in.

A

We need to send no shutdown unit test. Could.

D

We go and close that one, because I think that was uh an underlying issue with a bunch of stuff.

A

uh Yeah he closed this one.

D

Okay: okay, I missed that thanks sergey.

A

Yeah, thank you for uh asking.

A

Nice condition.

A

Yeah looks legit and there is already fix.

E

And that was a fix that didn't work.

A

E

I'm waiting for them to like I'm giving them a few days to try fixing it again before I like just do it.

A

So I would say importance for the long term.

A

Is making the test fake yeah just very.

E

Very occasionally.

E

There's a handful of kubelet tests that, like every like thousand runs, will fail because we wrote bad tests but they're just good enough to work.

A

Thank you, dale.

A

Cherry pick, this is code issue, so we will archive it from here.

A

A

Shadow metric connection: well, I wanna it is 295 files hold on.

D

So it's uh too it's it's like. It's also got a shruggie on it. uh What is going on with this pr? The reason that there's so many files is because uh there's a run c bump inside of it uh like there's a there's, a vendor update.

D

um Currently, I actually just met with bartek about this this morning, so cubelet uses a lot of cpu a lot and like a huge amount of it, is just like spent on uh c advisor like recreating the labels for prometheus metrics, like creating the data structures over and over again, even though they don't actually change uh and then, as a result like spending a ton of time in garbage collection for like the old labels, which are the same as the new labels.

D

But the cubelet doesn't know that, uh and so anyways uh bartek is working on like adding a new prometheus cached implementation in order to collect this a little bit more efficiently, something more like cube, state metrics, for example, um and so uh I this was my attempt to like kind of integrate that, and so this builds. uh But there are like issues like no metrics are coming out. So my follow up on this is: I need to write an end test that verifies that metrics are being scraped so then the node test will fail.

D

If, like that's, not happening, I can't believe that we don't have this end to end test, but we don't. uh It would have prevented at least one major regression that we had with like c advisor metrics disappearing in uh 119. So I'm going to write this end to end test and then uh yeah I've. I've just been working with bartek and damien grusine on this, so uh yeah anyways, you like I mean needless to say, you can dump it into like archive it or something. I don't think it's relevant for the test board, but.

A

Do you need to like bartek here.

D

A

No, that's okay,.

D

He knows, okay, uh the only person that I would maybe want to like get uh you know a take on would be david porter, but uh I don't think that we need his feedback quite yet.

A

Okay, thank you.

A

And this is what we just talked about. I will accept it.

D

Don't we have, I feel, like we had an issue for this and we transferred it to the node problem, detector repo. So I think this is a duplicate.

A

D

Yeah I mean somebody, I think, filed it, a while back.

A

Come with me, uh let me check.

A

Yeah, as I said, I know, I saw pr from warships that fixing something in a pd test.

A

Okay, um let's see here.

A

Let's see what needs reviewer, oh, maybe it doesn't need it any longer.

A

This is assigned to david and it's fading.

A

Okay, I think we need to just poke david.

A

F

A

Over there, thank you I didn't know you here: yeah yeah.

A

I hope that that are irrelevant, otherwise uh yeah, maybe wait. Till tests will pass.

A

So this is a project product code, mostly, so I will move out from here not sure why it's.

G

A

Yeah, this is what I was talking about this pr for continuity,.

A

Yeah this minor comment here, uh I'm not sure why it will fix the test. Oh this issue, you mean right uh in.

A

Is it what what you meant.

D

Yeah, maybe okay.

D

Yeah, I think yeah you fought, you filed it already. There we go.

A

Yeah is it same component.

A

Here I'll check offline, but uh if it's duplicate uh it's a big problem yeah. I don't know why this will fix the test. But if it will it's great.

H

Yeah this this maybe can marked uh important long term. This was based on a comment that kevin had left on a pr or some or some of the changes that we submitted in 123.

A

um So this is a product called mostly right.

H

A

Okay, let me put it out of scope of this cersei but uh see you on the main board.

A

Yeah so we have legit to uh we are that need reviewer, and we still have that misapprover. Maybe they scope beyond what we own um yeah. Maybe we need to take it offline and see whether we can make progress on all them. I think this one is ready to merge by the way. So if you can take a look.

D

Oh, that's an oldie.

D

Yeah, I think I was waiting on something from harshall. There.

A

Okay, thank you.

A

uh Okay, uh anything else for test part of the meeting. If you look at everything, if not, we will go to backtrack now.

A

Only five bucks.

D

No that's two more than yesterday, I think.

A

Five days back, um standalone, complete.

D

That looks like a bug.

A

It definitely doesn't.

A

ah Yeah, it even has a pointer to code.

A

Okay, so it's not sigmoid.

D

Yeah sig storage might be missing. uh If you look client is nil, I mean it's still kind of us. I think we should triage accept this, but I think someone in six.

I

D

Is gonna need to fix this.

A

I don't think we ever test cubot's standalone.

A

It raises a question.

E

Yeah, we don't have any signo dts that would cover like standalone properly.

A

Yeah, I wonder how this test will look like just run kubota and run couple static, pods.

A

Okay, this is done.

A

D

A

Okay, so it seems that uh it's the same issue as this one wonder what kind of information we can ask a full couplet may be useful.

A

A

Yeah quite fresh um okay.

A

Yeah understand why it's not cleaning up this sandboxes.

A

Okay, definitely put it in different infinitely important fail to an expensive missionary.

E

That doesn't seem like.

E

I

I agree it seems like something that is an unpleasant but uh legit state of things I mean if the cubelet rejects admission. The job of the controller is to recreate the pod, it's a very similar to what we have with the technology measure. It's a when cubelet rejected mission. You almost always end up in states like this. I'm not sure if this is what danielle meant, but I I do think that, like.

E

Replica set is doing its replication job and the kubelet is doing its job of telling you to go away.

I

Exactly completely agree.

D

A

I would say it's.

D

A sig apps thing potentially or just an application error.

A

Yeah, they also need to understand. What's this unexpected submission error.

D

Yeah they didn't share any details on that or people at logs or anything like that.

E

uh I thought like two lines of log.

A

D

Those were keyboard version.

A

E

Yeah, it wasn't.

B

A

Yeah local system should be enough, but I think it's windows bug.

A

Yeah, I will remove node.

A

Unless they will need a review from us, where is this function.

A

Yeah, they don't need to review from signal, so I will keep a signal, but let's.

A

Yeah we done with that and we still have half an hour. uh So um why? Wouldn't we just check? What needs information.

A

Maybe look at the first view.

D

Yeah, well that one you just updated: oh yeah,.

D

I remember this one from a while back: what's the update on it fix it? Oh.

A

D

Whether or not like this is why I think I had suggested they should really be filing a bug against like red hat and not kubernetes for this, because if they're missing a cherry pick and the cherry pick is done in kubernetes, but wasn't done in openshift like no buck here could help them.

D

Yeah and it does sound like there's a cherry pick missing and an older version of openshift, so.

A

And this is 119., so yeah quite old, and this cherry pick is by the way, also it kind of been 119 right, because 19 is not support, oh well, it was done before yeah.

D

It was done like last year.

D

Okay, it might not have made it into an openshift rebase. So.

D

G

A feature request.

D

That shouldn't be on the bugs board.

A

We remove that already, I don't.

A

How did I end up opening it?

A

D

Oh, maybe your query just includes uh non-bugs yeah. That one was a feature yeah.

F

A

Okay, so update here.

A

Something stuck information about one step.

A

And it hasn't from blocks now.

A

I hope this is what I think.

D

Yeah, uh do we wanna triage, except that I mean I guess.

I

D

Gonna look at it not sure.

A

A

A

We can keep it open for now.

A

Let's update here, oh okay,.

A

Okay, so just move to this information.

A

Okay, just wrote it.

D

Yeah sometimes the stale bot is the update.

A

A

Now let's say a couple more.

D

For that one honestly, I'm not sure it's possible. There is yet another bug in the cubelet refactor with the pod worker or it's possible that they just did something ridiculous and, like the storage thing, can't retry it's hard to tell if this is a cubelet thing or a storage thing, and I was hoping that maybe sig storage could confirm that it's a cubelet thing.

D

Given that the the one log error they provided was error, syncing pod skipping, which is definitely coming from the pod worker, like they haven't, provided full cubelet logs, did we ask them for full cubelet logs? Did they share them?.

A

D

Okay, maybe we should ask for that.

A

And I need to change triage same news information.

A

What is this one.

A

A

What's the fix.

A

A

A

Yeah still not working for them.

A

Stalker and version 119, so 119 is out of support technically.

D

Yeah and 120 is about to go out of support as well. I think the patch release we're doing this month is the last one.

A

Nice information, but first is that policy always.

A

Oh honey, you wanted to comment on this um ordering uh that complete this uh end of life.

D

uh No, I just need to remember to actually comment on this one. It's a sign to me. I will try to remember after this meeting.

A

A

A

G

G

Cloud oci cloud, something or other.

A

Yeah oracle, all right.

A

Yeah oci is very unfortunate name as well open container.

A

A

Interesting so the kubota keeps reporting that node is under heavy ram transition.

A

I remember it now, like we've been discussing with a uh there, is some race in taking these measurements.

D

I mean four gigs of memory being used in kernel. Buffer is like not the same as there actually being a bunch of memories sitting around on the system.

D

It's probably doing some like I o intensive stuff and the working set is relatively small and they're like confused when, like the node needs to free up memory, because it's got too much stuff in there somewhere.

G

D

Know I just don't think it's accurate to say that you know like sure it says available for 3.4 gigs, but 3.7 is being used for kernel buffer.

D

Just because it's not an application working set doesn't mean it's not being used by stuff, it's probably being used by that stuff, just not directly in a way that the kernel was measuring it that the.

G

Way, I don't know.

A

I'm not going to follow you.

D

Okay, so when you have say a very, I o intensive workload frequently what the kernel will do is it will take all of the stuff that it would otherwise be reading from disk and caches it in memory. That memory is not like associated with the application. uh As far as kernel accounting goes. It goes into kernel buffer, uh so like, if you have a very I o intensive thing on a system that doesn't necessarily have a ton of memory uh like or like you have an app that's doing a bunch of I o intensive stuff.

D

It might have a very small working set of memory, but it's actually using a bunch of memory, because all the stuff that it was hitting on disk is being cached in the kernel so, like, I think, that's probably the situation that we're in here.

A

uh It was just as available.

D

But that just means that it can like it can free that if it needs it.

A

Actually, and like.

D

The actual free memory, if you look at that system, is 445 megs of the like eight gig total available just means that it's happy to evict those pages but like they are being used.

A

So they said that there is no running pause on this node, so we need to understand what is actually running there.

A

Because if this theory is correct, then we need to have something. I o intensive running right right away, right.

D

Well, I mean, if they're, if they're running other workloads on the node, potentially that could also be competing with kubernetes there's like a bunch of possibilities here uh to me. This is like going pretty deep into like application level support, uh I would say uh like if they're using oracle, they should maybe talk to oracle about what they're doing here. It could also be a bug in like oracle's platform right. We have no way of telling we don't have access to those binaries, so those are some of the problems of like you know.

D

We we're the downstream of or sorry we're the upstream of this downstream thing and we're getting support tickets for the downstream product.

A

D

Maybe like it might be helpful to explain to them that, like available memory is not the same thing as free memory and like the free memory on the system is accurate. Something is causing a bunch of memory to be uh like used by the kernel for buffer or cache, and that, like that, does have an effect on available memory on the system.

A

Yeah, how how could goodnight have you go places available.

D

That I'm not entirely sure.

A

Is david's on the course too dude.

F

There's something in the advisor that uh I don't remember exactly, but but I remember there, there is some special logic paid to whether it's available or whether it's buffer cash. I believe it's like subtracted or something like that. I don't remember the exact logic off my head, but it was handled somewhere.

D

You can assign this one to me and I'll comment on it. I'll try to do that, one and the other one. At the end of the meeting.

A

It'll be great to understand how we measure that and like maybe ask them to run the same command as we do in kublet, so to make sure that kubelet doesn't uh do it's, not the error in kubernetes, it's actually what is happening on the node that is occupied by some other processes.

A

Thank you. Anna.

A

It's just stupid.

A

A

So, let's close it since it's not relevant to kubernetes.

A

Okay, I suggest we take 10 minutes back and close up all the like respond to the female box and uh clean up things. Anything else for today, triage.

A

Thank you have a good time. Bye.

D

Thank you. Bye-Bye thanks.