Kubernetes SIG Node, 11 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20230111

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20230111-180428_Recording_1848x1120.mp4

A

Hello, hello, it's January, 11th uh 2023, it's a signaled, CI meeting, hello, everybody. uh We have one agenda item today uh that is explicit and it's on Francesca I just want to take it on.

B

Hello, hey so yeah. um uh Let me give you folks some context here, so um some end-to-end tests we have, notably the ones related to the research measure or the Pod resources API, which are again related to research manager, so CPU manager, device manager pointing case and they require some form of device plugin.

B

um For example, um you know pod resources. Api is about listing the devices assigned to the pods, so you need some devices, and so you need some device plugin, and this is why we we use them now.

B

um We have in particular uh one or two few tests that wants to check some combination of parameters which are uncommon, for example, point in case device plugin, which do not report Numa topology information, because, as it is today for the device, plugin API is not mandatory to report in the Numa uh topology information, so it's still allowed.

B

So we have this need because you know we want to check actual features with support for any former device plugin and in particular we will really really benefit to have some device plugin which do not report, um no more information, no matter topology, which is pretty uncommon. All the devices will only measure device again I reviewed that they want to report Numa. A lot of pneumatology information fill this use case. We we did a research back in time and we among the few device plugin available.

B

We add the cube vert device plugin, which long story short, is the device plugin from the cube, Cube, weird project which exposes Dave and Dave KVM a device file which is needed by virtual machine. So it's not that important. The pointing case is that that plugin back in time was supported by Third parties was an existing device, plugin used in production and didn't report, no more information, because it's a device. So it's available on it's a fake devices by Linux on it.

B

So it's available on all the newer nodes, always fine, but the keyword project decided to retire device plugin and they implemented the semic functionality in a different way. It doesn't really matter what matters for us is that this device, plugin, is not available.

B

We read that didn't notice for another related case, but, okay, now it's broken. We have without this device plugin and that covers the first two points. I add now the problem becomes- and this is actually an open question for this forum. I have a suggestion, but it's for awareness and to open questions Forum. Okay. What do we do now? Because you know just removing the test- seems the less serial solution, so we will really benefit from a replacement device.

B

Plugin uh the option I'm going to present and then I'm shut up is extend the sample device. Plugin we already use um to you, know to have a configuration to provide or not provide Numa locality. Why is this is not obviously a good solution, because uh to end-to-end test will really like really like to use them as close as possible um configuration similar to production as close as possible production and simple device. Plugin is used only test, so it's not a real device plugin, it's not something used by people which consume kubernetes.

B

So that's the state that I wanted to talk I'm open to discussion. Thank you.

A

B

Another device plug-in which defeats our needs and which is you, know, supported and use the by other people.

B

If there are better idea, just shout to me so.

A

I know we had this similar uh conundrum with um uh credential providers that we are now moving out of tree I mean it's a simpler problem, because I think a surface space like API space is much smaller.

A

um So we, since we're running on gcp most of the tests that we saw that maybe need to take gcp uh credential provider, but it wasn't well accepted. People didn't want to take depends on specific vendor and tests are not portable. After that I mean they're they're, somehow portable today, with taking depends on specific uh vendor credential provider you'll make it less portable. So now we have fake credential provider in three uh in in our test folder.

A

That kind of doing this um logic and again this has similar concerns that it's not actually a testing what people may use in production.

A

um So I guess our end-to-end test in this sense is mostly integration. Test views.

B

A

Than integration test with yeah.

B

It's not terrible because, but it's it's still sub-optimal because you know it's something interesting something fake so but uh yeah, um really I'm I'm out of options is if everyone has an idea or a suggestions or want to chat about it. Just ping me on slack otherwise.

A

Plugin was doing, was it.

B

Yeah sure I can actually share the link. I hope I I have the link handy just a sec, but it was exposing um a Linux device file.

A

B

You know they have like Dev zero or it's it's. The device file, the okay in the specific of that is that um that device value is used by the qm or hypervisor to use acceleration provided by the processor through the mediated by the Linux kernel, but um conceptually okay shared now wrong link.

B

ah Where is it no? Well? Okay, it's not something completely off topic, but it's not the device plugin either.

C

A

I, don't have my biggest question was: uh is it how portable that was? It sounds like it was quite portable, so.

B

Yeah yeah yeah, absolutely I.

A

Remember we didn't have any, we don't have any real devices on the machine, so uh it's hard to find real use device plugin for machines that doesn't have any devices.

B

Exactly and so yeah what.

A

About GPU devices, will they test at least something GPU.

B

Using a real device in a real device plugin, it's completely fine. The only concern I have okay. I need to learn the setup, but that's minor, very minor. The only concern I have is that um those machines are more expensive. So you know okay, we can maybe run, but it's it's friction. It's friction that leads to Lanes running less often, arguably more often than today, but still and so I was optimally. We would have something that we can run on H on HPR. You know, instead of like, let's say daily or weekly, like integration again.

D

B

Gpu K is an option. It's totally an option. Yeah I will explore this option.

A

C

um Francisco gpuk: what was that UK.

B

Yeah gone sorry.

C

Uk thing I wanted to look it up. What was that.

B

um Did the cubic one or the the GPU one, the.

C

Last thing that you mentioned when you were summing up with Sergey had said it wasn't Cube vert, but there is something else. Yeah.

B

No I'm GPU, they're, probably nvidious, but I will just look at what are we? What are we doing with GPU devices soon? Okay,.

B

Ryan, you were talking.

E

Yeah I was gonna, ask um so there's different tiers of testing. We want this to run on PRS and potentially um we want to do integration testing too right.

B

E

um So do we have today like a mock or fake four device, plugins.

B

Something similar very similar there is the sample device plugin, which, like the name suggests, is a trivial implementation of the device plugin, which is in three, so yeah can.

E

We actually mark like a GPU and have a report, various uh whatever it needs to report and have the keyboard interact with that.

B

Yeah yeah, we we, we can quote the just unquote mock, simpler device than GPU or GPU, and and because we would just need to provide what the device plugin API needs. My issue here is that is conceptually a mock or fake. It depends on us, but it's not a real thingy. So you know it's it's something some signal it's better than nothing, but it's not really end to end well.

E

I was gonna suggest, maybe so, if we did, that on, every PR might be a good signal that nothing is broken and then um what if we had like a weekly GPU job that actually spent up a GPU and we did an integration test with that.

B

E

B

That's the best solution. We can aim for yeah at.

E

Least it would we would get signal weekly that you know someone might have broken or not. You know it works or it doesn't work, and um it's not all that expensive.

B

I agree: that's probably the best approach.

F

B

You'll have a.

F

Periodic job running with the device.

B

I don't know and I need to check because things changed them. Many.

F

Times right, we could have three jobs, uh one runs on a BPR that only uses a fake device and we have two periodic one with the GPU and one with the fake device as well. I.

B

F

Know if that works.

B

It's possible and just out of date, yeah.

A

B

But it's a good point: I need to check.

B

Okay, folks, it's very reasonable what we discussed so I have a way forward. Thank you, everyone and again, if you want to chat about this topics, just ping me always open thanks. So.

A

Much do you want to talk about number nodes, uh I, remember, yeah,.

D

Yeah, uh thanks for reviewing that I managed to update it. So there was I think a minor comment about um starting like comments. So I did that and it's ready for people to take a look.

D

I can put the.

A

Link if it helps.

A

B

A

C

Of course, okay.

A

Therefore, we will try to run the project manager test on bigger machines that have multiple, pneumonodes and I believe this is only for your focus, is uh topology manager call suit and it will run on bigger machines, uh I guess also discuss. Maybe we need to increase the timeout um for like how often we run this test, but uh right now it's.

A

A

And now it's four hours right: every four hours.

D

Default as it was uh because I just wanted us to have some signal and be able to compare it to what we already have.

A

Now, let's try it foreign.

A

Manager tests and it's a good thing, because we want to GA this feature, and ideally we need to test it before.

B

We divide them.

C

A

D

Okay, so who would I assign it to for approval surgeon, okay, okay,.

D

Disappear not need an approved label.

A

D

Oh perfect, thank you.

A

Yeah I just had a hold on it because of this minor comments: okay, um yeah, um it's I think it's a big deal. uh We will be testing more and better.

D

A

I would suggest to go to to the board unless there is any other topics.

A

Yeah by the way um I saw that uh basically, it was fixed hokkaya, so now I think any longer.

A

Yeah and it's still showing red but uh most of the test uh green now and failing tests are not failing in the beginning that failing on specific tests, so uh it's much better shape. Now, foreign.

A

Check issues in progress to start cleaning up the board um so bear with me.

A

For this consistent painting.

A

A

This may affect somehow tests.

A

A

Yeah I think it's mostly a straightforward changes. Anybody interested to take a look.

F

C

D

A

Yeah and Peter is working on a nice document explaining how to write good tests. I think I need to just.

A

Share it here, so this is a document.

A

I'll put an agenda, um so the document explains how to use this framework and how to write a good tests.

A

I think this is one of those like how to write specifically for uh painting ports.

A

um Who was it I wanted to take it.

F

A

A

Attractive default storage class feature.

A

Yeah I don't think it's test related. So let's move it out.

A

C

A

I might have approved this test before so um hey.

A

Yeah I think uh yeah the test. This change sounds familiar, so I would.

A

I would take a look yeah.

A

A

Something was broken, this playlist fix.

A

Researched them to research, attachment yeah.

D

I think it's unrelated.

A

A

C

A

A

Change, it affects pretty much everything.

A

Yeah I I think it's a cool change. I'm, not sure why test was test label was added, so I would remove it from this board.

A

Finished particularly the before events can be tried, retrieved.

A

A

It's a good issue to do.

A

um Is anybody interested to change uh assertion on events to assertion on something more meaningful.

A

Yeah, so it's uh uh Antonio mentioned here: we should shouldn't rely on events uh in test and we had a dedicated effort to remove all the dependencies and events from conformance tests. So all the conformances was cleaned up and they don't have this dependency any longer.

A

um It feels that some tests still have this dependency and removing it quite trivial. In most cases, you just uh I.

B

Mean unless you.

A

Test explicitly meet event and like have event s that event will be sent. uh Then you have some other conditions like something happen like uh Port was deleted, or quote was uh important. There is some specific state um so changing it typically easy, um but in this case, if anybody wants to take it, please do well now I just put it in to do to do.

A

Okay and now I wanted to look at in progress issues.

A

And see if they're, actually in progress.

A

Anyway, for some reason, PR, as in issues in progress, the CPR.

A

Okay um and I think this is something related to our team, so I think it was mysterious.

A

Okay, test defaults, yamo, foreign.

A

Okay, thank you. Kevin lost me.

D

D

A

There's the Earth exercising figure, containers creation, upgrade tests.

D

A

I, wonder like female containers are G8. Now they still don't have tests, which is nothing not good.

A

A

I'm not sure if it's uh if it exist anymore,.

A

It doesn't run at all, okay.

A

A

A

A

Yeah, this is still failing, I mean I, saw it on at least in one place.

A

I remember somebody was trying to fix it by a region owned score just from a proper process. uh So we had a problem when we just assumed that the process will be the first one and uh or something like that. So we took an score of some other like incorrect port.

D

A

Yeah I don't think this was fixing it.

A

Oh, it does okay, so we need to have somebody to look at it.

A

Okay, so this is in progress clearly and then.

A

Second type was deleted, the gray Spirit doesn't take effect. Oh.

C

A

Is interesting, uh I, don't think it's test related that much I believe it's only very small and uh static forward test.

A

Yeah, so this interesting Behavior Uh exploration um today when you uh Delete Port twice like you, call pod deletion with a very good uh large grace period, and then you want to kill it immediately. You quote again: um second gray, Spirit wouldn't override the first grace period, so Kubota will wait for the whole duration of the first place period supplied, and this is a change of behavior that is not ideal after 120 something 22 23., um so this uh PR attempt to fix it.

A

uh If anybody on this call want to take a look at this from a perspective of um end-to-end test, at least to make sure that end-to-end test is written properly, it will be great if you can uh test the whole uh it'll, be even better any takers and it was brought up on Signal's main meeting as well. For for the code change.

G

We can assign it to be.

A

G

H-E-R c-h-e h a r c h, e h, e r c g. Sorry I forgot my own vendor.

D

Each year yeah, it's.

G

Yep, that's the one. Thank you.

A

And this test is failing, but I I saw this further on other places. Well,.

G

It's a continuity, node conformance right, yeah, there's no due. Alternatively, no conformance.

A

Oh yeah, it's actually related to this. So oh, it's good, meaning that actually like tests that they added is not working, which is I, mean it means that code is not working uh which gives you something uh Food, For Thoughts.

A

Proves a couple of recovery after restart when dealing with devices? Oh, this is Francesco Europe right. Is it uh somehow it get into tasks in progress.

B

uh I think I'm gonna close this one because I'm yeah it's working progress and actually Swati has a better PR for the similar issue. If I remember correctly, so I don't worry, let I will handle that and probably close.

A

I put it in chat as well, so you'll have.

D

A quicker link.

A

So now yeah we have a few items in progress and we ping three of them and a couple of them are actually in progress. There is an active PR uh that needs to be approved.

A

Good, um we also have issues to do, um but we also only have 20 minutes left, so I suggest we. uh We will clean up to do next time and now we'll switch to bad triage.

A

Is there any comments on this board before I go to bugs no okay and by the way, I will also check the performance dashboard. It still uh looks fine, no spikes or anything uh if you're interested just click, this link and uh switch between CPU and memory and runtime.

A

Okay, let's go into bug triage right now.

A

Last time we get number of bucks almost to zero, so it should be easy to ask today.

A

A

Okay, here that I'm tires and project ID Ryan, we discussed it last time and you said that you'll take a look and I think you did right.

E

um I did I, don't know if I commented on here, I I talked to renal about it.

E

um We haven't come to a consensus yet, but sort of thinking that we will be deprecating this feature, um since no one has time to work on it over here at Red, Hat, um Maybe, Google or somebody else can pick it up if they really want it. But you.

A

Also mentioned some security problems with this resolution, yeah.

E

uh Peter brought that up actually that a user can change the project ID and effectively get around the quota uh problem or get get around the quota that's set up and that would become a problem.

A

Oh, do you want to bring it up to Signet main meeting yeah.

E

uh I can yeah I can do that next week.

A

Is an agenda right away, so we wouldn't forget.

C

A

As you can get this security issue ready, that will be a good reason to duplicate it. Unless somebody really really wants to pick it up.

E

Yeah sounds good.

A

What is this? Okay, I moved the triage for now, because it's actually an issue, but uh maybe you can can close it because of a duplication of the feature. Foreign.

A

I think it's a known problem.

A

A

Outside I think I know what's happening.

A

A

Dependency between my business I'm mixing up things but I, remember the some of the issues was because of uh Circle dependencies between CRI and C advisor, uh and we wasn't able to collect certain metrics.

A

The process has been removed, one, but the config map, automated mode.

A

B

A

I think I remember this was discussed before I thought, it's unrelated to oh, no, it's different! Okay and the pink again.

A

A

Does anybody remember we had these issues with some folders that we failed to delete after port was cleaned up.

A

And this one specifically about config map.

E

What version was this that they reported against.

A

E

A plenty of storage, refactored.

E

um How some of this uh gets laid down, especially the mount logic with I, think it was in 125.

E

Maybe we could ping stores on it to take a look.

A

Yeah storage already here.

A

Okay, so 124.

A

So maybe that was fixed.

C

A

A

Okay, there are so many problems with cleaning cleaning up volumes. Recently, um some activities going on there.

A

A

Okay, config Maps again, but now it's fail unable to attach.

A

A

I wonder if this is the same.

A

A

Feel but uh error is different, so I don't think it's this problem.

A

I think it's a new problem.

A

Doesn't really matter.

A

A

A

Yeah I'm not ready to take another one uh this week. So if anybody want to take a look, please do otherwise. I will just keep it around not reached.

A

C

Let's come back to it.

F

A

A

Yeah, this is a dog change.

A

Yeah Brian: are you still on a call.

C

Yes, yes I'm here, let me catch up with you.

D

A

Let me uh thank you, so you said last time that you will take a look at documentation, um so if maybe you can do this.

C

um Yeah I intend to do it. I have been you know, on vacation and ignoring all this stuff, so I'm just getting back in the swing of it here.

A

C

A

Oh yeah um sorry I forgot about it. um I will take a look even behavior for regular and static ports upon deletion.

A

Three hours ago, very fresh.

A

So maybe you can.

G

Yeah so I, just I just discovered this yesterday me and Ryan were debugging something, and so, if you try to delete the a pod it doesn't uh especially if it's static part, it doesn't transition into terminating State immediately. But if the regular part we do that immediately. So that's the bug.

A

I see yeah other.

C

A

Sorry, what is the side effect like side.

G

Effect is, let's say if you are monitoring a part for the status and uh you remove it like say: if it's a static part, and if you remove it, uh you would expect it to go in terminating, like the regular part does. But this one doesn't go so, but eventually it does go in this terminating State and gets removed, but that's like a split second before it gets removed, instead of it immediately transitioning into terminating state.

G

So you will see keep so if you, if you put a large grace period for example, then you will see the part is in running state for a long long time, even though it's not running.

G

So I'm not sure whether this is a signaled issue or a six CLI I found some I think Ryan found some documentation, which said it's just a waste. Cube CTL represents the issue, but I haven't dig deeper like so I just want to keep it open at this point and whether it's a signal or issue or six CLI issue.

A

Or can be CLI you just remove in the file right.

G

I'm not sure how so I'm removing the file, but whether it's a CLI that queries and interprets it that way uh like now uh so who is responsible for printing the actual status? Is uh uh we need to see whether the logs also say whether the part is running or not, or just a CLI? That's not printing correctly. I. Don't know at this point now.

A

So you think that equivalent may have the object in the correct state, but.

G

A

G

A possibility too yeah, okay,.

G

A

Do you intend to work on it? Yes, I would okay I'll just thank you.

A

Thanks, thank you. um Containers.

A

Wow sugar containers want to take secret.

A

A

Yeah I yeah I'm not sure how useful it is to have secrets in ephemeral containers.

A

Whatever error is very similar to what we see in uh different bug,.

A

A

Yeah I think Buck has enough details how to decorate.

A

Let me check if it's 100 wrapper.

A

Okay, we are out of time, and uh unfortunately, we didn't have enough time to go through all issues, but nevertheless we reached almost all of them. Thank you. Everybody um bye-bye.

A

E