Kubernetes SIG Node, 28 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210428

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's uh april, 28 uh 2021., it's a signal, ci group and uh we are all here for like checking the status, so we already went through this uh two three items, so I hope this pr will fix most of the serial jobs and then we can see actual failures um yeah. Let's wait for our progress. I pinged uh approvers yesterday and uh some of pr's already got approved, but uh we still have a few um okay, so you are francesca for the next topic.

B

Yeah yeah yeah, so um in in this, is kinda related to the serial phillips, because I was investigating some tests, the end-to-end test, cpu manager, so long story short. There is um a key thing which I would like to address and this when we run end-to-end node test, the cubelet is going to use the default system directory var lid cubelet to hold the state of the managers, for example, cpu manager, memory manager, and this means we will have heightened state and what's worse, height and state between tests and which is something I will rather avoid.

B

But first I will first like to check. I will be, I will be checking with uh using git log and asking in the channel, but first of all I will just start from here and asking if, if we agree, we should try to avoid actually avoid this hidden global state and if there are, if no one, someone in this meeting, maybe already knows why we do. That could be just a historical artifact and in general, if there are objections on concern or concerns in moving away from this global heightened state.

B

So this is my point.

A

Do you know what we store there? This.

B

Yeah yeah yeah. We have the, for example, the cpu manager, state memory manager, state and also the all the state files for the resource managers which are being tested in the end-to-end node, but in general the state of the cubelet is ending up here and um cleaning for I don't want to take too much time, but cleaning up this state uh directory up between individual end-to-end tests. First of all defies the the purpose of having shared state if you're going to reset any way.

B

Why having a global, shared state- and then second, this is actually harder, because how the end-to-end tests are run. So you really need some careful coordination between the all the moving parts. So it's it's just seems easier to use to give its each test each end to end test its own private. Let's say uh state directory, so it's on enter and each end. End-To-End test has its own private state, so you have isolation among tests so I'll really, unless there are objectional, just keep investigating this, among god, with alongside all other things,.

A

Now, how does it affect a parallel execution of test and single content run? Do we need to clean it up between those executions, or I mean? Presumably, this state will not be affected if you're working with sports in different name spaces right.

B

This is really not state, so, yes, it will be affected.

A

A

Yeah, I would be interested to learn more about state, but uh I don't I mean from a description of it. You're right it's. uh It seems that between executions, you clearly need like. We definitely need to clean it up, or at least don't share.

B

Okay, I will keep investigating asking people which may brought the end-to-end tests and asking the channel and keep torturing. It will make all also the tests more reliable in, for example, in other streets in openshift. I'm pretty sure we we delete the state file, among runs, which is pretty much my point so yeah keeping this yeah going.

C

Like like, it is for cpu manager, memory manager, I know that we anyway delete the state file during the after each phase, uh so I'm pretty sure we already already do it.

B

Yeah we we we do, we do try, but for cpu manager state, I'm pretty confident there is a bug which prevents the state fight to be actually deleted, but I will keep investigating. It seems to me that it is an it seems to me. We feel it is an area worth spending some time.

B

Into so thanks, everyone.

A

D

uh For this uh yeah I just wanted to flag this, so I was recently digging into uh on the openshift side, a cpu regression uh in the cubelet, and uh when I was trying to look at what the source of it was, I was like well do we have any upstream scale tests you know.

D

Could I look at like to see if this is just a an open shift issue, or if this is like upstream as well, and for what I can tell it's a run c issue, so it should affect both uh upstream, uh like kubernetes and openshift, uh but uh it has been difficult to track down uh in part because there really is not any resource uh utilization tests.

D

uh Right now for the cubelet, I asked six scalability and I just said it doesn't exist so like the uh api server and a bunch of other components have this, but not the cubelet. So I have filed this issue asking. Maybe we can do this? uh I mean still probably for a regression analysis.

D

We'd have to go and pull profs and that kind of thing, but it would at least make it a little bit easier to do with, like you know where, where do we bisect from and where the regression started, and that kind of thing, uh because in this case there were some run sea changes uh that I still don't have a smoking gun trying to like deal with a local up cluster prof, but certainly we were seeing an open shift and I think part of the problem is it's really hard to uh reproduce like on a local machine, a cpu regression, uh whereas, like that's much easier on a large cluster, that's actually running some load.

D

So uh yeah! That's uh that's what I've been playing around with you.

A

D

There was clearly a uh cpu regression in the api server, uh as of like whatever that date was for the. I think it's the like highest percentile.

D

So this is the sort of cool dashboards that sig scalability maintains.

A

And six callability maintenance and they didn't notice this or they brought it up.

D

I mean, I don't know, I'm not worried about the api server. uh That's that's not our problem, uh but uh we don't like. I was like. Where can I find this for cubelet and turns out doesn't exist.

A

Oh there's not a couple of things.

D

No, this is the api server.

D

So one of these does not exist for cubelet, but if you click on the dropdown, where it says cube, api server, there's a lot of different components that are monitored here, but not the cubelet. So I said we should also have the cubelet and they said we would love that. But we don't have anybody to implement that. So please file an issue.

D

And if you haven't seen these dashboards uh previously, these are the perf dash uh perf dash dash dot, k8 dot io, and you can view all these various stats that we collect. uh There's also some cluster slos that get collected here and different jobs and whatnot. So you can see this dislike. You know api server, cpu regression that I found was on like 120., um but there's other stuff.

A

Do you know how we're on this test.

D

I have no idea.

A

Okay, so I mean, if somebody will like I mean first of all, I'm interested myself because my performance that are really hard, you need to reproduce them and you need to have stable environment. It's also things, but then, if somebody will take this issue, how they will like where they will get started.

D

Yeah, so I think like luckily in this, uh so this is the perf tests, repo, which is separate from kk there's. This thing called cluster loader two uh and I I would assume that there's probably a pattern already for, like all this other stuff, that's scraped, it's just not set up for cubelet. um I think that the perf tests are mostly maintained by googlers.

D

So if someone at google wanted to pick this up, you will probably have like much better luck with trying to find people internally to help you then, for example, I would so because I think a lot of that team is at google and I think a lot of them are also in europe like a lot of them are in the warsaw office. I think.

D

They also have an on call in their slack, uh which is, I think, a best effort on call rotation uh and they will respond to people during. I think, like europe, business hours, so it does not overlap with me really being on pacific time, but uh there are folks there.

A

Yeah we can chat with poetic, maybe they have somebody to help, but uh you said you already spoke with them.

D

Well, I just like added a question in the slack channel uh and they said they would be interested in somebody implementing this, and so I filed this book based on what they said.

A

Okay, yeah, I'm just afraid like whenever we add some tests, we need to make sure that we start looking at this test. Otherwise, it's uh just uh work that will degrade over time.

C

Good, thank you. You know in general, you can just make it required during the lane. I know that uh one of scale tests is required, uh probably for api, and if you have regression and it failed from some reason, like you, you just don't post ci.

D

Are you talking about like on pre pre-submits, on prs.

C

D

Yeah, so we had, I don't know if you were there, that week, um ben elder uh came from sig testing to chat with us about pre-submit tests, because we needed to add one for uh cryo, because there were a lot of things that were getting broken on cryo and then, like you know, having to get uh like retroactively fixed. And so I know that.

D

Basically, if a thing there's a lot of issues now, especially with having removed bazel in terms of like pre-submits being pretty expensive because they run on every single pr and every single one of those jobs runs on every single pr. And so I know that we're basically trying to move away from having more pre-submits to having less pre-submits and instead running more periodics, because the periodics, I think, are much cheaper to run.

D

uh And as long as you know, we're responding to those failures that we're seeing there uh it's somewhat of a better way to catch, because you know like somebody has, for example, uh a typo and uh their thing doesn't build. And then all the pre-submits fail. So we don't get a lot of good signal in terms of like where tests are actually flaking uh and so yeah anyways. uh I have been told that we should try to be moving away from pre-submits, more and moving more towards sort of post, submits and all of the stuff.

D

That's populating these per scale dashboards are from post-submit jobs, so they're periodics that run.

C

Okay using periodicals, because anyway, we have email notification, yeah.

D

A

This is very interesting to observe, but yeah. If you can get some sort of notification or we need to put a part of um triage process.

D

The other thing that I learned of recently is that there's also a if you like, add to the beginning of that url like node dash perf dash dash.kates.io. That dashboard exists, but I don't know who maintains it. I think it's basically unmaintained.

D

uh Like it's, it's kind of dead, but it exists, uh and so somebody pointed this out to me, I can't remember who uh I think it just yeah like there's nothing that appears to be populating it right now and it's got a bunch of broken images and whatnot.

C

But what about the the job that's introduced under this like.

D

Yeah, I don't know what the job I don't know, what jobs that are populating it. I don't know what infrastructure it's running it on. I don't know who owns this dashboard. I would strongly prefer to not have a separate node dashboard and that we, instead hook into all of the rest of the scalability test, dashboards.

D

Because yeah, that one is like it's unmaintained, uh I don't know that it would be worth setting it up as a separate thing.

A

Maybe waiter can help.

D

I can pick him. That would be great. I'm glad I mentioned that I almost forgot to add it to the invento.

A

Okay um yeah, so let's take a look at triage.

A

I cleaned up triage column.

A

A

Do I remember it no.

D

No other order.

A

Oh, no, I see yeah.

A

Okay, just three.

A

Okay, this was an ike somebody else, yeah right, okay, typing.

A

A

A

A

Is not assigned to anybody?

A

Okay, this is uh the cheat not assigned.

D

This one's memory manager francesco- do you want to take this one or r2.

C

It's my pr, so it's.

D

A

D

Take it I mean you, you are the two, so I can I don't mind.

A

Great okay, it doesn't need any text uh perfect. um I think you can cut it short today for testing um unless anybody has any more topics. Any questions.

A

Yeah we likely will come to next week because of clipcon.

D

I think we should just cancel it now.

A

D

Because we we know cubecon is happening. We know it's possible. Someone will ask me to get up at 5am to do q a so.

A

Okay, now some consolation calendar as well um yeah. uh Do we want to go to product feature? I just cut it short today.

A

D

I'm happy to either do a quick run through the board or cut it short. uh I've been a little bit behind.

A

Yeah, let's cut it short and uh try to do offline as much as possible. I did some yesterday, but uh not too many.

D

A

Okay, um thank you. Everybody.

D

Thanks everyone bye. Thank you.

C

C