Kubernetes SIG Node, 25 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20230125

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20230125-180539_Recording_1542x1120.mp4

A

A

I put this agenda item on a list: I I, don't have specific action items yet um it just uh uh one of the bugs that we just covered I spoke about it during signals meeting yesterday it's about HTTP probes. If you have too many containers, um I mean in this example. uh When we've been designated, it was like 300 containers on a node and all of them doing live live news props every second, then we can run out of resources on node, because every HTTP connection takes some socket and it waits for completion.

A

And then, after callback like after the response was received, it keeps the circuit open for another 60 seconds uh just to because of some TCP standard um and um I'll be changing it to one second, uh making a problem, less uh significant. But still uh we have this uh noise enabled problem of sort, and uh we discovered this issue in from different channels.

A

uh uh We we saw people complaining about it, so it's like one channel, but we couldn't uh understand why it's happening and like uh we started like maybe some application uh doing some crazy things with resources, and maybe uh it doesn't listen for circuit, because error was client error, so client kublet uh reporting that liveness profiles with connection timeout trying to connect application. So naturally you will expect that application is not responding. So maybe like socket is like publication, cannot reply on the circle or something like that.

A

But you know it's equivalent problem, so node exhausted all the circuits and it can just uh create one another one um so um yeah and uh this, but then uh we also discovered it in our stress test in G key uh been running some of them, but surprisingly I like it didn't fail before, but it started failing and I'm, not sure what changed, um but uh through that.

A

We like also recognize that maybe you know a customer uh issue, it's uh it looks more and more like a infrastructure problem, so we dig a little bit different uh Antonio found this problem because he's very familiar with networking stocks. So uh it was obvious.

A

I mean it wasn't obvious for him, but like it took some time to figure it out um and then um I I spoke with uh scalability team in the past six collaborities running all sorts of scalability tests, but they never run any notes collability test, so they don't run um like authors. Collability efforts are directed to towards scheduler and API service collability, so they want to make sure that they can handle a lot of nodes with each node will host a lot of ports, but every port that they run is collability.

A

Test is extremely small and using the same image as everything else. It doesn't have any probes at all defined so like all the primitive premature probe, so they're, mostly tasting like and like, maybe sometimes they have configma but config map for the sport is typically designed um to test API serious collability like how many config Maps we can host and like how fast they will be downloaded this kind of things.

A

um So this is six flagability uh Focus uh for many years now um and uh they knew about node scalability questions, but they never like dig much deeper into node efforts at all. So.

B

A

um What I'm saying was that, like I I, feel that we have this uh lack of testing um in sick note, we don't have, uh um according as uh another thing that uh we're working on is grpc um probes for grpc props. We also like implemented couple conformance tests and like it's, uh it's so good, like functionality is working, but now looking at this problem, I'm thing like what kind of collability does do I need to run like how many uh continuous is too many um can I run.

A

Like should I run skull, businesses with like 2 000 containers like five thousand Canada, whatever the limit uh that they need to support um yeah ah and uh what kind of resources we can exhaust on node with grpc connections.

A

No excuse me yeah, um so I don't have answers yet. I just uh I feel that there is a lack of testing that we have you also like in like. We also have this talk test that something that Ryan could continuously. We have this test. It's constantly red. We made like a few attempts already to fix it. I think we're getting closer, but yeah Brian is he. You know shaking ahead.

C

Yeah, it's just it's always it's always broken and extremely hard to figure out how to fix it.

A

Yeah I mean I, get it to the point when it I I knew. The problem is some leak and log, so I, like I, was trying to figure out which lock I need to rotate but uh I get to this point, but then, like I, went to parental leave and, like I came back and it gets broken again, I completely passed it uh so I. Don't know like this.

A

uh This is a little bit uh uh dysfunctional but uh yeah, something that we also need to do, but we have at least some I mean we have a place to fix. We don't have like. We actually have a test that we need to look into uh to fix it. For stress. We don't have anything I.

C

Have a question for you: um I see this your link in the pull request here. Does this mean you started a test to um do the stressing.

A

So what Antonio implemented this unit test uh so like? Basically, it's fake, it's faking all sorts of things, and then it creates a lot of probes during the unit testing.

A

It's very I mean it's good enough, for that, like I mean it's failing without the fix and it's not failing with the fix but uh I fairly speaking, didn't even try it on Windows. Will it work on Windows I? Will not I I have no um no idea, so that uh I mean probably it does, because we we run unit tests on Windows, yeah, Ryan.

D

um I was gonna mention, I was just looking at the patch too I. Think it's a really good patch.

D

um It doesn't look like uh the grpc probes got fixed with a new dialer, and so we might uh want to um double check the grpc probes.

A

Yeah, um that's why I mentioned grpc because I like uh we were thinking of GA and grpc probes and right now uh with a scalability problem. We probably need to look deeper and try to test it on scale.

A

Yeah um yeah um so I, don't if anybody I will be interested to look into stress tests and like how what kind of stress test we want to run during functional validation, um it will be interesting. I I definitely will support for that. I. Just don't know like who has energy right now and whether we want to tackle it in the cities or next to this.

C

It seems like a fun thing to me too, but I I don't think I can devote time to properly giving it to finish getting making it happen.

A

Yeah, um maybe we can start with understanding what you want to do from stressed. um I. Think probes is obvious things that you may want to test, but you also may want to look at how many config Maps, like one pod, can handle uh like how many I know I know what else to stress like like how fast you can remove and delete Paul like maybe you can just keep pounding like uh creation deletion and see how Google behaves.

C

I have a practical suggestion for uh would help me, and probably others in here would make it easier for beginners to contribute if there's already a place where these kind of things would fit in, and we can add test cases if there's not a good first step would be to create a place where we can begin to add test cases find a place. Something like that and tell everybody in here. So we can jump in and make test cases.

A

E

A

Is fair like once you have an example, it's easier to extend it uh with other examples, right, um yeah um but I mean even thinking about it. I I, still myself, not super clear, like do you want big, big machine with uh and test a lot of like heavy load of ports, or you want smaller machine and uh just uh test the limit of a couplet, um I I said in on the senate.

A

In my mind myself, so that's why, like I mean even to create this place, we need to do a little bit more thinking. uh What exactly you want to validate and why um doesn't even know if anything was discussed for invented plug uh the plug I evented black is a Improvement uh where, like today, we cobot will release all the containers periodically from runtime to to see. If there are any changes that needs to be reconciled.

A

I was invented plaque. um It opens a streaming connection to come to runtime and runtime whenever something changes uh reports back through this channel through this, like stream, uh saying that there is a change on this container, please updated um and I would like.

A

The goal of the same problem was to uh improve performance uh and uh minimize like memory usage and I was wondering if, uh um during this cap, somebody was looking into stressing stress testing it.

F

So uh Sergey we are trying to add event like Ci jobs in various repositories. Right now, so there is one was recently added uh uh the public as a space submit job, but unfortunately spaying due to some cluster authentication we're trying to fix that. uh So that's the node e2e job uses you enter plug. uh Then we are trying to add events like Ci jobs in cryo, I believe someone might add in containerdy as well and uh and similarly we are trying to add an event like job in um in openshift CI. So that's that's.

F

B

F

Right now we have a pre-submit job. uh There's a small problem like when you launch it. It kind of doesn't get this right cluster to launch a job once the job doesn't get launched, so you're trying to debug that what could be going on there, but as soon as that's fixed, we should have a event at like uh no D2 is running soon.

F

A

Do you know if any stress testing were discussed? Oh.

F

No, that would be the next step right, so we're just trying to get a basic node equal right now and then next step would be to add more different scenarios, not just test testing but yeah.

A

Make sense, um and also like uh it's a side note when looking at the probes I discovered that today, props, like we wouldn't start next probe, uh while previous prop is still waiting for response. So it's um it's quite uh surprising for me, because I thought that uh we will keep creating new connections over and over, uh even if, like older um connections, still waste no time out um yeah, it's.

F

Surprisingly, I.

A

Don't know if you knew.

F

About it, but something I I was wondering how evented plague is is discussed in this context of this PR. Oh.

A

um It's uh I'm I'm, putting this PR as an example where stress tests may be needed and even black is another place where I believe even like stress tests will be very important, uh mostly because during stress we can find all sorts of race conditions and uh Mission events uh that can be like. Maybe it's a missing event will cause some port to uh not be removed, or something like that. So I was wondering if a specific stress test was discussed and how the stats would look like.

A

Is it like a lot of different ports same port? That starts and stops very often uh that's kind of uh questions. I was wondering about because, like yeah.

F

Yeah gotcha I thought uh I just wanted to make sure that the the evented lake has no relation to the changes in that pool like this. No.

A

It's it's most about rest, I, think and I I was wondering if.

F

There is a place already.

A

Like if there is Improvement already said uh like cap, that is intentionally doing stress testing, can it's part of our CI? I would like to understand uh details like how exactly we decided to stress. Does this aspect and not that aspect uh kind.

F

Of yeah, so the the stress testing aspect for you entered plague that I had, in my mind, was to have or lots of Parts who, which are changing. The states uh continuously.

F

uh That's where that is what will trigger a lot of events, so you need to have a lot of parts and they need to continuously change States from one state to another. That would be a good status for your template.

A

Yeah I would be interested to know like how do you know what is a lot and how how you decide on that, and so.

F

Right now we have a fair number that we have around 250 number that we easily cross we want to. We want to see whether we can go beyond that number very easily right, uh so I'm going to take the existing uh pod life cycle test and then start scaling up without event plague and then, given the plugin to see whether we see any performance difference, because there are a lot of optimization that happens in container runtime as well, so uh uh that that kind of works with existing framework.

F

uh So whenever, even though cubelet is requesting the part statuses every second, it's not a runtime is actually going and hitting the disk, at least in case of cryo. It has a lot of caching that is inside and it can respond from the. So in that case, how would event plague actually make a difference there? So all those things are need to be. They need to answer those questions, but yeah stressful stress testing is required. Considering event that, like was designed for for me for improving the performance, they are definitely uh stress test is required.

F

A

Yeah I wonder if uh we can even reach the scale or like stress of uh system when Google is not able to process all the events. Oh my God. If we can get a stress test to that level, yeah.

F

A

Yeah, and once you have it, I would be interested to uh to learn how it's done.

F

A

You thank you. Okay, uh if there's no more agenda items, I would love to go into our board.

A

S, we have some issues to triage, but uh I wanted to start with waiting on author think we haven't, looked and wait in another for a while and I cleaned it up a little bit for some obvious things.

A

Work in progress, work in progress, yeah everything.

E

A

In progress likely not needed.

C

A

How do you know.

C

Oh, it's uh I put a comment on it. Divya anyway, we'll see.

A

Oh, let's go in.

B

A

Store high level podcast.

A

B

I, don't think.

A

It's test related.

A

Yeah, it's mostly changing functionality; okay, let's remove it from our board, so.

A

Okay, Peter uh Peter.

E

Yeah this one, uh the the test is still failing, haven't, had I keep not having time to are not remembering to look at it. um So I would say like a waiting on the author kind of situation. Okay,.

A

But they still wanted to fix it right. I.

E

Would like to that would be. That would be ideal.

E

E

A

A

File permissions.

A

Oh, it needs to replace.

A

Okay, I think it's waiting on author for sure, maybe even about to be closed.

A

So what do you need us to win this yeah I'm assigned here.

A

Okay, yeah: it's still waiting.

E

A

um Update there's too much current Alpha note, look feature I, think I looked at today morning, um so somebody added a lot of new tests, which is good.

A

A

So uh something you want to add a new feature to query logs from a node using cool blood. So you don't need to SSH to know to gather some um I think a journal logs in this case, um it's uh and Jordan seems to be having a problem with um is an overall approach like we don't want to make Kubler to be a look forward in solution for any node. It's a little bit scary from security perspective.

A

A

Yeah this one yeah uh Brian- you mentioned this right, so this is something you reviewed.

C

Yeah that looks like the one where she was adding removing Secrets right.

E

A

Okay, so it's still waiting on another then after this review.

C

Yeah because poly left some more substantial comments. You know mine's init. What he says is probably more interesting. It's it's an active ticket. She just hadn't had time to respond yet.

A

B

A

Cloud you reply to all the comments.

A

A

Yeah I need to take a look. What the comments addressed.

A

Yeah, since all the comments were replied on um I would put it in each reviewer. If anybody interested to take a look.

C

A

C

A

Right: um okay,.

A

Yeah plus things here now, you have anything marked as a work in progress. Please take a look. If you don't need it just kill it.

B

A

A

E

This one's pretty new.

F

So so good so issue here, it's time to fix is we have the Fedora core OS as a operating system that we used for Clara jobs and uh the boot configuration of the I forgot? What's the new Econo Ubuntu, but you know where the system comes up, is certain configuration that configuration is called Ignition and uh so far we have been manually editing those ignition files to update things there, but this PR essentially uses a feature of federal course where you can define a human, readable, yaml file and convert automatically generate machine configure ignition file.

F

So that's what this PR is doing.

F

A

B

F

A

F

A

Oh I see so yeah.

F

Ing file yeah the ing finally generated from the yaml file, but we need ing file while running the test. So they will check in the ing file and.

A

Yeah yeah, this sounds awesome.

A

Hack of order and some feedback from Ben and this rest thank you, I think I can approve for a note portion of.

A

Me I'm going to.

F

Yeah, this is the uh nobody to be evented like job that I talked about sometime back, which is not uh able to find the gcp project and pspi is attempting to fix that. But.

B

F

Have I have a couple of questions around that? So if you go to the uh yeah, this portion of the changes I'm not sure it's required or not so I'm. Talking to author directly on the slide, uh maybe I should put some comment here or hold here, because we have a private job, uh which is a periodic which doesn't have this and it still works. Fine, uh so I'm not sure whether this is required or not.

F

Let me put a hold on it like that, so we I get a clear answer.

A

Okay, do you need this uh in the chat yeah.

F

F

A

Detect anybody to the other.

A

Yeah, it's not test related.

A

Login Benchmark update.

A

A

Okay to work update mostly.

A

Yeah, it's not not specific, so I would uh take it out of our board.

A

David is working on that yeah I. Remember the back. uh We have a skip logic that tries to skip a test when um we are outside of pre-configured environment, with test Handler and I. Think.

A

Yeah, let me put the review, need.

C

It so who's the David you said, is working on it.

A

C

Bobby's David, oh okay, yeah.

A

I have no idea.

A

Okay, let me put this into.

A

Work in progression.

E

A

Is Unstoppable.

A

Yeah I think there is a lot of.

A

I don't know, I, know they're working on a life cycle on pot life cycle right now, so but it's not related to put life cycle, something else. Okay,.

A

And mythics about admission, requesting errors.

B

This is related to the topology manager, graduation, to GA. One of the uh items that we have to have for a feature to graduate to GA is to have some sort of metrics and there were none for topology managers, so I've added two and I've added some end-to-end tests as well.

B

A

Okay, um the sentence with this Matrix.

B

A

A

Yeah, okay, I, don't know what it's about, but it's not related to signaled.

A

Human set support in pgb: it's a draft, so I will.

A

Yeah and it's a feature so I will take it out of our board completely.

A

B

A

A

Yeah, just changes are minimal, so I will hide it from here. Okay and we're done with this part, uh we'll just switch to bug fixes. If you have anything ready to test, please speak up. Otherwise I will go to box foreign.

A

Yeah, this uh I need to replace you in trapper.

A

Yeah somebody was helping and repeat last time.

F

Yeah, so I wasn't an impression that they are going to work on it. If you follow the thread- and they are asking for updates here,.

A

Yeah very confusing, like I, will try to get some fighting and hello any updates.

A

Okay, do you want to.

F

Work with that I don't mind, but I don't want to uh right now. Should we ask them if they want to work.

A

F

A

It in because it seems to be a boyfriend, no.

F

F

F

A

Yeah I am self-assigned foreign.

A

A

E

A

A

123 almost fresh but still supported.

A

B

A

Yeah there is a Sonic and not registered.

A

A

A

Specific environment.

A

A

Generally, we do delete ports.

F

F

Exists, but it's not merged description.

A

E

A

F

C

Foreign looks like a bad problem to me so I'd, like your question, is good. Can we reproduce this in other environments.

E

C

It's a good one.

A

Yeah um box is saying we have a good environment where it fails. Please help us I'm, not generally helpful. We need a little bit more information.

A

A

Oh, that's why David sent in these fixes? Okay,.

E

A

I wonder if it's around that specific.

A

Yeah this sounds like a bed bug.

A

A

A

A

What is that container I wonder? Is it three jobs.

C

Part with two containers, one of which is a short living, a short life I guess.

A

But we already started unless it's a job with restart policy, never.

A

Okay, we still have a little bit of time. So let's uh try to understand.

A

This will continue uh container.

A

C

We want to see Part B.

A

It drives 12 gigabytes on local disk, okay,.

A

A

Where is Port B definition.

C

Looks like you didn't put the ammo for pod e.

D

N e uh exact end to it right up in this block. That's at the bottom of the screen, there's a block that he labeled it.

A

Oh somewhere here, yeah.

D

I, don't know up go up a little bit, there's an exact.

D

Here, uh right here right here, there's an exec with uh GC test B.

A

Yeah but uh I'm looking for yamo for this, this B.

A

I'm just curious: what uh one that container means is it the job that has one canner completed already.

C

It's fair to ask for info I. Think.

A

A

A

Implementation.

C

Yeah sounds interesting.

A

Complete memory, leak, top launcher.

A

Okay, that's part of this title: I I see the memory leak and scary.

C

C

I'm not seeing any indication of a condition that causes a memory like it's gen, General, General usage.

F

Important question is this foreign.

A

E

A

Doesn't say anything about um what triggered it.

A

A

um Start a lot of those using every turn.

C

Yeah, we need a container key container D guy.

A

No uh for this one, we don't need it because, like we call it, uh it doesn't sound like you can say stock forever, and then we just accumulate those calls. uh Maybe you shouldn't call again if one cause already in progress.

A

A

A

Okay, I will finish the reply here and thank you. Everybody bye thanks, bye.