Kubernetes SIG Scalability, 16 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-09-16 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

So, hey everyone- um this is six callability meeting and today it's 16 september 2021 and um on today's agenda we have two points. um So one is. I think this point was actually for quite long time and honestly, I'm not sure. What's what the current step is right now.

B

Yeah, so I'm not sure if anybody anyone else has uh started working on it, but I have some cycles um and to work on this, so I think I'm pretty new uh to the scalability tests I haven't like. I haven't, um I'm not very familiar with the ripple. uh If somebody uh can provide me, some pointers and also like for the pnf test is equal to a new test, or are you going to extend an existing test?

B

um Like probably, we need to have some discussion around those topics and then probably like I can start working on it.

C

Yeah, I'm happy to discuss that with you. Okay, I'm not.

B

C

B

A

That now probably.

C

A

Yeah yeah exactly that, that's what I was actually going to propose. So um do we have uh an idea in mind like what actually do we want to test uh regarding pnf.

B

For example, I would start with a basic question like in uh these scalability tests like we have different. um We have different layer tests right and I know that we have like 5k tests and like a 100, node test and different tests like um so. The first question is: are we going to see like how pnf impacts in like a very large scale, like let's say you know five thousand now lc?

B

uh Also, do you want to see at a micro level, like you know how much time a request spends inside the filter right before it executes the next filter only in the pnf machineries, uh or do you want to have both in a large scale.

A

Test um so so I think that currently, um the pnf is not actually blocking too much a request right and um the question is like because our test also measures the latency of of calls. So um I'm not sure like how exactly we want to test it, because if we reduce the number of let's say uh concurrencies in pnf, then then obviously the latency will increase. So it's it's straight off, but um I don't know: do you have any ideas?

A

What actually, how actually.

C

We want to test it. I think it shouldn't be the same test. I think it should be a separate test and I would I would focus more on not necessarily very large cluster but generating much higher load and uh yeah and ensuring that api server won't fall over by more or less right. So it's more like a reliability test right more of like reliability tests. Yes, so.

B

So drive the concurrency uh to a very high limit and then see um so in in today, like what's the usual like uh the like the number of requests in flight for our, like largest scale test like how far do we go like how many requests in flight at any time, do you have any rough idea.

C

I can't remember from the top of my head, but uh um I think it's it's it's more in hundreds than something else right. Okay,.

D

So I think that in the 5k we are setting the limit of like 600 and I think, like we are seeing values like 300, something like that, like I think, between 100 and 300.. This is what I saw in one of the latest 5k uh tests. I can in fact try to do this now.

A

uh So we have around 600 set and how? How much do we reach like 500,.

D

300, I think, okay, let's recreate.

D

Yeah yeah, but.

B

I think it's like 200 yeah. I think this is actually.

D

B

Yeah in the past, uh like probably when uh pnf was in alpha, I tried like to. I did some some testing, uh mostly. I did a benchmark test where I didn't know how to actually drive the concurrent sales.

B

So what I did, I actually added a filter that basically waits for, like let's say one second, uh and then um that cut they're all for all the requests that originate from certain user right and basically uh I drove the concurrency through that, but that requires you know it's like you know, changing the api server, so uh I'm not sure, because through all the tests, I've seen in scalability is basically all are like end-to-end like it's a real test. No, not any integration type.

C

I think that the simplest way to do that is to have some web hook. That will be sleeping one second, or something like that. Right.

C

So it's a little bit more expensive in the sense, so we should probably have a baseline with that webhook doing nothing, but but uh it will let you to simulate what you what you basically want. I.

B

See: okay, yeah! That's a good.

A

Idea, external workbook, so so there is actually or it's already implemented. I saw a webhook that waits five seconds. So probably we could. We could use that.

B

I see and I think it is it's configurable, so I can.

A

Actually, I don't think so.

B

So do you have any uh is which repo is it in like it's in the uh e2 test report.

A

I I can't remember I I can add it to the doc document tomorrow or after the meeting.

B

Okay sounds good. Thank you, yeah, okay, so I think the first iteration uh what I could try to look at is um use an external workbook to to uh to add some artificial latency right. So this will drive up the concurrency and then run run the test, and I also need to measure I don't know by default.

B

Do we get all the metrics or I have to selectively capture the metrics and send it to uh like? I have zero knowledge on how this globally tests work in terms of capturing metrics.

D

B

We are getting.

D

All the metrics from like, for example, from cube api server. We are gathering all of them like for, for other components like, for example, running on the node. You would need to specify like manually, but in this case like I guess.

D

B

So all the metrics, then then all the metrics will already be collected. Okay,.

A

In our default test, if you are going to develop the new one, then you will need to use um basically our measurements that gather those results, and then you can set uh appropriate thresholds that you want to use.

B

Okay, does I saw that we have multiple tests, uh the yaml file uh in in the repo? Do you have anyone in particular that I should use as a reference or any of them or just.

A

Well, well, honestly, there is one that we heavily use so, but it's also quite complex.

A

So the question is like okay, so I think that first of all, we need to prepare kind of like test scenario, kind of like idea what we want to test uh and then I guess we could help you with uh finding like what parts of cluster already.

D

Okay, so maybe maybe maybe like, if you, if you, if you, I feel kind of lost like what is exactly that the goal of this test like do, do we want to measure like the overhead of the projection furnace or what happens if we actually like saturate, all the inflates like if you know like the api server behaves well whatever that means or like, or I know like some different property.

B

I guess we want to. I guess both I just want to see like at a higher load, hardest, uh pnf impact and pnf has. I think we have a metric that actually measures how much time a request spends in in the pnf filter exclusively, and we can use that as a as a you know, a baseline, like you know this shouldn't go higher than that or when we would come up with those goals. I think later.

D

B

D

Yes, there is a metric like that, but I afraid it includes the actual time like spent on the waiting for for so like it will not measure like the overhead, but actually like you know how much we have like trousers uh requests in the system. So um yeah, I'm not sure if this is exactly what what what we need to measure like. I asked him because I was thinking. Maybe we do not need to like create.

D

You know some cluster loader tests and anything like that, maybe simply like a go benchmark would be a better tool for that like it can be used. I have seen some examples that it has been used like with success to validate like where we spend like cpu time stuff like that uh and like then it is easier to it right on this, and so maybe this can be used for some uh for some tests, but I don't know like. Maybe we already have some some yeah.

B

D

Yeah we do have okay.

B

I mean definitely, we definitely can use some like you know, benchmark actual, go benchmark test at a very, very micro level, um but I thought this was like actually seeing what happens on the real cluster, so this like at a very macro level. uh I guess we can. Both, I guess, are good ideas.

B

I I do we actually do. We run any benchmark tests today in kubernetes ci. I don't see any, uh I see only any test. Does it run any benchmark test today.

C

B

C

No one is paying attention to them.

C

I can. I can even find them now, but.

B

But ci already has all the machineries to run benchmark tests. It has yes, oh okay, so uh we can do both. Actually, like you know, we can, if you have a particular uh like you know, module in pnf that we want to see how much cpu it spends. We can add those tests as well uh and we can also.

C

See it's just like it's failing, even which isn't super surprising, because no one is paying attention to that. But.

B

Oh this is, I don't see them in actual like in github, I don't see it listed in the ci uh ci jobs.

C

I pasted the link to the test. Okay I'll, add.

A

It to the document.

B

The other thing is: uh if we for this test, um I know that pnf ships with a set of uh bootstrap configurations, like um we can add more objects too, to see how much cost it it adds to its processing right.

B

That could be another um like avenue we can. We can pursue.

B

So do you, how does it usually work? Do I need to write a cap for this, or we just work in steps with pr and that that should be good?

B

A

Yeah, just just working with appears yeah, okay,.

B

Sounds good yeah.

A

I mean you can also create issue before so we can discuss more and think about it.

B

Right yeah, I can what I discussed. I can jot down those those topics we discussed today and I just put in an issue. This is these: are the next steps and then we can just track that issue, and if someone has any idea comment we can, we can take it from there.

A

Yeah sounds good to me.

B

And this would be an issue in the kubernetes repo right.

A

um It can be, but we also create issues in our birth test. Repository.

B

Okay, perf test repository, okay, yeah.

A

That's that's usually where, uh where we create this all.

B

Right I'll do that.

A

I will paste the link to that document.

A

Okay, so I guess um the question is: do we have anything more that we want to discuss right now, because uh I see on the agenda that there is a release team intro, but there is none from release team. I think so I mean we still have 15 minutes to talk about this testing. If we want.

B

I think I have some information for now to actually proceed and then, if I have any question I can ask in the meeting or I can reach you on slack as well.

A

Yeah yeah definitely.

B

B

I have some like a different question like if we're done discussing uh pnf yeah go ahead yeah, so I know that we have the 5k test. Is it actually like actual notes, or these are like, uh like cookmark like.

A

B

A

D

A

D

Do have like the we do have tests which, like the 5k standard like vm instances- and this is like the one we are primarily using for the very verification of releases and everything you also do have like the cube mark in the like. We try to simulate 5 000 nodes.

D

um It's not that important right now, but yeah.

B

Oh, we do so. I found an actual standard node test.

B

But that's not part of the ci right because I don't see them in ci. It's basically like.

A

It's part of the ci, oh, I can send you the link.

B

And we can like how would one go ahead and run some tests with their changes? Is it like a open thing? Is it like mostly a google thing or is like open to everyone.

A

um So what I would say is that, okay, let's say that you are making some changes to our ci, then usually what we do is you know we have this like kind of feature flag and then we enable in 100 100 nodes, and then we enable on five 5k nodes. I see okay and if it doesn't work, then we roll back.

A

So that's the usual way for for people to to test it.

A

And the 5k test- it's it's here.

B

Also, do we like.

B

When you run these tests, do we run with like prometheus installed like or okay.

A

Yes, yes um yeah, so basically maybe I can give you some short introduction, so so the idea is that um we are creating multiple objects. uh Deployment stateful sets.

A

um What else um and um basically uh what's happening is that first of all, we are loading the cluster and uh just to some point like 30 per per node, and what happens later is that for each deployment we are scaling up and down by 50 percent randomly.

A

So this generates some pot churn and uh during the whole test, and that and then at the end we are deleting everything. So um so what we are doing is we are also measuring different things like some out of memory: errors, api server, latency and things like that, and basically, if any of uh of those measurements go above some threshold, then then we fail the test.

B

A

um So do you have any more questions? I think that's that's all from from my side, okay, so so I guess uh we can discuss it further in the issue and and I don't think we have any more anything more to discuss right. So, let's finish earlier, and everyone will have nine minutes more.

A

Okay, thank you. Thank you.

D

All right, thank you. Bye.