Kubernetes SIG Scalability, 4 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-02-04 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

I'm not sure whether it started.

A

Recording okay anyway, I believe we are recording so welcome everyone to our.

B

Is it recording matt, it doesn't show the record option.

B

Oh, it probably is there's that small red dot at the moment, like top.

A

B

A

Think, like last time, I hit the record button. Then there was some notification that you know we started this one looks more like it's still trying to start recording but yeah. It looks like. According to these options, I have to pause and stop. I assume it's recording anyway.

A

If it's not done, then we will just not have recording for this meeting, but welcome everyone to this uh another sixth calvity meeting and we have fourth of february and we already started discussing uh abu's pr about uh like cleaning up the the the how timeout is handled uh in uh in the api server uh like filter chain and so the pr has merged. We already checked that we don't see any regression or like any issues, any any flakings uh in the 5k node test results and here's the link. If anyone is interested.

B

Oh nice, so so this should fix that uh issue with the hcd connections uh in the air, even after the call times out.

C

uh It so I have to like double check all the lcd storage layer to make sure, um but um this pr basically uh makes sure that, as soon as the request is received, we have a deadline, bound context and that context is attached to the request. So the hope is that the storage layer so far, I've seen it that uses the same context.

C

But a follow-up task is what I have is: go through: the hcd storage layer, the admission layer and the the aggregation layer and make sure that all the layers they're using the context or, if they're, creating new context, they're creating new context based on the parent context. uh So that that's that's a follow-up task I have on my plate. Okay, I did a preliminary check on that layer. It looks like the context is already wired, so it should use the the deadline context.

B

Okay, so how are you able to test this uh that these at? Are you going to simulate some condition that, like give a very small timeout for a big call? Yes, in the.

C

For the filter, I have unit tests that, basically I simulate timeout conditions and make sure that we get the expected result. um I'm also planning to add similar tests in the fcd storage layer. These are hard to simulate in an interview test, but yeah, I'm I'm planning to add more unit tests with these edge conditions.

C

Yeah with regards to that, there's um there's also a pr that, basically, it has some unit tests. For example, um what happens um like they are very hard to simulate in in the real production cluster or in into a test. For example, um a request times out and uh the the inner handler still tries to write to the responsibility. What happens then right?

C

A request is waiting in the pnf queue and then new request arrives. What do you expect for that? You for the request to happen. I actually wrote a few a couple like a number of unit tests that simulate these extreme edge conditions uh and base basically uh um like uh check the expected behavior. uh I can this pr is marched, but I can uh like share you.

A

A

It will be hard like to test. It would be hard to test uh like in in some end-to-end manner, uh whether the timeout, for example, is propagated, but I think it shouldn't be that difficult right like if we can so the question is: do you have access to some like larger cluster, for example, with large num large number of some objects, for example bots and then, for example, listing all the plots from from lcd? It will take like order of hundreds of milliseconds right or like seconds.

C

Right, yes, so then, like you can just set the timeout.

A

To be, you know, 100 milliseconds, because you know that the lcd call will take like 500 milliseconds right so right you should be able to yeah yeah yeah exactly.

C

I can definitely try that yes, yeah.

A

All right do we have anything else, discuss.

A

B

A

Me just go through the issues. I believe some of them might be kind of outdated, but we are making some changes, as always in the purpose. Repository.

A

So let me start with this one.

A

That's actually just the migration, but we introduced some modules in cluster loader. This is a response to feedback from many places that our our load test config is actually really hard to parse like by human. I know streamers, okay, it's not an example here, and but that I like. The thing is that.

A

Without this change- and I think even now the low test config was this huge jump file you can see it will be at the bottom.

A

Has been loaded because it's like over 800 lines, so it's super hard to to understand what is going on stairs test is super hard actually to maintain it, for example, to implement something new uh or to sometimes we even find bugs there so to fix something etc. So the idea is that uh we don't have to have like a test inside a single huge file. We should like go and have some modules. uh This is what basically, this pr does and some examples.

A

uh Some examples. Actually, maybe, let's take a look at the example module that we introduced in this pr.

A

Yeah so, basically like with modules we can like replace some of the parts uh with like much shorter definition, and also the good thing is that we can create a single module and use it to uh like do usually like you have in clustered order some part for creating something and then for deleting or even like, creating and then modifying, and then deleting so. You can put everything into into a single module and then just call it from the from the test config. So my estimation was is once we migrated everything to to these modules.

A

The size of the file should be like two 300 lines of code instead of like 900, so that should be much more easier to read and also easier to maintain. So the immigration.

C

A

And we already have uh external contributor uh peter who, uh who is helping us with that so hopefully, in the next few weeks, we'll have like all the all the load tests migrated and all right. What else do we have here.

A

Yeah, like we realize at some point that we have these post startup hooks and one of them can be awfully slow in large enough cluster.

A

That is like loaded like meaning, for example, there is like, usually it happens, especially like in some disruptive events like master upgrade or things like that when server is restarted, and there is like a lot of agents or like clients trying to connect to it, it overloads the api server and api server will not become ready until all post startup hooks like complete and one of the post starters hook, is responsible for initializing all the arbuck bindings, our back rolls, and it requires about like 100 rolls or or roll bindings or crusted rows, and to basically reconcile so first like it's getting them if it doesn't exist or or if the state is not as it should be, it's like creating them or updating them.

A

uh So it's like ap is very slow, meaning requests can take like hundreds of milliseconds. Then this post startup, who can literally take like minutes uh because, like one thing, is that we are doing everything serially uh in serial and the other is, if anything fails, then we start from the beginning. So we open this issue and we also have like contributor who is uh working on that, uh like first of all, we've created them.

B

Hey so matt quick question about this, so you said it's one. Get plus update, call for a lot of hundreds of roles in binding right yeah so um but like when the api server starts.

B

If it can start watching these objects, then maybe many of those won't even need to be touched right, for example, things like updates when you're updating those roles with when you create the cluster for the first time, yep those roles and role bindings will be created as a one-off, but.

A

That's right like in most cases like if you restart the api server, you only need to reach them. Actually, so you you don't issue any mutating calls. You just read them, but still.

C

A

Like if cluster is like overloaded, then like a single read, call can take 200, 300 milliseconds right plus there is a chance that, for example, one of these calls will fail, and then the logic was is stupid enough that it basically starts from the beginning right and because.

B

Of that yeah, so also it doesn't have to make hundreds of these calls right. Let's say if it lists all all roles, cluster roles, it can kind of do with this. Probably in four calls right- and I.

A

Think the role bindings yeah- that's fair points uh yeah, so I know that we started with adding a benchmark so uh like the second pr uh is still like. Oh it's still open, I believe so. Actually, that's a good point. We should double check whether we can, instead of like having get get get for each of these rolls or all bindings. We can do like one like four lists and uh yeah. That's that's, probably a good idea.

A

On the other hand, it might be tricky if there is a like another uh another actor modifying this, uh these roles or all bindings right, because then it might be actually hard to resolve the conflict, but over for this, but yeah anyway,.

C

Like that's on the.

A

Ap machinery too, to figure out whether uh that's that's a good like that's something we could do or not, but yeah interesting idea, yeah. So that's one thing: I wanted to discuss more.

A

What's this oh yeah, so we also have. uh We also have uh shem, who is uh like, like few weeks ago, started helping us in scalability. He is doing an amazing job. He already like has has a lot of meaningful contribution contributions. One of them is he fixed the image preload feature which, like stopped working once we migrated to container d from docker uh image.

A

Problem is like a feature of of of cluster loader that, basically, you can pre-load some images on nodes before the test starts running, and this is to like better simulate production environments because usually on our like, we run our scale test on on fresh clusters and with like no notes like no images on nodes and and if you compare like empty cluster, with almost no images on nodes with like production cluster, where already a lot of workloads have been run, you will you will see, there is a huge difference in uh no size object right, because every time an image is downloaded on on on an aux and then the hash of the image is stored in the in the node object right.

A

So this is actually causing discrepancy in like results of scale tests. uh So yeah like like to give you some numbers in like empty cluster. The node object has like one or two kilobytes, maybe and like in production clusters like easily it's like in 20 30 kilobytes right. So these are like huge objects and they're like really important to performance of like really relevant to performance of fcd, especially like operations like compacting, no objects, or things like that. So to make the scale test more realistic.

A

We have this feature, so we can before we start the test. We actually print out some images to to to make this node object more realistic and it stopped working one once we migrated to container d but uh fixed that, and he also has like other contributions. So big big. uh Thank you chamfer for like doing that. uh Some some examples of the things he's currently working on is he's.

A

uh He has like, uh like almost he's, almost ready with fixing the load test to support clusters smaller than 100 nodes, and this is like very useful thing for like new contributors who want to start the journey with like performance testing, but don't necessarily have access to large clusters so with shems pr, which I hope will merge tomorrow. uh You should be able to run our log test even with like one node cluster.

A

So uh that's also like good for anyone who wants to help with uh clustered other features or like with adding something clustered or or or doing some uh good first issue, because you will have them in uh to test that. He also like works on making caster roller to work with kind.

A

uh So the same story with like with kind like anyone can set up a kind cluster. uh So that should like help a lot, especially new contributors, uh he's also like working on adding some better uh validation to cluster other conflicts. So before it actually does. It executed we'll have like a validate step, uh and this is really great because uh it hasn't been like hitting us a few times that cluster had like no real validation of the test.

A

So imagine a like large scale tests that can take 12 hours and very often we're wasting runs because in the middle of the test, after like a few hours, something was wrong with the config and the the test failed. So so yeah also like kind of anticipated feature. So yeah again like I wanted to thank stream because he's doing amazing job and yeah one and the last.

A

A

Yeah, I'm actually open an issue actually appear is already fixed. Then we noticed some issues with how lightness probe of lcd is configured. That is actually the way we we set the timeout second and and like the period of when the the lightness probe is, is checked or like yeah. We checked the healthy of health endpoint of the hcd and it didn't really make sense, like especially larger clusters.

A

So uh oh yeah, this should be appear somewhere here, because that's the issue uh like especially, we had like a lot of false positives, so, like lcd becoming unhealthy for a really short time uh and what was happening like cubelet was killing it and that was actually doing more harm than good, because if the hd was left, it would like completely recover, and but if cubelet killed it, then it caused a lot of issues because it took like more time for a cd2 to get up again and then like like.

A

We have this like thundering card of issues later, because api server become unready for some moments. Some, like client, disconnected and connected again. uh So the change is actually pretty simple, but it had like a nice uh results in our scale test, so uh yeah.

A

So basically it was like tweaking of some arguments and also changing the way we uh we checked the lcd healthiness so.

B

We changed from checking what to work.

A

Now we are still checking actually we changed something because we are not calling the uh we are not using the hdp get to to check like the health endpoint. uh We use the sd castle, uh but I believe more or less. Let me check the your description uh yeah. I think magic, like summarize, like what is the difference uh in like calling the health endpoint versus like using the sd card, so like feel free to to read through that.

A

I don't remember exactly what was uh I'm not sure if this is the the main reason we did that, but uh I know that the actually like this part of changing the parameters is crucial uh because it makes uh the lightness probe uh much less project to like single failures of lcd healthiness right. So cuban basically will give xd more time.

A

Even if it's like becomes unhealthy for a short moment and it will have like it will, it will take longer for cuba to decide to like longer it will be actually less likely for cubeless to to kill it. If this is like just one one: failure, because.

B

Like basically,.

A

Basically, like we reduce the period from my field, 10 is the default, like increase the failure threshold from three to five.

A

So, basically now you need to have like five fails in a row before it was just three and and also, I think, like this lcd castle is much uh it's better in a way that if there is like, for example, a lot of going on on the master vm uh when it comes like to network throughput, then I think sd castle probably works better than I'm just calling uh like this http get handler, but yeah, I'm not sure for 100, but yeah.

A

There was a reason for using sd card also, so it's describing that here yeah. Let me copy today's meeting all right, any questions. Anything else discuss.

A

All right, in that case, I believe we can finish the meeting earlier. So thanks everyone for joining and hope to see you in two weeks.

B

Hey man, I actually had a quick question. Sorry I forgot so for this cluster loader right um today. Are we capturing any uh network, latency related metrics or are we still not like? Yes,.

A

So let me where is it it should be here so long time ago? Actually, uh that's it's a shame that we we haven't done like more in this area since two years, but basically we define all this like network latency and network programming, latency scalability, slides on the slots.

B

A

And we started implementing them, so actually both of them are implemented in cluster loader, but the network programming latency requires changes because we like, after the kubernetes, has migrated to endpoint slices uh the cod. The code stops working and we we haven't had time to look into this right because previously there was end points and it worked for endpoints, but I believe for employee sizes.

A

Something has changed in the qprox implementation, like obviously like new new part of the q proxy code had to be added to support the endpoint status, but it kind of break our our network, latency measurement and for the network latency. I believe that was your question.

C

A

Actually yeah yeah both so we also have it implemented. It actually works, uh but we just like started measuring it in a very simple way and having rechecked like looking into results. Let me show you where the code is for that.

A

So let's go to the code.

A

All right so uh so for the network latency, we introduced like this concept of probes uh and it lives in the util images probes, uh package ping. That's like. Basically, the the probe implementing probe is basically a container like a pod running in the cluster and cluster order uh creates like these spot spots, uh but like probe pro bots, and so there is like for to measure network latency.

A

There is a simple client and simple server, basically client kinks the server and records the latency uh and export it as a prometeuse metric and then cluster loader has this prometus stack so basically in prometheus. If the probes are enabled- and we should have a probe or maybe we do it differently check.

A

That's the regular service, my internal, that's not it.

B

So you're looking for the deployment for that probe, part or.

A

I think like that now, let's take my you're right: let's take a look at the pro quotes code, so it should be in the measurement and common. I believe probes. Oh here, are the monitors. Okay, all right uh yeah, so there's deployment for both server and client, and there is also service monitor, which is like prometheus operator uh api, uh like custom resource for, like defining that uh the ping server should be described right, like this interval and using the matrix port defined it in the in the deployment.

A

So the data is the prominent is in the promise use and I believe we even have the probes probes go.

A

We even might have a measurement for like checking that I'm not sure whether.

B

So for truth, as for so in the test, when the promoters actually scrapes these metrics and stuff, uh what happens so? How is it collected as a summary at the end? Are you just taking? uh So I believe this part.

A

Is not implemented, so you can check it in grafana. uh You can check it in grafana.

A

um Some example- maybe I'll do that or maybe I don't.

A

That's a good one. Some.

A

We have any results here. Maybe some.

B

Other so this graph, which cluster does it point this points to multiple clusters or.

A

Actually, it points to results of cluster loader, so with cluster loader you can like go back to the code and there is like this feature of dumping the prometus database base after the test.

A

So basically, I think it's really experimental, but we use that we've been using that for over a year and it works.

A

You can basically provide the disk snapshot name and like and cluster roller after the test will dump the prometus database into this uh into this disk and later, like, we have a scripts for uh creating grafana instances uh for this data from like. So so. Basically, you can check like the prometus data for some tests right, like all the metrics that were collected during the test. That's okay, yeah.

B

So you're kind of running your own instance of grafana uh on your computer like somewhere else like a long.

A

Running instance: yes, so yeah, uh okay, anyway, that you can check that in graph now, I can like later show you an example. I believe we haven't implemented any uh like automatic checking or like we haven't defined any threshold, but in general, that's something uh that we do for that. That's the way the api call agency works.

A

So if you take a look at the uh api, slos aka responsiveness prompt use, that's we only use committees right now, so basically prometheus scrapes api server every five seconds, and now we have this measurement to uh like it uses promotes. Basically it connects to promit use and and executes query exactly the curry. That's how we check whether slo satisfies satisfied or no or not. Right so like the idea was to do exactly the same for the network latency and we have like almost everything there.

A

We just need to start doing that and then figure out the threshold, but yeah you can check the data in grafana, so you will see like more or less how it looks.

B

Okay, thanks yeah, that's a good starting point and.

A

Similar for network programming latency, we also like have it implemented, uh and it's similar like prometheus scrapes, keep proxies right because the proxy has a metric for the network programming agency and then like. We have this query here to just check it uh yeah. So take a look at the code and okay.

A

Yeah all right, so we are over time. So thank you. Everyone for joining uh hope to see you in two weeks so bye. Thank you. Bye.