Kubernetes SIG Scalability, 25 May 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2023-05-25 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing

A

Hey everyone: uh this is six collability meeting 25th of May 2023, and today we have um LTD tracing demo from Benjamin and I. Give you um yeah it just. You can just start. Okay.

B

uh Can you see my uh uh my doc yeah? Okay, so hi everyone? My name is and I'm a graduate student at Polytechnic Montreal I'm, doing my master on performance analysis of kubernetes through kernel and user space tracing I'm here to present to you a tool I've develop that can help. You do.

A

B

Analysis on kubernetes, although it's still under development, I'm, eager to gather your feedback and perhaps explore potential use cases where my tool could be useful.

B

uh So allow me to provide some background information for those unfamiliar with the concept of tracing uh you might have heard of open tracing or even ebpf. Essentially, what a tracer do is record event that represents what a process was doing at a certain point in time.

B

uh As an example, here um we have a client doing a request to the back end and the bank can uh doing a few subtasks, so the events would be uh the client starting a request and then ending their request, and then um the web app starting the processing and then ending the processing. And with all of these events, you can create a sequence diagram that represents the critical path of your request.

B

um But the issue is that um from one Trace to another. Maybe you can see here that the task one is taking more time uh compared to trace a and there's no way to know exactly why uh it was slower.

B

um It's kind of a black box uh and so kernel tracing allows you to better understand uh why the task was plugged. So in the second example you see you can see here that using kernel chasing, you could see that Task 1 was plugged by a task one from trace a because it was waiting for a few texts uh and basically a kernel. Tracer collects uh system calls. uh So that way you can know more about your system and using a tool like Trace can pass. You can better understand the critical path of each processes.

B

You can better understand. um You can see the CPU users, the memory usage, the disk usage and a lot more things. uh So the kernel Chase server that I'm using is lttng.

B

There's many kernel Tracer like ebpf ftrace perv but I'm using lttng, because it's the only Tracer, that's both able to collect user space event and kernel space event, and it's also one with the least overhead.

B

So what I've done is added a few Trace points inside of kubernetes to collect events. This way I can generate two views inside of Trace compass. So let me show trace compass, so here you can see Trace compass. uh This is a tool to analyze uh kernel traces.

B

um As an example, uh I've traced, a a simple deployment of a single pod on kubernetes, using my user space instrumentation. You can see all the lives, the events in the life cycle of a dispute creation. You can see the deployment being created, you can see the replica set being created and you can see the Pod being created. So you can see all the faces to create the path you can see it being pulled. You can see it being started and you can see it being killed um in this pod.

B

uh uh I've created um on purpose, I've put a really low quota. So when you can, when you zoom in on the pad being started, you can see uh the small Spike here that indicates that it was being throttled, which is a common case, and when you look at the control flow of your system, you can see that all my processes were preempted which basically showed that Mike pudd was being traveled. Also um I can look at in this case.

B

um The polling was really so. It took 13 seconds and basically to know why it was slow. I can look at the critical path of container D container D was the uh process that was pulling the image.

B

um In the more specific view. You can see all the live time. It was printed, so it's a bit cluttered but there's an overview.

C

B

View here of the critical path- and you can see the reason why it was slow. So in this case you can see that it was printed a lot for like on pick Z other container D processes. uh It itself was printed a lot. It waited a lot almost also on the timer and a bit on the network, um so yeah so essentially, Trace can pass and kernel tracing allows you to get more information on what was actually happening on the Kernel.

B

um You can see also the disk activity uh you can see which process for running on which CPUs at each time, uh you can see the complete um control flow of the processes. So that's essentially the information I. We can call it and process using lttng and uh Trace compass.

B

um So thank you for listening to My Demo. uh Do you have any question or some feedback.

A

Yeah this this looks pretty cool I was actually wondering because on the lowest one you also have um scaling deployment and replica set. How do you get this information? Because um my understanding before was that the ltdng is kind of like kernel thing, but here.

B

A

The traces you can see also the deployment of the replica set here right, yeah.

B

uh So basically uh lttng can collect a every system called being called and recorded, but it's also there's a library called the lib lttng UST that allows you to send events uh to record events from user space, so what I've done is added uh I've created my um basically I've added, like a few uh call to that library. Inside of kubernetes I've compiled my own kubernetes and from kubernetes I've called the kernel Tracer to add, like a life cycle event like being pulled, started, killing Etc.

A

Yeah, so this is pretty cool, because I think like actually last last time on Sig meeting, we were discussing kind of similar thing where we were interested in tracing uh Cube, Verde, latency of creating VMS, and something like this would be also really great I think what do you think shiam and wetek?

A

This looks like pretty interesting tool.

D

Yeah I, I, fully agree and I think we um I'm not sure I I I'm not closely involved in like the tracing related efforts that are driven from by sick instrumentation, because I think they are, they are integrating with open tracing if I remember correctly, um I think it would be good to present that awesome, sick instrumentation meeting and get their feedback and and understand why they chose open tracing and whether we May potentially want to revisit that I I'm not super familiar with any of those.

D

So it's hard for me to tell that, but um they would be good people to um to talk about that too.

A

So I'm, guessing that open tracing is not as gradual. The granular granularity of open tracing will be much slower, much much much much bigger compared to LTT tracing right. It.

D

Can be at least yeah.

B

Yeah so like that's one of the pitfall of kubernetes of open tracing that was like I was explaining is like. Sometimes you can see that one task is slower from like one Trace to another, but like you cannot go further in lines.

E

So uh Benjamin, you said um you said you. There was a way for you to plug in like this kind of custom events that, like the long, the user space events that you're plugging in using this Library- uh and you also said you to be able to do this- you had to kind of come by.

E

Let's make some code changes to the components to kind of permit certain events um and stuff, so I think it will be interesting to see if the same kind of or the similar method as possible with the with the open Tracy people. That goethic is talking about with the statistic: instrumentation is now the things that have changed here: do they? How would, if can they also be plugged into the other solution that the community today is thinking about?

E

It's not, then it's a good place to bring it up right, uh hey I'm, able to do this with lpt, and it's the same thing with.

F

Them working there.

E

I I am personally very interested in tracing two, so we do happy to get you in touch with some of the folks basic.

F

Instrumentation.

A

I think one more interesting thing from our perspective would be to see it for the control, plane notes and not the worker like from user perspective. Usually probably they care about like why the Pod was starting slow but from scalability perspective, I think it would be also cool to see it for the control plane.

A

B

A

Question yeah, so my question would be like actually, um how much effort had did you have to put in to to actually integrate it with um with with Note, basically or couplet and containerdy.

B

uh So the thing um so essentially the kernel space information instrumentation you get it like for free. You just have to install a kernel, module and you'll be able to collect every um Trace event that you want from the kernel uh for the user instrumentation. So in this case this view uh the issue is that uh the library is in C, so I had to use cigo to call it, um but otherwise it's like pretty easy. It's just like a system call.

A

um Okay, okay, so.

E

I I know lots of users at least a class of customers today that do at least the kernel inside of tracing using EVP of probes. I think you also mentioned YouTube here, so you can install you, can fuses probes and permanent probes, for instance, right and um I. Don't know how much of this, for example, celium celium I, believe, does a bunch of EPF.

E

Yeah yeah, so I I think I I, guess from what I sense where the community is ending and I might be totally wrong. Is for the for these kernel operations right or even user space operations, like let's say, changing, IP templates and stuff like that which is um or let's networking configuration and stuff using evpf to gather.

E

That is something people already do today, and um the open tracing proposal uh that is floating around in sick instrumentation um I need to go check whether that is trending towards maybe voice, like you know, more about boyfriend but I. Suppose that's purely kubernetes space.

E

D

I think for now it's pretty much only keyword and API server.

E

Okay and what you're proposing here Benjamin is one single solution that can do all of this. So again, I think I. Think I'm, going back to the same point as my previously is it'll, be good to check with uh books uh the direction we've taken and why we've taken that direction.

B

um So yeah you have like open tracing that collects user space event and then ebpf that collects kernels base events yeah. um So the issue is like when you're trying to combine both of these events. uh The issue is like it's not really that precise, um the timer from um open tracing.

B

uh So it's hard to correlate uh events from like your kernel, Trace, which is usually like in nanoseconds, compared to your um user space Trace that is and micro seconds.

B

So that's one issue with like combining these two traces uh and then ltdng is able to combine both, uh uh but also ebpf adds more overhead than lttng does.

B

When tracing so that's just one issue with ebpf so yeah, what I'm proposing is. This is actually like one solution that combines both user space tracing and kernel space tracing with uh the least overhead foreign.

E

Ty collecting, if it's is not using evpf, uh is it a completely new thing or.

B

No, so that's the thing is like ebpf is like Computing things, but uh lttng only like um uh it's only like copies things in memory like it doesn't touch at all. It just records events. So that's why it's really uh fast and efficient.

E

Okay, okay, let me see overall, I I think it is a super cool to watch as well. um The other part I think we haven't talked much about is the how you are visualizing this like this tool for uh visualizing? Is this also a part of entity or is this? Can this work like in general, the open tracing standards like it can inject traces.

E

Maybe we can I guess it's a bit a little bit down the line lane but uh I think it'll be cool for us to add this sort of a visualization for artists um that we run today.

B

E

What's this tool that you're using.

B

Trace can pass yes, uh so um Trace Compass is just like the you: don't have to um use a trace Compass to analyze, lttng traces, there's also a library called Babel Trace that allows you to um essentially like read your Trace files and create your own custom analysis for it and, like I'm sure you could use another front end.

B

But what's nice with Trace Compass? Is that there's a lot of like analysis um that have been built upon it's a tool developed by a buyer or a lab at Polytechnic Montreal with also a um Ericsson.

B

So yeah, so it's it's just like a tool to analyze Trace, but you don't need to use this trace this tool to analyze traces. It's just really convenient.

E

E

um Thank you, yeah I, guess, I think. um Maybe once you check back with sick instrumentation feel free to let us know uh what what we are thinking is about it. You can use this channel, bring up any follow-ups because, for example, say what sort of things we usually look for.

F

B

Okay, uh but I've actually talked uh a few months ago with six scalability.

B

um What they said is essentially that like I could reuse their Trace Point uh with like open tracing, um and they were like open for me to like integrate it and to uh to bernetes, um but I guess I'm, just like.

G

B

Our use cases like before, like uh Integrity.

A

You mean he talked with the signal instrumentation right, yeah.

B

A

Had a meeting ah okay.

B

I mean they're open for it, but yeah.

B

I was just looking more for like uh use cases inside of like kubernetes, like people that would uh appreciate want to see certain things using kernel and user space tracing.

B

I know you mentioned like a control plane, yeah.

A

Also but I think I'm wondering like if, if the cube verdict like two weeks ago, we had uh exactly similar conversation with the issue of debugging kind of like. Why does it take so much time for for VM to spin up in Cube Verde and they were actually like interested in tracing I'm, not sure, if, like they need like kernel tracing, it might be helpful.

A

um So that could be maybe like one use case but from like our perspective, I think like um we saw um before some issues with pod startup latency, for example, that you presented here right and um I think um we saw like on the Node level as well. So it might be useful for us I. Think.

D

Yeah I think it's I certainly can imagine use cases that, when it's useful for debugging.

A

Yeah, like with Docker Docker before we had like some contentions right um on the Node level, I think and.

D

Yeah I think in general we still don't have like super good understanding where we are spending most of time um or how the time is split in inside the node, for starting or even deleting the boat.

E

Yeah yeah, um yeah, I, guess I think where this collection is heading I believe is maybe a okay, but if Benjamin, this is something you want to take a stab at you and you're interested in actually kind of taking some of these changes you might want to see if there's some parts of it at least that you can try integrating with our tests or testing I, guess the changes where you have to go make uh add this response in the qubitis code. Those will be.

E

He doesn't need, like bigger discussion and like thinking through holiday, sounds events of some information that may already be present today.

F

E

That's one place where you can achieve and traces to these tests and.

F

I'm pretty sure like, oh, if I cannot be saying we'll use it a lot and then distribution any issues.

E

um All right so I had another topic for today. If.

E

So it's about this. uh This change I just made the link here.

F

E

E

I think the question made this change uh well, I think you're resuming it so this it seems like this one is adding uh making um missed, calls various multiple seats in apfq and there is. There is I, think couple customer issues we ran into recently and I see pratik from the from the AWS titles joints I.

E

Let him speak more about it, um but I I think the main uh I guess the main thing which kind of was coming out of it for me was that change if we tested it using our test today, we uh the load density test that we run today, I think it wouldn't have actually exercised this. You know because artists don't create really any load to put.

E

um Do you know what I think, because if you reviewed that change of, if there's some separate test done for this, or should we actually enhance our tests to kind of maybe add, like a measurement for this creating load or something like that.

D

uh We were testing it a little bit like um out of tree internally in Google, but um can you explain what the actual issues were because like yeah? Certainly it's? There are things where it can be improved, but that I want to understand where? Yes, because we also seen a bunch of cases where it actually helps a lot and like the cluster will fall down without this yeah and it survived with.

E

Yes, so yes, so these cases we've seen maybe it's outliers I guess.

C

So yeah thanks, um uh so the particular The Edge case that we are seeing is if a customer is making or if a user is making large number of list calls, and now they are upgrading from a version which didn't have this functionality, which is 122, and this was added in 123.. So what they are seeing is uh their list calls are now taking large number of seats.

C

So some of the uh other calls that fall in the same bucket are basically getting 429s, because there are, uh there are no more seats for those calls, which was not the case in 122. so, uh and this started happening right after they upgraded from 122 to 123 and in The Matrix. We can see that uh the 429s and the reject calls those also increase as a result of the upgrade, but from their side there was no other increase in the load or the list call pattern that we checked.

D

Yeah I think that that, in general, like um um because we've seen, we've seen that too and um the reason for that is primarily that we are not tuned well enough, especially in terms of like um defining the capacity of the API server. So um so, basically, defining or increasing the number of in-flight seeds in the API server is. Is the um or adjusting this this this? This value is like how how we believe it should be solved.

D

I mean the the the exact sizing function also is in far is far from perfect here and it's going to be improved, but in in general, like um I. Think that the adjusting like how how many in-flight seats that the API server has is, is what works.

E

Put more seats, I feel that probably the Miss with that PR is when doing this, because it is also sharing the bucket with other papers, which means like mutating requests, which will be hydrating. Maybe we should have done that in along with maybe separating the bucket for let's calls altogether, because in the end, we actually ended up doing that.

E

For this customers we created another bucket, which will divert the list topic, and you can separately uh throttle that without affecting the question, so maybe but I see also how it may be hard to come up with search engine pocket with respect to how many concurrency shares and stuff to have. But maybe uh do you think it makes sense that this change would have should have gone along with that sort of a change. Yeah I'm.

D

Not sure I mean it's, it's certainly like a reasonable mitigation for some of the customers.

D

Sorry, I I I'm, not sure it was your ego from from your site.

E

D

Yes, so I guess um I I think it's. It's certainly a reasonable mitigation to that problem too. I wouldn't do that by default for everyone, because um in a typical use case, where customer or the the user is not actually overloading or overloading in some sense, the API server um I think they they generally don't want, or they generally want to share, uh share the priority level between like different types of calls, I mean both mutating and reads so, um I think what what we.

D

Could do better is certainly documenting it and like making more like release, note required and stuff like that, and.

E

D

Mitigations and so on, but I um I'm, not I'm, not sure we we actually should have said by default, separate that.

C

Original comment about around testing so- and you also mentioned that the estimation is still not perfect, so let's say we uh in APF: we go and improve the estimation algorithm and tomorrow, like from Maximum seats. For any reason it is now we are bumping up to like 20 or 50 seats. Then again we might run into this issue writer like unless we have these uh kind of testing, which can include this case from like either cluster loader or some kind of perv test.

C

Do you think we should be enhancing those, at least in our Upstream test,.

D

um Yeah we started I mean the fact that we are missing. Tests is I, certainly agree so like if, if anyone has capacity to like extend our test or build new ones to exercise these scenarios I'm all for it.

E

Yeah yeah, so maybe we'll take that, but you can take it back. That is not the AIS to kind of be enrich this specific order. Yes,.

D

You can also talk to Abu from Red Hat I will I will paste you his um his Elder here on the chart like because he was. He was actually also looking into that and and performing a bunch of like tests internally in Red, Hat, um I, I, think I've seen some demo or presentation or summary from him at some point like those two years ago.

D

But I can't remember exactly what those were, but certainly he was also doing those um and he may he may either have something still around or maybe able to help with with um um with running some of those tests or designing or reviewing.

A

Also also I think like there is relevant test that wukash implemented recently for streaming lists right like it's different, but it also exercises like just bunch of lists with the default plug in.

F

A

D

I think at least reusing some of the some of the the like some of the code that is they're like it's definitely possible. There.

A

uh Maybe even like extending this test right.

F

um To have like great confidence.

C

um Sorry, uh you mentioned who added these tests recently.

A

um I will post his nickname, nickname, but um I I will not type it right now, polynomial, but with some zero somewhere. Okay.

C

Or if you could Point me uh towards the test that can that I can reuse for enhancing these tests.

D

Yeah I'm, just looking for that and I will paste it here.

E

I guess really I think what has ended up happening is multiple places.

G

um Hey guys, I am interested and have slightly tangential question on this discussion.

G

um I was wondering if there is any generic um guidance to come up with good um priority and fairness policy for list calls or watch calls in general.

G

um We are working on experiments where you need to find out at what point API server breaks in concurrent risk also I was just curious and it's related to the discussion. If anyone has pointers, I um I'll be curious to read.

E

Yeah I think that was one of the follow-ups from the earlier discussion that we'll probably need to have somewhat better guidance around us, especially like maybe figure out like today. We already take care of defaults.

E

So if that comes with each features,.

E

We need to do that. I guess. Oh.

F

I think it's probably about to release priorities.

F

F

E

One thing, though, is yeah I. Guess we missed, we missed the release, note for this particular change, but I think you mentioned this idea. That is also my other call out. um I think what we needed from the release notice changing this way, and you know we can make a difference.

D

um Yeah I guess it's too late now to to add it, because no one will read this release note now, given that 120 2 is already out of 123, will be soon out of support window even um but I. Think updating the documentation in general is something that we we still can do, and that is something that people are looking at.

E

Yeah, so okay sounds good, all right so for the testing gaps and like this idea about really split list calls okay, now defaults itself to a separate bucket dial open is British. You need to discuss this, but in general, actually the fact that this change went out and besides two customers, not many people saw this actually is a good sign that the state is actually the right change and this. This is not these particular cases, which is three corners. Foreign.

A

Okay, then, thank you and um see you in two weeks.

F

A

F