KubeVirt SIG Performance and Scale, 17 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-08-17

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

August 17th 23, please back yourself as an attending the documents.

A

Okay forward today, um let's do a quick catch up on our post, V1 tracking, so I can get out of this. So let's see here all right. First point so PR for script, around graphing I think is Luba working on Oh no you're working on this okay.

A

And this is the one we was working on: yeah yeah. You want to go ahead and talk about this.

B

Yeah sure so this is the post submit job that we have been talking about in last. Couple of meetings, um this post submit job basically takes the data we have in the uh CI performance, Benchmark repository and plots the graph for it. The reason why it has to be a separate job is because the um other periodic job only has like a week so week of data. So after it publishes that week's data, the uh every week's data is present in DCI performance repository and that's where we need this um a post of my job.

B

So basically this is to isolate the timeline of when the data is collected and how far back it is plotted.

A

Okay- and we need to get to this first, we need yeah, okay, yeah.

B

And I have tried this out. So I used this script today to upload fresh data from last four or five weeks.

B

um If you go yeah, actually the one the links above the one, you click, yeah, okay, the charts.

A

B

Yeah, so from now on, once those two PR merges we will, we will be able to review this graph in automated way so correct. Basically, it won't have to do any work for the charts to show up. Okay, that's.

A

Great okay, yeah I, haven't looked at these in a little while I think I was just gonna scroll through them and see. If there was anything strange.

B

So I think these are the strange ones uh if you scroll little bit up so we we reviewed this in June July. Last time when we saw the uh badge counts, uh go down a little bit, but they have uh come back to normal. Now the patch VMI account uh and if you scroll up there is one more chart like this look at 10 points.

A

A

And then there was this one I think we talked about this one already right. This is we talked about before V1. This was the we added a patch, probably from the finalizer correct. Yes,.

A

Thanks nothing on the ordinary. This was June and that this is the same thing like we're being the same drop, similar drop.

A

Okay but then nothing else looks interesting: okay, cool yeah! We already talked about this one: okay, cool yeah. That looks good. That's awesome, I'm about to see this.

A

Okay, let's see what close on this one, uh these ones too,.

A

Yeah that one okay, cool, pretty steady, looks like all right. This is awesome. It's gonna set us up a lot, because now, when we get to once we get this stuff automated, we can start building some more of those tests and a few of these places like the density cluster and see how we can list the results. A little bit.

B

Cool actually the density cluster one it has uh some, you know, wealth, not anomalies, but it looks like for a lot of early July. The density cluster jobs were broken.

A

Yeah I know I think like we need yes, yeah yeah I. This stated like for yeah I mean. Let's say this is kind of one of the ones where, like after the new year, I wanted to see where it goes, because we just we, we basically I think this was the rebuild like somewhere in here. Brian did the rebuild to upgrade kubernetes to a new version, and it was completely offline during this time.

B

No, not that right, recent ones, so July 2020 through three yeah. Yes, um a little bit right, yeah from July 3 onwards. We saw no job reports until July, 15th or something.

A

Oh you're saying like you're, saying Justina like I, was we weren't getting any? And now we are seeing a bunch of stuff yeah.

B

So after so, if you can zoom into that card, so so July 3rd yeah yeah is when we started when we saw the last data and then it all the way to July 24th there's nothing.

A

B

Yeah, so something happened in in between those that the jobs were not reporting. Data.

A

Yeah I think I, okay, I mean we need to I mean the density. Cluster needs some more love like it. Just we haven't, we've been focused on things, so I mean as far as like I'm concerned, like what I was saying with this data like these holes here like this holes, hole here, and you know the one back in July like we, we it's totally different than what we have for the periodic where you can clearly see Trends this one. It's okay, we're not getting a good trend line here. This is so I.

A

That's where that's what I mean it's like I'm kind of thinking about it's like we at some point. We need to look at this as like from January forward or whatever, because I I don't know how to read this data. This is too random makes sense, so I we probably need to spend some time in a in the future when we have all the stuff how to make it to do some work on the dedicated job see.

A

If we can get some more, we get some get a trend line out of this okay cool all right. That's awesome like so script is looking good, it's working and um okay, so you just seen some reviews on this one and then and then we should get that in and then looks like. We can then just wire it up to this and um okay, so that covers those. It's just. Those two PRS right, I think two, nine three two and everyone yeah okay and then that gives us what we need for um for the job.

A

So then, what's this one? Let's go to the last one here.

B

So this is a PR that um that came to my attention, while the API review work, um and so basically, this PR proposes that we introduce a shadow node, a new crd, where um we will basically write everything to that shadow node and um the actual node will only have read-only access so that prevents um security that addresses some security concerns by introducing a new crd. The reason why I have it in this call is because I feel like this will introduce a lot of new performance related.

B

um um You know not problems but uh sort of things. We need to be cautious about, and I would uh I don't know it feels like this needs a little bit more of a design and and discussion.

B

um So I was wondering. Okay, is this a good? uh You know at item to discuss here.

A

Yeah I think so um I mean really they could have liberal, can put together um something in the community. First, one I I read this one too and uh I mean I was not thinking about it in terms of scale, but I mean I. Just don't understand the um like I. It just made me think of like we're regretting the VR back Paul. They are back AP guy, but uh we're just like working around it and I.

A

Don't know, I would really nice if we can get some design but from him, but I mean so like in terms of so just strictly talking scale like what do you think needs to be looked at or addressed.

B

So for scale, here's what I'm thinking right are uh list calls will be doubled because we have to watch well. We have to watch the uh the node objects because they are read only and you know, in order to Shadow it, we will also have to watch the um Shadow node right, so our list calls will be um sorry list. Watch, I call them together. They'll be doubled, um the rights that the system makes so any write that is made to the node API.

B

We will also have to be followed through on the shadow node API in order for them to be consistent, so even the rights and updates will uh I don't know if it they will be doubled, but definitely uh will have more load on the updates.

B

So um these both will affect scalability in large scale clusters and that's why I wanted to have some kind of design in brainstorming like what will we do to validate performance and scalability issues once this work is taken on.

A

A

A

Yeah I think we need to design I mean this is just another reason why I think we do.

B

Yeah um I will reach out to lubo on this PR and you know have some discussion going and it'll tag you as.

A

Well, okay, yeah I mean I I would I think the yeah sure I I think we I think really I, don't know what Romano said, but I I think this would be I. Think Louisville needs to take some time to do some design on this I mean this has been a few weeks. He hasn't respond with anything, it's quite busy. It's about some other things but um yeah, let's, um let's try and push them in that direction, because I I don't know I think it's been a little bit more thought.

A

I agree with what you're, what you're saying yeah I mean and even for.

A

Yeah I mean from the API side of things: I, don't I, don't know, I want to maintain that I, don't know API, so, okay, yeah, please post on it. Let's get let's try and get to a design document somewhere and see where this goes. Okay, cool, okay,.

A

uh Let me open this up. What else did we have in here? So you've got um so the API reviewers okay uh times the definition policy, but okay, here's our sixth scale, let's be I'll, see one okay, so in this stuff, so we've got automation, which is what you're, covering um with those changes.

B

Yeah I think that one needs attention next.

A

Okay, I have another chance. Look at this okay, oh, this is the issue. Okay, I thought this was a pull request.

B

Okay, I I need to add a section. Do the doc for tooling um that's something I'll pick up next Once the full submit job is March.

A

Cool okay, um all right, you gotta plan, then take a look at this after the call this PR and.

A

This one okay sounds good all right. Let's see any other topics, anything else to discuss otherwise, blenderly.

B

um So some updates regarding the simulation.

B

So last time I demoed. Well last time, I talked about a controller to add watch requests that work is going on and um right now, with that POC we can see that, let's say, if you add 400 nodes, it will create 400 into 20, which is around 800, well, more 8, 000, watch requests and with that the API server memory increases around 1, 1.5 gigabytes.

A

Why does it do 20 to 1.? This is this? Is this is what you're observing kubernetes it's 20 to 1 when you create a new in.

B

Our Downstream data center CS, uh so not only does not only is the cubelet uh load included on that, but all the other uh demon set controllers that run uh on the on the cubelet. They also cause watch loads. This is creating a node, you said, creating a node. So, for example, when you start a cubelet yeah, creating a fake node.

A

Okay- and this is a 20x multiplier on API calls like this. Our watch requests watch list yeah twenty to one.

B

Yeah so now the question is what resources are used for that list watch right. So it depends on different things: uh cubelet uses nodes, pods, uh PVCs, PVE secrets and other two things CSI. So.

A

When in this I'm still on the top sorry I'm like that, does wouldn't I'm assuming we so we, so this is our Downstream.

A

Maybe this is where you go so like you're trying to identify all the places that are that are where this 20 is coming from. Is that, where you're listing out.

B

No I already have identified that that's I already have okay, yeah I, just added them up, so it's concise. But uh let me.

A

Oh, so you you yeah, okay, sorry, how did you get to 20 then? So you you went and located even located. All of them.

B

Yes, so what I did was I have an Excel sheet which looks which is coming from audit logs and.

A

So how many of these are Upstream components of the 20 and how many are not.

B

I need to um classify them for now. I have not classified them as upstream and downstream components. They are just classified as what resource those requests are going to.

A

Okay, that'd be interesting, I mean because we should.

A

This is one of the things like when we look at um this is what was one of our problems when estimating scale is that is that we need to so like it was two things is that one is we need to know the weight or the cost of a list watch request whatever that is, we could classify in memory CPU whatever it is, um or some other unit that we make up, and then the other one was how many of these things get created when when something happens in the cluster, so it looks like you've already got.

A

It sounds like you've got pretty close to both for just like create node, which is really interesting, because that would be so. That would be really interesting uh topic to talk about, especially in the kubernetes six scale, because I wonder if they have this level of granularity or visibility in some of their on some of their metrics yeah, because I'm, the one I talked to worship previously, was that there were some estimates about this stuff and I.

A

If we're able to get exact measurements. That would be tremendous, because not only does it mean that we can do, we can we can extrapolate with it, but it also means that we can probably do some heuristics with it.

B

Yeah so I think the data coming from it is not uh is not using the tools that we have discussed here. It's using the audit logs and it's very manual I had to go through all the requests and you know, spend.

A

B

A

So talk to me about um the problem. Okay, so then what are the so the problems so the way you're dealing with this is you are you're, scraping logs to then find the list request, who's making the list requests no.

B

A

Is that what you're doing no.

B

Actually so when I mean audit logs, um they are shipped to uh kibana right, so I'm, making a table in kibana, which gives me the API resource on the rows, and the number of calls on the columns so- and this is filtered by watch- calls right.

B

So so then I have all the API resources and I have all the watch calls made um by different components aggregated into that table and then I am simplifying it to just one call so I assume that multi. So this is filtered by again one node and one node is making multiple calls, but in our simulation, I will just use one instead of many right when when creating, because those watches can terminate and they can be restarted and then something and I I don't want to account for that.

B

So the very base case is at least one call made um during the creation. So that's how I got to this list?

B

A

I think so I think the next step would be the um yeah I was trying to classify them, because it will be interesting to know what what this number is from Upstream components that you get out of the box. Maybe a soccer release so.

B

I, what would be the advantage of doing that um it from my understanding? Those are the calls which are essential for working of those controllers, so we might not be able to reduce those calls.

A

No, we wouldn't well I I, don't think so, but the um the point is so it's two things is that which of them aren't so maybe we can produce those and which one aren't Upstream part of Upstream components and, and those have you left them to remove those or whatever we want to do with them, and then the ones that are then the value is, is that we know they need them right, so that this is this is what's going to. This is what's going to add to our.

A

This is, what's going to add up to our number right like where, whenever we do something with a node, this is the cost of it, and so the idea was that, okay, if, if creating a node costs, something pass, I, don't know we'll do a numerical value, it won't even do like a CPU value or anything costs. Five list watches, maybe that's the metric. How much does updating a node cost would be cost 10 list watches maybe cost none whatever it is like what is the? How do these things?

A

What's the cost of some of these things? That's what I'm wondering, because when you start to look at the what comes out of the box, let me separate the two now we can see. We can make assumptions about what the Upstream infra is doing and and then we can start to extrapolate. What are the really costly operations which we kind of already know, but it's now we can put a number to it. It's not just like. Oh, we know list watch is costly and we know created notice, costly. We know exactly how much it costs.

B

Well, that makes sense yeah, so I I think we are on track for doing that, not Upstream versus Downstream. That's a good data point I'll, add that, but in terms of costs right, so um what I did was because we have these numbers. 20 is to 1 multiplier. I implemented that number in the watch thread. So anytime, uh node is created.

B

That watch controller will create like 20 watch calls, and with that implementation we created 400 nodes and the cost result of that is the memory of control plane, which was averaging around 4 gigabytes, jumped to 5.1, 5.2 gigabytes. Okay, so immediately we saw the API server memory, control, plane, memory increase and that's the actual uh gain we saw from the simulation. So the the point is that, even though this is the means, the end goal is that adding a node actually increases the control plane, resource utilization, and we saw that in the simulation.

B

So the next step here is we're going to compare it with the actual real node, and if it is in the ballpark, with like 30 percent or with some misses, which we can tolerate for simulation, um then this is good to go. Then the next.

A

B

Would be generating the same amount of load for vmis.

A

Okay, gotcha yeah I mean a way. I would look at this I would think it's almost. It might be easier to approach it as like make up your own unit for this and make that as your cost, this cost less watches this costs.

A

This costs something else whatever and then convert this later into memory and CPU, because I almost I think that this cost and membrane CPU is going to vary per Data, Center, yeah and number of API servers and lots of other things like this is going to be hard to actually compute, whereas a list watch is constant, it's going to be across no matter what your inference.

A

So, if you can do it this way, I think that would be a really powerful way to communicate it and then and then there can always be some sort of way to translate it. Based on you know, someone's data center hardware and other stuff, like that.

B

Actually, you know I think even the list watch will be uh varying per data center right because so right now so let's say we have two keyword clusters, the density, one and the performance one and the density one uses a certain um demon said of uh uh no PVC provisioner CSI provisioner. That is not used by the performance cluster, then because of that little change, even though everything is upstream and and open source that little change will produce a difference in this list. What cost.

A

Oh so what I mean is um um list watch the um the metric? A list watch will be consistent across all kubernetes clusters, not like the number of them like you can use it like. You can use this watch in all cases when you're in kubernetes land.

B

Got it okay, so.

A

You're saying there's like CPU and memory, it's not going to be it's going to vary wildly based on topology and other things got.

B

It so you're saying that if we can have a consistent way of measuring wrist watch, no matter how many deploy, how many components are deployed in in the cluster, if we have a standardized mechanism to measure the list watch uh coming per node, then we can say that. Okay, if you are a user and if you want to estimate, cost here's the tool to measure the list watch per node and then here's the tool where you plug in the result and simulate that yeah.

A

Exactly like we could I, you could have your cluster like I could have mine. I might get I'm 18 to 1 unless watches you're 20 to 1.. This tells me something it's like okay, so something so I. It tells me that I have I'm creating less events when I am doing things with notes, you're creating more that can it's totally. There could be so now it takes out topology, because now it's like oh wait.

A

A second like there could be a lot of things that could influence that that's fine, but it just tells me that I'm I am I, have less load that I'm putting on when I'm work when I'm using a node.

B

Got it yeah makes sense, so I think the direction is that we need to find a generic way to measure the list watch per node uh across different clusters.

A

Yeah well the way you're doing it. So that's why that's interesting! That's why I was wondering about this like filtering monologues in Cabana like if there was a way to pull this out um yeah like yeah, we can measure this. That would be interesting.

B

A

Without having without having an external tool because like I mean I, guess you could maybe do it with Prometheus or something um without having a full Cabana setup to and then having to export a graph or someone else to consume something we can do from the command line.

B

Yeah that makes sense so I think there are two things right. So if we want to do it with Prometheus, we will need to look at the Matrix exported by client go and if we are lucky there will be a matrix that can help us well. Actually that was my problem. The reason why I had to go to the audit logs is the client go metrics.

B

They give us a an aggregation, so, for example, you will be able to know how many list watch requests were made over period of time and you can find an increase or decrease, but it does not get you the granularity of which service account is causing that which node is making that and um which pod within that node is making that all of that was important for me to find out the exact source of that list. Watch request.

A

Okay, interesting.

B

Yeah so we'll yeah, so either we will have to extend those metrics or we'll have to find other way to to like deploy this audit logs in a recreatable way.

A

I wonder if, like I see this is where I wonder like so you audit. So the audit log is saying there was a list watch created. This we're saying, is sort of a performance and scalability important performance and scalability data. Point right.

A

This is where I'm like okay, well we're getting in the audit log. So we're we have a print statement somewhere. I! Think, like all what you could do is this is one of those places where you can identify well, this would be nice to have. You could dump this into Prometheus and we can just have a count for this or something and then now it becomes very or much more accessible got.

B

B

Yeah, so that audit log is a part of uh kubernetes API server.

A

B

Together right so I don't know, maybe we can have a sidecar that filters it um and we just yeah I mean I, can imagine a scenario where, let's say you have a sidecar, 2, API server, sidecar container that filters uh audit logs that you want right and then you add a node to it for five minutes you let that filter go through collect the results and then um analyze how many list watch requests were made. We could do something like that that will be generic across all the.

A

Okay, well, I, don't know, let's let something to think about, because um I think like where we can agree is like this, like what you have here with this this watch. This is the important scale and performance data and it's it's and we can communicate it and it's about it and it's and it's useful to talk about you just got to make it a little bit easier to access. Yeah.

A

Okay, cool well, that's interesting, I I think yeah I'm, looking forward to see what some ideas to come up with it's cool yeah.

B

So now that we are is that this so okay, we did an experiment where we added fake nodes and added fake vmis. There was no change on creation of fake vmis. All the change was when a node is added, so this tells us that the next thing that we have to look at is load generated on the control plane when a VMI is added by the cube, Word, Stack and Implement simulation for that, as well. Just like the Watcher threads.

A

B

Yeah and hopefully with that, we will see both a bumping research utilization when a node is added, as well as when a VMI is added, and then we can. That will be like very close to a real life, um a real VMI being started in a cluster.

A

Yeah, that's all set cool eh, yeah I'm. Definitely looking forward to seeing what some of the cool other graphs this might might yield for us and the data points yeah cool. Okay, that's good! All right! All.

B

A

Yeah thanks Lee all right, everyone I think. That's it Blended. There have a good day. Everyone. Thank you. Go back.