KubeVirt SIG Performance and Scale, 13 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-07-13

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Here's my there's okay.

A

All right, July, 13th, okay, uh let's start with the post, C1 tracking, so V1 was released and I took all of the V1 items and moved it over to this issue. So we'll have um so mainly things in here that we need to I like the themes in here. It's like it's mostly documentation stuff like so we have the support Matrix. That fob is working on. There's the the API review process.

A

All these things are kind of that's part of the their process, so it'll continue and then um the sixth scale plus C1, which I created. So that's the other one where I think what we wanted to go with. This you've got the automation right, you've got benchmarks enhancement, and then we have some of the documentation and then I think one of them is uh it's one of these covering like the the rendering I, don't know. If it's here.

B

um I already marked that uh issue done, because we are able to get the uh we're able to get HTML pages in cubeboard uh website. Let me share the link. uh Actually, if you can go to the previous issue that is linked in the um yeah. This is a 2705.

B

B

Yeah uh so yeah this one, okay.

A

Okay, all right, so this is the link to it. Okay, let me um oh okay, so this is V1 release.

A

Oh I thought this was uh so. This is what we said a lot we we have so it's the name of the CI job, the type and then the actual HTML.

A

A

Okay, so you have the next area: okay,.

A

B

Yeah, so the the I part I mean the rendering part is complete. The only CI work we need is.

B

To to upload that index.html, after via script data, and that is captured in the issue.

A

Okay, okay! What's something we have everything then yeah, so um let me see so the well. This is where I would ask like the the. um So this is till June. Okay, this is release V1. So what should we expect here like um this page, is not going to change or is it where would be like the latest stuff.

B

I think this is not going to change. The latest stuff will be in um weekly, BMI, yeah and then index.html here, yeah.

A

Okay, so weekly, okay, so then this will be finished. This.

B

Okay right right, this is order one to update this index.html is what we have uh as pending item close to you.

A

Okay, so the automatic rendering of it okay.

B

Yeah, if you go to the issue, um I, have it the most viewer yeah I have it in automation. The first bullet point.

A

A

A

I've got uh the screenshots okay, okay sounds good.

B

Okay, so one other thing I wanted to bring up for discussion here is that this issue does not capture what we want to do. Post V1 in terms of the version 1.0.0 release and its minor or Z stream releases.

B

So um a question I have is the like: this keyboard have a schedule for a minor release, or how does that get decided.

A

A

So I, actually don't know about the minor release, is I. Think.

A

Did we uh I don't know if it's written about anywhere I mean.

A

I, for some reason, I was thinking.

A

A little say, Okay I I was thinking that it had to do with.

A

Like whenever we find bug fixes or things that.

B

A

To back for it, I think people sort of I think they get creative because, like if you look at- and at least this is just from what I've seen there isn't always a minor release, so five Nine's got some five eights got some.

A

B

Yeah, so with five nine and five eight, those are actually Z string right because of the way we did versioning before we won.

A

Wait so your question was with was this stream or for the minor so like one one dot, zero or one zero? One? Three.

B

Both I assume that will only do bug fixes with V1 so right. If we only do bug fixes, then we we might not have minor releases. Those will roll out as Z string I, but wanted to discuss that if this is being thought through or um it's being worked on, I don't know I'm.

A

Not sure I think what's gonna happen is like we'll get some of the we'll get some Z streams, but then I. The next thing is so the next thing that I I there's something I have to do. I have to create the or you know me or Fabi, understand you'll have to write the schedule like it needs. We need a um or at least one one schedule.

A

So the the way, though it should go, is we're gonna we're gonna get this in. You know the usual Cadence.

B

A

B

So the next release is going to be release one one and it's going to be three months. uh Sorry, it will follow the kubernetes.

A

um Yeah exactly four months, not sorry, not three, four months, so it should be, um should be the fall. This one I think 129 is released.

B

Okay; okay, that that helps with some clarity. Okay, then, my question is with the Z string releases, so with v, 1.0.1.

B

uh We don't have a way of tracking performance degradation if, if, let's say any of the bug fix, uh causes uh an issue with performance regression, then we will not have a way to track that.

A

Because we don't have the data right, we don't run it on the branches.

B

A

Yes, okay, do you have that as an item here.

B

No I, don't I think.

A

We just wanted to.

B

Bring it up for discussion, so, okay, pros and cons is that um it will.

B

It will increase the resource utilization in terms of the actual Hardware being used for tests and also storage used for.

B

Storage, use for scraping and and storing that data, so the actual utilization will go up and and what I wanted to brainstorm a little bit is is the um you know is: is it worth it to go, um set up, work fix releases, or um should we have some kind of like manual process where we were at it and and only if it gets out of hand um we will introduce automation, I, don't know just just sharing some thoughts. Yeah.

A

I think I think it's worth I think it's worth adding I mean I I, don't like we haven't really from what I've seen run into any problems with resources.

A

Even with this, uh with this performance job running per PR, at least I haven't seen, anyone complain, I haven't seen even from the from the CI metrics that it was considerably worse. I actually noticed to be worse, like it like things be more by worse, I mean like jobs, taking longer like just being queued up because of some massive resource usage by performance. I haven't seen it so I I, don't think we're near the at least I don't well.

A

Maybe we already need a limit, but I don't know if we at least I I, don't know if, um like I would say it's worth adding just because I, don't I, don't know if we need to live in I. Think like it's, uh it's definitely shown its value and I think we can demonstrate it and if the case comes at, we need more resources. I think, uh let's, let's have that discussion for um before we make the Judgment of not including this.

B

Got it okay? So then, if we say we want to include it, then the task will be. We need to set up the uh performance job for release V1 branch uh so that Whenever there is a new PR.

B

The changes get tracked and compare that with with the actual Benchmark, before the release um for additional information and we'll do that for next three releases right and next to until we defecute this one.

A

Yeah so yeah we'll do yeah we'll have right we'll have those um we'll hold on to the all those Money release, metrics and then get rid of.

B

Them yeah, so at any given point of time we will at least we will have at least four branches that are running post periodic jobs are one is the main and then the the past.

A

um Three yeah I also, don't think I also don't think so we get rid of them, but I also.

A

Don't think that we don't get that much like there aren't a ton of backboards or anything like but like it's, not it's, not a crazy amount, usually like I, think, what's the determination of whether it's worth cutting a release, whether like how many things they get through, and it's usually not that many so I guess like um yeah and and so then, then the question is okay, so like should we keep them um because I I think technically Hubert allows at least right now.

A

It allows like people to continue to create minor releases, I I, don't think people are actively doing it or like actively requesting it, but there has been some cases where people have like what zero five zero or zero four nine or something people have done. There's been additional releases. I think that needs to be sorted out like I, think that needs to be started out with um whatever I, just like the um I think it needs to be sorted out with the uh with this uh I.

A

Don't know this, the release, support, Matrix, I, think, there's I think what we should do here is. We should follow whatever the CI uh version is for the kubernetes provider.

A

So it's like if we're on 127 that should correspond to you know: 108, that's to correspond to one one, and then we just kind of Follow that window around, so in other words, whenever we we eventually want to like Drop support for those branches like include in all our inside of all our all our repos and and we don't want to yeah I guess we just do it in a repos. We probably wouldn't change CI right, so we would just we just stopped tracking it in our repos.

A

B

There a way to do that you really have to we'll have to um change CI as well right because we'll have to turn those post submit jobs off uh and introduce new ones.

B

So it will have to be both uh change. The CI to not run post, submit jobs on those branches and drop the data from The Benchmark requisite.

A

Okay, yeah: that's that's what I was getting to okay, so that we need to okay, so we get rid of the the CI stuff and then we drop the support. We just remove the directories. Okay, yeah that.

B

A

Sense, I'm, on the same page with you all right, that's what we should do, because I I this is for our own. You know, storage support in for just the resources. I think that's the safest play it's just the way the community is going.

B

Got it okay, yeah then I will reflect this in the in the tracking issue for post V1. So somehow we need a way to keep updating the release branches uh from the the actual release, data in in The, Benchmark, Repository uh and yeah I'll re-edge that a little bit and add it to the issue.

A

Okay, if there's something that we need to do like so I I, think the release team already updates a bunch of things by the way whenever we like go between releases. If, if we need to add something to the list, maybe we can talk to Daniel on be like hey like when, whenever we move to this new release, remove this piece of code I think would fail to help us there sure yeah.

A

Okay, is there anything else, oh yeah? Okay, we want to just add that one to your issue: okay, yeah. What.

B

Else no I think there is couple um in that issue. um If you can go back to the issue yeah so I think there was a couple of suggestions, property and Benchmark enhancement. One is um including VM uh with instance times, and preference data um I've created an issue for this. It might be uh good to invite Lee um sometime in this call and have a discussion for this.

A

B

And another suggestion was that um we should also add.

B

Averages along with dp95 data and p50 data and I think the the technical justification there is that if you take the binomial distribution of uh run, running creation to running time, and if there are Peaks just before the P95 and just before the p50 uh-huh, the average will suffer, but uh because we are tracking P95, uh the P95 will stay the same. So there the having the average will help us understand a little bit more about the binomial distribution of these runtimes.

B

So, while the P95 and p 50 are good indicators, if we want more data points on how the actual binomial distribution is um doing, we can add on average and get that visibility.

A

Okay, the thing that so I understand that the thing that we have to do is explain it whenever we're so like. We know like right now we're handing people graphs and we have a little bit of text in the readme. So as long as we can explain it as to what its purpose is and how people can use it when they look at it. That's fine.

B

I think um so, I'm not sure if this will go to the benchmark, but at least for our visibility we should have that craft, then whether we want to shape it with the Benchmark or if the explanation is enough, can can be a secondary decision. We make during releases right, but at least when we are looking at the graphs for any performance problem should consider both all three P95 p50 and the average okay yeah, and then we can explain it.

B

We can make that decision later. All right, that's.

A

B

Okay and then the next um thing next bullet point in the issue was about documentation. I think we have already discussed that I. Don't have any additional discussion points there.

A

Okay, all right, let's close this stuff, so next uh blog about explaining V1 graph, so you've got a something empowers, looks like okay! Oh! Is this what we.

B

Yeah, the today's discussion is in the next page: okay,.

B

Yes, if you can scroll down to page two yeah, that's.

A

Oh, oh here it is okay. Gotcha I was like this looks. Familiar I was like.

B

Now, I'm using the same document but uh different pages, okay, so I I did a little bit of uh brainstorming as to how what or what kind of data can go into this detailed blog post. That will be helpful to readers.

B

um The thing I came up with. Is this um five bullet points, so we want to introduce what we are doing. We want to show the benchmarks, then help a little bit on how to interpret this graphs, so these graphs have some kind of um esoteric thing which only people involve with this um can understand.

B

So you want to decode that a little bit then um explain how this can be helpful, for, let's say any other project that is using the crd and controller and with this we've been kind of need, the tooling as well, so that becomes a sub point that what tools are used in order to set this up and and how you can do it for your project.

B

So those are the items I had in mind.

A

Okay, yeah um so I think I mean it and it's so I follow that.

A

um Let me think.

A

Yeah I mean it makes sense as an outline here to the direction.

A

I'm trying to think like in terms of developer blogs, so we want so how? How much do you want to go like? How much do you think we should go into um so we say, like intro, show a benchmark graphs just tell them what it means.

A

What do you think the most detail is going to be like, which is these bullet points? I think is kind of the most detail.

B

I think um three and four right: okay, so in three, what I'm thinking is we show, um let me show explain the graph then show its usage by um tracking one or two PRS.

B

A

Okay, so this bullet point will be to the developer. This would mean: okay, here's how I can see how keyword performance changed on a per PR basis.

B

Right, so what um what I want to highlight here is how this tool or strategy that we are using can help achieve this, find performance degradation and, in the actual PR that cost right. The source of degradation.

A

Okay! Okay! So now that okay, that's all that! So there's our that's like our thesis! Okay! So now.

A

This part, so now, what are we going to say here um like? Is it going to be how you can find in your Downstream commits whether you're, if you have got a performance regression, is that where this would go.

B

Yeah, so there are two things here: uh either we take this to how to get similar benchmarks um set up Downstream, or we make this generic and say how to get similar benchmarks for your project and then because this is a generic thing. It extends to people who want to do this. Downstream.

A

So, when you say your project, you mean someone that is using cuber, not someone. That's like got a controller in the kubernetes community.

B

No I mean the second one.

A

The second one, okay, yeah, so kind of towards the like what we're thinking for the cubecon talk in that direction: yeah, okay,.

B

Yeah, that's that's what I am not a little bit um clear on S2 what will be the best transition, because there are two points right we are. We need to be careful about the length of this post blog post and if we put a lot of things in here um for let's say your project, which is the controller and crd, this boost can get really big.

A

Maybe what we do is we if we focus on one of them, I mean this could be I mean there are there's opportunity for multiple blog posts, maybe the so maybe this one. So this is focused, like maybe let's say if we focus on like what yeah I'm sure like I'm, just kind of like almost writes itself. The title almost worked itself. It's like it's like our understanding, performance, regressions and your cubert deployments right. We have production users, they use this stuff.

A

They have likely some custom commits on top right, maybe even it's just one or two that is still valuable to know. So, if you're someone who's who's coming across this blog you'd be interested. If you have a production deployment, if you were Downstream now. The other part of this is what you're saying, which is point two, which is that that kubecon title that one would be like for the developers, the community, so maybe that one we can try to go to Lexington CF. For that.

B

Right, so there is another alternative.

B

um What I was thinking is the the downstream part of this material can go in the tooling user guide that we are thinking for um in the post to E1 issue, so we can tackle it there.

B

It will not go it could I mean we could skip it in the blog and just link it from the tooling section, with like one or two line here in this section, so we would be like keyword Community tracks this year and has found it useful if you want to do this, for your Downstream deployment here is.

B

The here is how you could do it and link the the user guide that has tooling and and every thing needed for setting it up, and then we can focus it for, for the we can focus this course for wider um kubernetes.

A

Well, so what I'm thinking is like if we want to go to the kubernetes route like I, think we should try and go like a keywords, could block, but we can always try and go to other places because, like the cncf has its own set of blogs, I mean I, don't see why not like if like, if we want to focus this towards people who create controllers like they want to go toward the cncf or the blog like that, I mean that's what the audience is.

B

Yeah yeah I mean I, have uh so I want to kind of Focus the discussion on what content we should come up with and select the content, because there is a lot here um once we you know brainstorm.

B

What content will be best. We can probably decide where this post will go and.

A

Right both I follow you, but the thing is if, but this I think what you're leaning toward, though, is that this fourth bullet point being towards the community, and so if we want to like, if that's the, if that's the way that we want to go, what I'm saying is like I think there's really one good place.

A

This belongs, wait, wait like and and so if we want to do a Blog process about like end users, right I, think it's a great Spot Vancouver we're focus on Qbert and and the thing is like we could talk about this. We could tie it in with V1 like I, think, there's a definitely a keeper blog post that I've written I also think there's one for the cncf I I mean I, don't I think it's fine. We do both I mean, like it's I, think really.

A

What ends up happening here is that I, don't think if you split them apart, I think they're, not that long like I, think they're they're, not extremely long and you have, and they have nice illustrations.

A

Okay, I think they're. Both um I think like it's not like. If, when we look at it and combined, it becomes a lot of work, it comes along blocks, so I think we should decide on which one we'd like to at least start with. Okay, one was just released if we want to use the momentum from V1, I mean I. Think where we should go. Is this directions that we should go towards.

B

A

Doing this and then and then don't do the controller, the general controller blog not yet I mean we can do that in maybe in August I mean the other thing we got to think about. The LA is that we, so they all in August the kubecon talks get selected. We obviously don't know whether we're going to be selected or not.

A

If we based on what happens like maybe let's say we aren't selected, it might not be a bad idea to write a blog post to the cncf about this, just to raise some awareness of the topic and then possibly give it a talk after that, based on the blog post as a way to get interest.

B

Makes sense, yeah I think that.

A

B

A

Accepted right, we could still write the blog post to get interest from the talk. I mean then mention the talk as part of it. So I mean it's sort of like a win-win I. Think if we, if we look at it that way when we just would have to- and it gives it in the timeline lines up to like we just have to wait till August.

B

Yeah I I think I I agree with you, so we can use the momentum to um publish this part of the content for cubeboard and then and then, as we get close to the talk if it gets accepted, the material prepared here can be, you know, brainstormed and ironed out in a way that you know um we could either use some for the talk and some for the blog posts.

A

Let me do like this: how to use the best seller benchmarks for your controller, so those will be the one in the two. So the point we'll call this is I'll call this here down.

A

They're going to call it production and then this would be the community, the cncf one.

B

Okay, yeah I think that direction makes sense. So for this one, do you think the user guide um user guide documentation will be a prerequisite.

A

uh I think I think we should I think it would be really nice to have like I think, maybe what we can do to shorten our blog is to have a user guide with a lot of detail, and then we kind of do a high level in the blog and then point to it for the people that are interested for more because I think there's a lot of things here like we have prow, we have Prometheus. We have like all these. We have grafana.

A

We've got all these tools that all sort of make it all come together, and what we can do in this blog is mention the fact that we set these things up and how we use them and then inside of the the user guide, it talks about like how you can use them.

B

Makes sense so I think uh with that this user guide setup should be your one of the highest priority for uh post V1 issues.

B

Because that will help us draft this post, better yeah.

A

I think we need to.

B

A

Yeah yeah I agree.

B

Okay, yeah I think we are on the same page. We have a strategy.

A

A

Okay, okay, all right so I'll go to the next one. I I only mentioned I only had a k-wop here with a question because I because I know you sent me some of the stuff we've been doing, I mean. Do we want to talk about this at some point? I, don't know if, like what your plans were, I think it'd be good to see at some point in the community sure.

B

So um I can give a little bit of a background on what's Happening Here um we've discussed this before quack is a project which allows us to learn resources without actual cubelet.

B

So we can fire up our virtual machine instance um where the word controller creates the um pod, which is actually not running, but its status shows running and it's not backed by any hardware the.

B

In order to fully support a virtual machine instance API. We need to fake out the word Handler parts because word Handler takes the Pod that is in schedule, phase and moves it into running phase. So I've been doing uh some work and I have a short one minute. Demo. Where you can see you can create a fake node. You can create a fake BMI.

B

The word Handler sorry word launcher pod gets running there, the VMI goes to schedule phase and then this fake uh extension of Quark transitions VMI from scheduled to running state so there, and then it does this with a little bit of a Jitter. So Quark has a functionality where you can transition from one state to another and add uh delay seconds and Jitter period. So what I have right now is 10 seconds as delay and 15 seconds as cheater period.

B

So what I expect is that the VMI will be transitioned between 10 to 15 seconds and and that Jitter period will be picked up um at random.

B

So that takes care of um scheduled to running. The second thing we need to take care of is birth. Handler also removes um finalizers during uh cleanup. So when when the fake VMI is deleted, the Pod will be the word launcher. Pod will be deleted, but the finalizers on BMI will still be there. So then the Quark VMI controller will come in instead of the bot Handler. It will remove the finalizer and the deletion will go through so.

A

You need to you need a controller in K, walk that understands virtual machines.

B

Not precisely virtual machines, but.

A

Earth pods, so it would be the vert launcher pod.

B

No, no, not really so because these are uh fake objects. You can have a state transition.

B

Of this object, so all the Quark controller needs is an input that it's watching the vmis for one state and it is going to transition from this state to the next state. So, for example, for schedule to running the conditions I have is that dot metadata, dot deletion timestamp is not specified and Dot status, dot phase equals to schedule. So if it is in this phase, then the transition from schedule to Running With, The Jitter reader will kick in.

B

So that's one state who.

A

Sorry, who does that, though, who does that? Who actually does the same tradition.

B

uh Walk BMI controller.

A

Clock view my controller.

B

A

That's so is that what you're saying is that? What you're is that? What the fake word Handler is.

B

A

Oh okay, so the okay okay, so you have an extension of clock and that is faking. The word Handler. Okay, I get it okay,.

B

Yes- and it doesn't actually need to understand all of the VMI logic. All it needs is an input of what state the VMI is in and where it needs to transition. It.

A

Okay, I see okay.

B

Yeah and then the second example I was talking to about the input, is that metadata dot deletion, timestamp is set and then remove the finalization. So if that is the input, then the fake VMI controller will do more finalizers. If the schedule state is the input, then it will, it will transition it into running state.

A

B

um So that's the background of the word um I've been doing so.

A

In this case, do you um do you have a special deployment of keyword because I can't imagine you have to have so? Actually? Maybe you don't because you have fake notes right, so you must have the K walk controller watching those fake notes and then it's extension handles the work. Okay, so I, guess you don't so you just you were deployed to get the apis though, and you need some of the invert API invert controller running right. So, okay and then this takes care of the fake note part: okay, right.

B

So we need the keyword, deployment and the Quark deployment, uh and both of them will not intersect with one another. So, for example,.

B

Quark, fake VMI controller will not work on actual running DMs. It will only work with bmis that are scheduled on fake notes, right.

A

Right, okay makes sense: okay, yeah.

B

A

It requires a keyword deployment with the and came up with a keyword, I'm gonna, just call it a keyword. Extension right, uh I, don't know if that's the right terminology, but basically yeah.

B

B

So the next part um I'm still figuring out um how to abstract things. This is just a POC. Ideally what should be so once we Brew out value with POC, um we should be able to have a decent enough abstraction, where we don't need custom controllers for all uh extension right. So then we can go to the okay clock, maintainers and say here's how you can extend it without maintaining custom controllers, and that way we will have just a few configuration of cube, word, specific resources and use vanilla, Quark deployment to run um fake vmis.

B

So that's the end goal, but it's a long way out. We first need to figure out whether the data points coming from fake bmis is actually helpful in the scale test. Okay,.

B

Yeah, so that's the next step. um We're going to take is find out what is the difference with fake and normal bmis in resource utilization of the control plane with a scale test and improve, bring it close to the actual vmis, and then um you know, work on the.

A

Abstractions: okay, cool nice.

B

Yeah, so this leaves us a little bit of.

B

Open question as to when exactly can we bring this in um in the keyboard and and in our keyword post, submit jobs? The answer is I, don't have a plan yet not at least until we can find out the difference between the actual scale, like the actual resource utilization of BMI and the facon. Once we have that, we can brainstorm a little bit on how we can leverage it.

A

A

Okay, cool yeah I mean it's cool. It's definitely look forward to this thing. We talked about this so a long time ago, like basically when six got started. I, don't even think K walk exists and we talked about this idea.

A

So this is cool to see.

B

Yeah I think what would be good to um so.

B

What are the things we can do with out of box Cube word and clock deployment is sad. We know that the word controller is not fake and we know that word. Api is not fake, so we can understand the scaling and performance behavior of word API and word controller without any improvements or additional development with just what we have today.

B

But the problem is that our metrics are not categorized into components, so we don't know from the benchmarks or the metrics that we track, which is generated by Body controller and which is generated by what API, so the only higher level aggregation. We have is memory and CPU usage and that's where we can start with for now.

B

A

That's good cool, okay, okay, all right! Definitely looking forward to that one I'm cool to see that let's just want to write one of our limitations for a while. Now some Hardware so it'll be good to see how this comes out. Okay, cool I think it's only analy have anything else to see. You have no.

B

A

Think, that's it. Okay, all right cool, I, think we'll under yeah. All right. Thanks talk to you later, yeah thanks, bye-bye! Okay, thank you. Bye.