KubeVirt SIG Performance and Scale, 2 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-03-02

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Thus, and and the major my major focusing is to run a workload on pod and VM against uh openshift cluster and it's a touch all the resources that related to openshift storage, Network, CPU,.

B

A

B

A

Two years after that, and in general, we talk more.

B

Technical we're using GitHub.

A

Action to run our CI and and most of the work is the written.

B

A

And that's it in two centers, let's say: okay and.

B

A

Future plan is to try to get into the Upstream keyword project to understand how it's written, to see how we can improve it or give or give one give two hands.

C

To help you to.

A

Improve it uh and and add more workloads and more tests in order that we get more coverage in the Upstream side. Okay,.

C

That sounds great, so I can do a little intro to uh to the sixth scale meeting. So we have uh the sixth scale group we meet. um Usually it's it's scheduled weekly. It's we usually meet Indonesia ends up being two to three times a month and we as part of this meeting what we do is um we look at um covert from the scale and performance perspective and we look to to build any sort of testing that could possibly expose any bugs.

C

We used to we'd look at ways that we can calculate scale and performance in Cuba. We look at um uh all these different perspectives. We look at tooling other things like that and also things in in kubernetes, and we kind of we take those things and we look at different ways that we can improve um things in the community by you know creating pull requests, creating issues and um even dashboards and metrics, and things like that, just so that we have.

C

um You know ways to measure scale and things like that, so we we've basically been doing this for a little while now I think it's been over a year and we've had a lot of changes that we've had that we've made over time. We kind of we started this and we kind of began with um uh looking at ways we can measure uh what was a major Focus for us. It was um a lot of.

C

It was focused on uh like getting alignment on metrics that we could get in Prometheus that could describe for us what um you know, how I basically describe for us the performance of a of a job that we wanted to run, for example, and so we did a bunch of stuff like um here. I can show you so um or what I'll do is.

C

um Let me share my screen here: okay and share the document, so um when we dive into this here's a link to the document, add yourself as an attendee and well yeah and I'll walk through some of those. So what I'll do is um this is uh we have here? Is we have these jobs we? This is something that we over time.

C

We've worked on these performance jobs, um let's see, where is the performance, and this will I think clear up some of what we're um some of what we've been able to accomplish and give you an idea of what our direction is. So what we'd have is we have a bunch of um we have this proud job? That's does a few tests for us um and measures a bunch of things, so we have when we go to the first one, so there's actually three tests that we run in here.

C

C

Or the name of that's at the bottom, so, basically, what we do is we. um We create a hundred pmis, and we do this from from nothing. We previously, we created a cluster with um you, know, make cluster up, make cluster sync and we create 100, vmis and, and what we want to do is we want to measure which.

A

Image, you are using.

C

The door so for um I think we use um so I'm, not sure it might be a serious image. It's that we use. We all right. It's I, don't know one of the default images that we use, I, don't I, don't know if I don't think it's a container disk like I, think it's something that we.

B

C

One of the CI images that we use I- don't.

A

Know so we are using uh Fedora content of this for low test. We are used. We are running okay,.

B

I wonder just the configuration.

A

You run each VM, do you have the one harmel example for.

C

A

Sure so we think we are using 128 I I, guess that it's too low.

A

So all the CI is based on a shell. We.

C

um What we do is we is we we actually we trigger it through the shell, it's basically just um what it does is. It runs. It actually runs the tests and um they're they're integrated into the test Suite. So it's like with with Ginkgo so I'll show you those.

C

Okay, so here's the density test uh we do see if we have the image in here.

C

C

I, don't see the image and I see it's probably buried in one of these um function calls.

A

So for each test you create a cluster from scratch and, after it run the test and destroy the cluster right.

C

Yeah, that's right so new, random virtual okay. Here it is so it's a serious image.

A

And what once you deploy the cluster deployment? Did you install all the related operators like uh cuber and all this stuff yep for each step or for each bunch of of tests? So.

C

C

We use the same cluster and then um what we do is we uh so in the same test, we'll we'll create a bunch, then we'll delete them and then we'll go to the next one.

C

um So we do three, we have the um batch of emis um and then we've got another. This is the uh batch of VMS and then we have the with VMS, with single instance, type and preference so create 100, delete them, create 100, delete them, create 100, delete them and then after each one we measure.

A

So you run each 100 again specific node or on the same note, because you have a limit to 250 per node, no yeah.

C

So we do so it's it's um I, don't know how many notes it is I, don't know how many uh the infrastructure I don't know what it provides off the top of my head, but um so what we end up doing is because we end up deleting we we won't. We won't overload the node, so we create 100, we delete 100, create a hundred delete, but this is the this is the the meat of it right here, and this is what ends up showing up. um You know these results here.

C

So that's the basic idea is we we create that that hundred, and so we do three times what we have here is so we've we sort of have two. We kind of have two major pillars and, as part of this thing is that we have, we have the performance partner. We have the scale work, so we kind of we look at um capturing both in the in the job. So what you see here is you have a bunch of HTTP requests right and you know these look familiar.

C

We track those in each job, and so what's important here is like this is. This is an estimate for us right like we we. This is how much pressure we put on the API server. You know the number of patch virtual.

A

Machines you see if they can say that if you launch them at the same, you will the it will cause Hank on the Note side. Are you spin up all the 100 at the same exact time, because I think that.

C

Input as fast as possible uh create this I know so we have a rate control. So let me see what is the right control.

A

C

Milliseconds in between, so we don't do it at the same time. We just do.

A

C

A

C

Yes, so we create one create one, and then we sleep for 100 milliseconds and create another one. Four.

A

So you say the four bunch of 100 there is between each.

C

A

A sleep time so and there is no verification that the VM is up and running.

B

A

Just think that it's running no.

C

There is um we have this.

B

C

A

C

Yep, that's right, which is um somewhere in here yeah. Here we go. So what we do is we wait till they're all 100 running, so if they don't, we actually fail on this. This is one of the things that will we.

A

We do different stress, we spin up by hyper thread of and.

B

A

At the same time, and do this sleep and not each one.

C

That's what you're doing your testing? Are you saying that's what you do in your testing? Yes,.

A

Okay, this is what we do, because we want to make a pressure and real pressure uh stress against our node and verify what will happen if, when the user created.

B

At once, because.

A

In order to see it on Extreme State, okay,.

C

That sounds great, no I mean so.

A

This case, we we close to problem with CPU, sorry, okay, we.

B

A

To investigate it because um but I see that you solve it by slip between each VM, but it's not I think it's not fair from testing purpose, because it's.

C

Not the reality, I yeah, Ellie, I, agree, I. Think um I. Think what you're alluding to here is that, like you know, this is so. What we're doing is we're making a choice to like for us. This is like what we found to be a reliable way to get 200, but, like I said it's like, we don't expect the user to do this so.

A

Running if I can say that when I ask I, think and if I can say that it's run stereo and not run in parallel in.

B

A

Way, yeah because.

B

A

It follow up against 100, so in order to make it more reality, we take. We take the number.

B

A

Cpu we have on.

B

The server side and create.

A

Bunch against the physical CPU, so.

B

For example, you have.

A

A 20 cities, physical CPUs, we do balance of 20 and race. 20 are the same exact time, and this in this case and stress I, will see more performance, extreme situation.

C

Yeah it'll, make sense. I think like the test you're describing is, is something that we want to.

A

Cover in this case, we exposed to problem on CPU size.

B

A

When we enter to the node, we see that after the six things or 18 we are and eating is extensive in intensive CPU usage, and this is something we need to continue and do the investigation and behind it and we have the all the data of now. It's nightly CI we distribute in the garfana the logs in a spring bucket, so we can.

B

A

It later and compare the result here, I don't know where we have the result here for the future. In this for historical investigation stuff located, we have something yeah.

C

So the um this is what I'm showing so like in prow we have these. We have these objects that.

A

Question on the same time, but I think um when they come- and you know, writing something I think always on performance and perspective in general and not functional. So when I see this test in for Loop I I, it's most, it's more actually Dutch functional more than performance. If you understand what I mean foreign.

C

I'm not sure I totally understand but I think what I understood when you were describing the example to me like when you're saying, like you're you're looking to apply pressure specifically on you know the by doing things in parallel and stretching and stressing you.

A

Know, let's say let's say for paulville overall the.

B

A

We're running and if I go each workload, database storage.

B

A

When we test something we test it in multi-threaded and not serial like.

B

If there's here.

A

Because it's not performance, it's functional! What we test here, okay,.

B

And if we test all of.

A

The car, if you think that you test performance, it's not performance. Okay, it's the the.

B

Bandwidth is performance, but.

A

This is my thinking, and this is what we are doing. We have.

B

A

Redis database to spin up all the VM and start the VM at once. So when we started the VM on.

B

A

C

Created already.

B

A

B

A

B

Understand what.

A

C

Right so like what well what's what I was going to say is like I understand what you're saying, how like the you're saying these are functional. Well, what.

A

B

What I said leave it inside I.

A

Asked now because we want to spin up all the VM at the same time like horse races, okay, yeah, so here I see that we run one after one, so in which state the VM exists. When we spin up the.

B

A

That you cannot control the scheduler on the Node uh when the VM will start how you can have handle it. If you have read this database and say: okay already so now, I start all them or all in in VM. It's simple: you can put all the VM in stock mode and run all the VM at once.

B

A

B

A

If I see here they on the stop mode or you create them, just create all the VM.

C

And I guess what you could do is like you could um to do what you're describing you could create. The Manifest render them ahead of time and then, um but.

A

When you create them nine, two two one.

B

A

When you create it, it's great it.

B

A

Or it just start.

C

When you create, when you do, when you run this step, this is um this is what I'll do is make an API call to the API server and then it will go down and go through cubert and then create the Pod, so the VMI won't it. It's there's.

A

A bunch of steps like if we stop all right.

C

Yeah, like the Pod, will will start and.

A

Then eventually, so the launcher pod will.

B

Launch the VM and the VM.

A

Will be in the running state in line two to five right?

A

No, it's not guaranteed no.

C

A

Don't know but it's starting the process of start so.

C

What is this process has started, so let.

A

Me explain what we do. We first start all the machine, so.

B

A

Create the machine with start State and after it Go in bulk and running in multi-third and start it not.

B

A

So the time we we actually calculate it's from the not.

B

From the creation it's.

A

From the start to running.

C

Do you well, so do you mean just the guess or do you, including.

A

I can share with you if you want what we do, but.

B

The the idea behind.

A

B

A

This case no guarantee you, you just run 100 VM, but you don't know how the scheduler will react.

B

A

Pressure because you just spin up 100 VM, okay, the.

B

A

Performance States: what will happen that I start a.

B

Real situation that 20 users.

A

Start 20 VM at the same exact time, and here it's not at the same exact time because you run it one after other, so the schedule can handle it right.

C

No I I understand what you're saying this is this is this is one of the tests that Marcelo did try and it's not one that we included in our CI we. Actually. This was one that Marcelo did try out in the performance cluster, but we don't um I, don't think we have it. So it's like a different area. This is so specifically. What I'm showing is here is like the the cluster is shared. We have like um a very um we have, since it's a shared resource.

C

We, the results, can vary a little bit, so we we do is we we focus on. We kind of make some caveats, and the thing is like with this is that it does get us some information like we do get some pressure here. We do get a little bit and and definitely like the API server right totally understand can handle it, but.

A

Then it's like playing.

B

A

C

A

If you ask me a performance guy I cannot find that it's performance test right.

C

I understand, let me let me just finish so like what what we're doing is is we want to? We want to see these creation two runtimes. It's not the same level of pressure like I, totally understand that, but.

A

uh You just want to see the creation time yeah.

C

We want it yeah, because what we want to do here is we want to compare across. um We want to. We need a simple way for us to compare across pull requests on a shared cluster, and we need to do it in a way that is going to give us some consistency like we don't want like we don't want to the goal here, is not to apply like crazy pressure and see how it holds up not in this test.

C

What we want to do is to get like this test is to get some consistency uh across different PR's and get some data back, and that's that's what we're doing here, but we I I. Think, though, one thing, though, is important to highlight is like these are.

C

This is three tests that we're looking to do like or that we're doing now, and we have a bunch of data that we've been Gathering and using for a while when you're talking about are a lot of tests that we have talked about before even tried and just one-offs that we want to do. We just haven't had the the time to to do this stuff, and so I, like we're, we're describing to me, makes total sense I. Just we just haven't had a chance to it and I'd like to like.

C

If, if you'd, like Ellie I mean I, think what would be nice is that if you write down some of the tests that you really care about in the in our document here- and you know, let's, let's create some issues.

A

The 100 case to be a hype of mole performance and in.

B

In performance.

A

Way, for example, so leave this test and maybe to create another testing yeah.

C

Certainly, we could try that I. Think, like that's what I'm saying is we can try that I I would like to.

A

See because when we run it in our um in our way in performance where we start and see some problems uh against uh against cluster and stuff like it, and it's make the whole idea of the test more interesting, okay, we we found and see issues that we cannot see in this test. Okay, yeah.

C

No I'm there with you on that and awesome.

A

Also, if we talk about pod, okay, important, we attack it differently because in pod you don't have a state like VM. So what we are doing, we.

B

A

B

Each pod update that it's uh ready.

A

And once it's ready, we spin up all the Pod at once, so to run this the same.

B

A

Pod, it's in, if you want to run 100 pod, each node, it's not performance test. I can sign into everyone. Everyone you want. It's functional test, functional.

B

A

A big functional test, but it's not that's the performance at all. So this is the complexity when you come and write a performance test. Okay- and it's take for me more time because if I need to write a.

B

A

Loop, so I I will do all the workloads in one week and everything behind it, but to make.

B

It in performance.

A

Perspective, it's take sometimes, and it take more time than regular yeah.

C

No, it makes sense to me we have so we have a few other areas. So we have this thing called this. uh This load generator that we did. um We have a bunch of stuff in here, uh whereas the um I was just over.

B

C

To do so here we don't wait, we what we do. This is what we would use in the performance cluster, which is and this one's important, because this is the dedicated cluster. There's no one sharing this cluster, it's just for running the performance jobs and this one we do like sets of like 200 400 600, and we create them as fast as we can.

C

So this is more along the lines of the the pressure that you're looking for, and this is where Marcelo ran into all kinds of problems with QPS and stuff when you start using this kind of stuff.

C

This is along the lines of what you're talking about and we were using this dedicated cluster, which we haven't been because there was down for like three or four months, but it's recently being recently been resurrected and we want to actually bring it back and start doing some of the tests again that we have here and even some of the one as you're describing and I'm sure, there's some improvements we can make in here, like what I'm hearing is, there's probably other ways.

C

We can apply pressure even more than what we're doing, even just by creating things on the for Loop. Maybe we can have all the objects created ahead of time and then um in a in a buffer, and then we just fire them all as the API server at the exact same time and measure like there's a lot of things we want to do here and and um I I totally hear you, but what I'm saying is Ellie is these tests?

C

um I I would be great if you can write some of them in the notes, because there's things that we want to do and but we I mean it would be good to. If you can enumerate on them- and we can, you know, discuss them and I can help point you in the direction of where we can actually.

A

Go and Implement these things in order to know to improve something. We I need to get more details about the environment, to get more details about how all the things connect together and all this stuff. So I don't have this that uh yeah, but we for sure this is the the direction is to take and what the existing to improve, maybe to add a more workload. And so, uh according to your recommendation, what do you think how it's going to be.

C

um Yeah, let me put some links in here for you, so here's like so this is going to be. um This is for dedicated.

A

By the way, I don't have access to this document. I I asked for permission. Oh.

C

You need to join. You need to join the keyword, Dev Google group to get access to it.

A

I asked her to join I didn't get any answer.

C

There yeah you all so the way that won't work, so you can't go. Do it through Google, like you have to go to the. If you go to the Qbert Dev Google group and you hit join with them when you're logged into one of your Google accounts, then you'll get access to this it'll you'll have the ability to have right access, okay, I.

A

Will try it after the meeting no problem yeah.

C

Okay, what I'll do is I'll put some links in here, so what we have um so we have I'll put. Let me do two things so we've got, we've got our um performance job um that week, so uh we run for PR. This is what I was talking about earlier. This is the um this is an example and um by.

A

The way, also the the running time when you run it on performance in performance way. Sometimes it's take more time than running it, functional because the stress the machine you stress, the environment and for this example, do you have the runtime of this 100 test, 100 VM, test yep.

C

I, do I have the so I just put in there.

A

C

Know one window.

A

To run this test for how long.

C

I think we've got like a few hours that gives us, but I think it takes us. uh Oh I, don't just have the time on here. I, don't know if it has it in the job, but if it doesn't, then it should um have it somewhere. On the.

C

Yeah I, don't know, I, don't actually see it, there's a time, I I. So the thing is um well. The the end time test, though, isn't I guess is important. The the thing that we do, um but since we have multiple tests in here, it wouldn't be valuable. What we have here is like I, was saying before us, there's a bunch of data that we output and here's another thing that I'll give you a link to. This is important, for this is how we used to measure.

C

So let me go to um give you a link to it. um That's not here.

C

So give you a high level what this is It's a um we take, our we've created a bunch of metrics and these metrics get into Prometheus, and then what we do is we scrape Prometheus whenever we run our.

A

Our job, you just capture the the.

C

A

That interesting, you wow.

C

That's correct, that's correct, yeah! We do.

A

It through memory, all this stuff, is actually do and do you have the.

B

Yaml with the queries.

A

C

We were it's written, go it's it's in here, the um the things that we that we capture you can find it in this tool and all the things that we we care about.

B

C

Okay, but we can add anything to it, we I don't think we have, um but you can take. You can take a look for yourself, but basically this is some of the output. Now, like some of the things we do um like you can see, we take p99s of things. We look at the deletion times. We look at um the number of requests um stuff like that, like that that get done and we we, when we chart those.

B

C

B

A

And there is option to run it against a local cluster before yes,.

C

Yeah this is actually yeah. This is one of the, so this is one of the um the nice things about it. It's basically a tool, you can run locally and you can point at any Prometheus and it will scrape the data for you. So if you have the local cluster and.

A

The deployment also.

B

A

C

The classroom yeah, so if you want to um for deployment um you can use, uh make cluster up and make cluster sync commands.

A

um For the environment, I guess and I saw that it's golang, so I need which ID do I using for it. Visual.

C

uh I I just use emacs I'm a for Ides I'm, pretty basic or no one.

A

Any ID for uh to do the deployment and all this stuff locally.

C

No, no! No, so um um you just need um no I'm, not you can I mean it's pretty, it's pretty lightweight. It uses Docker and Docker. You just need you just need Docker running on your localhost and and it's pretty simple just running these two commands will get you a local running cluster and.

A

C

Need to install something.

B

A

Order to have this uh support for cluster and all this stuff. Oh no yeah.

C

I, just Docker will be the minimum requirement.

A

C

Yep, you can do your spine man and.

A

C

A

For the cluster, do I need something special, no.

C

No, that's all you need is this will take you these two commands, plus you know having podman and cloning the rebound git will give you what you need.

A

And and to run the it's, we run also the test.

B

A

C

The run the performance tests uh I forget if we have, if it's in, should be in make somewhere. Let me see uh I need to check to make recipes.

C

They have like a performance yeah, we do okay. Here we go so you can. You do make perf test, and this will run. um This will run the uh yes it'll run this test, the one that um that creates them with the 100 millisecond sleep time.

A

All right, can you add it to the this.

B

A

Okay, do you have um do I need more things to make it work locally.

C

No, this should do it. That's that's should be what you need to give us a try.

A

Okay, and so, if I need to do modification stuff like it I do it in the dedicated the file and only these three common should they make cluster up and synchronize and.

C

A

Is made cluster I think down also right on anything.

C

Yeah there is yeah when you want to reset.

B

C

It's in the mixture yeah, it's just another recipe.

A

Okay, so I don't need to install nothing that related.

C

A

Cubard or be installed when you do make cluster up right.

C

A

Okay, nice, nice, so I will try to play with it um and by the way I can reach you uh directly by email or stuff.

C

Like yeah, that's fine I'm on the Hubert Dev slack um Channel and naku and kubernetes you can um in this channel.

B

A

B

C

Others in sixth scaler in this channel as well, so you might and other people in the community might respond to so you know, put your questions in there and and get them answered.

A

Okay, great great so I.

B

A

Over it and see you know to make it works locally and try to play with it and.

B

After it, they.

A

Will try to think how we can take it to the next stage from.

B

First, one: the forces.

A

Performance aspects, second, maybe to add more workloads.

A

And so do you know if.

B

A

Talk about at the beginning on the yaml which container we can see it in the test or I need to dig into this because um I thinking about huge VM, for example, I, don't know if it's a if it's a okay to run a Windows VM, because you know it's license or something like that.

A

It's a required license. But it's more realistic scenario: spinner yeah.

C

I, don't know I, don't know if I don't know if CI has anything with Windows right now, I think it's all Linux based.

B

C

It's a good question: hey it's something you can ask in the mailing list, I'm not sure.

A

Yes and then storage with test of storage, okay, no.

C

We don't this is something we've had some conversations a while ago on doing, um but we haven't, we haven't, it hasn't picked up a lot of steam, so if uh it's something we can resume, we just haven't had the bandwidth to take this on with something that um yeah I mean it's something that's more interesting doing, though, that'd be.

A

Great Soho is using the local storage FM online, not local storage, yeah, okay, and there is a Network testing.

C

No, this is another one, another Lane that we'd like to add we're. Basically, the only thing we're doing is we're we're we're stressing kubernetes API server as much as we can in keyword, control plane. That's all and.

B

C

Into Network or.

A

C

A

um Something that, uh but let's I, think that I should start play with it, because I'm not familiar, but um once actually once I emerged. Someone should will review it and after it commit, but we need.

B

To test it, there is a.

A

B

A

Run against the code, we push that.

B

C

Run it locally.

A

C

What we do is we I mean we, basically what we do is we run this test? This is our gate, like the.

A

Against the code, a unit test against the code that they're writing- no, it's the it's part of the integration test, I guess yeah.

C

Yeah there is in.

C

I, don't remember where it is so if there are, there are some in the um well. Let me see hold on if I don't know, if the audit still has any no yeah, it doesn't have any.

C

I'm, not sure yeah, maybe something you can look at when when you go through I mean they'll, be if there are unit tests, they'll run when you run um I think there's a mid test command, not perfect. This one.

B

A

If, if a summer as there, there are more missing thing, except and improving performance, Maybe other integration tests that missing or just performance aspects.

C

I think it's largely on the performance side. I think we've got things, I mean I, think this is a huge area, so I mean there's tons of things. We can add.

B

A

And okay, and regarding the cluster I, guess you have a huge cluster. So if I try to run it on my local, my my local machine will crash. So I think that I need.

C

A

C

I would start I would just start with this just to get familiar, because this is just a building box of what we have and then, when we we have that dedicated cluster and I was going to look and see here if it is up and running because oh yeah here it is so here is the performance cluster. So this one, if you want to okay, here's the time, so here's there's two hours at the top okay, this one takes two hours to run. This is on the dedicated cluster.

C

This should be um where we eventually want to Target a lot of your tests that you're, describing because this is only on this- is the only one that shares or no one shares this resource. It's just for testing.

C

um How many VMS is this. This looks like 200, no 400. No, this is 600 VM test yeah, so you can see like this. Is um this is our larger stress or the largest stress that we can do right now? Maybe we can do larger, but um yeah.

A

B

Here's the results.

C

Of 600, let's see yeah so 600 and then here are so like you can see like we balloon pretty significantly on the p50, the P95 and P99, from creation to running.

C

And this is the one that's not in cereal, where basically creating it as fast as we can.

D

Okay, so when you, uh when you guys have some time, I just have a have a question: uh unless there's something else in the agenda: no go ahead, um so I'm I'm trying to capture some information I'm trying to help a university that wants to uh build a assured campus, pretty much uh with virtual machines, and um they want to try and understand who out. There is running a massive amount of of convert deployments um and I stumble across.

D

Of course, the G4S, the the Nvidia I'm, trying to find more out there- and you know some of the some of the highlights that may have been found from the cluster at scale group. uh Are there any findings or any notes, or anything like that documented anywhere?

D

um So that can we point them into the community documentation or.

C

This is, this is a gap we have. We have we've had discussions about this for a while about having a documentation. I'm, assuming you mean like how you know what it like. The recommendations is running at scale and maybe the the highest level of scale in the community. Things like that is that what you're looking for right.

D

Like um you know at what point it's it's dumb to have more than a certain amount of VMS in a single cluster. You know with a your control plane and uh is you know, is 500 or a thousand VMS good. uh Once you pass the Thousand number, maybe the entity requires a different different performance. uh uh You know um so so things like that right. So how many? How many VMS do we want to run as a single cluster? How many nodes have we actually put into into a cluster on the VMS?

D

um How many VMS per node right, because there's also a pot limitation um things like that? uh That's.

C

Where yeah there's a good, um let me see if I have it here somewhere in the notes, there's a good guiding document that we used a while ago from kubernetes um that I would point to if I could locate it in here uh that answers some of this, and so the thing uh all right here we go so the thing about this is um like it's kind of the way to look at the way to look at this problem is like with with Qbert it's really it's.

C

It's pods pretty much like it's like you know the rvms and like there's a computer control plane in the middle. One of the biggest factors is going to be kubernetes, and this is what this document focused on and explains it really well. So, basically, the idea is like that is described in this. This um presentation is like how different things affect the overall pressure that you apply, and so, for example, um let me see if I can find a good one based on what you're asking there's um should be nodes in here.

C

So here's like example. So if you, um if you have the number of PODS per node at um 110, let's just say, then the number of nodes you can scale to comfortably ends up being about 1300. and then on the other side of this. If you have 30 odds per node, the number of the number of nodes Scouts, who comfortably is five thousand. This is this is from a few years ago, but this is what um you know. They were so back in 2018, so six five years ago.

C

This is what was tested at the time. So this is a I think this is along the lines of what you're looking for and when you kind of extend this to VMS right VM is VM, is just a pod. I mean there's the keyword piece in the middle, but this will give you a sense of like how it would work with kubernetes and it should work like the same with with Qbert I mean cubert has, for the most part, what we've been doing and this sig is making sure it keeps up with the kubernetes scale.

C

We've been able to see that so I would say. These are good numbers for you to go by.

D

Right now, the the 1300.

D

Node quantity feels a little exaggerated even for um for for large companies. It's a single cluster um then does this imply that having 200 pods but 50 nodes is absolutely fine, um because I remember, there was also networking limitations after 110 pods.

C

Yeah I mean I, guess it depends on the on how you set up your IPS, but you should um it should be. uh It should be fine, I mean I, don't know exactly I mean, but you I mean you'll probably have to test this 200 and 200 pods and how many nodes you say: 500.

D

Yeah even 50 right because 50. I don't know it's is, is already it's it's a very good quantity right um and then, if we have, you know 100 virtual machines and node, which is probably two dens.

D

um It's already a massive amount of virtual machines. So again this is the scale of of universities and and different users. So just just trying to to get into a sweet spot versus.

D

um You know uh massive Nvidia scale, but yeah I.

C

Think it would work, I mean I, I. Think, like just I mean I, don't know, I mean it's not it's not on this chart here. So I can't tell you, but like I, think it would work, I mean it seems like it would be well within the boundary of of um safety for the for kubernetes yeah.

D

Okay, yeah I mean if, if this is 1300 nodes, yeah.

C

You're way down here like in this little corner, so you're, probably flying at 200.

D

Right, okay um and- and are there any other limitations that um that are well known, I guess.

C

um Yeah there's a ghost this. This uh presentation actually goes into a few like it's some relationships between Services back in service um namespaces services for name space, here's another one. This is important um and one we actually do see in our in our measurements. A lot uh pod churn. So even at Nvidia like we, this is one of the biggest pressure is that we see like so. In other words, like the amount of PODS, you created the lead and an update per second.

C

This is like this supplies, a lot of pressure, so um you know if you're I mean I, don't know what your use case is. But if you have a lot of people creating and doing workloads in the very quickly and high throughput, then it will apply a lot of pressure.

D

Right I I did I did capture they May create VM, hundreds or thousands of VMS per second, so I can see how this is um a thing for them right.

B

C

Sure yeah there was a few others here, I think uh names of places and stuff like that. The link um I'll copy it up to the top. Just so I.

D

Found it I I found it okay, thank you um and now is. Was there any other.

D

Was there any I'm, assuming once a virtual machines are created and the pressure and entity or any of the on the control plane is off, um is is this? Is this something we we see as well or or when you guys are running through um uh through this Benchmark jobs? um What are the things you're monitoring?

D

um This is just the the how long it takes to provision a VM like the provision in time. Are you actually looking at the underlying infrastructure of you know, CPUs memory and and so on, as they are consumed um and and the over commit ratio as well? uh Are we doing an overcommit ratio like we do with things like openstack, uh like an 18 to 1 or 16 to 1., um or do you guys just go 1vm one CPU um yeah.

C

We have it so this is another area where, like like Elliot, even talked about like we, we haven't that these tests haven't expanded or haven't matured. To the point that we have some of the things that you're talking about like we could very easily start measuring some of the the CPU and memory changes that could happen on the control plane, based on the amount of pressure we're doing. We don't have that we have.

C

We do have dashboards like for, especially on the performance cluster that um that we were reviewing before when we were actually going through and doing this and like it helped us find a bunch of um go routine leagues and a bunch of other stuff.

C

um But it's not something that we um that we report specifically in you know as part of the job and and check and fail our gate on. But we kind of we do look at it from time to time since, like each of these has a Prometheus instance that we can query so we can see it, but we don't. We don't always look just because it's not you know as a part of the um we don't have an automated it's. Basically, what I'm saying yeah.

D

So so this is mostly checking the failure rate of mess, request of virtual machine creation or.

C

Yeah, so what it does it's going to we're we're creating a lot of pressure and we're both in the amount of VMS and the speed that we do it in this dedicated cluster. We measure the number of hcp requests. This is would be important for scale like the few of these with the less pressure.

C

So we we do that over based on the um the number of requests we create, and then we measure based on uh where's my um we measure the um the create to running time for this stuff and see what um how this changes, based on the amount of pressure that we apply.

D

Okay, so so there's a running: what are these? What was it P95 p50? Is that, like a a virtual machine type or.

C

No no, this is um this is like the K like this is the nine and a half percentile. So this is like the worst case, and this is the 95th and here's the average p50. So it's so. This is 228 seconds.

D

Right so it took almost so four minutes for the the VMS to to be ready. Okay,.

C

Yeah, this is um 600 VMS uh created as fast as possible on average, because, as we see about that four minute time,.

D

Okay for the entire job, 50 of the virtual machines were done for in pretty much four minutes.

C

Yeah well so this yeah um well so I wouldn't well. So the way I'd look at it is like refuses can average it. So there could be a few of them that were done in 20 seconds, and then it kind of slowly crept up to the point right like that's so but yeah right, like the I guess the way I look at it is the average of of all 600 ends up being 228.

D

What's this running on this 600 virtual machines, how many, how many nodes is that running on.

C

I don't know the topology of this dedicated cluster. Unfortunately, I need to figure that out, because I haven't looked at in a while, so I don't remember, but I can find that out and I can leave it in the notes for you. I can ask Brian hello, yeah.

D

Because I I don't know if this is 10 nodes or or five and then that's a massive massive difference. Okay, all right! No just just uh just wondering this- is this, of course, validation to um you know if there you know if, if this University wants a development environment, this isn't the case right now, but if they want to develop an environment that wants to execute virtual machines on demand like this, then yeah, okay. Well, no thanks for sharing and we're past.

D

Our time didn't want to take more of the time that you got scheduled well.

C

Thanks yeah thanks for the questions- and um you know really appreciate you, you know as you as you find you know, if you, if you've got any more questions about scale or probably or things that are going on, you know please come back and you know we're happy to discuss more and try and solve some problems. Anything you guys encounter and you.

D

Know yeah yeah, I, I I joined time to time. Actually this call I've had it in my calendar like for two years so I run uh I join, you know about the ones a month or so, uh but I just stay quiet, but most of the time is pretty much about this dashboard um and um yeah I. Just you know, wanna I wanted to talk more more Hardware.

C

And well, no I'm glad you brought it up because we can I can find that out for you- and you know, please come back with your Hardware questions and we can incident because it's really important right, because I mean this is what you're getting at here is like: there's a lot of variables and scale and performance right and and- and this is why this is part of the problem.

C

So you know if this is something that that you need to know in the very nitty-gritty details on like you know, let's you know please raise up and we can find that out for you, I.

D

I will uh okay thanks thanks a lot for your time again of.

C

Course, okay, thank you very much. Everyone all right have a good day. Bye.

B

C

Thank you, bye.