KubeVirt SIG Performance and Scale, 9 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-09-09

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.p5py231aneed

A

Okay, all right uh welcome to sixth scale. um It's september 9th um document links in chat. Add your name as an attendee, please um and feel free to add topics. um While we discuss uh anything okay, um so first uh item on the agenda, um uh it's an issue for profiling, the keyword control plane, um so I created this uh two days ago. I think uh tomas is here yeah, um so the uh basically the idea is that uh so we have a few changes that are actually close to merging the profiler.

A

We have the load generator, so this issue is just to track some of the work that we can do to to profile all the control plane. The different tests we can do things like that and tomas is going to take a look at this, but something that we can look at as a community. I think there's a lot of different tests that we can do.

A

These are just some of them, but these are just some of the ones I had in mind, but if you have any ideas or things that you want to look at in particular just add them in here. We can kind of use this as our our source of truth to track uh some of the different uh profiling tests.

B

uh We plan to scale to up to one million concurrent users and right now uh he's here's andre from ddesk.

B

uh We are uh reaching something around uh 10, 000, concurrent users, every cluster, because them there are some limiters on our current infrastructure that is inside google.

B

Is there anyone that is reaching high numbers of uh who, like us, like 10 000, concurrent users in a single pool.

A

um Can you elaborate like what you mean like um 10 000, current users so like this is 10 000 users? um This is like like how many vms would you say like is this? Are you sending users for like a kubernetes cluster.

B

uh The kubernetes cluster we size because we are inside google.

B

uh The maximum number of uh vms inside uh google is uh in a single vpc is 15 000., and that's why we have done uh in a single vpc, 10 subnets in each subnet we have up to 1 250 vms, running the os and on top of os means uh okd, and on top of okg we you we install, convert uh we these 1250 clusters, we plan to have up to 10, 000 concurrent users, and that's why we are asking if it's possible, someone else is reaching already that amount.

B

B

uh These are vms, running windows, to be able to do remote.

C

Desktop okay, so every every user is a one-to-one relationship with a virtual machine, so 10 000 virtual machines and one cluster is that right. Yes,.

B

Let me just uh specify each vm, for you know uh just one second,.

B

Each vm has a 16v, cpu and 32 gigabytes of ram and 8 gigabytes of gpu. For you know, okay,.

C

So the question is: if anyone's, um if we've ever run, that many virtual machines- or I guess I'm trying to understand what we're what's the time.

B

um We plan to reach one million. You say that uh it's possible to scale. What are the limits? That's.

C

I don't see 1 million virtual machines running in a single cluster, no.

B

No across the cluster they, what do you mean across multiple clusters? Multiple clusters? There is a limit of clusters or what are the limits? Because can I have one api across all the clusters? That's the kind of questions we have.

C

So some sort of federation- or I don't know what yeah we we aren't- really focused on multi-cluster right now, we're primarily interested in how far we can scale within a single cluster.

A

Yeah, how many, um how many would you say like andre if, like this was we have one cluster where um you know what's what would.

D

A

Target for like the number of bmis, if we were just to kind of like if we were to get to.

B

uh We we put the limit of 10 000. uh If I.

E

B

Elaborate a little bit better for you understand. uh Can you enter my website ddas.global? Then you can understand.

A

Sorry, what is it you wanted to look at dds.global, double d.

B

D: okay, then, that was keep spelling free e as elephant as a stop. Yes, that one, the first one, okay on your list: okay,.

B

B

We offer these flavors if you walk the mouse uh uh on top of basics, then that you're gonna understand okay,.

C

Okay, I think we understand what you're doing you're creating vdi desktops for people.

B

Is and what we uh we plan to achieve is one million concurrent users, because we have today 1.5 million name, it user. Okay, we plan to.

E

B

From a citrix solution based on windows to let's say, kubernetes and convert on top of it, and did you see any limitations to on the cluster? What are the limit current limits on a single cluster?

B

C

B

C

Let me stop there for a the the limits um are difficult to put hard numbers on, because that's something that's going to be specific to the hardware you're using and um what that hardware and it's capable of both at the node level and the control plane level. um I don't know if 10 000 virtual machines is practical or not for a single cluster. I don't think that we've tested at that scale yet to know, I would say, as a gauge to understand like if you're in the ballpark of something that is uh practical or not.

C

Look at what's been scheduled for pods on kubernetes clusters. So, if you're looking for how far kubernetes can scale just with pods, um I would expect that we would get pretty close with virtual machines to that same um sort of realm or ballpark, because in the end, we're just we're just pods with cumulative processes running inside of them. So uh that would be like if you're looking for like hard numbers and just kind of understand the limits.

B

C

Itself, I would look there. No I'm.

B

Looking to cooperate on top of kubernetes limits before you understand what we are using per pod is up to 64. Only okay on the kubernetes website say: 110. Okay, then we are covered right now.

B

We are under the all the limits that, uh but we are doing these in let's say in a hard way the the users came and go, and the idea is when the users log up, we just kill the machine, and I don't know if this is already available uh like uh linked clones, that we have uh on on vmware and citrix solutions.

B

Can you elaborate uh how you are doing the pool? Because I saw some some information? Yeah, that's actually something that's been discussed. The idea of.

C

A virtual machine. I would like to understand better if you can elaborate what is available today and yeah. So what you're asking for is um the equivalent of an aws autoscaling group or a google cloud compute engines, um instances. That's that's what I'm interested on right. We don't we don't have that. Yet! That's something! That's! um I would say that it's in the progress of being designed, um it's something that we keep poking at, but it's something that has yet to gain the kind of attraction to actually get implemented.

C

Quite yet it's something we're interested in doing, and I think that your use case um actually helps us drive that forward a little bit.

C

So let me make sure I understand you you're it's a vdi scenario where the users, you don't care about the state of the virtual machine after they log off so the he said that the females.

B

Have another tool uh that keeps the profile of the user before you understand.

C

Okay, what's happening to the state of the disk, for example on the virtual machine that the user was on, is that being restored from somewhere else or what happened? Let me explain how.

B

It works, uh I'm gonna put these on the chat window, for you understand how we work on windows. Okay, uh we use a technology that microsoft purchase and offers for free for windows, users that call it fs logics during the login of the user.

B

We mount an external vhd drive that, on our case, has 10 gigs, and this is where the profile of the users are stored. That's why?

B

If we just instead of only log off we kill, the machine is almost the same, because everything is distorted just one second, just one. Second,.

A

Okay, uh while we wait for andre- maybe we can close this first topic here, um just if you guys, um so I guess the the topic for all basically asked here was that so tomas is looking at doing some profiling um of the control plane. Here are some of the tests that we want to do like the number of vmi's, the number of nodes um and the tools that we want to use um to do it in the pattern we're going to follow.

A

But um if there are any comments about that, we can, uh we can address them in the issue. Sorry, okay, sorry, for.

B

That uh sorry, sorry, for that, uh we plan to have up to uh 1250 notes in the same in the single cluster. I think the number- uh and we for you understand, since we have uh several flavors uh for you understand if we have uh the type one that is two virtual cpu and four gigabytes of ram we handle in each node 64 vms.

B

If we have that uh with 32, cpus and 64, uh let me grab it here. I don't remember everything uh with 16, cpu and 32 gigabytes of ram. We handle eight per node, for you understand.

C

B

C

Level, that's pretty low. I think that that you're not going to hit a scalability issue on the node level, with 64 virtual machines or it's unlikely. If you have powerful enough hardware.

B

Yeah, my question is how to create, uh in the same cluster like four posts that need to go up to ten thousand. These four fools.

C

Oh, you understand. We don't have that abstraction today. So the way you can do it is to create your own um controller or your own api logic.

C

That's going to post a vm every time a user is wanting to access a virtual machine and that vm has the, um I guess, the characteristics of the class or the flavor or whatever you want to call it and then manually delete that uh that virtual machine, when you're done with it so it'd, be a one-on-one relationship between the user logging on and being being created, and there wouldn't be a pull mechanism today. uh If you want to use keeper as it is right this moment that won't exist.

B

But uh the mechanism to clone the the disk uh it's available.

C

Yes, you can clone the disk uh using cdi and what would happen is we would use a data volume. The data volume could be associated with the virtual machine and it doesn't like a smart cloning behind the scenes, depending on what your csi driver is, meaning that uh you're not actually taking. We plan to use.

B

uh Gluster and a solution call it vgo for also do they do the duplication of the disks? For you know, okay,.

B

This is something that we plan to achieve: okay,.

C

All right that sounds feasible.

A

Yeah, andre, uh you know, like dave, was saying: virtual machine pools, isn't is influenced, yet there is a design dock for it. So, if you do have you know anything, you want to talk about with your use case. You know if you review and add your thoughts in there it's one of the things that we have as then. The list of things is covered in this in the sig that we want to get to eventually- and I think, having an additional use case would would definitely help us a lot.

A

Okay, um all right thanks andre, um so before we move on, I did was there any other comments that people had about the first one like in uh about the profiling uh before we move on to the third points, didn't sound like there was anything, but if not, we can always talk on the issue.

A

Okay, uh let's move to number three, uh so we're controller stress, starting with increased number of vms. Let's take a look. We have a snapshot.

A

Okay, so I want to talk through this.

F

Yes hi, um so I was doing I was doing the stress. uh I was trying to stress my cluster uh with with intent to know. When does uh our control plane components start uh acting up in the sense? uh Do we need any hpa or any any types of auto scaling uh for for our control, plane, components? That was a larger idea and I started playing with it and yeah I'm still playing with it.

F

But this is one of the things I I noticed that um so the first graph is the number of uh vmis and running uh due to the limitations of nodes. Only 120 could be in running, but I created 1000 vm objects out of it.

F

One only 120 could be scheduled and running, but uh into starting uh like in the first couple of minutes thousand were created, and we can see that the first four graphs are different, but if you just go up at the bottom, the word controller uh keeps uh with a high uh on the left yeah. The word controller stays high uh on the cpu, even after all, the vmis and vms are deleted. So it's like 400 percent of of the current requested uh cpu resources.

F

So I think um I just found it good to be changed or we should take a look at.

A

I see so you plato at around like 8 25, but that your count and then it seems like we have a steady level and then we increase it looks like is this guar? Maybe this is garbage collection or something we do delete something there's some garbage collection. So what that says, h40 looks.

F

It's not showing that line at the same time for two different uh plots, but.

G

A

I see um is that what um I think this is kind of what kevin saw when um in some of his graphs, where we we saw that um when we do the deletes there is um well, I don't know if it increased. I don't remember if it was it just at least it hung around at least um a higher level than we expected, um maybe doing the garbage collection. But I I don't know if we expect this, we expect an increase in cpus. It's like almost twice as much for doing the deletes. I'm.

F

Not sure if this has kevin's fix uh in this I mean I was just using hco, um which was available and operator helps, I'm not sure if kevin's latest fix is included here.

H

Yeah, the fix doesn't doesn't fix that. What uh ryan just meant is uh we saw that when we delete a lot of objects, resource utilization stays up for longer than the resources are being deleted um because it takes the go processes to clean up their memory for a while, and I don't think that's I we don't know yet if we can fix that or have to.

A

Yeah and then right and then like this this this kind of maybe this is like I mean you see like we see we hit, we hit one minute, 147.

E

A

Kind of peaking at 120, and then you said you had a lot of other um vms objects lying around. I'm wondering if um you know, if you're deleting a thousand of them and maybe that's taking some much more cpu or something. I wonder if, if you did this experiment with just the exact amount, if this was uh there's a if there's a difference at all like if you didn't just with 120, you had 120 running so what you create 120 vms, you have 120 running bmis, and then you delete them.

A

If you see the same pattern.

F

I'm going to try it again uh right now. We don't have the metrics for plotting number of vms. We only have for plotting number of vmis, so that was one of the lagging point, but I'm going to try this again and this time have a better graphs.

F

A

Yeah and then um I think marcel has got a change that we can get also, some of the would be good to even uh like these are some good boards of even having the um the standardized ones too. So we can always look at the same ones. um We'll make this easier as well.

A

Okay, cool all right! Well, thanks for sharing! um Definitely something to look at. You said this is one node right, I think right he says onenote you can only put 120 games.

F

No, this was no. This was five master and ten notes.

A

Tenders: okay,.

F

A

They were smaller.

F

D

H

uh I said I think I already sent you the link to the board we use. We. We have here that marcelo built for the tests, if you use that next time, maybe uh yeah.

F

I have that board.

A

Okay, cool yeah. That will the reason I mentioned this because it has, um I think it has these just so you don't have to go through and create them it'll make it easier uh for you for the next time.

A

Okay, all right thanks for sharing yeah. I I'm definitely curious, like I said, maybe see if, like you know how different number of vms um that you have see if it affects this at all, um that's something I think of not. Maybe we need to look at something here as to why the cpu increases um right around when you do the deletes, it seems a little weird.

H

And yeah: that's all you've been deleting vms right.

D

H

So I would suspect it's because um there is more api calls in the back deleting the vmis, and that would increase the total of stuff going on. um And then there is still the garbage collection in process, because both vm objects and vmi objects get garbage collection in the go process as well. So it might just be yeah.

F

I think next time I'm going to keep the vm vms for uh longer to see uh to separate out the to separate out the events. So we can.

A

F

Charts better okay,.

A

And this is what's template. Validator is this like? um This is like making sure that the the vm templates are correct, or something is that this is something yes.

F

Okay- yes, it's and that's that's what's. My second point is about. uh I I also want to. I also also want to see how we can see the latency of web hooks, for so repetitive related actually creates admission, webhook and I wanted to plot a graph of the latency, but I'm not finding a good way to do it with the metrics api server request time. Duration, metrics. So if you have any hints about that, I would be helpful. It would be helpful.

C

The latency of uh of which well yeah, I'm curious there. What? What exactly are you trying to measure the latency of.

F

That, if there is an increased number increased latency from the template-related web hook, if the number of vm vm requests coming is, is higher.

C

So measuring the latency of how quickly the that specific web hook can process requests. Is that the yes.

H

Yeah we were, the goal was to investigate uh how much low templar validator can get until we should scale it up like.

C

H

F

The larger larger idea is to see if we need any type of scaling up with increased number of load uh or increased number of virtual machines.

C

Interesting yeah.

F

E

C

That's that web hook is just one part of the chain of the entire request.

H

Yes, yeah. I also don't think there is a kubernetes metric for web oaks, specifically so.

C

H

C

For kubernetes to to expose that.

H

Yeah, true, because.

C

They're, the only ones.

D

Something we should measure now.

C

H

A

H

Yeah, we can add a prometheus handle to our weapon handlers. I you can the web book itself can expose how long requests on itself take don't.

C

You need to accurately measure latency measure it from the client.

H

Yeah right, that's, um we don't know the latest. We wouldn't. We would not know the latency from the api so to the web book. That's different story, true.

C

Does it register, what's it actually looking at.

F

Just a second finding the exit line.

H

um Template validator looks at vms, created from openshift templates um and validates that they comply with the template they are created from the same for the vmi. It also validates that the bmi is created from that does not change any values that collide with the templates being created from.

C

So it's a uh an update web hook, so it cares even after the vm has been created.

H

It is only validating, as far as I know, and uh but it still cares that you don't change the vm after it's been created because it also upgrades the templates with our upgrades, like we guarantee the vms created from those templates work. So we make sure the user doesn't break any vm create from a such template.

C

Whoa, that's straight well. Has that always been the case, so we're we're saying that a person who has launched a vm? Let me just make sure I get this straight. The person launching a vm using a template does not have the ability to modify values that were set by the template on their vm.

H

As far as I know, yes, there might be some values they are allowed to change like we're, um for example, what we're looking at, where we're as we're allowing them to change um resource limits, for example, I think, but uh in general yeah you shouldn't change the template that you create from ibm. Okay,.

C

Yeah, um huh that's kind of curious. I think that's orthogonal necessarily to our discussion here, but I'd like to know more about that. Maybe out of band.

H

I mean similar to what we do with flavors right.

C

C

Interesting so um sorry that was a tangent for a second. uh These vms are being created. Did they use pvcs and are they doing any sort of cloning, or anything like that, like? What's the storage that they were using.

F

They can be anything, I mean we don't we don't limit it by uh in any of the templates. No.

C

I I I'm I'm sorry, I'm talking about the the load tests that you did uh where you were using vms yeah.

F

It's it's. It's a container image.

C

All right all right at some point, it would be nice, um and I don't know if we have the right environment for this quite yet, to understanding uh the impact of using pvcs, because there's more api calls and things like that in our control plane associated with uh vms, especially vms, that uh have pvcs attached to them than just the container disk flow, because we're we're doing more uh informers and all that kind of stuff.

H

Isn't there always a pvc behind it like? Doesn't the container just get cloned into pvc yeah or is that optional.

C

Only if you use a data volume, so you can.

H

C

Have a data volume and I want the source to be this container disk and I want to put on the stateful pvc, but if you just have a volume uh just like you do in a bmi a container just there. What.

E

Happens behind.

C

The scenes is, we create an ephemeral, uh drive or disk or whatever, and share um some data across that.

H

C

Something like that at least.

H

Yeah then that would be interesting because I think it's a primary use case right, yeah.

C

Yeah, so at some point, when we're talking about the vm use case, understanding the scale of vms and not vmis, we need to start introducing pvcs there, but the problem with that is that we begin to be throttled by our storage provider. So how quickly it can provision pvcs and things like that. That's going to be a new, I guess graph or something like that for us to understand. What's actually within our control, so we don't have control over how quickly storage is provisioned for a vm as part of the start flow.

C

uh We need to be able to separate that from what we do have control over.

B

May I ask uh what is the uh persistent storage use it uh on these tests, uh like rock uh gluster? What.

C

Is use it that's so that's what we were discussing. It's not we're not actually using persistent storage for these load tests today and that's what I was pulling out that we need to start doing in order to understand the impact of that what's happening today is we're using ephemeral storage. So it's like local storage on the node, that's being on demand provisioned for these virtual machines as they land on the node, which helps.

B

Us understand the characteristics of our.

C

On the on on the node yeah yeah, it's just using the it's the equivalent of local storage. You can think of it like that. Uh-Huh.

C

So it's the specific storage type, it's called a container desk and what we've done is we've put a virtual machine image in a container upload that to a container registry and every time a virtual machine is launched on a node. It's using the image in a container and using that locally on the node.

F

There's one more thing I noticed in this graph and on the logs, but it's probably out of scope of this uh discussion. It was that this web book actually just monitors for virtual machines, but if you've seen the template related to cpu graph, the uh the second, his the second hill, we see just left to it, uh yeah that was the right to it. Sorry, one more right yeah, so that that one was when when the vmis were actually getting created out of the vm objects.

F

um So I saw that in the logs that in in the logs for the temperature that that there are entries two times so um probably missing something, uh but it can also be possible that when vm status is changing, uh there is another request coming in to validate the vm. I'm not sure, but I think that's that's.

C

What's happening, that makes sense so when the vmis are launched, um it's mutating the vm status, which means it's going through that same uh web hook. Every time we do that.

F

Is there any possibility that vm status will keep changing in further in the vm's life cycle?

F

Thinking out loud.

C

I

Should stabilize.

H

um Again, in that specific case, the temple villager doesn't even care about status as far as you know so.

C

It still gets it, though,.

H

Yeah, it still gets it right. You can't tell it's to not get status. That's.

C

Curious, so roman, maybe you're on this call, I think we use the status sub resource for vms is there I don't know, there's a way to say for a web hook that we only want to view spec changes and not status changes yeah.

H

I think you can only do the other way around. You can only view status, changes.

I

Okay, can you hear me.

H

E

E

At least a little bit very little.

J

What one second.

D

J

No sir, I can't make myself louder at the moment.

J

The bmi is, are we talking about vm or vmi vm, so the vm has the status sub resource enabled you can ride at both locations, but not at the same time from a webhook. So you can use the status subresource endpoint to modify it with the patch, for instance, or the spec. But if you, for instance, do an update and modify both, it will only update the spec.

H

Kind of what can I watch watch only the spec and only the status, because I think I can.

J

Only watch the status is everything.

H

Not even only status okay, what's that, okay.

C

B

May I ask another thing: these tests are running on bare metal servers or of the a nested vm on.

F

Nested vms, on azure cloud.

B

Oh, where azure azure, can you give me what type of machine are you using.

F

If I remember correct, they were the d2s on d8s type of virtual machine used for masters, and so essentially the the masters had 8 gb of rams and worker nodes had 2 gb of ram with 2 cpus. Each.

H

Standard d8 sv, it is.

F

B

Can you put on the chat window?

B

D

H

uh But till next time, uh when you run this test, can you keep the grafana board running for a bit long? I would be curious if rich controller memory goes down to its normal before as well.

F

Yep, I'm I'm going to use the new dashboard with a longer time. This time I mean when I do.

A

Like looks like it is, you can kind of get that just yeah.

A

Yeah it's making its way down, but yeah it would be nice to see yeah I mean I like the thing that's occurs to me. It's like, like that's all you mentioned like this, this peak here it lines up when we do the when we do the deletes um what what is going on. I don't know I mean it'd be good, like I think we we said we have a few tests.

A

You can try with you know if we use that, I think I heard you said you're using more vmis and we can are more vms than we can actually account for with vmis, and maybe we can try different numbers there and then keep this running longer and see how it goes um see if we can isolate some of what.

C

Is going on here I'll try to get the p prof thing merged soon, so then you could. Actually, if you wanted, uh do the p prop profile during that that hump and maybe get some interesting results.

A

B

Cool sorry, to borrow your time again, this is running on on centos, 7, 8 or red hat enterprise.

F

You mean the host or the guest.

D

F

I I think it's probably reddit coreos.

H

This is an openshift cluster.

B

F

Also, does it make sense for us to have a metric for vms and not just vmi? I I could not find one metric for um vm number of vms in the cluster.

A

Yeah, it's a good, it's a good topic to discuss because, um like we've mentioned, I think it's been mentioned before and I think there's definitely some use cases and you're kind of polluting some of them here like like. What's you know like you can imagine some latency like what is the latency between you know when you set something running to maybe when the vmi is running, there's a lot of areas here like there's api calls, there's uh there's other things that are happening.

A

um Even the count of the number of vms um that would also, I think, would be useful. I think it's in this area of perfect scale, so I mean, if you have, I mean I think this is a good area that we can kind of.

A

If you want, we could take a few minutes now. We can talk about some ideas and I can add to the list of metrics.

C

Right now, what we have is a breakdown of everything that transitions after vmi. It's actually posted. What we're lacking like ryan you're, saying right here is we lack everything that occurs before that bmi is posted. So when we're using a vm, we don't have any sort of visibility and how long, for example, the storage provisioning takes, or just the going from a vm being posted to actually posting the vmi. We don't we don't know what that latency is.

H

um But so just I think, the number of resources of any resource you should be able to get with keepstate metrics. I don't know if there are, but you fall and openshift somehow what you need to do to get them, but I think that's the tool to go. If you just want to know how many objects are there um I'll take.

A

A look okay, yeah, but you could uh well what about like like we do right now. We do. We count like the number of like uh bmi is in a in a state like I don't know what other states there are, but I mean it's like. I think it's paused right for vms it's running. There's I don't know what else there is I mean. Maybe that could be valuable, um so there might be some other ones like not just like.

E

A

The number might just be not going to get the more granular breakdown.

F

But yes, like if you're posting thousand vms, there's still some time between a vm is actually created in the vmi, even with just scheduling or scheduled state comes into effect, does um does the vm.

A

Resource mirror the state of the vmi.

A

Or? Is it just like running or not running or something.

C

It mirrors a few of the values kind of I think that maybe some of the conditions um are are mirrored, um but it definitely does not have the same granularity that the vmi does so there's a lot more in the vmi than it's on the vm.

A

A

All right well, um if we want to so we have 15 minutes left it. We don't have any more topics. So, if, if people, if we don't do, does anyone want to talk about anything, otherwise we can use this time. Maybe we can come up with a few metrics here um for vms. I think it's worth writing out before we.

C

Do is there anything else, people want to bring out uh the one. Well, so has any more. Do we have any progress update on the periodics or anything like that? I know there's a task waiting on me to integrate things like perf, audit and stuff, like that, I'm just curious, being any work's been done in that area that we should review uh over the past week.

A

Who's working on it, I don't know who's working on it.

C

I know marcelo uh initiated some of the the first periodic things like that, so if no work's been done, we're done uh for topic for that topic. I just want to make sure I wasn't going to stomp on anybody if I end up um working on that a little bit in the next few days. Okay,.

A

Yeah, I don't looks like martial, isn't here right at the bottom of the last five okay.

C

That's it we can skip over this.

A

Okay, um well, why don't we take a few minutes? Let's, let's talk about some more some more metrics, because I mean I've heard a few here, just being thrown out, not just the vm metrics like.

A

I also heard this one like the um the the volume creation one like that, like I heard that one as well um that we can enumerate on some of these so like so what is like, what's a valuable, what are valuable metrics for vm, like I think so, like I don't know like count so like what what's the data we want to get like? What's the if we're reading, you know metric, what do we want to get from it like? How can we.

C

What are some ideas so one of the things that's kind of difficult about the vm? Is we don't have a phase, so we don't have a clear transition between different states like we do with the vmi, so we have to look at it a little bit differently, um there's not a clear delineation between all the different possibilities. I guess that's what I'm getting at the things I care about is understanding how long it takes for storage provisioning to occur before launching the vmi.

C

That's one one gap that we have, that kind of gates the ability to to start a bmi.

A

So, what's the difference between uh the storage provisioning with a vm metric than just from the vmi like? Would this be like? Would this be a something that's specific to vms, or is this like this? Would.

C

Be it's specific to the ends so there's a vm feature similar okay, so think about a stateful set uh with the stateful step. Today you have a pvc or persistent volume, claim template and you specify what you, what kind of storage you want, every one of these replicas to have and when a new replica comes online, new storage is provisioned, that's specific! For that replica and a virtual machine.

C

We have something similar to that called a data volume template which is going to clone a nor a new pvc for that specific instance of the vm so take another pvc or some other data source make a new pvc that's just consumed by this vm containing that contents, so that process, how long that takes is interesting to me.

A

So this is different than like, say getting a pvc, or does it include that this is like we're gonna ask for, like we said, look we're gonna clone a disk like there's multiple steps of just getting the pvc involved in provisioning storage. Here.

C

A

C

The creation and population of the pvc before the vm uh starts or the vmi is posted. So that's that's exercising a lot more right. Let's say more than our control plane. That's the cdi control plan, because that's the thing that's actually going to be populating the ppc uh and this the storage provider itself, so the csi, the underlying csi storage class, that's actually creating the storage network storage for us and um how quickly it can, whatever it's involved with populating the storage.

C

So it might be a smart clone which would be really quick, perhaps or it could be something that's a little bit more involved.

A

A

Okay, what other things uh people have.

B

In uh mind, the desk uh is the amount of iops uh each vm and the total uh uh we have on the the server or our node.

C

How would you measure iops is that something that the storage like luster, yes,.

B

Disregarding the storage, the iops completely.

B

We do a tricky things to make it happen on our solution, but uh you can understand. I would like to better understand. We can get these metrics for measure not only a single vm, but the average of all vms. In this this node, uh the total iops I'm getting okay.

G

It's rather new, but we already have one crystallized.

C

Is that specific to the vmi? Is it like one of those runtime metrics.

G

Yeah, it's for vmi.

B

The other thing that uh the desks use we use gpu on on the uh guest vms and that's why we better, uh we need to measure the amount of of processing is, is having, and the uh problem we find is also. We measure the temperature temperature of the gpu on on on the node. Also, for you know, in the current version that is without could be worth before. You know.

A

uh You might get that well for the node. I mean the one that I at least that interests me like. I guess, with what you said is like the like device plug-in latency.

A

um This would rely, though, on something external like we had. You have to have like if you're, using the videos you're using the nvidia's gpu device plug-in like you you're, relying on that latest. I mean that one we could. um I mean that could be something that I guess we record in here, like I'm just trying to think um I mean, because the device plug-in can also explode. It's on metrics.

B

A

Just trying to think about it.

B

It explodes uh expose the entire gpu. What I I I would like to to have is the amount of gpu, processor and temperature for this specific vm guest vm. Before you understand.

B

Okay, if it's possible.

A

I don't know if it's possible um yeah.

B

Because we slice the gpu in profiles, the what I would like to con to see the metrics is for this specific evm. What is this gpu usage.

A

Before you understand yeah, it's tricky because, like I think, some of the stuff you're going to have to have access like quite a bit of access to the host file system, probably to the kernel's file system, to get some of the stuff. I think it's a lot of. It's insist which.

B

This is don't. Let me give you the code, what we are using uh just once one second.

A

Yeah I mean, I think, well I'll write it down, because it is so like I mean it's something to consider, because, like device plug-in, I mean there in terms of like if we're measuring this. This also isn't specific to vms I mean I think, um but this is something where I mean we would want to know in terms of like our total performance. It is something that can certainly affect the you know the gauge like push it one way or another if we're slow.

A

So it is good to know, um I guess, trying to find the right way to measuring it to measure it is the challenge so, but I'll write it down here. So there's something I had in the other one too. I think it makes sense to something just finding the right place.

A

Yes, um he said gpu temperature, something we can also look at.

B

Processor gpu processing.

G

But temperature is not something that from for the node that we can correlate to the vms yeah. So it's.

D

B

D

G

B

It's on the vm level, because you give a profile, and this profile has uh that temperature, for instance, like the node, uh has like four gpus, which one is using this vm is using before you understand.

A

Yeah, the challenge is to like you know, saying like: where are you going to get this information like? It's probably insist this. Is it so like that's where, like you're, not going to get that with vert launcher like that, that's where it's a little bit above I don't know. I like this is where I'm thinking like the device plug-in is going to have access to some of this.

A

So it could be that that's the place where this goes, but um I mean I think we need to explore this a little more because I'm not so sure if it. If we know the whole picture here, great go on okay. So what are the other? uh So what some other ideas? What else do we want to know about vmetrics?

A

um I I mentioned like well, I heard, uh um let's all say uh some count and I kind of expanded on the idea of count like do we like what are we? How would I do what's the right way to describe this like? Is it just the number of vms like? Is it like? Does it make sense to break down by like the number that are running? Remember they're not running! Is that, like that, be like a like a histogram or something? Is that a metric? We can do that valuable.

G

I think it's valuable to know how many vms we have that are not running, which is something that we don't have at the moment. We do have it for with the phases, but for running the vmis, but not for vms.

A

These is this like a status like: what's the, how do they how's like running um or not running like on vms how's, that displayed.

C

I can get it from the uh we can get it from a condition on the status. It's such condition. Okay,.

A

C

So I think, there's three buckets here: there there's provisioning, there's uh running and then there's shutting down. Okay,.

A

G

Heading down and it stopped all right.

A

Does it put a pause too? Is that one.

G

A pause is part of the phases.

I

A

Wasn't sure if that was a phase, because I know that it for vms, because I know like that: um well, it doesn't it's not a sound specifically my face, but I know you could pause. I wasn't sure if it gets reflected on the vm as well. So I guess it sounds like it's not slowly running.

A

Okay, stop shutting down running provisioning! Okay, so I.

D

Mean we're just saying.

G

Is there a maintenance.

C

A

What are the, what are the current phases like? Is this to do? Do we clear.

F

A

Oh, I mean what are the current conditions that we post on the um on the vm, and this do we have a list of them.

C

A

Assuming that's, what we do here is that we is. This is just all the conditions that can.

C

I don't know how well our list correlates to condition. I mean it correlates to a condition, but I don't know how accurately.

G

I think it's more of a status, no.

A

So if we're saying stopped like how would I if, if let's say I'm assuming these are all conditions that we can get on the object like if they're, not, I don't know how we quantify like. How would we say something has stopped.

A

That's what I'm saying like whatever whatever the conditions are, is what I think we should list here and I don't think it should be any different than that list.

C

What we have is the best thing we have is uh it's not meant for this purpose, but I'll say it. uh We have something called a virtual machine, printable status and what it is is it looks at the virtual machine status and it aggregates all the conditions to try to come up with a human readable explanation for the state of the virtual machine. So that's going to be, for example, looking at all the conditions and saying this virtual machine is stopped or this virtual machine is provisioning or starting or running or paused or whatever.

C

I think when we're looking at creating a metric, we're probably going going to want to kind of perhaps replicate some of this to understand, what's occurring with the virtual machine. But it's it's not an official like it's, not a stable, necessarily um way of doing that, because it's meant to convey information to the user, not necessarily programmatically, to uh like something, that's consuming the vmi status directly.

C

But it's not meant to act on the human readable status, but maybe that's useful for us just to see.

E

A

Yeah I mean it sounds like I'm not familiar with all the conditions, so it sounds like it sounds like that. There's a lot! Okay that we could. We could do on this based on what people want to do. I mean to me like: what's what's the simplest, what are the I mean if we were to pick you know a few simple ones, these sound, pretty pretty reasonable, stops shutting down running prisoning. That's what.

E

I mean: how would you.

A

Get how would you get stopped, though, like um this.

C

Means it's running it's enough and no vmi is present.

A

Okay, so this would be um after shutting down sharing.

C

Demand would be, rent strategy have halted, and the vmi is still active.

A

Okay, provisioning means that we're creating one, maybe there's not quite the.

C

Vm is posted, there's no bmi created yet so.

E

Something is occurring as a precondition.

C

To launching the vmi, I see okay.

G

And then shut it down, there is no vmi as well when shutting down shutting.

C

Down, if it's in the process of shutting down, that would mean that the we declared the run strategy on the vm is halted, but the vmi is still online. So it's in the process of being torn down.

G

It might conflict with the running.

C

Running would mean that the run strategy, the declared state, is that we want the vm to be running so it we wouldn't necessarily conflict they would. It would cross over to shutting down once we've declared that we want the state to not be running.

H

C

H

Of a missing between provisioning and running, which is booting like the vmi, is there, but vm is not there yet right.

C

H

Not that we could I well, we could tell, but I don't think we need it, but I'm just like from the start.

C

I think that would be.

H

C

So no no provisioning still.

H

C

Sense, uh starting.

F

Would be, after.

A

A provision here.

C

A

C

A

Exists right, but.

C

It's not running yet yeah.

C

So that's all the.

E

C

Between a vmi being posted and actually hitting running.

G

What I'm worried is that, if I'm looking at it- and I want to see like a pie chart, I want to see my vms what status and how many I have in each state. And then I go to the metric about with the phases for each phase. And if I look at the running um and the number that I see in the when I aggregate on the phase metric, will I get the same number or will it include also starting and shutting down.

C

Running, I would expect would be very close to each other, probably identical.

G

C

But the other ones would be less accurate, so starting, for example, um a vm and a starting um whatever bucket. I don't know what we're calling these the starting metric would include vms and scheduling, scheduled and pending everything leading up to the actual running.

C

So there's multiple phases on a bmi that map to the starting phase, or I don't know what we're calling it on the vm.

C

I I need to jump off, I'm sorry. I just remembered I have another meeting.

A

Yep yeah we're a little bit over okay, uh yeah thanks. Everybody we'll see you next week have a good day. Nice discussion talk to you later.

H