KubeVirt KubeVirt Summit 2022, 24 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Summit 2022: KubeVirt Performance Visualization at Nvidia

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Minute just rolled over so I'll go ahead and introduce uh and I'm I know I didn't get a chance to practice your name before we started, but uh yeah.

B

Yeah yeah, okay. Let me.

A

Try and make it the full.

B

Screen: okay, guys, okay,.

A

Okay, yeah, okay, well I'll hand it over to you then. Thank you.

B

Okay, yeah I'll get started so uh thanks everyone for joining good morning good afternoon good evening. So after we have uh heard many hardcore uh topics, uh let's discuss something more relaxing.

B

So today I'm gonna discuss about uh the kubernetes performance visualizations at nvidia, uh so first some background informations, so convert is the core component in nvidia's cloud platform stack. Basically, we use cooper to provide a vm resource to cloud gaming service. So when we use uh kubert, uh we found an issue that is a lack of visualizations.

B

This makes it very hard to debug and triage issues. So usually, when we have a problem we can use, can you only use command line tools like coop ctrls, to get the info? This is usually inefficient and not intuitive, and also uh so. Our cloud team is uh very big, so there are multiple teams. There are a cloud service team and there are validation team and there are sru teams. So for the external customers they are usually very find it very difficult to understand how the kubernetes works and how our platform side works.

B

So that's why we need some better visualizations for on the platform side, especially for the convert. So what process is needed for the visualizations?

B

So the first and the most important one is the vm creations. We would like to know when the vm is created, how the phase transition process goes like. How long does it take for a vm to reach planning state and then reach a scheduling, phase and then schedule phase and finally, a running phase.

B

Another process that is interested to visualize is to vm duration process. We would like to know how the vm deletion goes, whether it is succeeded or failed and how the cleanup process goes, and in addition to that, we would like to know uh some internal performance of cooperate as well.

B

Some important uh components like a work queue and things like that and they need to be uh visualized.

B

So in nvidia uh we have a different type of way to uh to visualize our platform, so we have a matrix based visualized we have logs and we have uh traces and specifically for cooper. We chose promisius and graffana, based monitoring stack to visualize uh covert uh matrix, so uh last year, upstream uh has introduced some uh very interesting metrics for us to visualize.

B

The first one is the covert vmi phase transition matrix, so this matrix will basically check the dynamics of vm phases, including scheduling creation and deletions.

B

Also, we have a vm phase count phases. So in this case we can know how the uh vm uh this is distributed uh inside uh our zones, like how many of them are in running phase and how many of them are impending and scheduling, scheduled phase and also uh there's a kubernetes workq matrix.

B

So this is related to the internal performance of cooperate, work queue components, so we can use this to track the performance of cooper's.

B

Internal behavior, so we know whether under large pressure, cooperate, behaves well.

B

So, in order to visualize uh the above matrix, uh we develop a dashboard. So this is the overview of the convert performance dashboard. So we put everything inside a single dashboard. So all the info is contained in this dashboard, so we can make it compact and intuitive.

B

So the overall layout is like initially we so the lower uh the the panel, the more detail it contains. So like initially on the top, we have uh like a user manual uh panel. This is usually for the external customers, because they don't have a very good understanding of the codewords and then we have a performance indicator dashboard.

B

So this can serve as a quick overview of how the things like and then some some details dashboard. So let me introduce uh this uh panels one by one.

B

So first is the performance indicators. So usually uh uh there are many external customers for our dashboard. So usually they don't want to go over the very details about how how the clipboard works internally. They just want to know how some some some kpis some very important performance indicators.

B

So one of the most important uh indicator is the vm creation time. So this is basically the very the most important uh indicators for cooperate because uh on the cloud service side, what they care about most is how fast this vm can be created.

B

If we create it like in like three to four or even 10 minutes, then the the club gaming users will feel that they, it takes very long time to to load the game and if they will feel very bad, so this indicator is very important.

B

So we put this indicator on top, so it will track the end to end time from vm creation to to running, and also it contains some stats of each phase like the average and 75 quantize 95 contact a 99 quantile uh of time to reach a specific phase like the graph here is track. uh The time it takes from previous phase to the running phase.

B

So in general, this performance indicator can be used by sre team for quick judgment of whether the convert and the cloud platform behaves well.

B

So the next panel is the phase transition time breakdown panel. This panel is very important for the developers to first triage whether there are there's some some problem in the vm creation process.

B

So uh basically, uh this panel will visualize the kubernetes vmi phase transition time from creation seconds matrix, so it will display a time it takes to reach a specific phase and in general it is like a stacked bar histogram. So we stack a different phase at the same time together. So it is, uh it is compact and uh intuitive to to to check uh how long does it take for each phase, so the most important usage is to check how how which face spends most of time.

B

Then we know that phase has some issues and may need to triage and deep dive like in this graph. You can see that so different colors represent different faces, so the the yellow colors represents the time it takes to reach pending phase. So the light blue represents time. It takes to reach scheduling, phase the dark blue means time it takes to reach scheduled phase and the the green represents time it takes to reach running phase.

B

So in this graph we find that the dark blue bar is always the longest, so it means uh it takes most time for the vms to go from the scheduling phase to schedule scheduled phase. So it means the vms takes a long time in scheduling.

B

So basically, this graph will provide uh very useful information on the health of each faces.

B

And uh actually, there's a shortcoming of breakdown graph, that is uh the data in the transition time. Matrix is aggregated, so we through 95 quantile calculations. In this case, uh we lack of we will lose some of the information from this matrix. So we we cannot where we were. Some information of individuals vms is lost during its 95 quantile calculations like if we see this, we can see that so each bar represents the the 95 quantile of all of the vm equation. It doesn't contains the information of individual vms.

B

So that's why we add a heat map panel as well, so for the heat map, it is very suitable for visualize, each bucket of histogram matrix. So basically, this histogram gram matrix contains many buckets, so each bucket represents a time like uh in this graph.

B

So so this example of this graph is so it contains many cells actually so this this heat map panel contains many cells, and each cell represents a bucket in this histogram matrix and the brighter the cell is. The bigger number is in this bucket like in this example. If we hover, if we power over this cell, we will see, we will check that the count of this bucket is one point, one eight k, so it means uh there are many uh vms that is inside this bucket.

B

That has a transition time of like if we check at the the y axis.

B

Let's see that is like 10 seconds, so it means uh this heat map tells us that there are many uh vms that has a phase transition time of 10 seconds, and also we found that the this uh this row is also very bright, so it means there are also many vms that has a phase transition time like if we check the y, it's five seconds yeah, so you can, you can see that previously many uh vms are most vms are in the transition time of five seconds, but now uh most vms have a transition time of 10 seconds, so it means.

B

Maybe some issue happens, so the performance becomes worse. That's more vms has a larger transition time.

B

So uh so this is basically very helpful and another usage is it can help us to check some some outliers. So, if, like many vms, are actually like in a five minutes cell or even 10 minutes cell, it means that we have many outliers vms. That has a very unhealthy transition time. Transition time is too long and it needs uh some deep dive and triage.

B

And also we have a vm distribution panels, so usually one of the one of the very important indicator is that how the vms uh goes for each phase like how many vms are in running phase and how many vms I scheduled to face and how many vms are in scheduling, phase and pending phase. So if we check this graph and found that there are many uh uh vms at uh like the vm account of of running of running vms is very large.

B

We know uh this zone is in general is healthy because most of the vms are in running phase.

B

So if, in some conditions that we found that the the running vms drops a lot and the failed vms or the pending vms, the count of them increases a lot. We know uh the zone may be in an unhealthy state, so the sru team. We know this zone is not healthy, so they may take the zone offline for quick maintenance or of detox deep dive and triage.

B

Also, we have a total succeed and failed vms, so this can actually help us uh to see how the the how the house of loan goes through the times. So if we found that the field vms keeps increasing very fast, then it means uh the vms may be not healthy and may need a maintenance or something like that.

B

uh In general, there are many uh interesting uh use cases for uh this uh covert performance dashboard. So the first typical use case is the heavy cloud gaming workload use case. This is very typical and the most important use case so basically uh immediate cloud gaming service uh service often run into some high load situation like uh like in the night uh or or in the some holidays.

B

Many users will come to start play games at the same time. In this case, many vms will be created in very short time. This will have very large pressure on the cloud platform side, the cloud service. Basically, they will use uh kibana to trick track uh vm resource informations like in this graph. They will track, they have a vm pool, so they track the. How many vms are there in the vm post? So in this graph we see that under some uh very high load situations, the vm pool will drop a lot.

B

This may be possibly due to some load test off or some high load uh use cases. In this case, the vm will will drop a lot so uh service side cloud service. I use kibana to track, and in this case, uh on the platform side, we need bet better uh visualizations as well to to triage the issue on our side to make sure that our cloud platform side is not the why the problems occurs.

B

So, basically, uh the phase transition time is very suitable to do some and analyze on platform side like uh so this. This graph is captured at the same time as this graph, so so, on the server side, we see that there's a vm pool drop at around 4 00 a.m.

B

So, in this graph uh the platform part first transition time breakdown graph. We we found the same uh problems happen at the like at around the same time. It also happens around the 4 a.m. We found that the the light blue bar increases very fast, and also we see that the count of running vms decrease a lot. So this uh is inconsistent with the service side uh graph, so how what this graph tells us.

B

So, since the the scheduling bar is very very long, so it means it takes most time spent most time to reach a scheduling phase. So in kubernetes, if, if a vmi spent a long time, you reach scheduling phase, it means usually it is keep in the pending phase. So usually it is probably due to the lack of system resources like uh lack of gpus or laptop uh memories that cause this.

B

So, with this graph we'll quickly uh know what may be probably the issues and they can. We can resolve these issues like uh by adding more system resources or cleaning up some orphan node that take up most resources, so in general this helped us to triage the issues in the cloud gaming workload and platform side.

B

The second use case is this graph can help the bug detection as well. One good example is the vert controller panic bug.

B

So uh so we found that when, whenever there's a large vm division operations, it may cause a vert controller uh to panic.

B

So whenever the vert controller to is panic, it cannot uh expose uh any matrix. So on the graph, we will see that uh the the graph is uh interrupted, because uh the word controller is panic and it cannot expose any matrix.

B

So uh with this graph, uh it is easy, very easy for us to detect some situations like the vertical controller panic, and we also know why it is panic and we, and if we check the log, we can find why it is panic.

B

So the panic for this is due to uh the deleted final state unknown object. It's not properly handled that causes an uh a panic. So usually we need to add some error handling uh logic to make sure that it not cause very uh severe issues.

B

uh Some details can be fine, as this github issue.

B

So, in summary, uh the newly introduced uh phase transition matrix uh provides a very good approach to visualize the covert performance. It can be used by external customers to quickly overview uh the behavior of kubernetes. So they don't need to uh deep dive uh into the details or keep asking us about how whether the the platform is healthy. You can use just use the indicator panel to check whether the zone is healthy.

B

It can also be used by developers to detect a hidden bugs because uh the uh the kubert, uh the this the grafana dashboard, is in real time it's very suitable to to check uh when the problem happens, when and and where the problem happens and then to further deep dive, they can check uh the logs on elasticsearch to deep triage uh the problem and the issues.

B

So currently, uh the dashboard is staged in immediate cloud platform, production, environment and it helps the visualizations in the heavy cloud gaming workload scenario.

B

Yep, uh that's uh my talk today. It is a shorter one. Yeah any questions you guys have. We have.

A

B

A

Your presentation, I really like the case studies you went through there. We do have some questions in the chat. uh First off andre was asking if nvidia is working on live migration of vms, with gpus.

B

uh So when you say live migration, uh do you mean that we can do a vm restart to uh or uh move it to a new data center without, like any issues, is it like that.

A

Yes, I believe so something like the sri ob, where you can take your virtual adapter with you, as as you migrate from one node to another. Probably at this point we're talking nodes but possibly clusters as well.

B

Yeah, I think uh there may be some some design and discussion ongoing, but I, as far as I know, we we currently don't have it yet so usually we don't. We need some maintenance window to to achieve this. Actually.

A

Okay- and he andre also asks uh whether you can monitor the gpu temperature with prometheus and and your plug-ins.

B

Yes, I think uh there are actually several uh open source uh exporters like mvsm exporter and snmp exporter. I think they can be used to monitor gpu temperature. I think it is currently already implemented in our platform yeah. So basically, it's like so immediate smi has exposed some interface uh and the golan sdks for us to to write a quick exporter to export, uh to to retrieve the temperature information from uh the framework and then expose it as promises matrix yeah. I think it is doable yeah.

A

B

A

So we have uh a number of questions uh that are interested in seeing how your uh user load works out. uh So one of them uh from daniel was what is the average time that users will tolerate until there's some sort of uh churn.

B

Yeah, I think, if it's larger than five minutes, I think it will be very very long. So usually I think they can. I think most of the users cannot tolerate more than five minutes, so the usual case would be like uh one or two minutes: yeah, okay,.

A

B

If we cannot provision on vm, then we cannot uh start streaming and then we cannot play games. So it is very obvious uh and you, every user can can observe it if they think the game cannot start. The can leads to a very long loading dialogue.

A

That's interesting to know that's kind of the the metric the that the sres have to live on.

B

Yeah yeah yeah, so yeah analyzing. This is basically the most important indicators yeah. If it has very hit long loading times and yeah, the gaming sessions must drop and user may yeah may leave our service.

A

Okay, marcelo has a, I think we have like two or three more questions. So marcelo has a question: what is the average and maximum number of concurrent create requests that you typically see.

B

uh So the maximum number of concrete creation of vms, I think, is uh like hundreds of yeah and sometimes it may reach reach thousands yeah yeah. So we're.

A

Not very likely my.

B

Colleague, ryan and fam may have better uh some some data as well yeah.

A

B

A

And uh do you plan to make the dashboards publicly available sj asks.

B

uh I think uh currently, maybe it is mainly for internal use, so we may not make it public available yet because it is deeply coupled with our our zone configurations. So maybe it's hard for external customers to use.

A

Okay, uh chris has a final question or a question um I think in terms of measuring the the average and maximum number of concurrent requests like we were talking about earlier. You kind of group those by minute by second like, what's your time bucket.

B

Yes, I think the time bucket is it is by by second, if I remember correctly yeah, it should be back a second yeah. Some generators, yeah.

A

B

B

I still have more questions.

A

I think that was a reaction like whoa, oh yeah,.

B

Oh yeah yeah, it might be awesome.

A

Somebody else sharing a uh grafana dashboard uh marcelo was uh mentioning one that's useful, but.

B

I think it's also into the chat yeah. I think upstream also has a very uh good dashboard. Maybe we can further introduce this dashboard in our own platform as well.

A

Okay, so I think we'll allow just a couple seconds more. If you have any more questions, you can go in the chat.

A

Yeah, I think we're good, so thank you very much for your presentation.

B

Yes, thanks a lot for all this time. Yeah thanks a lot. Oh thanks a lot for hosting this. Thank you of course yeah. So I will start video.