KubeVirt KubeVirt Summit 2022, 24 Feb 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Summit 2022: A few bugs and findings from VMI Churn at NVIDIA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, um so uh welcome everybody to the last session of the first day of kubrick summit. um uh By the way, all these sessions are being recorded and as soon as we can edit the video we will get it up on the kubert youtube screen in case you missed anything, but for our last session of the day we have fan zhang who's, going to be talking about.

A

Their real life experience with using cooper at nvidia so take it away.

B

Thank you, yeah hi, everyone yeah. Thank you for joining my talk. My name is benjam. I'm a software engineer at nvidia, I'm working on delivering a global deployed, massive skilled, scaled gpu cloud services as a foundation for some challenging workloads like cloud gaming, ai and machine learning and gpu accelerated workloads, something like that at a large scale. So today I'm going to talk about a few bugs and the findings from the bmi turn in our practice.

B

This is this is our agenda, so I will introduce the keyword, use cases and nvidia and talk about some interesting bugs and findings and some takeaways from our site.

B

Okay, so uh yeah I I would like to start with talking about how convert is used in reading nvidia project and something about our use case. That upstream doesn't doesn't cover. So nvidia is a leveraging convert as the core part to build cloud native infrastructure and services on the on-premises data center to support our globally multi-tenancy workloads, like stream gaming.

B

We hope convert to in our stack to be resilient and resilient and reliable and there's a workload on it to be managed in a pure cloud native approach. So now take the stream gamer use case as an example. So stream games must be running on a window machine window window, russia, a windows machine. The backend services must be running on linux, virtual machine, our infrastructure services and are built on top of kubernetes. So basically all of them must be running isolated.

B

In a multi-tenancy environment, we have sort of complicated hardware devices that need to be supported if it includes but not limited to a variety types of gpus pcie devices. So so many know the resources needed to be to be advised and assigned to the bmi so dynamically.

B

Most of our workloads are created in an on-demand manner, inspect their virtual machine lifecycle specifically, like start pause, resume and migration. For example, we.

A

B

On the virtual machine, instant object directly, the workload runs in a high, intensive, dynamic manner, vms burst with creation and the deletion every minute somewhat. A lifetime knowledge would not exist more than two hours or even less. However, some critical services would expect it to be running for a long time.

B

So some big qbs in the skill uh wait is bad burst, creation rate at least 600 vms per minute, and normally we are running a big kubernetes cluster with over 600 bare metal nodes and over 1 000 of mis are running every minute.

B

So that's our use case. So, let's move to uh the box.

B

Okay, yeah so uh to build a status. The first one is a vmware's data stock issue. In our practice, we noticed a couple of times that the larger part was moved from kubernetes cluster, but via my object, uh thereby incidents are still in running status.

B

So looked into the logs of the communities uh the logs of the kubernetes are covered there. um We found that there are two issues habit: the first one, a word handler, failed to think uh where the handle failed to sink domain cash during a crash or termination.

B

A reboot rebooted vert handler has to re-sync with the api server and the rebuilt of the local informal cache, uh unlike vmi, which is a persistent fcd. The domain informer cache is completely lost during the word handler crash or termination word handler, handles resync by listing launcher circuit files in the host path and adding back to the domain format.

B

Then, in the default working process it calculates the uh accurate to be my status based on the domain and be my existence.

B

If the party is gone and the domains are responsive, the bmi should be, firstly, updated the field status by the world handler and, and then the word controller can take over this vmi and for the afterward finalizing and deleting kubernetes is a distributed system. So there might be multiple controllers running and trying to write to the same object simultaneously or object is manipulated concurrently, so kubernetes api implemented implements multiple version, concurrency control and.

A

B

To orchestrate the currency right operations so to update or re-sync the object, the controller must use the latest or resource version and, however, in we notice that the resource version of the vmi is easily get lost, so in the current code base, the word handler does not cover the situation when the resource version is empty. So so, in this issue, I'm I'm going to add a fix to get the latest resource version to update the object in this case and the editor and adding the get the permission of report handler are back.

B

Otherwise we can see. There's the error. Messages like the resource version is, is empty.

B

uh Another issue needed to be mentioned is we noticed that able to mark vmi, as our responsive, socket error message of the inverted handler lock? um This is because the power volume directory on the host was deleted when the port was gone.

B

There are many reasons. Maybe this.

A

B

Is evicted on the report of the node? All the node not ready status, lasted to last. uh More than five minutes, the partner will be evicted, the periodical equal, routine change. The word header has a periodically routine checks. The state scale circuit and marks unresponsive, one by creating a large launcher unresponsive file at the same directory of the launcher socket. However, if the port volume is completely removed on the host there, isn't a directory can be right too. So we always so. The still error messages always happens repeatedly, um yeah.

B

So um the reason for this, I would suspect that it might be relating to the ghost record. The caching, the launcher socket pass was not cleaned up properly.

B

That's why we go to the next issue, so this is a very interesting bug. We captured this when, when vmi stuck uh stuck in scheduled status, but the launcher part failed with an error showing the computer come, the compute container was terminated whatever this vmware is recreated or we tried. The result was the same.

B

So looking to the issues, I found this relating to the ghost record, which is not to clean up properly. As I said,.

B

Yeah, I'm talking a little bit about the ghost recorder as a background, so each new creative word launcher part needs to provide a launcher socket and register it into the var uh wrong: convert private coaster, records, vm, uuid path for caching and and every every time the verb launcher is started. It will read all the uh files for handling the rule we started. It will read all the files in the past into its cache. The ghost record is used for guaranteeing that advanced local data is cleaned up.

B

But when, while word handler, we initialized, even if the vmware is deleted from the ipcd, how we debug this issue and the founder the root cause. um There are some clues. uh First, the issue happened after node reported from an already situation.

B

The word handler was terminated by the equivalent during the same period of the status. So that's the first circle. Second, one from the termini terminated the computer log. We we see it was timeout waiting for domain to be defined and we find from the we found from the red handler logs. We saw that the error message was something like unable to create uh where the launch of client records, uh when we try uh already exist with different uid.

B

So this this sort of recluse upon real upon me just thinking about, if the word handler mostly likely use the wrong word launcher client to build a connection. So I look into the only one ghost record associated with this vm and found out the circuit of fire pointed to a pawn which didn't exist.

B

Then I checked the word handler locks for this missing port uid and then I finally found out the part of this uid belongs to one previous dmi of the same namespace and the vmf so back to check this vmi.

B

This vmi is for critical services and is asked to be deployed on one specific node, also the vm name and the namespace are specified. So it means the key of this coastal record is always the same, and severe va will always be scheduled. On the same note, okay, so checking back to see the time steps. The word handler was terminated on the node and the node was suffering from node.

A

B

For more than five minutes, so the previous, the launcher pawn was evicted, and but the local data was not a clean algorithm. The word handler did not have to clean up, did not have the chance to go through a successful cleanup process. The ghost record was a sticker one after the word handler rebooted every time, a new bmi of the same key, the name space slash should be. My name was a sponge. The word handler using the key uh using this, the same key have to pick up.

B

The the scale goes to record in the past, so the vmi will never be processed and the convention cannot be built. um That's why we saw the container failed with the timeout waiting for the connection.

B

So looking into the code base, I think the fix will be adding a cleanup logic, but it goes to record uh this could be done by extend extending the logic of the cleanup when deleting the old domain systems. So this.

A

B

I think this one should be. um This is a good example. We should be thinking about uh some um corner cases, especially when the failure happens on the cryptovert components, uh how we handling the um still ghost records on all circuit files.

B

Okay, yeah, okay, so uh the editing video our workload could be very intensive and uh at a large scale. So uh we are experience experiencing something that hasn't been covered upstream. So today, I'm going to talk about uh um one thing is at the largest scale: for example, one thousand uh vmis uh deleting a lot of bmis, can cause world controller to panic um before we expand the root cause. Let me step back a little bit and look more abstract on how kubernetes controller works and why they choose uh the event uh event.

B

Logic um here are the two ways to detect this uh state changes uh for an event in current in in the real world, so one is even edge trigger and all edge driven trigger, so which means at the point in time the state changes occurs. A handler is triggered. For example, the paw was in. There was another important and suddenly the pod is running yeah, so so this is edge trigger it's not like a pulse.

B

The second one is the level trigger level triggers means. The state is when the state is checked that the regular the state is checked at the regular intervals and if something or certain conditions happens or met, then the handle uh of the controller is a trigger, so level trigger is a form of reporting.

B

If it does not, it does not scale well with the number of objects.

B

A

Latency of the controllers.

B

Noticing changes depends on the interval of the pulling and how fast the vmi can uh the how fast the api server can answer. So, if many async controllers run simultaneously, the system will take longer time to meet the desired state status so, on the contrary, edge triggers is much more efficient with many objects.

B

The latest remote study depends on the number.

A

B

Workers, threats in the controller's processing events, so kubernetes using the kubernetes controller, is designed based on the edge trigger so also we called event processing um yeah, let's, uh let's refresh how um kubernetes controller works. So kubernetes controller has two main components: informal and work. Eq informers have mercados on the hood to watch for changes on the camera status of kubernetes objects and send events to the worker pew. Then the event in this work queue will be popped up by the workers to process inside the cache.

B

There are three callback functions, add function, update, function, delete function and they are called if the corresponding events happened.

B

For example, delete function is called when an existing resource is deleted. It gets the final state of the resource if it is known. uh Otherwise it will get an object of the object, type that delete final state unknown.

B

uh This can be happening if the watch is closed and the miss just deleted the event and the controller doesn't notice the condition until the subsequent uh released heaven.

B

So actually uh we also, we observed that, on the larger scale, edge trigger events like a delete have a higher chance to get a missed when the watcher means that the data, when the controller preferred means the delete event happens, the delete final state unknown object is added to the data fifa peel of the vmi informer so but the converter picks up the object and the custom attempt attempt to assert the key to the vmi, which is uh causing our runtime panic. So that is the root cause. uh The the fix is easy.

B

We added this uh search before uh in uh uh red controller. Every time word: controller trying to assert the object's type.

B

Okay, it's a okay, so some takeaways.

B

um In the car's engineering is needed, it is a master tool to identify weakness in the system that could potentially lead to altitudes that harm customers before we shipping the product, so kyle's engineering experiments could help us to do so or we found many convert or vmi issues.

B

A sort of religion has a sort of relationship to the pod crashes, no, not the available network, and I o problems so we're on the way to expose some faulty injection solutions to convert to do this. We use running some space to randomly uh inject some uh faults, but that's not enough. uh We have uh experiment. uh We have uh investigating some calcium engineering tools like health mesh to do it. Hopefully we will have another talk on this.

B

Secondly, the scale converts gives up much value. Some issues are discovered in a large scale, environment, so understanding the pure verb.

B

The learning how people convert can move efficiently using kubernetes is very valuable to understand industrial product. We.

A

Need a microsd running is.

B

A skill set for some initiatives to improve, convert at the large scale, but we are very happy to share what we were found in our practice.

B

uh The last one is debugging, the debugging is is hard and painful, um so there are many challenges. uh As I said, I can see the first one. Some tricky problems are not easy to reproduce. This is a major blocker for debugging. A second bug could only be fired in some particular criteria. Criterias, but we don't know the root cause most of the time, so it cannot reproduce debugging.

B

It is also. It is very hard to capture all the information needed to debug, for example, components, log and not sufficiently debug. Why the qmu process terminations expectedly, we needed to search for every pieces of clue to find out to the root cause so yeah, so, okay, um okay, so that's all for my talk today. uh This is my contact information, so feel free to show me any message and talk about anything.

A

Yeah cool, okay! Awesome! If you want to stop sharing your screen, um we'll see if we have any questions.

A

So, thank you very much for that. um The um so folks have any questions comments about uh fawn's experience.

A

uh Andre you wanted to ask about live migration uh with nvidia gpus.

B

um So, uh as far as, if I remember correctly in video gpu, um because we are using the bare metal uh virtualization platform that doesn't support the migration of the virtual gpu, so that's not the case in our that's, not the um something we take care of in our platform. uh Also, in our use case, we, when we support the vmi, the vmi running uh spin out very quickly and the lifetime is very short, so there isn't a need to do the do the migration.

A

Okay, cool uh actually wants to know what kind of monitoring and alerting you used.

B

We use the promises and a lot of exporters to um to grab the information and logs from the cluster and pointing to the dashboard. So that's uh that's a major tools. We are using.

A

Andrew wants to know if there's any way to monitor gpu temperature.

B

The gpu- that's uh that that's uh nvidia, has a lot of tools. Nvidia has a tool to monitor the gpus. I think it's uh africa's them. You can get checked on the nvidia gpu online.

A

uh Prashanth wants to know what vmi scale numbers could you scale to after fixing the panic issue um on deleting the vms.

B

Well, what the mysql numbers could scale to after fixing the panic issues yeah, so uh we um we have, we are gonna, be fixing this problem. We can support uh over one thousand vmis currently running in the cluster, so every time they are every minute there are hundreds of vms created and deleted. So this is there.

A

Yeah, so chris is commenting that nasa is actually using water. Cooling for their nvidia. Gpus um really need to answer that, and I guess I guess um do you want to share your contact information slide again uh sure, um just because uh andre wanted that.

A

Okay and you can get contact information to ask other questions about nvidia or ask them on the cooper dev list if this relates to cooper, because that is the way place to discuss this sort of thing.

A

So I want to say thank you for everybody for attending this first day of kubrick summit and thank you so much to all of our presenters for making it a great and information packed event.

A

We will have a second day of kubrick summit tomorrow and remember that, because of the platform we're using, you need to rsvp to the second day separately.

A

um So if you haven't done that already, um please go ahead and do that um so that we can see you tomorrow morning, starting at 1400 utc, with an update on where we are in scale and performance.

A

So thanks everybody and see you tomorrow.