KubeVirt KubeVirt Summit 2022, 24 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Summit 2022: Benchmarking the performance of CPU pinning using different virtual CPU topologies

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

It's my pleasure to announce the second talk for the second day of the keyboard summit. Now we have lee and marcelo talking about benchmarking, the performance of cpu, pinning using different virtual cpu, topologies akvm versus keyword, analysis, liane, michelo. It's all yours.

B

Okay, um so marcelo we'll cover the both the introduction and the motivation part. um He will talk about the cpo, pining benefits and drawbacks and also he will list the different scenarios of virtual cpu topology and the configuration and the goal of each scenario. And after that I will talk about some interesting characteristics of hyper threading.

B

I know how you can actually do a covered, cpu, pinning and topology with just a simple ammo and followed by some of our experiments, and we also verified the previously reported cpu pinning issues in covert and- and we will close the presentation with some final considerations and the summary- and I will pass it to marcelo.

C

Okay, so just a very short introduction, uh everyone knows what pupini is um it's we are, you know dedicating and then doing one-to-one mapping of the virtual cpus from the vm to the physical cpu in the physical holes. So it's done for several reasons. uh You know.

C

One of the reason is performance, so we have some application that it's cpu intensive and it's performance, you know and latency sensitive and cpu pini will improve the performance of the vm, especially because it will, uh you know, reduce the competition of resource of a different process running if different, vms running. In the same, uh if all the vms are pinned, but you run the same holes, so it can prevent some os noise and prevent some contact switching.

C

At least it will reduce a little bit the contact, switching for the set of only for the set of physical cpus that the vm will have access so and the other uh you know, motivation for have cpu pining is to isolate vms uh public cloud is doing that, so they create vms and isolating not only for performance, but especially for security cell cpu and pinning it's something that many people are using.

C

um However, it has some drawbacks, especially for balance the load, so the operation system will not like load balance anymore, the uh the load across all the physical cpus, uh and then this can be a challenge also to define.

C

Should you know in a system, for example, cover that is creating a lot of vms and define uh what's the best cpu pini, and we will talk about that um expect most, especially for the new release of kubvert as uh fabian introduced it before with something actually roman actually created the new uh cpu penny uh called in the new. uh You know: release okay, so regarding cpu pinning something that it's important, that comes is the vm to polish okay.

C

So it is especially important when doing cpu pini so because it will affect the performance of the vm. When you are creating the virtual polish in the vm, the vm can have, for example, virto hyper thread or not. You are saving hyper thread, don't have like cores in the vm, and this uh virtual topology impacts the performance, and this is one of the motivations that of our experiments and in this presentation. Okay, so leak, can you go to the next line?

C

Okay, so, given that we have like, uh we can have you know, virto hyper thread and, and not you know, our disabled virtual hyper thread have different virtual colors and even the host we can disable hyper thread and have all the cores. So what's the best you know to polish the best configuration for performance and and uh that we can, we can have with that. So this presentation we're going to drive through you, know just different topologies and talk about performance.

C

Yeah, we're gonna go to next okay. So this is the baseline. What we call perfect topology matching we have the vm topology. The virtual apollo is the same as the physical topology in the host. For example. Here we have a host with one socket and it's a you know theoretical holes here. So only two cores and each core has uh enabled two hyper threads, and the virtual polish has also the same configuration okay, so this will be used as the our baseline configuration to compare to the other scenarios.

C

We have another scenario where, with the host uh we disable hyperthread and but the vm has hyperthread enabled, then we want to you know measure here. What's the impact, whether the gas os has things that they have hyperthread, but there is no hyperthread in the holes you you can go next.

C

The next one is actually no hyper threader all in the holes and neither in the vm, and we want also to show, what's the you know, compared to the baseline, whenever you know everything has hyperspread, we can go next.

C

The next one is actually we have hyper thread in the host, but the the vm topology is not aware about the hyper threat. It's only horrors in the vm and we want it's like a mismatch in the polish. We want to show how it will be the performance. Next, it's like a plus. You know, since we are doing cpu pini, it's possible also to pin uh cpus from different pneuma nodes, different core sockets- and you know, pneuma nodes.

C

Everyone probably is very aware about about that, but uh it will the each cpu have access to more memory bandwidth because they have different. You know memory rations and also less level cash, okay, next, okay, the the the another scenario that we want to show here is what we call mismatching hyper thread location.

C

So this is this is the the problem that uh you know a previous presentation. The kvm forum have before the change that we had in cpu, pinning in the coop vert, and it was more or less random, not not run but kind of run. The locate cpus when it was doing pini and then was not matching the topology, the virto the hyper thread, um and then we will show the performance when we have the scenario: okay, the next okay, so the the next one it's to illustrate.

C

You know, because we are talking about about a lot about hyper thread here host. You know disabling holes hyper thread, so this is the it's to show the benefit of hyper thread. You know, even though hyperthread has, uh of course, as expected, is lower performance than one core hybrid thread. It's important, especially for the scenario here when, for example, an application can access two cores when we disable hyper thread in the hose. However, if we enable hyper thread, the application now can access two.

C

You know uh I would not say virto here, because we'll mix with the concept of virtual machine, but uh four other cores, you know uh in the holes and how hyper thread you know, allows the to increase the performance for application. When we can run more and increasing more hybrid threads, not allows only allows application to run more threads, but also to run more vms in an old okay. Just to keep in mind that the next one, okay, so the next one we want to compare.

C

You know the performance of the perfect matching scenario with pini against kvm versus cooper, just to show you know uh how both are using library to create vms. So it's the same thing same versions. uh However, coop is running a kubernetes cluster and inside the container, and we want to you know, highlight: what's the performance difference here.

B

Okay and thank you marcin, I guess this is my part now um I will talk about some background about hyper thread. I guess the first natural question to ask is: why do we need to use hyper thread? um So I think this might be obvious to you. There are certainly a lot of issues related to hyper threading, like cache, thrashing where threads are competing for those low level caches and also some previous studies actually have proved that hyperthreading has higher latencies compared to a dedicated physical core, but they also come with some benefits.

B

I reader paper actually was written by intel uh quite a long time ago. The basic idea is, they add, an extra extra set of regesters.

B

They only increases the the die size by less than five percent, but it was potentially more than 30 percent gain. That means they add less transistor current with more throughput. This is quite important because traditionally, if you want to let's say, increase the cpu performance by 30 percent, you might need more than 30 percent of a transistor current.

B

That's actually uh not very power efficient, which means that you might get more uh bills on the electricity cost and another obvious uh benefits is that you can run more vm per node as marcelo said so for experiment. We run the micro benchmark, nasa parallel benchmarks with those computational kernels and some pseudo applications they're, basically doing some sort of matrix computation and tasks using cpu intensively.

B

I wrote a simple batch script to automate the whole task, where you can actually modify the xm file on the fly and launching the vms running the benchmarks. Inside of for multiple times. For each of the scenario, we run two parallel tasks, except for one experiment where we wanted to see how much throughput gained from hyper threading. So we allocated two cores which it was a hyper threading hung compared with two chords with hyper threading off.

B

So for the first case, we we run four parallel tasks uh compared two tasks, so we wanna see uh how much gain we get uh here is our test bed um for the host. We got uh 32 vcpus with 32 ram, so we we got two pneuma nodes on this host, where we have two sockets and eight cores for each socket and where you can able to disable hyper thread for the vm.

B

Most of the cases were allocating four vcpus, with an ram with a pre-allocated disk image for the os hosts were using ubuntu, but uh for the guests we're using opensuse.

B

The reason uh for this is because, for the previous talk um on kvm, um I collaborated with susie engineer, so I'm too lazy to to change to a different os. So I just stick with the susie uh distribution. I I hope it doesn't really make much difference, but it's important actually to for us to make sure the qemu and the library version are consistent with the the one that's shipped by covert, so we can have an apple to apple comparison.

B

um This is what you can do actually covered cpu, pinning with just uh also a topology configuration with a very simple yaml file.

B

And here is our result of the first comparison. We compared compared the best language scenario too, where, let's see you have you disable the hyper threading on the host side, but enable the hyper threading inside of the guest I was expecting. The impact should be minimal, which is actually true for most of the test cases.

B

However, for some strange reasons, this mg benchmark is showing quite a bit of performance again, which is interesting, something I don't really understand, but in a good way at least it's it tells us it's safe to go with a perfect match in case.

B

uh Similarly, for this case, where we have hyper threading disabled on the host as well as uh inside of the guest, so we have a matching topology again. This impact is very minimal, but mg showed some interesting performance differences out there.

B

Things got really interesting where, let's see you uh have the hyper threading turned on in the host. But what, if you turn on turn off the hyper thread inside of the vm guest? This, where you have a mismatch in the topology- and this is actually a real issue, because uh the guest scheduler is not hyper threat aware. So there is a 50 percent of chance uh that you have a sibling contention.

B

So, as you can see that the performance drop is quite significant, which up to 35 percent, uh another scenario is we put, we pinned the the cpus to different socket, so uh the benefit of that is um those tasks, got access to more of those lower level caches, also with higher memory bandwidth. But since our application is quite small, we expect this memory battle with won't make much difference. But then you can see for both integer salt and this congruent gradient.

B

They are showing quite um a big throughput difference. The reason is because they require inter-process communications when we're running those two tasks, so you need to access data from remote memory, memories from a different luma node. That gives you quite a bit of performance penalty for scenario, six, we're comparing the case where uh what the cooper issue they had in the past, where you have a totally mismatching hyper thread allocation, we're basically forcing the siblings to compete with each other for the resources.

B

So for all the benchmarks- and you can see there is a significant performance drop up to like a thirty percent uh for scenario, seven uh seven. This is the case where we wanna um um check like uh how much uh throughput uh we get uh through gain. We get from hyper threading, so we are comparing um two thread uh uh with two cores like with hyper threading able, and then we disable the hyper threading by only allocating to dedicated cores. So this is a throughput gain.

B

You get as you can see, this ep benchmark, which is called uh um embarrass, the parallel the reason you're getting 60 60 performance gains because uh they also called the perfect parallel because the the tasks that really requires little almost no communication.

B

This is quite important for some tasks like uh image processing, because um you can just process those individual frames independently without any dependency.

B

Lastly, which is something most of you, might be very interested uh wanting to know what is the difference between running a kvm vm and covered vm? So, as you can see that the performance difference is really small, but we run multiple executions, those performance difference always exists, so we suspect. The reason is because uh in cooper, kubernetes components uh are running a lot of background processes like kubernetes agent or container d, which might be computing resources with the vm cpus.

B

So for final consideration, if you really want to take this cpu pin into the next level what we suggest you can actually either use iso cpus or cpu set for the kernel boot parameters, you can actually isolate those cpus from the host scheduler. So, in a way they minimize the cpu preemption and also I found out that when you have this dedicated cpu placement enabled it also actually have enabled this kbm hand dedicated thing, which is some sort of a power virtual edition.

B

Allow the guest is kind of aware they're running on top of kvm, but we don't know how much a performance impact this has, and I talked to of the maintainers, and they said um this thing didn't really go very well. Another thing you could do is you can actually use this iso isolate emulator thread which reduces the log contention and also you can increase this huge pitch size which allows you to do faster pitch walk as well as reducing the tob pressure.

B

So, just to summarize, I think, matching the host topology with the guest is really important and the virtual hyper thread benefits from certain workloads and not for all kinds of workloads and as our benchmark result shows and the performance drop can be significantly if done incorrectly, and so another sad note is that kvm offers you full control toning compared to covert, which makes the best effort tuning automatically.

B

So there is essentially a trade-off and you want to choose like whether you want to do things more automatically or you want to do things manually. So uh as usual, there is no solution, uh one solution and solve all the problems, so you need to make the trade off. And, lastly, I think the opinion issue is indeed fixed, which is quite a good news.

B

um That's the end of the presentation, thank you all for listening and feel free to ask any questions.

A

Yeah indeed, there was a it was a great presentation, thanks julie and marcelo.

A

I don't see any question right now, but I have one regarding to the difference. The last case where you were were checking pure kvm.

B

A

I think I saw on the cubelet some settings where you could fully isolate, also the cubelet components and the system daemons.

C

A

Of the workloads, I think the openshift performance operator does something like that. Did you look into that also.

B

uh Not really, but that's really interesting, if that's the case, I think we could uh eliminate all those noises and noises right.

A

Yeah, that would be really cool also for uh for real-time performance, I think would be.

B

Very nice to have that.

A

Yeah don't see any other question.

A

Except that vladik is saying, as an answer to my question: we can't isolate the word laundry code and delivered d in the launcher. This is definitely true.

A

Another question question from ish is coming in: I'm unfamiliar, I'm unfamiliar with the benchmarks, which of them were most meaningful to pay attention to.

B

I think the of the most meaningful benchmark is called the ep, which is uh in embarrassedly parallel that that, basically, uh is the benchmark that requires no dependent dependencies um between tasks. So this is actually quite a good example for uh what I said in the exam called a presentation called image, processing tasks- um yeah, that's one of them, and another thing is uh some of them actually do- require some sort of like dependencies like inter-process communications, which can be a good representative for general applications. I think.

B

A

B

Answered the question.

A

Yeah yeah, please respond if you want more details and if you like, it can also give you, you can also get microphone access.

A

Okay seems like yeah: it seems to be answered. Okay, okay, great. There is no other question coming up so lee and marcelo. Thank you again for the great talk and we will be back 10 minutes past for utc time. That's in roughly 13 minutes. Okay! Thank you. Thank you. Thank you marcin. Thank you. Roman.