Cloud Native Computing Foundation Kubernetes Batch + HPC Day NA 2022, 28 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 79 Lightning Talk Dancing with Cores A Path For FineGrained Core Positioning within Kubernetes Ata

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

A

So I'm giving this talk on behalf of myself and Dr atana satanasov he's my lead engineer on our team.

A

Initially, we had been looking at a in-node workplace.

A

Has a big impact for many workloads, including Cloud native applications, in this talk, we'll look at some examples from the domain of cloud native benchmarking, HPC, Telco and discuss what questions, tools and techniques can help with this problem.

A

So first I'm going to flash up this picture um balzattanus and I because we started off in the HPC World realized that the current set of couplet default managers are potentially leaving a lot of performance on the table.

A

They do not allow the level of fine grain control. We are accustomed to an HPC.

A

We are currently two to three weeks away from releasing the CPU control plane, which you can talk to me about later, and to allow users to mix shared and pin cores and to have new affinities to workloads.

A

We will later add in isolated course as an option as well.

A

These examples come from our experience in attempting to Benchmark new hardware platforms that Intel using this control plane- and you know now that we had this fancy new tool. We needed to prove that it was useful, so benchmarking, server, Platforms in HPC, is a custom. We want numbers, so we had a lot of confusion on how to do these benchmarks, so we started looking for toolkits.

A

So first do we want to execute manually? No, we want reproduction to be simple engineers. Get things wrong, including us, and we wanted to automate batch tools play a big role to increase the ROI of benchmarking. We wanted to enable users to study application, regressions and save everyone time, so we then had to go and hunt down tools we wanted to. We could use batch scripts, but that seemed hard. We also wanted to be able to schedule. Benchmark execution, maybe similar to learn Cloud native queuing Frameworks can be a valid alternative.

A

We used ansible to provision underneath in benchmarking, we also used ansible to handle workload, deployment, validation of the workload properly executed and automatic error detection logs further benchmarks as well to handle, handle the post processing the benchmark.

A

So now we have the tools. Now we needed to figure out what benchmarks we were using. We could use synthetic benchmarks and evaluate system performance, but this does not well represent the workloads we are running in the cloud today.

A

Cloud users have complex applications consisting of multiple microservices connected over the network. There may also be availability requirements such as P95 latency, access to services.

A

So we did some research on workloads. That could be more realistic. The examples are still rather simple: Cloud applications at the end, let's discuss what other benchmarks people are using to evaluate Cloud systems, so this will be a question. I actually pose to you on what you guys are doing.

A

The first workload we looked at is the microservice based uh Google microservices demo. This application is the classic three-tier app, including a front-end, to receive the request from the clients, business logic, services and a database store where we store the transactions for the customers. In our benchmarking, we evaluated the throughput of such such a system on a distributed system with four machines, and only four machines replace a load generator on a separate machine, and we had three worker machines to front-end business logic and transactions.

A

Our goal was to optimize the throughput of such systems under the given latency constraints.

A

The second Benchmark we used was the Death Star bench hotel reservation system. Again this is a microservice based software platform that provides a search recommendation capabilities for hotels. We see here a clear split in three tiers. The difference in this case was that we had separate databases for different parts of the data model of the application indication. Caching layer.

A

It turned out that these two applications had very different scaling, behavior and reacted very different to pod placement strategies on the cluster hotel reservation had a clear bottleneck of the database components and if you run just one instance of the database, that if it turned out to be a bottleneck, best effort Qs, the business logic was able to handle the increasing number of front front. End requests we observed hold on.

A

To fix the bottleneck issue rate, we executed two instances of the workload with two database layers. Still, this was not the final optimization. We also had dual Nicks, so we further isolated each workload. Instance on its own socket with careful network configuration, we were using multis, but it was very hard coded, so this isn't available easily. Today.

A

This is hard for Google microservices, the best quality, the best effort quality of service did not provide us any benefits. The workloads suffered under Noisy Neighbor problem, which we managed to fix by pinning of the services and again isolating on a separate, socket all unused cores. On the socket. We used the remaining group of services, which were not sensitive to the cash related issues.

A

And so this particular piece is a summary of why extending these Primitives is so important, so we went from a 40 utilization to a 78 utilization. This was true with both of them. Both of them had similar performances. One was not better than the other and we really want to be closer to 90. So there's probably more, we could do regarding the database as the bottleneck we're not sure what other optimizations we can do to get that core utilization up.

A

These two workloads are still very far from actual applications and what the users Deploy on the systems. If we look at some examples, genomics AI, HPC and apps and more users start to use the cloud platform to run those workloads.

A

These applications are performance, critical and usually they are optimized for certain place for a certain placement model. Hpc and AI applications apply pinning and affinity configuration mechanism to get out the max available compute from the hardware So. Currently, a lot of these applications are still using slurm. That includes internally Intel for fine grain performance.

A

Last but not least, we are also looking to measure uh other things, so it's insufficient to analyze them on one to two nodes. We really do we do this on for four, but it's still very far from reality. The customers are running on hundreds or thousands of machines.

A

So what can we measure we're starting to look at throughput, latency throughput under latency constraints, but would also like to understand how these metrics behave at scale? Does latency go down? If we add pods or do we increase the throughput? What happens also with the system? Are you using all the available resources so similar to Prior we're not using all the compute available?

A

If not, we can pack multiple workload, instances workloads on the given VMS. What about the memory? Are we accessing memory in an optimal way, or are we wasting cycles due to the wrong placement?

A

um That's about where we are I have some questions on what people are using for benchmarks if they have better ones than what we were using, but this is the best we could find.

A

Is anyone using gutter benchmarks.

A

That's a no okay.

A

Okay and that's the talk so do we have any questions.

B

C

Hi very nice talk, um I, have a multi-part question when, when you're running your benchmarks are the nodes that you're running on Virtual instances or you're running on bare metal, we're.

A

Running on bare metal, okay,.

C

Do you have plans for running on virtualized nodes and if so, how do you plan to manage the core placement with the hypervisor.

A

So that's a different question, so it depends on the hypervisor, so we need to know more about what are users are doing so if people are using something like mesos with with virtual couplet, that's going to look a little differently than if they're doing it in a VMware type platform. So we have. We have to know more about what we're going to be running on that is in the plan we just haven't gotten there yet.

C

B

Any other questions.

D

uh This kind of goes along more with your the benchmarking questions, I'm more putting this I guess to the kubernetes community, but I was wondering how uh what kind of infrastructure is in place for performance testing for kubernetes? And, if, like you, I guess it's just because for is it or you're shaking your head? But.

A

That was the problem right. Is there isn't a lot in place so there's the Google microservices and there's that star bench and that's about the extent. As far as that sort of workload there are people, who've run linpak across kubernetes, but those are very targeted: uh their targeted benchmarks, They're Not Meant For Real Performance for your applications.

E

I cannot also because I used to work on scalability of kubernetes there are. There are also benchmarks, ran for evaluating the supported scale of kubernetes workloads and but that's probably that's not running real applications. That's what Marlow showed here.

E

It's more like uh um you create a large cluster right after the nodes, and this is being done regularly and then like try to run synthetic but very fully synthetic cloud like throw lots of containers on and this sort of stuff, and there are also some benchmarks that, like test other dimensions but like 100 node uh configuration, but you try to, for example, throw a very significant pot density or the issue the network traffic. But these are very synthetic benchmarks.

B

Any other questions.

F

I was wondering if you were also thinking about multi-tenancy like running multiple applications of different nature,.

A

A

That there's no way today to guarantee your tenants going to behave in with regards to bandwidth as an example right, so you actually are going to either have to have something working externally and that's checking to make sure things are behaving and then either throttling it or throwing off the node, or you have to have some assumption that if your band, you know your latency starts to go down or your throughput isn't what you expect that you're going to be rescheduling that particular service, so that has to do with yeah.

A

So that's really on your scheduling side, that's less on your your node resources side, um but you still need something on their monitoring.

A

Does that answer.

B

Any other questions.

B

um We're just following up on that networking piece I mean network is a compressible resource like CPU right right.

A

B

Can you can give it to a part, and you can take it back yep and there are mechanisms at the like Linux Network Linux, in the news stack to set these things up, I'd like to to basically set up fair sharing between things, so you don't really necessarily have to have like an external component to monitor traffic. You can just set them up similar to where you set up sharing for CPU course.

A

Maybe but you still have Jitter so when you're looking at HPC applications, you're assuming not having Jitter, because you don't have other workloads, but you do end up with Jitter. If it's switching back and forth between your two processes,.

B

It mostly controls basically throughput, but not, but.

A

It doesn't, it doesn't, help Jitter. So you still end up with the processing issue.

B

Thank you so much. Thank you.