Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2020 - Virtual, 4 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

At Nvidia we have several large GPU K8s clusters for running deep learning training (AI/ML) workloads. On these clusters we need monitoring to support a range of user personas . First we have the end users (AI/ML researchers) who want to get an insight into how well their workloads used the GPUs and the system. Then we have the operations team who would like to monitor the general health of the cluster and be alerted in real time to any issues. Finally we have the stakeholders who would like to see the GPU utilization and saturation over time for capacity planning. These requirements cannot be satisfied by a standard “out of the box” setup. In this presentation we will show how we used a combination of open source tools to address our requirements. We will discuss various deployment, maintenance, security and scale challenges we hit and how we resolved them for monitoring GPU data.

https://sched.co/Zeoh

A

Hi everyone welcome to my talk on gpu's monitoring, gpus at scale for ai ml and hpc clusters.

A

My name is bharti agarwal, I'm a software architect at nvidia, working on logging and monitoring stack in the nvidia saturn 5 data center.

A

We manage the nvidia saturn 5 data center using kubernetes. This supports thousands of gpu servers which supports hundreds of users running their machine learning, training jobs. These jobs use terabytes of data.

A

We also have stakeholders and other users with observability needs how to support monitoring at this scale for our agenda today we will first look at what the saturn 5 data center is. Then we will review the various users we have to support for observability.

A

We will look at the size and the scale.

A

Of the data center and the requirements that come out of that next, we will go over the stack and a couple of architectures of the stack that we set up to meet these requirements.

A

We will also cover some details of gpu matrix collection. We will look at the scale challenges we hit and how we solve for them. Finally, we will look at some example: user views.

A

Let's look at the nvidia saturn 5 data center. So what is the nvidia saturn 5 data center or nsv for short? At nvidia? We have a lot of users who need to run their artificial intelligence and machine learning jobs on the gpu server.

A

They can set up their own server to do this, but this limits them in the size of the job and the technology.

A

It would also add cost to maintain these nodes in order to facilitate these users, our team set up clusters with hundreds of gpu nodes with the latest technologies for the user to deploy their jobs on.

A

We also added a cpu control plane to allow us to add a scheduler for the user, for the user jobs and to support observability.

A

On top of this, we added the nsv cloud control plane for the users to interact with the users, use the nsv api to schedule their jobs on the nsv cloud control plane.

A

The request is then sent to this cpu control plane, which forwards it on to the scheduler.

A

The scheduler sends metrics back to the cpu control plane. The users are also able to leverage the ngc registry, which contains containers, data sets and pre-trained models for them to use. This is great users, get the latest technology and are able to deploy their jobs and be guaranteed that they will have resources to run on. However, we need to ensure we can meet this guarantee.

A

Observability is the key to this guarantee. Let's look at the setup and the use cases we need to support for this.

A

Next, let's look at the users, the size and the scale, as we saw in the nsp data center, we have the kubernetes cluster with gpu nodes and cpu control plane from the observability perspective. Apart from the end users, we have various users accessing this system.

A

We have the admin user who would like to get a view on the overall health of the cluster and get real-time alerts in slack and page of duty, we have the machine, learning and artificial intelligence end users who would like to see the performance of their jobs in terms of the job telemetry and the application telemetry.

A

We have the stakeholders. These are the product managers, managers, architects and other org leaders who would like to get an overall occupancy and yield of the system for capacity planning purposes. Then we have the nsp developers. These are developers like me who worked on the scheduler and the observability stack.

A

These users would like to be able to get a view into the health and resource usage of the components and be able to root cause issues.

A

How do we, how do these use cases map into requirements for the end user, who wants to see how efficiently their jobs are running in terms of resource usage and their custom metrics? We need to collect the resources, resource and application metrics and send them to the nsv cloud control plan for the nsv admin user. Who wants to monitor the overall health of the system, get real-time notifications on incidents and be able to root cause them effectively. We need to centralize the metrics and logs in real time.

A

The real-time alerts also need to be sent to slack and creativity. We need to ensure that the stack is highly available and can scale horizontally as nodes get added to the clusters for the stakeholders. We want to get metrics aggregation at the job level and retain these for one year.

A

We also need to show you an efficiency in a dashboard and weekly reports. Finally, the nsp developer. For them we need to collect and centralize the metrics and logs of the nsv.

A

Components our system needs to support multiple production clusters, each with up to 600 nodes, 250 of which contain 2 000 gpus.

A

Along with this, we have several non-part clusters where we need to support the same logging and monitoring staff. We use ci cd processed for promotion to manage these. These clusters keep growing.

A

Let's go over the system, requirements that come out of this.

A

For the functional requirements we need to centrally collect the metrics for all levels: the cluster, the node, the kubernetes system and jobs at scale. We need to meet the sla to allow users to get real-time metrics, that is to say less than one minute from node to system.

A

These metrics have to be forwarded to external systems. The growing data needs to be retained to support long-term views for the non-functional requirements.

A

We we need to consider data durability, that is to say there should be no data loss. The system should be highly available. The stack should scale as we add nodes, and it should be resilient.

A

We also need to keep security in mind when we send metrics to the external cloud.

A

Let's now look at the stack and the architecture.

A

These are the components our stack consists of. We decided to use kubernetes for orchestration and lifecycle management of the system. The system uses kubernetes device plugin for gpu enablement under communities.

A

We developed the nsv scheduler, which is the kubernetes custom controller for the observability stack. We have the nvidia data center gpu manager component for gpu metrics. We decided to use prometheus operator for metrics collection and alerting. This stack comes with grafana for visualizing.

A

This is native to kubernetes and widely used with a lot of community support behind it. It is highly performant in collecting metrics and writing to remote storage to support in-cluster data persistence. We use influx tv, thanos and swift stack object, store are used for data durability and storage.

A

Fluentd agent is used for the log collection and sending blogs to the central login service, which can be influx tv or relog.

A

We use nginx ingress controller for exposing the stack endpoints to the users in some clusters based on requirements, we are using greylog instead of influx gb for logs.

A

This is how it all comes together in terms of the overall architecture. We use all open source components. On the left hand, side we can see how the metrics are centralized.

A

Node exporter is running on the compute worker nodes. These can be either gpu or cpu knobs. It exposes node and gpu level. Metrics applications can also expose service metrics. At an end point, these metrics are scraped by prometheus running in the cpu control plane at a 30 second interval.

A

These metrics are sent to influx cv for class in cluster persistence from here they are forwarded to the metrics transporter via the n clutch cv subscription setup. This transporter then forwards them to the external cloud systems, having matrix transporter on a different node from the source. Node with a separate vpc allowed us to introduce a level of security to ensure that the data was not would not be hacked.

A

We also have thanos sidecar running on the system that sends these metrics to switch stack, object, store for long term storage and data durability.

A

Grafana is used to visualize these metrics for the nsv developers and the admin users alerts are evaluated by prometheus and forward it to alert manager which will then forward them to slack and pagerduty.

A

On the right hand, side we see the flow for logs. A fluency agent is running on the nodes to collect the system and send the system logs to influxdb.

A

An application can also add a fluency login sidecar and send their logs to the same centralized service as well. The users will. This will see this data in the nvc ui and the stakeholders will use the elk stack on the data lake to visualize the job. Metrics stakeholders also get a weekly yield and showback report that uses the data from the data lake.

A

Here is a variation of the architecture we set up for another cluster. The only change here is that we do not need to send metrics to external systems, so we were able to replace influx tv with greylock for logs. This has care better for us for our loads.

A

We also introduced the thanos receiver to give us better data. Durabilities.

A

As you can see, we were able to adapt our architecture for a slightly different set of requirements.

A

Let's take a quick look at the details of how we collect gpu metrics in this diagram. You will see on the gpu node on the bottom. We have several jobs, running node exporter pod has a data center, gpu manager or dcgm exporter as a container that collects gpu metrics from the dcgm library running in the pod as well.

A

To get hard level details, we have hot exporter that talks to cubelet to see which part is using with gpu. With this integration we can get job level metrics.

A

In prometheus, in the monitoring nodes, we will scrape these metrics at every 30 seconds. These then get sent via influx tv to the metrics transporter and from there to the cloud systems.

A

Some of the key gpu metrics are listed here.

A

To break the flow a bit, I want to show one example of view that comes out of the first architecture we covered. This is a view of the job telemetry, the end user will see at the top. You can see the job runtime and the time the gpus are active and the average gpu utilization across the 8 gpus being used by this job.

A

Below that you can see a line graph for the average gpu metrics over time, like gpu, active, tensor, core active gpu memory and gpu power in this job, you can see that gpu is being kept busy with tensor core at 25.

A

The overlay shows the breakdown of the metrics. At any point in time, you can see details like the pcie read-write bandwidth values, the in-wind link, bandwidth values, the cpu and memory usage.

A

The heat map below shows gpu utilization and the tensor core utilization for each gpu.

A

Let's now look at the scale.

A

Challenges, as you saw in the architecture diagram, we used a lot of cncf components. These are great, as there are many solutions that exist and independently. They solve for a lot of our functional requirements, they're all native cloud technologies and have strong community support behind them.

A

This allows them to continuously evolve and with the growing needs of the community, this has been a lifesaver for us many a times. Just an upgrade of the stack got us the resolution up for an issue and improved the performance.

A

All the components are also easily configurable. Why? I help the stable helm charts. However, there was a lot of trial and error involved in finding the right configuration for us for our requirements.

A

The community has many references for the integrations and we were able to leverage this. We took a lot of the best practice and adapted them for our loads.

A

We had to tune for size, high availability, scalability and data durability. Not only did we have to optimize the configurations, but we needed to get a good understanding of how to adapt them, as the clusters grew.

A

Let's now look at how we optimize the components for our scale for prometheus and other components, we had to understand the resource, requests and limits that would work well for our loads, with room to grow to size each component. Well, we had to understand how it used the resources prometheus, for example, keeps two hours of data in memory.

A

The till it is compacted to this prometheus size is dependent on the total number of series it is supporting. You can see some of our load numbers here.

A

It restarts at restart it also loads wall files which, depending on the load, can be large. This impacts the restart time of prometheus, as well as the memory usage, we did rigorous load tests to get a baseline for the size of each component. This helped us avoid frequent failures in production.

A

However, due to the unfair unpredictability of production environments, we kept a close eye on the systems in production to ensure that we did not hit auto memory or cpu throttling errors.

A

We do regular evaluations of our system to ensure that the sizing is working. Well, as we add nodes to the cluster, we review these settings to make the system highly available. We had to learn what load could be supported by prometheus, with acceptable restart times versus where we needed to add another replica for scaling from ethios. We rely on sizing of the instances from our load test.

A

So far, we've been able to manage each cluster with two instances with tuning on the resources.

A

There were some trickier challenges we had. The first of this was data persistence. We needed the metrics to always be available for viewing for this. We needed to add in cluster persistence with replication.

A

We used influx tv with for remote, read and write to cover this. This worked well up to a certain load, after which we started seeing data gaps to resolve this. We have to look into the code of prometheus and the downstream components we found that prometheus would drop metrics when its buffer was full or if the downstream components were not expecting metrics.

A

So there were several points of failure with the 2.8 version of prometheus, where prometheus introduced wall for remote drive, this occurrence was reduced, but, prior to that we had to go. We had to optimize the downstream components as well as the buffer of the prometheus, so metrics did not get dropped.

A

I will go into some details of the downstream components in the next five slide for the data durability. We had to introduce thanos side car in one instance and thanos receiver in another thanos sideka worked well for us, but thanos receiver was new and we hit a lot of challenges with it.

A

With the updated version of prometheus, we had to optimize the max samples max shards and capacity size to optimize, based on the samples size of the cluster load. Tests, helped us get to the optimal setting for this.

A

The influx db tuning for sizing ha and scalability was similar to prometheus influx tv keeps a lot of data in memory and restarts times along, as the container needs reads the wall files. We also found that influx db did not scale well for prometheus data, which has a lot of labels which need to be indexed.

A

Sizing had to consider this due to long restart times, for ha we needed to add replicas. We added influx tv relay for application. However, influx db relay is not supported anymore, so we had to support it ourselves for our setup.

A

This also helped for the data persistence requirements. Sizing of the resources helped us support the scale. However, influx db did not scale to our minimum requirement of two weeks retention.

A

We could only support three days of retention with confidence for getting longer attention. We introduced thanos sidecar in one setup and moved away from influx tv to thanos receiver in another.

A

As I mentioned, we had added input tv in our architecture for data persistence in the cluster, since this is a kubernetes, cluster parts can restart for various reasons and land. On new nodes to process data for influx tv, we are using host path, persistent volumes with pod restarts. We could easily have parts land on different nodes and blues data.

A

Our ha solution of in cluster application helped here here we had to ensure that the data we are sending to external system had no gaps as well. This turned out to be a very hard issue to debug and solve for and needed detailed investigation into the component codes to understand how best to solve it. The solution involved updates to the code of the services.

A

Tunnels which we use for data durability comes with many components.

A

Sizing for tunnels involved understanding each of these and how best to optimize them for our loads, for example, thanos store for thunderstorm we needed to set max cache size for thanos receiver. We discovered that setting the tstb min block and match block to 15 minute would reduce the wall size which allowed us to reduce the restart times and have thanos receiver, not use increasing amounts of memory.

A

Aj was easily supported by thanos by optimizing the replicas for the components as needed for scale in chronos receiver. We had to move to the grpc version. Thanos helped us solve the issue of data durability and data persistence.

A

Data in a kubernetes cluster is ephemeral. Thundersidecar sends data to an object story for the tree. This component is fairly robust and gave us data durability for data persisted to disk. This is prometheus. This in prometheus is every two hours and we use influx db for the immediate time window.

A

This worked well, however, in a variant architecture that I showed where we used thanos receiver, we had a lot of issues with load.

A

We had to delve into the code and figure out the root cause and then see how best to solve for it.

A

Luckily, we were able to upgrade prometheus and thanos versions and tune the configurations to solve for our loads. We had to do it. We had to do rigorous load tests to set up the max shards max samples and capacity for prometheus report drive for our loads down. Sampling allowed us to support longer data retention and querying.

A

In some of our deployments, we use greylock for logging, greylock uses elastic, search and mongodb, underneath each had to be separately undisturbed and tuned load tests helped us, but we also had to have. We also have a lot of battle scars for sizing. We had to look at how to tune each component.

A

These are written in java, so we had to optimize the heap size and the container sizes. We also learned that we had to set cpu limit, as without it, java would default to just one.

A

Cpu for high availability- initially, we added replicas, though elasticsearch and mongodb did fairly well with the default. Replicas greylock performance seemed to do better with more of the bills as our lords grew. We quickly hit scale issues with greyhound.

A

Just adding replicas did not help. We had to look at sizing as well. We learned there was a heap size recommended for elasticsearch that made it perform optimally for data persistence. We added the persistent volume for greylock journal. Data written to elasticsearch was already well managed, relog and elasticsearch handle data as derivative.

A

Let's now look at some user views.

A

Here is an example of the nsv admin view. This is a capacity dashboard.

A

It shows the cluster capacity with total nodes in the cluster and total number of code nodes cordoned on the top and below that it shows nodes coordinate by node type on the right. It shows the availability of the different node types with this. The admin users can get a good view into the cluster capacity.

A

This view shows us the nsv developer's view for the job controller. The pie charts at the top show counts of jobs in the different states and below that we see the timelines for jobs in various states, running queued, task lost and fair.

A

This gives a good view to the developers for the current state of the dog controller.

A

This is a stakeholders view the pie chart at the top left shows the jobs in different states. At the moment, then we have timelines of jobs running by different aspects by time by gpu, allocated running duration or theme, registry, etc.

A

And finally, we have the yield view it shows, year by jobs, the green bars or gpu hours, the blue line. You can see the dips in both when we have issues in our cluster, for example, when koi io was down, users could not download their job containers. This, in turn, calls our year to drop.

A

Thank you for your for attending. I will now take questions.

B

Can you hear me.

B

B

A

Everyone, okay, so we have a few questions here.

B

B

A

B

A

Are the grafana dashboards available anywhere for folks to look at the ones I've shared here was done by us they're custom dashboards, but if you download the prometheus operator helm chart, it comes with some predefined dashboards that you'll get out of the box, so you should be able to kind of install it and look at.

A

It no uh can we use the nsv scheduler. Is it open source? Unfortunately not. uh It is an internal scheduler that we are using within nvidia. Sorry.

B

A

How is the gpu usage calculated for different applications running in the kts so that their sras are satisfied? So um basically, we are monitoring the actual utilization of the gpu, um so we have several uh profiling metrics that we grab. We grab the percentage usage uh percentage active for the gpu, the percentage active of the tensor core. So while the job is running, we are monitoring it um almost every second, but we're collecting the metric at every 30 second rate, so engineers can see that what rate they're using the gpu add and optimize it.

A

So it's really up to the engineers to kind of do the tuning we help guide them. Some of our senior ml engineers will help guide them how to tune it better, but the application is really the one that needs to kind of satisfy their own slas.

B

Let's see what else.

B

Okay, let's see if I have any more.

B

B

Okay, we wait for a little bit longer see if any more questions come.

B

B

Okay, so let's see there are some medium level ones that are coming in.

A

So this for the slides you guys can kind of download it just from the um the.

B

Ui, I think um so. Here's the response.

A

That they have click the handouts widget below to download the slide deck.

B

So that's that.

B

um See, oh, would you mind sharing okay.

B

And then you have um is gpu okay hold on there's questions coming in.

A

How about if there happen anomaly and causes to use gpu more than the normal level? um I think that's more about just gpu usage.

B

I lost the question just a second.

A

So I I mean that's really up to the engineers to kind of resolve those types of things we are just about, um setting up the scheduler to allow them to run their jobs. um You know they can get the resources but optimizing and tuning the application.

A

um So I guess what you're asking is, if there's an infrastructure anomaly that potentially causes them to use gpu more than the normal level.

A

I am not really clear on what that anomaly would look like the the issues that might happen at the infrastructure level would be like um they may not be able to mount um their cf volumes or their data sources, and then that is when we would step in and try to address that. But if, in terms of um because they're in a in a pod in a container- I don't you know, usually they have to tell us exactly how many gpus they want so they're restricted to the number of gpus. So they can't really.

A

I use more than they're asking for.

A

I hope that answers your question. I'm not I'm sorry. If it doesn't.

B

Answer it very well, um okay, so this is.

A

uh Why influx tv wasn't thanos enough to satisfy long-term storage um at the time that we did it? The first install that we did it. Thanos was not quite there yet. It had just started like six months after we did the first implementation.

A

So at that time the industry most of the community was using influx db for the long-term retention from um prometheus. So that's why we started with infrastb and we did see scaling problems with it. So our second iteration, the second architecture that I shared with you has thanos in it. It doesn't have influx tv, so we quickly upgraded to thanos as soon as as soon as it was ready.

A

um Is gpu sharing talked into account taken into account in your jazz scores? um Gpu sharing is not yet taken into account in the dashboards um we. We are just talking about um gpu affinity in our scheduler. um We haven't really talked about gpu sharing as such between. I assume you mean between multiple jobs by a team of users.

A

um I think what we're gonna do is in the next, inter next upgrade we'll be introducing some of those capabilities, but that's not in place yet so once it's there, then I'm sure the dashboards will be there for that they're. Not there right now.

A

Would it be possible to go granular, that is to schedule to gpu cores and monitor many users and how they're, using these cores, how about scheduling yeah? So we are trying to go more granular by splitting the gpu and you know schedule at the gpu.

A

um I mean I think what the way the design is. I think this is public that we're going to make it part of the upstream infra architecture for gpu sharing, um which is to split the gpu.

A

You know, and then the jobs will define how many of those splits it's using um and then the scheduler will also be supporting that. So you should be able to do it at that level and, and once it's that is in place, then we'll be able to kind of monitor it as well.

A

um So I hope that answers all the bits of your question.

B

um Thank you all for joining.

A

um I think we'll be wrapping it up now.