Red Hat OpenShift Data Science | OpenShift Commons Gathering 2021, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes - Kevin Jones (NVIDIA)

Description

NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes
https://github.com/NVIDIA/gpu-operator/blob/master/README.md

Kevin Jones (NVIDIA)

OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html

Find out more about OpenShift Commons, please visit: https://commons.openshift.org

A

Hello, everyone, my name is kevin jones. I am an nvidia product manager. Today, I want to walk through with you, the gpu and network operator, just a quick news, flash 15 minutes on the current state of affairs, where they are today how they're constructed and how you can utilize those on top of your kubernetes platforms, to take advantage of the hardware acceleration of gpus and smartnics.

A

First before we get to the gpu and network operator, I want to talk to you about the platform I specifically work on, which is egx and egx is a cloud native platform for scale out acceleration, and you can see in the visualization here that this goes from the certified systems of hardware at the bottom layer through to the egx stack itself, which is a common application platform built on linux and kubernetes, and you can also notice there that the network operator and gpu operator at this layer and those are the ones we're going to specifically walk through.

A

Today. You have the cuda libraries, drivers and sdks that are available, and then you also see that nvidia is working on different frameworks to make smart systems more available and more easy to develop applications for so we have things like metropolis for smart cities. In retail, we have clara for healthcare isaac for robotics and 5g aerial for the telco work that is being done on 5g today.

A

You may also have noticed that there are different lines at nvidia with the gx moniker at the end and I'll explain those really quickly. The agx line is really about small scale system on chip embedded manufacturing, robotics type systems. The dgx line is about scale. Up so remember, I said: egx is about scale out. Dgx is about scale up system. So a lot of our work in this. The super computers lately with the dgx super pods, that's what those systems are being built for.

A

The hgx line is about cloud service providers, so allowing cloud service providers to provide instances to their end users with nvidia gpus in them, and then egx is our scale out platform that I've been discussing here today. Now, I'd like to switch gears and start talking about the operators, but before we dig into each of our nvidia operators, I really want to touch on the operator framework. So a while back red hat acquired a company named core os and coreos had many contributions to the kubernetes ecosystem.

A

One of the major comp contributions that they gave was the operator framework and red hat has continued the work on this really beautiful and capable framework for deploying kubernetes native applications. It's a pattern for how you accomplish this, and you can use different ways to write your operators as well.

A

They can be helm based operators, they can be ansible waste, they can even be go based if you get really complex and each of those different layers have either simplicity in the approach of how you build them all the way up to very complex application capabilities and the operator framework allows you to do deployment. It allows you to do.

A

Updates and upgrades of your software stack complex, tying together of different components, and it also allows you to tie in monitoring to your applications as well, and so nvidia chose this as a great mechanism to distribute and deploy software that enables their both their gpu, accelerators and smartnics on kubernetes platforms and when they're deployed together, they actually automatically enable things like gpu direct rdma, and if you have ever configured any of these things, mainly in the past, with gpus and smart mix.

A

You know that there is quite a bit of work and and almost black magic to get all the things in the way that you need them and the by also doing this inside of the operator framework, we have made it so that we are able to deploy these capabilities onto many of our partners, kubernetes offerings with all that foundational knowledge out of the way, let's dig into the gpu operator, we approach this in three parts.

A

The first was to have our container runtime and driver installed on the host and then plugged into kubernetes, and then we containerized the device plug-in and the data center gpu manager, monitoring capability called dcgm exporter, and with this set, we had the core functionality we needed, but the next piece is to actually containerize the runtime and the driver nvidia drivers themselves, so that all of our component tree is running as pods on top of kubernetes, and this makes the gpu node look like a cpu node and then the last was to wrap this with the operator so that we can manage life cycle of all these components, and this enables you to not have to worry about all the different variations of configuration and versioning that goes on.

A

This is the nvidia gpu operator, with these each of these components we'll talk in depth about each one. So first is our container toolkit which enables gpu support for various container runtimes.

A

This includes docker cryo, pod man, lxe singularity and others, and this we integrated specifically with linux container internals rather than wrapping to specific runtimes, and we can expose the gpus and the drivers to containers via this toolkit.

A

The next is the drivers themselves. This is the component. Most administrators are aware of whether they have done containerization with nvidia gpu drivers or on standard host, and the driver is really the most critical component to exposing all the capabilities of your underlying gpu to your application layer, and the goal here is to simplify provisioning of the nvidia driver with our operator.

A

The next is the device plug-in. This is a really important part, because it allows the applications to make requests for gpus as resources via pod specs, and so in our example. Here you can actually see this gpu example.

A

Application is making a request for one nvidia gpu and that's what kubernetes will expose to the application, and I mentioned our data center gpu manager, dji dcgm exporter, which enables us to take the telemetry, that's being fed back from the gpus and expose that into prometheus, which is gives the administrators of the cluster themselves visibility into their gpus telemetry, much the same way, they're getting visibility into their cpus that are running in the cluster. So this is a really great way to take advantage of the native tooling.

A

That's been been selected in kubernetes and and feed that with our gpu telemetry. So, with all of these core components put together and simplified in the operator, we have a great way to expose and get the best out of our gpus and our kubernetes clusters. So next, let's talk about the roadmap for the gpu operator, we're working on a number of things. New features that I've listed on the slide here. So things like upgrade management where we handle driver and kernel updates, handle node reboots.

A

If we need to those things, are improving with each release of the gpu operator, we're also working on disconnected and air-gapped installations, where whether you're restricted via a proxy to get out to certain resources or you have no internet connection at all. And you have to pull things from custom registries.

A

Those type of environments are very difficult to work in and we're trying to make it easier to utilize. The gpu operator in those environments we're also working on security capabilities like more granular, our back controls via roles and bindings for the gpu operator. Nvidia is very conscious of security capabilities within our hardware and our software stack, and we want to make sure that the customers are getting the best and most secure capabilities they can.

A

If you haven't seen yet. The a100 is a beautiful piece of hardware that is the most impressive leap in gpu capabilities.

A

Thus far, it uses the ampere architecture, micro architecture and there's a capability in the a100 that doesn't exist on any other gpu, which called multi-instance gpu and the a100 is physically capable of being sliced up into seven unique slices of the physical hardware, which means you could actually have seven parallel processes running on the a100 card, and so we want to expose those mig capabilities up via the gpu and take advantage of that capability to share this outstanding piece of hardware, and the last word we continue to work on is further integration with all the kubernetes distributions that are out on the market, and we have great integration with those like red hat open shift, but we want to continue to expand our capabilities to integrate with others.

A

I put up some useful links for you to go and follow up with. On the gpu operator. The code is open source, it's available on github at nvidia gpu-operator.

A

There is a getting started document that I've linked to here, and you can also reach out to us with any of the questions you have now. I want to switch gears and start talking about our network operator. If you were not aware, nvidia acquired a company named melanox at super computing. Melanox has obviously made a very good name for themselves and we want to continue that trend with melanox as nvidia's networking business unit, and so with the network operator, we've taken a very similar approach that we did with the gpu operator.

A

The idea is that we can simplify four system administrators that are running kubernetes environments with melanox smartnix underneath of them, and we want to simplify the configuration and expose all the capabilities that make those pieces of hardware everything that they are up to your applications in your kubernetes clusters.

A

So we chose the operator framework to leverage that again, we use custom resources to define what we need from a componentry standpoint and the operator framework takes care of reconciling the system itself versus the hardware and the software that's configured and it automates things like fast networking configuration rdma, gpu direct, so we did all of this for the benefit of our end customers.

A

We wanted to simplify the deployment experience so that complex network deployment tasks are taken care of for you, with by the network operator, they're portable across different kubernetes platforms and it's a consistent deployment across those platforms. We also wanted to give you some operational efficiency gains, because we are now managing the network at a cluster level rather than individual system levels. So the operator itself starts to look at this as a cluster capability, rather than individual system units that have to be configured, and we want to put the network automation administration on autopilot last.

A

We want to take advantage of the architecture itself so making the operator aware of the architecture.

A

So we did this when we started you had this legacy, where the melanoxa of driver and the nvidia peer memory driver were configured on a linux system by by hand or by automation, scripting, and then we containerized in the same way that we did the device plug-in for gpus. We have this kubernetes rdma shared device, plug-in that's containerized, maltis is what really gives us?

A

The capability multis allows for multiple network interfaces on a given pod and the offed driver and nvp nvidia peer memory driver were exposed as plug-ins into kubernetes, and the last piece that we did was containerize, both the melanox outfit driver and the nvidia pure memory driver, and we wrapped that with an operator for life cycle management and what you get is the nvidia network operator.

A

So let's talk about the individual pieces themselves. The offed driver container is what loads the melanox outfit driver into the kernel, so you have it pre-built for the distribution and the kernel lift running on the hose, deploy it onto the nodes based on node labels. So we in both the gpu operator and the network operator there are these node feature discovery capabilities that go out and label each of the hosts with the hardware that they have.

A

We expose the container root fs to the host to allow kernel, module, compilation against updated headers, and then we load the kernel. Rdma stack in the melanox driver, sac on container start and we unload it on container. Stop the rdma shared device. Plugin is how you can run rdma workloads in kubernetes and the shared device.

A

Plugin is the way that we let pods perform rdma, exposing rdma device files to the container in a shared manner, and you can see in our example here that it is requesting a single rdma device and it's limiting itself to that one rdma device through its pod spec. The next is the nvidia pure memory driver container, which compiles and loads our pure memory client driver into the kernel itself, so it loads in mem module into the kernel and unloads it when the container exits.

A

This is a really great capability for us to be able to do this, all for the administrators and all automated via the operator lifecycle manager. So where are we at today? So in december, we're looking at helm deployment we're using node feature discovery. That's what you see: nfd there to label rdma nodes and we're also working on secondary network deployment, so you actually have a secondary network that is rdma configured that we can take advantage of in the cluster itself.

A

So that's our 15 minute, quick news, flash on both the gpu and network operator coming from nvidia. I really appreciate your time today. I hope this this information. It finds usefulness for you that you explore the gpu operator and the network operator feel free to. Let us know if you have any issues or you have any feature, requests that you want to add to this. You can. You can find us in the upstream community working on these operator code bases we really look forward to hearing from.

A

A