Red Hat OpenShift San Francisco 2018 | OpenShift Commons Gathering at Red Hat Summit, 10 Jul 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Nvidia and Red Hat; Partners in AI, ML and Other GPU Enabled Deployments

Description

With Artificial Intelligence and Machine Learning workloads expected to drive demand for the next wave of applications in the data center, Graphics Processing Unit (GPU) deployments using containers within an enterprise Linux environment will become more common. You will want to attend this session if you are a system administrator, developer or just looking to better understand how Red Hat is partnering with Nvidia to improve the developer experience, support containerized applications, and enable GPU and vGPU accelerated workloads to run on Red Hat Enterprise Linux for both bare-metal and hybrid cloud environments.

A

Well, welcome everyone to the Nvidia, Red Hat partners in AI ml and other GPU enabled deployments. My name is Andre Beausoleil and I'm. A senior principal partner manager with red hat and co-presenting with me today is Duncan Poole, let Duncan introduce himself and he will be hosting the first few slides and then I will take over. Thank you.

B

Thank You Andre, my name is Duncan Poole I work for Nvidia in dog years, I'd be dead now, but I'm, the director of partner alliances for Nvidia and one of my principle partners is Red Hat.

B

So the purpose of today's presentation is really to showcase all the different places in which Nvidia and Red Hat are collaborating red hats or sir Nvidia is probably not so well known for activity and open source, and yet, if you think about it, it's a core requirement that we go to market together in many areas, and so this is kind of opening the kimono a little bit and letting you have a feeling for what that's all about.

B

Okay, I'm happy to I'm gonna try and go through the presentation at a relatively quick clip to leave time for questions near the end, and if we do run out of time, then we can also adjourn and find some more time afterwards. Okay, so with that said, let's get on with the presentation.

B

So the agenda is basically me to outline the goals of the partnership. Talk about what Nvidia brings to bear in its tools. An ecosystem talk a little bit about our container efforts and containers strategy and the open source projects that we have going on, that that we support and work on with red hat and then Andre will pick up and he'll talk about some very specific components to this that Red Hat has been putting together.

B

So that's how we divide up our time, so the vision is really to treat rel as first class citizen for GPU deployments, so GPUs getting used in hybrid clusters in HPC and AI, and in many commercial environments now, and it's important for us to go to market with rel, because it's a it's a classic distribution for the commercial marketplace.

B

For us. That means, among other things, try to simplify the install process that we do, because if you like, you went back about five years and tried to install rel, you would probably choose through a few different versions before you figure out the magic incantation that lets you install it. Also, in the day you used to do this using a dot run file, and now invidious built, you know, rpm repos, for our for our offerings to make it a little easier. So, together we do for each card.

B

We release for each OS of rel for each CUDA distribution, we release, we have to do sort of synchronized OS qualification, so you can imagine that this is quite the cadence of meetings. There's still a lot of room for improvement beyond that in the container space, where you know, you really want to wrap up an application and deploy it in a commercial environment, it's been important to get more aligned with with Red Hat on how we do that. So we'll talk about that we've.

B

We also have specifically in the the open source collaboration space, one that's going on around heterogeneous memory management and I'd like to tell you a little bit about that, because it's kind of changing the way malloc works, so that it works well with devices that manage their own memory like nvidia has and a.

A

B

Other open source projects, so what's new so from Nvidia, what's new is if you were to go and try and download our driver and our CUDA, you could go and use the RPM repo. We also have within our free CUDA downloads, several new host compiler options. Llvm is directly supported, yeah if you've never used, CUDA and I'm gonna guess. Has anyone used CUDA here? Okay, so CUDA is the compiler that lets.

B

You use a syntax to blow a parallel program out onto a GPU and to give you a feeling for what I mean when I say parallel to keep a single volta card busy probably takes a hundred and fifty thousand threads of computational activity, which is scheduled by the hardware when you launch the kernel and kernels can launch kernels and in the new environment, where we have a network called env link joining them all together, you can have sixteen of these launching activity, parallel activity outs, so the programming model for that is called CUDA and- and we provide a bunch of tools, debuggers profilers, to make that all work, so there's host options, because, obviously it's not all going around the GPU and some people might want until some people might want GCC or you might want LVM, which is the more recent things.

B

So you get all these options so so to go on we're going to talk about kubernetes on GPUs and basically that the trick here is in a containerized environment you want to have you want to teach kubernetes, how many GPUs you have? Are they busy or not? Are they near a CPU, so CPU affinity becomes important, and all of that is a contribution that NVIDIA made called a nvidia plugin for kubernetes.

B

Then we're going to have andre to talk about the KVM improvements that we've been doing in grid together and a little bit more about open source collaboration. So if you want to look at the breadth of what is Nvidia in this market space, we have well. We have examples here of some very high performance, computational chemistry, very long, running apps that are well tuned for running on GPUs, and these ones are all now containerized. So you can go out and you can launch those apps with a container.

B

So you don't have to think so hard about compiling it out what are the library dependencies and so on to make it run. We also have intimate relationships with all the various frameworks developers so and if you're interested in detail on this have a an AI learning environment online. So you can go off and you can. You can pick up a framework and run your own self-paced learning for this.

B

So the the frameworks are great and you can quickly become kind of an AI expert. But if you want to go down the path, the traditional programming, we provide libraries that are already ported to using GPUs. So if you're familiar with la pack or any of the standard, math heavy libraries they're directly available there, so you don't actually have to you just call them with your CPU code.

B

You don't have to write any code at all or you can dig down and you can use one of a series of standards like Python or thrust, and these ones are also all GPU aware and then finally, we have our own specialized language. A free Fortran, free, C compiler that can be used, works on the CPU, but it also works in the GPU, and you can blow your code out directly to it. Okay,.

B

So, with Red Hat on rel, we now provide a faster release of CUDA, and this in part, is because we have rpm distribution model working now, so we're gonna release four releases of CUDA per year and then, as I was saying, the math libraries, especially around AI, are moving even faster. So you can pick up the improvements that we have for each new GPU that comes out from Nvidia directly every month and because that sounds like a nightmare for a developer.

B

This is where that whole containerization framework comes in handy, because if you need to freeze it one version you can and use that or you can follow us going forward as fast as we're going and it's simple you can see that basically you enable the repo and then do it and do a young, clean and finally, a yum, install, CUDA and away you go, and this more broadly is designed to show you a picture of the the ecosystem of tools that NVIDIA provides are provided in freeware.

B

So you see you know, memory checkers, visual, debuggers, visual profilers, in a series of libraries of various useful kinds for AI and for math and then finally, your standard languages. Okay, on the kubernetes alignment front, we recently had our own developer conference, where we're jensen wong, our CEO demonstrated containerized app running on the show floor and then failing over to run on the Amazon Cloud. So the fault, tolerant aspect of of our containerization strategy is starting to come into play, just schoesser the robustness feature here and and actually on nvidia.

B

If you, if you go register as a developer, you can download our pre-built containers that include all these various frameworks already tuned up to run on our devices. That's the whole point of it. Okay and the benefits of containers are obviously a stable, a stable environment for install dependencies not having to resolve between the developer and the user what they actually are running on, so it just. It simplifies the whole process and, as anyone of us probably knows getting the install dependencies right on linux is sometimes a bit about chasing your tail game.

B

Okay, so for open source development- and this is one where again I- don't think NVIDIA is normally thought of. You know we work directly with Red Hat maintainer.

B

The guys name is Ben Skaggs, so every single graphics card that we put out, we give him one of the graphics cards and help him make sure that when we release he's already qualified at with Nouveau- and we try to do that when, when we come up with new devices but also when, when Red Hat shakes new versions of their OS as well in the parallel languages, space there's a library called Lib, gaunt, Lib, Gump Gump stands for open and Peter.

B

Can you open MP and that library is actually a common runtime library for people that want to use directives? And that's basically a comment that goes around a block of code saying run this in parallel. Openmp is the classic example of this in doing it for years, open ACC is another one and they're actually implementing open ACC in GCC in gone, so there's a fun little cooperation.

B

That's going on between the open, MP maintainer is the open, ACC maintainer and Jakob Yale Nick at Red Hat, helping orchestrate all this for hmm this memory manager that I was talking about. There's a the key developer of this actually is at Red Hat and he's been working with us since about 2012.

B

So it's a very long-term project, just getting upstream now and I'll talk about that more specifically in the slightest, though, and then finally, V GPU and the support for multiple virtual GPUs for graphics and commercial workloads is another big effort and Andre will talk about that.

B

So what is header G's memory management? That's the ability for malloc to work on memory that sits on a card and to be able to reference or dereference the pointer, whether it's on the card or on the host machine and think of the underlying paging subsystem for Linux. If it's on the card and you're running as a process on the host, it'll, just page fault it across and you'll run against it on the host or the other way around.

B

So if you think about it, half the problem with the GPU is resolving the data dependencies and where they live, and when you have this feature like mouth this feature in malloc available, then it's a huge developer simplification. Now you might sit there and say yeah, but wouldn't you have thrashin going on as you reference these things on either side?

B

That's true, but the truth of the matter is developers understand this and in general, people build their working sets, so they live in one place or the other, and all of the tests that we have been doing have suggested that by and large it's a big win and only a small loss and most of these circumstances another feature that lives in here.

B

On the GPU side, it's not shown, but those Peas they can be connected by a env link and indeed, link is this network a very fast network of up to 16, GPUs and all of this capability then runs on all of those and each GPU can have 32 gigabytes of stack memory, so memory consistency is all supported, locking is all supported and it just works. So this is a new feature that we're putting together right now and with that I'll pass it over to hunting Thank.

A

You Duncan, ok, we just got a couple more slides to go through and I would like to talk a little bit more about the partnership that we have with Nvidia with regards to their grid. That's the V GPU offering with Red Hat virtualization. So a couple of things going on here back in oh I, think it was 2015. We started working with Nvidia on on providing the enablement for V GPUs, and that was that was quite a project.

A

It required us promoting an upstream package which was mediated devices mediated devices is the key enabler to allow you to set up V GPUs right, in other words, take a an Nvidia, Tesla or Maxwell GPU and then deviated out similar to what you would do for virtualization of CPUs right. So we were able to get that upstream and accepted in 2016, and then afterwards we worked on getting that support in Rev, right, Red, Hat virtualization, as well as support added in Red Hat Enterprise Linux. So they were the upstream components which Nvidia worked on.

A

Actually they did most of the open source development for mediated devices. We worked with them in testing it and then helping promote and getting it accepted upstream.

A

Okay, just to give you an idea of some of the open and closed source aspects of the vgpu again, their grid support, if you notice we're actually installing the grid software on the KVM host and then what the mediator devices will provide, support for or we'll need to install the drivers on each of the VM gas right. That's that's critical again! The you know the the V GPU will view the GPU as if it's a dedicated GPU, so there's no need to worry about additional management.

A

It'll all be transparently handled by the by diem that by the mediated device enablement in Linux.

A

Okay, another aspect of our partnership is that last year we were able to collaborate. We worked with NVIDIA, we worked with the HPE on a benchmark. This is the stack a to benchmark stack for those who are not familiar. It's a security consortium way. They tend to run very high CPU a high memory type benchmarks, which lends well to the HPC market. With our collaboration, we were able to have a configuration of Nvidia v100, that's their Volta GPUs at the time those were the fastest available GPU.

A

So we had 8 8, V GPUs and an HPE ProLiant server running Ralph Red, Hat Enterprise Linux right. We were able to break a number of records, both benchmarks around throughput, as well as any energy efficiency for those who are interested in the specifics of this benchmark. We have a couple of blogs that are available and you can get more details with regards to the configuration what this speaks to is. Essentially, the partnership is leveraging two aspects. One is our ability to provide.

A

You know enterprise resilient support through well and then the other is to show how we have integrated with the GPU Nvidia CPU, as well as with CUDA.

A

Okay, something new that just we just announced around the time of the Nvidia GT C conference just over a month ago, was the availability of the GPU device plug-in right. This is something that we worked when, with the kubernetes resource management team, to get the ability to provide support for managing GPU workloads and a containerized environment. This feature is supported in OpenShift 3.9 and the feature is available as a technical preview feature, which means that it's not production supported.

A

But if you wanted to deploy the device plug-in and your lab environment to run a POC you're able to do it, and we have specific information and guidelines.

B

A

How to go about supporting that, so you can go to this blog it'll. Tell you how to configure the GPU plug-in and with OpenShift 3.9 and show you how to manage it. So a lot of detail is available here in this. In this blog.

A

Okay, the other thing that is important to our mutual customers is to ensure that we are staying ahead of the security vulnerabilities and, of course, to this point, and everyone here is probably familiar with spectral meltdown. It's something that we all have to deal with at the beginning of the year and with NVIDIA we were able to provide some of the patches to them so that they can test it and and and validate that we had mitigated any exposure that resulted from spectral meltdown. So that's that's a value of the partnership.

A

It's something that, if you as a customer running both you know, Red Hat and Nvidia technologies, you'll be able to have some level of assurance that we are working to ensure that we keep up with the most current CDs.

A

All right last but not least, I'd like to talk about our demo, we're going to be having nvidia in our booth at booth number 725. That's the AI IOT AI partner, ecosystem booth, it's just on the other side. They'll be running demos and we'll also have representatives from our product management team, as well as technical staff to address any questions or if you want to have any side meetings will be available to to address them there as well.

A

Well, I'd like to thank everyone for taking time out to hear about the Nvidia in Red Hat partnership.