Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deep Learning Workflows on NVIDIA GPUs with Mehnaz Mahbub and Mayur Shetty

Description

Deep Learning Workflows on NVIDIA GPUs on Open Shift with Mehnaz Mahbub of Super Micro and Mayur Shetty of Red Hat.

Filmed October, 28th 2019 in San Francisco.

A

One of the key components of working, an M in the ml and AI space is really the GPU enablement and we're gonna get started because they have a really cool demo that I Google Talk that I want them to to make sure they get it all in so without further ado. Thank you very much. Thank.

B

You hi, my name is Maya Shetty and I'm, a senior Solutions Architect at Red Hat and, along with me, is Manasa she's, glasses systems. Engineer supermicro.

B

Before we start I'd like to give a shout out to Diane Fatima, she is the the IML lead at Red Hat and she was instrumental in driving this project from the Red Hat site.

B

So the agenda for today is gonna, be it could be quick. What we find the talk is I'm going to cover white containers and kubernetes for ml workloads and in particular, why open shift and I also going to talk about how we are prepared. The system to run their mailbox roads and Manaus is going to talk about the super micro hardware and also the results that we collected during our exercising before I proceed. I would like to just walk through the pipeline and the various personas involved in ml workload. Deployment.

B

The first and foremost, is you collect all the data. We collect data from various sources, and this is the raw data that's coming in and the persona here is the data engineer. The next phase we have the the data is stored in data lakes and we have the the models which are being tuned. It's it's tested, it's trained. All that is happening in the second phase and the persona here is the data scientist.

B

In the first phase, you get to see some trends, but this is where all the training happens in the second phase, then it takes certain models and then those are deployed using the update process and the application developers are involved in this face. The data scientist are also involved because they want to make sure that the the right models are deployed.

B

Also, they want to make sure if there's any drift in data, because if there's new data coming in this needs to be some tuning done and some retraining, so the data scientists are also involved along with the app developers and at the end, what you see is an intelligent application which meets some Business. Objects object is all across across all these phases. One rule which is common is the IT operations folks, so they're common across all this they're also responsible for yesterday, two operations involved in the pipeline so.

B

We're going to talk about white containers and why kubernetes in particular, for a hybrid cloud in a IML workload, first and foremost, the agility that it provides by this I mean the automation for the the platform and also the the model framework that the data scientists are using. All that can be automated. Also the auto scaling feature so with this now without a scaling feature, the data scientist does not have to rely on the IT folks to provide them the infrastructure.

B

They could basically just use their own tools to auto scale and and get the infrastructure that they need to do their work. So what we've seen is the training, the testing. All this is very compute intensive, so any hardware acceleration provides a key benefit to the data scientist. So what we've noticed is GPU acceleration integration with any security features uptime all this key value add for the ml workloads.

B

The next thing is portability. What we mean by is here is now the containers have been benefiting this from this for a long time, meaning you could basically run your workload from a develop as well as deploy your workloads from anyplace. Now you can do that for the m/l workloads.

B

Also, now you can offer ml as a service, so we could do this with containers, but now you could do is is can service so that way, the data scientist does not have to focus on writing all the services into their application code, but rely on existing services. They could just go to a registry and download these services and integrate it with the applications and benefit from that.

B

There are products and services which help a lot, mainly by talking about the automation and the CI CD pipeline that basically rod brings in. So all this boost productivity and last but not the least, is the lifecycle, management and operations that will help the deployment of AI ml workloads. Okay- if some of you have already seen this light from earlier ml workloads are highly data and compute intensive. At the same time, OpenShift is a distributed platform.

B

So that's where they really meet because we're talking about a workload which is highly computer and data intensive, and we are talking about a distributed system. So, ideally, these micro services can be deployed across or orchestrated across the shared resources.

B

Also like Sharad mentioned earlier. These services can now be behind load balances and disk ailable, and what we mean by this is, you can add more or resources as and when you need them, or you even shrink resources when you don't need them. So this is a huge value. Add also now, with open shift. Dml work loss can be to truly portable meaning they could be running on your private cloud or on your public cloud.

B

Just like the containers are benefiting lifecycle management. This is something that is new to the data scientist world. So now, data scientists can just focus on writing their code, putting it into a git repository and a source to image kind of feature and the CIC the pipeline that is integrated into openshift, takes that source, creates an image and then deploys it and all the testing that hadn't happened or the testing hooks that are available in openshift can be leveraged by the data scientist.

B

Also that all the learnings that, under that we've come we've got from the container world, can be benefited by the machine learning well for the machine learning workloads. So all in all this makes a very compelling story for OpenShift makes a very compelling story for Emily workloads.

B

Let me move on to the actual project that we had done together. So I want to talk about GPU as a service on OpenShift. So before we before we even started run our benchmarks, there's some critics that we had to run so. First and foremost, you may have to make sure that you have the GPU drivers running on the servers which are having the GPUs so to make sure that things are fine.

B

What we need to do is what we did is actually got the GPU names and they're going to use the GPUs the GPU names later on in labeling the machines the docker, which comes with rail, already, has the OCI runtime hooks, so we didn't have to do anything out there. We focused on the Nvidia container on time. Hooks got that configured once that was done. We had our system ready to deploy, containers, docker containers, and we used OC commands document commands at this point to deploy this containers on the GPU machines.

B

We tested that with the CUDA vector containers. So once this test passed, what we got to know is that our drivers are hooks and our runtime parameters were all fine. So now we were ready to configure openshift so with openshift. 3.11 is what we use for this project.

B

The device plug-in API is already enabled, so we didn't have to do anything out there, so we are to just focus on the device plugins so got to make sure that the Nvidia device plug-in was running on the host which had the GPUs. So once we had that configured, we again tested it using the CUDA containers and I'll show you the next slide out here. What you see is the last line. This is the Yama file which tells that a GPU is required for this particular container.

B

What happens is the OpenShift scheduler sees this information shadow sit on the node running GPUs?

B

At this point, the cubelets and the device api's coordinate and make sure that this particular container gets the resources or the GPU resources that it needs.

B

What at this point we did is we have OpenShift configured, we have the device plugins running on nvidia device, plug-ins running on the horse, so now get continued, rising our benchmarks, so we continuous the benchmarks using the ml/cc tool from Red Hat.

B

We then created the images and push them on to the Kwai repository, and once we had the report, our images, the benchmark images on quake, we then run the performance tests using the images from Kwai that we created at this point I'll hand it over to and she can talk about the Super Micro Service and the results Thank.

C

You mayor so before I I mean asked by the way I'm from Super Micro so before I dive into the benchmark numbers here. I just want to tell you a little bit about Super Micro, so Super Micro is one of the leading providers of super servers throughout the industry in current days and our headquarters in San Jose. We also have branches in the Netherlands and in Taiwan as well. So super Micra is one of the leading manufacturers of a huge array of hardware's actually including servers.

C

Networking devices, server management, software's, HP, CAI I mean you name it and we provide the whole hardware stack for you and we also do exciting solution buildings like this, and we are super mikro straight to partner with Red Hat here, where we have ran real-life AI workloads for the first time on top of OpenShift. So let me start with the solution reference architecture here, so we built a 10 node cluster, which I will show you in detail in the next slide.

C

So for the actual OpenShift building block, we have used our super micros, a famous big twin super server, which is famous for its very dense, parallel compute power. As long as it's large memory footprint and for the for running the actual AI workloads, we have used our super micros GPU, Server, TV RT, which again I, will provide you the actual spec details of the servers in later slides and for networking. We have used our own super micro switches.

C

We have employed both 10g and hundred G switches for this project, and this is kind of a summary of the software stack. For example, if you want to know what role we use, we used rel 7.6 and, as my you mentioned, we, this project was done and open ship, 3.11 and CUDA versions and summaries like that.

C

So coming back to the solution, building blocks on your left, you have the actual openshift cluster building block, which is the Super Micro big twin again again: I'm not gonna, go into the whole details of the CPUs and memories. But if you guys have any questions regarding any of the servers, please let me know: I'll get back to you on that and on your right. You have our super micros GPU server, which can operate up to eight Tests leve 100 sx m two GPUs, which are the actual GPUs.

C

We used for this benchmarking as well again, if you have any detailed questions about the specs I'll be happy to answer them. So moving on to the actual hardware setup, we in our super micro lab, we created a 10, node, OpenShift cluster standard tree master nodes, 3, infra and 3 application nodes along with 1 load, balancer, node and one of those application nodes, as you can guess, as the GPU server, the actual, where we actually ran the AI workload in there and the network topology pretty straightforward.

C

We implemented two different layers of network 10g and 25 G 10 G's basically was for the management purposes and at 25 G was implemented because, as you know, the wider bandwidth and lower latency we can provide to machine learning workloads. The faster results, we're gonna get so and again. This whole solution, architecture, reference architecture and network topology. We built it in such a way, so it would that it would be scalable to me any you know whatever scale your deep learning project might be.

C

This solution architecture can be scaled up or down accordingly, so it's completely customizable.

C

So for running the actual benchmark suite we, you know, chose the if you're familiar with the world of machine learning, you're very familiar with mo perf, so mo / is the wide range of benchmarks, whis that that covers a lot of the main applications of machine learning. So ml / basically gives you a set of rules and some specific data sets and a bunch of specific models so that the results that you produce are comparable across any hardware platform or across any framework. So and from the mo perf suite.

C

We basically chose two categories of benchmarking. The first one is object, detection and the other one is machine, translation and for I want to talk a little bit just a little bit about the datasets we used so the first one for object. Detection. We used a data cell called the cocoa dataset, it's from Microsoft, which contains around 328 K images, along with more than 2.5 million labeled instances and in the images and for machine translation.

C

We have used the WMT English to German translation, and this data set basically contains it's a huge data set containing things like news, commentaries or Parliament proceedings. So a lot of speeches like that and for our purposes we used the English to German translation.

C

So moving on to the actual benchmarking here, as I mentioned, the first one is object. Detection. So before I talk about the numbers, very basic, the matric that we're comparing here is the training time and on the very right side, you see that if you go to a mappers website, you will see that the only numbers they have published are mainly run on Nvidia's dgx, one platform, and so what we have done in our lab is the software stack that we have.

C

The hardware stack that we have created is very much comparable to Nvidia's DG x1, whether its CPU cores a number of GPUs GPU memories. Things like that, so we have tried to create a very comparables hardware stack to dgx one so that we can compare our results with the ML published results. So, moving on to the first number here, the first object detection, which is the hip heavyweight object, detection, and that was actually the longest training that we ran and incredibly, we got even better training time than in videos. Tgx one.

C

As you can see, ours was a little around 205 minutes where Nvidia's was around two hundred and seven minutes and, as you know, this much of a difference in timing makes a huge impact in the real life AI trainings and for the next one, which is the lightweight object. Detection. We have also got like very close results, and I will also explain why all these numbers are very important, even if they're not better than nvidias Dziedzic.

C

So on I'll tell you that story a little bit later and I just also want to mention that for all our benchmarks here we have run it on PI torch, using only pie chart framework.

C

So the next one movie line is the machine translation, as I mentioned English to German, and we have ran two different two different sets of algorithms here, but again, both on PI torch. So the first one is the recurrent translation which, as you can see, the training time is very close to dgx wine and the next one again is the non recurrent translation which, for which we got the exact same training time as nvidia DJ, x1, and one more thing I do want to mention.

C

Is that all these results were all these benchmarks will run on mo perversion, zero point six, which is the latest version that they actually just publish like a tooth two months ago. So these are all the very latest version of benchmarks that we have ran.

C

So these are again as Gerard mentioned. He also showed the cool demos here. So this is one of the examples of the of what the cool GUI of OpenShift that you can play around with. So both these dashboards were created using Prometheus and Prometheus and Griffin ax, and, as you can see on your left chair, you can see the actual GPU usage, how much of each GP is being used along with the GPU memory usage and on the other one.

C

You can see the actual GPU temperatures which, when you're learning training workloads it's very important, to monitor the overall health of your GPUs as long as power usage as well. So the open shaped GUI. It gives you a really lot of really cool to monitor, in fact, every aspect of your project. However, you want to monitor later control it. So this is one of the like really cool examples of openshift features. I think that we have.

C

We were able to implement for our project as well, so on this last slide, I want to talk a little bit about why these numbers or what are the impact of the numbers that I just showed you well. First of all to our knowledge, this is the first real life AI workload that was run on openshift and another very important main point is that the numbers that we're comparing it with Nvidia DG x1. They were all running bare metals.

C

So the fact that we can match bare metal numbers running on top with workload running on top of OpenShift, because there is supposed to be a little bit of lag because of bare metal and open ship comparison. So the fact that we can match those bare metal numbers, if not better in one case, like I, showed you is a huge statement by itself so again, just matching those numbers being able to match those numbers or getting close to those numbers. Showcases not only open shifts performance.

C

It also shows the overall hardware performance and how well we integrate with OpenShift and again. The last advantage is the heart from the hardware point of view and videos. Dg x1 is a very expensive piece of hardware, as you might be aware of compared to that. The super micro hardware stack wood that we have developed, which is very comparable to digits. 1 is much more cost efficient.

C

So the fact that the customers are getting getting the same training performance, if not better in one case, for a much in a much more cost-efficient way, is another huge statement on its own. So before I finish, I want to let you know I want to share some links with you in this slide. The first one is the white paper that we have jointly published with the Red Hat Red Hat on Supermicro.

C

You can that white paper has all the details of this project and all the numbers and hard words and everything, and also I've, also provided the get account information here. If you want to go there, you can download all the data sets that we have used all the yellow files and everything's in the get account and also I have linked the super micros openshift solution page here. If you want to take a look at the hardware, stack details, so thank you.

C

Everyone for your time and I also want to thank Red Hat for inviting Supermicro, it's a huge opportunity for us and we're really thrilled to be partnering with you. Thank you. Thank.

B