Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2022, 11 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Machine Learning Using Various GPU Technology With Kubeflow. - Jihye Choi, SAMSUNG SDS

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe in Amsterdam, The Netherlands from April 17-21, 2023. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Machine Learning Using Various GPU Technology With Kubeflow. - Jihye Choi, SAMSUNG SDS

Speakers: Jihye Choi
Everyone who works in MLOps tends to have a perception that limited cost and GPU is crucial. Kubeflow is a great open source, but it provides very little elements to handle efficient distributed learning through coupling tightly with GPU or by maximizing GPU utilization. 1. A simplified model uses a relatively small amount of GPU, as using the entire GPU capacity is considered as waste of resources. The Multi-Instance GPU applied to the NVIDIA A100 provides a technology that splits one GPU into up to 7 instances, and this presentation shows how to combine this top-notch technology with Kubeflow. 2. As the size of the model increases, distributed training becomes more necessary when using multiple GPU servers for efficiency. GPUDirect RDMA is a high-performance networking technology that directly communicates and processes GPU memory without CPU and system memory intervention. As a result, you can get tried and true experience, which improves GPU utilization and performance in Kubeflow.

A

Okay, hello, everyone good afternoon, uh I'm jih Che from Korea and I'd like to thank everyone for participating in today's presentation. Despite your busy schedules, uh I am working at Samsung S6 as an infra item I'm.

A

Sorry, I am uh I have been working at Samsung SDS as an infra architect for about seven years and as a cloud architect for the recent three years, having particular uh interested in open source, including kubernetes and Cuba flow, and I'm, also proceeding with a various POC to develop the best machine learning platform, focusing on the components like Network server and GPU and applied it to service.

A

Based on my experiences as an infra architect, my team is currently developing a machine learning platform which is based on Kiva flow and at the beginning of this year we add on the functions uh that for enhancing the usability to qf flow, which is on Samsung SDS Cloud platform and release a keyword flow service which is consistently being updated uh for those of you who are not familiar with SSP Samsung Cloud platform is a cloud environment launched by Samsung SDS in July last year, which what virtualizes and provide buys various components like Computing stories and database that are necessary for corporations.

A

Corporate Cloud can be used conveniently as a self-service, and it provides a very high quality, availability and stability. As I mentioned earlier, my team developed a keyword flow service on this sap.

A

Through our experience of developing a machine learning platform, we've learned a lot I assume that those from the field of machine learning, Ai and Mr office would agree with me on this. The biggest lesson that we learned, the on The Limited cost and how important it was to utilize the GPU within that cost. So I'd like to introduce two cases stories on how we improve to improve that part and applied it to our platform service.

A

Yes, first of all for those who are not familiar with the Cuba flow, it is a machine. Learning are toolkit based on kubernetes and its open source projects, and that enables the simple scaling of motion, machine learning, model and deployment to production. Qf Pro provides various components like Jupiter notebook, captive or pipelines, and training operators that allowing data scientists and machine learning Engineers to work on machine learning, training, high performance parameter tuning and serving workflow.

A

However, the components that uh process process effective, distributed training by combining with GPU ecosystem or maximize the GPU utilization rates are really provided. So we proceeded with POC based on two GPU Technologies. One is multi-instance GPU, provided by a Nvidia ampere architecture and the other one is GPU director RDMA I'm gonna share the POC result for each of them today.

A

uh Let's first take a look at a POC on Mig. With these two tables, we can compare the specification of GPU for data centers and for desktops used for uh experts, as you can see from the table on the left. The GPU for data centers and servers for use for AI and HPC has large resource and processing capacity. It is worth two teeth for huge volume training.

A

Then, let's look at the specification of desktop GPU for export. It's memory size as a four to eight gigabytes at the minimum, and its specification is about a half or 120th of a GPU for data centers. This means that it is relatively less GPA. Resource is needed for a light model development or inference task, in other words, using one unit of GPU for data centers in AI development or inference task list is a waste of resources.

A

Is a technology that came out to tackle this issue?

A

Mi can split one unit of GPU up to seven instances and each instance can be used for one complete GPU Mrs can partition GPU into a instances that each has a high bandwidth memory cache and comparing core uh when various tasks like different AI inference or Jupiter jobs, Jupiter jobs or training jobs is uh executed in the same GPU without Mig, then each task completes each other each other for the same resource, on the other hand, with Mig or the tasks are executed simultaneously in different instances.

A

Securing for service quality and in extending the scope of a computer, accelerated Computing resource to all the users. Mis was particularly fascinating for us to handle the large scale of GPU for data centers, and we thought that we can enhance the machine learning efficiency by combining it with curved flow. So we proceeded. We proceeded with a technology verification since QV is not a platform that is simply loaded on operating system layer, so we have to verify it in barometer.

A

Hypervisor virtual machine operating system, kubernetes and Cube Pro consecutively through numerous phases and our biggest interest was in the procedure of each phase and.

A

The most efficient way in terms of performance, so oh I'm, sorry.

A

A

I, oh I, onto this, to reached every step how to set the Miz on today's presentation. But, however, let's rather focus on the lessons that we learned.

A

First, we didn't work well on Cuba flow. The official document to thoroughly explains up to kubernetes layer and the test on the operating system, layer and kubernetes layer were done without any issue. However, we had to. uh We had to take up step further step further to check up the cube flow layer, and, to sum up, it works very well, as you might have expected, this screen shows Jupiter need to build test screen that is created on Cuba flow.

A

The screen on the left shows the GPU confirmation comment executed with injector notebook, and we can see the information of the four gigabyte site, a 100 GPU and one Mi device of GI id9 that has been located on the right is the screen that shows GPU information of the node, where notebook is loaded. One unit of a 100 GPU is divided into seven MOS devices devices and the process is located to the device GI I did not in as we track the process.

A

We can see that is the Jupiter Notebook on the meter, so we executed simple Eminence training on Jupiter notebook and, as you can see, it works very well with mis device yeah. However, we faced a minor issue during this testing. When Jupiter notebook is created, the Mis device is not detected on open source qfro dashboard.

A

So let's take a closure. Look at this. The left is the screen for open source qf flow when Jupiter Note 2 is creative, then a CPU memory and the GPU resource can be selected, but in GPU only the quantity can be selected. The Mis device cannot be detected, so we have to update the Nutri yaml 5 manually for testing and thinking that it will be very inconvenient for developers who are on familiar with kubernetes.

A

So that was the reason why we enhanced this part on our platform on the right is the screen for creating network of a speculable service. As you can see, we can select the tribute type type. As of now we can see 10 gigabyte, size, Mis device and GPU type support.

A

It is a set for users to select instantly on the dashboard without updating your file, and it will be appreciated if such Mis device function is applied to open source Community later then, that's a more work on asset speaker flow service for utilizing resource, including Mi device. It can be checked on the dashboard image device. Information of the node configured in the cluster can can be checked and the resource used, amount can be reported and the allocate allocation of resource code can be restricted and I. Think it's going to be very useful.

A

Then, let's move on to the next lesson that we learned. We wanted to check whether the distributed training was visible on mi's device. Although mice technology is not suitable for large volume of tasks like distributed training, but we just wanted to check its possibility for those of you who are not familiar with the concept of distributed training, it involves one training job that uses numerous gpus, that to execute Trading.

A

It can be executed in one node or martinus. Like the one. In a picture we are going to test this DDP drop in pi torch, using an outstanding technology called cuda specialized for NVIDIA GPU for task execution device is located using Cuda command in device in the device. Let's look at those specific example on the left is a distributed. Training Yammer, allocating four gpus to one part and running total of two parts.

A

As you can see on the right by executing it, then the pro or the process are located in the one node, and if we look at the executed low, then we can see that Quran device has been allocated properly and the task has been completed. This is the DDP that generally takes place in GPU.

A

Then let's allocate the same task with mis device. Only it is the distributed trading Yammer that runs total of two parts by allocating a two Mi devices for each part, as you can see on the right, an error occurs in runtime of qur device detection, then does this mean that distributed training is not visible on Mig for accurate accuracy, verification we allocated one Mis device for each part for the same task. As you can see on the right now, moist device was located properly and distributed. Training was executed without any issue.

A

We found out that distributed. Training is visible with mis device, but the task is executable when only one device is located for each part, then. Lastly, let's take a look at the performance. To sum up for lighter Motors, it was three to seven times more efficient on Mrs device. We executed model training in both barometer and virtual machine environment.

A

We compared the execution time for the method that executes the execute the same model consecutively seven times in the in one GPU and the method that executes simultaneously on 7mis devices, we saw three times better performance for heavier models like RNN and five to seven times better performance on a light, lighter models like CNN and resonant 18.. We found this variable enough to apply to our SSP service.

A

Furthermore, we found some points to consider for kubernetes were testing and, if you're interested in it, then you can take a look at the document which we have written in a specific technology guide. So so far we went over the POC related to Mi technology, with which GPU is divided into seven instances. At Max and efficient use of the peel is possible, even for only one unit of zipu for numerous users were on accurate applications by using Mig.

A

However, it is suitable for data, Sciences model development, developing tasks or inference task and not for large skills test like distributed training, and it allows more efficient use of GPU resources if it tries to wear. Then.

A

Let's move on to the record case study I'd like to introduce POC with related to review director RDMA technology as the size of the model gets bigger and the amount of the data is increases to enhance the accuracy of deep learning. We need countries, the computers and efficient distributed processing, I'm, going to explain, GPU direct RDMA technology for working out of GPU, distributed training and system architecture, as well as some examples and share the performance.

A

The verification result, first of all, GPU art direct RDMA, is a function that allows direct access to uh to the peel memory between GPU GPU, for GPU communication between remote nodes and uh through the network interface data. I o between GPU GPU memories is processed without involving CPU to understand it better. Let's take a look at four cases.

A

The two diagrams at the top I'll show the different attributes communicating within in a single node and the the other two at the bottom choose the communication between the periods of the remote node. First, let's take a look at the communication of GPU within in a single load.

A

We without GPU director peer-to-peer, though who's the CPU, must transfer the data from the GPU memory to the host memory and then from the host memory to the the other second gpus memory, but with the GPU directed peer-to-peer, the data can be transferred very directly from a GPU memory to other the other GPS memory.

A

It works. Similarly to for internet communication as well, we distribute without Triple direct RDMA. Then there are what what what we copied from the GPU memory to the host memory, a host memory and then from the host memory to a remote host, but with GPU director RDMA.

A

There is a transferred directly from the pure memory of it is sent via RDMA network adapter like infiniband to the remote host with no CPU environment. It seems that the communication will be done more quickly without the pure environment.

A

So then, what do we need to use? Gpu direct RDMA. Various environmental settings is necessary for GPU director RDMA, applying GPU director resume to Cuba flow for Hardware. We need Network equipment that supports academic communication and for our POC we've set bare metal environment and with infinite event, network adapter and Nvidia a100 GPU. In order to detect the GPU and network adapter driver and turkey setting is necessary for operating operating system, layer and kubernetes system layer. The parts in light green are the modules for using GPU and those in dark.

A

Blue blue are the modules for using a network adapter. Let's first look at OS layer. We have to set forward things displayed in the in diagram when Nvidia developer is, is stored to have GP recognized and or as an OS and the Nvidia container toolkit is set to be run on kubernetes. Then it is possible to use GPU after installing all faded driver to have infinite event, network adapter recognized and installing a web pyramid driver that supports RDMA communication.

A

Then setting of os layer is completed when the settings layer is done, then let's take a look at kubernetes layer. As you can see, we have to set Nvidia device plugin and our daily measure the device to current layer. A video device. Plugin is a demo set that automatically recognizes and runs GPU in kubernetes layer.

A

It is mandatory for using GPU in kubernetes if RDMA shares a device that allows access between Parts. By sharing the RDA divide, RDMA device between the remote kubernetes nodes, then we are ready to use GPU direct RDMA in kubernetes layer. The part in yellow is the container in this layer and we used Cuda, cdnn and nicker for our POC. Then early preparation is completed to use GPU director rdb.

A

Let's take a look at the sample that we tested: it is a Yammer that executes a distributed, training on image segmentation model, using training Operator by touch drop embedded in Cuba flow. We set the party in red rectangle to use GPU direct RDMA first setting over two Osos environment variables in container image is necessary.

A

One variable is for knicker communication using infiniband and the other is for GPU direct RDMA level.

A

Then after we designate our damaturated device with its the custom resource set in kubernetes layer in previous in in advance, then set the security contact for IPs lock. Then we can proceed with the test. We want.

A

Let's look at the log that has actually been executed. The Osos environment variable is set, X has been extra, uh sorry is that properly and we can see the a100 GPU and the infinite event is set to the root. This means that Infinity is not disabled; in other words, we use in free event using GPU direct RDMA. We can see a nickel log, the clock where training takes place, how effective ODB.

A

Then, let's see this is on throughput, measured by increasing the number of gpus for image, classification, detection and segmentation model and natural language understanding model. In our POC environment, we increased the number of GPU from one to two four, eight six and sixteen and used for use a total of 16 gpus for the red rectangle, and it is the comparison of performance results between the green bar that you use the triple direct RDMA and the blue bird. That does not.

A

It differs, but generally we can expect the effect to be around 114 to 612 percentage. uh Basically, as the number of GPU increases, the performance improves, but in some model the performance of Martino's may be undermined. If a huge value of parameter communication uh consist continuously takes place in the model, but by using GPU direct RDMA, it can communicate more effectively, as we saw the the performance verification result.

A

We achieved an extremely satisfying the result and by applying this with release, infiniband based multi-nodes GPU service on SSP last month, mm-hmm- and yes, oh yes, that's it I think this is your eyes or I wanted to share. Today uh the presentation was based on the cases from our experience, and although it was a brief explanation, completed in about 30 minutes due to the time limit, but I want to mention, I want to mention that we have gone through a countless dryer and error.

A

Many of you working in this field, maybe or will be going through the same difficulties, so I hope you find my presentation a little bit helpful and I'd like to finish today's presentation. So thank you for your time and attention and stay uh and have a safe trip, and thank you thank you.

A

And also you can uh and also you can download this presentation at the schedule.com and any question.

B

A

Yes, actually we uh we proceeded with the POC based on Docker, but basically our asset, specific kubernetes service is based on container D, so we basically processed with the container the base too.

A

Yes, it is based on decorative.

B

A

Yes, okay and thank.

B

B

In one of your first slides it looked like you were having trouble on kubeflow recognizing the migs, and then you mentioned that you had to apply changes. But I wasn't clear if the changes in the yaml file were only applied to the Samsung SCP service or if it was something that got committed back to cube flow I'm.

A

Sorry, could you take up your mask and then please I will speak okay.

B

Earlier you had trouble having kubeflow recognize your migs, your GPU migs. You mentioned that you had to apply changes to some yaml files, so that Cube flow could see your migs, but I wasn't sure it. It wasn't clear if those changes were applied to the Samsung SCP platform or if they were committed back to cube flow.

A

After really, we operated to our sscp flow service.

B

Okay, so does that mean then that Cube flow may still have trouble? Oh.

A

Yes, identifying.

B

Images. Thank you.

A

Any other questions: oh okay,.

C

When you're doing your distributed testing with Mig right, you're using big instances to run distributed with a single Mig instance in each container, uh what was the interconnect? Was it just using ethernet as the interconnect because of the lack of GPU Direct when you're using Mig.

A

I'm, sorry, I'm! Sorry, could you take over your mask? Please yeah I'm, sorry.

C

uh One of the first tests you were running was uh distributed: training in Mig right. You did it with uh two Mig instances per container and it failed, but with a single Mig instance per container, it worked.

A

C

Was the interconnect.

A

Actually, uh it is the NBA, it is based on a.

B

A

Pair architecture, so they they designed uh with uh so.

C

I'm guessing it wasn't using infiniband, it was falling back to ethernet in those cases.

C

Interconnect like the for the distributed Communications.

A

uh It is with infinite event, nothing.

C

It was using infiniband with a single Mig instance.

A

No actually infinite event is for the remote nodes in Internet, so internet is doesn't need any infinite event network adapter. It is a communicate within the MP3 in the single node, so that does not need any infinite band. In a single note,.

C

In us, when you're doing the distributed test with single Mig instances right.

A

C

A

C

So so you were not requesting uh RDMA HCA for those pods or appropriate.

A

I'm, sorry, if you have any question, then you can also email me. Then I will answer that I'm very sorry, any other characters.

A

Okay, then I think that that's it. Thank you for your time again.