Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Optimizing Knowledge Distillation Training With Volcano - Ti Zhou, Baidu & William Wang, Huawei

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Optimizing Knowledge Distillation Training With Volcano - Ti Zhou, Baidu & William Wang, Huawei

Knowledge distillation is a classic model compression technology, which is a way of migrating knowledge from a complex model (Teacher) to another lightweight model (Student) to achieve model compression. EDL use Volcano as scheduler to deploy the Teacher model to an online Kubernetes GPU inference card cluster, and use the resources of the online inference GPU card to increase the throughput of the teacher model in knowledge distillation. At the same time, because the Teacher model can be flexibly scheduled by Volcano, there is no need to worry about task failures caused by preemption of online instances during peak hours. You can also deploy the Teacher model to cluster fragmented resources, or low-usage resources such as k40, to make full use of the cluster's idle and fragmented resources. In this lecture, we will explain in detail how to use Volcano to optimize elastic distillation training and give the corresponding benchmark data.

A

Hi everyone welcome to our session. Today. This is t zhao from baidu and label, one from huawei. We will introduce how to optimize the knowledge distillation training with volcano thanks next slide.

A

Okay, so let's have a quick overview of the project's background. Pilot paddle is china's fully open source deep learning platform, and it is also an agile framework for industrial development of deep neural networks.

A

Palo palo, deep learning framework supports both declarative programming and imperative programming, with both development flexibility and a high runtime performance preserved pilot paddle also support ultra large scale, training of deep neural networks.

A

It launched the worst, the world's first large-scale open source training platform that supports the training of deep network with 100 billion of features and trillions of parameters using data source distributed over hundreds of nodes.

A

Piloped includes and maintains more than 100 mainstream models that have been practically and polished for a long time in the industry. Some of these models have one major price from well-known industrial competitions in the meanwhile palo pado has more than 200 pre-trained models to facilitate the rapid development of industrial applications. Next slide.

A

For large scale, training pilot pedal, enabled collective training on multiple gpus also supports the asynchronized parameter server mode training on gpu and cpus pilot header used bleach api for highly scalable, distributed training and now most of the baidual intelligence services are powered by pedophile framework.

A

When pilot pedal do the large-scale distributed training inside baidu, we suffer from some problems. One is uh when one of the products in the large distributed job got failed. The whole job failed. It will waste a lot of resources to restart the whole job.

A

Another one is the low utilization of some influence card cluster like k-40, while the training card cluster like v100, is always out of resources. So we need to figure out a way to resolve those two main problems when training in the large kubernetes cluster next slide.

A

Okay, we create edl project uh the elastic deep learning project as the middle layer between kubernetes and pilot paddle framework to handle the elastic training related stuff.

A

Currently edr uses kubernetes as the foundation and provides user scenarios, including predictive training methods like knowledge, distillation, reinforced learning and hyper primary search by using the kubernetes crd and now, with three major release.

A

Edr enables snapshot based for storage, job port, auto scaling and support knowledge, distillation, natively edl, also highly integrated with volcano for advanced scheduling, features to accelerate the training, speed and edl. Now is a linux, ai and data foundation incubating project with edl enabled in baidu internal cluster uh cluster level. Resource utilization is above 70 and the job submission queuing time is less than 44 minutes and the failure rate of the job is less than 5. Now next slide.

A

And then what is the knowledge distillation and what's benefit for that to some people who are not familiar with knowledge installation? Let's have a quick overview. Nowadays, the deep learning model is getting bigger and bigger.

A

The network layer is getting deeper and deeper in many scenarios, the larger the model and the more layers, the better the more model effects, but limited by the request of reasoning, speed and video memory resources. Large models usually cannot be deployed directly and the models need to be compressed.

A

The current mainstream compression method include tailoring quantification and knowledge distillation among them. The concept of knowledge distillation is a state-of-the-art technologies proposed in the distillation the knowledge in a neural network paper published by hinjen in 2015.

A

It is a very classic model. Compression technology that converts knowledge from a complex model migrated to another lightweight model to achieve model compression.

A

In fact, the so-called knowledge transfer can be understood as a training process, which is to use the teacher model to train the student model. This training method is distillation training.

A

After training a good student model, the student model can be used for actual deployment, as shown in the figure above, the training steps can be divided into two steps. The first is train a teacher model and then use knowledge of the teacher model to change the student model.

A

Okay, next slide.

A

Originally, there are two common ways to do: the distillation training edl, based on user scenarios in baidu, to invent the third way to do the training.

A

Let's introduce those three types, the first one is called the pure distillation training. The method of pure distillation training is much like the teacher recording the content of a lecture as the video and giving it to the student for self-study, and then the student learns by themselves.

A

According to the course video, therefore, the pure distillation training is to first use the teacher model for inference and save the results in the disk, and then the student model used the sample saved in the disk and the inference results of the teacher model as a data set for training.

A

uh Okay uh in the training model, no, no yeah, okay in the training model, the training, uh the trainer, the training of the student model is the same as the regular training and the method is simple. However, the training method generates required data enhancement and takes up huge disk space, so the application environment is subject to certain restrictions and the second one is same network distillation.

A

Training same network distribution training refers to putting the teacher model and student model into the same network and the fixed teacher model parameters are only forwarded and the student model is normally used for back propagation training.

A

This is also the current mainstream distillation training method. This is very similar to the teaching method in the real life. The teacher and student are in the same classroom. The teacher says sentence and the student listen to it.

A

However, the training method not only take up to a lot of space for the teacher mode, but also because the teacher and the student have a one-to-one binding relationship and the training of the student model completely relies on the teacher model, and the student model has to wait for the teacher model to output a batch of inference and the result can be trained, and the teacher model has to wait for the student to change a batch before starting the influence of the next batch, which has a certain impact on the overall training speed, and our edl has the third one, which is called the edl service distillation training compared with the pure distillation training, edr service.

A

Distillation, training decouples the teacher model and the student model. The t-shirt model is deployed as an online inference service and the steel model uses the clients in identity to send samples to the teacher model in real time via the internet, to obtain the influence results for training. It's like letting the model take lessons online and we will we use distillation reader for communication next slide.

A

Okay, there are several advantages of edl service knowledge, distillation. The first one is save the gpu memory resource due to the decoupling of the student model and t-shirt model. The service distillation training can use here, teaches resources that is deployed the student model and the teacher model to different devices.

A

Distillation networks that will only originally limited by the size of the gpu memory and or different to deploy to the to a single gpu card can be deployed to a different cards. In this way, user can also flexibly set the ratio of the teacher to student according to the throughput performance of the teacher and student, which means that multiple teachers can teach multiple students, instead of maintaining a one-to-one tutorial model to maximize training, outputs.

A

The second one is improve the training speed due to say due to the saving of gpu memory resources. The student model can be trained with a larger batch size at the same time, because the student model and teacher model are in in different pipelines.

A

The serial model does not need to wait for the teacher model to end the influence before training. Combining the two reasons. Those can greatly improve the training speed. The third one is improve the utilization of training resources in practical application. We can develop the teacher model to an online elastic influence card cluster and use the computer resources of online predictive card to increase the throughput of the teacher model in the distribution task.

A

At the same time, because the teacher model can flexibly schedule, there is no need to worry about task failures caused by the preemption of online instance during peak hours. It is equivalent to transferring the resources required requirements of the teacher for the training card to the online gpu card. When offline training resources such as vu 100 are limited, the online card is used to accelerate the training to see variable training resources.

A

In addition on offline clusters, combined with scheduling strategies, the teacher model can also be deployed to cluster fragmented resources or resources with a low usage rate such as k40, to make full use of clusters idle and fragmented resources.

A

The right picture is the flowchart of service distribution training operation. In this figure. You can see that the student model sends samples to the teacher model and obtain the inference result, while the the service side of the teacher model can be added and deleted as well and adjusted flexibly next slide.

A

Okay, now we'll see how it leverages kinetics and volcano to optimize knowledge, distillation, training, videos, support elastic training with inference style services during training. It deploys the teacher model as online inference through pilot serving in addition to a teacher and a student training pod.

A

A service registry discovery model is developed by dl online influencers are elastic and are registered to eds service registry modules for service, auto discovery and fault tolerance, so either enable dynamic adaption of teachers model online instance to maximize students, training through output and the resource utilization with k40 influence card serving cluster and v100 card training. Cluster edr used volcano for multi-cluster scheduling and, since spider inside are always shut off. V100 resources for training. We use scan scheduling for knowledge, installation job to avoid training results from deadlock.

A

We also utilize the io awareness in volcano for maximize the rdma usage in training cluster next slide.

A

Okay, so, in order to verify the vector of edl service distribution training, we use pure training, same network distribution, training and edl service distribution, training on the imagenet data set to train the resnet 50 vd model.

A

The first one to concern is accuracy in terms of accuracy compared to the pure training. Distillation. Training improves the accuracy of resnet 50 model by nearly two percent. The service distillation training and the the same natural distillation training have the same accuracy in terms of the training speed compared with pure pure training. Same network distillation training takes up a large part of the computing power due to the teacher model, so the training speed is only 35.9 percent of the pure training, with the same training resources.

A

The edl service distillation training used additional online p4 elastic resource and transfer the teacher's resource request for the training card to the elastic card, so compared to pure training. It still maintains uh maintain, maintains a training effective of 82.8 percent and the speed is 2.3 times to the same network distillation training.

A

If you continue to use teacher resources. Theoretically, the speed of edl service distillation training can be the same as that of the pure training. Of course, same network distillation training can continue to accelerate if resources are increased, but this will take up more valuable v100 training resources. Okay, here are all the distillation training on edl. So now, let's welcome label one for deep dive into the volcano project and tell us more details about the features and implementations of volcano and how it is integrated with other ai and data system. Thank you.

B

Hey guys, I'm william wong from huawei volcano community community maintainer and attending the lead. It's my pleasure to share this topic with tido. Okay, let's, let's get started with the development of industries. More and more domain frameworks are invented and applied to support business development.

B

This framework plays an irreplaceable role in their respective domains, such as spark tension flow and flink. On the other side, the business model is becoming more and more complex. Nowadays, it's difficult to handle complex business scenarios with just single domain framework. Multiple domain frameworks are widely used together to achieve business objective.

B

The domain framework cluster is is becoming bigger and bigger, and these clusters are independent of each other. The resources cannot be shared. This leads to a huge list of resource. Therefore, more and more users want to use the unified scheduler system to resolve the resource sharing problem. Kubernetes is the best choice for many users because of its excellent scalability and ecosystem.

B

As you know, kubernetes is designed for microservice of chest fission in the early age. When we tried to measure migrate batch workload to kubernetes several years ago, we found there are still a lot of challenges for kubernetes.

B

The first one is the scheduling policy for high performance workload, for example, gun scheduling, batch workload need all or nothing scheduling to reserve to solve the deadlock. The fair share scheduling for much attendance job, priority scheduling for urgent workload, the top larger scheduling to add, accelerate training in central essential. The fact challenge is job life cycle management.

B

Different type of workload have different expectation, for example, for tension flow. If the ps and work pulled field, we have to restart the whole job. uh However, for spark, if extruder pulled field only restart the executor code is enough. All these are handling should be resolved in job life cycle management. The third challenge is the heterogeneous hardware. Support the high performance workload has higher performance requirement.

B

Many providers produce different kind of hardware to accelerate computing. This requires the scheduler to schedule the results uniformly and provide the best resource allocation.

B

The last one is the performance tuning. For example, the scalability throughput network container runtime is not only about the scheduler. Volcano is a kubernetes native batch system. It is designed to try to address in these challenges. As you can see, volcano implements a batch scheduler to provide the rich scheduling policy.

B

A new controller is added to do the lifecycle management and provide a unified interface to for different kind of workload. We also provide some command line for hpc users to help them to sub submit workload by coming on.

B

Okay, okay is a cncf sandbox project. Right now there are more than 106 hundred github stars and more than 100 contributors from different from different company and organization. There are, there already have released from pharmaceutical and more than more than 14 public adopters.

B

Here is the volcano overall volcano architecture. Volcano spots scheduling multiple tabs overclocked in one cluster. The resources can be shared among this workload to get a better utilization. This workload might be online microservice or offline data analysis tasks or ai workload. Volcano provides queue for users to plan their cluster resource. It's easy to map the com company organization to volcano queue.

B

Different departments can share resources to each other when the resource is added.

B

The rich scheduling policies are supported in volcano, such as priority scheduling, topology scheduling, preemption, reclaimed, time, division, multiplexing and so on, and also supports monitoring of resources for fun. Green scheduling. The red part shows the benefits volcano in different scenarios. We will talk it later.

B

The first scenario is showing the gun scaling in tension flow training. As you know, all or nothing scaling is required for tension flow or mpl workload to solve deadlock, obviously team in the test. There's there's no enough resources results for two jobs to run concurrently in the cluster.

B

Then we submit jobs to cluster, as you can see when we submit 5 jobs, 12 jobs with 2ps and 4 workers. Only 2 of the 5 jobs finished because of the deadlock, the left, 3, job of occupied parts of resources and waiting for other job release resources, each other.

B

The second scenario is tension flow training, our performance testing. It is found that the different placement of ps and work post affects the training result, especially for gpu training for some models. The course is, the host network is better than network across coast. Fps and work port can be scheduled, scheduled to one host. They can exchange data with host network. There are three nodes: there's three node in the test cluster and we submit training, training job with two ps and four workers.

B

We get three different, please placement at last. The result is random. With default scheduler, as you know, the group c is the best placement that we want.

B

The task top largest schedule in keno is target to handle this kind of visual. You can define the top logic of the task in job. The scheduling will get you the best placement based on the input. A notice is that the complexity of the feature doesn't increase the doesn't increase with the cluster skill. As far as I know, some users use the kubernetes affinity and anti-affinity features to achieve this goal. However, the complete the complexity increase with the cluster skill.

B

We also do some research on the el aware scheduling with the task top launcher info and io information we can make. We can minimalize the max state, transfer latency and give even better performance. The figure shows the vgg 16 model training results with diff with default, scheduling, volcano task, top larger and lower scheduling. The l, awares scheduling, gets 13 30 performance increase compared with default schedule. The result depends. The result depends on the data exchange and models.

B

This this page shows the survival kubernetes with volcano. Several years ago, we started to help users to migrate, backup workload from hadoop to kubernetes. We use the tpcds to do the performance test. It is found that the deadlock happened when the job concurrently is is high. The main reason, the main reason is that all the results are allocated to spark driver code. The extra code has started later than driver code and no results available. Also, black applications are stacked there.

B

We have two prepared dedicated node for extruder and driver code to resolve this probe. This problem, however, this kind of no division increases increase the resource fragment. As you never know, what is the best proportion for driver, node and external node? A new feature called mean minimal results is introduced based on this situation in volcano volcano, create put group for each spark application. The minimum result is a property of both group. The scheduler reserve minimal results for each spark application and the result resolves a driver called over commits issue.

B

It is a job level, scheduling user, no need to prepare the dedicated node anymore and also the fragment issue is resolved. As you can see, the performance improved more than 30 percent.

B

Next, as more and more mpi users try to submit workload to kubernetes with volcano job, volcano provides some features to help users submit mpi job work, workloads to kubernetes. The left part shows a volcano job for running npr workload. User can define the main available resources for gun, scheduling user also can define the job policy as well. For example, event, port evicted, action, a restarted job. This means whenever the port is evicted, the mpi job will be restarted.

B

Automatically user can also define the mpl master ampl worker replicas resource requests task policy respectively. In addition, volcano provides building job plugins to simplify the configuration, for example, the ssh plugin provides ssh authentication, wizard password and the svc plugin prepared hardly service for communication among the pod.

B

It is convenient for mpi users.

B

Another another scenario is the gpu sharing gpu resource is expensive resource. The gpu resource utilization is not good enough in some conditions, such as development environment and the influence inference, gpu sharing is, is expected by many users who cannot provide the gpu. Sharing ability user can specify specify the memory amount they need. The soft solution is supported so far. The hard solution will be supported in the future.

B

Cromwell is a popular pipeline software and widely used in gene computing. Volcano has been integrated with from where user can use the wdl to adjust, treat the volcano job. The ability has already supported in reagent computing service, for the lpc user commander is commander is important. They always use the command line. To submit to to to to submit workload. Volcano provides a set of command lines such as resub. We console we suspend to have users migrate, migrate, a pc workload, kubernetes variants of sdt language are supported as well.

B

Nowadays, the clusters still become bigger and bigger. As far as I know, there are more more than two zone nodes in one kubernetes clusters: in some users, production environment, the coupe stream is a simulator of kubernetes for batch and offline workload. It is based on the kumark on kubernetes.

B

We can use it to simulate simulate super skill cluster to do the scheduler performance testing. There are two problem problems in kumark. The first problem is user, cannot configure the resource results of hollow node. The second problem is the ports leaders which, running on the whole hollow node, cannot be updated.

B

These problems are resolved in cubism in the future. The way we may add more enhancements for group theme, for example, and more job template to support, simulate, bad bad shop, submission, etc.

B

You can join the volcano and edl community to learn more about how to apply and optimize the this distillation on kubernetes. We have google slide channel for open, communicate, communicating communication, and you can also submit prn issue on github volcano pedopedal and edr report repository. We will respond to you as soon as possible. Thank you for listening.