ONNX March 2021 Community Meetup, 22 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ONNX20210324 V12 OnnxRuntimeTraining

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone, my name is palm. I am a senior software engineer working for microsoft, ai platform team, I'm going to share some updates around only quantum training work, as we shared previously only sometime has been designed for addressing few problems, including reducing production model latencies, making making it possible to deploy python trained models with c sharp or other programming languages.

A

There are also some needs to run the model on different kinds of devices, for example the mobile devices, as my colleagues, tom and scott introduced earlier in the online time mobile presentation, all those requirements are from influencing scenarios and only spend time solved them pretty. Well, once once we extend to the training area, we see increasing demands to change the large model efficiently.

A

The only extent runtime has been proven to be a highly performant influence engine with cross-platform support and extensible architect for either custom operators or hardware. Accelerators training feature is introduced in the past months and now is still in the previous stage and showing promising results on some of the internal models.

A

As of our design principles, we would consider online time would be a generic framework for training. Deep neural networks, similar with influence support, we would allow developers to extend with customer operators for trainings the transformer models fund, most of its applications in the field of nlp, for example, the tasks of machine translations time series prediction.

A

It has also been applied to image processing showing a competing result compared with the convolutional neural networks. On the other hand, training transformer based architects can be very expensive, especially for long sentences.

A

So those kind of models are becoming a good starting point for us to focus. Last but not least, customer demands would be super important for the production direction too.

A

The charts on this slide shows the orth training architect. As we see, data centers will still be able to stick to original trainer code built by pytorch or other frameworks. Those models would be converted to onyx model representing the model structure. We usually call it a forward graph in the training scenarios rt as a backend would take. The onyx graph in then handle the complexities, including building a training graph.

A

Do the graph, optimizations and finally run the graph efficiently and the backend is also a good place to incorporate innovations, including msr, deep speed, parasail and maxine's those kind of techniques so far, ort has the capability to run training using both data parallelism and horizontal model parallelism the sample of how a python model runs. Training with ort the flow is a bit out of date when we are working on new design recently, while most of the concepts remain to be still valid.

A

Roughly saying the python model would be converted to onyx graph. First afterwards, ot built the training graph, including mixed precision, setting up auto deep graph building graph, optimization finally set up distributed training before scaling out to multi-gpus or multi-nodes, more specifically, orts supply a loss function to the photograph, as the first step builds the gridding graph step by step, removing unnecessary computations composed adam optimizer. Finally, we got a fully training graph.

A

Let's see some results about ort chain. Acceleration.

A

Here is some proof numbers for some cases. Like the integration with office services, we could say 1.4x highest approach using rt. That means for one single training recipe ort. Could reduce almost half of the time to get the training completed beyond the listed board? Gpd2 robota xl models.

A

We also see promising speed up in our development for even larger models: 1.5 billion 2 billion 2.5 billion and even larger sizes, compared with some open source high performance sport training plantations using sim training recipes here ort brings over 22 to 25 percent speed up on the support another highlight here is the ort could run 2x of both sides and pythons on one single gpu.

A

That's the reason you could see. Half of the accumulation status needed in the in ort training than pythog beyond one single nodes. Ort scale were up well to 512, gpus that we observed in the large e100 clusters.

A

Some of the performance comes from the cooler kernel improvements we initially date for the birlage models and all the optimizations are proven to be reusable and applicable for other transformer-based models in our cases like robota, gpd2 and but yeah. We also provide good coverage for different group-based optimizations, essentially kernel fusions in place, computations and so on.

A

Memory efficiency also plays an important role on the beta performance with buffer reusing minimizing the memory. Fragmentations ot could run two axle precise than pythons for prolatch. As we mentioned earlier, similar observations applied for the gpd2 medium training. We could train it on 16 gigabytes v100, while pathogens hit outer memory issues.

A

Definitely this memory saving would benefit the large model trainings as well. The last. The last thing I want to share is how existing models come about ort for the training yeah. This is where we called it. The front-end integration.

A

The way we provide currently required fuel line of code change, including model descriptors, multi-trainer construction, replace training loops.

A

This sometimes brings quite some big overhead for the data scientists, so we have been working on a new approach called ot module to make the integration easier, as shown in the slides, only one line of changes needed in the new design, so you could always check our team rapport for the latest progress and announcements about rt module development.

A

If you have interest to try, our teaching out here are some links for any feedbacks contact, ort teams on the github. Please, thanks for your time, bye.