ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Accelerating Machine Learning with ONNX Runtime and Hugging Face

Description

Hugging Face has democratized state of the art machine learning with Transformers and the Hugging Face Hub, but deploying these large and complex models into production with good performance remains a challenge for most organizations. In this talk, Jeff Boudier will talk you through the latest solutions from Hugging Face to deploy models at scale with great performance leveraging ONNX and ONNX RunTime.

Jeff Boudier builds products at Hugging Face, creator of 🤗 Transformers, the leading opensource ML library. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.

A

All right, I'm here, hi, I'm jeff with you from hugging face uh I'm on the product team over there. I'm super happy to be here. Thank you so much passant. Yes, you for inviting me uh to talk to you today. uh What I want to talk about is how you.

A

Using the optimum library and its integration with onyx runtime, so all the goodness that ryan, I was just telling you about- is available to easily be applied onto transformers models using the optimum library all right. So let's get started it's a short talk and I'm going to breeze through everything. So if you have questions- and we don't have time to ask questions- don't hesitate to ask me directly, I'm at jeff at huggingface.com.

A

Through this uh talk, I'm first going to take a step back and bring you into the the transformers world and how we get to where we are today to then talk about optimum, why we went out to build a specific library. That's focused on accelerating transformers models, but first up a little bit of a trivia quiz. So what do tesla gmail, facebook and bing all have in common?

A

The answer you know it it's transformers, but the point that I'm trying to make is that there are very few companies in the world that are able to make billions of predictions based on transformers models a day which gmail is doing through its autocomplete, which tesla autopilot is doing and same goes for translation on facebook or search results with bank.

A

So what are we trying to do at hugging phase? Well we're trying to make the power of those transformers models accessible to every single company in the world through uh accessible, readily accessible pre-trained models and through tools to make use of it all.

A

For us, the the the initial uh conception starts with the advent of transfer. Learning of the attention is all you need paper. This is what really changed the field of machine learning.

A

It changed the field of machine learning, first with nlp by achieving breakthrough performance in every single natural language processing task, but it is today producing state-of-the-art results in every single modality of machine learning, and I think that's the first time throughout the history of machine learning, that every single community, the speech community, the vision community, the nlp community, the reinforcement, learning, community chemistry, bio, bio, health care, everybody is working with the same set of architectures.

A

We slide variations and everybody is working from the same set of tools and speaking the same language. So it's really been a privilege for hugging face to be at the center of this ecosystem to be the home for the community that is sharing and improving upon what is possible.

A

A greater uh than uh what you would expect from a team of now 150 people. We really represent the aggregate contribution of over 1300 open source contributors to our libraries uh and, of course, we provide access to over 50 000, find fine-tuned, pre-trained models for every single machine learning task. You can imagine for every single language you can imagine all contributed to by our community.

A

And that focus on on a community on collaboration on making machine learning, open and collaborative has really fueled. Our traction today transformers is the reference toolkit to make practical use of attention-based mechanism of transformers models in every modality, and so now optimum. Why optimum?

A

Well, whenever I talk about the history of hacking phase, I show this graph, which is very a prehistoric graph, and by that I mean it predates gpt3, but even by then the trend was really clear, which was this exponential increase in the number of parameters of models that make it more challenging for engineers for machine learning engineers for software engineers to make use of those models in a production context, because they're getting hungrier in terms of compute resource memory, resource bandwidth resources and creates a whole struggle for infrastructure teams to make use of these models, and that's why we decided to create a specific library, optimum that is focused on the acceleration of transformers models from training to inference, because if you want to reduce the the the latency you get from a vanilla pie, torch tensorflow model to something that you can use for.

A

Real-Time use cases to something that you can use in a cost effective way. You need to um to decrease the latency through three different layers of complexity. You need to work on your model, editing the graph. You need to work on accelerating uh the the the inference and then you need to work on like the hardware specific optimizations, uh to get it all down to millisecond levels.

A

So optimum is that bridge it's the bridge between the transformers library and the hardware and hardware peak performance in the same way that with transformers we made transformers accessible by offering a high level of abstraction and easy to use apis, we want to do the same thing for hardware, acceleration and so with optimum. We want to offer the reference toolkit for hardware acceleration, offering these high-level apis dedicated to production performance.

A

So today, uh hugging face optimum. Is this one library through which you can get peak performance leveraging these acceleration libraries and solutions from our partners? Onix runtime, intel, graphcore habana, all within the the optimum package?

A

So let's focus on the onyx runtime package within optimum. We can already today use optimum to accelerate your trainings of transformer models in a very easy way to accelerate the inference of your transformers models in a very easy way for training. We introduced a new trainer class called ort trainer and, if you're familiar with the transformers library you're familiar with the trainer library- and it's really an easy two lines of code switch to go from your transformer trainer to the rt trainer to take advantage of all the acceleration that it provides.

A

One of the main benefits is that through the ort trainer you get native integration of deep speed and that produces amazing acceleration results. In fact, I'm sharing here some uh preliminary figures shared by ashwinicad at microsoft, uh benchmarking, uh optimum onyx, runtime uh and showing that you very easily get through these very few lines of code changes uh 10 to 40 percent uh acceleration in the throughput of your training, uh depending on the configuration which stage of this deep speed, you're going to be using. So it's really powerful, but very simple.

A

Then, if you want to talk about inference, there are three main classes that I want to uh to talk to you about like the first one is ort optimizer, it's a simple way to simplify uh the graph from your model.

A

uh You can simplify the graph from your model by specifying just uh the uh the the pre-trained model and the task, and what you get is a set of basic optimization like constant folding, like operator, fusion, that are going to be applied across the board, and you also get advanced optimization that is specific to the execution provider that you are targeting, whether cp or cuda.

A

Once you have an optimized graph, you can optimize the weights, you can optimize the weights by quantizing uh the the model, and you can do so very easily, using the new rt quantizer class and with the rt quantizer class, you have access to both dynamic quantization and static quantization.

A

It's a simple parameter change to go from one to the other to get a state-of-the-art acceleration, and you can do so by targeting the specific execution provider that you're going to be users are using to take advantage of those hardware, specific optimizations and last, if you're familiar with transformer you're familiar with the auto model for task classes, which is how you're able to like very easily apply a model to specific machine learning class.

A

Well with optimum, you can do the same for onyx runtime and benefit from all the hardware acceleration uh by switching your auto model for task to rt model for task class. And so again, it's like a very easy change to make to benefit from all the optimizations that uh annex runtime provides and something that I'm super excited about and that the community is super excited about.

A

Is that we're collaborating together to enable sequence to sequence model optimization through these optimum accelerated, inference pipeline classes, all right, so to put it all into like an end to end example, we just published a new blog post.

A

uh You can find on our blog at hf, dot, co, slash blog, it's called optimum inference, and in here you have the whole sort of user story, starting from a pre-trained model that is fine-tuned for qa and then exporting it to onyx, applying the optimization, applying the quantization and using the rt model for qa class to get accelerated performance, you're getting 44 percent uh throughput uh uh increase or latency decrease, uh while conserving 99.6 of the original model accuracy.

A

So that was uh what I wanted to talk to you about. I invite you to check out and give a start to the optimum library. It's on our github at github. Hacking face optimum. Thank you. So much.