ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: High-Performance Inference for Video and Audio

Description

ORT provides the foundations for inference for Adobe's audio and video products (Premiere Pro, After Effects, Character Animator) on both Mac and Windows. In this talk, we'll discuss how ORT with the DML backend is essential in enabling high-throughput inference for audio and video workflows on Windows, and how we use ORT to enable speech to text on Mac.

Video workflows are unique because of the sheer amount of data they process; our customers frequently ingest high resolution video of which each frame may need to be passed through our models. Likewise, video workflows are inherently resource limited: the GPU is also being used for hardware decode and render at the same time.

ORT gives us the tools to build complex frameworks and workflows on top of so that we can deliver ML-based features while ensuring that we're able to provide the best experience for our customers.

Nikhil Kalra is a Sr. Computer Scientist at Adobe and is currently the engineering lead and architect for the Digital Video and Audio applied machine learning team.

A

Hi everyone today we're going to talk about high performance, machine learning for video and audio. My name is nikhil kalra and I'm a senior software engineer at adobe working on the applied machine learning team for video and audio.

A

Some of the products that my team works on include premiere pro after effects, audition and character. Animator, the machine learning workflows that we develop revolve around one central concept. We want to enable creators to spend more time on the edits and more time in the creative process than they spend in redundant and repetitive workflows.

A

Some of the features that we've worked on include auto reframe scene, edit detection, auto color, auto captions and transcription.

A

So what makes video workflows unique? The first is that they're, inherently resource limited, the cpu and gpu can currently power. Decode render encode and machine learning inference. Our biggest bottleneck is transfer, that's pci bus saturation on discrete cards and memory, bus saturation on integrated cards.

A

Our workflows are extremely data, intensive common formats that our customers use include 4k at 60 frames per second or 8k at 30 frames per second, that's 6 gigabytes, a second of uncompressed footage, and the third is that our workflows are compute heavy, not necessarily in terms of gpu compute, but at minimum. Every machine learning workflow requires the render pipeline to spin up to either resize or relay out a frame from a packed to planar formats.

A

That brings us over to our overall video processing pipeline for machine learning. First, we'll read the compressed or encoded media from disk and hand it off to our decode pipeline.

A

If the media is in a format that we have hardware tcode support for we'll hand it off to the gpu or dedicated hardware decoder, if not we'll fall back to the cpu and perform the decode and software next, we'll move on to the render portion of our pipeline, where we'll use a combination of gpu and cpu rendering in order to produce a frame for our machine learning engine.

A

Finally, we move on to the machine learning portion of our pipeline, where we take that frame and run it through our inference pipeline running on both the gpu and the cpu.

A

So all of these different pipelines means that premiere ends up having a lot of concurrent gpu activity running in the background on various different technologies. On the hardware decode side, we use a lot of vendor-specific frameworks such as envy deck, the intel media sdk and the amd media sdk.

A

On the render front we use cuda on nvidia, gpus and opencl on amd and intel gpus and for machine learning we use directx and direct ml to back our pipeline.

A

All this to say, our biggest bottleneck is transfer, cost going off of the gpu and off of the hardware into cpu memory, so that all of these pipelines can work in parallel on the different underlying technologies.

A

So we use the onyx runtime to power all of our machine learning workflows on windows in a handful of our workflows on mac, and we use the direct ml execution provider for gpu acceleration on windows platforms.

A

On windows, the creative cloud apps target the entire windows ecosystem as a single platform. This means two things: one that we must provide feature parity across all major ihvs, that support windows, nvidia, intel and amd being the big ones, but two. We must also provide functional parity across equivalent hardware from different ihvs.

A

In other words, our customers expect that equivalently classed and video hardware performs the same as equivalently classed amd hardware, even though they may be using two different gpu technologies under the hood.

A

As a result, we chose the direct ml execution provider as our way to deploy onyx models on all hardware in a common runtime.

A

Use of vendor-specific hardware such as tensor cores on nvidia gpus, and that's especially important for us, because it leaves hardware free for other async compute workflows, including render and decode that may be occurring simultaneously. In the background on the performance front, the onyx runtime has had a few historical bottlenecks that have greatly impacted our workflows.

A

The first is that there was no gpu input support at minimum. We were required to perform a gpu to cpu, read back in order to pass a frame buffer into the onyx runtime for inference.

A

The second is that batched workflows inside of the onyx runtime require a contiguous memory buffer, which forces us to make additional memory copies as we copy from the locations in which our frames were decoded or rendered into this contiguous buffer for inference.

A

Now, with newer releases of ort and the directxml execution provider, we have a path forward for enabling performant workflows that don't exhibit these same bottlenecks.

A

New bindings with dml enable us to pass gpu frames as inputs into the inference pipeline, meaning that we can use cuda and opencl to directx 12 interop in order to create these dml resources.

A

At the same time, we can now assemble batches asynchronously on the gpu using direct directx 12 copy cues, which one increases the available bus bandwidth for the application, because we're able to control the copy, but two it also decreases the latency that we experience in our inference pipeline by allowing us to experience that transfer cost asynchronously.

A

Additionally, this is more forward-looking. Io binding support with dml will enable us to use output buffers and directly consume them in a dx12 based pipeline, enabling the use of machine learning-based models for gpu workflows.

A

We also hope to use I o bindings with dml outputs in order to reduce the latency that we experience on the output side of our inference pipeline by performing cpu readbacks, if necessary, asynchronously using dx12 copy cues.

A

So we've also made additional performance optimizations by building a framework on top of the onyx runtime, to allow efficient interop with everything else that we're running in the application.

A

For us, that means that render and decode always have the highest priority, because we need to produce frames for the inference engine and we back off of inference whenever the gpu is busy or starved for resources, for example, if we're in a low gpu memory situation- and this is increasingly common, as our customers gravitate to using higher frame rate and higher resolution media for their workflows.

A

The third is that we also try to assemble our inference, requests into batched workflows to reduce resource contention with the gpu and minimize driver overhead.

A

This is what that looks like in practice.

A

By dispatching inference requests sequentially we incur gpu driver overhead not only on the input side, but also on the output side, and then we incur that repeatedly over the size of our entire inference, pool, on the other hand, by switching to a batch to workflow we're able to amortize the cost of gpu driver overhead over the size of the entire batch. But, more importantly, we're also able to delay the execution of our inference request until we've assembled this batch, meaning that we can leave the gpu free for other work.

A

That's also incurring such as render and decode.

A

Let's take a look at scene, edit detection, for example. This workflow requires us to decode and render every single frame in a clip and then run each of those frames through a network to determine if there is a cut between two frames scene.

A

Edit detection is an extremely resource, intensive workflow for a 1 minute, 4k clip at 60 frames per second, we dispatch about 3 600 inference requests to the engine on a total of 6 gigabytes of video data, we're able to run that entire pipeline end to end from decode to inference in about 10 seconds or 6 times real time on modern high-end gpus.

A

None of this would have been possible without the performance enablements provided by ort and the direct ml execution provider.

A

So, looking forward, there are a few more changes that we'd like to make. First, we want to leverage the onyx runtime to enable machine learning based workflows during gpu render and as more of our gpu compute transitions to dx12. This will become even easier because we can minimize the cost by removing the overhead, that's currently associated with opencl and cuda to dx12 in our off.

A

But more importantly, we want to transition our pipeline to look more like the pipeline on the right, where we can keep as much stuff on the gpu or gpu addressable hardware as much as possible and minimize our transfers to the cpu unless absolutely necessary, or are required for any user facing features.

A

That's all I have for today, thanks for watching.