►
From YouTube: What's New in ONNX Runtime
Description
This talk will share highlights of the ONNX Runtime 1.10-1.12 releases, including details on notable performance improvements, features, and platforms including mobile and web.
Ryan Hill has been with the AI Frameworks team for the past 4 years, where he has mostly worked on operator kernels, C APIs, and dynamically loading execution providers. Prior to this he worked on the Office PowerPoint team, where his most widely seen work is many of the slideshow slide transitions. For fun he likes trying to use the latest C++ features and hitting internal compiler errors
A
So
what
is
onyx
runtime?
I
figure.
Many
of
you
will
already
be
familiar,
but
for
those
who
are
not,
it
is
a
run
time
for
onyx
models.
It
is
cross-platform,
can
target
multiple
cpu
architectures
for
all
major
platforms
and
has
language
bindings
for
many
current
programming
languages.
Currently
we're
doing
releases
roughly
quarterly.
A
A
A
One
is
that
we
added
new
apis
to
allow
op
kernels
to
be
called
directly
from
outside
of
a
model
run
call
so
that
they
can
be
used.
Like
a
math
library,
we
had
users
adding
custom
ops
that
extended
an
existing
op
and
they
would
copy
the
internal
code
for
the
op
and
then
add
a
relatively
small
change
around
it.
Now
they
can
just
call
the
ops
directly.
A
A
A
In
1.10,
we
added
a
transpose
optimizer
to
push
and
cancel,
transpose
ops,
significantly
improving
performance
for
models
requiring
layout
transformation
in
investigating
some
performance
numbers.
We
noticed
we
could
optimize
away
some
heap
allocations
by
doing
a
small
size,
optimization
in
our
op
kernel
code
to
handle
shapes
and
in
related
code
using
shape
size,
standard,
vector
classes.
A
To
give
an
idea,
the
improvement
one
test
showed
a
drop
of
70
heap
allocations
down
to
just
six
on
each
run,
and
another
team
saw
their
performance
improved
by
around
100
microseconds,
going
from
479
down
to
360.
Microseconds
quantization
has
also
made
some
nice
improvements.
You
can
see
here
a
bunch
of
common
cnn
based
models
running
up
to
50
percent
faster
in
the
latest
version.
A
Execution
providers
are
how
we
enable
onyx
runtime
to
perform
its
best
on
today's
various
hardware
possibilities.
Some
providers
are
just
onyx
runtime
code
implementing
kernel,
ops,
using
a
hardware.
Api
like
cuda
and
others
are,
are
a
using
a
software
library
that
already
implements
the
opmap
optimally
using
a
particular
hardware
api
like
the
tensorrt
library,
which
uses
cuda
and
openvino
using
various
intel
hardware.
A
So
why
not
just
use
a
library
like
tensorrt
directly
well
for
running
models?
Onyx
runtime
offers
a
single
api
that
gives
you
the
flexibility
to
run
on
almost
any
target
hardware
optimally
with
very
low
overhead
and
onyx
runtime
has
a
complete,
cpu
implementation.
So,
if
something
isn't
supported
in
a
provider,
it
will
fall
back
to
the
cpu
version
in
developing
these
providers.
A
Our
engineers
work
together
with
the
outside
company
engineers,
to
ensure
the
best
results
and
to
create
an
ongoing
relationship
where
everyone
benefits,
as
you
can
see,
as
these
hardware
apis
and
libraries
continue
to
be
updated.
We
continue
to
update
our
providers
along
with
them,
to
ensure
that
we
maintain
top
performance
and
the
latest
hardware
support.
A
We
saw
the
need
for
a
performant
low
footprint
model,
inferencing
solution
on
mobile
devices
and
released
the
onyx
runtime
mobile
packages
about
a
year
ago.
Since
then,
we've
continued
to
invest
in
these
platforms
to
improve
usability
for
mobile
developers
like
how
we
can
now
do
nhwc
conversion
at
runtime,
which
is
not
mobile
specific,
but
we
ran
into
this
on
mobile,
it's
used
when
you
have
a
kernel
implementation
that
prefers
a
specific
layout,
for
example,
running
on
arm
or
using
nn
api.
A
The
converter
is
aware
of
the
layout
sensitive
operators
and
can
replace
nodes
using
them
with
an
nhwc
version
internally.
It
does
this
by
wrapping
the
appropriate
nodes
with
transpose
operators,
and
then
thanks
to
our
transpose
optimizer,
it
then
removes
any
transposes
that
effectively
cancel
each
other
out.
A
We
tested
this
using
a
production
model
that
uses
our
new
xn
pack,
support
which
uses
nhwc,
and
the
input
was
also
nhwc,
since
the
model
came
from
tensorflow
and
all
of
the
added
transposops
canceled
out,
including
the
the
initial
input
transpose.
Ideally,
this
is
what
typically
happens
for
c-sharp
users.
They
can
use
onyx,
runtime
and
cross-platform
apps,
including
apps,
targeting
android
and
ios.
We
currently
support
xamarin
and
are
adding.net
6
maui
support
in
the
next
release
fingers
crossed
this
is.
This
will
be
a
very
interesting
thing
for
c-sharp
developers
on
mobile
platforms.
A
We
also
added
android
and
ios
packages
with
the
full
onyx
runtime
builds.
This
will
make
it
simpler
for
users
getting
started
using
onyx
runtime
in
mobile
scenarios.
They
can
use
onyx
models
and
all
offsets
operators
types
that
onyx
runtime
supports
are
included.
It
has
a
larger
binary
size,
but
this
probably
incident
isn't
an
issue
for
most
users,
as
it's
still
relatively
small,
to
give
an
idea,
a
minimal
build
that
only
includes
necessary
ops
might
be
around
2.5
megabytes
and
the
full
build
is
around
8
megabytes.
A
Onyx
runtime
web
is
one
of
our
newest
offerings.
As
we've
been
see,
as
we've
seen
growing
interest
in
directly
running
inferences
in
the
browser
we
used
to
have
a
side
project
called
onyx
js,
but
this
wasn't
ideal.
We
needed
to
maintain
two
separate
co,
two
separate
versions
of
the
code,
a
javascript
one
and
the
primary
c,
plus
plus
one-
and
we
needed
to
ensure
that
the
behavior
was
identical.
A
Now
we
just
compile
the
main
c
plus
plus
code
into
webassembly,
so
that
the
onyx
runtime
web
is
backed
by
the
same
core
code
base
as
a
bonus.
It's
faster
it
uses
less
memory
and
the
resulting
binary
is
smaller,
basically
like
having
another
target
architecture.
To
compile
for
another
big
achievement
for
all
of
our
javascript
users
is
that
we
introduced
a
javascript
library
called
onyx,
runtime
common,
which
has
multiple
back-end
implementations,
web
node.js
and
react
native,
but
one
single
common
api.
This
allows
the
same
javascript
code
to
run
on
all
the
main
web
platforms.
A
Challenges
with
data
pre
and
post
processing
has
been
a
recurring
hurdle.
We
also
recognize
that
custom
operators
used
with
onyx
runtime
that
weren't
officially
in
the
on
expect
could
be
shared.
This
led
us
to
create
onyx,
runtime
extensions,
which
is
a
library
of
shareable
custom
ops
that
can
be
built
and
run
with
the
core
onyx
runtime
operators.
These
are
currently
focused
on
model
pre-post-processing
work.
The
user
no
longer
has
to
implement
this
in
an
outside
language.