►
Description
Hugging Face has democratized state of the art machine learning with Transformers and the Hugging Face Hub, but deploying these large and complex models into production with good performance remains a challenge for most organizations. In this talk, Jeff Boudier will talk you through the latest solutions from Hugging Face to deploy models at scale with great performance leveraging ONNX and ONNX RunTime.
Jeff Boudier builds products at Hugging Face, creator of 🤗 Transformers, the leading opensource ML library. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.
A
All
right,
I'm
here,
hi,
I'm
jeff
with
you
from
hugging
face
I'm
on
the
product
team
over
there.
I'm
super
happy
to
be
here.
Thank
you
so
much
passant.
Yes,
you
for
inviting
me
to
talk
to
you
today.
What
I
want
to
talk
about
is
how
you.
A
Using
the
optimum
library
and
its
integration
with
onyx
runtime,
so
all
the
goodness
that
ryan,
I
was
just
telling
you
about-
is
available
to
easily
be
applied
onto
transformers
models
using
the
optimum
library
all
right.
So
let's
get
started
it's
a
short
talk
and
I'm
going
to
breeze
through
everything.
So
if
you
have
questions-
and
we
don't
have
time
to
ask
questions-
don't
hesitate
to
ask
me
directly,
I'm
at
jeff
at
huggingface.com.
A
Through
this
talk,
I'm
first
going
to
take
a
step
back
and
bring
you
into
the
the
transformers
world
and
how
we
get
to
where
we
are
today
to
then
talk
about
optimum,
why
we
went
out
to
build
a
specific
library.
That's
focused
on
accelerating
transformers
models,
but
first
up
a
little
bit
of
a
trivia
quiz.
So
what
do
tesla
gmail,
facebook
and
bing
all
have
in
common?
A
So
what
are
we
trying
to
do
at
hugging
phase?
Well
we're
trying
to
make
the
power
of
those
transformers
models
accessible
to
every
single
company
in
the
world
through
accessible,
readily
accessible
pre-trained
models
and
through
tools
to
make
use
of
it
all.
A
For
us,
the
the
the
initial
conception
starts
with
the
advent
of
transfer.
Learning
of
the
attention
is
all
you
need
paper.
This
is
what
really
changed
the
field
of
machine
learning.
A
A
A
greater
than
what
you
would
expect
from
a
team
of
now
150
people.
We
really
represent
the
aggregate
contribution
of
over
1300
open
source
contributors
to
our
libraries
and,
of
course,
we
provide
access
to
over
50
000,
find
fine-tuned,
pre-trained
models
for
every
single
machine
learning
task.
You
can
imagine
for
every
single
language
you
can
imagine
all
contributed
to
by
our
community.
A
And
that
focus
on
on
a
community
on
collaboration
on
making
machine
learning,
open
and
collaborative
has
really
fueled.
Our
traction
today
transformers
is
the
reference
toolkit
to
make
practical
use
of
attention-based
mechanism
of
transformers
models
in
every
modality,
and
so
now
optimum.
Why
optimum?
A
Real-Time
use
cases
to
something
that
you
can
use
in
a
cost
effective
way.
You
need
to
to
decrease
the
latency
through
three
different
layers
of
complexity.
You
need
to
work
on
your
model,
editing
the
graph.
You
need
to
work
on
accelerating
the
the
the
inference
and
then
you
need
to
work
on
like
the
hardware
specific
optimizations,
to
get
it
all
down
to
millisecond
levels.
A
So
optimum
is
that
bridge
it's
the
bridge
between
the
transformers
library
and
the
hardware
and
hardware
peak
performance
in
the
same
way
that
with
transformers
we
made
transformers
accessible
by
offering
a
high
level
of
abstraction
and
easy
to
use
apis,
we
want
to
do
the
same
thing
for
hardware,
acceleration
and
so
with
optimum.
We
want
to
offer
the
reference
toolkit
for
hardware
acceleration,
offering
these
high-level
apis
dedicated
to
production
performance.
A
A
So
let's
focus
on
the
onyx
runtime
package
within
optimum.
We
can
already
today
use
optimum
to
accelerate
your
trainings
of
transformer
models
in
a
very
easy
way
to
accelerate
the
inference
of
your
transformers
models
in
a
very
easy
way
for
training.
We
introduced
a
new
trainer
class
called
ort
trainer
and,
if
you're
familiar
with
the
transformers
library
you're
familiar
with
the
trainer
library-
and
it's
really
an
easy
two
lines
of
code
switch
to
go
from
your
transformer
trainer
to
the
rt
trainer
to
take
advantage
of
all
the
acceleration
that
it
provides.
A
One
of
the
main
benefits
is
that
through
the
ort
trainer
you
get
native
integration
of
deep
speed
and
that
produces
amazing
acceleration
results.
In
fact,
I'm
sharing
here
some
preliminary
figures
shared
by
ashwinicad
at
microsoft,
benchmarking,
optimum
onyx,
runtime
and
showing
that
you
very
easily
get
through
these
very
few
lines
of
code
changes
10
to
40
percent
acceleration
in
the
throughput
of
your
training,
depending
on
the
configuration
which
stage
of
this
deep
speed,
you're
going
to
be
using.
So
it's
really
powerful,
but
very
simple.
A
Then,
if
you
want
to
talk
about
inference,
there
are
three
main
classes
that
I
want
to
to
talk
to
you
about
like
the
first
one
is
ort
optimizer,
it's
a
simple
way
to
simplify
the
graph
from
your
model.
A
You
can
simplify
the
graph
from
your
model
by
specifying
just
the
the
the
pre-trained
model
and
the
task,
and
what
you
get
is
a
set
of
basic
optimization
like
constant
folding,
like
operator,
fusion,
that
are
going
to
be
applied
across
the
board,
and
you
also
get
advanced
optimization
that
is
specific
to
the
execution
provider
that
you
are
targeting,
whether
cp
or
cuda.
A
Once
you
have
an
optimized
graph,
you
can
optimize
the
weights,
you
can
optimize
the
weights
by
quantizing
the
the
model,
and
you
can
do
so
very
easily,
using
the
new
rt
quantizer
class
and
with
the
rt
quantizer
class,
you
have
access
to
both
dynamic
quantization
and
static
quantization.
A
Well
with
optimum,
you
can
do
the
same
for
onyx
runtime
and
benefit
from
all
the
hardware
acceleration
by
switching
your
auto
model
for
task
to
rt
model
for
task
class.
And
so
again,
it's
like
a
very
easy
change
to
make
to
benefit
from
all
the
optimizations
that
annex
runtime
provides
and
something
that
I'm
super
excited
about
and
that
the
community
is
super
excited
about.
A
You
can
find
on
our
blog
at
hf,
dot,
co,
slash
blog,
it's
called
optimum
inference,
and
in
here
you
have
the
whole
sort
of
user
story,
starting
from
a
pre-trained
model
that
is
fine-tuned
for
qa
and
then
exporting
it
to
onyx,
applying
the
optimization,
applying
the
quantization
and
using
the
rt
model
for
qa
class
to
get
accelerated
performance,
you're
getting
44
percent
throughput
increase
or
latency
decrease,
while
conserving
99.6
of
the
original
model
accuracy.
A
So
that
was
what
I
wanted
to
talk
to
you
about.
I
invite
you
to
check
out
and
give
a
start
to
the
optimum
library.
It's
on
our
github
at
github.
Hacking
face
optimum.
Thank
you.
So
much.