►
From YouTube: ONNX20210324 V12 OnnxRuntimeTraining
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hello,
everyone,
my
name
is
palm.
I
am
a
senior
software
engineer
working
for
microsoft,
ai
platform
team,
I'm
going
to
share
some
updates
around
only
quantum
training
work,
as
we
shared
previously
only
sometime
has
been
designed
for
addressing
few
problems,
including
reducing
production
model
latencies,
making
making
it
possible
to
deploy
python
trained
models
with
c
sharp
or
other
programming
languages.
A
There
are
also
some
needs
to
run
the
model
on
different
kinds
of
devices,
for
example
the
mobile
devices,
as
my
colleagues,
tom
and
scott
introduced
earlier
in
the
online
time
mobile
presentation,
all
those
requirements
are
from
influencing
scenarios
and
only
spend
time
solved
them
pretty.
Well,
once
once
we
extend
to
the
training
area,
we
see
increasing
demands
to
change
the
large
model
efficiently.
A
The
only
extent
runtime
has
been
proven
to
be
a
highly
performant
influence
engine
with
cross-platform
support
and
extensible
architect
for
either
custom
operators
or
hardware.
Accelerators
training
feature
is
introduced
in
the
past
months
and
now
is
still
in
the
previous
stage
and
showing
promising
results
on
some
of
the
internal
models.
A
As
of
our
design
principles,
we
would
consider
online
time
would
be
a
generic
framework
for
training.
Deep
neural
networks,
similar
with
influence
support,
we
would
allow
developers
to
extend
with
customer
operators
for
trainings
the
transformer
models
fund,
most
of
its
applications
in
the
field
of
nlp,
for
example,
the
tasks
of
machine
translations
time
series
prediction.
A
A
A
The
charts
on
this
slide
shows
the
orth
training
architect.
As
we
see,
data
centers
will
still
be
able
to
stick
to
original
trainer
code
built
by
pytorch
or
other
frameworks.
Those
models
would
be
converted
to
onyx
model
representing
the
model
structure.
We
usually
call
it
a
forward
graph
in
the
training
scenarios
rt
as
a
backend
would
take.
The
onyx
graph
in
then
handle
the
complexities,
including
building
a
training
graph.
A
Do
the
graph,
optimizations
and
finally
run
the
graph
efficiently
and
the
backend
is
also
a
good
place
to
incorporate
innovations,
including
msr,
deep
speed,
parasail
and
maxine's
those
kind
of
techniques
so
far,
ort
has
the
capability
to
run
training
using
both
data
parallelism
and
horizontal
model
parallelism
the
sample
of
how
a
python
model
runs.
Training
with
ort
the
flow
is
a
bit
out
of
date
when
we
are
working
on
new
design
recently,
while
most
of
the
concepts
remain
to
be
still
valid.
A
Roughly
saying
the
python
model
would
be
converted
to
onyx
graph.
First
afterwards,
ot
built
the
training
graph,
including
mixed
precision,
setting
up
auto
deep
graph
building
graph,
optimization
finally
set
up
distributed
training
before
scaling
out
to
multi-gpus
or
multi-nodes,
more
specifically,
orts
supply
a
loss
function
to
the
photograph,
as
the
first
step
builds
the
gridding
graph
step
by
step,
removing
unnecessary
computations
composed
adam
optimizer.
Finally,
we
got
a
fully
training
graph.
A
A
A
Some
of
the
performance
comes
from
the
cooler
kernel
improvements
we
initially
date
for
the
birlage
models
and
all
the
optimizations
are
proven
to
be
reusable
and
applicable
for
other
transformer-based
models
in
our
cases
like
robota,
gpd2
and
but
yeah.
We
also
provide
good
coverage
for
different
group-based
optimizations,
essentially
kernel
fusions
in
place,
computations
and
so
on.
A
Memory
efficiency
also
plays
an
important
role
on
the
beta
performance
with
buffer
reusing
minimizing
the
memory.
Fragmentations
ot
could
run
two
axle
precise
than
pythons
for
prolatch.
As
we
mentioned
earlier,
similar
observations
applied
for
the
gpd2
medium
training.
We
could
train
it
on
16
gigabytes
v100,
while
pathogens
hit
outer
memory
issues.