►
From YouTube: Reduced/Mixed Precision Optimization Techniques
Description
CJ Newburn (NVIDIA), Xiaoye Sherry Li (LBNL) & Cindy Rubio González (UC Davis) present a panel discussion on Reduced/Mixed Precision Optimization Techniques. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Panel Chair: Hugo Brunie
A
Right
now
we
will
have
a
panel
discussion
on
mixed
precision,
optimization
and
I
will
be
the
moderator,
I'm
a
nurse
postdoc
here
at
lbl.
A
Cherie
lee
is
a
senior
scientist
in
the
computational
research
division
from
lawrence
berkeley
national
laboratory.
Dr
lee
has
worked
on
diverse
problems
in
high
performance,
scientific
computations,
including
parallel
computing,
sparse,
matrix,
computations,
high
precision,
arithmetic
and
combinatorial
scientific
computing
she's.
The
lead
developer
of
super
lu,
a
widely
used
past
direct
solver
and
has
contributed
to
the
development
of
several
other
mathematical
libraries,
including
oprah,
clappack,
pdslin,
streampike
and
xbox.
She
has
collaborated
with
many
domain
scientists
to
deploy
the
advanced
mathematical
software
in
their
application
codes.
A
B
A
B
A
B
Thank
you,
hugo
for
the
kind
introduction
and,
as
hugo
mentioned,
I
was
actually
working
quite
a
bit
in
the
past
on
the
high
precision
arithmetic.
Now
we're
coming
needs
to
do
low
precision
as
low
as
possible.
So
that's
interesting.
B
So
we
by
the
way
we
do
have
a
quite
a
few
high
precision
libraries.
So
actually,
every
year
we
got
to
still
get
more
than
a
thousand
downloads
of
those
libraries.
So
it's
depending
on
your
application.
The
precision
needs
are
certainly
different.
B
Okay,
so
I
will
talk
about
the
our
recent
effort
of
trying
to
use
lower
precision,
lower
precision
in
sparse
direct
solvers,
and
the
goal
here
is
to
determine
you
know
where
how
low
precision
you
can
use
and
obviously
we
want
to
be
safe.
You
know
you
want
to
in
the
end,
get
correct
result
and
another
thing
is:
we
always
want
to
analyze
the
accuracy
of
the
numerical
algorithms
with
the
mixed
or
lower
precision.
B
So
so
you
want
to
tell
the
users
that
what
guarantee
you
have
with
this
algorithm.
So
this
these
two
goals
go
hand-in-hand,
and
you
probably
already
know
this
about
iterative
refinement
in
the
dense
matrix
format
context.
B
So
so
the
lu
based
the
direct
solvers
can
safely
use
a
mixed
precision
and
if
you
do
properly,
you
can
get
a
desired
accuracy
and
the
the
good
helper
library
the
function
is
the
uter
refinement.
B
So
in
the
example
code
here
I'm
showing
you
the
methodology
here
is
you
do
the
expensive
one
using
lower
precision
but
for
the
cheaper
operations.
So
you
do
lower
precision.
You
use
a
higher
precision
to
recover
the
accuracy
lost
from
lower
position.
B
So,
for
example,
in
the
dense
lu
case,
the
factorization
is
expensive.
It's
the
order
n
cube,
so
you
want
to
do
maybe
single
precision,
but
then
this
ir
iteration
is
you
compute
the
residual
after
your
first
solution
and
solve
this
correction
term
and
add
back
this
correction
term
to
your
final
solution
that
you
can
probably
iterate
a
few
times
so
in
this
ir
loop,
you
can
see
that
this
matrix,
vector
multiplication
is
cheap.
You
can
do
double
precision
and
triangular
solve
is
not
so
cheap,
it's
but
still
cheaper
than
the
factorization.
B
You
can
do
a
single
precision
and
the
double
precision
here.
The
addition
is
very
cheap
relative
to
the
others.
So
look
at
this
column
you
can
see
in
the
dense
case
you
have
most
expensive
and
it's
n
cube
all
the
rest.
The
next
expensive
ones
are
n
cube
and
square
so
that
you
have
a
lot
of
room
to
do
the
cheap
operation
many
times
before
catching
up
to
n
cube.
B
But
in
the
sparse
case
the
situation
is
slightly
different.
So
in
this
case
for
the
standard
3d
problem,
for,
if
you
do
the
sparse
lu,
we
get
a
n
square
complexity
and
then
the
residual
calculation
is
cheap,
but
the
triangular
solve
is
actually
not
so
cheap.
It's
n
to
the
fourth
third,
so
that
you
can
see
the
the
gap
between
expensive
between
the
most
expensive
and
the
cheap
one
is
relatively
small
compared
to
the
dense.
B
Case
so
here
I'll
just
to
show
you,
the
ratio
in
the
dense
case,
expensive
versus
cheap,
is
order
n.
So
you
have
a
lot
of
room
to
play
and
in
the
sparse
case
the
expensive
versus
cheap
is
order
n
to
the
two
thirds.
So
you
have
less
room
to
play.
That's
room
to
you
know
to
do
this
user
refinement.
If
you
need
the
many
iterations,
it
will
catch
up
very
quickly
and
numerically,
this
algorithm
is
well
understood.
B
We
have
already
implemented
this
in
dense,
airway
pack
and
also
in
the
apply
this
for
the
over
determine
the
least
square
problems,
and
these
were
published
a
while
ago
and
recently
some
researchers
they
use
even
better
technique
in
terms
of
for
inter
refinement.
Here,
instead
of
doing
the
pure,
this
simple
eatery
refinement,
you
can
do
like
a
gm
rest
for
this
loop.
It's
still
much
faster
in
the
dense
case,
so
all
these
were
illustrated
in
the
dense
case.
B
So
now
moving
to
the
sparse
case.
The
first
issue
I
mentioned
is
the
gap.
You
know
the
between
the
expensive
and
cheap
is
relatively
small,
so
you
don't
have
too
much
room
to
play
for
that,
and
another
issue
is
in
the
sparse
case,
so
you
need
to
do
gather
scatter
operation
which
get
in
the
way
compared
to
the
dense
case.
So
this
is
the
example
of
for,
for
example,
in
the
superior
direct
solver.
B
You
want
to
factorize
this
matrix,
a
into
l
and,
u
and
all
the
shaded
block
here
they
are
non-zero,
but
all
the
blank
ones.
These
are
zeros.
So
you
don't
you
don't
install
those
zeros,
you
don't
do
the
operations
with
those
zeros
etc,
and
then
here
this
picture
shows
it's
mapped
to
six
processes.
Six
npi
zero,
one,
two,
three,
four:
five:
zero
one,
two,
three
four
five.
So
this
is
the
blocker
cyclic
kind
of
for
mapping
and
in
the
gpu
implementation.
B
For
this
code,
our
strategy
is
to
use
the
gpu
this
side
as
the
offload
mode,
so
the
panel
factorization
is
still
on
cpu
and
depending
on
the
gpu
memory
size.
If
it's
not
big
enough,
we
will
keep
a
part
of
the
short
complement,
also
on
cpu.
So
it's
a
splitting
between
cpu
gpu
and
this
splitting
location
is
a
parameter
depending
on
how
large
memory
you
have,
and
nowadays
the
gpu
memory
is
pretty
big,
so
we
can
do
a
lot
of
stuff
on
gpu.
B
So
the
main
thing
I
want
to
mention
here.
If
you
look
at
the
computational
kernel
every
step
in
the
sparse
case,
you
have
to
gather
the
sparse
blocks
into
a
dense
work
array
and
then
use
this
dense
work
array
to
perform
gmn
after
that,
you
scatter
this
dense
work
array
into
the
remaining
sparse
blocks.
So
this
step
one
and
three-
you
don't
do
in
the
dense
case,
but
in
the
sparse
case
you
have
to
do
this
and
to
use
a
tensor
cores.
For
example,
it's
actually
relatively
easy,
it's
not
so
difficult.
B
B
A
Sorry,
sherry,
it's
more
than
seven
minutes.
Can
you
wind
up
in
30
seconds.
B
Oh
okay,
so
I
I'll
stop
pretty
much
stop
here.
So
I'll
just
show
you
some
results,
for
example,
on
the
summit
machine
currently
we're
using
six
npi
six
gpu
each
npi
drives
one
gpu,
so
I
have
two
sparse
matrices,
so
they
are
on
the
order
of
a
million
dimension,
and
then
you
can
see
that
the
factorization
time
double
versus
a
single
precision.
You
get
something
like
48
to
50,
faster
just
by
moving
less
data.
B
You
know
doing
the
single
precision
doing
lesser
communication
in
terms
of
bandwidth
and
as
I
mentioned,
getter
gather
scatter
actually
will
take
42
in
this
case
in
this
case
35.
So
you
only
have
for
like
50
60
of
the
time
that
you
can.
You
know
do
that
fraction.
You
can
speed
up
by
single
precision.
You
know
tensor
core
those
kind
of
thing,
but
this
gathers
scatter
will,
you
know,
cannot
benefit
and
the
solve
is
relatively
fast.
So
you
can.
B
A
Thank
you,
sherry
for
trying
this
nice
results
and
iterative
refinements.
A
I'm
just
talking
to
the
audience
to
say
that
you
can
ask
your
questions
during
the
talk
and
the
panelists
will
answer
when
all
of
them
have
presented
so
right
now
we
have
chris
j
newbern
cj
is
a
principal
architect
and
drives
hpc
strategies
and
the
software
product
roadmap
in
nvidia
compute
software,
with
a
special
focus
on
systems
and
programming
models
for
scale.
A
Dr
newburn
is
a
community
builder
with
the
passion
for
extending
the
core
capabilities
of
hardware
and
software's
platforms
from
hpc
into
ai
data
science
and
visualization
he's
delighted
to
have
worked
on
volume
products
that
his
mum
used
and
that
helped
researchers
do
science
that
previously
wasn't
possible.
So
now,
cj
for
talking
about
the
mixed
precision
tuning
working
group.
C
Better
when
I'm
not
thanks,
I'm
really
delighted
to
have
worked
with
hugo
and
many
others
in
a
working
group
on
the
topic
of
reduced
precision.
The
key
thing
here
is
that
it's
it's
one
thing
to
have
cool
hardware
and
ideas
about
what
you
could
do
at
hardware,
but
what
really
matters
is
doing.
C
New
science
is
making
end-to-end
connections
and
working
with
people
who
have
the
use
cases
who
are
actually
in
the
trenches
doing
the
work,
like
all
of
you,
trying
to
figure
out
how
to
apply
it
and
getting
a
dialogue
going
back
and
forth
between
end
user
developers
and
our
developers
as
to
how
we
can
do
that.
So
we
created
the
working
group
on
this
topic
and
kind
of
first
want
to
talk
about
sherry
thanks
for
already
talking
about
sort
of
the
algorithmic
complexity.
Part.
C
Another
factor
of
this
is
really
understanding
where
you
can
apply
reduced
precision,
so
we
decided
that
we
would
offer
many
different
options
in
the
our
gpu
hardware
for
supporting
fb16,
fp32
and
fp64
and
I'll
talk
about
some
of
the
other
variants.
In
this
we
have
but
really
understanding
sort
of
the
first
order.
Where
do
you
need
precision
and
where
do
you
need
range?
C
Looking
at
accumulators
tend
to
want
higher
precision
more
so
than
things
like
exp
that
others
have
experimented
with
tools
on.
If
you
find
that
your
range
is
too
broad,
you
can
do
some
preconditioning,
for
example,
for
linear
systems.
You
may
be
able
to
rescale
to
fit
into
the
range
where
possible,
and
you
can
test
the
tolerance
with
introduced
noise
in
that
space
as
well,
and
so
one
of
the
things
that
people
like
john
stone
have
tried
out
is
using
fixed
point
for
reproducibility.
C
Here,
it's
it's
easier.
Actually,
if
you
don't
have
to
worry
about
all
the
cases
that
have
to
be
excluded,
things
are
reversible,
it
could
be
easier
for
algorithm
development
and
there
are
a
number
of
opportunities.
There.
C
People
have
tossed
out
some
ideas
like
tuning
the
encodings
for
graphics.
Apis
people
have
thought
about
sort
of.
What
do
you
need
to
take
advantage
of
hardware
that
can
do
a
matrix,
multiply
or
write
in
hardware
rather
than
just
doing
individual
vector
operations?
We
call
them
tensor
chords.
You
may
need
the
right
shape.
C
We
have
opportunities
where
you
can
eliminate
sort
of
half
of
the
entries
that
are
coming
in
if
they're,
zero
and
mux
those
for
as
much
as
a
2x
performance
on
the
latest
gpu
the
a100
hardware
that
we
have
and
we've
looked
at
things
with
in
one
and
eight
for
signal
processing
and
so
on.
So
lots
of
opportunities
here
and
you
can
see
at
the
right
for
teraflops
we've
gotten
the
best
performance
where
we
use
an
fpc,
fp16
tensor
core
that
accumulates
in
32
bits.
C
If
you
didn't
accumulate
in
32
bits,
then
you
may
never
even
converge
but
larger
problem
sizes.
Things
tend
to
fall
off.
So
what
is
it
that
we
can
do
in
this
space?
C
So
one
of
the
key
things
that
you
know
I
kind
of
started
with
this,
and
actually
I
know
many
of
you
in
the
berkeley
community,
because
I
helped
start
the
ixbug
effort
back
when
I
was
at
intel-
and
this
is
kind
of
of
a
similar
vein
of
gathering
together
people
that
are
passionate
experts
and
share
what
works
show,
how
much
it
helps
be
able
to
share,
reproducing
results,
try
it
and
then
give
get
feedback
through
a
community
discussion
and
be
inspired
by
a
range
of
different
applications.
C
C
I
think
it's
well
over
a
year
now
and
we'd
like
to
invite
anybody
who's
interested
to
join
in
that
kind
of
take
your
turn
and
taking
a
half
hour
or
longer,
to
present
some
of
the
work
that
you've
been
doing
and
and
work
through
that
our
next
session,
for
that
is
next
tuesday
and
we'll
be
talking
about
a
particular
interface.
We
have
called
causalis
to
make
better
use
of
the
tensor
cores.
Kate
clark
also
just
started
the
slack
channel
and
mixed
precision
so
welcome
to
join
that.
C
There
are
a
number
of
libraries
and
frameworks
that
you
could
try.
You
could
sort
of
make
it
easier
to
try
reduce
precision
with
the
same
higher
precision
interfaces
some
in
the
dl
frameworks.
People
have
been
working
this
with
something
we
have
called
amp
or
you
can
just
throw
a
switch,
and
lo
and
behold,
you
get
a
whole
lot
more
performance.
C
We
have
some
an
iterative
refinement
in
the
ku
solver,
as
you
were,
referring
to
sherry.
We've
done
this
with
the
cuda,
that's
used
for
chroma
and
other
apps,
with
coupe
loss
and
cool
tensor,
where
you
can
sort
of
drop
in
for
the
mixed
precision,
and
I
expect
we
can
come
back
to
this,
perhaps
in
a
broader
discussion,
but
I
kind
of
wanted
to
offer
some
different
highlights
of
some
opportunities
with
iterative,
solvers,
multi-level
summation
figuring
out,
you
know
hey.
C
C
We've
had
some
discussions
about
where
physicists
know,
hey
you
really
need
to
care
about
water
and
or
this
virus
has
a
particular
molecule,
but
it's
kind
of
a
nerd.
You
have
to
include
it
in
your
model,
but
it
really
don't
need
to
worry
that
much
about.
What's
going
on
that,
so
treat
that
with
lower
precision
than
something
else,
that's
really
operative.
So
maybe
we
need
some
sort
of
a
new
forum.
I
don't
know
for
being
able
to
have
more
communication
there
being
able
to
analyze
know
what
is
the
science?
C
What
are
the
physical
limits?
How
can
you
work
through
these
subsets
of
species
that
you
should
care
about
versus?
Not
what
matters
in
the
algorithm?
C
How
sensitive
am
I
to
variation
across
the
data
sets
if
I'm
trying
to
use
profiling,
for
example,
as
hugo
has
done,
with
the
tool
that
we
mentioned
there
to
be
able
to
analyze
those
inner
sets?
What
kind
of
interfaces
do
we
need,
and
how
do
you
automate
this?
How
do
I
you
know,
measure
the
stability
in
terms
of
the
number
of
iterations,
or
does
it
converge?
C
Can
I
introduce
some
noise
tolerance
or
measure
the
nose
tolerance
by
introducing
some
noise
and
looking
that
across
your
validation,
data
sets,
and
can
we
work
up
a
set
of
like
hey
here?
Are
the
different
interfaces
for
these
different
operations,
the
different
precisions
for
these
different
skews
or
generational
models,
so
that
developers
kind
of
know
what's
at
their
fingertips
to
be
able
to
work
with
so.
A
So
our
next
speaker
is
cindy
rubio
gonzalez
cindy
is
an
associate
professor
of
computer
science
at
the
university
of
california.
A
Dr
rubio
rocks
spans
the
areas
of
programming
languages
and
software
engineering,
with
a
focus
on
program
analysis
for
automated
bud
finding
and
program
optimization.
She
is
particularly
interested
in
the
reliability
and
performance
of
system
software
and
scientific
applications.
Now
for
dynamic
analysis
for
floating
point
precision,
tuning
cindy,
rubio,
gonzalez.
D
Hi,
can
you
hear
me?
Yes,
thank
you.
Thank
you
for
the
introduction.
So
let
me
yeah
so,
and
thank
you
to
the
introductions
before.
Also
it's
not
hard
to,
you
know
convince
everyone
that
foreign
arithmetic
is
used
everywhere
and
but
unfortunately,
listening
about
flowing
point
programs
is
often
difficult,
given
the
large
variety
of
numerical
problems
that
can
exist
in
these
programs
and
also
the
fact
that
most
programmers
are
not
experts
in
flowing
point
because
of
this,
a
common
practice
is
to
use
the
highest
available
precision
which
often
leads
to
poor
performance.
D
So
the
goal
of
our
work
is
to
develop
automated
techniques
to
assist
programmers
in
tuning
the
precision
of
the
floating
point
programs.
So
the
idea
is
to
systematically
search
over
the
types
of
floating
point
variables
to
recommend
a
type
configuration
that
specifies
what
type
to
use
for
each
variable
declaration.
D
The
goal
here
is
that
the
resulting
program
should
still
produce
an
asset
acceptable
answer
while
being
faster
than
the
original
program.
So
let
me
just
illustrate
our
program
transformations
using
an
example.
So
here
is
an
excerpt
of
a
c
program
that
computes
the
at
length
of
a
function
g.
D
So
I
will
not
go
into
the
details,
but
I
want
to
just
point
out
that
this
program
uses
long
double
precision,
and
here
we
have
been
told
that
by
an
expert
that
there
is
a
mixed
precision
program
that
produces
an
answer
as
accurate
as
the
original
program
while
being
faster,
and
here
is
a
program,
so
the
orbital
structure
is
unchanged
aside
from
some
variable
declarations
and
function
calls.
D
So
here,
for
example,
variable
s1
is
an
accumulator,
so
it
remains
in
long
double
to
avoid
accuracy
loss.
However,
the
precision
of
the
remaining
variables
has
been
lowered
to
either
double
or
float.
Furthermore,
the
call
to
this
square
root
function
has
been
replaced
with
according
to
the
corresponding
single
precision.
Implementation
is
square
root,
f,
so,
unfortunately,
even
for
small
programs
like
these,
it
is
infeasible
to
find
these
type
configurations
by
hand.
D
Now,
some
some
of
the
advantages
of
pressing
onions
is
that
it
considers
both
accuracy
and
performance
and
because
it's
black
box,
it
works
on
medium-sized
non-trivial
programs.
It
is
easily
easily
configurable
because
you
can
specify
what
areas
of
the
program
to
focus
on.
If
you
know
them,
and
also
in
our
initial
evaluations,
it
gave
speed
ups
of
up
to
40.
D
Now
the
downside
of
this
technique
is
that
it
it
explores
multiple
configurations
during
the
search
and
each
of
them
has
to
be
evaluated,
and
that
is
because
simply
lowering
precision
does
not
mean
that
the
program
will
be
faster.
Now.
Also,
presimonious
only
explores
a
subset
of
the
search
space,
so
the
ordering
of
the
variables
will
will
affect
what
parts
of
the
search
space
are
looked
at
now.
In
order
to
add
some
of
these
challenges,
we
also
develop
another
couple
of
tools.
D
And
then
we
work
on
hp
tuner,
which
is
a
white
box,
hierarchical
approach
that
groups
variables
based
on
their
usage,
so
that
we
only
consider
type
configurations
that
can
actually
lead
to
a
speed
up,
because
they
will
focus
on
configurations
that
do
not
have
many
cast
operations,
and
this
still
requires
program
profiling
to
find
dependencies
among
variables.
But
this
resulted
in
a
considerably
faster
search
and
also
finding
configurations
that
led
to
higher
speed
ups
now.
D
So,
despite
all
the
disadvantages,
I
I
briefly
discussed
the
tools
I
I
presented.
They
are
the
state
of
the
art
in
dynamic
and
dynamic,
dynamic
position
to
me,
and
and
but
there
are
still
challenges
to
overcome
so,
for
example,
these
type
configurations
that
the
tools
propose
rely
on
the
inputs
that
we
use
during
the
tuning
process.
So
is
this
a
problem
for
hpc
applications,
or
is
there
any
other
correctness
metrics
that
the
applications
use
that
we
could
leverage
in
order
to
to
make
these
configurations
to
make?
D
You
know
more
general
guarantees
about
these
configurations?
Also,
these
current
approaches
do
not
scale
for
long-running
applications,
applications
that
need
to
run
for
hours
days
or
weeks.
So
we
need
to
further
find
ways
to
reduce
the
switch
space
or
find
incremental
ways
to
apply
these
techniques.
D
Third,
is
that,
even
though
these
configurations
are
leading
to
a
speed
up,
we
do
not
know
how
far
we
are
from
the
ideal
configuration
that
we
could
find.
So
are
there
any
other
program
transformations
that
we
could
explore
in
addition
to
changing
variable
declarations
and
function,
calls
or
is
there
any
domain
knowledge
in
the
hpc
application
that
we
want
to
tune
that
we
can
use
to
guide
the
search
that
would
be
very
helpful
to
figure
out
and
finally,
as
a
tool
developer.
D
So
I
think
this
is
a
great
opportunity
to
connect
to
application
developers
in
order
to
create
a
collaboration
to
you
know,
to
put
together
a
collection
of
applications
that
we
could
use
in
order
to
have
further
inspiration
and
find
the
bottlenecks
for
these
tools.
D
D
As
I
said,
we're
actively
look
working
on
these
tools,
so
it
would
be
great
to
connect
to
developers
who
have
access
to
hpc
applications
that
could
benefit
from
tuning
and
also,
I
would
like
to
make
a
quick
announcement
about
a
workshop
that
I
am
organizing
with
nasa
laguna
from
livermore
at
supercomputing.
So
if
you
have,
if
you
are
working
on,
you
know,
topics
related
to
mixed
precision
and
correctness.
So
we
welcome
your
submissions
and
yeah.
So.
A
Thank
you.
Thank
you.
Thank
you
cindy
yeah.
That
was
a
nice
view
of
the
tool
for
searching
the
space
of
type
configurations,
and
I
think
this
is
complementary
of
what
we've
seen
before
that,
where
more
in-depth
mixed
precision
tuning
of
a
specific
library
which
are
broadly
used
like
sparsely
so,
do
we
have
any
question
from
the
audience.
A
A
Yeah
in
terms
of
getting
the
solver
to
converge
and
getting
performance.
B
Yes,
so
so
the
the
code
is
a
very
complicated,
so
it's
you
know,
we
started
with
the
double
precision
code,
double
precision
double
complex
and
we
have
a
automatic
macro
to
mechanize
the
the
code
base
to
generate
a
single
precision
code.
But
you
know
the
compared
to
dance
code.
There
is
just
a
lot
more
different
pieces
going
on
now,
not
related
a
lot
of
these.
Like
a
scatter
gather
operation.
They
are
not
related
to
the
floating
point
operation,
so
you
need
to
take
care
of
all
these
indirect
dressing.
B
Mold
correctly.
So
that's
a
you
know,
it's
quite
an
engineering.
A
lot
of
these
engineering
part
it's.
It
doesn't
show
up
in
the
dance
code.
B
B
B
We
can
gain
some,
but
I
think
it's
limited.
A
Okay-
and
we
have
seen
like
a
presentation
of,
for
example,
cj
you-
you
talk
during
the
the
work
group
that
there
has
been
a
lot
of
research
on
optimizing
specific
libraries.
Do
you
think
all
of
you
all
the
three
panelists?
Do
you
think
this
is
the
the
path
to
to
make
mixed
precision
into
general
applications
is
to
tune
specific
libraries,
or
can
we
leverage
some
like
cindy
sets
some
metrics
from
specific
application
to
apply
it
to
general
case
general
purpose?
A
C
Yeah,
our
experience
in
general
has
been
you
know,
kind
of
following
the
the
bang
for
the
buck
right.
So
if
you're
going
to
make
an
investment
finding
something
that
is
common
to
lots
of
users
that
you're
solving
lots
and
lots
of
people's
in
the
end
problem,
then
you
know
going
after
that
in
library
form.
First
you're,
it's
easier
to
justify
a
larger
person
power
investment
in
really
making
that
one
thing
shine
and
while
you're
doing
that
you're
likely
to
learn
a
bunch
of
things.
C
D
Well,
I
was
just
going
to
add
that
I
completely
agree
that
you
know
tuning
the
the
lower
lower
end
blocks.
It's
a
you
know
a
first
step.
I
also
think
that
there
will
be
optimizat,
so
there
will
be
optimizations
that
are
specific
to
applications
that
we
can
always
leverage,
but
also
that
makes
it
even
more
complicated
because
often
they
are
not
general,
so
they
are
very
specific
application
specific.
C
Just
to
really
plus
one
that's
indeed
being
able
to
dsls
can
really
be
your
friend
right.
So
where
again,
there's
a
broad
applicability
within
a
given
domain,
and
if
you
can
use
a
dsl
as
many
people
that
berkeley
and
other
places
have
done
and
focus
there.
First,
where
you're
dealing
with
a
set
of
constraints
that
you
can
operate
with,
can
be
very
fruitful.
A
C
All
right,
so
I
think
it's
interesting.
If
you
look
at
what
we
did
in
this
case,
we
went
for
mp16,
first
and
then
sort
of
backed
out
to
higher
precision
so
that
we
could,
where
the
lower
precisions
was
good
enough
for
a
lot
of
dl,
and
then
we
backed
off
to
higher
precisions,
which
are
really
needed
for
a
lot
of
hpc
and
we
kind
of
compromised
with
the
the
tf-32.
C
I
I
actually
skipped
a
slide,
as
I
was
talking
about
that
of
where
you
can
sort
of
use
that
8-bit
range
for
actually
is
it?
Okay?
If
I
share
that,
I
don't
know
if
it
is,
you
can
just
put
that
up
using
kind
of
a
the
best
of
both
worlds
of
being
able
to
get
that
can't
talk
at
this.
At
the
same
time,
very
well.
C
So
in
this
one
sort
of
it
different
things
help
in
different
ways.
So
we
we
found
that
going
back
up
to
being
able
to
do
things
at
higher
precision
was
helpful.
Finding
a
compromise
where,
with
the
tf32
tensor
cords,
you
can
get
an
8-bit
range
like
fp32
and
b-float,
but
10-bit
precision
like
the
fp16
was
very
fruitful.
C
We
also
see
opportunities,
as
I've
mentioned,
also
with
the
single
bit
all
right.
There
was
some
gordon
bell
work
in
2018
that
did
really
well
with
sort
of
one
bit
stuff
and
sort
of
an
adjacency
to
hpc,
with
the
signal
processing
it's
for
radio
astronomy.
So
we
do
think
that
there
are
opportunities
up
and
down
that,
and
I
think
we
need
to
explore
that
whole
range
so
going
starting
in
the
middle
and
going
up
and
down
from
there.
D
Yeah,
I
just
wanted
to
add,
like
the
aspect
of
correctness
in
here,
because,
yes,
I
agree
well,
we
have
seen
that
you
know
there
is
this,
this
trend
from
ml
to
to
push
for
for
mixed
precision,
but
also,
I
believe
that
ml
applications
have
very
different
requirements
in
terms
of
correctness
and
that's
something
that
usually
comes
up
in
all
these
automated
tools
that
we
need
to
know
what
what
correctness
means
and
then
these
tools
are
driven
by
a
subset
of
inputs.
So
how?
How
much
error
are
we
willing
to
tolerate?
A
Yeah,
that
is,
that
is
true.
The
correctness
krita
is
sometimes
very
difficult
to
to
get
from
the
application
developers.
It's
a
it's
a
full
problem,
so
we
we
have
a
question
here.
I
will
ask
it
to
to
sherry.
Do
we
need
mixed
precision
because
of
convergence
of
hpc
with
ai
workloads
or
because
either
hpc
or
ai
workloads
are
more
thirsty
for
performance
that
then
we
currently
can
deliver,
and
I
would
reformulate
this
question
also
saying
that
yeah?
A
What's
what
did
you
push?
What
pushed
you
to
to
use
mixed
precision
in
sparse
edu.
B
Yeah,
so
for
us
the
motivation
is
really
to
reduce
the
communication.
So
if
we
can
use
a
lower
precision,
we
have
a
memory,
access
will
be
smaller
and
the
communication
will
be
reduced
by
half.
So
that's
the
real
motivation
and
if
you
are
talking
about
a
hpc
ai
together
and
I
think
the
the
tf4
32,
this
format
is
really
attractive
because
it
doesn't,
you
know,
reduce
the
range
which
is
really
needed
for
most
of
the
hpc
applications.
B
If
you
have
only
five
bit
exponent,
it's
really
limited,
we
cannot
re
very
often
you
can
do
some
trick,
like
balance
the
matrix
equilibration,
but
in
most
cases
those
techniques
is
not
helpful.
So
five
bit
is
too
limited
and
then
you
reduce
the
just
reduce
the
mantissa
to
do
things
faster,
but
keeping
the
same
range.
I
think
that's
a
very
good
compromise
in
both
world.
B
So
that's
also
convenient
for
for
the
for
the
application
people.
A
Yeah
we're
we're
seeing
more
and
more
hardware
floating
points
way
to
return
the
floating
point
like
the
fixed
or
tf32
to
be
maybe
driven
by
applications
needs.
Do
you
think
there
will
be
more
needs
for
for
for
this?
For
this
specific
type
of
floating
point,
and
do
you
think
the
hardware
will
answer
to
them
like
it
did
with
the
ii
ai.
B
So
maybe
cj
can
answer
this,
but
my
impression
is
the
with
this.
For
example,
this
is
a
tf-32
and
there's
already
tensor
cores
for
this
formatter,
which
is
this
that
helps
speed
up
things
dramatically
right.
C
And
maybe
just
to
take
that
in
a
slightly
different
direction
and
coming
back
to
florina's
question
one
of
the
things
that
we're
seeing
and
I'll
just
cite
hpc
at
the
edge
as
an
example
of
where
I
visited
some
people
at
some
government
labs,
where
it's
really
important
to
make
best
use
of
the
very
very
expensive
equipment
for
being
able
to.
You
know,
take
measurements,
whether
it's
microscopy
or
whatever
it
might
be,
and
it's
very
easy
to
get
a
completely
bad
data,
because
you
didn't
set
it
up
right.
C
It's
also
possible
to
sort
of
take
only
a
few
samples
of
something
that
was
really
interesting
and
surprising,
but
you
didn't
discover
that
until
much
later
it
was
sent
back
to
the
scientist,
and
they
said,
oh
man,
that
that
was
really
cool.
Please
please
do
this
experiment
all
over
again.
There
are
a
lot
of
folks
that
are
telling
us.
We
really
want
to
be
able
to
get
a
very
quick
turnaround
and
get
almost
instantaneous
feedback
from
this.
C
So
what
that
means
is
that
we're,
if
you
will
downloading
a
model
that
we
use
for
inference
into
the
instrumentation
pipeline,
we're
also
doing
other
kinds
of
hpc
processing.
We
might
be
being
a
video
and
doing
feature
extraction
and
so
on.
So
one
of
the
connections
here
is
that
having
the
same
hardware,
that's
both
really
good
at
hpc
and
really
good
at
ai
and
being
able
to
have
that
at
any
scale
where
you
can
do
it
out
in
sort
of
smaller
instruments
that
are
out
near
the
edge.
C
Maybe
it's
in
at
a
base
tower
so
that
you're
looking
for
a
lost
person
or
whatever,
and
you
you
push
out
something.
That's
really
good
at
recognizing
that
person
wearing
a
red
jacket
or
something
or
you
can
move
it
back
into
the
data
center
and
being
able
to
make
that
trade-off
in
the
same
kind
of
programming
model.
That's
able
to
run
on
the
same
kinds
of
hardware
that
may
be
just
available
at
different
scale.
That's
pretty
cool
and
being
able
to
have
that
sort
of
model.
C
That's
there
wherever
it's
needed
throughout
your
whole
system
kind
of
opens
up
a
lot
of
opportunities.
So
that
that's
maybe
a
different
perspective.
A
Thank
you
cj
for
this
yeah
yeah
the
question
I
I
just
asked
to
to
sherry.
I
think
both
of
you
can
also
give
your
your
point
of
view
about
the
the
convergence
hpc
ai
and
the
the
why
we
are
using
mixed
precision
cindy.
Do
you
want
to
to
give
your
point
of
view
on
this.
D
Yes,
so
it's
sorry
sorry
is
this
about
also
florina's
question
right
that
what
is
driving
the
the
the
need
for
mixed
precision?
Correct?
Okay,
so
from
my
point
of
view
developing
these
tools,
it
seems
like
the
main
driven
force
has
been
performance
and
then
by
law
by
using
mixed
precision,
then
we
have
to
be
aware
about
any
additional
numerical
problems.
That
is
my
break.
D
So
it's
not
so
much
about
you
know
doing
the
mixed
position
because
of
the
need
of
the
convergence,
but
it's
mostly
about
achieving
as
much
performance
speed
ups
as
we
can
and
then
being
aware
of
what
other
problems
are
being
introduced.
Because
of
that.
A
So
I
think
we
have
a
complete
view
on
on
this,
and
this
problem
and
yeah
thanks
cj
for
this
dynamically
mixed
precision
model
that
it
looks
like
a
yeah
and
so
do
we
have
another
question
from
the
public,
okay,
so
yeah,
we
will
finish
soon.
You
can.
If
you
have
a
last
question
or
debate
you
you
want
to
to
bring
bring
up
between
you.
Otherwise
I
have
a
one
more
question.
I
would
like
to
ask.
A
So
yeah
this
is
a
bit
specific
to
to
sherry
application.
You
say
that
this
is
bound
with
bomb,
because
it's
a
sparse,
linear,
algebra
and
you
say,
mixing
precision
will
help
you
to
reduce
the
data
movements,
so
we're
talking
about
about
a
lot
about
making
computation
go
faster
with
10
circles
on
nvidia,
but
actually
for
such
application.
Tensor
cores
are
maybe
not
the
solution
because
of
the
bandwidth
aspects.
Have
you
considered
using
mpa
communication
with
reduced
precision
data.
B
Yeah,
yes,
so
for
for
in
the
single
precision
code,
I
was
showing
the
results.
So
that's
the
using
the
mpi
float
to
do.
B
A
Maybe
tricks
with
compressing
the
data.
B
Yes,
so
so
that's
that's
a
very
good
question.
So
for
the
compression
you
know
there
are
some
tools
like,
for
example,
livermore
has
this
tool
called,
I
think,
a
zfp
compressed
or
something
and
the
we
don't
know
yet
we
haven't
experimented
the
the
impact
on
the
final
result.
B
B
A
Another
acts
of
research
getting
back
to
what
cindy
said,
what
about
the
the
correctness
of
like
how
to
to
define
the
correctness
criteria
from
some
application
I
mean
for
some
of
them
is
quite
obvious
when
you
have
just
the
significant
digits
of
the
of
the
results,
and
you
have
your
solder,
which
must
converge
with
such
significant
digits,
for
some
other
is
a
bit
less
obvious,
and
maybe
people
can
join
the
nurse
user
slack
channel
and
on
the
mix,
precision
channel
and-
and
we
can
talk
about
it
further
so
for
this
I
will
thank
you
all
the
the
the
panelists.
A
I
want
to
tell
to
the
audience
that
we
will
have
a
last
talk
at
the
end
of
the
day,
wrapping
up
the
the
gpu
for
science
two
days
and-
and
we
will
share
with
you
a
form
to
ask
you
what
you
think
about
this
these
two
days
again.
Thank
you
very
much,
cj,
cindy
and
sherry
for
for
this
instructive
panel.