►
Description
Kate Clark from NVIDIA presents a talk on Mixed precision optimization for lattice quantum chromodynamics. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Hugo Brunie
A
Thank
you
very
much
for
the
invitation
to
talk
here
today.
My
apologies
ahead
of
time,
I'm
a
keynote
speaker
and
I
tried
to
convert
to
powerpoint
this
morning
to
get
the
captions,
but
I
failed
so
I'm
giving
this
with
a
keynote.
A
So
beside
the
caption
free,
I've
probably
got
a
lot
too
many
slides
to
talk
about
in
this
session
here.
But
I'll
do
my
best
and
you'll
have
the
slides
at
the
end
and
I'm
happy
to
answer
any
questions
offline
now
to
start
off.
We
think
about
why
we're
going
to
do
mixed
precision
anyway
in
hpc.
There's
of
course,
many
reasons.
A
Current
gpus
and
other
hardware
have
acceleration
specifically
for
low
precision,
and
but
there
are
other
reasons
as
well,
such
as
reducing
memory,
traffic,
reducing
network
traffic,
producing
memory
footprint
and
also
sometimes
you
want
to
change
the
precision
and
for
suitable
numerical
properties.
I
don't
just
mean
going
from
single
precision
to
double
precision,
but
also
using
things
in
integer
instead
of
a
floating
point,
for
example,
and
the
whole
goal
of
using
mixed
decision
is
really
to
accelerate
or
improve
the
algorithms
without
compromising
the
quality
of
science
to
accelerate
the
time
to
science.
A
So
the
case
study
will
be
talking
about
today
is
lattice
quantum
chrodamics
qcd,
I'm
not
going
to
tell
you
much
about
what
qcd
is
other
than
it's.
The
science
of
subnuclear
physics
and
qcd
is
the
fee
of
the
strong
force
which
binds
together
the
neutrons
and
the
protons.
It's
a
very
simple
theory
to
write
down.
We
have
a
path
integral
here,
it's
a
very
beautiful
theory.
Unfortunately,
it's
a
highly
non-linear
theory.
You
can't
solve
it
by
using
feynman
diagrams.
A
That
essentially
means
you
can't
use
a
taylor
expansion
to
make
predictions
and
you
have
to
resort
to
numerical
methods
and
lattice.
Quantum
chromodynamics
is
the
method
of
choice,
and
here
you
take
the
infinite
space
time
and
we
put
it
on
a
finite
periodic
box.
So
we
end
up
with
a
four
dimensional
grid,
with
periodic
boundary
conditions
and
the
original
partial
differential
equations
and
are
now
turned
into
finite
difference
equations,
which
means
stencils
on
a
regular
grid,
and
we
have
to
solve
problems
of
the
form
ax
equals
b.
A
So
a
is
a
large,
sparse,
matrix,
x
and
v
are
vectors.
Now
it's
important
in
in
this
conversation
here,
because
it's
a
very
large
consuming
of
consumer
of
public
super
computing
cycles,
so
10
plus
percent
of
super
computing
cycles
in
north
america
and
europe
in
japan,
and
increasingly
so
in
china
and
india.
A
So
it's
traditionally
very
highly
optimized
on
every
hpc
platform,
that's
been
around
for
the
last
30
years
in
the
us.
Specifically,
we
can
see
why
we
care
about
this
on
the
two
pie:
charts
on
the
left,
we're
looking
at
the
utilization
or
the
allocation
and
utilization
on
the
summit
machine
at
oak
ridge
by
latches
qcd,
and
they
have
in
inside
2019
the
approximately
one
sixth
of
all
cycles
are
allocated
to
them,
but
because
lattice
qcd
is
always
ready
to
absorb
any
cycles
available.
They
actually
took
more
like
35
40
of
all
cycles.
A
In
summit
are
spent
during
last
qcd
and
nurse
is
a
bit
more
conservative
in
giving
time
to
qcd
physicists
is
more
like
13
there.
You
can
see
it's
split
between
the
applications,
chroma
cps
and
milk,
so
it's
an
important
application
to
optimize
for
hpc
centers,
there's
two
major
steps
in
elastic
qcd
calculation,
there's
a
large
strong
scaling,
monte
carlo
phase
and
a
second
phase
where
you
take
all
the
configurations
which
are
generated
by
this
monte,
carlo
metropolis,
strong
scaled
algorithm,
and
you
analyze
all
these
configurations
so
step.
A
One
requires
strong
scaling
running
on
leadership
facilities.
Step
two
is
a
task,
parallel
job
that
can
be
done
on
a
smaller
scale
or
a
more
easily
paralyzable
scale.
So
it's
much
more
task
parallel
and
both
of
these
steps
require
solving
this
ax
equals
b
problem.
This
problem
in
the
top
right
here,
ax
equals
b.
Whereas
a
is
the
discretized
dirac
operator
or
direct
stencil,
so
and
you're
not
sure
what
lattice
qcd
is
you.
A
You
run
your
simulations,
you
strong
scale
them
you're
strong
scaling,
the
stencil
problem
in
many
gpus
or
many
processors,
whichever
your
architecture
of
choices,
you
you
run
your
simulations.
You
come
up
with
predictions.
You
then
compare
them,
on
the
other
hand,
to
experimental
results
such
as
what
comes
out
of
large
hadron
collider
and
in
in
geneva,
and
ideally
you're
looking
for
deviations
between
fear
and
experiment,
because
that's
the
sign
of
new
physics.
A
So
what
cura
is
is
the
application
I
work
on.
Cuda
is
a
library
which
accelerates
lashes,
qcd
on
gpus
and
it's
written
using
modern
c
plus
plus
using
cuda
c
or
cura
c,
plus
plus,
and
it
provides
acceleration
for
all
these
major
elastic
qc
applications
which
absorb
all
the
time
on
the
leadership
facilities.
So
the
ones
that
point
out
here
are
chroma
and
milk
and
cps,
and
what
we
talk
ab.
What
we
use
a
lot
is
mixed
precision
methods
to
optimize
performance.
A
So
that's
what
I'll
talk
about
now
and
it's
really
much
a
research
tool
to
or
how
to
reach
the
exit
scale
using
algorithms
and
techniques.
So
here
are
some
of
my
contributors
towards
the
library
or
my
collaborators.
The
folks
I
highlight
in
right
here
are
all
the
folks
that
work
in
nvidia
we
have
quite
a
few
former
qcd
physicists
coming
to
nvidia.
A
So
mixed
precision,
crew
of
solvers,
what
we're
really
talking
about
today,
so
we
have
this
dirac
stencil
and
it's
a
sparse
matrix,
essentially
a
sparse
matrix
problem
and
like
any
sparse
matrix
problem
or
any
stencil
problem.
For
that.
On
that
hand,
it's
very
memory
bandwidth
bound.
So
the
the
arithmetic
intensity
is
quite
low.
It's
around
one
in
single
precision,
so
on
any
modern
architecture,
you're
going
to
be
very
memory
bandwidth
bound
and
what
performance
you
can
get.
So
what
we
have
to
do
is
maximize
our
arithmetic
intensity,
using
whatever
technique
possible.
A
The
main
technique
I'm
going
to
talk
about
here
is
using
low
precision
so
going
from
signal
precision
to
16-bit
precision.
You
immediately
double
your
performance
in
terms
of
gigaflops,
because
your
memory
bandwidth
bound
so
here's
an
example
of
some
performance
scaling
just
to
demonstrate
this
we're
looking
we're
increasing
the
size
of
the
box.
A
Here,
l
was
just
the
number
of
grid
points,
we're
comparing
double
precision
in
single
precision
and
you
can
see
double
precision,
saturates,
700,
gigaflops,
single
precision
doubles
the
performance
and
that's
purely
because
of
the
memory
traffic
has
halved,
and
so
we
double
the
performance.
We
also
have
the
16
bit
fixed
point
format.
I
just
alluded
to.
We
don't
actually
use
fb16,
we
use
a
custom,
fixed
point
format
for
more
precision
because
we
want
dynamic
range
over
the
grid,
but
we
need
more
anticipates
on
each
grid
point.
A
So
we
use
what's
what
some
people
call
a
block
float
format.
Each
of
the
grid
points
is
actually
a
vector
of
12,
complex
numbers.
We
use
a
so
we
use
that
you
store
that
vector
of
12
complex
numbers
using
a
16
bit
fixed
point
format
with
a
normalizer
like
an
fp32
to
set
the
scale
so
over
the
whole
grid.
We
have
dynamic
range
of
fp32
but
locally.
We
have
16
bits
of
mantissa.
This
gives
us
a
combination
of
precision
and
range
which
helps
with
the
stability
of
the
numerical
solvers.
A
A
This
blue
line
here
is
showing
the
comparison
of
half
precision
or
the
16
bit
fixed
point
versus
single
and
double,
and
you
can
see
each
time
you
have
the
precision
representation,
you
double
the
performance
or
more
or
less
double
you
actually
slightly
sometimes
increase
the
performance
more
than
a
factor
of
two
and
that's
shown
on
the
right
hand,
side
here
with
the
effective
memory
bandwidth
that
each
thread
is
achieving,
and
this
is
because
we
achieve
more
than
the
memory
bandwidth
of
the
device
which,
in
this
case
is
900
gigabytes
per
second
peak,
and
that's
because
you
have
l2
reuse,
lots
of
lookout
in
the
problem
and
the
l2
and
the
gpu
is
very
good
at
capturing
the
locality
of
the
stencil.
A
Now
it's
all
really
great
having
a
low
precision
stencil,
but
it
doesn't
help
unless
you
have
a
if
it
unless
it
improves
your
time
to
solution
and
the
time
to
solution
must
be
measured
against
solving
some
high
high
precision
problem.
So
you
want
your
actual
solver
good
to
double
precision,
accuracy
or
something
close
to
double
precision
accuracy,
so
we're
in
the
entire
cg
solver
by
cgstab,
solver,
multi-grid,
solver,
etc.
A
The
high
precision
result,
and
so
you
have
to
be
careful
not
to
violate
the
no.
You
have
to
be
careful,
otherwise
you
you
do
and
you
do
not
end
up
with
a
new
free
lunch.
A
So
in
the
context
of
conjugate
gradient,
we
found
that
there's
four
key
ingredients
to
have
a
high
precision
solver
be
doing
most
of
the
work
with
low
precision
and
now
the
traditional
approach
when
you
use
it
of
solvers
is
to
use
something
called
defect,
correction
or
iterative
refinement.
You
have
an
inner
outer
solver
where
you
solve
an
inner
system
to
using
low
precision.
A
You
finish
that
solver
so
say
you
finish:
the
inner
cg
solver
to
10
to
the
minus
4
accuracy
or
10
to
the
minus
5
accuracy.
Then
you
calculate
the
outer
residual
and
high
precision
and
then
you
restart
and
restart
the
system
and
do
another
inner
solve
with
low
precision,
and
then
you
keep
repeating
that
until
you've
converged.
A
Unfortunately,
when
you
do
this,
you
throw
away
the
generated
preloft
space
that
the
solver
will
have
generated,
and
you
end
up
doing
many
more
iterations
to
reach
conversions.
So
we
use
something
different,
which
is
called
reliable
updates,
which
is
an
old
technique
developed
in
the
90s
to
improve
numerical
stability
by
cg
stab,
but
it
allows
us
to
keep
the
query
of
space
resident,
but
do
both
the
work
and
low
precision
while
maintaining
high
performance
high
precision.
I
refer
you
to
the
paper
here
and
for
more
details.
A
A
Even
though
you're
doing
most
of
the
work
in
half
precision,
you
need
to
make
sure
the
cumulative
solution,
vector
is
always
done
in
double
precision
and
an
another
thing
we
inherited
from
non-linear
cg
was
to
use
a
different
form
for
calculating
the
beta
coefficient,
which
is
identical
in
infinite
precision,
but
it's
it's
more
numerically
stable
and
it's
self-stabilizing
in
fact,
and
improves
the
orthogonality
of
back
to
back
iterations
in
conjugate
gradient,
and
what
we're
looking
at
here
is
a
comparison
of
in
in
green,
a
pure
double
precision.
A
Conjugate
gradient
solve
doing
some
number
of
iterations
trying
to
converge
to
10
to
the
minus
7
accuracy.
If
you
just
plug
in
naive,
low
precision,
conjugate
gradient
and
without
these
recipe
I
just
described,
then
you
end
up
with
a
spiky
black
behavior.
If
you
use
this
recipe
described.
However,
you
get
this
nice
smooth
red
line,
so
you
get
some
small
overhead
and
iteration
count,
but
you
can
still
converse
your
target
system
and
and
get
the
free
lunch
you
desire.
You
can
see
here
the
condition
number.
A
The
system
is
10
to
the
four,
as
we
make
the
condition
number
of
the
system
higher,
so
increase
it
by
two
orders
of
magnitude.
What
you
see
is
that
the
original
naive,
conjugation
solver
didn't
converge
at
all.
You
just
got
this
mess
of
convergence
that
this
black
line
here
and
in
fact
it
never
converged.
You
just
protect.
You've
got
this
oscillating
behavior.
The
red
behavior
is
of
the
stabilized
solver
and
you'll,
see,
although
it's
not
as
stable
as
the
green,
pure,
double
precision
result.
A
If
you
plug
that
into
the
actual
solver,
you
can
see
the
actual
speed
up
you
get
from
using
this
more
careful
use
of
precision,
so
double
precision
going
down
to
double
single,
going
down
to
double
half
where
double
half
gets
the
best
overall
time
solution
in
this
milk
application.
So
here
you
know,
using
the
mixed
precision
was
an
important
like
these
milk
calculations
can
run
for
months
at
a
time.
So
this
can,
you
know
it
creates
knocking
off
weeks
of
a
calculation.
A
We
went
down
to
16
bit
precision
it's
interesting
to
think
about
how
low
can
we
go
in
this
precision?
Can
we
go
down
all
the
way
down
to
eight
bit
and
can
we
still
converge
that
conjugate,
gradient
solver
and
the
plot
on
the
right
here?
I'm
just
showing
the
performance
of
double
single
half
and
now
quarter.
You
can
see
quarter
improved
performance
again
on
a
voltage
gpu,
not
quite
the
factor
of
two
now,
because
other
things
are
coming
in,
like
indexing
overheads
and
load
store,
instruction
overheads,
and
things
like
that.
A
You
can
still
see
we
get
a
nice
boost
in
gigaflops,
but
when
we
actually
look
at
the
solver
and
whether
it
converges
or
not
what
we
see
that
in
the
regime
of
physical
interest,
which
is
shown
in
this
pink
area
of
this
plot
here
and
on
the
x-axis,
I'm
showing
the
condition
number
of
the
linear
system-
and
this
is
the
area
we
care
about
in
terms
of
physics-
and
you
can
see
that
if
we
look
at
the
iteration
count
on
a
logarithmic
scale
and
looking
at
the
double
double
double
single,
double
half
and
double
quarter,
you
can
see
that
the
first
three
of
those
all
lie
approximately
on
top
of
each
other.
A
So
they
all
have
an
approximate
iteration
to
the
similar
iteration
counts.
You
can
see.
Eventually,
double
half
starts
to
diverge,
but
in
the
regime
of
physical
interest
we
basically
keep
the
iteration
count
fixed,
whereas
double
quarter,
and
where
we're
trying
to
converge
to
double
precision
accuracy
doing
most
of
our
quarter.
Precision
we
just
do
not
get
there
even
at
very
easy
to
solve
problems.
You
cannot
converge
it
without
a
huge
overhead.
A
So
eight
bit
is
a
bridge
too
far,
so
16
bit
is
where
it's
at
it
would
seem
now,
let's
switch
to
multi-grid
now
multi-grid
is
the
the
optimal
way
to
solve
these
types
of
stencil
problems,
and
that's
why
we
care
about
it.
It's
much
faster,
much
more
numerical
stable
and
it's
optimal
in
terms
of
it
has
it's
flat
with
condition
number.
A
So,
as
you
change
the
condition
of
regular
system,
you
have
no
increase
the
number
of
rich
number
of
iterations
of
your
solver
and
and
also
it
has
order
n
scaling,
which
means,
as
you
double
your
problem
size
and
you
steal,
your
iteration
count
stays
fixed
as
well.
So
this
is
the
solve.
We
really
care
about.
A
A
Now
multi-grid
in
some
ways
is
the
ideal
algorithm
for
mixed
position.
Each
passive
multigrid,
so
each
v
cycle,
if
you're
familiar
with
multigrid,
only
reduces
the
the
residual
isn't
by
an
order
of
magnitude.
It's
very
much
an
approximate
solver.
Very
often,
you
would
wrap
it
as
a
preconditioner
for
an
ultra
creal
of
solver
or
you
harbor.
Do
multiple
passes
of
multi-grid
steps
to
reach
your
target
accuracy,
so
each
iteration
of
multigrid
doesn't
have
to
be
that
accurate
and,
and
so
that
allows
you
great
freedom
to
reduce
precision
dramatically
inside
the
multigrid
solver.
A
So
the
method
that
we
were
looking
at
was
just
to
wrap
the
entire
multi-grid
cycle.
Sorry
do
the
entire
multi-grid
cycle
and
16-bit
precision
and
then
wrap
that
in
a
outer
gcr
solver.
So
we
keep
the
altitude
solver
in
single
precision
and
again
an
outer
precision
outside
of
that
in
double
precision.
A
Side
here
is
showing
one
thing
you
do
have
to
be
careful
with
mixed
precision
is
latency,
so
your
lane
stays
latency
stays
fixed,
of
course,
as
you
go
from
double
to
single
to
half
and
as
your
performance
goes
faster,
of
course,
that
latency
and
effect
will
become
a
bigger,
bigger
overhead.
So
as
if
in
multigrade,
you
go
to
smaller
and
smaller
grids.
As
your
solver
courses,
if
you're
going
from
ten
to
eight
to
six
to
four
to
two
eventually
you've
got
a
very
small
problem
size.
A
A
Another
interesting
thing
we
found
in
the
multigrid
setup
just
in
case
the
the
time
of
the
talk
is
reaching
its
end.
Could
you
wind
up
in
one
minute,
yep
I'll,
be
I'll,
be
quick.
So
when
we're
doing
coursing
in
multi-grid,
we
would
like
to
do
it
in
parallel,
using
massive
parallelism
and
that's
difficult.
A
If
you
have
to
add
lots
of
numbers
together,
and
so
instead
we
will
use
is
integer
atomics,
because
there
we
have
determinism
and
associativity,
and
it
means
we
always
get
the
right
answer,
and
so
we
always
get
the
same
answer
all
the
time,
which
is
very
beneficial.
A
When
you
put
all
these
things
together,
this
optimized
multigrid
solver
with
four
different
precisions
here,
so
double
single
half
and
in
32
for
these
different
parts,
the
solver,
you
could
see
a
significant
speed
up
versus
the
the
solver
we
had
before,
going
going
from
titan
to
summit
and
changing
the
algorithm.
We
had
an
82x
speed
up
altogether,
and
I
won't
talk
about
this
plot
here,
other
than
to
say,
8-bit
is
actually
stable.
The
multi-grid,
unlike
the
case
of
cg,
and
so
we
can
use
8-bit
all
the
way
down
to
you.
A
You
can
use
multi-core
with
8-bit
precision
now
just
wrapping
up
here.
I
just
wanted
to
highlight
some
of
the
work
we're
doing
with
tensorcores
so
tends
to
cause
these
high
throughput
matrix
multiplication
instructions
on
the
volta
and
ampere
architectures,
specifically
with
extremely
high
throughput
for
low
precision
such
as
fp16
we've
been
deploying
these
these
instructions
to
accelerate
some
of
that
accelerate
some
throughput
accelerate
some
calculations
on
the
summit
architecture.
A
What
we
see
is
that
preconditioners,
which
are
designed
to
reduce
network
communication,
which
are
too
expensive
without
tensor
cores,
are
made
possible
now
with
tensor
cores
and
as
a
result,
you
know
at
large
scales.
We
can
use
tensor
cores
to
accelerate
these
large
strong
scaling
workloads.
A
That
would
not
be
possible
otherwise,
and
we
turn
a
problem
which
will
completely
network
bound
now
into
being
compute
bound
and
finally,
just
to
highlight
the
fact
that
we're
now
looking
at
using
tensor
cores
throughout
different
aspects
of
the
latches
qcd
workflow
and
what
we
recognize
you
know
after.
The
fact
is
that
many
of
these
workloads
are
actually
matrix
multiplication,
and
so
they
are
directly
applicable
to
being
reformulated
into
tense.
Of
course,
so
there's
an
ongoing
work
there
to
turn
these
matrix
multiplication
kernels.
A
As
into
explicit
tensor
core
usage-
and
we
see
significant
speed
up
when
doing
that,
they,
you
know
what
we
expect
when
this
work
is
done,
is
we'll
have
a
2x
throughput
of
chroma
running
on
summit,
for
example,
and
as
noted
at
the
beginning,
that's
a
significant
overhead
of
what
runs
in
summit
now.
So
we
can
achieve
the
2x
speed
up
by
using
the
tensor
cores,
which
are
there
that's
great.
A
Just
to
conclude
my
final
slide
and
before
I'm
kicked
off,
I
would
just
advise
generally
don't
use
double
precision
without
considering
what
precision
is
actually
required
in
the
calculation
judicious
use
of
precision
tuning
can
lead
to
large
speedups
more
than
4x
is
what
we've
seen
here
and
creativity
is
required,
such
as
using
integers,
where
necessary,
instead
of
floating
point,
I'm
using
precision
at
low
precision
as
a
storage
format,
as
opposed
to
just
for
computation,
and
some
algorithms
can
utilize
many
more
than
you
know,
two
precisions,
three
four
precisions,
or
sometimes
the
optimal
approach
to
take,
depending
on
the
calculation
and
future
work.
A
That
we're
doing
here
is
to
recast
a
lot
of
these
algorithms
to
take
use
of
these
tensor
core
functionality,
which
is
on
current
and
future
gpus,
to
really
give
a
scalable
solution
to
the
future.
Okay,
and
in
that
I
thank
you,.