►
From YouTube: Using GPUs as Accelerators
Description
Max Katz from NVIDIA presents a talk on Using GPUs as Accelerators. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: OisÃn Creaner
A
So
in
the
concept
of
accelerated
computing,
you
have
heterogeneous
nodes
which
are
combinations
of
cpus
and
gpus
and
jay
showed
you
what
this
might
look
like
in
the
context
of
promoter
where
you
have
a
cpu
that
is
very
well
optimized
for
serial
tasks
like
input
and
output
and
other
calculations
that
don't
have
any
parallels
into
the
nature
or
at
least
limited
parallelism,
whereas
gpus
are
optimized
for
highly
parallel
tasks,
and
you
combine
these
two
processors
in
order
to
effectively
solve
your
problem,
and
so
that's
really
the
new
norm
that
we're
in
where
you
need
to
use
both
of
these
things
together
to
solve
problems
in
science
and
so
cpus
are
latency,
reducing
architectures
and
what
we
mean
by
that.
A
That's
a
synonym
for
saying
that
they're
optimized
for
serial
tasks
that
they
have
very
large
amounts
of
memory.
They
have
very
high
clock
speeds
and
they
have
very
large
caches
which
help
them
to
reduce
latency,
and
so,
when
you're
trying
to
access
some
data
from
memory,
for
example,
cpus
are
very
good
at
ensuring
that
the
time
it
takes
to
access
that
memory
is
as
small
as
possible,
but
they're
not
good
at
everything,
so
they
have
relatively
low
memory
bandwidth
and
if
the
cpu
gets
it
wrong.
A
If,
if
the
data
that
you
were
trying
to
access
is
not
immediately
available,
then
that
is
costly
and
in
terms
of
energy
efficiency.
They
are
relatively
low
efficiency,
if
you
think
about
it
in
terms
of
a
number
of
floating
point
operations
per
watt,
say,
conversely,
gpus
exist
to
hide
latency,
and
so
they
have
very
high
bandwidth
memory
but
low
capacity
memory.
A
But
on
the
other
hand
they
have
very
high
amount
of
compute
resources
and
they
are
high
latency
in
their
access.
So
when
you
have
one
thread
accessing
a
particular
date
item
of
data
from
memory
that
takes
a
long
time,
but
the
way
that
we
help
avoid
that
problem
is
by
having
many
threads.
And
so
while
one
thread
is
trying
to
do
something,
and
it
takes
some
time
to
fulfill
that
request.
A
We
have
many
threads
that
are
doing
something
in
order
to
hide
the
latency
of
that
request,
and
so
their
individual
performance
of
any
one
thread
is
relatively
low
compared
to
a
cpu
thread.
However,
by
combining
many
threads
in
parallel,
we
can
solve
problems
effectively
and
in
terms
of
energy
efficiency.
Gpus
are
relatively
high
efficiency,
which
kind
of
goes
hand
in
hand
with
the
fact
that
the
individual
cores
that
make
up
a
gpu
are
relatively
streamlined.
They
don't
do
a
whole
lot,
but
the
things
that
they
do
they
do
well,
when
combined
in
parallel.
A
So
gpu
acceleration
is
the
task
of
taking
your
application
code
and
then
identifying
the
parts
of
it
that
can
run
on
a
gpu
and
putting
those
parts
in
the
gpu
and
then
leaving
the
rest
on
the
cpu
and,
in
terms
of
say
the
number
of
lines
of
code
you
might
put
a
relatively
small
fraction
on
the
gpu.
It
really
depends
on
what
your
application
looks
like,
but
it
might
be
just
a
few
percent
of
the
lines
of
code
that
make
up
the
time
that
it
takes
to
dominate
the
runtime
of
your
application.
A
I
want
to
tell
you
a
little
bit
about
the
gpu
architecture,
because
I
think
that
will
help
make
sense
of
this
claim,
I'm
making
about
gpus
being
massively
parallel
devices,
but
also
ones
that
really
require
a
massive
parallelism
to
succeed.
So
jay
showed
you
a
a
an
image
of
what
the
chip
looks
like
from
the
outside.
This
is
kind
of
an
inside
view
of
the
latest
nvidia
gpu,
the
a100
gpu.
A
So
we
announced
this
a
couple
of
months
ago,
and
this
will
be
the
gpu
powering
the
gpu
nodes
and
promoter.
The.
A
If
you
pay
attention
to
one
thing
from
this
slide,
it's
to
pay
attention
to
the
number
of
what
we
call
cuda
cores,
which
are
and
I'll
explain
what
this
term
means
in
a
moment.
A
But
you
can
think
about
it
in
very
rough
terms
as
the
amount
of
compute
power
that
the
gpu
has
and
you
you
probably
know
that
the
more
core
as
a
processor
has
that
that
should
that's
usually
proportional
to
its
compute
power
in
some
sense,
and
so,
if
you
think
about
modern
cpus
that
you're
familiar
with
the
ones
that
might
be
running
on
your
laptop
or
desktop,
they
typically
have
a
few
cores,
maybe
tons
of
cores
in
the
higher
end
server
nodes.
A
That's
certainly
true
for
the
the
high-end
cpu
nodes
that
we
have
see
running
today.
They
run
like
10
to
100
cores,
by
contrast,
gpus
have
thousands
of
cores
and
that
allows
them
to
solve
problems
massively
in
parallel,
but
the
catch
and
I'll
show
you
this
in
a
moment.
Is
that
again
those
cores
don't
solve
every
problem.
A
Those
aren't
really
optimized
running
on
gpus
gpus
have
relatively
small
amount
of
memory
compared
to
cpus,
and
so
this
gpu
has
40
gigabytes
of
memory,
but
the
gpu
bandwidth
is
much
higher
than
typical
cpu,
and
so
the
typical
cpus
that
you
see
today
at
the
high
end
typically
have
you
know:
100
200
gigabytes,
a
second
of
memory
bandwidth,
whereas
modern
gpus
using
high
bandwidth
memory
can
achieve
something
like
a
terabyte
or
more
of
hp
memory,
and
this
trend
is
true
across
all
gpu
vendors.
A
If
you
look
closely,
you
can
see
it's
combined
with
several
little
chunks,
which
have
a
little
bit
of
green
and
blue
in
them
and
in
total,
there's
120
of
these
on
the
screen
and
we
take
108
of
them
and
turn
them
into
what
we
call
the
a100
gpu.
A
A
gpu,
and
so
a
larger
gpu
in
most
cases,
has
more
sms
and
a
smaller
gpu
has
fewer
sms,
and
so
the
high-end
gpus
are
still
like
100
of
these
sms.
These
are
the
individual
multiprocessors
that
are
then
tiled
across
a
die
to
make
a
gpu.
A
What
you
get
is
that
you
have
something
like
exactly
exactly
32
double
precision,
64
fp64
units
and
64
fp32
units
or
single
precision
units,
and
so
it
might
be
best
to
think
of
one
of
these
sms
as
the
equivalent
to
an
actual
cpu
core
that
you're
familiar
with,
and
when
we
say
a
term
like
cuda
core,
what
we
really
mean,
it's
something
more
like
an
arithmetic
logic,
unit
or
floating
point
unit,
and
so
that's
another
way
to
think
about.
A
Gpus
is
that
they
have
a
very
large
amount
of
floating
point
and
integer
units
and
so
they're
very
good
at
floating
point
integer
math,
but
they're,
not
so
good.
At
other
problems
like
that,
don't
involve
math.
They
additionally
have
these
tensor
cores,
which
are
well
optimized
for
solving
matrix
multiplication
problems
and
they
have
a
relatively
small
amount
of
cache
memory,
and
so,
while
gpu
cpus
may
have
something
on
the
order
of
you
know
hundreds
of
megabytes
for
caches,
you
typically
have
much
smaller
amounts
for
gpus.
A
On
the
other
hand,
any
one
of
these
individual
sms
or
what
you
might
call
cores,
can
have
up
to
2000
threads
running
on
them
at
one
time,
and
so,
if
you
were
to
expand
this
out
and
take
the
108
sms
that
make
an
a100
gpu
and
and
multiply
that
by
the
2048
threads
that
can
be
active
on
them
at
one
time,
you
could
have
up
to
200
000
threads,
something
like
that
running
on
the
gpu
at
one
time,
and
so
that's
really
what
these
gpus
are
capable
of.
A
So
my
takeaway,
for
how
to
use
gpus
effectively
is
that
you
have
to
expose
massive
parallelism,
and
so
again
you
have
to
be
solving
problems
that
are
relatively
mathematical
in
nature,
dominated
by
floating
point
or
integer
arithmetic,
and
you
have
to
be
able
to
hide
the
latency
of
any
individual
floating
point
or
integer
operation
by
combining
the
result
of
many
many
cores,
and
so
I've,
given
you
that
number
below
more
than
200
thousand
threads
can
be
active
on
gpu
at
one
time,
and
so
this
requires
a
quality
qualitatively
different
level
of
parallelism
exposed
than
you
would
typically
find
on
cpu
architectures,
even
on
some
of
the
more
modern
cpu
architectures
like
k,
l
cpus
on
corey
didn't
require
this
level
of
parallelism
exposed
to
be
successful,
but
you
have
to
use
hundreds
of
thousands
of
threads
to
be
successful
on
a
modern
gpu,
and
what
that
means
is
that
your
problem
needs
to
have
at
least
that
many
degrees
of
freedom,
and
so
when
you
think
about
the
size
of
your
problem,
whether
it's
the
number
of
elements
in
your
grid,
if
you're
doing
some
sort
of
good
calculation
or
something
like
that,
you
need
to
have
something
like
100,
000
or
a
million
degrees
of
freedom.
A
So
if
you
take
one
of
these
sm
which
are
kind
of
the
building
blocks
of
a
gpu
and
then
you
kind
of
lay
them
out
horizontally
like
this,
they
have
different
memory
levels
in
their
hierarchy,
so
they
have
registers
which
are
the
actual
source
and
destination
of
the
computation
on
the
gpu.
These
are
on
the
on
the
actual
chip
and
then
those
these
are
what
the
floating
point
units
and
integer
units
use
to
do
their
math.
So
those
are
the
high
bandwidth
and
very
low
latency
access
memory.
A
You
also
have
on
these
sms
or
or
individual
cores.
If
you
like
to
think
about
them
at
l1
cache,
which
has
the
the
close
cache
to
the
to
the
chip,
then
you
have
a
device-wide,
l2
cache
and
then
a
main
global
memory,
and
so,
if
you
then
lay
those
out
from
closest
to
the
chip
and
therefore
highest
bandwidth,
but
also
smallest
in
capacity,
you
have
registers
which
have
only
a
few
tens
of
kilobytes
per
sm
or
core,
but
can
be
accessed
at
much
greater
than
one
terabyte,
a
second
worth
of
bandwidth.
A
And
then
a
similar
thing
is
true
for
this
l1
cache
that
it
lives
on
each
one
of
these
sms
or
cores.
If
you
want
to
think
about
it
that
way
again
like
100
kilobytes
per
sm,
but
can
be
accessed
at
more
than
10
terabytes.
A
second,
the
l2
cache
has
a
size
of
40
megabytes
on
an
a100
gpu,
an
fba
100
gpu.
A
It
could
be
accessed
at
like
five
terabytes,
a
second
something
like
that,
and
then
the
main
ram
has
40
gigabytes,
and
so
you
can
see
it's
orders
of
magnitude
larger,
but
it's
also
slower
in
excess,
and
so,
of
course,
as
is
true
for
cpus,
you
want
to
make
use
as
much
as
you
can
of
the
memory
that's
closest
to
the
the
chip.
A
I'll
just
leave
a
couple
tips
here
that
you
can
look
at
later,
but
the
main
takeaway
that
I
want
to
emphasize
is
that
just
like
you
won't
be
able
to
saturate
the
cpu
performance,
the
I'm
sorry,
the
the
core
compute
core
performance
without
having
many
threads.
You
also
won't
be
able
to
do
that.
A
A
So
then.
In
summary,
your
main
priority
is
to
make
the
code
faster
and
that
doesn't
always
mean
making
the
part.
That's
on
the
gpu
run
faster,
really.
Sometimes
it
means
optimizing
data
transfer
back
and
forth
to
the
gpu
because
they
have
their
own
separate
memory
spaces
and
sometimes
it
just
means
refactoring
your
cpu
code
to
expose
parallelism
a
lot
of
codes
that
exist
today
before
they
start
according
to
gpus,
don't
have
their
parallelism
exposed.
A
Naturally,
you
have
to
spend
a
lot
of
time,
or
at
least
some
time
making
that
true,
and
so
that's
a
big
part
of
the
gpu
reporting
process
and
so
don't
go
in
and
start
writing
cuda
code
or
open
ac
code
or
anything
like
that
to
identify
what
are
the
parts
of
your
application.
That
could
benefit
from
parallelism
and
is
it
exposed
in
that
way,
and
you
should
always
use
profiling
tools
to
help
you
with
this.
A
It
is
generally
the
case
that
you
will
hear
lots
of
received
wisdom
about
what
works
and
what
doesn't
on
gpus,
and
I
encourage
you
not
to
to
try
to
live
by
that
wisdom,
but
instead
experiment
with
yourself
and
then
use
tools
like
profiling
tools
to
help
you
understand
whether
you're
succeeding
or
not.
So
with
that
I'd
like
to
thank
you
and
I'm
open
for
any
questions.
B
B
B
Okay,
I
suppose
a
great
question
I
have
is
sorry.
Do
we
have
an
example,
one
so
there's
a
question
here
from
an
attendee
saying:
can
you
give
some
examples
of
codes
that
are
not
appropriate
for
gpus.
A
Well,
the
the
most
common
example
is
one
where
you
just
don't
have
a
large
enough
problem
to
solve,
and
so
you
have,
like
I
said,
gpus
require
problems
that
have
many
degrees
of
freedom
exposed.
You
have
to
be
able
to
utilize
hundreds
of
thousands
of
threads
at
once.
A
That
means
your
problem
needs
to
be
able
to
expose
hundreds
of
thousands
of
independent
pieces
of
work
that
can
be
solved
simultaneously,
and
so,
if
your
problem
just
doesn't
express
itself
in
that
way,
like
you're
solving,
you
know,
your
main
application
is
solving
matrix
multipliers
that
are
like
10
by
10
and
you're
only
doing
one
of
them
at
a
time.
That's
not
appropriate
for
running
on
a
gpu
it'll
actually
be
faster
to
run
that
on
a
cpu.
A
Now,
sometimes
you
can
express
your
code
in
a
way
where
you
can
batch
operations
so
that
many
are
being
assaulted
once
and
then
maybe
you
have
some
chance
of
using
the
gpu
effectively,
but
really
it's
the.
The
main
pitfall
is
when
your
code
just
doesn't
have
enough
work
to
do
to
saturate
the
gpu
and
then
you'll
actually
end
up
being
worse
than
if
you
just
used
the
cpu
alone.
B
Thank
you
for
that.
Another
question
here
from
konstantinos
do
gpu
profilers
work
well
when
dealing
with
multiple
npi
tasks,
slash
gpu,.
A
Gpu
profilers,
this
will
depend
on
which
profiling
tool
you
use.
The
nvidia
provided
profiling
tools
certainly
have
the
capability
to
work
when
using
multiple
mpi
tasks
per
gpu.
They
don't
the
ones
that
are
currently
available.
Don't
have
the
the
greatest
ability
to
combine
that
in
an
mpi
aware
way
to
show
you,
for
example,
multiple
ranks
running
at
the
same
time,
but
they
certainly
can
run
and
you
can
analyze
e-rank
independently,
and
if
you
look
broader
than
the
nvidia
provided
products,
that's
something
we're
working
on
that.
A
We
don't
have
today
to
some
of
the
third-party
libraries
like
scorpion,
vampire,
total
view
or
other
well-known
third-party
profiling
tools
and
debugging
tools.
They
generally
have
been
programmed
well
to
run
at
scale.
So
those
are
also
good
things
to
look.
B
At
great-
and
I
think
one
last
question
any
further
questions:
do
please
put
them
in
the
q
a
but
we'll
have
to
answer
them
off
off
the
chat
here.
So
this
one
comes
from
akil.
B
A
Well,
the
envy
link
connection
is
a
is
a
proprietary
connection
that
nvidia
created
that
allows
gpus
to
directly
talk
to
each
other
at
very
high
bandwidth
compared
to
the
standard,
for
example,
pcie
interconnect
between
chips
on
a
motherboard
and
600
gigabytes
is
the
total
bi-directional
throughput
that
a
gpu
can
talk
to
all
the
other
gpus
in
the
system,
but
in
reality
that
throughput
is
going
to
be
divided
across
multiple
gpus
and
so
there's
actually
12
of
these
links
per
per
a100
gpu,
and
then
you
split
them
up
across
whatever
interconnect.
A
You're
using
and
so,
for
example,
it'll,
look
a
little
bit
different
from
on
the
pro
motor
motherboards
than
on
other
systems.
Where
there's
you
know
more
or
fewer
gpus
per
node,
but
that's
customer
gearbox
is
the
total
bi-directional
throughput
from
any
one
gpu
to
all
of
the
other
gpus
in
the
system
and
then
just
gets
divided
across
the
other
gpus.
A
I
can't
really
talk
about
their
energy
cost.
It's
not
really
something
I'm
prepared
to
meaningfully
speak
about,
but
this
will
be
essentially
the
the
default
way
that
you
communicate
between
gpus
on
on
promutter,
and
so
it's
something
to
be
aware
of
that.
You
have
much
higher
bandwidth
than
you
might
be
familiar
with
using
standard
pcie.