►
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
Good
now
my
name's
brent
lebeck,
I've
been
a
member
of
actually
pgi
before
we
were
bought
by
nvidia.
I've
held
a
few
different
positions
and
I
just
anymore
consider
myself
just
a
part
of
the
nvidia
hpc
sdk.
A
So
I
just
showed
this
slide.
This
is
the
hpc
sdk
from
a
high
level
view
I'm
not
going
to
go
over
this
too
much,
but
it
has
the
various
compilers
and
programming
models.
But
for
this
section
right
now
we're
going
to
talk
a
little
bit
about
porting
considerations.
A
So
I've,
given
these
talks
for
12
years
or
so
so
from
about
five
years
ago,
when
the
coral
systems
were
first
introduced.
We
used
to
show
this
slide,
and
this
on
the
left
is
the
cpu,
whether
it's
an
x86
power
or
arm,
and
it's
got
you
know
a
handful
of
of
cores-
that's
probably
more
cores
now
than
it
was
five
years
ago,
but
maybe
they've
doubled
to
40
or
64,
or
something
like
that.
B
Sorry
to
interrupt
brent
people
reporting
sound,
become
chubby,
so
I
think.
A
B
A
The
cpu
is
connected
to
usually
large,
high-capacity
memory
anymore.
This
is
you
know,
half
a
terabyte
or
a
terabyte
of
memory
on
a
server
class
system
relatively
to
the
gpu.
The
memory
bandwidth
between
the
cpu
cores
and
its
memory
is
fairly
low,
and
you
know
I've
been
around
this
industry
for
quite
a
while,
and
in
fact
I
wrote
a
paper
in
2007
or
eight
with
some
people
at
sandia,
when
multicore
was
first
coming
out
about
how
the
the
memory
bandwidth
was
not
keeping
up
with
the
number
of
cpu
cores.
A
So
on
the
right
we
have
a
typical
gpu
accelerator,
so
it's
got
many
more
cores.
You
know
every
generation
seems
to
like
double
the
number
of
gpu
cores
on
the
accelerator.
A
So
with
more
cores,
the
chords
are
simpler
and,
as
we'll
talk
about
some
of
the
considerations,
the
key
is
not
to
just
fill
up
the
number
of
cores
on
the
accelerator
but
to
way
over
subscribe
them.
So
you
can
in
fact
hide
some
of
the
memory.
A
Latency
and
take
advantage
of
the
high
memory
bandwidth
now
between
the
two.
There
is
a
interconnect
either
pci
or
nvlink
on
the
coral
based
machines,
and
that
is
yet
another
consideration
that
we'll
talk
about
over
the
course
of
the
next
two
days
of
how
to
minimize
the
amount
of
traffic
across
that
connection
between
the
cpu
and
g.
A
So
processor
counts
through
the
years
now
so
notice
in
the
upper
left-hand
corner.
There's
a
little
circle
that
blue
circle.
So
you
know
when
some
of
us
started
in
hpc,
most
of
the
parallelism
was
via
mpa.
You
wrote
pretty
much
a
sequential
program
at
points
you
inserted,
mpi
sends
and
mpi
receives
into
your
code.
A
The
core
may
have
been
like
a
486
or
maybe
an
sgi
processor,
or
maybe
even
a
sun,
ultra
spark
processor.
So
there
were,
you
know
a
few
registers.
It
had
some
instruction
level
parallelism.
A
A
A
Well,
we
still
used
high-level
parallelism
via
mpi
seemed
to
be
a
great
model
and
it
survived
over
many
many
years
we
got
most
of
our
performance
improvements
still
via
the
manufacturing
process.
Improvements
instruction
level
parallelism
got
even
higher.
A
There
were
more
registers,
the
clocks
were
still
faster
and
at
this
point,
cmd
vectorization
of
loop
became
important
and
compilers,
including
the
old
pgi
compiler,
which,
where
I
worked,
you
know
really
invested
a
lot
of
time
and
effort
into
vectorizing
loops
across
cmd
lanes.
B
Sorry
brent
sure,
never
again
and
you're
supposed
to
show
some
diagrams
we're
only
seeing
texts
on
the
screen
right
now.
A
I
will
so
I
am
showing
how
processor
counts
have
grown
over
the
generations.
A
So
we
still
code
to
the
main
core
and
the
sequencer
right,
so
you
have
a
sequencer
that
has
branches
and
loops
and
things
now,
even
though
you
have
say
four
lanes
in
your
simdi
registers,
you
use
simdi
loads
and
stores
on
those.
You
don't
write
code
for
each
individual
lane
and
that's
kind
of
the
cpu
model.
A
So
then
we
move
to
multi-core
architecture,
so
this
was
maybe
15
years
ago
amd
launched
the
hammer.
I
was
thinking
of
the
kind
of
the
slime
name,
the
hammer
amd
64
architecture.
So
then
people
decided
and
started
to
use,
mpi
plus
openmp.
So
maybe
I
used
mpi
one
per
core,
but
maybe
I
used
one
mpi
process
and
openmp,
and
this
was
about
the
time
that
the
clock
rates
really
began
to
slow.
You
know
we
had
hit,
maybe
three
gigahertz
and
then
you
know
that
was
kind
of
the
the
top
end
on
on
x86.
A
They
actually
started
to
slow
down,
but
they
added
more
so
then
pneuma
issues
began
to
crop
up.
Pneuma
p
threads
made
their
way
into
the
linux
kernel.
I
remember
how
disruptive
that
was.
A
The
memory
bandwidth
again
does
not
keep
up
with
the
compute
speed.
The
main
memory
got
bigger,
but
perhaps
you
only
had
about
two
gigabytes
per
core
on
your
on
your
server,
but
they
added
more
and
more
caches
to
each
core.
So
if
you
to
deal
with
the
latency
to
main
memory
they
added
more
cache,
then
we
moved
to
avx
512,
and
so
this
is
maybe
like
quarry
gpu.
A
Again,
a
lot
of
the
same
things
as
before,
but
just
more
and
better
avx
512
hardware
was
actually
a
little
slow
to
take
hold
and
the
initial
implementations
were
not
optimal.
A
lot
of
times
we
found
in
our
compiler.
It
was
better
to
generate
code
for
avx
256,
assuming
that
every
core
was
going
to
be
active,
it
had
heat
or
or
power
issues
still
pneuma
issues.
A
Vectorizing
compilers
are
still
very
important.
Cmd
instructions
still
in
use,
and
just
more
and
more
of
the
same
types
of
things,
software
grows
in
complexity,
relies
on
features
like
dynamic
memory,
large
heaps
large
data,
big
long
call
stacks.
Of
course,
you
still
coded
to
the
main
core
and
sequencer.
You
did
512.
A
So
this
is
this
is
what
you're
faced
with
if
you've
been
running
on
an
avx-512,
cpu
and
now
you're
moving
your
code
to
amp.
So
you
need
to
find
much
more
parallelism
in
your
code
when
you
move
this
to
the
gpu.
Otherwise,
you'll
be
under
under
utilizing
the
gpu,
so
you
know
40
different
threads
are
just
not
enough
for
64
different
threads.
A
You
know
times.
Eight
even
is
just
not
enough.
An
ampere
a100
likes
to
work
on
thousands,
tens
of
thousands
of
different
elements
all
at
the
same
time.
A
So
we
can
still
use
high-level
parallelism
via
mpaa
and
perhaps
openmp.
Now,
there's
been
a
lot
of
applications
ported
over
the
years
to
gpus
that
still
use
openmp,
maybe
one
thread
per
gpu
and
in
fact
some
well-known
applications
are
only
openmp
one
thread
per
gpu
and
then,
if
you
don't
have
enough
gpus
all
the
threads
gonna
help
out
do
some
work
sharing
and
they
have
schedulers
built
into
them.
A
So
because
the
the
gpu
is
so
large,
there
are
some
tricks
or
ideas
you
can
use
to
like
make
multiple
uses
of
the
gpu.
So
we'll
talk
about
some
of
these
things,
multiple
cuda
contexts,
multiple
streams
running
in
parallel,
there's
mig
and
mps,
which
are
some
products
provided
by
nvidia,
which
allow
sharing
the
gpu
resources
either
within
your
program
or
with
other
people.
A
You
know
other
entities
on
a
gpu
synchronization
between
the
columns
in
this
diagram
is
hard.
Synchronization
down
the
column
is
a
feature
of
cuda,
that's
been
there
forever
and
you
sync
all
the
threads
in
a
single
block.
A
I
say
it's
hard,
but
it
is
solvable
and
a
lot
of
clever
people
have
solved
it,
but
it's
usually
buried
into
a
library.
It's
not
really
as
much
a
part
of
the
programming
model.
So
if
you
need
to
do
synchronization
between
the
thread
blocks
or
the
columns
in
this
diagram,
you'll
struggle
to
find
a
good
solution.
A
A
The
memory
latency
is
very
high
and
the
caches
on
a
gpu
are
relatively
small.
Now
the
caches
on
a
gpu
have
gotten
better
and
better
with
every
generation,
but
still
they
are
much
smaller
than
on
a
cpu
and
every
once
in
a
while.
We
run
into
applications
that
run
faster
on
a
multi-core
cpu
than
on
a
gpu,
and
it's
usually
because
on
a
multi-core
cpu,
the
whole
data
set
fits
in
like
l1
cache.
So
you
will
not
see
a
lot
of
speed
up
if
your
data
sets
are
that
small.
A
There
is
a
shared
memory
which
jeff
I
think
touched
on,
so
it
is
a
programmer,
managed
cache
and
it's
useful
for
performance
and
to
communicate
between
cores
and
an
sm
or
in
the
column,
within
a
thread
block
in
a
column
on
the
diagram
and
massively
over
subscribing.
The
cores
is
a
key
to
performance.
So
I
don't
know
they
may
say
an
a100
has
5000
cuda
cores
in
it.
A
A
A
Cudit
is
a
lot
different
and
it
was
you
know:
I've
been
around
a
long
time.
It
was
kind
of
the
first
model
where
you
coded
to
the
core
or
the
lane,
and
so
the
cuda
run
time.
That's
on
the
gpu
handles
the
divergence,
even
though
you
know
all
the
the
cores
in
a
column
in
the
diagram
might
run
in
lockstep.
A
You
can
say
something
like
if
I'm
thread
number
four
go
off
and
do
something
else,
and
you
don't
do
that
on
a
on
a
cpu
with
cindy
lanes,
but
you
can
do
that
in
cuda
and
that
ability
kind
of
trickles
up
into
the
other
programming
models.
So
there's
a
cost.
A
There's
a
you
know:
cost
in
performance,
but
the
cuda
runtime
that
runs
on
the
gpu
just
handles
that
each
core
lane
loads
and
stores
its
own
data
and
the
os
or
the
low
level
runtime,
ideally
coalesces,
those
into
contiguous
blocks.
I've
found
over
the
years.
The
biggest
factor
in
gpu
performance
is
getting
all
the
cores
in
a
thread
block
like
a
column
of
the
cuts
on
the
diagram
to
read
consecutive
memory
locations
in
memory.
A
A
A
And
finally,
this
is
one
of
the
things
that
I
always
tell.
People
overheads
can
really
adversely
affect
the
performance.
So
the
gpu
programming
is
a
kernel-based
programming
model.
You
launch
kernels
they're
short-lived
and
then
the
kernel
ends,
and
then
you
launch
another
kernel.
The
kernel
ends
and
every
thread
in
every
cuda
thread.
A
So,
unfortunately,
if
you
have
overhead,
like
indirect
memory,
access
or
other
you
know,
function
calls
where
you
have
to
set
up
argument
lists
and
then
tear
down
argument
lists
things
like
that.
That
can
actually
take
longer
for
a
thread
than
doing
the
actual
work.
So
you
need
to
think
about.
A
These
all
follow,
you
know,
follow
the
same
guidelines,
so
look
for
the
cuda
c,
plus
plus
programming
guide,
cuda
c,
plus
plus
best
practices
guide,
the
stuff
that
they
talk
about
applies
to
all
models,
so
you
know
find
ways
if
you
can
to
parallelize
sequential
code,
minimize
the
data
transfers
between
the
host
and
the
device
kind
of
back
to
that
original
diagram.
Either.
You
know
if
it's
andy,
link
or
pci
a
lot
of
times.
You
will
run
a
multi
a
large
number
of
time,
steps
on
the
gpu.
A
Adjust
the
kernel
launch
configuration,
you
may
know
more
about
the
you
know:
loop,
extents,
the
size
of
each
dimension
of
your
raise
than
the
compiler
ever
does
you
can
play
around
with
the
launch
configurations
in
cuda
the
number
of
blocks,
the
number
of
threads
per
block
and
sure
global
memory
accesses
are
coalesced.
I
talked
about
that.
So
within
a
warp
or
a
thread
block,
you
want
all
of
the
threads
to
be
accessing
consecutive
elements
of
memory.
A
You
know
internally,
the
gpu
has
a
very
wide
memory
bus
and
if
you
access
consecutive
areas,
you
take
advantage
of
that
wide
memory,
bus
minimize,
redundant
accesses
to
global
memory.
A
You
know
gpus
do
have
some
caches
on
them:
small
l1
and
a
small
l2,
but
you
don't
want
to
be
doing
a
lot
of
extra
reads
and
writes
and
avoid
long
sequences
of
diverged
execution
by
threads
within
the
same
warp.
So
I
don't
really
see
this
as
as
much
of
a
problem,
usually
because
of
the
problems
I'm
working
on.
A
So
finally,
you've
seen
this
slide.
I
I
just
want
to
reiterate
one
more
time
everything
that
I've
talked
about
in
this
presentation
applies
to
all
three
of
these
columns,
whether
you're
using
a
new
concurrent,
standard,
r
and
c
plus
plus
directive
based
models.
The
way
you
form
your
loops
and
the
way
the
accesses
are
done
in
the
innermost
loop
are
very
important
and
I
think
that's
all.
I.