►
From YouTube: 12. Hands-on demo: OpenACC, OpenMP, and CUDA -- Max Katz
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
A
This
is
solving
the
laplace
equation
from
math
or
physics
using
a
standard
method
called
jacobi
relaxation,
which
basically
involves
iteratively
solving
an
equation,
and
we
solve
it
some
number
of
times
until
we
reach
a
particular
error,
tolerance
threshold,
and
then
we
stop.
A
If
we
look
at
the
fortran
version,
then
basically
the
way
this
works
is
that
we
have
some
initial
setup
where
we
set
up
the
initial
conditions.
On
the
on
the
array
data.
A
We
have
an
outer
loop,
which
basically
says
keep
going
until
we
reach
some
tolerance
threshold
or
until
we
run
out
of
maximum
iterations,
and
then
it
does
the
main
update
for
this
algorithm,
which
is
that
at
every
location
in
the
two-dimensional
array,
it
updates
that
value
to
be
equal
to
one-quarter
times
the
the
neighboring
values
in
the
2d
array.
It's
like
I
plus
1,
j,
minus
1
j,
and
then
I
j,
minus
1
and
I
j
plus
1..
A
So
these
are
the
four
neighboring
values
to
I
j
and
it
sets
that
equal
to
the
average
of
the
neighboring
values.
This
is
how
you
up
you
perform
this
particular
algorithm,
but
it's
okay.
If
you
aren't
familiar
with
this
algorithm,
you
can
just
look
at
it
as
code
and
ask
the
question:
how
would
I
accelerate
this
code.
A
It
keeps
doing
that
until
we
reach
some
error
threshold
and
then
at
every
iteration
of
the
loop.
What
we
do
is,
after
we
perform
the
update.
We
store
the
value
from
the
update
in
an
array
to
record
the
value
and
then
perform
this
iteratively,
so
at
every
iteration
we're
doing
an
update
based
on
the
value
from
the
previous
iteration.
A
So
we
first
do
the
update
and
store
it
in
this
value,
a
new
based
on
the
old
data
a
and
then,
basically,
we
swap
a
and
a
new
by
just
doing
a
simple
loop
over
the
entire
array
saying
a
equals
a
new
and
at
the
end
we
just
check
that
we
reach
the
tolerance
to
within
the
threshold.
We
make
sure
that
we
get
the
right
answer,
which
has
a
particular
known
quantity
for
this
case,
and
then
we
stop.
A
Now
this
is
the
pure
serial
fortran
code
and
there's
an
equivalent
bit
of
c
code
as
well.
That's
here
in
laplace.c
and
basically
does
the
same
thing.
It
initializes
the
data
and
then
it
does
an
outer
while
loop,
which
says,
while
the
error
is
greater
than
tolerance.
You
just
do
this
update
setting
it
equal
to
the
new
value
equal
to
the
average
of
the
neighboring
values
in
the
array
and
then
we're
we
stop
when
we
reach
this
tolerance.
A
If
I
look
at
this,
I
see
I
have
my
laplace.c
in
the
class
setup
90.
I
can
do
nvc
laplace.c
and
then
that
will
compile
the
c
version
of
the
code.
I
can
give
it
an
executable
name:
oh
laplace,
and
now
I
have
a
compiled
version
of
the
code
and
then,
if
I
just
execute
that
you'll
see
that
what
it
does
is
it
performs
the
this
relaxation,
this
iterative
algorithm
after
every
100
iterations
of
the
algorithm.
A
It
prints
out
the
current
error,
which
is
basically
the
difference
between
this
iteration
and
the
previous
iteration,
and
we
stop
once
we
reach
some
tolerance
threshold.
A
In
this
particular
example,
we
actually
don't
reach
the
tolerance
before
the
end
of
the
thousand
iteration
maximum.
So
by
the
after
a
thousand
iterations,
it
has
a
an
error
of
about
point:
zero,
zero,
zero,
two-
and
that
happens
to
be
the
the
correct
answer,
or
at
least
the
the
target
answer
for
this
one.
A
That
being
said,
of
course,
you
may
change
it
to
with,
within
some
floating
point
rounding
error
threshold,
which
will
be
pretty
small
but
non-zero.
So
we
are
checking
to
within
a
certain
tolerance
which
is
significantly
higher
than
that
floating
point
round
off,
cut
off.
A
A
A
A
If
I
look
at
the
code,
porting
to
run
on
gpus
is
often
best
thought
of
as
an
iterative
process
where
you
port
a
loop
at
a
time
and
make
sure
that
you
get
correct
results
and
then
analyze
the
performance
of
that
change
and
keep
on
going
until
as
much
as
possible
of
the
calculation
is
done
on
the
gpu.
A
This
is
most
easy
to
do
in
the
directive
based
models
because
they
are
incremental
and
non-destructive
to
your
code.
If
I
add
something
like
a
omp
target
teams,
loop
right
here,
then
the
compiler
will
ignore
this.
So
it
doesn't
know
how
to
deal
with
those
with
that
directive
or
if
the
flag
to
the
compiler
that
turns
on
directive
processing
is
not
turned
on
and
also
I
don't
even
have
to
change
the
structure
of
the
loop.
In
order
to
do
this,
so
I
can
apply
some
directives
to
the
code.
I
can
one
at
a
time.
A
Add
them
without
changing
anything
about
the
code
itself
or
the
way
the
algorithm
works
and
then
see.
If
I
get
the
correct
answer,
that
loop
is
a
little
complicated,
so
I'm
going
to
come
back
to
that
one.
Instead,
I'm
going
to
do
it
on
this
one.
So
this
is
a
two-dimensional
loop.
A
A
I'm
also
going
to
want
to
turn
on
m
info
equals
mp,
which
means
the
compiler
should
tell
me
about
any
places
in
the
code
where
it
generated
openmp
offload
constructs
corresponding
to
my
directives,
because
I
will
often
want
to
inspect
and
see.
Did
the
compiler
actually
do
what
I
asked
it
for
the
other
thing
that
I'll
do
is,
I
will
turn
on
dash
gpu
equals
managed.
A
We
talked
about
this
a
bit
both
today
and
yesterday,
and
what
this
does
is.
It
means
that
all
dynamic
memory
allocations
I.e
the
ones
that
use
allocate
in
the
case
of
fortran
or
malik.
In
the
case
of
c,
all
of
those
memory
allocations
can
be
used
on
either
the
cpu
or
the
gpu
I'll
give
it
the
file
name
dash
olympus
and
then
the
post.
A
So
this
dash
m
info
flag
now
gave
us
some
input
or
some
from
information
about
what
the
compiler
chose
to
do
with
my
omp
target
team
loop
statement.
The
first
number
in
this
line
60
corresponds
to
the
line
of
code
where
it
found
that
directive.
So
let's
double
check
in
the
code.
If
I
look
at
the
bottom
of
my
emacs
window,
you'll
see
that
this
does
in
fact
correspond
to
line
60..
A
That's
where
my
omp
target
teams
loop
is
maybe
there's
a
way
to
get
this
information
out
of
vi,
but
you
know
good
luck
to
those
of
you
who
use
that
and
then,
if
I
look
at
the
information
further
down,
it
tells
me
even
more
info
about
what
it
did
to
the
director
what
it,
what
what
the
compiler
chose
to
do.
Based
on
that,
the
fact
that
their
directive
was
in
the
code.
A
The
first
thing
is
that
it
says
generating
gpu
kernel.
So
that's
awesome.
That
means
that
it
was
able
to
take
the
the
corresponding
loop
and
then
turn
it
into
a
gpu
kernel.
It
also
gave
me
the
compiler
generated
name
for
this
kernel.
This
will
be
useful
later
when
we
look
at
insight
systems
output,
so
you
can
verify
that
this
kernel
name
corresponded
to
line
60
of
the
code.
This
particular
for
loop,
generating
tesla
code
by
the
way
is
just
happens
to
be
equivalent,
meaning
saying
we
generated
code
targeting
the
nvidia
gpu
architecture.
A
Now
it
also
gives
us
even
more
nested
information
which
says
how
it
chose
to
do
parallelism
across
the
two
levels
of
loops.
It
says
on
line
61,
it
parallelized
the
loop
across
teams.
Teams
we
know,
are
the
higher
level
of
parallelism
that
are
available
to
us
in
openmp
and
then
on
line
62.
It
said
it
paralyzed
the
loop
across
threads
within
a
team,
and
it
also
tells
us
how
many
threads
it
chose
to
use
per
team.
That
is
128.
A
As
brent
said,
you
don't
have
to
think
in
this
way.
It's
not
required
for
you
to
think
in
this
way
in
order
to
generate
code
that
runs
and
works
on
the
gpu,
but
it
is
sometimes
useful
to
at
least
understand
that
lower
level
of
how
things
work,
because
that
will
be
useful
when
you're
looking
at
the
performance
of
your
code
and
thinking
about
whether
it
can
be
any
better
than
it
currently
is.
A
It
also
tells
you
that
can
generate
multi-core
code
because,
as
we've
discussed,
the
openm,
the
nvidia
compiler
can
do
either
gpu
or
multi-core
cpu
parallelism
for
openmp
target
loops.
A
Finally,
it
also
is
telling
you
that
it's
generating
an
implicit
map
of
a
new
and
a
what
this
is
saying
is
that
it
recognizes
that
for
this
code
a
and
a
new
are
not
so
we
talked
about
how
both
in
openmp
and
openacc
there's
an
explicit
notion
of
gpu
memory
and
cpu
memory
and
by
default
the
way
that
these
models
work.
They
expect
you
to
map
in
the
contest
is
open
people
call
it
a
map
they're
supposed
to
map
data
from
the
cpu
to
the
gpu.
A
But
if
I
use
the
dash
cpu
equals
managed
flag,
then
that's
not
necessary.
One
of
the
benefits
of
that
flag
is
that
all
the
memory
management
is
handled.
For
me,
however,
the
semantics
of
openmp
don't
change
right
from
openmp's
perspective.
If
I'm
accessing
something
on
the
device,
there
needs
to
be
a
map
clause
which
says
this
data
is
available
on
the
device.
So
the
compiler
is
telling
me
that
we're
telling
us
that,
in
order
to
satisfy
those
semantics,
it's
generating
a
map
statement.
A
A
A
And
what
this
would
do
is
it
would
record
all
the
gpu
activities,
like
we
discussed
yesterday
and
it'll
print
out
any
cuda
kernels
that
happen
to
have
run.
You
get
the
cuda
kernel
section
you
see.
Only
a
single
kernel
was
used
in
this
code
and
its
name
is
nv.
Kernel
main
underscore
underscore
f1l
60
under
square
one.
This
is
exactly
the
same
name
that
the
compiler
told
us
it
was
generating
for
that
kernel.
A
There
were
a
thousand,
and
that
makes
sense,
given
that
that
code
occurs
in
a
loop
that
runs
a
thousand
times
and
it
tells
us
the
average
time
per
call
it's
listed
in
nanoseconds,
so
the
numbers
often
tend
to
be
a
bit
big,
but
you
can
see
this
ends
up
being
about
one
millisecond
per
call.
A
I
would
find
that
it
took
about
3.7
seconds
to
run
if
I
compile
this
without
the
dash
m
and
p
equals
gpu,
which
basically
turns
off
that.
Oh
I
do
this
also,
so
I
turn
off
all
the
gpu
support
and
then
I
rerun
it
and
ask
how
much
time
it
took.
A
It
actually
took
it
wasn't
really
like
the
gpu
code
was
arguably
slower
or
about
the
same
amount
of
time
to
run.
So
that's
not
awesome.
It
suggests
that
we
have
a
little
bit
of
work
to
do
and
one
way
we
could
think
about
this
I'll
go
back
and
turn
on
the
gpu
flags.
Now.
A
A
So
here's
the
name
of
that
qdrip
file
and
I
will
advocate
that
there
is
nothing
more
informative
when
doing
gpu
coding
work
than
actually
looking
at
this
profile
in
the
user
interface
and
seeing
what
happened
that
really
helps
you
understand.
What's
going
on,
so
what
I'm
going
to
do?
Is
I'm
going
to
copy
that
or
I'm
going
to
copy
that
report
file
down
to
my
local
system?
A
A
A
Okay.
Now,
if
I
were
to
look
at
this
profile,
I
would
see
that
the
gpu
was
in
usage
for
the
majority
of
the
timeline.
Let
me
make
this
a
little
bit
bigger.
So
it's
hopefully
easier
to
see
the
gpu
is
in
usage
for
the
majority
of
the
timeline.
So
this
whole
thing
ran
for
about
four
seconds,
which
is
pretty
similar
to
the
what
we
got,
what
we
said
it
took
before
and
for
almost
all
of
that
time
the
gpu
is
being
used.
A
We
said
yesterday
that
blue
is
when
a
kernel
is
running
and
red
is
when
memory
operations
are
occurring
so
there's
both
memory
operations
and
huda
kernels
happening
the
whole
time,
so
so
from
like
10
000
foot
level.
This
is
good.
The
current
the
gpu
is
being
used,
but
let's
now
dig
in
a
little
bit
deeper
and
and
understand
this
at
a
slightly
finer
level.
A
So
now
I've
isolated
it
down
to
a
relatively
small
section
of
the
timeline.
This
starts
at
about
1.7
seconds
and
ends.
You
know
about
15
milliseconds
later
it's
a
relatively
short
section
of
timeline,
and
I
can
see
now.
Gaps
in
the
timeline
blue
is
when
cuda
kernels
are
running
and
that's
gpu
compute
work
and
when
there's
no
blue
on
the
timeline.
That
means
there's
no
compute
operations
happening
at
that
time.
A
So
this
is
immediately
a
red
flag
to
us,
almost
in
every
case,
when
you're
running
on
gpus
you're
going
to
get
the
most
performance
of
the
gpu.
If
the
gpu
is
running
essentially
continuously,
if
there
are
big
gaps
in
the
timeline,
though
you're
not
using
the
gpu
and
that's
likely
to
be
an
ineffective
use
of
the
platform
that
you're
on
the
bottom.
Half
of
this
row
is
memory
operations.
If
I
hover
over
any
one
of
them,
it
tells
me
a
little
bit
about
that
memory
operation.
A
So
if
I
look
at
what's
going
on
here,
I
see
a
pattern
repeated
four
times
in
this
particular
part
of
the
timeline
where
a
gpu
kernel
runs
and
then,
as
part
of
that
kernel
operation.
We
transfer
data
from
the
cpu
to
the
gpu
and
then
a
section
of
timeline
where
we
transfer
data
from
the
gp
to
the
cpu
and
no
gpu
work
occurs.
A
This
is
all
part
of
this,
while
loop
and
this
while
loop
has
two
components.
It
has
the
second
loop
that
we
already
moved
to
the
gpu,
which
does
the
swap
of
the
old
in
the
new
array,
and
then
has
this
other
loop,
which
is
the
one
that
performs
the
actual
iterative
update
and
notice.
Now
that
I
haven't
yet
actually
ported
this
loop
to
the
gpu,
so
the
result
is
that
that
loop
is
going
to
be
happening
on
the
cpu.
That
has
two
consequences.
A
First,
this
loop
is
going
to
be
a
lot
slower
on
the
cpu
than
it
was
on
the
gpu,
because
it,
the
cpu
just,
doesn't
have
as
many
threads
available
to
it.
In
fact,
we're
only
running
with
a
single
thread
in
this
case
in
the
cpu
and
the
other
problem
is
that
we're
also
transferring
the
data
back
and
forth
in
order
to
support
this
operation.
A
So
that's
going
to
make
the
cpu
code
slower,
because
it
has
to
wait
for
the
data
to
come
back
before
it
can
operate
on
it.
It's
also
indirectly
making
the
gpu
code
slower,
because
that
data
needs
to
constantly
be
coming
back
from
the
cpu
after
this
section
of
code
in
order
to
make
it
on
the
gpu
to
perform
this
section
of
code.
So
it
turns
out
that
it'll
actually
be
a
lot
faster
overall
to
run
all
both
of
these
loops
on
gpu.
A
A
With
this
loop,
it's
a
little
bit
more
complicated
than
the
previous
loop,
the
the
below
that
we
already
did
and
the
main
reason
it's
complicated
is
because
of
this
second
line.
If
I
were
to
comment
this
line
out
and
just
look
at
this
first
line,
this
is
pure
compute
and
every
index
I
comma
j
can
be
computed
independently
of
every
other
index.
It
does
depend
on
neighboring
indices.
A
So
I
can
completely
paralyze
fully
over
the
I
and
j
loops
in
this
case
and
not
have
to
worry
about
anything,
and
this
would
just
be
great.
It
would
end
up
having
the
same
kind
of
parallelism
as
the
loop
below
and
would
be
good,
but
it's
actually
a
little
bit
more
complicated
because
I
do
have
this
line
which
computes
the
running
tally
of
the
error,
which
is
the
difference
between
the
old
array
and
the
new
array.
A
So
we
have
to
be
cognizant
of
that
and
make
sure
that
our
loop
construct
knows
that
there
is
in
fact,
a
reduction
operation
occurring
in
this
case.
This
is
a
max
reduction
which,
which
has
which
basically
calculates
the
maximum
value
of
this
error
value
across
the
entire
loop.
A
Now,
if
you
were
to
forget
this,
the
compiler
might
do
something
that
you
expect
or
might
do
something
that
you
don't
expect.
So
first,
let's
ask
what
happens
if
I
left
this
off.
If
I
forgot
that
there
was
a
reduction
here,
what
kind
of
code
would
the
compiler
generate?
We
can
answer
that,
because
we
know
with
this
dash
m
info
flag.
We
will
actually
get
output
from
the
compiler
about
what's
going
on.
A
Before
I
analyze
this,
so
a
couple
questions
in
slack
that
I
should
answer
one
question
was:
can
I
explain
the
difference
between
copy
and
copy
in
and
why,
in
this
solution
code,
do
we
use
copy
for
one
and
copy
n,
for
the
other
copy
means
that
I
want
to
copy
in
the
initial
value
of
the
data
and
also
at
the
very
end
of
the
data
region,
copy
it
out?
A
So
I
can
reuse
it
after
that.
When
would
you
use
copy
versus
copying?
Well,
you'd
only
use
copy
in
if
you
don't
need
the
value
at
the
end
of
the
data
region,
but
you
would
use
copy
if
you
do
want
the
data
at
the
end
of
the
region,
so
copy
in
might
be
useful
for
for
some
data
with
an
initial
value,
which
then
goes
to
the
gpu
when
it
stays
there
and
you
never
need
the
data
back,
but
copy
would
be
more
appropriate
for
a
case
where
you're
bringing
the
data
to
gpu
you're
updating
it.
A
A
So
if
I
were
to
look
at
the
section
of
the
timeline
which
is
the
cpu
threads
and
that's
this
row
here,
every
so
often
inside
systems
will
periodically
sample
the
cpu
and
ask
it
what's
happening
at
that
point,
and
if
you
had
written
code
with
a
a
deep
call
stack,
then
what
we
would
do
is
at
every
sampling
point.
We
would
show
you
what
the
call
stack
is
and
that
can
be
used
indirectly
to
determine
what
the
hotspots
are
in
your
application.
A
The
other
thing
you
can
do,
which
I
won't
talk
about
today-
is
to
use
nvtx
the
nvidia
tool
extension,
which
allows
you
to
manually
instrument
the
code
with
human
readable
strings
that
can
call
out
sections
of
the
of
the
source
code.
And
then
you
would
see
those
those
timeline
sections
occur
in
the
inside
systems
profile.
So
I
in
fact
often
will
do
that
when
I'm
starting
to
port
a
code
from
the
first
for
the
first
time
from
the
cpu.
A
Before
I
write
a
single
line
of
cpu
code,
I
often
will
write
nvtx
regions
in
the
code,
identify
where
the
time
is
being
spent
and
then
start
there,
because
you're
going
to
get
the
most
bang
for
your
investment,
your
time,
investment
dollar,
if
you
import
the
code
where
the
most
the
time
is
spent
in
this
example,
not
super
useful
or
relevant
because,
like
it's
obvious
just
from
looking
at
the
code
where
the
time
would
be
spent,
but
I
completely
agree
that
in
general
you
should
use
that
kind
of
instrumentation
to
figure
out
where
to
invest
your
reporting
effort
from
the
start.
A
Okay,
so
I
was
getting
back
to
the
code.
I
was
asking
the
question:
what
would
happen
if
I
forgot
this
reduction
value?
It
turns
out
that
actually,
the
compiler
is
pretty
smart
notice
here.
First
of
all,
omb
target
teams
loop
and
it
says,
generating
implicit
reduction
max
error,
so
it
actually
recognizes
that
a
reduction
is
occurring
here.
Does
this
actually
give
us
the
right
thing?
Well,
let's
double
check.
A
If
I
were
to
run
my
laplace
code,
it
looks
like
indeed
it
does
still
get
the
correct
answer
and
the
same
one
as
before
notice,
also
that
subjectively
it
took
a
lot
less
time.
What
I
can
do
is
I
can
collect
a
profile
of
this
run
and
see
how
much
time
was
actually
spent,
and
it
turns
out
that
what
do
we
got
so
the
kernel
that
is
now
in
line
61..
It
was
on
line
60
before,
but
we
had.
A
directive
now
takes
only
six
microseconds
to
run
on
each
of
the
invocation.
A
The
new
kernel
that
we
just
added
in
line
49
that
does
this.
This
iterative
update
takes
18
microseconds
to
run
but
notice
that
that's
shockingly
smaller
than
the
one
millisecond
it
took
to
run
this
kernel
before.
So
we
have
made
this
code
a
heck
of
a
lot
faster,
even
though
we
added
more
work
to
the
gpu,
we
actually
made
it
faster
overall,
because
a
we
accelerated
the
first
loop
and
b,
we
hopefully
got
rid
of
the
data
transfer
operations.
A
And
then
open
that
up
in
my
inside
systems-
viewer
port,
3.,
okay,
so
now
the
whole
thing
runs
in
some
700
milliseconds
and
if
I
look
at
just
the
gpu
section
of
the
timeline,
it's
condensed
down
to
how
much
time
is
this
100
milliseconds?
That's
obviously
a
lot
faster
than
we
have
before
and
if
I
zoom
in
to
any
particular
section
the
timeline,
what
I'm
looking
for
is
hopefully
smaller
gaps
we
can
see
there
are
still
gaps
so
that
maybe
is
not
as
optimal
as
it
could
be.
How
long
are
the
gaps?
Well
in
report?
A
A
Also,
each
of
these
kernel
invocations
is
extremely
fast,
whereas
before
we
was
taking
an
order,
a
millisecond
to
run
in
now
we're
writing
kernels
that
run
on
the
order
of
you
know:
10
microseconds,
so
we've
made
the
code
like
100x
faster
in
some
sense,
as
you
can
see,
the
overall
runtime
of
the
application
did
not
get
100x
faster.
It
still
took
only
0.7
seconds,
whereas
it
took
a
total
before
in
the
previous
one
four
seconds.
A
So,
even
though
we
made
the
compute
section
of
the
timeline
vanishingly
smaller,
we
didn't
change
the
fact
that
there's
actually
still
a
fair
amount
of
setup
cost
to
running
a
gpu.
There's
a
whole
half
a
second
here
before
the
gpu
code
even
runs.
We
talked
about
this
yesterday.
That's
just
an
unavoidable
fact
of
life
when
using
gpus,
but
if
you
were
to
run
a
bigger
problem
or
you
were
to
run
more
iterations
of
this
solver,
for
example,
you
hopefully
amortize
that
out
and
end
up
paying
for
the
gpu.