►
From YouTube: 7. Hands on demo: StdPar and Nsight -- Max Katz
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
Directories
for
tomorrow
for
opening
the
openacc
and
cuda
for
today
we
just
have
the
directory
on
center
language
parallelism
and
if
you
look
at
the
readme,
it
shows
we've,
given
you
two
source
files
to
look
at
one
in
fortran
and
one
in
c
plus.
I
recommend
doing
both,
even
if
you
only
use
one
of
those
two
languages,
because
it's
really
not
there's
not
any
real
code
exercises
just
practicing
around
compiling
and
running
and
we've.
A
Given
you
a
brief
readme,
which
explains
what's
going
on
what
the
prerequisites
are
for
getting
it
running
and
then
the
set
of
like
little
exercises
just
to
get
practice
compiling
and
running
the
code
for
both
the
fortran
and
c
plus
plus
cases.
So
I
recommend
going
through
these
and
whether
you're
running
on
promoter
or
on
summit
we've.
Given
you,
hopefully
enough
instructions
for
compiling
and
running
and
then
in
a
little
bit.
I
will
come
back
and
say
a
little
bit
more
about
these
exercises.
A
A
A
The
for
training
example
is
solving
a
linear
system,
ax
equals
b
and
it
has,
it
creates
a
matrix,
a
and
a
vector
b
and
the
way
it's
going
to
solve.
It
is
using
the
standard
loss
operations
that
first
do
an
lu,
factorization
of
the
matrix
and
then
do
the
solve
using
the
factored
matrix,
a
it
first
initializes
the
matrix,
a
with
some
random
numbers.
A
It
then
fills
in
the
some
guess
at
b,
or
rather
it
creates
it
fills
in
a
right-hand
side
b,
and
then
it
does
the
factor
and
then
the
solve,
and
then
at
the
end
it
checks
to
make
sure
that
we
actually
got
the
right
answer,
which
we
should.
We
can
know,
because
we
know
what
a
and
b
are,
and
we
know
the
relationship
between
a
and
b.
We
should
be
able
to
just
do
a
sanity
check
to
make
sure
that
we
got
the
right
thing
now.
The.
A
So
this
call
to
random
number
that
will
fill
in
the
matrix
a
as
well
as
this
do
concurrent
operation
that
initializes
modifies
the
the
matrix
a
and
fills
in
b,
as
well
as
these
bloss
operations
can
all
be
done
in
the
gpu,
with
the
support
that
the
nvidia
compiler
has.
A
A
In
order
to
do
that,
we
save
I'm
doing
this
example
on
nerfs
right
now,
but
it's
similar
how
you
do
it
at
oak
ridge.
You
can
see
what
our
requirements
are
for
the
nurse
environment,
the
we
want
you
to
have
the
program
dash
nvidia
loaded
on
promoter,
that's
part
of
the
default
environment.
We
also
want
you
to
do
module
load
coded
toolkit,
which
is
not
part
of
the
default
environment,
but
easy
just
one
line
to
do,
and
so
you
can
see
what
my
module
environment
looks
like
here.
A
So
this
is
basically
the
default
environment,
but
with
this
additional
cuda
toolkit
loaded
now,
the
first
exercise
is
just
to
compile
and
run
the
code.
So
the
only
non-standard
or
the
only
part
of
the
code
that
is
not
just
immediately
part
of
fortran
is
the
cause
of
the
boss.
Apis.
Do
you
get
rf
and
de-get
rs
in
order
to
do
that?
You'll
just
do
dash
l
bloss.
If
you
just
do
dash
l
bloss,
then
you'll
use
the
blossom
library
that
we
ship
or
that
this
support
from
the
nvidia
fortran
compiler.
A
However,
you
can
of
course
use
other
blast
libraries
on
the
system.
So
if
you
wanted
to
use,
you
know
mkl
or
some
other
gloss
library,
you
could
absolutely
do
that
and
then
you
would
get
this
that
this
you
know
test
dg
trf
cpu
executable,
now
we've,
given
you
a
sample,
submit
script
that
you
can
use
to
run
it.
You
can
take
a
look
at
it.
We
have
one
for
both
nurse
and
for
oak
ridge.
A
Hopefully,
that
will
run
pretty
quickly
because
it's
quick
job
and
also
it's
part
of
this
reservation
as
usual-
we'll
get
a
slurmout
file
which
will
record
the
output.
So
I
can
cap
that
all
of
that
this
code
does
is
it
prints
out
either
test
passed
or
test
failed
test
patch
would
indicate
that
we
got
the
expected
answer
from
the
linear
system
zone.
A
A
First
of
all,
give
it
a
different
name
indicate
that
we're
now
doing
the
gpu
build
we're
going
to
need
to
add
a
couple
things.
So,
first,
we're
going
to
need
to
add
dash
sid
bar
dashboard
bar
means.
We
want
to
take
all
the
standard
language
constructs
in
this
case
duke
and
current,
and
then
run
them
on
the
cpu.
A
We're
also
going
to
want
to
add
that
we
want
to
pick
up
the
linear,
algebra
and
run
it
on
the
gpu.
So
remember
brent
said
that
that's
the
nvla
math
and
then
that
so
that
looks
like
mvla
math.
That
requires
the
cuda
11.4
back
end,
so
I'm
doing
mvla,
math,
comma
cuda,
11.4
and
then
the
last
thing
we
do
is
tell
it.
We
want
to
support
the
kudo
libraries,
so
we
need
to
pull
in
any
cuda
libraries
that
would
be
relevant
for
this
gpu
sport
in
particular.
A
We're
going
to
want
to
support
the
math
libraries
as
well
as
the
random
number
generator
kuran.
In
order
to
generate
do
that.
Random
number
call
that
we
said
so
that
should
be
enough.
As
you
can
see,
we
have
to
turn
on
a
number
of
options,
but
that's
all
we
had
to
do
was
modify
the
compiler
flags
in
order
to
get
this
example
to
run
and
hopefully
now
run
on
the
gpu.
A
A
A
Okay,
so
if
I
cat
my
new
output,
four
three:
fourth,
it
out
great
test
passed.
Okay,
so
the
gpu
build
worked.
However,
it'd
really
be
nice.
If
we
could
actually
verify
that
this
thing
actually
ran
on
the
gpu,
how
do
I
know
that
it
wasn't
just
falling
back
to
the
cpu
that
I
perhaps
I
made
some
mistake
when
I
compiled
and
ran
it?
How
do
I
know
it
actually
was
using
the
gpu?
A
A
A
A
We
can
also
give
it
a
name
so
I'll
just
call
this
one.
You
know
test
g-e-t,
rf,
gpu
and
then
it'll
automatically
append
the
right
file
extension
to
it,
so
that
we
know
that
this
was
the
profile
corresponding
to
that
particular
executable.
So
that's
what
I
would
do
in
order
to
collect
a
profile
of
this
application.
A
A
A
A
So,
even
though
we
didn't
write
any
cuda
in
this
example,
huda
is
being
used
under
the
hood
by
the
compiler.
In
order
to
do
all
the
work,
so
you
can
see
all
of
the
cuda
operations
that
had
to
occur.
If
this
is
something
you're
interested
in
on
the
cpu,
in
order
to
support
the
gpu
workload,
then
we
have
cuda
kernel.
Statistics
kernels
are
the
actual
gpu
compute
work
that
occurs
and
the
you
get
a,
and
it's
a
little
bit
too
long.
As
you
can
see.
A
So
it's
falling
over
to
a
second
row,
but
you
get
in
each
row.
The
total
percentage
of
the
time
spent
on
the
gpu
for
each
kernel,
as
well
as
the
name
of
the
kernel
and
then
some
other
statistics
about
how
much
time
was
spent
total
in
that
kernel,
as
well
as
how
many
calls
to
the
kernel
and
how
much
average
time
has
been
in
each
call
of
the
kernel.
A
The
name
of
the
kernel
in
this
case
is
generated
by
the
the
compiler
or
the
library.
It
won't
always
be
super
intuitive
to
you.
So,
for
example,
this
test,
deg
trf
underscore
17
underscore
gpu,
is
actually
fairly
informative.
It's
telling
me
that
it's
happening
in
my
program
test
dhe
trf
on
line
17,
and
that
is
the
gpu
code
that's
being
generated.
A
If
I
look
at
line
17
of
my
file,
that
is
this
duke
and
current
loop.
So
it's
telling
me
that
it's
generating
a
gpu
kernel
corresponding
to
that
call
of
duty
concurrent
and
then
a
whole
bunch
of
other
ones,
with
names
that
are
less
informative
to
you
that
are
generated
by
the
cause
of
linear
algebra.
So
you
can
see
actually
a
fair
amount
of
work
is
being
generated
to
support
this
linear
system
solve
you.
Don't
have
to
worry
about
all
of
the
code
generation.
A
Somebody
asked
in
chat
about
the
fact
that
sometimes
the
names
of
the
kernels
get
cut
off
if
they're
very
long.
Unfortunately,
there's
really
nothing
that
I
can
tell
you
to
do
about
that
in
this
standard
out
that
we
can
get.
However,
you
can
get
the
full
name
of
the
kernel
when
you
actually
open
this
profile
in
the
user
interface,
which
I'll
show
you
next.
A
A
It
tells
you
what
happened
but
seeing
it
in
a
timeline
is
much
more
powerful
than
just
getting
a
standard
out
text
summary
of
what
goes
on
okay,
so
we
hopefully
got
at
the
end
of
the
output,
a
information
about
the
the
name
of
the
file
on
the
file
system
that
we
saved.
It
has
the
name
that
I
was
asking
it
for
test,
deg
trf
underscore
gpu,
and
it
has
this
dot,
qd
rep
file
extension,
which
is
the
native
report
format
of
insight
systems.
A
As
I
mentioned,
this
was
renamed
in
the
very
most
recent
release
of
the
tool,
but
it
has
the
same
kind
of
works
the
same
way.
So
I'm
going
to
go
ahead
and
copy
that
file
the
name
of
that
file
and
then
I'm
going
to
open
up
a
terminal
and
do
I'm
going
to
scp
that
file
from
promoter
to
my
local
system.
So
I
can
just
use
standard
scp
for
that.
I'm
going
to
give
promoter
colon
and
the
path
to
the
file
and
there's
copy
it
to
some
location
on
my
local
computer.
A
Steve
asks
can,
I
repeat,
the
command
to
build
to
enable
profiling.
You
don't
have
to
change
any
flags
in
the
build
in
order
to
turn
on
profiling.
There
are
separate
activities,
so
the
the
gpu
compilation
flags
are
in
the
readme
here,
so
you
can
copy
this
cuda
gpu
dash
cuda
lib.
These
are
all
necessary
for
turning
on
gpu
support
and
then
exercise
the
text
for
exercise.
Three
says
what
you
need
to
do
when
says
dash
stash
equals
true
in
order
to
collect
the
profile.
A
Okay,
so
I
have
copied
this
report
file
from
my
from
promoter
to
my
local
system.
I
already
have
the
insight
systems
user
interface
up
because
I
was
showing
you
an
example
report
before
what
I
would
need
to
do
is
I
would
need
to
go
to
file
and
then
open,
and
I
would
need
to
locate
this
file
that
I
just
downloaded
on
my
file
system.
A
A
A
The
gpu
activities
are
in
this
cuda
hardware
row
and
then
all
of
the
runtime
api
calls
that,
for
example,
the
cuda
calls
that
I
was
showing
you
that
are
orchestrating.
All
of
this
work
are
here
in
this
row.
Additionally,
you
can
see
load
on
all
the
cpu
cores
that
are
being
used.
If
you
want
to.
This
can
be
useful
for
understanding
when
and
where
was
there
any
load
on
the
cores
in
cpu?
A
What
I
really
want
you
to
pay
attention
to
is
the
cuda
hardware
row,
because
this
shows
you
all
of
the
actual
compute
and
memory
work
that
happened
on
the
gpu,
so
there's
a
little
everywhere
that
the
row
is
blank.
Nothing
was
happening
on
the
gpu
and
anywhere
that
there's
any
color
in
the
row
is
where
things
were
happening.
A
The
timeline
runs
from
the
beginning
at
zero
seconds
to
the
end
at
something
after
six
seconds,
so
the
gpu
work
is
actually
constricted
to
a
fairly
small
chunk
of
the
timeline.
A
This
first
bit
here
is
going
to
be
the
call
to
do
concurrent
that
initializes
the
data.
So
you
can
see
that
this,
if
I
zoom
in
really
far
that
the
kernel
that's
being
run
here,
is
this
test
deg
trf
underscore
17r
square
gpu.
So
this
is
that
do
concurrent
loop
on
917
of
the
code
and
then
I
have
to
zoom
all
the
way
out
by
the
way
to
zoom
in
and
out
you
hold
down
control
and
then
you
can
use
either
the
or
I
think
it
might
be
command
on.
A
Mac
hold
down
controller
command,
and
then
you
can
use
like
your
mouse
scroll
wheel.
If
you
have
one
like
a
pinch
and
zoom
motion,
if
you're
using
a
touchpad
to
zoom
in
and
out,
I
have
to
zoom
all
the
way
out
in
order
to
see-
or
I
can
just
click
right
click
and
do
reset
zoom.
I
resume
all
the
way
out
and
see
this
gpu
activity
that
actually
does
the
linear
system
solve
is
happening
at
the
very
end
of
the
run.
A
So
it's
constrained
to
a
fairly
narrow
chunk
of
the
timeline,
and
I
could
see
the
names
of
these
kernels.
If
I
wanted
to
they're,
not
these
aren't
going
to
be
super
useful
to
you,
because
these
are
the
individual
kernels
that
are
done
by
the
linear,
algebra
library,
so
the
names
of
the
kernels
aren't
relevant
just,
but
the
fact
that
you
can
see
names
like
g-e-t-r-f
is
indicating
that
this
is
the
work
that's
corresponding
to
the
linear
system
work.
A
So
if
I
reset,
what
I
see
is
that
only
a
very
small
chunk
of
this
timeline
is
actually
using
gpu.
Almost
all
of
the
work
is
setting
up
data
or
handles.
This
is
a
pretty
characteristic
thing.
That's
true
about
gpus,
where
setting
up
work
on
the
gpus
is
fairly
expensive.
Initializing
the
gpu
is
expensive,
allocating
memory
is
expensive,
and
so,
if
you
don't
have
a
lot
of
work
to
do,
you
may
be
killed
by
these
initialization
costs.
A
A
So
the
last
exercise-
and
I
won't
show
you
this-
but
I
recommend
you-
do
it
on
your
own-
make
this
a
bigger
problem
and
then
see
if
a
longer
chunk
of
the
timeline
is
spent
on
the
gpu.
You
might
even
be
able
to
ask
yourself
the
question:
can
I
make
this
problem
big
enough
to
effectively
amortize
out
the
cost
of
the
initialization?
A
Now
this
particular
example
wasn't
set
up
to
show
you
like
excellent
performance,
so
there's
ways
to
write
code
that
that
mitigate
this
behavior,
but
it
is
worth
knowing
that
in
general,
initializing
data
and
the
gpu
state
is
expensive.
So
you
want
to
reuse
as
much
memory
as
you
can,
and
typically
that
will
work
out
to
something
like
run,
a
code
that
launches
a
large
number
of
iterations
or
time
steps,
or
something
like
that.
A
So
you
can
amortize
out
that
initialization
cost
so
yeah
try
making
the
problem
bigger
and
see
if
that
affects
the
shape
of
the
profile.
A
We've
also
got
a
siblings
example,
which
does
a
stood
transform
in
c,
plus
that
matt
set
up
and
kind
of
resembles,
something
that
he
was
showing
you
in
his
lecture
and
so
what
it
does
is.
It
creates
two
vectors
x
and
y.
These
are
just
arrays
with
size
n,
which
in
this
case,
is
a
million
to
initialize
them
to
some
data,
and
then
we
do
a
times
x,
plus
y,
in
the
context
of
the
c
plus
plus
parallel
algorithms.
A
One
way
to
implement
that
is
with
stood,
transform
and
then
with
stood
transform.
You
tell
it
the
a
vector
as
well
and
the
first
point,
or
the
first
pointer
or
location
in
the
array
and
then
how
many
pointer
offsets
later
to
stop
at
as
well
as
the
second
vector
and
then
the
last
argument
is
the
receptacle
or
the
output
of
the
of
the
data.
A
In
this
case
we're
basically
doing
an
in-place
update
to
y,
and
then
you
write
a
lambda
which
basically
says
I
want
to
return
y
plus
a
times
x
and
that's
the
saxby
operation
and
then
just
a
check
at
the
end.
So
you
should
also
verify
that
you
can
compile
and
run
the
c
plus
plus
example.
You
can
also
practice
collecting
a
profile
with
inside
systems
in
order
to
see
to
verify
that
gpu
work
actually
occurred
on
the
gpu.
A
Any
other
questions:
oh:
is
there
an
option
to
run
some
parts
of
the
code
on
cpu
and
some
on
gpu?
The
short
answer
today
is
not
really
we
don't
really
support
like
a
mixed
mode.
Certainly
you
can
use
in
the
context
of
openmp.
If
you
use
non-target
regions,
then
you
may
be
able
to
combine
openmp
host
threading
with
gpu
target
regions
on
running
on
the
gpu,
but
they're
in
general.
A
We
don't
support
something
like
having
multiple
c
plus
plus
parallel
regions
like
stood,
transforms
and
having
some
of
them
run
on
the
gpu,
and
some
of
them
run
on
cpu,
that's
a
little
bit
too
challenging
and
tricky
for
us
to
implement
and
also
honestly.
There
are
very
few
circumstances
where
you
would
want
to
do
that.
So
today
we
don't
support
that,
but
if
you
have
a
really
compelling
use
case
for
why
you'd
like
to
mix
those
things,
you
can
always
reach
out
to
us
and
we'll
be
happy
to
hear
you
out.
A
Yeah
there
was
a
question
in
slack:
can
I
just
go
over
one
more
time
what
this
whole
thing
is,
so
it
may
help
to
look
at
the
if
you,
google,
the
api
for
stored,
transform
to
go
along
with
it.
But
basically
I
let
me
go
through
these
arguments,
one
by
one.
A
So
the
first
argument
is
the:
what
we
call
the
execution
policy
execution
policy
is
a
statement
about
how
we
it's
basically
telling
the
the
compiler
some
some
statement
about
the
relationship
between
the
iterations
of
the
for
loop,
the
execution
policy,
really
what
it
means
is
telling
the
compiler.
How
should
I
generate
code
to
do
this
loop
and
that
can
either
be
done
serially,
so
I
run
the
executions
of
the
loop
one
by
one.
A
If
you
think
about
the
transformers
really
representing
a
for
loop,
like
this,
for
loop
above
from
zero
to
n,
then
a
serial
execution
policy
would
basically
mean
generate
code
which
looks
like
this
for
loop.
So
iteration
zero
is
before
iteration
one
etc,
but
we
can
also
give
it
parallel
execution
policies
and,
in
particular,
the
one
that
we're
using
here.
Power
underscore
unseek
means
that
is
both
parallel
and
that
there
is
no
specified
relationship
between
a
particular
iteration.
So
I
can
do
iteration
a
thousand
before
iteration,
zero
or
after
or
explicitly
telling
the
compiler.
A
A
The
second
argument
and
third
argument
are
the
beginning
and
end
of
a
particular
array
or
iterable,
and
the
third
argument
would
be
a
second
thing
because
stood
transformer
is
basically
combining
two
pieces
of
information
and
then
the
next
argument
is
the
output
of
the
data
and
then
what
it's
going
to
do
is
going
to
pick
a
set
of
data
from
x
and
y
and
then
give
it
to
you
in
scalar
form,
xl
and
yl
as
read-only
data,
and
then
the
return
value
of
the
lambda
is
what
I
want
to
do
with
that
particular
combination
of
data
from
x
and
y.
A
A
A
If
I
have
some
openmp
instructions
in
my
code,
but
I
compile
it
with
no
openmp
options,
then
no
the
compiler
should
not
should
just
ignore
the
openmp
premise.