►
From YouTube: Building and Running GPU Applications on Perlmutter
Description
Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/
A
So
we're
going
to
walk
through
some
examples
of
building
and
running
gpu
applications
on
perlmatter.
A
So
in
this
session,
we'll
look
through
building
and
running
an
application
on
palmata
with
mpi
and
gpus,
using
cuda
as
an
example,
we'll
then
have
a
little
bit
of
a
break
which
should
around
about
hit
at
lunch
time
for
those
on
the
eastern
time
zone
in
the
u.s
and
then
we'll
go
into
session
two,
which
will
be
a
slightly
longer,
also
hands-on,
oriented
session,
where
we'll
walk
through
a
few
kind
of
additional
scenarios,
such
as
a
little
bit
about
some
of
the
math
libraries
using
other
compilers,
rather
than
only
nvidia.
A
So,
first
up
when
you
are
compiling
code,
so
you'll
have
your
normal,
your
program,
source
code,
so
a
bunch
of
voc,
c,
plus
plus
490
cpu
source
code
files.
These
can
have
mpi
calls
within
them.
They
may
use
directives
for
using
gpu
such
as
openhcc
or
openmp,
those
you'll
compile
with
the
regular
compilers
and
and
more
specifically,
generally
on
pole,
mudder
and
nurse
craze
systems,
you'll
be
using
the
cray
wrappers,
which
give
you
the
the
mpi
stack
and
some
other
kind
of
niceties
built
in.
A
This
is
all
things
that
your
nurse
users
have
been
using
corey
already
for
a
while
will
be
familiar
with.
A
Then
we
have
cuda
code
which
comes
in
dot,
cu
files
and
those
ones
will
compile
with
nvcc,
which
is
part
of
the
nvidia.
Cuda
stack
just
a
kind
of
a
tip,
though
with
program
nvidia,
which
uses
the
nvidia
cpu
side
compilers.
Those
can
actually
read
cuda
code
incorporated
into
the
same
source
files
to
sort
of
enable
that
you
can
add
the
dash
cuda
or
gpu
flag
at
compile
time
and
we'll
actually
see
that
in
the
examples.
A
So,
looking
at
the
software
stacks
that
we'll
be
working
with
on
the
left
hand
side
here,
we
have
by
default.
We
have
the
program
nvidia
software
stack
loaded.
This
gives
you
the
nvidia
compilers,
plus
the
cray
compiler
wrappers,
a
few
kind
of
useful,
underlying
cray
libraries
such
as
such
as
cray
lipside,
and
the
mpi
stack.
A
We
recommend
generally,
if
possible,
use
the
provided
crayon
pitch
rather
than
building
your
own
open,
api
or
whatever,
because
that's
the
one
that's
sort
of
best
optimized
for
our
high
speed
network
and
also
in
this
is
the
crepey
magic
that
makes
the
compiler
wrappers
kind
of
do
a
lot
of
things
automatically
without
you
needing
to
put
extraneous
options
in
there.
A
So
this
should
be
fairly
familiar
if
you've
already
used
quarry,
what's
new
on
perlmutter
but
might
be
familiar
if
you've
used
gpu
and
cuda
applications
on
other
systems
is
the
cuda
stack
here
which
you
can
get
with
a
module
load,
cuda
toolkit
and
that
gets
you
nvcc
the
envy
of
cuda
compiler.
We
talked
about
before
plus
a
bunch
of
libraries
and
tools
which
is
all
part
of
the
nvidia
to
cuda
toolkit
that
are
needed
for
gpu
code,
so
you'll
need
to
module
load,
cuda
toolkit
when
you're
building
things
for
gpus.
A
So
what
to
actually
load
for
most
applications,
including
the
examples
today,
we
recommend
that
you
use
the
pro
gaming
video
stack
and
this
one
is
loaded
by
default
when
you
log
into
palmata,
so
unless
you're
changing
something
that
should
already
be
there
to
build
gpu
applications,
which
is
going
to
be
the
case
for
phase
one
you'll
also
need
to
load
a
cuda
toolkit
module.
There
are
a
whole
slew
of
cuda
toolkit
modules
available
on
the
system
that
match
with
different
versions
of
cuda
and
different
versions
of
the
compiler,
particularly
the
nvidia
compiler.
A
The
default
one
is
generally
the
best
one
to
use.
You
know
off
the
bat
unless
it
doesn't
work
pretty
much.
What
you
want
to
do
is
choose
one
that
has
a
cuda
version
that
matches
what
your
application
needs
and
the
default
one
is
currently
the
latest
available,
gooder
version
on
burma,
so
11.4,
I
think,
if
you're
doing
openmp
or
acc
offloading
or
also
using
cudaware
mpi,
which
we'll
cover
in
a
little
bit
you'll,
also
need
to
load
one
of
the
crepey
excel
modules
and
in
particular
we
want
nvidia.
80.
A
you'll
see
this
number
80
pop
up
a
bit.
This
is
because
they
feel,
like
I
guess,
architecture
indicator
architecture
version
number
of
our
gpus
is
is
sm80,
so
that's
that
80
is
the
ampere
series.
If
you
used
corey
gpu,
you
will
have
seen
sm70
come
up
a
bit.
That
was
the
volta
that
we
had
before.
A
All
right,
so,
let's
get
kind
of
straight
into
it.
Hopefully,
people
have
got
a
session
open
that
they
can
log
in
to
perlmutter
run
from
there.
A
Git
clone
this,
unfortunately
rather
long
url,
but
what
you
can
do
is,
if
you
type
module,
show
training,
it
will
print
up
a
little
help
test
that
includes
this
url,
so
you
can
copy
and
paste
it
within
there
go
to
the
the
directory
called
cuda.
Slash
ex3
will
jump
straight.
To
example,
three
don't
forget
to
module
load,
cuda
toolkit
run,
make
and
take
a
look
at
the
output.
A
Here
we
go
so
what
we'll
do
to
try
to
basically
identify
when
people
are
ready
to
move
on
if
everybody
can
use
your
zoom
session
to
raise
your
hand
to
begin
with,
and
then
when
you've
made
it
through
the
exercise
lower
your
hand
and
we'll
kind
of
watch
for
hands
that
are
raised
and
enhanced
lowered
to
see
kind
of
where
people
are
up
to.
I
will
also
create
a
breakout
room.
A
I'll
be
helping
we'll
call
it.
I
hope
if
you
have
issues
having
accounts
issues,
answers
to
promoter
issues.
A
Okay,
so
hopefully
you
should
see
now
in
your
zoom
controls,
an
option
to
jump
to
a
breakout
room
and
there's
a
breakout
room
called
help
with
connections.
I
think,
is
what
I
called
it
and
we'll
have
one
or
two
nest:
people
in
there.
A
So
if
you're
having
trouble
not
so
much
with
the
exercise,
not
perhaps
for
the
exercise,
but
but
particularly
with
just
getting
onto
pearlmatter,
if
there's
something
wrong
with
your
account,
please
use
the
breakout
room
and
then
we
can
sort
of
you
know
separate
those
challenges
from
challenges
of
using
the
exercise
itself.
A
So
I
can
see
I
can
see
quite
a
few
hands
raised,
which
is
a
good
sign.
People
are
paying
attention
and
raising
hands
and-
and
some
may
have
already
finished
it's
going
up
and
down.
We
might
make
it
sort
of
about
five
minutes
instead
of
ten
minutes
to
run
through
this.
It
should
be
a
reasonably
simple
exercise.
Hopefully.
A
I
see
a
question
in
the
chat
from
from
william
about.
Should
we
clean
it
to
hove
or
p
scratch,
you
can
clean
it
to
either
really.
Actually
it's
a
reasonably
small
set
of
examples.
A
So
for
issues
with
compilation,
if
it's
a,
if
it's
a
straightforward
question,
I
think
the
google
doc
is
going
to
be
the
way
to
go.
A
A
Hopefully,
somebody
from
nurse
will
slap
me
if
there's
a
any
questions
that
I
should
answer.
A
A
A
What
does
this
tell
us
all
right?
So
up
here,
we've
got
cc.
The
mig
file
is
calling
on
the
c
plus
plus
cree
compiler
wrapper
it's
all
a
single
cuda
file.
In
this
example,
the
cray
wrapper
is
calling
the
nvidia
c,
plus
plus
compiler
underneath
and
that
accepts
an
option.
Dash
gpu
equals
and
cc.
80
is
the
tag
for
the
architecture
that
corresponds
with
our
gpus
and
it's
creating
an
executable
called
vec
underscore
ad.
A
If
you
did
have
troubles
with
that,
there
is
a
already
made
executable.
Actually,
that
is
in
the
examples
directory.
That's
pointed
out
by
that
module
show
training.
A
So
you
may
have
already
poked
around
and
discovered
these
there's
a
few
more
exercises
in
that
same
directory
to
walk
through
kind
of
at
your
leisure.
We
we
looked
at
exercise
three,
which
is
mpi
plus
gpu,
exercise
one
and
exercise
two,
a
simple
gpu
kernel
only
without
mpi,
particularly
if
you
found
difficulties,
building
the
mpi
one.
A
These
might
be
a
good
place
to
start
to
solve,
solve
problems
one
at
a
time
and
the
readme
one
level
up
should
have
a
bunch
of
information,
but
we
might
have
to
check
that
somebody
did
comment
that
a
readme.md
file
they
found
was
empty.
A
Okay,
step,
then,
is
once
we
build.
It
is
to
run
it
important
things
to
remember.
This
is
a
hpc
cluster,
don't
run
on
the
login
nodes,
submit
a
batch
job.
So
in
this
case
it's
a
very
short
job,
so
so
it
might
not
be
as
critical,
but
for
any
real
work.
You
definitely
want
to
be
submitting
a
batch
job.
Also,
when
you're
in
the
batch
environment,
you've
got
the
full
yeah,
slingshot
mpi
stack,
you
know,
I
think
on
falmouth
is
going
to
be
a
little
bit
easier
than
it
was
on
corey.
A
Other
important
thing
to
remember
is
when
you're
submitting
a
job
on
perlmutter.
You
must
specify
a
gpu-enabled
account
name.
A
A
So
when
you
submit
a
job,
you'll
need
to
especially
a
and
get
that
account
when
you're
doing
your
your
kind
of
real
work
later
you'll
use
your
own
project
account
for
that.
A
So
there's
a
bunch
of
necessary
s,
batch
options
and
having
gpus.
There
are
a
couple
more
now
than
what
you
were
familiar
with
on
corey.
The
first
bunch
are
the
same.
Pretty
much
so
you'll
need
a
cue
which
is
a
qos
and
for
almost
everything,
you'll
want
to
use
the
regular
qos.
A
You
want
to
set
a
time
limit
if
you
give
it
just
a
number.
That's
the
time
limit
in
minutes
for
slurm.
So
in
this
example
here
where
we're
saying
that,
after
five
minutes,
slurm
is
allowed
to
kill
this
job
for
a
for
a
real
job.
You'll
probably
want
a
few
hours
finding
the
the
right
time
limit
is
very
application
dependent
and
it
is
worth
experimenting
with
to
get
the
right
number
dash.
Lowercase
n
specifies
a
number
of
mpi
tasks
notice.
A
This
is
a
lowercase,
so
there's
a
number
of
mpi
tasks
as
opposed
to
uppercase,
which
would
be
the
number
of
nodes
and
we'll
come
back
around
to
that
on
the
next
slide.
When
we're
talking
about
those
splitting
work
over
gpus
as
well
as
nodes
dash
c
sets,
the
number
of
cpus
per
task
slerm
considers
a
cpu
to
be
what
linux
considers
a
cpu,
which
is
actually
what
we
might
call
a
hyper
thread.
So
our
amd
gpu
nodes
each
have
64
cores.
A
Each
core
has
two
hyper
threads,
so
so
linux
and
therefore
slim
sees
the
node
as
having
128
cpus.
So
here
that
c32
means
that
for
a
single
mpi
task,
you're
reserving
one
quarter
of
a
node
we'll
get
to
the
next
few
in
the
next
slide.
But
importantly,
particularly
for
today
is
we
have
a
reservation
called
pearl
mother
day.
One,
and
you
know
I
think
that
actually
should
be
two
dashes
in
reservation,
but
it
should
be
right
in
the
batch
file.
I
hope
so
when
you
submit
it
we'll
go
to
this
palmetto
day.
A
One
reservation
which
you
should
have
access
to
because
of
the
entrain
three
underscore
g
repo
that
you
that
everybody
is
in.
A
We
also
want
to
specify
how
many
tasks
per
node,
and
so
our
gpu
nodes
have
got
four
gpus
per
node
and
so
kind
of
the.
I
guess.
The
simplest
way
to
use
things
is
to
have
one
mpi
task
per
gpu,
which
is
to
say
four
mpi
tasks
per
node.
So
this
sort
of
corresponds
with
this
c32,
which
gave
us
a
quarter
of
a
node
for
each
task.
A
A
So
it
looks
like
there
is
quite
a
few
comments
and
questions
in
the
chat
and
it
looks
like
ronnie
and
laurie
are
hoping
answer
them
very
quickly.
So
thanks
for
that
all
right,
then
we
come
to
actually
running
the
gpu
code,
so
skip
out
the
top
part
of
the
the
batch
script
up
here
and
it
finishes
up
with
pretty
much
s
run
and
an
s
running
command.
A
So
this
is,
it
should
be
fairly
familiar
for
those
who've
used
corey
before
just
another
thing
that
you'll
see
when
we
look
at
the
examples
in
a
moment
is
several
of
the
examples.
Don't
have
gpus
per
task,
but
instead
they
have
dash
capital
g,
so
with
dash
capital
g
you're,
specifying
the
total
amount
of
gpus
for
the
job.
So
if
you've
got
two
nodes,
eight
gpu
is
available
dash
capital
g8.
A
This
is
kind
of
a
handy
shorthand,
it's
shorter
type
than
gpus
per
task
and
good
for
when
you're,
just
using
one
or
two
nodes,
when
you
start
using
larger
numbers
of
nodes,
calculating
it
out
will
get
a
little
bit
unwieldy
and
you
know
it's
a
little
harder
for
documentation
in
terms
of
you've
got
to
calculate
it
out
to
work
out
for
nodes.
So
for
for
your
larger
scale,
real
jobs,
you
probably
want
to
switch
across
to
gpus
per
task,
but
that's
what
the
dash
g
means
in
the
examples
here.
A
So
an
easy,
easy
thing
to
admit
is:
if
you
don't
set
gpus
per
task,
then
what
you
can
get
is
actually
a
floating
point
error
and
the
re
it's
a
floating
point
error,
not
a
sequel,
and
the
reason
for
this
is
that
basically,
the
gpu
hasn't
been
allocated
to
the
job,
we're
trying
to
run
on
a
gpu.
It
doesn't
have
one
it
trips
over.
So
if
you
get
an
error,
first
thing
to
check
is
that
you
have
all
of
the
asbestos
directive
set.
A
Okay,
so
now,
let's
go
to
another
hands-on
period
in
your
clone
of
that
directory,
you
can
go
back
to
ex3
make
if
you
haven't
already
done
that
which
hopefully
should
have.
If
you
didn't
succeed
in
building
it
before
the
module,
show,
training
or
module
load
training
will
point
you
at
a
place
where
we
actually
have
a
pre-built
executable
that
you
can
use
you
can
copy
across.
A
A
Actually
more
than
caught
up
so
we'll
have
about
sort
of
five
or
ten
minutes
for
this,
and
I
think
what
comes
after
this
is
actually
a
break,
so
we
can.
A
So
we'll
move
along
to
the
next
step.
We
actually
have
two.
We
do
have
one
more
topic
and
exercise
in
this
session
before
moving
on
to
the
second
session,
so
we
are
still
slightly
behind,
but
not
too
far,
and
we
should
catch
up
in
the
next
one.
A
So
experienced
quarry
users
will
be
familiar
already
with
the
ideas
of
affinity
and
binding,
we'll
use
those
on
that
system
as
well.
So
different
cpu
cores
have
an
affinity,
which
is
to
say
a
closeness
to
certain
memory
and
caches,
and
you
can
bind
a
thread
or
a
process
to
particular
cause
to
make
sure
that
that
thread
stays
stays
on
a
core,
that's
close
to
its
data.
A
A
So
a
similar
concept
holds
for
perlmutter
as
well,
so
the
filament
gpu
knows
you
can
figure
it
in.
What's
called
nps4,
it
stands
for
pneuma
nodes
per
socket
four,
which
basically
means
that
each
each
socket
each
cpu
or
what
you
call
it
node,
I
guess,
is
arranged
so
that
certain
cores
are
closer
to
certain
gpus
there.
There
are
four
kind
of
pneuma
nodes
on
each
gpu
node
and
each
gpu
is
closest
to
one
of
them.
A
So
this
diagram,
here
kind
of
in
a
slightly
cartoonish
way
illustrates
that
the
ccd
is
sort
of
the
unit
that
holds
a
bunch
of
cores
in
amd's
epic
architecture.
It's
divided
here
into
four
quadrants.
There
is
a
certain
amount
of
memory,
that's
closest
to
each
quadrant
and
a
single
gpu,
that's
closest
to
each
quadrant.
A
So
where
this
starts
to
matter
is
when
you
are
arranging
your
job.
Yeah
you've
got
some
gpu
tasks.
Some
of
the
work
can
happen
on
cpus.
It's
spread
over
multiple
nodes,
so
you're
going
to
want
to
have
some
sort
of
control
over
this
there's,
actually
quite
a
lot
of
several
options
that
you
can
use
around
binding
and
the
ones
here
are
sort
of
a
good
place
to
start
with
sort
of
a
you're
reasonably
sensible
default.
A
A
That's
just
cpu
bind
equals
cause,
which
is
to
say
that
a
given
task
is
locked
to
certain
cause.
It
can
move
around
in
the
hyperthreads
on
those
cores,
but
it
has.
A
This
link
here,
josh.net
docker
jobs
affinity,
has
more
information
about
the
gpu
binding
options
and
we
can
do
a
quick
hands-on
to
try
it
out
in
ex-5
of
that
same
repo
that
you've
been
working
in.
A
So
if
you
go
in
here
make
the
code
it's
the
same
executable,
but
these
two
batch
scripts
will
do
different
things
with
the
binding,
so
one
won't
do
anything
with
the
binding
and
the
other
will
bind
it
to
the
closest
gpu.
If
you
look
at
the
outputs
of
each
it
describes
in
terms
of
pci
identifiers
which
gpu
each
task
has
available
to
it.
A
A
So,
let's
do
the
same
thing
just
to
be
able
to
see
where
people
are
at
if
everybody
can
raise
their
hand
on
zoom
and
when
you've
been
able
to
run
the
exercise
and
you've
seen
some
interesting
output
from
it
put
your
hand
down
and
when
the
number
of
raised
hands
gets
reasonably
small
or
after
a
few
minutes,
we'll
continue
and
run
a
so
now
that
we're
recording
again
to
recap
what
we've
talked
about
so
far
this
morning,
we've
built
and
run
a
simple
surplus
plus
application
using
mpi
with
cuda
using
compiler
wrappers
for
the
cpu
mpi
side
code
and
nvcc
for
the
cuda
code.
A
So
for
the
for
the
rest
of
this
session,
we're
going
to
go
into
a
few
of
the
more
of
the
edge
cases
for
a
lot
of
people
you
might
be
saying
here.
This
is
this
is
all
very
well
for
a
simple
training
exercise,
but
my
application
is
more
complicated
than
that,
so
we'll
go
through
a
few
other
common
scenarios
that
people
are
likely
to
hit
and
what
you
can
do
in
those
cases.
A
So
some
topics
for
for
here
are
what
about
things
like
blast
slap,
fftw,
etc
when
you're
using
gpus?
What
about
if
the
nvidia
compiler
isn't,
isn't
suitable
for
or
doesn't
work
for
your
application?
A
A
So
gpu
accelerated
math
libraries
in
cuda,
so
there
are
gpu
accelerated,
implementations
of
or
alternatives
to
a
lot
of
the
common
math
libraries,
so
so,
for
instance,
blas,
which
is
sort
of
at
the
bottom
of
everything,
there's
a
kublus
which
is
a
cuda
equivalent,
and
you
get
that
when
you're
modular
kit
lapec
doesn't
have
a
direct
equivalent
in
the
nvidia
stack,
but
the
nvidia
stack
does
include
ku
solver,
which
does
similar
things
to
a
lot
of
the
larpec
routines
and
includes
some
of
the
laptop
routines
directly.
A
It
doesn't
have
quite
the
same
api
though
so
you
do
need
to
write
your
code
for
it,
but
the
good
news
is
with
the
nvidia
compiler
there's
an
option
you
can
add
which
is
dash
nvla
math,
and
what
that
does?
It
basically
adds
a
a
la
pac,
equivalent
or
laplacian
plus
equivalent
interface
to
the
cu
libraries.
A
I
haven't
included
a
link
here,
but
if
you
do
a
search
for
nvla
math
on
our
docs,
you
should
find
something
fftw,
there's
a
ceo
fft,
which
is
a
cuda-oriented
fft
and
cofftw,
which
is
an
fftw
interface.
To
that
there's
also
cu
sparse,
which
does
yeah
some
sparse
solvers.
A
So
those
those
ones
are
part
of
the
nvidia
stack
that
you
get
with
modular
load,
cuda
toolkit.
There
are
also
a
bunch
of
third-party
math
libraries
that
are
gpu
accelerated,
two
that
are
probably
particularly
useful
and
important,
magma
and
slate,
which
between
them
cover
plus
a
subset
of
lapec
and
scalar
pack,
and
look
here's
the
link
that
you'll
want
to
follow
in
our
docs
to
some
tips
about
using
these
libraries.
A
So
we're
not
going
to
go
into
too
much
detail
about
those
today,
but
just
a
reminder
and
a
quick
plug
for
an
upcoming
training
that
helen
mentioned
in
the
welcome
this
morning.
This
is
only
next
week.
I
think
we
have
some
training
from
nvidia
about
their
hpc
sdk,
it's
a
hands-on
training
that
will
cover,
amongst
other
things,
these
accelerate
gpu,
accelerated
bath.
Libraries,
there's
a
link
down
here
for
registration
and
info
these
slides
if
you're
quicker
than
typing
it
in
these
slides,
are
available
from
this
training
events
web
page
at
the
moment.
A
Next
scenario
is
what,
if
you're
not
using
the
nvidia
compiler,
so
so
we
recommend
the
nvidia
compiler,
for
you
know
as
the
as
the
first
approach
for
most
things
for
gpu
based
applications
on
perlmatter.
It's
it's,
the
one
that
has
the
best
support
for
the
gpu.
A
You
know
tool
chain,
it's
the
one
that
we've
sort
of
done,
the
most
with
in
terms
of
you,
know,
nick's
certain
preparation
on
on
filmmatter,
it's
the
default,
and
it's
the
one
that's
loaded
by
default
when
you
log
in
so
that's
kind
of
all
for
a
reason,
so
yeah
do
try
that
first,
however,
you
know
we
have.
We
have
four
different
compiler
stacks
and
they
all
have
different
strengths
and
weaknesses
and
yeah.
You
might
find
that
for
some
applications
you
do
hit
difficulties
with
programming
video.
A
A
It's
pretty
portable,
it's
available
in
everything
which
tends
to
mean
that
it
gets
bug,
fixes
and
features,
and
so
on
fairly
quickly.
So
that's
what
we
recommend
to
the
second
alternative.
We
also
have
on
the
system.
If
you
type
module
avail
program,
you'll
see,
we
have
a
there's:
a
cray
program,
programming,
environment
and
an
amd
programming
environment.
These
two
currently
are
more
cpu
oriented
than
gpu
oriented
and
we
haven't
done
too
much
with
them
just
yet.
A
Okay,
so
oops
a
couple
of
limitations
to
be
aware
of
for
different
compiler
stacks
when
you're
using
the
gnu
compiler,
you
need
to
choose
the
right
gcc
version
for
the
cuda
version
that
you're
using
now.
The
good
news
is
that
doing
the
default
shouldn't
just
work,
so
the
the
default
cuda
toolkit
is
actually
cuda,
toolkit
21.9,
underscore
11.4,
and
that
supports
the
default
gcc,
which
is
11.2.0,
but
if
you're
using
an
earlier
toolkit
version,
you'll
also
need
to
use
an
earlier
gcc
version.
A
The
gnu
compilers
that
we
have
installed
don't
currently
support
open
mp
and
open
acc
offloading
that
is
coming
soon.
I
think
that
it's
not
there
yet
also
the
handy
trick
of
having
cuda
code
embedded
in
your
source
files
is
specific
to
the
nvidia
compiler.
So
with
the
gnu
compiler,
you
need
to
have
the
cuda
code
in
its
own
separate.cu
file.
A
This
is
another
compiler
stack
that
I
didn't
mention
before,
which
is
the
llvm
one.
Llvm
is
yeah
the
clang
and
flying
stack.
A
few
of
the
other
compiler
stacks
are
actually
based
on
it
and
coming
soon
we
we
have
plans
and
development
of
a
program
llvm
as
a
nisk
supported
program
based
on
the
llvm
compiler.
A
It's
not
there.
Yet
it's
not
far
off.
I
understand
it's
currently
targeting
c
and
c,
plus,
plus
only
it
doesn't
have
a
fortran
stack
in
there.
Yet
it
should
have
support
for
sickle
and
openmp
offload
and,
as
I
said,
it's
not
available
just
yet,
but
is
expected
soon.
A
The
limitation
we
currently
have
with
the
cray
compiler
is
that
it
supports
the
v100
gpus,
which
is
the
model
before
ours,
but
it
doesn't
yet
support
the
a100
gpus
so
to
use
that
the
a100
gpus
could
still
run
v100
code,
but
it's
not
going
to
be
as
as
optimized
and
you
need
to
load
a
different
creepy
excel
module
for
that
just
one
earlier.
A
Also,
we
really
haven't
spent
much
time
testing
this
and
so
nurse's
ability
to
support.
It
is
a
little
bit
more
limited
and
aacc
kind
of
has
a
similar
issue
there
in
that
nurse
hasn't
spent
any
real
time,
and
you
know
building
up
expertise
here.
So
our
ability
to
support
it
is
still
fairly
limited.
That
will
probably
be
you
know
more
of
a
focus
come
phase
two
when
we
have
cpu
oriented
nodes.
Also,
currently
the
aocc
compiler
doesn't
have
the
offloading
support.
A
Some
errors
that
you
might
see
when
using
different
programs.
A
If
you
see
something
about
floating
point
exception,
we
mentioned
this
before
check
that
you've
actually
requested
gpus
errors
about
bind
request,
not
specify,
does
not
specify
any
of
the
devices
within
the
allocation
complains
about
binding
check
if
you
actually
requested
all
of
the
gpus
in
the
node,
if
you
only
requested
half
of
the
gpus
and
it
tries
to
bind
to
the
closest
of
my
binds,
who
might
be
attempting
to
bind
to
one
that
isn't
actually,
you
know
marked
its
allocated
to
you
and
if
you
hit
a
cannot
open
shared
object
file,
this
can
happen
if
you
built
part
of
the
code
with
one
program
and
another
part
of
the
code
with
the
different
programs.
A
You
know
they're
trying
to
access
sort
of
different
versions
of
similar
libraries
and
things
can
kind
of
get
messy.
So
it's
a
good
idea
to
do
it
and
make
it
clean,
and
you
know
make
sure
that
the
object
files
are
successfully
deleted.
After
your
swap
programs.
A
So
for
our
next
hands-on
exercise,
let's
try
it
out
back
on
perlmutter,
we'll
use
that
exercise
4
and
exercise
5
again
see
if
you
can
build
them
with
program
gnu
you
might
need
to
make
a
few
changes
to.
You
know:
make
files
and
batch
files.
A
Just
to
note
that
this
is
viable
with
exercise
four
and
exercise
five
exercise:
three,
if
you
try
to
build
with
programming,
gnu
you'll
get
some
sort
of
curious,
looking
errors,
and
it's
because
that
in
exercise,
three
we've
got
the
cuda
and
the
c-plus
plus
code
all
merged
into
the
same
source
file,
and
that
feature
is
only
supported
by
programming
nvidia.
So
you
won't
won't
be
able
to
build
that
with
program
canoe,
but
it
can
be
interesting
to
get
have
a
try.
A
Look
at
what
the
error
messages
are
and
recognize
them
for
when
it
comes
to
working
with
your
own
code.
A
So
let's
spend
10
minutes
on
this.
A
Have
a
crack
at
building
these
exercise
4
and
exercise
5
with
programming,
canoe
and
we'll
do
the
same
thing
if
everybody
can
raise
a
hand
and
lower
it
when
they're
done,
and
that
should
to
give
us
a
bit
of
an
idea
of
how
how
many
of
us
have
finished
the
exercise
and
we'll
reconvene
in
five
or
ten
minutes.
A
So
continue
if
you
are
still
stuck
on
anything,
let's
post
a
question
in
the
google
doc
or
jump
into
the
breakout
room.
A
A
What
it
does
is
it
prevents
the
gpu
device
memory
as
part
of
the
same
address
space
as
the
cpu
main
memory.
So
this
this
is
a
diagram
that
comes
from
nvidia's
website
kind
of
illustrating
what
this
means.
So
so
the
cpu
has
a
certain
amount
of
ram
attached
to
it,
and
each
gpu
has
a
certain
amount
of
ram
inside
it,
as
well
and
kind
of
you
know,
naively
pre-uva.
A
Each
of
these
is
separate.
It's
a
separate
address
space.
They
can't
really
talk
to
each
other.
Having
a
single
address.
Space
is
more
like
this
picture
over
here
on
the
right
where
the
memory
might
be
in
physically
different
places,
but
it's
arranged
as
a
logically
contiguous
block
of
addresses,
and
what
this
means
is
that
a
cuda
aware,
mpi
implementation
and
creating
pictures
in
one
of
these
can
send
and
receive
messages
directly
from
the
gpu
memory
of
one
node
to
the
gpu
memory
on
your
gpu
on
a
different
node.
A
As
opposed
to
you
know.
If
it's
separate
address
spaces,
then
what
you
would
need
to
do
is
actually
use
a
cuda
memphis
cuda
to
host
a
device.
Could
a
device
to
host
mem
copy
to
move
the
memory
from
the
gpu
into
main
memory?
Send
it
through
mpi
that
way
and
then
move
it
from
main
memory
on
the
other
end
back
into
the
gpu.
A
So
obviously,
having
cuda
aware
mpi
here
to
be
able
to
transfer
memory
directly
from
gpu
to
gpu
can
save
a
lot
of
buffer
copying,
particularly
when
most
of
the
work
that
your
application
is
doing
is
happening
on
the
gpu.
A
So
how
do
you
know
if
you're
using
it
one
good
tip,
is
to
use
the
ldd
command
to
have
a
look
at
the
executable
this
for
a
dynamic
executable
which
is
now
the
default
on
kalamata,
for
building?
A
Another
thing
that
can
help
is
in
showing
which
libraries
you
are
using,
and
so,
if
you
run
ldd
on
your
executable
and
you
see,
one
of
the
libraries
in
here
is
called
libor
gtl
cuda.
That
means
that
you
have
cuda
aware
mpi
available
to
you.
So
gtl
stands
for
something
like
the
new
transport
layer.
I
think-
and
this
is
this-
is
craze
library
for
providing
gpu
to
gpu
memory
transfers.
We've
got
network
network
transfers
from
node
to
node
to
actually
make
use
of
this
cuda
aware
npi.
A
A
So,
let's
give
it
a
try,
jump
back
on
to
into
your
call
motor
window
and
you
might
need
to
go
back
a
directory
to
see
it,
but
there
should
be
a
directory
called
cuda
aware
mpi
take
a
little
bit
of
a
look
at
this.
This
is
a
different
application.
Again,
it's
a
just.
A
very
simple
mpi
broadcast
has
some
memory,
so
it
has
a
a
buffer
in
gpu
memory
and
it
will
directly
transfer
it
from
one
gpu
to
another.
Gpu.
A
We
don't
need
to
worry
too
much
about
the
source
code.
At
this
point,
I
think
that
yeah,
the
focus
here
is
really
on
how
you
build
and
use
it,
but
it
might
be
interesting
to
look
at
the
source
code
just
to
kind
of
see
what
it
is
that
it's
doing,
but
most
importantly,
build
and
run
it
and
run
ltd
on
it
and
see.
If
you
can
see
that
live
mbi,
gcl
cuda,
a
couple
of
tips,
don't
forget,
you
need
to
switch
back
to
programming.
A
A
A
I
have
missed
a
step
here.
Hopefully,
if
I
remember
rightly,
I
think
in
that
exercise
there
was
also
a
batch
script
that
you
can
run
and
in
fact
this
output
will
probably
be
in
that
in
the
output
from
that
batch
script.
The
batch
script
has
the
setting
of
the
environment
variable
as
well.
A
A
couple
of
other
scenarios
that
you
may
be
working
with,
maybe
you're
not
using
cuda,
so
there
are
a
couple
of
offload
options:
there's
openmp
and
openacc,
and
so
different
applications
that
you're
using
may
use
one
of
these
inside
the
application.
A
If
it's
openmp,
it
will
look
something
like
this.
You'll
have
directives,
you
have
pregnant
omp
target
teams,
blah
blah
blah
blah
blah
with
some
map
directives
in
there
open
ap
openacc
is
a
little
bit
similar
you'll
have
directives
in
the
code
that
say
things
like
pragma
acc,
parallel
loop,
open
acc
in
a
way
is
like
kind
of
a
higher
level,
a
higher
level
of
abstraction.
A
So
if
your
application
uses
openmp,
what
you'll
need
to
add
at
compile
time
is
these
c
or
c
plus
plus
flags
dash
mp
for
multiprocessing
equals
gpu
dash
gpu
equals
cc
80.
This
is
that
magic,
80
number
again,
because
we're
using
nvidia
ampere
and
this
last
one
dash
m
info
is
optional,
but
quite
useful.
It
prints
a
bunch
of
information
during
compile
about
what
the
compiler
is
doing
with
the
openmp
directives
and
openmp
offloaded,
kernels
and
loops.
A
Similarly,
when
you're
building
open
acc
codes,
you'll
add
a
couple
of
options
to
your
c
or
c,
plus,
plus
flags
dash,
acc
and
again
dash
m
info
equals
excel.
A
The
m
info
does
similar
sorts
of
things
it
prints.
A
little
bit
of
extra
information.
Acc
is
for
open,
ace
and
c
so
see
the
the
difference
here
is
slight
bit
noticeable.
Openmp,
you
have
a
dash
np
option
open
scc.
You
have
a
dash
acc
option.
A
I
think
we're
going
to
do
this
in.
I
know
we
have
build
and
run
it
so
we'll
go
back
to
hands-on
jump
back
into
your
clone
of
the
training
repo
and
you
might
need
to
go
back
a
directory
again
to
find
it.
But
you
should
see
a
directory
called
openmp
dash,
open,
acc.
A
Take
a
look
in
there.
Take
a
look
at
the
readme
if
you
like
it
the
code
and
build
and
run
it
and
take
a
look
at
the
output.
A
A
So
the
last
couple
of
topics
is
about
a
few
extra
tips
around
building
code,
one
that
is
very
easy
to
trip
up
on
is
powerful,
but
also
a
little
bit
particular
is
using
cmake.
For
the
most
part
it
should
just
work.
We
have
some
cma
modules
available
on
perlmutter.
We
have
a
fairly
recent
version,
3.22.
A
There
are
currently
a
few
issues
that
we've
discovered
when
linking
math
libraries
in
the
cuda
stack,
particularly
here
things
like
could
f
of
t
could
c?
U
f,
f,
t
c,
plus
c.
U
solver
is
that
these
libraries
are
in
a
different
location
to
the
nvcc
compiler
itself
and
cmake,
often
trips
up
trying
to
find
them.
We
have
a
tip
on
this
in
our
docs,
but
basically
what
the
tip
is
is
to
add
the
math
libs
path
to
your
cmake
prefix
path
with
something
like
this
command.
A
So
the
this
opt-in,
video,
hpc
sdk
you'll,
see
that
if
you
do
module
show
cuda
toolkit,
it
will
show
the
specific
path.
It
may
be
different
from
one
cuda
toolkit
to
the
next,
particularly
changing
cuda
versions.
A
There
was
a
question
earlier.
I
saw
in
the
g
doc
about
whether
whether
we
should
prefer
cmake
or
auto
conf.
So
we
don't
particularly
you
know,
not
support
one
over
the
other.
A
In
a
lot
of
the
cases,
you
won't
really
have
a
choice.
If
you're
using
code
that
somebody
else
wrote,
you
know,
building
a
third
party
library,
for
instance,
chances
are
it
either
uses
cmake
or
auto
conf
to
you
know
to
set
up
and
to
make
files,
and
so
you
just
need
to
use
what's
there
if
you're
developing
new
code,
probably
cmake
is
the
way
to
go,
but
particularly
modern
scene
mates.
Cmake
has
changed
quite
a
lot
over
the
years.
A
It's
gotten
better,
basically,
and
the
newer
practices
are
generally
much
better
to
use
and
more
more
maintainable
and
sustainable
than
the
old
ones.
I
think
if
you
dig
back
through
nurse's
trending
kind
of
history
training
resources,
I
think
we
did
actually
a
little
while
have
a
course
on
using
modern,
c
major
cabinets.
So
with
taking
a
look
at
the
slides
and
the
recording
of
that.
If
you're
developing
code.
A
Finally,
this
is
still
in
progress-
it's
not
quite
there
yet,
but
we
are
working
on
setting
up
a
spec,
0
17
0
module
file
and
configuration
it's
already
on
query.
It's
actually
there
on
palmata,
but
we
don't
have
the
module
file
yet
we're
still
sort
of
testing
it
then
and
refining
the
configuration.
But
that
should
be
there
real
soon.
A
Now
it's
being
set
up
to
work
also
with
the
e4s
deployment,
so
e4s
is
I've
forgotten,
actually
what
it
what
it
stands
for,
but
as
part
of
the
ecp
project,
it's
a
scientific
software
stack.
A
A
So
we'll
have
that
kind
of
available
in
poll
matter
in
the
not
too
distant
future
as
well,
but
they
their
spec
instance,
is
being
set
up
yeah
to
work
with
that,
so
that
will
hopefully
make
installing
a
lot
of
third-party
software
easier.
So
we're
not
going
to
go
into
the
details
of
how
to
do
this
for
today,
but
just
a
heads
up
that
it's
coming.
A
I
seem
to
have
missed
putting
a
final
slide,
so
just
to
recap
we're
at
the
end
of
our
slides
and
exercises
for
today,
and
fortunately
we're
a
little
ahead
of
time.
A
A
What
to
do
if
you're
using
kudera,
aware
mpi
and
the
fact
that's
worth
using
a
few
pointers
towards
math
libraries,
some
tricks
and
trips
and
errors
that
might
trip
you
up
and
how
to
recognize
them
and
what
to
do
about
them.
And,
very
importantly,
there
is
the
repo
that
you
have
cloned.
A
That
will
hopefully
provide
some
examples
that
you
can
use
as
sort
of
a
starting
point
as
you
can
move
on
to
building
your
own
code
on
pearl
matter,
and
you
know
when
you
do
hit
errors,
that
these
examples
hopefully
will
help
to
narrow
down
the
steps
that
might
be
missing.
A
So
that's
all
that
we
have
for
today
I'd
like
to
think
notation
about
the
things
that
you
know.
I
provided
on
perlmutter
from
cray
and
also
helen
and
ronnie,
and
roll
and
moise
and
many
other
nurse
staff
who
have
been
answering
questions
in
the
chat
and
did
a
lot
in
developing
yeah.
These
slides
and
these
exercises.
A
So
and
of
course,
finally,
everybody
who
has
come
along
joined
their
training
and
participated
and
hopefully
found
it
beneficial
and
be
able
to
make
good
use
of
it
using
coal
miner
coming
out.