►
From YouTube: 04 Migrating from Cori to Perlmutter GPU Codes
Description
Part of the Migrating from Cori to Perlmutter Training, December 1, 2022.
Please see https://www.nersc.gov/users/training/events/migrating-from-cori-to-perlmutter-training-dec2022/ for the training day agenda and presentation slides.
A
Everyone,
my
name,
is
Moz
and
I,
along
with
Steve
and
Helen,
will
be
presenting
about
the
GPU
notes
that
we
have
on
the
perimeter
system
so
first
and
outline
or
overview
of
this
presentation,
we'll
have
a
quick
look
at
the
GPU
notes,
their
Hardware
configuration
and
then
we'll
move
on
to
the
type
of
programming
environment
status
that
are
available
over
there.
A
And
finally,
we
have
some
Hands-On
exercises
which
are
self-guided
setup
for
you,
but
before
you
start
working
on
them
on
your
own
I'll
do
a
quick
walk
through
to
them
and
there
are
several
Concepts
that
are
being
explained
there
that
are
new
to
these
notes
and
we're
not
available
on
the
query
system
or
are
available
on
the
current
model.
A
Cpu
notes,
so
I'll
try
to
walk
through
them
with
some
examples,
and
once
we
are
done
with
that,
you
can
try,
and
you
know
you
can
try
them
on
your
own,
because
if
you
do,
you
learn
better,
so
the
GPU
nodes.
So
we
have
about
1500
GPU
notes
on
the
permanent
system,
and
each
of
these
nodes
contains
one
AMD
Milan
CPU.
This
is
the
very
same
CPU
that
is
also
present
on
the
perimeter
CPU
nodes.
A
The
difference
is
that
on
the
CPU
notes
we
have
two
of
these
and
on
the
GPU
nodes
we
have
one
of
these.
Each
Milan
CPU
has
64
cores
where
each
core
has
two
Hardware
threads.
So
for
the
scheduling
system,
it
will
see
a
total
of
128
compute
elements.
Cpu
elements
on
each
GPU
node,
along
with
that
we
have
the
distinguishing
factor,
is
the
gpus.
A
We
have
four
AMD
A1,
sorry
Nvidia,
a100
gpus,
and
each
of
these
GPU
has
40
GBS
of
HPM
the
high
bandwidth
memory
and
is
capable
of
Performing
up
to
9.7
teraflops
of
floating
points,
the
Double
Precision
flowing,
find
operations,
and
each
pair
of
these
gpus
on
this
node
is
connected
with
a
Envy
link
connection,
while
the
CPUs
and
gpus
communicate
through
a
PCI
Gen
4
best.
So
it's
a
highly
performant
node
with
that
we
move
on
to
the
programming
environment.
A
So
the
programming
environment
is
very
simple,
similar
to
what
is
available
on
the
on
the
CPU
node,
except
we
have
some
specific
modules
that
are
available
or
are
loaded
onto
the
GPU
nodes
when
you
have
to
build
your
code
for
the
GPU,
except
that
everything
is
same
as
Eric
explained.
So
all
the
model
trick
strategic
here
that
you
can
still
use
them
on
the
GPU
nodes.
In
fact,
the
code
will
basically
be
built
on
a
login
node,
so
everything
will
remain
the
same
over
here.
A
I'll
try
to
talk
about
the
differences
that
are
and
that
you
must
be
aware
of
when
you're
building
for
the
GPU
nodes.
Now,
if
you
log
into
your
terminal
on
Perimeter-
and
you
do
a
module
list,
some
so
something
like
this
will
show
up-
and
you
can
see
that
by
default
we
have
this
GPU
module
on
number
18.
You
can
see
load
it.
So
when
this
is
loaded,
the
system
is
configured
or
the
environment
has
been
configured
for
GPU
codes.
A
And
if
you
are
looking
to
build
CPU
code,
then
you
will
have
to
unload
this,
but
by
default
this
model
will
always
be
there
and
we
are
assuming
that
you're
building
for
the
gpus.
Now,
what
this
tells
is
it
loads?
Some
additional
modules,
for
example
the
cooler
toolkit
and
the
creepy
Excel
nvidia80
module,
but
these
modules
are
required
for
for
building
for
GPU
and
by
default
you
can
see
that
the
environment
is
the
new
environment.
A
There
is
a
thing
known
as
the
Cuda,
where
MPI
and,
if
you're,
if
you
want
to
you
utilize
that
feature
we'll
get
into
the
details
of
that,
you
want
to
make
sure
that
you
have
the
GPU
module
loaded
and
once
you
have
everything
set
up,
it's
recommended
that
you
use
the
compiler
wrappers
to
build
now.
Eric
talked
about
compiler
wrappers
in
detail
and
I'll
also
give
a
few
examples
of
that
now.
A
By
default,
the
compilers
that
are
unloaded
are
the
gnb
based
and
you
can
access
any
programming
environment
or
the
compilers
that
are
loaded
using
the
compiler
wrappers.
For
example,
if
I
have
the
compiler,
a
compilers
loaded,
I
can
check
with
the
compiler
wrapper
CC
and
the
capital
CC,
with
doing
with
the
dash
dash
version,
you'll
see
that
the
underlying
compiler
is
G
plus
plus.
Similarly,
for
the
c
c
language,
you
can
do
the
small
CC
and
you
can
see
that
you
will
have
GCC
available
now.
A
Let's
say
that
you
want
to
use
a
different
compiler,
let's
say:
programming,
environment
Nvidia.
You
want
to
use
the
Nvidia
compiler,
then
you
do
the
modular
programming,
environment,
Nvidia,
and
then
you
can
see
that
some
changes
happen
in
the
environment
and
if
you
do
the
CC
dashboard
again
you'll
see
that
we
have
NVC,
NVC
plus
plus
showing
up
now.
This
is
the
Nvidia
compiler
and
do
not
use
NVC
plus
plus
directly
always
use
the
compiler
wrappers
and
why
this
is
important,
we'll
see
when
we
go
into
the
Hands-On
exercises.
A
On
parameter,
we
we
support
several
GPU
programming
models
and
we
have
almost
support
for
everything
available
there.
There
are
certain
programming
environments
which
suit
better
for
certain.
It
just
suited
better
for
certain
creating
models.
So
this
is
a
list
of
our
recommendation.
If
you're
working
with
a
Cuda
code,
it's
recommended
that
you
use
programming,
environment,
Nvidia,
naturally,
because
Cuda
is
nvidia's
proprietary,
but
let's
say
that
you
have
an
application
that
uses
the
gnu
compilers
the
you
can
still
use
those,
but
then
you
will
have
to
do
a
separate
compilation.
A
You'll
have
to
make
sure
that
your
Cuda
code
is
built
using
the
Cuda
toolkit
or
the
nvcc
compiler.
We
have
a
Hands-On
exercise
about
this
as
as
well
and
I'll.
I'll
point
this
out.
When
we
get
to
that,
Cocos
is
a
c
plus
Library.
Can
you
present
it
as
a
C
plus
plus
Library,
so
anything
that
supports
C
plus
plus
and
has
the
the
backend
support
for
the
gpus
you
will
be.
You
will
be
able
to
use
that
programming
environment
for
it.
A
Then
we
have
openmp
offload
it's
if
you
are
looking
for
portability
across
different
types
of
gpus.
This
is
one
of
the
most.
You
know
recommended
programming
model.
If
you,
if
you
have
a
code
in
openmp
upload,
it
will
work
on
Nvidia
and
AMD,
or
even
on
Intel
gpus
on
prenwater.
We
are
supporting
it
using
the
programming
environment
Nvidia
and
there
are
options
and
programming
environment
created
that
also
allow
you
to
build
the
code,
that's
in
the
openmp
output
and
when
we
go
to
the
open
ACC
model
on
Perimeter.
A
To
summarize
all
the
last
three
to
four
slides,
if
you
have
a
source
code,
does
not
matter
what
it
contains:
either
the
GPU
directives
or
MPI
or
or
like
apartment
code.
We
recommend
that
you
build
it
using
the
compiler
wrappers
instead
of
using
the
underlying
compilers.
For
example,
if
you
have
a
C
plus
plus
code,
you
use
the
capital
CC
compiler
wrapper.
If
you
have
a
c
code,
you
use
a
cc
and
if
you
have
the
Ford
front
code,
you
use
the
SP
and
rubber.
A
Now
this
is
also
regardless
of
the
type
of
programming
environment
that
you
are
using.
It
doesn't
matter.
If
you
have
a
new
programming
environment,
create
or
Nvidia,
you
always
stick
with
the
compiler
wrappers,
because
they
they
will
take
care
of
the
compiler
that
is
to
be
used
and
the
libraries
that
are
to
be
LinkedIn
now
the
exception
would
be
a
Cuda
kernel.
A
If
you
have
Cuda
kernel
inside
a
DOT
CU
file,
then
the
the
famous
method
is
to
use
the
nvcc,
but
on
firmware
we
have
programming
environment
Nvidia,
which
contains
the
NVC
plus
compilers,
and
you
can
even
build
the
Cuda
code
with
them
directly.
You
just
have
to
pass
the
appropriate
flag,
and
we
have
also
a
Hands-On
example
that
explains
this
nicely.
A
Now,
with
this,
we
move
on
to
the
Hands-On
exercises,
and
that
is
where
the
bulk
of
information
will
come
from
you.
You
may
already
be
aware
of
this
link
this.
There
is
a
GPU
directory
in
this
repo
and,
if
you
CD
into
that
you'll
see
a
long
readme.
That
readme
is
basically
kind
of
a
lab
manual
for
for
these
training.
A
For
these
exercises-
and
you
can
read
through
it,
I
would
suggest
that
you
open
it
into
a
separate
window
and
open
all
the
code
in
your
terminal,
and
you
know,
read
through
the
readme
and
try
to
follow
the
steps
in
the
exercises.
There
is
no
it's
not
kind
of
an
assignment.
You
don't
have
to
do
anything
on
your
own.
You
will,
you
will
be
perfectly
fine
if
you
just
follow
the
steps
and
I
would
highly
suggest
that
you
open
up
the
make
file
and
look
into
it.
A
So
what's
covered
in
these
these
exercises
we
start
off
with
a
simple
Cuda
code.
We
try
to
make
it
more
complicated.
As
we
move
forward,
we
add
NPI
into
the
mix.
We
try
to
build
the
MPI
plus
Cuda
code
with
different
types
of
programming
environments.
Then
we
talk
a
bit
about
the
the
Cuda
wear
example,
where
you
are
able
to
communicate
between
two
gpus
across
node
directly
and
then
the
the
GPU
Affinity,
just
like
Eric,
explained
the
CPU
Affinity.
We.
A
As
I
mentioned.
It's
there
are
two
important
files
in
each
each
exercise.
A
make
file
and
a
batch.sh
file.
Batch.Sh
file
will
mostly
be
the
same
very
similar,
except
when
we
talk
about
Affinity
make
file
is
the
most
important
one.
In
typically
in
a
training
you
would.
You
would
think
that
a
source
code
file
is
more
important,
but
since
here
we
are
trying
to
learn
how
to
use
the
programming
environments
and
how
to
build
your
code
so
the
make
file
here
it
takes
present
the
code.
A
The
source
code
would
almost
be
the
same
in
all
the
examples,
so
the
the
batch
file
is
basically
used
for
launching
a
job
in
in
an
efficient
manner.
If
you
get
a
dedicated
node
and
try
to
run
on
it,
you
will
basically
be
wasting
time
because
a
lot
of
time
you'll
be
spending.
You
know,
reading
the
file
and
building
it.
But
if
you
run
the
file
through
the
batch
system,
it
will
be
much
easier.
So
there
are
different
options.
These
may
not
reflect
exactly
what's
written
in
the
in
the
batch
file.
A
That's
included
in
the
examples,
but
the
whole
the
overall
concept,
and
then
the
terms
are
pretty
much
the
same.
The
dash
queue
option
specifies
the
the
qos
or
the
quality
of
service
or
the
queue
that
you
want.
Your
job
to
go
in
the
dash
capital.
N
is
the
number
of
nodes
that
you're
requesting
Dash.
T
is
the
time
in
minutes,
for
example,
this
is
five
minutes.
Dash
n
is
the
number
of
MPI
tasks.
A
The
total
number
of
tasks
across
each
nodes
that
will
be
launched
and
dash
C
is
the
the
number
of
CPUs
that
you're
requesting
now
be
mindful
that
first
learn.
A
CPU
is
a
hardware
thread.
For
example,
in
the
start,
I
mentioned
that
each
Milan
CPU
has
64
cores
and
each
core
has
two
Hardware
threads.
So
in
total
we
have
128
compute
elements
and
and
slurm
sees
each
compute
element
as
a
CPU,
as
so
the
it
it's.
A
The
total
number
of
CPU
elements
that
you're
requesting,
and
then
we
have
the
number
of
tasks
per
per
node
yeah.
So
that's
that's
just
very
explanatory
self-explanatory
and
then
we
have
the
number
of
gpus
per
task.
So
this
is
the
so
basically
you
have
four
tests
per
node
and
you
have
one
GPU
per
task.
A
So
that's
the
total
of
four
gpus
that
you're
requesting
and
for
this
for
the
purpose
of
this
exercise,
you
will
be
using
N
Train
two
as
your
location
account
and
your
reservation
would
be
PM
underscore
GPU
underscore
just
number
one.
Also,
it's
important
to
use
the
constraint
GPU,
because
if
you
don't
that,
then
you
know
you're
not
requesting
a
GPU
node
for
the
CPU
nodes.
You'll
replace
this
with
the
CPU.
So
for
the
for
all
these
examples
make
sure
that
you're
requesting
a
GPU
node.
A
These
are
some
useful,
runtime
environment
variables
a
lot
of
time.
It
happens
when
you're
working
with
the
GPU
code,
especially
within
openmp
of
the
type
of
code,
because
openmp
also
runs
on
CPU.
So
it's
it's
important
to
know
that
the
example
that
you
ran
actually
used
the
gpus.
So
if
you
set
these
variables
to
these
given
values,
you
will
know
what's
actually
happening.
A
For
example,
if
you
set
this
to
to
to
2
that
will
tell
you
the
data
transfers
that
that
are
happening
so
when
the
data
transfer
happens
between
the
CPU
and
the
GPU,
so
there
are
multiple
options:
try
to
explore
them,
and
this
will
also
help
you
understand
if
your
code
ran
on
the
GPU
or
it's
it's
running
on
CPU.
A
In
the
exercise
one,
this
is
the
same
test
of
all.
We
have
a
simple
Cuda
kernel
and
it
has
been
placed
in
two
different
files.
One
file
is
named
dot
cu,
the
other
is
named
dot
CPP
and
we
first
try
to
build
the
dot
CU
file
that
contains
a
Cuda
kernel
that
runs
on
the
GPU
with
the
nvcc
compiler.
This
is
a
dedicated
compiler
for
for
the
Cuda
language,
and
then
we
do
the
same
exercise
using
the
CC
wrapper.
A
Now
it
really
happened
that
you
have
a
code
that
only
contains
Cuda
and
is
all
in
the
same
file.
Typically,
you
will
have
your
application
will
be
large
and
distributed
across
multiple
files
and
as
a
good
practice,
people
try
to
keep
all
their
code
in
a
separate
file
or
all
their
GPU
code
in
a
separate
file,
and
that
actually
makes
it
easier.
So
let's
say
that
you
have
a
application
that
makes
excessive
use
of
Cuda
and
all
the
Cuda
kernels
are
located
in
a
separate
file
named
kernels.com.
A
But
if,
in
a
scenario
you
want
to
use
a
different
compiler,
then
you
can
use
this
method
where
you
can
build
your
Coda
code
with
nvcc
and
then
use
your
it
doesn't
matter
what
compiler,
even
it
can
even
be
GCC
or
a
new
compiler.
You
will
still
be
able
to
link
it,
so
this
is.
This
example
covers
that
that,
in
the
third
example,
we
include
MPI
in
the
mix.
So
in
this
example,
we
have
MPI
and
Cuda
code.
A
All
in
the
same
file-
and
there
are
and
the
best
way
to
do
that,
because
if
you
use
nvcc
here,
nvcc
compiler
will
not
recognize
MPI
and
that's
why
you
will
need
to
use
the
compiler
wrapper
that
comes
with
the
programming
environment
and
video.
Now
this
is
again
one
of
those
cases
where
it
will
only
work.
If
you
use
this
particular
module
and
you
use
the
compiler
wrapper
because
then
it
will
be
linking
an
MPI
and
Cuda
both.
A
In
the
fourth
example,
we
again
come
back
to
the
separate
compilation.
Here
we
have
the
the
MPI
library
and
we
have
the
Cuda
kernels,
but
the
Kuda
kernels
here
are
located
in
a
separate
file
and
you
can
again
use
any
programming
environment
for
this,
because
you're
building
Cuda
code
separately
using
nvcc
and
then
you're
linking
it
using
the
compiler
wrappers
compiler
wrappers
can
again
come
from
any
compilers.
It's
recommended
that
you
use
the
compiler
wrappers.
A
You
could
still
build
the
code
if
you
were
using,
let's
say
G
plus
plus,
but
then
you
would
need
to
write
a
bunch
of
you'll
need
to
link
in
a
bunch
of
libraries,
for
example,
good
or
random
library
is
one
of
those
that
you'll
be
needing,
and
since
you
don't
know
where
it
will
be
located,
and
what
what
are
the
paths
you
need
to
include?
It's
always
recommended
to
use
the
compiler
wrappers,
because
they
will
take
care
of
everything
for
you
and
it
will
also,
you
know,
make
the
compilation
line.
Look
much
much
simpler.
A
Before
we
go
to
the
next
exercise,
this
is
a
slide.
That's
copied
over
from
Eric's
slide
deck,
and
it's
basically
telling
you
the
compute
elements
that
are
available.
So
we
can
look
at
the
rightmost
column.
It's
the
CPU
on
the
perimeter
GP
nodes,
as
was
mentioned
before
that
on
the
parameter
CPU
you
have
two
of
those
sockets
and
on
the
on
the
GPU
nodes
you
have
one
of
those,
so
everything
is
actually
hacked
like
the
total
physical
code
is
have
the
logical,
CPUs
per
physical
curve,
which
is
the
which
is
the
hardware
thread.
A
They
still
remain
the
same
because
they
are
located
within
the
within
the
core.
Total
logical,
CPUs
per
node
is
also
halved,
and
the
new
model
domain
is
also
have
so.
We
have
four
Numa
domains
while
parameter
CPU
nodes
have
eight
numer
domains.
A
Before
we
go
into
Infinity
I'll
try
to
get
this
out
of
the
way,
so
the
affinity
for
the
CPU
CPU
cores
is
still
the
same
as
on
the
CPU
nodes,
because
nothing
has
changed
here.
So
it's
recommended
that
you
use
the
correct
number
of
you
know
you
assign
the
correct
number
to
the
taxi
option
or
the
CPUs
per
task
option.
A
If
you
write
in
the
longer
format
and
to
to
compute
the
correct
number,
you
can
use
this
equation,
it's
pretty
simple
64
is
basically
the
total
number
of
Hardware
course
that
you
have
on
the
Node
and
K
is
the
number
of
tasks
per
node.
So,
let's
say,
if
you
had
64
tasks
like
in
this
example,
the
in
the
thing
inside
the
braces
would
become
one
and
you
will.
The
answer
would
be
two
and
now
2
is
the
number
of
Hardware
threads
that
you're
assigning
to
each
MPI
task.
A
So
it's
important
to
have
this
right,
because
you
you
want
to
make
sure
that
you're
utilizing
the
resources
best
and
you
know,
you're
not
pushing
too
many
MPI
tasks
onto
a
single
core
when
you
have
more
available
and
to
one
way
to
push
that
is
to
add,
the
CPU
bind
equal
to
course
option,
and
next
we
will
look
at
the
GPU
affinity
now,
as
described
before
on
GPU
node.
A
We
have
four
new
my
domains
and
each
numer
domains
contains
a
memory
which
is
faster
to
access
if
access
from
within
that
in
my
domain
and
across
different
numer
domains,
Things
become
slower.
Similarly,
just
like
each
domain
has
its
own
memory.
Each
domain
will
be
assigned
a
GPU,
so
this
is
something
it's
what
it
looks
like
on
the
GBP.
You
know
we
have
four
new
my
domains
and
each
new
my
domain
gets
a
GPU.
A
Now,
let's
say
that
you
had
a
MPI
task
which
was
residing
in
Numa
domain,
let's
say
zero
and
it
was
assigned
a
GPU
or
it
was
trying
to
communicate
with
a
GPU
that
was
assigned
to
Numa
domain
3.,
then
literally,
it
will
have
to
take
a
longer
path
and
things
will
slow
down.
So
it
is
important
that
MPI
tasks
are
assigned
gpus
that
are
closest
to
them
and
we'll
see
how
to
do
that.
So
in
exercise
5,
we
have
two
batch
scripts.
A
There
is
a
batch
script
wreck
and
a
batch
script
close
the
The
Red
Badge
script
is
just
the
regular
way
of
running
things
here,
the
every
we
don't
specify
any
affinity
and
what
what
will
happen
is
that
every
rank
or
every
MPI
task
will
be
able
to
see
all
the
gpus
that
are
available
on
the
Node
and
in
a
typical
application.
When,
when
each
rank
is
able
to
see
multiple
gpus,
we
assign
one
GPU
to
each
rank
in
a
round
drop
in
fashion.
Those
of
you
have
been
actively
developing
or
porting.
A
Their
codes
to
gpus
would
be
aware
of
this.
Now.
What
that
does
is
it
does
not?
Really
care
about
if
the
GPU
that
is
being
assigned
to
a
certain
task
is
closest
to
it
or
if
it
lies
in
another
new
domain,
to
make
sure
that
you're
getting
the
closest
GPU.
You
have
to
specify
this
flag,
GPU
buy
and
equal
to
closest,
and
this
has
been
demonstrated
in
this
close
dot
search
example.
A
Now,
when
you
run
the
code
without
the
the
GPU
Affinity
set
you'll,
see
something
like
this
printout.
You
can
see
that
the
rank
1
is
able
to
see
four
gpus
and
it
randomly
or
around
in
a
round
robbing
a
fashion
assigns
a
GPU
itself
and
we
print
out
the
the
PCI
ID
of
that
GPU
to
differentiate.
You
know
what
GPU
is
being
assigned
to
what
rank,
but
it
does
not
really
clear
if
it's
the
closest
and
you
can
see
that
every
rank
can
see
every
GPU.
A
But
if
you
relaunch
the
same
thing
with
the
GPU
closest
Affinity
set,
you
will
see
that
each
rank
is
able
to
only
see
one
GPU,
and
that
is
the
GPU
that
is
closest
to
it.
Now.
How
do
we
know
that
it
is
the
closest
to
it?
So
when
you
run
this
example,
some
information
about
the
node
topology
will
also
be
printed
out
and
we
can
use
that
information
to
see
that
we
are
getting
the
closest
GPU.
A
Similarly,
Rank
4,
which
resides
on
core
32,
has
been
assigned
GPU
number
41
as
well.
Now,
let's
see
where
these
cores
reside.
Actually
the
code
32
and
33.
from
this
we
can
see
that
code,
32
and
33
reside
in
Newman,
node
2.,
and
if
we
go
to
Numa
node
2,
we
can
see
that
the
PCI
PCI
bus,
ID
or
the
GPU
ID
that
has
been
assigned
to
that
node
is
also
41..
A
And
then
we
have
the
Cuda
aware
MPI.
So
Nvidia
has
this
thing
known
as
UVA
or
the
unified
virtual,
addressing
what
it
does
is
it.
It
allows
the
program
to
see
the
CPU
and
the
GPU
memory
in
a
single
virtual
space,
and
what
this
makes
possible
is
direct
communication
between
two
gpus.
Let's
say
you
have
two
nodes:
for
example,
in
this
case
you
have
node
one
and
two,
and
you
want
the
GPU
one
to
send
a
message
on
GPU
one
on
node
2..
A
Typically,
what
will
happen
is
that
this
message
will
be
first
sent
to
the
CPU
memory
on
on
the
parent
node.
Then
it
will
be
sent
to
the
CPU
of
the
of
the
target
node
and
then
it
will
be
sent
to
the
Target
GPU.
So
that's
a
longer
path.
But
if
you
have
the
Cuda
aware
MPI
option
available,
you
can
send
the
message
directly
from
one
GPU
to
the
other
on
a
remote
node,
and
that
is
a
facility
that
we
have
available
on
Perimeter.
A
The
the
Korean
pitch
build
that
we
have
it
Targets
this
it
utilizes
this
underlying
technology,
and
that
gives
you
the
performance
that
that
is
capable
of
direct
communication.
Now
how
to
use
this.
So
as
as
I
showed
before
that
by
default,
a
GPU
module
will
be
loaded.
A
Now,
this
module
does
a
lot
of
things,
it
loads
other
modules
and
also
makes
sure
that
there
are
certain
environment
variables
that
are
set
up
for
this
type
of
Cuda
wear
MPI
to
take
place,
and
if
you
have
this
module
loaded,
you
don't
have
to
do
anything
else.
You
will
just
build
your
code
as
usual
and
everything
will
be
taken
care
of.
A
Sometimes
you
may
be
running
into
issues,
and
the
first
thing
that
you
would
want
to
check
is:
was
your
executable,
Cuda,
MPI,
capable
and,
and
you
can
simply
check
the
type
of
libraries
that
were
Linked
In?
You
can
do
the
ldd
or
the
executable
or
the
library
that
you
have
that
you're
using
and
you'll
see
that
you'll
have
this
Library.
You
should
have
this
Library
Linked
In
And.
A
If
it
is,
then
that
means
it
was
built
for
cudavere
MPI,
and
then
there
may
be
some
other
issue,
and
you
can,
you
know
check
with
us
about
that.
A
So
example,
six
tries
to
explain
this
concept.
It
is
a
simple
example
that
shows
you
how
you
do
this.
It
sends
a
message
to
the
GPU
located
on
a
remote
node
and
try
tries
to
verify
verified
that
the
message
was
received
correctly
and
then
in
the
exercise.
A
Seven,
we
explore
open,
ACC
and
openmp
offload
methods,
so
these
are
two
programming
models,
other
than
Cuda
that
you
can
use
to
Target
gpus,
and
this
this
example
contains
the
same
kernel
that
previously
we
had
in
Cuda
in
the
in
the
other
example,
and
just
rewrites
them
in
openacc
and
openmp.
So
you
can,
you
know,
compare
and
contrast
these
three
kernel
and
see
how
these
three
models
differ
from
each
other.
Over
here
we
have
tried
to
keep
open,
University
and
openmp
codes
in
the
same
file.
We've
separated
them
using
IF
stuff
statements.
A
So
you
can,
you
know,
easily
compare
and
contrast
between
those
two.
Now
there
are
you
can
you
can
use
so
for
openmp?
You
can
use
programming
environment
create
as
well
and
for
open,
SCT
and
openmp.
They
both
can
be
built
using
programming
environment
Nvidia.
Now
this
is
the
model
that
is
recommended
if
you
want
to
Target
these
two.
A
But
if
you
have
a
some,
you
know
serious
dependent
dependence
on
create
programming
environment,
then
you
will
have
to
go
with
the
openmp
now
in
order
to
build
a
code
for
openmp,
you
need
to
pass
the
dash
MP
equal
to
GPU
flag
to
the
CC
wrapper
off,
that's
contained
in
the
programming
environment,
Nvidia,
and
if
you
want
to
build
for
the
for
open
ACC,
you
will
pass
the
flag
Dash
ACC
and
the
example
So
within
the
exercise.
7.
A
There
is
an
example
that
also
explains
how
to
use
open
ACC
and
in
fortron
using
the
photon
wrappers.
So
you
can
also
look
at
that.
If
that
is
of
interest
to
you,
that
that
is
all
from
my
end.
Thank
you
very
much
and
I
think
we
have
some
some
time
left,
so
you
can
try
to
walk
through
the
the
exercises
the
Hands-On
exercises,
and
we
will
be
available
here
to
to
answer
your
questions.