►
From YouTube: Migrating from Cori to Perlmutter: GPU Codes
Description
Migrating from Cori to Perlmutter: GPU Codes
Presenters: Muaaz Awan, Stephen Leak, Helen He, User Engagement Group
Training: Migrating from Cori to Perlmutter, March 10 2023
A
Okay,
so
my
name
is
Mars
and
I'll
be
going
over
some
some
of
the
tips
and
tricks
that
you
may
need,
if
you're
trying
to
run
your
trying
to
Port
your
coach
to
the
GPU
nodes
in
particular
and
before
I
begin.
I
would
like
to
thank
Steve
Van
Helen
for
for
the
contribution
to
these
slides
and
the
materials,
so
I
think
most
of
the
things
that
you
need
to
get
started
on
print
matter
have
been
covered
by
Eric
very
nicely,
and
this.
A
This
presentation
is
mostly
describing
the
particulars
of
the
GPU
nodes,
for
example
the
programming
environment
and
the
architecture
of
the
nodes.
And
later
we
have
a
few
Hands-On
exercises.
But
before
you
get
to
them
yourself,
I'll
do
a
walkthrough
of
those
through
using
the
the
time
that
I
have
for
this
presentation
and
I'll
try
to
convey
some
of
the
concepts
that
we
are
trying
to
explain
through
those
Hands-On
exercises.
A
So,
as
was
described
by
Jack
in
the
morning,
the
the
GPU
nodes
on
parameter.
We
have
about
1792,
GPU
notes
on
parameter
and
all
of
them
have
same
architecture
except
a
small
change
out
of
these
17
192
notes.
A
1536
nodes
have
a100
gpus
the
their
40
GB
variant,
which
is
that
the
size
of
the
HPM
is
40
GBS
on
each
GPU,
while
256
of
these
nodes
have
the
atgb
variant.
A
So
each
of
these
GPU
nodes
also
contains
a
host
processor,
a
CPU
which
is
the
AMD
Milan,
zip
view
it
contains
64
cores
and,
as
was
described
by
Eric.
These
are
the
hardware
cores
and
each
core
contains
two
logical,
CPUs
or
hyper
threads.
So
you
have
a
total
of
128
logical,
CPUs
on
the
front
motor
GPU
node
and
each
GPU
node
contains
four
Nvidia
a100
gpus.
The
the
slight
difference
between
the
gpus
as
described
above,
is
that
256
nodes
will
have
the
gpus
with
80
gb
HPM
each.
A
So
all
the
gpus
on
a
GPU
node
are
connected
via
NV
link
connections.
So
it's
a
it's
like
an
all
to
all
connection
and
the
CPUs
and
the
gpus
can
communicate
via
the
PCI
Gen4
bus.
Each
node
also
contains
four
slingshot
11
nics,
as
described
over
here,
and
these
are
connected
to
the
CPUs
via
a
PCI
Gen4
connection
as
well.
I
think
this
was
also
a
question
in
the
in
the
first
presentation:
what
is
the
256
GB
ddr4?
A
So
this
is
the
ram
that
you
have
on
on
the
nodes,
and
this
is
separate
from
the
the
gpu's
memory,
so
each
GPU
has
40
GBS
or
80
GBS
of
memory.
That
is
the
high
bandwidth
memory,
that's
available
on
the
GPU
that
you
can
utilize
and
then
on
the
host
side,
we
have
256
GPS
of
ddr4
memory
available
as
well
foreign
with
this.
Let's
move
on
to
the
the
programming
environment.
So
when
you
log
into
parameter,
everything
is
by
default
set
for
GPU
nodes.
A
So
you
won't
need
to
make
any
change,
and
you
can
check
that
by
listing
the
modules
that
you
have
loaded,
you'll
see
that
a
module
GPU
is
loaded
and
what
this
module
does
is
it.
It
makes
sure
that
all
the
required
environment
variables
and
the
compiler
wrappers
are
set
up
for
GPU
builds
you'll,
also
notice
that
you
have
a
the
quota,
toolkit
module
and
the
creepy
Excel
nvidiait
module
loaded.
So
these
are
things
that
are
required
if
you
want
to
use
the
GPU
specific
features
of
the
of
the
node.
A
The
default
programming
environment
is
new,
so
if
you
log
into
perimeter
and
do
a
module
list,
this
is
what
you
will
see.
You'll
see
that
the
GPU
module
and
the
GPU
specific
modules
are
loaded
along
with
the
new
programming
environment.
But
if
you
want
to
use
the
Nvidia
programming
environment,
you
will
have
to
change
to
that
by
default.
It's
big
node
and
let's
talk
a
bit
about
the
the
compiler
wrappers,
so
the
compiler
wrappers
are
something
of
you
know.
It
makes
things
really
easy
for
you.
A
What
you
want
to
do
is
you
want
to
use
the
compiler
wrappers
so,
regardless
of
the
type
of
programming
environment
that
is
loaded,
your
compiler
wrapper
will
make
the
call
to
the
to
the
right,
compiler
and
Link
all
the
required
libraries.
For
example,
if
we
have
the
new
programming
environment
loaded,
we
are
basically
trying
to
work
with
the
gnu
compilers
and
if
you
use
the
capital,
CC
compiler
wrapper.
A
That's
used
for
the
C
plus
applications,
and
you
will
see
that
underneath
it
it's
basically
using
the
G
plus
plus
compiler,
and
if
you
use
the
small
CC
wrapper,
it
would
be
GCC
compiler.
Underneath
now.
Let's
say
that
you
want
to
use
the
Nvidia
compilers,
so
you
swap
or
you
change
the
the
programming
environment
to
Nvidia
by
doing
module,
alert,
programming,
environment,
Nvidia,
and
then
you
can
check
the
version
of
the
compiler
by
using
the
compiler
wrapper,
and
you
would
see
that
the
NVC
plus
plus
and
the
NVC
compiler
are
now
being
used.
A
Documentation
and
a
kind
of
a
similar
image
was
shown
by
Jack
in
the
morning.
This
is
the
all
the
programming
models
that
we
have
available
and
the
programming
environments
that
make
them
available.
For
example,
Nvidia.
If
you
do
reduce
the
programming
environment
in
media,
you
will
be
able
to
build
your
program
in
Cuda,
open,
SSC,
openmp,
Cocos
and
Roger
as
well.
A
While
there
is
another
experimental
program
environment
that
is
the
nurse
programming
environment
or
the
llvm,
also
in
in
the
in
the
pipes,
I
think
you
will
have
access
to
it
soon.
The
experimental
version
can
still
be
accessed.
You
can
check
our
documentations
for
that,
so
this
provides
I
think
the
the
widest
coverage
it
even
provides
coverage
for
the
hip
and
the
sickle
programming
models.
These
are
the
programming
models
that
are
used
by
the
AMD
and
Intel
gpus.
A
So
if
you
code
in
or
Circle,
you
will
be
able
to
run
on
AMD
and
Intel
devices
respectively,
while
both
of
these
can
also
be
run
on
the
Nvidia
gpus
that
are
available
at
nurse.
A
So
once
you
have
decided
what
programming
model
that
you
want
to
go
with,
then
we
have
the
recommendation
of
the
programming
environment
that
you
want
to
use
I'm
guessing
a
lot
of
the
applications
are
already
there,
where
you
have
already
decided
the
type
of
programming
model
that
you
want
to
use.
So
everything
is
all
like.
Most
of
the
things
are
supported
by
the
Nvidia
programming
environment,
while
Cuda
and
cocus.
We
can.
A
We
also
recommend
the
new
programming
environment
for
them,
so
for
Cuda
it
you
typically
want
to
go
with
the
Nvidia
and
new,
with
the
Cocos
nvdn
glue
would
also
work
with
the
for
the
open
ACC
and
the
standard
C
plus
Library
parallelism.
You
would
want
to
go
with
the
Nvidia
compilers
foreign.
A
With
this.
Let's
move
on
to
the
Hands-On
exercises.
There
are
a
few
Concepts
that
are
covered
there,
that
I
want
to
walk
through
before
you
start
doing
the
Hands-On
exercises,
so
the
the
record
on
this
link
contains
two
directories:
a
GPU,
a
directory
for
the
GPU
examples
and
another
for
the
CPU
examples
here,
I'll
just
go
through
the
GPU
examples.
So
once
you
move
to
the
GPU
directory
you'll
see,
there
is
a
readme
file
that
is
basically
sort
of
a
lab
manual
that
you
can
walk
through.
A
It
contains
instructions
on
how
to
build
and
run
and
what
would
be
the
expected
output.
There
are
some
optional
exercises
that
you
can
try
out
as
well
yeah,
so
we
try
to
touch
everything.
You
know
almost
everything,
that's
that's
very
basic
for
the
GP
nodes.
For
example,
you
try
to
touch
the
tree
programming
models
that
you
can
use
to
run
on
on
the
GPU,
we'll
try
to
build
them
using
different
programming
environments
or
compilers
in
particular,
We
Touch,
Cuda,
open,
ACC
and
openmp.
These
examples
are
pretty
simple.
A
Your
codes
may
be
more
complicated.
If
that
is
the
case,
you
can
always
reach
out
to
us,
and
we
can
help
you
with
the
more
more
complicated
things.
A
A
Two
examples
are
for
the
Cuda
wear
and
the
GPU
Affinity
Cuda,
where
MPI
gives
you
the
gives
you
the
power
of
communicating
between
two
gpus
directly
with
the
data
from
one
GPU
buffer
can
be
transmitted
directly
to
a
remote
GPU
buffer
and
then
just
like
the
CPU
Affinity,
that
was
that
was
covered
by
Eric.
There
is
a
GPU
Affinity,
because
there
is
a
how
you
can
bind
your
ranks
to
gpus
to
get
the
Optimal
Performance.
A
Each
of
the
exercises
directories
will
contain
a
make
file:
a
batch
script,
the
S
patch
script
and
some
source
files.
The
make
file
contains
these
steps
to
build
the
example
and
the
dispatch
file
contains
instructions
to
run
that
code.
You
can
use
the
DS
patch
file,
the
dispatch
script
directly
or
you
can
just
you
know,
get
get
like
in
an
interactive
node
and
just
run
the
executable.
A
The
execution
Rhino
line
over
there,
so
the
the
expat
script
for
GPU
nodes
would
look
very
similar
to
what
you
you
have
been
using
on
query
or
what
you
will
be
using
on
on
the
parameter,
CPU
notes.
There
are
a
few
changes
that
I'll
point
out,
so
one
major
thing
would
be
the
change
of
the
the
number
of
CPUs
per
task.
So,
on
the
CPU
news
nodes,
you
have
256
logical
CPUs
that
is
128
Hardware
cores
on
the
GPU
nodes.
We
have
have
that
number.
A
We
have
64
Hardware
cores
and
128
logical
CPUs.
Now
for
slurm,
one
CPU
is
a
one
logical
CPU.
That
is
one
hyper
thread.
So
if
you
have,
let's
say
in
this
example,
eight
ranks
in
total.
You
have
two
nodes,
so
you
have
four
ranks
per
node.
That
means
you'll
need
to
assign
32
CPUs
per
rank
out
of
the
128
in
total,
and
let's
say
that,
instead
of
for
you
had
64
ranks
per
node,
then
you
would
you
would
set
C
to
C
equals
to
2.
A
Then
you
would
because
you
want
to
have
one
core
or
two
logical,
CPUs
map
to
one
ampere
rank.
The
other
two
things
that
you
want
to
focus
on
when
running
on
the
GPU
notes
is
the
gpus
per
per
task.
That
is
the
number
of
gpus
that
you
have
you
want.
You
want
to
have
per
task
available
and
the
the
constraint
should
be
set
to
GPU,
because
otherwise
you
will
not
be
requesting
a
GPU
node
in
you
know,
specifically
and
for
for
the
scope
of
this
training.
A
A
When
you're
running
or
building
or
running
on
on
the
GPU,
there
can
be
some
confusion.
For
example,
if
you
have
an
openmp
code
and
you're,
not
sure
if
it's
running
on
on
GPU
or
on
the
CPU,
there
are
some
debug
environment
variables
that
can
be
very
helpful.
A
For
example,
if
you're
working
with
the
new
compilers
setting,
this
variable
will
tell
you
when
a
kernel
is
launched
or
when
a
data
transfer
takes
a
place
from
CPU
to
GPU,
or
vice
versa,
and
similarly
working
with
the
Nvidia
compilers,
you
have
more
find
control
over
what
you
want
to
debug
and
what
events
you
want
to
be
alerted
about
so
make
use
of
these
variables.
These
can
be
very
helpful
because
sometimes
you
don't
want
to
use
the
profiler
directly,
because
there
is
a
larger
overhead.
A
You
just
want
to
run
your
executable,
so
this
basically
prints
out
all
the
information
that
you
need
onto
the
console
foreign
with
this.
Let's
move
to
the
exercise,
one
so
exercise
one
would
just
be
a
simple,
a
simple
Cuda
kernel
and
we
will
demonstrate
it
using
two
different
types
of
files.
So
there
is
a
cool
file.
There's
a
CPP
file:
if
you
have
the
Cuda
API
calls
within
a
DOT
CU
file,
that's
a
trademark
extension
for
the
Cuda
files
and
that
will
be
detected
by
all
the
Nvidia
compilers.
A
So
if
you
have
the
nvcc,
which
is
a
Kuda
compiler,
it
will
obviously
detect
it.
But
even
if
you
have
the
NVC
plus
compiler,
it
will
also
detect
without
any
specific
flags
being
passed,
but
if
you're
using
a
different
compiler.
Let's
say
sorry.
If
so,
if
you
are
using
the
different
extension,
that
is
the
dot
CPP
and
you
want
your
Nvidia
compilers
to
pick
up
that
it's
a
Cuda
file,
then
you
would
want
to
pass
a
specific
flag
for
that
if
you're
using
the
NVC
plus
plus,
then
that
would
be
Dash
Cuda
flag.
A
In
the
example
in
the
exercise
2,
we
show
the
separate
compilation.
The
Cuda
code
in
particular,
can
only
be
built
with
the
nvcc
compilers
I
mean
it's.
It's
specifically
recommended,
but
I
think
the
llvm
can
also
do
that.
But
it's
recommended
that
you
build
it
with
the
nvcc.
It
can
get
you
the
best
performance
and
let's
say
you
want
to
build
your
main
application
with
a
different
compiler.
Let's
signal
compilers,
then
you
would
want
to
use
the
separate
compilation
that
is
you
first
build
your
kernels.
A
A
In
example,
three
we
show
how
you
can
use
MPI
along
with
Cuda,
and
this
one
is
a
much
simpler
example
where
everything
is
located
within
a
DOT
CU
file.
You
can
simply
use
one
of
the
Nvidia
compilers
I,
think
particular
NVC,
plus
plus
through
the
compiler
wrapper,
and
it
would
just
detect
what
language
it
is
because
it's
a
DOT
CU
file,
so
it
will
be
much
simpler,
but
the
interesting
thing
is
that
the
compiler
wrapper
will
be
able
to
link
the
required
libraries
for
the
MPI
as
well
the.
A
But
the
more
realistic
case
would
be
that
you
have
your
Cuda
kernels
in
a
separate
file.
You
have
your
NPI
or
the
main
host
code
on
on
a
separate
file,
and
you
make
calls
to
your
quota
kernels
from
that
host
file
and
that
can
be
done
using
that
can
be
built
using
a
using
the
the
separate
compilation
again
and
you
can
use
your
choice
of
host
compiler
here,
but
make
sure
that
you
use
the
the
compiler
wrappers,
because
otherwise
the
required
libraries
will
not
be
linked.
A
So
before
we
move
on
to
the
next
examples,
which
include
the
something
about
CPU
and
the
GPU
Affinity.
Let's
have
a
quick
recap
of
the
how
to
set
the
number
of
CPUs
per
task:
Corey
Hazel
and
Corey
K
L
you're,
aware
of
on
the
parameter
CPU,
as
was
described
before,
we
have
128
physical
cores,
while
on
the
GPU
notes
we
have
half
of
that,
and
if
we
want
to
set
the
the
dash
C
flat.
That
is
the
number
of
CPUs
per
task.
A
This
is
a
simple
formula
that
you
can
use,
so
you
divide
64,
which
is
the
number
of
Hardware
codes
that
you
have
on
the
Node,
the
CPU
Hardware
course
and
divided
by
the
task
number
of
NPI
tasks
per
node
and
slower
that
you
know
round
off
to
you,
know,
wrap
around
it
down
and
then
multiply
it
with
the
two.
For
example,
if
let's
say,
if
we
had
64
tasks
per
node,
the
thing
inside
bracket
would
be
one.
You
multiply
that
with
two,
and
that
gives
you
two.
A
So
that
is
the
two
CPUs
per
task
that
you're
setting
it
with
this.
So
this
is
an
example
of
so
yeah.
You
already
saw
that
in
order
to
make
sure
that
we
are
getting
the
best
out
of
our
nodes,
we
want
to
set
the
CPU
by
inflect
course
that
will
bind
your
tasks
to
course,
but
there
is
another
thing
on
the
GPU
node
that
is
the
GPU
binding.
A
So
if
you
do
not
set
GPU
binding,
then
all
your
ranks
will
have
access
to
all
the
gpus,
but
gpus
are
are
mapped
to
particular
Newman
nodes,
so
numenode
is
is
like
a
region,
a
physical
region
in
inside
your
node.
It
was
very
nicely
described
by
Eric,
so
I'll
not
go
into
the
details,
but
let's
say
that
so
on
your
GPU
on
your
GP
nodes,
you
have
four
Newman
nodes
and
each
pneuma
node
is
tied
to
a
particular
GPU.
A
Now,
let's
say
that
your
you
have
a
rank
that
is
bind
to
pneuma
node
0
and
it
tries
to
access
a
GPU
that
is
located
on
Newman
node
2.
Then
they
are
physically
far
away.
So
there
will
be
a
penalty
for
that
So.
To
avoid
that
sort
of
thing
we
recommend
GPU
binding
so
that
your
gpus,
so
that
your
gpus
are,
as
you
know,
in
space,
they
are
set
closer
to
your
MPI
tasks,
and
that
is
what
we
are
going
to
explore
in
this
example.
So
in
this
example,
we
have
two
batch
scripts.
A
There
is
one
is
the
regular
one
and
the
other
is
the
GPU
binding
one.
So
first
the
the
regular
one
will
have
have
no
GPU
binding
and
you
will
you
can
see
that
your
execution
line
would
look
something
like
this,
where
we
are
just
finding
the
MPI
ranks
to
cores
and
we
are
not
setting
the
GPU
binding
here
and
it
will
print
out.
Each
rank
will
print
out
all
the
gpus
that
are
visible
to
it
and
the
GPU
that
is
assigned
to
it.
In
this
case
we
are
assigning
gpus
in
a
round
robin
fashion.
A
So
let's
have
a
look
at
this
highlighted
example
of
rank
1,
which
is
bound
to
core
16
now
core
16..
If
you
look
at
this
node
map,
you
can
see
that
the
core
16
is
located
in
pneuma
node,
one
now
Numa
node
one
has
the
the
GPU
with
PCI
address
82
tied
to
it.
So
ideally
we
would
want
the
rank
on
node
on
on
core
16
to
have
access
to
this
GPU.
A
So
this
is
not
ideal
and
you
can
also
see
that
it
has.
It
can
still
see
other
gpus
as
well,
and
any
of
those
could
have
been
assigned
to
it.
It
totally
depends
on
how
you
map
a
program
programmatically,
but
if
you
use
the
GPU
bind
Flags.
This
is
what
the
output
would
look
like
if
in
over
here
in
this
particular
I'm
setting
the
GPU
bind
flag
to
close
this.
A
So
it
will
map
each
rack
to
the
GPU
that
is
physically
closest
to
it,
and
you
can
see
that
it
in
this
run,
the
core
16
rank,
widgets
are
named.
Rank
2
now
can
only
see
one
GPU
and
it
is
the
PC
address
82
GPU,
and
you
can
see
that
it's
located
in
pneuma
node
one,
and
that
is
the
same
pneumonode
where
the
core
16
is
located.
A
So
just
setting
this
simple
slag
will
you
know,
make
it
through
some
performance
Improvement,
because
now
your
your
gpus
are
physically
closer
to
the
to
the
ranks.
Try
this
out.
There
are
different
settings
that
you
can
use.
You
can
check
more
on
the
sked
MD
website
and
GPU
bin
equals
to
close
assist.
One
option,
but
you
you
can
multiple
options.
You
can
even
do
the
custom
mapping
of
gpus
ranks.
A
The
other
interesting
feature
is
of
Cuda,
aware
MPI,
so
with
the
unified
virtual
addressing
technology
which
basically
allows
the
the
GPU
device
memory
to
appear,
as
part
of
you
know,
same
address
space.
As
of
the
CPU
memory,
like
has
been
shown
on
the
figure
on
the
right,
it
allows
us
to
make
a
direct
transfer
of
messages
from
one
GPU
to
the
other.
A
So,
basically,
if
you
want
to
send
some
data
from
the
buffer
of
GPU
from
one
GPU
to
a
remote
GPU
that
can
be
done
directly
and
it
will
basically
bypass
all
the
communication,
like
you
know,
going
through
the
CPU
memory,
and
that
is
the
typical
route
that
could
have
your
MP
allows
you.
You
know
much
faster,
a
communication.
This
way.
A
Cuda
the
example.
Six
will
demonstrate
how
to
do
that.
Basically,
you
just
use
the
GPU
module
that
will
already
be
loaded
in
your
environment.
A
You
can
just
build
your
example
as
you
would,
and
then
you
can
verify
that
it's
actually,
you
know
linking
this
particular
GTL
to
the
library
by
checking
over
the
the
list
of
library
that
have
been
linked
in
this
basically
indicates
that
your
code
is
going
to
make
use
of
the
Cuda
where
MPI
capabilities
the
example
that
we
have
will
we
have
two
ranks
two
remote
ranks.
A
One
will
send
some
data
to
a
remote
GPU
on
the
other
Rank
and
the
other
rank
will
read
and
print
it
to
the
screen.
So
it
basically
shows
you
know
you
can
go
through
the
code
and
see
how
that's
done
programmatically
and
that
it's
actually
being
done
in
the
example.
A
The
the
last
example
has
two
parts,
so
here
we
describe
with
a
simple
example:
we
describe
how
you
can
use
and
build
for
open,
ACC
and
openmp.
These
are
other
programming
models.
These
are
more
portable
in
nature
than
the
Cuda
programming
model,
which
is
a
kind
of
very
specific
to
Nvidia.
So
if
you
plan
on
running
on
different
architectures,
these
are
some
programming
models
that
you
can
look
into.
Openmp
has
I
think
the
the
widest
of
widest
support
foreign.
A
We
should
move
on
to
the
the
Hands-On
section
we
will
be
around
and
if
you
have
any
questions,
please
reach
out
to
us.
Thank
you
very
much.