►
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
So
I've
added
this
slide
again,
so
I
will
present
standard
languages
from
a
fortran
point
of
view
and
matt
stack
will
present
standard
languages
from
c
plus
plus
and
we're
gonna
try
to
divide
it
up
into
15
minutes
each.
I
think
you
know
helen.
We
can
take
questions
in
between
or
at
the
end
it
doesn't
really
matter.
A
So
some
of
this
will
be
review.
Jeff
went
over
some
of
these
things,
so
for
this
training
we
decided
to
start
from
the
high
level
and
and
which
is
the
leftmost
column
here
and
work
to
the
right.
Sometimes
I
think
of
things
in
the
other
order.
I
start
from
a
cuda
point
of
view
and
work.
A
My
way
left,
I
think
part
of
that
is
that's
my
job,
is
to
part
of
my
job
is
to
find
things
in
cuda
new
features
and
things
that
we
can
expose
at
a
higher
and
higher
level,
but
either
way
you'll
get
exposed
in
the
next
two
days.
To
do
all
three
of
these
columns
and
and
the
library
is
underneath
as
well.
A
So
for
this
section
we're
just
going
to
talk
about
standard
languages
in
fortran.
A
A
Everyone
is
aware
what
we're
doing.
We
have
people
on
the
fortran
standards
committee
and
it's
mainly
just
like
a
language
thing
in
the
in
the
language
spec,
but
it's
intended.
The
intent
of
do
concurrent
is
to
say
this
is
a
parallel
loop.
The
iterations
of
this
loop
can
run
in
any
order
and
they
can
run
on
any
type
of
parallel
hardware.
A
A
So
it's
a
little
bit
of
a
different
format
than
the
old
fortran
do
loop
and
they
have
some
locality
specific
specifications,
so
you
can
declare
variables
used
within
the
body.
Is
local
local,
initialized,
shared
or
default?
A
This
is
one
of
the
features,
as
jeff
mentioned,
that
dash
standard
par
tries
to
make
as
much
of
the
data
in
your
program
as
possible
to
be
managed.
Data,
cuda
managed
data
and
what
cuda
managed
data
is
is
data
that
the
driver
and
os
is
responsible
for
paging
back
and
forth
between
the
gpu
and
cpu,
similar
to
how
virtual
memory
works
on
a
cpu.
A
A
I
guess
I
need
to
speed
up
here
a
little
bit,
so
just
some
examples
do
concur
in
many
weather,
so
many
weather
is
a
is
a
application.
A
Out
of,
I
think
it's
actually
out
of
oak
ridge
from
matt
norman
name,
so
we
have
ported
that
code
to
do
concurrent,
and
so
the
upper
left
here
shows
do
concurrent
with
some
local
arrays,
d3
valve
stencil
and
also
some
local
variables.
So
within
a
duke
and
current,
you
can
have
other
do
loops.
A
You
can
do
just
normal
fortran
operations
and
if
you
compile
with
them
info
you'll
see
that
we
parallelize
the
loops
across
threads
and
blocks
the
do
concurrent
loops
and
then
the
other,
some
of
the
other
loops
are
run
sequentially
within
that
which
is
turns
out,
is
a
you
know,
a
pretty
good
schedule
for
this.
A
Starting
in
21.11,
I
believe
jeff
mentioned
this
as
well.
We
support
the
reduce
clause
in
duke
and
current
on
the
top
line
on
the
left.
A
You
can
see
duke
and
current
with
reduced,
and
so
the
reduce
is,
you
know,
some
reduction
on
the
variables
mass
and
te,
and
so
this
gives
you
a
lot
of
you
know,
capabilities
our
compiler
sometimes
would
find
reductions
automatically,
but
it's
good
to
have
a
specification
that
actually
supports
that
and
it
doesn't
hurt
to
add
them,
even
if
our
compiler
can
find
them
automatically,
because
other
compilers
may
not
be
able
to
so.
This
is
more.
A
So
we
have
some
current
limitations.
This
is,
you
know
fairly
new
work
newer,
either
than
newer
than
our
openmp
compiler.
So
you
know
some
of
the
gpu
programming
models
we've
been
working
on,
we've
been
working
on
for
over
10
years.
This
is
fairly
new.
It's
been
around
about
a
year.
A
The
fortrans
spec
requires
functions
and
subroutine
calls
and
do
concurrent
to
be
pure.
A
So
if
you
call
certain
things,
you
may
get
messages
that
oh
you're
calling
a
subroutine
that
is
not
pure,
and
we
may
at
some
point
change
that
in
our
compiler,
but
today
that's
also
another
place
where
we
are
a
little
fuzzy
according
to
the
spec.
Is
we
follow
the
open,
acc
and
openmp
defaults
for
scalars
and
arrays
within
the
body
of
the
new
concurrent,
so
scalars
are
first
private
or
local
by
default
and
arrays
are
shared
by
default.
A
A
Duke
and
current
lacks
control
over
gpu
scheduling.
You
know
we
found
that
useful
over
the
years,
so
things
like
forcing
a
loop
sequential
inside
of
a
region,
offloading
a
serial
kernel
and
there's
no
control
equivalent
to
an
acc's
gang
worker
vector
which
we'll
talk
about
tomorrow
and
interoperability
with
cuda
okay.
A
So
we
like
all
of
our
models
to
interoperate.
You
know
we'd
like
duke
and
current,
to
work
with
openacc
and
openmp,
and
we
would
like
to
be
able
to
interoperate
with
cuda
as
well
cuda
fortran
in
this
case,
so
we
still
need
to
mark
some
of
our.
You
know
standard
device
functions
as
pure,
so
there
are
certain
things
that
are
just
kind
of
part
of
cuda
that
you
can't
call
in
a
do
concurrent.
We
do
support
atomics,
because
that
was
important.
So
we
made
that
change,
but
there
are
some
other
functions.
A
You
know
low-level
cuda
functions
that
would
be
nice,
but
we're
not
there.
Yet
we
don't
have
any
control
over
the
stream
which
the
offload
region
runs
on
and
we
are
not
yet
on
our
interoperable
with
cuda
fortran
device
attributed
data.
We
would
like
to
be
able
to
declare
cuda
fortran
device
data
and
use
that
in
a
new
now
these
are
all
extensions,
so
it
would
be
non-portable,
but
it
would
make
the
programming
model
much
more
powerful
if
you
needed
it.
A
This
is
a
duplicate
slide.
I'm
not
going
to
get
into
this
too
much.
Just
I
added
a
title
of
a
paper
that
came
out
at
the
end
of
last
year
from
a
person
named
ron,
kaplan
who's
been
using
our
compiler
for
many
years
and
he
wrote
a
really
nice
paper
called
and
fortran's
do
concurrent
replace
directives
for
accelerated
computing.
A
A
I'm
not
going
to
get
into
this
slide
too
much.
This
was
presented
at
the
last
gtc.
There
are
people
working
with
duke
and
current.
Some
kernels
out
of
nw
cam
have
been
moved
to
duke
and
current,
and
they
found
the
performance
was,
you
know,
basically
on
par
with
openmp
or
open
acc
on
the
gpu
and
a
group
from
games
used
new
concurrent
to
port
a
portion
of
of
the
games
code.
It
was
a
pretty
simple
port.
A
A
A
little
bit
about
other
libraries
and
ways
that
you
can
use
standard,
fortran
jeff
mentioned
this
mammal.
So
one
thing
that
I
do
as
part
of
my
job
is
to
create
fortran
interfaces
to
cuda
libraries,
and
while
I
was
writing
the
interfaces
for
tensor,
it
occurred
to
me
that
two
tensor
solves
a
lot
of
or
solves
a
lot
of,
the
fortran
array,
intrinsic
problems
for
map,
mole,
reshape
and
spread,
and
so
we
just
added
some
capability.
A
If
you
use
the
kutenser
ex
module
to
recognize
cases
that
cou
tensor
can
run
in
a
single
kernel
like
this
mapmole
and
just
offload
that-
and
you
will
never
be
able
to.
You
know-
write
handwritten
code
that
performs
as
well
as
the
coutenser
matrix
multiply
in
the
library.
So
so,
if
you
can
take
advantage
of
that,
this
is
standard.
Fortran
you'll
get
really
good.
Speedups.
A
You've
seen
this
slide
before
the
blog
on
the
bottom
is
is
a
article
I
wrote
on
bringing
tensorflow
standard
fortran.
These
are
just
representative
of
the
types
of
operations
that
we
can
recognize
and
make
contents
for
calls
under
the
hood.
A
One
quickly
here,
one
project
that
we've
done
sort
of
in
collaboration
with
nurse
is
a
library
called
nvla
math
and
using
some
of
the
same
techniques
that
we've
used
in
other
areas.
A
What
we
wanted
to
do
was
to
write
our
own
wrappers
around
some
of
the
crew
solver
functionality,
so
nurse
identified,
30
or
40
important
la
pac
calls
for
them.
Of
course,
d
get
rf
is
usually
the
most
important
it
does
lu
factorization.
A
If
you
call
coo
solver
directly,
you
know
you
have
to
go
through
a
little
bit
of
a
a
set
of
steps.
You
get
the
handle.
You
figure
out
what
workspace
sizes
are
needed
by
the
coo
solver
you
allocate
that
workspace.
You
call
a
you
know:
a
coup,
solver
version
of
d
get
rf
and
you
deallocate
the
work.
A
If
you
compile
that
you
get
about
3.3,
teraflops
on
a
v100
and
then
gpu
with
nvla
math.
So
if
you
compile
with
the
option
cuda
lib
equals
nvla
math,
we
will
pull
in
secretly
kind
of
a
a
module
that
redefines
the
interfaces
to
d
get
rf
and
do
the
wrapper
work
for
you.
So
you
don't
need
to
make
any
of
these
changes
in
the
center
to
your
legacy.
Fortran
applications,
and
we
will
you
know
the
time
is
basically
the
same.
There's
we're
not
saying
that
this
is
faster.
A
It
does
basically
the
same
work,
but
just
hides
that
for
you,
so
some
possible
future
work.
You
know
we'll
probably
look
at
adding
some
non-standard
or
nvidia
specific
capabilities
to
duke
some
of
the
things
I
mentioned,
we'd
like
to
do
some
more
f90
intrinsic
function.
Support
similar
to
what
we
have
for
matt,
mall,
reshape,
spread
pack
and
merge
would
be
very
nice
we'll
you
know
once
perlmutter
comes
up
and
we
get
some
feedback
from
nurse
users.
We
may
add
some
more
supported
routines
to
nvla
math.