►
From YouTube: 11. Concluding Remarks, Interoperability -- Brent Leback
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
I
think,
in
fact,
that
the
last
hackathon
berkeley
gw
had
open,
acc
and
openmp
in
the
same
program
so
and
you
could
decide
between
the
two
at
runtime
a
lot
of
people
use
cuda
when
they
really
need
it,
and
maybe
that's
like
one
percent
of
the
code,
but
it
could
be.
You
know
a
large
percent
of
the
overall
execution
time.
A
So
what
do
we
mean
by
that?
First
different
programming
models
can
appear
in
the
same
source
file,
so
sometimes
that's
easier
than
others.
In
fortran
we
have
control
over
all
of
the
compiler,
both
the
the
cpu
compiler,
the
openmp
compiler,
the
openacc
compiler
and
the
cuda
fortran
compiler.
So
interoperability
within
a
file
is
actually
pretty
easy
for
us
with
nvcc
it's
a
little
more
difficult.
A
Really
only
nvcc
can
compile
gpu
code
as
max
shown
the
kind
of
global
functions
that
are
the
entry
points
into
a
kernel.
There
are
also
device
functions
which
are
commonly
used.
Macs
didn't
go
into
that,
but
global
kernels
can
call
device
functions.
A
Another
definition:
objects
from
different
programming
models
can
be
linked
together
into
the
same
program,
and
one
programming
model
can
use
data
declared
or
defined
or
initialized
in
a
different
programming
model.
So
we've
shown
some
examples
of
that.
A
You
know
using
open,
acc
or
openmp
to
manage
the
data
and
then
how
to
get
the
device
pointers
to
call
into
cuda
libraries
with
that
a
little
harder
part
of
interoperability
means.
Can
I
generate
a
kernel
using
acc
or
openmp
and
call
device
functions
which
are
written
in
cuda
and
our
support,
for
that
is
pretty
good
and
there
are
a
few
places
where
it's
lacking.
A
When
I
talked
about
standard
par
on
yesterday,
I
noted
that
currently
inside
do
concurrent,
we
have
problems,
calling
functions
that
aren't
pure
and
then
all
of
our
cuda
functions
to
do
you
know
cuda
low-level
things
that
you
might
want
to
do
are
not
marked
as
pure
yet
and
that's
a
problem
we're
trying
to
solve
for
interoperability.
A
A
So
that's
important
and
max
didn't
show
this
yesterday
or
excuse
me
just
in
the
previous
talk,
but
in
the
chevron
configuration
when
you
launch
kernels,
an
optional
argument
is
the
stream
number
so
so
that
is
useful
and
in
fact,
given
people
that
are
trying
to
get
speed
of
light
out
of
their
kernels.
A
Just
so,
you
could
prove
that
you
can
use
all
these
models.
You
know
in
the
same
program
as
a
as
a
kind
of
a
toy
or
a
joke.
I
wrote
you
know
a
five-line
fortran
program
and
compiled
it
different
ways.
Turning
on
the
flags
dash,
md,
acc
and
dash
cuda,
and
you
can
see
that
we
can
create
a
a
program
that
you
know
accepts
openmp,
open,
acc
and
cuda
fortran
in
any
combination
and
the
sentinels
in
the
upper
left.
A
So
you
can
also
use
these
if
diffs
in
the
same
program
and
it's
defined
in
openmp
and
openacc
to
set
these
macros
and
I
believe,
nvcc
and
cuda
fortran
define
the
under
barcuda
natural
macro.
A
So,
as
I
mentioned
before,
if
you
have
a
cuda,
you
should
probably
use
mvcc
to
compile
that
use.
Our
nvc,
which
is
rc,
comply,
compiler
or
nvc,
plus,
plus
to
compile
openmp
or
openacc
calling
cuda
libraries
with
host
site
interfaces
does
not
require
nvcc,
so
the
the
libraries
like
qft
and
kuran
that
we
showed
you
know
you're
you're
running
that
on
the
device.
But
you
do
not
need
mbcc.
A
We
provide
the
dash
cuda
and
cuda
lib
options
just
provides
for
easier,
compiling
and
linking
one
thing
to
be
aware
of
when
you
mix
cuda
compiled
with
nbcc
and
our
compilers,
nvc
or
nvc,
plus
plus
there's
this
notion
of
a
relocatable
device
code
called
rdc.
A
Our
compilers
nbc
and
nbc
plus
plus
turn
that
on
by
default,
because
we
figure
for
hpc
applications,
people
are
usually
calling.
You
know,
lots
of
functions
from
different
files
in
those
types
of
applications.
Nvcc
the
cuda
compiler,
is
doesn't
feel
like
making
that
assumption
and
for
performance
no
rdc
actually
can
perform
slightly
better
than
rdc.
A
A
few
more
optimizations
can
occur
in
the
code
generation
during
the
various
phases
of
assembly.
So
be
aware
of
that.
There
are
options
on
all
compilers
to
turn
on
and
off
relocatable
device
code
and
the
people
at
nurse
who
are
attending.
Some
of
our
calls
know
that
you
know
our
our
some
of
our
slides
have
c,
plus
plus
standard
par
interoperability
with
pragma
based
data
directives.
A
A
It's
it's
hard,
because
a
lot
of
the
c
plus
plus
parallel
algorithms
are
just
kind
of
handled
in
header
files
with
meta
programming
and
the
compiler
doesn't
really
have
a
chance
to
really
interject
itself
to
help
out
with
you
know,
looking
up
into
the
present
table,
which
I've
mentioned
a
few
times
to
know,
you
know
how
to
get
the
device
address
for
a
corresponding
host
array.
So
again,
there's
there's
more
work
for
us
to
be
doing
here
as
well.
A
In
fortran,
similar
to
c
plus
plus,
I
mentioned
our
fortran
compiler
envy
fortran
compiles
all
models
across
languages.
Fortran
calling
c
is
pretty
well
defined,
as
is
cuda
fortran
calling
cuda
c.
We've
got
lots
of
examples
of
that
in
our
packages
and
we've
had
blog
posts
over
the
years.
How
to
do
that?
A
Maybe
I've
left
a
few
out
there,
so
if
you're
a
fortran
programmer
and
you're
making
use
of
some
of
those
cuda
libraries,
I
really
recommend
you
use
the
cuda
live
option,
because
some
of
those
interfaces
require
an
extra
wrapper
library.
A
Because
openmp
defines
a
host
fallback
mode,
some
cases
which
work
with
openacc
plus
cuda
are
not
quite
right
yet
with
openmp
plus
cuda,
but
we're
working
on
it
and
I
think
we
can
solve
that
problem.
We
just
have
to
get
to
all
of
the
cases
and,
as
I
mentioned
yesterday
for
interoperability,
we
still
need
to
figure
out
if
and
how
we
allow
non-fortran
standard
features
in
do
concurrent
for
things
like
calling
cuda
functions.
A
If
we
want
to
or
expressing
a
a
launch
configuration
you
know,
I
will
say
you
know
we
don't
add
the
capability
to
change
the
launch
configuration
to
open,
acc
or
openmp,
because
it's
just
fun.
We
do
it
because
people
have
required
it.
Real
applications
need
control
over
that.
So
to
think
that
you
can,
you
know
port
a
real
application
to
do
concurrent
without
it
is
just
to
say
well,
I'm
willing
to
give
up
some
performance.
A
A
A
I
have
host
code
versions
of
them.
You
can
compile
just
on
the
host
and
if
you
want
to
start
with
that,
you
can
insert
your
directives
yourself.
If
you
want
to
try
different
things
and
then
we've
provided
a
handful
of
different
solutions
to
that,
I
did
them
all
kind
of
quickly.
Yesterday,
some
of
them
might
be
a
little
buggy
or
off,
but
they're
basically
correct.
You
should
be
able
to
get
the
just
of
what
we're
trying
to
show.