►
From YouTube: Current State of CUDA Compilers and SDK
Description
Jeff Larkin (NVIDIA)
Current State of CUDA Compilers and SDK
A
So
hi
everyone,
my
name,
is
Jeff
Lark
and
I
run
a
team
at
Nvidia
that
manages
HPC
programming
models
and
standards
and
HPC
software
and
so
I'm
going
to
talk
to
you
about
our
HP
C
software
stack
I
apologize.
My
voice
comes
across
a
little
bit
froggy.
If
the
audio
drops
very
subtly
during
the
talk,
it's
probably
not
the
connection,
it's
probably
muting
quickly,
but
I'll
struggle
through
it
with
you,
but
stop
my
video
and
so
there's.
A
First
I
want
you
to
understand
our
vision
for
how
you
should
program
to
a
Nvidia
platform
and
I
want
to
talk
as
well
about
how
the
video
platform
is
more
than
just
gpus
and
and
what
all
is.
A
It
includes
I'm
going
to
talk
about
the
HBC
SDK,
which
is
how
all
of
the
Nvidia
software
is
installed,
on
the
on
Pearl
Mudder
and
on
other
systems
as
well,
and
make
sure
you
understand
what
that
provides
for
you
and
and
how
you,
how
you
get
it
and
how
you
access
it
and
then
I'm
going
to
talk
a
bit
about
programming
for
machines
like
Pearl
Mudder,
with
standards
based
approaches,
which
means
not
having
to
rely
on
a
domain-specific
apis
or
you
know,
things
like
openmp
or
open,
AC
or
Cuda
would
actually
be
able
to
write
directly
in
C
plus,
plus
or
Fortran.
A
So
the
HPC
software
team
has
four
major
initiatives.
You'll
see
a
couple
of
these
again
when
I
get
to
the
math
libraries,
but
first
we
want
to
make
using
our
platform
completely
seamless,
and
so
that
means
making
it
simple
to
to
take
advantage
of
all
of
the
available
Hardware.
So
I
included
this
being
able
to
write
a
native,
C
plus
or
native
Fortran
application
that
can
run
on
the
CPU
or
the
GPU,
or
even
take
advantage
of
things
like
the
GPU
tensor
cores.
A
All
of
this,
these
these
great
Hardware
features
to
be
exposed
to
you
in
a
way
that
is
easy
and
seamless
to
to
use.
Next,
it's
important
we
scale
up.
You
saw
a
picture
of
the
the
Pearl
mutter
node.
It
has
four
gpus
on
it.
A
So
you
know
it's
pretty
rare
to
find
a
GPU
machine
with
just
one
GPU
on
the
Node,
now
they're,
larger,
larger
machines,
and
so
first
the
libraries
need
to
enable
you
to
just
scale
up
to
all
of
the
gpus
of
your
node,
but
even
be
able
to
scale
up
to
full
system
scale,
and
some
of
our
math
libraries
now
can
scale
things
like
the
multi-dimensional
ffds
or
linear
solvers
up
to
full
system
scale
for
you,
next
working
in
domain
libraries,
and
so
one
that
we've
been
working
in
a
lot
recently,
is
quantum
so
identifying
a
specific
domains
that
have
very
specific
needs
and
develop
libraries
to
support
those
needs.
A
So
for
Quantum
we
developed
libraries
to
support
a
Quantum
circuit
simulation
for
things
like
signal
processing.
We
developed
software
packages
to
support
that
so
identifying
these
important
domains
that
have
individualized
needs
and
be
able
to
address
those
needs,
and,
lastly,
arm
is
an
important
part
of
of
our
ecosystem,
although
Pro
matter
doesn't
have
any
arm
nodes.
I
think
you've
probably
seen
that
Nvidia
does
have
some
Army
CPUs
coming,
and
so
we
need
to
make
sure
that
those
are
a
well
as
well
supported
as
x86
is
today.
A
But
looking
at
the
broad
Nvidia
platform,
how
do
we
expect
you
to
to
program
to
this
and
I
want
to
emphasize
platform,
because
often
people
think
of
it,
video
as
the
the
GPU
company,
and
we
certainly
have
made
our
names
for
our
gpus?
We
also
have
you
know
coming
next
year,
our
Grace
CPUs.
We
also
have,
of
course,
some
embedded
CPUs
as
well,
and
our
our
Network
interconnect
through
the
melanox
and
finiman.
So
we
need
to
provide
a
programming
strategy
that
addresses
all
of
these.
A
It's
really
a
disservice
to
the
users.
If
each
of
these
Technologies
has
a
distinct
way
to
to
program,
it
makes
things
much
more
difficult
to
use
the
foundation
of
all
of
that
is
our
accelerated
libraries
and
when
I
say
accelerated.
Libraries.
People
frequently
think
immediately
of
our
math
libraries,
which
are
a
really
important
feature.
We
provide
things
like
linear,
algebra,
solvers,
ffts,
random,
number
generation,
tensor
solvers
things
like
that,
and
we
do
provide
very
highly
tuned
math
libraries
that
improve
over
time.
A
But
those
aren't
our
only
libraries
core
libraries
are
things
like
libq,
plus,
plus
and
thrust
which
provide
high
level
data
structures
and
algorithms.
A
communication
library
is
providing
a
well-tuned,
MPI,
nickel
and
vishman,
and
even
higher
level
Frameworks
for
data
analytics
Ai
and
Quantum
circuit
simulation
as
well.
A
This
is
the
foundation
and
for
many
developers
you
could
start
by
by
writing
code
to
use
these
math
libraries.
So
if
you're
doing
large-scale
ffts,
we've
got
a
library
for
you
if
you're
doing
large,
linear
solvers,
we
have
a
library
for
you,
but
it
also
enables
us
to
to
layer
on
top
of
that
gives
us
a
firm
foundation
for
these
other
programming
approaches.
A
I've
broken
them
down
into
kind
of
three
a
high
level
high
level
ideas.
Now
many
people
assume
that
our
goal
is
absolutely
everybody
writing
in
some
form
of
Cuda,
where
there's
Cuda,
C,
plus,
plus
or
Cuda
Fortran.
These
Cuda
languages
are
our
languages
Innovation
we
can
co-design
those
with
our
Hardware.
So
whenever
we
do
Hardware
release
has
new
hardware
features.
We
can
expose
them
directly
here,
but
it's
not
necessarily
the
right
programming
model
for
everyone.
It's
a
it's
a
language
for
platform
specialization.
A
If
you
want
to
get
the
take
advantage
of
all
the
hardware
features
we
have
or
what
to
optimize
your
application,
it's
specialize
it
to
the
GPU
that
you
have
in
front
of
you.
You
know
Cuda,
C,
plus,
plus
and
Fortune,
give
you
all
the
tools
and
knobs
and
bells
and
whistles
to
do
that.
But
it's
certainly
not
the
only
approach
to
our
our
platform.
A
Even
though
it's
not
an
ISO
language,
why
do
you
go
this
way
means
your
code
is
parallel
first,
so,
whether
you're
coming
to
a
GPU
platform,
a
CPU
platform,
an
fpga,
a
DSP,
it
doesn't
matter
if
your
platform
supports
ISO,
C,
plus,
plus
it's
going
to
run
out
of
the
box
and
that's
a
really
exciting
thing,
because
it
means
you're,
not
porting
your
application
to
the
platform.
A
You
know
you're,
ready
to
run
day,
Day
Zero
now
sitting
between
these
approaches
is
compiler
directives
which
provide
incremental
opportunities
and
there's
really
two
main
ways
we
see.
Directives
use
and
at
the
top
is
open,
ACC
and
the
bottom
is
openmp
a
one.
A
You
could
use
it
like
I'm,
showing
here
where
you
write
something
and
say
you
know:
Fortune,
Duke,
current
or
C,
plus,
plus
standard
Library
and
use
compiler
directives
to
incrementally
prefer
and
portably
improve
the
performance,
so
here
I'm
taking
control
of
my
data
movement,
so
that
enables
you
to
write
fewer
directives
than
you
would
have,
because
the
parallelism
is
handled
the
language,
but
still
optimize
things
like
data
movement
in
a
portable
way.
A
The
other
way
pilot
directives
are
popular
is
for
legacy
applications.
If
you
have
hundreds
of
thousands
or
millions
of
lines
of
code,
it's
unrealistic
to
expect
you
to
rewrite
it
all
overnight.
A
compiler
directives
provided
ice
beads
of
leveraging
your
existing
code
getting
as
much
as
possible
running
in
parallel
running
on
the
GPU
and
then
once
you're
running
that
on
the
GPU,
you
could
go
and
evaluate.
Well,
maybe
certain
parts
of
my
code
should
be
refactored
and
Rewritten
in
one
of
these
standard
approaches,
or
maybe
I
have
one
solver.
A
A
Mixed
together,
nicely
you're,
not
picking
one
swim,
Lane
and
sticking
to
it,
but
in
fact,
there's
existence
of
applications
that
do
the
bulk
of
their
work
and
do
concurrent
and
and
math
libraries,
and
then
it
sprinkle
in
some
directives
here
or
there
and
maybe
even
have
a
function
and
and
Fortran
and
Cuda,
and
all
of
that
will
mix
together
nicely.
So
so
our
goal
is
to
provide
you
with
the
accelerated
standard
language
as
a
way
to
write
your
code
once
and
and
expect
it
to
run
everywhere
and
anywhere.
A
A
Now
this
is
supported
through
a
software
product
called
the
HBC
SDK,
and
we've
worked
very
closely
with
the
folks
at
nurse
to
make
sure
this
is,
and
also
with
with
hpe
the
system
integrator
to
make
sure
that
this
works
really
well
on
your
platform.
And
if
you
look
at
the
slide
earlier,
you
saw
that
HPC
SDK
was
the
one
software
package
that
had
you
know,
Green
bars
all
the
way
across
it
supports
all
of
the
the
the
programming
models
available.
A
Hbc
SDK
is
a
completely
free
product,
regardless
of
whether
you
have
a
GPU.
You
can
download
this
and
install
it
on
your
personal
machine,
it's
available
on
on
Pearl
Mudder
and
on
all
of
the
major
super
computers
it's
available
on
all
of
the
major
clouds.
So
you
could
right
away
begin
to
use
this
for
free
everywhere.
It
supports
x86,
arm
and
open
power.
So
it's
a
portable
software
stack
if
today
you're
running
on
Pearl,
water
and
next
week,
you
want
to
run
on
an
arm
computer
by
taking
the
HBC
SDK.
A
All
of
the
the
software
libraries
that
you
need
are
built
in
and
ready
to
go
right
on
that
new
platform.
So
we
support
all
the
programming
models.
I
discussed
comes
with
four
compilers
nvcc
is
the
Cuda
compiler
and
NVC
C,
plus
and
Fortran
are
the
CC
plus
plus
and
forger
power.
So
we
call
these
are
HBC
compilers
at
a
huge
range
of
libraries.
I
can't
even
list
them
all
here.
A
The
community,
including
communication,
libraries
and
so
what's
really
nice
with
this
package,
is
all
of
these
pieces
are
tested
to
make
sure
they
work
well
together.
So
this
entire
software
stack
is
well
tested.
So
no
matter
where
you
go,
you
can
expect
that
it
will
work
seamlessly.
And,
lastly,
we
do
provide
profilers
of
debuggers,
because
it's
really
important
for
you
as
a
developer,
to
be
able
to
understand.
A
Okay,
if
I'm
hitting
an
error
or
I'm
hitting
a
performance
issue,
why
am
I
hitting
that
so
I
just
to
say
it
one
more
time
completely
free
and
downloadable
from
our
website
as
a
container
via
SPAC
and
all
the
cloud?
And
we
we
release
every
odd
month
and
the
occasional
even
month
as
well.
So
our
next
release
is
expected
in
November.
A
An
important
part
of
that
package
is
the
HPC
compilers
now
I
know
many
of
you
that
have
been
around
nurse
and
the
other
doe
sites
for
a
while
are
familiar
with
the
PGI
compilers
several
years
ago.
Actually,
almost
10
years
ago
now,
Nvidia
purchased,
PGI
and
and
we've
put
a
ton
of
work
into
those
compilers
so
much
so
that
we
can't
even
call
a
PGI
anymore.
We
call
them
the
Nvidia
HPC
compilers,
so
they
do
have
that
that
lineage,
but
they've
come
a
long
way.
A
Since
then,
we've
added
features
like
the
standard
parallelism.
You
know
additional
programming
models
at
additional
platforms,
so
it
supports
all
of
our
GPU
platforms,
and
so
you
can
automatically,
in
some
cases,
offload
to
gpus
all
of
the
program
models.
I
discussed.
It
is
a
great
CPU
compiler
as
well.
You
don't
need
a
GPU
to
take
advantage
of
this
that
supports
compiler
directives
and
vectorization.
A
And,
lastly,
it's
multi-platform.
A
The
other
important
piece
of
the
HBC
SDK
is
our
math
libraries,
these
first
two
I've
kind
of
covered
already,
but
the
additional
high
level
initiatives
here
are
that
we
are
building
libraries
to
be
more
composable,
as
our
gpus
have
got
larger
and
larger
in
some
cases
you
know,
writing
a
you
know,
a
single
having
a
single
Matrix,
that's
large
enough
to
to
saturate
entire
GPU
may
not
be
realistic
for
scientific
applications.
A
So
we've
built
composable
functions
where
you
could
College
the
libraries
for
within
your
existing
kernels,
which
saves
you
a
data
movement
costs,
launch
overheads,
and
things
like
that
and,
lastly,
making
sure
that
we,
when
Grace,
does
come,
that
we
provide
the
best
best
possible
performance
libraries
for
for
the
ARB
CPUs
as
well.
A
A
Yeah
I'll
highlight
two
reached
enhancements.
In
both
cases.
These
are
our
multi-gpu
libraries,
so
here
coosolver
mp,
coo
solver.
Does
you
know
linear
solvers
things
like
Lu
chilesky,
QR,
eigen,
solvers
things
like
that?
What
I've
highlighted
here
is
the
multi
multi-gpu
multi-node
aspects,
so
lower
is
better
here.
The
gray
bar
is
a
library
that
Community
Library
called
analyst.
You
can
see
it
scales
up
here
to
about
1024
gpus
with
the
green
bar.
A
You
can
see,
we've
taken
our
coosolver
MP
and
it
not
only
improved
performance
but
improved
the
scalability
out
to
here
I'm
showing
four
thousand
ninety
six
gpus
on
Summit.
So
this
means,
if
you
have
large
solvers
that
might
you
might
have
used
in
the
past
like
Scala
pack.
Now
you
know
we
can
support
scaling
up
to
to
all
of
the
gpus.
A
You
have
available
automatically
and
we
can
do
the
same
thing
with
ffts
here,
I'm
showing
where
we're
scaling
up
the
problem
size
as
we
increase
it
over
gpus
you
can
see
reaching
4
000
gpus
is
a
very
large
3D,
fft
and,
and
performance
is
great.
We
support
2D
and
3D
with
sports
lab
and
pencil
decomposition.
We
greatly
prefer
slab
because
it
gives
you
see
a
picture
here.
A
It
gives
a
lot
more
work
for
the
GPU,
but
we
do
support
pencils
as
well
and
as
well
as
having
functions
to
convert
between
these.
A
So
that's
it
for
the
libraries
there's
several
talks
available
on
the
nvidia's
website.
Let's
go
into
greater
detail
about
more
of
our
libraries
to
talk
about
the
stated
languages
I'm
going
to
highlight
both
C
plus
plus
and
Fortran.
Here.
A
So
C
plus
plus,
is
has
been
said:
C
plus
plus
17,
a
parallel
language.
It
introduced
C
plus
plus
17,
to
introduce
the
parallel
algorithms
Library.
So
we
already
had
a
list
of
high-level
algorithms
in
the
in
the
C
plus
plus
standard
Library,
C,
plus
plus
17
added.
The
idea
of
providing
an
execution
policy,
so
I
could
say,
go
run
this
sequentially
or
hey.
A
This
is
something
that
can
actually
safely
be
run
in
parallel,
so
it
enables
you
to
exploit
both
your
threaded
be
a
parallel
and
Vector
concurrency,
C,
plus
plus
17,
also
made
some
guarantees
for
forward
progress
to
avoid
deadlock
and
clarifications
the
memory
model
to
avoid
race
conditions.
The
reason
I
highlight
these
is
at
the
same
time
we
were
building
those
features
into
our
Hardware.
A
There's
new
features
coming
to
C,
plus
plus
23,
that
we're
excited
about
and
that
we're
beginning
to
to
preview
it
our
compiler
and
we're
already
working
on
C,
plus
plus
26
Ed,
will
later
this
year
release
a
prototype
of
a
feature:
that's
not
even
expected
in
C,
plus
until
C,
plus
plus
26.,
so
we're
very
excited
about
making
sure
that
C
plus
is
a
mature
language
for
parallels
of
a
concurrency,
because
that's
a
like
the
tide
raising
all
of
the
boats
making
this
available
these
these
features
available
everywhere.
A
One
application
I'll
highlight,
is
a
mini
app
from
Lawrence
Livermore
called
luesh.
It's
a
hydrodynamics
mini
app
and
the
Baseline
code
is
C
plus
plus
with
openmp,
and
this
is
one
example
function
you
can
see.
This
is
fairly
typical.
Openmp
I'm
supporting
my
CPU
threads
I
have
work
sharing
my
work
here
and.
A
I
have
a
nfdf
at
the
top
to
handle
a
sequential
consistency.
A
If
we
look
at
this
code
on
the
right,
this
is
the
exact
same
function,
doing
the
exact
same
thing,
but
written
using
a
C
plus
plus
standard
algorithm.
So
you
can
see
the
code
is
much
more
compact.
It
should
make
it
long
term
easier
to
maintain
it's
completely
ISO
compliant,
which
means
it's
portable
to
every
isoc
plus
compiler,
and
on
top
of
all
of
that,
it
actually
turns
out
to
be
faster
too
so
here
showing
three
compilers,
the
Intel
compiler,
the
new
compiler
and
NVC
plus
plus.
A
This
is
running
on
AMD,
epic
CPU
I,
don't
recall
if
these
are
the
same
CPUs
as
as
on
Pearl,
better
but
they're
the
similar
family,
and
you
can
see
the
performance
of
the
openmp
code
across
these
compilers
is
comparable
if
I
take
the
time
to
tune
all
of
my
environment
variables,
I
get
them
even
closer
together,
but
this
is
the
default
setting.
A
If
I
look
at
the
the
code
right,
they're,
just
ISO,
C,
plus
plus
you
can
see
across
the
board,
the
code
gets
faster
on
the
same
CPU,
and
why
is
that
well
I
think
there's
some
advantages.
The
compiler
has
to
staying
in
a
simple
single
programming
model
that
has
more
complete
understanding
of
the
code.
It
could
optimize
better
but
I
think
as
well.
There
are
some
performance
inefficiencies
in
the
open,
NP
code
that
could
bring
it.
A
A
Lastly,
changing
a
one
compiler
flag:
I
could
take
this
C
plus
plus
code,
no
extensions,
no
directives
and
running
on
the
GPU
as
well,
and
so
you
can
see
here,
I'm,
taking
the
exact
same
pure
C,
plus
plus
code,
we're
going
to
cost
three
different
compilers
and
two
different
Hardware
platforms.
A
So
Lou
lash
is
a
mini.
App
Maya
is
a
full
application,
written
at
rwth
AKA
University,
it's
about
a
half
a
million
lines
of
code.
It's
been
written
over
the
course
of
a
long
period
of
time,
so
this
is
not
something
that
could
be
Rewritten
overnight.
A
They
did,
however,
go
to
some
of
their
individual
solvers
and
replace
their
open
MP
with
just
straight
C
plus
plus
parallelism
down.
Here
are
the
results
for
the
lattice
of
ultimate
solver,
and
you
can
see
the
openmp
and
the
C
ISO
C,
plus
plus,
are
comparable
performance.
A
In
fact,
this
is
after
after
they
they
fixed
a
significant
open,
MP
performance
bug
that
that
was
initially
in
the
code
and
taking
this
code
unmodified,
they
can
run
it
on
all
of
the
gpus
on
their
node
or
using
MPI
scale
it
up
to
their
full
system
scale,
so
they're
very
excited
about
these
results
and
have
now
begun
to
work
through
their
their
other
numerical
methods
to
make
this
MPI,
plus
ISSC,
plus
plus
their
their
standard
going
forward.
A
We're
doing
very
similar
things
in
fortrad
is
a
very
widely
used
language
within
the
doe
in
particular,
but
around
the
world,
and
it
has
a
history
of
parallelism
as
well.
So
there's
three
ways
to
parallel
programming:
Fortran
we
support
two
of
them.
First,
is
our
array
math
intrinsics,
so
these
are
things
like
calling
Matt
Muller
reshape
audio
arrays.
There's
lots
of
parallelism
deeply
embedded
in
that
that
we
can
take
advantage
of
second,
usually
the
Duke
and
current
Loop,
which
is
something
that
was
added
in
Fortran
2008.
A
It
was
extended
in
2018,
it's
being
extended
again
in
2023,
and
we
actually
already
support
that
feature
that
adds
reductions
to
do
concur
loops,
and
so
you
can
write
all
of
your
data
parallel
Loops
Now
using
do
concurrent
without
the
need
of
like
any
compiler
directives
at
all.
A
Lastly,
is
co-arez:
co-res
can
be
thought
of
as
a
an
alternative
to
API.
We
don't
currently
support
that
in
our
in
our
compilers.
We
would
like
to
eventually,
but
we
don't
have
that
today,
one
application
I'll
highlight
for
Fortran
is
mini
weather.
This
is
developed
at
Oak,
Ridge,
National
Lab.
It
is
a
teaching
code,
but
it's
used
as
a
part
of
the
spec
HPC
Benchmark
Suite,
there's
a
open,
FP
version,
there's
open
ACC
version,
there's
even
versions
written
in
various
C
plus
Frameworks
as
well.
A
In
terms
of
the
code.
You
know
if
you
are
a
fortunate
programmer.
This
code
will
look
very
easy
to
understand
matter.
Of
fact,
the
bulk
of
this
code
is
exactly
the
same
as
the
open,
ACC
and
openmp
versions.
The
difference
is
I
replaced
a
Tripoli
nested
do
loop
with
a
Duke
and
current.
The
compiler
could
take
this
code,
build
it
for
CPU
threads.
A
As
you
see
here,
the
performance
is
on
par
with
openmp
openmp
actually
handled
the
thread
Affinity
slightly
better,
so
it's
Luke,
Duke
Curtis
ever
so
slightly
slower
in
this
case
or
could
run
out
of
the
GPU
and
it's
comparable
with
the
open
ACC.
So
we
get
very
good
performance
and
portability
using
to
concurred.
A
Thank
you.
Another
application
is
pot
3D.
This
is
also
a
spec
HPC
Benchmark
suite,
and
why
I
like
to
highlight
this
one.
They
had
an
existing
open,
ACC
code
here
lower
is
better.
They
wanted
to
see.
How
far
can
we
get
without
any
directives
at
all,
and
so
they
rewrote
the
open
ACC
code
using
compiler
directives
and
what
they
found
was.
They
were
about
10
percent
slower
than
their
open
ACC
code,
which
is
actually
fairly
acceptable
to
them,
but
they
wouldn't
understand
why,
and
so
they
dug
in
and
determined.
A
Well,
you
know
part
of
the
reason
we
could
run
fortra
Duke
occurred
on
the
GPU
is
because
we
could
use
something
called
cuda
managed
memory
which,
when
you
allocate
your
data,
it's
visible,
both
the
CPU,
the
GPU
and
under
the
hood.
It
will
migrate
according
to
usage,
and
so
they
what
they
eventually
found
is
they
use
open,
ACC
with
managed
memory.
They
get
the
same
performance
as
the
Duke
of
current.
A
So
clearly,
this
10
performance
loss
was
due
to
the
automatic
migration
of
data,
and
so
they
put
back
in
some
minimal
open
ACC
to
handle
the
data
movement
to
optimize.
The
data
movement
and
their
performance
was
then
the
same
so
here
they
Stripped
Away
some
400
lines
of
of
directives
or
something
like
that
and
wrote
all
of
the
parallelism
and
Fortran
and
the
data
movement
in
primary
directives.
A
And
one
more
thing
I'll
highlight
here
is
a
recent
enhancement
is
games.
Games.
You've,
probably
heard
of
is
a
computational
chemistry
application.
That's
very
widely
used.
It's
been
developed
for
some
40
years.
Their
Baseline
code
is
MPI,
plus
openmp,
and
they
have
that
on
the
CPU,
and
they
also
have
that
for
the
GPU
using
the
the
offloading
directives,
a
student
at
Iowa
State
rewrote
this
portion
took
out
the
directives
and
put
in
duka
current,
and
you
can
see
here.
A
The
results
was
actually
a
pretty
dramatic
performance
improvement
over
the
openmp
and
she
did
go
through
to
try
to
further
optimize
the
openmp,
and
this
was
the
after
the
optimizations.
So
why
is
this?
Well?
Openmp
is
pretty
strict
in
what
the
compiler
can
and
should
do
to
your
to
your
code
when,
when
encountering
these
directives,
Duke
occurring
is
more
descriptive.
It
gives
the
compiler
a
whole
lot
of
freedom
in
what
it
can
can
do,
and
so
it
can
make
smarter,
optimization
decisions.
So
this
there's
a
paper
coming
for
this.
A
It's
not
not
available
yet,
but
look
for
it
next
year
to
show
these
results
so
with
like
four
minutes
left.
Let
me
come
to
some
conclusions
here.
First
hpcsdk
is
a
complete
and
portable
toolkit
for
HPC
developers.
It's
available
on
pearlmeter
via
a
module
load
or
on
your
own
machine
via
a
download
I,
encourage
you
to
check
it
out
because
it
is
portable
across
you
know.
All
of
the
major
architectures
second
Nvidia
supports
a
wide
range
of
programming
models.
A
I
think
I
would
actually
say
that
we
have
the
the
greatest
choice
of
composable
and
mature
programming
models
of
any
HPC
vendor
and,
lastly,
I
can
only
scratch
the
surface
in.
You
know
30
minutes
here
so
I
I
picked
out
my
four
favorite
talks
for
GT
this
past
GTC
and
linked
to
them
here.
Full
disclosure.
This
last
one
was
mine
and
my
voice
is
a
lot
less
painful
to
listen
to
in,
in
that
one.
A
So
I
encourage
you
to
go
back
and
watch
these
and
with
that
I
have
about
four
bits
left.
If
there's
any
questions
you'd
like
me
to
answer.
B
Thank
you.
Thank
you,
Jeff.
We
have
a
couple
of
questions
in
the
chat,
so
the
first
question
is
from
Josh
and
the
question
is:
do
any
Nvidia
sdks
have
licensing
concerns
or
considerations
that
need
to
be
noted
when
deploying
on
non-nvidia
platforms.
I
did
notice
that,
on
your
previous
slide,
you
had
mentioned
Nvidia,
GPU
and
amdcp.
So.
A
Yeah,
so
the
HBC
SDK
it's
freely
available
there.
If
you
there
is
a
Eula,
the
Euler
does
have
a
car
route
so
that
you
can
distribute
the
necessary
runtime
libraries
to
make
it
possible
to
to
build
and
distribute
your
your
your
software
appropriately.
A
You
could
certainly,
if
you,
if
all
you
have,
is
an
Intel
CPU
you
go
off
or
download
this.
You
can
still
build.
If
all
you
have
is
arm
CPU
go
off
and
build
it.
You
know
it's.
So
this
is
a
freely
available.
But
yes,
of
course,
go
go,
read
the
the
EULA,
but
we
tried
to
be
as
generous
as
possible.
There's
no
charge
to
it.
There's
no
renewing
a
license.
It's
download
off
our
website.
It's
free
to
use.
B
Thank
you.
Another
question
is
what
would
be
the
best
place
to
suggest
or
request
new
features
in
cooler,
math,
libraries.
A
Go
to
math
libraries
I
would
say
your
best
bet
is
to
to
float
them
up
through
the
nurse
help
desk
there's
they
have
a
direct
line
to
our
to
our
engineering
and
our
product
managers
to
to
make
that
happen,
but
you
could
also
post
on
our
our
forums
as
well.
B
Another
question
is
what
is
MP
in
sorry:
cool,
solver,
MP
and
cool
fftmp.
Are
these
available
through
previously
existing
cool,
solver
and
450
apis
yeah.
A
So
we
that
I
believe
stands
for
multi-processor,
so
there's
qsolver,
which
is
the
kind
of
single
GPU
solver
MP,
is
the
the
multi
multi-processor
version.
The
the
API
does
translate
I.
Think
there's
I
think
you
have
to
add
the
two
letters
to
it,
but
otherwise
the
the
API
is
very
familiar
and
the
reason
for
maybe
them
differently
is
it
does
make
it
a
little
bit
easier
when,
if
all
you
need
is
the
single
node
or
all
you
need
is
the
multi-node,
it
helps
with
the
linking
and
distribution.
B
A
Believe
it
does
so
actually
that
that
pop3d
code,
one
thing
I
did
gloss
over
is
they
actually
had
did
not
go
down
to
zero
compiler
directives.
They
had
to
leave
in
three
one
for
picking
which
GPU
on
the
Node
they
wanted
and
two
atomics
which
were
for
array
reductions.
We
have
since
implemented
that
in
our
compiler,
so
the
best
of
my
knowledge,
it's
both
supported
in
the
in
the
standard
and
in
our
compiler.
Now.
B
A
Great
question
so
there's
a
paper
coming
at
Super
Computing
by
the
folks
at
University,
Bristol,
Simon,
McIntosh,
Smith's
group
that
goes
into
a
study
of
this
and
I
believe
they
were
able
to
to
build
for
it.
Intel
gpus,
with
using
Intel's
compiler
stack
and
a
very
small
shim,
the
difference
being
we
chose
to
use
the
standard
execution
policy
and
Intel
chose
to
use
a
a
unique
one
to
themselves
a
DPC
plus
plus
execution
policy.
A
So
they
had
a
six
line
of
code
shim
to
to
translate
between,
but
otherwise
the
code
was
the
rest
code
was
portable
I'm,
not
aware
of
one
for
AMD
gpus
at
the
moment-
and
we
are
you-
know,
discussions
with
the
community
to
to
better
to
to
support
this,
to
support
this
better.