►
From YouTube: 2. Nvidia HPC Software -- Jeff Larkin
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
Hi
everybody
for
those
who
don't
know
me,
my
name
is
jeff
larkin
and
I
am
an
architect
in
nvidia's
hpc
software
team,
and
so
I
wanted
to
give
you
an
overview
today
of
really
the
topics
that
are
gonna
be
covered
in
greater
detail
through
the
rest
of
today
and
in
tomorrow.
So
give
you
just
the
high
level
introduction
to
our
platform,
to
our
vision,
for
how
you
should
program
that
platform
and
the
software
that
we
provide
to
to
enable
you.
A
And
to
start
I
want
you
to
and
I'm
going
to
turn
my
video
off.
So
I
don't
eat
my
bandwidth,
so
there
we
go
to
start.
I
want
to
discuss
what
the
what
is
the
nvidia
platform
so
frequently
people
will
think
of
us
as
the
gpu
company,
because
we've
certainly
been
innovators
in
the
gpu
space
from
the
beginning,
and
that
is
has
certainly
been
our
bread
and
butter
over
over
the
lifetime
of
our
company.
But
in
the
last
several
years
we
have
expanded
that
to
a
broader
platform.
A
First,
through
the
acquisition
of
mellanox
providing
us
with
an
infiniband
network
portion
of
the
company.
That's
happened
a
little
over
a
year
and
a
half
ago,
and
then
last
year
jensen,
our
ceo
announced
the
the
grace
cpu
that
is
coming
in
the
future,
and
so
at
an
unannounced
time
in
the
future
we
will
have
a
cpu
arm-based
cpu
as
well.
A
So
we
need
to
provide
you
with
a
coherent
vision
for
how
to
program
for
all
of
this,
not
just
the
gpu,
but
also
the
cpu,
and
also
the
network,
and
this
is
what
what
we
tell
people
and
first
I
want
to
point
out
this
foundation
down
the
bottom
of
our
accelerated
libraries
and
I
think
of
these
libraries
as
the
foundation
on
which
we
build
the
rest
of
this.
A
A
We
also
provide
you
with
communication,
libraries
mpi,
of
course,
but
also
our
collective
communication,
library,
nickel
and
our
schmem
implementation
envishment,
and
then
we
also
provide
support
for
various
data
analytics
and
ai
packages
and
then
most
recently,
we've
added
support
for
for
acceleration
of
quantum
simulation.
A
So
this
is
the
basis
on
which
we
build
the
rest
of
that
strategy
and
for
many
people
the
assumption
is
that
we
at
nvidia
believe
that
all
developers
should
be
writing
this
code
on
the
right,
which
is
cuda
c
plus
plus,
and
we
also
have
support
for
cuda
fortran
as
well.
A
Cuda
is
came
out
at
a
time
when
programming
for
gpus
was
very
difficult
and
we
needed
to
provide
a
the
proper
abstractions
to
be
able
to
do
general
purpose.
Compute
on
the
gpus
and
kudu
remains
the
place
where
we
innovate.
So
if
we
introduce
new
features
on
our
hardware,
you
can
expect
them
to
come
to
cuda
c,
plus,
plus
and
cuda
fortran.
First.
This
is
where
we'll
expose
those
features
and
in
various
innovations
in
software.
A
However,
it
is
not
the
only
approach
to
programming
our
platform
and
in
fact
our
goal
is
that
for
most
developers,
you'll
come
to
our
platform
with
these
approaches
on
the
far
left,
which
we
call
accelerated,
standard
languages
and
you'll
hear
here.
Some
other
terms
sometimes
you'll
hear
us
refer
to
stood
par
shortened
for
standard
parallelism
or
I
like
the
term
standard
language
parallelism,
because
it
encapsulates
the
fact
that
these
are
standard
programming,
languages,
iso
c,
plus
plus
here
iso
fortran
here
and
then
also
python,
which
doesn't
have
an
iso
standard
behind
it.
A
A
So
here
I
have
a
standard
transform,
I'm
saying
to
write
it
to
execute
it
in
parallel
and
then
I'm
providing
the
implementation
for
that
same
thing
with
fortran
with
fortran.
A
Do
concurrent
has
been
in
the
language
now
since
2008
and
we
have
supported
it
on
our
across
cpus
and
gpus
since
2020,
and
it
is
actually
supported
in
other
compilers
as
well
to
express
that
the
iterations
this
loop
can
be
run
in
any
order,
and
so
we
can
actually
parallelize
this
for
multi-core
cpus
and
for
gpus,
and
the
more
recent
edition
is
coo
numeric,
which
provides
a
similar
interface
for
python
and
I'll
go
into
more
detail
on
that.
In
a
few
slides.
A
By
focusing
your
efforts
on
these
standard
languages,
it
means
that
you
can
come
to
our
platform
or
any
other
platform
with
a
code
that
is
already
parallel,
and
will
it
be
the
the
highest
performing
code
that
you
can
achieve
on
our
gpus,
probably
not
most
likely.
If
you
want
to
get
absolute,
the
best
possible
performance
exploiting
everything
in
the
hardware
you'll
still
want
to
write
it
in
cuda,
c,
plus
or
fortran,
but
your
baseline,
which
you
run,
will
already
run
out
of
the
box
on
day
one
using
these
standard
languages.
A
Now
there
is
a
functionality
gap
between
between
these
approaches
and
we
use
directives
to
span
that
gap,
so
as
a
as
a
for
instance.
In
these
these
approaches
on
the
left
there's
no
way
to
represent
your
data
transfer,
so
we've
built
them
on
top
of
cuda
unified
memory.
But
if
you
wanted
to
take
control
and
further
optimize
the
code,
you
can
incrementally
optimize
your
code
using
something
like
open,
acc
or
openmp,
but
we
still
believe
the
best
place
to
start
is
with
the
standard
languages.
A
So
what
do
we
provide
you
to
to
program
to
this
vision
and
the
product?
We
have
is
the
hbc
sdk.
You
will
sometimes
see
the
acronym
nvhpc
as
well.
A
A
We
provide
compilers
for
c
c,
plus
plus
fortran,
and
then
we
also
support
the
traditional
cuda
compiler
nvcc,
and
then
we
package
up
a
ton
of
libraries
to
make
you
successful
here.
I've
listed
cousin
plus
plus,
which
is
a
subset
of
the
c
plus
standard
library
for
gpus
thrust,
which
is
a
higher
level
template
library
for
for
writing
code,
that's
portable
across
cpus
and
gpus
and
cub,
which
is
our
a
collective's
library,
our
communications
library.
Here
we
also
have
for
it's
building
blocks
for
other
for
other
algorithms.
A
We
also
have
support
for
our
math
libraries,
so
some
that
you
may
be
familiar
with
me:
cooblas
coo,
solver
and
qfft,
and
this
is
not
an
exhaustive
list,
but
these
are
some
of
the
most
important
ones
and,
lastly,
we
do
provide
support
for
communication,
libraries
for
mpi
envy
and
nickel
and
then
because
you
need
to
be
able
to
debug
and
understand
the
performance
of
this.
We
also
provide
you
with
debuggers
and
profilers.
A
A
You
don't
even
need
to
have
a
gpu
in
that
machine.
It's
available
for
free
from
our
website,
it's
available
by
containers
using
nvidia,
ngc,
also
spac,
and
even
on
the
various
cloud
providers.
We
provide
amis
for
that
and
we
typically
release
somewhere
around
seven
or
eight
times
per
year.
Our
last
release
was
actually
just
last
week
with
the
22.1.
A
Now
some
people
are
surprised
to
learn
that
our
hvc
compilers
are
not
just
gpu
compilers.
In
fact,
we
want
them
to
be
best
in
class,
cpu
compilers
as
well,
and
so
we
do
provide
high
degree
of
cpu
optimization
across
all
of
the
supported
architectures,
which
is
x86
power
and
arm
server
class
cpus,
including
parallelization
and
vectorization.
A
A
And
we
recognize
that
for
many
of
you,
you
were
not
necessarily
consider
yourself:
computer
scientists
you're,
not
considering
yourself
software
engineers
for
most
of
you
are
in
fact
scientists
first
and
you've
learned
to
program
in
order
to
enable
that
science
through
simulation,
and
so
many
of
you
will-
will
have
applications,
you've
developed
on
other
platforms
and
it's
running
along
in
in
that
in
that
lane
and
you're
making
good
progress.
But
you
recognize
the
the
performance
benefit
of
being
able
to
run
on
gpus
and
so
standard
language.
A
Parallelism
provides
you
with
an
on-ramp
to
get
to
the
gpus
so
that
you
can
run
code
that
is
natively
able
to
run
on
the
on
continue
to
run
on
cpus,
but
also
on
our
gpus
and
then
once
you're
on
the
gpus.
A
You
can
begin
to
look
at
other
optimizations,
such
as
directives,
cuda
c
plus,
plus
and
fortran,
in
order
to
better
optimize
your
code
and
bring
it
farther
and
farther
into
the
fast
lane
so
standard
language
parallelism
provides
you
with
a
means
to
get
onto
all
parallel
platforms
very
quickly,
and
this
is
not
a
decision
that
we
made
overnight
to
to
strive
towards
standard
language
parallelism.
But
it's
actually
an
investment
we've
made
over
the
past
decade.
A
So
we've
been
participating
in
the
various
iso
committees
for
for
more
than
the
past
decade
and
actually
have
we
participate,
not
just
with
with
the
national
labs,
but
with
also
with
our
we
collaborate
with
our
competitors
in
these
spaces
of
the
iso
committees.
A
So
as
we
began
to
support
standard
language,
parallelism
in
our
software
stack
in
2020,
but
that
that
fruit
was
actually
attended
and
and
planted
more
than
a
decade
earlier
and
has
only
been
through
continued
engagement
with
the
standards
committees
that
we
have
we've
developed
this,
and
we
believe
that
by
focusing
on
bringing
concurrency
and
parallelism
to
all
of
these
languages
that
make
standard
language
parallelism
the
tide
that
raises
all
of
the
boats.
So
this
is
available
to
you
everywhere.
A
I'll
also
point
out
these
multi-dimensional
array
abstractions
solely
because
it
is
a
collaboration
that
took
place
between
nvidia
and
and
the
national
labs,
as
well
as
the
rest
of
the
iso
committee.
And
we
continue
to
drive
forward
with
new
features.
A
Well,
the
first
thing
I'll
point
out
is
that
we
use
the
nvc
plus
plus
compiler
to
to
accomplish
these
the
acceleration,
these
parallel
algorithms,
and
so
that's
the
compiler
that
you
will
use
whether
it's
directly
or
through
through
crazy
wrappers,
and
it
does
provide
support
for,
of
course,
grain
parallelism,
but
also
vector
concurrency.
A
Some
of
this
comes
about
through
enhancement
to
the
programming
model
and
the
compilers,
but
also
with
our
hardware,
and
so
our
most
recent
hardware
enables
us
those
several
last
two
generations
have
enabled
us
to
make
forward
progress,
guarantees
that
enable
support
for
the
c
plus
plus
execution
model
on
our
accelerators
and
also
the
c
plus
plus
memory
model
c,
plus
plus
20
enhanced
the
synchronization
libraries,
and
you
can
see
that
we
actually
have
provided
support
for
many
of
these.
A
These
primitives
in
the
lib
c
plus
plus
library,
and
we
are
continuing
to
drive
forward
with
new
features
such
as
the
sender's
receivers
proposal,
mdspan
and
md
array,
range-based,
parallel,
algorithms
and
and
so
on.
So
this
is
an
ongoing
ongoing
process
to
highlight
some
of
the
successes.
We've
had
first
we'll
start
with
a
mini
app
called
lulesh.
A
So
if
I
show
you
just
a
snippet
of
the
code-
and
this
is
one
representative
function
here-
you
can
see
that
using
their
baseline
openmp,
you
can
see
the
use
of
if-defs
here
to
support
serial
or
parallel
execution.
A
You
can
see
the
introduction
of
a
parallel
pragma
here
to
spawn
to
cpu
threads
and
to
the
omp4
here
to
work
share
this
loop,
and
this
is
a
fairly
typical
openmp
code,
with
the
the
addition
of
these
pragmas
and
if
deaths.
This
is
the
code
that
they
run
in
in
production.
A
As
a
matter
of
fact,
now
we
worked
with
them
to
restructure
the
code,
to
use
solely
standard
c,
plus
plus,
and
so
this
function
on
the
left
gets
transformed
into
this
function
on
the
right,
and
you
may
have
to
stare
at
a
little
bit
to
believe
me.
A
But
these
are
actually
are
accomplishing
the
exact
same
thing
and
you
can
see
the
code
is
a
lot
more
compact
and
and
easier
to
read
and
maintain
here
we're
using
a
standard,
transform
algorithm,
we're
telling
the
compiler
that
it
can
execute
this
code
in
parallel
and
then
inside
of
this
you
can
see
we
have
first
the
transformation
which
is
this
top
loop
and
then
the
reduction
here,
which
is
this
bottom
loop.
A
So
you
can
see
this
makes
the
code
a
lot
more,
a
lot
easier
to
read,
but
because
it
is
fully
iso
standard,
it's
portable
to
any
compiler
that
supports
isu,
c,
plus
plus
and
here's
a
list
of
of
several
that
we've
tried
it
in
and
in
addition
to
all
of
those
benefits
from
this
code.
I'll
also
point
out
that
it's
faster
too.
A
So
here
is
the
baseline
code
that
I
showed
and
again
that
was
just
one
function
out
of
the
entire
code
and
you
can
see
building
with
gcc
on
a
64
core
amd
epic
server,
that
is
our
baseline
performance.
A
A
Openmp
has
has
various
inefficiencies
and
overheads
in
the
programming
model
that
that
we
were
able
to
eliminate
using
using
standard
c
plus
plus,
if
you
switch
to
nvc
plus
plus,
you
can
see
that
improves
to
a
2x
performance
improvement
again
same
same
cpu
and
same
version
of
the
compiler,
but
what's
really
powerful
with
this,
is
you
can
change
one
compiler
flag
and
now
take
that
same
standardized,
oc
plus
plus
code
and
you're
running
on
an
nvidia
gpu
as
well?
A
Another
code,
I'll
highlight
is
called
maya.
This
comes
from,
I
believe
it's
rwth,
akin
university.
We
worked
with
them
on
a
portion
of
their
code
that
uses
the
lattice
boltzmann
method.
This
is
the
fluid
flow
portion
and
we
accelerate
it
using
just
standard
c
plus
plus,
so
the
code
on
the
left
again
becomes
something
like
the
code
on
the
right,
so
you
can
see
once
again.
The
code
is
more
compact
and
it's
completely
standard
compliant
in
terms
of
performance.
A
We
actually
saw
a
pretty
sizable
performance
improvement
there
as
well-
and
this
is
this
one
I
would
call
not
typical
in
that.
We
actually
achieved
a
fairly
large
performance
improvement
in
in
the
standard
c
plus
plus
code
because
of
various
things
that
were
eliminated
from
the
during
the
rewrite
from
openmp
to
standard
c
plus
plus.
So
this
is
a
fairly
large
and
atypical
performance
improvement.
But
what
is
typical
here
is
that
we
can
then
take
again
that
same
code
and
run
it
on
on
the
gpu
as
well.
A
A
A
They
had
a
goal
of
being
able
to
run
on
a
variety
of
platforms
using
no
external
apis,
and
so
they
wrote
their
code
using
standard
c
plus
plus
the
parallel
algorithm
was
available
in
c
plus
by
17
and
and
these
were
the
results
so
they're
able
to
run
across
their
entire
40
core
xeon
server.
A
But
rebuilding
that
code.
Changing
the
compiler
options
to
target
the
gpus,
they
could
run
all
the
gpus
as
well,
and
I
would
point
you
to
two
two
talks
that
I
did
this
first
one
is
gtc
spring
so
last
march,
and
this
last
one's
gtc
fall
and
I'm
sorry
I
I
should
have
updated
this
with
the
link
to
that,
because
he
goes
into
a
great
amount
of
detail
about
about
what
was
involved
in
this
transformation
and
actually
shows
the
results
on
more
than
just
the
a100.
A
But
on
several
several
of
their
servers
and
gpus,
so
I
encourage
you
to
go
look
up
these
talks
because
it's
really
he
goes
in
some
great
detail,
but
to
quote
him,
they
viewed
this
as
a
paradigm
shift
for
them
for
cross-platform
cp
gpu
programming
that
they
could
do
this
with
solely
iso
c
plus
plus
now.
Fortran
is
still
a
very
important
language
in
a
high
performance
computing
and
within
within
your
labs.
I
know
there
are
a
lot
of
fortran
codes
as
well
so
beginning
with,
with
the
2020
versions
of
our
compilers.
A
We
began
accelerating
various
parts
of
the
fortran
language
automatically
as
well.
So
to
begin
with,
we
were
able
to
begin
to
accelerate
the
array,
math
intrinsics
inside
the
library,
so
looking
at
things
like
a
matt
mall
and
recognizing
that
this
matimal
can
be
mapped
to
our
accelerated
math
libraries
automatically
again
using
the
cuda
unified
memory.
A
We
expanded
that
support
six
months
later
in
the
the
november
release
to
support
the
duke
and
current.
So
now
you
can
write
your
loops
using
this.
Fortran
2008
feature
do
concurrent
and
it
can
thread
parallelize
it
on
your
cpu
or
it
could
automatically
offload
it
to
your
gpu
as
well.
A
We
do
intend
to
eventually
support
co-rays.
I
don't
have
a
comment
on
when
that
will
be
coming,
but
it
is
something
that
we
are
actively
looking
at
as
well
and
then
a
new
feature
that
actually
has
been
approved
for
fortran,
but
is
the
the
version
of
fortran
that
it
will
be
in
has
not
yet
been
released,
is
reductions
on
duke
and
current
loops.
A
So,
if
you're
not
familiar
with
the
reduction,
if
you
do
something
like
a
summation
or
finding
the
min
or
the
max
within
a
loop,
you
have
many
values
that
you
need
to
reduce
down
to
just
one,
whether
it's
the
sum
or
the
min
or
the
max
or
whatever
the
the
fortran
specification
did
not
have
a
way
to
do
this.
A
You
actually
had
to
write
a
do
concurrent
loop
and
then
use
a
one
of
these
math
intrinsics
to
accomplish
the
reduction,
but
this
next
version
of
fortran
has
support
for
a
reduced
clause
and
we
actually
already
have
preview
support
for
it
in
the
compiler.
So
you
could
write
a
duke
and
current
with
a
reduction
since
envy
fortran
21-11
recognize
that
for
this
this
event
using
21.9,
which
doesn't
have
it
yet,
but
the
very
next
version
that
that
becomes
available
to
you
will
have
that
support.
A
So
this
is
what
it
looks
like
here.
You
can
see
a
fairly
simple
routine
from
a
laplace
operator
here
here
we
are
doing
a
stencil
operation
that
would
normally
be
written
as
a
I
loop
and
a
j
loop
you
can
see
here.
We
write
a
single
do
concurrent
loop
and
say
you
know,
iterate
across
all
of
I
and
all
of
j,
and
the
reason
you
combine
all
this
together
is.
A
You
can
run
it
on
your
gpus,
and
actually
the
performance
here
is
is
comparable
to
to
open
acc
version
of
the
code.
A
I
mentioned
the
math
intrinsic
sound
just
demonstrate
here.
I
hope
none
of
you
have
have
a
naive,
matrix
multiplication
like
this
in
your
code.
But
if
you
do,
one
thing
you
can
do
is
replace
it
with
with
a
matte
mole
operation
like
this,
and
you
can
see
not
surprisingly,
that
we
get
a
pretty
substantial
performance
benefit
because
we're
able
to
map
this
to
our
accelerated
math
libraries
and
that's
not
limited
to
just
simple
things
like
matrix
multiplication.
A
But
actually
you
can
see
that
we
support
a
a
pretty
broad
range
of
of
these
intrinsics.
A
So
our
goal
with
accelerating
the
standard
languages
is
that
you
could
take
this,
take
the
same
code,
whether
it's
a
c
plus
plus
or
in
fortran
or
even
in
python,
as
I'll
discuss
it
in
just
a
moment
and
be
able
to
run
it
across
your
cpus
or
your
gpus.
And
so
you
can
see
here
for
the
compiled
languages.
I
I
simply
had
to
change
one
compiler
flag
in
order
to
to
retarget
the
code
and
then
using
the
python
code.
A
Python
has
increasingly
become
an
important
language
in
high
performance
computing
and
this
something
I'm
sorry.
Somebody
needs
to
go
on
mute.
I'm
hearing
some
typing
noise,
the
the
packages
in
the
the
pi
data
ecosystem
that
are
used
most
often
is
this.
Is
this
numpy
package?
This
is
the
the
common
approach
to
writing
numerical
algorithms
in
in
python,
and
it
dates
back
to
right
around
the
year
2000-
and
you
can
see
this
code
down
here
that
performs
you
know,
an
a
plus
a
transpose
operation.
A
So
that
was
easy
during
time
period
where
you
had
single
core
cpus,
but
that
needed
to
be
expanded,
outward
of
course,
as
as
gpus
our
cpus
went
multi-core,
so
you
can
see,
we
went
from
you
know,
creating
a
a
single
small
matrix
to
then
needing
to
distribute
the
that
matrix
across
the
multiple
cores.
These
here
we're
using
something
called
desk.
To
do
that.
A
Das
was
eventually
extended
to
be
able
to
extend,
not
just
across
you
know,
cpu
threads,
but
even
across
you
know,
clusters
of
of
several
nodes,
as
you
can
see
here,
but
eventually
we
needed
to
be
to
begin
to
provide
gpu
acceleration.
We
really
want
to
return
back
to
the
simplicity
of
the
first
approach
and
our
answer
to
that
is
a
package
called
coo
numeric.
This
was
announced
at
gtc
fall
and
it
is,
it
aims
to
become
a
drop-in
replacement
for
for
numpy.
A
So
you
can
see
here
I'm
not
thinking
about
all
of
the
gpus
on
my
system
and
I'm
not
thinking
about
all
of
the
nodes
on
my
system.
I'm
thinking
about
the
the
size
of
the
work
I
need
to
do
so.
Here's
a
160,
000
elements
and
the
ku
numeric
package
is
able
to
distribute
this.
This
data
structure
across
your
gpus
or
even
across
many
nodes
automatically,
and
so
now
I'm
able
to
simply
write
the
same,
a
plus
a
transpose
and
it
will
distribute
both
the
data
and
the
work.
A
So
this
code
is
productive
because
it's
simply
sequential
semantics,
there's
no
obvious
need
for
synchronization,
there's,
no
obvious
distribution
of
work.
No
partitioning
of
data-
and
you
can
actually
compose
this
well
with
the
rest
of
your
libraries
or
our
program,
but
it'll-
be
nice
to
be
able
to
get
high
performance
as
well
and
so
being
able
to
transparently
run
this
anywhere.
You
need
and
leverage
all
the
available
hardware,
whether
it's
a
single
gpu
or
many
gpus,
or
even
a
whole
system,
so
we've
been
building
a
system.
A
Architecture
called
leagate,
which
is
this
layer
here
and
league,
provides
a
way
to
handle
distributing
of
data
structures
and
and
work
transparently
and
we'd
like
to
extend
that
to
the
broader
python
ecosystem,
but
to
start
we're
focusing
on
numpy
using
the
ku
numeric
package,
and
so
you
can
see
here
the
ku
numeric
is.
We
were
able
to
take
a
an
existing
numpy
application,
we're
plug
in
numeric
in
place
of
of
numpy
for
our
data
structures
and
actually
run
this
code
across
a
entire
machine
full
of
gpus.
A
Now,
if
all
I
want
to
do
is
run
on
one
gpu,
there's
already
a
package
called
coupe
that
could
have
handled
that
what's
exciting
about
coonomeric
is
not
just
that
you
can
run
a
run
gpu.
But
here
I'm
rescaling
it
out
to
I'm
showing
here
1024
gpus,
so
that's
pretty
exciting
because
we're
able
to
do
it
with
us
essentially
no
code
changes.
A
Here's
another
code
that
we're
that
we've
demonstrated
this
came
out
of
the
scikit
image
library.
What's
notable
here
is
that
this
function
didn't
change
at
all.
What
we
did
was
we
replaced
the
numpy
arrays
with
kuhn
america
raised
and
we've
run
this,
and
once
again
this
is
actually
weekly
scaling.
You
can
see
that
the
throughput
increases
as
we
increase
the
number
of
gpus.
A
So
that's
python,
and
we
would
really
like
to
see
more
people
developing
this.
I
will
say
that
coo
numeric
is
not
numpy
complete
at
this
point,
it
is,
is
still
considered
alpha
software,
but
it
is
really
critical
to
our
standard
language
perils
of
strategy,
and
we
would
love
to
see
you
trying
it
so
now.
Let
me
shift
focus
towards
nvidia's
performance
libraries,
so
the
first
thing
is
I
want
to
to
give
you
is
kind
of
the
goals
of
our
libraries
first
is
we
want
them
to
be
seamless
to
use?
A
So,
as
we
add
new
features
to
our
hardware,
you
don't
have
to
worry
about
how
to
how
to
make
use
of
those
features
that
we
can
build
them
into
libraries,
and
you
reap
the
benefits.
A
Most
of
the
names
are
fairly
self-explanatory
of
what
they
provide,
and
I
will
point
out
that
there's
a
lot
of
excitement
about
the
tensor
cores
that
we
provide
in
our
gpus,
beginning
with
the
the
v100
and
expanding
support
in
in
the
a100s
as
well,
and
the
point
of
this
slide
is
to
demonstrate
that
for
many
of
the
operations
in
the
library
you
don't
need
to
enable
the
tensorcore
support
that
they
will
be
used
for
you
under
the
hood,
and
you
can
reap
the
benefits
of
that
automatically.
A
So,
for
instance,
the
kublaz
library
kublaz
provides
a
full
implementation
of
the
blahs
plus
a
variety
of
extensions
such
as
mixed
or
lower
precision
and
batched
apis,
and
these
are
actually
used
in
in
a
lot
of
of
applications.
A
And-
and
what
I
would
like
to
show
here
is
that
you
know,
as
you
utilize
these
these
libraries
that
we're
able
to
take
advantage
of
the
the
hardware
features
for
you.
And
so
here
you
can
see
on
gv100
we're
taking
advantage
of
the
the
16-bit.
A
Floating
point
performance
using
our
tensor
cores-
and
here
you
can
see
you
know
automatically
out
of
the
box
you're
able
to
take
advantage
of
this
of
that
feature
in
a100
as
well
or
even
if
you
have
need
for
for
different
floating
point
types
such
as
the
tf-302
that
you
can
utilize
that
as
well
and
get
very,
very
good
performance.
A
Another
library
of
interest
would
be
coo.
Solver
coo
solver
provides
a
variety
of
linear,
solvers,
l,
u
chileski
and
qr
as
well.
As
you
know,
symmetric
and
generalized
eigen
solvers,
and
we
do
provide
support
for
iterative
refinement
solvers
which
allow
you
to
utilize
reduced
precision
to
get
full
precision
results.
A
So
you
can
actually
take
advantage
under
the
hood
of
things
like
the
the
16-bit
tensor
core
units
to
get
very
high
speed
solvers,
but
expect
the
full
64-bit
precision
on
your
results
and
we
provide
support
for
multiple
gpus
as
well,
so
I'll
point
out
the
automatic
use
of
the
the
mma
instructions
for
for
using
the
tensor
cores
under
the
hood.
So
you
don't
need
to
opt
into
that
and
you
can
see.
A
Actually,
you
get
a
pretty
significant
performance,
speed
up
going
from
v100
a
100
here,
we're
showing
2.3
x
and
actually
in
in
more
recent
versions
higher
than
that,
and
you
can
ignore
this
part
at
the
bottom
that
just
did
not
get
stripped
out.
A
For
coo
sparse,
if
you're
doing
sparse,
linear
algebra,
you
can
see,
we
have.
You
know
a
range
of
capabilities
available
to
you
and
again
we
can
get
a
very
high
performance.
So
this
is
speed
up
using
our
new
generalized
version
of
the
solvers
versus
versus
the
our
previous
solvers.
A
So
we,
you
can
see
more
details
about
that
in
in
the
release
notes,
qfft,
we
do
provide
support
for
1d,
2d
and
3d
ffts,
including
support
for
multiple
gpus,
and
you
can
see
here
across
a
variety
of
problem
sizes.
You
can
see
support
for
one
two,
four
and
eight
gpus
here
and
then
more
recently,
support
for
generalized
tensor,
contractions
and
reductions
using
the
kutenser
library.
A
Now
I
want
to
highlight
that
we've
begun
supporting
multi-node
within
several
of
our
math
libraries,
and
so
one
I'll
point
out
here
is
ku
solver
mp
and
here
we're
able
to
to
scale
not
just
across
the
gpus
in
a
single
node
but
across
multiple
nodes
up
to
full
system
size,
and
so
this
began
support
in
21.11
and
so
and
it's
available
since
then
so
once
again,
you're
running
with
21.9.
A
For
sake
of
time,
I'll
keep
skipping
ahead.
I
want
to
point
out
that
we
have
a
rich
set
of
what
I
would
call
core.
Compute
libraries
for
c
plus
plus
libku,
the
lib
code
c,
plus
plus
or
q
c,
plus
plus,
is
a
standard
template
library
that
you
can
utilize
on.
A
The
gpus
thrust
is
a
parallel
algorithms
library-
and
this
was
very
this-
the
thrust
project
very
heavily
influenced
what
eventually
went
into
the
the
c
plus
plus
standard
as
well,
and
then
the
cooperative
primitives
library,
lib
cub.
A
So
thrust
has
you
know
very
high
level
classes
like
the
host
and
device
vectors,
high-level
algorithms
like
transform,
fill
and
copy,
and
then
various
iterators
that
you
can
use
and
actually
in
some
of
our
early
c,
plus
plus
examples
we're
using
these
iterators
in
there
as
well.
And
then,
if
you
need
within
your
kernels
to
do
various
collective
communication
patterns,
you
can
use
cub
in
order
to
expose
that
things
like
warp
warp,
wide
and
device-wide.
A
Primitives
and
here
I'll
point
out,
libkusy,
plus,
plus,
and
so
qc
plus
plus,
is
in
addition
to
your
standard
library.
This
comes
with
your
host
compiler,
so
you
would
include
you
know
any
of
the
normal.
You
know.
Impound
include
vector,
for
instance,
to
include
atomic
to
get
the
normal
host
side,
standard,
template
library
and
then
ku
c,
plus
plus
provides
two
interfaces,
one
that
is
strictly
standards
compliant,
and
it
is
a
subset
of
the
standard
library
here
which
has
the
namespace
cuda
standard.
A
And
then,
if
you
we
do
it
provide
some
extensions
as
well
that
are
under
the
cuda
name.
Space
one
such
extension
is,
is
the
atomic
and
there's
details.
I
can
point
you
to
a
presentation
that
details
that
as
well.
A
So,
jumping
ahead
to
the
communication
libraries,
I
want
to
point
out
that
you
know
just
like
with
the
community
with
the
math
libraries.
We
hope
to
provide
you
with
a
the
right
set
of
communication.
Libraries
that
are
optimized
for
the
entire
system
also
provide
low,
latency
pgas
programming.
This
would
be
the
partition
global
address
space,
so
things
like
nvsmem
and
then
optimize
collectives
on
your
system.
A
We
have
several
libraries
to
provide
that
are
provided
within
the
hbc
sdk
hbcx
provides
you
with
an
version
of
openmpi.
A
We
also
have
support
for
openschmem
in
that
as
well
and
and
ucx
and
sharp,
which
are
technologies
that
came
from
from
melanox
I'll
point
out
that
envy
schmem
is
a
technology
that
we
support,
that
for
partition.
Global
address
space
messaging
using
the
cpu
or
the
gpu,
so
in
typical
mpi
you'd
have
something
like
issue,
and
I
send
and
and
then
wait
for
the
results
to
complete,
and
you
can
see
that
the
the
data
has
to
move.
A
You
know
through
the
cpu
out
to
the
network
and
back
to
the
gpu
with
nvsma.
The
gpu
can
actually
initialize
all
of
this.
If
you
want
so
you
can.
Actually
you
put
messages
through
the
network
onto
other
gpus
or
even
get
message,
get
data
off
of
other
gpus,
and
so
this
has
also
interoperates
well
with
with
cuda
stream.
So
this
is
a
program
model
that
I
encourage
you
to
to
take
a
look
at
and
there's
a
variety
of
trainings
available
online.
For
that.
A
Approaching
the
end
here,
I'll
point
out
our
developer
tools,
I
know
max-
has
some
talks
about
this
coming
up
and
so
I'll
point
out
that
we
we
provide
insider
our
our
download
support
for
the
cuda
gdb,
which
is
works
just
like
gdb,
but
as
understands
our
compilers
as
well,
and
also
more
recently,
we've
we've
begun
shipping,
an
extension
to
to
visual
studio
that
and
excuse
me
visual
studio
code
and
then
also
there's
the
insight
visual
studio
edition
as
well.
A
We
also
have
profilers
insight
systems
as
a
profiler
for
getting
high
level
information
of
how
your
code
is
running.
You
know,
is
your
compute
and
your
data
movement.
Are
they
overlapped
and
things
like
that
when
you
want
to
drill
down
into
individual
kernels,
we
provide
insight,
compute
and
there's
a
and
you
can
see
here
a
screenshot
showing
the
the
roofline
analysis
of
a
particular
code,
and
then
support
for
nvtx
is
something
that
you
would
learn
about
later
today.
A
I
believe
the
nvidia
tools,
extensions
that
allow
you
to
annotate
your
code,
and
so
you
can
look
at
it
and
say:
okay.
This
is
time
spent
in
my
solver,
or
this
is
time
spent.
You
know
at
other
points
of
the
code.
A
Compute
sanitizer
is
a
way
to
just
check,
for
you
know
if
something
crashes,
to
try
to
understand
why
things
like,
if,
if
we
access
an
array
out
of
bounds
on
the
gpu
or
if
we're
using
a
shared
memory
in
an
unsafe
way,
and
so
computer
sanitizer,
if
you're
familiar
with
cuda
memcheck
compute
sanitizer,
is
the
the
next
generation
kudum
check
and,
lastly,
we
provide
integrations
for
various
ids.
A
We
do
ship
insight,
eclipse
edition
and
visual
and
insight
visual
studio
edition,
but
what
you
may
not
be
aware
of
is
that
we
also
provide
support
for
a
visual
studio
code
and
that's
fairly.
Recent.
A
So
insight
systems
is
our
system
level
profiler.
That
gives
you
again
a
high
level
information
about
how
your
program
is
doing
insight.
Compute
is
our
lower
level
really
understanding.
You
know
what
what
is
limiting
your
code?
Is
it
limited
by
compute,
or
is
it
limited
by
memory
and
you
can
even
dig
down
as
far
as
you
want
and
look
at
your
our
underlying
assembly?
If
you
really
want
to
dig
in
and
understand,
understand
the
performance
of
your
code,
insight,
visual
studio
code
edition
here,
as
I
mentioned,
is
very
new.
A
You
can
find
out
more
information
on
how
to
get
it
here
or
search
for
the
video
studio
mark,
visual
studio
code
marketplace
and
you
can
see
it
provides
ways
to
and
you
inspect
the
inspect
your
variables.
Look,
your
juror
your
registers
and
really
understand
and
debug
your
code
directly
within
your
ide,
including
support
via
ssh.
Now
I
will
point
out
from
my
experience
that
visual
studio
code,
the
remote
ssh,
does
not
work
on
summit,
but
we'll
work
on
x86
platforms.
A
Here's
a
little
bit
of
an
example
about
cuda
gdp
and
compute
sanitizer
as
well
so
scanning
your
code.
What
I'll
point
out
here
is
memcheck
is
an
operation
that
allows
you
to
look
for
unsafe
memory.
Accesses
look
for
memory,
leaks,
look
for
out
of
bounds,
errors.
Things
like
that
race
check
is
a
tool
for,
if
you're,
using
shared
memory
understanding
did
you
have
unsafe
memory
access
patterns
to
that
shared
memory.
A
Knit
check
is
one
for
checking
for
reading
from
uninitialized
memory
and
then
sync
check,
watches
for
thread
synchronization
issues.
So,
as
you
can
see,
computer
sanitizer
is
quite
a
bit
more
advanced
than
what
was
available
in
in
cuda
memcheck.
A
So
I
did
want
to
leave
just
a
tiny
bit
of
time
for
questions.
I
was
not
very
successful,
but
I'll
put
this
back
up
here
and
point
out
the
hpc
sdk,
as
as
your
means
of
getting
access
to
everything
that
I've
shown,
and
I
guess
there's
you
know
two
or
three
minutes
left
where
I
can
answer
some
questions.
B
Are
there
any
questions
you
could
I
mean,
speak
up
or
type
question
in
slack
channel.
Please.
D
C
I've
noticed
that
nvidia
or
they're,
maybe
not
nvidia,
but
there
are
some
specific
directives
used
for
gpu
acceleration
like
target
teams,
for
instance,
and
I
wonder
if
if,
if
you
guys
had
used
those
if
it
would
have
perhaps
made
a
difference.
A
So
this
I
I
don't
know
whether
there
is
a
version
of
lou
lesch
that
supports
the
the
target
offload
directives
there
may
be,
given
that
it
came
from
lawrence
livermore
there
very
likely
is,
but
this
is
the
the
baseline
code
that
was
provided
to
us,
which
is
the
the
cpu
threaded
one.
You
know
it
would
be
possible
to.
You
know,
write
a
you
know,
target
teams,
you
know,
distribute
here
and
here
and
and
offload
these.
A
In
that
way,
I
would
not
expect
the
performance
to
be
as
good
because
of
various
overheads
associated
with
with
doing
that,
but
functionally.
It
would
certainly
be
possible
to
write
such
a
code.
We
believe
that
a
better
approach
is
to
use
the
the
standard
c
plus,
rather
than
relying
on
openmp
as
an
additional
api.
D
So
hi
jeff
this
is
zanzi.
I
have
a
question
about
your
cool
solver
mp
implementation
and
also
so
you
have
a
co-ff
tmp.
So
those
are
I
mean
the
the
cross.
Note
communication.
What
kind
of
library
did
you
guys
use.
A
I
believe
it's
based
on
lib
fabric.
I
would
have
to
confirm
that
I
do
know
that
it
has
been
tested
on
on
both
promoter
and
summit,
so
we
do
know
that
it
does
work
on
those.
I
think
it
uses
lib
fabric
which
each
of
the
vendors
are
able
to
implement
their
own
layer
underneath.
D
Also,
I
have
another
question
like
you
have
shown
many
performance
results
from
like
your
tool.
I
mean
you
showed
many
better
performance
than
like
openmp
or
like
some
offload
result.
So
when
you
do
this
standard
language,
parallelism
implementation,
did
you
still
use
a
cuda
or
something
else,
I'm
just
curious.
How
your
performance
is
that
much
better
than
other
other
things.
A
So,
in
both
of
these
cases,
no
cuda
c,
plus
plus,
is
used
if
you
were
to
rewrite
it
in
cuda
c,
plus,
plus
there
are
some
optimizations
that
you
can
accomplish
in
cuda,
c,
plus,
plus
or
cuda
fortran.
That
can't
be
done
in
standard
language
parallelism.
So
I
would
actually
expect
that
it
would
be
possible
to
tune
this
even
further.
But
of
course
the
trade-off
there
is
is
in
in
portability.
A
If
your
goal
is
to
write
something
that
you
can
right
away,
bring
to
new
platforms
and
expect
it
to
run
in
parallel
right
out
of
the
box
standard
language,
parallelism
provides
that
and
actually
a
better
slide
to
show.
You
would
be
this
one.
A
Each
of
these
codes
can
run
out
of
the
box
on
multi-core
cpus
or
on
gpus,
but
if
I
want
to
get
if
my
goal
is
absolute
best
possible
performance,
I
would
write
my
code
with
this
on
the
right
now.
The
thing
I
should
emphasize
here
I
should
have
emphasized
in
the
beginning-
is
that
you
don't
have
to
choose
one
of
these
and
you're
not
wed
to
one
approach
that
all
of
these
approaches
composed
with
each
other.
So
I
can
start
with
you
know:
fortran
do
good
current.
A
I
can
selectively
optimize
with
with
directives
if
necessary,
or
if
I
have
a
portion
of
my
code
that
is
so
performance
critical
that
it
absolutely
needs
to
be
written
in
in
cuda.
I
can
do
that
and
all
of
those
work
together.
So
this
is
not
a
pick
one
and
stick
with
it,
but
this
is
the
your
your
baseline
starting
point,
and
this
on
the
right
is
your
absolute
best
performance.
B
More
minutes,
if
the
time
when
times
runs
out,
I
will
just
ask
you
to
answer
them
in
slack.
So
one
question
from
philip
thomas
is:
what
is
the
state
of
the
cuda
graphics
feature,
particularly
with
respect
to
in
standard
programming
for
c
plus
plus
4chan?
Is
this
feature
on
active
development.
A
So
kudo
graphs
is
absolutely
continues
to
be
in
active
development
right
now.
We
don't.
We
don't
have
any
defined
interactions
between
standard
language,
parallelism
and
kudo
graphs.
A
That's
something
that
we
could
explore
in
the
future,
but
right
now
they
they
are.
You
know
two
separate
approaches.
Those
of
you
who
aren't
familiar
with
cuda
graphs,
the
the
basic
introduction,
would
be
if
you
have
a
series
of
data
transfers
and
gpu
kernels,
that
you
call
repeatedly
within
your
application,
rather
than
you
inside
of
your
time,
step
loop,
you're,
going
through
and
issuing
your
memory
request
issuing
the
kernel
after
kernel
offer
kernel,
which
has
various
launch
overheads.
A
You
can
capture
all
of
those
into
a
graph
that
you
basically
pre-compile
and
and
just
relaunch
over
and
over
again
on
the
gpu
and
and
that
takes
away
various
overheads.
So
we
don't
currently
have
an
interface
to
kuder
graphs
within
the
standard
languages.
That
is
something
that
that
does
require
you
to
use
this,
the
specialized
cuda
c
plus
plus
to
use
it.
B
Thanks
jeff,
here's
another
question
in
zoom:
chat:
I'll,
read
it
to
you:
how
does
it's
from
ko
hearn
and
how
does
performance
compare
against
cuda
versus
standard
c,
plus,
plus
and
fortune
codes,
that
is
in
the
past,
writing
cuda
code
was
often
the
suggestion
to
achieve
the
best
performance
possible.
Is
this
still
the
case.
A
So
if
your
goal
is
the
best
performance
on
the
gpu
that
you
have
cuda,
c,
plus,
plus
or
cuda
fortran
is
is
the
way
to
is
the
way
to
accomplish
that.
There
are
low
level
hardware
features
that
you
know
we
we
can't
expose
in
in
standard
c,
plus
plus
or
take
you
know
many
years
to
get
exposed
in
the
standards
plus.
So
if
you
really
want
to
tune
for
the
best
performance
on
the
gpu,
you
have,
you
would
write
it
in
cuda,
c,
plus,
plus
or
cuda
fortran.
Of
course,
the
trade-off.