►
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
A
Cpus
plus
standard
parallelism
was
introduced
in
c
plus
plus
17..
It
recently
saw
an
extension
in
c
plus
plus
20.
The
most
recent
version
of
the
standard
and
more
features
are
coming
in
the
pipeline
for
the
future
cbos
23
and
onward,
and,
as
jeff
mentioned
in
our
overview
talk,
this
has
been
a
very
important
feature
in
the
standard
and
it
has
been
an
investment
over
many
years.
A
A
I
would
like
to
briefly
mention
some
of
the
classic
algorithms,
such
as
sort
reduce,
transform
and
for
each
most
are
pretty
self-explanatory.
If
you
have
not
seen
them
or
are
not
too
familiar
with
the
algorithms
of
the
c
plus
plus
standard
library,
usually
they
take
iterators
and
then
do
whatever
the
function.
A
Name
is
like
sort
for
each
is
going
to
help
us
a
lot
in
this
case,
because
I
think
it
is
an
easy
way
to
visualize
parallelism
and
for
each
applies
a
function
that
is
given
as
an
argument
to
each
element
in
the
range
defined
by
the
iterators,
and
you
can
think
about
it
like
if
I
wanted
to
give
out
different
elements
of
my
container
without
my
control
of
order
in
which
they
execute.
A
If
you
use
the
parallel
execution,
how
would
I
need
to
structure
my
algorithm,
which
is
tip,
which
is
typically
a
lambda
in
this
case,
or
a
function
object
to
abide
by
the
rules
of
good
parallelism
as
brent
mentioned,
such
as
no
data
braces,
and
that's
just
something
to
keep
in
mind
as
we
go
forward
and
brent
mentioned
this
as
well,
but
I'll
repeat
that
nbc
plus
plus
standard
parallelism
also
uses
cuda,
unified
memory
for
memory
management,
meaning
it
is
handled
by
the
compiler.
A
Okay,
if
you
are
a
c
plus
plus
developer,
you
may
have
used
cpp
reference
in
the
past
or
maybe
every
day.
A
I
pulled
the
screenshot
to
show
that
after
this
talk,
if
you
want
to
explore
the
options
that
stabar
enables
you
look
for
declarations
that
have
a
universal
reference
to
a
execution
policy
as
a
parameter-
and
you
can
see-
I
boxed
this
one
out
and
also
you
can
see
that
it
is
since
c,
plus
plus
17,
because
that
is
the
version
of
the
standard
it
was
introduced.
A
Nvidia
does
not
own
this
content,
so
you
know
it's
not
like
a
replacement
for
actual
nbc,
plus
plus
documentation
or
anything
and
below
is
a
simple
semi-complete
example
with
for
each
using
the
parallel
execution
policy
and
we
have
a
vector-
and
we
are
performing
an
algorithm
on
each
element
in
that
vector
we
will
which
we
will
bring
the
work
and
the
data
over
to
the
gpu
for
execution,
and
then
we
will
bring
it
back
to
the
cpu
after
it's
done
and
just
for
note
for
the
future,
when
you
want
to
give
this
a
try,
you'll
want
to
include
algorithm
and
include
execution.
A
Just
a
few
details
to
help
you
get
started
when
using
the
parallel
execution
policy
make
sure
there
are
no
data,
races
or
dead
locks,
as
brent
mentioned,
because
this
is
left
up
to
the
programmer
to
handle.
When
you
declare
execution
parallel
or
execution
parallel
unsequenced
the
two
parallel
options.
A
You
are
saying
that
there
is
no
dependency
between
iterations
and
the
compiler
just
trusts
you
and,
as
we
mentioned
step
part
uses
cuda
unified
memory
to
handle
our
data
transfers
between
cpu
and
gpu.
For
us,
and
just
a
note
that
unified
memory
requires
data
to
reside
in
heat
memory,
so
this
means
stood
vector
is
all
good
stood
array
would
not
because
it
resides
on
the
stack
and
quick
two
notes
that
functions.
Reference
do
not
need
to
be
given
the
device
annotation
like
they
are
included
c
plus.
A
If
you
were
familiar,
the
compiler
just
automatically
goes
through
the
call
stack
and
handles
this
for
you
as
well,
like
the
data
and
also
execution
on
the
gpu,
requires
random
access.
Iterators
not
not
forward
iterators
anti-compile
using
stood
par
using
our
nvc
plus
plus
compiler
use
the
dash
studpar
flag,
and
we
have
two
options
with
that:
you
can
use
equals
gpu,
which
is
the
default.
A
And
this
is
a
very
simple
workflow
that
we
can
walk
through.
Imagine
that
we
have
a
problem,
there's
a
vector
I
want
to
sort,
and
you
can
see
we
have
a
vector
called
effect
one,
and
it
has
just
for
this
example.
10
unsorted
dents
and
our
solution
simply
is
just
apply.
The
standard,
algorithm,
sdd
sort
to
the
vector-
and
you
can
see
that
we're
all
sorted
out
below
and
stifler
comes
into
play
on
the
far
right
as
a
potential
performance
improvement
very
easily.
A
We
add
the
stood
execution
power
to
our
function,
call
and
then
later
we
remember
to
compile
with
step
bar
dash
depart
and
then
that's
pretty
much
all
you
need
to
get
this
code
and
this
data
onto
the
gpu
and
I've
noticed.
I
have
noted
potential
improvement
here,
because
I
will
say
that
if
you
give
a
vector
of
10
inch
to
a
gpu
and
that's
it,
you
won't
exactly
be
breaking
light
speed.
A
If
you
think
about
it,
the
speed
up
would
need
to
be
at
least
worth
the
cost
of
the
data
transfer
as
sprint
touched
on
a
little
bit,
but
fortunately
we
can
just
apply
our
same
basic
knowledge
as
brent,
provided
and
we'll
touch
on
just
throughout
these
two
days
of
what
makes
a
good
gpu
program
and
make
sure
our
code
follows
those
same
guidelines
and
just
generally,
there
needs
to
be
plenty
of
work
and
data
to
keep
the
gpu
happy
to
see
a
performance
increase.
Utilizing
the
hardware.
A
And
jeff
already
gave
a
great
overview
about
this
cool
work
that
professor
has
done
with
the
lattice
boltzmann
simulations
using
steadpar.
I
just
wanted
to
advertise
it
one
more
time
because
the
gpu,
the
gtc
talks
are
really
good
and
I
reference
them
quite
a
bit
and
they're
just
down
here
at
the
bottom.