►
From YouTube: 10. CUDA C++ Basics
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
So
in
this
last
session,
what
I
would
like
to
do
is
say
a
little
bit
about
cuda.
A
We
decided
to
do
cuda
last
because
we
wanted
to
demonstrate
that
there
are
plenty
of
ways
to
use
the
gpu
that
don't
depend
on
you
having
to
drop
down
to
a
lower
level,
in
particular,
the
main
distinction
between
cuda
and
the
programming
models
that
we've
discussed
so
far
is
that
in
cuda,
you're
expected
to
do
a
little
bit
more
effort
as
the
programmer
mapping
parallel
threads
to
items
of
execution
work.
If
you
think
about
the
models
that
we've
discussed
so
far,
parallelism
in
the
standard
language,
as
well
as
openmp
and
openacc.
A
These
models
didn't
require
you
to
explicitly
map
parallelism
to
threads
you.
What
you
were
doing
was
telling
the
compiler
what
work
you
wanted
to
do
and
where
you
wanted
it
to
happen,
and
then
the
compiler
was
figuring
out
how
to
map
that
to
parallel
threads
on
the
device
as
brent.
Just
you
know,
discussed
you
can
take
a
little
bit
more
control
over
this
process
with
some
optimization
work
by
telling
openmp
you're
opening
cc
how
many
threads
to
use
per
gang
or
per
team.
A
You
can
also
tell
it
how
many
teams
or
gangs
to
use,
but
you
aren't
required
to,
and
in
any
case
you
still
aren't
required
to
figure
out
what
those
teams
and
threads
are
doing
you're
just
telling
the
the
compiler
how
many
to
use
as
a
tuning
exercise.
A
That
being
said,
it
does
require
a
little
bit
more
effort
than
the
models
we've
discussed
so
far,
I'm
only
going
to
talk
about
cuda
as
it
applies
to
c
or
c
plus
in
this
presentation,
and
I
do
call
it
cuda
c,
plus
plus,
because
technically
that's
what
it
is
technically,
there
is
no
such
thing
as
cuda
c,
but
for
the
most
part
you
can
use
standard
c
in
in
cuda
as
long
as
it's
the
part
of
the
subset
of
c
that
doesn't
conflict
with
c
plus.
A
There
are
other
languages
that
have
bindings
to
cuda.
The
most
popular
ones
today
are
in
python
and
in
fortran.
Both
of
these
give
you
the
a
way
that
is
either
pythonic
or
fortran,
like
fortran
like
in
order
to
write
cuda,
and
both
of
these
are
models
that
are
supported
by
nvidia.
I'm
not
going
to
talk
about
them
today,
but
if
you
do
see,
cuda,
fortran
or
python
out
in
the
wild
you'll
be
able
to
recognize
it
based
on
cuda
c.
A
Now
the
fundamental
difference
between
the
way
that
cuda
exposes
parallelism
on
the
gpu
compared
to
the
previous
models
that
we've
talked
about
is
that
cuda
explicitly
requires
you
to
identify
kernels.
Remember
the
kernel
is
kind
of
the
fundamental
unit
of
work
on
the
gpu.
That's
the
actual,
discrete
bit
of
work
which
you
launch
on
the
gpu
runs
for
some
time
and
then
completes
and
in
any
given
program,
you'll
probably
have
many
kernels
that
you
run
in
the
context
of
a
model
like
openmp
or
openacc.
A
A
A
A
function
that
can
run
on
the
gpu
in
cuda
has
to
be
marked
up
with
the
keyword
global
with
two
underscores
on
either
side,
as
I've
indicated
here,
and
this
is
the
only
way
in
cuda
to
launch
a
function
on
the
gpu
or
the
device,
as
I'm
calling
it
here.
That's
just
jargon
on
the
device
launched
from
the
host
or
from
the
cpu
and
that
by
the
way,
is
under
the
hood.
How
these
other
models
work
at
the
end
of
the
day,
regardless
of
whether
using
openmp
or
openacc.
A
So
global
is
a
cuda.
Specific
keyword
is
an
extension.
It
is
not
a
standard
t
plus
plus
keyword,
but
it
is
a
keyword
that
you
can
apply
in
front
of
a
kernel,
a
function
definition
in
order
to
make
it
a
kernel
that
can
run
on
the
gpu
cuda,
because
it
is
an
extension
to
c
plus
is
not
going
to
be
understood
by
every
compiler.
A
Nvidia
provides
nvcc
the
cuda
compiler,
which
can
parse
this.
It
both
looks
for
any
functions
or
kernels
that
are
defined
in
the
code
in
the
source
code
that
have
this
global
attribute
and.
A
A
Okay,
sorry
about
that,
so
right,
so
device
functions,
kernels
that
have
this
at
this
global
keyword
are
processed
by
the
device
or
the
gpu
part
of
the
compiler
toolchain,
and
everything
else
that
is
standard
c
c
plus
plus
is
going
to
be
compiled
by
your
standard
host
compiler.
So
on
linux,
it
would
be
g
plus
is
what
compiles
that
part
of
the
code?
A
But
this
these
triple
angle
brackets
or
triple
chevrons,
with
a
pair
of
numbers
separated
by
a
comma
in
between
them,
are
cuda
specific
syntax,
which
means
that
this
function
call
can
actually
be
launched
from
the
cpu
onto
the
gpu,
so
any
place
that
you
see
these
triple
angle
brackets
or
triple
chevrons.
With
this
pair
of
numbers
in
between
means,
I
want
to
launch
a
cuda
kernel.
I
want
to
launch
a
kernel
on
the
gpu.
That
is,
I
want
to
actually
start
executing
gpu
work.
A
That
means
whatever
the
contents
of
my
kernel
are,
will
run
on
the
gpu
and
there
will
be
a
single
thread
that
executes
that,
based
on
the
gpu
architecture,
discussion
that
brent
led
yesterday.
Hopefully
you
understand
already
that
if
you
are
ever
in
a
situation
where
you're
only
running
a
kernel
that
launches
one
thread,
you're
very
likely
to
be
using
the
gpu
ineffectively
from
a
performance
perspective,
but.
A
A
A
A
A
So
how
can
we
write
a
cuda
kernel
that
can
add
one
vector
or
array
to
the
other?
This
is
what
our
add
function
will
look
like
what
it's
going
to
look
like
is
it
has
this
global
keyword
here?
Otherwise,
it's
a
standard,
c
function,
definition
and
I'll
describe
what
these
this
new
syntax
means
in
a
second,
but
fundamentally,
it
will
look
like
writing
b
at
some
index
equals
b
at
the
same
index
plus
a
at
some
index.
So
this
is
just
adding
the
array
a
to
the
array
b.
A
A
A
So,
instead
of
having
an
explicit
for
loop,
we're
having
in
some
sense
in
an
implicit
for
loop
where
the
body
of
the
function
is
run
many
times,
and
the
index
of
any
particular
block
that
is
written
in
the
body
of
this
function
can
use
this
block.
Idx.X
variable
this
is
defined
by
the
compiler
at
run,
compile
time
and
runtime
for
you.
So
it's
part
of
cuda.
A
It's
not
a
standard
c
variable.
But
if
you
do
b
of
block
index.x
equals
b
of
block
and
x
dot
x,
plus
a
of
block
index
set
x,
then
that
will
do
the
corresponding
update
of
a
given
element
of
these
of
this
array
b
at
the
index
corresponding
to
block
index.x,
which
will
be
unique
for
every
block.
That
is
running
the
body
of
this
function
and
it
will
be
zero
indexed,
at
least
in
c.
A
So
there
will
be
block
indices
from
zero
to
n
minus
one.
Where
n
is
the
number
that
you
gave
in
this
execution
configuration?
So
whatever
number
n
that
I
choose
here
in
this
execution
configuration
there
will
be
that
many
blocks
running
the
body
of
this
function
and
if
the
length
of
the
arrays
or
vectors
is
n,
then
this
will
give
us
exactly
the
number
of
items
that
I
need
to
perform.
A
So
here's
what
a
main
would
look
like
if
I
am
calling
this
function
or
launching
this
kernel
on
the
gpu,
I
would
define
some
arrays
a
and
b
and
then
I
would
have
to
allocate
memory
for
them
now.
The
way
that
we're
going
to
choose
to
allocate
memory
in
this
example
is
using
the
memory
allocator,
cuda,
malik,
managed,
kudomatic
managed
means.
A
I
want
to
allocate
the
array
a
and
b
to
have
memory
that
can
be
accessed
on
either
the
cpu
or
the
gpu.
We
talked
about
this
a
little
bit
yesterday.
This
is
the
same
strategy
that
we
use
for
standard
language
parallelism
and
can
often
use
for
openmp
and
openhc
as
well.
If
we
want
to
where
we
allocate
memory
that
is
accessible
on
either
the
cpu
or
the
gpu
and
wherever
it's
accessed
it
will
automatically
migrate
to
in
order
to
be
used
there.
A
It
is
a
little
bit
different
syntax
from
malek,
because
you
give
it
the
address
of
the
pointer
rather
than
setting
the
pointer
to
the
return
value
of
the
malloc
call,
but
it
fundamentally
does
the
same
thing
where
it
allocates
a
size
in
bytes
and
then
updates
the
pointer
to
that
location
to
that
allocation.
A
We've
allocated
memory
that
memory
can
be
accessed
on
either
the
cpu
or
the
gpu,
we're
going
to
fill
in
some
data
to
the
arrays,
a
and
b,
with
perhaps
with
some
random
data,
and
then
we're
going
to
launch
our
kernel
on
the
gpu
with
n
blocks.
A
A
Now
a
block
can
also
be
split
into
parallel
threads.
We
could
then
update
the
add
function
to
use
parallel
threads
instead
of
parallel
blocks,
and
then
the
body
of
the
function
would
look
very
similar,
except
now.
The
index
that
we
would
use
for
determining
which
element
of
the
array
b
that
we're
going
to
update
will
be
based
on
thread
index.x.
A
A
If
I
want
to
use
threads
as
the
parallel
or
the
level
of
parallelism
that
I
use
threads
within
a
block,
then
I
would
do
one
comma
n
instead
of
n
comma
one
when
I
write
the
execution
configuration
that
launches
my
parallel
add
function.
A
Now
I'm
going
to
talk
about
combining
both
blocks
and
threads
within
a
block
to
do
a
tool,
a
hierarchical
parallelism
for
doing
this
effect
tradition.
I
will
explain
why
we
might
do
this
in
a
second.
Why
would
we
make
it
more
complicated
this
way
when
either
one
of
the
previous
approaches
that
I've
discussed
is
valid?
Why
would
we
do
that
I'll?
Come
back
to
that?
First,
I
want
to
say
something
about
indexing
of
data.
A
Every
block
has
unique
index
from
zero
to
n
minus
one,
and
every
thread
within
a
block
has
a
unique
index
from
zero
to
m
minus
one.
Where
m
is
the
number
of
threads
per
block
and
that
thread
index
within
a
block
would
then
be
replicated
across
all
blocks.
So
if
we
look
at
block
index
0,
it's
going
to
have
threads
from
0
through
seven
block
index.
One
will
also
have
threads
from
zero
through
seven
and
seven
the
same
for
block
index
two
and
blocks
index
three.
A
A
A
A
So
if
we
combine
an
offset
corresponding
to
the
block,
we
are
which
reflects
how
many
threads
there
were
in
the
grid
total
prior
to
this
block,
and
then
we
combine
that
with
an
offset
within
the
block
that
gives
us
a
unique
index
within
the
grid
and
again
you
get
threat
in
x,
dot,
x,
plus
block
index
times
m.
Where
m
is
the
number
of
threads
per
block?
A
So
with
that
in
mind,
let's
highlight
an
element
in
the
grid
in
red
and
ask
the
question
which
thread
in
the
grid
would
operate
on
that
element
in
the
array.
A
We
will
then
advance
through
by
threads
until
we
reach
the
course
the
target
element.
So
then
that
will
correspond
to
block
index
dot
equals
block
index
equals
two,
because
that's
the
block
where
this
red
element
falls
in
and
within
this
group
of
threads.
It's
the
thread
with
index
5
that
happens
to
correspond
to
the
array
index
21.
A
If
we
then
do
the
math,
we
should
verify
that
if
we
set
thread
at
x
equals
to
five
and
block
index
of
two
and
then
multiply
that
by
the
number
of
threads
per
block,
that
should
equal
the
desired
target
slot
in
the
array
21,
and
it
in
fact
does.
If
you
do,
the
math
here,
you'll
see
that
that
equals
21
as
desired.
A
Okay,
so
we've
now
said:
we've
now
written
down
an
algorithm
for
how
we
would
identify
what
parallel
threads
map
to
what
elements
of
the
work
or
the
array
either,
and
we
said
that
the
last
piece
of
information
we
need
other
than
the
thread
index
and
block
index
is
how
many
threads
there
are
per
block.
Cuda
also
provides
a
runtime
built-in
variable
block
dim
dot
x,
which
defines
the
number
of
threads
per
block.
A
A
With
this
idea,
then
I
can
update
the
definition
of
my
add
function
to
combine
both
parallel
threads
and
parallel
blocks.
I'll
define
a
unique
index
within
the
grid
index,
equals
threat,
index,
dot,
x,
plus
block
index
times
blocked
in
and
then
I
will
say,
b
of
index
equals
b
of
index
plus
a
of
index.
So
it's
the
exact
same
actual
code
that
we
used
before
we're
just
defining
the
index
a
little
bit
differently,
we're
combining
both
parallel
threads
and
parallel
blocks,
to
define
the
index
into
the
array
that
we
will
use.
A
A
A
A
It
is
closely
related
to
the
number
of
threads
per
team
or
per
gang
that
were
launched
in
though
in
the
directive
based
models,
and
it
there
are
some
values
that
are
better
than
others.
For
example,
128
is
a
pretty
good
value,
but
it's
a
value
that
ultimately
you
control
and
have
to
define
for
yourself.
The
compiler
does
not
help
you.
A
Of
course,
you'd
want
to
make
sure
that
you
do
a
ceiling
division
in
case
the
n
does
not
evenly
divide
threads
per
block
in
the
example
in
the
repo
that
you
do
later,
for
the
hands-on
will
cover
that.
But
fundamentally
that's
what
we
do.
A
Now
notice
how
this
is
different
from
the
previous
models
that
we
chose.
You
are
now
telling
the
gpu
exactly
how
many
blocks
and
threads
you
want
it
to
run,
and
you
are
required
to
do
the
work
of
mapping
those
blocks
and
those
threads
to
do.
The
work
that
you
want.
Kudo
doesn't
have
any
guardrails
in
this
sense,
you
are
in
control
of
exactly
what
the
gpu
does,
which
gives
you
a
lot
more
power,
but
also
a
lot
more
responsibility
to
get
it
right.
A
A
A
So
generally,
when
you
write
safe
code,
you'll
want
to
put
an
if
condition,
which
says
only
if
index
is
less
than
n
the
number
of
elements
in
the
array.
Only
then
do
I
do
any
modifications
to
the
array,
and
that
often
requires
you
to
pass
as
an
argument
to
the
kernel,
the
length
of
the
array
to
make
sure
that
you
don't
access
off
the
end
of
it,
because
in
c
we
don't
really.
A
We
don't
have
that
information
from
a
pure
pointer
unless
we
provide
some
additional
context
to
the
function,
and
so
this
is
the
updated
ceiling,
division
or
rounding
up
division,
which
or
one
way
to
do
it,
which
guarantees
that
you
will
always
have
enough
threads
to
work
on.
You
do
n,
plus
m
minus
one
over
m.
Where
m
is
the
number
of
threads
per
block
and
that
will
always
basically
guarantee
that
you
divide
by
m
but
round
upward.
So
you'll
always
have
at
least
as
many
blocks
as
needed
to
do
all
of
the
work.
A
A
And
then
you
will
be
good.
There
will
be
potentially
more
threads
launched
than
actually
needed.
A
But
that's
okay!
There
probably
won't
be
that
many
if
n
is
sufficiently
large.
One
block
at
the
end
of
the
grid
will
have
a
few
idle
threads,
not
the
end
of
the
world,
so
this
is,
in
general,
a
safe
and
general
way
to
write
your
kernel
launches
and
the
way
that
professional
cudico
tends
to
be
written.
A
A
I
can
answer
that
in
two
ways.
One
of
them
is
to
say
that
threads
within
a
block
have
direct
mechanisms
to
communicate
and
synchronize
with
each
other
in
a
way
that
blocks
themselves
don't
have
as
easily
a
mechanism
to
communicate
with
each
other
in
so
that
gives
you
a
way
to
do
interesting,
new
kinds
of
algorithmic
choices
that
rely
on
explicit
communication
or
synchronization
between
threads
that
are
not
available
to
you.
If
you
only
use
blocks,
of
course,
you
could
just
flip
the
question
around
well,
then,
why
do
we
have
blocks?
A
A
That
being
said,
it
doesn't
require
that
much
extra
work
right,
we've
seen
that
you
can
often
pretty
much
ignore
that
you
can
write
the
the
kernel
as
if
there's
a
single
dimension
of
parallelism
in
many
cases
and
just
get
a
single
index,
unique
connector
in
the
grid
and
use
that
to
do
some
work.
But
the
fact
that
there
is
a
two
level
hierarchy
is
relevant
for
both
performance
and
doing
certain
kinds
of
algorithms,
and
it
is
ultimately
the
same
reason
by
opening
acc
has
gangs
and
also
you
know,
vector
lanes
within
a
gang.
A
It
reflects
the
same
design
pattern
in
how
modern
parallel
processors
work,
and
so
it
is
relevant
from
a
performance
perspective.
I'm
going
to
stop
there
because
the
you
know
you
could
go
on
for
cuda
for
a
long
time,
and
I
really
just
wanted
to
give
you
a
flavor
of
what
goes
on
in
cuda.
I
will
leave
some
links
to
here
to
introductions
to
huda
if
you
want
to
learn
more
and
in
particular,
a
resource
that
I
strongly
recommend.
A
If
you
want
to
learn
more
is
that
we
did
a
extensive
training
series
in
conjunction
with
oak
ridge
and
later
nurse
and
argonne
on
introduction
to
cuda,
and
it's
probably
the
best
resource
on
the
internet
for
learning
cuda
and
I'm
not
tuning
my
own
horn
there,
because
I'm
not
the
one
who
gave
the
presentations.
But
my
opinion
is
it's
one
of
the
best
public
presentations.
There
are
on
cuda
great
resource
for
you
to
learn
from
and
in
fact
the
slides
that
I'm
using
today
are
adapted
from
that
training
series.
A
B
Thanks
a
lot
max:
yes,
I
conquered
that
cuda
training
series
is
very
comprehensive,
and
so
it's
good
if
we
have
a
chance
to
review
them
and
there
are
also
hands-on
access
there
as
well,
but
max
30-minute
training
with
the
hands
on
today
would
also
be
very
helpful.
A
A
Should
I
determine
the
number
of
blocks
based
on
something
that
I
know
about
the
hardware,
and
the
answer
is
yes,
you
should,
generally
speaking,
you
want
to
think
about
how
many
multi-processors
there
are
on
the
gpu
and
have
at
least
as
many
blocks
as
there
are
multi-processors,
because
ultimately,
every
block
maps
to
one
of
those
multiprocessors
brent
also
described
yesterday.
The
fact
that
our
multi-processors
can
themselves
do
parallel
processing
and
be
working
on
multiple
blocks
at
a
time.
So
actually
you'll
do
even
better.
A
There
is
also
a
much
much
higher
limit
on
how
many
can
be
queued
up
at
one
time,
which
most
people
will
never
hit
in
practice,
because
it's
basically
a
billion,
but
it
is
worth
knowing
something
about
the
hardware
when
tuning
that
choice,
however
you're
not
required
to-
and
you
can
choose
any
number
up
to
the
maximum
limit
that
cuda
allows
you
and
get
correct
execution.
As
long
as
you
write,
semantically,
correct
huda
code.