►
From YouTube: Intro to CUDA programming
Description
Dossay Oryspayev (LBNL), Muaaz Awan (LBNL), Hugo Brunie (LBNL), & Michael Rowan (LBNL) present a tutorial on Intro to CUDA programming.
A
The
hardware
the
scheduler
schedules,
warps
of
threads,
that
is
chunks
of
32
threads
onto
the
hardware
and,
as
we
saw
yesterday
in
the
morning,
talk
in
the
first
session
that,
unlike
cpu
gpu,
is
a
latency
hiding
device
and
to
to
optimally
exploit
the
massively
parallel
architecture
of
a
gpu.
It
needs
to
do
that.
It
needs
to
hide
the
latency
and
to
better
understand
what
is
latency
hiding.
Let's
have
a
look
at
this
figure.
Let's
assume
that
at
cycle
one,
our
device
had
four
three
available
walks
that
could
be
launched.
A
The
scheduler
will
pick
up
any
of
them
and
and
launch
it
now.
Let's
say
the
first
warp
is
picked
up
when
it's
launched
and
the
first
walk
makes
a
memory
request
which
is
going
to
take
two
cycles
to
process.
That
means
in
the
next
cycle
that
warp
won't
be
able
to
progress
so
that
what
the
device
does
is,
it
picks
up
another
available
warp
and
launch
stack
and
the
process
continues
till
the
memory
request
from
the
first
walk
is
served
and
it
is
available
to
go
forward
now
in
the
cycle.
Four.
A
If
we
had
another
available
warp,
it
would
be
kind
of
a
random
pick
between
warp
fund,
because
that
is
also
ready
to
go
forward
and
another
available
war.
So
what
we
can
take
away
from
this
figure
and
the
concept
of
hiding
latency
is
that
the
more
work
the
device
will
have
to
do
available.
It
will
be
able
to
hide
latency
better
now.
Imagine
after
cycle
one
if
we
did
not
have
another
warp
available
for
next
two
cycles,
your
device
and
your
resources
would
have
been
sitting
idle
so
to
make
better
use
of
resources.
A
A
A
and
which
translates
to
2048
resident
threads
per
sm,
because
each
warp
consists
of
32
threads
and
if
you
take
a
product
of
32
and
64,
you
get
2048,
but
there
is
another
limit
that
is
of
maximum
number
of
blocks.
So
the
way
these
threads
can
be
divided
across
blocks
is
is
up
to
you,
but
the
number
of
concrete
blocks
per
sm
cannot
be
more
than
32
for
a
b
100
device.
A
So
that
was
all
about
kernel
configuration
and
you
will
obviously
experience
this
concept
in
the
in
the
exercises
that
we
have
for
you.
So
to
top
this
off.
Let's
add
another
concept
to
this:
the
concept
of
memory
correlating-
and
this
is
one
of
the
more
important
ones.
A
Let's
have
a
yeah,
so
the
global
memory
accesses
from
the
device
are
serviced
in
the
form
of
memory
transactions
of
size,
32
bytes.
That
means,
if
your
thread
is
going
to
access
a
four
byte
number,
that
is
an
integer
whatever
it
accesses
the
memory,
it
will
be
processed
in
the
form
of
32
bytes.
A
A
good
programming
practice
would
be
to
ensure
that
your
consecutive
threads
inside
of
our
accessing
memory
locations
which
are
closer
to
each
other
contiguous
memory
locations
and
that
is
going
to
bundle
up
all
the
memory
accesses
for
that
particular
thread
within
the
least
number
of
memory
transactions
possible.
A
So
the
top
figure
is
what
you
want
to
do,
and
bottom
figure
is
what
you
don't
want
to
do
to
try
these
concepts
out.
You
might
want
to
move
into
the
section
two
folder
in
the
github
repo
that
you
have
and
open
up
the
readme
file.
The
readme
file
contains
all
the
details
of
exercises
and
how
to
build,
run
and
what
parameters
to
observe
and
how
to
change
them.
A
A
Kernel
and
for
the
next
exercise
for
the
memory
would
be
rec,
add
kernel
memory.
You
can
build
a
rackhead
file
with
with
the
simple
make
command
and
you
can
run
using
sh
run.script
and
if
you
want
to
for
the
second
exercise
the
memory
exercise
you
want,
you
might
want
to
use
the
other
script
that
is
transactions.sh
the
details
about
when
to
use
which
script
and
how
to
build
are
all
in
the
readme
file.
So
I
will
be
here.
If
you
have
any
questions,
you
have
a
next
about
25
minutes.
For
this.
A
B
So
the
session
three
is
about
debugging
on
a
cuda
code
so
because
yeah,
we
all
know
that
coding
is
not
just
writing
some
line
of
codes
and
then
having
great
speed
up
is
also
about
debugging,
a
program
which
happens
always.
B
So
what
are
the
tools
to
debug?
Actually
jonathan
this
afternoon,
already
made
a
presentation
about
it.
So
if
you
remember,
this
will
be
very
easy
to
read
for
you.
B
Basically,
the
printf
is
the
one
that
you
go
to
when
you
don't
have
other
possibilities
or
when
you
think
you
can
debug
it
very,
very
quick,
and
you
don't
need
to
be
aware
on
cuda,
though,
because
when
you
put
a
printf
in
a
kernel,
it's
not
one
two
or
eight
threads
that
are
executing
the
printf
is
it
can
be
several
hundreds
or
several
thousands
of
threads,
so
your
terminal
can
be
over
over.
B
Yeah
can
be
too
many
output
in
the
terminal.
So
what
you
do
basically
is
just
put
a
conditional
branch
on
on
the
thread.
Id
here
is
in
a
one
dimension,
kernel
with
the
one
dimension
block,
and
we
pick
the
the
master
thread.
B
So
another
way
to
debug
is
to
use
cuda
gdb,
it's
very
similar
to
the
gdb.
You
are
used
to
use
in
a
classic
cpu,
so
bt
for
the
backtrace.
B
B
B
We
won't
go
into
the
details
of
them,
but
know
that
they
exist
and
if
you
want
to
to
be
sure,
your
your
program
is
is
valid
like
when
you
use
valgrind
on
a
cpu
code.
You
can
use
this
tool
on
the
on
the
coda
code
and
total
view
is
to
me
the
equivalent
of
ddt.
B
You
will
use
the
iphone
capital
g
or
iphone
line
info
to
get
the
the
liner
and
the
debug
and
air
dynamic
gives
you
the
the
symbol,
information
on
the
cpu
side,
which
you
must
put
a
x
compiler
before
the
I
dynamic
to
for
telling
nvcc
that
this
option
is
for
the
host
compiler,
how
you
will
use
this
gdb,
so
cuda
gdb,
you
will
use
it
by
putting
breakpoints,
for
example,
b.
B
My
functions,
dot
cu,
which
is
a
file
2.48
means
I
put
a
breakpoint
like
the
line
48
of
the
file
myfunc.cu,
and
then
I
will
run
the
program
with
run
and
it
will
stop
at
this
breakpoint.
If
it
reaches
it,
you
can
print
value,
p,
var,
p
array
at
10.
B
Prints,
the
the
10
10
elements,
first,
10
elements
of
the
array
you
can
control.
The
execution,
like
I
said,
with
run
next
tab
continue.
One
thing
to
note
is
that
there
is
no
watch
points
possible
on
cuda,
so
yeah.
If
you
were
used
to
use
a
watch
point
on
cpu
gdb,
you
cannot
hear
the
changing
of
context
is
a
bit
more
complex
than
on
cpu.
When
you
want
to
to
change
the
threads
on
cpu
is
straight
forward.
B
You
just
put
the
id
of
the
threads
here
you
you
have
a
three
to
four
dimension
id,
so
you
can
precise
the
id
of
the
thread
by
its
it's
a
device
assemble
up
and
lane.
So
this
is
the
hardware
way
of
giving
the
coordinates
of
the
thread,
and
there
is
a
software
way
of
doing
it
with
the
block
and
thread
you
can
notice
here
that
block
and
threads
are
other
three
integers
is
because
this
can
be
a
three
dimensional.
B
To
execute
you
will
obviously
a
lot
could
have
a
it
must
be
already
done,
and
then
you
s
run
your
your
code,
don't
forget
to
add
the
dash
pty,
which
allows
you
to
execute
a
good
hddb
and
then
a
dash
dash
argument.
Args.
If
you
have
arguments
for
your
program,
if
your
program
don't
have
arguments,
you
are
not
obliged
to
add
the
args
on
this
session.
You
will
have
two
files,
debug
printf,
that's
you
and
the
other,
which
is
a
debug
memcheck.cu.
B
You
will
go
on
them,
one
by
one.
You
will
have
to
modify
these
files
even
just
a
bit
the
dot
hpp
you
don't
have
to
modify
it
to
compile
just
use
make.
First,
you
run
the
code
with
the
s-run
and
the
first
file
and
you
will
have
the
the
results
on
the
left
with
the
correctness
test
that
fails
and
your
goal
is
to
go
on
the
right
with
the
correctness
test
not
pass
to
that
you
can
your
help
by
the
what
is
thrown
here
on
the
output
by
the
printf.
B
That's
why
we
are
saying
that
we're
debugging
at
printf
and
the
id
is
to
understand
what
is
going
on
here
and
the
way
the
way
the
taste
is
failing.
B
B
Other
commands
you
will
have
to
use
for
good
rgb
is,
for
example,
setting
a
breakpoint
with
b
and
the
name
of
a
function.
B
I
thank
usual
young,
who
did
some
very
similar
slide
for
the
debugging
on
gpu
in
february
2020.
For
now
so
now
you
can
go
ahead
and
and
start
the
exercise
and
if
you
have
any
question
we're
here,
good.
C
C
Yeah,
it's
good
all
right,
so
welcome
everyone.
This
is
session.
Four
of
the
cuda
tutorial,
we'll
be
introducing
here
some
of
the
nvidia
profiling
tools
that
you've
heard
about
in
many
of
the
talks
today
and
yesterday.
C
So
we'll
just
start
by
throwing
up
this
diagram
that
you've
seen
a
couple
times
already.
This
is
another
incarnation
of
this
optimization
workflow
diagram.
C
You
might
follow
a
process
like
this,
where
you
profile
your
application
and
collect
some
data,
and
then
you
analyze
this
data
and
try
to
identify
bottlenecks,
or
maybe
like
kernel
kernels,
that
aren't
behaving
as
you
would
like
them
to,
and
then
you
try
and
tweak
things
in
your
kernel
and
then
see
whether
these
tweaks
or
things
that
you've
changed
are
actually
changing
the
application
behavior
in
the
way
that
you
thought
or
the
desired
way,
and
so
two
tools
that
can
really
help
you
with
this.
C
These
two
steps:
profiling,
your
application
to
collect
data
and
then
analyzing
that
are
insight,
systems
and
insight.
Compute
and
the
tools
have
sort
of
different
scopes
and
site
systems
is
a
tool
that
can
give
you
a
cohesive
picture
of
how
your
application
is
interacting
with
various
system
resources
available
to
it
and
end
site.
Compute
is
more
like
a
targeted
analysis
tool
that
can
tell
you
about.
C
So
first,
this
is
a
very,
very,
very
quick
interview.
Sorry,
sorry,
very
quick
introduction
to
msi
systems
and
we'll
see
more
we'll
say
more
about
these
various
timelines
in
a
hands-on
demonstration
in
a
moment,
but
the
three
main
features
or
three
main
timelines
here-
are
the
cpu
workload
timeline.
C
C
All
right,
I'm
going
to
keep
this
view
just
so
that
you
can
actually
see
where
my
cursor
is
so
there's
a
timeline
here
that
shows
you
the
workload
on
your
cpu
cores,
so
I'm
pointing
to
this
there's
very
thin
black
line
you
can
in
in
the
gui.
C
You
can
expand
this
range,
so
you
can
see
this
more
clearly,
but
this
shows
you
like
the
cpu
utilization,
cpu
utilization,
there's
another
timeline
that
shows
you
how
the
the
os
threads
of
your
application
are
interacting
with
the
cpu
resources
and
also
the
gpu,
and
these
different
lines
here
indicate
different
things.
So
the
black
line
is
telling
you
the
cpu
core
utilization
and
then
the
the
line
below
this,
like
can't
a
little
small
here.
C
Let
me
zoom
in
so
then
there's
a
red
bar
here
that
if
you
hover
over
this
in
the
application,
you'll
see
a
tooltip
that
will
tell
you
which
core
this
actually
corresponds
to.
So
it's
like
the
workload
and
black
or
the
the
the
core
utilization
in
black
and
then
this
red
bar
corresponds
to
a
particular
core
and
then
below
this.
This
shows
you
the
thread
state,
so
whether
it's
active
or
scheduled
or
stalled
things
of
that
nature,
and
so
this
thread
timeline
or
enzyme
systems
has
support
for
like
several
apis.
C
So
you
can
trace
like
commands
from
like
the
cuda
api
or
openhcc
and
several
others
that
are
not.
I
don't
have
the
full
list,
but
there
are
many
others
that
are
supported
like
for
this
tutorial,
we're
you
using
cuda,
so
you'll
be
able
to
see
how
these
commands
from
the
api.
Like
you
know,
your
cuda
device
synchronized,
which
you
see
in
this
example,
you
can
see
how
this
is
called
from
one
of
your
threads
and
then
so.
C
The
final
timeline
here
is
this
device
timeline,
and
this
tells
you
about
memory
operations
and
like
the
compute
workload
on
the
gpu.
So
for
this
example,
we
we
have
a
tesla
v100
and
there's
like
some
blue.
A
C
Here,
the
height
of
which
tells
you
sort
of
like
the
kernel
coverage
over
a
given
time,
so
that
is
a
like
basic
orientation
for
insight
systems.
One
really
useful
feature
is
that
if
you
you
can
like
click
on
things
from
one
timeline,
so
say
you
could
click
on
from
the
device
row.
You
could
click
on
this
blue
bar
that
again,
this
corresponds
to
a
kernel,
that's
running
on
the
gpu,
and
then
you
can
see
where
the
the
kernel
launch
was
called
in
like
one
of
your
threads.
C
So
here
you
can
see
additional
information
about
the
the
launched
kernel
like
the
begin
time,
the
end
time,
like
the
stream
that
it
ran
on
and.
A
C
Also,
the
the
streams
are
available
in
this
like
drop
down
menu
under
the
device
row,
and
this
sort
of
analysis
is
useful
because
it
tells
you
like,
what's
the
latency
between
when
you
like,
launch
the
kernel
and
then
once
the
kernel
actually
running
on
the
gpu,
so
that
is
on
site
systems
and
then
so.
This
is
a
screenshot
from
end
site.
Compute.
C
There's
this
bar
chart
that
can
tell
you
about
the
sm
utilization
and
also
the
memory
bandwidth,
utilization
and
and
these
values
are
phrased
in
terms
of
something
called
like
a
speed
of
light
value
which
exactly
what
that
means
is
hardware
dependent.
But
it's
meant
to
to
just
indicate
it
was
a
fraction
of
the
peak
performance
on
that
hardware
so
like.
If,
if
this
fm
is,
I
guess
this
looks
like
it's
around
three
percent
or
so.
C
This
means,
like
you're,
doing
three
percent
of
the
compute
workload
that
would
be
possible
on
this
gpu.
If
the
hardware
was
being
used
at
peak
capacity
and
some
similar
meaning
for
the
memory
bandwidth
here
so
they're,
also,
if
you
were
to
click
this
apply
rules
here,
this
will
automatically
generate
some
tips
like
automatically.
C
Generate
tips
about
like
possible
performance
bottlenecks
that
are
kind
of
recognized
according
to
some
yeah,
it
can
tell
you
things
that
you
should
look,
take
a
closer
look
at
to
improve
the
performance,
and
so
this
is
actually
a
really
realistic
case
of
what
you
might
see
if
you're
looking
at
an
application
and
you
just
like
open
up
some
random
kernel,
it's
like
quite
common
that
your
your
kernel
is
neither
memory
bandwidth
bound
nor
compute
bound,
and
in
this
case
it
might
mean
that
it's
latency
bound,
and
this
encourages
you
to
kind
of
look
into
further
issues.
C
So
I'm
going
to
switch
to
a
hands-on
demonstration,
we'll
open
up
a
few
reports
and
just
poke
around
these
a
little
bit
so
first
I'll,
I'm
not
on
a
gpu
right
now,
but
I'll
just
kind
of
show
you
the
commands
that
you
could
use
to
generate
a
report,
we'll
start
with
end
site
systems.
C
C
You
can
specify
stats
equal
to
true.
This
is
going
to
output
some
profiling
statistics
to
the
command
line.
This
will
also
generate
an
sql
database
with
all
of
the
profiling
information.
If
you
wanted
to
use
that
I've
never
used
that
before,
but
it's
generated
there
and
then.
Lastly,
you
could
you
just
select
the
kernel
that
you
want
to
profile,
so
you
could
do
something
if
you.
C
Out
you
can
try
profiling,
the
vector
edition
kernel.
That's
been
used
in
the
other
sessions.
There
are
addition
additional
options
here.
You
can
specify
like
a
a
delay
and
a
duration
in
seconds
with
like
dash
y
and
dash
d
respectively.
So,
if
you
say
like
delay
is
like
one,
then
you're
gonna
wait
one.
Second
before
you
start
profiling
and
you
specify
like
five,
then
you're
gonna
collect
data
for
for
five
seconds.
C
C
Okay,
so
we
see
already,
let
me
expand
this.
C
You
can
see
the
timelines
that
I
pointed
out
before.
There's
the
cpu
cores
workload
timeline,
there's
the
os
threads
timeline
and
then
there's
this
device
timeline.
So
we
can
just
expand
out
this
cpu
cores
workload.
We
see
that
this
cpu
59
seems
to
be
doing
a
lot
of
work
and
there's
some
some
stops
doing
some
work
for
some
reason
around,
like
one
one
point,
four
seven
seconds
or
so,
okay,
and
so
that's
sort
of
the
cpu
workload
timeline.
C
As
I
said
before,
there
are
these
os
threads
and
it
shows
you
all
the
calls
for
the
various
apis.
So
we
see
calls
for
the
kudi
api,
like
there's
a
cuda
malik
here,
and
you
can
click
on
this
and
there
will
be
like
highlighted
ranges
here
that
will
show
sort
of
like
the
correlated
calls
and
then.
C
Like
beginning
and
end
times
for
the
thing
that's
actually
executed
on
the.
C
All
right,
well,
okay,
so
sorry,
what
I
mean
to
say
is:
if
we
we
could
click
on
a
kernel
here
from
the
gpu
row
and
let
me
zoom
in
a
little
bit.
C
We
could
click
on
a
kernel,
that's
running
here
and
then
we
can
see
that
it's
actually
launched
at
this
time
from
the
cpu,
so
we'll
zoom
in
a
little
bit
more.
So
we
click
on
this.
This
is
again.
This
blue
bar
down
on
the
device
row
is
showing
a
kernel,
that's
executing
on
the
gpu,
and
then
it's
launched
at
this
point.
Okay,
so
there's
we
can
see,
for
example,
like
there's
a
launch
latency
of.
C
C
C
C
Now
we're
going
to
switch
to
insight
compute
again,
we'll
just
show
the
command
that
you
can
use
to
generate
a
report
within
site
compute.
So
again,
if
I
was
logged
into
or
if
I
had
a
gpu
node,
then
you
could
do
s
run
and
then
envy
nsite,
cu,
cli.
C
We
can
generate
a
report
called
cu
profile,
then
you
can
specify
the
kernel
name
with
dash
k.
So
this
is
a
kernel
for
which
you
want
to
collect
some
profiling,
metrics
and
so
in.
In
the
case
of
vector
edition
kernel,
this
would
be
vector,
add
kernel
if
you're,
using
this
on
on
some
more
complicated
applications.
You
may
have
like
a
very
mangled
name,
so
you
want
to
use
like
regex
syntax
to
maybe
identify
the
the
kernel
here
but
anyways
and
then
the
last
thing
to
specify
is
the
kernel
to
profile.
C
So
you
can
use
this
command
and
then
this
will
generate
a
report.
C
Okay,
so
again,
these
are
the
measures
I
indicated,
or
I
mentioned
before.
This
is
the
showing
the
sm
usage
and
the
memory
bandwidth
usage.
C
One
pretty
neat
feature
here
or
let's
see
so.
If
you
go
to
this
page
drop
down,
you
can
click
on
like
source
and
you
can
see
the
number
of
live
registers
per
line
of
like
low
level
sas
code,
which
is
kind
of
cool.
C
You
can
also
see
the
speed
of
light
values
for
different
pipelines
here.
So,
if
we're
just
kind
of
poking
around
these
lists,
we
can
see
that
the
speed
of
light
value
for
this
pipe
fc
fd64
cycles
active.
This
is
zero,
so
it
means
we're
not
doing
any
looks
like
the
the
fd64
pipe
is
not
being
used
at
all.
In
this
example,.
C
Yeah
so
there's
a
lot
of
information
here
and
I
would
just
recommend
everyone
to
take
a
look
at
the
official
nvidia
documentation,
because
we've
only
provided
like
very
kind
of
brief
introduction
to
these
tools,
and
there
are
also
lots
of
really
good
tutorials,
like
from
the
blue
waters
workshops.
C
So
there
are
tutorials
on
like
insight
systems
and
then
site
compute,
and
I
think
yeah.
That's
that's
all
I
have
on
these
profiling
tools,
so
I'd
recommend
like
in
the
remaining
time
you
could
continue
work
on
any
of
the
sessions
like.
If
you
wanted
to
continue
working
on
any
examples
from
previous
sessions,
you
could
do
that
or
or
try
out
some
different
options
using
the
command
line.
Interface
for
end
site
systems.
Try
like
profiling,
the
vector
edition
kernel.
Another
option
is
to
try
and
add,
like
use,
nvidia
tools.
C
Extension
to
add
some
like
add
some
timelines.
I
think
you
could
see
this
in
the
command
line
interface.
I
think
there
should
be
a
some
statistics.
Output,
if
you
add,
like
use
nvtx
to
instrument
your.