►
From YouTube: Profiling/debugging for GPUs
Description
Jonathan Madsen of LBNL presents a talk on Profiling/debugging for GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Yan Zhang.
A
I'm
going
to
talk
about
some
of
the
profiling
debugging
tools,
no
doubt
over
the
course
of
yesterday
and
some
of
today
you've
had
some
of
these
referenced
or
you
know.
Maybe
they
were
presented
some
of
the
tutorials
yesterday.
So
this
will
kind
of
be
a
review
in
certain
places.
A
A
Turn
on
captions,
okay,
so
just
quick
little
overview,
I'm
going
to
talk
about
some
of
the
nuances
and
the
different
tools
available
for
both
debugging
and
profiling.
A
Debugging
on
the
gpu
is
a
little
bit
different
than
the
cpu,
because
it's
asynchronous
with
the
cpu
so
sometimes
where
you
detect
the
error,
isn't
exactly
where
the
error
occurred
or
say
the
the
current
was
launched
in
a
different
place
and
if
you
are
used
to
working
with
highly
parallel
codes,
you
know
that
debugging
highly
parallel
codes,
like
which
what
you
have
in
the
gpu,
tend
to
produce
sort
of
highs
and
bugs,
where
the
simple
act
of
trying
to
study
it
makes
the
bug
disappear.
A
This
is
quite
a
difficult
problem
it,
but
there
are
tools
available
to
sort
of
solve
these
things.
When
debugging.
I
think
this
was
covered
elsewhere
for
nbcc,
the
lowercase
g
flag
generates
debug
information
for
the
host
code.
A
A
A
I
don't
believe
that
mvcc
supports
this,
I'm
not
sure
if
clang
does,
but
you
know,
hopefully
that
is
something
hint
that
they
will
eventually
address,
because
these
features
are
very
nice,
like
the
f
sanitized
address
and
data
race
are
very
nice
for
just
compiling
your
code
and
running
it
and
then
just
getting
a
report
at
the
end,
most
debugging
error
tools.
Oh
sorry,
most
cuda
routines
return
could
ert,
and
you
just
you
have
it.
You
want
to
check
whether
or
not
that
equals
cuda's
success.
A
A
A
You
will
quickly
become
a
fan
of
using
it
in
the
ide.
It's
so
much
nicer
to
use
you
know,
but
sometimes
this
requires
actual
integration
with
your
build
system,
and
for
that
it's
one
of
the
big
big
benefits
of
using
cmake
there
kuda
gdb
is
the
nvidia
supported
debugger,
it's
pretty
much
modeled
after
gdb
the
gnu
debugger,
it's
built
as
an
extension.
A
You
simply
just
run
cuda
gdb
and
then
the
command
options.
And
then
you
will
get
an
interactive
prompt.
You,
you
type
run
enter
whenever
you
get
the
the
error
you
can
hit
back
trace
and
it
will
show
you
the
call
path
to
where
you're
getting
it.
And
then
you
can
print
variables,
switch
between
frames
and
stuff
like
that,
then
there
is
the
cuda
memcheck
tool,
which
is
a
functional
correctness,
checking
tool.
This
is
sort
of
similar
to
their
sanitizer
stuff,
the
sanitizer
stuff
that
I
mentioned
previously.
A
A
Sometimes,
though,
printf
really
will
just
be
the
best
debugger
and
you
know,
I
think
I
think
all
of
us
have
used
printf
as
a
debugger.
Quite
a
bit.
A
Usually
printf
is
also
great
for
enabling
just
sort
of
log
messaging,
so
you-
and
if-
and
this
is
always
nice
to
have
in
a
sort
of
code
that
you
are
distributing
so
that
if
you
know
a
user
comes
back
to
you
with
an
error,
you
simply
have
said
say:
set
an
environment,
variable,
hey
turn
on
verbose,
equals
three
or
something
like
that
and
you'll
see.
A
You
know,
values
printed
out
in
the
code
that
help
provide
you
context
about
where
the
errors
are
actually
occurring
and
that
same
thing
with
that
macro,
you
should
have
sort
of
always
on
error
checking.
So
let
me
move
on
now
to
profiling.
A
Just
like
debugging
profiling
has
some
nuances.
Measuring
that
performance
can
degrade
your
performance.
Unfortunately,
hardware
counters
in
particular
the
way
they
are
implemented
with
envy
prof,
that
particular
api.
It
sort
of
serializes
the
kernel
execution,
so
you
don't
get
any
overlap
and
also
kernels
are
sorry.
Hardware
counters
are
a
finite
resource.
So
if
you
request
measuring
too
many
hardware
counters,
you
actually
kuda
has
to
replay
the
kernel
in
order
to
collect
all
the
hardware
counters.
A
And
also
when
it
comes
to
profiling,
the
performance
on
the
cpu
still
matters,
you
know
simply
optimizing.
The
actual
kernel
execution
isn't
isn't
as
really
as
important
as
optimizing
sort
of
the
hosted
device,
communication
patterns
or
or
your
memory
access
patterns.
A
A
So
it
gives
you
a
per
kernel
breakdown
and
you
can
identify
computational
bottlenecks
with
your
memory,
access
and
occupancy
and
stuff
like
that,
once
you
have
optimized
the
code,
especially
at
a
hackathon,
if
you
spent
you
know
a
week
at
a
hackathon
optimizing
the
code
using
all
these
tools,
you
have
things
migrated
to
the
the
gpu
and
they're
running
well.
A
So
you
can
easily
just
sort
of
look
back
and
say
you
know:
has
this
region
expanded
significantly
or
reduced
in
its
its
run
time,
compared
to
this
old
run
and
really
do
sort
of
continuous
monitoring?
The
guise
are
very
nice
to
use,
but
they
really
don't
get
run
all
that
that
often
not
you
know
that
and
and
integrating
sort
of
a
continuous
monitoring
of
performance
that
you
can
refer
back
to
or
easily
run
without
a
gui
is
highly
advantageous.
A
So
you,
you
know.
A
lot
of
us
have
simple.
You
know
on
in
our
cpu
code,
simple
cpu
timers,
and
you
want
to
try
and
integrate
something
like
this
into
into
your
code,
and
there
are
on
the
cpu.
There
are
compiler-based
tools
that
sort
of
make
it
easy
to
do
profiling,
they've
sort
of
built
in
profilers
to
the
compiler
like
x-ray,
is
part
of
clang,
and
then
they
also
have.
A
Waze
anyway,
I'm
sorry,
I'm
just
gonna
move
on
the
clang.
The
the
flags
above
might
work
with
clang,
but
I
I
haven't
actually
tested
that
and
to
my
knowledge,
mvcc
does
not
have
a
compiler-based
way
to
instrument
your
code.
A
And,
as
I
mentioned
as
far
as
gui's
go,
there
are
there's
insight
systems,
compute
and
mbprof.
There
are
also
several
open
source
tools.
Amd
has
a
a
profiler,
then
there's
tao,
scorpi,
hpc
toolkit
they're
all
open
source
projects
that
have
been
around
for
a
long
time.
They
have
guise
visualization
stuff,
like
that,
a
lot
of
features
because
there's
there's
really
not
anything
particularly
special
about
insight
systems
or
compute,
and
so
much
as
they're
doing
something
that
an
open
source
tool
cannot
because
they
provide
apis
for
toolkit
developers.
A
So
envy
prof
uses
the
company
callback
api,
which
has
been
around
for
a
while
and
has
widespread
support
and
open
source
tools.
Inside
systems
is
mostly
timing,
sort
of
stuff
tracing
and
it
has
widespread
support
and
but
the
the
new
insight
compute
api
is
a
uses,
a
new
api
and
in
the
open
source
tools.
A
There's
there's
minimal
support
at
this
point,
even
though
there
will
soon
be
quite
a
bit
more
as
far
as
building
something
into
your
software.
The
cuda
runtime
api
has
some
some
basic
utilities
that
you
can
use.
For
example,
they
could
have
been
elapsed
time
for
getting
the
timing
between
two
two
events
that
just
sort
of
sticks
a
time
stamp
into
the
the
stream
that's
being
processed
and
then
there's
ways
to
control
an
external
profiler
from
your
code.
So
you
can
start
and
stop
it.
A
The
cuda
memcheck
that
I
mentioned
earlier
uses
the
sanitizer
api,
so
you
can
implement
sort
of
a
cuda
memcheck.
Within
your
code
I
mentioned
nvtx
and
the
decoration
you
can
include
that
as
a
header
only
source
by
including
that
file.
You
see
that
right
there
and
if
you
use
wall,
clock
timers
on
the
cpu.
Just
remember
those
wall,
clock
timers
are
kind
of
meaningless
unless
you
do
a
sync
before
them
oops.
A
A
We
call
the
project
timory
and
it
it
works
by
sort
of
creating
single
handles
that
you
can
use
to
invoke
multiple
of
these
apis.
So
you
can,
you
can
combine
the
profiler
start
and
stop
with
an
mbtx
marker,
or
you
can
have
direct
access
to
the
the
copy
tracing
api
or
the
hardware
counters
it's
available
for
c
c,
plus,
plus
fortran,
and
in
c
plus
and
python.
A
You
can
get
direct
access
to
the
data
for
your
continuous
monitoring,
one
minute
to
wrap
up
one
minute
to
wrap
up
guys
sure,
excellent,
okay
and
sort
of
the
the
key
features
of
this
is
it's
really
easy
to
create
new
components
because
they
can
be
composed
of
other
components,
and
if
you
integrate
it,
if
you
create
a
pull
request
and
get
it
integrated
into
sort
of
the
native
stuff,
it
becomes.
This
standalone
python
class
ii
that
can
be
used
from
python
and
c
plus
plus
users
can
create
their
own
components
locally.
A
A
A
If
I
don't
see
that
in
the
chat-
oh
yes,
okay
and
that
will
just
stay
for
maybe
five
or
ten
minutes
sure
yeah,
the
the
ert
that
I
built
in
is
is
available.
It's
implemented
in
headers
and
actually
has
a
has
an
extension
so
that
you
know
the
traditional
ert
is
sort
of
based
on
doing
fma
operations
to
estimate
the
peak.
A
This
la
actually
allows
you
to
replace
those
via
lambdas
so
that
you
can
estimate
the
peak
of
say
just
the
a
vectorized,
a
vectorized
multiplication,
operation
or
scalar
operation
and
sort
of
model
it
after
the
the
peak
for
what
you
think
your
code
actually
might
look
like,
because
not
all
of
us
can
you
know,
have
codes
that
can
execute
a
whole
lot
of
fma
operations.
A
I
say
cool
thanks,
I'm
glad
to
see
more
profiling,
debugging
tools
at
nurse
and
general
hpc
community.
Thank
you.
Jonathan.