►
From YouTube: 6. Nvidia Nsight Developer Tools -- Max Katz
Description
Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/
A
Jeff
larkin
already
showed
this
slide
here,
so
I
won't
go
through
in
too
much
more
detail,
except
to
just
repeat
that
the
nvidia
developer
tools
suite
has
a
focus
on
both
code
correctness
tools,
as
well
as
code
development
tools
like
ides,
and
then
the
other
half
of
the
picture
is
profiling
tools,
so
insight
systems
inside
compute
and
a
related
tool
suite.
A
We
call
the
nsite
family
of
profiling
tools,
the
profiling
side
of
the
developer
tools,
family
and
there's
really
two
tools
that
are
relevant
in
the
context
of
hpc
insight
systems
is
always
going
to
be.
Your
starting
point
inside
systems
is
a
high
level
application
profiling
tool.
So
it
collects
information
about
the
an
application
as
it
runs
and
it
collects
information
also
about
the
state
of
the
system
as
it
runs.
So
you
can
understand
what
was
happening
in
your
application
as
a
function
of
time.
A
A
Where
was
I
spending
time
transferring
data
between
the
cpu
and
the
gpu,
and
where
are
my
kernels
that
I'm
launching
you
know
brent
mentioned
that
a
kernel
is
the
kind
of
atomic
unit
of
work
that
gets
launched
on
a
gpu?
Where
am
I
launching
kernels
that
don't
have
enough
work
to
be
able
to
use
a
gpu
effectively
brent
mentioned
that
modern
gpus
have
thousands
of
cores
and
that
you
really
want
tens
of
thousands
or
maybe
even
hundreds
of
thousands?
Oh,
my
slide
must
have
a
slide
transition.
A
There
have
tens
of
thousands
or
hundreds
thousands
of
threads.
You
want
to
have
that
many
launched
at
one
time
in
order
to
use
the
gpu
effectively,
so
insight
systems
will
help
you
answer.
Am
I
launching
kernels
that
have
that
if
you
find
using
insight
systems
that
you're
launching
kernels
that
only
use
a
thousand
threads
there's
a
very
good
chance
that
you're
not
using
the
gpu
as
effectively
as
you
could
now
once
you
if
you
have
identified,
or
once
you
have
identified
that
a
specific
cuda
kernel?
What
is
the
bottleneck
in
your
application?
A
Then
you
would
switch
to
n
set
compute
to
analyze
that
kernel.
Now
we
will
talk
more
about
cuda
c
plus
plus
tomorrow,
as
the
kind
of
low
level
program
model
for
nvidia
gpus.
That
we've
been
referring
to
the
one
thing
that
I
want
to
emphasize
that's
a
nice
feature
of
the
nvidia
platform
is
that
all
programming
models
that
run
on
nvidia
gpus
generate
the
same
kind
of
underlying
code,
and
that
means
you
can
analyze
them
equally
or
or
without
any
sort
of
favoritism
towards
any
particular
model.
A
The
way
that
you
analyze,
the
code
of
course,
will
be
a
little
bit
different
because
it's
different
source
code
and
for
some
models
we
may
have
additional
support
for
bonus
profiling
functionality,
depending
on
which
program
model
you
have,
but
at
the
core
they
all
generate
the
same
kind
of
code,
and
so
they
can
all
be
analyzed
in
the
same
way.
A
So
I've
used
the
word
cuda
kernel
here
to
emphasize
the
fact
that
in
some
sense,
every
program
model
generates
cuda
kernels,
whether
or
not
you're
explicitly
using
cuda.
But
you
don't
you
don't
necessarily
have
to
know
that
when
you
first
start
profiling
on
the
nvidia
platform.
A
Okay,
so
you
once
a
particular
kernel
has
been
identified
as
being
the
bottleneck
in
your
application
and
site.
Compute
would
allow
you
to
analyze
in
detail
the
performance
of
that
kernel.
I've
also
shown
you
here
insight
graphics,
which
is
kind
of
the
corresponding
tool
to
end
set
compute
for
for
graphics
performance
analysis.
Presumably
that's
not
relevant
to
most,
or
probably
all
of
you
now
inside
systems
talking
in
a
little
bit
more
detail
about
it
is
our
system
level,
slash
application
level
profiler?
A
This
allows
you
to
identify
gaps
in
the
timeline
where
nothing
is
happening
on
the
cpu
or
the
gpu,
and
then
you
can
use
that
information
to
go
back
to
your
source
code
and
think
well.
Why
is
it
that
I'm
not
doing
anything
on
the
gpu
at
this
time?
Right?
Maybe
you
unintentionally
left
some
part
of
your
code
on
the
cpu
and
you
want
to
move
that
to
the
gpu.
So
anti-systems
is
useful
for
showing
you
at
a
given
time.
A
You
know
in
wall
time
relative
to
the
beginning
of
the
application
run,
why
you
know.
Where
am
I
on
the
cpu
and
gpu
insight
systems
can
profile
both
one
or
many
cpu
cores
and
gpus.
So,
if
you're
running
at
multiple
gpus
on
the
same
server,
for
example,
if
you
have
a
multi-gpu
server,
you
could
profile
an
application
that
uses
multiple
gpus
and
see
them
all.
On
the
same
timeline.
A
Insight
systems
is
supported
on
multiple
architectures,
so
we
have
support
on
linux,
windows
and
mac.
The
generally
speaking,
we
also
support
collecting
the
profiles
themselves
on
both
linux
and
windows,
because
those
are
the
places
where
you're
likely
to
be
running
in
video
gpus
and
because
of
the
way
anti-systems
is
architected.
You
can
collect
the
profile
remotely
on
a
remote
cluster
and
collect
it
without
the
use
of
a
gui
and
then
bring
that
store.
A
That
saved
report
back
to
your
local
system
and
then
analyze
it
in
the
user
interface
there,
and
you
can
do
that
cross-platform
sorry.
I
didn't
realize
the
side
transitions
over
here.
So,
for
example,
you
can
collect
a
report
on
linux
and
then
display
the
collected
report
on
the
user
interface,
neither
windows
or
mac,
and
that's
the
workflow
that
I'll
show
you
in
a
second.
A
It
has
rows
in
the
timeline
and
each
row
is
shows
you
information
about,
what's
happening
for
that
kind
of
section
of
analysis
as
a
function
of
time.
So
time
is
increasing
from
left
to
right
here
and
each
row
traces
or
collects
information
on
a
different
kind
of
data.
So
in
this
example,
information
about
what's
happening
on
the
cpu
cores
and
threads,
is
in
the
upper
part
of
the
screen.
So
you
can
see
a
bunch
of
different
python
processes
that
are
being
launched
and
each
of
those
pyth.
Actually
these
are
python
threads.
A
A
You
might
be
calling
any
number
of
other
apis
that
emphasis
systems
knows
how
to
collect
information
about
it
can
then
show
you
in
this.
In
this
example,
in
the
bottom
half
of
the
screen,
the
gpu
centric
view.
So
as
a
function
of
time,
where
are
all
the
kernels
that
I
launched
from
the
gpu
as
well
as
where
of
where
are
all
the
memory
copy
operations
that
are
happening
on
the
gpu
and
in
this
example
we're
showing
you
a
multi-gpu
run?
So
you
can
see
for
cases
where
using
multiple
gpus
in
the
same
workload.
A
A
You
can
then
zoom
in
to
particular
areas
of
interest.
So
you
know
this
screenshot
was
a
relatively
large
chunk
of
time
in
the
hundreds
of
milliseconds
that
we're
looking
at
here.
Often
you'll
want
to
zoom
in
to
a
very
particular
slice
of
time
and
understanding
more
detail
what's
happening
there,
so
you
can
just
zoom
in
in
the
profiler,
and
this
is
an
example
zoomed
into
a
much
smaller
chunk
of
time.
You
know
like
a
small
fraction
of
one
millisecond
and
then
you
can
get
information
on
individual
operations.
A
The
excuse
me
the
name
of
the
command
line
interface.
Forensic
systems
is
called
ensis
and
sys
and
nsys
has
a
number
of
different
modes
that
it
can
run
on.
But
profiling
is
the
most
common
one
that
you'll
use,
so
it's
nsys
profile
and
then
the
name
of
your
application
as
you'd
expect
it
has
a
number
of
runtime
options.
In
this
example,
I'm
showing
you
two
of
them
dash
o,
would
be
the
name
of
the
report.
A
File
that's
being
generated,
which
has
in
the
newest
version
of
enter
systems,
has
the
dot
nsys
dash
rep
file
extension,
but
in
older
versions
like
the
one
we'll
be
using
today,
has
the
qdrap
file
extension
and
then
dash
dash
stats.
Equals
true
means
that
I
want
to
generate
a
summary
at
the
end
of
the
run
of
all
of
the
gpu
and
cpu
related
activities.
A
Typically,
you'll
want
to
do
a
little
bit
more
work
here,
because
you'll
want
to
give
each
report
file
a
separate
name
for
each
rank
and
then
sys
does
understand
how
to
inject
environment
variables
into
the
report
name.
So
you
could
say
something
like.
I
want
each
report
a
separate
report
file
to
have
a
separate
name
corresponding
to
the
mpi
rank
in
question,
and
you
just
use
the
mpi
environment
variables
for
your
implementation
to
do
that
or
you
can
use
like
slurm
proc
id
if
you're
using
slurm.
A
It
is
also
possible
to
put
nsis
profile
before
mpi
run
and
then
in
that
case
it
will
profile
everything
that's
launched
by
the
the
job
launcher,
but
because
at
this
time
inside
systems
doesn't
have
is
not
multi-node
aware.
They'll
only
profile
ranks
they're
launched
on
the
same
node
as
you,
so
using
nss
profile,
and
then
mpiron
is
primarily
only
useful
for
single
rank
single
node
cases
where
your
profiling,
from
or
you're
launching,
from
the
same
node
that
you're
running
on
which
doesn't
happen.
A
Now
in
this,
this
is
an
example
of
an
inside
systems
profile
that
I
collected,
I
think
on
permuter
a
few
weeks
ago,
and
you
can
see
what
the
end
side
systems
user
interface
looks
like
the
it's
when
you
first
open
it.
This
is
basically
what
the
airport
open
looks
like
when
you
first
open
it
you'll
see
a
timeline
view
in
the
upper
half
of
your
screen
and
then
an
events
view
in
the
lower
half
of
your
screen.
A
A
So,
for
example,
if
I
click
a
particular
row
and
then
click
show
in
events
view,
then
it
will
create
a
table
of
operations
that
I
can
then
see
down
here
and
in
the
case
of
nvtx,
which
is
the
instrumentation
api
that
you
can
use
for
annotating
your
code
with
human
readable
strings.
You
can
get
a
nested
view
of
what's
happening
in
your
application,
so
this
is
an
application
that
launches
a
bunch
of
time
steps
and
then
I
could
filter
down
through
individual
time
steps
and
then
see
what's
happening
so
in
this
screen.
A
In
this
example,
here
the
what's
happening
on
the
the
cpu
side
is
in
the
bottom
half
of
the
screen
and
then
what's
happening
on
the
gpu
is
the
upper
half
of
the
screen.
Any
row
can
be
expanded
by
clicking
the
the
arrow
here
and
then
now
that
I've
expanded
the
cuda
row.
I
can
see
a
whole
bunch
of
operations
that
are
happening
on
the
gpu
generally
speaking
these
and
I
can
zoom
in
so
you
can
see
it
better.
A
Generally
speaking
in
the
cuda
up
overall
cuda
row,
which
is
this
open
uppermost
row,
kernels
compute
operations
are
blue
and
then
memory
copy
operations.
You
know
transferring
data
between
the
cpu
and
gpu
are
green
and
red,
and
then,
if
I
scroll
in
far
enough,
I
can
then
find
any
particular
kernel
operation.
So,
for
example,
here's
a
kernel,
an
individual,
discrete
piece
of
work.
This
example
using
cuda
but
you'd,
see
the
same
kind
of
presentation.
Regardless
of
what
program
model
you
used.
A
You'd
see
basic
information
about
the
kernel,
which
would
tell
me
information
like
how
many
threads
total
did
I
launch
in
order
to
do
this
kernel
and
that
can
give
you
a
sense
of
am
I
launching
enough
work
to
even
fill
up
a
gpu,
so
inside
systems
is
super
useful
for
understanding
as
a
function
of
time.
What's
going
on,
you
can
see
that
this
chunk
of
the
timeline
here
has
zero
activity
happening
on
the
gpu.
That's
normally
a
bad
thing.
A
In
this
particular
example,
I
had
turned
on
mpi
tracing,
so
this
allows
you
to
understand
where
mpi
calls
are
happening
in
your
application,
and
you
can
see
that
this
chunk
of
the
timeline
on
this
rank
happened
to
nicely
correspond
to
an
mpi
all
reduce
operation.
So
that
explains
to
me
why
it
is
that
there's
no
gpu
activity
happening
here,
because
I'm
waiting
on
an
api
operation
in
order
to
finish,
you
can
also
see
calls
into
the
cuda
runtime
api,
and
these
can
be
useful
for
understanding.
A
Okay,
switching
gears
now
to-
and
you
can
of
course
ask
me
questions
beyond
switching
gears
now
to
end
site,
compute
and
site.
Compute
is
our
kernel,
profiling
tool
and
it
allows
you
to
get
detailed
performance
analysis
of
individual
kernels
in
the
screen
or
in
the
profile
that
I
was
just
showing
you.
You
could
get
some
information
about
the
kernel
like
how
many
threads
it
launched
how
long
it
ran.
A
Insight
compute
works
by
doing
different
kinds
or
levels
of
analysis,
each
of
them
asking
a
different
question
about
the
performance
of
a
given
kernel
as
one
example,
the
one
that
you
start
with
the
one
that's
at
the
top
of
the
inside
compute
report.
When
you
open
it
is
the
gpu
speed
of
light
section,
and
this
gives
you
high
level
analysis
for
how
much
of
the
compute
throughput
that
was
available
on
the
gpu
did
I
use
and
how
much
of
the
memory
bandwidth
did.
A
I
use
this
is
presented
in
a
bar
chart
where
sm
percentage
is
a
measure
of
the
compute
throughput
and
memory
percentage
is
the
percentage
of
memory
bandwidth
that
I
use.
Generally
speaking,
if
you
are
using,
you
know
at
least
60
or
70
of
one
of
those
two
subsystems
on
the
gpu.
That's
a
good
sign
that
you're
using
the
gpu
effectively,
if
you're
using
less
than
that,
then
that
could
be
a
sign
that
you
aren't
using
the
gpu
to
its
full
potential.
A
Brent
mentioned
that
the
latency
of
individual
operations
is
relatively
high
on
gpus
and
really
there
is
no
way
around
that.
That's
a
fundamental
design
characteristic
of
how
gpus
are
designed.
The
only
way
to
compensate
for
the
fact
that
the
latency
of
individual
operations
on
gpus
is
high
is
to
have
a
lot
of
those
operations
going
all
at
once,
so
that
any
individual
operation
can
have
its
latency
hidden
by
operations
and
other
threads.
A
If
you
aren't
able
to
achieve
a
high
fraction
of
the
peak
throughput
of
the
gpu,
that's
usually
a
sign
that
you
don't
have
enough
threads
in
flight
to
hide
work
and
brent
mentioned
that
you
know.
Threads
in
flight
is
really
can
be
thought
of
as
a
measure
of,
in
many
cases
at
least
how
many
independent
items
of
work
you
have
to
do
or
how
many
degrees
of
freedom
you
have.
A
So,
if
you
think
about
a
for
loop
from,
I
equals
one
to
n
in
fortran
or
in
c,
then
that
for
loop,
the
number
of
the
trip
count
of
that
for
loop.
You
know
what
n
is
will
essentially
in
almost
all
cases,
be
a
measure
of
how
many
threads
there
are.
If
n
is
a
thousand,
then
I
launch
a
thousand
threads.
A
A
And
you
can
see
movement
between
the
main
memory
and
the
caches,
and
it
helps
you
understand
for
detail
for
people
who
are
willing
to
do
detailed
analysis.
What's
going
on
in
your
kernel
again
for
those
of
you
who
are
just
getting
started
on
gpus,
you
wouldn't
start
here.
This
is
a
relatively
advanced
level
of
analysis,
but
for
those
of
you
who
really
want
to
dig
into
a
particular
kernel,
you
can
then
use
insight
compute
for
this
and
site.
A
Compute
also
allows
you
to
look
at
the
source
code
of
your
kernel,
either
the
assembly
code
or
the
original
high
level
c
or
fortran
code,
and
then
correlate
that
with
some
number
of
metrics
that
we
have
sampled
through
the
kernel.
So
what
we
can
do
is
every
some
number
of
clock
cycles.
We
can
record
a
sample
and
that
sample
will
record
where
we
are
in
the
code,
as
well
as
some
information
about
why
we
are
waiting
if
we
are
waiting
at
this
point
of
the
code.
A
Why
are
we
waiting
there,
and
this
allows
you
to
both
say
roughly
speaking,
where
are
we
spending
time
in
my
code,
and
why
am
I
spending
time
there?
Is
it
because
I'm
waiting
for
some
data
to
arrive
before
I
can
compute
this
operation?
Is
it
because
the
latency
of
this
particular
arithmetic
operation
is
high?
That
sort
of
thing
this
level
of
analysis
is
quite
tricky
to
do
for
two
reasons.
A
One
is
that
understanding
the
actual
data
that
we're
presenting
to
you
does
require
some
understanding
of
the
gpu
architecture,
and
that
requires
practice
and
experience,
and
the
other
is
that
gpus
are,
by
their
nature,
very
highly
parallel,
and
so
these
are
think
of
them
as
averages
across
all
the
threads
that
are
being
launched
from
the
gpu.
It's
not
the
the
work
of
any
one
particular
thread,
and
that
requires
you
to
think
in
a
fundamentally
parallel
or
averaged
way,
which
is
non-intuitive
for
somebody
who
thinks
primarily
in
terms
of
serial
cpu
threading.
A
A
A
Your
kernel
depends
on
how
much
data
you're
collecting,
but
it
can
be
10
or
sometimes
even
50
or
100
times,
so,
of
course,
that
can
make
your
runtime
of
your
application
10
or
100
times
longer.
So
if
you
profile
every
single
kernel
in
your
application
that
will
introduce
a
relatively
large
level
of
overhead,
I
recommend
not
doing
that,
but
profiling,
only
specific
kernels
that
are
the
ones
that
you
care
about
in
any
given
analysis,
iteration
and
you
may
so,
for
example,
dash
k,
says
I
only
want
to
profile
kernels
with
a
specific
name.
A
You
may
even
specify
even
further
than
that,
for
example,
by
only
profiling,
some
number
of
counts
or
iteration
counts
of
that
kernel,
not
every
single
iteration
count,
but
so
you
will
want
to
be
careful
about
how
much
data
you
collect
with
insight
compute.
But
this
is
the
workflow
that
you
would
use.
A
And
then
just
a
quick
look
at
how
this
looks
in
the
and
the
viewer,
if
you
can't
see
inside
compute
user
interviews,
let
me
know.
A
This
is
an
example
of
what
I
get
when
I
look
at
and
set
compute.
As
I
mentioned,
it
has
these
different
kinds
of
sections
of
analysis.
The
first
one
is
this
gpu
speed
of
light,
and
then,
if
I
expand
any
given
section
with
that
arrow,
it
gives
me
a
lot
more
detail
about
that
analysis.
So,
for
example,
here's
that
bar
chart
that
I
was
showing
you
with
both
compute
throughput
and
memory
throughput
this
in
this
particular
kernel.
A
You
can
see
memory
usage
was
33
percent
of
peak
throughput
and
begin
with,
and
compute
throughput
was
even
less.
It
was
17.
So
this
is
an
example
of
a
kernel
that
isn't
using
the
gpu
as
effectively
as
it's
possible
to
use
the
gpu
and
then
potentially,
if
you
could
improve
it,
you
would.
Although
this
is
sometimes
that's
not
easy,
depending
on
the
way
that
you've
written
the
code,
we
can
generate
a
roof
line
chart
which
allows
you
to
see
your.
A
You
know
your
flop
cap,
your
flop
rate
as
a
function
of
arithmetic
intensity
for
those
of
you
who
have
used
refine
analysis
and
other
tools
and
then
a
number
of
sections
down
below
for
different
kinds
of
analysis.
There's
multiple
pages
in
insight,
compute,
if
you
click
from
page
to
details
to
go
pages
to
source.
You
would
see
this
source
code
view
and
then
you
could
scroll
down
through
your
source
code
and
then
see
as
a
function
of
line
of
code.
How
many
instructions
were
executed?
A
In
particular,
we
did
a
joint
training
with
oakridge
and
almost
two
years
ago
now
it's
crazy
how
fast
the
time
goes,
but
on
both
inside
compute
and
touch
systems.
We
did
very
much
more
detailed
presentation,
hour
long
presentation
on
each
of
these
tools.
So
if
you
want
to
learn
more
about
that,
you
can
go
consult
those
training
sessions
that
we
did
most
of
the
information.
There
is
still
a
pretty
good
sense
today
of
how
to
use
these
tools.
A
A
And
envy
prof
be
deprecated,
that's
insane
chat.
Envy
prof
is
currently
in
what
I
would
call
or
describe
as
maintenance
mode,
which
means
that
we're
not
adding
any
new
features
and
while
we're
not
formally
calling
it
deprecated,
it's
functionally
deprecated
right
in
the
sense
that,
like
unless
the
tool
actively
breaks,
we're
not
going
to
be
adding
any
new
functionality
to
it.
A
Nvpof
already
doesn't
support
the
latest
gpu
architecture,
so
you
can't
use
nvprof
on
the
a100
gpu
that's
on
promoter,
so
if
you're,
if
you're
still
using
mvproff
today,
I
strongly
recommend
using
insight
systems
and
insight
compute
and
there's
a
blog
post
that
I've
linked
here,
which
tells
you
about
how
you
would
transition
from
envy
prof
to
systems.
If
you
want
to
do
so,
and
it
will
be
true
that
all
future
gpus,
including
both
a100
and
any
future
gpus
that
we
launch
will
not
have
any
prop
support.