►
From YouTube: Intro to GPU: 07 Profiling on GPU
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
So
welcome
back
everyone
I'm
max
Katz,
for
those
of
you
who
joined
just
for
the
afternoon
session,
I'm
the
NVIDIA
technical
contact
to
nurse
and
it's
my
job
to
help
train
everybody
on
how
to
use
GPUs
effectively,
and
so
in.
This
talk,
I'm
going
to
talk
a
little
bit
about
profiling
tools
for
NVIDIA
GPUs.
A
First
start
when
you're
getting
started
on
GPUs,
so
the
the
Nvidia
profiling
tools
are
broken
up
into
a
family
of
tools
called
the
end
site,
developer
tools,
and
so
the
way
this
works
is
that
we
have
a
set
of
three
tools
that
can
be
used
for
various
parts
of
the
profiling
analysis
workflow.
So
whenever
you
first
start
analyzing
the
performance
of
a
code,
the
most
important
question
is
where's,
the
time
being
spent
right.
A
You
want
to
be
able
to
as
a
function
of
wall
time
in
your
application
ascertain
how
much
is
in
each
part
of
the
program
or
that's
the
most
important
thing,
because
then
you
know
what
is
your
bottleneck?
What
is
the
most
expensive
part
of
your
code
and
usually
that's
the
part
that
you
dead
and
go
in
optimize
right?
You
don't
want
to
optimize
the
part,
that's
the
most
fun
to
optimize
or
the
part.
That's
easiest
right,
you
really,
the
most
bang
for
the
buck
comes
from
the
part
that
you're
spending
60
percent
of
your
time.
A
If
that's
what
you
have,
of
course,
that's
not
always
the
easiest
thing
to
do
right.
Sometimes
you
can't
do
that
or
you
needs
to
wait
on
some
of
the
refactoring,
but
it's
nevertheless
always
important
to
know
where
time
is
being
spent,
and
so
you
don't
go
pre
optimized
right,
you
don't
you
before
you
go
and
optimize
some
code,
you
need
to
have
a
very
clear
understanding
of
what
is
the
possible
benefit?
A
You
can
get
from
this
and
if
it
already
only
takes
one
percent
of
your
time,
then
spending
your
time
on
that
part
of
the
code
doesn't
make
a
lot
of
sense
right.
So,
with
that
in
mind,
insight
systems
is
the
name
of
our
tool
that
is
designed
to
collect
a
timeline
of
the
activity
on
your
node,
and
so,
as
I
systems
can
be
used
for
doing
system-wide
application
analysis,
you're
really
entering
on
my
node
where's,
the
timing.
How
much
time
is
on
the
GPUs?
A
How
much
time
is
on
the
CPUs
and
like
getting
a
stacked
up
view
as
a
function
of
time?
What's
happening
on
your
node,
then
once
you've
identified
a
particular
part
of
your
workflow,
that
is
the
problem.
It
often
breaks
down
into
two
categories.
One
is
that
you
identified
that
memory
is
your
most
important
bottleneck,
in
particular
copying
data
in
between
back
and
forth,
between
the
GPU
in
the
CPU,
so
enzyme
systems
will
barely
clearly
tell
you
whether
that's
happening
or
not,
and
then
sometimes
the
bottleneck
is
the
actual
compute
workload.
A
I
would
say
most
of
the
time.
The
first
time
you
port,
two
GPUs.
It's
going
to
be
memory,
that's
your
bottleneck!
You
either
spend
a
lot
of
time,
allocating
memory
or
copying
it
back
and
forth
between
the
CPU
and
the
GPU,
but
once
you've
gotten
that
all
worked
out.
Typically,
your
compute
workload
will
then
be
the
most
important
part
of
your
time,
and
you
want
to
analyze
that
now
on
NVIDIA
GPUs,
discrete
chunks
of
work
are
called
kernels.
A
Regardless
of
what
programming
language
you
use,
there
is
a
discrete
chunk
of
work
that
gets
launched
on
the
GPU
that
has
parallel
work
to
do
in
the
CUDA
context.
That
was
the
global
function
that
we
saw
earlier
in
opening
CC.
That
would
be
a
specific
parallel
write.
A
row
same
thing
for
OpenMP
within
a
target
teams,
distribute
some
common
statement
so
in
the
terminology
of
Nvidia,
those
are
kernels
and
inside
computers
designed
to
pick
a
particular
kernel
and
analyze
that,
and
so
that's
when
you
identified
a
particular
loop.
That
is
the
problem.
A
That
is
your
bottleneck
now
and
you
want
to
understand
what
is
the
bottleneck
for
that
loop?
How
can
I
optimize
that
loop
there's
also
another,
and
so
the
tool
that
is
used
to
do
that
is
called
n-side
compute?
There's
another
tool
called
insight,
graphics,
which
is
for
people
who
are
doing
like
graphics,
optimization
I'm,
going
to
assume
that
that's,
not
anyone
in
this
room,
but
it
is
important
for
the
people
who
do
like
game
development
on
the
Nvidia
platform.
A
Of
course,
I'm
only
showing
you,
the
profiling
part
of
the
Nvidia
tool
chain,
whose
son
kindly
showed
you
the
debugging
tools.
There
are
a
couple
other
tools
as
well,
but
I
think
for
profiling
at
anti-bugging
covers
the
most
important
part
of
the
offerings,
so
once
I
systems,
as
I
said,
is
designed
to
give
you
a
timeline
view
of
what
happened
in
your
application.
A
If
you
look
on
here
on
the
right-hand
side
of
the
screen,
we
have
a
kind
of
screenshot
of
what
this
might
look
like
for
a
fairly
complicated
application
up
here
at
the
top
of
the
image
you
see
workload,
information
about
the
CPU
by
the
way
I'm
going
to
give
like
a
live
demo
of
this
tool.
After
these
slides
are
done.
So
don't
you
don't
have
to
squint
or
I'll,
make
a
little
bit
easier
to
see
but
I'm
just
giving
a
sense
of
what
you
would
see
up
here
at
the
top.
A
You
have
like
the
CPU
workload
in
the
middle.
You
have
information
about
various
API
is
that
the
sims
knows
how
to
track
so
it
can
track
calls
into
CUDA
into
the
CUDA
libraries
it
can
track.
Mpi
calls,
for
example,
and
then
at
the
bottom
you
have
the
actual
GPU
workloads,
and
so
that's
what's
down
here,
where
you
typically
see
read
for
where
memory
activity
and
blue
for
kernels
the
compute
workload,
and
so
that
helps
you
identity'
Sting,
guaa
shat,
a
kind
of
glance
where's,
my
time
being
spent
is
it
in?
Is
it
in
memory?
A
We're
basically
just
collect
the
data
and
give
it
a
standard
out
print
out
what
happened,
but
the
more
fun
thing
to
do
is
collect
a
report
on
the
remote
system
and
then
visualize
it
in
the
graphical
interface
and
that's
what
I'm
gonna
I'm
gonna
show
you
how
you
both
do
both,
but
the
second
one
is.
It
gives
you
a
much
richer
view.
You
get
information
like
this
that
helps
you
see
at
a
glance.
What's
going
on
your
application,
the
with
some
exceptions.
Both
versions
are
supported
on
Linux,
Windows
and
Mac,
the
exceptions
being
on
Mac.
A
There's
only
like
a
viewer
version
of
this
because
you're
not
going
to
be
running
on
your
Mac
and
Linux
power
9,
which
is
what
is
running
on
Summit,
for
example,
we
only
have
the
collection
mode
so
you'd
have
to
like
you,
do
have
to
copy
it
back
to
your
local
system
and
then
install
and
such
systems
on
your
own
and
load
it
up
and
some
computers,
the
other
half
of
this
workflow
as
I
described.
This
is
where
you
would
go
once
you
identify
that
a
particular
kernel
is
your
bottleneck.
A
Now
you're
gonna
jump
into
this
tool,
and
it's
gonna
do
some
analysis
for
you
on.
What's
going
on
so
under
the
hood,
what
it's
doing
is
it's
collecting
Hardware
performance
counters
on
the
GPU?
It
runs
your
application.
It
collects
these
counters,
it
prints
out
a
report,
and
then
you
can
load
that
report
into
your
user
interface
and
I'll.
Tell
you
things
like,
and
my
memory
bandwidth
bound
a
might
compute
performance
bound
or
something
else,
and
then
hopefully
you
can
use
that
to
determine.
Where
should
I
go?
Next
to
my
optimization
process,.
A
It's
a
generic
name,
my
AP
XE,
and
then,
if
you
add
the
that's,
equals
true
flag,
what
it
does
is,
it
says,
collect
a
report
and
also
at
the
end
post
process
it
and
give
me
a
standard
out
view
of
like
what
happened
and
I'll
show
you
what
that
looks
like
in
a
little,
but
basically
gives
you
a
summary
of
all
the
activity
that
happened.
It's
not
like
a
ascii
implementation
of
the
timeline.
It's
just
a
rolled
up
summary
like
the
average
of
what
was
of
what
each
activity
cost.
A
It
will
generate
a
file
that
has
the
extension
QT
rep,
that's
just
the
the
file
extension,
so
you
know
which
ones
are
entering
systems
profiles
and
it
is
possible
to
use
NCI
systems
for
interactive
profiling
like
if
you
have
a
GPU
on
your
local,
laptop
or
workstation.
You
can
use
it
that
way,
but
in
the
HPC
Center
context,
usually
you
collect
the
profile
of
remotely
and
then
view
it
locally,
and
then
you
might
get
a
view.
A
A
If
you
zoom
in
even
further,
you
can
highlight
over
a
particular
kernel
launch
and
then
see
information
about
it.
So
how
long
it
took
what
kind
of
resources
that
used
on
the
GPU
that
sort
of
thing.
So
that's
that's,
basically
the
most
detailed
information
you
can
get
about
that
kernel.
If
you
want
any
more
detailed
information,
then
you
would
need
to
jump
inside
compute
target
that
kernel
the
name
of
that
kernel
and
then
profile
it.
So
that's
the
workflow!
A
A
The
kernel
name
says
anything
any
kernel
name
who
includes
my
kernel
in
it
will
be
profiled.
So
you
want
to
tune
that
carefully
and
there's
a
further
option.
You
can
do
it
even
further
limit
the
collective
collection
ever
it's
taking
a
long
time
as
within
sign
systems,
you
can
use
insight
compute
to
actually
drive
the
application.
A
If
you
have
like
a
local
workstation
but
again,
most
of
I
think
you,
if
you're
running
a
nurse
will
be
doing
it
in
the
collected
on
the
command
line
remotely
and
then
visualize
it
locally
I
workflow
by
the
way
I
haven't
these.
Are
generic
slides
not
specific
to
nurse?
So
if
you
were
running
on
earth,
you
need
to
run
s,
run
n
1
and
then
then
the
thing
just
see
you
know
because
the
GPUs
aren't
visible.
As
we
talked
about
from
the
from
the
allocated
node.
Why
do
you
fell?
A
And
this
is
what
the
interface
looks
like
France,
I
compute
again,
I'll
show
you
this
in
a
moment,
but
it's
broken
down
in
a
set
of
sections,
and
each
section
is
intended
to
give
you
a
different
level
of
analysis
on
different
parts
of
the
GPU
workload.
So
the
top-level
section,
it's
probably
the
most
important
one
for
you'd,
always
start
with.
It's
called
the
speed
of
light
section.
The
speed
of
light
section
is
intended
to
tell
you
of
the
various
theoretical
bottlenecks
on
the
hardware.
A
How
close
are
you
getting
to
matching
those
theoretical
bottlenecks
right,
so
we
published
that
if
you're
using
double
precision
floating
point
multiplier
ads,
then
the
peak
performance
of
both
the
GPU
is
something
on
the
order
of
seven
teraflops
right.
So
this
would
tell
you
if
you
had
an
application
that
was
primarily
doing
that
instruction.
A
What
percentage
of
that
seven
teraflops
is
my
kernel
actually
get
right,
so
that
gives
you
a
sense
of
how
much
more
optimization
do
I
need
to
do
before
I
hit
a
hard
limit
on
the
GPU
beyond
which
I
could
not
get
any
faster
and
I
should
move
on
and
and
work
on
another
kernel.
So
this
isn't
to
do
that
the
speed
of
light.
There,
therefore,
is
the
the
peak
possible
performance,
and
the
percentage
of
that
is
how
much
did
you
achieve
and
it
will
do
it
will?
A
There
are
several
different
enough
remember:
it
gets
a
little
more
complicated
because
there
are
different
levels
of
memory.
Are
you
as
we
discussed,
there's?
What's
called
global
memory,
which
is
basically
your
DRAM,
that's
the
big
chunk
of
memory
that
GPU
has
16
gigs,
there's
also
levels
of
cache
and
any
one
of
those
could
be.
A
Your
bottleneck
right,
the
if,
if
you're
doing
a
streaming
operation
like
summing
two
long
arrays
together,
that
typically
will
be
bottlenecked
by
how
long
it
takes
to
go
to
DRAM
because
they'll
size,
the
arrays,
will
be
much
larger
than
the
size
of
your
cache.
But
if
you're
doing
something
that
can
fit
in
cache,
then
you
will
primarily
be
bottlenecked
by
what
is
the
bandwidth
to
the
cache
or
latency
to
the
cache,
and
so
the
higher
the
memory.
A
Harkey
has
different
levels,
and
so
it
shows
you
the
what
percentage
of
each
level
you're
at
so
in
this
screenshot
example,
it's
telling
you
that
and
that
the
the
nomenclature
is
a
little
bit
unfortunate
or
hard
too
hard
to
parse
for
for
beginners,
but
basically
it's
something
L
1
cache,
L
2
cache
and
for
the
DRAM.
What
percentage
of
each
of
those
did
you
get
and
approximately
speaking
your
your
total
limiter
will
be
essentially
the
highest
of
those
3
right,
because
the
the
quote
the
highest
number
you
get
is
the
is
the
bottleneck
right?
A
Okay,
so,
as
a
general
rule-
and
this
is
getting
more
into
the
weeds-
perhaps
you'll
haven't
you
need
to
build
up
some
more
experience
with
this,
but
as
a
general
rule
and
a
kernel
that
looks
like
this
is
not
an
efficient
use
of
the
GPUs
resources.
It's
not
getting
near
the
hundred
percent
that
you'd
like
to
get
for
either
the
compute,
which
is
the
upper
bar
graph
or
memory,
which
is
the
lower
bar
graph
as
a
general
rule.
A
B
A
C
A
Question
was,
should
I
interpret
this
as
saying
that
the
top
row
should
be
as
large
as
possible
relative
to
the
bottom
row.
I
you
should
I
have,
should
I
try
to
be
bottlenecked
by
compute
rather
than
memory,
and
my
comment
is
that's
the
ideal
world,
where
you're
bounded
by
compute
and
if
you
are
bounded
by
compute.
Often
you
will
not
be
bound
by
memory
because
you're
not
to
be
making
that
many
accesses
to
memory,
because
you
have
a
lot
of
computers
directions
to
do
so.
A
That's
the
ideal
world,
but
don't
necessarily
make
that
a
goal
of
your
optimization,
because
if
you
are
trying
to
add
two
rays
together,
the
example
I
gave
you
can
never
make
that
compute
bounds
right.
The
best
you
can
do
is
make
it
memory
bandwidth
bound
and
then
just
optimize
that
as
much
as
you
can
by
doing
things
like
making
sure
the
memory
accesses
are
coalesced
and
that
you're
achieving
as
close
to
100%
as
possible.
A
A
Okay,
so
how
much
time
do
I
have
left
twelve
minutes?
That's
fine!
So
I'm
gonna
give
you
a
quick
demo
and
I.
Don't
have
enough
time
to
like
do
this
full
justice,
but
just
give
you
a
flavor,
and
this
is
something
you
might
be
able
to
do
for
homework
or
not,
maybe
later
not
in
the
today.
Necessarily
it's
a
little
bit
more
advanced.
A
But
if
you
want
to
follow
up
exercise,
you
can
do
that
Oh
with
2:30,
okay,
so
Matt
Norman
is
a
climate
scientist
at
Oak
Ridge
and
has
developed
a
mini,
app
called
mini
weather,
and
so
many
weather
is,
is
a
simple
C
or
Fortran
based
CFD
type
code
that
can
be
used
basically
for
the
type
of
analysis
you
would
do
for
weather
or
climate
simulations.
So
he
kindly
made
that
available.
It's
the
community
and
an
open-source
code.
A
It's
a
nice
mini
app
to
play
with
if
you
want
to
get
familiar
with
what
that
community's
code
looks
like,
and
so
that's
the
type
of
thing
that,
like
vendors,
would
typically
go
off
and
try
to
optimize
for
procurements,
not
sort
of
thing.
Many
weather
is
an
excellent
code
for
doing
various
types
of
analysis
and
I
have
chosen
to
fork
it
and
make
it
and
I
made
a
version
of
it.
A
That
is
I
use
for
doing
for
demonstrating
performance
analysis,
and
so
this
is
on
my
personal
github
I
can
tell
you
it's
in
a
branch
called
MCAT.
Slash
tutorial
so
like
follow
up
with
me
afterwards
or
I
can
make
it
available
on
the
nurse
documentation.
Something
like
that.
But
this
is
something
you
can
go
to
and
github
and
get,
and
it
just
gives
you
a
few
simple
problem
exercises
that
you
can
work
on
and,
as
you
can
see,
I
was
working
on
this
forensically
while
Wilson
was
talking.
A
Given
you
some
hints
to
make
that
hard
and
then
eventually
get
to
the
point
where
now
kernels
are
your
limiter
and
then
use
NCI
compute
to
profile
of
the
kernel,
so
I'm
not
going
to
go
through
all
five
of
those
I.
Don't
think
we
have
time
for
that,
and
I
also
want
to
give
you
something
to
do.
But
I'll
give
you
a
sense
of
how
that
workflow
works
and
I'll
show
you
how
to
use
both
tools
so
I
have
cloned
this
repository
on
quarry
and
I'm.
A
I
have
currently
allocated
a
node
on
query
with
s
I
like
the
way
we
described
before.
So
you
can
see.
I
did
if
I
do
this
I
get
the
GPU
that's
available
to
me.
So
I
just
have
a
like
I've
requested
one
GPU,
and
that's
all
that
I
need
for
this
mini
weather
itself.
As
an
MPI
code,
I
stripped
out
all
the
MPI
from
my
version
of
it
just
for
simplicity,
so
you
can
focus
on
just
the
single
GPU
performance.
A
If
I,
if
I
LS
here,
this
is
the
structure,
those
I
have
five
problems
which
I
just
have
in
text
files
and
then
what
I've
done
is
for
each
problem.
I
provided
a
get
patch
file.
So
if
you
get
stuck-
and
you
just
want
to
see
what
was
my
solution
in
that
problem-
you
can
just
do
get
apply
solution,
one
that
patch
and
then
that
will
basically
solve
it
in
a
sense
so
that
you
can
then
move
on
with
a
tool.
Part
of
the
analysis.
A
So
in
order
to
build
this,
you
just
make
it
requires
having
the
PGI
compiler
in
your
environment.
This
is
an
open,
ACC,
accelerated
code,
I,
basically
from
mat
Norman's
code,
stripped
out
everything,
but
the
C
open,
ACC
implementation
and
then
so.
That
requires
a
PGI
compiler
on
Cori.
You
just
do
module
load,
PDI
and
again,
ask
me
if
you
want,
if
you're
confusing
is
just
ask
me
afterwards.
If
so,
I
can
run
you
to
the
workflow
so
now,
I
have
a
mini
weather,
executable
and
I
can
just
run
it
with
us
Ronnie.
A
A
Basically,
NV
top
is
like
a
from
the
30
seconds
that
I
looked
at
it.
It's
a
tool
that
allows
you
to
see
like
a
time
graph
representation
of
the
utilization
of
the
GPU,
and
so,
if
you
do
anybody
s,
mi
there's
a
number
here.
This
is
basically
gives
us
this
report
on
the
GPU
I
know
this
is
a
diversion,
but
this
is
a
useful
to
everything
and
the
status
report
tells
you
things
like
what
is
the
temperature
of
the
GPU?
A
This
is
its
total
potential
power,
draw
300
watts
for
volt
to
be
100,
and
this
is
the
current
power
draw.
So
it's
basically
idling
at
this
point.
This
is
the
total
amount
of
memory
available.
This
is
the
current
amount
of
memory
allocated,
and
then
this
number
here
is
the
GP
utilization.
It's
a
very
crude
measure
of
how
heavily
you're
hitting
the
GPU
and
basically
it
says
in
my
last
one
second
of
time
a
well
time
what
fraction
of
the
time
was
I
spending
running
on
the
GPU,
so
this
is
100%.
A
It
basically
says
you're
running
a
continuous
workload
on
the
GPU,
so
you
can
often
use
that
for
your
various
highest
level.
Questions
like
am
I
even
running
on
the
GPU.
That's
a
useful
thing
to
do
and
if
I
am
using,
the
GPU
am
I.
Only
spending
like
2%
of
the
time
on
the
GPU
and
the
rest
is
presumably
then
on
the
CPU.
A
So
Nvidia
SMI
has
a
mode
a
couple
different
modes.
One
of
them
is
called
the
daemon
mode,
where
basically
every
second,
it
will
print
out
a
bunch
of
stats
like
this
and
I
believe
that
NV
top
is
basically
just
a
wrapper
on
this.
That
runs
this
in
this
mode
and
then
shows
it
to
a
nice
graphical
interface.
So,
if
you
want
to,
you
can
often
do
something
like
in
a
separate
window,
a
separate
terminal
window
load
this
up
right
or
you
could
do
like
you
can
be
more
simple
with
Linux
right.
A
A
So
that's
one
thing
you
can
do
I,
don't
I,
don't
know
if
nurse-like
what
how
that
works
on
earth,
but
in
general,
that's
a
thing
that
can
be
done
in
HPC
centers
and
you
typically
don't
want
to
be
doing
that
that
often
because
just
more
that
could
work
with,
but
it
is
a
way
to
get
around
the
fact
that,
with
with
slurm,
you
don't
have
one
s
run
test
going
at
a
time.
So
you
can't
s,
run
your
program
and
then
also
s
front
anybody
SMI
at
the
same
time.
A
So
that's
that's
a
trick
that
I
use
to
get
around
it.
If
that
is
not
currently
documented
on
the
quarry
Doc's
will
make
sure
we
have
described
that
workflow
in
case
you're,
curious
okay,
so
that
was
an
aside
and
B
top
and
performance
monitoring.
But
that's
a
very
crude
metric
right.
It
only
tells
you
was
the
GPU
active,
essentially
right.
We
want
to
get
better
information
that
we
want
to
get
like
what
actually
happened
on
the
GPU,
so
I
ran
mini
weather
and
it
gives
me
a
counter
of
where
I
am
in
time.
A
It
tells
me
what
the
time
step
is.
This
is
like
a
CFD
grid
code,
so
it
bounces
a
particular
time
step
and
the
first
two
numbers,
NX,
glob
and
NZ
glob
are
the
number
of
zones,
and
this
is
a
two
dimensional
grid
code,
so
it's
X
X
zone
as
these
zones,
so
there's
total
of
800
zones
in
the
current
version
of
the
code
in
this
simulation.
So
what
I'm
going
to
do
is
I'm
going
to
run
insist
profile
on
this
application
and
as
the
binary
for
Ensis
and
site
systems,
as
we
discussed.
A
So
that
will
do
two
things
for
me.
One
is
that
it
will
give
me
I
will
capture
to
disk
a
report
file
that
is
like
a
like
a
binary,
opaque
thing.
They
record
all
the
events
that
the
entire
system
is
captured
and
then
the
second
thing
that
it
does
is
it
post
processes
that
report
and
it
gives
me
a
list
of
everything
that
happened.
A
So
this
can
be
a
little
bit
it's
why
it's
so
like
you
might
want
to
pipe
it
to
a
file
and
then
look
at
it,
but
basically
what
it
does
is
it
gives
you
several
different
sections
on
memory
operations
and
kernels.
Remember:
kernels
are
the
compute
workloads,
so
if
I
scroll
up
to
the
top,
what
I
see
is
these
setup
sections?
So
this
is
a
top
my
report,
so
the
first
section
is
the
kuda
api.
A
So
the
kuda
api
is
what
the
cpu
call
is
to
launch
work
or
do
things
on
the
GPU
and
that's
broken
down
into
memory
allocations,
memory,
transfers
and
then
what
we
call
kernel
launches,
which
is
actually
launching
the
work
on
the
GPU
and
it's
broken.
It's
ordered
in
descending
order
by
amount
of
time.
So
basically,
what
this
row
is
telling
me
here
is
that
of
the
CPU
calls
into
CUDA
not
to
see
not
to
run
down
the
application
just
of
the
parts
of
CUDA
API
that
were
tracked.
A
98%
of
that
was
in
this
API
call
ku
mem
host
Alec,
which
you
know
I,
don't
expect
you
to
recognize,
but
basically
is
memory
allocating
right.
So
this
is
a
symbol
that
gets
called
when
you
do
CUDA
malloc
or
some
other
like
opening
CCAC,
see
data,
create
or
open
MP
map,
or
something
like
that
that
that's
the
memory
allocator
thing
that's
called
so
the
is
dominated
in
terms
of
the
CPU
view
into
it
by
memory
allocation
from
kernels.
It
gives
you
a
descending
list
of
all
the
kernels
that
were
nm
GPU
openings.
A
You
see
an
open
MP,
have
a
nice
property
that
they
tend
to
generate
nice
kernel
nice-looking,
kernel
names
that
are
the
name
of
the
function
in
this
case
is
the
name
of
a
function
in
the
code,
said
he'll
olz
and
then
a
line
number
in
the
code.
So
that's
super
useful
for,
like
locating
where
that
loop
was,
if
you're
doing
like
template
a
C++
code.
A
These
names
are
not
as
fun,
and
basically
this
tells
you
that
of
the
time
spent
on
the
GPU
running,
work
on
the
you
compete
work
on
the
GPU
about
half
was
spent
in
this
set
halo
values.
Eve
function
tells
you
how
many
time
in
nanoseconds
how
much
time
in
seconds
was
spent.
In
that
how
many
times
it
was
called
and
like
the
average
I'm
min
and
max
for
that
kernel.
B
A
These
are
kernels
independent,
so
if
you
sum
them
up,
it'll
get
a
hundred
percent
and
then
finally
down
here
you
have
memory
operations,
so
these
are
transfers
between
the
CPU
and
the
GPU
and
basically,
what
it
tells
you
is
that
about
half
of
that
time
was
spent
in
and
CPU
GPU
transfers,
which
we
call
host
to
device
or
H
to
D
and
then
about
half
in
the
other
direction.
Now,
if
you
added
up
these
various
sections
in
nanoseconds,
you
could
make
some
inferences
about.
A
A
So
I
have
a
second
terminal
window
open
here
and
then
what
I'm
going
to
do
is
I'm
just
going
to
choose
the
workflow
of
SCP
in
the
file
down
to
my
laptop
and
then
viewing
it
in
my
local
viewer,
it's
pretty
common
workflow.
So,
if
I
print
my
working
directory,
this
is
where
I
am
on
quarry,
so
I'm
gonna
copy
that
path
and
then
copy
this
file
down
to
my
laptop
and
hopefully
I
will
type
things
correctly.
I
will
forget.
I
will
not
forget.
A
C
A
The
question
was:
does
anti
systems
replace
env
prof
for
those
of
you
who
have
done
GPU
programming
on
video
before
there
is
a
an
existing
appropriate
tool
called
env
prof
and
a
paired
graphical
interface
called
nvidia,
visual
profiler
enemy
p,
and
the
answer
is
yes.
That
tool
is
now
in
what
I
would
call
maintenance
mode.
It
has
not
supported
for
essentially,
we've
stopped
active
feature
development
as
of
the
current
GPUs
of
elta
and
then
the
next
generation
GPUs
that
will
be
on
promoter.
This
will
be
the
only
supported
profiling
tool.
A
So
basically,
that's
why
I'm
showing
you
this
tool?
The
current
profiling
tool
works
on
core
GPU
and
we
prof
it's
not
it.
You
know
it
works,
there's
no,
there's!
No
bugs
that
I
know
about
you're,
welcome,
to
use
it
but
I'm
sure
I'm
intentionally,
showing
you
the
tools
that
will
be
supported
on
Perlmutter,
so
I
open
up
this
report
when
kini
rap
file
and
I
get
a
timeline
view
like
the
following.
A
It's
broken
down
into
two
seconds
as
I
said
earlier.
The
top
part
here
is
the
CPU
workload
so
like
this
black
bar
here
is
a
measure
of
load
on
the
CPU.
If
it's
evidence,
the
bar
is
100%
that
basically
means
I'm
hitting
a
CPU
threads
heavily,
whereas
like
here
towards
the
beginning,
I'm
just
sorta
starting
to
spin
up
my
application.
A
The
CUDA
API
is
on
this
row,
so
this
is
telling
you
all
the
calls
in
the
CUDA
there's
a
big
chunk
here
for
kuben
post
Alec,
so
that
kind
of
lines
up
with
our
thing
we
saw
before
and
that's
telling
you
that
of
the
time
spent
in
the
CUDA
API
most
of
it's
in
that
memory
allocation.
But
now
we
can
see
that
this
is
a
pretty
darn
big
chunk
of
the
application
run
time
as
a
whole
right.
A
The
the
entire
run
time
is
from
the
left
bar
here
to
the
bar
on
the
right
and
it's
800
milliseconds,
so
yeah
it'll
come
back
up,
I'm
sure,
well,
I'm,
not
sure,
but
I
hope,
I,
don't
know
what
I
can
do.
I
mean
I'm
not
connected
to
it,
I'm
just
going
to
the
zoom.
Okay!
Sorry
about
that,
for
those
on
WebEx
we're
just
screen
difficulties:
oh
I'm,
sorry,.
A
So
800
millisecond,
you
can
see
a
big
chunk
of
that
I
can
I
can
do
something
like
this
to
highlight
a
section
and
look
at
the
timeline
and
I
see
that
that's
you
know
like
300
milliseconds,
that's
almost
half
of
my
run.
Time
is
spent
in
this
memory
allocation
and
then
the
second
part
down
here.
The
CUDA
session
is
where
the
actual
runtime
occurs.
So
I
can
expand
this
and
it's
broken
down
to
kernels,
which
is
the
compute
and
memory
which
is
the
memory
transfers.
A
And
if
you
look,
if
you
have
sharp
eyes
you
can
see,
all
of
it
is
right
there.
All
of
the
compute
workload
is
right
there
in
that
part
of
the
application,
and
that's
that's
pretty
tiny
right.
That
means
this
application
is
not
spending
much
time
on
the
GPU
and
the
main
message
here
that
I
would
take
away
from
looking
at
this
profile,
and
this
is
basically
exercise.
One
in
my
tutorial
is.
This
is
not
a
workload
that
works
well
on
the
GPU.
Even
if
every
one
of
those
kernels
is
optimized,
it's
fully
optimized.
A
It
cannot
be
better
you've
written
the
best
possible
CUDA.
This
does
not
make
sense
to
run
on
the
GPU.
This
workload
will
be
faster
on
the
CPU
for
sure,
and
it
comes
down
to
two
things:
one.
It
takes
time
to
spin
up
CUDA,
or
this
is
the
GPU
in
general.
It
takes
like
something
like
half
a
second
to
one.
Second,
actually
get
everything
loaded
onto
the
GPU.
A
A
second
memory
allocation
is
really
expensive
on
GPUs,
there's
fundamental
hardware
reasons
why
that's
true,
and
so
basically
it
is
much
more
expensive,
allocate
memory
on
GPUs
than
you're
familiar
with
on
CPUs,
and
so,
if
you're
Dorf
by
the
amount
of
time
you
have
takes
to
allocate
the
memory,
and
you
only
spend
a
tiny
amount
of
time
running
with
it.
That's
not
a
good
use
of
your
time
so
think
carefully
for
a
second.
What
is
the
answer,
then?
How
do
I
make
this
application
run
well
on
GPUs?
A
A
Running
okay
running
on
the
CPU
right
and
that
that's
the
answer
for
this
workload
right:
this
would
be
faster
on
the
CPU,
but
the
trick
wet.
The
trick.
Answer
to
that
question
is
make
the
problem
bigger
right.
This
is
my
fundamental
argument
that
you
should
not
cannot
run
small
workloads
on
the
GPU
in
almost
every
part
of
science.
That
I
can
think
of.
There
is
a
way
to
make
your
problem
more
expensive
by
giving
it
more
work
and
having
to
be
higher
fidelity
right.
If
it's
a
grid
code,
you
give
it
more
zones.
A
So
it's
more
work,
but
it's
usually
a
higher
resolution.
Higher
fidelity
thing.
If
you're
doing
molecular
dynamics
right,
you,
you
add
more
atoms
right,
you
know,
there's
all
sorts
of
things.
You
can
typically
do
to
add
more
work
to
at
the
cost
of
making
more
computation
expensive,
but
also
higher
accuracy,
higher
fidelity,
and
so
that
is
what
you
do
for
GPUs.
You
do
not
run
this
problem.
You
do
not
run
this
version
of
the
problem.
B
B
A
So
the
question
is:
if
a
number
of
GPU
cores
stays
fixed,
how
can
add
any
more
work?
Make
it
any
faster?
And
the
answer
is
it's
a
great
question
and
the
answer
is
basically:
this
I
told
you
before
there
are
800
zones
to
work
on,
so
that's
basically
800
degrees
of
freedom
in
the
application,
and
the
fact
is
that
NVIDIA
GPUs
can
have
a
hundred
thousand
threads
resident
at
one
time,
so
we're
not
actually
using
all
the
cores
right
now.
So
that's
how
we
make
this
faster.
A
The
rule
of
thumb
is:
if
you
have
a
hundred
thousand
to
a
million
degrees
of
freedom
in
your
application,
you
have
it.
You
stand
a
good
chance
of
saturating
the
GPUs
compute
capability
for
anything
below
that
you're.
Not
using
GPU,
well
get
up
and
go
home.
Well,
don't
go
home,
make
your
problem
bigger.
A
Right,
it
is
right
it
absolutely.
The
comment
was
you
can
run
into
memory
issues
and
that's
right.
You
might
make
a
problem
so
big
that
now
you
don't
even
fit
on
the
GPU
and
I
agree
that
sometime
a
problem
usually
for
something.
That's
like
a
hundred
thousand
degrees
of
freedom
that
usually
you
can
fit
that
into
memory
kind
of
generically
for
all
of
the
science
applications
I
know.
A
Okay
right
so
so
the
question
was:
does
has
this
actually
been
tested
in
ask
I'm
promising
that
it
is
because
I
am
literally
doing
this
at
nurse
right
now
on
the
core
GP,
you
know
it
so
I
promise
it
works.
That
said,
there
is
an
issue
where
this
stats
equals.
True
option
was
added
in,
like
somewhat
recent
version
of
n
site
systems
and
the
CUDA
module
determines,
which
version
you
get.
So
if
I
do
module
list,
you
see
I
have
cuda
10.2
to
89.
A
A
A
Yeah
I
might
want
to
following
on
the
WebEx
on
the
zoom.
Sorry,
but
I
I
am
highlighting
over
a
kernel,
and
it
gives
me
information
and
I
highlight
over
one
randomly
and
I'm
going
to
say
that's
the
one
that
now
takes
the
most
time
and
in
this
case
I
happened
to
highlight
over
a
few
tendencies
X.
So
let's
pretend
that
compute
tendencies
X
is
the
most
important
kernel
that
now
that
you've
done
your
your
review.
Factoring.
That's
not
knowing
isn't
it.
A
You
can
see
this
already
taking
a
lot
longer
than
it
used
to,
because
each
one
is
being
run
17
times.
That's
the
number
on
the
right,
so
I'm
now
profile
every
instance
of
that
kernel
is
being
profiled
17
17
times,
so
it's
actually
gonna
take
a
little
bit
longer
and
it
happens
to
be
the
case
of
it
actually
two
functions
that
have
compute
tenancies
X
in
them
so
as
profile
and
both
of
them.
B
A
So
in
order
to
make
life
simpler
and
because
I
have
one
minute
left,
I'm,
gonna
intentionally
profile,
only
one
kernel,
so
C
1,
it
means
only
profile
1,
one
instance
of
that
kernel
and
then
what
I'm
going
to
do
is
store
this
in
a
file,
so
I'm
gonna,
say
mini
weather,
and
so
you
now
see
instead
of
giving
me
standard
out
to
ASCII.
I
have
now
created
a
file,
and
it
has
this
extension,
which
again
is
a
mouthful
I'm.
Sorry,
I'm,
gonna
copy
that
down
I'm
gonna
go
a
couple
minutes
over
hope.
A
Nobody
gets
mad
at
me.
Just
a
couple
minutes
copy
that
down
and
now
I
have
my
report
file
I
open
up
the
insight,
compute
user
interface,
which
I've
already
pre
opened.
It
looks
like
this.
You
go
to
file
open
file,
Jung
called
open
project
and
then
do
open
my
mini
weather
and
psych
coupe
F
report,
and
so
finally,
this
is
the
view
that
I
gave
you
a
screenshot
of
like
half
an
hour
ago,
and
this
gives
me
several
sections
that
many
different
levels,
analysis
and
I'm
just
gonna
stay
on
this
one.
A
For
the
sake
of
time,
this
is
the
GPU
speed
of
light
section
that
I
showed
you,
and
basically
this
tells
me
if
I
look
at
these
bar
graphs.
These
tell
me
that
I
am
using
one
percent
of
my
peak
memory
bandwidth
and
one
percent
of
my
peak
compute.
So
is
that
a
good
you
spend
you
pure
value
of
my
GPU
yeah?
That's
not
a
great
use
of
the
GPU
right,
you
don't
you!
A
This
is
a
latency
bound
kernel
right,
latency
bound
is
about
in
this
is
generally
when
you're,
not
memory
bandwidth
bound
or
you're,
not
compute,
bound
sort
bound
by
latency.
Remember
I
said
that
GPUs
are
latency
hiding
processors.
Anyone
operation
is
very
high,
latency
six,
hundreds
of
cycles
to
go
to
global
memory,
2d
Ram,
but
we
can
hide
up
having
lots
of
work
ready
to
go
set
at
any
one
clock
cycle.
The
work
that
is
ready
to
go
can
go.
A
This
thing
only
has
800
zones
of
work
to
do
this
problem,
but
we
have
a
hundred
thousand
threads
that
we
could
be
doing
so
like
point,
one
percent
of
my
threads
potential
threads
are
active.
No
one
percent,
one
percent
of
my
potential
threads
are
active,
and
so
it
kind
of
makes
sense
that
I'm
only
getting
something
like
one
percent
of
the
peak
compute,
our
memory
bandwidth
right
I
can
never
achieve
the
compute
peak
performance
because
that
peak
performance
is
relying
on
having
enough
threads
going
to
hide
latency
to
achieve
that
peak.
A
And
so,
when
you
see
something
like
this,
you
know
your
latency
bound,
and
this
particular
case.
The
answer
is
add,
more
work
right
so
make
the
big
grid
bigger
right.
That's
the
answer
for
this
problem,
but
sometimes
that's
sometimes
there
are
other
limiters
and
the
tool
will
help
you
I
think
guide
you
through
that
process.
So
that's
the
very
high-level
overview
I
just
want
to
let
you
know
that
these
tools
exist
and
that
you
should
make
them
part
of
your
workflow.
A
The
most
important
thing
to
do
when
you
start
running
on
the
GPU
is
profile
right.
In
fact,
you
should
probably
before
you
get
on
the
GPU
use
anti
systems
to
collect
the
profile
of
your
application.
Have
that
be
the
baseline
so
that?
Well
then,
when
you
start
putting
things
on
the
GPU,
you
see,
did
it
get
faster
or
slower
and
I'm
gonna?
Warn
you
first
time
it's
gonna
get
slower
right,
because
you
allocated
memory.
A
I
took
too
long
or
you're
doing
too
much
memory
transfer
right,
so
I'm
don't
get
bummed
right,
that's
a
normal
part
of
the
process
profile!
It
see
what
your
bottleneck
is
and
then
work
to
eliminate
that
later,
exposing
more
parallelism
or
doing
things
like
more
effect
we're
using
memory
that
sort
of
thing
putting
more
work
in
a
row
so
that
memory
doesn't
have
to
transfer
back
and
forth.
A
C
C
A
A
I
will
say
that
there
are
there's
a
light
at
the
end
of
the
tunnel
for
that
thing
at
which
we
can
discuss
offline,
but
for
now
it's
basically
whatever
name
the
compiler
generates,
and
you
can
as
long
as
you
were
then
to
find
a
substring
of
that
you
can
give
it
to
Kay,
but
it
may
be
a
pretty
gnarly
name.
So
Koko's
is
famous
for
generating
colonel
names,
they're
literally
2000,
characters.
Long
and
like
breaking
some
tools.
B
A
So
the
question
was:
does
n
cite
the
answer?
Tools
work
on
other
platforms
like
Python,
for
example,
beautify
thumb?
The
answer
is
yes.
The
nice
thing
about
the
Nvidia
platform
is
that
all
of
the
programming
models
go
through
the
same
underlying
level
of
CUDA,
and
so
everything
generates
CUDA
kernels
when
it
does
work
right
at
CUDA,
kernel
can
be
gender
from
CUDA
C
or
it
can
be
generated
from
Kuta,
Python
or
number
or
Q
PI,
and
so
every
tool
can
be
profile
every
every
program
at
how
that
runs.
A
nvidia
gpus
can
be
profiled.
A
This
way,
not
all
of
the
integration
will
be
the
same,
so
it
may
be
easier
to
like
correlate
lines
of
source
code
from
C
to
in
the
profiler
than
it
would
be
from
Python.
So
that
part
is
different,
maybe
but
the
overlying,
the
underlying
concept
of
being
able
to
look
at
it,
get
a
timeline
of
you
and
target
particular
kernels
with
insight.
Compute
could
be
done.
Independent
of
programming
language,
yes,.
C
A
But
that
is
a
free
program.
It's
it's.
Just
somebody
in
our
marketing
team
decided
that
they
wanted
to
collect
that
information,
so
you
do
have
to
join
the
developer
program
to
do
it.
The
other
thing
is:
we
have
engine
systems
installed
on
query,
so
if
you
had
VNC
or
exporting,
you
could
go
that
path
by
just
loading.
The
user
interface
remotely,
if
you're
very
close
to
the
system
that
can
be
okay,
can
be
tolerable
if
you're
doing
it.
A
If
you're
halfway
across
the
country,
the
x4
wording
is
pretty
rough
for
these
applications
and
I
had
the
recommend
just
fighting
the
bullet.
Sc
peeing
it
down
to
your
system,
download,
registering
for
the
relevant
program
downloading
that
tool
and
then
running
it.
It's
a
free
download.
You
just
just
need
to
record
an
account.
That's
right!