►
From YouTube: 06 - Performance Tools
Description
Part of the NERSC New User Training on September 28, 2022.
Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.
A
Well,
good
afternoon,
ish
everyone
I'll
be
giving
a
talk
about
Performance
Tools,
in
particular
the
ones
that
we
are
going
to
look
at
so
again,
just
to
briefly
remind
everyone
and,
as
everyone
might
have
already
been
hearing
for
all
the
talks,
Cody
is
gonna
be
decommissioned
shortly,
and
so
we
want
all
our
users
new
and
old,
to
start
working
on
Palmetto
as
soon
as
possible
start
migrating
to
Paul
Mater,
and
so
today
we
are
going
to
look
at
the
performance
tools
that
that
are
available
on
polymater
note
that
there
are
two
major
omissions
in
this
list,
namely
Intel
advisor
and
Intel
v-tune,
and
we
no
longer
have
those
because
the
architecture
on
polymer
is
different
compared
to
query,
and
so
an
entire
list
of
all
the
performance
tools
that
are
available
can
be
accessed
at
this
particular
links
in
the
stocks.
A
Primarily,
we
have
seen
users
get
interested
in
using
perf
tools,
which
is
provided
by
cray
nvidia's
inside
systems
and
inside
compute,
arm
map
and
arm
performance
reports.
Previously,
it's
also
known
as
DDT
for
data
purposes.
In
order
to
measure
data
performance
we
have
used
darshan,
I
o
profiler
and
timery
is
a
is
a
very
high
grade.
High
Fidelity
profiling
toolkit.
It's
not
a
profiling
tool.
A
It's
a
profiling
toolkit
which
can
then
be
leveraged
to
measure
whatever
performance
that
users
are
interested
in,
but
for
us
today
we
are
going
to
look
at
a
brief
primer
on
how
to
use
gray
pad
perf
tools,
perf
tools,
light
and
inside
systems
and
inside
compute
note
that
both
cray
perf
tools
can
be
used
with
the
wrappers,
so
the
compilers
that
are
supported
you
can
use
MPI
for
MPI
codes.
A
You
have
you
have
the
create
compilers
and
also
it
supports
Fortran
using
the
ftn
compiler
foreign
systems
are
exclusively
used
for
GPU
profiling
and
it
supports
these
compilers
as
well
as
it
supports
python.
A
So
to
start
off
or
talk,
I'll,
we'll
take
a
look
at
gray,
part
gray,
pad
profile
profiling
Tools
in
particular,
we'll
be
looking
at
perf
tools.
So
do
note
that
create
pad
is
specifically
for
use
on
Clay
machines
and
mostly
the
results
that
are
generated
are
text
based.
However,
there
is
Apprentice
2,
which
can
be
used
as
a
GUI
to
look
at
the
results.
A
The
modules
that
are
prerequisite
or
required
to
use
perf
tools
is
perf
tools,
base
which
is
required
before
loading
perf
tools
or
curve
tools
light
and
on
Paul
Mater.
It
is
loaded
by
default.
A
Puff
tools
is
the
full
suite
and
post
tools.
Light,
as
the
name
suggests,
is
easy
to
use
and
is
used
for
quick
analysis
and
many
times
it
might
be
adequate.
So,
let's
take
a
brief
look
at
how
do
we
use
perf
tools,
light
and
so
for
for
for
the
case
of
this
presentation,
we
are
using
a
Jacobi
solver
written
in
Fortran
with
MPI
and
openmp
support.
A
One
of
the
requirements
of
using
perf
tools
is
that
the
code
must
be
run
in
scratch
or
you
should
set
an
environment
to
run
it
from
elsewhere.
Second,
whatever
object
files
that
are
generated
while,
while
compiling
the
code
should
be
created
in
a
separate
step
and
must
remain
present,
because
that's
how
the
analysis
will
give
you
give
you
plots
and
charts
to
visualize
the
result.
As
I
mentioned,
you
can
use
app2,
which
is
apparentus
2,
and
in
order
to
view
those,
we
recommend
that
you
can.
A
You
use
the
NX,
no
machine
or,
if
you're,
in
a
pinch
and
in
a
hurry,
you
can
launch
the
terminal
with
text
forwarding
to
run
to
profile
your
code
with
puff
tools
of
sorry
puff
tools.
Lite.
These
are
the
steps,
so
the
first
step
is
to
unload
darshan
and
exalt
as
they
conflict
with
with
the
metric
collection.
A
You
then
load
perf
tools,
light
following
that
we
use
the
programming
environment
three,
which
contains
all
the
compilers
that
that
we
are
looking
at
and
in
particular,
we
compile
the
Jacobi
solver
in
two
steps.
So
the
first
step
we
generate
the
object
files
and
the
second
step.
We
use
those
object
files
to
generate
the
executable.
A
A
So
when
you
compile
this
code,
two
executables
will
be
generated,
one
with
the
name
that
is
given,
and
it
already
contains
the
profiling
information
required
required
to
for
post
tools
to
work
with
and
the
other
one
other
executable,
which
will
be,
which
will
be
called
Jacobi,
MPI,
OMP
plus
orange,
which
which
stands
for
the
original
compile,
and
it
doesn't
really
generate
any
possible
slide
information
when
you
run
that,
and
so
the
text-based
results
that
are
generated
from
perf
tools.
A
Light
will
look
like
this,
so
the
first
chart
or
information
that
that
will
provide
it
will
give
you
the
number
of
ranks
and
the
number
of
resources
that
you
have
used
and
also
give
you
the
information
of
the
of
the
architecture
on
which
you
are
running.
Another
important
thing
that
we
notice
is
that
we
can
get
the
I
o
information
straight
from
a
visible
chart
without
any
additional
modification
to
the
code.
A
A
So
the
units
of
this
sample
time
is
hundredth
of
a
second
and
the
chart
shows
you
the
most
time,
consuming
kernels
or
Loops
that
we
have
so
in
this
particular
example,
we
have
this
Jacobi
mpiomp
loop
on
line
61,
which
takes
around
53.3
percent
of
sample
time
and
additionally,
the
next
most
computationally
expensive
part
is
a
compute
diff
loop,
which
starts
on
line
261.,
and
so
information
has
to
be
read
in
in
this
manner.
A
What
you
notice
is
also
that
it
gives
you
MPI
information,
which
is
pretty
handy.
If
you
are
code,
needs
optimization
in
terms
of
MPI,
you
can.
You
can
use
both
tools
very
easily
with
very
little
modifications
to
the
code,
with
no
modifications
to
the
code
and
get
the
information
required
for
you
to
optimize
the
code.
A
This
table
shows
you
a
line
numbers
with
function,
and
so
it's
pretty
much
the
same
as
what
you
had
in
in
table.
One,
with
the
exception
of
it,
will
also
tell
you
what
file
the
function
or
The
Loop
belongs
to,
and
so
over
here
within
Jacobi
MPI
Loop
line
61
the
loop
is
broken
down
and
you
can
see
what
are
the
two
time
consuming
aspects
of
this
particular
Loop
and
so
over
here
we
see
that
line
63
and
line.
A
66
are
the
are
the
computationally
dominant
lines
in
this
in
this
particular
function
similar?
Similarly,
we
see
for
compute,
div
and
MPI
function,
so
here
this
this
is.
This
is
another
way
of
seeing
how
much
time
is
required
so
many
times.
What
might
turn
out
is
that
the
com,
some
of
the
computationally
most
dominant
kernels,
may
not
be
the
ones
require
requiring
the
maximum
amount
of
time
most
of
the
time.
A
It
is
true,
however,
if
it
is
if
it
is
run
differently
or
if
there
are
bottlenecks
somewhere
else,
you
table
number
one
and
table
number
three
tend
to
look
different
table
number
four
shows
power
consumption.
This
is
useful
for
some
of
the
applications
where
you
need
to
engage
on
how
much
power
requirement
was.
The
power
requirement
is
necessary
to
run
the
code
or
was
utilized
to
get
get
the
runs,
and
finally,
it
generates
two
more
tables
table
five
and
table
six.
We
have
not
shown
table
five
table
five
reports.
A
The
average
time
taken
and
the
number
of
bytes
read
from
a
file
and
table
six
is
similarly
average
time
and
number
of
bytes
written
to
a
file,
and
since
this
code
we
didn't
really
read
anything
from
a
file,
it
doesn't
generate
table
five,
it
generates
Table
Six
information
and
we
have
everything
written
out
to
stand
out
and
it
shows
the
write
speed
and
the
average
bytes
that
were
written,
buff
tools,
light
perf
tools
also
supports
similar
analysis
for
GPU
workloads
and
it's
called
perf
tools,
light
Dash
GPU.
A
It
also
supports
Loop
analysis
specifically,
and
it's
called
perf
tools.
Light
blue
the
table
looks
very
similar,
so
here
I'm
showing
you
an
example
of
of
a
Cuda
aware,
MPI
code,
which
was
profiled
using
perf
tools
like
GPU
rest
of
the
information,
looks
very
similar.
Although
in
this
particular
aspect
you
get
information
directly
from
kudam
regarding
Cuda
Cuda,
kernel
launch
and
then
copy.
A
Do
note,
however,
that
this
is
just
a
brief
primer
and
we
are
not
delving
deep
into
the
details
of
each
of
these
performance
tools
that
we
have
available.
A
more
in-depth
analysis
of
your
code
can
be
performed
using
perf
tools,
so
steps
are
very
similar
to
what
you
use
for
perf
tools.
Light
again
code
must
be
run
in
scratch,
and
you
you
follow
the
same
steps
that
you
have
followed
for
perf
tools
light.
A
However,
there
is
a
bit
of
a
difference,
so
you
need
to
after
requesting
an
interactive
node.
You
need
to
build
this,
build
your
executable
which,
which
was
generated
previously
and
use
pad,
build
on
it
to
get
the
detailed
information
out
and
it
will
generate
a
new
executable
called
Jacoby
MPI,
Plus
Pack,
and
when
you
run
this
do
just
a
second
sorry,
sorry
to
take
you
back,
but
we
also
have
this
particular
flag.
A
So
if
you
are
using
perf
tools
to
analyze
a
GPU
code,
you
have
to
GPU
MPI
code.
You
have
to
add
this
flag,
G,
MPI
and
name
of
the
executable.
So
so
this
is
very
similar
to
what
what
function,
what
syntax
we
are
using
to
get
both
tools
built
for
an
MPI
for
a
for
a
CPU
code.
You
just
have
to
add
G
flag
to
get
it
to
run
for
a
GPU
code.
Upon
running
this
particular
executable
with
the
plus
packed
name,
you
will
get
a
once.
A
You
run
it.
It'll
generate
an
XF
file
in
the
data
folder
in
the
data
editor
folder.
You
have
to
now
convert
these
dot
XF
files
into
the
app2
format,
because
we
are
using
Apprentice
2
to
read
these
files
and
the
command
is
as
follows:
you
you
do
Pat
Report
with
Dash
F
flag
for
formatting
it
to
app2,
and
once
that's
done,
you
can
launch
app
to
result
and
it
will
generate
a
window
like
this,
and
each
tile
within
this
window
gives
you
specific
information.
A
So
the
code
profile
here
is
the
Jacobi
solver
that
we
looked
at
for
perf
tools,
Lite
and
you
can
see
it.
It
gives
you
a
breakdown
of
the
of
the
runtime
of
each
function
as
a
pie
chart.
Now.
It
also
gives
you
the
flowchart
of
the
code
or
the
flow
of
the
code,
which
is
really
nice.
If
you
want
to
use
it
for
either
documentation
or
you
need
to
move
it
for
refactoring.
A
Is
it
running
which
directives
so,
for
example,
over
here,
the
yellow
part
is
the
openmp
part,
and
so
we
can
see
as
as
a
breakdown
of
100,
what
percentage
of
time
on
what
rank
was
used
for
what
purpose,
and
so
it
gives
you
a
really
detailed
information
which
can
be
used
to
improve
your
code.
It
also
gives
you
tiling
communication
information
Mosaic.
A
So
this
is
a
mosaic
now
of
time
taken
from
source
to
what
destination
and
how
you
can
improve,
how
you
can
improve
the
MPI
communication
Time
by
analyzing
your
code
using
a
mosaic
shifting
gears.
We
are
looking
at
inside
systems,
which
is
a
profiling
tools
provided
by
Nvidia
for
GPU
workloads.
A
So
Insight
systems
is
again
a
low
overhead
profiler
analogous
to
both
tools.
Light
and
inside
compute
is
like
a
full-fledged
of
tools
feature,
and
so
it
provides
a
broad
description
of
GPU
based
application.
A
The
only
module
required
is
Cuda
or
Cuda
toolkit,
and
it
supports
a
variety
of
applications
which
are
written
in
either
Cuda
Cocos
openmp
open
ACC
python,
but
they
must
be
run
on
on
a
GPU
and
application
must
be
compiled
using
GPU
libraries,
and
so
here
we
are
showing
we
will
be
discussing
a
workload
of
an
open,
MPA
offload
based
application,
which
was
compiled
using
client,
plus
plus
compiler
or
llvm
compiler,
and
again
the
code
was
run
in
scratch
to
visualize
the
result.
A
Again
you
have
to
use
NX
no
machine
or
you
can
have
X
window
forwarding.
We
do
not
recommend
text
window
forwarding
because
it
tends
to
make
it
extremely
slow.
Instead,
what
you
can
do
is
download
the
the
files
that
are
generated
as
a
part
of
the
profile
and
you
can
install
both
inside
systems
and
inside
compute,
which
are
available
for
free
on
your
local
machine
and
then
analyze
your
code
using
that,
and
so
the
Run
steps
are
very
similar.
A
Again,
you
compile
your
code,
you
request
an
interactive
and
then,
in
order
to
profile
the
code,
all
you
have
to
do
is
just
Add
nsys
profile
stats.
True,
so
this
gives
you
information
similar
to
what
you
just
saw
for
perf
tools,
light,
and
so
it
generates
Cuda
API
statistics
in
order
to
understand
how
the
code
can
be
improved
and
where
the
bottlenecks
are
in
particular,
it
gives
you
what
were
the
most
dominant
kernels
in
your
workflow.
A
Do
note,
however,
that
if
you
are
using
openmp
offload
or
code
or
any
other,
any
other
GPU
based
API
kernel
names
may
be
mangled,
so
the
entire
name
for
this
function,
which
I
have
shown
in
this
table,
might
be
OMP
offloading
with
a
bunch
of
letters,
and
then
it
will
show
you
the
name
of
the
function
and
what
line
it
was
it's
located
on.
A
So
in
this
particular
example,
61
of
our
time
is
taken
by
compute
Yi
kernel,
which
is
online
471.,
and
so
this
is
very
beneficial
in
improving
the
runtimes
of
the
code.
A
Although
you
can
also
improve
your
code
quite
a
bit
by
making
sure
your
Cuda
mem,
copies
from
host
to
device
and
device
to
host
are
kept
at
a
minimum,
so
here
we
see
the
total
time
taken
by
just
in
Cuda
mem
copying
from
host
to
device,
and
the
overall
objective
is
to
lower
both
the
time
as
well
as
by
consequence,
you
can
do
that
by
lowering
the
data
transfers
between
host
and
device
it
it
once
you
you,
you
have.
A
You
can
use
nsys
UI,
which
is
a
which
is
a
GUI
window
to
analyze
the
trace
of
your
code,
and
this
is
this
is
a
very
useful
feature.
Once
you
load
the
load,
the
profiling
report
that's
generated,
which
is
the
QD
rep.
You
can
then
zoom
into
particular
parts
of
the
runtime
Trace
by
selecting
a
brief
time
window
here.
We
can
see
that
in
this
particular
instance,
when
the
Cuda
API
is
running
host
to
device,
we
get
this
on
our
timeline.
A
There
are
some
aspects
of
the
code
which
which
you
can
improve
just
by
looking
at
the
trace.
So
here
we
see
that
there
are
gaps
between
one
function
being
called
so
this
is
OMP
offload
a
function
followed
by
another
function,
but
there
is
a
gap
and
our
objective,
just
by
looking
at
a
trace,
would
be
to
figure
out.
A
Why
are
there
gaps
and
this
point,
what
is
a
CPU
doing
so
here
we
also
get
a
CPU
trace,
and
so
you
can
compare
GPU
versus
CPU,
trace
and
figure
out
how
you
can
improve
your
runtime
just
by
eliminating
these
gaps,
where
our
idle
times
on
both
host
and
device.
It
also
provides
you
more
information
regarding
what
were
the
resources.
So
if
you
select
the
events
View
at
the
bottom
of
the
window
and
select
the
function,
it
will
show
you
what
were
what
were
the
launch
statistics.
A
So
what
were
the
theoretical
occupancy
for
this
particular
function
and
what
thread
was
launched
and
loads
of
other
information.
A
Inside
compute,
as
I
pointed
out,
is
sort
of
a
more
detailed
analysis
and
in
in
this
demonstration
we
will
do
a
pretty
interesting
comparison.
A
So
we
have
a
code
which
we
have
improved
just
by
adding
a
collapse
Loop,
and
so
we
want
to
understand
how
much
improvement,
just
by
providing
this
collapse
collapse,
Clause
to
the
OMP
pragma
parallel
for
pragma,
we
can
improve
our
code,
and
so
this
is
a
nested
Loop,
and
these
two
for
Loops
are
now
being
collapsed,
and
so
we
will
run
our
profiling
step
for
inside
compute
twice
the
first
time,
we'll
just
run
the
Baseline
code
and
the
second
type
separately,
we'll
run
this
optimized
code
and
gather
the
reports
for
both
of
them,
and
so
in
order
to
get
the
reports,
we
can
do
ncu-0,
and
this
this
is
the
name
of
the
file
it
will
generate.
A
So
you
can
call
the
profile
snap
case,
one
case
two
and
you
you
have
to
use
SEC
full
in
order
to
get
the
information
regarding
memory
transfer
as
well,
and
so
one
thing
to
note.
The
key
thing
over
here
to
note
is
that
you
have
to
switch
the
DC
GMI
profile
to
pause,
because
a
profiling
step
is
already
being
run
on
12
matter
by
default,
and
so
you
need
to
switch
this
to
pause.
Otherwise,
it'll
give
you
error,
so
you
users
are
interested
in
using
inside
compute.
A
Do
remember
this
DC
GMI
profile
dash
dash
pause.
This
is
separate
from
query,
and
so,
when
you
load
these
two
files,
you
can
set
one
of
them
as
Baseline,
and
so
what
we
have
set
as
Baseline
is
the
case
one
and
what
we
see
the
case,
one
being
the
blue
by
just
by
improving
just
by
adding
a
single
collapse
Clause
to
our
function.
Sorry
kernel:
we
see
that
our
compute
throughput,
as
well
as
memory
throughput,
have
tremendously
increased
and
our
runtime.
So
comparison.
Also,
you
can
see.
A
The
duration
of
this
particular
function
has
improved
improved
by
97.
It's
97,
lower,
compute,
throughput
and
memory.
Throughput
have
improved
as
well,
and
the
reasons
are
also
provided
so
here
the
L1
cache
throughput
has
improved
by
10
and
the
L2
cash
throughput
has
improved
by
142
percent.
A
Then.
Similarly,
the
dram
Improvement,
so
you
a
single
snapshot
using
speed
of
light
analysis.
You
can
understand
the
improvements
in
your
code
and
make
similar
changes
to
your
code
in
order
to
improve
an
overall
runtime.
A
It
also
provides
you
roofline
analysis,
so
you
can
understand
that
once
I
making
a
change,
your
arithmetic
intensity,
which
is
the
number
of
flops
executed
for
a
byte
of
data,
transferred
increased,
because
we
now
no
longer
have
to
move
a
lot
of
data
to
do
the
same
calculation
on
our
performance,
which
is
flops
per
second
also
improved,
and
so
in
a
single
chart.
You
can
get
this
information.
It
also
provides
you
other
other
useful
information
regarding
compute
workload,
analysis,
what
what
sort
of
memory
transfers
were
taking
place.
A
So
here
it's
it's
sort
of
a
repeat
of
the
information
that
we
saw
in
the
speed
of
light,
but
it
gives
you
a
more
detailed
analysis
in
terms
of
whether
fuse
multiplication
and
ads
were
increased
or
fp64
instructions
where
increased
by
making
a
change,
perhaps
a
more
visualized
understanding
on
how
the
data
moves
just
for
this
function
is
also
provided,
and
this
is
a
very
beneficial
feature,
and
so
by
adding
a
collapse
class
we
reduce
data
transfer
between
device
memory
and
L2
cache
by
95
percent
back
and
forth
between
L2
and
L1
cache,
we
have
reduced
our
data
transfers
by
93
and
our
cash
hit
rate
has
increased
by
22
percent,
and
so
all
these
information
are
pretty
key
in
improving
the
overall
performance
of
the
code.
A
And,
finally,
you
can
also
figure
out
the
number
of
instructions
and
how
they
changed.
So
this
is
sort
of
a
very
deep
dive.
It
provides
you
additional
information
on
what
aspects
of
the
code
change
by
just
making
one
change
in
your
in
your
code
with
that.
I
conclude
my
talk
I'd
like
to
thank
you
for
attending
this,
and
we
are
glad
to
have
all
the
new
users
and
all
the
users
migrating
to
Paul
matters.
Thank
you.