►
From YouTube: 5 - Analyzing CPU Applications
Description
Part of the Using HPCToolkit to Measure and Analyze the Performance of GPU-accelerated Applications Tutorial, Mar-Apr 2021. Slides available at https://www.nersc.gov/users/training/events/hpctoolkit-for-gpu-tutorial-mar-apr-2021/
A
A
A
A
A
A
B
B
A
A
A
B
A
B
A
A
A
A
A
A
A
A
A
A
Or
you,
I
think
you
can
also
refer
to
them
as
like
perf,
colon,,
colon,
cyclists..
Why
would
you
add
this?
well,
if,?
If
you
just
want
to
distinguish
exactly
what
you're
specifying,
you
might
use
this
qualifier,
but
typically,
if
I'm
using
perf
vents,,
I'm
just
going
to
use
the
shorthand
like
cycles
and
instructions.
A
A
A
A
A
A
A
A
A
A
A
A
A
D
A
A
A
A
A
A
A
A
A
A
A
A
B
A
D
A
A
A
A
A
A
F
A
C
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
everybody's
been
doing
the
same
thing.,
so
everybody's
circulating
the
information
about
what
they
have
around
the
ring.,
and
so
basically,
it's
like
an
all
communication
where
everybody's
telling
everybody
else,
here's
what
I've
got.
and
then
you
select
out
and
say,
oh,
this
is,.
This
is
who's,
got
my
neighboring
block.,
and
so
that's
who
I
have
to
communicate
with.,
and
so
we
end
up
spending
all
this
time
on.
A
A
A
A
A
A
A
A
F
A
A
So,,
so
this
worked
well
because
everything
is
in
a
tree
because,
like
everything
is
rooted
at
flash
and
I
can
go
find
where
the
losses
are
in
the
tree.,
if
I
have,,
if
I'm
not
using
the
ompt
interface-
and
I
have
things
broken
up
into
master
and
thread
root,
then
this
top-down
analysis
doesn't
work
well
on
a
tree.
But
I'm
sorry
on
a
forest.
A
A
D
A
A
A
B
A
A
I
A
A
A
A
A
A
A
A
B
A
A
A
Spending
a
lot
of
time,
trying
to
figure
out
what
was
going
on
because
with
lulesh,
it
was
just
pointing
into
the
workloads
and
saying,
you're
spending,
time
here.,
and
so
that
looked,
that
looked
good
until
you
started.
Looking
at
how
many
cycles
I
spent
there
and
how
many
instructions
I
executed
and
realized
that
actually
the
number
of
instructions
I
execute
is
pretty
slow,,
it's
pretty
low.,
and
so
it
was
spending
a
fair
amount
of
time.
Clearing
pages.
A
A
A
I
A
A
A
C
A
A
A
A
We
used
a
strategy
where
we
would
say
I'm
idle
and
I
don't
know
why.,
and
so
let
me
just
incremental
counter
saying
that
my
thread
is
idle.
and
then,
when
other
threads
took
samples
and
they
were
active,,
then
they
would
say,
well.
I
have
to
assume
some
blame,
like
how
many
of
us
are
active
and
how
many
of
us
are
idle
and
I'm
going
to
take
some
blame
for
the
idleness,
that's
elsewhere.,
and
so
then
what
we're
able
to
do.
A
A
A
So
I
think
that
this
addresses
the,
the
issue
that
you
had
steve,,
which
is
like,
you
know,.
So
there
are
all
this
idle
time,.
What
am
I
going
to
do
about
it?
and
so
by
attributing
the
open
mp
idleness
to
the
code?
That's
actually
running,
we're
identifying
regions
of
serial
code
or
regions
of
code
that
have.
A
A
A
A
A
A
A
A
A
B
E
A
A
B
D
A
A
A
D
A
B
B
D
D
I
A
A
A
A
A
A
A
A
A
A
A
I
A
A
A
A
A
I
A
I
A
A
A
A
A
A
A
A
A
A
A
A
A
I
A
A
J
Using
in
the
linux
kernel,.
J
J
J
I
know
that
nvidia,
at
some
point
we're
also
working
with,
but
I
don't
know
what
their
current
planning.
A
J
The
mechanism
is
the
amd
mechanism.