►
From YouTube: RustConf 2017 - Improving Rust Performance Through Profiling and Benchmarking by Steve Jenson
Description
Improving Rust Performance Through Profiling and Benchmarking by Steve Jenson
This talk will compare and contrast common industry tool support for profiling and debugging Rust applications. We'll discuss our experiences finding and fixing performance problems in a production Rust application.
A
So
today
we're
going
to
talk
about
understanding,
performance
of
rusts
programs
and
libraries
and
we're
going
to
talk
about
some
rust,
specific
pitfalls
and
we're
almost
going
to
talk
about
I'm
gonna
talk
about
some
tools.
You
can
use
to
see
the
performance
of
your
rust
programs.
Some
of
them
you
can
even
use
today
and
install
on
your
laptop
I,
am
an
engineer.
I
work
at
buoyant.
We
make
we
make
thanks
to
our
design
team
for
some
great
transitions
we
make
linker
d-link
rudy
is
a
cloud
native
service
proxy.
A
We
build
what's
called
the
service
mesh
and
I,
hear
buoyant
I
work
at
I
work
on
some
Ruston's
Gallico,
one
of
the
things
that
we've
built
his
Nicole
linker
TCP,
which
is
a
TCP
proxy,
it's
written
in
rust,
and
it's
as
I'd
work
in
cloud
native
environments
like
kubernetes
or
DCOs
I,
weave
more
protocols
coming
right.
Now,
it's
TCP
but
we're
adding
HTTP
to
like.
Just
today
we
announced
that
we
Karl
open
sourced
the
h2
repo
on
github.
You
take
a
look
at
that
link
of
a
TCP
just
like
link.
A
Rudy
is
apache
license
and
free
for
everyone.
One
of
the
things
that
we've
built
to
enable
us
to
build
link,
adhesive
peacenik,
called
taco,
which
is
a
stats
library,
allows
you
to
instrument.
You
are
applications
and
as
a
stats,
librarian
has
some
basic
features
as
a
counter,
so
how
many
times
something
happens
has
timings
so
how
long
something
takes
to
occur
there
to
you
know
some
bit
of
code
to
run
and
then
gauges
which
are
values
at
a
point
in
time.
This
is
also
a
patchy
licensed.
A
I'm
gonna
use,
taco
I'm,
gonna
use,
one
of
our
macro
benchmarks,
is
sort
of
a
drawer.
Two
of
our
macro
benchmarks
is
a
driver
for
talking
about
that.
The
talk
and
the
topics
that
I'm
here
to
discuss
today
here
is
a
macro
benchmark
that
I
wrote
and
I'm
going
to
talk
about
what
macro
benchmarks
are
here
in
a
second,
but
I
wanted
to
give
you
a
sense
of
what
taco
as
before
we
started.
You
know
digging
into
it
and
looking
at
graphs
of
performance
and
charts
and
that
sort
of
thing.
A
So
here's
a
here
is
a
multi-threaded
macro
benchmark.
Where
here
you
can
see,
we
have
a
timing.
We
measure
how
long
it
takes
to
do
an
action.
We
have
a
loop
about
10
million
times
that
we
do
this
inside
of
a
single
program.
This
this
macro
benchmark
driver
called
multi
through,
and
what
do
we
do
here?
We
set
the
current
iteration.
We
increment
a
counter,
and
this
is
the
basic
work
we
do
and
then
we
do
this
and
we
do
this
ten
million
times.
A
A
In
native
programs,
there
are
generally
three
causes
of
performance
problems
as
memory
stalls,
which
is
you
know,
talking
them
talking
to
DRAM,
essentially
lock
in
tension
in
the
CPU
utilization.
When
we
talk
about
memory
stalls
really,
what
we're
talking
about
is
the
memory
hierarchy
where
these
aren't
exact
numbers.
These
numbers
are
here
to
give
you
a
sense
of
what
the
order
of
magnitude
changes
when
you
move
up
levels
when
you're
talking
to
memory.
So
a
register
takes
like
half
a
nanosecond
to
talk
to
your
last
level.
A
Cache
it's
like
gonna,
be
like
10,
10
nanoseconds
I
mean,
if
you
remembering
to
see.
If
you've
got
like
a
l1
cache,
you
got
an
l2
cache
and
then
sometimes
you
have
an
l3
cache
each
each
layer,
the
cache
adds
another
is
larger
and
slower
and
cheaper.
Well,
they're.
The
same
cost
anyway
so
again
happen
in
a
second
ten.
In
a
second,
then
talking
to
D
rounds
gonna
be
about
a
hundred
nanoseconds.
So
that's
just
a
quick
overview,
the
memory
hierarchy
to
remind
yourselves,
then
we
have
long
contention
things
like
spin
loops
and
blocking
weights.
A
These
are
things
that
can
cause
performance
problems
in
your
program
and
we'll
see
one
of
these
here
and
here
in
a
moment
and
then
CPU
utilization.
The
reason
I
talked
about
I
want
to
talk
about
CPU
utilization
last
is
because
it
generally
hides
a
lot
of
the
performance.
Usually
a
CV
utilization
can
hide
the
other
two
things
that
I
talked
about.
They
can
hide.
You
know
memory
latency
through
slow,
a
slow
instruction.
You
can't
it
looks
the
same.
What
am
I
trying
to
say
a
given
instruction?
A
There's
you
can't
tell
if
the
instruction
slower,
if
you're
talking
to
memory
when
you're
looking
at,
is
what
I'm
trying
to
say
here
and
instruction
can
hide
lock
contention
like
have
a
spindly
when
I
say
I
really
should
say
a
function
here.
Idleness
is
often
countered.
It's
useful
work
like
you
look
at
top
or
H
top
and
you'll
see.
A
Oh
it's
mice,
I'm
using
90%
of
my
CPU,
but
I
can
also
mean
you're
spending,
80
percent,
your
time
waiting
for
RAM
or
disk
and
I
want
to
talk
about
a
little
a
little
bit
about
some
rough,
specific
pitfalls.
Yeah.
Here,
here's
a
big
one
like
derived
copy
and
large
trucks
copy
is
great
copy,
makes
your
programs
much
more
economic.
You
know
they
can
be
a
real
lifesaver,
but
if
you
find
yourself
overusing
it
you
can
accidentally
without
meaning
to
the
great
thing
about
Cologne
is
its
explicit,
but
copy
is
implicit.
A
So
if
you
find
yourself
you
know
implementing
copy,
you
can
find
yourself
killing
your
DRAM
bandwidth
and
the
most
common
reason
is.
It
was
small
and
I
want
to
start
it
and
here's
an
example
for
you
can
see
someone
might
make
a
person
struct
where
they
just
have
a
vivid
int.
They
have
a
reference
to
a
string,
and
then
somebody
decided
well,
let's
put
the
whole
users
DNA
in
there
and
that's
that's
800
megabytes.
A
It
should
be
a
reference
or
maybe
not
there
at
all,
and
then
again
you
know
we're
talking
about
where
time
a
copy.
You
know
we
can
talk
about
clone,
doing
clone
a
little
bit,
something
that
can
kill
performance
through
through
saturating
your
DRAM
bandwidth.
The
nice
thing
about
clone
is,
is
explicit,
you'll
see
it
in
a
profile,
you'll
see
clone
being
being
there
at
the
very
top.
A
Here's
here's
just
a
sort
of
a
contrived
example
of
going
through
a
people
vector
and
pushing
them
into
a
for
another
friends
vector
and
then
something
that
pretty
much
everyone
in
this
room
is
familiar
with,
is
at
the
stand,
the
default
hash
or
is
cryptographic
with
strong,
and
so
it
can
be
a
little
slow.
It's
well-known
trade-off
to
you
know.
Experience
rust
programs
is
really
surprising
to
brand
new
rust
programmers.
A
They
maybe
aren't
used
to
this
before
the
seamless
before
the
nice
thing
is:
there's
lots
of
great
alternatives
for
using
different
hashes
and
there's
a
good
pages
with
benchmarks,
rust,
specific
benchmarks
for
these.
These
different
hashers
and
here's.
Here's
an
example
of
using
a
different
hash.
One,
that's
a
little
bit
faster
here.
We
use
an
F
and
V
hash,
there's
no
reason
to
show
the
slow
version.
Everyone
knows
what
it's
like
to
make
a
hash
map,
so
I
decided
to
show
what
it's
like
to
make
a
slightly
faster
one
again.
A
It
depends
entirely
on
your
use.
You
should
you
can
pick
from
one
of
the
great
alternate
one
of
the
great
alternatives,
so
something
else
that
can
that
can
bite
us
is
using
expensive
arguments
in
places
where
maybe
we
don't
expect
the
code
to
be
called
like
expect
is
a
good
one.
Like
you
often
expect
your
code
reasons
called
expected,
specular
to
work,
and
then,
but
this
isn't,
you
can
put
something
slow
and
expect
like,
for
instance,
this
this
was
pulled
out
of
one
of
the
Liza's.
A
Libraries
ended
up
being
in
the
hot
path,
and
so
this
format
was
called
a
10,000.
You
know
the
bazillion
times,
you
know
every
second
and
then
pulled
it
out.
Did
something
a
little
cheaper,
sped
up
the
program
dramatically
again.
This
isn't
this
isn't
specific
to
expect
it
just
it
has
to
do
with
the
fact
that
Russ
is
eager,
and
you
have
to
remember
that
you
just
have
to
remember
about
the
evaluation
order
in
the
program
last
I
want
to
talk
about
pre
allocating
backs.
A
You
know
when
you
look
at
experience
for
us
programs,
program
or
programs.
You
see
that
they
pre-allocate
you
Novak,
see
everywhere,
like
they
hardly
ever
allocate
and
here's
an
example
of
where
we
do
this.
In
Lincoln
adhesive
II,
we
we
have,
we
have
a
structure
as
a
buffer
size
and
a
default
size,
and
then
here's
the
highlighted
as
us,
making
the
buffer
that
we
use
to
shuttle
data
around
from
from
you
know,
inbound
outbound
and
now
I
want
to
talk
about
some
of
them.
A
Some
of
the
tools
that
you
can
use
I'm
going
to
talk
about
some
tools
in
the
Mac
like
instruments,
cargo
bench,
cargo
bench
compare
and
also
some
Linux
tools
that
we
use,
because
our
primary
deployment
target
is
Linux,
where
we
use
perf
and
flame
graphs,
vtune
and
again
also
cargo
bench
and
cargo
bench
compare
so
cargo
cargo
bench
is
a
micro
benchmarking
tool,
I'm
sure
many
of
you
use
it.
It's
part
of
the
part
of
the
standard
toolset,
and
it's
really
great
for
what's
consider
like
the
high
used
parts
of
your
API.
A
You
should
consider
writing
micro
benchmarks
for
here's,
an
example
of
a
micro
bench
or
a
micro
benchmark
using
a
cargo
bench
that
we
have
in
tako,
where
we
bench
how
we'd
benchmark.
How
long
does
it
take
to
make
a
new
a
new
counter
and
sort
of
the
output?
The
output
will
look
like
this.
Here's
a
bunch,
here's
it
mix
in
with
a
bunch
of
other
benchmarks,
we've
written
where
we
tend
to
write
a
yeah
here
you
can
see
I've
highlighted
23,
nanoseconds,
47,
nanoseconds
and
those
numbers.
Look
great.
A
Everyone
probably
be
happy
with
those
numbers
unless
you're
really
paying
attention.
Cargo
bench
compares
something
that's
relatively
new.
It
allows
you
to
compare
to
two
cargo
bench
runs.
This
is
really
great
for
avoiding
performance
regressions.
You
know
someone
ships,
you
a
PR,
you
look
at
you,
look
at
the
PR
and
github.
You
try
it
yourself,
you
do
it.
You
can
do
a
textual
comparison,
so
Bart
cargo
bench
does
and
you
can
see
well.
This
does
help
or
doesn't
help
performance
in
this
particular
example.
A
I
borrow
this
graphic
directly
from
the
from
the
cargo
bench
a
compared
to
get
up
page.
You
can
see
it
doesn't
really
make
much
of
a
difference.
Even
though
there's
lots
of
green
and
rad
the
numbers
themselves
are
not
not
that
different,
while
bike
micro
benchmarks
are
really
useful
and
you
should
write
them.
What
we
have
found
is
that
riding
macro
benchmarks
is,
is
a
better
a
good
you,
another
great,
really
good
use
of
time.
A
A
macro
benchmark
is
taking
your
api
and
using
it
sort
of
in
the
same
context
that
your
users
would
use
it.
The
reason
that
micro
benchmarks
are
limited
in
utility
is
that
you
run
something
handful
of
times,
and
you
measure
it.
So
you
don't
really
know
you
don't
really
know
how
well
it's
running
in
the
context
of
your
of
your
CPU,
so
in
the
might
in
the
micro
benchmark
that
I
showed
you
before
we
exercised
the
API
in
a
loop
of
10
million
times
and
we'll
talk
about.
Why
why
we
do
so
so
many
iterations.
A
Art
taco,
as
I
mentioned
earlier,
has
to
has
two
macro
benchmarks.
We
have
a
single-threaded
benchmark
and
we
have
a
multi-threaded
benchmark,
because
these
are
the
two
different
ways
that
people
tend
to
use
taco
and
that
this
again
is
the
same
slide.
I
showed
before
of
how
of
the
of
the
macro
benchmark
that
the
multi-threaded
macro
benchmark,
there's
so
what's
going
on,
is
there's
a
couple
threads
that
are
being
spawned
there,
they're
calculating
timings
and
and
adding
them,
and
then
we
have
another
thread.
A
That's
doing
the
reporting
work
and
then
here
just
a
really
quick
segue
using
instruments
to
look
at
the
performance
of
this
of
this
multi-threaded
benchmark
here,
I
use
a
measure
called
IP
C,
which
is
instructions
per
cycle
which
I'll
and
I'll
talk
about
what
that
means
here
in
a
second.
This
is
really
just
to
give
you
a
sense
of
what
we're
going
to
talk
about
more
here
here
a
moment.
So
we
tend
to
we.
A
We
have
stumbled
upon
using
IPC
and
I
shouldn't
say
stumbled
upon
it's
a
fairly
commonly
known
metric
we
use
IPC
is
a
useful.
Empirical
metric
tells
you
how
many
instructions
are
completed,
every
clock
cycle
when
we
learn
CPU
architecture
in
school,
we
we
tend
to
learn
it
as
you
have
a
program
that
want
runs
at
one
instruction
and
then
runs
the
next
instruction
and
the
third
and
so
on,
and
that's
not
that's
just
not
how
it
works
on
mard
until
cpus.
I
think
this
is
this
is
sort
of
the
model
that
we
learn.
A
As
you
can
see
here
this
again
as
the
serial
model,
you
run
the
first
instruction,
then
the
next
instruction,
the
third
instruction
and
here
in
this
diagram,
you
see
I've
broken
out
the
other
stages.
I
got
this
I
got
this
graphic
of
the
next
one
from
a
great
page
called
ninety
minute.
Micro
microprocessor
guide,
which
you
might
enjoy,
but
see
really
is
not
really
how
programs
run
anymore.
They
run
more
like
more
like
this.
They
run
as
soon
as
one
fetch
stages
run.
A
The
next
fetch
stage
can
start
as
soon
as
one
decode
stage
start.
There
is
finished
the
next
you
can
start
and
you
don't
just
have
one
execution
unit
anymore
inside
of
a
single
core.
Now
you
might
have
three
or
four
or
five
and
even
more
complicated
instructions
can
each
one
of
these
little
boxes
can
depend
on
another
box.
A
So,
given
how
complicated
things
are,
and
since
instructions
can
depend
on
each
other
and
we
can
have
huge
bubbles
in
the
pipeline,
how
do
we
know?
How
do
we
know
we're
doing
well?
Well,
here
we
enter
the
realm
of
the
performance
counter.
So
until
engineers
had
the
same
question,
how
do
we
know
our
program
is
running?
Well,
given
how
complicated
a
modern
CPU
is
they
added
something
called
the
performance
monitor
counters
they?
Let
them
just
like
the
counters
in
the
in
the
benchmark.
A
The
useful
work
is
done
and
what
we
found
is
a
good
rule
of
thumb
is
that
if
you're,
if
you
have
an
IPC
score
less
than
1,
it
means
your
memory
stalled
yeah.
This
is
a
high
PC
score
greater
than
0
or
greater
than
1.
It
means
your
instruction
stalled.
You
maybe
need
to
run
fewer
instructions,
and
you
can
refer
your
machine.
You
can
learn
this
empirically
like.
If
you
have
a
deployment
target,
looks
like
a
modern
CPU
has
like
it's
5
wide
you
can
or
one
of
the
3
wide.
A
A
That
does
nothing,
but
you
know,
write
works
entirely
out
of
registers
and
in
you
can
see
whether
you
can
you
can
determine
whether
this
rule
of
thumb
is
appropriate
or
not
again,
reminding
just
how
complicated
a
CPU
is,
this
being
a
three-wide,
which
is
what
you
see
at
the
very
beginning.
This
could
have
an
IPC
of
3,
as
I
mentioned
in
that
slide,
to
give
a
max
like
the
C
of
three
instruments
on
the
Mac,
so
it
has
IPC
built
in
again.
This
is
going
back
to
the
earlier
slide.
You
can.
A
What
I've
chosen
to
do
here
is
look
at
the
IPC
for
the
multi-threaded
benchmark
and
you
can
see
we
have
an
IPC
score
of
less
than
0.5.
You
might
think.
Oh
it's
50,
we're
saying
you
know
why's
it's
great,
but
you
know
it's
a
three
or
four
wide
CPU,
that's
like
50%
of
4,
and
it's
really
poor.
It's
like
maybe
a
zero
point.
That's
like
maybe
10%
utilized
and
here's
I've
done
the
same
thing
with
the
the
single
threaded
macro.
Benchmarking
IPC
scores
about
0.8
a
lot
better,
quite
a
bit
better
yeah.
A
So
all
these
counters
are
available
directly
in
instruments.
What
you
do
is
you
would
make
a
new
counter
instrument
and
then
you
would
long.
Click
on
the
recording
button,
you'd
pick
an
event,
and
then
you
can.
You
would
have
a
really
just
daunting
drop-down
of
performance
counters.
It
sort
of
assumed
that
you've
read
these
hundreds
of
pages
of
documentation
and
really
what
you're
gonna
do
is
you're
gonna
find
a
handful
of
them
that
work
well
for
you,
but
the
nice
thing
is,
you
can
actually
create
formulas
from
these
performance.
A
Counters
IPC
is
the
only
one
that
ships
with
default.
You
can
make
your
own
here's
an
example
using
instruments
where
I'm
looking
at
I
can
tell
it's
hard
to
read
from
there,
but
you're.
Look
what
I'm,
looking
at
l1
hints
and
l1
misses
for
the
multi-threaded
benchmark,
so
I
specifically
went
through
the
drop-down
picked.
L1
hits
l1
misses
cuz
I
wanted
to
see
what
they
look
like
before
this
multi
probe
macro
benchmark.
A
Here
you
can
see
we
do
have
about
20
million
20
million
hits
for
the
l1
cache
and
about
half
a
million
misses,
which
you
know
isn't
great.
In
my
opinion,
and
then
another
use
of
instruments
can
be
to
sort
of
look
at
your
typical
waterfall
of
methods
and
see
how
long
each
individual
function
is
taking,
and
you
can
see
the
heaviest
stack
trace
over
there
yeah,
so
lots
of
well.
A
You
know
it's
a
easy
to
use
performance
tool,
it's
a
little
limited
because
it's
cocoa
specific,
but
you
since
you
have
the
performance
counters
available
you
it's
actually
much
richer
than
a
lot
of
people,
give
it
credit
for
nice
and
I,
suggest
you
give
it
a
try
on
Linux,
as
I
said,
when
we
deploy
most
of
our
code
on
Linux,
so
we
dig
and
we
dig
in
with
perf
and
perf
is
been
part
of
Linux
for
a
few
years
and
they've
been
constantly
improving.
They
put
a
ton
of
work
in
a
perfect.
A
It
has
both
kernel
and
user
space
profiling.
It's
a
sampling
profile
where
they
can,
with
the
configurable
a
sampling
rate
which
you
can
play
around
with
you
know.
Nyquist
Nyquist
theorem
tells
you
that
if
you
want
to
send
see
something
that
happens
at
rate
X,
you
shouldn't
sample
it
to
X,
and
so
the
nice
thing
about
being
configurable
is
that
you
can
really
turn
it
up
and
it's
also
pretty
cheap
if
they
they
just,
they
designed
it
to
run
at
low
overhead.
A
So
you
can
run
it
in
production
at
low
out
of
somewhat
moderate
sampling
rate
and
here's
an
example
of
looking
at
the
IPC
metric
again
of
our
program
using
Linux,
perfect
and
I've
highlighted
it's
roughly
0.5
and
per
cycle.
This
is
a
different
CPU.
This
is
a
dedicated
machine.
We
used
for
performance
work,
there's
a
different
CPU,
which
is
why
the
instruction
scores
a
little
bit
different
in
here.
I
use
perf
to
look
at
the
cache,
hits
and
misses
here.
A
I'm
looking
at
the
l1
misses
and
hits,
and
you
can
see
that
for
the
multi-threaded
benchmark,
I
have
about
a
5%,
a
5%,
miss
rate
which
isn't
great,
but
the
single
threaded
one
I
deserve
sent
hit
right.
It's
great
or
it
seems
mean
0%
miss
right
there,
but
it
would
be
pretty
poor
yeah.
So
I
really
encourage
you
to
dig
into
this.
If
you,
if
you
run
software,
that's
supposed
to
run
Linux,
you
should
really
dig
into
perf
that's
much
much
deeper
than
then
I
had
time
to
go
into.
A
Considering
that
I'm
talking
about
other
tools,
it's
a
little
very
linux
specific!
You
can
do
things
like
you
can
dig
into
this:
the
Linux
scheduler
you
can
dig
into
the
I/o
and
network
subsystems,
incredible
amount
of
cooling
being
built
around
perf.
One
of
the
things
that
we
use
with
perfectly
build
flame
grass
and
a
flame
graph
is
a
sample.
A
It
takes
again
like
I,
mentioned
purpose,
a
sampling
profiler,
so
you
collect
a
bunch
of
samples
and
then
you
aggregate
them
and
to
get
a
sense
of
like
what
is
the
shape
of
your
program,
and
here
is
what
one
looks
like
for
our
single-threaded
program.
So,
like
I
said
you
take,
let's
say
this
is
a
thousand
samples.
Then
it
a
gates
them.
The
width
is
what's
on.
What's
on
the
very
top
and
you
can
mouse
over
and
see
what
that
is.
I
know
it
looks
in
it's
incredibly
tiny.
A
You
can
actually
drill
in
by
double
clicking,
and
then
you
can
mouse
over
to
get
it
to
figure
out
exactly
what
function
it
is
what's
on
top
is
what's
what
was
on
what
was
running
on
the
CPU
when
the
sample
was
taken
and
then
the
width
of
what's
below
it,
gives
you
a
sense
of
how
often
that
was
seen
in
a
sample.
So
here
we
can
see,
we
get
a
clock.
We
get
the
time
who
make
an
instant.
We
call
map
on
a
future,
so
it
gives
you
sort
of
a
sense
of
your
program.
A
One
thing
that
we
do.
A
lot
is
we'll
just
drill
into
into
some
part
of
this
and
I
just
give
things
a
look
from
time
to
time
to
make
sure
things
are
looking
the
same,
really
useful
for
looking
at
long-running
program
since
we
since,
since
what
we
ship
is
a
proxy
that
runs
24/7
in
people's
data
centers,
it's
really
nice
to
be
able
to
get
a
sense
of
like
how
does
things
look
at
time,
T
versus
in
like
30
minutes
from
now?
Are
things
changing
or,
as
events
happen
inside
of
the
system?
A
What's
how
does
the?
How
does
the
shape
of
your
running
program,
change,
yeah,
Netflix
pioneered
this
technique
really,
and
they
use
this
to
measure
the
health
of
all
their
online
services,
all
the
time,
they're,
always
building
flame
graphs
of
all
their
online
services.
The
one
one
thing
that's
trickier
is
you
gotta
remember
to
yet
you
have
to
get
symbols
just
like
I
described
earlier
with
profiling.
You
have
to
remember
to
add
the
symbols.
Otherwise,
what
you're
going
to
get
in
the
graph
and
the
flame
graph
you're
gonna
get
a
bunch
of
addresses
and
good
luck.
A
So
the
last
tool
I
want
to
talk
about
is
probably
the
most.
The
most
complex
is
v10,
so,
like
I
said,
you
know
until
integers
have
the
same
question
that
you
all
have
is
like.
How
well
is
my
program
running,
so
they
added
performance
counters.
Now
they
have
another
problem.
What
what
all
these
performance
counters
do?
So
they
made
this
this
commercial
tool
called
v10,
which
helps
you
make
sort
of
make
sense
of
what
all
these
performance
counters.
Do
it's
an
extraordinarily
complex
tool,
but
you
can
get
your
head
around
it.
A
You
know
with
some
practice.
It
has
tool
tooltips
and
a
GUI
that
works.
Shx
forwarding
has
a
CLI
and
if
you're
an
open
source
developer,
you
can
get
a
free
license
here,
I'm
looking
at
mm-hmm,
this
is
the
single
thread.
Adventure
I
really
should
have
labeled
these
graphs.
Thankfully
I've
looked
at
them
enough
to
know
what
they
are.
This
is
the
single
threaded
benchmark
and
according
to
this
is
the
general
exploration
view,
and
you
can
see
there's
a
lot
of
there's
a
fair
amount
of
detail
here.
I
can
tell
this
is
memory
bound.
A
This
is
back
in
bounds.
Specifically
on
memory
and
looking
at
the
multi-threaded
benchmark,
that's
also
back
and
bound,
but
something
something
really
stood
out
to
me
from
running
this,
which
was
that
I
was
I,
was
bound
on
a
remote
cache
and
and
will
come
I'll
come
to
this.
This
is
a
lesson
I
learned
from
preparing
this
talk.
I
didn't
realize
this
about
this
particular
this
particular
benchmark.
This
is
a
multi-threaded
benchmark.
Here's
a
view
of
I
think
believe
this
is
the
memory
Explorer
here
we're
looking
at
this
is
the
single
threaded
benchmark.
A
I
can
tell
from
the
miscount,
and
you
can
see
stacked
CPU
time,
but
then,
secondarily,
what's
memory
bound
and
what's
really
great
about
this
tool,
as
you
can
mouse
over
any
cell
and
it'll,
give
you
a
useful
tool
tip
of
what
that's
what
that
means.
It's
really
helpful
when
you're
starting
to
use
v2,
it's
such
a
such
a
daunting
tool.
A
It'll
also
give
you
the
formula
that
it
uses
to
drive
many
of
these,
which
is
something
which
will
be
like
this
counter
stack
counter
/
some
other
counter,
and
you
know
that's
an
example
of
a
formula.
You
can
actually
take
those
formulas
that
you
learn
and
plug
them
into
instruments
for
when
you're
doing
development.
On
your
on
your
laptop
here
is
the
multi-threaded
program.
A
Miss
counts,
pretty
high
notes.
How
I
can
tell
this
is
the
multi,
oh
yeah.
It
also
has
the
name
at
the
very
end.
Here's
the
multi-threaded
benchmark,
here's
the
locks
and
weight
view
there's
so
typically,
when
you,
when
you
use
V
tune,
it
has
a
lot
of
different
analysis,
types
that
you
can
run
and
you
would
start
by
looking
at
like
the
general
exploration
view
like
I
showed
at
the
beginning,
and
then
you
would
run
it
over
and
over
again,
using
the
different
types
of
analysis.
You
would
run
your
benchmark.
A
That's
another
great
reason:
to
use
a
to
write.
A
macro
benchmark
is
that
if
you
have
an
example
application
you
can
hand
it
to
Perth
to
run,
you
can
hand
it
to
be
tuned
to
run,
and
then
you,
you
have
a
just
an
easy
tool.
You
don't
have
to
remember
how
do
I
run
this
like
what
scripts
do
I
need?
You
just
run
your
macro
benchmark
in
the
tools
here.
A
A
This
is
another
view.
This
is
the
event
accountant
view
you
can
see
from
the
progress
bar
at
the
bottom.
There's
a
tremendous
amount
of
detail
here.
This
is
kinda
like
the
day
trader
or
a
fantasy
football
view
of
your
performance.
I
wouldn't
encourage
you
spent
a
lot
of
time
here,
but
it
is
something
I
mentioned
earlier
that
there's
a
GUI
and
there's
also
CSV
you
can
get.
This
is
something
you
can
get
CSV
from
and
you
can
do
some
post-processing
on
to
learn
specific
insights
for
yourself.
A
A
It's
really
it
can
be
really
overwhelming
and
that
the
tooltips
are
helpful,
that
all
the
different
analysis,
modes
or
I
found
it
really
useful.
I
only
only
dug
into
maybe
a
third
of
the
analysis
modes
of
my
screenshots
here.
Oh
that's
because
and
I
learned
a
lesson
while
I'm
preparing
this
talk,
I
learned
something
in
vtune
highlighted
this
remote
cache
issue,
and
what
I
learned
is
that
I
was
remote,
cache
found
and
what
this
means
as
I
was
talking.
This
machine
has
two
physical
CPUs
and
I.
A
This
stick
that
you
plug
in
it
shows
you
all
the
the
memory
use
for
that,
and
also
the
memory
use
between
the
two
CPUs,
which
was
also
like
a
just
a
dead
giveaway
that
there
was
a
behavior
I,
wasn't
expecting
so
going
back
to
perf.
You
can
see
that
I
had
that
5%
miss
rate
that
we
talked
about
earlier.
So
what
is
the
multi-threaded
benchmark?
A
What
I
decided
to
do
is
well
I'll
just
use
tasks
that
I
used
tasks
that
are
running
on
one
CPU,
we'll
see
how
it
improves
and
then
the
the
the
miss
rate
dropped
to
0.01
percent,
which
was
pretty
fantastic,
and
then
you
can
see
there.
The
total
run
time
dropped
from
nine
seconds
to
3.8
seconds,
which
is
pretty
great.
That's
pretty
really
pleased
with
that
IPC
drop
we
were
talking,
I,
P
C
is
being
a
useful
empirical
measure.
A
Maybe
C
increased
about
10%,
which
tells
me
there's
still
a
lot
more
work
that
we
could
do
to
improve
that.
Yes,
the
performance
is
hard.
It's
hard
to
understand.
You
know
like
given
how
complex
CPUs
are
there's
a
lot
of
tooling
yeah.
You
know
you
need
you
really
need
to
measure
and
pericle.
You
need
to
run
your
program
and
see
how
it
does.
Then.
You
know
if
there's
a
lot
of
great
tools
that
you
can
use.
Ipc
is
a
really
great
measurement
that
you
can
use.
A
You
know
I've
been
really
pleased
at
how
many
tools
there
are
to
use
them
to
measure
performance
of
programs.
Today
you
know
you
can
zoom
instruments
on
the
Mac,
but
much
more
powerful
than
people
realize
you
can
use
perf
on
Linux.
You
can
use
be
tuned,
but
ultimately
the
the
best
tool
is
the
one,
the
one
that
you
use
on
a
regular
basis.
Now,
I
want
to
get
a
special
thing
out
to
Eliza
for
walking
me
through
instruments
and
thanks
thanks
for
listening.