►
Description
RustConf 2018 - Benchmarking and Optimization of Rust Libraries by Paul Mason
As we develop the Rust eco-system we have a goal that "Rust should provide easy access to high quality crates". This means libraries must be both ergonomic and perform well. Rust provides various frameworks to help benchmark libraries however achieving performance past a certain point requires knowledge of some deep language constructs.
This talk explores a journey towards benchmarking various library functions consistently and fairly and consequently exploring strategies for optimizing performance.
A
Hello,
can
you
run
here
me:
okay,
yep,
Co
and
so
yeah,
as
I
mentioned
I'm
Paul,
talking
about
benchmarking
and
optimization
of
rust,
libraries
and
just
as
a
little
bit
of
intro
about
Who
I
am
so
I'm,
currently
principal
engineer
at
human
interest.
Human
interest
is
a
401k
processing
company
and
in
San
Francisco
you
might
be
able
to
tell
I'm
not
from
there.
A
So
and
essentially,
if
you
take
the
valves,
you
take
up
the
oh
and
then
you
move
the
sounds
of
the
vowels
over
one.
That's
the
New
Zealand
accent.
So
if
I
say
the
word
pan,
you
might
hear
this
and
I
say:
pin
you
may
be
hearing
that
and
likewise,
if
I
say
pen,
sometimes
you
might
be
hearing
there.
So
if
you're
hearing
the
word
and
yellow
then
you're
going
to
be
good.
A
But
if
you're
hearing
the
word
on
the
right,
you
may
need
to
reverse
engineer
what
I'm
saying
some
other
times
and
so
on
to
the
talk
and
so
the
rest
ecosystem
is
growing
and
when
I
first
put
together
these
slides,
it
was
just
under
17,000.
This
morning
it
was
around
seventeen
thousand
seven
hundred
it's
very
soon.
A
If
you're
using
a
decimal
number
library,
you
don't
want
that
to
be
the
the
performance
bottleneck
of
your
system,
and
you
also
don't
want
it
to
be
the
source
of
bugs
as
well
and
so
to
really
be
able
to
understand
the
performance
of
a
system.
The
first
thing
we
really
need
to
do
is
be
able
to
measure
it
and
really
that
the
idea
of
this
is
so
that
you
can
understand
whether
you're,
making
positive
or
negative
impacts
to
your
to
your
library
and
so
just
to
really
intro
into
this.
A
Some
of
that
detail
throughout
this
talk,
and
so
so
just
to
begin
with
in
order
to
benchmark,
we
need
to
know
some
of
the
tools
that
are
out
there
so
and
cargo
bench
is
the
most
obvious
one
to
start
up
start
off
with
and
the
reason.
Why
is
because
it's
included
with
cargo.
It
leverages
the
test,
crate
test,
crates,
internals
and
stable
at
the
moment,
which
means
that
it
can
only
be
run
on
lightly.
Only
but
getting
it
up
and
running
is
relatively
straightforward.
A
A
The
first
thing
you
probably
notice
in
this
in
this
particular
table
is
the
red
and
the
green,
the
red.
In
the
case,
it's
something
bad
happened.
A
regression
and
the
green
indicates
that
something
good
happened.
So
either
this
the
same
performance
or
an
improvement
in
performance,
and
so
cargo
is
pretty
its
sorry
carrier
bench
is
included
with
cargo.
So
it's
pretty
low
effort
to
get
set
up,
it's
very
fast
to
compile
and
execute,
and
you
don't
need
any
external
crates
within
your
cargo
tamil
file,
and
you
know
with
a
good
archiving
strategy.
A
You
could
also
compare
results
over
from
version
3
division
1,
if
you
wish
to,
and
that
the
major
downside,
though,
is
that
it's
nightly
only
and
performed.
The
reason
for
that
is
that
performance
may
be
different
and
stable.
You
may
be
testing
against
experimental
compiler
features
or
whatever
else,
which
means
your
results
may
be
a
little
bit
different
cargo
bench
comp
can
be
a
little
bit
sensitive
to
thresholds.
So
in
that
particular.
In
the
previous
example,
there
was
green
and
red.
A
I
was
actually
running
that
against
the
same
code
without
making
any
changes,
and
so
essentially
you
can
control
the
threshold,
but
you've
just
got
to
be
making
sure
you're
comparing
like
for
like
tests.
So
some
of
those
tests
were
taking
2
nanoseconds.
If
it
goes
to
the
3
nanoseconds,
that's
a
50%
increase
and
the
other
thing
of
of
cargo
bench
is
looping
through
sets
of
values
can
be
a
little
bit
tedious
and
so
with
a
decimal
number
library.
You
want
to
be
able
to
test
a
range
of
values.
A
A
That's
a
matter
of
including
the
criterion
crate
and
then
overriding
the
bench
commands
so
that,
instead
of
invoking
the
default
bench
and
harness
and
invokes
the
criterion
bench
Alice,
and
so
within
the
code
relatively
straightforward,
we've
got
the
criterion
main
function,
which
essentially
is
what
is
getting
invoked
by
the
harness
as
the
criterion
harness
and
instead
of
having
the
bench
attribute.
We
had
criterion
group,
so
we
had
the
bench
attribute
on
each
of
the
functions
and
still
you
list
them
in
their
criterion
group
macro
and
the
actual
met.
A
The
actual
function
that
gets
called
is
relatively
similar
looks
similar
in
the
fact
it's
got.
An
editor
function
as
well
as
I'm
passing
the
result
to
a
black
box
and,
however,
the
big
difference.
There
is
it's
wrapped
around
with
a
beach
function
and
function,
and
so
what
that
is,
therefore,
is
to
name
all
the
beaches.
So
because
you
not
able
to
leverage
the
attribute
the
beach
attribute.
You've
got
to
be
able
to
name
them
via
a
string
for
it
to
work
unstable.
A
So
criterion
is
relatively
easy
to
set
up
and
what
I
mean
by
that
it's
an
easy
migration
from
cargo
bench.
The
code
was
fairly
similar
and
you
know
it
has
some
way
offs
and
which
turn
out
to
be
kind
of
nice
in
some
ways,
so
bench
function
over
inputs
means
you
can
pass
multiple
inputs
to
a
a
single
benchmarking
function
and
it
does
provide
statistics
out
of
the
box,
which
means
you
don't
need
a
second
plug
in
or
whatever
to
manage
there.
A
You've
just
got
to
be
careful
to
not
name
them
the
same
thing.
If
you
do,
then
you
get
these
weird
subtle
comparison
errors,
and
that
can
just
really
throw
you
off
if
you
haven't
realized.
What
you've
done
so
so
before
jumping
into
some
of
the
practical
things
of
performance-
and
the
other
thing
I
just
want
to
mention-
is
just
really
understanding
your
application
before
running
it
under
bench.
A
So
you
can
use
intrument
ation
to
help
understand
this,
and
the
reason
you
might
want
to
do
this
is
to
really
understand
what
the
internals
of
the
application
are
doing
before
you
I
understand
what
the
internals
are
doing,
so
that
you
can
really
make
sure
that
what
you're
testing
against
as
what's
expected
from
a
normal
application,
Ram
and
thus
various
tools
to
be
able
to
do
this.
But
the
whole
idea
here
is
that
you
understand
what
a
both
expensive
calls,
but
also
high-end
vacation
calls.
A
So,
in
an
example
for
a
decimal
library
for
division
for
bigger
numbers,
you
might
be
calling
a
subtraction
function,
maybe
a
thousand
times
internally,
which
isn't
exposed
by
the
public
API.
And
so
it's
important
that
that
particular
function
is
tested
under
those
sorts
of
loads.
So
we'll
go
into
some
of
the
practical
things
and
this
one's
going
to
see
them
a
little
bit
of
a
cop-out,
because
it's
going
back
to
the
fundamentals,
but
essentially
some
of
the
biggest
ones
you
can
get
is
really
from
re.
A
Looking
at
your
approach
and
these
things,
such
as
early
exit
conditions,
these
operational
efficiencies
using
bitwise
shift
and
C
of
dividing
by
a
power
of
two,
these
parallel
operations,
you've
got
dynamic.
Programming
use
of
efficient
types
will
jump
into
use
of
efficient
types
a
little
bit
more
in
this
talk
and
just
go
into
that
a
little
bit
more.
But
just
there's
a
bit
of
an
example
we'll
just
talk
about
an
Postgres
write,
performance
and
so
Postgres
has
a
numeric
data,
type
and
you're,
probably
all
aware
of,
and
to
write
out
to
the
Postgres
protocol.
A
You
know
if
we
had
it
in
a
string,
for
example,
it
might
be
quite
easy
to
reason
about
how
we
might
be
able
to
implement
this.
We
essentially
chunk
it
into
groups
of
four
and
paired
the
zeroes
and
output
the
results.
So
if
we
were
to
do
that
and
rust,
we
might
do
something
like
so
where
we
convert
it
to
a
string.
A
We
find
out
where
to
split
the
number,
and
so
that
we've
got
an
integer
portion
and
fractional
portion,
and
we
can
then
go
through
and
chunk
it
by
fours
heading
to
lift
for
the
integer
portion
and
padding
to
the
right
for
the
decimal
portion
and
then
writing
it
out
to
the
protocol
with
the
the
big
endian
format
down
the
bottom.
And
so,
if
we
were
to
do
it
this
way
then
say
we
had
15
samples
that
we're
testing
with
it
roughly
takes
around
15,000
nanoseconds
per
iteration,
so
around
about
thousand
seconds
thousand
nanoseconds
per
iteration.
A
For
this
particular
example-
and
of
course
I
wouldn't
be
talking
about
this,
if
you
couldn't
do
it
better,
so
what
have
you
didn't
have
to
convert
it
to
string?
Well,
you
don't,
and
if
you
take
a
step
back
and
think
about
the
pure
math
approach
about
how
to
do
this,
then
you
may
think
about
how
to
scale
up
the
number
to
the
closest
team
thousand
group
keep
dividing
it
by
10,000,
while
storing
the
remainder
and
keep
going
until
you
reach.
A
A
We've
got
our
groups
on
the
right
there
and
that's
without
having
to
convert
it
to
string
first
so
implementing
this
and
rust,
and
we
could
do
the
following
where
we
essentially
figure
out
what
we
should
be
scaling
and
up
to
so
the
closest
team
thousand
boundary
I'm,
also
using
a
couple
of
operational
efficiencies
here,
which
is
instead
of
using
the
modulus
and
divider
functions
or
the
rain.
Durham
divider
functions
Steve
just
taking
the
last
two
bits
of
the
number
and
likewise
also
bitwise,
shifting
over
to
Steve,
dividing
it
by
4.
A
So
we've
scaled
up
the
number.
We
then
divide
the
number
by
team,
thousands
during
the
group
into
array
and
finally,
we
write
it
out
to
two
Postgres,
so
the
performance
of
that
is
quite
a
bit
faster.
It's
5,000
NS
seconds
versus
15,000,
nanoseconds
and
so
we're
getting
a
you
know.
Three
times
increased
just
from
really
looking
at
our
approach,
the
first
one
was
easy
to
reason
about,
but
the
second
one
of
you
take
a
math
approach
to
it.
A
Then
you
can
get
some
clear,
clear
winds
just
from
doing
that,
so
the
next
one
I
want
to
talk
about
is
the
exercise
slices
performing
better.
So
if
we
have
two
identical
functions
like
so
where
one
of
them
takes
exercise
slice
of
three
and
the
second
one
doesn't,
then
the
hypothesis
is
the
first
one
runs
faster
and
you're,
probably
looking
at
this
and
thinking.
Well,
obviously,
it
runs
faster
pool
there.
The
second
one
does
all
these
bounds,
cheeks,
which
the
first
one
doesn't
do
and
you'd
be
right
in
this
particular
case.
A
So
the
the
first
one
well
that
one
of
them
takes
two
nanoseconds
pin
iteration,
where
the
the
one
where
bounce
cheeks
takes
three
nanoseconds
and
within
a
loop
we
get
20
100
nanoseconds
in
3700
nanoseconds.
So
it
starts
to
broaden
a
little
bit
there.
But
what
if
we
didn't,
have
those
bounce
cheeks?
What
have
we
had
this
particular
code
here,
which
is,
of
course
a
little
bit
dangerous
to
have
in
your
production
and
base,
but
nonetheless,
how
does
it
perform
compared
to
a
fixed
sized
slice?
A
A
So
what
this
is
saying
to
us
is
that,
by
giving
the
fixed
size
slice,
the
compiler
is
able
to
make
various
assumptions
or
various
I
guess
heuristics,
to
be
able
to
understand
that
they
can.
It
can
make
further
optimizations
that
it's
unable
to
do
without
having
that
fixed
sized
slice,
so
fixed
sized
slices
are
indeed
faster.
A
So,
following
on
from
their
inlining,
can
both
improve
and
hinder,
and
so
with
that
previous
example,
there
say
we
just
throw
inline
always
on
to
the
top.
What's
going
to
happen
well,
you
said
that
things
are
going
to
improve
substantially
then
you'll
be
right,
but
interesting
to
note
is
that
both
of
them
behave
exactly
the
same
way
now.
So
within
lighting
both
of
them
take
1,300
nanoseconds
per
iteration
for
the
multiple
iterations
and
one
nanosecond
for
this,
the
singular
version.
A
Well,
it
knows
that
the
size
of
the
array
that
I'm
passing
through
is
3,
so
it
can
basically
remove
all
those
bounce
checks
and
make
things
faster
as
well,
and
so
so
within
lining
that
the
common
thing
is
is
well
that
I
often
hear
is
I
use
matter
than
the
compiler
and
to
know
when
to
in
line
and
essentially
I.
Think
if
you're
measuring
things-
and
you
understand
the
impact
of
what
you're
doing
then
sometimes
I
say.
A
Yes,
you
can
be
smarter
than
the
compiler
and
so
with
a
library,
though
I
think
the
main
thing
to
call
out
there
is
that
just
to
be
worried
about
the
impact
you're
going
to
be
having
on
your
downstream
components.
So,
although
consuming
applications,
so
it
will
increase
the
binary
size,
and
sometimes
you
won't
get
the
benefits
you
want
now
with
with
the
increased
binary
size.
So
my
general
rule
was
to
be
very
wary
of
inlining
public
API
functions
going
nuts
with
your
internal
stuff.
A
A
A
Essentially
you
need
to
have
the
numbers
at
the
same
scale
to
be
able
to
add
them
together.
So
if
I
have
two
point,
five,
which
is
scale
one
plus
three
scale,
zero,
then
I
just
need
to
scale
up
the
the
three
to
scale,
one
to
be
able
to
add
them
together,
so
25
plus
thirty,
it's
easy.
You
have
fifty
five
scale,
one
so
five
point
five
and
so
big.
It's
a
lot
easier
to
reason
about
this.
A
It's
one
single
number
you
need
to
scale
up,
and
if
you
have
an
array
of
you
know
three,
you
thirty
twos,
then
you
need
to
start
thinking
about
words
and
how
the
overflow
happens
between
binaries.
It's
not
just
a
simple
addition.
Anymore,
you
need
to
essentially
add
the
bits
together
one
by
one
and
so
with
a
naive
implementation
of
both
these.
You
can
see
that
the
the
array
version
does
actually
behave
a
lot
faster
than
the
beginner
version.
A
However,
that's
not
really
what
I
want
to
go
into
in
this
particular
part.
I
really
want
to
talk
about
exploiting
fixed
size
primitives.
So
by
knowing
that
that
array
is
96
bits,
we
can
make
a
lot
of
assumptions
about
how
we
add
together
or
do
certain
things
with
their
particular
number
and
so
shifting
on
from
addition
going
to
division,
a
naive
implementation
of
division
might
look
like
so
and
I'm
using
32-bit
numbers
here,
but
essentially
the
concepts
the
same.
A
So
this
is
a
the
two's
complement
version
of
division
of
you're,
not
aware
of
it
as
as,
essentially
you
take
the
dividend
and
you
keep
inger
divisor
every
time.
You're
successful,
you
increase
the
quotient,
so,
as
you
can
imagine
that
doesn't
perform
very
well.
So
for
a
smaller
numbers,
you
can
get
really
good
results.
18
nanoseconds,
but
as
you
get
bigger
than
90
million
MSI
considerations,
it's
not
acceptable
in
most
cases.
Probably
so
we
know
that
the
the
size
of
the
type
is
32
bits.
A
So
with
that,
we
can
also
make
various
assumptions
about
how
that
two's
complement
and
division
works.
So,
instead
of
looping
through
in
number
of
times,
depending
on
the
size
of
the
number,
we
can
limit
our
loop
through
only
32
times
and
a
syn
and
basically
exploit
the
fact
that
it's
a
binary
number.
A
So
the
other
thing
I
just
want
to
pull
out
here
is.
This
is
another
example
why
you
should
be
testing
across
a
range
of
inputs
if
I
just
have
one
test
around
for
a
hundred
or
even
200
or
what
a
thousand,
then
it's
not
going
to
be
incredibly
bad
performance,
but
as
I
start
to
get
to
the
bigger
numbers
that
starts
to
get
horrible.
A
So
the
fifth
thing
I
wanted
to
talk
about
is
copy,
borrow
semantics
in
this
ones
and
essentially,
when
we
first
start
getting
into
rust,
well,
I'm
sure
a
lot
of
us
are
battled
with
the
borrowed
chicken
before
and
we've
done
things
like
this
to
get
around
it
and
obviously
that's
a
warning
sign
because
cloning
to
get
around
the
borrowed
checker
isn't
the
way
forward.
It's
just
the
way
forward,
temporarily
or
whatever
else,
but
essentially,
if
you're,
avoiding
the
borrowed
checker,
it's
probably
a
bad
sign
and
the
reason
being
is
copying
or
cloning
a
large
struct.
A
A
And
so,
if
we
were
to
look
at
the
performance
of
this,
then
we
can
see
that
for
a
single
iteration,
it
looks
fine
to
nanoseconds
each,
but
once
we
get
into
a
bigger
loops,
then
we
start
to
see
a
little
bit
of
a
variance
and,
to
be
honest,
this
this
particular
benchmark
is
an
interesting
one
because
it
varies
substantially
between
runs.
It
really
depends
on
how
the
process
is
working
at
the
time,
but
essentially
this
one
has
a
roughly
around
2,400
seconds
versus
twenty
nine
hundred
nanoseconds.
A
So
copying
in
this
size,
just
96
bits
as
absolutely
slower
than
just
borrowing
those
96
bits
in
this
particular
case
again.
If
it
can
fit
on
the
stack
it's
fast,
but
if
it
doesn't,
then
it
potentially
could
be
a
bit
slower.
So
what
about
unsafe
I
mean
unsafe
is
a
great
way
to
improve
performance
right.
A
So,
as
library
authors
I
think
that
well,
the
definition
of
unsafe
really
for
wooden
rust
from
my
understanding
is
you're
basically
telling
the
compiler
hey,
I,
know
better
than
you
I
know
what
I'm
doing
trust
me
and
well
that's
fine
when
I'm
writing
the
code.
You
know
I
trust
myself,
but
you
know,
of
course,
I
cause
a
lot
of
bugs
as
well,
but
essentially
the
the
community
doesn't
necessarily
trust
me,
like
I
trust
myself.
A
So
having
unsafe
within
your
code
is
something
that,
if
you
can
avoid
it
as
possible,
then
then
you
know
you
should
be
avoiding
it
and
I
have
done
a
little
bit
of
experimentation
with
in
the
library
and
I
have
found
that
and
the
cases
that
I've
been
looking
at
where
I've
thought
hey
I
just
had
me
moved.
That
would
be
a
lot
faster.
I've
actually
found
that
there
hasn't
been
any
noticeable
difference
and
I.
A
Think
that's
kind
of
interesting
I
think
that's
something
that,
as
as
you
know
within
everyone
here,
if
you
were
measuring
your
performance
and
you
see
that
there's
a
huge
and
performance
increase,
then
you
can
make
that
trade-off
only
if
it's
absolutely
necessary
and
so
I
think
there
originally
I
was
jumping
to
the
conclusion
that
unsafe
was
always
going
to
be
better
and
in
terms
of
performance.
It
wasn't
necessarily
the
case,
and
so
I'd
probably
be
interesting
to
challenge
challenge
yourself
by
just
measuring
your
results.
A
So,
just
as
some
closing
thoughts,
as
that
you
know,
as
library
authors,
we
have
an
implicit
responsibility,
consider
and
performance
as
small
details
do
matter,
and
it's
really
important
that
you
understand
how
your
applications
working
under
load
or
under
normal
circumstances,
in
order
to
really
benchmark
it
effectively.
A
It's
important
that
you're
testing
across
a
range
of
inputs
would
be
a
shame
of
you.
Just
were
testing
against
dividing
by
100
before
and
the
billion
was
turning
into
19
million
nanoseconds.
So
it's
important
that
you're
really
considering
a
whole
variety
of
of
benchmark
tests
and
the
main
purpose
of
metering
is
really
so
that
you
have
insights.
So
you
can
make
effective
decisions
going
forward
so
I,
you
know.