►
Description
Stefan (https://twitter.com/dns2utf8) speaks about Atomic Counters and Cache Lines
How do atomic counters and cache lines interact with each other and why do they affect the performance of my programs?
About Stefan Schindler
Stefan has been working with Rust since 2015, is a member of the RustFest Project and the maintainer of the threadpool and sudo crate, and many others.
Rust Linz at https://rust-linz.at
Twitter: https://twitter.com/rustlinz
Speak at Rust Linz: https://sessionize.com/rust-linz
A
Okay,
hello,
I'm
stefan
sorry
for
the
delay
not
used
to
this
kind
of
presentation.
Yet
so
we'll
see
how
that
goes.
Quick
introduction,
then
we'll
go
over
integer
types,
so
this
will
be
more
of
an
advanced
talk.
Our
mc
kindly
has
an
eye
on
the
chat,
so
please
enter
your
questions
there
and
then
we'll
go
through
some
examples
and
it
will
be
like
yeah.
Why
do
we
care
into
like
the
nitty-gritty
details
into
an
actual
application?
A
A
B
A
A
So
that's
one
thing
that
gives
you
a
speed
boost
of
at
least
factor
to
usually
factor
40,
sometimes
even
more
and
that's
important,
because
you
don't
have
to
optimize
right,
you
just
tell
the
compiler.
Do
it
do
optimize
for
me,
another
thing
I
like
to
run
is
cargo
watch.
You
can
see
it
on
the
second
line.
A
Cargo
watch
watches
all
the
files
in
your
project.
It
watches
the
source
and
static
and
all
the
other
folders,
and
whenever
you
change
one
it
waits
a
little
bit.
I
think
it's
0.5
seconds
or
something.
Then
it
runs
these
x
commands
and
these
are
equivalent
to
you
typing
in
cargo
check,
cargo
of
fmt
for
format
and
cargo
run,
and
of
course,
you
can
add,
quotes
around
the
run
and
then
it's
run
dash
release
in
quotes
and
it
will
run
in
release
mode.
Why
is
that
important
link?
Time
is
really
bad
in
rust.
Still.
A
And
then,
after
all,
that
there's
your
game
logic.
That's
like
300
crates
into
the
build
process.
So
cargo
check
is
really
fast
because
it
does
not
link
your
project.
It
checks
the
syntax,
it
checks
most
of
the
types
and
then
it
tells
you
if
you're,
okay
or
not-
and
that's
great
format-
keeps
your
code
clean
in
the
sense
that
it
has
one
kind
of
format.
A
That's
really
great
for
working
together
with
people
there's
nothing
worse
than
looking
at
someone's
code
and
and
just
not
having
a
feel
for
where
the
control
flow
is
going
just
because
they
have
some
uncommon,
indentations
and
whatnot.
A
If
you're
coming
from
c
like
languages
or
javascript,
where
brackets
are
optional
around
the,
if
or
the
for
loops
or
the
while
loops,
then
you
know
what
I
mean
it's
horrible
right,
because
someone
can
write
stuff
like
if
four
and
then
one
statement
semicolon
and
another
in
one
line,
and
then
you
have
to
remember
that
the
last
one
is
not
included
in
the
if
or
the
for
loop
leads
to
a
really
really
dirty
box
moving
on.
So
so.
Why
do
we
care
about
atomics
I've?
Given
you
here,
ls
toppo
of
my
machine?
A
It
has
24
gigs
of
memory,
four
cores
with
two
compute
units,
and
you
see
it's
pretty
powerful
and
that's
a
laptop
so
to
make
problems
solvable.
We
need
a
lot
of
compute
power
and
python
is
great
for
for
teaching,
algorithms
and
whatnot.
But
when
it
comes
down
to
raw
compute
performance,
we
want
to
be
able
to
leverage
all
these
eight
cores
or
whatever
we
have.
I
mean
I've
worked
on
systems
with
128
cores.
C
A
How
many
times
do
we
need
integer
data?
Most
people
think
like
yeah.
I
will
work
with
strings
or
tables
and
databases,
but
you
can
do
a
lot
with
in
teacher
data.
One
of
the
big
things
is
counting
stuff
or
indexing
stuff.
A
A
Index
has
type
u
size,
because
that's
where
the
data
index
brackets
that
forces
the
type
into
u
size
and
because
we
operate
on
the
index
level,
it
actually
lives
in
rsp.
It's
a
register
so
that
symbol
does
not
have
a
memory
address
at
runtime
and
because
llvm
is
pretty
cool,
the
data
will
get
analyzed
by
data.
I
mean
the
program
code
will
be
analyzed
and
then
the
compiler
decides.
Oh
it's
just
for
loops
right.
It
can
determine
that
it's
just
for
loops
from
that
control
flow
and
it
will
say
too
bad.
A
We
will
unroll
that,
so
you
don't
see
this
loop
in
the
assembler
anymore
and
that's
pretty
powerful.
We
don't
have
to
care
if
we
run
this
on
a
really
small
arm
machine.
This
will
give
us
a
huge
improvement
because
it
doesn't
matter
what
the
our
machine
is.
There's
no
more
index
calculations,
it's
just
like
jumping
into
the
fields
and
reading
it
out.
A
A
A
This
is
it's
a
sign
that
you
have
to
be
really
really
careful,
and
you
should
add
a
note
why
this
unsafe
is
safe
and,
as
we
will
see,
this
one
is
not
moving
on
the
watcher
thread.
It
waits
a
little
bit
at
the
start.
It
waits
for
500
milliseconds
just
to
give
the
operating
system
a
chance
to
like
set
everything
up
and
the
other
one
is
like
a
clock
right.
It
goes
round
and
round
and
round.
A
A
Another
thing
you
have
notice:
if
we
didn't
copy
it
into
threshold
local,
we
would
have
to
fetch
it
from
the
memory
all
the
time
for
every
time
we
access
threshold
local.
So
this
is
another
hint
that
this
is
probably
unsafe
and
probably
not
so
stable
as
we
have
it
here.
We
have
no
comments,
not
great
in
production.
A
A
A
But
this
is
what
we
get
we
get
88
and
then
we
have
observed
it
once
and
then
we
get
93.
We
have
observed
it.
One
thing
98
observed
it
just
once
so
this
zero
is
how
many
repetitions
we
have
we
wanted
to
have.
I
think
it
was
hundred
thousand
samples
and
we
only
recorded
99
000
transitions.
So
this
is
like
all
over
the
place,
but
now
we
think
yeah,
maybe
it's
just
too
slow
right.
Maybe
it
misses
it
because
it's
debug
mode
and
debug
mode
is
slow.
So,
let's
put
it
in
release
mode.
A
This
is
this
is
even
worse.
Now
we
don't
see
any
transitions
anymore
and.
C
C
C
A
I
don't
see
any
when
proposing
any
answers
anyhow,
so
the
main
problem
is
the
compiler
looks
at
this
from
a
single
threaded
point
of
view
and
says
this
thing
here
is
a
symbol
that
we
see
out
here.
A
And
yes,
compile
optimization.
This
is
correct,
so
this
one
is
a
local
symbol.
This
one
is
always
the
same,
so
we
don't
need
this
update
anymore
and
then
this
value
will
always
be
the
same.
So
this
is
always
true,
so
we
can
just
increase
the
counts.
This
branch
here
gets
discarded
completely
and
then
we
have
max
test
which
just
contains
a
count
increment.
The
compiler
is
smart
enough
to
transform
this
loop
into
just
assigning
the
max
value
to
count
itself
and
then
it
notices.
A
No
account
is
never
read
actually
so
it
doesn't
need
count
anymore
and
then
we
see
last
isn't
read
either.
So
we
didn't
need
that
anymore.
Either:
history
yeah.
We
still
need
that.
We
need
to
allocate
that
thing
because
the
user
forced
me
to
and
then
we
have
to
give
it
back
and
that's
the
end
of
the
story.
So
the
only
thing
that
survives
this
is
a
huge
allocation
and
a
return
statement,
not
very
obvious
in
in
this
small
code
and
get
even
worse
in
bigger
code.
So.
A
So,
let's
fix
that,
so
we
have
the
counter
function
and
we
say:
hey:
compiler,
don't
optimize
this
away.
Yeah
vladimir
just
wrote
in
the
chat
that
rust
needs
the
volatile
keyword
and
we
don't.
We
really
don't
want
the
volatile
keyword
it's
badly
designed.
It
does
mean
different
things
in
different
c
versions,
and
it
does
mean
different
things
in
c,
plus
plus
and
in
c.
A
Although
the
same
compiler
compiled
that
code,
if
we
actually
need
volatile,
then
we
have
this
read
and
write,
volatile
pointer
functions
and
we
can
actually
do
stuff
like
that,
and
in
this
case
we
don't
get
optimized
away
because
the
compiler
knows
yeah.
We
have
to
touch
the
memory.
Therefore,
we
have
a
side
effect
and
this
has
to
be
like
attributed
to
something.
So
this
will
actually
write
to
the
memory
always
and
then
we
get
somewhat
correct
numbers.
A
C
A
I'm
really
sorry,
I
have
to
read
you
the
numbers
from
the
source
code,
because
there
is
an
entire
segment
missing
here.
C
A
A
Okay,
let's
trust
the
slides,
so
now
we
need
atomic
operations,
really
sorry
for
messing
up
light
text,
so
atomic
access
is
stuff
that
appears
to
the
user
of
the
cpu.
A
There
are
some
optimizations
for
that.
That
means
we
have
to
pay
attention
to
caches.
We
can
do
misaligned
access,
but
it
will
be
very
slow.
This
is
especially
true
if
we
have
bit
flags
that
we
have
to
check,
because
we
have
to
fetch
and
load
the
whole
block
and
then
access
the
word
and
then
have
to
perform
and
or
operations
onto
the
thing,
and
then
we
get
the
bit
we
want,
which
is
terribly
inefficient.
A
So
another
thing
that
most
people
are
not
aware
of
is
that
atomic
operations
also
have
effects
on
memory
reordering
and
instructional
reordering,
and
to
make
sure
that
this
works
properly
on
every
cpu
is
very,
very
difficult
because
it's
different
it's
different
on
amd
64..
A
It's
mostly
the
same
on
intel,
x86
64,
but
it's
different
on
32-bit
and
it's
definitely
different
on
arm
depending
on
what
arm
it
is
it's
very
different.
Sometimes
we
have
reordering
in
the
cpu.
Sometimes
we
don't
all
of
these
things.
We
have
to
keep
in
our
head
all
the
time
we
compile
the
program.
So
why
not
use
an
abstraction.
A
This
is
the
same
about
the
lock
prefix,
so
these
are
the
most
important
thing
is
actually
the
the
fat
keywords.
These
are
the
assembler
instructions
that
will
behave
differently
with
the
log
prefix
and
the
log
prefix
is
what
our
abstractions
with
atomic
variables
and
pointers
and
counters
used
to
actually
work
for
comparison,
incrementation,
not
sub.
Whatever
all
of
these
operations,
we
need
cpu
support,
the
hardware
has
to
guarantee
atomic
execution
and
on
amd
64,
it's
called
lock,
but
on
the
other
iso,
on
the
other
instruction
set,
it's
different,
so
oh
yeah.
C
A
Almost
forgot
about
this
one,
we
have
very
different
guarantees,
so
on
intel
intel
is
very
optimistic
and
everybody
has
heard
of
spectre.
I
assume
and
other
side
channel
attacks
and
it's
buried
in
this
number.
How
many
cycles
does
it
take
one
cycle?
For
instance,
if
you
increment
one
number
single
threaded,
it's
one
cycle
pretty
fast
atomic
operations
20
to
120
cycles,
it's
not
great,
but
it's
still.
A
A
Sorry
intel
cpus
are
not
running
instruction
sets
in
hardware
like
amd's
are
they
are
actually
virtual
machines
that
emulate
the
cpu
and
behave
the
same
most
of
the
time.
That's
why
we
need
micro
code
updates
all
the
time,
because
intel
makes
mistakes
and
they
have
to
be
fixed,
amd
opteron.
As
the
last
platform,
I
was
able
to
get
solid
numbers
from
it's
40
cycles
for
every
operation,
very
deterministic,
very
nice-
for
everybody
that
relies
on
real-time
systems
for
the
ryzen
and
epic.
A
A
So,
oh
for
those
of
you
that
are
new
to
rust.
This
is
more
functional
style,
I'm
really
sorry
if
you're
confused
by
the
thing.
So
let
me
go
quickly
through
the
thing.
A
A
We
can
also
have
inclusive
range,
then
it's
one
number
dot,
dot
equal
and
then
the
other
number
brilliant.
Why
do
we
need
brackets?
Because
if
we
put
this
dot
map
here,
it
would
go
to
the
second
number.
We
don't
want
that.
We
want
to
have
the
range
and
then
from
this
range
on
we
want
to
map
so
range
are
iterable.
So
this
is
a
for
loop.
A
A
This
is
the
closure
that
we
spawn
inside
the
closure.
We
get
the
counter
pointer
because
we
need
that
and
then
we
use
read
volatile
sorry.
This
will
read
volatile
to
get
the
current
value
incremented
by
one
locally
and
then
write
volatile
back
into
the
thing,
because
we
know
where
it
is,
and
we
can
do
that
and
it's
unsafe
because
you
shouldn't
do
that.
A
So
this
is
our
setup.
This
is
our
spawn
handlers
and
then
we
use
collect
collect
means
everything
that
has
been
generated
by
the
previous
iterators.
So
these
handlers
of
these
functions
now
we
collect
that
in
it
to
a
vector
and
because
we
cannot
actually
well
this
in
this
case,
we
can't
this
one
is
just
laziness,
so
I
don't
want
to
say
what
exact
type
this
is.
This
would
be
join
handler
of
a
function
that
returns
something,
but
I
don't
have
to,
I
can
just
say,
underscore
figure
it
out
compiler.
A
A
The
reason
is
we
need
to
buffer
all
of
the
jobs,
because
if
we
don't
buffer
them
in
this
vector,
then
we
will
just
spawn
one
at
a
time
and
process
them
just
like
in
order,
so
they
don't
run
concurrently,
don't
want
that
so
start
all
of
them.
At
this
point
all
of
them
are
started
and
then
once
everything
is
started,
we'll
loop
over
what
we
have
and
then
we
say:
okay
for
each.
A
We
wait
for
the
result,
so
we'll
just
wait
for
the
first
one,
the
second
and
so
on,
and
then
yeah
that
that's
it
because
we
operate
on
the
global
thing
and
then
we
can
ask
ourselves:
how
big
is
it
so
we'll
have
to
calculate
the
pointer
again
with
this
and
then
we
can
just
read
it
read
the
point
or
read,
volatile
and
say:
okay,
what
we
expected
is
how
many
parties
we
had
times
how
many
increments
we
wanted
to
do
and
them
the
result
is
pretty
bad.
A
It's
way
off
and
the
reason
for
that
is
sometimes
actually
most
of
the
times,
because
we're
pretty
close
to
the
lower
end.
All
four
frets
read
the
same
value:
do
the
same
calculation
and
write
back
the
same
value
if
you're
a
little
bit
lucky
they
get
a
little
bit
out
of
sync,
and
then
the
ink
value
starts
to
increase
a
little
more
than
one
thread
would
have
done
alone.
A
A
A
Oh
moving
on
ordering,
if
you
come
from
c
plus
plus
they
introduce
the
concept
of
ordering,
because
there
are
different
trade-offs
when
you
write
to
stuff,
for
instance,
there
is
to
see
this
at
the
bottom
sequential
consistent.
This
means
we
don't
want
memory
reordering
and
we
don't
want
instruction
reordering.
So
at
that
point,
when
we
load
this
thing,
every
code
that
has
been
above
this
has
to
be
finished.
Every
memory
read
and
write
has
to
be
finished
at
that
point.
A
A
A
These
are
details.
If
you
get
into
that
hire
me,
I
mean
I
have
time
for
a
project
just
studying
a
master's
degree,
all
right
and
now,
with
fetch,
add
and
load.
We
get
what
we
wanted.
We
have
exact
store.
The
runtime
is
comparable,
if
I
remember
correctly,
but
I
don't
have
the
numbers
here,
so
that's
what
we
should
do.
Also,
as
you
may
have
noticed,
we
don't
have
unsafe
anymore,
so
it
just
works
by
the
way,
if
you're
using
a
more
recent
compiler
atomic
usage
in
it.
A
This
constant
has
been
replaced
by
cons
generics
code.
So
you
can
specify
a
certain
number
at
compile
time,
so
we
don't
have
to
have
an
init
function
that
runs
before
our
main.
That
then,
sets
the
atomic
value
to
a
certain
thing.
If
it,
if
this
value
is
not
zero,
so
we
can
just
write
it
into
the
code
and
put
it
to
42
or
whatever
we
need
so
cache
lines.
So
why
was
this
so?
A
A
This
lock
is
a
really
bad
one,
so
the
real
mutex
is
little
more
complex.
Also,
this
is
relatively
slow,
but
if
we
don't
have
an
operating
system,
that's
how
we
are
going
to
operate.
We
have
to
have
an
atomic
linked
list
linked
list
in
rust,
usually
really
bad.
So
this
time
yeah,
we
need
it
as
a
queue,
and
then
we
have
this
atomic
queue
where
everybody
that
wants
to
acquire
the
mutex
has
to
operate
on
this
atomic
queue,
and
then
we
have
to
create
these
data
structures.
A
Add
ourselves
to
the
list,
see
that
no
one
else
has
manipulated
the
list
in
a
way
that
we
don't
like
and
then
check.
If
we're
at
the
head
of
the
list,
and
if
we
are,
we
can
access
the
value
and
manipulate
it,
and
if
we
are
done
with
the
list,
we
can
remove
ourselves
from
the
list
and
point
the
thing
to
something
else,
and
then
we
have
to
actually
wait
for
a
bit
until
hardware
flashes
out
the
cache
information,
and
then
we
can
free
the
thing.
A
If
you
think
this
is
insane
who
would
write
such
a
complicated
thing,
that's
actually
how
the
linux
driver
stack
works
with
just
writing
stuff
around
in
memory
and
waiting
a
little
bit
in
the
hopes
that
no
one
has
access
to
it,
and
then
there
is
a
value
in
the
documentation.
A
If
you
acquire
any
structure
from
the
usb
kernel
driver
stack
inside
the
kernel,
you're
allowed
to
hold
it,
for
I
think
four
seconds
or
something
like
that,
and
then
you
have
to
let
go.
Maybe
it's
less
nowadays
so
yeah,
the
point
being
hot
plug
is
hard
and
doing
that
stuff
in
c
is
harder
and
that's
the
solution.
They
came
up
with
so
oh
yeah
before
before
we
move
on
there's
another
cool
crate
called
parking
lot
and
that
leverages
operating
system
support.
A
A
So
that's
the
thing
that
we
get
the
virtual
interrupt
from
the
cpu
and
then
we
can
react
to
that
and
that
technique
can
be
used
to
build
new
taxes.
A
They're
called
food
tags
that
are
faster
than
the
traditional
mood
techs,
because
we
leverage
the
hardware
support
but
doesn't
work
on
all
platforms
and
usually
requires
operating
system
support.
But
it's
really
cool
and
it's
really
fast
look
into
parking
lot.
If
you're
interested
so
cache
lines
taking
another
sip.
A
Our
cpu
is
great,
let's
be
honest:
it's
a
marvel
of
technology
whenever
you
think
computers
are
shite,
think
again
think
about
how
many
things
a
cpu
can
do,
how
fast
it
is
and
with
what
dumb
instructions
we
have
to
feed
it,
and
it
still
makes
stuff
fast
right.
You
can.
A
I
have
spoken
to
a
professor
a
couple
years
back
where
he
says
like
yeah:
it's
no
fun
with
these
modern
cpus,
with
the
instruction
reordering
and
the
memory
reordering,
because,
even
if
the,
if
you
have
the
worst
assembler
code,
these
new,
I
back
in
kind
of
ai7,
this
new
i7.
So
this
was
like
20
10
ish.
A
These
i7s
are
bad
because
they
reorder
your
really
bad
assemble
code
and
make
it
performant,
and
you
don't
even
see
why
so
there's
no
point
in
optimizing,
assembler
anymore
and
they're
just
like
yay,
so
they
actually
canceled
that
class,
because
the
last
time
they
had
hardware
that
was
old
enough.
That
would
not
do
instructional,
reordering
and
memory
reordering
was
with
these
old,
8086
32-bit
cpus
single
core
really
old,
like
they
don't
need
a
cooler.
They
are
that
old.
A
They're,
like
this
big,
like
a
aaa
battery-
and
they
run
at,
I
think
80
megahertz,
but
that's
already
overclocked,
so
default
clock
is
40..
A
So
focusing
back,
I'm
sorry,
because
otherwise
this
gets
too
long.
So
we
have
one
luma
node
in
my
machine,
our
bigger
machines
have
multiple.
We
have
l3
cache.
This
is
the
cache
that
the
latest
ryzen
generation
has
a
lot
of,
like,
I
think,
16
megabytes.
Instead
of
four
for
a
laptop
cpu,
which
is
great,
it
improves
performance.
Then
we
have
l2.
A
A
They
feed
a
lot
of
instructions,
while
the
data
stays
mostly
the
same.
On
the
other
hand,
if
we
have
a
program
that
is
fast
but
runs
in
a
loop
like
for
a
graphics
calculation,
instructions
are
usually
the
same,
but
the
data
gets
fed
through
really
fast.
So
we
need
to
have
these
two
caches
that
are
independent
and
the
data
cache
is
smaller
in
in
my
cpu,
these
caches
grow
with
every
generation,
because
usually
it's
the
easiest
way
to
improve
performance
by
10-ish
percent.
A
If
you
double
these
caches
and
that's
what
manufacturers
will
keep
doing
because
clock
speeds
are
at
the
limit,
we
cannot
go
above
five
gigahertz
without
super
special
cooling.
A
A
Moving
on
this
is
a
more
colorful
diagram,
this
one's
from
amd.
So
we
see
the
instruction
cache
here
and
this
one
gets
fed
by
brand
prediction.
That's
what
I
said
with
instruction
reordering
instruction
reordering
is
before
that
then
this
branch
prediction,
so
the
cpu
actually
calculates
both
branches
of
your.
If
and
then
does
operations.
Caching
and
then
micro
up
q.
A
Even
if
we
lose
performance
of
40
or
50,
we
don't
care.
We
don't
want
anything
weird
happen
until
there's
this
other
flag
that
comes
on
and
said
yeah
from
now
on
on
its
application
code,
it's
not
security
relevant
anymore,
you
may
optimize
and
decode
in
any
in
any
fashion.
You
want
also
what's
not
shown
on
this
diagram.
Is
the
decoder
and
the
op
cache
they
instruct
the
memory
prefetcher.
A
So
the
prefetch
will
actually
fetch
memory,
that's
not
used
by
your
program
yet,
but
the
algorithm
is
like
yeah
there's
a
good
chance.
It
will
be
used
soon.
So,
let's
fetch
it
from
the
memory,
because
the
memory
is
really
slow.
Another
fun
fact
for
the
cpu.
This
part
that
we
see
on
this
diagram,
the
cache
is
mostly
transparent.
A
So
it
does
not
see
the
difference
between
l1,
2,
3
and
memory
just
operates
on
the
thing,
and
the
memory
prefetcher
will
fetch
stuff
from
the
memory
over
l3
into
2
into
l1,
and
then
the
actual
compute
units,
the
orange
and
the
red
here
in
the
middle
will
then
do
their
work.
There's
other
compute
units,
and
this
would
be
another
talk
on
what
you
can
and
cannot
do
with
all
of
that.
C
A
Who
measuring
this?
It's
it's
insane
right.
So
many
things,
it's
just
between
two
instructions
and
you
usually
never
have
to
think
about
it.
But
one
common
thing
is:
if
you
have
performance
counters,
you
may
use
them
with.
What's
its
name,
prometheus
right,
there's
a
prometheus
crane,
you
can
say:
okay,
here
is
a
bunch
of
counters.
A
I
want
to
measure,
I
don't
know
how
many
requests
I
get
in
my
web
server
per,
I
don't
know
second
or
in
total,
maybe
per
route
right.
You
have
main
page.
Some
api
calls
some
other
api
call.
Then
you
maybe
want
to
measure
how
much
time
does
it
take
to
fulfill
all
these
requests?
How
many
are
authenticated?
A
C
A
So
we
have
our
project
here,
oh
by
the
way,
this
is
xr
for
the
people
that
don't
use
it.
It's
very
colorful
and
then
we
say
in
this
rust
tool
chain
file
nightly.
So
that
means
all
the
cargo
commands
that
we
that
we
issue
will
automatically
get
redirected
to
nightly.
A
If
we
exit
this
folder
and
use
cargo
there,
it's
whatever
else
we
have
in
my
case
this
is
stable,
so
I
can
recommend
that
for
measurement
projects
like
this,
just
as
a
side
note,
black
box
is
a
really
handy
function
because
it
tells
the
compiler
don't
optimize
this
away,
like
don't
touch
it
just
treat
it
as
random
input
output.
A
C
A
See
if
I
can
show
you
the
code,
real,
quick,
what
source
main?
So
this
is
the
trade.
These
are
the
counters
and
then
these
are
the
tests.
They're
called
bench-
and
I
have
this
type
argument
that
then
gets
matched
by
a
constructor
and
then
the
work
gets
done
so
moving
on
each
structure
has
eight
fields,
because
I
designed
this
test
for
up
to
eight
cpu,
I
only
have
eight
and
yeah
just
increasing
the
number
by
the
alphabet
is
important
to
ask.
So
a
is
zero
b
is
1
and
so
on
normal
is
just
atomic.
A
U
sizes
all
the
way,
then
we
have
alignment
so
we
align
the
whole
structure
by
64
bit.
Bytes
pardon
me.
That
means
we
will
start
at
the
cache
line
right.
So
when
the
compiler
allocates
space
for
that
thing
in
memory,
it
will
start
at
the
cache
line.
This
means
it
will
not
lap
over
some
cache
lines
because
the
smaller
structure
may
fit
in
the
first
half
in
the
first
cache
line
and
the
second
half
in
the
second,
and
this
will
cost
us
performance
again.
A
A
So
if
we
just
run
one
fret,
it's
like
eight
milliseconds
per
execution,
that's
not
bad
and
then
two
gets
it
gets
worse
right
and
then
it
gets
really
bad.
So,
okay,
let's,
let's
just
make
sure
that
it's
aligned-
okay,
let's
let's
align
it.
So
it
starts
the
same.
But
apart
from
that,
there's
not
much
difference
right.
So
the
the
small
numbers
change,
but
the
big
ones
are
basically
the
same.
A
What
does
if
we
have
cash
line?
Awareness?
Interesting?
Isn't
it
right?
So
we
lose
hardly
any
time
anymore,
because
the
cpu
is
actually
able
to
access
different
parts,
different
caches,
and
I'm
really
sad
that
that's
like
pure
fireworks.
That's
it!
If
we
look
at
the
numbers
combined,
that's
it!
In
this
case
we
save
a
factor
of
10
with
eight
cores
and
that's
great.
C
A
Yeah,
if
we
order
differently,
if
we
have
ordering
relaxed,
I
hinted
it
before
see
these
numbers.
They
come
down
a
lot
like
a
lot
a
lot,
because
now
we
can
tell
the
cpu
yeah
just
whenever
right.
We
don't
need
you
to
interrupt
your
memory
prefetching,
and
this
test
is
backed
by.
I
think
100
k
entries
that
we,
so
we
allocate
an
array,
one
block
of
memory
with
100k
entries
and
then,
depending
on
the
kind
we
will
access
them
overlapped
or
in
different
cache
lines,
and
we
actually
gain
speed.
A
A
lot
like
remember
this
one,
the
first
cache
line
I
wire
it's
like
8.7,
always
unless
it's
normal
with
relaxed.
We
are
faster
even
with
eight
cores
working
on
these
numbers.
So
the
takeaway
here
is,
if
you
have
performance
counters,
always
always
use
ordering
relaxed,
because
otherwise
you
will
interrupt
the
I'm
sorry.
I
want
to
head
to
the
conclusion.
You
will
interrupt
the
the
program
flow,
the
memory
optimization
so
badly
by
your
measurements,
that
you
will
distort
whatever
you're
measuring
too
much
to
have
any
like
relevant
data,
yeah
other
conclusions.
A
We
need
atomic
operations
for
many
things.
There
are
lots
of
great
crates,
standard
libraries,
one
of
the
big
ones
that
abstract
away
all
of
these
things.
A
It
frees
our
heads
from
the
burden
of
thinking
about
all
of
this
stuff,
like
I
think
most
of
you
heard
about
this
stuff
for
the
first
time,
just
be
happy
that
someone
else
took
care
of
this
with
an
abstraction,
and
you
can
manage
other
problems
than
like
running
around
and
handling
individual
threats.
Just
use
a
pool
or
think
about
which
hardware
capabilities
your
program
has
to
use
to
be
able
to
run
or
not.
The
compiler
will
just
tell
you
and
that
opening
to
questions.
C
B
A
A
Actually,
I
should
should
be
a
good
citizen
and
show
you
the
release
mode.
So
this
is
the
size.
How
big
is
everything
atomic
use,
size,
eight
bytes
right,
eight
times,
eight
sixty
four
bits:
that's
what
we
expect.
If
you
align
the
structure
like
eight
fields
with
eight
things,
then
it's
512
bytes
long.
So
it's
half
a
kilobyte
for
one
data
structure
and
the
others
that
are
normally
aligned
are
just
64
bytes.
B
Right
yeah,
that
makes
sense.
Yes,
it's
pretty
cool.
I
really
loved
your
talk.
It
was
amazing,
I'm
really
into
performance
talks
and
I
loved
it.
A.