►
From YouTube: Rust Cologne, March 2018: Florian Zeitz - Caches and You
Description
One of the great things about Rust is that it gives you a lot of control if you need it. Amongst other things it gives you control over memory. How big is your data structure really, and where should it be allocated. This talk will look at caching in modern CPUs in conjunction with Rust data types and data structures. We will see how efficient code can be written to best utilize the cache.
More on the website: http://rust.cologne/2018/03/12/what-are-you-plotting.html
A
So
yeah
and
I'm
going
to
talk
about
Kesha's,
quick,
quick,
Paul,
I
guess
who
thinks
he's
so
caches,
PU
caches.
That
means
with
that
who
thinks
he
is
somewhat
intimately
familiar
with
caches,
so
I
think
that's
one
and
a
half
to
two
and
a
half
people
right,
okay
and
then
actually,
because
we
haven't
really
spoken
about
that,
you
feels
like
he's
still
more
of
a
rust
beginner.
Just
so
I
know:
okay,.
A
That
gives
me
some
pointers.
How
much
to
talk
about
this?
Okay,
so
to
lead
with.
Like
everybody
knows,
there
are
two
hot
computer
burns
in
computer
science
right,
the
first
one
is
naming
things.
Second
was
intentional'
addition
and
third
one
is
offer
own
house
obviously,
but
actually-
who
you
want
to
talk
about
today,
he's
like
three
sections.
So
first
I'll
give
you
like
a
quick
cash
is
101,
just
what
is
the
CPU
cache?
What
does
it
do?
Why
is
it
there?
Whatever
then
I'm
going
to
talk
about
caches
and
performance?
A
A
So
quick
cash
is
101
so
morning,
computers
right,
there's
a
problem.
Access
to
ram
is
relatively
slow
and
particularly
is
relatively
slow
compared
to
how
fast
modern
cpus
actually
are
like
modern
cpus
run
it
about
4,
gigahertz
and
accessing
like
just
one
piece
of
data
from
the
ram.
It's
a
lot
slower
than
that.
So
a
lot
of
the
time.
If
we're
just
reading
from
random
access
memory,
we're
waiting
for
memory,
we
can't
compute
with
that
memory
right.
So
the
solution
that's
been
deployed
since
forever.
A
A
A
lot
faster
right,
usually
caching
is
like
done
in
two
to
three
levels,
so
each
level
is
further
away
from
the
core
and
then
the
level
closer
to
the
core
will
from
the
upper-level
some
schematics
pictures
for
that
just
so
we
you
have
seen
it
once
so
usually
have,
like
course,
let's
say
photo
processor
in
this
case.
Their
registers
in
that
course,
which
are
physically
things,
that's
parses
to
access.
You
can
usually
read
from
them
in
a
single
cycle.
A
Then
there
is
an
outer
l2
cache,
which
is
a
bit
bigger,
which
is
significantly
bigger
already
and
yeah
one
cache.
So
you
get
more
data
in
that
when
it's
a
bit
slower.
So
if
data
is
already
missing
from
your
one
cache,
you
might
be
lucky
and
it's
stolen.
A
You
know
if
two
caches
you
get
it
from
there,
instead
of
going
back
all
the
way
to
the
ground
and
then
again
one
layer
from
that
is
your
three
cache
which,
usually,
if
it
exists,
doesn't
always,
but
if
it
exists,
usually
spends
of
course,
so
you
have
sort
of
shared
caching
between
course,
which
is
also
kind
of
useful.
If
you
have
threats
that
actually
migrate
between
the
cores,
so
they
don't.
A
After
migration
have
Texas
Rams
such
they
can
read
format
three
right
and
then
beyond
that
you
obviously
have
Rahab
itself,
which
is
the
slowest
in
this
picture
after
that,
in
theory,
there's
swap
space
like
if
you
have
your
RAM
full
and
all
that
custom.
A
So
in
terms
of
how
caches
actually
work
it's
not
as
if
they
read
each
byte
individually.
If
you
get
data
from
the
RAM,
it
will
always
or
typically
always
end
up
in
caches
and
will
end
up
in
all
caches
up
to
the
CPU.
So
if
you
read
but
later,
it
is
an
l1,
l2
and
l3
cache,
and
it's
done
at
the
granularity
of
what's
called
the
cache
line.
So
you
can't
read
a
single
byte
or
four
byte
at
a
time,
but
always
would
call
the
cache
line.
Cache
is
working
that
granularities.
A
If
you
have
to
throw
away
data
from
cache
all
that
data
in
the
cache
line
will
go
away.
Sizes
of
that
are
different
between
CPU
is
obviously
typical.
Size
today
is
something
like
64
byte
right,
so
cache
innovation
writes
the
the
thing
for
our
initial
jokes,
so
modern
cache
is
ours
called
coherent,
so
you've
seen
like
we
have
separate
caches
between
separate
course,
so
each
core
has
its
own
l1
cache.
But
the
thing
is:
if
you
read
data
from
multiple
cores
and
also
write,
they
different
multiple
cores
like
the
same
data.
A
Obviously
you
get
problems
with
coherence.
If
you're
not
doing
anything
special
because
then
you'd
write
on
one
core,
it
would
only
end
up
in
the
yellow
one
cache
of
that
core,
like
it's
not
necessarily
going
through
to
the
memory.
Another
core
might
have
the
same
data
in
its
cache
and
it
would
read
the
old
data
because
it
didn't
actually
see
the
right.
Even
if
the
right
was
going
to
ram
it
would
have
all
data
cached
from
the
read
and
would
still
read
0
theta.
A
So
what
you
have
to
do
in
order
to
get
rid
of
that
is
implement
some
sort
of
what's
called
a
cache,
coherency
protocol
and
invalidate
data.
That's
in
other
course
people's
caches
right.
So
what
if
anyone
writes
it
will
have
to
novella
date
that
same
data
and
everyone
else's
caches?
And,
additionally,
to
that,
we
have
another
reason
that,
like
we
have
two
other
reasons
that
caches
might
drop
theta.
A
Let's
say
core
0,
then
that
data
would
have
to
be
dropped
from
l2
and
l1
cache
for
core
0,
which
is
just
an
implementation,
detail
and
property
of
like
inclusive
caching,
but
also
something
to
keep
in
mind
like
even
if
you're
working
on
something
that
fits
comfortably
in
the
l1
cache,
something
that
another
course
doing
could
potentially
invalidate
that
data
right.
So
some
numbers
for
that.
Just
to
give
you
an
idea,
I
want
caches.
This
is
some
info
skylake
processor,
so
relatively
recent
i7,
I
think
I'll.
A
One
cache
is
typically
like
three
32gb
bike
and
size.
L2
cache
is
over
a
256
kilobyte
in
size
and
three
caches
like
the
whole
eight
megabyte,
but
on
the
flip
side,
the
further
away
from
the
cpu
you
get,
the
slower.
The
whole
thing
get
like
l1
cache
is
already
like
about
one
on
a
second
I'll
see
you
cache
is
about
3
nanoseconds,
2x,
chess
and
he'll.
A
Yeah
quick
last
thing
in
the
like
101
section,
there's
prefetching,
so
if
you're
reading
linearly
to
dietary
either
forwards
or
backwards,
theta
is
not
fetched
just
when
you
actually
access
the
data,
but
the
processor
also
recognizes
that
hey
someone
is
iterating
forwards
through
later.
Let
me
already
fetch
things
that
you'll
probably
access
in
the
future.
So
while
you're
still
doing
calculations
are
not
necessarily
thinking
about
accessing
the
data,
the
processor
will
already
have
gotten
it
into
cash,
so
you
can
access
of
that
which
is
kind
of
useful
property.
A
B
A
A
So
I
think
this
is
this
super
classical
one
for
this
one
like
matrix
multiplication.
So
like
the
usual
way
you
do
matrix
multiplication
is
let's
say
this,
so
you
have
a
function.
You
give
it
two
slices.
What
I've
done
in
this
case
is
this
are
two-dimensional
errors
erase,
so
the
slice
contains
a
set
of
erase
the
errors,
have
all
the
all
have
the
length
of
dimension
and
the
slices
which
you
can
see
from
this,
but
they
also
all
have
the
tanks
of
dimension.
That's
the
theory
about
that.
A
So
and
then
the
output
is
a
vector
and
what
we're
basically
doing
is
we're
allocating
a
new
result
vector
for
that
and
then
iterating
in
rows
and
cells
of
the
whole
result
vector
with
indexes,
and
then
we
have
that
variable
K
to
do
like
your
usual
matrix
multiplications.
So
if
you
don't
remember
that
matrix
multiplication
is
multiply
the
row
by
the
column
and
add
that
up
and
that's
what's
that,
that's
right-
a
is
indexed
by
the
column
with
the
second
index
over
K
and
B
is
indexed
by
the
road,
so
we're
the
first
two.
A
Next.
What
that
means
is
practice
that
it
goes
linearly
through
a
so
each
element
in
a
is
success,
linearly
one
after
the
other.
It
goes
through
B
with
huge
jumps
where
huge
means
dimension
elements
right,
because
it
goes
through
the
Columbus
instead
of
through
the
rows.
So
the
second
element
in
that
case
will
be
one
whole
line
behind
that
in
memory,
so
different
approach
to
that
store
your
data
differently.
A
So
what
we
want
from
that
is
CPUs
as
a
volunteer
at
are
very
good
at
linear
reads
like
u2
prefetching,
but
also
due
to
the
fact
that,
if
you
fetch
one
element,
you'll
have
a
whole
cache
line.
So
in
our
case
we
have
to
use
32
so
you've
had
one
element.
You
get
sixteen
elements
raising
the
into
cache
and
don't
have
to
do
gram
reads
for
sixteen
elements
right
and
during
those
elements
you
have
the
time
to
again
do
prefetching.
A
In
fact,
your
next
cache,
fine
and
stuff,
like
that,
the
way
you
store
data
is
important,
like
you
can
always
think
about
that.
Try
to
have
data
the
nearly
in
memory
instead
of
with
big
gaps
and
yeah.
What's
keeping
large
chunks
of
data
also
note
that
what
I've
done
is
deliberately
I've
used
those
areas?
A
So
if
this
was
like
a
nested
vector,
I'd
have
a
heap
allocation
which
basically
contains
pointers
to
a
heap
allocation,
which
means
I'll,
always
have
dimension
elements
directly
adjacent,
but
then
I
might
have
a
huge
gap
and
then
get
the
next
time
mention
elements
so
using
arrays.
In
cases
like
this,
it's
often
helpful
to
actually
get.
A
The
reason
specifically
I
chose
a
Veck
at
all,
like
you
obviously
could
do
two-dimensional
array
like
just
phrases
and
then
put
dimension
twice.
Is
that
I
needed
in
a
heap,
because
the
size
at
this
point
is
so
much
that
it
doesn't
actually
fit
on
the
stack
and
we
don't
have
a
good
way
yet
to
allocate
actual
array
on
the
heap.
So
that's
what
that
does,
but
it's
the
easiest
way
to
get
away
with
too.
The
other
thing
as
a
vector
in
the
nursing
is
an
array,
and
he
gets
a
nice
data
layout
in
memory.
A
Ok
different
example:
a
bit
different,
let's
say:
you're
I,
don't
know
writing
a
physics.
Student
simulation
writing
a
game.
You
have
a
huge
pool
of
objects
and
you're
reusing
those
objects,
sometimes,
but
a
lot
of
the
time
they
are
not
life.
So
to
speak
right,
so
you
always
store
them,
but
you
don't
always
need
to
do
computation
of
them.
So
you
put
a
nice.
Is
life
tag
as
a
first
element
of
that
object
and
then,
as
you
can
see,
on,
the
right
would
iterate
through
that
list
of
objects
and
just
check
for
your
own.
A
So
slightly
cars
are
other
than
a
cache
line
on
this
particularly
system,
but
typically
on
many
of
them
these
days.
So
what
we're
doing
is
we're
fetching
like
a
cache
line
or
maybe
even
two
for
the
object
and
checking,
let's
assume
the
very
first
bite
of
it,
and
then
we
don't
care
about
it
anymore
if
it
isn't
life-
and
we
do
that
like
for
each
object.
So
basically
each
object
XS,
even
if
you're
doing
no
computation
on
earth
at
all
gives
you.
Okay,
ash
miss
and
you
have
to
load
a
cache
line.
A
A
So,
but
what
do
we
do
instead
right?
So
there's
these
things
that
you
can
typically
do.
One
of
them
is
the
separate
vector
just
for
that
and
like
make
sure
the
indexes
line
up
that
way,
you
can
iterate
over
the
first
word
to
Chelsea
liveness
or
whatever
other
information.
You
might
need
right
information
you
to
write
over
often
as
opposed
to
some
data
that
you
just
did
rally
and
then
that
will
be
cache
efficient,
because
everything
is
just
one
after
the
other
and
give
you
a
nice
PDF.
A
The
other
way
to
do
it
is
the
conversion
that
it's
not
too
typical,
I
think
but
seeing
relatively
often
in
and
things
like
physics
and
high-performance
computing
stuff,
which
is
baby
turning
an
area
of
structure
into
a
structure
of
array.
So,
where
you
have
now
a
vector
of
objects,
you
would
have
the
object
type
and
for
each
field
of
the
object,
type
you'd
have
a
vector
containing
the
elements
in
theory.
A
Compilers
can
do
that
automatically,
but
it's
let's
say
really
successful,
and
the
main
reason
for
that
is
basically
that,
in
order
to
be
able
to
do
that
optimization,
you
have
to
be
sure
that
nobody
else's
to
eat
looking
at
that
type,
because,
if
you're
changing
the
type
and
the
external
interface
obviously
becomes
different.
So
if
the
data
is
sufficiently
private,
compilers
would
actually,
in
some
cases,
do
that
optimization.
For
you.
A
Okay,
so
he's
everyone
is
done,
fundador
him
doing
a
decent
job.
Okay.
So
this
is
this
one.
I
have
not
a
cheat
code
example
for,
but
just
like
it's
a
general
point
code
size
matches.
So
if
you
have
some
hard
code
like,
ideally,
your
code
would
fit
into
your
one
instruction
in
that
case
cash,
so
he
bought
set
it
to
kilobyte.
A
Call
like
usually
he'd,
say
that
in
lining
is
like
always
cheaper
or
whatever,
because
function
calls
or
else
except
ends
have
put
and
everything,
but
in
particularly
it's
code,
that's
being
repeated,
it
might
be
nicer
to
actually
outline
that
code
into
the
function.
If
that
makes
it
fit
into
cash,
it
you
don't
get
a
cache
miss
every
time
that
code
is
running.
It
could
do
to
like
large
amounts
of
code
it
overall.
A
So
if
that's
what
happening
to
you,
you
can
try
to
annotate
functions
within
line
ever
if
you
think
that
gets
you
speed
up.
Also
interesting
is
that
generic
functions
they'll
get
for
each
type
argument
you
put
into
the
generic
function
a
copy
of
the
function,
so
that
also
blows
up
the
size
of
codes.
That's
actually
being
executed
at
a
time
and
you
can
try
to
get
rid
of
that
to
some
extent
by
passing
trade
object.
I,
don't
know
if
that's
a
trade
object,
a
reference
to
trade
anyway.
A
Never
quite
sure
we
say
trade
objects
just
for
box
trader
also
for
reference
trade,
but
basically
having
something
that
just
implements
a
trade
as
a
reference
and
do
it
that
way,
which
also
won't
give
you
the
code
duplication,
but
for
all
of
this
device,
very
obviously
measured
like
if
you're
optimizing
hard
code
and
you
don't
you
never
as
a
rule
of
thumb,
you
never
know
what's
actually
a
problem,
you
can
try
those
things
you
can
try
to
see
how
the
bigger
the
code
actually
is,
and
you
can
try
some
of
this
advice,
but
always
measure
always
try
to
make
sure
if
that
helps
or
if
that
doesn't
helps,
and
if
it
doesn't
help,
then
do
it
but
like
in
particularly
the
cynics
like
inline
ever
on,
like
very
small
function,
counts
or
significantly
make
things
worse.
A
So
that's
something
to
look
out
for
okay,
so
this
is.
This
is
a
code
that
I
was
referring
to
before.
That
is
like
super
synthetic,
that's
the
thing
you
will
ever
find
like
an
actual
code,
but
it's
something
that
really
nicely
shows
the
phenomenon.
A
Just
for
reference
is
8
byte,
because
this
is
a
64-bit
system
and
then
what
I'm
doing
is
I'm
getting
another
reference
to
a
reference
count
to
the
X
array
and
passing
advantages,
new
threat
and
within
the
threat,
100
million
I
think
right,
yeah,
one
of
the
milk
x
incrementing
an
element
and
outside
the
thread,
so
my
own
thread
I'm,
also
1
million
times
incrementing.
Another
element,
the
particular
elements
which
I'm
incrementing
are
like
pass
through.
The
function
is
IJ
right
so
sufficiently
key
to
everyone.
What
that
function,
Utley
does.
A
Maybe
okay,
so
basically
like
increments
to
different
elements
of
an
array.
A
lot
right,
it
turns
out
like
incrementing.
Adjacent
elements
in
that
array
is
a
lot
slower
than
incrementing
elements
that
are
in
this
case
is
seven
eight,
eight
apart
right,
so
like
any
any
any
takers
for
y8
that
might
be
happening.
Any
ideas.
A
Yeah,
so
the
answer
was
maybe
SDM.
There's
a
cache
plan
yeah
precisely
like
that's
that's.
Actually
the
problem
so
what's
happening
here
is
that,
if
we're
having
exactly
adjacent
elements,
we're
both
writing
on
the
same
cache
line
right.
So
one
third
is
writing
to
the
first
eight
byte
of
the
cache
line.
The
other
side
is
writing
to
the
next.
A
part
of
the
cache
line
and,
as
we
said
before,
if
writes
to
a
cache
line
happen
on
a
different
core,
then
you'll
have
to
invalidate
on
it
on
all
the
other
course.
A
A
So
typically,
this
is
a
cruise
if
values
are
on
the
same
cache
line,
obviously
because
otherwise
you
won't
have
this
false.
Sharing,
like
will
be
different
cache
lines
and
each
one
can
really
add
independently
on
their
own
cache
line.
Access
needs
to
be
by
different
cores.
Access
needs
to
be
somewhat
frequent,
let's
say,
because
only
only
then
will
that
actually
make
a
difference.
A
So
if
cache
is
invalidated
anyway
between
your
writes
like
but
something
else,
but
just
using
data,
this
won't
matter
right,
but
if
it's
somewhat
frequent
and
you
actually
invalidated,
while
someone
else
cool
I've
wanted
to
access
it,
that's
a
problem,
and
there
has
a
piece
to
be
at
least
one
writer,
because
if
they're
only
reads
no
curtain
will
ever
be
invalidated,
I
think
this
code
actually
might
want
to
run
just
to
sew
something.
So,
oh
great.
A
A
It's
basically
how
I
benchmark
that
so
first
thing
I
actually
run
the
function
once
without
without
measuring
anything,
and
the
reason
you'd
want
to
do
that
if
you're
measuring
anything
related
to
cache
is
that
otherwise,
it's
usually
like
messes
up
like
you'll,
get
cases
where
you're
measuring,
in
theory
different
things
and
notice
that
the
second
one
is
always
faster,
even
though
maybe
you
didn't
expected
it
to
be
slower,
and
that
case
usually,
the
effect
is
that
your
cache
is
not
warm
right.
The
first
time
it
ran
everything
was
loaded
into
cache.
A
Second
time
it
ran
everything.
What's
already
in
cache.
Service
was
a
lot
faster,
and
that
also
applies
to
different
code,
of
course,
if
you're
doing
another
algorism
entirely,
but
on
the
same
data,
the
second
algorithm
would
always
get
positive
effects
from
the
first
one
running,
because
they'd
love
it.
They
tend
to
catch
so
basically,
I'm,
always
warming
up
cashing
than
just
using
like
rusts
instant
interface
like
very,
very
basic
to
get
measurements
and
yeah
I
guess
what's
wrong.
There.
A
So
it
takes
a
while,
but
actually
not
too
long.
I
hope,
maybe
I
need
power
if
it
does
right
so
immediately
adjacent
a
entries
in
this
case
more
than
five
seconds
right.
So
a
lot
of
time
I'd
like
and
we
set
up
eight
and
Reese
apart,
so
not
on
the
cache
same
cache
line
less
than
a
second,
it's
a
very,
very
noticeable
effect
of
both
sharing
images
as
a
way
to
actually
like
proof
that
that's
what
happening
I
can
like
force
those
threads
to
be
on
the
same
core
right
and
if
I
do
that.
A
B
A
B
A
So
this
one
is
is
a
different
example
and
actually
like
credit
for
this
one
goes
to
I.
Forget
the
name
like
the
basic
thing
you
want
to
do.
Is
you
want
to
count
the
number
of
odd
elements
and
then
in
the
metrics
again
right
and
we'll
do
the
threaded?
So
basically,
we
get
ourselves
a
resolved
error
like
sufficiently
large
to
be
able
to
use
it
for
different
numbers
of
thirds
and
then
we'll,
like
iterate
over
the
metrics
but
channeled
into
elements
like
the
number
of
lines
divided
by
the
number
of
threads
right.
A
The
way
we
count
elements
in
this
case
is
like
we
are
writing
to
one
of
the
elements
of
our
result,
vector
right,
so
first
strike
rights,
gentleman,
zero
of
the
result,
vector
seconds
roadster
element,
one
so
like
obviously,
should
be
pretty
classical
case
of
false,
showing
to
you
right
there
right
into
adjacent
elements.
They're
different
threats
are
in
different
course,
supposedly
so
it
gets
a
problem.
So
the
way
you'd
fix
that
typically
is
pretty
obvious.
Actually,
then,
can
I
get
them
both
up
on
the
screen.
Well,
sorta,
OOP!
Sorry!
A
So
will
you
fix
that?
Is
you
introduce
you
to
the
local
variable
right?
So
within
the
threat
you
just
have
your
own
local
variable
account
on
that?
It's
not
in
the
array
not
on
the
same
cache
line
or
anything
and
increment
that
and
just
at
the
very
end
right
to
the
result
vector
once
problem
with
this
code
is,
as
I
said,
it
doesn't
actually
measures
the
right
thing
because
turns
out
and
I've
not
sufficiently
looked
at
the
assembly,
but
turns
out
even
with
a
single
threat.
A
Weather
should
not
matter
at
all,
like
the
false
sharing
thing.
The
one
was.
The
local
variable
is
significantly
faster
and,
like
my
guesstimate
of
why
this
one
is
always
significantly
faster,
it's
just
that
the
variant
without
the
local
variable
will
always
access
memory,
like
always
writes
to
the
vector,
but
they
were
one
with
the
local
variable
can
increment
the
variable
in
a
register,
so
you're
never
like
accessing
memory
at
all.
Not
even
that
one
cares
not
anything
right.
So
yeah
I
think
that's
like
it's
it's
bad,
that
it
doesn't
show
false
sharing
like
that.
A
For
me,
I
think
it's
an
interesting
lesson,
because
what
that
shows
is
best
thing
to
do
about.
Caching
and
accessing
memory
is
don't
access
memory
like
if
you
can
do
it
in
a
single
register
for
some
time,
then
just
access
memory
once
when
you
actually
really.
You
need
to
do
that.
Have
a
small
local
variable
just
like
this.
Also.
D
A
Okay,
so
I
thought
I'd
just
do
a
quick,
quick
run
number
of
like
typical
or
us
data
structures
like
not
about
a
lot
of
them.
Few
of
them
selected
vectors,
obviously
like
from
a
cash
point
of
view.
Amazing,
because
vector
is
like
one
contiguous
data
structure
in
heat
memory,
no
gaps,
no,
nothing!
You
can
iterate
through
it
pretty
fast.
As
I
said,
then
your
reads
what
CPUs
are
rated
at?
A
If
you
want
to
read
through
that
forwards
or
backwards,
you're
good
from
the
cache
prospectus,
that's
nice
and
turns
out
they're
so
good
for
caches
that
they
often
outperform
other
data
structures.
That's
discipline
that
those
other
data
structures
are
specifically
designed
for
at
least
up
to
a
certain
number
of
elements,
just
because
cache,
often
like
trumps.
A
Everything
like
if
the
other
data
structure
for
some
reason
has
to
use
the
separate
nodes
that
are
a
part
in
memory
and
just
not
very,
very
sparse,
very,
very
tied
together,
you'll
you'll
get
negative
effects
from
that
and
everything
being
a
one
cache
as
it
usually
is
for
elements
and
lectures.
It's
just
it's
just
nice
for
that,
oh
so,
linked
lists
linked.
This
is
almost
only
in
here,
because
if
that
talk
was
done
in
another
language,
someone
would
tell
you
about
linked
lists,
because
it
shows
how
events
who's
ever
used
linked
lists
in
rust.
A
Right
and
it
turns
out
I've
never
seen
anyone
use
linked
lists
in
rust.
Actually,
but
it
exists
is
in
standard.
Larry
right
has
always
been
as
the
collections
linked
list
is
there
so
think
lists
I'm,
not
so
grateful
Cassius,
and
that's
usually
why
they're
like
the
example
of
don't
ease
this
link.
This
have
one
heap
allocation
per
node
like
you
have
those
separate
nodes
per
element
right
and
they
have
one
pointer
to
the
previous
element.
What
is
the
next
time
at
the
actual
data?
A
Each
of
them
is
a
net
separate
heap
allocation
and
those
he
publications
are
not
necessarily
like
contiguous,
like
unless
you're
allocating
them
like
in
terms
of
time
closely
to
each
other
they're
very
likely
to
be
spread
apart
across
your
heap
right
and
even
if
that's
not
the
case,
the
heap
will
usually
like
look
for
bins
that
are
pretty
close
to
the
size
reallocating,
so
they
might
still
be
spread,
even
the
fear.
Locating
them
just
back-to-back.
A
So
iteration
is
pretty
expensive,
like
you're
chasing
as
I
say.
The
next
pointer
is
through
the
list
of
linked
lists
and
you
are
likely
to
get
one
cache.
Miss
per
element
right
because
you're
shading
across
the
cache
so
and
then
again
the
point
I'd
usually
be
making
at
that
point.
It
is
insertion
on
linked
lists
is
terrible
and
stuff,
like
you
might
have
been
told.
The
new
certainty
that
linked
lists
are
great
for
insertion,
because
insertion
is
very
cheap.
A
That
function
is
straight
up,
does
not
exist
so
yeah,
that's
nice,
I,
guess
what
what
does
exist
is
we
have
in
the
pen
function,
so
you
can
link
to
linked
lists
together,
which
is
a
pretty
super
cheap
like
again
changing
to
pointers
but
yeah.
Actually,
because
we
have
a
pointer
to
the
very
back
up
the
list,
you
don't
have
to
run
across
it
or
anything,
so
that's
like
legitimately,
cheap
and
legitimately
a
lot
cheaper
than
chaining
together.
A
Two
vectors
right,
what's
also
relatively
cheap,
is
pushing
to
the
front
or
the
back
of
the
list,
and
particularly
to
the
back,
it's
again
very,
very
likely
to
be
slower
to
pushing
to
the
back
of
a
vector
because
for
the
list,
you'll
have
to
do
the
evaluation
for
the
new
node
for
the
vector.
You
probably
already
have
the
heap
allocation
for
that
element
because
vectors
like
have
a
capacity
separate
to
the
length
of
that
grows
and
powers
of
two,
so
you
only
really
rarely
actually
reallocate
memory
for
lectures
push
front.
A
Actually,
it
doesn't
exist
for
vectors,
interestingly
enough,
if
you
actually
need
push
front
and
push
back
on
a
data
structure
use
vector
deck,
so
that's
the
trick
to
basically
get
a
push
fund.
Tech
deck
is
that
the
underlying
data
structure
is
a
vector
so
vectors
great
and
it's
one
contiguous
heap
allocation
or
things
like
that,
and
it's
basically
a
global
ring
buffer
right.
So
it's
get
goes
across
the
the
last
element
of
the
vector
and
begger
of
the
front,
and
whenever
that
ring
buffer
becomes
too
small,
so
any
area
limit
is
full.
A
It
will
actually
click
double
in
size.
Like
a
vector
would
for
memory
allocations
and
you
have
a
larger
ring
buffer
yeah,
and
that
way
you
actually
have
like
super
cheap
push
front
and
push
back
on
that
one
too
and
their
vector
cheap
and
not
linked
this
cheap,
like
you're,
not
doing
a
heap
allocation
and
putting
a
pointer
to
that
heap
allocation,
but
you're
actually
like
writing
to
space
and
a
vector
and
you're
done
and
like
changing
two
indices,
because
it's
a
ring
buffer.
A
So
hashmap
another
data
structure-
that's
like
kind
of
interesting
kind
of
used
to
a
lot
and
half
times
actually
happen
to
be
pretty
good
interest
in
terms
of
people
seem
to
be
that
say
a
bit
pissed
off
about
C++.
Well,
what
is
it
unique
map
I.
Think
right
is
that
the
equivalent
to
the
first
programmers?
A
Yes,
yeah,
a
city
map,
is
the
one
that
people
agree
on
shitty
anyway.
Oh
yeah,
unordered
map.
Yes,
right,
brain
is
working
okay,
so
you
have
a
city
map
and
yesterday
I
know.
Redmond
St
under
odd
map
is
pretty
much
what
its
equivalent
to
our
hash
map.
Let's
say,
because
map
also
needs
to
keep
up
some
watering
properties,
I
guess
in
terms
of
insertion
order,
water
level
and
the
register.
A
So
it
turns
out
the
interface
that
C++
programs
on
everyone
gets
is
apparently
done
in
a
way
that
only
really
works
if
your
hash
map
is
implemented,
one
specific
way-
and
one
specific
way
in
this
case
means
like
the
implementation
you
probably
heard
about
in
university.
Let's
assume,
which
would
be
you,
do
a
hash
and
that
hash
gives
you
an
entry
in
a
list
and
that
entry
in
the
lists
itself
is
a
list
so
for
everything
that
has
to
the
sine
value.
B
A
So,
if
you
care
deeply
about
that,
you
might
want
to
look
at
one
of
those
in
any
case
yes,
cool
much
better
than
and
other
languages,
but
necessarily
like
just
because
it's
a
hash
map,
it's
not
as
good
as
a
vector
in
terms
of
like
doing
things,
but
it
also
does
a
complete
two
different
things
right.
So
it
does
key
value.
Lookup
and
vectors
just
stored
in
their
data.
A
Also
there,
the
documentation
will
tell
you
nice
things
about
how
this
is
better
than
the
regular
binary
tree,
and
probably
also
there
that
are
plenty
of
blog
posts.
Basic
idea
is
like
instead
of
doing
is
usually
binary
tree,
where
you
allocate
a
note
for
each
tree
element
right
and
again
would
get
that
pointed
chasing
kind
of
thing.
A
A
So
try
to
do
that
as
often
as
you
can
try
to
keep
your
working
set
small,
though
the
data
that
you're
working
on
both
in
terms
of
instructions
in
memory,
if
you
need
to
be
fast
so
particularly
in
hard
loops.
Obviously,
because
that's
what
do
you
want
to
optimize
and
use
like
like
if
in
doubt,
use
neck
and
second
do
that?
Maybe
use
lag
deck
if
you
need
a
push
phone
for
whatever
reason?
B
A
Question
actually
I'm
also
not
sure
I
think
there
might
now
be
an
align
attribute.
I
couldn't
tell
you
for
what,
though,
like
they're
technically
multiple
places,
we're
having
them
in
the
lion
attribute
and
say
makes
sense
right.
It
would
make
sense
on
actual
data
type
definitions,
so
you
could,
let's
say,
define
your
vector
like
in
as
an
SIMD
vector
type
and
make
sure
that's
always
aligned
to
128
byte
or
whatever
the
requirement.
A
There's
it
probably
isn't
six
and
by
the
line,
something
like
that
and
then,
whenever
the
type
is
used,
you
said
alignment
and
it
would
make
sense
on
separate
variables
if
you
have
just
a
very
generic
type
like
an
array,
but
you
have
a
use
case
in
mind
that
needs
a
certain
alignment
I
think
at
least
the
formal
one
has
an
attribute
might
be
nightly
only,
but
I
can
really
on
that,
but
certainly
space
to
be
filled.
Yes,.
A
Think
part
of
the
answer
is
just
do
like
coming
coming
back
to
the
point
of
when
we're
usually
would
talk
about,
insertion
and
linked
lists.
Right
is,
as
I
said,
they
are
not
as
good
as
you
think.
You
are
anyway,
because
you
have
to
walk
the
linked
list,
but
also
they
are
surprisingly
inefficient
compared
to
vectors
like
things.
A
Cpus
are
also,
apparently
very
good
at
and
not
surprisingly,
seeing
how
a
vector
in
straighten
everything
out
arrival
nowadays
is
shifting
data
around
so
I
haven't
done
the
benchmark
myself
and
I
have
forgotten
the
actual
numbers,
but
people
compared
like
linked
lists
and
inserting
in
the
middle
to
inserting
in
the
middle
of
a
vector
which
obviously
needs
you
to
copy
the
whole
bunch
of
trailing
memory
over
right,
and
it
was
surprisingly
late
that
linked
lists
were
actually
faster
like
few
few
thousand,
if
not
100,000,
elements
of
something
like
that
for
I.
Think
typical
32-bit.
A
So
yeah
just
do
and
don't
expect
it
to
be
slow
just
because
you
feel
like
shifting
it
around
should
be
slower,
so
think
part
of
that
answer.
The
other
answer
is
just
use
the
insert
function
where
it
exists.
Apparently
it
doesn't
exists
for
for
data
structures
where
it's
inherently
stuff.
Interestingly
enough,
like
I,
feel
like
just
I,
see
pretty
well
in
terms
of
what
you
should
be
doing
with
the
type
like.
A
If
you
tried
to
use
push
front
on
the
back,
you'd
see,
there's
no
push
front
and
then
let's
say
you
go
to
the
rust,
IRC
channel
and
say:
okay.
Why
said
it
performed
what
the
and
people
will
tell
you
use
the
back
deck?
It
basically
does
the
same
thing:
I
select
clear,
but
it
has
a
push
fund
and
the
push
one
will
be
a
lot
faster
than
it
could
be
on
the
vector
right,
so
I
feel
like
Russ
is
guiding
you
to
some
extent
with
which
the
operations
are
actually
available.
A
Honest
answer,
I,
don't
know
like
really
really
white
answer
is
the
same:
that
work
for
C++
often
work
for
rust,
like
particularly
in
select
the
little
space
and
open
source
space.
If
it
reads
to
off
debug
info,
there's,
no
inherent
reason
that
it
wouldn't
work
as
well
for
C++
as
it
does
for
us,
like
you,
should
be
able
to
get
the
sample
counts
for
performance
counter,
matched
up
to
an
address
anyway,
like
that's
what
they
do
and
that
matching
that
address
based
on
Waffen,
first,
two
source
line.
C
D
Some
piece
of
advice
that
is
often
floating
around
us
that
to
keep
in
am
variance
of
basically
the
same
size.
For
example,
you
have
like
you
typically
result
and
you're.
Okay
value
is
no
a
string
and
your
error
type
is
a
custom
enum
that
is
using
some
variants
that
are
very
much
larger
than
24
bytes.
So
this
is
bad.
D
A
I
think
in
general,
that
would
depend
on
a
use
case
like
the
point
at
which
that
makes
sense
to
do
right,
but
I
think
the
preservation
is
sound,
like
you're,
obviously
wasting
like
wasting
a
lot
of
potential
cash
space.
If
you
have
got
variants
in
the
enum
which
may
not
matter,
those
are
the
variants
that
you
use
of
for
the
result
type.
The
specific
argument
that
I've
heard
people
make
is
that,
like
a
good
case
into
your
case,
that
almost
always
happens?
A
Is
you
get
the
OK
variant
and
then,
if
you
do
so,
if
you,
of
course,
if
you
do
things
like
map
over
something
that
we
turns
a
lot
of
results
and
in
theory
all
of
them
are
like
you
ate,
and
one
bite
large
Patera
type
is
like
20
bytes
or
whatever.
That
means
you
have
to
access
20,
bytes
of
enum
and
theory.
Each
time
you
you
access
one
of
those
results
and
have
much
larger
gaps
in
between.
A
D
B
C
C
C
I'm
a
newbie
to
us,
so
my
question
is:
is
there
some
special
trick
that
you
can
do
in
rust
that
you
cannot
do
in
another
language?
So,
for
example,
and
the
good
example
is
this
in
Lyon
case,
where
you
say
sometimes
it's
better
to
in
line
and
another
time
it's
better
not
to
in
line
and
yeah
could
put
the
language
help
you
with
this.
Maybe
you
could
say
in
line
only
short
stuff
or
what
what
kind
of
things
can
language
to
do?
This.
A
Good
question
I
think
just
having
the
in
line
attribute
upon
itself
is
already
something
you
could
mention,
as
the
language
can
do
that
right
like
if
you
can
give
it
hints
standard
c
and
c++,
certainly
don't
allow
that
there
probably
GCC
extensions
for
that,
though,
and
probably
MSB
c
extensions
as
well
other
than
that
good
question.
I.
A
C
A
Right
so
something
I've
like
mentioned
in
passing
is
what
will
help
you
to
keep
data
small
is
like
they
are
just
structure
appropriately.
Well,
that's
the
C
advice
for
life
right
because
in
C
your
struts
always
will
have
the
same
order
s.
You
write
out
the
fields.
So
if
you
do
things
like
put
them
8-bit,
integer
first
and
then
a
32-bit
integer
you'll
have
to
have
three
bytes
of
padding
between
that
right,
which
is
space
that
you're
potentially
wasting.
A
If
you
put
the
eight
byte
integer
one
byte
integer
at
the
very
end,
it
would
not
necessarily
waste
all
three
bytes.
That's
actually
a
terrible
example,
because
it's
still
for
alignment
reasons,
but
if,
as
soon
as
you
have
like
two
of
those
one
byte
integers,
that
would
help
that's
something
that
rest
does
automatically
like
from
the
beginning.
We
have
said
that
structs
memory,
representation
is
undefined
in
that
way,
like
fields
can
be
reordered
by
the
language
and
by
now
they
are
like.
This
has
been
implement
within
I.
Think
the
last
year
is
amazing,
like
that.
A
So
that's
helps
because
just
keep
stages
smaller
than
they
would
be
in
other
languages,
and
you
can
still
that
right
feels
in
the
order
that
feels
semantically
meaningful.
For
you
good
point
things
that
can
help
that
come
to
mind
not
necessarily
specific
to
the
language,
but
having
things
like
actually
development
or
debug
output,
for
how
large
functions
are
actually
got
in
terms
of
code
size
during.
B
A
D
This
and
basically
optimize
it
down
to
very
tight
loop,
because
it
knows
this
type,
but
in
other
cases
it's
not
possible
and
the
compiler
will
generate
a
huge
bunch
of
code
for
each
of
these
functions
and
in
this
case,
is
much
more
easy,
much
much
easier
to
reduce
the
code
size
by
using
its
write
object.
And
this
is
quite
easy
in
rust-
and
it's
not
possible
or
much
hardline
as
a
languages.