►
Description
Tom comes from a background of maths & statistics, getting his start in programming in Python and R for ML and statistics workloads.
He found Rust a few years ago whilst looking for a new language to learn and now uses it at work and for side-projects.
A
Alrighty
cool:
this
is
a
little
talk,
I
like
to
call
a
4A
into
thread
per
core
programming
or
or
architecture,
something
I
discovered,
semi-recently
and
thought
it
was
just
exceedingly
cool
and
when
my
friends
volunteered
me
for
a
talk,
I
was
like
yeah
sure
I'll
I'll
talk
about
that.
That
seems
neat
cool
yeah.
So
this
is
a
4A
into
thread
per
core
programming
or.
A
I
want
to
put
like
Zoolander
meme
in
here,
but
I
wasn't
sure
about
copyright,
so
alrighty,
let's
just
let's
just
set
the
scene
a
little
bit
a
little
bit
background.
Lots
of
applications,
probably
lots
of
ones
you're.
Already
writing
you've
already
used,
probably
do
quite
a
lot
of
stuff,
there's,
probably
a
lot
of
like
tasks
that
have
to
be
done.
There's
some
like
there's
a
reasonable
degree
of
parallelism.
There's,
like
you
know
some
you
know,
do
some
work
here
and
then
maybe
kind
of
like
when
that's
not
being
done,
drop
back.
A
Something
like
you
know,
serve
requests
and
you
know
kind
of
like
latency
sensitive,
but
at
the
same
time
you
want
to
run
some
background
script
to
clean
this
up
or
vacuum
some
table
or
etc,
etc,
etc.
Lots
of
you
know
reasonable
degree
of
concurrency
all
familiar
to
us,
also
core
counts
on
CPUs
or
in
the
case
of
something
like
Lambda
these
these
go
up.
This
is
like
the
num
number
goes
up.
It's
like
every
year.
It's
like
CPU
manufacturers
are
like,
and
here's
something
with
even
more
cores
in
it.
A
You're
like
so
cool
yep,
that's
great
and
rust
is
nice
and
Speedy,
and
it's
got
some
cool
parallelism
things.
You
know,
sharing
it
so
immutability,
rayon,
etc,
etc.
Work
stealing,
that's
cool!
A
Could
we
do
more
with
that?
Probably
like
get
more
out
of
these
cores?
Okay,
oh
and
I
missed
the
one
about
nvme
devices,
because
a
few
years
ago,
obviously
storage
would
have
been
slow
and
you
know
don't
serialize,
that
into
disk,
because
disk
is
slow,
blah
blah
blah
blah
that
Gap
somewhat
small.
With
things
like
nvme
drives,
you
can
get
commodity
stuff.
You
know
seven
Giga.
Second,
like
read
it's
nuts
anyway.
So
what
does
this
lead
to?
This
is
going
towards
kind
of
like
unlimited
power.
A
If
you've
got
an
nvme
drive
and
you've
got
a
model
Linux
and
you've
got
like
32
cores
on
your
thread.
Ripper.
It's
like
you
can
do
a
lot
right.
If
you
can,
if
you
can
harness
all
those
cool
so
introducing
the
concept
of
thread
per
core
programming,
pretty
pretty
straightforward,
it's
basically,
you
have
a
thread
per
core,
terribly
uninteresting
kind
of
kind
of
unto
itself,
but
that's
kind
of
it's
really
only
the
start
of
it.
A
The
sort
of
the
underlying
theme
Here
is
that
we
want
to
like
divvy
up
our
application
yeah.
We
want
to
divvy
up
our
applications
so
that
we
separate
out
the
incoming
work
into
as
many
kind
of
like
independent
shards
as
we
possibly
can,
and
we
don't
really
want
them
to
communicate
too
much,
because
that
would
mean
waiting
in
queues
or
waiting
for
locks
and
mutaxes
and
stuff
and,
as
evidenced
by
the
next
slide,
welcome
to
the
inconvenience
queue
because
waiting
for
things
is
boring
and
bad.
A
So
let's
not
do
any
of
that.
So
we
can
do
some
things
like
we
can
Shard
our
incoming
data,
which
means,
and
then
we
can
hand
that
off
to
a
to
a
thread,
we
can
pin
the
thread
to
a
CPU.
So
this
your
operating
system
won't
then
punt
your
thread
and
Shuffle
it
around.
That's
great
for
your
instruction
case
locality,
your
data
cache
locality,
because
next
time
your
thread
runs
you'll,
see
if
you,
oh
God,
that's
you'll,
see
if
you
won't
be
like,
oh
and
by
the
way
here.
A
Let
me
reload
up
all
of
your
data
back
into
the
cache.
If
you
don't
need
to
spend
time
waiting
from
that,
that's
time
you
can
spend
like
doing
productive
work
right.
That's
that's
good
and
you
know
not
shuffling
stuff
around
I
already
covered
that
good
for
throughput
and
latency.
A
If
you
follow
this
architecture
and
you
can
cut
your
like
your
tail,
latencies
I
think
it
was
Microsoft
who
did
a
study
on
basically
like
if
you
can
cut
your
like
the
worst
of
your
like
tail
latencies,
you
can
end
up
reducing
your
like
your
main
application
sort
of
like
serving
latency,
which
is
cool
because
it's
kind
of
like
Karen
and
short
if
you're
like
oh
make
the
Fast
Parts
go
kind
of
kind
of
fast
and
that's
how
I
get
my
application
to
go
faster,
not
quite
what
else
sealer
DB
the
Cassandra,
DB
drop-in
uses
thread
per
core
uses,
Library
called
C
star
and
red
panda,
which
is
a
Kafka
drop-in,
also
written
in
C
plus
I
know.
A
This
is
a
rust
talk
and
C
plus
buses,
but
bear
with
me
also
uses
thread
per
core
for
reducing
some
of
that
tail.
Latency
isn't
getting
more
out
of
their
machines,
cool
that
sounds
great
I.
Really
I
really
sold
it
in
lots
of
lots
of
marketing
talk.
How
do
we
get
started?
How
do
we
do
anything
with
this
cool
introducing
thread,
purple
programming
and
the
way
to
start?
Is
you
go
to
your
local
Spotlight
and
you
find
the
thread
of
your
choice?
A
No
okay,
yeah
moving
on
seriously
oh
God.
So
seriously,
though
there
is
a
very
cool
framework,
called
glamio
or
or
glamio
how
I
don't
know
how
to
pronounce
it.
That
gives
you
thread
per
core
sort
of
functionality
and
some
async
functionality
for
your
threads.
Along
with,
if
your
Linux
kernel
is
recent
enough,
a
thing
called
IOU
ring
for
async
Io,
which
is
really
cool
and
directio
for
your.
A
If
you
have
an
nvme
Drive,
what
they
do,
directio
lets
you
skip
the
file
system,
cache
on
the
way
in
and
out,
which
means
you
can
kind
of.
You
can
write
directly
more
or
less
to
your
nvme
drive
and
read
directly
from
your
nvme
drive,
which
means
noisy
applications,
don't
sort
of
slow
you
down.
You
have
to
wait
for
the
page
cache
to
like
to
flush.
You
don't
get
held
up
by
okay,
other
things
going
on.
You
can
also.
A
Yeah
very
handy,
very
cool,
very,
very
modern,
and
it
also
has
some
really
cool.
Does
anyone
is
anyone
familiar
with
control
theory
and
Engineering?
Yes,
yeah
it
has.
A
h
thread
gets
its
own
scheduler,
obviously,
because
each
thread
has
a
local
async
executor,
and
then
these
controllers
that
are
attached
to
your
thread
are
then
powered
by
control
theory.
So
you
can
go.
Oh
that's.
Cool
I
would
like
to
now
have
separate
async
task
queues
and
maybe
for
one
task
queue.
A
The
latency
doesn't
matter
for
another
task
queue
the
latency
does
matter,
and
then
the
tasks
can
specify
their
latency
and
the
scheduler
on
your
thread
will
be
like
oh
cool.
There
is
no
work
in
my
task
queues.
I
can
just
you
know,
continue
ticking
through
my
like
non-latency
critical
work
and
then,
as
work
comes
in
from
the
high
from
the
latency
sensitive
task
queues,
it
will
shunt
your
latency
insensitive
tasks
out
of
the
way
and
be
like
sorry.
These
things
have
to
complete
first.
A
A
It
all
gets
nice
and
logical,
very
kind
of
easy
to
get
your
head
around
because
you
know
exactly
what's
going
on.
The
other
neat
thing
is
that
because
these
async
executors
are
all
thread,
local,
your
your
futures
and
what
you
await
no
longer
has
to
be
send
and
sync
because
it
never
leaves
the
thread.
You
can
have
thread
unsafe
things
now,
because
it's
like
no.
This
is
fine.
A
It's
not
going
anywhere
your
own
ship's,
all
good
cool,
so
putting
it
together,
I've
started
a
little
project
called
called
tarkhein,
because
it's
a
forest
and
I
thought
it
was
cool.
This
is
a
a
it's.
A
reverse
text
search
little
application.
A
It
uses
these
things
called
percolate
style
queries
traditionally
in
a
full
text
search.
You
would
persist
your
documents,
you
would
index
them
and
then,
when
you
make
a
query,
the
query
is
ephemeral
and
you
kind
of
you
look
through
all
your
documents
and
you
find
things
that
matches
and
you
come
back
at
that
point
in
time
with
the
said:
results,
percolate
style,
text
search,
works,
the
opposite.
A
You
store
your
queries
and
you
stream
the
documents
through
it
and
you
then
sort
of
you
build
up
a
persistent
set
of
results
and
you
can
like
you,
can
notify
on
change
or
you
can
notify
on.
You
know,
you've
got
more
hits
or
the
the
search
order,
change,
etc,
etc.
A
Useful
if
you
want
to
search
a
lot,
a
lot,
a
lot
of
data
and
you
have
a
reasonable
idea
of
you
know
what
you're
looking
for
or
you
would
like
to
rerun
the
query,
lots
and
lots
and
lots,
and
you
don't
want
to
kind
of
sit
there
waiting
for
it
to
troll.
You
know
hold
like
150
gig
index
every
time
you
search
so
I
thought
that
made
for
a
fun
problem
and
would
make
for
a
sort
of
good
fit
to
this
cool.
A
Yes,
threads
and
little
native
communication,
we
can
Shard
up
that
data
quite
quite
easily.
It's
an
unsolved
problem
at
the
moment.
Bearing
in
mind
this
project
is
all
of
like
two
weeks
old,
whether
their
best
fit
is
I
Shard
the
queries
and
pass
all
the
data
through
all
the
threads
that
I
Shard
the
data
and
pass
the
yeah
or
whether
I
do
it
the
other
way
around.
A
But
that's
why
it's
a
foray
into
thread
per
core
programming
and
not
and
not
an
explanation
of
the
virtues
and
maximum.
We
also
care
about
maximum
maximum
utilization
of
all
those
resources,
because
the
throughput
of
the
documents
is
kind
of
Paramount
right.
So
you've
got.
You
know,
hundreds
of
terabytes
of
stuff
you
want
to
trawl
through.
You
don't
want
to
be
waiting
an
unnecessarily
long
time,
just
for
each
single
one,
one
to
complete
I
think
someone
at
a
search
engine
called
Manticore.
They
have
this
as
a
search
functionality.
A
They
have
this
as
a
feature.
Elasticsearch
has
a
feature
in
in
their
documentation.
They
had
like
shootout,
and
so
my
my
goal
is
to
be
able
to
beat
the
elastic
search
throughput
for
for
percolate
queries,
which
I
think
several
thousand
documents
a
second,
so
fingers
crossed
I
can
get
there.
A
A
You
want
to
keep
those
you
want
to
keep
the
right
and
read
cues
for
those
drives
as
more
or
less
as
like
full
as
possible,
so
they're
always
doing
like
nice
page
size
levels
of
work,
and
this
is
a
good
fit
for
that
right,
because,
if
you're
streaming,
something
through,
you
can
probably
like
you've,
you've
probably
got
hits
occurring,
and
so
you
want
to
like
feed
that
off
to
your
drive
as
quickly
as
possible,
yeah
and
that's
for
them
to
drive
and
blocking,
which
is
the
time
spent
searching
new
docs
yeah.
A
That's
the
other
Advantage.
If
you're,
not
if
you're,
just
waiting
for
the
drive
to
complete
that's
a
dead
thread
time,
you
don't
we
don't.
We
don't
really
want
that.
A
That's
we
could
spin
out
a
new
thread
to
do
it,
but
then,
if
you
do
that
too
much
you
risk
like
CPU
over
subscription,
because
you
have
too
much
like
thread
contention
and
you're
like
your
OS
schedules,
like
hey
man,
got
a
lot
of
threads
and
so
I'm
just
going
to
start
shunting
things
off,
because
this
thing
completed
in
the
middle
of
this
thing,
you're
like
no,
that
one
was
doing
work,
which
is
obviously
bad.
If
you
want
to
focus
on
throughput
and
sort
of
like
you're
optimizing.
A
Your
utilization,
cool
I
have
some
code
samples
here.
My
code's
not
great,
fair
warning.
I.
It
occurred
to
me
quite
late
that
I
was
like.
Oh
it's
rust
made
up.
Pro
people
would
probably
be
interested
in
code
samples
because
crazy
program,
language,
who
knows
sorry?
Oh
it's
laser
pointer,
neat,
cool
yeah,
it's
it's
pretty
straightforward.
A
A
In
my
case,
I
spin
out
almost
as
many
kind
of
like
I
think
I
have
16
cores
in
my
computer
and
so
I
spin
out,
most
of
them
for
the
main
sort
of
indexing
workers
and
they're
a
reserve,
a
couple
for
a
like
a
Tokyo
runtime
to
run
my
like
Network
server,
and
then
you
know,
leave
some
for
spare
the
placement
fixed.
There
is
basically
you're
telling
the
operating
system
and
the
CPU.
It's
like
hey
this
thread's
attached
to
this
physical
core.
A
You
can't
you
can't
bump
this
off
like
it
has
to
run
on
there,
which
gives,
which
is
what
gives
us
the
instruction
cache
affinity
and
the
core
and
the
data
cache
Affinity.
That's
that's
so
useful.
A
Yeah,
this
one's
also
a
little
light
just
demo
of
like
the
core
kind
of
like
indexing
Loop,
it's
kind
of
ugly,
it's
kind
of
blocking.
It's
not
very
good,
but
it's
like
it's
fundamentally
kind
of
quite
a
simple
things.
There's
a
bit
of
everything
there
stuff
these
threads,
get
their
work
off
a
lockless
queue.
A
A
lock
free
queue,
rather
coming
in
from
Tokyo,
tries
to
process
them
and
then
spawns
a
mutable
file
Builder,
which
is
just
like
a
nice
high
level
interface
over
some,
like
direct,
I
o
as
an
async
task.
So
whenever
it
gets
those
matches,
it's
sort
of
like
it's
scatter
gather
style,
API
out
over
your
out
of
your
drive
and
then
they
just
sort
of
like
complete
in
the
background,
and
then
the
OS
comes
back
and
it's
like
hey
all
your
stuff's
there
and
I'm
like
cool
and
then
and
then
on.
A
We
go
oh
God,
yes,
that
is
that
is
my
talk.
Are
there
any
questions.
A
Oh
I
haven't,
but
that
sounds
cool.
Oh
sorry,
yeah
have
I
explored
using
thread
local
memory
allocators,
yet
I
haven't
the
moment.
My
focus
has
been
on
kind
of
like
can
I
Stitch
this
framework
into
my
code
and
like
and
get
it
going
I
believe
there
is
some
discussion
on
their
like
zulup
chat
about
a
sharding
and
like
thread
per
core
aware
allocator,
and
what
the
best
option
is
to
use
for
that.
So
we'll
probably
get
there,
but
yeah.