►
Description
Writing the Fastest GBDT Library in Rust by Isabella Tromba
In this talk, I will share my experience optimizing a Rust implementation of the Gradient Boosted Decision Tree machine learning algorithm. With code snippets, stack traces, and benchmarks, we’ll explore how rayon, perf, cargo-asm, compiler intrinsics, and unsafe rust were used to write a GBDT library that trains faster than similar libraries written in C/C++.
A
A
A
So
what
are
gradient
boosted
decision
trees?
Let's
say
you
want
to
predict
the
price
of
a
house
based
on
features
like
the
number
of
bedrooms,
bathrooms
and
square
footage
to
make
a
prediction
with
a
decision
tree.
You
start
at
the
top
and
each
branch
you
ask
how
one
of
the
features
compares
with
a
threshold
if
the
value
is
less
than
or
equal
to
the
threshold
you
go
to
the
left
child.
If
the
value
is
greater,
you
go
to
the
right
child.
A
When
you
reach
a
leaf,
you
have
the
prediction:
here's
an
example:
we
have
a
house
with
3
bedrooms,
3
bathrooms
and
2
500
square
feet.
Let's
see
what
the
price
our
decision
tree
predicts
start
at
the
top.
The
number
of
bedrooms
is
three
which
is
less
than
or
equal
to
three,
so
we
go
left.
The
square
footage
is
2500
which
is
greater
than
2400,
so
we
go
right
and
we
arrive
at
the
prediction
which
is
512
000..
A
A
single
decision
tree
isn't
very
good
at
making
predictions
on
its
own,
so
we
train
a
bunch
of
trees,
one
at
a
time
where
each
tree
predicts
the
error
in
the
sum
of
the
outputs
of
the
trees
before
it.
This
is
called
gradient,
boosting
over
decision
trees.
In
this
example,
the
prediction
is
340
000..
A
The
process
of
training
trees
takes
in
a
matrix
of
training
data,
which
is
n
rows
by
n
features
to
decide
which
feature
to
use
at
each
node.
We
need
to
compute
a
score
for
each
feature.
We
can
compute
that
score
for
each
feature
in
parallel
with
rayon,
it's
as
easy,
as
changing
the
call
to
iter
to
par
eter.
Rayon.
Will
keep
a
thread
pool
around
and
schedule
items
from
your
iterator
to
be
processed
in
parallel?
A
A
A
A
A
A
Then
we
access
an
array
of
the
same
length
called
values,
but
in
the
order
of
the
indexes
in
the
indexes
array,
this
results
in
accessing
each
item
in
the
values
array
out
of
order
from
the
flame
graph,
we
knew
which
function
was
taking
the
majority
of
time.
So
we
looked
at
the
assembly
code.
It
generated
to
see
if
there
were
any
opportunities
to
make
it
faster.
A
A
A
When
we
looked
at
the
assembly
for
this
loop,
we
were
surprised
to
find
an
I
mole
instruction,
which
is
an
integer
multiplication.
What
is
that
doing?
In
our
code?
We're
just
indexing
into
an
array
of
f32s
and
f32s
are
four
bytes
each,
so
the
compiler
should
be
able
to
get
the
address
of
the
ith
item
by
multiplying
by
four,
and
it
can
do
this
by
shifting
I
left
by
2,
which
is
much
faster
than
integer
multiplication.
A
Well,
the
values
array
is
a
column
in
a
matrix
and
a
matrix
can
be
stored
in
either
row
major
or
column
major
order.
This
means
that
indexing
into
the
column
might
require
multiplying
by
the
number
of
columns
in
the
matrix,
which
is
unknown
at
compile
time,
but
since
we're
storing
our
matrix
in
column
major
order,
we
could
eliminate
the
multiplication,
but
we
have
to
convince
the
compiler
of
this.
A
A
Next,
we
used
compiler
intrinsics
to
optimize
for
specific
cpus.
Intrinsics
are
special
functions
that
hint
to
the
compiler
to
generate
specific
assembly
code.
Remember
how
we
noticed
that
this
code
results
in
accessing
the
values
array
out
of
order.
This
is
really
bad
for
cache
performance,
because
cpus
assume
you're
going
to
access
memory
in
order.
If
a
value
isn't
in
cache,
the
cpu
has
to
wait
until
it's
loaded
from
main
memory
making
your
program
slower.
A
A
A
But,
as
we
said
in
the
beginning,
the
indexes
array
is
just
a
permutation
of
the
value
0
to
n,
which
means
the
bounce
checks
are
unnecessary.
We
can
fix
the
spray
replacing
get
mute
with
get
unchecked
mute.
We
have
to
use
unsafe
code
here,
because
rust
provides
no
way
to
communicate
to
the
compiler
that
the
values
in
the
indexes
array
are
always
inbounds
of
the
values
array.
A
A
We
can
parallelize
our
code
using
unsafe
rest,
wrapping
a
pointer
to
the
values
in
a
struct
and
unsafely
marking
it
as
send
and
sync
so
going
back
to
the
code.
We
started
out
with
combining
the
four
optimizations
together
making
sure
that
the
values
array
is
a
contiguous
slice,
pre-fetching
values
so
they're
in
cash,
removing
bounce
checks,
because
we
know
that
the
indexes
are
always
in
bounds
and
parallelizing
over
the
indexes,
because
we
know
they
never
overlap.