Rust Programming Language RustConf 2021, 15 Sep 2021

Previous Meeting

⏯

youtube image

►

From YouTube: RustConf 2021 - Writing the Fastest GBDT Library in Rust by Isabella Tromba

Description

Writing the Fastest GBDT Library in Rust by Isabella Tromba

In this talk, I will share my experience optimizing a Rust implementation of the Gradient Boosted Decision Tree machine learning algorithm. With code snippets, stack traces, and benchmarks, we’ll explore how rayon, perf, cargo-asm, compiler intrinsics, and unsafe rust were used to write a GBDT library that trains faster than similar libraries written in C/C++.

A

Hi, I'm isabella today we're going to talk about how we wrote the fastest gradient boosted decision tree library in rust. First, a bit about me, I'm a founder at tangram, where we build open source tools that make it easy for programmers to train, deploy and monitor machine learning models.

A

Tangram is written entirely in rest from the core machine learning algorithms to the backend and front end of the web application. You can check it out on github or on our website at the links here.

A

So what are gradient boosted decision trees? Let's say you want to predict the price of a house based on features like the number of bedrooms, bathrooms and square footage to make a prediction with a decision tree. You start at the top and each branch you ask how one of the features compares with a threshold if the value is less than or equal to the threshold you go to the left child. If the value is greater, you go to the right child.

A

When you reach a leaf, you have the prediction: here's an example: we have a house with 3 bedrooms, 3 bathrooms and 2 500 square feet. Let's see what the price our decision tree predicts start at the top. The number of bedrooms is three which is less than or equal to three, so we go left. The square footage is 2500 which is greater than 2400, so we go right and we arrive at the prediction which is 512 000..

A

A single decision tree isn't very good at making predictions on its own, so we train a bunch of trees, one at a time where each tree predicts the error in the sum of the outputs of the trees before it. This is called gradient, boosting over decision trees. In this example, the prediction is 340 000..

A

Now, let's talk about how we made our implementation fast.

A

The first thing we did was parallelize. Our code and rayon makes this really easy.

A

The process of training trees takes in a matrix of training data, which is n rows by n features to decide which feature to use at each node. We need to compute a score for each feature. We can compute that score for each feature in parallel with rayon, it's as easy, as changing the call to iter to par eter. Rayon. Will keep a thread pool around and schedule items from your iterator to be processed in parallel?

A

This works well when the number of features is larger than the number of cores on your computer, when the number of features is smaller than the number of cores on your computer parallelizing over the features is not as efficient.

A

This is because some of the cores will be sitting idle, so we will not be using all the compute power available to us. Instead, we can parallelize over chunks of rows and make sure we have enough chunks so that each core has some work to do.

A

Each core now has some rows assigned to it, and no core is sitting idle.

A

Distributing the work across rows is super easy with rayon as well. We can just use the combinator par chunks. Rayon has a lot of other high level combinators that make it easy to express complex, parallel computations.

A

Next, we used cargo flame graph to find where most of the time was being spent. Cargo flame graphs makes it easy to generate flame graphs and integrates elegantly with cargo. You can install it with cargo, install and then run cargo flame graph to run your program and generate a flame graph.

A

Here's a simple example with a program that calls two subroutines boo and bar. When we run cargo flame graph, we get an output that looks like this. It contains a lot of extra functions that you have to sort through, but it boils down to something like this.

A

The y-axis of the graph is the call stack and the x-axis is duration.

A

The bottom of the graph shows that the entire duration of the program was spent in the main function, above that you see that the main function's time is broken up between calls to foo and bar and that about two-thirds of the time was spent in foo in its subroutines versus about one third of the time spent in bar and its subroutines.

A

In our code for training decision trees, the flame graph showed one function where the majority of the time was spent. It boiled down to something like this. We maintain an array of the numbers 0 to n, that we call indexes and in each iteration of training we rearrange it.

A

Then we access an array of the same length called values, but in the order of the indexes in the indexes array, this results in accessing each item in the values array out of order from the flame graph, we knew which function was taking the majority of time. So we looked at the assembly code. It generated to see if there were any opportunities to make it faster.

A

We did this with cargo asm like cargo flame graph, cargo asm is really easy to install and integrates nicely with cargo. You can install it with cargo, install and run it as a cargo. Sub command.

A

Here is a simple example with a function that adds two numbers and multiplies the result by two. When we run cargo asm, we get an output that looks like this. It shows the assembly instructions alongside the rest code that generated them.

A

Note that, due to all the optimizations the compiler does, there is often not a perfect correlation from the rest code to the assembly.

A

When we looked at the assembly for this loop, we were surprised to find an I mole instruction, which is an integer multiplication. What is that doing? In our code? We're just indexing into an array of f32s and f32s are four bytes each, so the compiler should be able to get the address of the ith item by multiplying by four, and it can do this by shifting I left by 2, which is much faster than integer multiplication.

A

Well, the values array is a column in a matrix and a matrix can be stored in either row major or column major order. This means that indexing into the column might require multiplying by the number of columns in the matrix, which is unknown at compile time, but since we're storing our matrix in column major order, we could eliminate the multiplication, but we have to convince the compiler of this.

A

We did this by casting the values array to a slice. This convinced the compiler that the values array was contiguous, so it could access items using the shift left instruction instead of integer multiplication.

A

Next, we used compiler intrinsics to optimize for specific cpus. Intrinsics are special functions that hint to the compiler to generate specific assembly code. Remember how we noticed that this code results in accessing the values array out of order. This is really bad for cache performance, because cpus assume you're going to access memory in order. If a value isn't in cache, the cpu has to wait until it's loaded from main memory making your program slower.

A

However, we know which values we're going to be accessing a few iterations in the loop in the future we can hint x8664 cpus to prefetch those values into cache. Using the mm prefetch intrinsic, we experimented with different values of the offset until we got the best performance.

A

Next, we used a touch of unsafe to remove some unnecessary bounce checks. Most of the time the compiler can eliminate bounce checks when looping over values. In an array, however, in this code it has to check that index is within the bounds of the values array.

A

But, as we said in the beginning, the indexes array is just a permutation of the value 0 to n, which means the bounce checks are unnecessary. We can fix the spray replacing get mute with get unchecked mute. We have to use unsafe code here, because rust provides no way to communicate to the compiler that the values in the indexes array are always inbounds of the values array.

A

Finally, we parallelize that section of code, but is it even possible to parallelize at first glance, it seems like the answer is no because we're accessing the values array immutably in the body of the loop.

A

If we try it, the compiler will give us an error indicating overlapping borrows. However, the indexes array is a permutation of the value 0 to n, so we know that the access to the values array is never overlapping.

A

We can parallelize our code using unsafe rest, wrapping a pointer to the values in a struct and unsafely marking it as send and sync so going back to the code. We started out with combining the four optimizations together making sure that the values array is a contiguous slice, pre-fetching values so they're in cash, removing bounce checks, because we know that the indexes are always in bounds and parallelizing over the indexes, because we know they never overlap.

A

This is the code we get and the results are great. Tangrams gradient, boosted decision tree library is faster than leading open source alternatives.

A

Thank you so much for listening. If you're interested in learning more, please check out tangram on github at github.com, tangram.dev tingram, and if you like the project, please give it a star. Thank you.

A