Rust Programming Language Rust Australia - Sydney Meetup, 12 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ryan Kelly - Indexing Petabytes of Data with Rust and AWS

Description

Ryan has worked as a software engineer for over 15 years, and although he's spent a lot of time with JavaScript and Python, he now defaults to Rust for all but the most ad-hoc of projects.
He first discovered the joys of Rust while working on cross-platform client SDKs at Mozilla, but these days he's putting the language to work on the server-side, building data process pipelines at harrison.ai.

A

All right um nice to see you, even if only virtually everyone, sorry I can't be there in person, but I'll do my best to keep an eye on the chat so feel free to huckle me over there. uh My name's Ryan I'm on the software engineering team at Harrison AI, uh and if you take a look around the big spiral staircase in the room there with you, you'll see that our mission at Harrison is to uh improve the standard of healthcare for millions of lives a day.

A

We do it by building uh world-leading AI models and I want to tell you today a little story about how rust is starting to help us execute on that mission. uh This is based on some work that we did for commercial purposes, so I'm not going to be able to give you all of the the details.

A

Obviously it'll be a lightly fictionalized account, but I hope to give you a little bit of a A vibe and a sense of the journey that we went through trying to use rust when dealing with petabytes of data for one of our model building projects. So the starting conditions are, one of our adventures had happened to come by a few petabytes of de-identified Medical Imaging data right.

A

The sort of stuff that is is really great raw input to building medical AI models, but the bad news is that this data was distributed more or less at random across a couple of million individual Tire archives and if you've ever worked with atar archive you'll know that they're very good for moving data around from one place to another, but they're not really that good at being able to inspect and work with the data they're in right, they're, pretty opaque kind of blob of data.

A

So in order to put that data to use, we had to figure out what was inside. Those Tire balls basically make them available to some sort of big cloud database. In our case, we're using Amazon's Athena now Athena is pretty amazing right. It's sort of a big cloud database designed for a querying data at rest in in a cloud storage provider like S3, and it can read parquet files. It can read Json files. It can read things in various compressed formats.

A

Unfortunately, can't read tarballs full of de-identified medical image data, so we had to find a way to generate some sort of index of those files right process. The tarbles turn them into some data that Amazon Athena could use to. Let us see what's in there and decide which files we wanted to work with in what order? Now we actually had a few sort of failed attempts at this, including one very unfortunate incident, where we sent too much data to Amazon Kinesis fire hose and it started. Dropping things.

A

Ask me about it when I'm there in person I'm over beers one time, but at the end of the day, what we settled on was to try and do basically the simplest thing that could possibly work at scale right. So we wanted to take each one of those millions of Tire archives process it using a scalable compute function in AWS Lambda and turn it into just a plain Json lines file describing the contents of that archive all right.

A

If we got to Jason lines files sitting in cloud storage in Amazon S3, then we had something that Amazon Athena would read read and from there we could use all of the big data processing capabilities of Athena to sort and summarize and compress that data into some smaller indexes. That would help us navigate our way around those Tire walls. Now this is not a big data. Talk I'm not going to go into too many of the things on the right hand, side of that diagram.

A

But let's take a little look at this, a kind of first step right. The indexing piece where we want to be able to take a look at each of those millions of Terribles and spit out some data about what it contains. Now, that's the sort of thing that uh sort of for a lot of people you could reach for just about your favorite programming language, uh spin up enough copies of it in a cloud compute environment like AWS Lambda, and it would probably do the job.

A

So when we sat down to look at this right, probably the go-to language for it for our team would have been Python and, in fact, I've put together a little python demo here of the core logic that we wanted to use on top of these Terribles right. At the end of the day, what we wanted to be able to do was take the path to a tar file open it iterate over its contents right spit out Json lines of like the files that we found there in right.

A

So uh in real life like this was a lot more complicated right. We wanted to parse out like details about what patients said, what files and this sort of thing, but the core logic of like open up a terrible rummage around inside it write out what you find.

A

um It was at the heart of this system, uh and if we did this in Python like it would have worked, but all of the folks on the team at the time like we were pretty Keen to give rust a try like we hoped that we would be able to get some better performance out of it. We have to be able to get uh perhaps some better robustness out of it, and this seemed like a really nice opportunity sort of self-contained uh problem space in which to explore it.

A

So let's have a look at what a rust version of that core logic would look like um and unfortunately, Russ doesn't have like tar file, handling and Jason handling built in, but there are some pretty common dependencies that you can use to pull those in right, tire, crate, sorry day, if you're doing anything in uh in Rust with Jason, you want survey, we need to do a little bit of boilerplate set up right in order to write out index entries.

A

You'd have a little struct that we Define with certain serialize helper just to help spit things out, but at the end of the day, like that processing Loop in Rust, it operates at pretty much the same level of abstraction as it does in Python. Right. You open the tire ball. You open your output file, you iterate over each of the entries. You write out some some data about it right and um I.

A

Think like being able to to kind of take a go at this in Rust and have something that looks and feels at the same level of abstraction as we would in a high level. Language was pretty exciting for us as an early uh starting point right, engaging with this problem space.

A

um One thing that we noticed very quickly was that rust was pointing out to us a lot of places where things could go silently wrong in other versions such as python right here, rust is kindly letting us know that the file names in a tar archive might not be utf-8 right, and we need to deal with that error um so before we got too carried away with trying to do anything further with that right. Let's take a quick look at this right. Is this actually going to be uh any an interesting Expedition?

A

So I did a quick Benchmark using this Tool uh called hyperfine, which is a little benchmarking tool written in Rust? um This is the the python uh demo that I showed you earlier right. It runs in about 345 milliseconds for some sample data that I have if I run the rust version of that it um sorry slower.

A

No one second compile in release mode all right, then we can Benchmark uh the rust version against the python version, and we find that it runs like a little bit quicker, not a lot quicker, but a little bit um I, actually, when putting the slides together and after the uh benchmarking and release modes and uh kind of got a bit carried away and tried to dig in and optimize that code like removed, some temporary temporary string allocations and that sort of thing and I managed to get that down uh to like running in.

A

You know a bit of two and a half times faster than the python version, which I think is, is pretty good right. There's code operating at a similarly high level of abstraction, but because of the nature of rust in the way that it sort of gives us more control over the allocation of resources.

A

um We kind of get substantially better performance more or less for free, which is pretty cool. This is not a talk about. You know minute. Optimizations of of rust code, so the next step was to go from there to actually working with data in AWS and for rust. That gets a little bit thorny. um So this is where we started to get a little bit complicated, because rust has a lot of high quality. He crates for working with AWS right, there's a crate for uh reading and writing files in S3.

A

There's a crate for connecting your rust functions to Lambda. The thing about all of these crates is that they're, all async, all of the networking stuff in Rust, uses async functions, and so we also had to go and find new async versions of all of the crates that we're using for the basic demo version, um because everything inside of our AWS flight version of this function will need to be async. Now I'm not going to lie to you. This was a little bit of Adventure right. Async.

A

Rust is pretty hard to to get your head around when you dig in for the first time, but kind of working through some of the issues and learning a little bit about streams and Futures. We actually managed to come up with some code that we're pretty happy with. um So you know in uh uh in order to read a terrible from S3 and uh iterate through its entries, actually like the rust AWS S3 SDK has a pretty nice wrapper where you can get an object.

A

If you give it a bucket and a key, and you can turn it into what Russ calls an async read trait right, which basically lets you read the bytes out of that uh file in a streaming manner. You can pass that to the async tire crate and it will let you sort of do a very low memory. Usage efficient streaming read through the entries in that tar file, which is pretty nice.

A

uh Unfortunately, what that means is that, where we previously had just a nice little iterator of entries from the terrible, we now had this thing called a stream and if you've done any work in async Rust, uh you may have encountered these things they're a little bit more Awkward to work with you can't just iterate over them with a for Loop.

A

um You can do a little while loop like this, which is not too bad, um but we did have a lot of fun, trying to find all of the various helper methods on the stream traits. So here, uh for example, I'm iterating through the contents of that tire ball, uh one at a time using this tri-fold method, um basically capturing all of the the rows that we're trying to write there as Json into a VEC, so that we can then write them out again to S3.

A

So I'm not going to lie like we spent quite a bit of time trying to find out the details like how to work with streams and how to make this uh sort of kind of fit in our heads. But once we've got our heads around it and once we came to understand how to use these tri-stream act methods and the various helper methods on a stream, uh you kind of feel like a wizard right like you.

A

Can you can sort of write super efficient streaming read code uh like this, which is pretty cool um and then, fortunately, for us right, pretty uh straightforward, helper library to write stuff back out again to S3.

A

um So what I? What I want to kind of get across there um is that the core logic there, even in this async talking to AWS rust version, was actually pretty high level like pretty easy to follow right.

A

We kind of read the terrible from S3 and process the archives one at a time, and then we can write the data back out again to stream um the process of hooking that up to be a Lambda, is a little bit fiddly right, there's a crate called uh Lambda runtime, which will help you do this and we actually have a little crate of our own called Cobalt AWS, which has got some kind of helper logic in it.

A

um The core idea, here being basically that you can have a callback that handles a single message right and you can do a little bit of rust, boilerplate and hook it up into a function that will handle a stream of messages coming off, say an sqsq or a Lambda function invocation um now that little bit of rust code right.

A

If you go through the motions of compiling it into a Docker image, um you can send it off into AWS Lambda and run that as a function, one time 10 times or in our case millions of times, uh and that's uh thanks to kind of cargo and rust's pretty convenient to build ecosystem. This was actually really straightforward. I'll be honest with you like much much simpler than I've ever had a time of sort of packaging python packages in the past.

A

um The one thing to watch out for here if you're deploying into AWS, is that they tend to suggest that you compile for arm64 instead of x8664, because that is cheaper to run in the Lambda environment.

A

um So the docs say anyway. I actually put that to the test. For this example, code- and uh here are some rough performance numbers essentially of that demo code running over some some example: tar archives, like the naive python version, uh you know 14.5 seconds uh to process, one of those files on x86 and, as Amazon suggested, like a little bit faster to run uh on their arm 64 processes.

A

Interestingly, the rust code when we ran it on x8664, was consistently a little bit faster than when run on the arm64 processors, but because those processes are cheaper, it actually turns out that, yes, the recommended architecture for uh for us was still the cheapest option.

A

um Now, that's sort of a a very Whirlwind demo of the kind of code that we're trying to work with here and sort of what you can expect. If you try to dive into this sort of thing in the in real life version of this, it was, of course, a little more complicated right. We weren't just listing the files in the archives, we're sort of generating I think six different listings of different kinds of files that you might find in those Terribles and the metadata about each.

A

We were writing out the files in a particular format by organized by prefix that help them be careful by Athena.

A

Rust, actually had some really good support for doing transparent compression with Z standard, thanks to some async read and async right traits, but at the end of the day, but a combination of the performance that we got out over us and the memory usage we've got over us like running this in production sort of costs on the order of a few hundred dollars in Lambda execution time.

A

Now we didn't do a full run of a different language for comparison, obviously, but from what I've seen like I'm, pretty confident that the the python version of that, if we'd done, one would have been an order of magnitude more expensive. So it felt like a pretty good uh little experiment right.

A

At the end of the day, we managed to sneak some Rust in here and when we ran it in production, um when we finally sort of shook all the bugs out of the system and deployed and ran it at scale, it was honestly a little bit of a an anti-climax like it ran. It did it did the job uh and it finished I think in in less than a day, uh and we were happy, we were able to process those files with Athena. We were able to see what was in them.

A

We were able to start using that data to do some model building. um So we really liked the runtime performance that we got out of using rust for this. But, to be honest, like it wasn't that much of a difference right, you know you could have a few hundred dollars here and there.

A

What was really impressive for us I think was the stable memory usage that you get out of using something like rust, rather than a scripting language like python, because it means that you can run these functions on smaller Lambda instances and when you're doing compute in an environment like Lambda, you actually pay for memory usage as well as just CPU time. I. Think one of our operations guys actually watching these rust, lambdas Run For. The First Time commented that he basically never seen a Lambda run that steadily rather with that consistently low memory usage.

A

So that was pretty amazing. We're really overall enjoyed the uh the runtime robustness of doing this work in Rust right, like it sort of forced us to think through a lot of the error cases up front, and it really was the case that when we came to run this stuff in production, it pretty much just worked, which was incredible. On the other hand, some things that we found were pretty challenging. I think that the async ecosystem is still a little bit fragmented right.

A

We sort of had to find async equivalents of some of the crates we were using. We had a little bit of a challenge with some testing, some of the AWS Services, although I will say which we've got some good mileage out of us of a product called local stack which lets you do sort of mocking a local hosted, Services of various AWS services and the one comment I have on things that were challenging.

A

Actually, when you're working with rust or you're, working with streams and you're kind of feeling like a wizard, and you can use all of the the cool features that rust offers you. There are a lot of optimization opportunities in there that turn out to be an attractive nuisance right and one of the challenges we had was actually like knowing to when to put down the brush and kind of just ship what we had and let it cost maybe 200 instead of 100 and get on with our lives.

A

um But overall, like I, said this was a bit of an experiment in seeing how rust would would work out for this kind of use case, and will we do it again? Absolutely right and it's in fact something that we're trying to make a core competency for our team here at Harrison, um try and open source what we can.

A

So, if you're interested in doing some AWS related things in Rust, we have a high level rapper library for AWS, which lets us do uh sort of more it kind of wraps the lower level AWS SDK ipis into things like async, reads and streams um to really make working at that higher level of abstraction a bit more convenient we're also publishing some of the docker-based build tooling.

A

If that's something that's interesting to you and hopefully we'll be able to flush out even more things on this slide over time, um so yeah at the end of the day, I think we're we're prepared to make a pretty big bet on Rust. The combination of sort of uh forcing you to think through your problem up front a little bit right. uh This notion of, if it compiles it works, super uh predictable performance in production makes it a really good fit for working with this sort of data in AWS at scale.

A

uh So that was a little bit of a whirlwind tour happy to take any questions, and uh thank you for coming along. Thank you for uh for joining us at the Harrison office and I hope they're treating you well. There.

B

So any questions um we'll relay them Discord and Ryan's going to be on the spring back. So any questions.

B

B

A

B

Just a bit curious about how big was the team and how long it took you to be there and to build the whole.

A

So the question in this could hear about how long it took to build. Do you mean, like the compile time of the software itself or the messing around with Russ, to understand it? For the first time.

A

uh So this, in fact, was a team of uh three people I think at least one of them is in the room there uh this evening, I believe for Tim, if you're keeping an eye out for him, um and uh so if I came to the team with a little bit of experience with uh with rust uh I think we were sort of all rust, Russ curious at least but they're. Like not gonna lie.

A

It was certainly a learning curve right, I I, don't think the cost, saving that we would have realized this time around would have would have paid for the engineering hours of learning to get up to speed with rust, but we're pretty excited about how those skills are going to compound over time. Right, as we do more and more of this in more and more parts of our data processing pipeline, uh you know sort of those Baseline skills and patterns that were started to develop, I think they're really going to pay off over time.

A

And nine names in the chat here I agree. It is super hard to resist optimizing, every little detail of things when you're working with rust.