Rust Programming Language RustConf 2020, 21 Aug 2020

Previous Meeting

⏯

youtube image

►

From YouTube: RustConf 2020 - Under a Microscope: Exploring Fast and Safe Rust for Biology by Samuel Lim

Description

Under a Microscope: Exploring Fast and Safe Rust for Biology by Samuel Lim

Ever wondered what goes on behind the scenes of breakthroughs in understanding proteins, viruses, our own bodies, and more?

Take a deep dive as we journey through some of the workings of computational biology at large, along with its advantages and pitfalls. In this talk, we will see how Rust bridges the biological sciences with safe, performant, and scalable systems, and discuss how you can play a role even as a fresh Rustacean.

A

Hello and welcome to rus comp 2020.

A

My name is samuel lim, I'm a developer and computational biologist as I perform research in computational biology and I develop software systems to help automate and optimize it and we'll be looking at fast and safe rust for biology and computational biology.

A

As a note, computational, biology and bioinformatics is very broad and the scope of biology is actually even broader, so to keep focus we'll be looking at computational biology from the angle of rna, sequencing and rna sequencing analysis, aka, rna-seq and we'll see how russ can play into that as a crash course into rna-seq.

A

What in the world is rna sequencing, anyways.

A

In one way to phrase it rna, sequencing is one way to sequence, the qualities, the presence and the quantities of rna in a sample in your fragments and in your reads, in comparison to reference data like a transcriptome.

A

Now these are quite a few different terms, so we'll define them incrementally we'll start from rna and then we'll see why sequencing is important and we'll talk a little bit about alignment along the way as a working definition of rna, you've probably heard of dna first and you've probably heard of it more often it's the stuff that makes the replication of your entire genetic sequence possible and you wouldn't be wrong to relate rna to dna.

A

In this sense, both rna and dna are mediums for genetic information, although they do come in slightly different forms, but their purposes can be different in biology. We have something called the central dogma which dictates how rna comes to be and what it can be used for.

A

We start with the dna in human cells case in the nucleus, and we read through that dna base by base acgt sequence by sequence, and then we translate it from one character to the next and then we transcribe it, meaning that we take the dna and we take those individual bases and we find its complement in rna.

A

This becomes a c g and u, in new characters. Now this rna then matures into what we call messenger rna which gets sent out to all different parts of the cell, and this is what we get as the communication source for synthesis of proteins for different factories of the cell in general.

A

The original source information, the genetic sequence and the sum of all dna found in the cell and in the nucleus is what we call a genome and from that genome we can find that there are many different kinds of rna produced and used, whether it be messenger, rna, trna, rrna and so on, and they can exhibit information from the genomes.

A

As we've stated, and we can compare them and see how similar they are by aligning them, they can serve as communication between genes and protein factories so that we can actually get from the static source of our genetic sequence to the active source of cell interaction.

A

And we can describe the expressions of these genes based on those interactions, and we can actually define behaviors and substance leader processes by understanding how rna works and if rna can describe the expressions of genes within a cell.

A

That means the cell can have an identical genomic sequence, an identical genome, but as it's changing and as it's producing different proteins and as it's going to different factories and as it's communicating differently, the cell can display different behavior even with the same genetics.

A

So the collection of all this dynamic, defined and transcribed. Rna in these cells is collectively what we call a transcriptome from transcription and because it's dynamic and serves as a bridge between our dna and our proteins.

A

Rna can help us to investigate the differences between individual cells as single cell rna analysis or in groups of cells or communities. As we would see in bulk rna, we can look at specific expressions of genes or sets of genes and interrogate them by themselves and look at how their sequences can compare.

A

And, finally, we can look at how interactions from rna we can actually profile different objects and different characters in the microscopic world, whether they be cells in our human body, bacteria, viruses and fungi.

A

And as a note, as many of you have probably been affected by copit19 covet, 19 is an rna virus which means that the virus's entire genetic sequence is contained in the capsule and its format. Its information format is rna, so we've looked a little bit about how rna comes to be, why it's important and how sequencing can play into that a little bit.

A

But how does that relate to computing.

A

It comes to computing where we actually need to process. The information that we've gathered, rna-seq processing is how we can quantify compute and analyze the data that we've taken after we've left the wet lab and, after we've done our isolation of of different samples of different reads and different fragments.

A

Everything from our information to more information and inferences we want to gather, can go digital and the rust and the applications in rust. Coming soon. I promise from this basic understanding of the mechanisms of rna and rna-seq, there's a simple methodology that we can take and that's we read the information from the files and the experimental samples that we've taken and turn them into data streams that we can manipulate.

A

We map and align these data streams to reference data where they're applicable. So we can take the information that we have. We can position them and we can compare them, see the similarities and the differences and categorize them, and we can finally analyze the output, depending on whether we want to quantify the categories and the expressions of different genes or different sequences, and we can send these results. We have for further processing in other pipelines or other programs.

A

Now rna-seq tools are a broad spread. They can be focused on many different analyses or different methods to achieve analysis and to name a few. Some of them may be worried about the quantification, the categorization and the analysis of expression of different genes and different rna sequences within our data stream and each have its own uses and advantages, but most are largely disjoint in terms of their programmatic, tooling.

A

We needed something that could actually bring the exposed functionality, so the command arguments, the positional arguments and the general command line interface of these many different tools with many different functions together into one unified surface, and I need the language that can do that and that's where rust comes in with no need for further introduction.

A

Rust is usually used for safety in the sense of both memory. Safety with rust, borrow checker and time safety within its entire type system.

A

We look at performance at the level of systems programming and we look at concurrency both in its primitives and in the ecosystem surrounding them.

A

And all of these are helpful in this regard, but how would that apply to a biologist or a computational biologist in that sense, and the first thing that we can actually look at is building ergonomic, abstractions and layers to this now, due to the fact that we had multiple tools to work with, the initial starting point of our translation was about 3000 lines of logic, configuration and command line.

A

Parsing, and some of it was easier than other areas, and all of it was generally not trivial to translate and projects grow in size and, as with this one, so did the size of what needed translation.

A

What originally started as about 3 000 lines of cli parsing, with a few tools growing from about three to four to five to six to ten different tools, began to add more arguments and began to add more configuration, and so what originally started as about three thousand lines of code at the beginning, then became more than ten thousand lines of code to migrate over and translate and unify.

A

In the end, thankfully, though, many of the options that we can translate at the level of rust have both high level structures and primitives that are generally synonymous and when they aren't and when we want to configure further, we have macros which we which enable us.

A

This is one example of a direct translation where we take uh not only basic configuration values like verbosity, and we have flags for that and we can take sub commands and other options and if something is not relevant to the functionality that we want to define right now in the abstraction we want to define right now, we can skip it so, thankfully, to crates like struck up and pico args.

A

What was collectively about 10 000 lines of mixed logic. Configuration and options ended up condensing by about six times plus more for documentation and in the end, what we got was quite a sizable difference as you can see to proportion.

A

This is what would have been the size before and what would have been the size after in screenshots.

A

But we don't just want an abstraction layer. We want to be able to interact with the tools that have already existed and the tools that have already been made around us and for that we need interoperability.

A

By the end of our initial abstraction and the command line layer, the resulting project looks a little bit like rust, on top with bindings to c plus, plus and scripts, almost all the way down and thankfully, to a few crates as he make or cxx by uh david tolney.

A

We're able to actually build a very systematic and almost self-contained structure for interacting with our files we can take in the files we can parse them and we can send them off for processing whether it be c c, plus, plus python, makefile, perl and other analytic languages, and we can finally destructure or serialize that data and bring it into further analysis for other pipelines and many times we should leave the abstraction layer as it is.

A

There is no reason to go further unless we have precedent and sometimes that precedent is very large to scale they're easily 10 to the 16th, to 10, to the 17th bases in some of the more popular public databases for rna-seq and its associated data, that's more than 100 petabytes or several hundred thousand terabytes, or several hundred million gigabytes or seven, several hundred billion megabytes.

A

And it's continuing to grow over time when data can not only grow over time, but can grow orders of magnitude in size just from the process of a single step in the pipeline performance does matter so in a way we can actually think about sequencing and the general process of analysis in three distinct steps. Where we read the information we parse the data we map in a line and we paralyze operations, we analyze and we export the data that we need for further analysis.

A

As for parsing, rust has a very strong track history of parsing, whether it comes to crates like nom lectures, like logos or pests, and so on.

A

We can see that rust actually has the ability to handle not only long strings and sequences, but bigger structures as well and structured data is sometimes the thing that we most need.

A

If sequence data were the simplest, we could possibly conceive, we would have a continuous stream of fragments of bases, joined together continuously and realistically require more than just a continuous stream. We require more information that we require structure around it. Now. How does that structure? Look one example would be the fastq format where we take in not only the sequence, information which is crucial to our analysis, but also the identifier which is the identification of what sequence we're looking at the quality scores.

A

So we know how well this is actually sequenced or how erroneous this is actually sequenced, and we can continue to process it further, and this is not the only format that is viable for computational biology and bioinformatics in rna sequencing.

A

We actually have quite a few whether it's fasta for genome and transcriptome fastq for our experimental fragments, gtf for our annotations or bed or files that can contain the alignments that we have calculated or.

A

A

Parsing in rus is not just a general fee. We actually do need some specific features to biological file formats sometimes, and we can actually measure this information. Thankfully, to a professor hung lee at harvard. We have been able to quantify some of these basic benchmarks for common analyses and parsing.

A

Here we actually see the actual times it takes where first and forth most rust comes in and we can actually count the amount of sequences and the quality thereof. That's contained in these fastq files.

A

And if we actually take a closer look at how we can use this information, we can see that it's not that different from very simple or normal rust code. By the time it reaches the biologist.

A

What we have is a reader and a record, and once we take in the buffered file, we can continue to loop over it and continue to print different sequence, data and quality data and, in the same thing, for the fastest benchmark.

A

Another version is very similar where we parse from our file, and we continue to take in new record information until we finish all the sequence.

A

And once we've parsed all these files, we need to do basic processing to them, which includes mapping and alignment the basis of most bioinformatics pipelines and not all mapping and alignment is created. Equally, some are better expression. Analysis. Some are better at quantifying different parts of rna.

A

Some have better accuracy, and some are very very fast in one example that we could find from an paper for callisto in near optimal rna-seq quantification.

A

We can actually see the variance in the level of speed, that is, the performance of different methods and different levels of analysis, and sometimes it can on a normal machine. It can take as little as 15 minutes, and sometimes it can take several days of computation in a more personal test.

A

We tested with at least 64 gigabytes of ram, sometimes 128 at least 30 million reads over multiple files, and we had enough computing power to feed a room equivalent to a lab full of think pads and in the end, the tools with builds focused on fast heuristics gave us actual reasonable answers in less than half an hour, some of them even within 10 minutes, but others which relied purely on accuracy or purely on speed, which were made more accurate, went out of memory, as we actually tried to get a proper answer out of it, even with these large computing constraints.

A

So the commonality between these tools is that parallelism and efficiency is actually no longer optional in rna-seq processing, it's an assumption of the field, and so in some of the rewrites of these tools, we had to defer expertise to designers of rust systems and the community more at large. So the commonality between these different tools is actually that parallelism and efficiency with our time and our memory is no longer optional, with rnac processing, most computers.

A

Nowadays, even personal computers, let alone compute clusters have more than one core and it's a general assumption of the field in current years, so to actually work with these tools and to incrementally translate them into a language like rust.

A

I had to call in the experts, so we defer expertise to the designers of these rust systems, who can actually optimize and work with these systems at a very fundamental level and the response included.

A

The standard library actually standard library is cohesive and extensive to the point where we can get atomics threading and streams together in a fashion, that's actually accessible, both in its documentation and in its resolution with other parts of the rust ecosystem within crates and beyond that, we also had actual data parallelism libraries such as rayon, where we could take normally sequential data, and we could place them into iterators and place them into transformations where we could parallelize the operations naturally and easily.

A

And, in the end, the data that we process, no matter how fast, no matter how much we parse needs to go somewhere, it needs to be analyzed further and sometimes rust is not the only answer to a problem.

A

There's a diverse ecosystem of languages and tools out there, whether it goes to scripting, whether it's for systems and whether it's for pipelines and in the end it boils down to the fact that biologists are not software engineers.

A

We certainly don't want to rewrite the world and rest and there's actually a lot out there to gain from what kind of language do we want to work with is less of a question of what do I want to stick to.

A

But what can I connect and what can I interoperate and classics of bioinformatics and computational biology, especially for rna, sequencing, include, c c, plus plus four train powered systems and other languages such as java, perl and analysis and r and newer languages are also cropping up, such as python julia go javascript, general scripting and some languages you may have never heard of before, such as futhark or seek.

A

So while there's certainly overlap between biologists and software engineers, the end goal is different.

A

Biologists write software to best enable biology and the tools that are existing and the tools that we can connect are the tools that we're going to use.

A

So what biologists and scientists more generally can take away from good quality software is reusability, composability interoperability and really all three interact in a way where we can actually get stable software that doesn't need to change that we can build upon that we can extend and that we can interact with at the level of different languages such as scripting languages or systems, languages, and lessons we can take away from rust and biology in the face of both parsing parallelization sequencing, processing and analysis is that there is actually a very kind and extensive ecosystem, both with the tools that russ gives and the communities that rust has enveloped.

A

This includes cargo, where we have an actual build tool similar to pip, similar to snake, make cmake all brought together and cohesive in the sense that you can test that you can make that you can build that. You can run that you can compile all these different things and all these different tools and all these different crates together.

A

And we have a crates ecosystem where, if we know the rust code compiles, we know that it will compile everywhere. That rust is, and in that sense we can continue to build upon different crates and different tools and different libraries, based on the assumption that we know it works abroad and across and when something is not available in this ecosystem and when something is so domain specific that we really need a tool from somewhere else.

A

We not only have ffi, we not only have communication with the language at a fundamental level, but we have tools- and we have different crates to actually abstract over this and to get a safe layer of ergonomic code that we can seamlessly transition between.

A

So what becomes the next step for rust and biology together?

A

Well, the impact of rust in the biological ecosystem is that we get a bridging at the level of languages as we've seen before different benchmarks and different tools and different toolkits.

A

We have a plethora of languages at our disposal, some of them scripting, some of them web languages, some of them systems, languages and rus really sits the heart of the ability to take initial information, not only at the level of simple cffi and c bindings, but also the level of safe abstractions, to interpreters, to different compile targets and to different information flow.

A

And we can work with the community at large to continue to build these tools.

A

Not only does rust enable languages and different language tools enables the community to build tools around it, to reuse, to extend and to interact with rust and languages around rust at an equal and bilateral level, and so we can actually act with different software and rust, not only in so far as the software itself, but also at the level of exchange from one software engineer to another, from one scientist to another and build another community of mentorship for both science and software and build a larger picture.

A

The biggest asset of the rus programming language going forward may not just be the language itself, but also its community, and the community. Mentorship model is what biologists can continue to take and learn from rust beyond the language, even as they go further. Thank you for joining into this talk. I hope you enjoyed it.

A

I will make further information available and, if you'd like to read more both on biology, how rust plays in or computational biology and different methods and algorithms to work with, it feel free to contact me and feel free to look at the slides. Okay. Thank you.