Dagster Dagster Community Demos, 9 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Immunai: Using Dagster to tackle bio-tech data engineering challenges

Description

Daniel Blinick—Software Engineer at Immunai—presented on Immunai’s use of Dagster to tackle bio-tech data engineering challenges.

🌟 Socials 🌟

Checkout (and star!) our Github ➡️ https://github.com/dagster-io/dagster
Check out our Documentation ➡️ https://docs.dagster.io/
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Follow us on Twitter ➡️ https://twitter.com/dagsterio

A

uh My name is daniel blinik, um I'm a software engineer at imuni um and I'm just gonna be speaking about how uh how we came to dagstr and and how we make use of it um as much as possible um in our pipeline.

A

So just a little bit about me, so I actually started in web development. um I joined immuni about three years ago when it was uh six months old uh and um I was doing web there for most of my time actually and then only about nine months ago. I I really got into um data engineering and pipeline, so um you know already from from my very beginning in this field I've been with daxter uh and uh I think it's made the transition um a really good one.

A

So um a little bit about immunize, um we are a um a therapeutic company um that is uh trying to develop drugs um that help the immune system fight infection, and in order to do that, we need a very comprehensive understanding of how the immune system works and what's really going on there, and there are many approaches that take in order to achieve that and the one that we we've chosen is to do genomic, sequencing and essentially what that does.

A

Is it attempts to measure the activity of genes within cells so just to make that a little bit more concrete? Most people have heard of dna sequencing.

A

You know you, you take a saliva sample, you send it to a lab, and you you understand more about your dna, um but in in every cell, in your body, the dna is essentially the same and what differentiates cells from each other is which part of that blueprint is used in the different cells, and so when we can delve into that, we get a picture of what's going on in your immune system and and that's really what we're trying to do so.

A

The challenge for me as an engineer and our engineering team is essentially converting that biological data that we get from our lab into digital data and then transforming it, enriching it in order for our computational biologists and our our data analysts downstream of the pipeline uh to do analysis, ai machine learning and and bring insight into how we can better develop drugs.

A

So, just to uh again make this a little bit more concrete.

A

uh I I I wanted to present one of the main data structures we work with, which is um the c the cell gene matrix um and and as I was trying to explain before this, this matrix is essentially made up of columns which are our cells, and then the the rows are genes and the values are are how uh how prevalent that gene is expressed within the cell and so different types of cells will have different gene signatures, and by doing that, we can understand. What's going on in the immune system.

A

For example, one patient might have a higher prevalence of t cells as opposed to b cells, and that tells us something that tells us when we know that that patient responded to a treatment. We know that that you know that that alignment of signatures might be playing a role, and these matrices get really really. You know big in size.

A

um um You know they can get to about five billion uh data points when we're talking about across a project and that's one of the the challenges that we deal with um scaling that so so analysts can work with those data sets in a a very efficient manner.

A

And so just to give an overview of what our pipeline looks like. Essentially, the steps themselves, the business logic are, is written by computational biologists and us on the the data engineering team were really responsible just for the orchestration logic of what gets run when and we try not to concern ourselves with the actual business logic itself.

A

So initially, when the company was about uh a year old, we developed a homegrown solution um and what that gave us was a lot of flexibility. It was written in python um and the people who wrote it um did a great job. uh Again, we weren't the company was so new that we just didn't really know exactly where we wanted the flexibility, and so the solution did give us a lot of that, and, um and it worked it was it was. It was written.

A

It worked and it served us well for for a while, but um as time went on, we we started to have issues with it and and just natural as a company grows, um and so the three main uh things that that really kept uh coming up were one is that there was no. We didn't have a dev environment.

A

um You know creating that kind of environment obviously takes a lot of resources, and we just didn't have that and because the computational biologists were the ones changing the business logic, they had no different environment, they couldn't test things really until it hit production and- and so the development cycle was very brittle.

A

We also didn't have a ui. We had something very hacked together, but it wasn't very usable and the flexibility that served us well in the beginning, as the orchestration logic got more and more complex, it just got very unwieldy and um we ended up in a situation where there was a there wasn't a lot of transparency. One person really only one person, really knew what was going on within the code um and it just became very unworkable, as the company grew, so we.

A

So this is essentially what it looks like now and when we started exploring um you know I was tasked with doing a little bit of research into uh the different frameworks and we kind of boiled it down to you know the main player airflow, uh which is you know, that's what everyone's heard of, even if you're, not a data engineer, uh and I actually didn't find daxter uh one of my one of my colleagues did, uh and so we really boiled down to these two.

A

um Are we gonna go with like the name uh that everyone knows um or are we gonna uh take a risk on a smaller project, which um you know definitely felt like it suited us better um and we ended up obviously going with daxter um for the for, for these five reasons which uh I'll delve a little bit into so. uh Firstly, um the abstractions in daxter just seem to be very well thought through.

A

We, when we were doing the analysis, it was just around the time when dexter switched from pipelines and solids and composite solids to graphs and ops, and just even just that, you know switching to the graphs which are kind of you can have as many graphs within graphs and there's no difference between uh a sub graph and a graph, just the the thought behind that uh really appealed to us.

A

The other thing we really liked um was the data centric pipeline model, seeing the pipeline as a flow of data um really just made a lot of sense to us and it's much more intuitive um than the air flow model and what I've taken the screenshot here up here is um the the being able to run from a point in the graph which really revolves around the fact that it's that the pipeline is centered around around data um and we've used that countless times. So it's really uh served us well.

A

um This feature also, this is something that's uh our pipeline makes use of this dynamic mapping feature which not all pipelines have which orchestration platforms have, um which is the ability to kind of define the graph at runtime. Based on you know certain metadata so uh yeah, this may simplify how many different jobs we needed to to use and again like we really really like this feature.

A

um The asset materializations um truthfully, we haven't actually utilized this as much as we should and we're really excited about the new developments in the asset space, um but but really even just the idea of kind of like uh um uh declaring the assets that have been created and being able to track it and link it back to the run very. Very easily uh makes debugging debugging things easier and and um again just like that thought behind. It was really attracted us as well.

A

um I also I don't- I didn't even add this here, but um I guess this is kind of a meta point for all the slides that came before it. But dagit was just in our mind so much better than uh than the airflow ui and a lot of the other uis. We saw it's just very simple: uh there's really there there aren't that many bells and whistles, which in my mind, is a feature. It's very like straightforward.

A

uh You don't really have to guess about what you're doing um so, that's kind of like what we the reason we chose it and um but obviously we ended up finding so many more things that really made us happy and continue to make us happy. So I just want to talk a little bit about those uh and how we use them uh so the first one uh which should have been obvious to me, but not being a data engineer.

A

I guess I didn't realize how painful this would have been without resources, but just the ability to use resources to set up different environments have a test environment very easily, a dev environment very easily and not change. The business logic has been amazing um and uh it really just takes away so much of the magic um in the sense that you, you know exactly what you're doing in a test environment, you don't have to rely on frameworks that are filling in all these gaps, for you, you're, you're, feeding it the resources yourself um and yeah.

A

It just made things so much easier to work with uh custom. Aisle managers we've also made use of um this is one of our our audio managers uh that basically just uh um extends the in-memory uh I o manager, but in addition, it just dumps the content into a bucket, and we have a few others like this and again just added flexibility is great.

A

The different sensors we've made use of we make use of the failure. Success, sensors, obviously the standard sensors to kick off jobs, um and we at one point were making use of an asset materialization sensor um and yeah. It's just a wealth of things to choose from to do what you want just makes everything uh the code a lot simpler and and more obvious in terms of what you're trying to do.

A

um The graphql api is something new that we've just been exploring uh we're using it to. um We call like nuking our jobs that went wrong. um Sometimes we have jobs that run with with the wrong metadata, um and so uh we use it to kind of query the runs. This is still in development and then just get rid of them, and then the sensor kicks off, kicks them off again with the updated metadata um and actually as an added bonus.

A

We, this has been my introduction to graphql, and you know I've just come to use it in all sorts of other places as a bonus.

A

So uh I just want to share one debate that we had on our team. um This was earlier on uh um just to share it with you guys and uh maybe it'll spark some interesting conversation.

A

um So we uh we rely the pipe our pipeline relies on on a bunch of metadata and we really weren't sure what to do with that metadata, whether to hold it uh hold it in an external database and query it in during the during the pipeline run um or to embed it within the run, configuration um and so the pro of putting in the run configuration is that it's explicit.

A

We also take the run configuration and when we create assets, we we store the run configuration as as as provenance for that data set, and so anything we put in the run, configuration is, is implicitly stored uh or explicitly stored as provenance, which so storing the metadata and the run config gave us that ability. Also, um more importantly, the metadata can't change mid-run and that's what we were saying we were nervous about.

A

um The cons are obviously that um it makes running from the launch pad almost impossible, because you need to then like copy and paste uh this massive file. It just wouldn't really have been workable um and the provenance that's created on the data set is is much harder to read because it just gets really big, and so the compromise we came to is we basically created.

A

um We, we took the files out of the running config, we store them as um um files uh in google cloud storage that are versioned, um and so we just put the version number into the run and what that guarantees is that it doesn't change mid-run. If it does, then it's just a different version of the file um and again we can. We can run it from the launch pad, it doesn't bloat the the run, configuration and that's it.

A