Dagster Dagster Talks, Panels, & Interviews, 24 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Quality Meetup: Building Reliable Data Apps

Description

Max Gasner, Co-author of Dagster – data orchestration framework for ETL –talks about principles of building reliable data applications with Dagster. Sign up for the next Data Quality Meetup: https://bit.ly/3yiUH2H
Join our Meetup Group: https://www.meetup.com/data-quality-meetup/

A

Hi everybody good morning, um good afternoon and good evening. um My name is max. I I work at elemental and I'm a core developer on dagster, which is an open source data orchestrator, and today, I'd like to share kind of the orchestrator's perspective on the broad goal that we're the broad goal that we're all kind of working towards of building more reliable data apps. So by a data app, I mean a graph of computations that consumes and produces data assets where the nodes in the graph and the assets they produce can be wildly heterogeneous.

A

That could be a model training pipeline for machine learning. It could be an etl pipeline, it could be an elt pipeline. In practice, these pipelines probably depend on each other, they're, probably really components of a single larger data application.

A

Although the dependencies between the components might only be implicit or they might be modeled using various kinds of hacks or they might be expressed only in some external state and that sort of sad state of affairs makes sense, because the thing that really distinguishes data applications from other kinds of software is that by nature they're quite complex and heterogeneous, your ingest from third-party sources, your sql transformations in the data warehouse, your model cross-validation, are all going to be written and executed, using different tools that don't necessarily interoperate very well.

A

These processes span persona and team boundaries, and they often involve multiple compute environments and, as a consequence of this, everything is hard. All the pieces of the ordinary software development life cycle are hard. Developing and testing data. Apps is hard. Deploying and executing them is hard. Observing their operations is hard. So what we're trying to do with dagster is build a platform that makes it easier to work with the graph of computations and assets that makes up a modern data app through all these stages of the application life cycle.

A

So, in this perspective, you know, data quality tests are a little bit like unit tests in the traditional software engineering domain, a critical component of a broader strategy for correctness and reliability.

A

So how do you actually build a system that makes all this easier? I want to talk about a couple of design principles and how they cash out in practice. So, first of all, in order to make developing and testing easy, we need to bring some of the lessons from software engineering into the data domain. So for lack of a better phrase. This is what we call functional data engineering. What it means in practice is that dags should be typed, so that issues with the data flowing between processes can be caught early.

A

So, like gleb, talked about with data quality testing, you want to catch things before the errors before they propagate downstream.

A

These types can be pretty rich or should be pretty rich like here, where specialized data frame types include assertions about their schema and that rich semantic richness can go down to the level of data quality tests that make even say distributional assertions.

A

They can also be tests on things that you wouldn't necessarily think of as data sets like if you have a external pointer to a data table somewhere, asserting that it actually exists and it was produced by some previous process.

A

Business logic needs to be isolated from external resource definitions so that you can mock external state and test.

A

So, for instance, if you have logic that inserts values into a table, you want that to be separate from the definition of the database that it uses from the way that you get access to the database that you're inserting the table into, because you want to be able to test against prod databases or you run against fraud, databases, but also test against local databases, and we think it's important for computations to be composable and configurable so that you can reuse existing well-tested components when you're building your data app, you don't want everyone rewriting their s3 download logic.

A

Over and over again, you want a single component, that's written once that's under test and they can be reliably reused in a bunch of different contexts.

A

So we need a programming model that makes it easy to express computations with these patterns.

A

What about deployment and execution?

A

I think we need to recognize that the data orchestrator is a real platform. That means that people need to be able to integrate with it in a wide range of unexpected ways. It needs to be robust and it needs to have the right primitives.

A

So you need to be able to run your application locally on a laptop as well as in production, and that means that you can't tie your app to a single heavyweight vertical stack like kubernetes.

A

The application needs to be able to run on a really wide range of systems in order to be robust in the presence of many different user personas tools and needs. You need a system that isolates user code execution by design so that, if an analyst makes an error in one pipeline that doesn't take down the production cluster, it's isolated to that particular pipeline and you need the right basic principles. For instance, you need a scheduling system that allows for conditional triggers and one-off execution without hacks, and you need a notion of execution.

A

That's rich enough to capture ideas like re-execute this pipeline, starting at the last place. It failed.

A

So those are some of the things we think you need for like an orchestrator that can really act as a platform for all of the needs that you have when you're deploying and executing your data applications.

A

And finally, I think you need to use the orchestration graph itself as a point of leverage when you have a system with a lot of interdependencies between different tools, you need to understand how data flows between those tools in order to understand how the system as a whole is operating.

A

It's the boundaries between the testable units that are each written in different tools that we think are some of the biggest problems for sort of the quality of data applications today. So in this example, we have dbt models and dbt tests running alongside an ingest process written in pure python, some analysis running in notebooks and a process that uploads uploads some results to slack with all the dependencies between these processes made explicit.

A

This graph is what's missing today in most data apps missing for everyone from the data, the data platform engineer, who has operational responsibilities through to the business analyst who's, trying to figure out why their bar chart looks weird.

A

So first, you need to tie unified logging and monitoring to the graph in a single operational view, so you can track issues down more quickly, regardless of what tool the issue arose in so here reviewing dbt logs and an execution of this very heterogeneous pipeline, you need to explicitly link the assets produced by your computations, whether those assets are data tables, whether there are charts posted to slack whether they're, even entries in data fold back to the computations that produced them, so the users of the platform can self-serve when they have a question.

A

My chart looks weird what pipeline want run produced it or what do? The last five charts? Look like, and finally, you need to build in a rich metadata system so that users can ask detailed questions about app operations, so you can have, for instance, longitudinal views of data, quality, metrics or sql execution times.

A

So hopefully, if we kind of can make these three design principles real, give people a programming model that makes it easier to do the right thing when they're writing their business logic, makes it easier to develop and test that logic. Make sure that the orchestrator really is the kind of platform people need to be able to integrate with and expose the graph as an interface that all of the personas using the data application can use to self-serve uh debug and observe their operations.

A

Then the world of the the difficulty of building a modern data application is going to get a little bit easier and that's what we're working towards uh happy to take questions in the chat and um thank you so much. I hope this has piqued your interest about dexter and, more importantly, giving you some food for thought about.

A

What's coming next in data orchestration and how the orchestrator itself can help you achieve kind of higher quality in your data systems, um thanks so much and please be in touch if you'd like to learn more about what we're up to.