Dagster Dagster Team Product Demos, 13 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Data Orchestration 10 min walkthrough - Jan 2023

Description

Dagster is a cloud-native data orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. In this short overview, Sean Lopp—data engineer at Elementl—gives us a tour of Dagster's capabilities and how this modern orchestrator helps data engineering teams break out of a typical vicious cycle.

A

Hi, my name's Sean and I'm an engineer working on dagster I, get to talk to a lot of different engineering teams and unfortunately they all say that they're struggling they spend too much time babysitting production and they don't have a chance to build new things and be proactive with stakeholders. So why is this? Well, unfortunately, a lot of those teams are using task-based orchestrators, like airflow, and that puts them into this vicious cycle.

A

Where Unfortunately they can't test code out locally, so they have to push it straight to production, but because it's hard for them to reason ahead of time about what new code will do often pushing straight into production, leads to failures and outages, and that's what ends up paging on call and interrupting those Engineers who are trying to do new work. Unfortunately, because of those interruptions, the team is slow and often criticized for being behind, and that in turn means they're unable to pay down technical debt.

A

That would actually allow them to fix some of these problems. So how do we get out of this vicious cycle? Well, we believe that dagster is a solution and that's because dagster is an orchestrator built for data engineers and the entire software development life cycle.

A

It allows you to think about individual assets and to take a declarative approach. So, instead of having to build one monolithic, dag that's tied to your production resources, you can write new code incrementally and then the orchestrator will figure out when those new data assets need to be run. This approach sounds familiar. It's because many modern web Engineers have taken this declarative approach. In fact, the migration from angular to react was all about adopting these benefits.

A

So let's see this in action.

A

Here's the global data asset graph for the Huli data engineering team. You can see they start by grabbing some data from an API that data is fed through a series of Transformations and eventually a daily order. Summary table is created. That table is then used by the data science team to run forecasting routines and create predictions. It's also used by the marketing team for kpi reporting and executive dashboards.

A

So what are the benefits of using assets? Well, imagine an executive has a question about the daily order. Summary something doesn't look quite right. Well, in a normal orchestrator, you would have to go spelunking through all the different tasks logs trying to figure out what task might have impacted that table or some dagster. You can immediately look at the the daily order summary and see metadata about it, see the Run logs associated with it, and even information like the SQL that generated the table.

A

This allows you to debug problems and answer that executive question really quickly and in fact, give stakeholders the ability to self-serve questions like when was this data set last updated?

A

In addition, if I get an asset, first approach allows dagster to do declarative scheduling so, instead of having to create a single monolithic, dag or try to reason through when different crime schedules should be applied to different jobs, you can simply Define new assets and encode the SLA that stakeholders have for them. So, for example, this average order asset that the marketing team relies on needs to be updated pretty frequently because it's in a kpi dashboard, so a policy has been set that the asset should never be more than 90 minutes. Stale.

A

In contrast, the daily order. Summary asset only needs to be updated every day by 9 am dagster figures out when these assets should run and because it's aware of all the different data assets that your team cares about and how they depend on one another tags are smart enough to avoid redundant work, so here we're seeing that the average order data set that has that SLA encoded to be up to date. Every 90 minutes needs to have itself and two other stale assets, Upstream updated, but everything else is already fresh enough.

A

This avoids redundant, computations and expensive Cloud Warehouse queries.

A

So how are all these things built.

A

Let's take a look at a dagster project: diagster projects are formatted as python packages and within a project we can create an asset. By simply writing a new function. Assets in dagster can be pandas data frames. They can be Jupiter notebooks, they can be spark data frames or really any arbitrary code.

A

So here we'll create a new function to calculate the average order size which is an important metric for our executive team, we'll start by writing a function and then adding dagster's asset decorator, then, within the function, we'll just use our regular logic to compute that kpi and then finally, we'll encode the SLA for what stakeholders expect as a freshness policy.

A

Once we have our asset created in dagster, we can run everything locally, so we'll fire up a local copy of our dagster user interface, and here I can test out that my co-logical code, that I just wrote, runs when I run things locally. I don't have to use production resources. So here, when I run, all of my code, I'm going to be using just the local file system to store intermediate results and the SQL that I'm writing will execute against a local ductdb Warehouse.

A

This allows me to execute all of my logical code really quickly and to iterate really fast, without impacting or relying on production systems when we do make it to production instead of using duct TB, we'll use a snowflake Warehouse instead of using our local file system, we'll use S3 and that's all encoded and configured through dagster's plugable resource system.

A

So now that I'm happy with my code locally, let's open up a pull request.

A

Normally, when data teams open pull requests, you can review the code, but you have to guess what that code will actually do once it's in production with dagster, we create, what's called a branch deployment which is essentially an isolated copy of our entire data platform. Just for this pull request that allows my team to actually run the code and see what it's going to look like in this case we're running against resources that are very similar to production, we're using snowflake to clone a copy of our production database that this pull request can run against.

A

So, while we're not impacting production, we can be sure that our code is going to work with production-like systems. So in this way, dagster provides a staging environment for every pull request that you open.

A

Once you're ready to put code into production, dagster was built with all the modern bells and whistles. So, for example, multiple teams can collaborate together in different virtual environments and different projects. You don't have to try to get everyone on the same version of pandas, while still having a global asset view where those teams can depend on one another's work.

A

Dagster has full support for role-based access controls and single sign-on. In fact, many of our dagster customers allow everyone in their organization to be viewers so that they can self-service questions from the dagster asset. Catalog like when was this data set last updated.

A

Dagster has a variety of settings to help ensure that the orchestrator is robust, including things like automatic Opry tries and run cues with different priorities, and finally, dagster supports a variety of different alerting policies. Like many orchestrators, you can alert on failure, but dagster actually helps teams avoid alert fatigue by also allowing you to alert on SLA violations, and that means that you're only going to get notified when data sets are outside of the slas that actually matter to stakeholders and not get notified based on spurious failures that are automatically recoverable.

A

Finally, dagster Cloud can run in a variety of different ways, and so, for example, you might use kubernetes or you might use ECS or any other highly scalable compute layer.

A

So we hope you're excited about dagster and ready to give it a shot. If that's the case, we've made it really easy to get started with diagster Cloud. You can clone an example project and get running in no time or you can start out by developing locally once you're ready to run things in production, you can either host tags to open source yourself or dagster. Cloud comes with a fully serverless option or a hybrid computation models available as well so be sure to check us out, find us on GitHub and give us a star.

A

That's the best place to keep track of recent updates like our 1.1 release or join us on slack, where you can ask questions and meet other modern data engineers thanks so much.