Dagster Dagster Demos, 2 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Supercharge dbt: You might not need dbt Cloud!

Description

Join the Dagster team for a special showcase of new dbt functionality included in Dagster 1.4. which makes Dagster a strong alternative to dbt Cloud. This session will benefit data professionals who work with dbt models and are looking to manage these as part of a larger pipeline. If you are looking to migrate off dbt Cloud or simply exploring dbt alternatives, this is a session you will not want to miss.

A

A

Hi I'm Pete CEO at Elemental, we make dagster in open source data, orchestrator and I want to welcome you to our event today focused on DBT, but before we get into all the content, I wanted to first touch on something that we care very deeply about here, which is data engineering, and we view data engineering as a discipline, not a job title. So there are all sorts of different people in the organization they have titles. Like data engineer or machine learning, engineer data platform, engineer, analytics engineer, even data scientists and data analyst.

A

They are all often participating in the process of building, maintaining and leveraging data Pipelines, and so we view this as a discipline, not a job title and the state of this discipline of data engineering um today is is has a lot of challenges. So first there's a lot of different tools, so everyone participating in this process has to learn multiple tools. They have to jump through many of them in order to do their job.

A

The second issue is that each of these tools don't provide enough context to truly understand what's going on, and so you often have to leverage a bunch of different tools, maybe write some custom analysis in order to really figure out the state of your data pipeline.

A

The third problem is that there are multiple different data teams within the organization. Often they have the the title: ml Engineers analytics Engineers data engineers and they often have entirely different Stacks, which makes them difficult to collaborate, even though they are doing very similar activities and they're, often operating on the same data sets and serving similar, if not the same stakeholders.

A

So this siloing has a bunch of different problems and creates a lot of problems and finally, many of the tools that data Engineers or people that are practicing the discipline of data engineering have available to them, have a really low quality developer experience so often they're lacking local development. We see teams testing in production all the time. Version, Control and CI CD are often not ubiquitously adopted, and so that's why we started the dagster project to accelerate the adoption of software engineering.

A

Best practices and solve a lot of these other problems that we were seeing amongst folks that were practicing the discipline of data engineering. So just a quick recap for those that are not familiar with dagster to set the table with dagster, it's a data orchestrator, so you write and you test your data pipelines in Python. You might be either doing you know your raw Computing Transformations within python, or you might be orchestrating systems outside of python.

A

You know based on SQL or Scala or another language like that, but fundamentally you're, defining your your data pipelines in Python and you're testing them locally and- and um you know, in the CSE process, then dagster will run and monitor your computation for you. So you can launch runs dagster will monitor them and restart them when they fail. You can put your data pipelines on a schedule. You can kick off runs of data pipelines based on external signals using a feature called sensors. You can partition them.

A

Etc Dexter will run them for you, and it will give you a beautiful UI to monitor, monitor that computation and also hook up to alerting systems to. Let you know when something's going wrong and then finally, one of our distinguishing features is that we track data lineage and metadata within the tool itself, so you can inspect every asset status. It's schema. It's metadata, it's dependencies all from one place. It's really really convenient for engineers working on the data platform, as well as their stakeholders to kind of self self-identify issues. You know when they come up.

A

A lot of people are using dagster all sorts of different companies all over the world, all sorts of different Industries and sizes and stages, and we integrate with every data tool out there. Basically, um and uh one thing that we noticed was that one integration really Rose Above the Rest in terms of adoption within the dagster community, and that was DBT.

A

So over half of our Cloud users use DBT and at least one of their pipelines, and so DBT has become one of our most important Integrations, and that's why we have spent the last couple of months really focusing on making a lot of improvements to our integration and and really uh trying to set the standard for orchestration with DBT, and so I wanted to share our agenda with you today. So uh pedram is going to take over from me and he's going to talk about the challenges that organizations are facing.

A

um You know, with DBT at scale Sandy's going to make the case that dagster is the best way to orchestrate DBT and I agree with him. Then Rex is going to do a demo of our new DBT integration and then finally, Ben is going to show us what this enables in terms of kind of the current set of features that dagster has and the future of dagster and how DBT can plug into that and leverage. All the great features that everybody using um using dagster has been able to take advantage of so far.

A

So, with that I'm going to hand it off to pedram.

B

Thanks Pete hi I'm, Pedro, Navid, head of data engineering Endeavor at dagster, I've been a long time. User of DBT and I've spent a lot of time talking to people in the community about the joys and pains of building data pipelines and while DBT is a great tool that has really changed how we build SQL based data pipelines. I've also seen some common pain points, as teams grow and their data needs become more complex.

B

There are three broad themes: I'm going to touch on today before I hand it over to Sandy to show you some of the new features that Daxter offers: Fender lock-in, operational silos and, finally, difficulties handling sophisticated workloads.

B

First, let's talk about vendor login, perhaps not an obvious topic, but over the past year we've seen DBT adopt a less permissive licensing model, making certain features exclusively available to DBT Cloud users. Now, while DPT core remains open source, many of the features required to make it usable, such as orchestration and deployment, as well as upcoming features like DBT server and metrics they're, all part of the cloud only product.

B

This increases a risk of vendor lock-in and, along with things like last year's price hikes and new limitations on their non-enterprise plans, such as a one project limit or two concurrent jobs. This has become a real concern within the community issue. We've heard a lot about is, as teams build, more complex pipelines and Integrations, especially with the rise of things like Ai and ml.

B

They soon begin to realize that there's more to life than chinja and SQL theater theater teams are increasingly working cross-functionally with ML and Ops teams, but DPT cloud is essentially an operational silo, there's little to no understanding of the source data that feeds your models and instead we rely on Patchwork freshness tests, for example, to hope that Upstream beta, just has been refreshed now. Dbt does support using snowpark or Pi spark for python models, but in doing so, there's Now a type coupling between your DBT models and your python code.

B

This makes it very difficult to run different workloads on different schedules.

B

Local testing is also much harder, given that you need to use the snowflake UI, for example, just to get error logs out when debugging beyond that, there's really just a minimal integration, support in DBT cloud, you're, largely limited to web Hooks and select notifications as specific jobs, complete or air out anything more complex than that is really not well served within DBT cloud.

B

Probably, most importantly, DBT just isn't a full life cycle tool for data. Engineers Jinja only gets you so far, and things that appear simple on the surface become very cumbersome at scale. This is the most common depressing concern when I talk to teams and I really believe that data teams deserve better tools that support engineering, best practices and DBT has certainly given us in. Given us a head start on that from where we were before.

B

I think it's still not enough for all the great aspects of SQL, and you know, I love, SQL, I, defend it all the time. Sometimes things are just better than python unit testing, for example, or if you have complex logic that you want to wrap around in a python function. That's a very natural thing to do and it's very easy to test, but you can't really test logic in SQL. All you can test is the output of these data transformations in a system when it comes to things like observability or notifications.

B

Again we're just limited to out of the box alerts. There have been many times, I've wanted to create a custom message based on failures, or even just send a slack message when I get a fresh this morning, but simple tasks like that aren't supported scheduling.

B

A DBT is limited to crossed out schedules which results in teams creating these large buffers between successive tasks, just to create the illusion of dependency, if you've ever scheduled an integration at 1am and then ran your DBT models at 4 AM and then your operational tasks like notebooks and reverse ETL at 6am. You know exactly what I mean you just hope and pray that nothing takes longer than those periods right and while yaml itself is an excellent language for configuration, we've really started to overload it and I.

B

Don't think it's the best choice for expressing complex logic, couple that with ginger macros and the pain in debugging and testing these things we often hear about teams hitting the ceiling very quickly in DBT as complexity and scale increases.

B

These are just some of the challenges that we've aimed to help make more effortless with the release of our DBT integration. I'll now hand it over to Sandy to give us a walk through of some of the benefits of our integration, that I'm sure you're all eager to hear about.

C

Sandy and I'm the lead engineer on the dagster project, if you're not familiar with it, dagster is an orchestrator. Its goal is to help data practitioners orchestrate the computations and data that make up their data Pipelines to schedule them to run them in the right order to give visibility into failures and to help pick up where they left off.

C

Daxter has a deep integration with DBT, as Pete mentioned earlier, over half of organizations that use dagster use it with DBT. We built dagster's DBT integration in order to help DBT users address many of the issues that pedram just talked about.

C

We believe that, with this integration, dagster is Far and Away the best way to orchestrate DBT. This isn't because it has more features than the Alternatives, although it has some good features. It's because dagster's Core Design principles go really well together with DBT the.

B

C

Between the way, the dagster thinks about data pipelines and the way that DBT thinks about data pipelines means that daxra can orchestrate DBT much more Faithfully than other general purpose. Orchestrators like airflow and at the same time, Dexter is able to compensate for dbt's biggest limitations. Dbt is rarely used in a vacuum. The data transformed using DBT needs to come from somewhere and go somewhere when a data platform needs more than just DBT.

C

Dagster is a better fit than DBT specific orchestrators like the job scheduling system inside DBT cloud, one of the elements of DBT that makes it so intuitive and Powerful is that it centers on data assets.

C

What I mean by this is that when you build a data pipeline using DBT, each model you define is one of the tables or intermediate data sets that make up your pipeline. So from the very beginning, you're thinking about the data products that your pipeline is there to support.

C

Data lineage comes automatically because the references between your tables are part of how you define your data pipeline. Like DBT dagster puts data assets at the center. Diagster pipelines are graphs of connected data assets. This means that dagster can understand a DBT project at a really deep level.

C

When you use dagster's DBT integration to load your DBT project into dagster, you get a faithful representation of your DBT models and the connections between them inside dagster and.

A

C

Other orchestrators daxer doesn't need to run each DBT model in a separate task which incurs a lot of overhead, but DBT models aren't the only kind of data asset that dagster works with a diagster asset could be a table ingested using a tool like five trans, Stitch or air bite. It could be a machine learning model, a data set of images, a file and you can compute decks or assets using any python code running on any platform.

C

This means that you can build Dexter pipelines that connect the models in your DBT project to these other kinds of data assets.

C

When you have a general purpose, orchestrator like this, it allows you to orchestrate and track lineage across your entire organization.

C

For example, you might have a machine learning team that trains models using data- that's transformed using DBT dagster, makes it easy to kick off ml training and inference based on changes to DBT models, and it can render the lineage between the ml models and the DBT models that those ml models depend on Dexter was built from the start for organization-wide scale.

C

It has a set of abstractions that make it easy to scope what you're doing to the parts of the pipeline that you own but zoom out to the entire asset graph, when you need to another way that dagster is different from DBT specific tools like TPT cloud is in the depth of dagster's orchestration feature set.

C

Orchestration has been Dexter's primary job since its Inception and over that time it's grown to handle a very long tail of orchestration needs, for example, to determine when you run your DBT models, you often need to rely on logic, that's specific to your use case. For example, you might have a particular way to check whether new source data has arrived or need to incorporate a specific business calendar into your scheduling in dagster. You can write arbitrary python code that triggers runs of your DBT models.

C

Dexter also offers granular, observability and operational tooling for each DBT model in your pipeline. You can track when it fails every time in the past that it ran and pick up where you left off after fixing problems.

C

Dexter has Rich support for partitions.

C

Partitioning is an approach to data management that allows you to update your data without dropping and recreating the entire table and still maintain an interpretable record of the status of your asset and how it changed over time.

C

Dexter also offers general purpose alerting. You can run arbitrary python logic whenever one of your runs fails so I'm going to cut myself off here. This is actually only a small sample of dagster's general purpose. Orchestration functionality.

C

Last of all, Dexter is open source. We offer a cloud product where we deploy Dexter for you and offer extra features for teams, but fundamentally you're not locked in, for example, if we change our pricing in a way that you believe is unfair, you always have the option to take your decks or pipelines and deploy them on your own, using the open source project with that I'm going to pass it off to Rex who will demo how this all works in much more depth.

D

Thanks Sandy for the introduction on dagster and DBT I'm Rex I'm, an engineer on the diagster project in the following section, we'll show how you can quickly integrate a DPT project, a software-defined Assets in dagster, how you can model explicit, upstream and downstream dependencies from other data platform Technologies alongside your DVD project, how to add, monitoring and alerting to your data assets and, finally, how to schedule your materializations of your data assets to occur at a regular Cadence.

D

Let's get started.

D

We're going to start off with a javel shop, DVD project and supercharge it piece by piece with dagster, we'll first scaffold the project to allow dagster to load your entire DP project as software-defined assets. We provide a built-in utility to do this in our integration Library here we're creating a new project, called Japanese metal container scaffolded code.

D

We can see in our directory that we have a set of the extra definitions to load our DPC project. Next, we'll visualize our project in the diagster UI to see our assets through dependencies and much more.

D

The dagster UI is the single pin of glass that allows you to monitor the status of your data platform and data assets. First off in the UI is the asset graph overview.

D

We can see the individual Assets in our current project and how they relate to one another, and currently this just contains all the assets modeled in our dapple shop project and we can search for models and filter accordingly. So, for example, a we can filter for any model that matches the word customers and we can also do a case-sensitive match for the actual customers model and show all models Upstream with them.

D

We can also materialize these DBT assets, and this will launch a run to execute DBT build on our project and, of course, this can be customized depending on our needs.

D

As this computation proceeds we'll be able to see events and metadata about our computation, and these events indicate what DBT models have been materialized. What tests have been run? What's been emitted to standard out and much more.

D

So we can scroll through the logs and also filter these logs for events that relate to say, for example, our customers model, and we can also look through the standard out. That's been emitted by executing this DBT command.

D

B

D

Deeper into a single asset, for example, in this customers asset, we can see all the events that have been materialized for this model, as well as when it was last materialized.

D

We can also look at this assets, metadata and definitions.

D

And furthermore, we can look at the lineage associated with this model to get a more granular view of uh its dependencies.

D

So, with dagster we're not just limited to managing data assets created in DBT, we can actually Define upstream and downstream dependencies to our DBT models using python.

D

So in the following example, we'll convert one of our DVD seeds into an option dependency and, furthermore, we'll create a downstream dependency that creates a Target visualization of a fresh DBT model for use in reporting.

D

So, let's start off by creating an upstream dependency to our DVD project. Here, we're converting our raw customer seed into an asset defined in Python.

D

We'll first indicate that the DVD seed is now a source, meaning that it is managed externally from DBT and we'll also update any SQL models that depend on it so that they load from the source.

D

Next, we'll create a new file that defines our upstream and downstream dependencies in Python.

D

So here in our Upstream asset, we are loading the raw customers, data using canvas and then storing it in our ductdb database.

D

And in our Downstream asset we create a simple report that shows the distribution of the number of customer of the number of orders per customer, and then we save that as metadata, so that we can view it in our Dexter UI.

D

And then to see this in the UI, we add these two new definitions to our dexor definitions, object.

D

And then, finally, we can reload our view in the dicer UI so that we can see all of our assets together.

D

And we can see this new raw customer's asset. That's the final python, as well as the customer's report defined in python as well, and we can see any models that depend on this. Raw customer's model is now marked as stale, because there is a new code definition associated with this asset.

D

Now we can materialize all these assets together and dagster will launch computations to load our raw data into wbb, build our DVD DBT project and then finally create that chart that we defined earlier.

D

And these are all three separate computation steps.

D

And we can see that the plot has been created and we can view it in our browser.

D

So we can also use dagster to augment our existing DBT computation so that we can monitor and alert on them. So here we're going to add capabilities so that we can create a custom notification to slack when our DBT models materialize.

D

We are importing our stock integration with daxer and we're creating a slack resource so that we can create a client for our slack application.

D

Next, we'll reconfigure our existing DBT command so that we can access the DPT artifacts associated with this command. After it's been run here, we want to retrieve the Run results Json so that we can use it to build our slack message.

D

Then we'll instantiate our connection to Slack, so that we can create a message with this Run results. Json.

D

And in the slack message, we're just outputting the dptcli message for each successfully materialized model and we're sending this to our data platform. Slack Channel.

D

Now we can rerun our computations.

D

And after our computation succeed, we can check our slack channel to see that a slack message has been successfully sent when we open up our slack workspace. We'll see that the message has been sent.

D

Now to ensure that our data assets are up to date, we want to run them on a schedule in our existing scaffold, we're given a schedule that only materializes our GPT assets, but we just added two new upstream and downstream assets. So, let's incorporate all of our assets onto the same schedule here we'll create a new schedule that selects all of our data assets and materializes them on a daily Cadence.

D

We'll name the schedule materialize DBT with dependencies, and this will select all of our assets, including our DBT models and upstream and downstream dependencies.

D

And we'll make it so that it materializes them on The, Daily, Cadence.

D

Once we save this, we can go back to our UI and we can refresh our definitions so that we can see our new schedule that contains all the assets that we want to materialize, and we can see our new schedule here.

D

We can see once we turn it on. We can see that next tick is going to happen tomorrow at 12 a.m. Utc, and that's it. We've shown concretely how daxer supercharges DBT by solving its common pain points. Dagster allows you to one express your computations with python.

D

Instead of just Jinja and SQL, two integrate your data platform, you can understand how all of your tools and data assets relate to one another in one central control plane and you can configure, alerting and monitoring that works for you and your team and three you can materialize your selected data assets together at desired times with dagster. Knowing how and when your data will update will never be a guessing game.

D

Now, I'll pass it off to Ben to tell us what's coming up next for our dagster and DBT work.

E

Thanks Rex, my name is Ben and I'm. An engineer working on Dexter Dexter provides powerful, orchestration capabilities that let you schedule your DBT models alongside other data assets, but that's just scratching the surface of what it can do. Extra is built to provide fine-grained control and deep insight into your data platform. So today, I'll showcase a few upcoming features that will help you supercharge your use of DBT.

E

We've already seen how you can set up Dexter to regularly update your DBT models. But what happens when you make changes to your models? Ideally, the table corresponding to the model would be re-materialized so that it matches our SQL change. Our Downstream data, the tables models and reports that are built off of our model might.

C

E

Out of date, ordinarily, it can be a nightmare to track down and update everywhere. Our models are used. Let's take a look at how dagster's automaterialization functionality can help. Let's say we notice something is wrong. In our payments model, it looks like we're incorrectly converting prices in tenths to dollars.

E

Let's go ahead and fix this bug and save our change, because we've been Computing this Fielding correctly. Our Downstream models have also been flawed. Thankfully, since tagster is aware of all of our data dependencies, it can automatically Mark our assets as out of date. Let's go ahead and reload our code. Indexer Dexter will automatically recognize that our SQL has changed. For those of you familiar with the state modified DBT selection. This works in a similar way.

E

One difference is that diagster will Mark out of date all affected Assets in your data pipeline, not just those written in DBT. Here we can see an asset which uses python to generate a report choses out of date. Two, if we wanted, we could now manually re-materialize our assets to bring them back up to date.

E

Coming soon to dagster is the ability to. If you so choose instruct dagster to automatically recreate models and their Downstream assets when their SQL has changed instead of manually re-materializing our assets, we can see the diagster's auto materialization functionality has automatically kicked off, run to rebuild just the change model and our Downstream assets. We can see our DBT models which depend on the change model as well as the report. Python model are being updated.

E

Auto materialization helps us take the guesswork out of our data platform. Ensuring the data is always up to date when we need it.

E

Dexter's local development flow is designed to help you iterate test and build confidence in your DBT code. Before you put up a pull request once you're ready to push your code tax, your Cloud can help you and your reviewers build confidence in your change before merging using branch deployments.

E

Let's say I wanted to add a new DBT model to collect all of our pending orders in one place. I've already tested against mock data locally, but I want to make sure that it's going to work properly on real production data, I've gone ahead and created a pull request. Since I've set up branch deployments on my GitHub repository dagster cloud is created in ephemeral, sandboxed environment, where I can test my code opening up my Branch deployment, I'm greeted with an asset graph, which now shows my new DBT model.

E

This is in a favorable environment which coexists alongside my production deployment. It's running against a copy on write clone of my snowflake schema, so I can kick off the materialization of my model, which reads from real production data, but without having to worry about anything being written to production.

E

Once my run completes, I can head over to Snowflake and query the data that I just produced, which was sourced from real production inputs, but is output safely to a clone of my schema with confidence that my change works properly. Even against Real data I can now merge to production without worry.

E

Dexter's role is not only to orchestrate computation of your data, which provide you deep insight into.

B

E

Going on in your data platform for many data teams, dealing with scale tracking down resource usage and cost is a priority. When you have a lot of stakeholders who are building and scheduling computations, it's not always easy to get a handle onto where to focus your effort to save money for these users, we're building a feature for diagster cloud that helps track down sources of wasteful compute and spend out of the box.

E

Let's say that I'm a data platform engineer supporting a team of other stakeholders who have built jobs using DVT, Snowflake and other tools. A recent budget review meeting showed that our snowflake bill has been spiking. Recently, let's go to the new reporting tab on diagster to get a handle on what's going on.

E

There are a couple different ways that we can investigate. Diagster provides a way to filter jobs, assets and asset groups by costs like snowflake credits or by performance metrics like compute time or number of free tries. Since we're looking at a snowflake building change. Let's go to the snowflake credits tab, it looks like one of our asset groups has had a major increase in consumption. Over the past couple days, let's drill down into that acid crew.

E

First, maybe let's see if any of our assets are being re-materialized more often, though it seems like the rematerialization account has been the same. Maybe let's look at duration. Instead, it looks like one of our assets. Orders has increased in run time, pretty significantly over the past couple days.

E

If we hop over to the snowflake credits tab, we can see that same orders asset is responsible for a major increase in Snowflake credit consumption.

E

We can click over to the asset index, there's asset catalog and jump to the assets. Definition in our GitHub repo, let's tab over to the get blame and see if anything has changed, it looks like someone has recently changed the DVT model to add a pretty expensive join. It seems like it might be our culprit.

E

Now we can contact the team responsible and let them know about the regression.

E

Hopefully, this gives you a taste of some of the powerful features that dagster has to offer on top of DVT, which, let you not only better ensure your data is up to date, but also get insight into all that your data team is running. That's.

A

All the content we have for today, thanks for tuning in and be sure to check us out at dagster.io Community. Thank you.