Dagster Community, 9 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Dagster Community Event August 2023 featuring Georg Heiler and the dbt™ integration

Description

In this month's Dagster Community Call, Rex Ledesma provided an overview of the new dbt™ integration and Georg Heiler shared his dbt™ use case: Unlocking Advanced Metadata Extraction with the New dbt™ API.

A

Okay, great thanks. Everyone for joining, welcome to um the dagster community call we have this meeting every other month. So let's get into the agenda for today and then we have rexon who's gonna, be going through the dagster DBT integration updates and then we'll actually see those in use from George who will actually be showing how he wrote a blog post and it's how he used them in different situations, with the metadata so over to Rex, where he'll be going through dagster DBT.

B

Thanks, oh definitely share my screen.

B

Okay, hopefully everyone can see my screen here. um I'm gonna go over the API changes that we've made to dicester DBT in our 1.4 release.

B

um We basically um revamped the API for people to uh use DBT and director together, um there's now more ergonomic API to create software-defined assets. uh We give you a CLI to scaffold a dagster project given a DBT project and much more, and this is just like a basic overview of the changes that we've made um as well as some resources that you can uh look forward to um if this is of Interest. So let's get started um so basically like what motivated um this.

B

These sorts of changes- um you know dagster and DBT have or you know, our DBT integration is very very popular over 50 of our users um use this integration and we started you know once we when we released it back in uh 2022 with um an integration with software defined assets. You know a lot of the same recurring. Questions came up as people use this integration um more extensively. uh You know people wanted to uh chain their DBT assets with upstream and downstream computations.

B

As Lorenzo pointed out here, people wanted to customize the metadata about the assets that they were creating um in dagster from their DVT assets. uh People want a big sensibility with what they were able to execute using DBT.

B

um For example. You know Stephen here is asking um you know how can I use slack to uh send a message after my gbt models have executed properly um or Dennis is, is like you know, how can I run short structures after uh before running my DBT run step?

B

um So, looking from all these user feedback uh questions, we sort of like uh came to sort of like purchasing problems that um people are encountering with our integration, and this is what we want to focus on with our revamp. um The first one was, you know how do I get started with my DP project on Daxter?

B

um The second is, you know how do I execute? um How do I customize the execution of my DB assets? You know, rather than just running DBT run or DPT build. How can I, um you know, add multiple DPT commands or how do I use, like you know, other python, libraries alongside by DPT executions.

B

um The second is: how do I customize customize my asset attributes? um You know how do I change my op name, how do I change my asset, Keys et cetera, Etc and then fourth, is you know? How do I add? um How do I add dependencies to my dpd assets? How could I make it so that you know maybe my five train or airbud assets kick off before my DPT assets or maybe I want to materialize a dashboard after my DMC models have executed. How do I do that in an ergonomic way?

B

And so, and so these are the four problems that we were tackling um in the sort of revamp and um we came up with a set of apis that sort of like address these underlying problems.

B

um I'll talk about a few of them that we've released so far um and please go to like sort of like two sets of changes that be made.

B

um You know like the ergonomics improvements for just like using this integration Library as well as a way to get started incredibly quickly with any DBT project um and I'll, be showing like examples of how to use these apis um in the following slides.

B

So here's basically like the new API that we have um this is very similar to um The Decorator based approach. We give for a raw op or a raw asset.

B

um Here you specify um we gave you a deep D assets, decorator, um this sorts of this sort of decorates a function that you can uh that you can Define, and here we're just you know we're just um creating events from um a GPT command.

B

um So here in this sort of uh in this sort of definition, uh we are running a DBT build step and this yields dagster events to um to stream metadata from um and from this API there's um there's um ways that you can sort of like customize the execution, uh for example.

B

um Another another example here is: um you know now uh from this new API from this DP assets. Api, um you can run um multiple steps um or multiple DPT commands that execute on the same assets. For example, you could run. You can now execute a DVD run step and then a subsequent DVD test step and before this wasn't possible, um but now it is given the new apis.

B

Another example here is: um you can now integrate um other resources into your DBT execution.

B

um So in this example, um we are uh we're sending a slack message at after running um a DPT command um here, we're running a DVD, build step, um creating events from it and then, if it's successful um or if it's not successful, we send a message to um any slack Channel and say that it's failed and this isn't limited to um just a slack resource.

B

Any resource that you've created in dykster can integrate into this new API for DPT assets.

B

Another example here is, um you know, like people want to customize the asset keys that are created with their gpp assets.

B

um Here we give you an API to just Define um an asset key associated with um any DBT model, that's ingested by dagster. So in this example, um we Define a sort of like translator, object that ensures that any asset key created from this integration has a prefix called snowflake so that we can label all of our DB assets. Accordingly,.

B

um So this is this addresses the um sort of concern about customizing asset attributes uh within integration.

B

um We give you new apis to um to create Downstream dependencies to your um DP assets, uh for example. Here in this example, um we created RDP assets, and now you know, we want to define a downstream dependency to finding python that um takes in a customer's dependency, which is defined in DBT, and we can uh instantiate any sort of python to create that clean customers model.

B

And here is a similar sort of like another, similar use case. um We can define an upstream dependency for um PPT assets. We can take a DBT Source um called my source and then translate that into a python asset defined indexer, so that this is an upstream dependency to your DPP assets.

B

um And so those were the sort of API changes, those those are the the flavor API changes that we've made to um to improve the customization of our integration, redipping, our users, more levers to be able to customize compute customize, uh the metadata about their assets um and um and also integrate it more seamlessly with the daxer framework, um and we sort of like culminated all of this. All these API changes to um a nice little uh quick start for DBT users.

B

uh So now, with the release of our 1.4 apis, uh we're actually giving users a CLI to scaffold a diaper project given any PPT project. um So in this example we're um we're creating a dancer project from the standard dapple shop project. You can see here new brand CLI command to create this new dapple shop, dagster uh directory.

B

This has a this, has a set of dags or definitions to let you load from this uh DBT project um and we've updated our tutorial to make use of this new CLI.

B

um So you can see here, we've scaffolded this project and when we go to load this project it automatically loads in all the assets from RPG project, and we can, um when we materialize these assets. This actually runs a DVT build step on um by default, and you can you can customize this. This just gives you just like a sort of template to get started quickly with daxer and vpg and um yeah. This is just showing a successful I run uh running for Success, um using the sort of uh scaffold.

B

That's um yeah and you know that's that's a quick overview um if uh people have uh any sort of the questions about these apis that we've um that we've released, uh there's API documentation um on our website, uh We've updated a tutorial, as I said before, to use uh these new apis alongside um our divestor gbt project, scaffold CLI. uh We have a set of frequently asked questions um regarding this integration and we covered all of this and sort of um our vision for dagster and DBT uh in our in last week's live stream event.

B

um So if you haven't checked that out, please take a look, and that's it for me. um So now, Georg is gonna sort of like talk about how he was able to use these apis to accomplish his use case um at his firm.

C

Right, let me start to share my screen.

C

And I will try to get the videos back on the second screen, so I can keep watching. You um I want to start with a blog post, to give you, uh let's say a deep dive of how you can utilize these new apis of Dexter and TBT, in conjunction with on one side, passing additional metadata out of the VT EML files or results section, and secondly, integrating DBT with a full-blown data governance solution, open metadata.

C

So Rex, thankfully, has already introduced to Basics, but let me show you how this might look like in code. I don't know on the shared screen. If the code, if the size is large enough, please tell me okay, it is like I will try to make it a bit larger in any case. So first of all, DBT has the concept of a Target environment like Dev and Broad.

C

Perhaps- and you have to tell the DBT CLI resource which Target should be used here- I'm passing the left Target, but you might want to use actually we Define one, but it's defined up here. The reason why this is different now is because initially in with, let's say the first release of the API, you had to basically pass it into the CLI like this option and I just monkey patched the code example yesterday. If we were fully released API.

C

In any case, we can pass the DBT config settings like this. You may be required to add a profile, still location if you're using a different location. But from this point onwards, the basics are set up and you can start to make use of a new API most interestingly, when you want to retrieve additional information from this DBT Json documents like the Manifest We Run results, maybe with test results of the catalog file from the documentation. Is you have to actually retrieve these artifacts? The new API has a method to get hold of them.

C

However, like the nice thing about the new API is that it allows you to stream the events in a quite need way, and this works fine, for let's say the default use case, where you only want to visualize the outcomes in dag it as quickly as possible. However, one when you want to actually work with these released Json documents and retrieve the results, these documents are only written after DBT has.

B

C

So basically, it is required that we are blocking the computation and then um basically can retrieve the data. So this also means that the logs in Target will not be updating. You might have to yeah. You might be able to change it up a bit and add some more complex logic to keep the log streaming and block. But I didn't do this here for the sake of keeping the example simple.

C

So, first of all, you have to basically wait on the cli's task to block until the DBT run is finished and when it once it is finished, you can retrieve the artifacts as desired and now on these events that are available that are emitted. But if we're not limited in a streaming way like one after the other as they happen.

C

But after the blocking is finished, you have to then retrieve these individual structures from the document, as uh the extra is using the asset, key notation and gbt a slightly different notation about this schema database and table name and so on. You might be required to have a remapping or lookup table which is basically sort of inverting the dictionary to then look up between both systems, but once you have that in place, you can basically iterate over these events and extract the desired details from the event.

C

I was personally interested in retrieving two types of information, but first of all rows affected, whether somewhere in this Json and the second one is the compiled code in the old version of text. I was using the convenience feature of load the DBT project right away, um but basically the whole DBT project would be compiled on the fly, but this is also a slow approach and I wanted to move towards. Let's say baked CI approach with past.

C

Manifest is already part of a container image and there get is no longer displaying the compiled code um because I didn't want to compile the image in C. I only wanted to have a pre-priced image in CI, because I didn't want to feed the database credentials into the CI pipeline.

C

So basically uh I will only get the compiled code after actually running the code. This means the day gets you I would not be able to show the nice documentation around the compiled, SQL and I want to get this back and basically, by running the code in the DBT run task here and retrieving the executed compiled SQL statement from the event it can pretty much almost 99 get back with initial functionality.

C

In the end, you have to add these elements to this emitted metadata and basically you're good. So let me quickly show you how this might look like in the UI of the extra. Basically, besides this, let's say, maybe execution duration graph, which will be the default one.

C

um You will get now a second chart with a rows affected.

C

And yeah, you will also get it back in with your eye with compiled, SQL event, um as explained before that was necessary because I didn't pre-compile the gbt Manifest only past it. But last but not least, it is sometimes interesting to work with a more full-blown data, cataloging solution, maybe for the sake of quickly accessing the documentation or working with other assets when only DBT. uh There are multiple open source projects around for like data Hub or open metadata person liked their approach and I'm also showcasing how here now how to integrate it.

C

Their concept is basically around yaml files and the CLI. You basically have to define the path to these DBT artifacts when you can call the CLI to upload it to their platform.

C

You have to configure let's save Uris and passwords and so on, but they are basically omitted to you for the sake of privity.

C

um The call that the CLI is doing is basically this one here, where this metadata ingest and the path to the path to the config file and I have created a small CLI wrapper that is basically executing red from python using web python API, and in order to make this work, we have to work in pretty much a similar project as before, but we have to first call this late method to block the task and again we can then use this wrap or utility functions to retrieve the artifacts.

C

However, now because with the 1.4 release of text, Dexter's modifying the path of outputs, so it will allow for parallel runs or for non-colliding parallel runs. We have to somehow put the final output path to a more yeah, basically pass it down. Streams of other two can work with it. I was extremely hard coding into this path, so not fully supporting parallel runs, but for the sake of my use case, this was enough.

C

If you have parallel runs, you might want to change it up to reflect the actual path instead and basically after when X blocking the call I can write with Json files to the path and then in fact call my method to run push the results over into a data catalog.

C

Why is this perhaps interesting, because the data catalog not only has the assets, but it has the pipelines like the extra runs and you might be able to show these two business users in a more streamlined way could also feed. Let's say machine learning, models, Kafka topics and whatnot, and it can give you a neat way to surface the test, results that are defining DBT tests to end users to drive a trust in data assets where you can show if they were successful. All the time or maybe not interestingly, also dbt's default documentation.

C

A

C

Giving you object or table level lineage here, you get column level lineage that might be quite interesting, or at least from like many Enterprise users like to have column level lineage, so this would be a way to get it easily.

C

um That being said, this is the DBT demo part one and I'm very multiple hats. So, first of all, I want to thank my University where we were showcasing this.

C

The extra installation and I was allowed to show you these screenshots from a production system, but with one of my other hats, I'm working, also um and um let's say Enterprise company and where we are also starting to explore this new DBT integration in a different way where we are having some custom jvm based, spark based data ingestion tool, and we want to basically integrate it with dagster, where we intend to basically have this tool to copy data over into the cloud environment and then retrieve a lineage from text itself or Steve is outside tool.

C

This is a very small name of how it will look like in the future where you have this Source asset that is living outside. You have a inside asset radius, in fact, depending on this outside one, that is when managed from this ingestion tool and when you have some DBT transformation event kick off so to show how this looks like in the demo example would actually look like this and again quite neatly tie the extra and its new API together with.

A

C

Yeah, that's the second demo in very short, I hope it was not too quick, because I could go in mod app there. Maybe there are some questions after the call you can reach out to me. Maybe that Rex will share how to contact me and regarding the blog post, I think we can share it in the show notes. So if you want to look at it in more depth would be quite easy to read up and I'm also part of the access lag, so you can just simply reach out to me there as well.

A

Boom, um we do have two questions. um One was on the column level lineage that was from open metadata. Is that correct? Yes,.

C

That's correct, so it's pausing the DBT graph and it's using some python package, which is an open source project package. It's retrieving the column level, linear transform there. Yes,.

A

Cool thank you and um one from pedram, which is, is the dbtcli running every time. Dagster definitions are loaded.

C

So that depends on how you use it in the old way. Pre 1.4, you had two options. Let's say easy mode on was to Simply run it every time when the extras is loading, the definitions and I was using it for the sake of convenience, but this is a bit slow, and this is why I switched to interrupting to the new API to this pre-bagged mode, where you can containerize with a pre-compiled image or in my case with pre-passed image and then have it run only once you need it.

A

Cool. Thank you um any other questions thanks. That was a great presentation.

A

Are there any ways to eat to maybe email, oh so asking if there's ways to email the DBT results using the metadata API AdSense grid, I'm, not sure if you're familiar with that.

C

uh Only vaguely but like in general Dexter allows you to orchestrate anything. So if you write your custom code to to do that, then you will be able to do it like Dexter doesn't have this integration out of the box, but you can easily Roll It Yourself by writing a custom code that is needed to to get the job done. Yeah.

C

Have any more questions? No.

A

No thank you so much.

C

Right and I hope to see you next time on the call here once the injection tool is finished and I want that feel free to reach out to me on some channels. Great.

A

A

So now, just quick.

A

Okay, so um any more questions on any of that or even the DVT stuff.

A

um So another couple other announcements.

A

Can you guys see my screen or no, no, okay, um so other than that? Just a couple, other call outs if you're interested in presenting in this call um feel free to reach out to us.

A

So we have this call every two months like I mentioned and looking for always new people or anyone who's interested to either collaborate on a blog post or present in this meeting itself, um so we'll definitely reach out and then other than that we are releasing uh just a trial version of dagster University, which is really meant for newer users to dagster. So if you have people on your team or you're interested in taking a look, let us know we'll be able to add you to the slack Channel where you'll get the updates.

A

um This will be before, like the more of a preview on this, so looking for people to take a look, give feedback if anyone's interested um I did send the form out in the chat so feel free to put that in and you'll be added to the dagster slack Channel, where you'll receive those preview updates, foreign.