Dagster Dagster Demos, 5 Apr 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Demo for Genpact - April 2023

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh You know a challenge that a lot of data teams face is that they're working with orchestrators that really weren't built for data engineering. You know they were airflow as an example as a very general purpose, uh orchestrator and so dangster. The reason it was built is to exist as an orchestrator, specifically for data teams, and so throughout today's demo, we're going to look at you know some of the the benefits that come from being a data orchestrator, not just a general purpose orchestrator, but to that I want to highlight out of the gate.

A

Is that we're able to track a lot of the lineage and metadata of the actual data assets that the pipeline is producing? That's, ultimately, what stakeholders usually care about? You know no one comes to a data engineer and says: I think you know task X, failed and that's why everything is bad today. You know that they come to data engineers and say this data set looks wrong. You know what what's going on and so in Orchestra you're.

A

Being aware of those data sets are really makes that debugging process that whole iteration cycle much shorter and then, as Frasier also mentioned, you know we're seeing an explosion in Tools in this space.

A

A lot of those tools operate against data sets, and so the orchestrator being aware of data sets allow us to integrate more seamlessly with a lot of those tools, whether they're on the extract and load side like air byte transformation layer like DBT or even further Downstream things like Jupiter notebooks that are creating models, but those all fit much more seamlessly with the tool. That's aware of data assets, then within orchestrator. That's just thinking about tasks- and you know this is another view of what I just said.

A

Essentially dagster gives us this ability, because we're aware of data assets to do many of these things. The other thing throughout there is that airflow was built before a lot of software. Best engineering practices really were hardened, and so things like local testing, doing CI CD. That can be very challenging with some of those other orchestration tools where a stackster was built to really adopt those best practices.

A

All right so I want to show you kind of what this looks like actually in practice and so I'm going to switch over to dagster I'm, going to work backwards. So I'm going to start with what a production data platform looks like in dagster and then we'll work. Our way to the actual. You know code that you'd write to get there, and so here is a dixter cloud. Dexter cloud is one way that many teams use Stacks during production, and so basically, what the extra Cloud entails is a control plane that we run tracks.

A

Things like you know: who's logged in what runs have occurred and then the actual execution plane where these jobs are fired off tends to in most cases be within a AWS or you know, Azure or TCP Cloud environment that is close to the data. So we have that hybrid architecture, and so what we're looking at is a timeline of jobs that have been running and we can jump into one of these jobs and right away.

A

That's where you'll start to see the data first orientation of dagster as a tool, and so here we have a pipeline. It's starting by extracting some raw data from an API loading it into a warehouse and then orchestrating some DBT transformations to clean up that date, data and so behind the scenes there's still kind of these raw tasks that the orchestrator is running. You go and extract the data from the API and then run some DBT Transformations, but uh whereas with tools like airflow, you only get this few and dagster there's.

A

Also this awareness of the data sets that are being created as that pipeline runs, and so, if you know I mentioned a stakeholder comes to the data engineering team and says we're doing all this work against a daily order summary table, and today that table doesn't look right in dagster. You can immediately go straight to that table's definition. You can see where it fits in terms of everything else within the pipeline, and then you get a lot of Rich information about. You know what the last time that data was updated, who owns it?

A

What the schema looks like and in the case of utvt, even with the the raw SQL definition, was uh and so you're right out of the gate, you can start to see how dagster's different from other tools, because it has that data awareness, a couple of other things that I'll mention here. You can see these pointers to other parts of the data platform, and so, in addition to looking at just a specific job in this one's set up to run.

A

You know every three hours and we could view the the Run logs for this specific job and see you know what events are happening over time here. uh You know, in addition to all of those kind of table, Stakes capabilities of a great orchestration platform.

A

You can also see all of the assets everything related to to your data estate uh in one place, and so here we have kind of a global view that our original job was kind of winking out to, and so here is that original job we were looking at, but now you can see kind of those other pointers out to the rest of the data platform and as Fraser mentioned, the reason this type of view is so critical is because the data estate is spanning in most organizations, many different teams, and so, while the data Engineers might be responsible for this raw, you know extract and some of the initial transformations of the data in the warehouse you'll often see other teams, like maybe a data science team, that's building models.

A

On top of that data or an embedded analytics team. You may be here inside a marketing organization, that's responsible for some kpis that are used within your bi tools. So it's important, but across these different teams, you're able to see how data is Flowing throughout the platform, and this in dagster is really obvious, because we have that view of these assets and how they're connected together and what that lineage looks like the one other thing I'll mention that people get really excited about from this asset asset.

A

First, you know view of the world that is around the different ways. You can automate work, and so our original job we were looking at was set up on kind of that standard. Cron schedule run this every three hours, but within dagster you can also do event driven orchestration, so you can have these pipelines be updated based on external events or you can respond to slas, and so as an example here this average order kpi. Maybe this is something that lives in an executive dashboard.

A

It needs to be really frequently updated, and so in diagster you can specify what we call a freshness policy and then Daxter figures out. What else needs to happen? Inside of your platform for this policy to actually be met, and so from a data engineering perspective, it becomes much easier to actually meet those stakeholder slas as opposed to trying to figure out. You know across these different teams what single, cron schedule or dag needs to be written so that you know we get data at the right time.

A

Often when we're talking to teams they're kind of guessing it may be. My five trans sync take will take an hour and then my DBT Transformations will take 30 minutes, and so, if I start one of those things at 7, 30 I can update a dashboard at 10.. That really is is a complicated game to be playing as your data platform grows, and we believe that the orchestrator, which has knowledge of how you know all these things interconnect, can simplify that process for significantly for data engineers.

A

So that's what those those freshness policies are all about. In addition to regular, you know Quran scheduling and event driven Pipelines, so that's kind of the high level view of how all, if things are orchestrated before I look at the code, I did want to jump back into these run logs. Just for a second to give you a feel, for um you know what dijster is providing you know as jobs are running, and so when we execute an asset or a pipeline that creates assets, dagster is going to go through this process.

A

I mentioned the hybrid architecture of spinning up compute in an environment, that's close to the data and then executing the those commands to actually you know, create those data assets, it'll, keep track of all the logs and then it'll also read the metadata we're looking at before, uh and so in terms of you know what what these commands are, what's actually being executed, uh that we kind of view people creating Bagster Assets in three types of ways.

A

uh So one way is that you can just write regular python code uh and then incorporate that pipeline code into a dagster project uh using these really simple function, decorators, and so, if you have code that does like extraction of data and then transforms that data adopting dagster is super straightforward. You just add these asset decorators, and that creates this lineage graph. We were looking at before between an extract asset and a transform asset notice.

A

That's uh you know one way out if you're ever scanning, through the dagster documentation, you'll, probably see that approach presented front and center as a great way for teams to get started. We also, though, see a lot of teams that that don't actually want to process the data within the orchestration tool. They just want to Outsource the actual data, manipulation and work to other systems, and so indexed or similar to what we saw before you can decorate python functions, but where those functions are executing tasks and other systems.

A

So, for example, you might have a loading tool like air byte r5tran, and you want to kick off a sync in that tool.

A

Wait for the results to come back and then once the data is in the warehouse maybe run a stored procedure in that warehouse, and so any you know, client library in Python that can call out to these other tools and the stack is accessible and can be orchestrated really quickly at dagster, and this pattern is actually so common that often we'll just build the code for you, and so we have a rich set of Integrations that essentially work.

A

The way I was just describing but keep you from having to write all the boilerplate code yourself, and so, if you are doing, for example, that use case where you're just orchestrating five trans and then you're doing transformations in DBT, that's often just a one-liner to pull in those existing projects into dagster, and then you get those benefits of the lineage graph. I was showing you and all the automation capabilities.

A

So we've tried to make it. You know pretty straightforward to get going uh and in fact this would be a full functional code for that process. I just described where you're you're pulling something in from Air byte and then transforming it with DBT um all and then even fitting the machine learning model in Python. All of that is built uh in pretty straightforward code and that's something we're really proud of. If you've ever had to fight with constructing a dag in in airflow.

A

Even you know, airflow 2, with some of the advances they're trying to make there. We hope that you can appreciate the developer first, your experience that the Daxter is creating for you, the one other piece I'll mention you know in terms of you know what this coding experience looks like is that we've worked really hard to adopt those software developments, best practices that I mentioned, and so in dagster.

A

If you are working kind of what the process tends to look like is iterating locally. So here I have a project with some of these dagster assets. We were talking about an indexed or you can fire up a local copy of the whole data platform.

A

Just within a python environment, so a lot of times when we talk to data engineering teams, especially those that are using airflow, there's no way for them if they make a code change locally to actually see that the structure of the dag is correct or to make sure they don't have any syntax errors or any typos without actually pushing that code, all the way through to production, and so in dagster we've tried to make it much easier to actually get a full view of the platform running locally without your complex dependencies.

A

So here's that same data platform, but now just running on on my laptop. You can see that the assets here are saying: they've never been materialized because I haven't run anything yet, but I can get a feel for what the structure of the pipeline is and really quickly identify if I have any of those typos or syntax errors or incorrectly.

A

You know structured the dependencies between my data assets and then after I've done that I can start to harden this by simply checking my work and any changes into Version, Control and so I'll show you what that looks. Like for this project, I'm playing around with I've got all my code inside of a GitHub repo and then I'm using CI CD, in this case GitHub actions.

A

But this will work with you any flavor of git or any CI system. You want to use I'm able to preview the changes that I'm making so say: I'm a machine learning, team and I want to add a new type of model. So maybe compare you know a different modeling approach to our existing production model.

A

So what I might do is create a new model in dijster, so here I'm taking the approach we discussed, where I write some python code, I decorate it with my function, decorator I've, tested things out locally, so I know that, like the structure of the tag is correct and in this pull request, what dagster is doing is actually creating a staging environment for me, and so I can follow this link and be directed to this environment created just for this feature Branch.

A

So before we were looking at prod now, I can look at just the this staging environment for my new model and what that allows me to do is actually run this model, and so I can execute this model. We've set up our staging environment so that it's going to read from our production warehouse but right to a staging bucket, and that gives me a lot of confidence that the results here are.

A

Actually, you know similar to what they look like if I merge this into production, as opposed to you just having to guess, what's going to happen, and then you fight those fire drills when I merge it into production and everything breaks, and so these Branch deployments what we call them, give you that way to adopt those CI, CD best practices, and then you know, of course, when I actually merge this model into production. uh I can tap into you know all of those things we were just talking about.

A

You know setting things up on a schedule um running things uh in a hybrid environment where we're able to tap into large. You know compute substrates to execute things at scale and then dagster in production, as kind of all of the things that you'd expect, like you know, run rechase concurrency limits, uh alerting and then, in addition to extra cloud, has a lot of those Enterprise check boxes that big organizations are looking at. So you know: granular role-based, access control, for example an audit log of everything. That's changed in the system.

A

These are all kind of built into the tool.