Dagster Dagster Talks, Panels, & Interviews, 14 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Orchestrating dbt with Dagster

Description

dbt defined an entire new subspecialty of software engineering: Analytics Engineering. But it is one discipline among many: analytics engineers must collaborate with data scientists, data engineers, and data platform engineers to deliver a cohesive data platform. In this video, Nick Schrock of Elementl talks about how orchestrating dbt with Dagster allows you to place dbt in context, de-silo your operational systems, improve monitoring, and enable self-service operations.

A

So, thank you for having me my name is nick schrock and I'm the founder of elemental and I'm here to talk about daxter and dbt. So what are we going to talk about today?.

A

So here's our agenda well, there's this intro and our opportunity to introduce our emojis the speaker from drizzly yesterday really led by example. Here we decided to copy her presentation uh where she used our emojis. So so we're going to talk about this, we're going to talk about this intro and then we're going to talk about what is dagster, the I guess: elusive dagster and how it relates to dbt, because we often get questions of should we use dagster or dbt and it's understandable because they both have dependency graphs, for example.

A

But we want to make it clear that we view them as complementary tools with very aligned values. Actually, then, as mentioned, there's going to be a live demo also some sample code and then we'll have a discussion about the evolution of roles within the modern data platform and then a brief conclusion.

A

So why would you use this thing? You use dexter when you need to operate and deploy dbt alongside other tools within the context. It's a project or a large data platform. So who would be the type of personas? Who would do this so you're at dbt conference? You're? A dbt, coalesce and very likely or probably you're an analytics engineer, but often you might be left without support, and you also have to self-serve your own infrastructure because you have to orchestrate your computations help out a machine learning person or a data scientist.

A

You need to insert a you know: python script for whatever reason so often you're stuck wearing two hats, so to speak, you're, basically, one person doing two jobs, the other persona we kind of envision using this integration is the data or data platform engineer who is supporting analytics engineers, and you know, and we're seeing this more and more and more because analytics engineers are using dbt, but it has to generally be placed in the context of a broader system which we're about to get into so there's a lot of new terms flying around in technology, and recently, we've been hearing a lot about the modern data platform.

A

I guess this morning drew one of the founders of fishtown talked about a postmodern data platform. I am looking forward to seeing what that is. I was unable to catch the presentation, and so when you spin up a new analytics team a lot of times, it looks like this or you have five tran or stitch or a similar tool doing ingest, which is effectively replication into a cloud data warehouse like snowflake, redshift or bigquery.

A

You have analytics engineers who are writing. You know templated sql, within the context of dbt, to produce the consumable assets that are downstream, and then you have that is then consumed downstream by tools like mode or other bi tools, which interface directly with the cloud data warehouse using sql.

A

Now an interesting thing comes up here. You know we actually view this and kind of deem it internally. The modern analytics stack- and you know it begs the question actually, which you know we ask, which is is orchestration necessary, because one of the appealing things about this stack as presented is it's kind of like a very lightweight operationally, because you can delegate a lot of the ops to it out of the box tool like like five tran or to the cloud data warehouse.

A

So you know, but that's not the world. We see what we see in reality is a data platform at every company customized for its own domain, and you know just every company and organization we see it. Has these customized needs that go beyond just a cloud data warehouse. So let's just take an example here: let's imagine you had a legacy data lake, where you had data engineers, writing spark producing data assets on top of s3, and then you want to introduce this modern, analytic stack.

A

So you add snowflake to the mix and you write a big check to 5tran and snowflake. But now you need to interface with the old data lake and it's your responsibility to do so. So you feel compelled to write some tooling in python in order to ingest it.

A

Now you set up your dbt installation installation, you start using dbt within the data warehouse and you start using mode or a similar tool to consume it for downstream analytics, but there's a whole set of other use cases here, like data scientists right who want to use python to build machine learning uh pipelines on top of the data in the data warehouse and so and this architecture that I'm laying out this instance of the architecture is a simple one. We see architectures like that with engineering teams, as small as like one two.

A

Three people, maybe serving a couple analysts and you know, talk about going to a larger organization. You just have an explosion of complexity, and so you know, we think that orchestration is the beating heart of a data platform like this. If you visualize it almost at every edge, you have the orchestrator instigating and managing the computations that puts the orchestrator at a unique place where it is interacting with every computational, runtime and tool.

A

Every single persona, either indirectly or directly, has to deal with it and interface with it and then, by extension, it actually is interacting or instigating computation which store data in every single data store in the system. It is the common denominator throughout the data platform, and so naturally we think that naturally, we think that dagster's a good fit for this okay.

A

So what is daxter the elusive daxter, so our shorthand, for it is the data orchestrator, but a slightly longer description that we have for it is dagster, is an orchestration platform for producing trusted data assets, and I think the keyword here is trusted like doing. That is a really challenging problem and in order to do that, we really think about and model and manage the full application life cycle.

A

So how does these you know? What are we talking about? What outcome are we trying to achieve? So during the development and test phase, we want to be able to efficiently build, well-structured, testable computations, and I really want to emphasize the word testable, because in order for these systems to be testable, it's extremely challenging. You need to design it to be so from first principles from day one and it's always been a goal of the project since day one next, you want to reliably execute debug and operate those computations right.

A

This is kind of the core activity that orchestrator does it orders computations ensures that they execute an order, enables retries a lot, etc, etc.

A

And lastly, we want to the end goal of these systems is to produce data assets that are that are consumed by your downstream stakeholders. That is why these systems exist, so we think it's natural for a data orchestrator to have out of the box data observability, and I want to note that all of these all of these outcomes mutually enforce each reinforce each other by building well-structured, testable computations. You make it more likely that you're going to reliably execute debug and operate those compute by having a programming model and an api.

A

That is data aware you can actually enable the integrated data observability capabilities. So all of these it's kind of a three-legged stool here. So how do we do this? One?

A

We have python apis and a system that you can do local testing on a system designed for testability and our tooling works locally, as well as when it's deployed deployed to production and you'll, see that in this live demo, which I'm going to drive for my laptop next, we have deploy and execution.

A

I think a couple things again that reinforce each other, because we designed for testability our actual infrastruct infrastructure is very pluggable, meaning you can execute it in a wide variety of contexts, whether that's ci cd systems or you know some people want out of the box kubernetes support which we provide. But you don't need to use kubernetes in order to use this system right. It's designed to be cloud native, but it doesn't prescribe any specific vertically integrated infrastructure.

A

And, lastly, we have tools like our asset catalog, which tracks the assets produced by the computations that are managed by dexter.

A

So you might be asking that all sounds interesting. Where is dbt in that diagram? You said a lot of words, a lot of those things that you said I already used dbt to solve. We agree so what? How should the analytics engineer view dagster and dbt?

A

So again, let's put up this life cycle and, let's start out with this efficiently build well-structured, testable computations. Well, if I'm an analytics engineer, I'm thinking I already have a tool that can do that. I can I love dbt. I can execute locally. I can build these wall structure that, like compute dags of ginger templated sql, I can inject data quality tests, it's great. Why would I need dexter to do that and we agree.

A

You know if you're, an analytics engineer and you're embedded within a dagster enabled data platform. You are using the tools you know and love to develop, dbt driven computations within the data warehouse. The difference between these things is that dbt is for sql only transformations within the data warehouse, whereas dagster is a generalized compute framework right, the only time you would be interacting with the dax or local development experience.

A

If you're say writing a pandas transformation for some reason next, you need to deploy, execute, monitor, observe these things, and you know: when do you need to do that?

A

Well, you need to do it when you're dealing with heterogeneous data tools right, you might be dependent on some upstream tool or someone might be complaining about your data assets out of date and really a focus of daxter is enabling folks, like analytics engineers to self-serve ops in these cases where something has gone wrong, because you often need to interact with your production systems in order to unblock people and get your job done all right. We're going to go on to the survey denim now.

A

My colleague max is in the coalesce dagster channel and he, I believe, is about to post a link to google sheets. So what we're going to do here is, let's see max, have you posted this anyway? I will continue on. I do not see the google sheet ah there. We are okay great, so I want people to have the opportunity to fill this out.

A

As I talk through the structure of this pipeline, so you're going to fill out, some information in google form it's going to be ingested into google sheets right, we're going to consume that actually with pandas, and we do that because we want to use some time stamp manipulation stuff in pandas. It's a pain in the butt to do with sql we're going to load that into snowflake and then do some aggregations using dbt.

A

Then we're going to suck data out of the data warehouse and use papermill, which is a parameterized jupyter notebook, we're not going to dig into that. But it's a really cool system and we're going to use that to produce a plot which is then pushed to this slack channel.

A

Okay and then the data is stored in snowflake and in s3 all right and we're going to do it live. So here we go. So this is what our tool looks like, okay, so here's a pipeline.

A

There we go so here you can just see this. You know we're doing this. Google sheet ingest process, we're spitting out data frame, we're doing some post processing and look at all this rich metadata right. There's user defined descriptions. There's types, you'll notice that every node in this graph we call each node a solid, has inputs and outputs. This helps accessibility.

A

The dag is typed, which is both makes itself descriptive as well as makes it more reliable right. We ingest that panda's data frame into the warehouse. This is the node that actually runs the dbt model right. Here's, the node that runs the jupiter notebook and we can actually just render this in line, which is cool, here's a preview of what you're going to see and then we're actually going to post this to slack and just so. Why don't? We just go for it all right, so one you'll notice.

A

Here we have these presets, so I just run this into test mode, which pushed it to a secret channel. We have I'm going to switch this and you'll notice that these pipelines are configurable right, so you can parameterize them and run them in different contexts and this config system I'm not doing a deep demo of it, but it's actually fully typed and self-describing, which is just super super powerful.

A

So now we're going to launch this and wish me luck. So when we launch.

A

When we launch here, we go so we're using this multi process. Well here this is the gantt viewer. So what we're doing is this is the dag we're going to execute and you'll see? This is a live, updating, beautiful ui that gives you live execution of. What's going on, this works both locally and in prod.

A

Now each one of these is blue stuff. Is you see this thing it's preparing? This is because we're using the multi-process executor, which spins up a unique process for every single step in the system right and that gives it process, isolation, etc. It incurs a little spin up time. So here's this actually I'll just show you the watch this. This is nice.

A

I'm going to show this and see this structured event log. So this is much more than just unstructured logging, which I call developers thinking aloud. This is actually structured event logs and we can do stuff like here. We can view a preview of the data that was flowing through the system. Note this time stamp over here and then we can say.

A

Oh, the output- oh great, it's telling me how to spell things and then you can see the munge data over here right, so you can just get a sense and observe what type of you know is your stuff actually working and often the sample data is sufficient for that. Okay, well, the dbt run is logged. I went a little slow, so it didn't show the live updating log, but it actually worked, and now you notice that our integration emits different, structured events right. So what this does is.

A

This is emitting an event for every single model in your dbt graph and it's it's emitting all this interesting, structured information about it, and now we can go over here right and now we have all this interesting information you can see. This is the run that last touched it. It's still running. You can see. What's going on as you go down here, you know you can see familiar concepts right like it's materializing this as a table: here's the database, here's the schema, etc, etc.

A

Now you can go down here and there's all this information, so, for example, we can kind of give it away here. You can see that around 6 30 this morning I woke up and ran this a couple times. Then I had a meeting with a colleague and ran it a few times, etc, etc, and then look here, it just shows you. You know all these nice mouse overs anyway. This is a really nice observability tool and you can do things like.

A

Let's say we wanted to look this up right, you can just hop directly to it. You can see what information you can see, information about your assets, all right. Let's go back to this and we've executed now and now I'm going to go to another step where we're posting the slack and I'm actually going to filter this down to materializations and we've actually executed the notebook, and now we've pushed this to our slack channel.

A

Now we go here.

A

Open this up, no, I need to open up the link, and here we go here's the information about where folks are coming from. It looks very us centric, but I think there's a little action here. I believe that's is that angola, I'm gonna betray myself there, but anyway, so we have a live demo and, let's see if it posts to slack there, we go okay.

A

Well, thank heavens, that worked, that's always terrifying. So let me just quick go through what code looks like that, enables that so, let's look at this dag and I'm just going to quick, show some python code examples what this looks like just so you get a sense. This is no means a tutorial, so this is our dag structure. In order to define this dag, we call a pipeline.

A

You know we actually all of them, take inputs and outputs, so we actually just use python syntax to to flow the data through this pipeline right, it's very straightforward and intuitive, so this is actually constructing the dag. Next we have these solids right. This is our node in our graph and solid is kind of the leaf node which performs actual computation you'll, see here. It's a function that takes a data frame, produces a data frame, and then it's just performing some compute. So you can see this infer date.

A

Time format is a capability in pandas that we wanted to use and then to integrate with dbt. We have a library that was community driven and then donated to our monorepo. Thank you very much david wallace, and here you can just you know, take the cli run solid and point it to project and profiles, and you are off to the races so just wanted to quick go through. I made this claim earlier that dbt and dags are highly defined values and just want to go through this quick.

A

So again, the difference between them is that dbt's domain is exclusively sql, ginger, templated, sql and dagsters generalized compute in python. So if you have these different concepts, we have functional compute dependencies in dbt. It's the models which are defined by select statements. You can just think of them as functions and they're dependent on other models via the ref syntax, and that forms your dependencies right. Dexter is very similar.

A

You have solids which are meant to be effectively just you know pure business logic, and then they have logical inputs and outputs right, which would be similar to models next, really focus on fast dev workflow right in dbt. You can view docs locally. You can look at the dbt cloud ide and then you can and then you can use, and then I think, I'm running out of time. So I'm going to skip this but effectively, there's very analogous concepts.

A

So I want to talk about one more subject quickly. While I have some time- and that is- and that is you know how the general ecosystem is developing and there's this line that marshall mcluhan says, which is we shape our tools and thereafter our tools shape us and what that means is that as we build or shape tools, it affects the way people work. It affects their career path and therefore it affects the organizational structure.

A

So before dbt there was data engineers and analysts and in order to deliver indent capability, the analyst actually had to talk to a data engineer in order to get a production asset produced, and this formed what uh some people have deemed data breadlines. So here's a breadline, uh it's a big line of people waiting for bread and like before data engineers are somewhere off screen to the right and then analysts and business users were somewhere in this parameter. I don't know the guy.

A

The horses dbt empowers analysts right, so dbt allows analysts to become analytics engineers and deliver and and capabilities.

A

Engineers are responsible for the core assets and infrastructure and analytics engineers are responsible for all the consumable assets in the data warehouse right, but this is not a complete picture of. What's going on what we're seeing just how similar that you know that most data platforms consist of heterogeneous tools, there's also many many different roles or jobs happening within a data platform by job I mean like it's your job to produce an asset. It's your job to do an ml model. It's your job to produce the platform.

A

So if you think about this, you have data engineers responsible for maybe some core assets. You have analytics engineers, dbt data scientists, other subject matter, subject matter experts, they're responsible for delivering assets, and then you have this emerging category of people who work exclusively on the platform right. They are setting up everyone else to be successful within the context of their tool, but this current status of things is. This is usually aspirational.

A

There's a lot of problems here right. Let's go back to our data breadline right. This original breadline was about the data assets themselves, tables columns, etc, etc. But there's this whole other domain right right now, often there's an op spread line or the moment that something goes wrong. Analysts and business users kind of fall off a cliff here right and they need to go to the data and data engineers in order to like deal with their ops right. We view dax there's an empowerment tool on the ops dimension.

A

It allows you to look up your stuff, see what pipeline last touched. It see what runless touched it super super powerful.

A

You know just wanted to find, and some final words about you know if you were to take away one analogy about how to relate these two technologies.

A

You know I was kind of inspired by the apple m1 launch, I'm very excited for having a laptop that I don't have to put in my refrigerator. That's literally true. I've had to do that in order to get run, but you know I thought I had a great analogy where, like there's this overall machine that orchestrates computations and there's lots of specialized co-processors- and I think that's what's happening in the data landscape today. So if the data platform is the machine, dbt is like the analytics gpu.

A

It performs a really important, huge subset of the computations that are critical to the functioning of this thing, but its domain specific and very powerful, and this all has to be governed by an orchestrator right and dbt and snowflakes show up in the gpu. But then there's this whole other universe of other coprocessors right. There's all these other tools that exist either for fundamental reasons or for legacy reasons and like this is kind of the analogy that I feel is making sense to me.

A

So to summarize, here, dbt and dax are a complementary systems, and you know I kind of view dbt as the analytics gpu in the machine. This allows one of the core values to analysts and analytics engineers. It allows them to self-serve ops right, so you might be asking listen. I don't think I'm ever going to write any python. Why is this thing good? Why is this thing important, and the answer is that we think it will enable you to self-serve your office.

A

A lot of our early users have seen that promise, and this is because that's friendly, ui parameterization, you can look up the asset very quickly that you're dealing with or debugging and then lastly, for those who are working in python, it allows engineers to build awesome pipelines.

A

You know, actually I got incredibly excited preparing this demo, because I really hadn't internalized how much better our tooling had gotten, and so I really think it's a really awesome experience, and you know it's really fun to build stuff in this system and your dags are typed they're configurable.

A

They kind of designed to be testable, and you know it's really fun stuff to build. So thank you so much for having me here we're an open source project. So we have a github. We have a growing slack community feel free to join us and without further ado, I will pass it back and take questions in the slack you.