Dagster Dagster Community Demos, 15 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Drizly: Adopting Dagster for a Heterogeneous Team

Description

Dennis Hume, a staff data engineer at Drizly, demonstrates how Drizly utilizes the Dagster framework to build a data platform that diverse data practitioners enjoy using. Dennis covers how he leverages Dagster abstractions to setup local, staging, and production environments.

🎞 Slides 🎞

Drizly & Dagster (Dennis Hume) ➡️ https://drive.google.com/file/d/1sF_eIjwzVutPxjUwtayObgYgkUoquvE4/view?usp=sharing

🌟 Socials 🌟

Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/

A

Say hi, I'm dennis I'm a data engineer at uh drizzly and just kind of wanted to go over how uh drizzly's kind of adopted, dagster and tried to make this something that works for our entire team.

B

And there we go.

A

B

A

Data team is pretty similar to, I think, most data teams, we're kind of a collection of analysts, data scientists and data engineers, who kind of all have very different needs out of our data stack um and really. What brought us to dagster was just the idea that, like every everyone can kind of get something different from dagster and there's just a lot of universal appeal.

A

So when we kind of started rebuilding our data infrastructure about two years or a year and a half ago, we tried to build it along this line of like of shared spaces. We didn't really want anyone's workflow to be to feel very different from anyone else's. We didn't want uh a data scientist to feel like what they do is very different from an analyst, and we didn't want data silos to emerge of it's a very different workflow for someone on marketing versus someone on strategic partnerships.

A

So when we're doing creating our shared spaces, we wanted things to follow the same workflows, be able to leverage the work of the entire team and allow members to be able to push changes without having to worry about any infrastructure and just be able to iterate quickly.

A

So we started this by um building out shared spaces around the sequel um world. So this is, I think, pretty common. At this point of just using a combination of dbt and snowflake between the two, we really had the flexibility to be able to configure warehouses or databases to specific teams.

A

We could also just hold all of our logic in one place with dbt, and then it really just never felt that different. If you were building a a data science, sql model or a model that would be going into our our visualization layer for marketing or something else so that worked really well. What was a harder question was how we were going to build that shared space for non-sql workflows, because here the differences and needs kind of become a lot more apparent.

A

So this is where we've kind of been hoping. Dagster can fill that role and what we've been moving forward with dragster.

A

B

A

People who've been using dagster for a while you're, probably familiar with a lot of these abstractions, of just the differences between solids and pipelines and modes and presets.

A

But the main thing with this is just that kind of going back to dagster appeals to different roles for different reasons. um We really didn't want it to be a barrier that, in order to contribute to the dagster project, you really needed to know every one of these abstraction layers.

A

So if you're an analyst just wanting to get a pipeline off the ground, you really shouldn't have to worry about the different workspaces and you shouldn't really have to worry about configuring. Your own resources, because you should be just leveraging resources, we've already used of already having defined our snowflake or dbt resources.

A

So when we thought of how we should configure these environments to kind of work for everyone and be consistent across roles, we started by thinking how we should use modes within dagster. So we divided these up into local dev and prod local mode is every resource is mocked. um This usage is for kind of quick local development and unit testing of pipelines, so as an example, instead of actually pinging snowflake, this could just be uh some files saved that just mimic what the results of a snowflake query would be.

A

um Our dev mode can be mocked or non-production versions of the system, so this could be for sticking with that snowflake um example. This could be something like pinging a staging table with a limit on the query, and this again is more just for integration testing and maybe um confirming that the schema that we're using and then prod is for production systems, so how we kind of wrap all this together is our different deployments of dagster, and we have four different deployments.

A

We have local, which is just running dexter from within a virtual environment, and this is specific to one workspace, and this is for just again quickly getting your pipeline to um to be able to compile and be able to just like check it.

A

We have a dexter compose setup, which again is just for your local machine, but this starts to bring in more of the extra dependencies, such as the postgres database, the daemon uh we're working with on creating a broker, and so this is for just kind of more involved testing. So you can get that uh see that every aspect of your pipeline is working correctly. Then after this we start to actually like push um code into git and based on the branch you're going to you.

A

Can it will be deployed either to our dev environment, which is just our aws stack on our dev account or prod, which is just the same stack, but just on a prod aws account. um The other thing we do to kind of limit some of the confusion over this. The different deployments are, we do filtering across our different deployments, so we do filtering on pipelines, modes, presets and schedules, and this is just to make it easy to know what all you should have available to you within a specific deployment.

A

We just handle all this with environment variables, but this just makes it so that if you're running dexter locally you don't accidentally trigger a production run or you don't accidentally try and run something on dev when you don't really have the resources to run.

B

A

In that environment, so our local deployment just look is very much just running daggett against a specific repo. At this point, the only as I've mentioned, the only modes you have available are local and the dagger configuration is none because there's no um dependencies. So this is kind of what the instance looks like and then in object filtering you just have access to local local mode and and presets, and you don't have avail any of the schedules available.

A

Moving up to our docker environment. um This is one that starts. We really start to take advantage of just the flexibility of dagster and being able to run different dagster instances and different deployments.

A

So here we can use pretty much what we're going to be using in production, except we can use like the docker launcher instead of a custom like ecs launcher that we'd use in aws- and this again is just allows us to keep moving our code along and get closer and closer to what it would look like in actual production without actually having to get to that step. Quite yet um so here the instance is different, because we um have a different workspace where we have multiple repositories.

A

Now we have the bi repository and the data science repository, which is broken in this screenshot and I think, is going to be broken in my demo. um And then you can see that the the dagster daemon is running in this environment and for object filtering. You now have access to both local and dev for presets and modes and now have schedules present, so our dev environ, our dev deployment, is just um again. This is what we're starting to get to what it will look like in production.

A

So this is just dagster running on aws resources, um so we are just kind of built. Our own stack around ecs. We don't use ek eks for um our deployment of dagster, but this is pretty much again. It looks pretty similar to docker compose but again we're just using aws resources at this point and then our production stack is the exact same stack, but just on a different account.

A

So one thing: that's kind of holding this all together are our data. Scientists have put together some very nice cookie cutter templates, and this allows us to just very easily spin up new pipelines that adhere to our deployments. So this again just makes it easier for people to be able to quickly get a pipeline off the ground and not have to worry about just all the all the infrastructure in the back.

A

You can just focus on the logic of your pipeline, which just makes this easier to get set up and running so quick demo, um I can just kind of this will be pretty similar to um the slides, but we can just go through a pipeline that.

A

I've been working on with an analyst.

A

Recently, this should have started this before the presentation, but that's hindsight there it is.

A

Okay, so this inventory cost analysis. Our cost pipeline is just an integration pipeline where um we read in a day's worth of s3 files. Do some mapping to determine which files we actually need to bring in and then dynamically generate a snowflake copy statement to load in that day's files.

A

So at this point the resources being used are snowflake and s3, but since we're in local mode, this will all just be mocked. So we can run this pipeline, um it will run, but it won't actually be doing much just because the s3 it's not connected to s3. So this is just pretty much to um make sure that our pipeline compiles and we can see it um in the dagster ui and then also that we can write unit tests. So we can write unit tests still that ensure that that.

B

A

Is running correctly, but if we also instead of.

A

So we can also run the same pipeline in our docker compose setup.

B

And so I'll, just reload.

A

B

B

A

So now so yeah here we have up access to dev and local, so we can run this with.

A

A

And this time it's actually going to be hitting our dev s3 bucket, and you can see this is still kind of a work in progress, but we're actually did hit the um dev s3 bucket and did the mapping on just kind of a sample of what the files would look like and generated the copy statement.

A

The snowflake is still mocked at this point, so we're not actually running the um the copy into our table. The other thing that's uh different in our docker compose setup is just since we have the daemon running. We can see a schedule for this, and one thing with the way we do schedules is just we have.

A

Multiple schedules for the same pipeline that are just keyed to the different modes. So again, there's one for dev and one for prod. But if you look in the schedules for this environment, you only see the schedule for mo uh for development and that again is just to kind of keep it a little simple. And um you don't have to make changes by like testing something out in one environment. And then, if you forget to change it, it accidentally getting pushed to prod or something like.

A

That so yeah, so that's pretty much where we are, I think we're still adopting dagster and just getting it more of a universal skill set on the team kind of the same way. Dvt and snowflake goes, but we've been.

A

I think we've been happy with progress so far it just it's really nice to be able to take advantage of what's already there and not having to start from scratch with every pipeline and feel that you have to be doing a lot of additional work for things that we've done in the past that are similar.

A

If you want to learn more about kind of bridging the dagster dbt snowflake, all together in one place, our uh infrastructure lead uh emily uh gave a talk at dbt coalesce in I think that was november or december of last year. But it's a snowflake dbt talk primarily, but there's also dagsters kind of under the hood. For that and that's all I've got.