Dagster Dagster Community Demos, 29 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Engineering Podcast: Running Dagster in Production

Description

Tobias Macey from Data Engineering Podcast discusses migrating from Cron job to Dagster including the resulting tech stack (Pulumi, Packer, SaltStack, Hashicorp Vault, Consul, Vdist/FPM) and the Dagster features he used (scheduler, resources, hooks, assets, etc).

🎞 Slides 🎞
MIT Open Learning & Dagster (Tobias Macey)➡️ :
https://docs.google.com/presentation/d/1TKL9kem6SDyPr0MADOQIRqwvFgHOF7_gJ9Hqdubiuhs/edit

🌟 Socials 🌟

Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/

A

All right um yeah, so I've been using dagster, got into production, probably on the order of about a month ago now. So this is just kind of recapping. My overall experience of from the problem that I was dealing with, of having a homegrown chrome job running with poor visibility and error, reporting to using daxter to replace that and serve as the foundation for bootstrapping other pipelines in my organization.

A

So a bit of backstory first is uh actually met. Nick a while before he actually went public with daxter and told he told me that he had actually been using the data engineering podcast as one of the critical sources of research. He was doing that ultimately led to his decision to build daxter and identify that as a problem space that was worth pursuing.

A

So he gave me that demo and from then I was fairly committed to using dagster once I had the opportunity to actually spend the focus on replacing some of our existing pipeline approaches, and I used that term very lightly because it was a pipeline in name only not really in terms of its actual implementation or execution.

A

So in terms of actually writing the pipeline, you know it was working on a script that was dumping daily data extracts from an installation of the openedx platform and just to give a quick overview that it's a online learning platform which we're running at mit for students to be able to take courses as part of their class load, and we take daily extracts from the platform to be able to provide some insights to instructors about how they're structuring the courses and the performance of students things like that.

A

So we had a script that would take sql dumps of some select tables from the platform uh from a mysql database.

A

It would run exports of the actual courseware, which is it's a shell script that generates a blob of xml files that we then tar up and send off to s3 and then also dumping at the contents of a database for forum responses to analyze some of the communications between students among themselves and also with the instructors and all that would be bundled up and uploaded to s3 using a date stamped path. So we would have one extract per day.

A

So you know, as I said, we had a custom python script, it ran on a chrome job and it would post some very poorly formatted output to slack each time. It ran just kind of dumping, the contents of standard out, and it was just very noisy and was quickly ignored. We didn't really pay a lot of attention to that slack channel because it was really hard to get any useful information out of it.

A

We would also log to disk, but the logging was only as good as what we could anticipate and with the potential error cases that we were running into.

A

It was hard to know what might have caused an error without going in adding some new login code or manually debugging things and having to run the whole pipeline all over again and with the periodic failures we were experiencing, there was no indication about where it failed or why or how or when even we would just know from the alerting platform that we used, where we would post to the http endpoint on success, that it didn't run, and sometimes that would also get buried in noise of other alerts, and so it would go a few days before we would actually get that pipeline repaired.

A

So, in order to break down the problem, I took each of the different sources that we were pulling from, where each sql dump would be a separate source and then the mongodump and broke that down into its own solid definition. So there would be one function, definition for each database table. We would need to export or the mongod table, and it all starts with that.

A

First stage of listing the edx courses where it hits the edx api to get a listing of what are all the courses that we're working with so that that can then get fed through to all the downstream solids.

A

The communication with the mysql database, I was able to extract out as a custom resource definition so that we could reuse that logic and have daxter handle establishing the connection at the beginning of the pipeline, run and close it down cleanly at the end of it, and that way it's also reusable for other pipelines that might need to be able to just pass arbitrary sql definitions to a mysql database and get the output.

A

I also wrote a resource definition for the daily extract folder, so that that would be consistent in terms of creating the folder and then removing the folder at the end of the pipeline. So just being able to take advantage of those life cycle hooks that the resource capabilities in daxter bring in uh daxter also has much better logging and error reporting out of the box and the visibility with daggett being able to go in and see. This is the stage that failed.

A

These are the stack traces that happened and not having to think about all those things ahead of time provides a lot of value and also because of the dependency graph in the tasks we're able to get the behavior of unless everything runs successfully, we're not going to run the upload so that we don't get any partial uploads where we might have a silent failure and not know about it.

A

So once I got all that done, that was probably in the order of a month or so of development work to get the pipeline defined figure out. What are all the pieces of dagster that we wanted to be able to hook into and then went to trying to figure out how to get this running in production.

A

So I wanted to be able to alert on cases where the pipeline failed. So for that I'm using a service called health checks and, as I mentioned, that's just an end point where we send a post request when the task finishes, and it also gives you the possibility of sending negative acknowledgements. So if the task fails, you can signal that as well and then that has integrations into the default. One we use is email b can also have it post to slack or various paging services, and so that way we know on a daily basis.

A

If I don't hear anything, then that's a good sign. It means that everything's up and running, but if it does fail I'll, get a notification, so I know to go in and use the tools available in daxter and daggett to be able to debug the pipeline. So for integrating with that, I actually wrote a hook definition.

A

um First, I wrote a resource definition for being able to communicate with health checks and then wrote a hook that wraps the pipeline definition with a success and failure event, so that if the pipeline fails it'll tell health checks that there was a failure event I'll get notified of that, and then I can go in and debug.

A

The other element that I realized once I first started working on, getting it into production in our qa environment is I went to the dag at ui and realized that there was no authentication or authorization restrictions. I could just go directly to the website and if I clicked on the link for the dexter instance definition, there would be secret values there or in the pipeline.

A

Playgrounds there'd, be secret values exposed there as well.

A

So because of the fact that it's just a web app, I was able to use the caddy web server and one of the plugins that it provides for providing an authentication portal to restrict access so that only people who have logins defined in the config file are able to actually access the daggett ui.

A

So it solves that problem and then caddy also has the added benefit of automatically managing tls certificates so that all the communication is encrypted over the wire.

A

So you can see on the right hand, side here is what you see if you land on the page, if you go to pipelines.odl.mit.edu, you'll, see get restricted, you'll get that sign in page and unless you're able to pass valid credentials, it won't let you through and then the other problem of, even getting it running, is being able to package and deploy the code, and I knew going into this as soon as daxter.

A

I think it was in the 0.8 release added the capability of having pipelines deployed independently of each other and having daggett be able to communicate between them. That I wanted to be able to take advantage of that and have each of the pipeline definitions siloed in terms of the deployment and development cycle.

A

So I went through the process of writing some tooling to package up the pipeline definition and all of its dependencies and the version of python into a debian package so that it's a single object that I download onto the host machine and use the standard apps tools to just install it and then using uh so that when there are new versions, I just repackage the repackage the project add a different version, number install it using apt and it just installs cleanly over the previous version.

A

So it gives me a very clean way to upgrade in place without having to rebuild everything from scratch all over again. So there's a tool. I've got links to all the tools I used at the end, but it's called vdist that just gives you the option of being able to package all that information up into a debian package or an rpm. If that's your flavor of choice, um so yeah the tools used, I'm using a tool called pollumi for provisioning.

A

All the infrastructure on aws, um it's similar to terraform, but it's just infrastructure as code, so to be able to version that run it all cleanly packer for building the ec2 machine image for the daxter application, so packages up all the pipeline code and dependencies. So I have a basis to work from as well as bringing in all the caddy definitions.

A

Things like that salt stack for actually placing the configurations on disk once the instance is running and deploying the package updates and then for being able to manage secrets in production, I'm using hashicorp vault for being able to generate dynamic credentials for the database being able to pull static values for things like the the secret key that I might need for or the client tokens they need for talking to the edx api and for simplifying communication between dagster and all the other systems that need to be able to communicate with for the source data.

A

The music console for service discovery, so that I don't have to worry about in the process of writing my daxter code. How do I determine what the source destination is for being able to communicate with it in terms of the dexter features that were very useful in getting all of this running and into production, using the scheduler for being able to set up the timelines of making sure that the job runs every day?

A

The resource definitions for being able to extract out the communication with my sql database and the daily the daily folders for uploads, taking advantage of some of the resource definitions available in the dagstr ecosystem for the dextre aws package, for communicating with s3 and then also being able to use pipeline presets to separate out the configuration loading for different deployments of edx, because I have one environment. That's for students at mit and then another another deployment.

A

That's used for a professional education product that we are building so being able to use the exact same pipeline code and just point to a different configuration file to talk to the different source systems and upload to separate destinations was very useful, reduces the amount of boilerplate and duplication using the hooks definitions that were added.

A

I think in maybe 0.9 for being able to have that on success and on failure, logic attached to the health check system and then for being able to plan for future uses of daxter using the workspace definitions, so that I can have dagget running as a service pointed at a single eml file that tells it where to communicate with all the different pipelines and have a stable and flexible way of bringing on new use cases and then, within the pipeline definition itself, having the use case for being able to write expectations so that I can see if certain uh quality checks are passing or failing asset materializations to understand what are the files being produced by these pipelines and then having flexible metadata to be able to generate another source of information that can be used for future analysis of how or how is the use of this pipeline trending?

A

How are the volumes of data that it's processing changing over time? So just a lot of really useful capabilities there. uh Here's a quick diagram of the overall data flow so on the left, you see daxter with its postgres database and its own vpc.

A

It's using bpc peering to communicate with the edx application, server and its dependent resources and then using s3 as the intermediate storage for dagster, as well as its logs and publishing the nxdata extracts to a separate s3 bucket.

A

So that's where we are now in terms of next steps. um Looking at using the pants builds tool for being able to use a monorepo structure for having all my pipeline definitions in one source control repository but being able to build and package all of the pipelines separately and deploy them independently from each other, while still being able to have shared resource definitions and common libraries that can be used across those pipelines.

A

uh I'm going to be putting up a package archive to make it easier to upload and deploy new versions of the pipeline codes as the debian packages, so using something like artifactory or the pulp project, as the use case grows I'll, probably end up bringing in das for being able to distribute the overall workload and then to simplify onboarding new use cases.

A

Creating some pipeline templates using tool called copier similar to cookie cutter, to be able to provide some scaffolding to users who want to be able to write their own pipeline so that I don't have to be hands on with all of those new use cases and then probably new resource definitions, as time goes by for things like vault for being able to pull in secret values in memory at runtime. So they don't have to sit on disk as a potential source of uh compromise.

A

So the things that went really well are the flexibility and granularity of the daxter framework. As far as being able to define resources having them be reusable having the solid definitions, easy to define and hook together in various pipeline formations, using the hooks for success and failure, events, asset tracking and materializations for rich metadata.

A

um The fact that it doesn't have a prescribed deployment methodology is useful because it means that I can use my existing tooling for being able to deploy and manage it without having to buy into another ecosystem such as kubernetes. If I don't want to, but having that as an option.

A

If I do and then the fact that the core framework was designed to be easily extensible, so that if I want to write a new scheduler for using something- that's maybe a bit more flexible than cron, I have that option and then the dependency schema for being able to determine how the various components of the pipeline fit together and can be reused there and being able to ensure that the values passed between them match a certain format.

A

Some areas, I think I'm looking forward to for improvement, are an expansion of the scheduling capabilities.

A

So the crop getting crown scheduling set up is still a little bit opaque and requires a bit of manual intervention, but all in all, it worked well but being able to add in new ways to define scheduling. Maybe do it dynamically or have better granularity. Or you know, trigger-based scheduling will be useful, probably going to be hooking into the graphql api.

A

For some of that, it'll be useful as dagster gains more adoption to have some a repository of examples from the community of actual code that people are using in their pipelines or a set of reusable resource definitions that you can incorporate in your projects and then possibly pushing some authentication into daggett, although possibly leaving that as a separate system, provides greater flexibility in terms of deployment without prescribing how that hooks into dagger itself.

A

So yeah, if you want to find me online or follow up with the work I'm doing here, are the places you can find me. I run the platform and data engineering team at mit, open learning. I host the data engineering, podcast and podcast.edit, which some of you folks might be familiar with, and you can find me on linkedin or twitter, I'm not very active there, but I do exist and then all the code that I'm using for building the pipeline and managing deployment is actually open sourced.

A

So you can take a look at it fork it use pieces of it. Ask me questions happy to provide feedback or help, as as I'm able to.