Dagster Dagster in Media, 23 Jan 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: How To Orchestrate Airbyte Syncs with dagster | Community Call 16 w/ Ben Pankow & Shawn Wang

Description

In this video, Ben Pankow, Software Engineer at Elementl, will show you how to streamline your process using dagster to manage your Airbyte connections and orchestrate syncs with downstream computation using DBT. He'll take you step-by-step through setting up Airbyte with Dagster from scratch, so even if you're new to these tools, you'll be able to follow along. And for those who are more experienced, we'll also show you how dagster and Airbyte can elevate your project to new heights. Don't miss out on this game-changing demo.

Subscribe to our newsletter: https://airbyte.com/newsletter?utm_source=youtube
Learn more about Airbyte: https://airbyte.com

#dataorchestration #data #communitycall

A

A

Super nice to meet you and I noticed. This call has been a long time coming so I'm excited for people to get introduced to Dexter and see what you've been working on with the awesome, Integrations I work with Krishan who, on the developer, experience team here there right uh where we help to run events and make content and basically help anyone discovering data, integration, data pipelines and data engineering along the way and and just provide them whatever content that they need to get familiar with their bite.

A

We also do stuff with data warehouses and just a fair amount of the data ecosystem, but with big personal fans of dagster. In fact, we have a blog post talking about how we're moving our internal data team stats to something that is orchestrated by Dexter and I know that the actor team has been cooking, something cool with uh with your integration. So maybe you want to give a little intro of yourself and and Dexter, and then we can take it from there.

B

Thanks again, Sean for for the introduction, as I mentioned, my name is Ben I'm, a software engineer at Elemental, which is the company behind diagram, which is an open source data orchestrator. So today, I'll be walking through how you can use dagster to orchestrate air byte, along with some of your other MDS tools using our recently updated integration.

B

So I'll start with a couple minutes of just background on the role of an orchestrator in the MDS and a brief explanation of what dagster specifically and then I'll move into a demo of you know what it looks like to to use dicer with air byte and some other MDS tools and then kind of Step through the code to see how we can kind of get to that point.

B

So the goal of a data platform or a data pipeline is to produce some sort of result, some sort of persistent output, an artifact that we call a data asset depending on your use case. This could be you know, one of any uh wide variety of things. You know. Maybe it's a bi dashboard that helps guide some decision making in your organization.

B

Maybe it's an ml model, uh maybe it's a notebook that one of your analysts is using, or maybe it's something uh like a file in S3 or even a database table uh in your Cloud Warehouse.

B

All of these things are kind of data assets that are produced by your pipeline and it's the reason that a pipeline exists is to you know, create one of these assets that is going to be used down the line by someone or some other process, and these data assets often have dependencies. Let's say you know we have some sort of bi dashboard, showing the performance of our product. That dashboard is likely going to rely on some sort of metrics. uh You know.

B

Maybe this is like a daily active user and those metrics are probably transformed from some raw data. You know. Maybe this is actual session data that we're ingesting, transforming into daily active user metrics and then displaying that on a dashboard uh and in the world of the modern data stack. We have a lot of great tools that are purpose built, for you know specific building, specific kinds of assets or doing specific kinds of Transformations.

B

So you know maybe we're ingesting our data using air byte then we're transforming it into metrics using DBT, or you know, maybe some dedicated metrics layer tool and then maybe we use a tool like hex or looker or mode to build a dashboard on top of those metrics.

B

So here already, you know we have three different tools, but in a larger data platform this can grow uh to be an even larger number of tools. You know, maybe you want to do some computation uh in python or use Pi spark. This is more. uh You know different purpose-built tools that you kind of have to Jungle, and this is where an orchestrator comes in. That is core and orchestrator's job is kind of to schedule.

B

The tasks needed to materialize these assets, so your orchestrator needs to know to you know kick off your air byte sync, then kick off your DVT transformation and finally, to rebuild your dashboard in the right order and to handle things like failures. So if your DVT transformation fails, you.

C

Probably want to know.

B

About it- and you don't want to rebuild your dashboard when the data is broken uh and that kind of falls into the second category of observability, um most orchestrators provide a layer of observability on top of all of these tools, making it easier to kind of debunk failures to get alerts when one part of your pipeline fails um and to get things like history being able to view all the times that something is run to help figure out. You know when something might have gone wrong. So then we move into uh you know.

B

Where does dagster lie? um What's kind of the sales pitch for dagster uh in the realm of orchestrators? The first is that dagster is, is data aware and is asset focused? A lot of more traditional orchestrators are what we call task focused. So you know if we look at the previous set of kind of data assets, we talked about a dashboard, that's powered by symmetrics from some ingested data. There's an underlying set of steps to actually produce these assets, so we'd run everybody ingest.

B

Our data we'd run DBT to generate our metrics, and then we trigger a dashboard update to you know, get our finalized Dash, but most kind of traditional task-based orchestrators can sign themselves to only working on this kind of lower level task based view with dagster. We like to take kind of a holistic view of both the data assets and the tasks that lie under that, because we believe you know if you're a stakeholder that cares about a dashboard or even, if you're, a member of the team, who's who's.

B

You know ingesting the data and transforming the data. You care more about the outputs of your tasks than the tasks themselves, so we kind of put these assets first and foremost in our UI and our programming abstractions.

B

The other kind of facet that set stagster apart, which hopefully you'll see in the demo, is that it's built for the engineering workflow so for local development uh using the same kind of tooling and same interface that you will use when you deploy to production and also the ability to test so to write unit tests to easily input sample data.

B

um You know deploy to a staging environment that sort of thing so, with that all being said, uh I'm gonna move into a demo, hopefully that'll help to illuminate some of the things I quickly talk through the code for this demo is available uh on my GitHub here here, so github.com earbud Community demo. So what I'll do is kind of first show what it looks like to use everybody with dagster and then I'll kind of move into the code base and show you know how we would actually build out.

B

So we'll start by going to a terminal and we'll go ahead and run Gadget. So this is going to go ahead and kick off the kind of Local web interface uh for dagster. You know you can get this and run this locally. Just pip install dagster, pip, install dag it if we go ahead ahead and open up the UI in our browser. We're greeted here by this update activity, Dash job. So here we get kind of a dag showing the relationship between the different Assets in our job.

B

It's a pretty straightforward data, ingest job, we're loading, some data from GitHub, using earbud and also from Slack, then we're using DBT to generate some kind of aggregate tables here so rolling up some daily data for slack messages and for GitHub commits and then building kind of a central table here that we'll use to power our hex dashboard here. So each node on this graph as I mentioned kind of, represents an asset. So you can go ahead and click on one of these nodes to view information about that asset.

B

So we can see you know the last time the asset was materialized, so the last time that this commits table was recreated by air byte was you know about an hour ago. We get a glimpse into the schema of this table in Snowflake. You know if we click on one of these DBT assets, we get something similar. We also get some time series metadata, showing you know every time this asset was created. How long did it take to produce?

B

This is all kind of data that, in this case, is automatically added by dagster, but can also be user specified metadata. So if you wanted to attach any you know metadata, you want to keep track of to one of your assets. You can totally do that and have it show up in the UI. And finally, we have this dashboard asset, which actually represents kind of a version of our hex dashboard that was recreated here at 12 30..

B

So we can go ahead and click on the URL there to view our dashboard uh at the time that it was generated. So here we have, you know, commits over time graph, pretty uh Bare Bones here, but you can imagine this is you know some sort of dense dashboard here, providing some some useful uh information uh for some outside stakeholder. So one thing I'd like to kind of showcase here, is the ability for you to kind of get insight on your assets uh without leaving the UI without having to delve into the code.

B

So if I'm you know an outside stakeholder who maybe cares about this text, dashboard I don't have to go dig through the code to see how it's produced. I can just go to the kind of dedicated page for that asset and I can see every time it's been produced. Historically I could you know maybe jump to one of the runs if I wanted to view some logs um I could go ahead and and open up a version of that dashboard from that particular point in time.

B

So here's kind of a historical version of the dashboard you, you can see, there's a slight difference here, since this is from a couple hours earlier and you can even go in and view lineage for a specific asset. So, if I didn't know anything about how this dashboard was sourced, I could go to this lineage Pane and see you know it's coming from this DBT table which in turn is coming from these Airbus assets.

B

So if I go back to this job page, let's go ahead and see what it looks like to actually trigger an airbite sync- and you know a DBT Transformations from dagster I can either go ahead and click this materialize. All button which is going to you know take all of these assets figure out what underlying tasks are needed to produce them?

B

um You know and run those tasks, but I can also just select a subset, so maybe I only care about you know our air byte assets here and some of our DBT tables I can also select that subset and dagster will kind of dynamically figure out what tasks need to be run just to produce those assets.

B

So here we can see in this run page we kind of have an execution plan on the top. Here we can see we're running two different earbyte sinks and actually, if we go ahead and open up earbud, we.

C

Can see that um artsyncs will be kicking off in just a moment um three. So.

B

We can see you know both of our air byte sinks. Here have started. uh You know, they're starting to produce logs and dagster integrates pretty tightly with air byte, we stream the log information directly from Air byte. So if something were to go wrong, you know you can access the logs directly through Dexter and you can also get kind of high level structured information um from our structured event log here. So once the air writes and completes we'll get some kind of metadata telling us.

B

You know how many new rows we're seeing that sort of thing in this more structured, so this is going to take a little bit uh to to actually get going, um but I wanted to delve just briefly into the code to show you know how easy this is to get going.

A

uh Oh yeah, so it looks very magical.

B

um This is all the code that's needed to get this example working. So you know we have a single python file here um and we have um a couple kind of segments that are generating each of our assets so to set up uh our earbud assets. Here, all we need to do is tell dagster how to access our air byte instance. So here we're pointing you know at our local earbud instance.

B

This is a plugable system, so you know in our production environment we were, you know pointing towards our production earbud instance and then we're just telling diagster to load all of the earbud assets automatically from the irbite instance. You can see. There's you know a bunch of optional inputs here. You can use to, of course, those assets to look the way you want them to here we're you know changing what the asset names look like, but for the most part, uh this loading just kind of happens automatically.

B

You can see we're doing something very similar here with DBT. We have a you know, a DBT project defined here um with some model SQL files and all we do is you know, point diagster at the files um and it will, you know, do the work of automatically generating the assets for you. Even the hex asset, uh you know. Really all we need to do here is using our hex. Integration is to point at the project.

B

Id uh tell dagster, you know what data that notebook depends on um and that's all we need to get that asset up and running. Then we can define a job to kind of recreate everything uh attach a schedule to it um and you know tell diagster all of the assets that we want to load uh and that's really all we need to get.

B

You know kind of a smaller example like this up and running so I wanted to make sure there was enough time for questions and I know we got a little bit of a late start, so I'll take a pause there, but I'm happy to kind of delve into uh you know any other part of this demo um or answer questions. If anyone has it.

A

Yeah, so if anyone is watching along feel free to drop in some questions, but um I can drop in a few like uh so I think this is a really good uh overall, like high level demo like what was involved in building this uh dagster air byte integration. I would love to learn a little bit more about like the process and like perhaps what could be better.

B

Yeah, so a little bit behind the curtain, so you know most of our Integrations. uh You know, airbite included, um are uh you know, broker dagster talking to whatever Service uh you know, through whatever API surface that that service has. So everybody in this case has a has a really good um API, which made it pretty easy for us to you know. Not only do things like uh you know, kick off earbud sinks uh through the API, but also to stream a lot of that metadata back.

B

So, whether that's uh you know incorporating the earbud logs um into diagster's run logs, but also to provide uh some of this metadata so like being able to show us, uh you know what is the scheme of our tables look like directly in dagster or even to show us. You know how many new rows were uh added to our table every time we ran this. So you know under the hood a lot of this integration work was really just interfacing, with their bytes Avi.

A

Do you do anomaly detection, just as a quick follow-up for all those metadata, like you have the time series, but the time series itself? Does it actually matter to me it matters when it changes a lot.

B

You know we have the ability to um basically to validate the outputs of various um each of your assets, and one of the things that we're actually working on internally right now is like a notion of slas for assets, so kind of rules that you can set for assets that will alert if they're violated to their.

A

You did launch that right like it's. It's a it's a thing already, yeah.

B

I think right now it's launched kind of experimentally, so yeah it's something we're still like actively working on right.

A

And the SLA has like basically say this is fresh every hour or else yeah.

B

Yeah, so that's kind of a yeah, that's another kind of way that the slas can be used so right now, um you know this example is run on a you know, strict time schedule, so we've decided you know, hey we're going to rerun all these assets every 10 minutes, but you can also have dagster kind of dynamically figure out when to run certain assets by telling it I want this asset to be ready by 9 A.M every morning um that way, you're not like defining explicitly when every asset runs you're, just giving dagster a list of criteria um and it'll kind of figure out behind the scenes, one to run everything yeah.

A

The ultimate in declarative uh assets, yeah.

B

A

G says: how do we Define air byte assets.

B

If we go back to the code here, you know this is one of the ways that you can Define urban assets is to kind of automatically generate them from your air byte instance. So what daxer does here is you know, talks to the earbud API directly figures out what syncs are present and then we'll generate assets from those things based on the tables that they're producing you can do things like.

B

You know, filter down the list of connections that you care about, there's also ways that you can build airbite assets manually if you'd rather kind of have each of the assets defined explicitly so there's uh you know you can also use something like build airbite assets here, which takes an explicit connection ID. So um it's a little bit less magical, but it might be nicer for cases where you want everything like explicitly defining code. You.

A

Know one thing one thing I feel like uh is under appreciated when people look at Solutions like this is the fact that you don't have to statically define the asset once I feel like because you have control of the API and I've seen some of our most advanced users do this, which is essentially based on some parameters that come into. Let's say your dagster run spin up new air byte instances and assets, dynamically I I feel like you have that control I I, don't know. If there's any foot guns that you would flag for.

A

Basically just going nuts with with all your airbright assets,.

B

Yeah, that's actually something, uh you know not that specific use case, but we've had a lot of people playing with recently kind of like dynamically spinning up and tearing down infrastructure like as part of a run. So, yes, you know one thing we've had people do is kind of have ephemeral development environments that'll, like Fork a snowflake schema. You can kind of test against production without actually interrupting your production uh database.

A

So it's funny, because uh dagster Cloud also kind of does that for dagster itself right, like that's your sort of deploy, preview thing, I, don't know if you want to talk about that a little bit and show it off really I mean I. Think it's the most impressive thing that that you've shipped uh last year, yeah.

B

So you know, maybe a little bit of a plug, so everything I've showed here is- is all the open source version of dagster, which you can run locally. You can host yourself, but we also kind of have a hosted version of dagster, which will ease some of the kind of deployment burden called dagster cloud, and it also has Enterprise features. So things like our back, but one of the other kind of features is we have something called branch deployments, so we have the ability to kind of dynamically Fork your entire diagster environment, with pull requests.

B

So you know here: we've opened up a branch deployment tied to a pull request that Sean opened, and so this is kind of a way that we see our users kind of like improve their velocity, because they're able to open a pull request, kind of spin up an ephemeral environment where they can run their changes in a safe way, validate their changes, get the change merged and then it gets deployed to.

A

Production, yeah, cool I think it's like part of the value proposition that essentially you're innovating on all the things that maybe you know, airflow doesn't really consider part of their core feature set um and I. Think that's that's pretty cool. What is uh I guess part of the roadmap or what should people look forward to in 2023 when uh thinking about the dagster story? In order to you know whether or not they should bet on.

B

This horse, yeah, that's a great question, so you know a lot of the roadmap. Is you know, stuff we're still kind of internally debating, but some of the things that you can look forward to you know one of the things we've been a lot of time on recently is our integration. So you know here you've seen our Integrations with air by dut hex, but we're hoping to build out.

B

You know even deeper Integrations, with a lot of the core tools that people are using as part of the MDS, and it's part of a lot of these data use cases. I. Think one of the things that set Stacks are apart from some more traditional orchestrators is the level of depth in our integration. So we surface a lot of metadata.

B

We kind of are able to do things like explode your DVT graph into individual assets, which makes the actual orchestrator a lot more useful you're, not just you know, seeing run DBT you're, actually seeing all the things that DBT is creating. So.

C

We're spending a lot of.

B

Energy building out our set of Integrations there we're doing a lot of work to improve the core kind of python ergonomics of uh the diagster apis. You maybe saw a little bit of that in the code. Preview I showed, but that's another area where we're focusing a lot of energy and then the kind of slas that that we talked about a little bit earlier, so making it even easier to kind of just declare: hey I.

B

Have these assets I want them to be ready at this time not having to worry about manually, scheduling things, building pipelines yourself kind of allowing diagster to do that scheduling for you. It.

A

It really is uh kind of everything that you would want like taking. The software engineering mindset right, like I, feel like Nick when he spoke at our conference, was talking a little bit about that and bringing that to data Edge, which is kind of starved for uh for all that Innovation and progress.

A

You know on our ends, we're working on basically the bringing the API that you that you use here in sort of local environments, to our everybody Cloud instance, so um just kind of similarly like what we want to encourage uh using Daxter using orchestrators in general as part of your production quality deployment of a data pipeline right over the sort of data architecture- and you know all the all the Integrations that you guys have uh is just really incredible to see so. I posted it up in the chat uh for people to explore.

A

Awesome I think that about covers it any parting words for the community, uh any call section what what they should check out next.

B

Yeah uh well, first of all, you know thanks so much for uh giving me the opportunity to share what we've been working on here. I don't have too much to share other than if you're interested in trying this out, uh you know, dagster.io is where you can find out more. Our slack Community is really the the biggest place to kind of engage with us and the broader dagster community. So if you want to play around with dagster, uh you have any questions that come up. uh The slack Community is probably the best place to ask.

A

Yeah awesome yeah, but and we we have a joint slack Community within our our own slack as well. So well, thanks to everyone for joining in thanks. So much for your time and I'll see everyone later.

A