Dagster Dagster Team Product Demos, 12 Feb 2022

Previous Meeting

⏯

youtube image

►

From YouTube: How to Orchestrate your Airbyte ELT Jobs with Dagster

Description

In this edition of the Community Call, Owen will show off the new Airbyte Dagster integration, allowing you to orchestrate Airbyte and the rest of your modern data stack easily.

Check out our announcement here: https://airbyte.com/blog/orchestrate-your-airbyte-elt-jobs-with-dagster

Subscribe to our newsletter: https://airbyte.com/newsletter?utm_source=youtube
Learn more about Airbyte: https://airbyte.com

A

All right, hey everybody, community how's. It going uh really appreciate everyone being here. um I know that everyone is kind of trickling in right now, but I'm gonna do a quick overview of what we have planned for today and a quick introduction to our guest. So, basically, uh airbite has been working with dagster, which is an orchestration platform that allows you to orchestrate a lot of things on the modern data.

A

Stack doesn't necessarily need to be airbite and we happen to be integrated with them in order to let you orchestrate your error, byte sinks with dagster and with us today we have an expert from dagster themselves owen here who is going to be walking us through how to use the daxter integration in order to orchestrate your error, bytes sinks, I'm.

B

A

Basically, take too much time away from him, so uh owen, if you want to take it away and start presenting. Thank you so much for being here awesome.

B

No, I appreciate the invite uh it's always exciting to present this stuff to a new community, so I'm going to go ahead and share my screen. Yeah thanks for the introduction, uh I'm a software engineer at elemental uh we're the company behind dagster, which is an open source data, orchestration tool.

A

B

Title of this talk is orchestrating the modern data stack in this. I'm just going to give you a few slides, I'm not going to get too bogged down in it before getting into the demo at.

A

B

High level I'm going to explain what orchestration is and then from there I'll just jump into the demo. Also, I hear comments, but I can't see them on my screen. So if there's something important uh for me to adjust in a moment, let me know yeah.

A

B

Let you know at the end.

A

We'll do it we'll do a q a session, so everyone, if you, if you have any questions, um feel free to leave them in the questions tab. So if you go ahead, if you look at the bottom, there's a questions tab, you can go ahead and create a question and we'll directly answer it at the end and we'll make sure owen hears it and do shim. uh This will indeed be recorded. This will be put on our youtube channel for uh to live in perpetuity. Now.

B

uh Without further ado, I appreciate that so at its most basic uh orchestration is when you have a set of tasks, and you want to run them in a particular order. So if you have two tasks, one might depend on the other one, which means that you don't want to run task 2 until task 1 completes task. 1 fails, don't run task 2.. These tasks can be anything so, for example, task 1 might be load new data into a database that would be using a tool. Something like airbyte task.

B

2 might be something like transform those tables once they're there again. This could be using a modern data stack tool like dbt, or it could be really anything. These tasks are completely arbitrary. Another example would be maybe first you want to store some log file. You have your local machine to s3 or something, and then only once, that's done. Do you want to delete your local copy? It's it's really important to get the order right there and an orchestrator is something that can help with that.

B

So other features of orchestrators that aren't strictly necessary, but you generally find them would be scheduling. So being able to kick off, these runs at a particular time alerting. So if something fails, you want to know about it and history so being able to look back over time and say you know: how long did this thing take at x, point of time, etc? So, seeing all that the question might come up like, why would you want to use an orchestrator or even is airbite an orchestrator right, so airby actually offers an integration with dbt.

B

It allows you to say okay once this airbyte sync completes, then I want to run this particular dvd project and in a sense that is orchestration, and this is actually really useful for some set of use cases. However, often the reality is a bit more complex than can be expressed by uh those uh that symbol of a dependency structure. So really your entire data platform is a collection of these interdependent tools. It's not just airbite and dbt.

B

Usually you have other stuff mixed in and where this gets really tricky is, when you have a custom code using python right, so, instead of just running error, byte and then dbt afterwards, maybe you use airbite to take data from a database, put it on s3. Once it's there, you use spark to transform that partition in a particular way. Then you do further transformation.

B

You might have machine learning, use cases that also need to run on the data that's produced by these things and as the data platform grows and the business logic becomes more complex, the dependencies between these different tools also become a lot more complex in turn, and it sort of quickly becomes unmaintainable to just have these running on their own separate cron schedules or in their own separate tools. The second you need to do something like a backfill.

B

It can be extremely arcane to know the correct incantation of scripts that you need to run in a particular order in order to get everything working again. Similarly, debugging workflows can be really challenging because you need to page through all these different tools, see where the error occurred. When the only symptom you have is hey some some number looks weird. This is really where an orchestrator shines, and so as these complexities grow. So that's the sort of general orchestration view um what about dagstr?

B

What's what's different uh or interesting about dagster, as opposed to other orchestration tools, and I think like I could go on about feature differences etc for for a while. But I think the most salient point here is that dagster, fundamentally the philosophy behind it is that the orchestrator isn't just responsible for running tasks. It should be responsible for understanding them. It's a centralized place where you're defining all the dependencies between your tasks.

B

So it should have some sense of why you're actually running them, and I think the easiest way to explain what that means is with the demo so you're already free from me, going through the slides and I'll just hop right into the demo. All of the code for this is available on my personal github. So, if you're interested in looking at that feel free and we're also going to have a recipe going out this week or next week, that will show how to get all this up and running on your local machine.

B

If you're interested okay with that, I'm going to do this demo in maybe like a slightly backwards order, I'm going to actually show off the ui first and then explain the code that is used to generate what you're seeing so dexter is completely free and open source, uh and that actually includes the ui tool as well. So you can run the ui tool, which is called tag it on your local machine, no sign up all that stuff required just hit install daggett. So this is one of the files.

B

That's in that github repo that I was sharing, and so I'm just gonna point tag it at that and spin it up. So that'll take a second, but now it's up so I can tab over here, and this is what daggett looks like so at a high level. We've defined a single job here, and the purpose of this job is to take data from two different sources: github and slack.

B

It's going to run that through dbt to transform it join that into a single unified metric and then we're going to do some custom python stuff at the end. uh Apologies to any uh data scientists that are watching this right now, because I'm not one but it'll, give you at least a sense of a data, sciencey workflow that you might experience. If we zoom in a little bit, uh you can see. Dijkstra already gives you a ton of metadata about everything, that's going on within this job.

B

So, for example, um if we want a more complete description of this sync github thing, if runs, have succeeded or failed, etc. So all this sort of operational information. This is really helpful, especially if you're, not the person that has ridden this job, because you might have no idea like what all these things are. A name only goes so far. So having all this metadata, that's available to you right.

A

In the interface.

B

Is quite valuable. Similarly, so these two ops, which is short for operations, are responsible for invoking error, byte to move data from point a to point b and then once that's done, we're going to invoke dbt to transform that data in our data warehouse, but.

A

B

A step back, there's sort of a naive question that I think people should ask about most of their data pipelines, which is what is the point like? Why have I defined this thing and, of course this is a demo so, like the actual data flowing through here, isn't that important? But if you imagine someone who's written something uh similar in reality, why would they have done this and it's not just so that they can run tasks right? That's not their overarching goal. It's to achieve a particular um result right and there's a few different ways.

B

We can understand this result. uh One is at the end of all this uh data science uh stuff. We have this generate chart step and all that's doing is we're fitting some model to the data that we're observing in dbt and generating a chart that represents how well the fit the the model fits the actual data. So that's one thing that a person writing something like this would care about. They care about that chart that they generate in the end. Another thing that they care about is the models or the tables in the database.

B

That dbt is creating. So when you run dbt, it creates or updates a bunch of tables in a database, and we probably care about those as well right. Analysts might want to make sure those are up to date or have accurate information. And finally, we also care about the data that is being moved by airbyte from the source to the our data warehouse.

B

So we probably care that the raw data for github and slack information is up to date and if you ask a traditional orchestrator about you know what is the point of this data pipeline? They often have no insight into the fact that those are the things that are actually being created by these steps. So we have a term for those things that we care about, which is data assets. Traditionally, that's something like a table in a.

A

B

Machine learning model some report etc, and we actually consume metadata about those assets and allow you to visualize that and track them over time. So we can see here that this job actually creates six assets, so we have an asset for the slack channel messages. That's one of the things that airbyte is syncing from the sync sync slack uh step. We have this airbyte github commits thing again: that's the raw data that error byte is moving. Then the dbt project has three models in it.

B

It's just some daily rollups on top of those tables and then they're joined together into a single metric and then finally, we have this chart that we create at the end, which is also data asset. So if we click on one of these things, not only do we see the most recent time that this uh chart was created as well, as you know, the the fit function that was used to create it, as well as the chart itself. So you can take a quick peek at that. It's not the prettiest thing in the world.

B

We can also see this historically over time, so dexter allows you to see this sort of longitudinal view of the actual assets you're creating. So, for example, if I go all the way back in time to 9 32 pm last night, I can grab a different uh version of this asset and see what the chart looked like all the way back then- and I promise these are different, even though they look very similar.

B

So this is a really valuable sort of observability tool and it pays to have the orchestrator be the one keeping track of this right, because only the orchestrator knows every single time that you have run this, because the orchestrator is the thing that is.

A

B

So if we again, um instead of uh finding the asset this way, we just know that there's some error by github commits thing. We can just search for that and we see the longitudinal information for this airbyte created asset over time and airbag is a great tool because it gives us lots of meditation to work with right. So this isn't, like the user, is manually inputting all this information.

B

Every time you run airbyte just gives you tons of really important metadata that you can track over time, so we can see how many bytes or how many records were created track that over time we also get schema information that again, we can track over time, and you know if something changes. This is a really powerful debugging tool. You can see the exact point in time at which the schema changed or at which a data spike occurred. So an orchestrator isn't just good for looking at things.

B

It's also good for running them, so I'm going to quickly kick off a run of this thing, just to prove it works, and here you can see what it looks like when you actually run one of these things. So we see, we've kicked off both of these syncs and to prove it's actually doing something.

B

I can go into erabite also running on my local machine and you can see there's a sync running and you can see all these logs but, like I said before, it can be like if you're debugging, something it can be kind of annoying if you have to page into all these different tools, just to look at the logs of what happened.

B

So, for example, if I go to the sync github thing, we actually ingest those logs directly into dagstr, so you can view these, as the thing is running, to get an idea and that's not useful only when you're running something, but you also get a historical record so again, debugging workflows. This is invaluable to be able to look and see if there are any warnings etc. At a certain point in time you get all the you know: dbt transformation information, all of that, so this is going to take a little bit there.

B

We go uh actually.

A

I have a quick question for you, so how is dexter able to kind of generalize, like I guess, having log output for like because you have, because you can like orchestrate a bunch of different things, you can probably get like dbt logs too, like how do you? How did you like generalize just basically like having this window there? That, like always shows you log output, yeah.

B

um So the answer is the tool needs to provide it to us. um So luckily, error, bytes api provides the the log information, but as long as we can get that from a request, we can insert that into our own system, actually under the hood, we're just re-outputting that into the standard out stream- and you can see uh this is actually a generic window showing all the standard out generated for this step. It just so happens that the step only runs airbite and we're the ones outputting the standard out.

B

So that's all that ends up here.

A

And, and do you have like a similar thing for like capturing like metadata? You mentioned that, like you can capture like that, you can like basically have that longitudinal data where you like, where, like we, send you a bunch of metadata and so you're, just capturing that through the api. So you would do that for like any tool. You would basically look for you'd kind of like individually, like look for the metadata and then capture it so that you can display graphs.

B

Exactly yeah yeah, so the more metadata the tool provides, the better, which is one of the reasons we're really happy with airbite um and how this integration turned out and yeah. As you mentioned, we also uh captured dbt output.

B

Actually because dbt output is a little smaller, we show it just in line um airbike can produce like hundreds or thousands of logs of log lines, so we don't want to show it in the same view that we're seeing all this operational information but yeah again, you can look back at this over time, so we've run our step, uh we'll get some new asset materializations. So, for example, for this uh sync slack, we can see that we've found three new records since the last time I ran this is incremental mode, etc.

B

So that's a quick view of basic dagster functionality, so I'll just hop into the code now. So this is what the code looks like. It is all fitting on one screen. The reason that it's able to be this condensed is just that we're using pre-built integrations for a lot of this work, so we're using three libraries here. The first one is dijkstra airbite. This was contributed by air by team member, which we're very grateful for we're very happy with how that turned out.

B

Then we're also using this dexter dbt integration to run our dbt step and then uh because this is running all on my local machine, where we have like a postgres database uh where all the transformational stuff is happening so we're using dijkstra postgres to eventually read from that. So the first thing that of import here is this is where we're defining our erabite sync operations.

B

So we import this airbyte sync op thing and then because it's just a generic up, we need to configure it with a particular airbyte connection id so that it knows which connection to kick off, and then we just give it a name so that it shows up nicely and daggett and people can understand what it is. And then we do the same exact thing for slack.

B

So we just take the generic up and and pointed towards the slack connection, and then we do a very similar thing for dbt this time it doesn't actually need any configuration. We just give it a name here, and even this is actually optional. um If I didn't do this, it would just show up, as you know, dbt run in daggett. So this is maybe a little bit more understandable I'll, show the python ops and how they're defined in a second most of the code.

B

There is just you know, normal python code, um so I didn't include it in this file, but here's how we define the dependencies between these things. So we define the operations that we want to compute. How do we define the order and how they connect to each other? We do that using what's called a job, so indexed or the collection of ops is the job and we have the option here to define particular resources, and these resources are how dagster communicates with the relevant apis, and these can be swapped out sort of at will.

B

So you can imagine having the same exact dependency structure between things but hey instead of looking at my localhost airbyte resource. Maybe I want to point that to uh the airbite server I have running in production uh when I'm actually running this for real. So this allows you to sort of segment the concerns of what the dependency structure is from. How is this going to run like what things is it actually going to hit? We configure a dbt resource to point out a particular dvt project and then we're looking at a local postgres database.

B

So we give a particular database connection. This is going to be used in one of the uh python ops, so I'll show that in a second uh and yes, it is just running on my local machine, but dijkstra does make it easy to read from environment variables. So don't do this and don't do this in the real world. Please.

A

Yeah, I sorry to kind of jump in again here, but I I uh just to kind of touch on another question that someone that someone asked uh this is kind of a interesting place to do it. So you could have like kind of like a mix of like local and cloud products right so like you have like.

A

Let's just say, you were running like airbike cloud or you're running, maybe like daxter cloud or something like you can use this like air by resource configured to like point it at anything right and still run dexter locally like and and then like. I guess reverse like you'd have this in like the extra cloud and point and then could you like, but then I guess you wouldn't be able to point it at local right so, like I guess, the question would be like.

A

What's the kind of like optimal mix of like cloud and open source here, I.

B

Think, generally, the patterns that we'll see is that all of the code to run either in production or locally would just be in the same uh git repo. Although dijkstra has a concept called repositories as well, which allows you to sort of segment which versions of those jobs you see in those different environments. I didn't want to get too bogged down in the weeds uh here, but.

A

No worries actually.

B

It's fine, but underlying every job is a graph and a graph is in simple terms just the job without the resources, so you can define a graph on its own right, so I could have defined this slack github analytics graph in a vacuum and it doesn't know how to run because it doesn't have these particular resources telling it and then I could create different versions of that graph.

B

So one version of that graph would be pointed out all my local stuff, so I would give it those resources and then a different version of that graph would be uh configured with all of my prod stuff and when I'm running in prod, I point towards the prod version of that graph and when I'm running locally, I would point at the the local version of my graph awesome.

A

I appreciate you asking that sorry for making you gonna get get in the weeds a bit, but thanks no.

B

No um no happy to answer more as they come up, but yeah, so we've defined all the resources that are important uh or relevant to running this thing. So now we just need to define the dependency structure, so the first bit is defining how we're defining the fact that this dvt step depends on these airbyte steps. So the way we do this is we take our transform slack github thing that was the dbt run up that we configured and we say that it starts after stink, github and sync slack.

B

This is slightly different syntax than what you're going to see for the python stuff, simply because dbt and airbite don't actually need to pass any data between each other. So dbt knows how to read tables and it doesn't need to be like past a copy of the the data in a table or anything in order to function. So there's no actual data being passed between these steps, but when you do python, transformations data actually does need to get passed between the steps.

B

So we might pass a pandas data frame from one step to another and therefore data needs to flow so yeah. Then we have these custom python ops, so I have a file called ops.pi and yeah. 99 of this file is just uh me messing around with various uh data sciencey uh python libraries, but our first app here that I've defined is called, read, dvt output and you see it. It has a required resource key of dbe con.

B

So this is just a connection string that can be used by pandas to read sql from a particular place, so this db connection string could also be a snowflake database connection string. So this is how you can define an op that does the same thing, regardless of what it's pointed at. I don't need to separately, create a read, dbt, output, postgres and read dbt output, snowflake thing and then create different jobs for either of those scenarios. I can just say: okay.

B

This is a generic way to read some data from a dbt project that was run. Then we have this getfitprams thing again. This is just like purely a normal python function right. So this is just a function that takes in a data frame. In this case, it's a data frame containing like that summarized daily metric data, and then it uses some library to create a curve and generate some parameters that say hey.

B

This is the best set of parameters that most closely matches this data, and this is really powerful right, like you're, not thinking about how am I storing this data? How am I passing it from task to task right, like these things are actually getting serialized and passed between uh different processes, but that's actually completely invisible to you as your programming, you can think purely in terms of what python code do I want to write. How do I want to transform my input into my output and then.

A

B

We have this generate charts thing, and this is where that asset thing comes into play, so we take in both a data frame containing the observed data and then the fit params. So that's what we generated in the previous step to match things together, and then we plot that all this is just you know, plotting stuff, and then we save that uh plot into a particular storage path. And then this is how you tell dagster, hey, there's some persistent asset that I want you to keep track of in the previous ops.

B

You didn't have to do this because that's just built into the integration, but it's still pretty easy. If you want to do this on your own, this is completely optional, like everything will work fine without it.

B

But if you want to track something over time, this is how you would do it, so you give an asset materialization, a name so I've just called it analysis, chart num action fit, and then you give it a function that uh this is the fit function, and then you give it a location, and this all just gets rendered in tag it I'm running low. On time, but just very quickly, I think uh one of the reasons that it's so important that daggett can be run locally. Is the local dev experience.

B

If I'm developing on this- and I want to change one of these functions, it can be pretty time consuming to be like. Okay, I'm going to rerun the entire thing just to see if this one thing works like these might take up to minutes, you know an hour and what you can do is actually just run subsets of the graph sort of at will.

B

So if I come back here and I run from selected, this will now run with the new version of the code and we'll see that my new version of the code is not very good in a second and it has now failed. um So this is actually a really nice local dev experience.

B

You can just iterate really quickly, it's kind of like a like a jupiter notebook-esque thing where you get all this stuff working above it, and then you just sort of iterate and keep going um to sort of fix the the issues as they come up. So I'm gonna stop it there. I'm happy to answer questions or show more stuff that people are interested in.

A

Yeah, that's awesome! That's that that's really cool! uh I. I really appreciate your presentation. uh It was very detailed and thorough, and uh I I think that that was really great. uh Let's try to go through these questions quickly. um Ari asks. um How do you serialize data assets between ops? Can you configure that and overwrite it to collect metadata about assets.

B

That is a great question yeah, so we have an abstraction called an I o manager by default, when you run it locally, uh that io manager just pickles, the python object, writes it to a file and then on the other end. It reads that and unpickles it, but that's completely customizable. We have abstract actions where we have built-in integrations for things like s3, so it does that same protocol, but instead of local file system, it's s3 and yes, you can emit metadata during that.

B

So, for example, you can, if you're writing stuff to s3, you might want to emit data about the number of fights which isn't really as relevant if you're, just working purely in python and.

A

Then um bogdan asks daxter has to be installed locally question mark, which I think they're implying um do like like are. Where are all the places that dexter can be installed and.

B

Yeah, in addition to.

A

B

Yeah, um so you don't need to install it locally. um It is useful if you want to develop like that, but some people just prefer to develop in pure python. um You can test ops as regular python functions. There's no need to run bag. It like that. You do need to install the python uh diagnostic library in order to develop code, but that's just the same as any um any normal library with dagster cloud. um You can get access to the same like daggett interface, without running anything on your local machine.

B

It's purely a convenience, though most people will deploy daga to some server and push code to there and um interact with it like that.

A

All right awesome, honestly, I think you did a fantastic job of summing up, all the all the information, so some last, uh I guess housekeeping to do this. Video will be uploaded to youtube uh within the next day or two and you can go and check out any of the information. The awesome information that uh that owen has bestowed upon us.

A

um We will also, as owen mentioned, have the recipe for doing this and implementing this and deploying it locally in the same way, that owen has showed you out next week and additionally, we will have some documentation on this on the airbag website and owen. Is there any last like kind of shout outs or last things uh call actions or anything that you want to kind of uh say to the community before you go.

B

No, I I just really want to thank the air by team honestly uh for first of all, getting the integration going and then being super organized on all of this uh community outreach stuff. We really welcome people to join the slack. That's the best way to reach us. If you have support questions or just curious about this but yeah, that's it yeah.

A

Awesome, it was really awesome having you here, uh honestly, really fantastic presentations. So uh thank you so much again, owen thank.

B

A

All right have a good one, bye all right, bye. Everyone.