Dagster Dagster Demos, 5 Sep 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Data Engineering with the Open Source Modern Data Stack (From MDS Fest '23)

Description

The Modern Data Stack gives us a wide range of free technology for data ingestion, data storage, data transformation, data orchestration, and data visualization.

Pedram Navid, Head of Data Engineering and DevRel at Dagster, walks us through one data pipeline design using a number of open-source solutions, including Dagster, dbt, PopSQL, and the Mastodon API, DuckDB, dbt-duckdbt, Evidence, Sling and Steampipe.

The pipeline is used to analyze bird observation data from the Cornell Lab.

Github Repo: https://github.com/dagster-io/mdsfest-opensource-mds
Source Data: https://feederwatch.org/explore/raw-dataset-requests/

A

Hey everybody, my name is pedram Navid and I. Am the head of data engineering and devrel here at Daxter and I want to talk to you about building an open source. Modern data stack. So what is a pipeline? A pipeline is really a collection of jobs that we need to get done in order to make data valuable right, and so in many ways, if you want to ingest data from various Source systems, that might be a postgres database, it could be some data in the cloud as a zip file.

A

It might be an API from a third-party SAS app either way. The data is out there and we need to bring it in in order to do some type of analysis on it right. So that's step one and then on the storage side that can be anything from a database itself or a data warehouse or even just parquet files or Arrow files stored on local disk. Then transformation is the next thing we really want to care about, and this is all about taking that data that we have and doing something with it.

A

You know whether it's aggregating it grouping it filtering it joining it all that kind of fun stuff and finally, we need an orchestration tool, a way to take all these various steps and put them in a sequence so that they can run and we can monitor and observe it, and once we have all of that, then we can do visualization or activation which is really about. You know the final mile here, you've done all this work, and now what do you want to do with it?

A

Typically it's to present it in some fashion or to get it into some other tooling.

A

So I'll talk a little bit about the state of Open Source tools, as it relates to each of these uh different uh things that we care about the first one is obviously ingestion getting that data out and I think one of the underrated ways to do that is really just plain old python script. It's a Time, tested and true method of getting data from wherever it may be, and one that I think still is very valuable today, outside of that, we have meltano and Alto.

A

Meltano is a framework for getting data out of these systems based on a single protocol that is also used by companies like Stitch, for example. Then, there's air by air byte is similar in some ways to Montano, but it has its own set of connectors and they have different qualities to them as well. Dlt is a newer framework, one that I'm pretty excited about. I haven't spent too much time on it, but I do like how light and versatile it is, and it makes it very easy to create some of your ingestion logic.

A

Then there's also things like steam, pipe and sling steam pipe is really nice. What it can do is take data that is behind a rest, API and convert that into a SQL query, so that you can now write SQL against some of these SAS applications. Instead of having to deal with rest apis, which you know they're fine, but we don't love them. That much sling is a wonderful little tool that uh it's kind of like a five-tran type tool.

A

It works against any of the postgres snowflake data warehouses and can adjust that data and extract it into anything.

A

You may want whether that's another CSV parquet file or back into a warehouse, so that's a real fun tool and then there's pg2 parquet, which is a nice little rest library, and what that does is it takes a postgres table or a query and we'll create a parquet file out of it, which you can then you know manipulate as you wish, and there's many many others I think we are blessed with so many options here in the ingestion space next step.

A

Behind that we have storage storage again we have a lot of options, there's the time tested and true postgres, which has been around for what feels like 100 years. Now it works pretty. Well, maybe not the best tool for large volumes of data when you want to do analytical processing but I, think it's a very solid option. There's clickhouse and duck EB much better suited to analytical queries than both available as well parquet Arrow files. These are wonderful ways to write data to disk that are high performant.

A

They have compression schemas, so all the things you kind of want out of a modern file format and then there's data band, which is an open source, Data Warehouse in a box in some sense now I haven't used it particularly well, but like I've, heard great things, so another wonderful open source tool to investigate once the data is in storage. Typically now we want to start transforming it right and I. Think DBT is the canonical standard by which we typically transform data from a data warehouse or from a tool like duct, TB or postgres.

A

Josh Wills has created a great tool called DBT ductdb, which is a wonderful adapter for DBT outside of that there's also polars and Aero data Fusion. If you, for some reason, don't want to write, SQL orchestration um Dexter is I, think probably the best, but there are many others in this space as well, airflow being the most obvious one, and then on the visualization side we have superset, we have evidence.dev, we have light Dash.

A

We have meta base here, probably the less the less feature fulfilled set of tooling compared to maybe the other categories, but that kind of makes sense because of how complicated it is to build a fully featured bi solution, sort of overviews out of the way, maybe we'll jump into I, think a demo and what I've done is I've created this open, GitHub repository that really brings together a lot of these Concepts we'll use data from a project feeder watch.

A

Project feature watch is a project by Cornell lab and birds Canada, which gets data from uh checklists that people have done of bird observations, and so every three or four years they publish a whole new checklist. These checklists can be quite large, as you can imagine one for every bird observation by a person, and so they can be up to a gigabyte each and there's multiple of these they're compressed they're in the cloud they're, a zip file and so we'll download that there's also sites descriptions and species translations.

A

These are you know, mapping files that you want to combine to the checklist. You get a fuller picture of that data. Then we have a popsql demo database, we'll just hit this demo database, that's in the cloud as an example of a way to get data from postographs and then finally, there's Macedon we'll use that as an example of a rest API that we want to reach data from using SQL and without further Ado.

A

What I'll do is pull up the repo for this project and first things, first I'm going to open up dagster by typing tagster Dev in the terminal, that's going to start dagster and then I'll pull up diagster in my browser by visiting localhost 3000 and the first thing we're going to see here is a overview of all the assets that I've created. So if I click on global asset lineage, you get a full picture here. We have the data coming in from the checklist.

A

This is using a plain old Python, and so it's downloading that file from S3 unzipping it extracting the CSV, and we do that for two checklists. We do it for the site, description data and we do that for the species data once we have that we also will use sling to get data from events and tickets from that postgres database in the cloud and finally, we'll hit Mastodon API and get bird toots, which is really just a hashtag of birds on some bird instance related to Mastodon.

A

Next up, we have the prepared section so here what we're going to do is take all these data sets that we've created all these raw files that we saved to disk and adjust them into duct DB. So we can then do some aggregation and some more fun stuff with it right. So we ingest the data first and then eventually we combine the All Birds data, which unions and ads and joins all the birds and then from there we create like a top Birds by year, just as an example of some type of aggregation.

A

You might want to do, and so this entire pipeline I can click um materialize all and it'll start from the beginning and I click on V run. If you can watch this sort of progress and what it's going to do, is it's going to do exactly what I just said? It's going to download the data it's going to hit the API it's going to hit postgres, it's going to run using all the tools that we've mentioned and ingest that data into duct DB and make them available as DPT models.

A

And while that's running the next thing, I'm going to do is show you uh some of the code. I use to build this, and so the first thing I'll do is walk you through the resources section of our code. This is where we Define really how we want to connect with the outside world with index there right. So we have a resource, for example, for duck DB. We have a resource for slaying for DBT.

A

All these sort of things and they're quite simple to Define duckdb is already supported by Daxter, there's a library for it, and so really in one line, two lines: we've defined inductive resource. We give it the path to a database that we want it to save our data in and that's pretty much it when it comes to deckdb. Now on sling here we have some wrapper code around an existing library, because dagster doesn't come with a sling integration built in.

A

We have to do a little bit more code, but, as you can see, it's not much. We Define a function that we want our assets to call we'll say you know, give us a source table and the file and we'll write to that file by default. We'll do full refreshes, which just gets all the data from The Source table, but we could conceivably do incremental as well.

A

If we, if we chose to we pass in a config to the CLI and then we ask it to run, and then we can even process the messages coming in from standard out to find patterns that we care about. For example, here we're getting the number of rows and then we output that as metadata out here, that's it on the sling side, DBT, because it's again supported by Dexter just a couple lines. We give it the path to our DPT project.

A

We tell it to parse the project and then create a manifest that we will use in our assets and that's really it when we get into the assets. That's where the interesting stuff starts to happen. So first, we've defined this function that downloads the data from S3 it'll extract it from that zip file and then save it to a location depending on the file name. That's the helper function itself. It's not been used by Dexter except through these assets, and so we have these assets here, which is really that representation of that file.

A

So we create an asset by using this asset decorator, we use compute kind to give it that nice little tag in the picture in the asset graph, and then we also group it here to make it a little bit nicer to see in the UI and then all we really do is ask it to download and extract that data. That's the function we had just above and we pass in some context and some constants so that it knows where to find the file and then that's really.

A

It it'll download that file it'll extract it and then we're sort of done, and so we repeat that once for the checklist here and again for this other checklist down here and then from here, we um we can download the other files, for example the site, description, data, the species, translation data and so on and so forth. Once we have all that data put together, the next thing we want to do is join that data inductdb, because it's just sitting as a CSV file right now, right and so to checklist data.

A

We want to combine that right. We have these two very large checklists one for 2020 one for 2023, but we don't want to have two separate tables, and so what we can do is use our duct DB resource to first read these into the center tables and then create our Birds table, which is just a union of these two things and for reference. This is about 17 million rows and it runs on my laptop in under a minute. That's really the power of these tools that we have in our hands today and so this code.

A

All it does. Is it reads the files using a CSV reader, it joins them and then creates the tables, and then we do that once for the birds table, and then we also do that again for species and again for the site descriptions. So now we have these three tables inductdb we can move on to the tooths. The toots are like mastodon's version of tweets.

A

We hit the Mastodon API and we want to query that API to get you know the top tweets or whatever it is, and so to do that we run Steam pipe steam. Pipe is a command line tool that runs locally in my laptop, so we have a sub process to run it, and then we pass that a query and let's take a look at that query, to see kind of how it's simple and easy. This query really is right. We're writing SQL, and this has been converted into uh API calls to Mastodon.

A

So we're saying you know, select the content um from Mastodon toots, give us a thousand and then match it with some regex to find hashtags.

A

So this is very interesting because now we can write these complex queries that are difficult to do as rest, API calls, but very simple to do a SQL, and then we can say you know give me the aggregate of that and order it by the count and by hashtag, and so we get this nice SQL transformation layer and all we have to do once we do that is pass that into steam, pipe and steam. Pipe will translate that and write it into a CSV file for us, which is really really nice.

A

The next big step is to take that CSV file and again just create a table so nothing too interesting here or just creating a doc DB table using that CSV file and finally, we're going to use sling to get data from postgres. So, like I mentioned, we have this postgres database in the cloud it has. You know, events and tickets and that type of stuff, and what we want to do is sync that data from progress into a CSV that we can then load into duckdb as well right.

A

So in just one line, we can do that. We run sling sync, we give it the name of the table and we give it a path for it to save the files to and then that's it. It's done, it'll save the files, and then we have some metadata around it as well. Now, once that's done, we can again create our tables induct TB by reading that CSV file, and we do that for the tickets and events and that's really it. The last step here is to just run DBT and because DBT is reading from duckdb.

A

Dexter will know that this runs only after all. The source data has been ingested and transformed, and it's ready for DBT to consume. If you look at our DBT model, there's not quite a lot there. We just have like, for example, this all birds I've created as a simple example and what it does is it just takes the birds, the sites and the species and combines it into this all birds table that will continue to use in our bi tool.

A

Okay, so I'll exit out of this and go back to extra UI. As you can see, it's now all complete everything is done and if we take a look here, it says it's finished in about a minute which is incredible, because this bird's uh file, for example, is 17 million rows and it was able to not only join that data but Aggregate and do all this fun stuff with it.

A

So that's quite interesting um next thing we're going to do now is take a look at the output of all this data right, so we created these tables with Dexter. Now what we can do is open evidence which is a visualization tool for um you know, analyzing and looking at data. It can connect directly to your duct TV instance, and so I'll pull this up now and as you'll see it's going to load data from duckdb and create these live outputs using the queries that we've generated.

A

So here we're selecting from all birds, for example, and here we can see that table of our data we can do top tweets. We can do bar charts, we can do all kinds of stuff and it's reading this directly from TB, and it's so easy to update that we can go back into our terminal here. We can pull up in the index file, and this is really all the code being used to generate that query. That visualization we just saw right so in here.

A

I have a very simple SQL statement, select solve from our birds, and that is being generated right here. I can go ahead and change this, for example, I can change it to 100 rows, and if I go back to here, you can see it's already been updated and we've got 100 rows now. So that's really the power of this Atari stack as you can see. It works really beautifully for getting data, transforming it running it visualizing it all locally, on a laptop all with free, open source tools and I.

A

Think that is really powerful and very exciting. So I hope you liked it. If you want to check it out more, you can go and check out again thanks for uh repo for this project. It's called mdsfest, open, source, MDS and it'll have all the instructions in order to install this and run it locally on your own laptop enjoy.

A