Dagster Dagster Team Product Demos, 25 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Build a poor man’s data lake from scratch with DuckDB

Description

DuckDB is so hot right now. Could it replace our cloud data warehouses or data lakes?
Pete Hunt builds a data lake from scratch with DuckDB and Dagster. Follow the tutorial here: https://dagster.io/blog/duckdb-data-lake

Combined with Dagster, S3, and Apache Parquet, DuckDB can become a powerful, multiplayer data lake that can serve the needs of many organizations with very little effort. Think of it as a free, simple SQLite.

We can see the rise of DuckDB for subsets of workloads that don’t have massive scale and can take advantage of its simplicity and fast performance.
Give it a go!

A

Hi, my name is Pete I work on dagster and I'm, going to talk to you today about building a poor man's data Lake from scratch with duct TV. So the first question on your mind, is probably what is duckdb.

A

Actually you already know what ducktv is it's like, so hot right now, everybody's talking about it, you can think of duck DB, as kind of like a sequel light for analytics, um so I think there is a couple reasons why people like Duck DB and why it's so hot right now, um it's very feature Rich, it's very fast. It runs locally integrates really nicely with python. It has a bunch of features that make it just really useful and it's all MIT licensed I.

A

Think the bigger reason, uh though, and the underlying reason people like it so much- is that it's just like a breath of fresh air seems like really simple lightweight makes sense. Modern data stack in 2022 is super. Complex requires a bunch of different components and duct EB is just like you know, just a real breath of fresh air and and it's it's simple and it's free and I.

A

Think people like it for that reason, so um we are going to talk about building a data Lake on top of duck DB, and um you know before we do that.

A

um We should probably talk about why this is even a good idea. I told you a bunch of features that ducktv has a bunch of reasons that ductdb is good, um but why should we even embark on this project?

A

Well, um you know, despite the limits of Dr B, specifically that it runs on a single machine when it's combined with a couple of other Technologies like an orchestrator like dagster, S3 uh and parquet files um or the parquet format, you can really build a a data Lake that scales pretty highly because there's so many um high performance machines available on AWS, so with duct DB, it runs in a single machine. So you know if a single machine can't churn through all the data fast enough, then you can't use it.

A

But the question is you know how many organizations are really doing lots of queries that are going to take more than like 2 000 gigs of RAM and 128? uh You know CPUs. The answer is: there's probably a lot um that that do you know that scale of data you know I've worked in a bunch of those places and there there's way more than you think. However, there is number one like a large class of businesses that don't process that much data and number two.

A

There are lots of teams within large organizations that want to take advantage of a lot of the benefits of duckdb, and maybe the specific problem they're working on can fit Within These constraints so um and when you get to use duckdb for something like this, you get the same. You know SQL dialect locally when you're running your tests, as as you would remote and you have to pay licensing fees to some SAS vendor in order to to run your queries and so before we dig into this like in depth.

A

I just want to highlight that this is like totally experimental uh we're not using it in production, in fact, we're not using any libraries here for this specific problem that are tested or anything we're going to go, build this solution from scratch. So think of this as a tutorial to help you learn these Technologies and potentially kind of sketch out a future architecture for these types of data.

A

Lakes I think it's a really exciting time to be in data engineering, and this is one of the the really exciting areas that I'm I'm pumped up to explore. So the thing we're going to build um is uh we're going to take dbt's jaffle shop example. This is like a classic example that a lot of data engineers and analytics Engineers- um you know use for for examples um and we're going to use uh we're gonna, basically plug duct DB into dagster, S3 and parquet, and we're going to call it duck pond.

A

It's going to be like a little mini data, Lake uh built on top of these Technologies. Now um the code is available here on GitHub. If you just want to get clone it and see the code and this blog post is available here and um uh we're gonna just get started right now, so the first thing we're going to do is create a new um get pod instance. This is a really cool app that I like it's. It's called gitpod.

A

It creates a cloud development environment for you uh with a standard set of like packages, package versions and stuff, so for tutorials it's really really convenient, so just go to getpod.new and it will spin up a container in the cloud for you with an in-browser Visual Studio code.

A

um So uh you know you you'll be working in exactly the same tools that I am and we won't have to you know, deal with python dependencies or anything like that um or inconsistencies. So the first thing we're going to do is is um we're going to get the project started, so I'm going to type hip, install dagster, and this is going to install the dagster command line interface.

A

So dagster is what's called a data orchestrator it's the thing that combines and coordinates all of your different Tools in your data stack and so we're going to start with dagster. uh It's convenient because I work there too, but uh but really, your orchestrator does sit in the center of these things. So we're going to start with that.

A

The next thing we're going to do is scaffold a project. So it's like dagster project. Scaffold name is jaffle.

A

So you can see here: we've got a hello world type of project scaffolded. For us it doesn't really do much.

A

um The next thing we need to do is add some dependencies here, so we're going to because we're building a um you know the poor man's data warehouse, we're going to depend on duckdb we're going to depend on pandas we're going to depend on this Library SQL Escape Pi, which helps us Escape.

A

You know we're going to be generating SQL strings because we're using duct DB, and so we need a library to help us create a safe, SQL strings and then, finally, for one of our examples, we're going to scrape Wikipedia, and so we need these libraries in order to do that, um we're also going to add some Dev dependencies. There's this really cool library or tool called local stack.

A

If you're not aware of it, it lets you run a bunch of AWS um like mock AWS in Services locally, so we're going to use this to run S3 on our uh on our local machine or this Cloud Dev environment. uh In order to to set things up, we also need the AWS CLI and the uh this dialect of it called the AWS CLI local, which points basically the AWS CLI at local stack. So now that we've saved this, we can pip install.

A

You know what I'm uh I'll make the font size a little bit bigger, so you can see it um so oops I I should probably switch to the right directory, so I can I can run pip install Dash e, um and uh you know this is going to install all of these different. um These different packages uh except I misspelled. This SQL Escape.

A

Now, let's try that.

A

And so this can take, uh you know a little bit of time. um Fortunately, the the get pod environment's got a bunch of CPUs and is is in a data center. So it's generally pretty quick.

A

um So the next thing we need to do once this finishes. Installing is we're just going to spin up our local stack environment. So local stack, like I, said it's a set of mock AWS, instant AWS services, and it's it's really convenient. You just run local stack start, which we installed it as part of that python.

A

Installation process and it runs on uh you, know, localhost, and so, if I can go and create my S3 bucket now, so just to give you an idea for um what this uh this architecture, it kind of looks like we're going to be running a dagster app in here.

A

This is gonna, be like our main, our main project, it's going to run duckdb from inside of the dagster app, and it's going to read and write files, parquet file, format, files on this, this S3 in an S3 bucket, so you can think of these parquet files as tables in a data warehouse.

A

So if you, you know, for example like if you're, using snowflake or bigquery, each of these parquet files would correspond to a table partition or Full Table in those those systems, and um you know where bigquery and snowflake couple compute and and storage uh in this we separate those. So so S3 is going to be our storage and duckdb is going to be.

A

Our compute I also want to note that um you know there's a lot of um generic S3 implementations, so you know you don't have to actually be tied to any vendor using this this architecture. All these tools are completely open. Source and S3 is a generic API. So you can. You can use digital ocean spaces or Min. I o as your as your storage layer here um so with that we're going to go and create our um our S3 bucket here.

A

So we've got the S3 service running, and so we run AWS local S3 make bucket S3 colon slash data lake, so the name of our bucket is going to be called Data Lake, and so um that means cryptically that it worked.

A

I think you'll just have to trust me on that um and let's go in and start uh writing some code here. So like I said, the core kind of abstraction here is going to be S3 files um or sorry parquet files sitting on S3 and we're going to interact with them through SQL.

A

So I'm gonna first want to create like a little abstraction for representing SQL statements, because we're going to be doing a lot with SQL, um so I'm going to create a file here called duckpond.pi I'm, going to create a little class um to represent a SQL statement.

A

So this is kind of just a a data class and you can basically think of this. As representing a statement, the SQL statement would play so those so you might want to say, like select star from my table. My table would be a binding, that's past this Constructor and the SQL would be this template string, so you would kind of construct it like this. You know my table equals um so uh we'll we'll show I'll show you that implementation in a bit.

A

um Hopefully that makes sense uh so um with this little abstraction, we're going to start actually building um an example application. So I told you that we were going to go. Do the jaffle shop example, but we're actually going to start with something really really simple, something that reads um a couple of of uh data sets from Wikipedia and then compute some basic statistics on them.

A

So I'm going to start I'm going to just copy and paste some code in here we're going to use uh dagster software defined assets to model each of those parquet files or tables in our database. So we'll start with a population asset.

A

What this does is reads: the UN list of countries by population from Wikipedia uses the pandas read HTML function to do that, then, because you know they, they use a weird Unicode Dash to to show you know negative numbers. We replace that and then we return a SQL statement that selects you know all the columns from that data frame and uh and- and you know, Returns the result- I'm also going to add a second asset that is based on the first asset.

A

So this is country population.

A

And you can see over here um we run a query, we say select the continent um and then the average population change from this population asset, which is kind of referred to up here, and then we Group by the continent and then we order it in descending order. In terms of you know, highest population change to lowest.

A

um So uh you know this is um this is hopefully pretty straightforward. You know you should go check out dagster software to find assets if it's not clear how this works. But you know basically, this decorator plus the name of this. This parameter tells dagster to kind of set up a dependency between these two assets.

A

um So so you get that that kind of SQL object from the previous uh the previous asset.

A

um So the next thing we're going to do here is we're going to um write a test case so that we're doing a little bit of test driven development. Here, um obviously, we haven't implemented anything yet um that that actually talks to duckdb, but let's just write a test, so we kind of know where we're going here.

A

So, let's open up test assets within the jaffle tests package and we'll save this, and so basically, what we're going to do is um we're going to call those two assets that we um that we talked about, um and we will basically assert that we get that SQL statement back and then we'll actually run the query and then we'll assert that um you know this continent is the top one and that the the population changes the known kind of expected population change.

A

um So you know um this this test by the way uh it does call this population asset um directly, which does go to the network so in a real kind of production application we would mock that out. But for the purposes of this tutorial, we're just going to leave that in um you'll also notice that we import this this fictional.

A

um So far, uh duckdb class, we instantiate an instance of duct DB. We call query: we pass it that SQL object, which represents the SQL statement, and then we get a data frame. So we should probably go and Implement that um that duckdb class, so I'm going to go back here into the duck pond and I'm going to copy and paste my implementation of the duck DB class and we'll take it step by step here.

A

um So I think I should probably pull this up here. um So uh we import the connect function from duckdb. If you've ever used, SQL Lite from um python, you know it's, it's. It tries to mimic that API um and um that's not right.

A

There we go um so this basically is a you know: it's got a Constructor. You can pass in custom duck DB options um which we'll use. During this tutorial, we created an in-memory or temporary database. uh We load What's called the httpfs extension. This lets duct DB talk directly to S3, so this is actually really important if you have a parquet file sitting on S3 or on any kind of HTTP server. Duct TV is smart enough to only scan the parts of the parquet file that it needs in order to fulfill a query.

A

So this means that if you have a giant parquet data set sitting on on S3, you don't necessarily have to download the whole thing in order to fulfill all the queries, maybe it only downloads certain row groups or certain columns that it needs to fulfill the query. So um this this is. This is a really great feature of duct. Tb I mean combine that with the fact that's got like really great parallelism. You can do a lot with this architecture.

A

An HTTP httpfs is key to that. So we set that up. We also, you know, enable any options that are passed into ductdb for us and then um we are going to do a couple of things.

A

First, um we're going to collect all the pandas data frames that are referenced by the SQL objects that were passed in. So if you, if you recall, let me go back to our assets, we actually pass in a pandas data frame um as a as a binding here to the SQL statement. So we need a way to tell duck DB to query that um this is one of the magic tricks that duck DB does by the way. Not every SQL engine can just query a data frame. That's just sitting in memory in your python process.

A

So one of the great advantages of duct TV is that the the interrupt between Python and SQL is is great, and so um we we basically have to have to write a function that that kind of collects all the data frames referenced in the query, and then we call this function, db.register, which is a duck DB function that says you know, register this data frame as a SQL View.

A

And then you can query against that view um and then, finally, we're going to run our um you know the provided SQL statement um because it has placeholders in it like the dollar. Sign placeholders: we need a function that kind of like replaces those placeholders with actual valid SQL and we'll write that function in a second and then.

A

Finally, we return a data frame as a result, um you know from uh from Duck TV and um by the way, it's worth noting that the the queer you know according to the documentation of docdb the queries against in-memory data frames, are actually like fairly efficient um I. Don't believe data is copied um or if it is it's, it's done. You know fairly efficiently. So um it's it's just really cool.

A

um So the first thing will one of these functions we're going to implement is the SQL the string function.

A

um We we actually have to implement both collect data frames and SQL to string we'll start with SQL to string.

A

And so I'm just going to copy and paste this implementation in here um and I'll just pull these Imports up here.

A

um So Warning, by the way, this is, um though, uh we're gonna. Be writing some recursive functions here, so um they can.

A

They can be a little mind-bending, um but uh but I'll try to I'll try to walk you through it, so um we basically need to take a SQL object that has basically a string with placeholders in it and then a bunch of bindings and those bindings can refer to either data frames, primitive python values like if you want to put a string in your query, for example, um or it can refer to other instances of this SQL object.

A

So if you want to embed like one query and another one like, for example in here right in this first asset, we pass in a data frame in this second asset. We actually pass in a SQL object, which references kind of this one. So we need to be able to turn that into a SQL string that ductdb actually understands because duct TV doesn't understand these placeholders.

A

um So uh what we do is we we iterate through all the bindings. If we find a data frame, we reference um a The View that was that was going to be registered, so we create a name so DF underscore and the ID of the data frame. So every python object has a unique ID number associated with it.

A

You can call this ID function and get it so this just you know, replace it with DF underscore one two, three four five, which is going to be the name of the SQL view, that's registered with duct TB.

A

um If we pass in um you know an instance of um of a you know, a string or an INT or a float or any other kind of primitive python type. We just basically convert that to a string here and we use the SQL Street Escape function.

A

um You know in order to safely Escape, you know strings, uh and you know the most interesting case here is. If we pass in another SQL object, we recursively call this function, um and- and- and so we can kind of- like you know substitute uh in you know a valid SQL string for that.

A

Finally, in order to do the actual string substitution, we use this built-in python string. Template class and that's the thing that basically takes those dollar sign placeholders and can interpolate them, so you can look up the python documentation for this thing, um it's not that commonly used, but it's in the standard library and it fulfills our purpose.

A

So um the next thing we're going to do so. We've already implemented this. The last thing we need to do to make this class work is to implement this collect data frames function, um so I'm gonna go uh do that whoops.

A

um We just need that type definition and then I think I can just paste this in here and so the collect data frames function. Like I said it goes from a SQL object and returns a mapping of the view name to the data frame. So if you remember correctly like or if you remember from before, we created these like view names here, DF underscore and then the ID, um so we need to basically have a function that walks, through all of our our different nested SQL statements and um and collect all the all.

A

The data frames that are used in these queries so that duck DV can know about them. um So uh we again iterate through all the bindings. If we see a data frame, we register it here in the return value.

A

um If we see a SQL instance, we recursively enter that and we walk that for all the data frames and we merge that into our return value and if we see anything else like a primitive value like a string or an integer, we don't need to do anything.

A

um I know that that was kind of like a like a whirlwind tour of these kind of complex recursive functions, but um it is uh if, if we've done everything correctly, uh this should hopefully work so I'm gonna go into my project here.

A

I'm gonna run Pi test the name of my um my package and we'll see if it works.

A

Hey the test passed. That's great.

A

um It's always good practice, though, when you're writing some code uh or writing tests to try to break your test, make sure that you actually tested something so I just changed this expected value to one instead of two and take a look at that it uh it broke.

A

um So that's uh that's working.

A

um So the the thing that you might find a little interesting, though, is we have this working with ductdb, but we haven't really brought in parquet or S3 or any of the stuff that makes this a data Lake. Yet so, in order for us to do that, um we need to implement something called an I o manager.

A

You can look up the documentation for dagster. I o managers, but basically what an I o manager does is. It manages passing return values from one asset into another asset, so in when you're running in a test, for example like if I pop over to this test, you can see that we just call the function and then pass that value in um into the other asset. That's great for tests, it's really really fast, really easy to test, but in in a in a real kind of production environment.

A

You're going to want to write this to some storage um and then read this from a from Storage when you want to actually run this second step, so you don't have to rerun all these different steps of your pipeline every single time and potentially you can parallelize them. If multiple steps depend on the same piece of data, um so dagster provides I o managers, which, which is the abstraction to do this.

A

You basically say you register an I o manager with your app um and then it will handle you know serializing and deserializing data from Storage um and and reading and writing that to that storage. uh For you, this keeps your business logic super super clean um and lets you kind of transparently swap out your I o, which can be useful in different environments.

A

um So, let's um let's go create this. um The first thing we need to do is is import um the base class for Io managers, which is called I o manager, and then we need to implement the I o manager.

A

So I'm just copy and pasted that in here um you can see that when we create an I o manager, we basically pass it the name of our S3 bucket. We pass it an instance of ductdb and then we pass in an optional prefix to basically prefix the names of our parquet files that we're going to create. This is not the whole implementation. This is the first kind of step and I'm going to you know, go through the implementation, this class step by step.

A

um What we need to do is uh basically determine the name of the file or the URL of the file on S3, for a given asset so set another way. We've got this asset called population and it returns some data where in S3 do we store that data?

A

um So in order to do that, uh you know we get this. We can get this context. Object and I'll show you where that comes from in a second. That represents the current asset and we can call, you know, get asset identifier and that will get us a string representing the asset and we just construct a little parquet file, URL here from the S3 bucket name, any prefix provided by the user and then that ID that corresponds to the asset now um by default.

A

That ID is just going to be the name population, um but uh you know there are more complex, dagster features where you can add time-based partitioning, for example, and that will make the um the URL include that information.

A

So um the next thing we need to do is handle writing to that storage. So how do we go from that return value to a file sitting on on S3 I will show you how let me just copy and paste this in.

A

A

This handle output function or handle output method. This actually comes from the The Base Class of I o manager and basically given a context which represents the asset and then this the what I'm calling select statement. But this is just the return value from the asset basically says: write it to storage somewhere.

A

um So I'll just walk through this step by step. um If we uh are returning, none from our asset, so you know sometimes you've just got an asset that just you know during debugging you're, just logging, something you don't need to store anything there. So we just return. uh We do nothing there um if, uh if we, otherwise, if we're not receiving an instance of the SQL, um we throw an exception. So this duck pond system with this particular I o manager. It expects every asset to return a SQL object.

A

um So a SQL. You know those objects that take the SQL template string and then a set of bindings. It expects those to be returned if you're returning anything else, you should be using a different, I o manager um and then finally, we just run a simple query.

A

We call query on our duct TV instance, and this is the magic incantation to save something um to S3 with a duck DB like look at how simple that is uh I just say copy this table to this URL in the parquet format and by the way the default here is to to use um the parquet format with Snappy compression, but you can provide Z standard or gzip or no compression as an argument here. This is all provided by the https plug-in that we had installed earlier. But um but how easy is this?

A

This is so. This is so great, and um this is all the code that we need in order to write to our data Lake we're combining the great features of duckdb, which are itself built on the great features of parquet and the grain abstractions provided by dagster. To just make this incredibly simple um and the uh the read path is, um in my opinion, even simpler.

A

So, just like we have a um a handle output, function or method. We have a load input method um which basically takes you know a context for the asset, that's trying to load a parameter, um and then it uh you know Returns. The value of that asset, and so we create a new SQL object with um you know this template string and there's this read parquet file in inductdb that you know again talks to httpfs and does all of that magic around.

A

You know only reading the parts of the file that it needs to with HTTP range requests, and all we do is just pass it, the URL to the file that we saved to up here. So you can see here the URL that we wrote to is the same as the URL we're reading from um and so I think. One thing that makes this really cool is in our tests.

A

You know we just pass those values directly into the function. We don't have to go through any sort of I O process, which makes your your test really fast, but then in production you know we we kind of rewrite the SQL a little bit totally transparent to the user and you get the exact same semantics, um but you, uh but you you get a big scalable data warehouse which is really cool, um or at least a reasonably scalable data warehouse.

A

um So now, uh before we're able to kind of run, this whole thing end to end. um We need to actually connect this. I o manager to our project um and actually I think we have to get our indentation right um yeah. So so we have to connect our data, our I o manager, to our data um or to our project.

A

uh So, for example, we need to figure out. You know what is our bucket name. How do we get an instance of ductdb? Do we want to provide a prefix Etc, so we're going to head on over to this repository.pi file?

A

Repository is, is kind of what dagster calls a project um and uh we're just gonna. You know, add a couple of lines of code here. um It's actually probably the most lines of code in this whole project, but um hopefully they're they're, pretty straightforward. No, uh no crazy, recursion in here first I need to add a couple of imports here um and then I'm going to add. What's called a duck, DB resource.

A

So a resource in in dagster, basically like a connection to some external service, and in this case it's a connection, um it's connecting our in-memory.db to the remote like S3 instance, so it's providing the credentials to S3 the URL, uh the bucket name Etc. So um what we're doing here? I- guess not the bucket name, but we'll do that later, but uh it provides the access keys and stuff.

A

um So you can see here we Define a resource called duckdb. It takes a con. Config schema called vars.

A

This is basically just this: the the variables that we're going to set inside a duckdb in order to kind of configure the HD PFS plugin to talk to S3 um local stack, uses an access, key ID and secret called test.

A

um We pointed at our Local Host 4566., um that's that's where local stack is running and just because of the way that S3 URLs are constructed and how SSL works and how DNS works. We need to set these two variables to be. You know, to disable SSL and to use the path style rather than the subdomain style.

A

um That's not like super important, but basically like with S3. You know you'd have to configure a bunch of DNS in order to make it work. Otherwise, um so calling this dot configured method takes this resource and basically applies a configuration to it. So now we have a now. We have a new resource called duckdb local stack and um and we're going to use that resource in a second.

A

The next thing we need to do is um just like: we brought in and added um duct TB uh connected to localhost or to local stack to our project. We also have to connect our um our I o manager that we wrote to our project.

A

um So you know you should just kind of read the documentation for how this works, but we Define an I o manager, which instantiates the class that represents our IO manager. We pass in the name of our S3 bucket, which is called Data Lake. You know you could potentially provide this via runtime config. If you wanted to, um we say that as I O manager requires a resource called duckdb, um and then we pass that duckdb resource, which is an instance of the duckdb connection to our Duck Pond.

A

I o manager and, as you can see, the second argument here is expecting a duck, DB instance, and that's that's how you get it. So you know when you say: hey I would like this resource. Please. You know this is how you access that resource. This key is the same as this key.

A

um Now, finally, it's time to actually um put all the pieces together, so I just need to add um uh I just need to you know, add one or two more Imports, um which is a function called with resources.

A

And then I um I just need to say uh that I want to apply the duck: DB um I, the the duct TV instance, and our Duck Pond, I o manager to all of the Assets in my project.

A

um So what I basically did was I took you know, I took the load assets from package module which loads all the assets um in our project and then with resources, says: hey. You know anytime. You want to use an. I o manager use the duck pond. I o manager instead of the default one, um and uh we also provide a resource called ductdb which points out our local stack version of Zac, TV and then and this this resource is used and passed in to this. I o manager.

A

So that's that's kind of a lot of typing, but um and a lot of keys for things and I know that can be a little annoying when you first type it, but there's a lot of advantages here like if you have a really complex project and you want some of your assets to be stored in one data Lake. You want some other assets stored in another data lake.

A

So, for example, like you want to put some in the U.S some in Europe, or you know some in your own data center with like Min IO or something this Dexter provides the flexibility for you to do that in a really elegant way, but only you know providing these resources to the proper um to the to the proper set of assets, all right, so I think if we've done everything correctly, uh we have, um you know, created our example: application we've wired up our I o manager. Let's actually try running this thing.

A

You know for real, like in production, so I just ran the command um dag it. This is um basically a uh a user interface for dagster and if I go over here, um all right, so I just had a quick problem with gitpod, but um I found a solution for it. You can just run GP URL 3000, and this prints out the URL that you can use to access that port on your uh on your instance, and so you can see here, this is dagster.

A

Ui uh and I've got two assets: the population asset and the continent. Population asset and I can smash this materialize all button and it launches a run. And now we have dagster basically running that computation for us, so we fetch the data from Wikipedia parse it with pandas. Then we use ductdb to query that data frame and then take the results and store it as a parquet file on S3. Then we have the derived data, this country, continent population.

A

You know, run a query against that data sitting on S3 using HTTP range queries efficiently and then using you know, massive parallelism where appropriate and then taking those query, results and then storing those on S3 as well, and so we can pop over and take a look at this data.

A

Actually um so I can run AWS local, which is our AWS CLI pointed at um at uh local stack and I can just say you know um what is in my data Lake bucket and you can see, we've got those two parquet files and if we want to actually take a look at those parquet files, um I'm going to actually need to install a uh a quick dependency here to do it.

A

um Yeah this this Pi Arrow dependency is going to just be used right now to view these files, I can download that file. I can say: AWS local S3 copy data, Lake continent, population, dot, uh okay,.

A

And so you can see now: I've got that continent population.par K file sitting there um and if I uh run I can use uh pandas to to read it. So I can say import pandas as PD need for k, continent population.

A

If I match up my parentheses appropriately yeah, you can see that we've got our our data right here created from our dagster job. So I want to just recap what we saw: we've just built in a fairly short amount of time, with not that much not that many lines of code, um a a pretty useful data Lake that lets you kind of compose together different SQL queries and efficiently query data sitting. You know on on S3 and parquet and orchestrate all that, with with dagster and write tests against it efficiently.

A

uh This can, you know very easily grow to more complex apps and if you kind of continue through the blog post, um we'll Implement a more complex one um called the jaffle shop example I'm not going to take you through all of the details here, but you can see here in this blog post. We basically take the patterns that we walk through here and we expand it out to something that resembles a um an e-commerce store.

A

And so, as you can see, you know, rather than waiting for reading from Wikipedia, we read from a an example data set. That's a CSV file. We do some additional pandas Transformations um and then our SQL queries are a bit bigger and a bit more complex, but it's basically all the same stuff that I showed you in. In that example, uh maybe one little addition in this final example is that you know we that resource that duckdb resource that we pass in.

A

We actually use that to kind of query directly at the end, so we can materialize our query, results and inspect them um and and so yeah. You know this has been a um a really interesting expiration for us. uh Sandy and I have had a lot of great discussions about it, and even members of our community have come upon similar ideas so with that I want to just leave you with with um you know a thank you for for sticking to the end.

A

I know this was a bit of a long, video and but hope you had. You learned something about these Technologies, and hopefully you can take some of the ideas here and and really play with them and build something great. uh Thank you again and uh have a good one.