Dagster Dagster Team Product Demos, 15 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Troubleshooting your Data Workflows: a live debugging session using Noteable and Dagster

Description

Data engineers waste a lot of time troubleshooting long-running pipelines and know only too well the frustration of minor errors consuming hours of work. In this practical tutorial, we will demonstrate an innovative solution for dramatically shortening testing cycles and reducing the number of reruns required, boosting developer/practitioner productivity, and reducing frustration on the team.

Join Noteable's CTO & Co-Founder Matthew Seal and Elementl's Jamie DeMaria for this virtual event.

A

Weird, so one of the things we wanted to intro here was the troubleshooting data workflows, the notable and dagster, and we really want to walk through what you know kind of a live example and be able to guide some folks in how this can be achieved and follow along on some of the blog posts and the content here that we're walking through so from the beginning. I want to talk a little bit about the um pain points you can.

A

You can find in doing uh for the data pipeline work and one of the really big pain points we're doing, ETL and especially scheduled ETL is uh when you have errors to troubleshoot this feedback loop between the air and the data engineer can get very uh slow and long like you could have things like that data pipeline step in the middle might take hours or even days, to fully execute.

A

So when you get errors, you oftentimes I have to get creative about how you reproduce the minimum copy of what happened in order to fix the problem. uh To kind of talking to that, you know, this creates really fragile change of ownership. That's owned by multiple teams. So let's take this as a real example of something you know: I debugged in the past, for people and helped them walk through. It involve like three different teams, which was you have a fragile change tools that are owned by many teams.

A

You have um something goes wrong in your ETL to track it down. Well, the problem is, uh you know your Tableau report. Isn't refreshing? Okay, you go back and look at my SQL extract query. You look at that and that might be.

A

um You know something where okay where's, that coming from and you look and well it's actually just pulling from a copy of data. It came from Druid, he goes to Druid and Druid will just pulled. You know. Ultimately, it was sourced from data, that's populated by spark and you keep going back and back and back all the way. Back to the event.

A

um You have to kind of trace this lineage of tools and stitching together with various libraries and tools like this is really difficult and not all the technology choices are ubiquitous to the user base that might be looking at them.

A

um So part of this is to kind of walk through when you use um notebooks, you can actually capture um the intent of what was executed. You can have you know the collection of queries, and on top of that, when we go through the demo here, we'll show you once you combine notable and you can actually land in a live session. That has the real context in memory for you to play with and manipulate, um and this gives a lot better visibility into that tool chain.

A

You're using it allows you to um share with your your constituents that are using that content to manipulate and edit it um to their own needs, and it reduces the friction for for having to have lots of data engineers in the middle for many types of data recovery and problems and we'll kind of walk through a few examples um here as we go and what we're using today, just to kind of outline we're using git pod. Otherwise, you can use a local virtual environment if you're comfortable with that pod's a great place to start for.

A

um If you want to get a clean slate, that's going to be consistent, we're using dagster, uh which is the the primary uh product here built by Elemental and then we're using some other open source, libraries, uh Paper Mill and then some extensions that talk. Allow paper mill to talk to um to notable paper mill is the Headless uh notebook executor that runs in Python, so sway the Run uh notebooks programmatically, and we have a nice little plugin that just slots right into that technology uh called origami.

A

So when we shorten that feedback loop of errors that that you can kind of run into, um one of the things that you oftentimes need to do is go inspect the most recent like what do you compiled and what are you executing at the end of a particular pipeline and in that the the? What oftentimes? What happens is you'll get an error and air most time is a trivial layer.

A

It's in something like uh a column, got renamed or um there's a new row in the data, that's causing a problem uh or something along those lines and uh to be able to live troubleshoot this on the actual data that you had pulled in uh and collected locally, in order to evaluate, what's different uh with, if the air Message doesn't tell you immediately, it can be really valuable and really shortening this feedback. Loop saves you a ton of time, saves you a ton of resources over time.

A

um We could say this like with confidence that, um in places where we built internal tools for doing these types of things, not even to this extent of even just having visibility of the rerun, let alone have a live session. It saved a ton of time, so let's talk a little bit about how we're going to achieve this and what the relationship is between these tools. We talked about paper mill, I talked about dagster and we've talked a little bit about, haven't talked a whole bunch about notable yet so for um dagster.

A

Do you might be familiar with that? As the is the scheduling dag? It has a concept of asset resolution uh where it's going to go, find and build you assets uh based on definitions, you've provided in order to accumulate um uh workflow notable, is a notebook platform that really provides a great enhanced experience over what you typically get in open source Jupiter, it's based on Jupiter under the hood and it it runs through and does a bunch of quality of life improvements, as well as a lot of features that are well and Beyond.

A

The scope of a notebook into the scope of data engineering needs and data analyst analysts sorry analytics needs on the the UI side. It's a really great platform for um turning your exploratory work into production, work and really kind of working through a lot of the the issues that you would have in clearly running.

A

The open source offerings in Notebook space uh and here paper mill uh in the middle is going to be doing the translation of the um dagster parameters that it's providing it's going to be putting those against the notebook version and giving you an immutable copy of a notebook run. um That does the materialization results of the asset asset and then and failure will have you know a window of time to go log in and play with that before it automatically shut down the Live contacts that has the air.

A

So the workshop here we're going to productionize the Jupiter notebook. um The analyze is Irish day, so we're going to start with just the basic example, use Jupiter notebook and see how um that would work in Dexter uh we're gonna do some interest of the data pipelines that are used, um particular was task posts, data access asset focused and then uh we're going to do: data pipelines and dagster and the apis and Concepts around there.

A

So I think from here I'm going to hand off to Jamie. Okay.

B

Great so yeah, like uh Matt, said uh we're just gonna start off um talking kind of a bit of an overview of what data pipelines are and we'll go through kind of like the process of Designing. The data pipeline that we're eventually going to be like implementing in the morally Hands-On Workshop portion of this. So just like starting off um in, like just kind of like a broad sense.

B

Let's just like talk about like what a data pipeline is so a data pipeline is just a kind of like series of computations that result in some deliverables or data assets, and a data asset can be a lot of things. It could be a table.

A

B

An ml model, some kind of report or lots of other things, and today kind of the main one that we're going to be working with is a Jupiter notebook.

B

So a really classic example of a data pipeline is the ETL Pipeline and the like first step of this pipeline is that we're going to make fetch data from an external source, and then we might need to do some transformation on the data to clean it up or join data sets together or like whatever. We may need to do to get our data into kind of like a usable form, and then we need to store that data in a data warehouse.

B

So today, like the workshop. Basically, what we're going to be doing is we're going to be analyzing sort of the canonical Iris data set within a jupyter notebook, and so we want to make the process of doing that analysis and running the Jupiter notebook part of our data pipeline. So we're going to start by sort of like in this design process like replacing the steps of this template. Etl Pipeline, with the steps we'll need to complete to kind of do our Iris analysis and make that a self-contained like data pipeline.

B

So the first thing we'll need to do is actually like fetch the iris data set, and we may do that with a tool like air byte that allows you to kind of ingest data easily. Without writing a bunch of custom, API API calls and we're going to kind of take a little bit of Liberties here, because the canonical Iris data set, like doesn't really change but to make our sort of example pipeline feel more like a real world example.

B

We're just going to kind of pretend for a little bit that there's like a group of scientists that are consistently publishing new data about different species of flowers to a public database, and we kind of always want to be doing our Iris analysis on the latest data. So, at the start of our pipeline, we're going to be like refetching the data from this database and then once we have the data we want to like kind of transform it or clean it up.

B

um uh All of this, like data that we've received from the flowers database into the data we want to analyze and for this step we might use some kind of specific data transformation tool like DBT and the last step stays the same. We load our data into a data warehouse and, let's say we're using snowflake.

B

So the next thing we need to do for this is actually add our Jupiter notebook into this pipeline. So, let's add a step at the very end that fetches the data from Snowflake and executes the Jupiter notebook, and this Jupiter notebook is going to actually do our analysis of the iris data so through this kind of like initial um design of our data pipeline I've kind of mentioned, a couple of different tools like air by NBT, and if you aren't like familiar with some or all of them- that's completely fine.

B

We are actually going to be using them in the workshop. Then you don't really need to understand like how they work to understand like the broader Concepts, we're going to be talking about today.

B

um I just have them in here, so that if you are familiar with these tools or you do work with them like on a daily basis, you can see the parallels between this data pipeline we're working right now, where we're designing right now and ones you may work with.

B

So at this point we kind of designed our first pass at this data pipeline. We know exactly what tasks we need to execute and we have a good idea.

B

Of course, data is and where data ends up, and so because kind of the main object we've modeled in this data pipeline is the task we're going to execute we'll just kind of give this a name of a task focused data pipeline.

B

So let's take a step back and think about what might happen once this pipeline is up and running so pretty soon. We might get a request to analyze. Daffodils too so we'll add some more code to the step that fetches data from the flower database to also get the daffodil data set and we'll update the remaining tasks to handle this data as well and we'll end up with a new table in our snowflake database.

B

But then you know maybe we'll get a request to analyze roses and we'll repeat the same process, so you'll notice that, even though the tasks have become responsible for handling more data, we don't really have a good way to represent that in the individual tasks in our data pipeline.

B

In this, like visual, we see here and if we want to understand what data exists, for example, after the ingestion step, we actually have to like dive into the definition of that task to like figure that out and because, like each of these tasks in our pipeline, are responsible for more data, the bugs in them are going to become harder to track down.

B

So, for example, if our DBT transformation fails, we don't immediately know if it's because there's like maybe an issue with our DDT models or maybe there's an issue with the raw data we're getting in and if it is the raw data we may have to investigate each data set individually to find the one that's causing the problem.

B

So how can we kind of fix these issues? So, let's start by just like taking a closer look at our Pipeline and we'll see that there's kind of more to it than just these tasks that we're executing to kind of look at it a bit closer. Each of these tasks produces their own data assets.

B

So our task to fetch the flower data from the flower database produces a table of data for each species of flower and the data. We're storing in Snowflake is exactly the transform data from our DDT tasks and, finally, the step to execute the Jupiter. Notebook produces an executed notebook file and that executed notebook file is actually what we're going to be like sharing around with our co-workers.

B

So these data sets and the jupyter notebook in these green boxes are kind of what we actually care about in this data Pipeline, and these are the data assets. So, if you'll remember from earlier, a data asset can be any deliverable from a data pipeline. For example, in this case, we've got tables and we've got a Jupiter notebook.

B

It doesn't necessarily have to be what's happening at the very end of your data pipeline here we're producing data assets at every step in our data pipeline, so now that we've kind of uncovered these assets that are in this pipeline, uh let's try kind of constructing the same pipeline, but with our focus on the data assets rather than on the tasks.

B

So the first thing um we're going to need to do is figure out like what objects we'll be modeling in the pipeline. So, let's start by just like focusing on our Legends, so we'll need a way to represent our external Source data, which in this case is the flowers database and we'll need a way to represent the data assets we'll be creating in our pipeline and the connections or edges in our pipeline.

B

That will connect assets and it'll connect assets that have data dependencies, and this means that the asset at the end of The Edge requires data from the asset at the beginning of the edge and finally, it'll still be useful to understand how each asset is created. So we'll also document the operation required to create the asset along the edge connecting the assets.

B

So, let's start by just looking at our IRS data set and Analysis first, so we'll start by fetching the raw Irish data from the flowers database and then we'll transform the data using DPT to create our final Iris data set, and then that data is going to get loaded into our jupyter notebook and analyze and that will create an executed notebook file that we can share.

B

So adding additional data assets to our pipeline looks really similar to this. We can just add new assets for each data set and find how to create those assets.

B

So, overall, this pipeline looks larger, but it's actually doing the exact same thing as the test. Focus data pipeline um this one's just like a lot more descriptive and aware of the data that it's processing. We can look at this and at a glance we know exactly what data is available to work with and how those data sets relate to each other.

B

So before we move on, let's just do like a quick visual comparison of these two data pipelines. We've designed so there's like a lot going on on this slide, but the main point is to demonstrate that when we model our data pipeline focused on the tasks we want to execute, we get a data pipeline that looks dramatically different from one where we focus on the assets the pipeline should produce, and there are some situations where a task focused data pipeline is the correct approach.

B

But at Dexter we tend to believe that the majority of data pipelines should be asset focused again. Kind of the point of a data pipeline is to produce data assets that you use at your organization, so it kind of makes sense that it would be helpful to have those data assets be the primary thing you model when you think about your data pipeline and by modeling your data Pipeline with tasks you actually kind of end up increasing your cognitive burden, because you have to sort of understand.

A

B

Each task is doing in order to know what data assets exist and those are things that are like really easy to forget, especially like over time, and that means that every time you want to do some.

A

B

You may have to relearn what the data pipeline does, and that increases the amount of time you have to spend kind of getting your job done. uh Conversely, when you have your data pipeline modeled with assets as the primary object, you get kind of a media asset insight into all the data you have available, and it can be much easier to figure out like where your new task fits into this pipeline.

B

So how does extra fit into all of this so Dexter provides your framework to build and execute asset focused data Pipelines and Dexter also has support for task Focus data pipelines since, like I mentioned there are cases where that kind of makes the most sense. um But again we think most data pipelines should focus on assets. So that's what we're going to be doing today, so using Dexter's apis, you can create these graphs of assets that span across the different Technologies in your data platform.

B

Indexer allows you to link these tools together into a cohesive workflow, and you can see how the assets generated by your ingestion tool are consumed by your data transformation tool, and you can launch runs from Dexter that update all of these assets in the correct order. Inductor.

A

B

Helps you reveal the assets that are already a part of your data tools, so, for example, when you load a DBT project into Bagster, rather than a single DBT test, each model becomes an asset and you can see the dependencies between the models.

B

So in the workshop today, I will be writing a data pipeline that will result in a Jupiter notebook. So let's go over some of the main diagster Concepts we'll be working with to build our data pipeline and the most important one is the software defined asset. So indexter a software defined asset is a software Declaration of a data asset. You expect to exist. So it's a way to write in software that you expect a data asset like an ml model or table in a database or a Jupiter notebook to exist.

B

So in software this is a combination of three things: the name of the asset, the names of any Upstream assets. It depends on and the function for, updating the assets.

B

So after you've written some assets, you can review the resulting asset graph in baguette, which is dagster's UI and from there you can materialize the assets and this executes the decorated python functions and stores the data to a persistent location.

B

So let's talk through what will happen if we materialize the two assets we have here so first Dexter is going to execute the raw Iris data set asset and it'll do this because it sees that it doesn't have any Upstream dependencies so it can be executed.

B

So it's going to execute this decorated python function and store the result in a persistent location and by default. That will be to your local file system, but you can also save outputs to S3, GCS, Snowflake and other locations.

B

So then diester is going to execute the iris data function and there's a little magic going on here. So Dexter knows that the raw Iris data parameter passed into Iris data corresponds to the output of the raw Iris data asset. So Dexter is going to load the raw Iris data asset from storage and provide it as input to the iris data function. And then we execute this function and add these column names and return. The result so again, dagster will store that to a persistent location and the persistent location is actually important here.

B

So the next time we want to materialize these two assets, the new return values will be written to the same location as the previous values, and this means that if we want to manually look at the most recent data, we know the exact location to look at. There is no more looking at a bucket of data files and like wondering which one has the most recent data and the persistent storage location also allows us to materialize an asset without necessarily materializing the Upstream assets.

B

So if we wanted to re-materialize this Iris data asset, we can just tell Daxter to materialize this one asset and it'll fetch the data we stored for the raw Iris data. From the last time that asset was materialized.

B

So Dexter also has built-in support for working with notebooks with the dagstromo library and Dexter. Mill is just a thin wrapper around paper mill that allows Jupiter notebooks to be directly executed from text error pipelines. So, instead of kind of copying the development work you do in a jupyter notebook into a python function. Diagster can just run the jupyter notebook as part of your data pipeline, and you don't need to. You know, translate that notebook into some other format and potentially like lose readability and.

A

B

Connect your notebooks to assets that are already in your data Pipeline, and this helps remove the need to replicate data loading logic like within your notebook and you'll also get some handy like built-in metadata and lineage tracking and the ability to view notebooks in dag it without needing to start a Jupiter kernel.

B

So, let's move on to the like Workshop portion of this so, uh like I mentioned we're going to be productionizing, um a notebook that analyzes the iris data set and, more specifically, we're going to be writing like a slimmed down version of the data pipeline. We just designed so we'll create two assets, one for the iris data set and another for the Jupiter notebook and the jupyter notebook will use the iris data set asset as its input data and we'll be working directly with the canonical Iris data set.

B

So we don't need to do any data transformation stuff in our Pipeline and additionally, we won't be using any external tools to fetch the data set, we'll just be using Dexter and the dextermal library to execute our jupyter notebook. Then, at the end of the workshop, we're going to um also start executing a notable notebook and we'll explore some of the features there, as well, so for each step in the workshop. Basically I'll talk through what we're going to do. I will demo it on my computer and then I'll go back to a slide.

B

That kind of lists. The task to complete, we'll wait a couple of minutes for each step so that everyone has enough time to complete it. And if you run into any issues, you can just put it in the zoom chat and a self or Matt we'll try and help you out, um I. Think since we're in Zoom. What might be kind of helpful is like once you're done with the task.

B

If you could uh use like a little like race, hand, feature or one of the reactions just so I can kind of get a good idea of like who's done, so we can move on um cool and then, uh during the first step of the workshop, we'll download the code we'll be using, and that will also have a readme in it that contains all of the steps for the workshop, and you also get a fully completed version of the project as well.

B

So, if you're, the kind of person who would like to just like listen along to this kind of thing, but you still kind of want to see the final version of code to maybe play around with it. You'll also be able to do that.

B

So the first thing we'll need to do is actually just get our environment set up, so I'm going to be using a tool called gitpod that creates a fresh python environment that I can use directly in my browser. So if you're up to try that out, I would definitely recommend giving it a try. I'll kind of help us all be using the same setup and that'll make any issues we may run into easier to fix since we'll be kind of working from a consistent place.

B

But if you'd rather use like a different virtual environment tool, definitely feel free to use that just make sure you're using a python version greater than 3.6.

B

So when you sign into gitpod eventually.

B

You'll be uh sort of dropped into something that looks a lot like a vs code. Editor like this. So once you're here um give me like a little react and then we can move on cool, okay. It looks like we're looking good, so the next thing we'll need to do is like install um the example code we'll be working with and all the required dependencies. So I have all of these commands um here in this text file.

B

If you want to start executing them along with me, but I'll also go back to that slide once I'm done um so we'll just start by just like upgrading pip, just to make sure everything, um I don't know, goes a little bit more smoothly and we'll need to First install Dexter and once we've installed Dexter we'll actually have access to a CLI tool. We have that will allow you to like download custom and like fully supported dagster example projects.

B

So that's what we're going to be doing here. Let me actually make this a little bigger, so you can see the full command um we're going to be downloading an example called tutorial notebook assets, and um then we can do that. um Might be helpful if I throw these in the chat yeah. Let me do that.

B

Great okay, so once we have the um example code downloaded, we can just move into this new folder, that's been created and then there's a setup.pi file in there that we can use to install all the required dependencies.

B

I'm doing this pip install uh and it takes like 30 seconds or so so over here you guys can get started and then we can move on all right uh great. So let's um just take a minute I'm going to kind of walk you through sort of. What's in this project you downloaded and we can pull up kind of the files that we'll be working with so in. Let me make this a little bigger, so you can see the file name's a little easier. So in the stacks here, not a little demo folder.

B

We have uh kind of two subfolders, there's tutorial finished that contains a fully completed version of the workshop today. So you can use that to like get a sneak peek into what we'll be doing or just kind of see. The final version, um but where we'll be working is in this tutorial, template folder.

B

So in here we've got a couple. Other subfolders, the ones that are important, are in the notebook subfolder, which is where our jupyter notebook that does the analysis of our Iris data set. Is we'll go through this in just a minute, but you can open that up in your text editor. You should see this comment at the top um saying that we're filling it out as part of the pi data workshop and that, if you see that comment there there you know you're in the right folder.

B

If you don't see it, you might be in the finished version of the tutorial.

B

um The other whole pile of importance is in the assets, folder and it says init file here again, if you open that up, you should see this comment at the top, and this is where we're going to be writing all of the assets that will be making as part of our data pipeline today.

B

um Most of what we'll actually be doing is like uncommenting code blocks, to help kind of keep this Workshop mostly bug free, but I'll be going through exactly what we're doing at every step, so that you can understand like what all the code is doing so I recommend just having both of these files like open in your text, editor um so that they're easy to get to.

B

So let's go back to the notebook file and we can just kind of quickly look at what's going on in this file and get an idea of what our our notebook is doing.

B

So the first thing we do is download the iris data set from the internet and we're going to do some column renaming. So that way we can analyze it like more efficiently later you.

A

B

This cell here we'll get to it later in the workshop, but then what we're going to do is actually let me start running. This is get into some sort of descriptive analysis of our data, so we'll just start kind of exploring our data set understanding. What's there getting an idea of kind of like what the data looks like so I didn't start at the actual top cell. Okay. Now it should work all right here we go so we've got um we're kind of looking at our data, we're going to make this plot here.

B

That gives us an idea of like what our data looks like and how the different uh axes of the data kind of compare to each other.

B

And then we'll get into our actual um k-means analysis, so we'll run our clustering algorithm and then we're going to do some more plotting so that we can understand like how our clustering did. If we scroll down to the very bottom, we'll see a plot with our results, and we can see that, like one of our species of Iris, data is very easily um distinguishable from the other two, but the other two are still a little mixed up, which means we might need to do some more like complicated analysis to separate them.

B

So that's a tour of our Jupiter notebook. So, let's move on to actually like writing some code and getting this working with dagster, so we're going to go back to the init file again, you should see this comment at the top um in and that way you know you're in the correct spot.

B

So the first thing we're going to do is actually scroll down a little bit and we're going to uncomment this code block under to do one, and when you do that this one line is still going to stay commented, and that is fine. We're going to get to it in a minute, so let's walk through kind of what this code is doing. So we want to start by making an asset for the jupyter notebook we just looked at and in the kind of dagster API. We went over in the presentation.

B

Doing that might look something like this: uh we would have our asset decorator, the name of our asset, the code to execute our notebook and then maybe we would return our executed notebook, but you'll notice. I just have this comment here code to execute the notebook and that's because it's actually quite complex to um to execute a notebook, and so the diagram, Library kind of helps abstract away that complexity, and it just gives you a helper function that will just return this whole asset for you.

B

So, instead of having to write this out yourself, you just get to call this helper function and it does all this work for you. We don't need that. So let's look at kind of what we're providing to the helper function. um So we have our defined diagonal asset function. We give it the name. We want our asset to have and then we give it the path to our notebook file and then the last thing we're doing in this case is giving it a group name. This is sort of an optional parameter.

B

We can provide and it just kind of helps with organization um in decades. So once we call this function, it's going to create that asset that will execute the notebook and it'll return it back to us.

B

So um I will just like again wait here for like a minute or so, and then we will move on so now um we can move on to actually materializing our asset in dag it, so that can get pod. We need to actually start a running, dag it which we'll do in the terminal.

B

The command is just dag it if you're in gitpod. This little like alert, is going to pop up um asking you. If you want to open port 3000 in your browser, you just want to click open browser.

B

um If you're running in a virtual environment, it should print out a link to localhost 3000, and you can just open that up in your browser.

A

And you'll land.

B

Here so dag it is Dexter's UI, and now that we are here we'll just kind.

A

B

Really quick tour before we get to materializing our asset, so this first page we're on is sort of like your home page timeline view and it'll give you like an overview of all of um the like recently run, uh data pipelines or assets that Dexter is executing. We haven't run anything yet so this is blank, so we'll go up here to this left hamburger menu in the top and we'll see sort of our two different like repositories or Dexter projects that we're running.

B

So we have one for the finished version of the project, and then we have our our template project, which is where we're working right now. So, let's open up that one we'll have uh you'll see this like Ping notable job. This is here to help us like test our connection to notable later in the workshop. If you need it and then here in this asset groups, you'll see a template tutorial asset group, and that is from the um the group name we added to our asset earlier.

B

So let's click on this asset group and we can see our notebook asset right here. So we can click on this asset and this right side panel will pop open with some more information about the asset. We got our description. We can click this view, Source notebook button and it'll open up a preview of our notebook and we can see, see the contents and then, if we close this, we can click the materialize button and this will actually execute our notebook.

B

So let's go ahead and click that and then this view uh windows won't pop up and you can click this view button to watch the uh asset materialize. If you kind of miss this button at first, you can come back down here and click this little hash. That appears on the asset itself, so we'll go here.

B

I took a little bit of time, so I missed watching it actually execute, but we can see that our notebook has executed successfully and then, if we go back to our sort of main asset, page and click on this, we'll have some additional metadata about the asset and we can actually click this to see. The executed version of The Notebook.

B

So I will give you guys a minute to kind of go through that process yourself, and then we can move on to kind of the next steps in making our data pipeline all right. So we've executed our notebook, um but if we go back to our notebook file, we'll see that you know the logic to kind of fetch. Our data is still in this notebook, and that means that every time we execute this notebook we're refetching our data set, and that may be like a really costly operation, and we may not want to do that.

B

um Every time we execute the notebook um and so we're going to like the next thing, we're going to do is actually like factor out this data fetching into its own asset, and we want to do that for a couple reasons like the first one like I mentioned, um not necessarily having to refetch the data. Every time we execute this notebook it'll also help us if we ever want to add, like a second notebook, also analyzing the IRS data set.

B

So, instead of having to copy the data loading logic into that notebook and potentially worry about them getting out of sync- and maybe one notebook is analyzing a slightly different data set than the other. We can instead just have one asset that has the iris data set and both notebooks will use that asset.

B

So we're going to go back to our init file in the asset folder, and we will go ahead and uncomment the code block under to do two.

B

So this asset looks a lot more like ones we went over kind of in the presentation portion. You know we have our asset decorator. We gave our asset the name, and then we just tell it what we want it to do again. We have this group name here just out with organization, so we have this Iris data set asset now, but we still haven't told our notebook to use it so to do that, uh we'll actually just scroll down a little bit and we'll uncomment this line. That has the to do three comment in it.

B

So I'll uncomment this. So this will help tell dagster that we should be using the iris data set asset in our notebook, and we do this um by specifying kind of this dictionary or a parameter called ins. So if you will recall from our notebook, we are storing the fetched Iris data set in a variable called Iris.

B

So in our dictionary here we're going to say that we want the variable Iris to be provided by the asset Iris data set.

B

So we can um go ahead and do those two things, and then we will do like one more change in our notebook file and then we'll be ready to materialize. These assets again so again, wait here for like a minute or two cool that was easy.

B

um We've got um our Irish data set being fed into our notebook asset, but we need to do one final, like small change in our notebook, um so that we know not to actually execute this code. So we could just um kind of like delete this, um and that would be fine. But then, if we like maybe wanted to like Scandal and execute this notebook, we might run into issues and we'd have to like copy this back in.

B

It could be frustrating so instead, what we can do is kind of take advantage of how paper mill, works and move this chunk of code into a cell tagged with the tag parameters.

B

So, behind the scenes, when paper mill is executing your notebook, what it does is it finds a cell that has the tag parameters in it and then it's actually going to overwrite that cell, with some cell content you provide, and when you execute a notebook with dagster, we override the cell with some code, that's going to be able to fetch, like the the content of the asset that you specified as input to your notebook.

B

So if we just cut and paste uh this uh block of code into the cell, that says it's been tagged with parameters instead of executing this code, it'll actually get overwritten and we'll be pulling in the value from the Irish data asset.

B

So the reason um we're like cutting and pasting code here is that in gitpod um there's not really like a good way to add parameters or add tags to um to Jupiter notebook cells. So in like the real world, if you were doing this, you would be either like in your Jupiter kernel or in like your vs code editor, and you would just add the parameters. Tag to the cell, where, um where this Iris data set fetching is, um is happening, but instead we've provided a cell.

B

That is already tagged with parameters for you and you just need to move the code into that cell.

B

So um right, so you can't see the parameters tag in the cell, but I promise you it is there um and and when we execute the notebook like it's in the metadata and it'll be found, but you just can't see it and get pod um so we'll be here again. You just need to cut the um this like block of code and move it into the um cell.

B

That's been tagged with parameters, okay, so now we can go uh back to daggit and we can click this reload definitions button, and this is going to just basically like look at the new assets we've created and like pull them in, so we can see them in dag it. So we should see our Iris data set asset now and then that is Upstream of our jupyter notebook. So we're going to materialize these assets, but like real quickly, just to like, hopefully go over.

B

What's going on again, what's going to happen, is Dexter's going to execute this Iris data set assets going to fetch that data and then store it in local storage and then it's going to execute this notebook asset and because we've moved, we have this new parameter cell. It's going and we've set up the um the the kind of input mapping of our Upstream asset to this variable.

B

It's going to inject the um the content of our Iris data asset as the IRS parameter, and then we'll be um doing our analysis on that uh on that asset. So we'll click the materialize all button and then again we can click. This view button to kind of watch. This all execute it's a little crowded on my screen, because it's kind of tiny, but you can kind of watch these assets execute up here.

B

Okay, great, so that's done so now we can go back to our main doctor. Page and again we can click on this view, notebook button to see our executed notebook and you can see that, like we've injected um the cell, that's got some kind of funky code going on that's like pulling in the asset that we materialized before this one.

B

So um give everyone just like a minute to do that, and then we can move on. So we sort of completed the um sort of like indexer and Jupiter a portion of this and we're actually going to kind of move on to start looking at notable.

B

So we're going to go through kind of like a very similar process as what we just did, but with a notable notebook. So the first thing we actually need to do is make a notable account. So if you'll go, you can go to notable.io and make a free account and then we can go ahead and, like start creating a notebook there and then um we'll execute that indexer.

B

But first thing to do is we can just make an account and then we'll go through the rest of the setup after that, um once you're in notable uh you'll, see kind of a page that looks like this you'll be in like your space um and then um once you're, there you're good, and then we will move on.

A

And the sign up button looks a little different than the screenshot there, but it's the big button that says sign up on the front page.

B

After what a month old now.

A

All right they get started for free I think is the the new label.

B

Okay, so the next thing we'll need to do is actually upload the um notebook we were just working with to notable. So if you're in gitpod, you can download The Notebook we've been working with um over in the left sidebar you can right click, the name of the notebook and then there's this download button here and then that will download The Notebook and if you're working locally in your own virtual environment, you should just have it on your computer.

B

So then we can go back to notable. You should have a project. I'll, probably call something like my first project or some other kind of default. Name, that's created when you make your account, so you can click into that and then click this upload button.

B

And then we can and uh drop our downloaded notebook uh into this little upload window, and then it should be uploaded. Then we can just open that up and um see our the same notebook just in the notable UI.

B

A

Should I show that download real quick, oh.

B

Yeah uh and get pod yeah, so it's uh in the like file, navigation, um the you know, file in there, that's the the Jupiter notebook we've been working with, and then you just right click uh and then there's the download here.

B

So the uh last thing we need to do is actually get an API token from notable um so again to do that uh from within notable you can click on your profile and go to settings and then on the left side there will be this API token section and you can generate a new API token and copy the value and then once you have that value copied, we want to go back to your terminal.

B

If you're still running dag it, you can just control C to shutdown get down, and then you want to create an environment variable to store your notable token. So it has to be the name. All caps notable underscore token equals and then you'll put in the um the API key. You copied.

B

And then once you've done, that you can just like restart dag it um with the same command as before.

A

And these tokens are just basically like any other, you know GitHub token or other service token you'd have so you can delete them and make new ones, they're sort of a a way for machine to act as you on behalf in the notable ecosystem. While that token is valid and the default there. If you see they set up for uh you know one year tokens you have to worry about expiring in the short term, um you can just kind of play with it.

A

We got one thumbs up.

A

Two thumbs up close if other people will probably fall. Okay, I think we got that.

B

Well, okay, yeah I had to go through and set up my token as well. So thanks for keeping an eye on that, um okay, great so now we can and actually make an asset for our new notable notebook so back and get pod back into submit file where we've been making all of our other assets. You can scroll down to the very bottom and there's this two five.

B

So we can uncomment this code block and this is doing pretty much exactly the same thing as our defined diagstormal asset, but instead we're defining a notable diagster asset again we're giving a name to the asset specifying kind of our input mappings and then giving it the group name. The main difference is that, instead of a notebook file path, we need to give it a notebook, ID and we'll go over getting that notebook ID in the next steps. You can just leave it like this for right now,.

B

um So the last thing we need to do for running this asset is actually get that notebook ID and we can do that by going back to the notebook we uploaded and just getting it from the URL. So in notable just open up the notebook we uploaded and then in the very top.

B

You should have um sort of this ID kind of in the middle of the URL, and you can just copy that and then go back over to where your asset is defined, and we will just replace the value of this notebook ID variable with the value we copied.

B

So now uh we can be back in keypad. I did not restart, so you also did not restart tag it. You can just do that if you're gonna get pot I think it'll like reopen that little like window. If you want to open it, um not 100 sure how gitpod works, but it might be starting like a new, like port forwarding thing each time um so we're back in daggit, you can go back to our asset group and we'll see our new um notable asset um If. You restarted daggit.

B

Before we uncommented that notable asset, you may need to click this reload definitions button and then you should see it up here. So we have our notable notebook. It's going to be using the iris data set as input. So we can execute this. We can hold down on the shift key and click the iris data set and the notable notebook- and this will materialize both of those assets together, so we'll go ahead and click materialize.

A

B

Click the view button and then we can watch this happen.

A

um And what's kind of nice when.

B

You're, using notable is that, once the notebook starts, executing you'll actually get a link to the live notebook.

B

um So we can just wait a second and then that link should appear here in the logs and we can click on that and we can actually watch our notebook um like execute in real time.

B

So I'll just take a second for the kernel to start, and then we can just like watch watch our notebook xq.

A

And the notebook that's running uh in this type of parameters run it's being executed by the way. I think this is the parameterized in the input rather than the uh yeah. Oh okay got it um so the this notebook here is, you know it's live running and it's you can have multiple people sitting and watching. At the same time, you can multiple editors. On the same time, if you want to pair debugging something that comes up, um it's an ephemeral copy of that original notebook. You had that's for this particular Bagster uh materialization right.

B

Yeah so I mean, if you wanted to like I, don't know I guess like test that out, you could go back to notable and open up the um the notebook you uploaded and you'll see like that's just sitting here, um not being executed where it's like this copy of the notebook is being executed.

B

Okay, there we go so it's uh run successfully. We can go back to daggit, uh we can click on it and in the right. Sidebar we'll have this link to notable that shows kind of the that'll like reopen that ephemeral, notebook and show you like the last executed version.

B

So we can um uh wait there for a minute um just to finish like executing.

A

B

Your different notebooks and then we'll jump into like actually doing like some live, debunking, debugging um and sort of like what it looks like if your notebook kind of fails halfway through and how you would um deal with that, all right, so yeah into live debugging. um So in order to um actually have something to debug, we need to like introduce an error into our notebook.

B

um So we can go back to the notebook we uploaded um you shouldn't uh like I, don't know this is the one that was executed. We can execute exit out of that and go back to kind of the like Source version of our notebook that we originally uploaded to notable and I'm just going to scroll down a couple of cells, maybe after this Iris dot, head cell I'll, add a new cell and I'm just going to raise an exception.

B

um Obviously this will be a very easy bug uh to fix, um but you can kind of like imagine.

A

B

This process would be like with like a more like challenging bug um that you may have to actually like dig into your data, to figure out. Let's just throw an exception.

A

B

Here um and then I will, um then we can like wait for a second get everyone with a notebook that will fail and then we'll go through and materialize this notebook and then debug it great. So we're going to go back to Dag it. um We can just click on our notebook and materialize it.

B

We won't need to re-materialize this Iris data set because we've already um we already have that data, so we'll click on our notebook click materialize selected, and then we can click on The View button and what we should see um is this notebook fail um and then we can go we'll have that. So we have this notebook link here again we're going to kind of wait for the notebook to fail, and then we can click on this link and start trying to debug. It.

B

um You can sort of like imagine a situation where you know. Maybe you have this notebook running on a schedule, so you're not like watching it execute every time, and you see that, like um the last like execute scheduled execution of the notebook field, you can go back into the logs, find this link and then get into the live, notebook and start debugging.

A

And by default that live notebook will stay around for 90 minutes. um So after 90 minutes, it'll shut down the context um and clean up, but for now from this point on, we have 90 minutes to go jump into the live session. Otherwise, you'll get the not live copy of what happened.

B

um Cool, so we also get some logs in dag it very simple, just based off of these logs, but again, if our bug were a lot more nefarious than that, maybe you can click on our live notebook and we will be like dropped into this live notebook here um we can maybe delete this and we can see that we can still do um like.

B

We can still investigate our data. um We got everything is like already loaded into memory. So if there were issues with our data, we could start like popping in here and trying to figure out what's going on, so we found our bug. You know we had that exception in there and said: let's you can just like change that to something else. We can delete it whatever um and then, if you want, you can just start um executing the remainder of the cells in your notebook, uh maybe just to like make sure that things continue.

B

As you expected.

A

And there's a runoff in the menu at the top when you want to go to it. If you want to run from all below, oh.

B

Oops the wrong thing: that's.

A

Okay, it's a live session. Don't worry about it! I I tripped you up by throwing in another thing.

B

uh Run, oh there. It is great okay,.

B

And yeah my fix of changing the exception to the current, um like that it was saved and we're all set from there.

B

um So going through kind of that process of um of doing that, little like live debugging exercise is kind of the last um like step in the workshop. So we can stop here, do any uh be like a little like q, a if you guys have any questions and then um we'll be done.