Dagster Dagster Community Demos, 8 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: VMware: Monitoring Driven Development with Dagster

Description

David Laing—Staff Data Engineer at VMware—present on monitoring-driven asset development

See full Feb 8, 2022 Community Meeting here: https://www.youtube.com/watch?v=fYJBN6MAtbE

A

uh Yeah so, as sandy mentioned uh talking about monitoring driven development, which is sort of a play on the ideas of test driven development from the agile software development movement, um so so I think that uh in the data engineering space, we're kind of watching um a set of paradigm shifts unfold if you're. Like me, you probably started with a kind of project mindset, so you were thinking about projects where you were delivering some kind of data artifact, maybe a report to a stakeholder.

A

um There was always they wanted some some level of some number of features by some specific date, and we all know how that goes um and success was that your stakeholder accepted the data. You know artifact that that you gave them, and then you moved on to the next project, uh then I think as the ultimate as the automation tools started to mature.

A

We we moved to uh what I call the the pipelines mindset, so we were basically using automation to do the manual steps that we had been doing before, but we still had something of a a project mindset.

A

We were just doing more automation when delivering that project, so we're still trying to deliver something automatically, but by a certain date, um success moved from acceptance by the stakeholder to your pipeline, succeeding or not succeeding, and when you were finished, you moved on to making the next pipeline- and you know replay that cycle a couple of times and I'm sure we've all had the experience where you have a whole bunch of pipelines. Some of them are failing and no one's quite sure who cares uh whether the failure is a problem or not.

A

So I think that what's happening now, where I think lots of us are going and where dexter is leading the way is this shift from a project mindset to what I think of as a as a product or perhaps and sort of data assets mindset.

A

So, instead of thinking about the pipeline that creates the asset, we shift the focus to instead talk about the actual data asset itself.

A

uh We make that a first class citizen, which we talk about the data asset that our team maintains and that our stakeholders use for things like drawing those reports and we shift into a sort of incremental delivery pattern where we get feedback and engagement from our stakeholders, and we use that to guide how we improve that data asset.

A

We also shift uh instead of having automation, that's just used to create the asset. We also have automation that validates that that the asset is available and has the quality criteria that uh our stakeholders expect, and only once we have all of that in place. Do we do the the familiar work of creating pipelines to create the asset, and then we have a much more iterative approach where we look at well. Are our stakeholders asking for more data in this asset?

A

Let's prioritize that work, uh our quality monitors showing us that the existing asset isn't being updated frequently enough or that there's errors. Well, let's focus our attention on that.

A

So to make that a little bit more concrete in the daxter world in the pipelines asset which we've you know have been using pre uh version. 13 are our pipelines would look? You know something along the lines of you know.

A

The pipeline is going to do some work, so do the daily work job and it had a set of operations that we strung together very nicely and those operations, some of them's kind of some assets, uh sort of kind of like fell out of the bottom of of this pipeline, but our focus was very much on building the pipeline is the pipeline succeeding, and I think that if we make that paradigm shift to thinking about assets which the new features in dexter, 13 and now 14 help us to do, we can we can shift into much more of a sort of data product focused mindset, and we can start by publishing um an mvp asset to the dexter asset catalog and putting that front and center in front of our stakeholders to get feedback early on before we've, even written any uh pipelines to generate that asset, because, as we all know, changes at this stage are significantly cheaper than changes later on in the in the in the process.

A

We also can start by writing. This is the kind of monitoring driven start by by writing the monitors uh that validate that our asset is some level of quality and we expect it to fail in the beginning, but now we have something that we know once we've implemented some logic. We know when we're finished, because now these monitors are succeeding.

A

It's also a great way to sort of demonstrate to our stakeholders why our data assets are trustworthy, we can sort of you know, show them. This is the automation that we have that validates that the monitor that the data assets are good, and this is how frequently they're good and how frequently they're bad, and only at that stage do we get on to actually implementing the asset creation logic. So this is the traditional.

A

You know, crate crate, pipeline uh sort of uh steps, but it's done in a way that um we have a much clearer target and we know when we've reached that target due to the the previous two steps, and then we don't finish and move on to the next sort of project or pipeline.

A

Now we think of now we iterate on that on that asset we get feedback from the stakeholders from the automation and we use that to prioritize adding more data, perhaps improving the quality and so forth, and so success in this paradigm looks like a team owning some data products and long-lived data assets with good quality and monitoring. And you always know uh you know what's important to the team, where you should focus your your your attention.

A

All right, that's uh lots of uh theory. Let me um show you some code, so this, I hope, will be merged into the standard set of examples. uh Once this pull request gets merged uh under the monitoring driven software defined assets example.

A

um So, first step here publish the um the mvp asset to get some early stakeholder feedback. So we want the simplest possible thing that we can. We can write in daxter.

A

We want to write some logic to validate the properties of that asset, and then we can even use the same um validation graph in our ex to sort of drive our unit our acceptance tests while we're developing.

A

So let me show you what that looks. Looks like right. So what I have here is a very simple implementation of a daxed asset. Up here is, is its schema and down here is the definition of the asset and conforming to that schema, and when I load this up inside inside daggett, whoopsie.

A

Come on there we go mouse control there. We go all right. So look at this, so we want to kind of publish this. uh This very basic um asset schema with no actual data in it yet uh into our asset, catalog and get our um our stakeholders engaged about. You know is this the thing that they were expecting. Is this the definition when you know when they talk about a customer being active, for example, uh you know: is this data that's presenting helpful uh for them?

A

How are we going to partition? Keeping this thing up to date, so here we've kind of arbitrarily chosen daily, but maybe that's the wrong thing. We can get them used to the idea that you know, um as automation runs.

A

um They can come back to this asset catalog page to to see this. You know the state of the data asset, uh so here we go so we filled in in one partition, and uh here we go. We can see that you know. One of the six partitions is now filled with some uh test data, but more importantly than that we can also start showing them how we're going to validate the quality of this data.

A

So in this case, the implementation that I did was the simplest possible thing. I just uh I made a view in my in my data mart with some static data in it. So it just happens to work for that date, but obviously it won't work for any future dates.

A

um So I can write a dagsta pipeline that is focused uh not so much on the creation of the data, but on the validation of the data. That's in the mart, and so I can have a set of graphs around validating specific uh data assets um and within those they can have a particular set of you know conditions, so it should conform to some schema. It should contain perhaps some current date and then the actual implementation of these.

A

Is so here's an implementation of of one of those um validation operations, the it should contain current current data, and this is just you know you can put whatever logic you wish in here to to validate whatever the thing that you're validating is.

A

But the key is that, by yielding an asset observation, you can link the result of this test back to the actual asset in the asset catalog. So let's show that in action, so I'm going to.

A

Oops, I should probably do that for one of the days we actually have data.

A

And so this just looks like a regular uh daxter pipeline, but instead of it having.

A

Steps that create data, it's actually got steps that validate the data and then because I'm raising asset observations.

A

We can see that this one conforms to the schema. I was expecting one of these to be a failure.

A

But if, uh if something went wrong in the in the asset validation, then it would show up here as a problem against this particular partition, and you could you could see you know what the what the cause of that of that problem was.

A

uh Oh and then- and then finally, I mentioned that we can- we can use that. You know this uh same uh the same job that we ran manually over here and that we can, you know, use daxter's scheduling, capabilities to run for us automatically. We can run that same job, but kind of in a pi test, pi test context to help drive our our actual implementation development.

A

So um I'm using pi test parameter the pi test parameterize feature to just check to to execute this, um this monitoring job and then validate that all of the steps succeed. So now I get a set of unit tests that I can use to uh to guide my development implementation.

A

So, for example, my it should conform to the schema step worked, but the it should contain. Current data didn't work, so I get some a nice helpful error that helps me when I'm. When I'm doing my implementation.

A

All right, so that's phase one just to recap. So we talked about publishing the asset um as one of the first things that you do in the asset catalog, and what we're trying to do is get early feedback from our stakeholders and trigger the kind of questions that that you know sort of come up. Can you actually access this location in the data? Mart it's a big one. I bumped into a lot the perennial.

A

Oh, what do you mean when you say a customer, especially if you're in a corporate environment you know, there's a customer, the subsidiary or the parent company or the one that pays the bills, um and you can have conversations about about the required grain um we've assumed uh daily here. Maybe actually, this is something that needs to be done hourly or maybe it's much slower. Maybe we just need something quarterly, but having those conversations early on can really impact how you go about implementing things.

A

Then we talked about writing a validation job where each operation is a check for the data quality as a person that makes data and using the asset observation um type to kind of link that metadata from this validation run into your into the asset, catalog and then finally, using the uh the same validation pipeline, but run inside a unit test context to service your acceptance test, while you're implementing the features all right phase. Two.

A

um Now that you have uh you know, set the stage now, it's time to get on to the bit that used to be step, one which is actually developing the logic to put the to put the asset in in the right place. So what I think we should be doing under this paradigm is to first deploy our monitoring and see that it's failing and failing in a way that that we expect you know, are the error messages we see in the monitoring helpful.

A

Do they tell us which environment, for example, the failure is occurring on, and then we just need to implement the simplest first thing to to sort of uh pass. The validation that that we have running simpler is better is always my approach, and maybe we begin just by having some some manual steps.

A

Maybe we don't have a fully automated process in the beginning, because we're still learning about the problem that we want to automate, um you can't automate something you you don't understand and you definitely don't want to be doing things like worrying about performance, optimizations or any of those. Those are you know things once this asset is used and important. That's the time to invest in that, not at this early stage and then finally, we can use the monitoring- and you know the associated acceptance test to tell us when on implementation is good enough.

A

um Looking at the time I'm going to skip over the the phase 2 demo, but let me just show you um sort of by by the recap, so you would have deployed your monitoring and it would have failed for some amount of time and then, finally, once you deployed the your implementation that worked uh now, you would have started to have um your monitoring started, starting to succeed.

A

Your mod, your your first implementation, can be really simple, uh perhaps rather than doing a whole lot of logic to compute the daily information you just do that manually and stick it in a csv or a spreadsheet and have that be imported.

A

Whatever is the simplest thing to get data in front of your stakeholders, data that passes the the quality monitoring.

A

And then phase three is to is to iterate on your solution uh and to prioritize where you, where you focus based on the feedback that you're getting from your stakeholders, are they asking for more data? Are they asking for a different schema? Is the update schedule? Maybe they want to change that or your um your automated asset monitoring?

A

Are we finding that things fail on monday mornings because the system is overloaded? Maybe we need to implement some retry logic.

A

Are we finding upstream data quality issues, and maybe you in fact get no feedback at all and that's probably a good a signal that perhaps people don't really care about this thing that you thought was an asset and you can focus your attention on something else. Instead,.

A

Some further ideas, you could add notifications based on the monitoring job to let you know about problems before your stakeholders notice them. There's. No reason that your monitoring job should only be limited to monitoring your assets. You know, as we saw the monitoring operations are just a python function. You can do whatever you like in them. Maybe there's a set of upstream assets that you depend on that would be helpful to to monitor, or maybe you want to monitor something downstream.

A

um Is that you know, is the data that you're producing that someone is copying into an s3 bucket?

A

Is it getting there without being corrupted, might be something that you could monitor as well and then, finally, you can start to use the monitoring um as the beginnings of some kind of data service level agreement with your stakeholders, you might say something like a particular data asset will be stale no more than 10 days per quarter, for example, and you have a nice measuring system because you know the days when your monitors failed are the days when that asset was stale, so to sort of visualize that we've got here's our here's.

A

Our monitoring system they're running um we're running it on a schedule, and I think this is quite subtle, but I think important. You want to pick a schedule, that's relevant to your stakeholders, which doesn't necessarily have to be the same schedule as the automation that creates the asset runs on.

A

Maybe that runs in the middle of the night, but your stakeholders tend to sort of check their reports or refresh their reports first thing in the morning, so let's run some validation that the asset is still available and correct at about the time when they're about to look at it, we talked about adding alerts so that you can notice when there's a problem before your stakeholder, you know, calls you in a panic and we talked about how about a very simple way.

A

So, assuming that we're running our validation functions using partitioning, we can use the partitions view to sort of you know see which days it was successful and which days it was unsuccessful and our sla can become something simple like no more than a certain number of red dots. Every quarter.

A

All right so uh recapping, so we talked about this whole idea of monitoring driven development, is, is part of this paradigm shift that we're seeing uh and that daxter is helping to drive this move away from pipelines and towards um assets, and the approach that you follow. Is you start by publishing information about the asset upfront? So you can get stakeholder feedback, nice and early? You can try drive that in engagement and you can you can make changes when it's cheapest to make changes.

A

uh You use uh your dax to pipelines to actually drive asset monitoring um and we're sort of we're testing the quality of our assets from the stakeholder perspective, not necessarily from the producers perspective and we're verifying.

A

uh You know we're asking our stakeholders to place trust in our data assets, so we should have a mechanism of verifying why they should give us that trust, and it's only once you've done steps, one and two that you get to what used to be step, one which is to actually implement the asset creation logic.

A

But you've got a nice set of um of guides to tell you what you're aiming for and when you're there when those uh those when the monitoring job starts, starts to pass and then, instead of finishing and moving on to the next project. Now we iterate to try and make this asset better.

A

Where you know based on feedback, we add more data, or maybe we improve the quality or the reliability of something and success in this new paradigm looks like a a team that has a a bunch of long-lived data assets um with a well-known quality, some kind of sli driven by our monitoring. And then we look we're thinking about selling the same assets to additional stakeholders rather than being on the perpetual treadmill of just doing the next project.

A

And that's that's it um I'm not sure about timing. uh If there are questions, please just add them to the um the zoom chat and I'll answer them and I'll stick around at the end. To answer questions as well, thank you, david. That was fantastic.

A