Delta Lake Delta Lake Tech Talks, 21 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark

Description

Join us for a live tech talk and learn about architecting for data quality in the lakehouse with delta Lake and PySpark. After the presentation, we’ll have time for questions. Excited to have you join us!

From null values and duplicate rows to modeling errors and schema changes, data can break for millions of reasons. To combat this, teams are increasingly adopting best practices from DevOps and software engineering to identify, resolve, and even prevent this "data downtime" from happening in the first place. Join Prateek Chawla and Ryan Kearns as they walk through how data and ML engineers can solve for data quality across the data lakehouse by applying data observability techniques. Topics to be discussed include: how to optimize for data reliability across your lakehouse's metadata, storage, and query engine tiers, building your own data observability monitors with PySpark, and the role of tools like Delta Lake to scale this design.

Links:
Exercises: http://github.com/monte-carlo-data/data-downtime-challenge
Jupyter Notebooks: http://github.com/monte-carlo-data/data-observability-in-practice

A

uh So rachel is just setting up the live stream so um for joining us in zoom uh from the the event page. We also are streaming live to youtube and linkedin, which is really exciting. um So we just have some things that on our side, that we need to get set up and and give a few more minutes for folks to join and then then we'll go we'll get started all right. Rachel says we already see it on youtube.

A

Awesome! Welcome if you're joining us from youtube all right. Let me check oh wow yep welcome everyone, and I think that means we're also live on linkedin. So let me check linkedin here.

A

Awesome we're live on linkedin, so exciting.

A

Welcome if you're joining us from linkedin or youtube too also we'd love to hear where you're joining us from um city country myself, I'm in oakland in the bay area and um yeah. I have to say it's like a nice day. It was raining last night um but uh denny. Where are you joining us from.

B

Hey everybody how's it going uh I'm joining from the I believe, cloudy. Yes, cloudy seattle, evergreen city uh up here in the pacific northwest hiding out here so.

C

Hey I'm joining from dublin california and it's raining quite a bit, so we got the rain from you guys.

A

Nice, I have some family. That's that's over. There come on ryan. Where are you joining us from yeah.

D

I'm calling from from san mateo california um so a little on the on the west side of peninsula. It's kind of cleared up a little bit here, but I think it'll.

A

B

Get back to it in the afternoon.

D

A

Yeah nice, I went to um elementary school in san mateo, I'm about one at local from the bay area. I guess you could say really not quite.

B

A

Francisco but like so, I grew up in daly city, which is like a couple blocks, and you know some areas like couple blocks over from san francisco. By now.

D

Yeah cool yeah: I went to high school in balboa park, so I drove right by.

A

The city- oh okay, okay, nice, small world.

A

Awesome well, we have quite a few folks on zoom already.

A

I think, maybe we'll give it like one more minute and get started here. Just have a few uh quick slides to share.

B

Meanwhile, it looks like there's a lot of folks that uh chiming in from all over the place, I'm looking to india, florida, brazil, italy, uh cleveland, ohio, some other folks from seattle, uh texas yeah. This is a great turnout here. uh We've got folks from new york, brighton uk um over in zoom as well. So I'm looking at linkedin and uh let me switch over to youtube so yeah. It's we've got some really good turnout here right. This is awesome.

A

Also in zoom 2, so brighton, uk and new york welcome and then buenos aires, alabama chicago costa rica awesome, some more folks from india, peru cool, really exciting, all right. Well, I let's get started here. um I just have a few quick slides that I want to share and welcome everyone.

A

All right, so, thank you, everyone for joining us. This is a tech talk uh with our folks, our friends from monte, carlo, so architecting for data quality in the lake house, with delta lake and pi spark, and today we're uh joining joining us is pratik and ryan from monte, carlo and uh new. This this time around.

A

So we typically run our meetups on um a meetup group, the data in ai online meetup group and um just recently, actually, our team launched a a page on the community linux foundation and I'd love for you to join us there. So it's a virtual. It's our delta lake um chapter there, where we're gonna host, all of our community office hours and tech talks and all awesome content um with the linux foundation.

A

So there's the link there to join I'll drop, that in the chat, spaces um and uh stay tuned up upcoming delta lake specific um talks and then also too, uh we have I'd love for you to subscribe on youtube. We have a great tech talk and meetups playlist there, where we record all of our sessions and they're all available immediately after the recording and then um follow us on linkedin. We have a delta lake and a databricks linkedin page, and I wanted to do a quick call out because we have a conference coming up soon.

A

It's the data nai summit, it's gonna be hybrid this year, which really excited um in person and online at end of june. The 27th through 30th lots of awesome sessions and the theme kind of is, you know, building the modern data stack on the on the data lake house, so hopefully you'll join us um in person or online and before I pass it over to jenny, my co-host and our speakers. I just want to remind everyone that the session is recorded and it'll be available on youtube directly after we're done so I'll drop.

A

That link in the chat and then also too we'd love to hear your questions and gather those while we're going through the presentation. So, um if you're joining us on zoom q, a is your best place and then you know the chat is on for youtube and linkedin and denny myself will be moderating that so we'll capture your questions and answer as many as we can at the end. So with that I'll pass it over to demi.

B

Thanks very much karen really appreciate it, um so I'm not going to go ahead and do too much. I just wanted to reiterate two things. uh One is that please do ask her questions we'll do our best to answer them live in linkedin, zoom or youtube all three. In fact, okay, um we're not gonna. There will be pauses during the session, so both pratik and ryan can go ahead and also chime in as well, but for the some of the other questions, we'll wait till the end of the session.

B

So that way we'll allow the flow without further ado. That's it for me. I'm gonna go switch it over critique uh ryan. Why don't you go to introduce yourselves and go uh and start the show uh we'll start with you pratik, I believe so.

C

Cool then thanks dan and karen, I'm pretty chavela, and I'm here with my colleague ryan kearns and we're the founding technical team members of monte carlo. So we're really excited to be here today and we really wanted to thank the databricks team and the linux foundation for this opportunity and we're huge fans of both organizations and the broader communities they're building around.

C

I think some of the most innovative and important open source frameworks we'll be discussing today is what it takes to build: a build and design a more reliable data, lake and lake house using systems like data breaks delta like pi, spark indian observability, but before we get started. Let's just do some quick introductions so, like I said, I'm prathik, I'm the founding engineer and technical lead at monte, carlo, where I help drive the technical strategy for our data observability platform.

C

Previously, I served as a technical lead at barracuda, where I worked on: email fraud, preventive technologies and I graduated summa laude, with a bs in computer science from uc santa cruz and, of course, in my free time, as you can see from the slides I enjoy watching, broadway shows flying airplanes, reading and exploring new places I'll. Let ryan introduce himself as well.

D

Yeah thanks, pratik and and thanks to our host uh databricks and the linux foundation, putting this together so my name's ryan, I'm one of the founding data scientists.

B

D

At monte, carlo I've been for for a little over a year and change. I work on the majority of anomaly detection, machine learning techniques for our platform.

D

Prior to this, I was at stanford studied computer science and philosophy, with kind of a particular focus in natural language processing and then before that, I've worked at after pay in backend engineering working on the data quality fundamentals uh book with monte carlo, which will go into some of the concepts you'll see today um in a sort of longer form, uh if you're, if you're interested and then my little fun fact is I'm a big fan of the american road trip.

D

So I took a pretty beat up minivan across the country with some friends of mine when I was in high school and remains one of my coolest memories. So um definitely a big fan of that car, but I'll pass it back to pratik who's, going to get us jumping in right to the the technical meet and details of the presentation.

D

C

Yeah thanks thanks ryan and back to what you guys all signed up for, so we wanted to do like a walk-through of common data quality challenges and how to solve them for the lake and lake house. So I'll just go over the agenda very briefly on today's agenda.

C

We'll basically imagine ourselves, inheriting a data lake house and um it's a lay cost environment specifically with a bunch of data, reliability, issues and maybe you've actually all been in this position before and then we'll talk about a new approach, called data observability to solving many of these issues and then we'll walk through what it takes to apply.

C

Some of these principles and technological approaches to lakes and lake houses using delta and all the other tools we have at our disposal after that, ryan will walk through anomaly detection by doing a demo using pi spark and ml flow and and like um karen and danny's iterated. If you ever have any questions feel free to put them in the chat and we'll try our best to answer them all right. So, let's get started so imagine. You've joined a company as a lead data, architect and you're, inheriting a data-like environment.

C

This one's built on spark and data bricks, which is I'd, say pretty exciting um you. You can't wait to dive in and solve the new challenges and like work through all the thorniest data problems the company has, but the first one on your roadmap really is data quality, and this only came to light because uh wonky dashboard was shared with your ceo and was projecting next year's financials, but it actually used last year's data and of course that's not what you want to do.

C

So the question is: how are we actually going to solve this before we dive in on how to like actually fix the problem and it sure doesn't happen again. It kind of helps to understand the implications of bad data for stakeholders across the company. It's like the gift that keeps on giving so to speak and works as you move downstream, and these broken pipelines can manifest in any number of ways originating in buggy code, missing data or even operational issues.

C

So and then, even if even no matter how you test, like the best testing, there's literally no way to catch all of them, so the key really is to monitor and alert on these data quality issues as soon as they happen and before they affect your stakeholders, whether those stakeholders are internal analysts or maybe your external customers. So here we put together this, uh what we've deemed the data quality cone of anxiety- and I think it highlights a lot of these different levels of impact as data moves downstream beyond your lake house altogether.

C

So you can see bad data can have serious ramifications like it. You can lead to lost revenue, it can lead to maybe erosion of trust. You can obviously waste your resources, there's just so many things that bad data can lead towards. It affects, and it affects all different types of organizations like, for example, with nasa they had their.

C

They had their mars climate orbiter. It was. This is a multi-million dollar satellite, but it was lost because the software used to build this technology used a different measurement that the astronauts used to program where it should land. I believe it was pound seconds versus metric units, they lost about 125 million dollars and years and years of progress and there's lots of other examples of this.

C

Like, for instance, there's a there was a 617 billion dollar so-and-so quote-unquote fat finger incident where uh trading error on the japanese stock market happened where a broker just entered the wrong number and it caused it. A huge stock fluctuation and the japanese stock market was sent to rattle for a few hours. It was just a simple order for toyota motor stock, but it was worth more than the sizes like sweden's economy, um so there's like that list goes on and you might necessarily be risking. You know 617 billion dollars due to bad data quality.

C

You can understand, like I think, how the implications grow as your bad data flows downstream. So now the question that begs the question: how do we address this bad data? So we believe that you can learn a thing or two from our friends in software engineering. The problem of stopping bad data in its tracks can be tackled by relying on some of the best practices of like our friends.

C

In software engineering, software engineers leverage these principles of site reliability, they use observability and they use all this stuff to ensure that their applications are performing as expected, and that up time is high, while downtime is low, so as organizations grow and the underlying tech stack powering them becomes more complicated. For instance, let's say you're moving from a monolith to a microservice architecture.

C

I think it's very important for devops teams to maintain a constant pulse on the health of their systems and, more specifically at least we believe observability speaks this need and refers to the monitoring, tracking and triaging of incidents to prevent downtime and, as a result of this, like we're kind of shifting as an industry to more distributed systems. So observability engineering is really becoming really really critical. So now, if you look at it from from our friends in software engineering, observability engineering really is broken down into three major pillars.

C

As you can see on the slide, you have your metrics, you have your traces and you have your logs now. What are these three things? So metrics refer to like a numeric representation of data measured over time. Traces are like the casually causally related events in a distributed environment and logs are like a the source of record for an event that took place to have like a time stamp.

C

They provide context and they basically indicate like an event occurred, but unfortunately, I think on the data side, I've noticed- and maybe you can empathize with this- that our approaches to data reliability and generally data engineering are about 10 years or so behind that of software engineering, because, like in an application development, every team has a solution already they have a new relic.

C

You have a data dog, you have pagerduty and we use all these really great tools to help measure the health of our application and ensure that there's some degree of reliability yet for some reason, like the data teams, are completely flying blind.

C

So what if we can apply some of these principles to our data lake houses? So it kind of begs the question: what exactly is data observability?

C

Well, if I think, if you hadn't guessed already, it's kind of an organization's ability to fully understand the health of the data in their system, it kind of the data observability has eliminates like data downtime, and we kind of do this by applying the best practices of devops observability to our data pipelines and like it's devops counterpart data observability uses automated monitoring, alerting, triaging, and we use all of this to identify and evaluate data, quality and discoverability issues, and we think this leads to more reliable data, healthier pipeline and, hopefully, more productive teams.

C

So now what does this look like in practice? How can you actually actually follow these footsteps of these engineering friends that I've been talking about and apply these principles to your data pipelines in the next few slides I'm going to outline a few of these best practices and hope? My my hope is that you can take that away from this presentation and apply it to your own systems and just like how you have the three pillars of observability for software reliability. Every data team has these pillars of data reliability.

C

I've I've broken them down into these five here and we believe they're strong indicators of whether or not something is broken or gone wrong. So these five pillars are really freshness distribution, volume schema and lineage, I'm going to walk through each one of them, so the first one is freshness, freshness kind of seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated. It's particularly important when it comes to decision making. As we all know, stale data is synonymous with wasted time and money.

C

That was our first example last year's data. The next step on our our next pillar up, basically, is distribution, and it's really- or you can say it's more like a function of your data as possible values. It tells you whether your data is in an accepted range data. Distribution gives you insight, I think, into whether or not the tables in your your the tables in your data can be trusted. Based on what you expect from your data, the next next pillar is volume.

C

So volume really refers to the completeness of your data tables and offers insight into the health of your data sources. Let's say you had 200 million rows and they suddenly turned into 5 million. I think you should know about that, because that indicates something happened.

C

The fourth of these is schema, so schema really is changes in the organization of your data, in other words, schema can indicate broken data and monitoring who makes these changes to these tables is foundational to understanding the health of your data ecosystem, not to mention preventing. I think, some of these things from occurring again and finally, we have lineage when data breaks.

C

I think the first question people always ask is where data lineage, I think, provides this answer by telling you which upstream sources and downstream integrators were impacted as well as which teams are generating the data and accessing it. Essentially, I think lineage acts as their context and it kind of serves as a singular source of truth for all your data, consumers- and you can see from my slide here. This is a dlt which is from databricks and shameless plug, so tracking these five pillars can give you a good sense of how healthy your data is.

C

I think from there whether or not a data downtime incident has occurred. Okay, so I've, given you all this conceptual stuff, that's great! How do you actually use this? So in software engineering we have, we have what is known as a devops life cycle.

C

The devops lifecycle incorporates a bunch of distinctious, continuous stages from like planning to coding, to development, to monitoring the cycle repeats itself, and while many of our technologies and frameworks have adapted to meet the standard and best practices we see in software engineering, I think we tend to handle data a little bit more reactively and it's prevented us from driving these changes in like a scalable way. So much in the same way, devops apply the continuous feedback loop to improving software. It's we believe.

C

It's time you can leverage the same blanket of diligence for data, so the data reliability life cycle is kind of an organization-wide approach simultaneously and proactively. Improving your data health. It eliminates data downtime by applying some of the best practices from devops to your to your data pipelines. So we hope we think this framework will kind of allow you to be to hit a bunch of different key factors. One of them is, you can be the first to know about data quality issues in productions.

C

Then you can fully understand the impact of the issue and then fully understand where your data broke, but now, let's actually dive into the environment. So if you recall, we inherited a lake house and to see we can set up data pipelines for success when it comes to preventing like these billion billion dollar data incidents. I think the first thing we need to discuss here, though, is what is really the difference between a warehouse and lake.

C

um There's, a bunch of different factors and a bunch of people have opinions on this, but I think it really boils down to the way data is stored and structured, as well as the various entry points you have in a lake that many warehouses don't uh so I've kind of broken down the data lake into four components or layers. The first of them being metadata, then you have your storage layer. Then you have your query engine layer and then you have your query logs there. As you can see from this slide.

C

There are various technologies you can use to manage this, um and this is not even complete close to all of them and but for the purposes of this presentation, let's dive a little bit into uh building a reliable freshness detector with spark and data bricks. But before we get to that and I hand it off to ryan, I wanted to briefly discuss all of these different layers. So the first one is your metadata layer. So leica warehouses have an information schema that provides like an extensive amount of metadata information about objects like tables and views.

C

Many lakes have what is called a metastore and acts as a central repository for metadata, and there are many flavors of this mana store, but I'm going to talk about the three that work with data bricks. So you have your central hype, metastore and every databricks deployment has this and it's accessible by all of the clusters you use. You can also provision an external hive, mysql meta store or you can leverage the aws blue catalog.

C

This ladder is only of course available if you're using the aws provider and then for those of you who are using delta tables, you can actually access the whole transaction log, which is the real source of truth and contains like an ordered record of every transaction. That's ever been formed on a delta lake table and databricks is doing some great work, so I think, unify a lot of this with their upcoming unity, catalog, which you can sign up for the waitlist today. It's okay! So that's great!

C

That's like I just introduced a bunch of different technologies. What does that mean? And why do they actually matter? And why do you want to crawl and track any of these? So I think doing so, we'll let you cover actually three of the pillars that we talked about by just by just tracking metadata. You can get schema changes, you can get size and you can get freshness well to be fair.

C

I think the last two have a little bit of an asterisk because they mainly apply to delta tables, but schema can be tracked for any type of table and what I mean by asterisks on this is freshness and size can be unreliable from the meta store and they only change in their respective catalog, at certain circumstances like, for example, how the data is loaded or maybe, if you recently ran an analyze command or the type of table you're using and in some cases some of these catalogs just don't have all this information, but but don't worry, there is a solution to that and it's basically tracking it at the storage layer and I'll get into that right next, but I just wanted to mention one other really nice thing about the metadata layer is: if you utilize the metastar, you can actually help it can actually help you with lineage too, um for example.

C

If you have a view that often contains the underlying view, query and you can actually use that to derive and build the upstream sources as well. Okay, so now at the storage layer, so by tracking the storage layer. What I really mean is watching how the underlying data of a table changes.

C

This can be done in one of two ways: kind of regardless of what blob storage you're using so the first one- and I think the most obvious one is: you can crawl the paths that correspond to a table, and this allows us to compute like the total size by just you know, summing up the files and the one or more directories that correspond to the table. Obviously, this is like this for very large table. This can be very slow, but you could there are a couple different optimizations that you can make like.

C

For instance, you can cache older partitions if you know that only new partitions are written to and, of course, like all resource problems, you can throw compute at it until of course, the date you can't and then for freshness. If a file is encountered in the crawl, you know there's no freshness event, but if you haven't encountered one, you know something is up or you can do. What I personally believe is a more is a more scalable and better approach, and you can just track files as they change and map these to a table.

C

It's obvious, I think it's more scalable, because it's push based, so you can bake this into your etl tool or use a service like s3 events, if you're using that provider like on the left here, I have an example of s3 events through event bridge and you can basically use this to track, puts and deletes, and obviously this can only track new updates. But if you need to get a history, you can always bootstrap and create the baseline as needed.

C

So the next layer we have is query logs. Query logs is a very important layer because it can be used for multiple different purposes. First and foremost, it can act as an audit trail, and you can do things like track if a query is changed during a data incident, so this can help you with rca.

C

You can also use it to help build lineage by creating like a source to destination relationship by just parsing these queries and then for cases where the query logs are not available, which are some of the cases when you're using spark different spark environments, uh you might be able to extract the same information from the etl tool, you're using or even using a spark extension like spline um and then there's a lot of other information you can find in the query logs in this context like, for example, if you have the runtime available, you can analyze.

C

Let's say if your queries are degrading over time or let's say you have tables that don't get frequently run, queries or specific fields that don't get run frequently. You can either clean them up or move them to a slower, yet cheaper tier of storage, if you're using a cloud provider and the last layer we have here is the query engine. So if you have the query engine, I think you can execute more specific or custom data checks. This includes things like running field health, which covers like you, can get retinol race, unique grade percent, none.

C

So many different versions of that you can also do distribution, which covers like values with low cardinality tables, to like determine if something's significantly changed like. If you have something tracking different clients like iphone or android. If a new client shows up, maybe that wasn't expected and then obviously, with the query engine, you can kind of write any sql queries you can validate whatever type of business logic you want. You have some custom thresholds that you know that have to be hit.

C

Every day you have slis that you want to write whatever you want to do you can you can write any sort of custom, query and validate that, and these type of things can be built into your pipeline to stop or circuit break, as you can see here on the right, it's an example of one of our airflow providers, where we basically have an operator, that's sitting in the middle of your etl pipeline that checks the data, integrity or quality of of your table before it goes on to the next step in the pipeline or you can be notified after the fact, and then you can be.

C

You can use all this information context. You have to triage it and just an important thing about after the fact you just you need the necessary tools to rca and determine impact, but you can use all these things to kind of build like a holistic picture of. What's going on in your data environment, okay, so now I've kind of walked through some of these components.

C

I'm actually going to hand it over to my colleague, ryan and he's going to demo how to actually build one of these monitors using uh your leg house with python and spark and pi spark.

D

Great thanks, pratik, I think I'll, take over screen sharing in a sec um for the moment, uh while we're in this transition point are there any any questions. We'd want to raise about the last couple sections danny.

B

No, no, no current questions right now. uh We've actually answered most of them. I did want to call out that for any folks that are asking data brick, specific questions, let's go ahead and leave them in the community form the database community forum. uh Community.Datebricks.Com, that's the great place for that. We definitely want to discuss much more about the about the topic that we're here today to talk about: okay, so about architecting data quality. So let's go ahead and definitely focus on that here, but pl, uh please!

B

If you have any other questions on that, go ahead and chime in and uh we'll continue answering them and also, of course, answer them um post uh ryan's demo. So that's it for.

D

Me at least great thanks denny here I'm going to share a screen now um so we'll be looking at databricks in just a moment. If you can stick with me, um so I wanted in in the second part of this presentation, to sort of get our hands dirty.

D

Take a look at pi spark itself, the databricks environment, and you know, look at now that we've been given kind of a great overview of all of the conceptual tools you have at your disposal when trying to tackle a data observability campaign, you know how can you actually go about implementing that in some code in some way that sort of respects the constraints of the lake environment and the particular features that you have at your disposal say in databricks, so um we'll be doing a short exercise in in a databricks notebook in pi spark uh I'll look through some sources of data they're going to be a bit sort of fanciful and construed, but you'll have to work with me and you can extrapolate into your environment here.

D

um We'll do some feature engineering on that data to get a sense of um how you might configure it for better detection and then we'll look at specifically the case of freshness, anomaly detection, so that is detecting updates to tables or rather tables that skew from their typical update patterns.

D

um At the very end, we'll give a brief kind of indication of how you would do some sort of model, tracking experiment, tracking and general kind of parameter, search grid, search with ml flow. But I'll have some important points about the limitations of that technology. In this case and kind of invite you to think about some, some future directions that you could take.

D

um Also just plugging here, you'll notice that there are some github links on this slide um to our repository.

D

We've been doing these kind of demonstrations for a little while and we've actually got a sort of ipi notebook format that utilizes the same concepts, but in raw sql and sqlite, um using some pandas and some plotly for visualization, so all open source um stuff and if you're interested in kind of seeing this in a more kind of ubiquitous domain, feel free to take a look.

D

So what I'll do is I'll get myself over to data bricks here, um making sure pretty you guys can check out my yeah, my presentation, yeah so architecting for data quality in the lake house, I'm now in um a notebook environment. I'll, be writing some python and just sort of importing some some base. Libraries for for handling this task. What I'm going to be using is I'll, be leaning heavily into this new command that delta lake tables support, which is called describe detail.

D

So if I just sort of capture diamonds, as you may be, familiar is a uh is a demo table in in um uh the databricks environment.

D

I can describe detail on this table and actually get some information, including its unique identifier, its name uh where it's located and then for our purposes here when it was last modified, um you'll notice, as as pratik said, that I could get all this information directly from kind of the query history, uh so the transaction log of this table for the purposes of this demo, I'm actually only working with described detail for reasons of scalability, which I'll explain.

D

So, if I uh can just sort of post-process on this describe detail command, you'll notice, I can get some last modified timestamp and some measured timestamp, which is just the current time, um and what I would do if I was uh architecting for an observability platform here is I'd, have some you know, update times table and I would use uh the pi spark data frame format to write into that table. A batch of these uh last modified timestamps.

D

um What I get to do here is, since uh I described detail just to describe command, is so uh cheap in terms of scale.

D

I could run this on many tables in my environment and just get a sense of when they've all updated and then tack that information to the back of a um kind of a growing incremental model, uh maybe using like um elt or some sort of uh orchestration scheduling service, I could get a sort of growing log of all the last modified timestamps uh at pretty easy scale, so you'll notice, obviously we won't be working with the real data from diamonds.

D

This table hasn't been updated since 2020, um it is a demo table in the databricks environment. So I sort of cheated and I created some synthetic data that would respect you know uh this workflow had we been recording data um for about the last two months of last 60 days, and so, if I run on this update times table you'll notice, I've got four tables. Now, as I said, we could kind of scale this up.

D

I've got diamonds, but also emerald, sapphires and rubies, and then, since you know mid to late february, I've got when we modified those and when we measured them and the measurement is, is sort of just an augmented feature. It's not strictly necessary for understanding how tables update, but it does give you a sense of how up to date, your own orchestration system is running. So if you.

B

Have delays on your.

D

End, maybe you want to um learn that that doesn't uh necessarily indicate an anomaly in the freshness of the actual table, um but those are sort of nuances.

D

I also just want to showcase if the actual delta table format's not for you that you can load the exact same data, either in csv or json, if you're more familiar with in your data science, workflow working with these types of formats, um the the lake environment, sort of extensible to a large number of formats, um and so you can, you can see that I get the exact same data here same schema.

D

So I've got like I said: four table names, I've just been running, described, detail and I've been I've been tacking the results onto an incremental table, and I can identify these tables with unique, ids and sort of see when I measured them and when they updated.

D

So now um I'll do a bit of of sql uh kind of wrangling on this table. uh You'll notice that, in order to plot this, I'm just going to kind of catch a ones column as my value and I'm going to get the measured time stamp and the modified timestamp uh I'll. Take the first 400 points: this table is actually around 6 000 rows, but for the purposes of this demo, we're going to stick with a kind of constrained group.

B

D

Rubies table updates, of course, this uh notebook will be open sourced and you can come in and mess around with all sorts of stuff. If you'd like to examine the correlations.

B

D

Tables I put some stuff into the scripting to set them up, but if I just run some plotly visualization to get a sense of these updates, um what you're seeing here is an update time series. This is your kind of base feature set for anomaly: detection, you'll notice.

D

If I hover over these, you see like at february 20th, we had an update, we had another one, uh five hours later and then another one, seven hours after after that or nine hours after that, uh so we've got a bit of a what seems to be a semi-regular update cadence, although there seem to be many delays and probably some some probabilistic updating based on whether new data is available.

D

So this table is not perfectly accurate, um yet uh it does observe a sort of regularity and you'll notice, like specifically here, I'm pointing a very large delay of around two days.

D

So if we're using this table for um dashboarding purposes or we're feeding the results into an ml model that runs live in production, uh this may be fine. You know. Maybe we can tolerate a delay of roundabout um 20 hours. Maybe one day, um two full days seems to be quite a lot and we may want to identify this anomaly here this anomaly and the tables update, cadence and I'll sort of show how we can do some really naive uh anomaly, detection to to surface that result.

D

um So what I'm gonna do is is a form of a bit of feature engineering um and, of course, for those in the in the machine learning world feature. Engineering can get a whole lot more nuanced than this, um but I'm just going to kind of give you the minimal, uh minimal augmentations, and then you can sort of infer all of the fun directions. You would take this on your own, um so what I did is I just I added a lambda function.

D

I assigned a new column that just takes the last modified timestamp and subtracts it by the shifted timestamp. So this intuitively gives you the delay between incremental updates and you'll notice. If I sort of I can scroll between these two and you'll see that dependent on the size of the gap between updates, the corresponding bar is larger, because this records the number of seconds since the table was last updated.

D

um So we've got like kind of a little regularity here. It doesn't seem like we eclipse something like 80 000 seconds all too often, and we've got the spike at around 150 000 seconds. So visualize like this, it's not quite so clear that the anomaly is all that significant. But if we're looking in terms of its delay, time can sort of get a more nuanced understanding, more obvious understanding.

D

So I'm going to show you some really simple unsupervised: attempts to to capture this anomaly and I'm going to do it through the lens of ml flow. So we can actually see how this technology can work in our favor for experiment tracking.

D

So what I'm using is a pretty out of the box detector from scikit-learn called a local outlier factor, lof detection. If you may be familiar, this basically sort of computes the density of neighboring points in a time series and will check their deviation relative to their to their neighborhood, and if the deviation is significant enough, we get a higher score for an outlier factor and then we're more likely to tag those data points as anomalous.

D

um So, if you know the kind of machine learning breakdown of tasks, this is an unsupervised learning approach. We don't have gold labels as to what constitutes real anomaly or not, and this is a kind of key- uh I don't want to say setback, but a consideration you need to make when you're doing anomaly. Detection is the fact that you know most of the um most of the tasks that you can configure at least at the beginning are going to be unsupervised.

D

You have to do a bit of kind of hard thinking to guarantee that uh that you can. You can leverage those results in a positive way.

D

So I just fit to my data with this uh local outlier factor, I'm using 10 neighbors on a on a train set of around 50 data points, and if I can just highlight, there's one outlier factor that stands out uh the threshold. If you don't change the kind of vanilla scikit-learn configuration is a 1.5 absolute value 1.5 for detection, so um we've got one point eclipsing that and actually, if I use some some plotly to add a vertical line where that integer is located, the one that that returns positive for the predictions um well negative.

D

Actually, it'll return minus one. If we get an anomaly, I've got a single anomaly here um and that's great you know we we figured out that we can partition our our our set of data points into kind of standard and non-standard points.

D

um If we were doing this in a production setting and we had actually grabbed the latest 50 data points from this table run a sort of anomaly detection on it and then surfaced this particular results. Then we could think about.

D

You know setting up some notification routing to dump this anomaly into another table or um some event log somewhere, where we could maybe send an email or push a slack notification or something to alert our engineers to the fact that the table is out of date at this point in time and what I want to get into I'm going to I'm going to show the ml flow stuff in just a second, but I also want to indicate kind of the setbacks. That of using this such a naive attempt.

D

um Lof is, uh I mean it's good in massive data settings uh for certain um certain tasks, because it actually scales quite nicely. It has good complexity, but it's not the right detection algorithm for our particular case here. For one um you know the threshold for detection is in it's bidirectional, which would mean that if one of these small updates down here was uh too small we'd be issuing an alert for that. uh For that point, and that's obviously not what we want, we don't really care if our table updates so close together in time.

D

Unless you have like a particular use case that would make such a such an instance relevant, um also we're only passing in the delay information. We have sort of a kind of a singular, one-dimensional vector as our feature set I've ignored all of the fancy and interesting stuff. You can do around seasonality, so there's there's um no auto regression, no rolling average uh exponential, smoothing uh kind of seasonality prediction any of the stuff that you might have heard about.

D

If you know time series, anomaly, detection, whole winters, and what have you um you can do that stuff here? If you, if you'd like to um and uh there's a lot of features of your disposal, even from just pulling the the measurement timestamp and the kind of splitting the date time of the the update time intelligibly.

D

So we could do a lot better with that and I'd encourage you to think about their limitations and sort of think about what the best anomaly approach towards this particular task would be.

D

But just in my sort of last couple of uh last couple minutes in the technical demo, I want to show what ml flow had been doing after I had configured. My auto log here and um the representation is a bit uh a bit lacking, given that the task is unsupervised and I'll talk about that in just a moment, but um here's my latest experiment run. You can see at 9, 35, pacific time. uh If I wanted to, I could restore the notebook to the state where that model had been run.

D

I'm not going to do that because it's identical, but I can look at what the experiment looks like. um It took three seven 3.7 seconds to train it's finished and then I can actually look at the parameters I used for this model instance, so I can check that my lof leaf size was 30. I used 10, neighbors um and p equals two means. I was using a euclidean distance metric instead of manhattan.

D

So if you're interested in sort of those hyper parameter tuning questions that would be sort of how you would uh how you'd get that um you'll also notice.

D

Actually, I think if I go back to my code, we may surface a warning at some point about the fact that yeah, because uh we're running unsupervised, we don't have training labels and, as a result, all of the kind of nice out of the box, training, metric uh tracking and accuracy tracking that mlflow provides is uh not available for us and um I'll actually go back to the slides to explain why that's an interesting thing, um so I'm gonna go back into the presentation here. But um do people have any questions in the meantime.

B

D

There are actually a whole.

B

Bunch of questions that actually have popped up uh since there, so let me go ahead and try to chime in on some of them. In that case, okay um sounds good. uh For starters, uh from logistically uh will the notebooks and the slides be made available. uh We've already responded to various forms that we will prop them up to the youtube channel. There is a request to us email it out as well.

B

I'm not I'm a little concerned that we might be spamming folks, so we might not do that, but I'll talk to uh rachel and karen to figure out what the right approach is for something like this, uh but we will definitely be uh propping it up to the youtube channel. So that way you can see it from there. Okay, um all right, let's see! Oh uh there is an earlier question uh concerning the data lineage graph.

B

uh That was uh that particular shown it's like how do you go create that, and so I guess that's the first question right there. So.

D

Richard you want to tackle that yeah.

C

Sure so the very cool thing is: if you're using delta live tables, it's created for you, you don't really have to do anything. It's just built out of the box from data breaks and you can just get it through there. If you're, not using data bricks, there's a lot of open source options and plug monte, carlo as well a bunch of other different ways to build lineage out there.

C

Does that answer? What does that.

B

Yeah, no, I think it does yes and for everybody, uh I'm looking at linkedin youtube and zoom, so I'm trying to compile the questions together. So do me a favor if it's not, if we didn't answer it's not on purpose, just chime in again and we'll we'll continue to follow up okay. uh So that's the first question next question: um okay, uh let's see also, would you be able to throw some light into how you see the changes in the query?

B

Log changes where there's some specific uh option available right here in databricks I'll, actually just answer that question right now uh in terms of seeing the query change logs.

B

That actually is directly, I believe during earlier um in the demo that ryan you had shown that was actually from a delta table, so the delta table, when you saw you described table, actually contains a snap snapshot, history of all the transactions that were committed to the table, and so because, that's all there, that's actually how you see any of those changes, and then there was a subsequent question. I believe from malcolm that asked: oh well, why don't you just use delta cdf? That's actually a good call out delta change.

B

Data feed is a cool ability to go ahead and see all the changes, but at the individual data level. So every road change every insert every update every modification shows up in the change data feed itself. But if you want a summary of the transactions that occur, which obviously will often will occur with multiple rows, that's what the describe table will end up showing you, especially when you show the history of that. So hopefully that helps answer that question. um I think that's it for now.

B

Oh sorry, um I did want to clarify uh and then uh ryan and pratik you can chime in on this one this this these notebooks are not just for uh uh writing and databricks right. You can definitely run this in other open source environments, so.

D

Yes, yeah, not only can you, but we actually have implementations of of pretty much that exact thing, um using a kind of lightweight package of uh sql lite to spin up kind of a database file um and doing some scikit-learn stuff and some pandas to kind of manipulate the data frames so yeah. You can run this out of the box on some like local jupiter server and there's some demos for that.

D

um Also, that that describe detail command happens to be a particularity of the kind of delta table environment, but you can do very similar stuff and the point in using that command was to showcase that yeah. There are kind of there's a natural way to surface freshness data from tables that need not be specific to the database table format. So yeah a lot of extensibility.

A

B

Perfect, uh I think that covers most of the questions now, uh why don't we flip back over yeah.

D

I'm gonna give some talking points about uh some, uh some more kind of like machine learning, considerations running this at scale, building out kind of an experiment, uh laboratory and and whatnot. But why don't we just uh hop in here and then uh we can? We can pick it back up for questions in just a few minutes, so we kind of did the demo.

D

um I I sort of just mentioned two minutes ago that I think there are some uh setbacks to unsupervised learning. Obviously, in the big data world you hear unsupervised learning gaining a lot of traction, because when you have like a massive amount of unstructured data, running clustering, algorithms and doing sort of feature, exploration and discovery is is quite a powerful tool.

D

The problem is, obviously you have a it's not straightforward to compare the results of two unsupervised classifiers uh to each other. So two anomaly detectors might disagree on what constitutes anomalous and, uh if you don't have a good kind of like north star signal as to what you're trying to pick up it's hard to know how to improve on our naive baseline that we just implemented, and that prompts me to talk a little bit about two concepts that are sort of interrelated bootstrapping and meta learning.

D

So, if you're familiar with this statistical learning, literature you might know, bootstrapping is like a form of model bagging or ensemble learning, um I'm using it in kind of a more colloquial sense to refer to bootstrapping a process so specifically bootstrapping a supervised learning process. um So unsupervised. Learning at a high level is basically taking data that doesn't have a gold label. So no correct answer with the data and clustering that data into different sets that share common features and that's what we just did.

D

We just took data with certain features and put it in two buckets one for non-anomalous one for anomalous. An anomalous buckets should be way smaller in terms of the size of the data.

D

Supervised learning is different in that you actually have a source of truth. You know what the real answer to the question is in my little diagram over here you can see the the points are colored differently, based on what class they actually belong to, and so that's not something that's up to the model to decide. That's a gold feature, a sort of gold standard and a supervised classifiers allow for uh kind of much more precise accuracy. Metrics.

D

You can look at like the precision or the recall or the um f1 score of a of a classifier and see if it is improved on some labeled corpus relative to other detectors and if you're, using supervised learning with ml flow, you can get a whole suite of tools that I unfortunately don't have time to get into, but that can showcase how that accuracy has improved over time.

D

In fact, I think, if I just show my slide on tracking monitor performance in this line graph over here, you're looking at the mean squared error of a supervised classifier over time and noticing that as it as it runs, it's improving. So it's it seems to be converging on a good strategy for um for this detection task, as indicated by the the low error rate so um back to bootstrapping.

D

What I'm referring to with bootstrapping is basically, uh if you have anomalies in your environment, and you have some ability to capture those anomalies- represent real incidents in the data pipeline. You know a real event where some etl schedule went offline or some schema change broke. Something upstream, um consider recording that data and consider using it as a labeled feature when you're going about training anomaly detection models.

D

um I know that's kind of a tantalizing suggestion, because I could go into a whole nother hour of presentation on how you would actually go about doing that and sort of tracking the feedback. From your end, users like data engineers who actually go about investigating incidents, um it's a whole other ballpark, but uh something to consider and especially to consider if you're, using something like ml flow, and you have interest in building out some sort of scaled um sorry, some sort of scaled approach towards anomaly detection.

D

So, oh man, sorry, my slides are being uh sticky but I'll sort of sum up uh in just a few brief points. You know: data.

B

Quality for lake.

D

Houses can be tricky like pratik mentioned, the flexibility that you get through all the different endpoints and data formats that are able to be manipulated comes at the cost of a lot of messiness. If you don't govern your system correctly, so we provide the five pillars of data observability to give a holistic view, a sort of end-to-end understanding of data health and to provide kind of a natural transition towards knowing what actions to take when data goes down um dependent on you know the different, the different signals from your data observability pipeline.

D

um You can apply this observability at all four layers like critique mentioned, so based on the difference, um uh the specificity, the specifics of your uh your environment, you may be using the storage layer or the query log layer to pull certain fresh and exercise information.

D

That sort of depends on your stack, uh but this this notion really is ubiquitous across the data lake, and specifically today we saw how pi spark with ml flow can help build and scale anomaly detection, and also that, if, if you want to achieve sort of a robust experiment, tracking and improvement pipeline, you need to look at kind of supervising the learning process and providing the ability for your experiment tracking software to to compare the results of of different detectors.

D

So we've talked about a lot uh I as I've kind of mentioned multiple times. Unfortunately, we've got way more content to discuss that I can possibly introduce in this short time we have together, um but I will say if you're interested in kind of hearing some of this uh in in longer form, uh we do have some interesting events coming up. So the impact tour is a three city. Stop uh in-person events uh uh on on date, observability concepts uh our next one.

D

Coming up um these three locations in san francisco, new york and london, we've got afua bruce she's. The former data strategy lead for the fbi and chief program officer at datakin, so pretty cool resume she'll be talking about democratizing data quality, which is a very cool concept.

D

I recommend taking a look um and in even longer form, if you have, the attention span, we're putting together a book on these concepts, called data quality fundamentals which will be going through not just the data lake, but the data warehouse environments, everything from elt to dashboard applications, business, intelligence, all of the types of observability concerns and data quality concerns. You'd want at different places in that pipeline.

D

So do take a look. We've got some chapters for early pre-release with new ones coming pretty soon, and we just wanted to say thanks. It's been a real pleasure to work with databricks and linux foundation on this. uh We hope that uh it's for you guys at the beginning of learning a whole lot more about data quality and, of course, we're super accessible for asking questions feel free to send us emails here and yeah. I think I do have some appendices if people are interested in seeing some specific stuff, but.

B

D

I'd open it back up to questions so that we can uh close our time here.

B

Absolutely uh actually one one of the questions that I wanted to make sure that people uh that you got a chance to answer is like you've shown a great demo of running uh uh of showing of showcasing um uh data quality uh within databricks, and we've already answered the question of like forsake argument like that: oh that we're going to be attaching open um um links to notebooks to do this all in open source. There is a question that was like hey what what does monte carlo data. Do, though, like what do you?

B

What do you guys do it's because you've been demoing this on databricks or you can demo this in open source, but people aren't so sure about what monte carlo data. So I want to give a one for you guys to chime in on that.

C

Yeah, I can feel this one. So basically, what monte carlo does is a lot of what you saw was us kind of manually walking through the different components and a lot of the complexities that are associated around it. Monte carlo's aim is to automate this for you and make this kind of automagical like the rest of your different tools, so kind of it's like a data observability platform.

C

I need a discovery platform, so it kind of does all of what we were talking about and I hopefully a lot more just out of the box and it works on a bunch of different platforms. You can connect it to a bunch of different warehouses and lakes and different components, and so you don't have to go through the collecting this data.

C

You don't have to go through processing this data, you don't have to go through building your detectors and all these things, setting up notification channels and building out linear, displaying and creating your rca tools and connecting all this stuff monte.

C

Carlo is a platform that kind of tries to do all of that for you and give you the tooling, to investigate and understand your data better by providing you uh different programmatic methods that you can use in sdk, you can use our airflow provider and you can do kind of it'll it just basically it's like if you've ever used a if you've ever used new relic or if you've ever used the data dogs. It's intended to be that for data.

C

I hope that answers that I can go into more details like on a specific component. If you guys want to learn like a specific warehouse or more about like some of these specific layers and like the different components of them, and how they're monitored and instrumented out of the box happy to do it, but the intent is basically it's like a no code solution. You don't have to run, describe detail, you don't have to store it in a table. You don't have to monitor this and kind of build your own framework on it.

C

It just does it for you, it's kind of just the intent. The platform.

B

No that's great.

C

B

Think I think you definitely answer the question. Oh sorry, ryan, did you want to chime in.

D

I was just going to say briefly, you know from my wheelhouse on the anomaly detection side. um We what I just showed was sort of a naive freshness detector. We have a whole family of models that we maintain that um run freshness, detection, uh size, detection, different field, health metrics. So we track things like the null rate and all of those models are sort of purpose built to scale to the whole environment.

D

So you can sort of run out of the box anomaly, detection that warms up in a couple days and then you've got um observability across basically the entire uh warehouse, for example. So that's that's my end of the pipeline, but petite can talk a lot about the different storage solutions that we haven't went with because uh yeah it's it's uh quite a cool tool. I think.

C

Yeah I'll actually try them in one other point, sorry for plugging more, um but basically I think a lot of the complexities with lakes comes to scalability. We didn't get an opportunity to really dive into that, but that's I think that can be the entire deck, because the nice thing about warehouses- and maybe relational databases- is they're relatively small in scale. So you can do a lot of these operational things.

C

When you talk about lakes, you're talking about terabytes and petabytes of data right, it's just it's a lot to manage, so you have to do a lot of things to handle that scale. I think there was actually a question in the chat about um how does how does delta handle this so like, for example, when you're writing the describe details command, it actually isn't pinging your metastore, it's pinging, the transaction log, which is wherever your blob storage. Is it's actually leveraging the scalability of these like provider.

C

Solutions like it uses s3 azure all of these blob storages and it leverages all of that and then the spark enters actually compute them. So it's like there's a lot of really cool things. I'm I love I'm very excited to talk about that. If you guys are interested.

B

Oh, no, that's awesome. uh Okay, we still have lots of questions we're trying to answer as many as we can. uh You know I want to give the last question to um to evgen from youtube uh he's from the ukraine and he wanted from your perspective, ryan. I think I'll have you chime in on this one.

B

um Can you explain the benefits of using mlflow versus just using like simple check rules on the data frames, and I think that we want to differentiate between the data itself versus the ml side, and so I figured ryan wants you to chime in on that one for sure.

D

Yeah, it's a good question, because um you're tempted to kind of meld them together right because feature engineering is like. Is it it's a complicated thing model tracking and experiment tracking is another complicated thing. If you want to understand why your model is performing poorly, it could be the modeling. It could also be.

D

The data you know garbage and garbage out is one of the most overused phrases in like modern machine learning, but it's it's real that part of why this data quality concept exists is actually because, if you're running a complex production pipeline- and you have really terrible data coming in that's constantly out of date or has high levels of nulls like you can't fit well to that collection of data. But on the modeling side I will say you know um what I just showed was local outlier factor detection. It's.

D

It can run out of the box and scikit in like two lines um in a real kind of machine learning. Context you'll be running, um really complicated models you might be like implementing neural stuff with uh tensorflow or pytorch. You might have like multiple modules of hundreds of lines of code to like. Actually build those models in in kind of the specific time series anomaly detection case: you'll have uh probably a lot of seasonality like if you're doing this for real.

D

You would be looking at things like monthly, weekly, daily trends and trying to smooth those in order to get anomaly. Detection that respects the existing trends of the data um and all of those things I mentioned are hyper parameters that can go into a model, and so, if you are building a model to kind of solve for that, you'll want to explore that space a bit and say what does it look like if on this table? Like I try a much more aggressive seasonality, smoothing approach, um maybe I uh implement monthly smoothing? Maybe I don't.

D

Maybe my training window is smaller or larger. All of those things are going to impact. How well your ml can perform in production and you'll want an experiment tracking tool like mlflow, to um kind of look at your whole corpus of available detectors and say: okay, like we just made this change, the training window is five days shorter, but now accuracy on this benchmark data set, we had is down 20. So clearly, we're missing some key feature by changing that hyper parameter.

D

um Sorry, that's kind of a a wild answer, but the space is like pretty huge and and yeah you're right to point out that it's hard to sometimes decouple those but um yeah you do want both. I think you can't do robust production um machine learning without both uh uh kind of a good understanding of how your data is coming in and also a good understanding of how your model is performing to that data. So, thanks for the question, oh.

B

Perfect: okay: we are definitely on time uh right top of the hour, so apologies for any of the questions we weren't able to answer um karen uh rachel. Why don't you take it away.

A

Thanks danny thanks so much ryan and critique. That was awesome session. um I think you know when we get a lot of questions. It just shows that people are really excited and engaged. So um thank you so much for taking the time to do this.

A

Tech talk with us and um thank you everyone for joining, so I dropped the youtube link of the recording in the chat, and I know um there were some links on the slides so I'll make sure I get the youtube copy updated with those links um and then, uh if you're joining us on zoom, um we'll just send a quick email with the recording and the links too. I'm since we're certain, since we're able to do that uh with that um thanks so much- and I hope everyone has a great rest of your day. Thank.