Delta Lake Simon & Denny Ask Us Anything, 6 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Simon + Denny AUA (2022-09-06)

Description

Join us for the brand new monthly series "Simon and Denny - Ask Us Anything!" where Simon Whiteley and Denny Lee will answer your data engineering questions from building a data platform to ingestion to ETL to analytics. With their background in SQL Server and BI to Apache Spark and Delta Lake - they want to show you how to build your own lakehouse.

As this session is interactive, come prepared to ask questions all throughout the session! Be prepared for another geeky, trans-Atlantic event from two data nerds.

Quick links:
https://delta.io/
https://go.delta.io/slack
https://groups.google.com/g/delta-users
https://go.delta.io/github

A

Anecdotally, warm it's: it's walking down the street having to wear actual human person clothes and not like a pair of shorts, and it's it's two one and.

B

A

Woman stole me: that's yeah,.

B

Says 21 in stormy is that what we're saying.

A

Sure yeah, why not say it's an amount of stormy, it's a amount of warm that is uncomfortable and that's we don't need to quantify it.

B

Okay, uh that that that that that that is the most interesting answer- I've given okay well, it is currently about what 66 degrees and almost what 18 uh celsius, I guess or 16 celsius uh here in seattle. Yes, sunny seattle, so goodness gracious! I know people might be surprised that we actually might have decent weather here. So uh we've got the youtube up and running. So that's good! So now we're just waiting for linkedin to kick in and once we get that we're good to go.

B

Oh hello from texas, excellent we've got some texans here. I guess for the person from texas, that's going to be even hotter. Isn't it at least that's my guess.

A

I mean something something's wrong in the world. If uh london is warmer than texas.

B

That, that's very true, that's very true, my goodness all right, I believe we have linkedin up and running. Give me one sec.

B

All right perfect, I think we are oh yeah uh ardavan from texas, has noted that it's 85 degrees. So that's a little on the warm side. Okay, perfect. We are up and running on linkedin and youtube. So if you have questions by all means, please start asking: uh let me go ahead and stop sharing and then um perfect. So for everybody, that's wondering this.

B

Is simon and denny: ask us anything now we we do paraphrase the comment uh when we say ask us anything: it is within the realm of data engineering, uh delta lake lake houses. Okay, so it is ask us anything related to that. uh I mean I guess you could throw in some coffee for me and baking for for for simon, but nevertheless that's the context. Okay, so saying that um first things: first, uh why don't we just introduce ourselves uh simon? Why don't you start.

A

Sure, hello, I'm the simon half of simon and danny, asked anything. I guess I thought I'd qualify. You know, you said your heart, you didn't let us know which heart it's important to know um yeah. So I'm fine.

B

A

That is a highly contested heart um yeah. I read a company in the uk called advanced analytics. I do lots of data stuff, I do microsoft, data stuff and I do you don't stuff yeah. It works.

B

All right thanks very much okay and then myself, my name is denny. I'm a developer advocate here at databricks as I go ahead and bust, my mic, um the I'm a long time uh uh contributor to apache, spark and also delta lake, I'm also committed to deltalake uh prior to that um prior to databricks. I was also at microsoft, and you can blame me for things like hdinsight and sql server and cosmos db and things of that nature. So lots of data questions, that's what you're here, for so, let's dive right into it.

B

So now, as this isn't ask us anything, we do ask you to prop your questions directly into either the q a or to linkedin or to youtube okay, and so this show pretty much will run. Based on the premise of the questions you ask. Okay, so do note that so. In other words, this thing could end in about five minutes if you've no questions. By the same token, we can also iterate and talk about lots of really cool things related data.

B

So one of the first things that we actually got uh one of the first questions, uh because you can also send questions uh via linkedin directly to simon and myself uh was basically um um or tweet us, for that matter um was: can we dive a little bit into star schema migrations into delta lake and data warehouse, best practices?

B

So simon? You want to start tackling that one first and I can definitely add some color coding to that yep. Absolutely.

A

I mean it's one of the ones that we've had less people going: yeah yeah.

B

A

You still don't do the star scheme a bit in legs right and we're like yeah yeah. We do all of it in lake. That's the whole point of a lake house these days um and yeah. So it's there's an easy.

B

A

But a real straightforward way to do. How do I get my star scheme working with my leg? Well, you take the sequel that you use to do it elsewhere. I mean you just run that same sequel inside a databricks notebook. You just write it as database sequel and it's the same thing.

A

You know you can write your insert statements, your merge statements, whatever you use, to make your facts and dimensions and you can lift and shift and run in spark the majority of that. It's just going to work. You know there might be some a little bit of syntactical changes. I don't want square brackets.

A

I want back ticks around my long column names that kind of stuff you can't have recursive ctes, but apart from that, the majority of things, if you've got massive recursive cte in your dimension model, there's something a bit weird um you're getting so much of this stuff and you can just lift and shift over now, don't necessarily say that's the best way of doing things, because if you had 50 store procs sitting in your secret database, they'll be using to make your kind of your existing warehouse, and you just turn that into 50 different dead books.

A

Notebooks- and I just ran that I mean you're missing out on some of the the nice parts of performance and tuning and engineering. So a lot of the time we take. We take that sql logic and we'll put it into a more generic python. Notebook we'll have a bit pi spark that says: load data from somewhere, transform it land it somewhere with partitioning and do a proper, merge and kind of manage it for you and you just kind of take that same sql and just run that, as put it into a pi spark.

A

You know, data frame equals spark dot, sql, run that sequel or have it as a view you select from or have it as whatever it happens to be. We tend to separate the business logic, put some sql over there and then a real generic kind of rapper saying get some sql and then land it properly in a nice governed performance.

A

Happy way, that's how we.

B

Perfect well so, adding to that. So one of the things that I often will talk about is that it sort of depends on what environment that you're starting off with first okay. So, for example, if you're starting from scratch and your date, your source data is relatively small and you don't have that much different varieties.

B

You may actually not need a star schema in all seriousness. You can start from the standpoint of just the data. I know I know I know I'm hurting support simon on this one. I apologize for for hurting him like that, and no no dude. I came from sql server and all I did was promote data warehouses, so calm down, buddy boy, okay, yeah.

A

I I said nothing: it was just ralph that was silently judging you.

B

No, no, I know I know, but the the context is that in order to get yourself up and running, it's not not a necessity to to do that, because the whole reason we talked about stark scheme is especially for online analytical processing uh or olap, or data warehousing design in general was. The premise is that we actually were trying to be extremely efficient with our joints. Okay, so that way they were especially during the older days.

B

uh This is aging myself, but basically the idea is that uh we could utilize memory more efficiently by ensuring that joins were being done by um primary keys and four keys, or whether we actually did actual primary quirkies, but logically primary key and foreign keys um using integers. So that way those joints can be very efficient and by only storing integers, um at worst case, big inside the fact tables you basically were reducing the size of the fact table as well.

B

So that was the initial impetus, okay of why we built star schemas in the first place, and so a lot of the logic I just said: isn't this isn't necessary, especially when you're starting off? But okay? As you progress and as you build more complex systems, what usually ends up happening is that for starter, your dimension tables aren't necessarily stored in the same system that you're actually processing in your lake.

B

You may have some uh third normal form design, or you may have some operational data store that actually controls what those dimension data sets actually look like in the first place, and so because of that, your dimension data, where you keep your what your? uh What ultimately is you know remember: a star scheme is uh comprised of a fact table with a bunch of dimension tables.

B

Okay, that dimension data may be organized or controlled by some other system, and so because it's controlled by some other system you'll end up needing to build a star schema anyways, just because you're going to be extracting the data out of this other system and then because you may not necessarily have full control of the system end to end. What will end up happening ultimately.

B

Is that basically, you'll have to do exactly what simon just described, which is that you'll actually need to go ahead and all those design patterns, like data warehousing design patterns that you normally uh would apply to data warehousing the the actual techniques, even though you're not building a data warehouse, you're building, a lake house right, the actual patterns of a data warehouse are still very much applied to your lake house design.

B

Did I cover everything and do I did do I make up for the fact that I nearly caught your heart attack, simon.

A

I mean yeah yeah. I absolutely absolutely agree. There's one thing that is upsetting me at the moment. Oh okay, sorry, really upsetting me! Okay, I apologize it's a databricks naming thing: okay, okay! Well,.

B

Why are databrick.

A

Sql endpoints be renamed to sql. Warehouses come on.

B

I I get nothing on that, one, no, no yeah! I I get nothing, I'm not marketing. That's! That's! That's! That's! That's fine! Yeah.

A

So yeah don't go.

B

No don't look at I I. I have plenty of answers for the technical side for the marketing side. I I I don't have any answers and so that, let's include the ask us anything, no marketing questions.

A

Okay, I can't I can't do.

B

Those I'm sorry I apologize.

A

Yeah I mean so that absolutely absolutely makes sense in terms of how we're building those out. The one thing I would say is that we do we do challenge kimball. You know.

B

It's not saying.

A

You know a lot a lot of kimball when it was designed about how you designed that star schema. It is actually designed for a relational system, there's loads of things that we keep, because it's good to manage data. That way, you know having things like a slowly change dimension, which is dimensionally.

A

You know if I've pulled out all the information to do with this dimension onto that separate table, because I'm doing some data model design and I need to update something because something's changed. If I can do that by updating one record rather than updating every single record in a 10 billion row effect, that's way more operations, you know so there's loads of bits of data modeling. That just makes a lot of sense, but some things that we used to never do like. You know the cardinals. You know designing a star schemer.

A

If you put a string on a fact you're, you know you're a rebel. You know you get out of data modeling, it doesn't matter so much. It's a little bit more forgiving. The good compression and dictionary management to park a is a little bit better. So you can do that kind of thing, so I'm a little bit more lazy, fair with my kind of kimball modeling rules, I'm not the dictator that I used to be about how well you should design and how well you should kind of conform to the kimber modelling principles.

A

It's you keep the bits that make sense, because they're good for data management, the bits that we did, because it was a problem of relational. You no longer really need and we can challenge it. So it's a take the stuff that makes sense and apply it to the new technology.

B

Oh so you're asking us to actually you know think about the process.

A

B

A

B

Do what we've been doing for the past? Exactly exactly um one thing to note: uh this is just a shameless plug uh and ardavan who actually originally asked the question uh called it out douglas moore and myself actually did a couple of sessions or three sessions about data warehousing techniques with delta lake. um The session, I'm particular that that uh um simon's calling out, particularly that is sorry that is related to what simon's calling out is the one related to sir keys and uh type two slowly changing dimensions.

B

So we actually have sessions just on those two- and I am proud to say, as a nerd that the I made sure the entire session was related to stargate sg-1 as well uh and stargate atlantis, and yes, because it's important for us to get that in so I just want to call that make sense, yeah, okay, so as a quick follow-up, because otherwise you- and I will keep on talking about this particular topic. Okay, so we're going to stop now, because we have other questions and we're going to actually try to answer some of them.

B

But let's, let's do artavon's question because he also has a follow-up okay, which is migrating from spark 2.x to what should we consider while migrating our star schema? Let's start with that, one.

A

I mean for me, it was 2.x. You didn't really build a star schema because it didn't really perform that well with joins. You can do predicate push downs uh over partition joints and it wasn't very good um so used to have to do really designs, like making sure essentially pre-spark 3.0. I spent most of my design time trying to trick people into quit. You know, including the actual partition column in this alex of the filter, predicate of the sql statements. That's how I spent my life trying to go.

A

No, no, ignore those columns, use this nice shiny column that you really should query. um But after 3.0, especially after uh adaptive, query execution and dynamic partition pruning went in, you can do things like have a date dimension and have people filter the data dimension and have it work out that oh well? Actually, I can take that filter context and apply it over to my effect and it just works. So it's it's less of a. What should you consider what migrating it's just uh it'll actually work uh as opposed to not work so well.

A

B

No yeah yeah, uh though the one thing I would definitely add, uh is the which I'm a huge fan of is basically um the active query, execution uh aqe um that actually speeds up performance, and it is designed very much with more that particular like data warehousing design in mind, right and and the my favorite part of aqe is uh skewed.

B

um So basically the spark 3.x actually solves that, for you there's actually nothing uh well, I always just say salsa for you, but does a great job handling sku, so you don't actually have to sub-partition the data yourself um specifically to your question about there's, really nothing you need to do. It just happens because it's these are the learnings from the spark community itself that basically over time as more and more people build these complicated systems that are like houses, you know on top of delta lake.

B

haha um The context is very much that um there's more of this, this type of logic placed directly within spark itself to help out so.

A

Okay, cool the one thing, I'm casting my mind back like what uh a couple of years now I think there was the minor thing that put us up with data mentions that we generated we had to change the case, sensitivity of the various different data formats. I think that was the only main thing that actually trips up in taking it literally a lift and shift. Suddenly it's like oh it's case, sensitive now, because it's gone nancy standard.

B

That's right: that's right, yes, and then uh for table column names. As of delta 2.0. You can accept uh the so long story short. The reason why you couldn't you actually had couldn't have spaces. It couldn't have uh capital capitalizations, though frankly capitalization sort of sort of suck anyways, but um is because parque itself didn't accept it, but then but delta 2.0 onwards, uh basically, except there's a mapping that was actually introduced in delta 1.2. That now allows us to go.

B

Do that, but again, these are minor things that that actually just make things a little bit easier for you more more than anything else. That's all so cool uh ready for the next question, because we've got somebody from the uk asking and by the way uh for the folks.

B

I we see your questions from jeff and hilbert uh somebody anonymous, but there's one from graham that we want to tackle first, just because it's from your neck of the woods, okay, so, and so I want me to say it uh yeah I'll, say it: okay, hello from wet in windy, blackpool in the uk. So there you go for your neck of the woods. uh I'm using delta lake on azure synapse to build a lake house against the against data versus data. What is the best way to generate generate surrogate keys with spark?

B

uh He saw mai and doug's um tech talk about certain keys a couple years ago again, the one that includes references to stargate, g1 and stargate atlantis. Have your views changed yet.

A

Yes, yes, kind.

B

A

Was the thing I was I was, I was teasing danny before the stream with a question of my own, um which is so we used to build surrogate keys in a couple of different patterns. You can do it with a a row number window function and that that involves a little bit of sorting and it's like performance or you could do it with the um monitor monotonically increasing id, which is a fantastic name which is super performance.

A

But just has big it's it's very sparse and if people like their surrogate keys to be nice, one two three four five: six which they shouldn't, because they should never see it and never care about it. But people people want it to look nice, um but.

B

There was always two.

A

Options, it was just a performant way, a tidy way, but these days delta you've got an identity column. So if you have built a delta table before you insert data into it, so if you do a create table as and specify the table, properties and specific properties, you can say, have an identity, column and then just insert data into it, merge data into it and it manages its own identity. So the same way as you would in terms of a normal sql database, you would have an identity column. I mean I know.

A

I know this is like what two months ago, three months ago, columns came in yes, you.

B

A

Server lander like welcome to the year 2000s, but it's great that we can do that.

B

Yeah and so, and the call-outs like, which is funny which is like they say but like welcome to year 2000..

B

uh The the funny thing about the statement is that yes, within a um an smb, a single system uh within sql server, uh since I'm from sql server, okay, it's possible to do such a things when you have a distributed system where each of the workers actually need to go ahead and generate a unique id, that's actually a lot more complicated than people realize, and so yes, it took a little bit longer for us to go. Do that there is actually a great session, uh I'm actually pasting it in.

B

Let me paste it in the chat, so y'all can see it. Oh, where did my chat go over? Oh there you go uh all right. This is um the the one I'm pasting right now into linkedin uh is and youtube is um simon's video about identity, columns and delta, okay and then, of course, if you wanted the the previous history and where, like I said, there's references to stargate sg1, then I'm going to go ahead and paste my my doug's version of that as well.

B

So I think that should cover those two scenarios quite nicely so all right, so that should cover us on surrogate keys for now, um let's go ahead and go to the next question. um All right. um Let me go to linkedin uh first, because otherwise, I'm gonna forget what people are asking here. uh What are the biggest challenges in delta lake for real-time data, refresh.

A

I mean joins realistically, um so in terms of how we design likes. You know you talk about the even talk about bronze silver gold, so you know what I use but run. Somebody else is fine, having bronze and for real time great easy, because you tend to have a one-to-one mapping between a data set comes in, we put it down, we pick the data set up and we clean it. We get it right and we get it all ready and validated.

A

We put it down again now, having that single line of hops that easy to stream things just work uh yeah, you have to manage it. You get like some optimizer. You have to look after for the smalls files problem of parquet, if you're constantly constantly inserting small things, but that's all fairly simple um the moment you say right and then I want to join these five tables together to make something. Like effect, things get really complicated because it depends on the the window depends on the kind of the temporal relation of those different things.

A

How much state are you having to keep in memory of your spark cluster to actually get that working, so the majority of them for pure brutal simplicity? We tend to have things going full real time up to that kind of silver layer and then have just incremental rebuilds of the gold layer or kind of just. uh You know every 15 minutes every five minutes, whatever happens to be saying, take everything that changed pull that over just because trying to do it as spark structured streaming.

A

You've got limitations as to how many streams you can join together and then a steam streams are static or static to stream and there's just different constraints about what you can achieve there and, if you're asking people to do it from a from a business point of view. From a data analyst point of view, kind of you've got all your data, your dimensions designed effective designs.

A

Oh yeah, just be aware of this giant pile of streaming logic and streaming kind of uh limitations. It's it's a big ask. So that's how we tend to do it for simplicity,.

B

Yeah, I can't overemphasize uh simon's point enough, which is often more times than not it's actually the business logic. It's not to say the technical logic is difficult it. It can be quite the opposite. Okay, uh quite quite assuredly, it can be quite difficult from a technical perspective, but what usually happens is that the business that's making the quest doesn't actually understand the implications of what they're asking and so because they don't, then the people who are trying to build it it it.

B

The the code, ends up getting much more complicated and what you're trying to do, just like simon, called out, is you're trying to simplify things you're trying to make things as easy as possible.

B

um So that way it invariably, when the system breaks down all systems, break right, that it's really easy to reboot restart refresh whatever the the other. Quick call that, I would also say is that you, when you talk about refresh, it really depends on that. That statement that word refresh actually means different things to different people.

B

Okay, so, for example, one somebody from a business might say refresh means everything from bronze silver gold needs to be completely cleaned up versus some, maybe analysts, just saying oh, no, I need just need the gold table refreshed right so with the former, which is everything needs to be refreshed, is basically you're taking everything from the raw data and you're, because you've decided for sake, argument the business logic. You know what used to be coded as red now is blue.

B

All of it needs to be rebuilt, so every silver table every gold table all the machine learning all the analytics that went with. It also had to be rebuilt now for be forewarned more times than that. That's actually not what it means, that's just what they think. It means, okay, so and then the the latter being. Oh, no, I just need to do a change to a single table from silver to gold, and that actually requires uh just some business logic updates. That's all that is okay, so uh or modifications.

B

I should probably be very specific for my wording here, but the context nevertheless, is that it's that business logic that really comes into play that really starts messing you up. So that's what we will uh more times than not when people use that word refresh. You have to be very careful what that actually implies, and what that entails. Does that make sense makes sense.

A

To me all right final thing on that, so, if you're going through your bronze and silver- and you can have you you're not going to rebuilding your gold, you can also point your reports to hit silver and whenever someone says hit, go they then recalculate. Those things on the fly and the report takes a second or two longer to make, but you're not actually having to recalculate everything on the fly or refresh everything on the fly. um There's different patterns right that's by deciding what needs to be updated.

B

Excellent okay, so let me switch to youtube before I go back to um uh the the zoom here uh so, and this is actually related to hilbert's question hilbert's question from zoom about dlt and dbt. So do you recommend using data modeling tools like dbt instead of a bunch of transformation notebooks?

B

So you want to give it a shot for simon.

A

I I don't have a strong opinion on dbt, honestly, okay, um so because we've already invested in terms of the stuff that I do, I've already built a lot of transformation notebooks and I tend to be lifting and shifting existing sql to kind of then just plug it into it.

A

So for me, dbt doesn't make sense for what I'm doing in terms of my state of code, I've seen a lot of clients who have great success in terms of using bt and applying some of that kind of the principles to template out the logic for it, but I don't have a huge amount of hands-on experience with it. Yet it's on my list of stuff that I need to spend a lot of time digging into so yeah. No, I can't really say too much about it.

B

Okay, well, and but by the same token, that that's actually the valid answer right, because a lot of the times when it comes to using systems like dbt or dlt, which is related to hilbert's question about dlt, okay um and just using notebooks, it really depends on where you're coming from right. So, for example, the often the data engineering persona is one that they're using clis and ides to do all their development.

B

uh The data scientist is the one who uses notebooks right so for sake, argument if you're or data analyst, for that matter, I'm sorry. So if you're, typically a data analyst or data scientist that has a lot of sql statements or python statements using that as an example, it may actually make in all seriousness, make complete sense to go ahead and just use notebooks for those type of transformations.

B

By the same token, if you're coming in from and by the way, dbt is great for that, because the whole purpose is to write everything in sql, okay. By the same token, if you're coming from like no no, I really need an ide style or I need cli. This is where I'm saying okay. Well, then, maybe I want a cli system.

B

Maybe I want to use dlt or in some cases dbt or whatever else, to do that and so really honestly, it's the context is very much where you're coming from and which what tools you're already used to using. So, for example, if you are a scala ide developer, then you typically use intellij honestly, that's probably where you're going to come from you're going to come from that aspect right versus. If you're going to go, be a python developer, you're, probably going to be perfectly happy inside the notebooks okay.

B

So hopefully that answers that question uh from youtube. Now this uh flows into. Are there big differences between dlt and dbt uh right away? I can I'll take that answer. Since I've used dbt a little bit yeah I mean there are very big differences. Delta live tables is very much about the structure of delta lake tables within the context of data bricks right now. Okay, um right now, that is okay. Now in the case of dbt, dbt, is great for using sql statements on multiple different sources: okay, so uh dlt!

B

It's not is very um it's. Its design is very centric about how to simplify the streaming and and or batch processing. So it's very much this context of you're coming from spark you're coming from delta lake. Now, how do I go ahead and abstract away some of the complexities around using different um environments, basically write. The code once apply to different environments, you're good, to go in the case of dbt. On the other hand, it's very much about saying.

B

Okay, let me use sql statements to go ahead and apply to all of these different databases, all these different systems, source systems, and so there there definitely is some uh intersection between the two but they're actually really designed for two different environments, and hopefully that helps answer that question.

A

I've actually survived only in the past couple of weeks.

A

A few a few clients have actually mentioned a pattern that they're they're trying in which is actually both so dlt for the earlier data pipeline data processing, kind of bits, so using auto loader, adl and delta live tables to bring data into the kind of the early raw bronze layers using delta live tables to do the cleaning preparation stuff, the repeatable obvious kind of cookie cutter stamp steps which are more apply, some python, with some parameter kind of things, doing that in delta live tables and then handing off to a dbt job which you've now is it.

A

As of as of this month, you've got dbt jobs in databricks, workflows, that's correct! You do very much so they can go. Gonna, see I'm keeping up with the release, notes, yeah.

B

I know you're reading them, actually in fact you're reading them better than I am so.

A

Yeah, so it's pretty.

B

Good actually yeah.

A

So being able to do that and have a dlt pipeline run through all the different kind of programmable automatable stuff and then run a dbt package which has got more your design and your kind of your warehousing and your modeling kind of stuff actually just run this when that's finished, that seems to make sense for those guys again, it's real early stages of people talking about how they're doing it, but they seem to be a either or they just have different benefits.

B

Rock on okay, uh we got plenty of other questions and I realized we're. Probably uh we're gonna try to solve them in the next 15 minutes. So let's go do this? Okay, uh we have another question from our buddy greg kramer uh from youtube. He's asking us a question: is there a relation between dbt and five trend? There are lots of tools. It was easier when we could just handle data with everything everything all the data with excel. So you want to comment- or you want me to comment first on this one.

A

B

A

B

Okay, so greg greg, greg, regreg, okay, so greg's an old friend um in terms of all the different tools. I I'm probably not going to go ahead and be able to explain dbt versus five trend uh in a decent enough way. Okay, uh these these are great tools that are allowing for connectivity to different source systems. um The fact is, there's a lot of systems out there, a lot of different orchestration systems, a lot of different development systems, and so I'm we're probably not going to be the best people to explain all that stuff.

B

uh Honestly, uh we will actually have sessions with dbt and delta lake and five train and delta lake and all these stuff in the near future, but those are they'll be more appropriate for that type of discussion, but what it comes down to- and the part that I will say is that yes, it was easier when we did everything in excel and and part of the reason why I'm laughing at that particular statement is that greg knows.

B

I was a part one of the first demos I did actually by the way uh when I was still at microsoft, was uh to basically get excel, to connect directly to hadoop to actually bring data down, and that's true when you're dealing with less than 65 000 rows.

B

The old 65 000 row limit in excel, um as data has gotten larger and there's more complexity, and it's gotten much faster. The reality is well excel is a great end tool to look at this stuff. The reality is 95 of our problems are much more related to everything before that, as opposed to that little tidbit near the tent end. So.

A

As much as I I do agree with you.

B

Yeah sorry go ahead.

A

I'll tell you it was. It was easier when we could do everything in itself, because the data was a small enough problem that it could be done in excel. Yes, the fact that we're doing larger, bigger crazy, uh you know we're doing things over real time, we're doing things that are dealing with horrible nasty nasty, gnarled, nested jason, the fact that we're doing mad data science stuff- I mean you- couldn't really do that back in those days so yeah, it's gotten more complicated and yeah the the environments.

A

You've now got lots of different specialist tools and they're just distributed, and then they evolved, and then you pulled it back and now you have six tools that can do most of it, but yeah there's interesting times. It's entertained off the streets.

B

It definitely does stuff and greg's yeah he's greg's just wanted to call that he's getting into his databricks journey. His multi-year-old makes a lot of sense. You you can keep on asking us questions all right. Well,.

A

B

Exactly join us: uh let's go switch over to back to um zoom here, um jeff actually had early on asked this question. This is a bit a longer one, so I'm going to actually I'm going to say it, uh but I'm going to say a little slower. So that way everybody can hear it so good morning. I'm using click, replicate it's a cdc tool uh and also dynamics 365 to push data pushed to a data. Lake. Okay, both uh create incremental changes, uh change files in azure in an azure storage account.

B

Okay, in both scenarios, the initial and any subsequent full load files land in one folder and change files into a second folder, so full load files in one folder change files in the second folder. Okay, he'd like to use autoloader to ingest to a bronze delta table. But how do you best account for two source pass, keep in mind full reloads versus the uh that could be adjusted after the fact, plus the fact that you still have change data capture. uh All the above.

A

I mean it's a pain. um Essentially, you need to have a switch in your notebooks that look after it. Essentially it's two separate parts I mean whenever we're doing an autoloader load, we'd have a single notebook that does it anyway, so he's saying, load from autoloader and I'll. Tell you the folder path, I'll, tell you the details, I'll tell you the scheme to expect all that kind of stuff and then transform it and land it into whatever songs table you're dealing with.

A

um Essentially it something would have to do to trigger the run of the historical something would have to run to trigger a run of the uh at the increment. uh It depends how you're using autoloader, because I think, compared to uh the databricks true standard, if it's a streaming tool and it should be turned on streaming all the time I use autoloader like a criminal and just turn it into a trigger available now uh root and just run it as an increment batch one.

A

It's just a very easy one for me to batch data in so it depends on how you trigger in it. If you just leave, it turned on, if you leave it streaming, so you should essentially have two parallel streams running which could be the same notebook but just executed in parallel with different different parameters. Playing one saying is my historical stream, one saying that's my incremental stream, but there's an overhead right, there's a cost of leaving that historic one.

A

If that's just sat there streaming waiting for for a full historical reload that might never come that. That's a big overhead to have sat on your spark driver constantly running microbatch's gun any more data, no any more data, no any more data. No, so I don't tend to do it. I tend to split it up as yeah two two parallel streams, but I have another mechanism that triggers the full reload and build that into the logic of whatever orchestrator you're using.

B

That's great, I don't think there's much more. I can add to that. um Okay, we got tons of other questions. um Let me try to get into them. uh Okay, uh next, one from hilbert. uh What is the best way to start with a framework to make your notebooks more generic.

A

Give away my secrets um yeah I mean so there's loads of things. You can do uh essentially it's getting used to whenever you write spark code.

A

Whenever you write a bit by spark, if you get choice between doing something that can be parameterized as something that can't do, the one that can be parameterized and there's a few little tips and tricks with that so making making your spark.read um command generic is, you know, don't use the implicit um file formats, so don't do splash.read.csv do spark.read.format pass in the string csv, which means you can pass it into parameter. Dot load pass in the path so to get used to trying build your spark.read.

A

It's fully parameterized uh pass in the options. Config pass in think get used to passing it again uh transformations. We had a lot of c, so you can use the expl function as a kind of the python event allows you to just dump a string in there and if that string can be passed as a bit of spark sql then it'll work, and that means you don't in your logic of your notebook, you don't have to say I'm going to change this column.

A

I'm going to apply this calculation to it, you say, apply something to some column, and then you can start to make a genetic. You can start to inject strings at real time and make it very generic and then, at the end, your spark.writes or your dataframe.right has the same things. You can make that generic you tell it where to go. You can tell it how to put it, you can tell it whether to use delta or not or just not include a formula and then it'll use delta automatically, but then you're a criminal um yeah.

A

So you can get those three steps and that's what a generic notebook is right read something do something to it: write it somewhere and if you break your entire data pipeline into lots of steps like that, so you're not having these big notebooks, which is, is all the different transformations. I need to do to generate my 20 tables in my data model. Each element of your data model is a single notebook that runs independently and is entirely generic.

A

Read something do something to it: write it out and if you can start to build in those brands, you can basically start to parametrize anything and your notebooks. You know kind of the majority of the legs that we deal. If we're talking about going from bronze to silver silver to gold, we tend to have maybe three or four notebooks and that's it. Everything else is just parameter configuration run it in a loop, run these and run that same notebook, 100 times with different parameters between 100 tables at once, how we do it in a nutshell,.

B

Yeah, that's great. Okay. I've got a bunch of small little questions that I'll go ahead and just tackle them off right now. um Let's see uh okay, first one uh from linkedin from anyesh. I apologize. If I did not say your name correctly is delta lake views. Can they be registered by data catalogs like libra um in general, uh when you register a delta lake table uh you're registering into something like an hms a high metastore.

B

You can often do it with things like glue uh as well, there's, actually a glue delta light connector that was released. I want to say two months ago, I believe, um there's also azure purview integration and um data hub recently released theirs. um I don't believe collabora itself currently has one, but it would not be that complicated to do a plug-in.

B

So one of the things I usually ask people that if you have ideas for pl for connectors like libra or anything else, uh let us know join us at go.delta.io, slack, okay, that's the delta user, slack chime in there with your ideas and and then we'll track them or either or go to the delta users uh um delta lake uh github.

B

Excuse me, uh as in go.delta dot, io, slash github, and then you can go ahead and just chime in there with your ideas, because uh more times than not that's how we're getting ideas as the delta community to go ahead and actually add these different integrations just basically based on these slack messages and or based on github um issues being created. Okay, so hopefully that answers your question on this one. Let's see uh there was another one: oh artavon, you had answered asked that the circuit key version with the notebook isn't available.

B

Bummer on that, my apologies. So do me a small favor audubon if you can just either link ping me by linkedin, like you did before or open a github issue uh and then we'll go tackle that uh asap. So my apologies for that one. uh Let's see there's another question uh from sema from uh youtube. She had uh she or he I apologize. You had asked um if um a specific thing about delta sharing this one's a little bit more complicated.

B

So what I did is I did a response saying: please go to go.delta dot, io slack. There is a delta sharing channel by all means. Ask us questions there, because then we'll have a bunch of the delta sharing folks there to help you get everything up and running um avishek asked the question: will the when will the delta lake definitive guy be published, we're still working on it? It's been with everything that happened with delta 2.0. We end up delaying writing because half of what we wrote was changing based on delta 2.0.

B

So now we're actually going through the process of trying to get ourselves up and running with that again all right. So those were the quick questions. um Let's see, uh there is a great question um from dennis um or denis from linkedin, which is: is there a known or good strategy for often run vacuum and define file retention values depending on your table, update frequency.

B

No, what do you think.

A

I know it's very much and it depends how how often do you change the team, uh how much how how much of the the data gets touched in each incremental upset um so because, currently, whenever you update, you know if there's a parquet file that gets changed as part of an update action. Anything that didn't change now is copied into another parka file and lo shuffle likes of vader, the the delete's gotten better. So it's now actually kind of just lotion for merging, like if you're doing, upsets copies the records that didn't change separately.

A

So you might protect them. But then, if they get changed in the next one, you're still going to have multiple copies but like incremental upsets, uh do create lots and lots and lots of redundant copies of data. But if you're doing a daily update, you don't need probably no need to daily vacuum.

A

You can put it away with a weekly vacuum if you're doing streaming and you're updating the table every 10 seconds every 60 seconds. Maybe it's actually touching a lot and your data volume just exponentially growing. uh So it's always kind of a. How much are you touching it kind of how many times a day? Are you updating that table and therefore how many extra copies you're eating you were streaming and it was append only so you're not actually making any records? Obviously, it's fine.

A

You don't even need to vacuum that table, because you're not going to have anything going and updating no changes, it's very much a what is the update pattern for the table and how much the data gets copied into a topic file each time you update it.

B

Perfect: okay, we actually probably only have time for two more questions and both are a little bit of a doozy one. So we'll try our best to answer these questions. um First, one's from shaman from linkedin. The question is: what is the best way to handle configure information for your silver layer, batch processing?

B

So do you have any advice on that best.

A

Way to handle what.

B

Are configurations configurations yes,.

A

So depends what kind of configuration are we talking? uh So we're talking about things that we'd kind of uh correct one of the one of the question things that we talked about earlier is what comes first, the table or the update state, um because a lot of the way that we used to build these things is the dataframe.right command would implicitly create the table. So you can apply table properties.

A

You can do lots of things in that just in the right state and so we'd have kind of various things built in to go and do that um and then we switched things around and we said well. Actually we want to we're going to start you doing a merge, because we've got merge now and delta and there's an easier way to do things and then actually so, we'll actually so table level properties kind of need to exist. Before we give a merge, we can't just run a merge statement. It'd be lovely.

A

If the merge statement auto created the table, if it didn't already exist, but apparently.

B

We have you know, yes, you do have to create the deal. Yes.

A

So if you have like table level properties, if that's the kind of configuration we're talking about table level properties, you need to create when you create that table.

A

So you have, you normally have a separate little function and we just generate some sql in there kind of a create table as here's all my columns, here's my identity, column, here's my partitioning, all the all the relevant information needed for that, and we just do that as a separate little three commands and then have it behind in this statement going does the table already exist if not go and create the table? Here's my table uh other configuration.

A

If it's more like what transformations it depends on how you built your your own framework, that's why you want to keep your metadata.

B

Perfect, I there's probably a lot more to this, uh just like what simon talked about when, depending on what the definition of configuration is um so since we're pretty much almost out of time, but I still want to at least try to tackle two more questions. um I'm going to chime in and say, please join us at the delta user slack I'm going to paste it to everybody here. So that way you can just join and ask us questions there.

B

But let's, let's uh dive into the final final question, actually there's one little quick question that I'll answer right away and then the final question, uh which I think is interesting for us um uh there's a still a question is: was z or intro order introduced in spark 3.x or was it there before, and the long story short is z order is actually specific to um to delta lake.

B

Well, actually I should rephrase that slightly there are z, orders and other systems too, but in the in the context of data lakes themselves, it's not a spark thing. It's actually a data like thing so z order was actually existing within delta lake within databricks itself, um and we recently open sourced it as part of delta lake 2.0, okay.

B

So now it's available for everybody to work with right away, uh underneath the covers we're talking about how it basically reorganizes the data uh with sparse filters, I'm not going to get into that right now, because we don't have much time, but we can definitely chime into that next time. If you want to dive into that. So.

A

Speckling curves on the next ama it'll be fine.

B

Exactly that's fun, I'm fine into it. In fact, all I'll do is this. Is I'm just going to grab td in my session, we'll just paste it there and we're good to go all right, but this one I figured. We want to answer because I think it's a great question to end in the session. It's from davine is in delta lake. How do we handle data deletion, that is in scenarios um uh like right to forget data or data consent, withdrawal we we need to have time travel for other records, for example.

B

If we have, uh let's see, for example, if we have a thousand records, you have to time travel for the last 15 days, but if those 10 of those records need to get deleted today- and you do the vacuum also, basically uh you remove all the the last 15 days of data right. So how do you? What is the right balance on how you approach that problem? So um I've got some answers and you've got some answers, so why don't you go first.

A

Why why don't we be going first.

B

Because I'm the one answering the question: it's.

A

Fair you're doing the admin.

B

I'm doing more talking than you right now deal with it.

A

Yeah I mean so gdpr, uh certainly for us uh in europe um is, is, is an issue right um most of the time. It's not an immediate someone rings up and says: hey, I want to be forgotten, get rid of my data and you have to get rid of it. That moment there is a reasonable period for you to enact that request now so talking about 15 days worth of kind of date of attention that should be covered by the amount of time that you've gotten.

A

As long as you action that request, you do the deletion and you are keeping a short enough period in terms of your time trial for that to reasonably be forgotten within the realm of that. That's, okay, that's that's! Usually fine. I mean it depends on the the rules that you're in depends on the country that you're in depends on the data protection legislation that you're actually under as to how big that window needs to be, but you can't get around it.

A

If you have to delete that data, you have to vacuum up to that point and you you'll lose time travel ability up to that point. There isn't really much you can do. um We've seen some interesting things that people have done to try and ground it, which is more less keeping the data but keeping the data in an obfuscated or encrypted state and then storing an encryption key elsewhere. So you actually at real time when you're querying it, you use the aes decrypt function to decrypt the data.

A

Then, if you delete the value for that, that's gone and then actually your actual table still has. Its title still has everything, but no one has the ability to decrypt that data. Therefore, the data is technically destroyed. If you cannot reverse engineer what that data is because your key is gone, it kind of gets around the problem, but it means you, then have to join that table and decrypt and it's a performance issue. So you can, you can build it to be incredibly robust and just kill it immediately using decryption stuff.

A

You can just increase the amount of vacuum you're doing you can build into your processes and actually just talk to your data protection autos and go. Is this amount of time that I've got because I've got a 15 15 day, backup of my database kind of that? That is normally absolutely fine in terms of how those things work um I did have.

A

One thing which is a challenge is in terms of the log retention period also needs to be uh considered because, if it's in, if the, if the right to be forgotten, if the pii information is in your minimum maximum statistics, that's also a pai node. So if you know if it happens, to be there, so it's like don't take stats on pii columns needs to be something you're thinking about. Therefore, it needs to not be in your first 32 columns and that's kind of there's a few things you need to be careful about.

A

If you're doing a full. You know we are following regulation data protection to do with all this stuff.

B

Yeah and just to add to simon's point, this is a very complicated system and really depends on um your legal department engineering department's definitions of how they follow gdpr. So, for example, one company that I uh actually one not even company sorry, one pattern that I've worked with is that they've identified what is deemed as pii and put them as actually as a separate delta table, and in this case it's based in essence, it's a demographic table. If you want to think of it, this way right.

B

So, basically, all the pii are placed into these three or four demographic tables. So what ends up happening is that they actually will not delete. Interestingly enough they'll, um specifically redact, and so in other words, let's just say id 20 um name, simon, okay, and so what they'll do is they'll actually keep the id 20, but they'll change the name from like name to from simon to redacted the reason they do. That is because there's downstream systems that actually that they don't control that they actually have to make sure that are aware of it.

B

So they actually have to keep the id. So they know that id20 was the one that actually specifically redacted and they trigger uh downstream systems. To basically say: oh, we got the word redacted now go ahead and clean up the downstream systems as well. So, even though the the data, the delta lake, the data lake is actually completely cleaned out by that because in essence, there is um little to no history actually on the demographics table, because that's the whole point there.

B

They need history and the facts not actually on the dem on the demographics themselves, um and so they don't care if they wipe history away from the demographics table. But what they do have to do is they have to keep the id so that way they can go ahead and specifically impact uh downstream systems, and so there's actually this in itself by the way is a is a longer discussion.

B

So if you want to chime on that by all means, we can definitely do that as well, because I'm more than sure between myself and simon, we can probably do a full session just on the tricks and trades of redaction and gdpr. So yes, I.

A

Mean referential integrity.

B

All right, so that's it! For today we apologize for not being able to tackle all the questions we probably uh into prototypical simon and me we probably did uh rat hole a little bit uh in the beginning, so apologize for that. uh But please do join us on delta user slack.

B

You can go ahead and ask your questions there, where both simon and I there um pretty regularly anyways and we'll be here next month- anyways as well, so without further ado, we're going to end today's session by going ahead and uh doing our little splash screen. So anything else to add before I go, do that simon?

B

Let's get a wife all right! Thanks for coming, send us more questions.