Delta Lake Simon & Denny Ask Us Anything, 15 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Simon + Denny Ask Us Anything - November 15, 2022

Description

Join us for the monthly series "Simon and Denny - Ask Us Anything!" on Tuesday, November 15, 2022 where we will answer your data engineering questions from building a data platform to ingestion to ETL to analytics. With our background in SQL Server and BI to Apache Spark and Delta Lake - we want to show you how to build your own lakehouse.

As this session is interactive, come prepared to ask questions all throughout the session! Be prepared for another geeky, trans-Atlantic event from two data nerds.

Learn more about Delta Lake
Read Our Newest Blog Post: https://delta.io/blog
Join us on Slack: https://go.delta.io/slack
Delta Lake Releases: https://github.com/delta-io/delta/releases

A

Okay, there we go perfect all right, uh hi everybody, uh as we slowly trickle in for today on Zoom, uh we're also getting ourselves broadcast to LinkedIn and broadcast to uh YouTube at the exact same time. So because of that, give us a couple minutes just to go ahead and get everything jump started. um Candace on the background from the Linux Foundation is setting up all the events, so basically we're just getting ourselves ready, so give us a couple more like maybe 30 more seconds, and we should be good good to go.

A

Meanwhile, if you're on uh the Youtube, LinkedIn or Zoom, why don't you tell us where you're based out of like, for example, I myself, Denny I'm based out of uh Sunny Yes? Actually Sunny Seattle um go for it.

B

Well, I, don't know, I mean Kent in uh in England, where it's rainy and grim and miserable, and it's not holiday anymore and I'm. Sad fine.

A

Be okay so hold it? Are you sad or is the weather sad I'm, just curious I'm, not.

B

Everything's sad, okay.

A

All right cool so so we're starting we're we're starting off. Well: okay, cool! Well, every everybody's, we're doing great, okay, cool awesome, all right, fair enough, fair enough cool!

A

All right so perfect looks like we are live on all three platforms, so this is great. We've got some folks from New Jersey from Leeds from Miami Ed I'm, hoping at least you from Miami are enjoying the hot weather, uh Northern California I! Think it's not cold at least I hope not so there you go um so uh before we dive into oh good. It's already commented back.

A

He says it's awesome down in Miami awesome for you, man all right, um one thing's, just to provide some context to remind you all that this is basically an open Forum where you're going to be asking myself, Danny, Lee here and Simon. uh Some awesome questions, so hopefully they're data, engineering and Delta, Lake and Spark, and you know SQL related so hope at least those are the ones we intend to answer. I should put it that way.

A

uh We did have some um uh uh asks from the previous session to talk a little bit more about gdpr as well.

A

So we can probably talk a little bit about that uh and as a starting point, but if you have other questions or other contexts that you'd like to dive into by all means, go ahead and chime in and let us see the questions in the uh q, a for, um oh uh sorry, q, a for uh Zoom for LinkedIn, just prop right here, um I'm actually typing in right now and uh and it's our perfect Ed's already started with the gtpr question and we actually have a question from YouTube.

A

So let me start with that: one from YouTube and then we'll switch over to Ed for the gdpr question and again keep on chiming in so uh from YouTube there's, actually an initial question from Tariq, which is a great question which unfortunately I don't have a good answer here uh and tram by the way, if you're trying to type your questions uh literally type them exactly what you did in the Q, a that's exactly where you're supposed to put them. This is a perfect way to do it, um but for Tariq your questions, any idea.

A

If the book Delta Lake, the definitive guide, will be uh out in print anytime soon uh and candidly, the answer is unfortunately, probably another year or so based on the feedback that we received from folks. uh We do need to do a rewrite now. Part of it was the rewrite part of was also the fact that we had the release of Delta 2.0, which basically restructured a sizable chunk of the book.

A

Okay, but I will also hold myself in the Hall of shames okay and so Simon will enjoy that fact that I'm, shaming, myself too proper so I apologize, you're, English and all right.

A

So, but the context is that I I, admittedly, if I wrote, chapter two chapter, two and mentally enough went from a very enjoyable introduction of how to work with Delta Lake and then decide to dive deep into the internals of how the file system and the protocol work and it dawned on me based on numerous feedback, which was correct by the way that I should not put internals in on a chapter two of a book and that's fair.

A

So that's what I meant by the restructuring of the book by the same token, like I, said, there's a lot of things that happen because the Delta 2.0 so because of that we are currently in the process of rewriting the book as soon as we have some structure around, that I will gladly go ahead and inform you all when that book's available. But right now we I'd say honestly the timeline's more like a year from now. Okay, so all right, Perfect all right!

A

Now we have the first question, which is gdpr really from Ed for my awesome Miami. So the I'll read out the question and then Simon you and I can figure out who wants to answer it? First. Okay, uh he's got a loaded question. So, let's start off with in pain, right away, okay, on gdpr. What is the best practice for using Azure ADLs and connecting the databricks also considering that unique catalog may or may not be in the picture right now, but what should be?

A

Maybe in the future, I'll start with that and then do read the rest of. Or do you want me to read the rest of it? How do you want to do it.

B

I'm just skimming through it there's there's a there's a lot in terms of ADLs container Lake structuring in there, which we can go into. If you want to talk about Unity catalog before we get into that, then you go ahead and then we'll dive into how you should structure your leg for good security. Boundary concerns.

A

Now why won't we do it without UC? First, because honestly, I don't have enough context to provide UC as it is anyway. So it's okay, yeah, bye.

B

All right, okay, so the wider question so yeah best practice um for using ADLs and connecting to databricks concerning unit catalog may not be a picture now, but could be in the future. Should we use one storage container for everything all layers in all environments: one storage, container per environment, Dev, QA prod? How do you mount it? How does all that stuff fit together? uh So there's some basic advice around there in terms of separation of concerns.

B

Certainly, when we're talking gdpr and all that kind of stuff uh week we can, we can dig into it. uh I think that's already a whiteboard question. If that's all right give me a moment, I will dive into a whiteboard.

A

B

A

You serve them working. Oh there we go ready.

B

A

Is this, is this is Simon deciding that he wants to show off his whiteboarding skills, just as an FYI.

B

Always Yes except weirdly, my pens kind of stopped working so I have to shake it like it was a real Biro. Now it's it's authenticity, all right. Okay, if you're building a lake, so you're saying we've got all layers, you know so we talked about you know. Maybe you've got bronze silver. uh Gold kind of you know that kind of thing: I always annoy people by not using the silver and gold kind of you know: periodic table things. I, don't know BRS s-I-g-d, um so you've got layers.

A

Available yeah I'm actually offended. You really should use the right. Colors come on dude.

B

Yeah yeah fair enough yeah, so we got kind of uh our Dev test.

B

uh We got prod, so the question we're saying is: if I've got if I'm in Azure for this question talking about other Cloud providers of similar mechanisms should I have a single um blob storage account and then one container kind of you know the the traditional like sort of uh Hadoop style root container and then that's all just folder structures. That's that's option one uh option, two exactly so!

B

Well, actually you know maybe I can keep having one storage container, but then I, one size storage account, but my containers I have as my Dev test and prod or the one saying actually do. I need to have real separation and should I have entirely separate storage accounts and then maybe break down my bronze silver and gold. As my separate container layers, so there's lots of different ways we can configure around it um easy answer is environments should always be separated by Azure resources.

B

You should never have your prod data in the same Azure Resource as your Dev data, it's your test date and all that kind of stuff. A few reasons for that. One separation of admin concerns uh the kind of people you want to give admin access to your Dev Storage Lake. If you're playing around and testing stuff, you don't want to give those same people full carte blanche, access to your product data. That's just not a good idea!

B

um So generally, you always want to be talking about having your entire Dev resources, your entire test resources, your entire broad resources, because you don't want to give your developers full absolute access to all prod data you're not going to pass any infosec Security reviews, if you're doing that, so environments should always be separated just from a security concerns. Point of view um whether you want to have different different containers for your different uh things. That's saying: I've got my Dev storage account and maybe I have containers in here.

B

For my bronze, my silver, my gold, you can do um or do you have just one generic one and then that's just folder structures inside having a generic one means you only need to mount it to databricks once so. It's slightly easier for configuration having separate containers yeah, you can do admin a little bit separately. There's not a huge amount you can do at the container level rather than the storage account level. So that's just kind of a tidiness thing. How do you?

B

How do you want to structure it either way you can create containers programmatically, you can create folders, programmatically you're, not really limited by a which way around you go with those. um It's just there's a slight bit of extra work. If you have separate containers that you have to mount each and every container as a separate thing, uh just when you're actually working that out or not Mount create external credentials. If you go in the unity, catalog route, so general advice, um I absolute storage accounts separate for different environments.

B

Whether or not you have different containers kind of depends on how often you're going to change layers. If you think you'll ever change and add new layers into your leg, uh if they're fairly fixed, then yeah, you do the effort of setting up once and then it's fine.

B

um If you think, actually you want it to be fairly flexible, then just have it in a single container and manage it with kind of various folder structures and that's a little bit more Dynamic a little bit more flexible, big.

A

Answer that was cool man all right, so Ed uh I'm sure you have a bunch of questions to follow up on that. But at least this is a probably a good starting point for us. So if you do have a follow-up by all means, go ahead and uh and chime in here, okay, so um great question a little bit great answer: uh Simon, uh let's go right to the next one. Okay, and uh let's see uh oh okay, uh you and I. Can you or I could answer those questions?

A

How to share Delta tables between workspaces without actually having to create copies of Delta tables? uh Can you talk about the permissions? Involve the scenarios? Okay, so you want to go first. You want me to go first on this. One.

B

I mean that is like a perfect teeing up to talk about UC, so I'll. Let you do the unity catalog advert. First, okay,.

A

Fair enough, no problem, so actually I was actually going to do the Open Source One iron. Oh.

B

If you want to do you see it I'll do the open source. Let's do it that way,.

A

Okay, cool all right, all right, I'll do the Open Source One. First, oh I've got a weird something: okay, sorry, it's like do. I have a laser on me. Okay, sorry, I, apologize.

B

A

Now exactly no I apologize um all right, so it is actually pretty common to use actually even the open source methodology to actually share your data. Delta tables. uh There's a Delta sharing server, it's actually completely open source. You go to github.com Delta. I o slash Delta, sharing, okay, the server code's. Basically there and uh and I'm like like, we noted I'm, going to answer the open source and one and then the non-databricks employee is going to answer the UC one okay.

A

So because that's what that's, how we roll here and so the context is that because your data's already sitting inside a scalable, uh reliable storage, I.E, uh ADLs, Gen 2 or if you're, using S3, AWS, S3 or Google, Cloud, Storage or hdfs, or whatever else that you're using, but it can scale basically based off of that, and so what the Delta sharing server is. Basically simply doing. Is it's saying hey: do you have the rights or the credentials to access this data?

A

That's all it's really doing so when you provide you as the client, whether you're, using like pandas or spark or Tableau or whatever it is, or power bi or whatever. It is that you want to do you're, basically providing a token that says: hey here's, the credentials I'm allowed accessing this data once I'm about to accessing the data. All Delta, sharing server does is actually provide pre-signed URLs directly from the cloud storage to the client.

A

So what that means is whatever workspace, whether it's like a databricks one, whether it's a ADLs like Azure synapse, whatever I, don't care like whatever workspace that you're currently working with, because it's got the token it's a it, will its job. Let's say: I'm for the fun of it I'm using python pandas. It will basically use those pre-signed URLs to grab the Delta tables and then upload them. Oh excuse me, download them and be able to query that those pre-sign URLs actually have a security timeout.

A

So I forget exactly what the timeout is, but basically after X number of seconds or whatever it is. Basically, you no longer have access to them, but basically this allows you to share the data without actually replicating the data, because you already have uh scalable storage now, underneath the covers. Maybe you've also geo-replicated the data and things of that nature, and so that actually helps things out, but ultimately what it boils down to is that it's coming from the um the cloud storage itself and you do not need to replicate the data to do that.

A

So that's the open source answer you'll dive into Simon will dive into the UC answer now.

B

Gonna, it's easier to do this with a whiteboard. So excuse me for once again scribbling.

A

Of course go for it.

B

Man there we go cool okay, so essentially we were talking one assuming we're talking about uh external tables or unmanaged tables where we've got I need to shake my pen again and it magically starts working. What is that and I've got? My uh we've got our databix workspace uh sitting here, so we've got a Delta table that exists inside our Lake and because we're saying I want to share it between workspaces, I'm, assuming you've registered that as a Delta table. So people can see it in the SQL layer.

B

It's in Hive, it's in the meta store, so he's saying well how do I share that, amongst other things, um so this is the old school way of doing it. Right of I've got a separate databricks workspace. There's nothing to stop me writing a bit of code saying create table as over this thing, location and just pointing to that existing Delta table. You can have multiple different workspaces all using the same bit of data, all exposing it. You can refer to it as different table names. You can do whatever you want. That's been in.

B

You know you can do that in any kind of spark environment. You can just register a SQL table over a file, but you'd have to do that manually. You know if you spat out 100 Delta tables into your Lake, and you want it to replicate those 100 tables or register them. Federate those registrations across a load of different database workspaces. You've got to manage a load of metadata. You need to run a load of create table scripts across a load of different workspaces, and it's it's honestly, a bit of a pain.

B

um That's the old way. That's the way we used to do things so we'll have some process that creates the table and then something that Rand Robinson goes and creates them in a lot of different data, which workspaces, which is really annoying um well, Unity, catalog, is doing, is saying: well actually this thing the thing that's registered here that kind of um that meta store, let's not make it live in databricks anymore, let's rip it out and then that actually lives at a slightly higher grain.

B

So it's gonna go over here and we get Unity catalog which exists at your tenant and region level, so you create a metastore at West, Europe, north Europe for Azure. You know whatever Cloud provisor you're using you register the table with that, and then you associate your workspaces with that metastore. So it's a shared metastore that exists across whatever database workspaces that you have so you can just register the table once and then use it.

B

Many many times and again it's all about the just that meta metadata registration, it's the SQL object the points at the data that you are sharing between these workspaces. There's no data replication we're not keeping 10 20 copies of the data. So if you're in databricks environment you can switch on Unity catalog, you can take that step and you can have that centrally managed catalog, which is again good for governance. It's good for gdpr, knowing all the different data across all your different workspaces. All that kind of stuff is great for that.

B

uh If you're not using UC, you can still Federate things, but there's just a bit of manual scripting to do around it.

A

Cool this is really awesome. I did want to remind that. We, this actually session, was not going to be very UC Centric, even though we've got a bunch of UC questions, so try to keep them around devil, like in lake house, we'll still talk about UC a little bit, but I just want to remind folks that we the what's the main context for today's sessions. um Okay, let me switch over to LinkedIn. uh This is not direct to you Simon. This is directed at uh some of the questions coming in. That's all.

B

A

um Yeah, okay, uh let me switch over to LinkedIn, because we got some great questions there. So hey uh one of the questions it has what is the best practice for leveraging DLT across different zones like bronze, silver and gold? My understanding, that is, that DLT cannot uh cross between databases, Target and um I'm Brent. It's a great question, but I'm gonna probably have to ask you to clarify this one a little bit because and when you say cross between databases, it's more a matter of like when we sing you have different zones.

A

These are tables: okay, like your bronze set of tables that were forsake, argument, direct dumps of uh or uh loads of data, from whatever databases or whatever CSV files, or whatever else things that you're currently working with. Okay and the whole premise of saying we have a Brawn and then the silver and the gold level is not about. Can I cross between databases per se?

A

It's more about saying that, okay, when you're at bronze, it's basically just a dump of data when you're at Silver we've actually filtered this out and cleansed it and made it actually something that's usable and then for gold. It's actually something that is really great for aggregation or for uh uh or for machine learning. In other words, it's more about a data quality framework, and so, if you could clarify a little bit about where you're coming from in terms of the database.

B

I I I get the question because I've got I've, got the same complaint: okay, fair enough! So when we're talking about uh you know, essentially it's all it's just organization, it's making it tidy right right.

B

So, if I've, if I had 100 tables, if I had, if I had an a slightly awkwardly large number of tables um or 100 data feeds that I'm trying to pull through and I pulled them all in as bronze tables and then I cleaned them all of the silver tapes and then out of that I made 50 or so gold tables. uh Currently in DLT they would all live inside the same database structure. This came the same.

B

Hive schema, so I'd have 100 brands, 100, silver and then 50 gold, so I've got 250 pedals just as a giant pile that I have to have naming conventions. I have to have gold underscore so I know where it lives, yeah, I, can't easily say Grant ax to just gold. For that team. It's just a. We can't currently organize the tables into that nicer bucket for managing it. That's it! Oh.

A

B

Data's all put together, but you need to do an extra step to register those gold tables again somewhere else, so that you can then just give people access to that in an easy way.

A

Yeah, okay, so that's a great color, so I I was obviously coming from that particular question, a very different context, so um so for something for a manageability.

A

I absolutely would provide the feedback directly to the databricks team, because um part of the reason in terms of like I want to say a first principle, because I think that's a little too strong of a statement, but in terms of the context, is that a lot of the feedback that we have received from not just for data breaks but in general from the community was the ability to basically have different catalogs for different sets of users.

A

Now I think that also ends up having its own set of problems, as you just described both Simon and also uh by um uh by Brent here so I'm, not disagreeing with your assessment. Now that I understand the context better. So thank you for claring, my assignment and Brent uh I would just simply say: I would provide that feedback directly. So that way we can make that manageability significantly easier, because I do agree with your assessment. Actually, so just that we're probably not the right forum. For that. That's all. Let me saying yeah.

B

Essentially, the only advice we have is have a naming convention.

A

Yeah pretty much you mean we're we're after we have to go, build processes again.

B

But then you can automate it, so you can build a framework around the framework. Oh.

A

Yeah yeah, we love Frameworks around Frameworks, that's great by the way for everybody who's. Listening to us right now we are being excessively sarcastic just in case. You didn't know us that. Well so all right, so this one's a little bit more freewheeling, but I still think it's great to talk about here. um It caused this discussion, a business from Gordon Anthony from uh the LinkedIn side. Don't worry folkson's uh in zoom and Linkedin will go.

A

Oh sorry and YouTube will go back to you, we're just literally going back and forth between the three different platforms here. So it's from Gordon. He says a constant discussion at work is when to finally dive into Delta Lake and not look back and away from our classic on-prem data. Warehouse and analytics thing um as people are um are making more sorry are more of making these decisions and giving uh people the power I work to say. Is it big enough or what is your opinion? Basically, the context?

A

Is there any advice like in terms of how to make that transition, and so I'll start first but I'm sure Simon will have a lot of his own advice on this. One too, but one of the things I I'd like to and the reason I like this particular question, um even though it's a little confusing because I actually came from there and actually both Simon and I. Both came up.

A

So if you even watch some of our older sessions, Simon and I talk about how like we're both SQL Server hounds, and we made the transition from SQL to um uh to Delta Lake to spark things of that nature.

A

So I actually came from myself personally came from the SQL Server team, so in other words classic on-prem data warehousing yeah, because, like game literally came from the system where I was proponing, hey everything should be in SQL data warehouse, and the thing is that we have to remember and when you're making that transition to say you want to build it. Thank you Delta Lake and lake houses in general. Right because it's it's.

A

The discussion is really about the fact that you're making this transition from an on-prem database, which is was great for what it was very good at doing. But the the data Paradigm has changed since then, when we built databases, it was about the ideas that we knew knew everything had to be structured.

A

Okay, when we talk about data Lakes from databases, databases allowed us to say hey, we want transactional protection and we wanted to go ahead and have reliability, but everything is structured data Lakes. We made that transition and said hey. No, no, we don't care about uh schemas. We don't care about any of what we need flexibility and we need scalability.

A

Whole context of lake houses is for us to say no. You actually need the Best of Both Worlds I want transactional, reliability and schemas at the same time, I want scale and flexibility right, and so this is where you know, Delta Lake.

A

You know uh and is why it's so fundamental for lake houses in the first place is because it gives you that ability to have that scale have that reliability while also having the transactional protection, and so that's part of the reason I myself like made that transition out of databases and data warehouses into the realm of spark and Delta like in the first place.

A

Just because of the fact that I needed the best of both worlds in order to apply these in order to be able to do the new problems like, for example, I want to do streaming or I want to do machine learning or AI or any of these other things, and so at least from where I'm coming from that's, usually where the discussion should be because you're recognizing that the data problems that you have actually are much more complicated than what databases by themselves could actually go. Do so that's my little Spiel.

A

But then how about you? Simon.

B

Yeah I mean uh so there's there's the two things one massively is the use cases. You know the thing things that a relational database doesn't do well, if you have a ton of them knocking around absolutely and it's the to be super cheesy about it. The fees of big data right so yeah you've got scale, but it's good to mention there. It should not be a bigger little discussion. Scale is one thing. We can't process that amount of data in the amount of time that we need to.

B

You might need something that can be used distributed, compute, which Sparks they're good for streaming. As you said, you can't process data fast enough. We need something that can do that kind of stuff variety. You know we. We need to be able to process a whole mix of images, data documents, uh Jason horrible, gnarly, nested dodo, anything that happens to be they're, all good, classic use cases.

B

um There are still tons of people who I help do a lift and shift of their classic Legacy Warehouse into a lake house, and that's not because of a use case. It's not some newfangled things come along. It's not because their data's getting bigger and bigger and bigger uh a lot of me a lot of the the way that I work tends to be about agility. It's um a lot of the Legacy classic tools when you're building something that can process data, clean data and shape data.

B

It's very very manual in the old set of tools you're, either using code generators and turning a crank, and it spits out reams of code what you're doing something graphical and every single new bit of data that you have to hand code a bit of how to process that thing and deploy it um and the move away from the Legacy Tech has kind of moved a lot more. That's where the whole data engineering thing came from uh for me.

B

The big argument for doing things in a lake house way, because the technology that sits behind it means we can do things like write. One Pi spark script that says clean a bit of data from here and put it into there. Look up a list of rules and dynamically apply them. So I can have one script to do that stuff and then, if someone says, can you load more data in then I can just add metadata to load more data and I.

B

Don't need to deploy code I, don't need to do a load of productionization stuff. um So for me that is just such a massive sales pitch of agility and you know kind of um it's investment in the text so that you can then speed up massively down the line. So we had a wonderful use cases, size and scale sure, and and do you ever need to add new data and are you constantly having to find new data sources and add it in, and that is a major major selling point for me.

B

That doesn't really come under the arguments for why people use this stuff, but it's a major driver for me if you're not needing any of those and your data, never changes and you've got no challenge with it. There's no point in migrating: it leave it where it is it's fine.

B

Most people have many of those problems.

A

Yeah, the the typically my response versus something like this is like it. You know your data, warehouse or data system is no longer being used because you don't need to change it anymore right. The reality of a successful data project is that you actually are changing things. New business problems, new new data sources, whatever else you're changing it all the time.

A

um This is not a knock on databases. By the way when I say something like this, some people might imply that it's not that the the knock is really um uh more toward management who thinks like? Oh yeah, I've built it now, I'm done and I have nothing else to do like no, no, that's not how these systems work right, hahaha cool! All right.

A

Let's switch gears we're going to switch over to uh sorry YouTube, because we have not actually answered questions on YouTube for a small while so okay, um there's two questions uh that I want to tackle right away. Just because they've been waiting for us one, that's a little bit quicker, so I'll get to that one right away.

A

uh There's a question from TCP packet from YouTube, uh which is asking hey: should we replace batch jobs with uh run one uh run, one streaming uh and I believe that's reference to trigger once or trigger available now, and the quick answer is in many scenarios: that's exactly what you can do. You can actually change your bash shops into structure streaming jobs because of the way spark works when it comes to structure streaming is there's really the only difference uh outside the apis, going read and read stream or write and write stream is basic latency.

A

That's it like. Basically, it's the same logic, same Catalyst engine same everything, that's happening under underneath the covers, so it actually simplifies the way you write your code. So what that translates to is, then you can go ahead and literally use uh trigger once or trigger available. Now to basically look at uh data, that's coming in so that way, instead of having like multiple batch jobs, you can have a physique argument, a single streaming job. One really cool example is actually from Comcast.

A

uh They actually had um the sessionization problem that they're, where they're sessionizing data from their set-top boxes um and they went from because they switched to Delta Lake. They went from 640 VMS down to 64, which is pretty sweet, but in addition to doing that, they went from 84 batch jobs to three streaming jobs. So pretty impressive. But that's the whole point. It's from imaginability standpoint. It's a lot simpler! So just want to call that out anything else. You want to add before I go to the next one.

B

For us, as a consultancy uh uh out of the books, reference architecture is using stream once using available. Now, that's because if someone says I do a back job cool, if someone says, can you make it streaming? We just change the trigger and it's the same code, um if someone says actually, rather than doing a daily make it hourly, it's just you just trigger it differently and you don't exchange on your code. It makes life a hell of a lot easier for all of that stuff.

B

The only caveat I would say uh is the complexity of doing some of the bad things that aren't supported by streaming.

B

um So, if you're trying to do a merge into Delta you're trying to write to multiple sources, you're trying to do those some things you're trying to do things like Drop duplicates on on there, you don't want to do that on a stream, because that's going to be very stateful, problematic things, um so you end up doing that for each batch. So you have your stream, you have a for each batch.

B

All your actual processing logic is in there for each batch, so from a complexity of building the script, there's one or two you want Spitz that you need to get around once you've built it once. You can then use it for many, many things, and it's really straightforward and that's how we build things um but yeah I would say uh super easy, just click a button it'll make your life easier, but it's a such a flexible pattern. We use it for everything.

A

Thanks for the call out you're, a really good call about the 4-H patch, that is a that becomes a very, very, very, very powerful tool gets a little. It takes a little while to get used to it, just like Simon called out, but once you get used to it, it actually becomes very, very useful and very very helpful. So yes, all right next question, uh Louis from YouTube has actually a great question. There is not a straightforward answer for this one.

A

So uh can you explain how to disable, CDC or sorry change data feed from Delta tables for huge data reprocessing? Okay? So the thing about in general is that you technically don't need to turn off change data feed because all it's doing is leveraging the transaction log you already have and we're simply exposing that information. So you get a row by row basis. Now to your point, though, okay, which is valid.

A

um Maybe if you're doing a large enough data reprocessing, you don't actually want to go ahead and not not only go ahead and see all of the row by Row from a change data feed. You may not even want to see it in terms of the history. Okay, so in those particular situations, one of the ways I'm sure Simon has some other ways, but at least the way that I a pattern that I often see. Is you don't actually reload the data in fully into that particular table?

A

You actually load it into a whole new table. You verify that that's the table. You want, because it's a it's a full float and blown data reprocessing you rename the old table to basically you have it like, basically there for a month or two just to make sure there were no issues. So that way, you have the both tables running side by side. That way, you don't run the risk of needing to actually have all the history that was no longer applicable, because you did need to do a reprocess and then once you're comfortable.

A

There then flip on uh change data feed on the new table. Okay, so I would typically do it. Something like that because, like I said change, the feed really is just opening up or making available the log that Delta already has. We didn't actually really do much more than that. I mean this is not a knock on all the engineers that worked on it.

A

It was a pretty cool feature but I'm just simply saying, underneath the covers it's not like, we actually had to create a new infrastructure like it was basically leveraging the infrastructure we already have, and so because of that, what I would say is like if you have that major concern, which is understandable, especially when you have really really large tables I, would advise not just the necessarily reprocessing them data to the same table, because, even if you didn't have change they feed on, even if you just had a regular table, uh but it was super large.

A

There are distinct advantages to basically putting those tables side by side if for no other reason to validate that the data being reprocessed is actually correct and then enabling change data feed afterwards. Okay, so anything you'd like to add Simon to that I.

B

Guess the only challenges of you if it was a partial reprocess. So if you had like five years of history and you wanted to reprocess the most recent year, so you didn't want to lose the change feed from all the four previous years, you're just restating a given year. That's not going to work in that scenario as to is there a solution scenario?

B

No, not really I, don't know if you can take change like disable change, data feeds, do a load and turn it back on again and then just say well that year changed essentially it's either it's on or off.

A

um Yeah I've not.

B

Tried to Wrangle my way around it yeah.

A

But the things like I said in the end, it doesn't avoid the problem. Even if you turn, if you were to disable, it still doesn't have the fact that you would actually end up rewriting so much of the data in the first place right and then you basically build up a gigantic transaction log and duplicate data in the first place. So that's.

B

Basically, the part where I'm.

A

I'm trying to avoid that particular problem to say: is it possible for us to basically like, even if, if it's a partial reload? Theoretically, what you do is you could? um Oh here we go like we're, there's a solution on the spot here. If it's a Year's worth of data, you basically build a view that points to the old table like that. Has two thousand you know: 18 19 20 data 2021 is the one year of processing. That's actually in the new table. That's gets rebuilt.

A

Look at the data, verify the view then go ahead and coalesce everything and uh uh and because um a coal lesson, because the view basically skips out the 2021 data from table a. But then it's now and it's merging sorry uh unioning with the data from table Beach.

B

Yeah yeah I, don't know when you've, if you're doing that restatement. If you did that single restatement is a separate batch job and just did a replace where yeah.

A

Exactly yeah exactly right exactly you would do a replace where and then that would actually minimize the uh the log rewrites.

B

As an insert yeah.

A

Yeah exactly so yeah, so this is us just literally solving problems on the fly, so um this may solve your solution. uh Sorry, this was for who is it um uh Louis from uh on YouTube? So hopefully that helps um don't forget, join us on Delta userslackgo.delta dot, IO slack, uh we're there answering tons of questions there and so okay, let me go ahead and switch back to zoom, because we have not answered a ton of Zoom questions now Okay.

A

So uh let me go back to this now so uh babak from I, hopefully, I said your name correctly from Dusseldorf Germany cool awesome. uh He has a question about parquet modular encryption. Basically, since spark 3.2 columnar encryption is supported in parquet tables, with Apache parquet, 1.12 plus, um so in that file parts are encrypted with a key, and is this encryption compatible Delta format or can we use with Delta tables uh honestly I?

A

Don't know the answer to that, because I have not tested that specifically out saying that there is Delta itself uses the parquet files as it is right. There's nothing inherently that we're doing all Delta does is actually add a transaction log on top of it.

A

So if your system's actually able to both write and read, presumably because you're using spark 3.2 to with yours, encrypted columns, then this should not be a problem at all within Delta, Lake whatsoever and so saying that I I do want, to put the caveat, should work because I have not personally tested this so, but uh if you do run into issues, though, please open up a GitHub issue and or join us on the Delta user. Slack we'd love to actually see that because again, I, don't I'm not expecting a problem.

A

But if there is we'd love to know about it and go fix it. So that's a quick call out babaka! Hopefully, I answered your question. uh I, don't think that Simon. If there's anything else, you want to tackle on that one or nope I've.

B

Never used poke module encryption I've not dug into it. uh Whenever we encrypt things, we've been doing it using the um spark functions for a aesd Krypton encrypt, which is what we tend to use for doing that kind of um column, level, encryption and decryption, which is pushing up to the compute level rather than the Storage level gotcha.

A

Gotcha, perfect uh tram. You've got a couple questions specifically about Unity catalog. Honestly, I'm gonna probably have a skip those questions and I'd like you to. If that's okay, if you can ping your databricks rep for those specific questions, because right at least for now, you see specific to the databricks environment, so I I, don't feel either of us probably will be very good. Answering those questions I mean unless Simon is brutal. Simon the non-databricks guide knows more about UC than I do so.

A

Yeah yeah, no we're good we're good, so so tram. My apologies! Please do go ahead and bring your databricks guy to or gal to help out with that, it's not that we don't want to answer it's more like we're, probably not the right people to answer. That's awesome, all right! All right!

A

Let's see, uh okay, hero and hacks for my I'm thinking for Miami probably uh has another question on Zoom before we switch back to LinkedIn um what is happening under the covers when executing analytics on data that is stored in different geographical regions like across regions or across clouds, is data filter remotely and then copied across a single location?

A

um And how can this be performant when data is stored remotely from the current workspace? And you don't want to copy data to a local workspaces? Okay, there's no straight answer to this question: I'm going to start it, but knowing full!

A

Well, there's no straight answer now: with Delta Lake in general, there is a concept of predicate push down, so there can be some filtering done actually at the remote location, but and but the reality is because it's a remote location and the spark job I'm using spark as an example, but whether using tree notes, link or anything else. It's really the same since the fact this the job, the actually just to be to not cold spark the data processing framework that you're using okay, it's not sitting in the same location as your remote location.

A

So since it's not there's not much, you can do, you will end up still grabbing the vast majority of your data across the wire and then that actually in uh the egress costs are basically going to be very, very painful now how you can potentially work around. That problem is running the jobs remotely where basically, uh your data processing framework is also running on the remote system as well. Okay, um so for sake argument now, I'm gonna just use spark just because it's easier for me to talk about it.

A

You have a spark job on region. A your spark job on region B and your spark job and Region C. You run them as disparate jobs to um to basically shrink down because you've done a join so, for example, you're doing a join with a demographics table. In that particular format, I would copy the demographics table from your current central location to each of the regions run the joins there so now you've Shrunk the data from let's just say, a billion rows, each down to 200 000.

A

Okay, then what ends up happening is the egress from region. A region being Region C to the central location is a lot less. Instead of transferring a billion rows on each region, now, you're only transferring 200 000 rows now this could actually simplify the egress cost. But again now this will get becomes a lot more complicated to manage because you're running basically remote jobs on each system. Now it actually I wouldn't say common, but it is actually a practice that some organizations do to basically reduce the workloads or the egress costs between systems.

A

Saying what I just said, often more times than not when you, if you're a large enough organization in which you do need to worry about egress costs, usually you can actually work with your Cloud providers and they're actually going to go ahead and deal with each other uh and because, for example, if you've got like a a direct connect right, you can. For example, your on-prem environment has a direct connect to both AWS and GCS, using just or in Azure. Just all three, for example, uh there's various data centers.

A

Allow you to do that because there's direct connect, the egress costs between the data center and each of the clouds is actually negligible anyways, and so then, if you're doing something like that, for example, you're on-prem. Just using that particular example, your on-prem system is actually executing uh that same spark job or distributed framework job to each of the regions. It's grabbing the data and the egress costs actually are negligible is just more the time it takes to do the actual egress right, and so, as you can tell, there's no straightforward answer to this.

A

This is basically you can jump back and forth in terms of how you want to do it, and so it really depends on what you're trying to achieve. So it makes a lot of sense where you're coming from, but depending on your scenario, you may want to again remote run remote jobs or again, if you've got like a direct connect type of setup, the egress costs are actually going to be a big deal so.

B

I want to touch on on a slightly different direction. Please if we talk about Delta, specifically just just Delta uh things, all of the performance tuning that you've got in Delta is about reading less data. It's about selectively just reading some of the files that are inside that, rather than order for files, so it depends if your analytics server just distributed data across lots of different storage or distributed Delta tables across lots of different places.

B

um You know reducing that egress cost. Reducing the performance of a talking spark spark does everything in memory. It has to read the data into memory to work with it, but it doesn't have to read all of the data and, if we're talking about partitioning- and you run a query that hits partition key or it's using Dynamic partition processing, it will only read the data from those partitions, so you can just ignore all the other subfolders.

B

If you're talking about Z ordering and you've got data, skipping you've got your systicks, that's going to go well, actually, I don't need to read those files and you're reading even fewer files, if you're talking about Bloom filter indexes- and it goes actually it's most likely going to be in that file. You've got these various different performance techniques. You can do if you're using Delta, which is doing that pushing the filtering to the cloud storage. So it doesn't bother reading data.

B

It doesn't have to read again only if we're working with Delta uh partitioning in spark yeah for other things, um but it just depends on what kind of data you've got and where it is.

A

Excellent okay: we have a ton, more questions and not enough time. So, look you and I. We we keep. We keep on right holics. So, let's try to dive into them. Okay, uh back to LinkedIn Jonathan just asked.

A

We need to backfill a Delta bronze raw title, Rotator raw table loading, Json files, segmented by day about 500 000 uh files a day for the last 13 months, once we load once loaded, we want to turn on DLT to process it as a stream as a real-time ish sales data uh strategies for loading that much data, um so Jonathan great question: I'm gonna do a high level and try to make it short.

A

uh You probably want to stream the data in uh just because that way, it's a it's, it's a continued continuous process and that way it'll actually go ahead and work in your current mode, whether using DLT or not, basically just from the standpoint of streaming. So as because you want to go ahead and run continue running the stream.

A

The the main strategy that you want to take into account is the fact that, because you're trying to load a whole bunch of data from the past, you'll probably want to knock up your cluster to a larger size. So it can go ahead and load the stream, the data faster, um but the tracking of all the individual files, whether they're processed or not, things of that nature. That's actually done by structure streaming.

A

You can also make use of, if you're, using databricks Auto loader to help that with that process, but irrelevant of whether you use you know vanilla spark or whatever else the context I'm trying to get at is basically I would use streaming to basically um uh the same code that you would normally run for. Regular processing run that actually for your backfill, but then knock up the number of uh of available uh notes. So you can actually uh catch up. So anything you'd like to add time on that. One.

B

So I've not actually tried switching over existing Delta tables to be then DLT loaded afterwards, I, don't know how awkward that's going to be in terms of just switching that to then be our DLT process. You probably could, if you got the location, folder structure set up in advance, but I don't know if it's going to try and reprocess that table left effect.

B

If you couldn't and you're trying to work out the best most efficient way to get in reading that, if that's 500k files per day and we've got like the small files, Json problem and you're figuring out, how do we get that in um easiest way, is to batch them into Json lines? Files turn them into slightly more performant files still treat it as your DLT. If you have to reload them in from scratch, but it's certainly easier to build as a Delta table than switch it over I've just not actually tried. Okay,.

A

Yeah, well all right give it a shot. Let us know cool all right back to the gdpr one Tony asked a great question is: how can we selectively remove records from a Delta version history to a service, uh so you can meet uh for a right to be forgotten request, so you want to go first. You want me to go first on this. One.

B

A

I know Tony hi.

B

Tony deliberately asking the awkward questions, uh yeah I mean so realistically you come from the Delta version. History I mean so there's two parts. Today we talk about right to be forgotten, squeezing in an actual bit of gdpr. At the end, um you know, if you do, if you just run a delete statement so well, this person we're going to obfuscate the data we're going to mask the data we're going to trash the encryption key, we're going to delete the record, whatever method of uh forgetting you're doing.

B

um If you just do that, run that on your Delta table that, then it's gonna be in your history until you do a vacuum and you can't selectively open up one of those historical parquet files, scribble out the data and close it back up again. Essentially, if you perform a right to be forgotten, you have that record still in your data in your version history, until the point when you vacuum it's part of building your processes to know okay right to be forgotten. How long have we got to action that process I?

B

Think normally you can get away with 30 days as an actual reasonable process for them.

A

Yeah we're not lawyers by the way, so you don't have to talk to your legal department exactly.

B

Yeah, okay, for whatever remit, you're, giving yourself to turn around that process once the process has been enacted, so someone requests it. It goes as a service request. It comes through. You run that command to go and do it.

B

You've then got the ticking clock for how much history you can keep before you have to trash their history depends if you class that as Beyond reasonable use, I don't it depends if people can query that version of the history and obviously you need to be careful about it being any sensitive values being kept in your transaction log statistics.

B

It's it's painful! You cannot selectively delete them from your history files.

A

Agreed saying that there one of the things I definitely want to say that you can Design Systems up front to make it less painful. Okay. Now this goes back to the processes. For example, one of the scenarios that I often talk about which uh that the folks over at Starbucks did right is um the data that their legal department again remember.

A

We are not lawyers, so you do need to talk to your lawyers for stuff like this okay, um but the data that would be deemed as things that would fall through fall through on a gdpr request like the right to be forgotten, request I'm going to deem that as a demographics table. Okay, and so the idea of the data processing itself is that the data itself uh when it comes to demographics data, it goes into a separate demographics table and then the facts. Information uh would only contain the ID and nothing else.

A

So in their particular case, uh what they did is that when they received a right to be forgotten request, uh they would update the demographics table, not delete it at least not yet. The reason they updated, basically is to say very clearly redacted. Okay- and they often within the notes or even within the transaction, would put the gdpr ID request ID, so they say: okay. This is associated with this request that just came in Okay.

A

The reason they did the updates versus delete wasn't because they were trying to avoid anything and by the way, underneath the covers, they would actually have to run a vacuum immediately to exactly what um uh Simon was calling out. But the idea is that you would you wouldn't need to redact uh vacuum all the fact data. You would only need to vacuum the demographics data which isn't changing nearly as much okay and so in the process of doing that number one.

A

They didn't still wouldn't have to uh run the vacuum for 30 days, because precisely because how about if you accidentally deleted the wrong person right, there's gdpr compliance rules where you're actually allowed keeping the information in history like a backup, because for precisely that, you accidentally delete the wrong thing. The point is that is there a paper trail? That's clearly states and that's the reason why they're saying no. We actually enter update the information and say we redacted that person.

A

They also did that, because that way, the the downstream systems could go ahead and actually automatically delete. uh They could tell their Downstream system to delete any demographic information related to that person as well. So that's why they didn't want to go ahead and delete the information right away and so now again we're not lawyers so different organizations.

A

Different countries are going to see this slightly differently and you're going to have to work with your legal department, and this is not a bad thing by the way, it's actually hard and tricky to do some of the stuff and lawyers are in fact some of our best friends when it comes to stuff like this. So this is me not trying to knock lawyers. Quite the opposite. I've worked with some really awesome. Lawyers sounds funny coming from me, but really awesome lawyers to basically go ahead and actually uh make sure we follow gdpr compliance.

A

But the point is that there are techniques and processes that you can employ to make this process simpler, and so hopefully, that'll help a little bit. The.

B

The one one slight Twist on that approach of having the separate table and deleting that those separate attributes which again links into what we're originally going to talk about in terms of Pi, um is doing it with the dynamic decryption. So we we encrypt the data in that table whether it's separate as a dimension table where there's party factor or, however, you do that, um and people have to join to a separate table. That is essentially like you're saying that look list of keys with a decryption key for each of those different keys.

B

And then, if someone says forget me, you just go in and delete the decryption key, so essentially the data's still there, but it can no longer be decrypted. You have trashed that data. uh The flip side of that means everyone who accesses that data has to join to a table and run an AES decrypt, and so you end up with views that do that Dynamic decryption, which is the same pattern we used to do for things.

B

uh You know we're kind of a traditional relational warehouses, but we can now get those same patterns working and it's much more efficient because again, you're deleting a single record in that one vacuum. History, plus you don't have your statistics, you don't have any pii data exposed in any of your um Delta log stats.

B

So if you're trying to be very very compliant- and you do have really sensitive data and you've got a mix of users, some who are allowed to see it, some who are not allowed to use it having the everything sensitive is encrypted and having a Central master decryption key linked to each of those individual Master records. Actually works really nicely for that pattern with a live performance it because you're having to decrypt on the Fly.

A

All right, this is excellent, excellent stuff, so unfortunately we're actually at the point where we need to stop at answering questions, because it's uh eight minutes the hour. Technically, we were supposed to finish 10 minutes the hour saying this. uh You might have heard a Big in the background. What we're gonna do is um because we did not answer a bunch of our questions, but we actually a lot of them, are really really good.

A

um First of all, you can always join us at go.delta, dot, IO slack. You can ask those questions there, but uh the reason you heard the big being is because I took a screenshot of all of your questions. uh Simon and I will actually make sure to answer those questions in the next session, which will be December, 13th I believe so we we've run this monthly, so you can come back in, but we're gonna actually copy down the questions and we're going to make sure that answer your questions uh in that session as well.

A

So uh so, thank you very much for everybody. Taking the time out of your day to uh to pick up the ping to the two of us um like I, said any. If you have any questions in the interim, definitely go ahead and join us at slack. But meanwhile uh we will I've already editing screenshots of all the existing questions. So I apologize for not being able to answer.

A

uh We I apologize on our behalf, not answering All Those Questions, uh anything else that I I missed uh Simon, because I'm sure I missed a bunch of stuff I.

B

Think that just about wraps up so again thanks everyone for the questions and we'll see you guys next time perfect. Thank you very much.