Delta Lake Delta Lake Tech Talks, 23 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability

Description

Apache Spark™ has become the de-facto open-source standard for big data processing for its ease of use and performance. The open-source Delta Lake project improves Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

Join us in this meetup to learn more about the performance improvements in Apache Spark 3.0 including Adaptive Query Execution (AQE), Dynamic Partition Pruning (DPP), and handling skewed queries!

Topics to be covered including:

* The new Adaptive Query Execution (AQE) framework within Spark 3.0 can yield query performance gains. Based on a 3TB TPC-DS benchmark, two queries had more than a 1.5x speedup, and another 37 queries had more than 1.1x speedup.
* With Dynamic Partition Pruning (DPP), we can significantly speed up performance by pruning partitions based on the joins between the fact and dimension tables common in star schema design.
* Showcasing transactional support as part of DataSourceV2 with Delta Lake Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

In kirkland, so uh the area so a little bit north of juanita beach, so I am going to be so excited when we finally can start doing these sessions together uh on site, as opposed to um virtually so uh hey harry, I'm from bellevue great to hear. uh Hopefully, the weather is a little bit better in bellevue versus in kirkland. Just because you know the lake washington, uh uh the lake washington uh weather zone, that is, is what it is. Hey paul good to hear from you again um too bad.

A

We don't have this thing set up or we can actually chat um uh voice wise, but you know uh definitely remember, definitely would love to say hi to you and catch ketchup with coffee, sometime paul um and yes, you're from spanish. All right.

A

Let's see um we're gonna wait just a couple more minutes just to let other people dive in and then we'll go forward from there, and meanwhile wasn't. I was thinking uh we also I'm gonna do my best to answer any questions on linkedin, though, like I said, uh I'm gonna be a little bit uh inconsistent for the linkedin folks, not because I don't wanna answer your questions, but only because there is um um it's easier for me to see the questions directly within zoom.

A

Okay, so again, uh apologies in advance- if I did not answer your questions directly on linkedin, all right, all right so but hey for the folks on linkedin. uh Like I said I wasn't trying to ignore. You go ahead and join us and say hi. uh Where are you from uh you can do it in linkedin? I'm monitoring this uh and you can do it in the chat within zoom. So we can do it either way all right- and this is meant to be a fun session for us to chat and have conversations.

A

Unfortunately, you're gonna be primarily hearing my voice, but by the same token, yeah let's definitely chat all right. uh I will try my best to answer questions uh as well as I can. uh Obviously I'm I'm not I'm not um omnipotent, but I will do my best okay. So, let's dive into it, because we got a lot to cover today and by the way thanks very much to karen. uh She is on the background, uh managing all the logistics here, so I want to give a shout out to her for taking care of this.

A

So in case you don't know today is march 23rd uh 9 a.m, pacific, for as part of the seattle spark and ai meet up we're going to talk about how apache spark, 3.0 and delta lake enhances data lake reliability.

A

um It is being for the folks that are wondering if it's being recorded, it is recorded right now, it's being broadcast to youtube uh and hey for the persons from uk awesome all right. Well again, this is the session how apache sparked three that out and delta lake enhances data uh like liability in case you didn't know which, which uh um session you're in okay. The one thing to note here, okay, is that we actually have a webinar. It was actually called out in the meetup.

A

um We have a webinar that actually talks about some of the stuff. We've actually updated this a little bit number one and number two. I did actually want people to go ahead and chime in and ask questions. That's the reason why, okay, that we're doing this session um in case you don't know who I am uh hopefully the folks in seattle sort of know who I am, but uh nevertheless my name is danny lee. I'm a developer advocate at databricks, uh I'm uh actually I'll skip through this stuff.

A

Basically, I'm a geek, uh I'm based out of the seattle area. I used to work at microsoft. I have to help build um project isotope. What is now known as azure hdinsight. I've been working with spark since 0.6, if I recall correctly and been with databricks um for a while and rocking it love it here.

A

So I have a bit of background when it comes to big data data engineering data science, because uh when I was at microsoft I was working at the sql with the sql server team, and so it had transitioned us into what now is known as hadoop and uh hdinsight and yada yada all right. So that's my spiel, the cool stuff. If you don't already have it, please go ahead and download your own free copy. I apologize for sounding like a salesperson, free copy of learning spark uh second edition, all right.

A

It's written co-written by my co-authors, jules brooke td and myself and afford by mate himself so pretty rocking. You can get the book for free, uh the ebook that is uh so just go to this site. The bricks.co slash, get dash, ebook and bam, uh fill out the little form, it's just a few questions um and then bam. You're gonna get all the cool stuff on spark delta lake emma flow and claws, for example, all right so and in case you want to dive deeper into the technical issues around. Why we built delta lake.

A

uh This session will be primary, sparks, don't worry, but we are going to go ahead and talk a little bit about uh delta here as well want to read the vldb paper. So it's right on the delta.io website, so you can go ahead and just grab it from there so again same context. I'm not really going to go ahead and bother diving too much deeper than that all right, perfect in case you're wondering who are we we're databricks uh you're, going to see all sorts of call? This is the only one and only marketing slide.

A

So do not worry. We are the unified data analytics platform for accelerating innovation across data science, data engineering and business analytics. Okay. So that's it for that. Spiel key thing we're the original creators of the projects like apache, spark delta lake mlflow and koalas. Hence you saw with the book just previously all right. So that's why I would love you to go ahead and take a download the book read.

A

It provide feedback, the whole kitten caboodle all right now, if you want to dive deeper into because this is going to be more of a breath session, mind you, some of you are going to complain to me to say that this is too deep, but it's more of a breadth session. If you want to dive really deep, uh there is the session from the data and ai summit a deep dive into new features of apache spark 3.0 and that's with shelley and winchen great session highly.

A

Advise you to go ahead and take a look at that all right and then, if you want to dive deeper, also on specifically aqe adaptive, query execution, we did a tech talk back in december with marianne and allison uh rock stars uh to to say the least uh in terms of be able to explain what they, uh because they help build a qe for in databricks, okay, uh and it's all not just for data breaks, it's also in spark, but the we just wanted to call it out uh what they've done all right, so cool stuff right all right!

A

Oh sorry, about that all right, and then, if you that was a link to the session, somebody apologies, I didn't mean to play it um and finally, uh also another cool session that you might want to check out. uh Delta lake 0.7.0 plus spark3.ama.

A

Basically, is myself a barack of td actually I'll rephrase that it was barack and td, and you know a little bit of myself that we went ahead and covered um questions uh from uh delta lake spark 3.0. Basically, we had you asked your questions and we tried our best to answer them and I think we covered most of them so a great session for you to check out um and there's also on the data and ai online meetup.

A

Okay, all right so with that out of the way, let's go ahead and talk about apache spark now with 3.0, oh by the way he's just lower. No, I'm not I'm just really really antsy, because I've had too much coffee, so nevertheless, apache spark 3.0. We would go ahead and we saw something on the order of 3 400 jiras, okay, uh uh just in spark 3.

A

rc2, and for some of you who are asking for links to all everything I sent uh once we're done and we're going to leave about five minutes for q, a I'll make sure to copy those links and put them into linkedin, okay, all right so pretty cool stuff.

A

We could try to talk about all this stuff, but honestly we're not because we only have 45 minutes to talk about and it's going to be a little bit rushed even then so we're really going to actually focus on are just areas around performance and the extensibility ecosystem. All right and, like I said, there's a lot of good stuff in apache spark 3.0, including richer apis, uh sql compatibility monitoring and debug debug ability. So you know what it's rocking.

A

It's don't give me all it's awesome stuff, but yeah we're going to need to really focus on just these two sections, all right so specifically on performance, we're going to go ahead and talk about uh aqe, adapter, query, execution, dynamic, partition, pruning uh query, compilation, speed up and join hits uh we're probably going to talk about the first two, just because there's actually a lot of good stuff here.

A

Just on that and harry, you just asked the questions: um the the q, a links, uh the links like, I said, I'll copy them once we're done the primary session and I'll link make sure to paste them directly into the zoom for those that are in zoom versus those that are on linkedin I'll make sure to copy the both of the links there right all right. So all right! Well, let's start with the spark catalyst optimizer.

A

So for those of you who have played with spark or work with spark quite a bit, the the context is that you know spark one dot x. That line is our catalyst. Optimizer was basically rule based.

A

Two dot x line was basically rule plus cost optimized cost based optimizer and now with 3.0, we actually included the ability to perform a lot of tasks at runtime when it comes to optimizing the the the code. Okay, all right. So what do we mean by this? So for some of you, you may be familiar with this diagram.

A

What you see here is the logical and physical planning that the catalyst optimizer of the spark engine performs basically whether you're running a sql statement on the left here, uh data set or data frame first it'll, go ahead and run an unresolved logical plan to a logical plan to optimize the physical plans, cost model to select the physical plans and then at the core execution level. It'll go ahead and create the rdd dags, uh for that will execute against your data. Now.

A

The important context here is that, if for some of you who may have a database background, the database background you're going to say hey. This looks a lot very similar to the logical and physical plans that are created with the databases and you're exactly right right.

A

The context in within the catalyst optimizer is to bring in some of that those concepts directly from databases in terms of, like I said, logical and physical planning and applying that to uh to spark now this stuff actually was already available as part of 2.3 okay, uh if not it's actually technically, even earlier than that, but uh now I'm thinking about it's more like 1.6, but but obviously we improve things over time now.

A

The key issue when it comes to aqe, though, is that we could do adaptive planning from that logical planning stage all the way to rdd. So every time we can go ahead and look at the system, statistics and re-optimize the execution plan and improve the performance. That's the most important aspect here: okay, all right!

A

So, let's talk about specifically aqe all right and there's a there's a uh by the way, the slides I'm presenting here by the way we're going to send these slides out and they'll be available in meetup.com and they're also going to be available in youtube and I'll actually I'll post them directly on my own uh uh linkedin as well, so you can go ahead and download them and use myself. That'll include all the links that you're seeing here by the way.

A

Okay, so the aqe fundamentals, we're gonna talk about three things, which is dynamically: switching join strategies, coalescing, shuffle partitions and dynamically optimizing skewed joins okay and that one I'm a huge fan of if you're ever working with databases, you often find that when you have skus in your data and you try to do a join, it becomes excessively inefficient. Well spark 3.0 has a pretty cool way of solving that particular problem and if you're wondering how it is, the tl dr, is sub-partitioning beyond belief.

A

Okay, so, let's start with uh broadcast hash joins here in that case, okay. So when you first you do your joins and shuffling right, the the initially. What you're worried about here on the left side is basically that you're trying to join one table, and then you have your fact table spread across multiple nodes. Sorry, the orange or the nodes, and the green basically is a um is a table.

A

You want to join again it's a smaller one, while the the dark, dark gray, I guess or like right excuse me- is your fact table distributed across multiple nodes. Now, when you're doing the joins and shuffling what you end up having here is. It is basically one join table. The smaller table on one node, while your fact table's across four nodes and so you're, starting to saturate the network to get the data from the join table across the network, okay to each individual portion of the fact table all right. So now, let's flip it around.

A

If I do a broadcast ask join, you know: good old-fashioned spark, plus equal, auto broadcast joint threshold, the joint table's replicated across multiple nodes, and then I'm able to go ahead and do the join intranautly as opposed to a cross nodes. So that way, I'm actually reducing uh any of the shuffling. In terms of especially from a network perspective, okay, so that's pretty sweet all right, but then you're gonna automatically ask me questions like hey. Well, then dude.

A

Why not just always broadcast join like? Why? Why why why? Why why why, just that is not the default? Well, there's lots of things like missing or incomplete statistics uh compressed files. uh Is it in a column, store complex, filtering, udfs, a complex query, fragments okay, so because of these things, we can't uh we cannot automatically do it, even though you know candidly for a lot of my queries. I just automatically do that right.

A

So the context when it comes to aqes that one of the fundamentals is that you can dynamically switch your join strategies, okay, so, for example, when you start with a sort, merge joint right here on the left right, you do your sort and your shuffles for stage. One of one table is an estimated size of 100 megabytes all right and then on the second side, the the other table you use, sort shuffle filter scan its estimated size is 30 megabytes, okay!

A

Well, when you execute it, and you re actually run the logical, then physical plans, what you'll recognize and I'm using this particular example. Actually the table initially is now instead of 100, it's actually 86..

A

The smaller join table is actually eight megabytes, so it's even smaller, I.e, potentially small enough where you can automatically broadcast it so at 30 megabytes. Maybe the table was too big for us to broadcast, but at eight megabytes it is so the idea of aqe is that's actually able to go and say hey.

A

We noticed that our original estimates or original statistics were not quite right. Let's go ahead and flip the switch and now recognize the fact that we can actually do a rocket's choice. So that's actually what ends up happening. So, instead of using the default of sort merge join, we can optimize it mid flight switch to a broadcast hash joint so because the the the smaller table, the join table, is actually eight megabytes.

A

So it's within threshold in this particular example and the bam you're good to go right so, as opposed to you needing to run a query to determine if it can actually handle that, and then you flipping the switch aqe automatically. Does that for you. So yes, manish you're, asking a question that you're basically optimizing joints, that's exactly what we're doing! Okay, all right! So the second part is dynamically coalescing shuffle partitions. Now, here's the always a fun one and, for example, um I'm going a little off script.

A

But, for example, you, if you look at the past when we we you ever to ever, do a join. We automatically spark go ahead and do like 200 partitions and people are going like wait, why? Why is it always 200. and honestly, I've long since forgotten how we got there?

A

So, in fact, if you know, let me know, because I've forgotten how we did this, but nevertheless, when you look at your data right, if you've got these two tables of your, which is uh premised on map one and map two, if you're not coalescing your partitions, because they're in essence, five partitions to the data, there's a lot of shuffling back and forth between the two mappers to, and so that way, when you run your reduce, you have five different reductions: okay, cool all right!

A

So that's how it runs normally, how about if I can coalesce the shuffle positions. What do I mean by that coalescence? It started. You can see this diagram notice, how the I guess is light, yellow, yeah, light, yellow blue and light blue um shelf partitions are smaller so since they're smaller, why don't you just coalesce those two, those three partitions together so instead of having five of reducers?

A

Why not go ahead and chuck it into three right? So you've got two mappers. Now you have three redu running three reducers, because it's doing that way, you're shuffling a lot less. So if you, even if you do have network saturation, it's going to be a lot less as opposed to you know five. Now I get down to three all right, oh and by the way, manisha your question about how to opt. Does the optimizing joins actually affect the table size it doesn't.

A

What what I was referring to in this section here was that we estimated the size of the tables initially right and the estimation was 100 megabytes and 30 megabytes, but when you run your when you run uh within the catalog catalyst optimizer from that logical to physical plant that you had seen here um here like this part here, when you go through that step, you re you'll find out the fact that oh wait, you know what no the table sizes are actually not 100 megabytes the actual size is 86 or, and the actual size of the join table itself is eight mega pipes and for sick argument.

A

Our um broadcast threshold is at 10 megabytes, so it's smaller than the it's below the threshold. So since it's below the threshold, that means the table will automatically be replicated when we use broadcast asteroids. But instead of you needing to specify that fact, aqe with sparkfree.org automatically does that for you. So that's pretty cool, that's the context! Okay! So hopefully that answers your question so now back to the coalescing of shuffle partitions.

A

Exactly exactly as I say, right, we have all these partitions. Can you coalesce them together? So that way you have less reducers cool. So that's what the aqe all right so now to put the two together, we have a really cool feature about dynamically. Optimizing skew joints all right, that's the cool stuff, all right!

A

So this is a time old data, warehousing database problem that we've seen forever, oh and by the way, if the person who asked about can I pass hints to optimizer, you absolutely can give me a couple more slides and we're going to go ahead and pass it through. Okay, sorry that passes through we're going to go. Explain sorry about that, all right.

A

So with a traditional problem, for example, I have a in red the data skew so, for example, think about old-fashioned data warehousing problems where that particular value is null okay or that value is negative or or if you're, old, even older school. Like me, you'll flip it to negative one okay. So the idea is that, okay, all my values for this particular widget color is negative one, because I don't know what the widget color is. So that means what's stored in the fact table is a bunch of negative ones or nulls.

A

Okay, and so that's what this table a part zero is representing table a is the fact information and there's a heck of a lot of negative one, slash nulls inside them all right. Well, that's your standard, sku, okay, and it's really really really common in data warehousing and for that matter, even databases. Okay, so it's it's it's ugly, and so what ends up happening is when you try normally to go ahead and do those queries.

A

You'll notice that, okay great when I try to shuffle the yellows are taught are the orange and yellows are more or less together in part in terms of partition, one the different partition, two that in terms of cyan like green, I'm, sorry, I'm actually partially color deficient. So I can't tell the colors and I actually chose them so my bad, but the idea is part three you're good to get them, but whoa table a part: zero partition, zero table partition.

A

Sarah yeah we're really far off so there so in other words, partitions one two three will complete extremely quickly while partition zero is going to take a really really long time. So how do we solve this particular problem? Well, you solve it. Oh there you go. Okay, you solve it basically using the same concept of a qe right in terms of we do look at the actual try to do a sort mortgage joint. First, we execute we get the. What the reality is. The actual sort merge joint right so remember in the previous slides.

A

We were going uh s true, sorry estimate size, then true size. So same concept, look the left, one! Here's the estimate, the the middle one is true, but when we go to the right, we actually there's an additional of a skewer all right, so it actually figures out how to deal with skus. Specifically, all right turn out well same idea. As we've said before, where we on the left side, we have your skewed table a all right. If you look in the middle, that's what your table a partition! Zero! That's really big! All right! Now!

A

What we do is we split that so table a now gets split into its own set of sub partitioning, so we're we're redeeming this as split zero split. One split two. So in this particular example we're just breaking table a into three of its own partitions, okay, so table a part. Zero is now broken up into three of its own partitions. That's why we have the splits table b is itself since it's a following: it has its own partition, zero.

A

We replicate that, and so that actually is happening all underneath the covers so now all of instead of actually having four partitions for code and where one of them is larger than the other three. So that means those three will go ahead and finish faster: okay, well, the the first one's going to take way too long. To finish what ends up happening is I've got six sets of partitions? Okay in total right?

A

Yes, it's partition, zero with splits, but what I already have is the three sets of partition, zeros and the three other partitions partitions, one two and three you put those together now: six partitions, they're, all roughly the same size and because of that they're all going to complete roughly at the same time, so sit there really fast.

A

In a nutshell: that's what uh how aqe fundamentals as part of spark 3.0 handles skew joints and the performance now we're. This is uh an older graph, because when we ran this, it was actually on dvrsum.o, beta, okay, beta ie, as in uh the early release right now just to provide some context. As of today, we actually just released dvr 8.1.

A

So it's a later version of spark with other additional optimizations, but the key context is what you see in this graph here with aqe off, which is the blue, color and then aqe on, which is basically that orange color you're going to see significant performance improvements by turning a qe on and so by doof default.

A

Having a qe adaptive, query execution so fundamentally improves the performance of your perks. Okay, so that's why we're going yeah, let's use star 3.0, because uh this is really rocking, oh by the way. I'm sorry I should have probably mentioned these are just select queries from your tpc ds, so your standard, uh ds, tbc workload, okay, all right, so that was just a qe, so that was a heck of a lot of stuff. I just went ahead and uh uh reviewed if you want to go dive deeper. I there are links inside these slot.

A

These slides, which, like as like, I said, I'm going to send the slide deck out for everybody to go ahead, and so you can get access to all of those links. Okay, the there's a session by marianne allison on faster spark sql with a qe and also they wrote a really good blog. Both links are inside these slides, and so, if you want to dive deeper, I would highly recommend you diving into the reviewing those things. Okay, all right now, I'm going to change gears big time all right!

A

um Oh sorry, let me take a second just to take a look at some questions here.

A

Okay, so there is a good question uh on zoom in this case about how do you know the statistics and the decide it doesn't mean you actually execute the query twice: uh yes and no okay, so you you personally are not actually running the query: twice: you're running a spark, sql query or a data frame for a using spark, 3.0 and spark 3.0 just handles that for you, okay, so that's actually what's happening underneath the covers underneath the cover.

A

So, okay, we are in fact actually running the query multiple times, based on updates and statistics right. It's basically a feedback loop right as you're, going from the logical planning to the physical planets themselves to the execution of those physical plants go ahead, and it basically goes back and forth back and forth back and forth right.

A

So, oh, oh sorry, guys can you still hear me? Okay, uh because it looks like some people folks are telling me that my micro audio is muscled.

A

We hear you hear you, okay, just a little bit.

A

Why these are just my airpods, so let me how about if I switch to this way, does it sound better now a little bit better? It's not too bad switch to this mode for now and see if that actually helps all right, so anyways.

A

What I was uh so thanks harry for calling that I apologize for any of the um any muffling sound that you might have heard uh could just be my voice anyways, uh but nevertheless the context is that when you are doing this um when you're running those carries basically underneath the covers yeah, we might run it multiple times to basically improve things. These are happening in parallel and extremely quickly.

A

So that's why you're not going to hear it, but what it boils down to is that that's the whole reason why we have a logical versus physical planner as it is anyways. Okay. So hopefully that sort of answers. Your question frankly, we're probably having to dive deeper there's, actually a really good blog post um that randall shin wrote about how apache spark is a compiler. uh So if you look uh just google being it right, apache spark as a compiler.

A

You can dive into the details of what uh what I mean by how that logical, physical planning works and why, how it actually determines those planets. Okay, all right, we have a lot. So let me go ahead and talk about a little bit, at least about dynamic partition pruning, so dynamic partition pruning uh um achieve high. As this calls out with dynamic partition printing, we can actually get faster performance for all your reports. Okay, so what? Let's, using with you again with your queries with or without dpp again, dpp is dynamic, partition pruning.

A

Okay, then what ends up happening is that with the green is with with it off and the red is with it on, and so you can tell from this that the query speedup is somewhere between 2 to 18 x, so we're talking about significantly faster performance, all right, so that's pretty sweet and so again, part of the reason why you want to consider spark 3.0 is because of dynamic partition pruning as well, okay, and so how does that work all right?

A

Well, let's talk about how dpp worked before the actual optimizations all right, so we're looking at a query which is select table one id table two partition key or p key okay from table one. You join table, two you're gonna join by the p key, um I'm just gonna, say key because p key sounds weird all right and you do some little filtering all right.

A

Well, what's happening underneath the covers is that we do a project specifically on that joint and then we do the join on the on the key all right and then we can do a scan on table one because it has a large factor with many many partitions, okay and then, where we can optimize those. We do the filtering, which is t table two ids less than two and then we scan right, so that was instead of running the join first.

A

We filter the second table, make it smaller and then do the join. That will run things significantly faster, but before the optimization, that's that's not happening. Okay, so just like I was implying before.

A

Instead, when we do the actual optimize again, project then join for between the two keys, but then we do a filter, scan filter, push down to get table 2 to be significantly smaller in the process. We also go ahead and say: hey what table? 1 keys are in the scenario of select t uh table two key from table two, all right. So in other words, not just filtering table two but filtering table one so table one is smaller and then only then do we do the scan all right.

A

So we significantly reduce the amount of time we're scanning, because we've already pre-filtered the data in the first place. Okay, so uh let's see all right so then, once we do that all right. The basic mechanisms here is that you you're, inserting I'm sorry not you.

A

Smart 3.0 is inserting a duplicated sub from the other side, all right um the table, the prune is partitioned by that joint, just like we called out and the join operation provided. The joint operation is one of these four things right. It's an inner left, semi left outer or right out. Okay, so it covers many of the use cases that you really care about when it comes to joins anyways so boom.

A

You should be good to go on that front all right so now, because we can do that, we can push push we now, I'm going the other way. So I apologize for the way I wrote this particular slide, but the optimize is going down this way all right and then, basically, we are doing the filter and scan the scan. That's required for table two we're doing the partition key or the key for table one which is the in the dpp filter results. So now it's a filter, scan okay, and so then, finally, we can.

A

What results in is that significantly faster response time and what I mean sigma faster in some of our cases, we saw 33x performance improvements, so pretty sweet to put a rather light one all right so uh with this together. That's why we suggest hey if you're able to go ahead and do this by all means, go ahead and use spark 3.0, which I'm in dynamic partition pruning okay. So there is a question from linkedin here that is asking: why is dynamic, partition override support removed from delta lake? It's because it's not so much it's.

A

We are still debating back and forth on when we're going to want to put it back or how we want to put it back, but don't forget dpp diabetic. Partition pruning in this case is actually about how you're working within the context of spark itself not so much on how you're going to write the data to disk right. The fact is, you still will need to put the full table to disk because you want to keep state of it right.

A

Your one query wants to go ahead and say table two id is less than two, but subsequent queries are may or may not have that right. So that's the reason why you're going to go ahead and want to keep all of that data, but the the key context is that for those queries now, you've got like basically a 90, less file scan and then has significantly improved performance all right, so we've got a little only a little bit of time left.

A

So I'm going to pseudo rush through this okay so now join him all right, one slide. Hopefully it covers everything for you, okay, so these are in order of performance. Okay, so broadcast hashtag is the fastest. So then you have your short merge join. Then you have your shuffle half join and then you have your shuffle nested loop, joint okay! Well, can you put in hits absolutely so? The context is that you can go ahead and put hints like this, like where you say, select, plus broadcast a id from a join right.

A

You can put hints like this to improve, to specifically specify that I already know that the table sizes are small enough, so let me go ahead and use a broadcast section, or I already know x, y and z, so I want to use a shuffle hash join or whatever else, but the context is that a broadcast has shown is awesome. We try our best to use this one, if possible, all right, which is basically, but it requires one side of the table to be really small. No shuffle no sort super fast. So that's awesome.

A

The reason we default to cert merge join often because it's really really robust, it can actually handle any data size. You do need a shuffle and you do need to sort the data okay.

A

So if the data is small and you don't need a sort, then in fact you do you want to use the broadcast hash join uh versus using the sort merge join. But if you don't know what's going on like in terms of what, if the data needs to be split or if you don't need a shuffle or whatever else, then guess what you can just use the default of merge store.

A

Now you can potentially also use a shuffle hash join all right, uh because you do because the context is that I'm saying I need a shuffle it, but I don't actually need to sort the game because for state government the data that you have is already in the correct sort order, or you don't care about order at all all right, but uh the reason why it's really cool is because you can re by doing it.

A

This way you can really handle really really large tables, including large tables on both sides, not just a small table and a large fact table like a small dimension table, but two larger tables that you need to join key problem, of course, is that you will run into out of memory errors if your data is skewed and that's usually a problem by the way. So you have to be careful about that all right and then finally, a shuffle nested loop join I.e.

A

It's a cartesian product, okay, so you're not actually joining by key you're just doing cartesian of those two tables. There are absolutely some scenarios where that makes sense. So, of course, you can go ahead and explicitly go ahead and say that you want to do a cartesian product of the tables, and then you can specifically specify the shuffle nested bluetooth, okay, because that'll go ahead and help that run yeah a little bit faster as well.

A

Okay, so, uh like I said the slides uh for for all the folks that are asking whether in zoom or linkedin, these slides we're going to save them, uh post them on linkedin we're going to post them on the meetup and we're going to post them on youtube because they're actually on youtube right now as well, so we're going to post them in all these locations, so you're more so you'll have all the links of all all the stuff I have here.

A

You'll have it so that way, uh because I just realized there's a lot of links here. So maybe I don't want to just paste them directly: okay, but the as an fyi, that's what we're gonna go do and when this is done I'll make sure to keep this window open. So this linkedin video will actually I'll paste the the slides into here as well all right. Okay.

A

So let me go on tons of other stuff here, too extensive building ecosystem, there's a lot of cool stuff like data source v2, with catalog support, java, 11 support for dupe three uh high 3x metastore hive, two three execution: I'm actually really gonna focus pretty much on the data source side of the house; okay, um because that actually allows me to segue into delta like why I want to cover about data lake reliability, okay, so the motivation for data source api is that the it allows compatibility, uh dependent, that's dependent on the data frames api in the upper level uh and when you cannot leverage the data set api for whatever reason.

A

Okay, um this. What the context is the physical storage information partition sorting is not propagated from the data sources and then because of that, the spark optimizer can't actually use it all right, and so, with the original data source api, the first one, the extensibility was not good and the operator pushed on capabilities were limited.

A

They lacked columnar, read interface for high performance and the right interface did not support transactions, I.e, asset transactions, so for those of you with the 15 or minutes or so left that we have you're gonna say: hey: are you talking about asset transactions? And the answer is yes: we're talking about asset transactions? The idea that you can protect your data when it gets written to your cloud object store all right. Why do I call that out?

A

Well because often, when you're running your spark job, especially when you're dealing with large amounts of data, if a task ends up failing now, the good thing about spark is that it can roll it back or you can retry, but how about? If it actually wrote it to disk and only parse did partial rights. All right. Partial rights can be really a bit of a pain, to say the least right. So that means you might have to have to go clean up the data yourself.

A

Well, if you actually had acid transactions to protect the data, then you could actually avoid that problem, because either the data has the the partial likes, never happened, or even if the partial rights did happen. There's a transaction log that details which files that you need to read. So, even if the, if there was a partial written file in on the storage itself, we would never read it because the transaction log says hey. What am I here? What am I supposed to go ahead and go? Do all right.

A

So that's what we're going to focus on? Obviously, okay, so with the catalog plugin api right users can the rest of register the customized catalogs to you start to access them with the table metadata directly. So in other words, your good old-fashioned, create table, alter table so good stuff, loving that all right cool now by the way we're actually on delta like 0.8.

A

So but I I so the reason I'm calling out 0.000 is because it's the first release, uh which is supported for apache, spark 3.0, okay, and it adds support for those metastore defined tables and sql dds all right. So, as opposed to just these now, we can do all of these okay, so support for sql delete, update and merge right.

A

It's not just for inserts anymore right, and so that's what's really cool about this, so so, in other words, for those of you who are much more into the database data warehousing uh type of perspective of things where you'd love to be able to write, sql statements uh for updates and deletes, guess what now you can and that's what's awesome here, all right and so what's actually happening on the covers because remember I said that I was worried about partial rights.

A

I'm worried about transactions all these other things so, for example, when I'm doing a delete or an update emergency, what's really happening underneath the covers. So let's talk about it. Let's talk about delete in this case. Okay. I want to delete this little bit of yellow here. Okay, so that's what the v1 of the table is. So what I have is, I have a bunch of files, all right that represent the table and that's my version one. I need to delete five rows from this table.

A

Those five rows happen to be in a file okay, so I, what I need to do is delete quote unquote. The rows, which means implies, I'm trying to delete the file except these systems excessive, like spark in anything. For that many any distributed system, they tend to be additive, so you're not actually going to like just simply open a file and remove the data from it.

A

What we do is we go ahead and simply say: okay, the v2 of the table is simply all of the files added in that's that we didn't delete okay, so traditionally in the past. What I have to do is instead of actually running a delete statement. I would basically say select all the data, except for the data that I didn't want to delete and insert that into a new table. Okay, but underneath the covers, we're almost doing the exact same thing except you don't have to do any of that.

A

You just simply write delete from events, and that's it right. So that makes things significantly simpler and now we can definitely dive in much much deeper. So I'm going to go ahead and actually skip a lot of that stuff, honestly, because we have other sessions for that. But the context basically, is that, because what's actually happening on the covers, is that we'll recreate and copy the new files for the for the file that contains the five rows.

A

We delete we're going to recreate that one file right with and only that file, so we filtered it down so, as opposed to recreating the whole table. We just recreate that one file, but except it you know it has a new good or whatever, and it actually contains all the rows. But the five rows, okay and so what's happening in the transaction log, is that first of all, let's just say there were four files that represented the table. Well then, there's these file a b c and d. That's in b1 all right!

A

Well now, because I take let's just say: file c, that's the one that I need a room: five, five rows from so now, I'm a c1! So now for v2 in the transaction log, I have file a b c, one and d, so I didn't actually have to replicate all the data in file a and file b or file d. I only had to replicate part of file c and so that's going to c1 so inside the transaction log. That's what I have.

A

I actually have um a b c 1 d, not just a b, so I'm not truly replicating all the data. Okay, so that does speed things up right significantly, because of that all right. What do what happens when I actually want to go back in time? For sake of argument, I've inserted data, I've updated data, but now I need to go ahead and realize no, no! I I screwed up, for example, argument v3 I didn't insert because I added the yellow the brighter yellow here and v4. I realized hey, I'm going to do an update.

A

Okay, that's cool right!

A

What ends up happening is that uh um if I want to do time, travel or data versioning, I can just do select star from events as a version using version as of 2, and I can go back to v2 and just get the files a b c, 1 and d from it, even though, by times v4. It's probably like a b c c one d, one e f g, a whatever else right, that's the context! All right now somebody asked a question which I'm sure is a good question here.

A

Did I have to scan all the files? uh Yes and no the reason I'm saying yes or no is because the transaction log itself tells me already what files it actually literally lists the file. So, for example, I don't actually have to do a file listing, I already know exactly which files to go ahead and and stack if, because there's statistics also inside the delta transaction log I depending on, if I, for example, if I use delete, um see if I let's see if I have the delete statement here.

A

uh Okay, if I do a delete from statement where, like uh um I d, equals two and id is actually partitioned right, then what ends up happening is that I have statistics that'll tell me which partitions or which uh files I need to read as opposed to. I don't need to scan all the files all right. So that's why I'm saying yes and no right.

A

So if, if I'm using a more broadly query- or there is no index for me to reference or there's no additional statistics for metadata for me to hit you're right, I potentially will need to scan all the files. But if there is metadata or partitioning or statistic reference, then in fact I don't necessarily need to scan all the data under the covers. Okay to be able to do all this, basically, as a quick call out, there's important things about well.

A

That means I've got a history of data right so because, even if I'm able to be efficient about my instead of replicating all the data, I just knew go c once from c to c one over time. There's a lot of data and not just data over time. There's a large transaction log too, so these are two very important uh history. Retention commands all right, so basically you can alter them.

A

So that way, when you execute a vacuum statement, it will clean up all the old data, so in other words, if you're on version 20 of your table right now and that version version one is from. Let's just say: you know two months ago: well, yeah maybe just run a vacuum and clear out any data. That's older than x the amount of time by default seven days, but you can change that time as well.

A

Okay, that's what the context of all the the these particular commands are: okay, one's for the log transaction logic, the log retention, duration and one's for the actual data itself. Okay, all right! uh Let's see all right well um and then, oh right, because we enabled data source, v2 and catalog api integration right.

A

These were important so that way, when you're working with spark and delta lake you're actually able to run all these commands all right, it was in order for delta lake to work with spark 3.0 correctly. All right. We need to reference the catalog apis in data source v2, so enable these configurations enables these extensions and then you are good to go all right, let's see and then um what little tidbit uh the data quality framework is just in terms of improving sql, dbls and emails and asset transactions just to start basically the last little tidbit.

A

It sounds like I'm doing doing a little bit of sales. I think, but I really am not it's about more or less the lake house paradigm. The whole premise is that you want to have improved performance.

A

They didn't wear having like capabilities but on your low cloud cost cloud object, stores right so for us to do things like adaptive, query, execution, dynamic, partitioning, pruning query, a compilation, speed of join hints and so forth, and so forth start to 3.0. You significantly improve your ability to do. Etl, machine learning, query performance using start 3.0 right, we're! That's why the community we're constantly adding all these different things and then, in the case of delta lake. We we add that in in order to assure acid transactions to your data.

A

In other words, if there are errors, if there are issues, if I want to run multiple streaming, multiple streams directly to the exact same table, no problem, because delta lake actually allows me to go, do that without any problem all right. So it's a combination of those two things together that really allows me that capability, that's and that's more, the context we're talking about here.

A

Okay, um so uh if you want to know more about delta, obviously go ahead and go go to delta.io uh and then final set of links um try out spark 3, though, and delta lake. Now um you can try out smart 3.0 right from the databricks community edition, that's free, so databricks.com.

A

Obviously, if you want to download it just go to spark.apache.org to go grab that we have a lot of cool notebooks on how all the stuff works. So try out the notebooks they're available at um github, slash, databricks tech talks and again we're going to go ahead and give you these slides. So you can go ahead and reference all this stuff.

A

um This this session here is actually part of the seattle spark and ai meetup, not actually part of the data and ai online meetup, but we're actually using some of its references, uh because karen runs the data in ai online meetup, uh I'm just one of those guys who happens to be a fan of the seattle one. So that's why I'm here so there you go, go ahead and check out all those cool sections there. You can also get the uh ebook again. We put the link inside there and the paper the link inside here.

A

So I'm going to leave it at that. We've got about three minutes left in terms of questions, so I will try to my best to answer the questions, but the first thing I'm gonna do is I'm gonna go ahead and look for the slides, real, quick and I'm gonna paste this into both the linkedin and the zoom channels. So that way you guys can go ahead and have the slides right away, because I believe I put a version, maybe a slightly earlier version of the slides, but I can update those uh so give me one.

A

Second, as I do that all right.

A

Meanwhile, if you do have questions that are not slide related by all means, please go ahead and let me know and and ping me about it uh here, perfect. I think yes, I do all right so to those folks in I'm about to send the slides here to zoom, and now I'm going to send them here to linkedin.

A

I'm gonna actually update them uh because I did some updates, but it'll be the exact same link that I just sent you, so you can go ahead and just grab uh grab the slide. So I'm gonna, like I said since I did do a small update, I'll, actually add a couple links here all right. So, um let's see, uh let me look at the remaining questions uh from the zoom. I've got a question on how do transaction logs look and recovery techniques. So that's a very um point.

A

There's a lot of potential answers to this particular question, so the long story short is this: the transaction log contains every single file that you possibly need to read for your table. Okay, so because it does and it has multiple versions. If there are errors like like oops, you can always roll back right and that's what you basically run a rollback statement or just a select from a version and insert overwrite and boom you've brought your table back. Okay.

A

Now, in terms of recovery techniques, you can always either roll back the table itself or, like I said, overwrite the table by doing the select from version as of and then insert overripe, or you can also potentially clone your tables. So for sake, argument you can clone the tables put them into a like colder or cold storage or warm storage. So that way you can actually have a different copy of the table to work with, and for that matter one.

A

One interesting aspect that we've recently been talking about is actually ability to, in essence, treat your data like an artifact. So it's not just you know how, with github you're treating your code like an artifact. Well, the idea is not just treat your code as an artifact treat your data as an artifact. So what do you do in github you'll, often fork or branch the code, so you can make modifications to it well same idea with with delta lake with spark 3.0.

A

Those two together spark 3.0, obviously gives you the performance aspect of things, but what you can do with with delta is you can clone the table I.e similar to you branching or forking the table, and do your modifications directly against that table? First, while you do not mess with the production table and then once you've determined that this is exactly what you want to do, then you can go ahead and actually merge and change this back together.

A

If you want to- and what's great about, is that it's not just merging the code changes back together, you potentially can even merge the data changes together because we keep versions of the table intact. Okay, so hopefully that answers your question and I'm going to answer one more question from linkedin um is delta lake an alternate kafka? No, it's absolutely not. Delta lake is about file. Storage. Kafka is an awesome system love. It would never insult kafka. uh It's there for the purpose of your streaming uh for your streaming environments.

A

Big time right at some point, you're gonna need to go ahead and land the data. Okay, that's it you're going to need to land the data, so so, when you have to land that data- and you want to keep session state of that data, that's what this delta lake store, in other words it delta like underneath by the way I I realized, I probably should have covered this earlier. It's it's like any your parquet table. That's it like it literally! That's all! It really is it's just your parquet table you it has.

A

It still uses apache per k, whatever codec you want to use snappy code for progression or whatever else same context, same everything, so we're good to go on that front. So, instead, what it is is that we have a transaction log with those part k files. So that way you can go ahead and track each and every file that goes that lands in there right. So that's why we're saying? No! No, it's definitely not a competition against kafka. uh I mean, I think.

A

The only way I can be competition against coffee is, if you decide that you really want to hold three months worth of data and kafka, probably not the best idea. Okay, I mean, if it's small, you can probably do it, but generally you probably don't want to do that right. Kafka is well designed for what it's for, in this case, we're talking about storage all right. So I apologize for not answering all of your questions, uh but uh that's it for today's session.

A

It's three minutes to the hour, so I did want to wrap things up so again. Thank you very much. uh We will uh karen already sent a link to the youtube link, so that will allow you to go ahead and see this reported as well. Obviously, this linkedin link that you're using right now will work perfectly fine for those that are on zoom the like, I said, the link will be sent there. We will have another seattle spark and ai meetup that follows up on this session, except it's specifically demo heavy now.

A

So, in other words, that's what I want to leave you, so I I've provided a lot of context on how to conceptually run the stuff. The next seattle sparking ai meetup were, uh is going to be demo heavy about how all this stuff runs. And yes, if you're wondering it will include notebooks that you can download okay so again, thank you very much. I appreciate you taking the time, please ping me in linkedin uh or on the youtube. uh If you have any other questions, um that's about it and thanks very much everybody.