Delta Lake Delta Lake Tech Talks, 27 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks

Description

Apache Spark™️ has become the de-facto open-source standard for big data processing due to its ease of use and performance. And the open-source Delta Lake project enhances Spark’s lead with new capabilities like ACID transactions, Schema Enforcement and Time Travel. These features help ensure that data lakes and data pipelines can deliver high-quality, reliable data to downstream data teams for successful data analytics and machine learning projects.

In this tech talk, we will discuss the top tuning tips for Apache Spark 3.0 and Delta Lake on Databricks. Come prepared to ask your questions and join Joe Widen, Chris Hoshino-Fish, and Denny Lee to discuss when to use which join operations, how to pick your machine sizes, how to help speed up your merge operations, and how to make your jobs easier!

Link to slides and the notebooks used in this tutorial: https://github.com/databricks/tech-talks

Chapters
0:00 Welcome
02:52 Use the latest version of DBR
04:53 Picking the best join strategy
13:39 Use Apache Spark 3.0 and AQE
26:27 Partition Pruning
28:36 Data Skipping
31:24 Z-Ordering
39:34 Databricks Delta Lake and Stats
44:39 Optimizing Merges
47:24 Picking good instance types

Speakers:

Chris Hoshino-Fish is a Solutions Architect at Databricks. Chris is an active member of the Performance Subject Matter Expert group and a former Principal Consultant focused on Data Engineering, working with several Fortune 500 Databricks customers. Prior to Databricks, Chris worked for an adtech company as a data engineer managing pipelines using Apache Spark for 3.5 years. Chris has a B.A. in Computational Mathematics from University of California, Santa Cruz.

Denny Lee is a developer advocate at Databricks, where he works on Delta Lake, Apache Spark, Data Sciences, and Healthcare Life Sciences. He has previously built enterprise DW/BI and big data systems at Microsoft including Azure Cosmos DB, Project Isotope (HDInsight), and SQL Server as well as the Senior Director of Data Sciences Engineering at SAP Concur. Denny holds a Masters in Biomedical Informatics from Oregon Health Sciences University.

Joe Widen is a Solutions Architect at Databricks. Joe leads the Performance and Delta SME horizontal initiatives along with making customers successful with the Databricks Unified Analytics Platform. Joe has been working with Spark and more generally Hadoop for 5 years, with previous stops at Hortonworks and Capital One.

To join the zoom live chat:
https://www.meetup.com/data-ai-online/events/274093223/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

Perfect, karen thanks very much really excited uh to have the session, uh if you don't know which session you're you're currently on the top tuning tips with the apache spark in delta lake on databricks, my name is denny lee, I'm a developer advocate and long time, data geek, you're, probably sick of me at this point. So let me switch right over to chris fish and joe one chris. Once you start it and then joe you'll be right after that.

B

Everyone, my name is chris fish, I'm a solutions architect here at databricks, I'm in sparta's data engineer for about six years.

C

And my name is joe wyden, I'm a solutions architect as well with chris here at databricks, and I've been a data engineer with spark and hadoop for the last six or seven years now.

C

So with that I'll go ahead and start sharing.

C

And hopefully you can see my tuning tip screen is that right.

A

Looking good all.

C

B

C

All right so uh we'll go ahead and get started with our tuning tips here. So these are the top five tuning tips that I generally give to all of the companies that I work with um when I'm working with them. I generally go through a progression of different things to make their jobs run faster when they're using data bricks and delta lake and spark 3.0, so we'll go ahead and share those with everybody.

C

So just a quick reminder of who everyone is, but we've already done intros. So, let's hop into the top five tuning tips, so we'll walk through all these in some pretty good depth.

C

uh The expectation here is, you know just a little bit about data bricks at the very least and at the very most you know, spark really well and you're here to learn more about delta lake and spark 3.0, so we'll kind of level set a little bit about how the functionality works and then we'll dive a little bit deeper into some of the nuances that you should be thinking of.

C

So our list here is using the latest version of dvr picking the best join for your workload.

C

um Using spark 3.0 with adaptive query execution, we'll talk a little bit about delta lake, how it works, how to think about delta lake and how to optimize it and then, at the very end, if we do have time, we will try to send or tell you how to pick the right machines and we'll try and be very prescriptive on all this. I know in the past a lot of people had said that working with spark is really difficult, but when you know what to do, it's actually really easy to do so.

C

At the end of this, I hope you feel empowered to basically take what you learn. Apply it almost verbatim and see your workloads run quicker.

C

So the first one- and this is when I tell all of my customers and all the companies I work with- is always use the latest version of the databricks runtime or dvr, so keeping up with the databricks runtime. Every release of the databricks runtime has performance improvements.

C

So when you go from databrick 6.6 to 7.0, you actually got an upgraded version of spark from 2 4 to 3.0. You got some features in delta lake, like the autoloader and the copy into, and you got some performance improvements like dynamic, subquery, reuse, the next one. If you went to 7.1 there are a bunch of performance improvements in merge into in addition to inserts, updates and deletes um it's the same version of spark. If I remember right, it might be.

C

3 0 to 301 um and all you had to do was upgrade to get those performance improvements. Seven one two, seven two: we upgraded the cloud connector. So if you were working on the amazon cloud, you'd probably see a bunch of s3 requests that were optimized on the azure platform, we're always improving the performance of adls, gen 2 and the wsav connector. So that's there and then in dvr 7.3.

C

You may have noticed your delta queries execute a lot faster to begin with, so there were some metadata overhead times that were drastically reduced just by upgrading and again, if I remember right, the difference between 7.0 and 7.3 is for 3.0 to spark 3.0.1, so very minimal impact to the upgrade and you get all. These benefits by upgrading databricks has tons of regression tests built in and we don't actually release a new version of the databricks runtime.

C

If there are any regressions, so you can almost be assured that anything that ran in 7.0 will be the same speed or faster in 7.1.

C

All right, the next one is picking the best join strategy, so one of the complexities of spark is there's a lot of ways to join different data and they're not explained very well anywhere on the internet. So hopefully this will kind of demystify how to pick your join strategies.

C

So on the left here we see the different type of joins that are commonly used in spark and from the top to the bottom, is decreasing in performance. So the broadcast hash join is by far the fastest joining spark. Then we have the shuffle hash join then the sword merge join and then the dreaded cartesian jump. So again the broadcast joins the fastest.

C

The cartesian is by far the slowest. It should be on a different slide. It's so much slower, but you kind of get the picture. So the general strategy for using your joins is broadcast as much as you can.

C

When you can't broadcast go ahead and do a shuffle hash join when you can't do a shuffle hash joint, do the sort merge join when you have to do a cartesian join? You should probably rethink what you're doing. um There might be a better way to work with that data.

C

So the obvious thing is okay, so why can't I just broadcast everything so a broadcast hash joint has a couple of technical requirements.

C

The first is well, the main requirement is that one side of the join has to fit completely into memory, and that's not memory on the size of the data on disk. That's the size of the data deserialized put into memory, there's a rule of thumb. The defaults are 10 megabytes on disk, which is really small. I'll show you how to up that in a second, the biggest you can do is two gigs.

C

If I remember right, if you do more than two gigabytes you'll get an exception, uh spark has an optimizer, that's pretty smart at doing this, but it never hurts to give a hint. So you can see on the left. We have an example of using the broadcast hint in a sql based spark sql join here.

C

The next join is the shuffle hash join. The shuffle hash join is when you do a full shuffle of both sets of the date or both sides of the data. Excuse me and it's not a broadcast join, but it is a little bit faster than the sort merge node. So what a shuffle hash join requires is that any given partition of the right side needs to fit in memory, so that doesn't mean the entire table needs to fit in memory. It just means any partition at any time can fit in memory.

C

So there's a here sticking spark that says it's kind of a primitive, but um basically at the left side table estimation is three times bigger than the right side table estimation.

C

Then it will invoke this, but we can also give it a hint and I'll show you an example of this in a second but the biggest difference between the shuffle hash join and the sort merge join is the shuffle hash join. Does a basic hash partition. It then creates a hash map for the right table. So when you go ahead and you join the two tables together, it's basically just to look up in the hashmap to go ahead and join the table, um so it is generally faster than the sort merge join.

C

The sort merge, joint, always works. So if reliability, robustness is super important and performance, not so much uh sort merge time, the sort merge doing basically sorts both sides of the um both tables that are being joined. It then goes ahead and does a merge and that's how it does. The join the most expensive part of the sort merge join is the sort if you're working with really large amounts of data.

C

What ends up happening is you start to spill to disks during the sort and that can really really increase the run time of that jump, so the sort merge join can handle anything can handle any data sizes, it always works, but when you have, the ability to you should just use a shuffle hash. Join a cartesian join is a cartesian joint everywhere, it's slow everywhere. It requires you to basically compare every data point to every data point. um Yeah, just try not to use it, not a good idea.

A

Hey before before, you switch the slides sorry about that joe uh uh uh there was a good question from q, a I know uh fish. You had already answered it, but actually, I think if it's useful for everybody, so fish, you might want to answer this question. The question is: is eight gigabytes still the maximum data size to broadcast within spark and fish? Why don't you chime in.

B

Yeah the response I gave um so right now the default for broadcast joins, I think, is still pretty low. It's only 10 megabytes, I believe, generally. We recommend that it's safe to increase that threshold to 100 megabytes or one gigabyte generally but past that you may want to benchmark it in some way, because the network traffic will start to get higher and higher and become more of a bottleneck.

B

If you're broadcasting, eight gigabytes of data, you have to send that out to every executor in the cluster, and so it's dependent on how many nodes you have in the cluster and how much data you're, actually broadcasting.

A

Perfect thanks a lot fish uh joe right back to you thanks very much for the interrupt here.

C

No problem thanks chris for answering that all right, so the next part here is actually I'll have back here. You can give hints for all of these and hints are great, but that requires you to know a decent bit about your data, so we can make this even easier.

C

So there's a couple things that I generally always do with a customer's workload. The first thing as chris was mentioning is I upped the auto broadcast join threshold, so you can see here I set it to 100 megabytes.

C

What's that what that does? Is it basically says if the table is less than 100 megabytes on disk?

C

It can be broadcasted, so this will increase the performance of your small table to bigtable join without having to go ahead and put a hint in um this one's pretty safe to do, as chris mentioned, when you get to the two four or eight gigs, um it's not always better to do it that way, um and I believe spark will throw an exception if you try to broadcast more than eight gigs in memory, eight or two, I forget the exact number so that one's a good one. To always set on all of your clusters, the.

D

C

One is spark: sql join, prefer sort, merge join false, so in spark 1.6 the default join was the shuffle hash join in spark 2. It switched to the sort merge join in 2 1. I think it switched back to the shuffle hash and then 2 three. It went back to the sort merger or something like that in general, the sort merge joint is faster than the shuffle hash join and the spark catalyst. Optimizer is smart enough to basically figure out based on some projections and some multipliers.

C

How big each table is going to be when it does the join. So if the catalyst optimizer thinks that at runtime it will be small enough by setting prefer sort, merge joint to false it'll, actually do the sort merge join and again what that does. Is it gets rid of all the spill to this during the sort which is naturally a very skewed and compute heavy and slow operation?

C

So generally, what I do is I actually just set these two first and run the workload if I still notice some oddities happening like let's say we join two tables and do a bunch of filtering, and I know I can broadcast the resultant table and it's not getting broadcasted.

C

That's when I actually will fall back to the previous slide and start adding hints. But generally you do know your data best. So there's some cases where a sort merge join is better than a shuffle hash join. A good example is when you're joining tables on a partition column. So let's say you partition by some sort of shard id well, the default shuffle hash join requires a full shuffle. The sort merge joins actually localized to the partition, so the sort's actually not that bad.

C

In that case, so you do know your data, the best um you can play with both but in general go ahead and follow that rule of thumb broadcast as much as you can then go ahead and do the shuffle hash join then fall back to your sword, merge, join and then, if possible, rethink your cartesian john.

C

And that's it for join, so we're going to move on to my favorite part of spark 3.0 and that's going to be the adaptive query execution, so adaptive query execution, something everyone's been wanting forever.

C

I like to think of it as somewhat of an easy button for spark in spark 2 4 and earlier you had to set this thing, spark sql, shuffle partitions and hope it was good enough for your entire job.

C

Well, now, with adaptive query execution, that number is figured out on the fly for you and you no longer have to think about it as much and with databricks, specifically we're actually figuring out the number of initial partitions too, for you coming soon to a databricks runtime near you, it's not quite there yet, but it's coming so we're basically trying to make spark as easy as possible and remove the amount of knobs that you need to basically twist to make your job run faster.

C

So adaptive, query execution by far the coolest thing that spark's done in a long time. So what it does is during runtime it'll, actually change the query plan, and it does that because we have stats on all the different partitions that are created during shuffles. So we take that information and we can intelligently do different things.

C

So the first one is, we can adaptively change a sort, merge join to a broadcast hash join, we can coalesce the shuffle partitions and we can basically split out skewed partitions automatically for you. I have pictures that show how all of these work, but it's a super powerful feature, and I have a few customers.

D

C

Used it and it made things, run three to five times faster I'll show you an example of what that looks like as well.

C

So the first one we have is the adaptive shuffle to broadcast doing or sort merge join to broadcast john. So on the left, we would have what you would normally see in spark 2.4. So it's doing a shuffle or a sort. A sort merge, don't excuse me, but you can see.

C

We have two data sets the left child the right child it'll go ahead and do a shuffle right, which is basically shuffling the data into arranged partitions it'll, go ahead and read those data on the remote executors, it'll, then sort them, and then it will do these. The merge and the sort merge join.

C

But let's say on the right side: we have a very small table that doesn't fit in our broadcast join threshold, so the static size is 15 megabytes the default size for the broadcast join is 10, megabytes, so spark would say: hey. I can't broadcast that so with adaptive query execution, what happens is during the shuffle? We actually just track the stats of each size of the shuffle partitions and given the size of that, we can dynamically re-plan the query so that we use the broadcast journal.

C

So the downside to this is, if you were to manually, specify a broadcast join the shuffle wouldn't happen, but the upside here is, you didn't have to do that, and the loss of performance actually isn't that big of a deal.

C

The next one is the shuffle partition coalesce game. So generally, what we tell people is to overestimate the number of partitions you'll need and go ahead.

D

C

That and the nice part is for really large tables that don't get filtered or reduced much. It works pretty well, but let's say you have a query where you're doing some sort of aggregation and then a join and then an aggregation and the number of partitions you chose for the first join was optimal but for the second aggregate was sub-optimal.

C

So what the adaptive query execution is going to do is, after the shuffle right, it's going to go ahead and use those stats and basically coalesce those small partitions into fewer big partitions. So in this example, you can see on the left. We chose the shuffle partitions as 5000 arbitrarily, and we didn't really need that many.

C

So what adaptive query execution said was hey. You have many small partitions. Let me go ahead and automatically coalesce those and basically make your join run faster. So the overhead of spark is less.

C

This is a really powerful feature when you're doing aggregations or filtering where the size of your data changes significantly during a job.

C

You set spark sql shuffle partitions once the entire job uses that and then you don't really have any flexibility there with adaptive query execution, it will go ahead and coalesce those for you and then again with the databricks runtime we're going to be releasing something to actually increase that number. If it's uh sub-optimal as well so right now the spark 3.0 just coalesces them, but databricks will be releasing something to increase the number as well.

A

All right, perfect, actually hey joe there's, a good question that popped up in this case from youtube, uh and it was based on what you had been talking about in terms of aqe and noted the fact that uh does this mean we should refrain from using actually using broadcast since ourselves. So my response online was basically saying it's not that we're trying to tell you to refrain it. It's just more a matter of like why?

A

Don't you let aqe try first, because there's a lot of cool things that aqe is actually able to figure out now, but I figured anything that you yourself, joe or fish. Would you like to add to that.

C

Yeah sure, so, um if we look at the auto broadcast from aqe, it still does require we shuffle both of the tables, because we use the post shuffle stats to decide the broadcast.

C

So if you give the broadcast hint we'll skip the shuffle on the left, child and the right child and it'll be more performant. So again you know your data, the best. um If you know you can broadcast it, it's always best to put that hint, but in the event that you forgot, aqe is still going to try and help you out.

A

Yeah, perfect fish, anything else you want to add, or we can continue on for that matter.

B

No, no, I think maybe the only other thing is still. You still have to increase the threshold for um broadcast. You still may want to increase the threshold.

A

Oh good yeah, that's a great call out. Yes, good good.

B

Yeah, just increasing the threshold and turning on aqe is pretty nice and spark will try to adaptively, adjust the plan and then to joe's point. One thing you can do is you can you know, look at the plans and see if adaptive query execution has kicked in and then maybe that tells you what hints you should be using uh the next time you get on the quitter.

A

Perfect, all right, joe right back at you, man.

C

All right, so the the last one was the sku split join. This is by far my favorite feature of aqe.

C

It makes dealing with skew almost trivial in spark, so a skew joint or a skewed, joint or skewed partitions is when you're joining two data sets and a subset of the keys make up a significant portion of your data and what ultimately ends up happening is in this example. The l1 and r1 are pretty small partitions and they go ahead and they execute really quick and then l2, r2, l3 and r3 are larger partitions, and we need to update that graphic on the right there they're larger partitions and they end up taking significantly longer.

C

So what we see is we see a stage barrier where the cluster is not really doing. Much and I'll show you a really good example here in a second, where the cluster is not doing much, because it's just working on a couple skewed tasks.

C

So before you had, this databricks had a feature called the sku join which worked okay, but you had to either tell it the keys or the columns ahead of time and did a bunch of sampling and it worked really well if you kind of knew how to use it.

C

If you didn't, sometimes it would regress performance with the adaptive query, execution, sku split. What it does is, after the uh the writes for the shuffle it'll, compare l1, l2 and l3 sizes of each partition and it'll go ahead and actually split those. So what you'll see in the spark ui is you'll see the 25th percentile um input size is like 10 megabytes, the medians like 30 megabytes, the 75th is like 50 and the max is like 8 gigs.

C

So this will basically solve that issue. For you, I've worked with a couple customers, I'm going to show you an example of it, but basically with zero tuning, except enabling the feature we were able to increase the runtime or decrease the runtime by threefold and again, what it's doing is. After the shuffle write, it's taking a look at the partitions. It's written comparing the sizes and splitting out the big ones, and this just reduces the skew and your cluster can run more optimally.

C

There's going to be less straggler task, so your cpu usage will go kind of through the roof by car. By far the coolest feature. Sometimes it'll kick in, you didn't even know you had sku your job. You know a particular joint run, five or ten minutes faster, this one's the closest to magic. I think there is so.

A

C

A

And I've, I'm sorry sorry go ahead! Sorry! No! I was about to add in I can't I want to just help joe over emphasize that point right. The fact that skus are especially coming from where a bunch of us came from were like, with data warehousing backgrounds right.

A

We would have to do all sorts of magic to basically get the databases to be able to work and then that translated in one way or another to when we applied that to spark and the fact that, basically with aqe and spark 3.0, also it's just able to handle the skus for us. It's a lot less work and it runs significantly faster and it's just happening all under the covers and it's it's gorgeous and it does simplify things a lot. So I cannot overemphasize this enough.

A

So sorry joe, this is just me to putting on that sales tactic of saying. Yes, you got to go use this.

C

Yeah definitely um one quick call out, though spark 3. The open source version only supports the sort. Merge join for the split sku uh databrick supports the the broadcast or the the shuffle hash and the sort merge join. So if you're doing this with open source spark 3.0- and you followed my recommendation and prefer sort merge joint to false- um you may not see it kick in. uh It only works for the sort, merge joint on the open source version of spark.

C

So we've kind of you know said this is really cool, but the proof is in the pudding.

C

So in this example, here I had a customer running a 12 hour or a six and a half hour job here and what was really common was to see ganglia, metrics that look like this, and what these ganglion metrics are telling us is. There is an extreme.

D

C

Skew and probably their cluster is a little too big for this particular job. So if we take a look here, you can see in the middle, the cluster was relatively well utilized and then you know half the task had.

D

C

In about eight minutes, and then it took another 12 minutes for the rest of the task to finish um 100 attributed to skew.

C

We were looking at it for a while and you know able to pinpoint it pretty quickly and then we're figuring out how to solve it again. This ran in six hours and 47 minutes. um The customer was actually reasonably happy because where they were running it before was 12 hours, but as a data engineer, this still kind of bugged me. So we went ahead and turned on adaptive. Query execution with the broadcast or the shuffle hash split, sku join and it ran in two hours and 28 minutes.

C

You can see the cluster was actually completely utilized um or significantly more utilized. For this query, and literally all we did was enable the adaptive query execution and we set the preferred sort merge joint to false. That's the only two things we did and three times improvement. Cluster utilization was through the roof um and a very happy company that we were working with all right now.

B

C

It over to uh chris to walk through partitions files, your stats and merges awesome.

B

Thanks chad, if you want to go to the next slide,.

B

So this section is going to go through um how to partition your table a little bit where the benefits of partition pruning and then how we can leverage optimize the order to improve different types of operations as well. So this here this is just showing you. What uh partitioning actually looks like on disk um spark partitioning uh is based on hives partitioning, which it uses physical folders, to separate the data, and then this has benefits where we can isolate operations to a specific folder and control.

B

What is going to each partition and then, if you go the next one joe we're also able to do partition pruning, and so this has immediate benefits where you're able to skip a lot of data in your scans.

B

Just by partitioning your data set and oftentimes, there will be a logical partition for you to choose something like a date is pretty common.

B

If you have low data volumes, maybe a month would be better or you can partition by things that are actually in your data. Things like customer information or geographic location, that kind of stuff, and then something that was added new in spark. 3 is dynamic. Partition pruning, which is where spark will actually look at both of your all of your tables that are being joined together and it'll check.

B

If you have a filter, that's being applied to one table that might also be able to be pushed down to other tables that are partitioned on that value. So you could be joining in something like a lookup table with metadata information and where you're, applying a filter for a specific partition, value and spark will be able to identify that and push it over to the other side of the join and do partition pruning uh on the table.

B

You did not specify the filter on and that's super helpful where he'll be able to to speed up some joins without you having to rewrite them yeah. If you want to go to the next one, joe so in lower level than partition, pruning data breaks, delta is also able to do what's called data skipping, and this has other names.

B

In other databases, I think in oracle it's called zone maps and it has another name in sql server that I'm forgetting- um and this is basically us taking uh advantage of the the columnar statistics that are available.

B

So the parquet file format is already collecting these min max values on each column, but the parquet format stores them in the footer of the parquet file, and so this is helpful, but not enormously helpful, because you still have to scan to the parquet file and check the footer and then decide whether or not you're going to avoid loading that file, but you've already scanned to it and loaded some portion of that file anyways.

B

So what delta is able to do is delta is able to pull these statistics up into the delta log itself and what delta will do is delta will first issue a metadata query to the delta log and determine if your filter can possibly avoid files inside the the delta table based on these min max values? And so here we see an example of um we have a query that has a really simple filter on it and it's filtering on.

B

You know an integer column and so then we're able to take the min max values that are applied to each file and check whether or not these files are going to have relevant values. For your query. So in this example, we're able to eliminate that first parquet file, because we can see based on the statistics that it won't match your filter.

B

This can have really huge benefits on the the scan reduction, where we have customers where they're querying tables, with hundreds of petabytes and they're able to get a 95 or 99 reduction in the amount of data scan, and this is massive in a world where you're paying for everything based on time, running you're, paying for compute, based on by the second and you're paying for storage by api cost.

B

um So one you know hidden benefit here and it's not an enormous cost savings, but in s3 or in azure blob store or adls you're getting charged per api request, and so every time you can avoid loading a file.

B

That's one last api request that you're you're getting charged for so it has benefits both on both on the speed side performance side, as well as on the cost side.

B

So then, to improve these statistics and make them even more effective. Obviously the statistics are going to work best when your data is clustered or sorted in some way, and so that's where z order comes into play. So in data breaks you have this compaction job called optimize and you can run optimize against any delta table you'll go in and it will compact your smaller files into larger files for better read performance and better, better metadata management.

B

Z order is a kind of cool um math algorithm. It's it's a space mapping, curve and z order is going to map your data into a one-dimensional space and then sort it over there and you end up. If you go google, it you end up with this cool z pattern.

B

uh Hence the name, z ordering and so z order is gonna more effectively cluster your data and improve those data, skipping statistics and so z order, you're going to want to use on any columns that you are filtering on a lot and doing highly selective queries on or columns that you are running, update operations or merge operations against, as those will also benefit from having better clustered data.

B

So here in this example, we have just the default layout of the table before on the left, where we did not apply any kind of clustering or sorting to the table.

B

Now on the right, we have uh the results, after the z order, where uh and again we're filtering here for the value where it's equal to seven, and so here now that the data is clustered and all the similar values are co-located, either into the same file or into a smaller set of files, and now we're able to avoid in a simple example, we're able to avoid two-thirds of the table.

B

And so just some general tips and best practices around z, order, z order works really well up to about three or four columns, and then the effectiveness is going to start to drop off quite a bit. So if you use the order on five columns, you kind of would expect three to four of them would give you decent performance and then probably that fourth and fifth column would start to have a lot less data skipping we're also in currently we're beta testing a new type of curve called hilbert curve.

B

The hilbert curve has some slightly cooler properties over z order, where the hilbert curve kind of bounds, how big of a range each file is going to have and make sure that there are no big jumps between files, and it also is more effective at clustering, your data, so the hilbert curve in our our simple testing or in our you, know, sort of synthetic testing. We see that the hilbert curve provides good clustering up to five or six columns depending on how cluster. How naturally clustered your data is data.

B

The hilbert curve can actually be effective up to 10 or more columns if you have really naturally clustered data, and so we're currently beta testing that in the latest, dbr 7.3 and if you're at all interested in that we'd, be happy to to show you how you can get that testing in your own workspace and then another thing that's worth considering is the actual file size, that's produced by optimize and z order.

B

So right now the default is one gigabyte files so either for just pure optimize or with z, order, it'll target these one gigabyte output files- and this imposes some restrictions where, if you have partitions that are less than one gigabyte in size, then z ordering won't actually do anything right where, in order for z order to uh actually do anything, you need more than one gigabyte of data inside of each partition, as the order happens inside of a per partition and then the other thing is that we've seen the one gigabyte size is kind of chosen in order to provide good scalability, where you'll be able to scale your delta tables easily up to hundreds of terabytes or petabytes.

B

But what we've seen is that there are benefits to kind of dynamically choosing this file size and that there are circumstances where you have smaller tables and the one gigabyte size files are maybe too large, and then, if you're specifically focused just on interactive queries or scanning the data really fast doing lots of updates and doing things where it'll benefit from a lot more file pruning.

B

Then it's beneficial to output, smaller files out of seat order, because that'll give you uh more the opportunity to do more data skipping and avoid more files with your operations so they're. Definitely, um uh we've def we've gone through and done.

B

Some tuning exercises with customers where we found 64, megabytes or 128 megabytes has huge benefits um over the default file, size specifically for merge operations or for uh interactive queries and then, in the future, we're looking to dynamically choose this file size uh based on the overall size of your table and the layout of it as well.

B

Then one last point: um this is something that's kind of very different between delta and all the other uh data sources available to spark where delta is never going to be bottlenecked by the number of partitions that you have on your table delta as a format effectively, the partition is actually metadata about the file instead of for all the other sources.

B

It's the other way around where the partition comes first and then the files come after so with delta delta care is much more about how many files are in your table and can easily scale up to hundreds of millions of files.

B

On the flip side, that also means it doesn't care how many partitions you have and you could have one partition per file and to delta that would be transparent.

B

And then, for just a general rule of thumb, you want is the order on the columns that you're going to be filtering on or that you're going to be updating on and that's kind of, the basic rule.

A

Oh actually, before you dive into that, we had a couple questions that I wanted to just chime in here. So for some folks, uh the first question, which is actually a little bit more about just data, skipping and z, ordering in the first place, is that do they have to actually do anything special in order for data skipping number one and then z order number two uh for that to kick in uh for their environment, so um fischer joe, why don't you go for it or I can take care whatever whatever makes sense.

B

It's a data skipping is going to be automatically applied and automatically picked up with whenever you're writing to delta from datafix. So databricks will automatically collect the statistics for you for any delta table and then automatically handles taking advantage of those statistics. So it's all hidden behind the scenes to the user, and you just have to write data into your delta table and then query it and you get the advantages.

A

Perfect, thank you very much that I think answers the questions so go for it.

B

So these are some um some information about the actual data, skipping statistics and how to optimize it. So, as you might guess, collecting these statistics is not zero cost.

B

There is a small amount of added write latency to collect these statistics, and so you want to make sure that, if you're collecting the statistics that you are collecting useful statistics and you're, not just collecting them for all columns, even if you're not going to use them, so the defaults for delta tables is to collect statistics on the first 32 columns, and this is probably too many for most use cases.

B

Most use cases you're only actually going to query and filter on a few of your columns, and some of your columns may not have useful statistics either if there's no variation in the columns. If there are a really small number of values, the statistics may not actually be helpful for data skipping. So what I tend to look at is um columns that you're going to filter on a lot or columns that you're going to run, update or merge operations on. You definitely need to collect statistics on and then with delta.

B

You can simply reorder the tables and um ensure that the columns you want are the ones being uh that statistics are being collected on for and down here at the bottom. We show you what the config is to alter the number of columns that are being that statistics are being collected. One cool thing about delta is that um because it's managing the parquet files under the hood for you, you can reorder columns in a delta table without actually physically rewriting the data, and because we have columnar storage behind the scenes.

B

We're able to just map the column order, that's being presented at the table level and map that to the correct uh column in the parquet file itself. So you can, after you've generated your tables. You can go in and start moving around columns and alter what order they appear in, so that you have a better logical representation of your table and then we're able to take advantage of these statistics and turn a bunch of queries into metadata.

B

Only queries- and these will then be super super fast and starts to give you an experience, a lot more similar to a data warehouse where a data warehouse is going to be collecting these statistics all the time for you and that's how it's able to be really fast for certain queries, so things like selecting the maximum value or the minimum value from a column. You know: we've already collected all the min max values from every file.

B

So then we can issue a metadata, only query to determine what the maximum value is for a particular column and then, as mentioned before, it allows us to get really granular with file skipping and and potentially load at any given time. You know, if you have a selective enough filter, you could load just one file out of your entire table.

B

This is basically taking you one step lower than what partition printing can already accomplish and then right now, timestamp and string types.

B

The statistics that get collected on them typically aren't super useful so for strings there is truncation being applied, and so, if your strings don't vary in the first few characters, then you probably are not going to get good statistics that are useful for those string types and then for time stamps. There's a precision element where time stamps require you know nanosecond or millisecond precision, and we don't collect statistics down to that level at the moment. So typically for timestamp.

B

um I might map it to something like you next time and then turn it into something that is easier to collect. Good statistics on.

A

Cool actually before you dive into it chris, uh I just realized. We've got about uh eight minutes left because we want karen to be able to finish up so why don't we breeze through the next few, slides just to at least give everybody the context so yeah perfect.

B

Yeah, so um last thing is: you can completely disable stats collection, there's a couple instances where you would want to do that. Stats are not useful if you consume the table via streaming right streaming is going to consume every file in the table, so you don't really need stats collected if you're only reading it via streams.

B

However, you may want to go back and collect stats later on. If you have interactive query use cases, so you can always go back and turn the stats collection back on um and we have actually there's a hidden api to force stats collection or you can run something like z-order and it'll collect the stats.

B

And then merge this just kind of explains how merge is going to work, but merge is going to take advantage of those data, skipping statistics and use that to figure out exactly which files actually match your merge query, and then it will load that data and actually apply the merge operation.

B

One thing to note is that delta right now rewrites data, even if the when matched and when match clauses, did not succeed.

B

So you do want to make sure that whatever your merge into clause is using for the actual update clause, the on portion of it, you want that to be selective enough, that you are getting matches out of your merge. Otherwise, even though you think this merge may be a no-op, uh we still may rewrite some data, so that's definitely something to watch out for.

B

And then explicitly filtering on your partitions is going to help a lot with merge. This is something where we're working on making this dynamic so that it automatically picks up what partitions you're targeting with your merge, but at the moment it's extremely beneficial. If you can pull out the explicit partition values and pass them in as a explicit list to your merge operation and then it'll apply the partition printing correctly.

B

The last thing is mentioned before you can alter the file size for your tables and we've definitely seen plenty of cases where reducing the file size can improve the data skipping a lot and, in turn, can improve the merge operations quite a bit. So I think we had one customer where we reduced the file size from one gigabyte down to 64 megabytes and we were able to reduce the amount of data rewritten by merge by over 50 percent.

B

And this is kind of just a general tip. I personally recommend optimized rights all the time. This is going to enable your rights to write out larger, better sized files, it targets 128 megabytes as the default, and this is going to improve um sort of your stability on the read side and give you better read performance after writing your data, but it's also going to help you avoid any throttling from the storage layer and get around overloading the storage layer with too many put requests.

C

All right and then we just have a couple more slides, one of the most common things I see. Customers do is coming from a different system. They always just pick the cheapest or the biggest instance type, because they think that's the best. um So here's trying to give you some prescriptive guidance on how to pick your instance type using databricks.

C

So in this example, we have uh the tale of three different instance types. um It was a very basic query, just select star from table on a partition column and then some other column equals one uh just to force us to read all the parquet files in that particular partition column.

C

uh It was a 2.2 terabyte data set with 3 000 files, the cluster config. We had a six node 16 core 32gb ram, a six node 16 core 128 gig ram and a three node 32 core 256 gig ram, so you'll notice. The core count is all the same. um In general, I o is limited by the number of cores and your storage system.

C

All the cloud storage systems can support a really really large or yeah large number of concurrent readers, so yeah cores are the important part, but what you'll notice is there was a difference in ram and just size of machines.

C

So for this particular query, we ran it with different databricks, runtimes versions. We ran it with data in the delta cache and we ran it on the different instance types and what you'll see is contrary to popular belief. The smaller machine actually performed much better. So this is mostly due to some of the jvm overhead and garbage collection, but you can see with the sixth node.

C

Oh this should say, or this excuse me, 16 core 32 gig ram vm. It took us about a minute 1.8 minutes to go ahead and read that entire data setting with the same number of cores but and machines, but four times the ram it actually took a little longer, and then obviously you can see here when the data is already in the delta cache on the nvme ssd.

C

uh The performance is a little better on the second read just for the sake of comparison, we did work with a older version of the runtime, and you can see that the older version was a little bit slower kind of reinforces the fact you should try and always use the latest version of the data. Brick stormtime and the most interesting thing here is: we went from three six machines to three machines, but we just increased the size of the machines.

C

Performance actually started to deteriorate pretty quickly, so this is kind of just a little bit of evidence for the next slide here. But basically, some rules of thumb you can live by uh use, 32 to 128, gigabyte machines. Cores are cheap. You and your employees. Time aren't um as long as you have less than 800 cores, you're, probably not going to run into any cloud limitations for the most part. So a bigger cluster usually scales pretty linear in terms of performance.

C

um The next thing, not all processors and cores, are created. Equally cloud vendors are constantly trying to release new machines. They use different processors.

C

Some processors are better than others and definitely take that into consideration when you choose your machine type for particular workloads, etl's generally bottlenecked by io, I mentioned earlier the more cores you have, the quicker you can do your I o so use compute, optimize instances for interactive queries and building machine learning models, any instance with an nvme ssd that can power or leverage the delta cache is going to be um more optimal for you. So uh just some quick rules of thumb there and I believe, that's my last slide.

C

So karen we'll pass it back to you.

D

Thanks chris thanks joe um great presentation and thanks everyone for joining us on all of your questions, uh we did record this session and it will be it's available now in youtube, and I posted the link in the linkedin chat and also in the zoom chat and we'll also post a link there to the slides as well. So you can review them after the session.

D

So with that uh thanks thanks again chris and joe and denny for your time, and everyone have a great rest of your day. Take care.

C