Delta Lake Conference Talks, 21 Oct 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Optimizing Delta Parquet Data Lakes for Apache Spark - Matthew Powers (Prognos)

Description

This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake. We will discuss why it's better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it's hard to compact a partitioned lake. Then we'll move on to Delta lakes and explain how they offer cool features on top of what's available in Parquet. We'll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command. We'll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we'll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We'll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

Hello, everybody. So, as I said, my name is Matthew Powers I work for a company called prognosis, health. The main gist of this talk is going to be optimizing. Delta lakes, although a chat a bit about Park a, but it's mainly gonna, be Delta lakes. So just a quick agenda of what we'll cover we're gonna be talking about why delta lakes are awesome, some basics and the transaction log. It's gonna be some low-level file system stuff.

A

We cover in this in this talk, we'll talk about compacting, Delta lakes, kind of evacuating files, post compaction, partitioning lakes and how that can be a huge benefit for whenever you're querying based on partition filters, and then, if we have some extra time, we'll talk about deleting rows and persisting transformations and columns, which can be good performance optimizations for certain types of queries. So a little bit about myself, Before we jump in I.

A

Am you know a data engineer during the day and a spark nerd at night is my first time in Amsterdam I, really like it here. I think everybody thinks I'm Dutch because they keep talking to me in Dutch I'm, not sure. Maybe that's a how people are here. After now, maybe everybody gets talked to in Dutch. I I spend a lot of time in South America and nobody thinks I'm from South America and nobody thinks I speak Spanish, but I do speak Spanish.

A

So it's kind of like the opposite of my normal life. So getting back to the kind of spark open-source stuff I do have some pretty popular spark. Open-Source repos spark fast tests and spark Daria I have a medium blog, which gets a lot of views. So if you've ever googled for spark stuff, you might have stumble across my blog. My new blog is called munching data have a bunch of Delta blog posts on my new blog, so you might enjoy those if you're just learning about Delta here's my contact information.

A

The best way to contact me is probably the github issue or pull request. I seem to care about those more than anything else. You can also hit me up in the Delta slack channel if you're, not in the Delta slack Channel, recommend it all. The Delta Corps guys are pretty responsive there. So what is Delta Lake at a high level? It's pretty much a bunch of part cave files plus a transaction log. The transaction log allows for tons of amazing things.

A

You can't do with the plain vanilla part K like so, if you that your choice is between Park and Delta, I, think it's a pretty obvious choice to use Delta Lake and now that it's open source, especially so Delta Lake and data bricks. Delta, are two different technologies. I've seen tons of confusion about this in the slack channel and also and some presentations so Denner bricks. Delta is a proprietary.

A

Technologies but they're different, so data bricks, Delta, is a proprietary technology that is optimized for different data bricks, runtimes dead bricks, runtimes our data bricks, specific closed source, Park versions, and so there's certain data bricks, Delta commands like optimized for compaction, which are not available in Delta Lake and, as you can see in this issue here, this is Michael arborist, like there's no plans to open source the optimized command, because it's specific to the data bricks runtime and also has some file system optimizations.

A

If you look in the slack channel, there's two different channels, one for the open source stuff and another for data Brooks Delta. So this talk is going to be focusing on Delta Lake to be clear on that one. So, let's look at and I know. This is like a lot of high-level stuff we're getting into code. Soon, don't worry so this is like a high-level best practices at Delta Lake. You know we want to be using the snappy compression algorithm, which is used by default.

A

Snappy is fast on the compression and the decompression side, which is ideal for spark use cases. We generally want to target one gig files and not have nested directories, although the new nested directories doesn't matter as much for Delta. So where do we get this one gig target file size? This was recently reiterated in the Delta slack channel by TV. This is also comes from a data brick support ticket from like 2016.

A

That's what I used to go off of, but I wouldn't use that that one gig file size optimal target as like the only file size. You should target, for example, the data bricks Delta, auto optimize is targeting 128 Meg files, but it also gives a shout-out to the one gig file, so yeah I think it's still like a kind of an open question on what the optimal file size to use for for Delta Lake is, and it probably depends on your use case. Okay. So now, let's jump into some code stuff.

A

Let's talk about compaction at a high level and why compaction matters for for big data file systems. So whenever you have a big data system, you're doing lots of writes when you do lots of writes you're, creating lots of files listing if you're dealing with a park, a data Lake listing lots of small files, it can be really expensive.

A

So just the file listing at operation itself for Delta the file listing operation doesn't matter as much because Delta's not I mean Delta technically lists the transaction log to figure out which which files to read, but it's not listing the parkade files, so in general compaction makes it easier for spark to read in data on disk into memory partitions in an efficient manner. So this is the sample. Data set will be covering throughout this presentation.

A

We're gonna have three columns of data first-name, lastname and country. We can see we have one Argentinian, two Russians and two Chinese folks.

A

So let's look at some code. That's going to create a Delta data Lake. So what we do, let's assume that that that did here is stored in a CSV file. So what we do is we read in that CSV file into a data frame and then we for purposes of this example we're going to repartition it into five memory partitions before we write it out, we're going to write it out in the Delta file format. Technically, that save mode override, isn't or overwrite isn't necessary and they were gonna write it out.

A

So let's take a look at what this looks like on disk after we read it out, so there's this Delta log directory that gets written out.

A

There's that zero zero dot JSON file, which indicates this was the first transaction to take place on this Lake and then there's those five park' files that are written out so underneath the hood earlier in this presentation, I was saying that Delta at its core is pretty much a transaction log, plus a bunch of park' files, I hope now that we're looking at the file system, we can kind of see why I said that, so, let's inspect the kind of the contents of this zero zero JSON file.

A

This is one of the rows that are added to that file. We can see that this is an add statement, said saying that this this snappy, that part code file was added to our Delta Lake. The partition values are blank because this isn't partitioned the size of that file is 875 bytes we're going to be using that size later to calculate the optimal number of partitions when we're compacting a data Lake the modification time.

A

That's another important piece of metadata that that Delta uses for time travel and the data change is set to true data trains set to true means that this changes, the overall data in the lake. We're gonna also talk about that flag later and why it's it's important.

A

So if we look here in the zero zero dot JSON file, this is just one row of the file. I've expanded it to multiple rows to make it easier to read. But if you open up that zero, zero JSON file and your own machine, this is just going to be one row and there's gonna be four other rows.

A

Just like this for each file, that's added to the to the Delta Lake, oh yeah, so I just wanted to make a quick shout out to that all the code that's covered here is in this Delta examples code, repo, how they recommend cloning, this repo and running these examples on your local machine, gonna, probably the best way to learn about how Delta's working under the hood okay. So now, let's look at how we can compact this Delta data Lake. So we have these five part K files.

A

Let's assume we want to kind of compact this data as to be one one file. So what we're gonna do is read in our Delta data Lake into a data frame, then we're going to repartition it to one partition and write it out to the same path that we read it in from so we we read and write the same path so again, Delta data lakes, we're looking at the open source Delta. So there is no optimized command, so you need to do this stuff manually if you're using the open source version.

A

So, let's look at the file system post compaction, so we have this new JSON file entry. So we have that 0-1 JSON file and we have an additional park' file. That's been written to the file system. Now. One thing you can kind of immediately notice is that the original five park' files are not deleted and we can kind of see that Delta overrides our over writes are operating fundamentally different than how a park' overwrite would work.

A

So if you do this in park', it's going to delete those five original files, if you do it in Delta that doesn't it doesn't delete those files deleting files in Delta is a different type of operation. Now the reason Delta does it like. This is because Delta supports version data and time travel. So this is intentional.

A

Let's take a look at some of those zero one JSON file entries to see what's happening. We can see that when a single file is being added this, this file, that's being added, is going to be the compacted data and the file that's being removed.

A

Well, there's five files that are being removed, and what this removes statement is saying is that when the latest version of the Delta table is read, don't read in these files just skip them and by analyzing the transaction log was kind of how you can understand how Delta is working under the hood. So one thing we can see is that the data change is unfortunately being set to true post compaction. Now, that's actually I'm just terrible. We would prefer to have data change set to false when we perform a compaction operation.

A

So when we're compacting data were not changing, the overall data in our lake we're just rearranging data into fewer files, so we don't want that to be considered a data change for a downstream streaming. Consumers like I use a lot of structured streaming a trigger once and compaction is, unfortunately, a breaking operation because you've compact, you create new files, and your downstream consumers interpret that as new data being added to the lake. So what we want to do is when we perform a compaction operation to be able to set data change equal to false.

A

So it's not it's not a breaking operation for downstream consumers, so this still needs to be done. I think I'm, probably gonna, work on this. This pull request and try to get it in there just because I really need this. Actually.

A

Another thing to note when you are compacting data, is that the repartition method has the potential to increase the file size on disk, so here we're seeing that we're doing DF through partition, one that algorithm.

A

You know at least when I ran it on my production data increase the size of data on disk, which was undesirable and slowing up queries I found that the coalesce algorithm did not increase the size of my data on disk, so just kind of an FYI something to watch out for when you compact do some file listing operations and make sure that your data hasn't increased in size on disk and if it has perhaps switch to coalesce if you're using repartition.

A

Ok. So now, let's switch to the vacuum command, what the vacuum command does and when it's useful, unfortunately, I think some folks think that vacuum is going to speed up their data like operations and that's, unfortunately, not the case. Vacuuming is just the leading files in a key value store, probably where you keep your data, that's being ignored, anyways, so deleting stuff, that's being ignored, doesn't unfortunately, doesn't help performance. What is vacuuming? It's deleting files from disk that are all marked for removal and also older than the retention period.

A

The the default retention period is 7 days, so in this case we're gonna show how to vacuum at Delta data Lake. We want to vacuum this this data. Immediately after its created, we want to kind of bypass that seven-day retention period. So when we're instantiating our spark session, we need to set this Delta retention duration check enable to false.

A

So in order to vacuum our lake, we need to instantiate this Delta table object, which we do Delta table that for path, and then we run the the vacuum method on the Delta table. In this case, we're setting it to point zero. Zero zero zero one hours, which is just a matter of seconds, so this is how you can kind of vacuum a table immediately after it's or files that have been added recently.

A

This is what the file system is going to look like post vacuum. We can see that we still have that zero, zero dot, JSON file, the zero one dot JSON file. We still have our compacted park' file, but now we don't have our five original files. Now. One thing that's interesting to note about how Delta is currently working is it doesn't consider a vacuum as a transaction right, we still have the same: zero, zero to Jason and zero one Jason transaction log files as before.

A

I think that that's okay, but maybe that should be you know, put in the transaction log I'm still kind of on the fence about it. I opened an issue for that to kind of discuss that, so it's a kind of an open issue: okay, well, I, never thing to think about is the potential of thinking of vacuum as a more abstract concept like what to do with files that are older than the retention period and marked for removal.

A

So right now we're just saying we're going to delete them from disk, but maybe we could think about that as a more generic concept like well for my files that are older than the retention period and marked for removal, maybe I want to move them to glacier to save on storage costs. Maybe I want to do other stuff with them right now, it's kind of a narrow concept, just deleting them from disk. So it's another thing that I'm, probably gonna, make some open-source pull requests to potentially change that and make it more generic okay.

A

So now we need to think about the optimal number of partitions to be using when we're repartition and compacting our data. So the way we can determine the optimal number of partitions is by querying the Delta log. So if you take this Delta log for table dot snapshot all files, that's going to return a data frame that contains all of the Delta log information. So previously we were looking at the the Delta logic, stuff and Jason format. If you want to get that in data frame format, you just call that series of methods.

A

So if we aggregate the sum of the size column, that's going to enable us to find out how much bytes are in our Delta data lake. Now, where we want a partition to one gig partitions, so we want to figure out how many gigabytes are in our Delta Lake. In order to calculate the number of gigabytes from bytes, we just take the number of bytes divided by the conversion factor.

A

The conversion factor is pretty much just 1,024 cubed and that's going to give us the number of gigabytes in our debt and our Delta Lake from there. We can I mean that's pretty much going to be what we what the number we want to use when we're repartition in our data now there's kind of an edge case. If you have a tiny lake, maybe you want to use one or two hundred as the number minimum number of one gig partitions, but that's only if you're dealing with the tiny Lake.

A

So this part arielle library, helps with these type of common data operations.

A

In this case, we can just import the Delta log helpers and call the number of one gig partitions method, basically just abstracts what I just showed you and then when you're compacting, you can pass in that num partitions argument for the optimal number of partitions spark'd area is a library I developed offers tons of useful SPARC helper methods, I, actually kind of can't even develop spark with that spark Daria at this point, because I need to use addicted to the helper methods, so I highly recommend taking a look.

A

This repo, if you haven't already so calculating the optimal number of partitions, if you're dealing with Parque Lake, is significantly less convenient. You need to run a series of these org Apache Hadoop FS path methods. They have a bunch of unintuitive names like in order to get the number of bytes. You need to call get length. It's the same math at the end of the day, I've tested this on s3 and my local file system.

A

So, if you're dealing with a park a data Lake and need to repartition it, this works there's other complications of compacting Park, a data Lakes that I'm not gonna cover here, but they basically just use Delta and make your life easier. It would be my recommendation, okay, so let's get into a little bit of partitioning data likes on disk and why that it can offer some huge performance improvements.

A

So when your data is partitioned on disk, you can enable for a lot of data skipping you're only going to be able to have you leverage that data skipping four queries that filter on the partition key so having the right partition key and your partitioned data Lake is very important to get these 50 to 100 X performance gains.

A

So let's use the same data as before and let's take a look at what filtering this unpartitioned Lake looks like. So, let's assume this data is in our CSV file as before. Let's take a look at the physical plan when we run some filtering operations and then we'll we'll partition this data or run the same query and then we'll see how the physical plan differs. So in this case we're looking at data frame where the country is equal to Russia and the first name starts with M, so focus your attention kind of the bottom.

A

There the partition filters. You can see that those are empty. So when we're running this, this query on a none partition lake. There are no partition filters, which makes sense, and all of the filtering is done in the push filters. So you can see they're there at the bottom, there's equal to country, Russia and strings arts with first name M. So, let's a partition. This data, like we're, gonna, read in that data. We are going to repartition on the column country that the repartition method is for memory partitioning.

A

So that's not gonna change anything when we're writing it out disk, but the partition. By does so. We set partition by country and then say if it's the output path. This is what our data is going to. Look like on disk once it's partitioned. We have that Delta log, zero, zero dot, JSON file, and then we have the the directory structure indicating the data for the different partitions.

A

Let's take a look at the contents of the zero zero dot JSON file. This is just for one of the rows and that zero zero dot JSON file. We can see that this is an add statement for the country of list of Russia. That's the the path, the park' file. Now we can see that that partition values is populated, so the country is being set to Russia. We still have the number of bytes the modification time and the date of change as before.

A

So let's run that same query, we were running before, but let's run it on this partition data like so here we can see we're grabbing where the country is equal to Russia and the first name starts with M, and if we look at the partition filters, we can see that that has the country equal to Russia apart is now being filtered on the partition filters and then the push filters now only contains string starts with M. So this is just putting those physical plans side-by-side.

A

We can see basically what we just covered, so this is kind of like an FYI for partition filtering in general, so SPARC currently is performing a file listing operation when you're leveraging partition filters. That is arguably unnecessary, so grabbing the data directly is I've found it to be eighty-three times faster than relying on the partition filters. That was for file listing time because SPARC, the current implementation of partition filters is performing what could arguably be be classified as an unnecessary file listing operation.

A

So what I found is, if you're dealing with a a partitioned data Lake it's better to just grab the partitions that that you want directly. So, let's look at a real partitioned data. Lake I work with it's being updated every three hours. It currently has five million files. It just seems like such a ridiculous amount of files, but that's how it is.

A

15,000 files are being added on a daily basis, so it's still great for a lot of queries, but any small file problem that you have in a regular data Lake is going to be amplified significantly. When you start partitioning any incremental updates just going to start adding a ridiculous amount of files and that's why it's hard maintaining partition Lakes, it's also hard creating partition Lakes.

A

So let's take a look at this query right here for creating a partitioned data Lake and we'll look at how it works on our little trivial data set and then explain why this can't be used on a real data set. So right here we are going to format by Delta partition by country and write out this data. So in this case we have a separate file being being written out for every row of data now when you think about how partitioned by works its partitioning by the memory partitions.

A

So if all of our data is being stored in a separate memory partition, then partition buys and is not going to kind of compact, the the dead and the given partitions. So if you run this, this code on a production size, data set you're just going to create a massive amount of files. So when I, when I was running this on my production data set, we had our data and like 50,000 memory partitions, the unique values of the partition key were about 20,000.

A

So this this operation was going to create up to a billion files, so it was an insane amount of files when I ran that I feel like this. How I learned about spark? You know you kind of just do stuff and fail, and then you're like, oh okay, I learned something new I, don't know if anybody else had that experience.

A

So let's look back at our original approach for creating this partition lake, which was let's see where that was repartition on the country column. Before writing the data out. This is not a production grade way of partition. You know creating an initial partition data like because this is going to create a single file per country in this case or per partition key so there's. Obviously, your data is going to have some huge partitions that can't be written out as a single file. Like let's say in this data set.

A

Let's say that China had two terabytes of data. This is gonna, try to write out all that Chinese data as one two terabyte file, which obviously isn't gonna work. So, let's look at another option for creating this partitioned data. Lake, which is the one I ended up doing, and it's still not ideal, but it at least works. So this right here, wery partitioning and we're supplying repartition with three arguments. The first argument is the max number of files per partition.

A

The second argument is the column two partition by, and the third argument is how to distribute the data among amongst those hundred files. So this is at least better. This is what I ended up using and it works, so this is going to for each country, create a maximum of 100 files and distribute that data evenly.

A

Now so like, let's say that that first argument that 100 that could be set like- let's say the Chinese partition, has 100 gigs of data, or maybe we say, ok, we're going to use 100 partitions for our worst case scenario, but unfortunately, that a hundred is going to be used for all different countries in our lake, so like Jamaica is gonna. Have up to 100 files and for if we're thinking about just the countries we'd want maybe lots of files for China lots of files for India, but then one file for all these tiny countries.

A

So this is kind of where we're at right. Now we need a fourth method. We need to have create a partition method. Data like four of three to give the real solution for the community.

A

So just to recap, these partitioned data lakes, the small file problem- is growing quickly and compaction is really hard. I, don't think that there's a good community solution for how to compact a partitioned Lake, yet so I think we need to come out with that.

A

Okay, so I have about 12, more minutes, I think I can keep going. So, let's suppose we have a use case where you know we, we, our business, changed. We have a bunch of Russia Russian data, and now we don't need to we exited the Russian market. We don't need that dead anymore. We don't want to run this where country is equal to Russia every time we just want to delete it from disk.

A

That's going to be a performance optimization for us, so you know: Delta Lake gives us the opportunity to delete data, so we can instantiate that Delta table object. We can call Delta table dot, delete and it's going to delete that data. Now. A good thing to know note when you're thinking about deleting data is park' is an immutable file. Format right, so it's not like Delta can go into the park' file and just delete a few rows from from the file. It's like.

A

Let's say you have a file that has a billion rows and one of them is a part of Russia data. What SPARC means to do or what Delta needs to do is write all that data, except that one row again. So one thing to note: when you're deleting data is it can create a massive write operation like you can, let's say you have 10 terabytes of data and you're deleting a tiny amount of data. That means that basically you're in theory, your whole data Lake can essentially be rewritten again.

A

So this can take a lot longer than you'd expect. Let's take a look at deleting under the hood, what it does so it's removing a file and adding a new file.

A

Of course, the file that's being added, does not contain some of the rows of the file that's being removed.

A

So that's one potential, optimization and another potential optimization we can do is appending a column on the fly so a lot of times in spark. What you do is you say well, parquet is an immutable file format I'm just going to do everything on the fly. So let's say we have this with continent method which append the continent of a country on the fly.

A

So we can kind of read in our data and run that with continent method to apply that on the fly now this with continent method would actually be pretty stable. So maybe we're gonna say: okay, we're just gonna persist the continent on disk and Delta allows for for that. So if we want to make this data frame on disk Delta makes it easy to add a continent or to a column. So basically, what we do is we read in our Delta data lake, we're gonna run that transformation.

A

We're gonna write out the data with the merge schema option set to true it's going to kind of overwrite our old schema and set save mode to append, and we can save that data.

A

So I've asked in the in the Delta channel if there was any downsize to Delta Lake and there were. There was a couple of downsides that the sketches okay mentioned, but not really so, if you're, if you're, looking at the options between Delta and Part A I, don't think there's a real good argument to use plain, vanilla, parquet anymore, for for spark analyses. You just use Delta and it's it's gonna, be a lot easier.

A

So this is my contact info again. Definitely hit me up on github if you want to do some open-source hacking or ping in the Delta slack Channel join the Delta slack channel if you're interested in this stuff. So thank you very much.

B

Thanks Matthew, we have a few more minutes for questions. Are there any.

C

Hello, thanks for the presentation, so I have one question on the let's say: I have an open an open source Delta table on on s3 and I. Do two updates or an update and delete on two separate clusters? Would that work, or is that not supported yet you're.

A

Saying two simultaneous.

C

Yeah two simultaneous updates on two separate clusters that.

A

C

Would work yeah even with the s3 being eventually consistent? How? How is that managed under the hood, since there are several files, jacent files, I.

A

Think how it works under the hood is Delta. Has the transaction lock right? So it's saying this is like it's looking and say: if you do DF dot read and read in some data and saying this is the state of the table and so you're saying you're writing in data simultaneously from two other sources: yep.

C

A

Then are using streaming readers or batch readers. Let's.

C

Say batch, for example,.

A

Yeah I think the batch readers are just gonna say not not change. Anything they're gonna say like.

C

How how does the, how do the two readers, not, for example, overwrite? So the let's say you starts with a zero JSON file and then you have choose simultaneous, writes: aha, how do they not collide and overwrite the the one JSON file? At the same time,.

A

Yeah, that's that seems like an edge case, but I. Could that could happen right so yeah I, don't know I, guess I, don't know he.

C

A

I think I think they've thought to do that. I I, don't think they that would happen. I! Don't think that that I think it would create the oh one: Jason and the Oh to die. Jason I'm, not sure how.

D

Optimistic concurrency for the death lake, so I think that would go through and then one of them will win. One.

A

Of them will win okay yeah, that's possible how they do it. Yeah I do know that that's a solved problem, though, which is which is cool.

E

All right, yeah.

F

E

You for that's all, and so I wasn't aware that there was a a distinction between the Delta for four days works and the UM source version. But so does that mean if I'm inside a Tate's works, environment and I perform an optimized operation on that day's lake, for which I'm writing from the open source version? From another spark instance, yeah.

A

E

Lock can be corrupted in some way or how that work. So.

A

It's it's actually kind of a problem how its, how its structured right now, because so, if you look at the the Delta source code, a lot of the Delta source code is actually in the org apache spark namespace. Now the reason that's the case is because of there's a lot of private stuff in spark like a packaged private stuff that Delta needs to access.

A

So like I did this, and if you package your like, let's say you go to Delta I owe you run SBT package, you build the jar file and then you attach to your cluster. It's actually not gonna work. So it's actually. It's actually problem. I mean some stuffs gonna work, but some stuff isn't gonna work. I, don't know. If that makes sense, think I can again yeah I can show you the code snippet for that later to justify that.

G

Hi great talk: what kind of consumers can consume date? Delta tables my VI team are using consuming pocket tables with tableau or any other sort of consumer. So what kind of connectors or presto what kind of connectors does Delta has today yeah.

A

It's a good question, I'm, not sure, but you need to be careful when you're reading and Delta stuff into none places that aren't used to Delta because, like there's all those part K files- and we did the compaction example right. So we have our data and then we compact it and our data is stored in the same data stored in two different places.

A

So if people aren't looking at, if your review know, your consumer is not looking at the transaction log to see that oh I'm only supposed to be reading this one file and ignoring those five other files. If they're not doing that, then you could be getting some erroneous results. So I'm not sure what you know. What support there is there yet, but you need to be careful there. Yeah.

F

I think uestion, so, what's at all about publishing by multiple fields like columns, I think.

A

That, if you, if you have a query pattern where you're frequently grabbing you know doing a where statement, you know where country- and you know first, you know the first letter or whatever and that's great but the more basically a partitioned data. Lake is great for whenever you're doing a where statement and using all of those in the filtering and then as soon as you don't do that. Those where statements that that correspond with the partition keys, it's pretty bad because we have so many small files. How.

F

About the ordering of them, like I, say you have like three columns like ABC, yes, Amanda like do go by with like the country Russia and then by the like date, stamps or wherever yeah.

A

It's a good question: I would go with whatever ordering allows for the maximum amount of data skipping. Creating a partitioned data lake is also really tough when, when you're doing it on a column that has very high cardinality, so it's it's tough, yeah I can see. You felt the pen.

H

When you said that sport doesn't do listing optimally when you use partition by, do you mean that it does the listing for even the partitions? That's not actually part of the query, and the second question is it's book. If I supported sorry.

A

So for the first question, the answer is yes and the second. Can you repeat the second question: please um is.

H

Book it by supported by the Delta for month, you're.

A

Saying it's Park I supported bucket.

H

A

By bucket by I, don't know, I haven't used bucket by side. Now it's something I should know about, but I don't I.

I

Noticed in the slides about the override and the delete operation that basically the time stamps for the deletion of files are later than the time stamps for writing the new files. So does that mean you need to take into account the possibility that a query that basically asked for the data of the specific time stamp in between those times will get duplicate data? Let.

A

Me just look this up to make sure I understand this. So.

I

Basically, here you see the time stamps and you see that the add happens before the delete. So basically, if you would query your data and just happen to create that for a timestamp in between those two times, you would get both the both files. Basically, both duplicate data.

A

That's a really good question.

A

It seems like an etch case, but it also seems like I'll have to look into that. I've never done that before, but the this is from the same that that's from the same JSON file entry, so I, don't know if Delta factors that in but yeah it is interesting that that deletion time stamp is different from the modification time yeah. So so I'll look into that I've! Never I, never considered that one. Thank you for bringing it to my attention. I love! Looking at education like this, that's a good one spot all.

B

Right, we are running out of time, so we'll obviously close the session, but please feel free to stay in the room and then talk to Matthew Thank, You Matthew.