Delta Lake Conference Talks, 16 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Lake Connector for Presto - Denny Lee, Databricks

Description

Delta Lake Connector for Presto - Denny Lee, Databricks

Delta lake is an open-source project that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. We - the Presto and Delta Lake communities - have come together to make it easier for Presto to leverage the reliability of data lakes by integrating with Delta Lake. In this session, we would like to share the design decisions and internals of the Presto/Delta connector.

For more info about Presto, an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes, see: https://prestodb.io/

A

Hi there, my name is denny lee and I'm a developer advocate at databricks, thanks very much for inviting me to this year's prestacon to talk about something that I'm super excited about, which is the upcoming delta lake connector for presto.

A

Now, a little bit about myself before we continue as you as noted. I'm a developer advocate at databricks.

A

Prior to this I was a principal program manager at microsoft for various teams, including azure cosmos db uh project isopo, which was the incubation team for what is now known as hdinsight, sql, server and bing and somewhere in between all that I was also the senior director of data science.

A

Engineering at sap concur, hopefully with a little context that provides my background in the data and ai space for the last two decades or so so a little bit about the company that I'm part of databricks provides a unified data analytics platform to accelerate your data-driven innovation, we're global company with more than 5 000 customers across various industries and more than 450 partners worldwide.

A

As most of you might have heard, of databricks we're also the original creator of spark delta lake, mlflow and koalas. These open source projects are leading the innovation in the field of data and machine learning.

A

So, let's talk a little bit about the motivation here, like what is delta lake and why did we help build work with the presto community to build a presto connector for delta lake? Well, let's talk about delta lake first, which is delta lake, is one of the major open source data lake storage standards in order to ensure data, reliability, one is in, are arguably a very unreliable system and we're going to talk about that in a second as well.

A

Now saying that, since we're all here at prestacon, as you all know, presto is one of the more pop one of the most popular distributed. Low, latency, sql qr engines for interactive handle queries. That's great.

A

The issue at hand is that currently, when you work with presto and delta lake, you actually have to use the manifest file in order for which then allows you to register a delta lake table into the hive, my store as a sim link table type. So basically it's a sim. It's basically a file that contains the list of files that then presta will access so the symlink file in order to be able to figure out what the current tables are.

A

Sorry, what the current files for the table are that's great if you are not doing, interact queries on relatively stable data, but how about? If you want to go ahead and work on uh interactive queries on data that does change over time or that is updated regularly or modified regularly or so forth and so forth.

A

So that's the motivation behind saying well, presto being so popular one things we really have to do is make it more straightforward for any presto clients to be able to talk to a delta lake table to benefit for what delta lake offers.

A

Now. Some of the first questions you might be asking here is: why do you even need delta lake? Okay? So let me go backwards a little bit and provide a little that context.

A

So delta lake is an open source project that enables building a lake house architecture on top of existing storage systems such as s3, adls, gcs and hdfs. Now, when I talk about lake house architecture, it's not just this idea of a marketing buzzword. It's a paradigm shift on saying the v1 world of data was typically associated with databases.

A

The v2 world of data engineering has now shifted over to more about data lakes. So, what's great about these, two worlds is in the case of databases, the v1 world version one in the world is that you actually had this very uh reliable system with acid transactions that was simple to use, okay or relatively simple. To use.

A

The v2 system, though, about data lakes, is that it gave you mass amount of flexibility to use low-cost object stores to store all of your data, and it gave you an a high degree of flexibility to solve problems that you really couldn't try to solve with databases. The version one the lake house paradigm is is like the version three of this world now, which is to say that, can you marry the best of these two worlds?

A

The asset transactions and reliability that you would normally get out of databases with the flexibility that you got out of data lakes and then with open source storage systems like delta lake. The fact is, you can right, because the idea that you can actually have that flexibility, but also have a reliable data storage, allows you to go ahead and get the best of both worlds when it comes to the data warehousing database, world and the delta link.

A

Sorry, the data lake world uh delta lake, but that's why it's a fundamental for what we're seeing is a lake house paradigm.

A

Now, let's talk a little bit about the promise of the data lake right, and so what was great about the data lake is that hey, let's go collect everything it allowed you to do all this really cool stuff. You could store it on the delta lay in your data lake without any problems. Apologies for the bias there and you also have data science and machine learning. So now you can run your recommendation engines, your risk and fraud detection.

A

All these other cool things directly against your data, lake and everything's done right, because everything's fixed no problem at all problem is- and this is an old adage garbage in garbage stored garbage out your data, science and machine learning was only as reliable as what you actually stored and what you actually collected. So how do we make this better?

A

So the reality of the data lake basically is this- is that you actually have a much more complicated system and without diving too deep into it, because we really want to talk about the presto connector is that you often were forced into a lambda architecture of types, okay or basically, the real time data would go on one set of architectures near the top, while the batch data would go ahead and process on the left side.

A

Okay, so that's why you'll see the number one here with the slab doctor, so you had two different architectures one for your streaming data and one for your batch data that invariably led to the idea that I also needed to validate that and reconcile what my streaming system was saying versus what my batch system was saying, and even if I didn't have that problem, I actually did have the issue that I would need to reprocess the data, which means I would have to reprocess against your partition data.

A

That's great, so that's stored in your data lake and then invariably, I also have to do run, updates and merges against all of this data in order to be able to do the streaming on the top side versus the ai and reporting on the bottom side, and so presto obviously would fits very comfortably in the ai and reporting side of the house.

A

We want to make sure it can it can access that, but wouldn't it be cool if we can also access the same streaming data so, in other words, presto, is able to read that data at the exact same time.

A

Well, the reality is that there's when it comes to these things like data links, it becomes a real problem, because there's no atomicity inside what's stored on the object storage, so what's written, there may actually not really be persistent or there will be end up being offered files.

A

So we you actually don't have a reliable storage, sorry reliable set of files that truly dictate what the table that you're trying to query is made out of there's no wave form of quality enforcement of that data and there isn't any form of consistency or isolation that goes with it as well. So this these distractions are not just minor. They actually have a ultim, truly big impact on whether you can trust the data that you're querying.

A

So what happens when I do this with delta lake instead? Well with delta lake, you know we often talk about delta, like hand-in-hand with the process of a what you what we often terms the medallion architecture. This idea that you have bronze silver and gold data quality levels of data as they're coming in right, so the quality is basically derived as in bronze. Is your raw ingestion silver? Is your filtered clean and augmented data and gold? Is your business level or aggregate data?

A

The idea is that delta lake allows you to as part of that process to incrementally improve the quality of your data as you're processing it. So it's ready for consumption and so again, when I look at the broad, the focus just a little bit on that in terms of the raw ingestion, it's the dumping ground for your data. It's often um you're retaining it for a very long time and you will avoid any error prone parsing. So you can keep the data there.

A

All right silver is at that point where you have intermediate data with some cleanup applied, it's queryable for easy debugging. Sometimes you often it's common that you, where you would do uh your debug logs or if you're running any machine cooled machine learning stuff. You might actually run them at this level as well, but then, ultimately, your goals. That is where you want to query. Ultimately query that data, whether it's for streaming purposes or for your ai batch reporting purposes. You want to be able to go ahead and reliably state.

A

You know whatever aggregates or whatever details that I'm reporting against are ready for consumption and it's ready, for you know, for example, spark or presto using those examples all right, so we've already called that out, and so the the key thing that I wanted to say is like delta lake also supports these ideas of batch jobs and standard emails like retentions corrections, gdpr upserts things of that nature.

A

So all good things, and with this medallion architecture it becomes a lot easier because if you need to clear the tables, if you need to restart streams because of business logic, changes, that's okay, run the deletes at the silver and gold tables and basically restart that process from the bronze onwards and you're good to go.

A

You haven't lost anything you've kept and retained all the the the data, uh even if you have to change business logic from the past and so who's using delta lake uh they're, it's used by thousands of organizations worldwide, um as we've listed here, I'm going to skip past most of this stuff. The one cool case that I want to call out in terms of like comcast, for example, is that this is actually from one of the um uh date and ai summit sessions with comcast those uh the keynotes from the dayton ai summit.

A

uh It's about the sessionization with delta lake, but written by comcast. They improved a lot reliability of their pedes petabyte scale, jobs, the cool thing, but because they are actually were able to leverage delta lake to to run as both a streaming and batch processing system. They were able to do two really important things: 10x lower compute. They went from 640 vms instances down to 64.

A

and because they had they could leverage streaming capabilities. They were able to run simpler and faster etl jobs from 84 jobs, down to three and also half the data latency. So pretty cool things right so just want to do some quick call outs on some of the amazing things that, by having data reliability and using a system like delta lake, it allows you to go ahead and afford to make things more efficient and ultimately cheaper.

A

Okay and I'm going to skip through this pretty quickly. But there's a lot of innovation because delta lake, just like presto, we're an open source project. You know there's a lot of innovations that are going quickly. This is uh the a piece of innovation. From april 2019, up to february 2021, we went from 0.1 to 0.8 added lots of really cool features and then with delta lake 1.0, which was announced earlier this year, we also added other really cool features.

A

I would highly uh advocate for you to go ahead and attend michael armbrust's keynote presentation from this year. The 2021 dating ai summit, which goes into great details about this, but things like generated columns, multi-cluster rights, clown independence, spark 3.1 support, pi, pi, installed delta everywhere, a key component in terms of having other systems work with delta, really really uh natively, and also, of course, connectors all right so that segway's brought us back to the back to the motivation right, like I said before, delta lake is a major open source data. Lake storage.

A

Standard uh presto is a super part of not the most popular distributed load lanes, the sql query engine so and right now, up to this point until today's session, we can only really talk about it from the standpoint of manifest well how about if we actually gave presto the ability to go ahead and read a delta lake table right at runtime.

A

So if there are any changes right at runtime at that point in time, presto is able to automatically know which files it's supposed to access for the table, and so it has a clean read of that data. Well, that's exactly what we're here to talk about today, okay, and why are there issues when it comes to using the manifest? Well, let's go into that right.

A

There's data consistency, issues for partition, delta tables, which may result in an inconsistent view of that delta table right, also, the performance, if there's a lot of data which results in basically lots of files to be listed. There's a lot in the manifest that's loaded into memory, and then it's going to be loaded in memory all at once, and if there are a lot of files for that table, there definitely is going to be a performance issue x, we'll see after the first record right.

A

It's just going to take a lot of time for it to figure all that stuff out. If you look at time, travel queries with the manifest file you're, not actually able to look at time travel. One of the really cool things about delta lake is the ability to say: what's what does my data lake table? What does my delta lake look from previous versions? Well, with a proper connector, you can actually see older versions of the data okay, so so without getting into all of the details here.

A

Okay, because, frankly, I put a link here to give you an access to the design document. So that way you can definitely go look in the details. Number one and number two. We actually have as part of the delta users slack we'll have we have not, we will have. We have bi-weekly presto, connector meetings. So you're, more than welcome to join us and I'll put a link near the bottom for where you can find that information.

A

But the key call out of the design is this: uh it starts with the presto, corning and presto executor, as you see here and, as you already know, right, uh there's two type of jvm processes, which is the coordinating executor so now the delta connector, and how it coordinates its calls with the press, domain uh delta, standalone, reader library. The hybrid store is in this form, the metadata provider. Okay, that you see here is what loads the delta metadata, which is stored in its own transaction log.

A

So traditionally, when it comes to working with presto, the metadata is actually stored inside the metastore, but because of the way delta works that metadata. That tells you which files file paths excuse me contain what files which ultimately make up your table. That's stored, actually in the underscore delta, underscore log file folder, which is a bunch of transaction log files that contain json files that contain that information.

A

So we created a metadata provider that is able to, even though normally presto is going to access that information directly from metastore it's going to access it from the delta lake transaction log. Instead, it still accesses the table information from the meta store, but when it comes to the underlying metadata, it's actually going to access it from there. Okay, any of the information containing splits on how presto is going to be splitting that split generator.

A

That's also included as part of uh and part of this code base, in which uh the delta, in which now we can figure out how to split that delta table into multiple input, splits and then. Finally, the page source provider is that's the interface in which the task will get the record reader for a given split. So by building it out.

A

This way now, we've- hopefully at least seamlessly made it so that presto can go ahead and interact with the with a delta lake table without the users themselves being aware at all that they're talking to a delta lake table, so it's completely transparent to the users, which is our main goal here, all right so enough to be said about talking. Let me just go show it to you. So let me let me dive into it here. So I've got this terminal window in which I'm logged into my pr local presto instance.

A

Do you know the fact that this precedence is locally running locally on my box, though I'm actually for the fun of it? Accessing data is stored in an s3 bucket, so there's gonna be a little bit of latency, but I didn't want to call that out and also, if I shift over here, this is the ui, the local ui than running that you can see. What's going on, I've got here's the number of nodes that are running. I did actually have an error from a fake sorry, not a faulty query from before.

A

So I'm going to go ahead and run this right now, okay, so let's go ahead and of course I'm copying pasting, let's be honest, so uh because I cannot type this fast, but nevertheless let me go ahead and, oh sorry, I am showing you the wrong screen, I'm going to go ahead and paste. My first query inside here and so right now, what's going on, is exactly what you expect presto in this case, what it's doing it's accessing the uh s3 bucket this particular bucket for the new york city taxi data set.

A

So this is initially going to run a little slow, but around 30 seconds or so, oh there you go, it is it's done. I did want to call out that, because of the way we've set, this particular set up right now. What we're doing is we're actually specifying the path, not the actual hive metastore bucket, so because we use delta s3 and we specify this particular path setting right here, then we know that we're actually accessing the delta table through its file path, as opposed to traditionally through the metadata metastore.

A

Yes, you have the access to the metastore as well, but I just wanted to call that out so perfect. You get to see the data set and you're good to go perfect. When I go ahead and switch back to the ui, you get to see the queries here, exactly what you expected in the presto ui, all the pertinent information. All that stuff is here exactly as you would expect so so far, so good, nothing terribly unexpected. Let me go back to the terminal again, all right.

A

So now I'm going to run a bigger query, but without partitions, okay, and so I'm going to run this one right now so so far so good. Let's think here all right I'll show you the ui view of it now, and so it's right now planning it through, and this query should also take about 35 seconds or so we'll see what happens all right.

A

Okay, it's starting to come through. It's currently running right now,.

B

A

You can look at the ui in terms of the rows per second and all that funk fun stuff. Look at this might take a little bit longer. So my apologies for my faulty predictions, um but as I did note, the fact is that it is actually trying to access an s3 bucket. So it's probably running a little bit longer so now this you'll notice is against a non-partition table now there's so it's not a huge table, but nevertheless it's not it's.

A

It's not able actually to break things down faster because it's accessing a non-partition table so what's great about it is that, of course, you know if I have a same query, but I'm going to this time run it using a partitions, okay same table except you'll notice that in this case I'm saying new york, city 219 part because that's the represent the partition table versus the non-partition table, okay, and so I'm going to go back and sure enough. Here's the query! So this first one that I ran took about 45.82 seconds.

A

Okay, uh 35.86 was actually done executing all right. Let's see what happens now and as you can tell we're almost done here, we're looking at the ui so significantly faster because we're actually using a partition table instead of the uh 45 seconds or excuse me 46 seconds now we're talking about 26.88 seconds so significantly faster, which is pretty cool all right so far, so good, but right now all I've really shown.

A

You is the fact that okay, I've made it super easy for presto to query the delta lag table, and that's great don't get me wrong. That's all good stuff, but am I actually also able to leverage some of the cool functionality of deltalic and the best way for me to show this actually is to go ahead and run a version or a history table query? Okay, so here is I'm going to show you a little bit of syntax now, okay, give me one!

A

Second, I just go back to terminal all right, so I'm going to run this query. It'll run relatively quickly, but basically, what it's doing is it's doing a select count from the partition table. Okay, now! Well, you notice here and it number one. It actually fits pretty quickly, but you notice, I have this additional syntax, the at v1.

A

By the way, the design document that I had um posted to this um inside the the slides, the design document actually explains exactly what we're doing here, but adding this additional syntax is basically saying I want to look at version one of the table, basically the first set of insertions that I put into this table this delta lake table. By the way there are nine versions to this table.

A

Basically, each time there was an insertion, an update, a delete or whatever else, there's a new version of the table, so you can go back in time, so this is looking at an older copy of the data.

A

So in this case this case it was looking at eight copies ago right because there's nine total copies- and this is the first copy all right so far, so good, all right so again in the first insert that I did basically, I put in 59 million rows cool.

A

Let me go ahead and run a second which is v5, okay, just somewhere in the middle, all right so boom.

A

So now I'm going to run it, and so again it's actually running pretty quickly because I'm running it's a partition table, it's only a two note or but still is actually able to go ahead and make use of that. So it's actually able to split the data quickly enough and actually bring the data back and actually come back with the results nicely and this time you'll notice version five of the table actually has 79 million rows. Well, 78.9 million rows, but close enough. 79 million rows.

A

So again, I'm looking at the history and then if I was to look and you'll notice right here it says version 5. and then finally, I'm going to run the last version of the table, so right here bam and sure enough it'll go run through and when it's done, it'll actually have even more rows. And so, if I go ahead and look at the ui okay, sorry just make sure yes right. You'll need notice how the queries are running perfectly fine.

A

If I was to look at, um let's just say, the v1 version of this there's actually no difference. Okay, because what's happening what's happening here, is that the delta standalone reader, what it does uh sorry the presto delta connector, what it does is it actually is accesses the delta standalone reader. The delta standalone reader itself is automatically able to return exactly which sets of files that belong to version. One of that table, because it's the one that returns that information, the presta connector just gets the list of files it needs.

A

So presto actually doesn't need to be aware of this fact. The delta standalone reader and the presta delta connector take care of that for you, and so meanwhile you're good to go.

A

It actually understands what's happening and just to finish up the what we're showing here, the the final query we ran there, which is about the ninth version of the table- there's 84 million rows, so you notice that, basically, with this capability now you're actually able to go ahead and make use of the time travel capability within your delta lake table right from the get-go as well, so pretty cool. uh So hopefully you guys get to enjoy using that all right.

A

So I'm going to switch back to the slides, real quick uh before we take on any more questions, but saying that I did want an important callout to do some attributions to the folks that actually helped build this. I want to call it venky, sadgeth and george. They were crucial to the help development of the project. If you yourself want to get involved with this project or any other one, please ping us at delta dot io. I think it's the next time. Yes, it is all right, so you want to build your own delta lake.

A

You want to help us with the delta lake inner. You want to join the presto and delta community meetings that we have every two weeks just join us at https, delta.io and um and all the information from the slack user group and everything else is all available there. So we absolutely welcome you, as part of the presto community, to come, join us to help us improve this connector and then saying that if you have any questions left by all means, this is the perfect time to ask them right now.

A

um You'll notice that I'm a bit of an expanse fan, so yes, the quote: I'm going to use as uh you're about ready to send any questions my way, so you can tell you found a really interesting question when nobody wants you to answer it, so hopefully I'll actually be able to answer it, but nevertheless, um please do ask away and again, if you have any questions or want to join the delta lake community with the this presto connector just join us at delta, dot, io.

A

Hey everybody- uh hopefully you guys could hear me now, uh but if you guys get questions, I'd love to hear them.

B

Yeah thanks so much danny. That was a really great presentation, uh went into much more detail of obviously delta lake, but the connector as well, um and really good to see time travel right so going back in time.

A

Yes, that was a big ask that everybody was going for, and so that was one of the first things we did. We did talk to the presto community specifically about the syntax, so because originally our syntax was much more closer to be honest to spark and then based on the feedback from the person we're like got it we're switching this right now and so yeah. We have much more presto friendly syntax, so I I mean for anybody, that's listening, especially if it's like whether you're watching it now or watching later. Really.

A

Please do join us because we really want to take into account your feedback so that way, um whether we're working out or you guys, are working or doesn't matter. Yeah like you, can chime in and give us that feedback. So we can keep on updating it. Yeah.

B

Absolutely uh sebastian has a question: uh what are these uh some of the key features in the roadmap uh in the next few months for the connector.

A

Yeah so right now the two key things that we're working on is we're actually trying to do some faster optimizations, basically better memory management. So that way we can handle larger tables even faster. So one of the things that were it works for pretty large tables already, but right now, candidly there is.

A

um There are issues surrounding the metadata in which you actually have to basically take all of the the entire file listing, and so, if you're talking about batch processing, that's honestly not that big of a deal, but if you're streaming data into that delta lake at the same time, the idea that you actually have to read the entire thing, the entire file list in the memory and then tell presto what it's supposed to do can actually be very uh memory intensive, and so the context is that we're gonna change that up by adding an iterator specifically to speed up processing of that um the in terms of farther out, obviously, there's going to be features and functionalities, as in like, oh, you know, can we do better partition pruning things of that nature?

A

So we are definitely looking at that, but the other big call-out that I wanted to say is that we are starting to and I've again that's why I want love to get feedback um is to go ahead and um uh okay is there interest for the presto to delta writer as well right right now, the vast majority of people are asking for readers. So that's why we focused on that one first. But are you also interested in writing? So, if you are again, please chime in we, incidentally, we're going to be publishing the proposed roadmap.

A

uh I'd say in mid to late january, uh for for the del uh for delta lichens, specifically, obviously, if it's not this audience the delta connector for presto, so we're gonna want you guys to chime in and tell us what you want out of this okay. So right now we're getting a little bit of feedback that like, for example, ctas operations, are the desired uh um flow cool.

A

Then that's actually honestly relatively easy, so we can do it, but by the same token, if there's some more complex scenarios love to know about it, so you can chime in and then I did notice denise. uh Hopefully I'm saying your name correctly. At least uh myself used to live in montreal, I'm assuming that's french, so I apologize if it's not um uh does at v1 syntax work in sparse, equal as well, not right now, in fact, we're gonna go back and pull do some pull requests into the spark community to accept that syntax.

A

It is. This is a very presto, specific syntax and it is not actually ready uh for spark sequel. Yet so we we specifically wanted to have the presta community have access to it. First.

B

Sounds good, thank you um and then on the right path. I wanted to say that uh sitas is something that you know. Some of the other connectors already support right for the other table format the good place to start. uh We are also looking at just basic. You know. Iud operations start with inserts, first uh most common uh and then look at updates, uh upsearch deletes and so on uh those get more complicated with it basically touches the entire.

B

um You know, code path from the parser, all the way down uh to the uh the writer itself uh and that's something that uh uh ahana is looking at, uh adding in in uh from the right path as well. So uh just last question on um pre uh uh catalogs. Is there a preferred catalog, um not.

A

Right now, no because who works.

B

A

No, there isn't actually a preferred one. I mean we're all we're going to be compatible like the at least the ones. We've been testing, it's more like the hive, metastore or hms uh and like glue, for example, um there's no big requirement that, but the part of the reasons is the the context. At least. Is that mo the vast majority of the metadata for that's red? It actually is not read from the um the meta store. The vast majority is actually read from the transaction log.

A

Now, what we've done with the connector uh and also using the delta standalone reader, is that basically, that metadata, it's the exact it's completely transparent to presto whatsoever in terms of it, doesn't need to know that it's grabbing the data from the transaction log, which is basically a file system versus actually reading from the meta store. uh The reason delta lake was designed this particular way was to allow for much much larger scale. Building we're talking about hundreds of petabyte systems, which normally operate where basically candidly we would. We would see the metastore fail and.

B

A

Of those scenarios, we're like that's why we did that with the with the um the file system, but because of that, we wanted to make sure that the from a presto perspective didn't matter it just it plainly didn't matter. So the cool advantage by the way, uh uh and thanks for the question rohan, is that, in addition to be able to query through the meta store directly itself, you can also query directly through the file path. So you actually have both options now so.

B

Awesome well danny. Thank you so much uh we're out of time and our drummer must be going on in the other room, so we're going to go, join him and close out the the the session uh dennis uh bastian rohn. Thanks for the questions um looking forward to the the right path uh being executed as well. Next presto khan, hopefully we'll have uh both in and out exactly.

A

Thanks everyone.

B

Cheers thanks jenny, bye.

A

Thank you. Take care.