Delta Lake Conference Talks, 24 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Presto Tech Talk: Diving into the Delta Lake Connector for Presto

Description

Denny Lee from the Delta Lake project discusses in detail the new Native Delta Lake connector for Presto.

A

Okay, uh hopefully you're able to see my slides um so just give. Let me know give me a nod if there's any issues or perfect. Thank you very much and uh yeah. So uh today we're gonna talk about building reliable data lakes with delta lake, but obviously the start of the show is presto. So let's go ahead and start the show here.

A

Okay, so in case you're wondering who I am and why I've decided to go ahead and join you guys here today, um it's the uh I'm actually a senior staff developer at theater bricks, long time, brickster prior to that. um Actually I've been working with apache spark since 0.6 and delta lakes inception um previously I was the senior director of data science, engineering at sap, concur and prior to that I was a principal program manager at microsoft for azure cosmos, db project isotope.

A

What is now known as azure, hdinsight and also sql server, so um a little bit of experience when it comes to data, and so hopefully um we're gonna make this a fun session and glad to chat with you all on on all things. Well, you know delta lake so, and so, if you're not familiar with what delta lake is it's an open source storage format that brings acid transactions to big data, workloads on cloud object stores and it's the key gradient for building your lake house now?

A

Why do I sort of harp on this concept and in fact I want to talk about the evolution of data management from the prospects, especially from the fact that I used to be part of the sql server database, team, okay and so by the way, there's a little bit of background noise. I'm not sure what happened there. So, if you can, if you're whoever's, not muted, if you can mute yourself, please thank you very much, um and so why do we need lake houses right?

A

Well, let's start talking about data warehouses, so this is actually literally my historical context. You know hey, let's go build a data warehouse right and- and I was part of the sql server team, like I mentioned before, and I helped build some of the largest sql server implementations that you you had found: okay, pretty sweet right and we're going like. Yes, let's go build these data, warehouses, they're, purpose-built for bi and reporting and in addition to helping the build sql server data warehousing itself.

A

I also was part of the sql server analysis services team right, the the bi. What eventually turned into what now is known as power bi, but basically we were building these super analysis. Services cues that allow you to like really fast bi, ad-hoc querying that would come back in seconds and to provide a little context. I was actually one of the first, if not the first microsoft employee, to present at hadoop summit hadoop world, where uh we had showcased the creation.

A

uh This is within collaboration with yahoo, uh the at the time the largest analysis services cube on the planet. Right, it was 24 terabytes. The source of that cube uh was a 2000 node hadoop cluster at the time. So you know a little bit of data, so pretty cool. You know this is awesome. We have the best bi the best reporting, but there inherently were problems with this right. There's no support for video or audio or text there. What definitely wasn't any support for data science, machine learning.

A

You know we had these schemas these star schemas that you had to follow through um when you say, limited support for streaming. That's actually us being a little too nice frank. The fact is really. There was no support for streaming, uh certainly close and proprietary formats right. So you know you had to actually in the case of sql server or the case analysis services, and this is similar to any other data warehouse.

A

These were specific formats that you had to work with right and, of course, it was extremely expensive to scale out um to give you some context uh when we, while hadoop itself, was designed very well for scaling out. When I talked about that 24 terabyte cube solution.

A

That was a single box that basically we had to maximize to the nth degree where we also had to go ahead and um was it uh we actually had a clone, and so we actually did hardware cloning in order to be able to support uh if there was any concerns with downtime and we ended up using not my choice by the way, an oracle rack server to be the staging server, so the most expensive staging server you'll ever see so really really expensive way of doing it.

A

But hey you know it gave us this great warehousing. This gave bi and this reporting so boom. We're done right, every solution solved and we're done for the day.

A

Obviously, not that's why we created data lakes, and uh I was part of the uh project isotope, which is, like I said mentioned before, as part of um was the precursor to azure hdinsight, and so we were a nine-person team. That said, hey, let's bring hadoop into microsoft. This is during the bomber year.

A

So that was a lot of fun when we did that, pull that one off okay, uh and because of that you know, hey we got hadoop, we got it running so now you can do your data science and now you can do your ml and you're good to go and sure maybe the queries were slow, but if I had you know terabytes hundreds of terabytes or petabytes of data, the query could actually finish as opposed to if I was trying to chuck into a sql server.

A

It would just never finish right now, of course, standard problems with data lakes, poor bi support, complex setup, poor performance and unreliable data swaps, and I really want to harp on this little point here. Okay, this idea of unreliable data swamps. You know you hear about data, salad, data swamp data, whatever right, but the context is we pitched the idea and I'm guilty of that by the way. So I want to call myself out on this the idea that we would say: hey.

A

You could just go ahead and solve all of your problems by simply going ahead and uh uh doing schema on read like the idea. Was that no problem?

A

I would just I don't have to care about the format data I just get the data, I would dump it in go super fast, and then only at the point where I need to read the data, I could write my cool scala code or java code or whatever it was, and I could just magically solve the problem and we're good to go right, and so that was the pitch that we're telling people so just dump your data into hadoop.

A

We were done, uh you know you build your data lakes and and we could just magically solve the problem every time now for anybody who is using presto or for that matter, spark delta lake anything else any other these systems. We know that that statement's not even remotely close to true right. So what we need is we need somewhere in between those the best of both worlds and that's what this concept of lake house really is about right and and the reason why is because people say well?

A

No, no, let's, let's see if I can just have one side of it. That's running warehouses, one side of its running data lakes, so the bi would be we would go and reporting would go off the warehouses, but then the machine learning data science in real time and and that would all go off the data lakes and you're perfectly fine, except what you're doing is that you run into basically this really messed up, lambda architecture, where you you have to reconcile the data, that's sitting on one side versus the other side right.

A

So what- and this is what we go back to saying well. Well, then, really what you need is a lake house which is the best of both worlds. I'm saying I can take the transactional consistency and um that I've got with warehouses, but um the flexibility that I got with data lakes, so I could go ahead and actually handle all the different uh data domains, whether it was streaming bi, data, science, machine learning or whatever else right, and so in order for you to build that lake house. This is the context here right.

A

It starts here, you're, going like great. We start with delta lake great, which allows us to have that scalable, open general purpose, transactional data format, that's half of the solution. What's the other hats of solution using a high performance? Query engine of course, today we're going to talk about presto, but it's applicable to all the above, and that's the context here right that this is how you can go, build your lake house by get allowing yourself that high performance query engine yet also scalable or open general purpose. Transactional data format.

A

A

So then, let's dive into some of the technical details that I love sitting here, so I'm going to pause just for a second to say, if you do have a question you know I probably am not going to see it in the chat right now, because the chat sometimes doesn't show up properly but you're more than welcome to raise your hand or turn on your camera or mute yourself and ask questions glad to dive into those details, otherwise that I will dive into the technical stuff and again you're more than welcome to interrupt me as I dive into those details.

A

A

Cool all right, so then, let's talk about so in order for delta lake to work. It's not just about your scalable storage for your data. It's also about your scalable transaction log, for your metadata, the metadata that defines what's really going on with your data, we're going to talk about that a lot more in a second okay. So in order to have scalable storage, let's talk about the old good, old-fashioned cloud, optics stores right.

A

This is what the vast majority of the time we're typically talking about, so the tables are stored as part key files on your cloud object store. This is a flexible aws, something like the gcs application.

A

Adl's gen 2 right doesn't matter perfect. Well, then, what is a scalable transaction log? Well, in the same folder as your delta lake table in the path of the table, we put an additional folder under sport, delton square log inside that folder is a bunch of json files and then subsequently parquet files.

A

What is that well is the delta transaction log. It is a scalable transaction log, a sequence of metadata files to track all the operations that are made on those files that make up your table. Okay, so you store it in cloud storage along with the table, so that way it's portable right. So, in other words, if you move the table from one system to another system, just as long, you move that fold. All the everything underneath that table path table folder the transaction log comes with it.

A

So it's portable right and because it's made up of json files and every 10th checkpoint it's a parquet file. Then you can read and process the metadata which tracks all the files in parallel and that's the key thing so, in other words, you're taking advantage of the scale, the j distribute scale that you actually have a storage to scale, your transaction log. That's what delta lake allows you to do all right, so it's important that these transaction law commits are done in as ordered and as atomic commits.

A

In other words, they either happen or they don't so, for example, if I'm doing an insertion like so I'm inserting new rows of data. Well, that rows of data comprised of let's just say, zer01.parquet and 02.rk perfect, so in the first entry into the transaction log log00.json, what happens is that we record the fact that the table is now comprised of 01.4k and 002.k perfect, then now we're about to do an update.

A

Okay, so an update basically says: okay, I'm going to do an update which ends up deleting a whole bunch of files, because it's a merge as well. Let's just say, and what ends up happening is that we're generating a new set of rows. Well, those new set of rows make up 003 dot part k, okay, but the table now is comprised at this checkpoint. Zero, zero three part k, because we actually removed zero, zero one and zero two dot par k.

A

Now we actually haven't deleted the files, at least not yet. This allows us this concept of time travel which we're going to demo shortly.

A

But the point is, then: you know at a particular version or particular time the table was comprised of 001.rk, there's a two-part dot par k and then, at a later time, after the update, completed transactionally completed. It's now comprised of 03.4k that list of files that make up the table well, in this case, 03 parquet, is included in the transaction log itself.

A

So when any system like presto, is attempting to read what makes makes up that table as opposed to it doing a file list, okay of your clogged cloud object store, which can be extremely slow. Okay, the listing of objects on a cloud object store is extremely slow so because of that, what it does instead is hey. Let me just go to the transaction log in the transaction log that list of files is already there, so in this case zero one.json.

A

It contains a list of files. This example is only one file, but the point is that for any large scale system it contains a list of files and then uh presto can go ahead and actually now send its workers to go ahead and read all those different files.

A

Okay, so that's the context and that's why it's so important for us to have these transaction log commits it allows us to have consistent snapshots right so either the reader. If it's, you know, if the query from presto had occurred before the update completed, it knows it's supposed to read: zer01.barcams there's a two dot. Okay, but if the query, then another query occurs a millisecond later, but that means the update. This is at a point in time where the update had completed.

A

Then you actually will that for that core client for the second reader it'll actually say. Oh, I can only I'll read, there's a 3.4k right. So, in other words, no dirty reads: right: there's nothing in between either you've got the. um When you query the table, you know exactly which transaction it's referring to at that point in time of the query and then you'll know exactly which files that make up that table at that point in time: okay, okay, sorry!

A

So because we're because it's so important for us to provide asset transactions, the whole context is that we're doing it via mutual exclusion on the law commits right. Basically, what happens if their concurrent writers are trying to write to the same table at the exact same point in time, you actually have to agree on what the order of those changes are now we typically follow optimistic concurrency control, meaning that most of the time, we're hopeful that, in fact, these rights will actually not interfere. That's the idea of optimistic concurrency control.

A

Why do we allow for that? For example, if there's one writer, that's trying to write to one partition, but one writer that's trying to write to another partition under optimistic concurrency controls. That's perfectly fine they're, actually not interfering with each other they're writing to the same table, but they're writing to different partitions.

A

So that's not that big of a deal okay, but when they do how about if they do write and they're actually they're both trying to write the exact same partition even under the rise of optimistic turns to control, one will fail to complete because of the fact the other one actually takes precedence, and this is what means we say. The concurrent writers need to agree on the order of those changes so, for example, write the writer one is going to go ahead and write into writing to the initial transaction.

A

Log.Json writer, two, no problem, it says: hey, no well I'll I'll. Do my writing. Insertions it'll go to 001.json so far so good, but how about, if they're, both trying to do updates or insertions or whatever else, to the same partition to the same table? Well, then, what happens? Writer two and writer one are going to fight with each other to say which one commit gets to commit, and in this case only writer two wins this.

A

Just this specific example, writer one would fail the client depending on what it what and this is included within the delta transaction protocol. The client will then get to auto retry and see if it works in different scenarios. If for sacred argument, you're trying to do an update and the data has changed, the reality is that you most likely will have to fail. Allow the user to recognize hey.

A

The date is different, since when you ran it, you might need to do something different now versus if it's just an insertion, then it probably wouldn't it probably actually would not interfere with each other, all it'll, just automatically retry, and to do the insertion okay. So far, so good everybody.

A

Cool all right so because we rely on these cloud object, stores or or for that matter, hdfs. So, in other words, we're talking about we're leveraging the scale of these storage systems. The whole premise is that, even as we're doing all of these things, there isn't a single point of failure. That's the whole purpose of working with s3 or adl gen 2 or gcs. It's a scalable infrastructure. So there is no single point of failure.

A

It's not like a single node disk that that could fail, you're writing to cloud object, stores right so uh in terms of storage of support which allows that concurrent rights now s3 has always allowed for concurrent rights by the way with delta lake. But it's through a single driver um with the with multiple drivers hcfs as their gcs has the mutex I specifically put if absent, consistency guarantees out of the box, so it never had that problem.

A

S3 does have this issue in which so that means, if you have two different drivers, trying to write to the same table at the same time, because s3 lacks put of apps and consistency issues you you could potentially corrupt.

A

The data okay: this is a problem for any system, that's running the s3 now how most of our systems have solved it, including how delta solved is that we use dynamodb as a lock store, not log store, now lock store, so in other words who determines who gets the lock the transaction log at that point in time. So that way, we can ensure that it is done in the proper order, and so we're not sure when the put of absence guarantees are going to be added to s3.

A

um This we've had discussions with them for quite some time on this one, but nevertheless uh included in uh the was it last month or the month before. When we released delta lake 1.2, it actually included uh the dynamodb um s3 multi-cluster rights, and we actually recently blogged about that as well too. So so in the end, this allows you to not have a single point of failure for any of these systems.

A

Now, what makes delta lake unique? I'm not going to dive into I'm going to keep it relatively high level, because I don't want this to be just a marketing pitch. I want to show you a demo. I want to dive into the details, but the like, I said, before, delta lake. The key features include the fact that we're talking about acid transactions. It protects your data with the strongest level isolation because of its design is designed to handle scalable metadata handle petabyte scale of data with the excessively large sizes of metadata often goes with it.

A

You can because of time, travel you're able to access and revert to earlier versions of your data for audits, rollbacks or reproduction, uh and this inherently also allows you to actually have your audit history to go with it as well.

A

Then, one of the main reasons we built delta lake and recognized that was because we it's unified back to streaming. We had so many customers from and from the community that were running into issues, and this initially you know um coming from databricks.

A

Initially, it was a lot of these spark writers right, well, they're, trying to do spark, structure streaming and spark batch rights and queries all at the same time and they basically were end up like leaving a lot of original files due to failed rights for whatever reason, so with a transaction log we can.

A

We can ensure that the either the right happen or the right did not happen right, it's atomic, and so it allows us to basically run those batch and streaming concurrent writers and readers all at the same time, without running into too much trouble, which is pretty sweet. Okay in the process of doing that, we do uh support schema evolution and enforcement. So in other words, if you have a table- and it says it is comprised of these five columns- we can either enforce it.

A

So if you know another set of files or instead of data, that's trying to be inserted in the table is decide to add a sixth column. We can enforce the schema and say nope you're not allowed making changes to that table. Yet, at the same time, you can evol tell the table during that query: no, no we're actually going to allow the schema to evolve based off this fact. So no problem go ahead and go do that um other things include constraints and generated columns.

A

So that way you can go ahead and uh partition using a generated column, for example off the most common approaches to say your data comes in as a time stamp, but you really want to partition not by the timestamp, but maybe like the day. You know day week month, type scenario. Well, then, you create a generated column to go.

A

Do that instead right and then the dml operations are applicable to scala java sql whatever, and it includes merge, update, delete and all these other things so delta lake allows all those capacities right and so, with the context of with presto, what we've done is we've started with the um the delta standalone project, which allows you to automatically read the metadata that makes up the transaction log and then subsequently, presto, then, is able to read that that metadata from the transaction log and then its workers are able now to go query read directly from those delta tables natively.

A

In the past we actually had to go ahead and uh use a manifest file, that's great, being a little sarcastic honestly if you were okay with making changes that manifest on an hourly basis or on a very slow basis. But if you are dealing with streaming and batch all running at the exact same time, that's actually problematic to put it lightly.

A

Well then, yes, there you go, that's why we want to make sure presto had the ability to create the delta tables natively without actually worrying about uh updating, manifest files.

A

Oh sorry, man hearing a little bit background noise so.

B

If you're not meeting, could you please make it yourself there we go. Okay, I muted yep.

A

Okay, thank you perfect. All right, so delta lake there's a fast paced information. Do not worry, I'm not going to ask you to read it, but the point is that we open source delta lake back in 0.1 uh back in 2019 uh december 2021, we released 1.1. We recently released 1.2, so we're constantly continuously adding more and more to delta lake, as we speak so um and, like I said, just released uh with delta 1.2. This was maybe a month ago now, but basically, uh data skipping with called stats isn't included.

A

Now the s3 multi-cluster rights that I referred to before is included uh compaction of small files with the optimized command, restoring by previous version, using the restore command renaming columns. um Well, it's also being added to the roadmap, and this is what we're targeting for uh delta lake, uh um the next version of delta lake, which is going to be targeting around the end of june, because we have the data ai summit, that's coming up, which hey bunch of the presto folks are going to be coming to the dna i72.

A

So please do join us there. Whether you want to join us physically or virtually uh physically in san francisco, virtually I'll have a slide at the tail end of this, which actually gives you a promo code as well, but included, is the not just the optimize, but now the optimize with z order as well, so we're actually including that capacity as well. That's that's being targeted for around q3. I believe right now um and we're also including uh generate excuse me- generate uh change data feeds as well as drop columns.

A

So we have a lot of really cool things that are still in the wood works for us and so hopefully you're able to go ahead and dive into those details. And so, let's see oh okay before I go into the demos, I just want to say: look you know. Delta lake covers many different cloud platforms, with many different apis and languages, with many different engines and sql engines, and many different etl and streaming engines right. So it is a system that has a very broad integration or connector system.

A

That's really cool for you to work with so um so by all means. Please go ahead and if you have any questions about any of these things, you're more than welcome to join us uh your. We also typically have delta delta lake community office hours every two weeks, in fact the next one's the third. I believe uh so that way you can. The community can definitely ask us any questions, and so what were some of the key native connectors?

A

The one we're going to show right now is the presta one which includes a delta reader, which was initiated part backed in presto db0.269, which is pretty cool all right. So, let's see, um let me switch to demo mode and then I'll probably jump a little back and forth okay, so all right so you're all familiar of course. With this view- and in this case uh thank you rohan for letting me actually leverage ahana to go ahead and run the queries against uh just because it allowed me to make my life a little bit easier.

A

I'm gonna go ahead and switch screens to the terminal. Now, in that case, oh sorry, uh stop sharing.

A

All right perfect! So here we go uh everybody's familiar with the good old-fashioned presto, so I've already connected to that uh that environment that uh um rohan already set up, and so you notice that the catalogs here basically I've got multiple uh schemas, that I can work with. I'm going to specifically go ahead and sorry.

A

I'm going to specifically query this available schemas within my format for delta lake okay. So far so good, and by the way, if you're wondering um I'm gonna, I'm doing a little bit, switching back and forth. So apologies for for doing that, but I figured it was easier and I've got a very large screen. So I don't want to go ahead and um um I don't want to go ahead and show my full screen because then usually nobody can see anything.

A

uh But basically, if you go back to this here, then you can basically see that the the queries came through perfectly fine uh in my uh presto cluster right here. So no problem. So I'm going to go back to my terminal here there we go boom all right. So what's next in our dot all right! Let's see, let's sew the tables from.

A

And as I'm running these queries uh by the way, you're more than welcome to ask any questions uh again, you can unmute yourself, you can put them in chat, um so I'm gonna specifically take a look at the customer park table. So, let's see what's what's inside there.

A

Okay, so let's definitely take a look what's inside this table, so all right, so I think it's running a little slow today, but only because uh I I think uh I think rohan we probably had set up uh when using spot instances, so our own bat on that one didn't he.

A

Nevertheless, it's currently running right now, and so as we're running for the demo again any questions any thoughts. Any thing you'd like to go ahead. This is a great time to do it right now,.

C

um So danny uh this is project and I have a question about the querying part. Actually, so what I want to ask is when suppose presto is trying to query any delta table, so is it only reading the latest um latest latest json file from the transaction log, or does it read all the files from the transaction log.

A

Oh, that's a that. That's a great question. So, in the end, what it typically does it'll actually put in memory all the transaction logs, but what it does it looks at the parquet files and they're up to the last nine json files. um This also denotes the fact that, basically, um because when you have the tenth one, that's becomes a part k file again. That's why? Okay, the reason it does.

A

That is because that way, it actually has the full historical context of what the table is now saying that if you only query the last json file right, then what ends up happening is the list of the files that make up that table are comprised inside there. So exactly to your point, you can just theoretically go. Do that, that's actually how we generate the manifest files themselves, initially right, and so when I'm writing right now.

A

This query, for example, this query here: okay, where I'm querying the table and I'm just looking at you know just I'm just querying what the results in it. What it's doing basically, is looking at the last json file to get what is comprised of that table and then go ahead and uh have presto go query. uh The files that make up that table basically.

C

Make sense so danny, I think I lost you in between so you mentioned something about the last nine json files or something like that.

A

Correct so, basically, when, when we make up the transaction log, there are multiple json files that make up the transaction file, but the of the 10th file. There's a checkpoint that kicks in that checkpoint basically goes ahead and says: okay, take the the previous 10 json files and go ahead and make it into a parquet file. So that way, if your cluster crashes, as opposed to trying to iterate through all those json files, it can just iterate through the par k files. That's the context makes sense.

C

Okay, so like, how did you come up with this number 10, like only 10 files like? Why not uh any other number, let's say like? Is there any logic behind that? Let's say we could go with 15 json files or some other number also.

A

uh Honestly, it was relatively arbitrary, I mean it was just basically the fact that we knew that we didn't want to have that many json files, uh because it would slow down the reading of the metadata it would slow down the reading of the transaction log. So that's the reason why you you definitely want to uh definitely would want to go ahead and um be able to do that. Basically, so so we just arbitrarily chose 10., that's really. It.

C

Okay, so I think there must be some archiving logic also behind controlling the number of json files in the transaction log.

A

That's correct so specifically there's a command called vacuum that you can execute that basically clears up or any old files, whether it's data files, ie parquet data files or your json files, and so the idea is by default. You uh when you run a vacuum on its own. It will clear out the last anything that's older than seven days for data and anything that's older than 30 days for logs now, the key important thing: it's not removing. If you've got data from like five years ago, that's inside there, okay, it's not removing anything!

A

That's like that's five years old per se. What it's doing is removing any previous versions of that table. Okay, that's seven days or older right and so the in order to reduce the number of files that are in your storage, whether that's both your data files and your uh transaction log.

A

That's the reason why you want to go ahead and clear that out right and so, but there are plenty of people that actually leave that running for an extended period of time now in batch scenarios, I think that's perfectly fine, but in streaming scenarios I would definitely not advise that. I would definitely advise to clear out these things. Yeah make sense.

C

Yeah yeah, that was helpful. Actually thank you. Yeah no.

A

Problem so, for example, I'm going to go ahead and run the square, and I'm also going to answer then uh dennis or denise question here: okay, so I'm going to run a quick period just to get the count of the table right now, okay, and so to your point, uh um uh uh yeah. Basically uh right now, it's going to the most recent or the last json file to determine uh what makes up the files and it basically is going to go query and it's going to come back with the results.

A

So it is currently planning right now as and running, and so like. I said, it's probably a little slow uh today. So my apologies for that um saying that uh dennis or denis uh you've asked the questions. As far as you know, the views are not supported by the delta catalog. Are there any plans to support uh views?

A

In fact, yes, there are plans, it's just not currently as high of a priority in comparison.

A

So one of the things I would advi ask you to do, though, is if you can go ahead and actually create a github issue on the delta github, and then that way people can vote in almost one, because my take on stuff like this is that it's a lot of the asks we do. We do it based off of what the community feedback is and so right now, there's been massive asks for like dynamic partition, overrides s3, multi, multi-cluster rights, um the change data, feed capabilities and so forth and so forth.

A

So we've been focusing more on that, so it's less of a we don't want to do it and much more of a what is the community asking then this is what we together are working on now, just to finish up this query here, I didn't want to call out that, basically, okay, when I query the customer table, I'm saying hey, there's it's right currently has about what two two million. That's right! Two million two point: six million rows inside okay, so excuse me no problem at all.

A

But what happens if I go ahead and create an earlier version of this team? Well, all I have to do. Oh, of course, uh when I run it, that's when I that's when it fails all right, duh all right. This is what happens when you start doing live queries haha there you go so when I, in this case I'm taking the same query that I just specified, but I'm actually now adding this at v1. This tells me the version table.

A

You can actually also specify the timestamp, but it's easier to write when I just use the version number. So that's all I'm just going to go do, but the context basically, is that by adding that add sign I can basically choose the version of the table. There are multiple versions of this table so right now it's currently occurring the not the current version of the table, but a the second version right, because v0 is the first first table.

A

So the second version to have v1 of that table is actually going ahead and being queried right now, so this will come back and it'll be a much higher number, because what happened is that in between version one and the current version table, we basically deleted that table uh deleted a bunch of rows from that table and so back to your initial point about like the vacuuming. What happens is that if the version one of this table is really really old?

A

Okay, like you know- and it's like it's older than seven days- let's just say it's: it's like uh yeah, it's older than seven days. It is possible that I'll when I run the vacuum, that I'll delete the files that make up that. Allow me to calculate the fact that they're 15 million rows inside that table, because we no longer need that.

A

Now you can change the settings of vacuum so that way, it'll actually go ahead and instead being some days maybe being longer, and there are other options that you can potentially do, whether you clone the table or generate a new table based off of the previous information.

A

So that way um to get to get specific versions, if you wanted to so there's definitely other options in play, but the idea is that, if you don't want that table to grow too much because you don't care about the old history, then you can run vacuum to basically clear it up. So, as you see, that's pretty cool so now in this case, because I've got multiple versions of the table sitting in here, I can basically just specify, for example, I want to specify version five.

A

I can see the number of rows that's inside there as well. So this is the ability to go ahead and look at the different versions of the table will significantly make things a lot more performant, basically so, oh and yeah, to follow up with the question about the uh the delta or preston repository with the views, uh the views actually part of the problem will definitely be in the presto repository, but part of the problem is actually in the delta repository.

A

So the remember the the the part of the reason why I'm saying part of it, it's part of its presta part of its delta- is that what the del? What is normally presto, does is right, it says: hey, let me look in the repo on the the catalog, the and it'll. Have the it'll make up it'll tell me the initial location of the table, so I specify the temple name and it'll. Tell me the base location, that's great, but what it also does.

A

It'll also go ahead and often tell me other pieces of information well in the case of delta. We don't want that. Actually, we only want you to go ahead and specify the table name and it specifies the table path, but then subsequently, it'll go ahead and um use the delta standalone project. The styles of standalone reader to go ahead and say: okay, now presto, is making use of that to say. Okay, I have the I have the name.

A

It tells me the path now with the path go ahead and query that information, and so what it does it's doing is reading the transaction log to get that information. So that's why I'm saying part of the problem when it comes to view will definitely be on the press department. In terms of saying, hey, presto, here's a view. The view is comprised of what a bunch of select statements. Let's just say that our headquarter table, then that table itself now needs to be translated into.

A

How do I query that, from the standpoint of delta, which is the for example, working on the query path, so using a very uh explicit way of showcasing this idea instead of me specifying the table name, I can literally specify the path. So no, instead of going select star from delta dot delta tpch uh sf100, I'm saying no, no just give me delta dot path, so you actually specify like this dollar path. Dollar. Okay, dot s3. So this is actually an s3 path with files that are inside that now this is extremely small table.

A

But the whole point is that you don't actually even need the catalog to be able to query the delta table.

A

You can just query it directly from the path, and so this path that you're seeing here in essence, is what's actually inside the presto catalog, and so so that way, when we specify delta oh go, got it go ahead and simply give me the initial path, tell the delta standalone project to go ahead and get the metadata from there, because the the metadata would be in s3 delta, glue test, one sample table underscore delta and score log right, that's the context, and so so. In the end, this is the the quick callout.

A

I want to show you about how pretty cool the presta delta native reader is, and so you're able to go ahead and make sense of it and to be able to query it uh to be able to run uh natively without actually going ahead and running a manifest file. Okay, so and rohan asked a wonderful question.

A

Let me go ahead and switch to back to my slides, because I did have a couple more things to show before I go back um and oh by the way here is back to the thank you again rohan for the the hana cluster. uh That allows me to showcase all the queries that we're running and all the active works and so forth so forth, and is there a survey coming up for 2022 for users to request new features or provide featured feedback?

A

Yes, we will be having a survey we're probably targeting around august or september, um for the next delta lake survey. You'll get free t-shirts by the way free swag when you go ahead and fill out the survey, but between now and then you like, I say, you're more than welcome to go ahead and ping us in the github issues and get all that information from there.

A

Okay, seeing that I did want to finish up to talk about like here's, the context for the incredible scale of delta lake, we're talking about more than 450 petabytes of data processed a day in in databricks alone. It it's 75 of the data scan is all on delta and there's more than 5000 companies in production that are running on delta lake, which is pretty cool.

A

Oh sorry, there we go. So how do you engage with us? Just like I've been sort of implying and calling out before you can go to delta dot io, we have the delta user slack, the delta, like youtube channel, the delta use. Google group, definitely ping us in the delta lake github issues, uh the delta lake, uh linkedin and or data and ai online meetup.

A

So so there's lots of ways to engage with us and that I didn't even include our stack overflow, uh a tag as well and then, like I said we have community office hours or ema's every two weeks right. So, for example, this one that I just posted here is from february 17th. Why?

A

Because I wanted to call out a cool appearance by uh apple's dominique brzezinski who have uh who's been involved with delta lake since its inception, um in fact, there's a good cool story about him and michael armburg, getting together during spark summit, 2017 specifically to go ahead and work on what end up becoming project tahoe, which itself became delta link, and so like. I said how to use delta like to go back. You make use of this stuff.

A

uh So the key thing I did want to call out is that not just for about delta lake there's plenty of sessions from presto from our friends on on it as well too for date and ai summit, which is at the end of june june, 27th the 30th. uh It is a hybrid format, so you're more than welcome to join us from um virtually or if you're in san francisco you're, more than welcome to go ahead and join us physically there in san francisco.

A

But if you're going to join us physically sign up soon, because actually tickets are running out now saying that I do have a code here. uh This d-a-I-s-com 25 for 25 off of uh conference uh of the off the conference pass and of training as well. So, of course, I should spell better, but nevertheless that's the context. So that's it for me today.

A

um If you have any questions, I've got probably like five more minutes to answer questions, and, but hopefully uh at least I've made a pretty cool uh uh showcase of of uh excuse me of preston, delta lake together.

B

Awesome denny what a great presentation and demo appreciate it. Thank you so much uh I don't know if somebody's raising their hand because they have a question, if you do, you can unmute yourself or you can uh put it into the chat, then we can ask.

D

Yeah, so this is surya a couple of questions here uh so right now we are using the open source version of delta link and uh you know so what we're trying to do is we are trying to get the data incremental data from kafka, and then we are merging that data into uh delta lake, uh which has limits of records right so.

C

D

We found out was initially when we start the um in a process when there is so much delta log. The merge process takes about 12 seconds or something right and later on. When the uh delta logs increases. The merge process you know keeps on increasing. The 12 seconds will turn into a 60 seconds after two weeks of time, right sure yeah. So we are also running this compaction process where we merge out these smaller files into a larger files- and you know, and also we run the vacuum once every day.

D

uh Is there something else we can do um you know to optimize this? We also have this partitioning on the delta lake as well.

A

Okay, so without rat holding, let me let me provide. Let me I'm actually going to answer your question, but I also want to be. uh I want to give you the shorter version, because the longer version is going to take up like a much more time, but basically one of the things I definitely do definitely chime in on the delta user slack. We all of us are actually there answering questions exactly like the one that you've asked now, specifically because you're running a merge as you're writing into it.

A

The fact is, it really depends on if the merge is actually looking at all the historical data or only at the current data. Okay, if you're looking at all of the historical data candidly, there isn't much, you can do, except for increasing the number of workers that are involved in order to be able to handle the load.

A

That's coming in because of the fact that you're going back to historical data and making changes, whether it's an update, delete or whatever else included with that now saying that are there ways to forsake argument, increase the number of workers, even if you're not increasing the number of nodes or, if you're able to go ahead and partition slightly differently. These are all definitely in play that we can help things out. The other thing to notice as you're debugging is that you want to figure out how often the merge is happening.

A

Okay, if the merge is happening for sacred argument, specifically within the context of the current day's data, but there actually isn't that much. You know, because it's not looking that much in history. What I'm inclined to say is that you probably want to run the vacuum more often than just once a day. Okay in those typical situations. What you want to do is you want to.

A

You probably even want to set up a second cluster, it's small, there's, probably even just a single node honestly, which will simply go ahead and vacuum up the files to decrease the number of both logs and also part of the actual data files, thus reducing the sizes of everything such that your merge can run faster. Okay, so, like I said this, is that was the short version, there's a longer version that probably really needs to understand and how to debug your scenario a little bit better.

A

So it's not that like I just because I do want to answer a couple other questions, but by the same token, like definitely make use of the delta user slack because those type of questions, a lot which allow us to sort of dig into those details. It'll it'll, be a lot easier for us to go to dive into those uh those scenarios. If that's okay with you yeah.

D

That's fine. Do you mind pinging me that slack channel in this chat message.

A

um Sure, actually, let's do it uh go that delta.

B

A

Oh there we go okay, no problem and dennis denis uh so after adding z order, support to delta standalone will work for the delta connector. Yes, no, it will not.

A

That is not the only thing that needs to be done right, so there are parts of it will be done by basically delta standalone, because the the way delta standalone works is that it'll actually give you better metadata to basically so that way, you're scanning a lot less files to work with okay, so that's great, but there's still the context of now we're working with the presta community to say: okay.

A

Well, how do you make use of uh of the the the column stats, the files that are to be scanned and which is which is basically built in within z order that you basically need to go ahead and work with? So this is definitely again a two-parter there's parts of it that we can definitely do in the standalone but they're parts of it that we actually definitely need to go do with the presto community.

A

Hopefully that answers your question on that one as well.

B

Hey just out of curiosity uh does it while we wait for another question to come in who's using um uh presto with delta lake today, if you wanna just do like a quick virtual hand, raise I'd, be interested to see how many folks today are using it or may be thinking about using it. So sorry, we know you are okay. Cool dennis is awesome. Okay, good, it's good to see, and hopefully, after this even more.

B

A

Definitely yeah like, like I said, definitely ping us on the delta user slack. I just propped it there. If you have any other questions, uh we we love working with the presta community and we hope to see more of you involved. So that way we can go ahead and work on more features together, faster. So absolutely.

B

All right, well, I think that's it looks like we don't have any more questions, so I will reiterate two great events. Coming up. We hear the data in ai summit, um so a ton of great uh sessions around delta lake and I think, there's a few presto ones in there too, which is cool right and then presto khan day, fully virtual happening in july. So we love to see folks at both of those. I think it's a great crossover between the two communities and with that denny. Thank you again.

B

Awesome presentation, great demo super informative, looking forward to seeing what folks do with delta and presto in the future and folks we'll see you at the next meet up.

B

D