Delta Lake Delta Hack 2021, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Hack: Introduction to Delta Lake

Description

Apache Spark is the dominant processing framework for big data. Delta Lake adds reliability to Spark so you analytics and machine learning initiatives have ready access to high quality and reliable data.

A

All right uh welcome everybody. This is the first session of delta hack, 2021. delta hack uh is a little hackathon that denny and I organized to help grow the uh the delta lake open source community. um We also thought it would be something fun to do after everybody recovered from data and ai summit, which databricks hosted a couple of weeks ago.

A

We're going gonna get this session started with uh just an introduction, talk about delta lake um with denny and if you've got any questions, throw them in the chat and I'll uh interrupt any and ask those along the way. If you have them, but otherwise denny. Take it away.

B

Thank you very much. Tyler really appreciate the introduction. uh Just like tyler noted, uh it's been a crazy few weeks uh post summit so uh but we we got all excited because of the fact that we had a lot of questions, a lot of asks uh about delta like and so that's what was the brainchild behind this delta hack. uh Thank you tyler for pushing this along. Like always um saying this um for the introduction to delta lake we're gonna just start off with a real quick. You know high level hey, how are things going?

B

Here's the quick context about data reliability for data links. Okay, so that's what today's initial session is. Don't forget that tyler later on was going to be showing a delta hack himself, where he's going ahead and uh um coding away, and then we also have steven yu coming on board to go ahead and show you uh how to create your delta table in a short amount of time, as well like in under 10 minutes uh the session's 30 minutes, but he'll actually do the table part of it in 10 minutes.

B

So, just like uh tyler noted if you've got questions, please prop from the chat he'll interrupt me. It's probably a worthwhile endeavor to interrupt me anyway. So not a big deal, and so, let's start with the key context here on the what the promise of the data lake is right, so uh to provide a little context. My background is: I'm originally a database guy um used to be part of the sql server team uh at microsoft, um and yet I had shifted over to hadoop.

B

So a little background actually was on the nine person incubation team, known as uh project isotope that actually introduced um what now is known as hc insight and so brought hadoop into microsoft. um So we love this concept of a data lake right you know, even though I was a sql server person.

B

Originally, this idea of a data lake was to be able to hold semi-structured and unstructured data, and all these other things was just vibenic like really really great, and so that that so it allowed us to do collect everything, uh and that was the whole point right. You know it wasn't just about structured data anymore. I could take collect your streams.

B

Your sensor data uh semi-structured email videos, whatever else right, I just store all this in a data lake and I'm good to go and I'm happy you're lucky and then then, because it's all in one place, you know I can go ahead and run my data science machine learning, uh doing things by recommendation engines, things of that nature, so I'm happy all right uh so with in the realm of hadoop, I should have been happy because I've got this data lake or or for that matter.

B

Claude object stores, whether it's you know uh s3 or blob, storage or gcs or whatever else. The idea is that I could just chuck everything there and I'm good right. So I'm happy right. Well, not exactly right. It's the old adage of garbage and garbage out right. So basically yeah, we collected everything, including a lot of garbage right, which means garbage in and then garbage stored, which means you're data, science, machine learning and things of that nature isn't actually terribly good actually right.

B

So, in other words, the the output that you got from your ai from your data science for machine learning wasn't high quality, because the data that you actually had wasn't high quality to begin with. So it doesn't really help out that much, and so oh I'm just realized. I should have turned off my videos that way you can see the garbage out all right there we go all right perfect, and so what does a typical data lake project? Look like all right, so forget about delta lake.

B

Just for a second, let's just talk about what does a data lake project? Look like okay? Well, you know it's. The evolution of a cutting edge data like often looked like a lambda architecture, now remind you, we're talking land doctor specifically for um data, warehousing style or old lap, style, processing. Okay, so you've got your events going into kafka. You go to your other system, whether it's kinesis or event up or whatever else, I'm just using kafka as the as the open source example right. um You can then for sake.

B

Argument like okay, let's use the patch spark with structure streaming, so I can go ahead and do my streaming analytics, but that's awesome all right, so I can go ahead and process the data and then and then in real time or close to real time, go ahead and process and see what's going on, and I can see the trends happening immediately, often good for things like advertising analytics and real time, and things of that nature all right. Well, how?

B

If I wanted to look at historical queries right the the whole context of using something like structured streaming to process the data for your streaming analytics was about this concept of saying, okay, I I'm I'm actually not storing all of the data, I'm looking as the data comes in right. That's all I'm doing I'm not actually trying to look at all the data, because, if you're trying to look at all the data of all time, that's you know terabytes petabytes of data, so you're not going to be able to do that.

B

What I'm doing with start stream is just like. What's the current set of data that I have okay, and so what I will have instead is, if I want to do historical queries, I actually needed a data link which was to store all of the old data like so that's that's the reversing to the the the the terabytes or petabytes of data, and so that's what the beginning of this lambda architecture was.

B

Was that you're, basically splitting it off right, you're, providing this context to say that okay, part of my data is sitting is going to be I'm going to multicast my data out from kafka in this example, and I have one engine: that's processing it for streaming purposes and I'm one engine, that's processing for the purposes of historical purposes. Okay, you can do that it's it starts getting complicated because the the maintenance that goes with it right, and so I'm just turning the camera back on.

B

So you can see my expressions, but you can tell this starts getting a little bit more complicated to say the least. Okay. So the this is like. I said this is the context of that lambda architecture. I'm referring to here. Okay, all right and then so now I've got two streams, one for streaming, one for my historical data.

B

I store all that historical data in my data lake and then afterwards I can go ahead and run another spark job that will take the data from my data lake and I shove it to my ai and report so perfect, all right. So now I'm happy and I'm good to go right.

B

Well, how about you get messy data right? I mean this initial concept works really well. If you were to have clean data and I'm sure tyler and anybody else that's on this. Today's session here could vouch for the fact that you're, especially in production you're, you don't have clean data right you. Actually, your data has issues right, uh cleanliness, corruption, bad code, all the above right. So so you need to figure out how to run a validation step. All right, which basically is as your stream is running in the slammed architecture, the top part.

B

Here it's going ahead and checking you know: maybe some just do some numeric validation like the counts per minute or once per hour or whatever else, and you also validate that against your data lake as well or at least the the job. That's loading into your data and so that'll allow for basically uh uh to validate that the numbers actually match each other really well, okay, so that that's that's completely cool all right.

B

All right, so so that's this validation concept here, okay, so well, then, how about if there are mistakes and failures in the processing of this data? Right, because no matter how? Even if your the validation steps, work well, you're gonna at some point need to reprocess data because there was mistakes in the code there were failures in the system outright and or for that matter you know uh the data just changed on you and you weren't expecting that.

B

So, even if the code was perfect for this point in time, well a data lake or any data system for that matter is constantly changing right. It's it's a it's a it's a it's a flowing system. It's not one of those systems where you just like okay, you know I can be static right even back to the old data warehousing days right. If your data warehouse never changed and you never were updating the schema or weren't updated.

B

That means the system was a failure, because nobody was actually using it right if people are actually using the system.

B

That means things are going to change so same context here with the data with a data lake except it's even uh it's that context is even much more on steroids, because you really have that much more data to work with, so I'm going to have to reprocess my data all right. So, for example, I update my patchy spark jobs. I partition the data by time so that way I can reprocess specific partitions. Okay, perfect! Well, that that eases things- and I can certainly do go- do that and so all right.

B

This reprocessing will then be will take care of this part of the context. But how about if I have to update the data? So in other words, even if I didn't everything was working fine, but the upstream system is saying: hey wait a minute. The data that came in actually isn't quite right right. Originally, I told you yesterday we sold 20 widgets, but now it's actually 30, widgets or whatever okay.

B

This will require. Then we would need to update the data lake itself to handle, updates and merges okay.

B

So, in other words, I need to be able to update the data to prototypically a a a a data lake is very much an append, only system right, you're, just constantly adding data you're not trying to actually update it, but the reality is, if you want to make make it useful, make it powerful you're going to actually need to update some of the data at some point, because that's what's going to be reported on right, um you could always try to do make it additive by doing a negative like, for you know, it's an old data, warehousing trick where we wouldn't update the data.

B

We would just simply say: oh 20 wedges today, oh you actually only sold 10 widgets, okay well, we'll just add a negative 10 right widgets, and so that that we can update so sure you could do that or whatever else. But the point is that you do need to update the data. Okay, all right and then even if you could update today, you have to schedule it correctly. So you can avoid uh modifications to your reports or the downstream systems so, like I said lots of fun.

A

I'm enjoying that this diagram is getting more and more complex, as you keep going going on.

B

Well, that's absolutely more of the context right. The context is that, as you productionize your systems right, the reality is this diagram does get much more more complicated and that's more or less the issue right, like you know, as you know, I'm guilty of this too right. You know when we initially built, you know hadoop on windows and then hadoop and azure. That was part of my team. We were thinking to ourselves, yeah yeah, we could be we'd, be good.

B

We could just insert data and we'd be happy and all of it's stored, and then we can run our machine learning, but, as you can attest to right, I mean the reality is as as were as more data came in as the data lake became more useful.

B

All of these challenges in one form another peaked its ugly head right, whether it was for streaming data, whether it was the fact that I had to reprocess the data, and so the the real work wasn't was uh was basically by going and saying I could store everything the real work ended up being now. I have to build all of these systems around to compensate for these challenges right and so exactly to your poi teller.

B

Like you know, it is this diagram becomes more complicated and it sucks to be around to be rather, to put it rather lightly right, and so so, if I was to do the tldr version after seeing all this for the last, what 12 minutes or so, is that you know in essence, you're wasting time and money right, you're, you're, solving system problems, you're not actually solving data problems right, because, if you think about it everything I just talked about whether it's the lab architecture, the the ability to do, updates and merges reprocessing scheduling- that's not me playing with the data or extracting value from there.

B

That's that's me or you or anybody on this today's session here to basically say how do I build all the infrastructure around these problems that that's what that's, what I'm actually trying to do here, all right, and so what it boils down to is that if I was to take all of these challenges and all these problems, these are the the primary issues or the distractions around around your data lake.

B

Okay, from a if I was to take these problems and all the other problems we have there, I could probably uh this is probably too much of a generalization but there's some key generalizations, okay, which is no atomicity okay, which means, if you have failed production, jobs, you're going to leave data in a corrupted state, okay and it becomes really really complicated uh for recovery right. So for many of you who, if you've ever run like a production, any job, I don't care what system?

B

It's not just spark any distributed system right because of the nature of what's going on there. One single job will result in multiple tasks running. So that's great. I love multiple tasks, because that means I can distribute the workload and I'm flying high. So that's that's awesome.

B

The problem with that approach is then, how of one of the 20 tasks that that job runs right fails: okay, well in your data lake, as what, whatever system that distribution system is running, it's constantly writing files to disk, okay, to storage right again, doesn't matter what storage you're talking about, whether it's cloud object, storage or hdfs. It's writing stuff down right. It can't keep everything in memory, so you, since you can't keep everything in memory. You've got to write it to storage all right.

B

Well, if the task fails, what ends up happening is there's a bunch of of files that have now been written to storage they. If the job failed, the system just shuts itself down. You lose network connectivity. What happens? There's a bunch of files that have been written storage? Okay, so that that's not good, because now that you've got these files, you've got basically orphan files or straggler files that are sitting in storage with your good data and you're going like oh great.

B

So if I query that table or succeed me that data in the file system, I don't know which one's actually good and which one's bad right- and so this is a real con, real importance of this context about atomicity, okay, there's, no quality enforcement. Just this well, I sort of implied when I was talking about the challenges of your data like the problem. Is that basically, is that, because I'm just dumping the data in and I'm I I don't I'm basically this.

B

This is old context that we used to advocate, for which I'm guilty of by the way as well, which is saying, hey, schema on re. You know I don't have to worry about what the schema looks like I'm just going to wait to put the prop the data down and then I'll define the schema. When I read the data, so I don't have to worry about. uh I can get the data written to my data lake as fast I can so that's great.

B

The problem with that approach is that, if just like, I said garbage in garbage out right, you want at least some form of quality enforcement. You can decide where that is by the way in, in the context of any data quality framework. What you're trying to do.

A

B

Reality is you: do need a mechanism to be able to say, okay well sure, even if I dump the data, as is with little, to no enforcement right from the get-go, as I progress in any form of data quality in my data pipelines, I'm going to go ahead and actually provide a mechanism to protect the.

B

Let's just say the schema of the data, all right uh that it's supposed to have x, number of columns of those x number columns have this. Many of them are integer versus string or whatever else. I need to be able to enforce that type of quality, okay and then there's no consistency and there's no isolation.

B

Okay, so because you know there's no consistency or isolation. When it comes to your data like right, it basically becomes nearly impossible to mix the pens and read batch and streaming right. The remember like now, I'm just basically talking in the realm of databases all over again, because you'll notice, basically, what I've done is talk basically about acid, atomicity, consistency, isolation and durability, right and so for those of you are who are not database folks, which is perfectly cool right.

B

This concept is that what was great about databases was that um and what made them like the the flavor du jour when it came to business data processing. Was this idea that we had this one system, the database that allowed us to go and ensure that if two sys, you know two clients were trying to write to the table. At the exact same time, one was trying to do an update. One was trying to do a delete.

B

We could actually protect the data to ensure that you know those two systems weren't trying to do the same thing at the exact same time for sake argument, if the delete came first before the update in millisecond before that, the delete would complete sure, but the update would fail report back to the client that the update failed, because the data never existed as opposed to you know having two copies with a straggler one that the deleted ones have been updating and then you don't know which one's actually the real uh the right actual value right.

B

Well, this is even harder when it comes to data lake, because you're going to have not just lots of fast jobs and multiple batch jobs, kind of concurrent batch jobs, but also streaming jobs, trying to write the data at the same time. So you've got like a structured streaming job. That's writing data into that file system, every second or so right. Meanwhile, I'm also trying to go ahead and say: oh maybe I need to update um a value because the widget number was originally 20 and it really should be 10 right.

B

Well, how do you ensure, if there's no, nothing, basically locking the file system, so that way we can ensure what the streaming is doing and what the update is doing aren't going to conflict with each other right. So that's what databases were really good at doing, and so you know the context or what I'm implying very strongly is that man it'd be pretty cool if we could provide acid consistency to my data lake.

B

So that way I could go ahead and you know prevent fail production jobs from corrupting my system and enforce some form of quality and be able to have consistency, isolation, and so because of that, that's the introduction of to delta lake right. That's what delta lake does right. It allows you the reason we I called the session, and if you look at other sessions, you'll notice, it's called like you know, um make in fact you spark better with delta lake or bringing data reliability to your data lake with delta lake and yada yada.

B

This whole context is about saying that I can bring acid consistency, protections transactional protections to my data lake. So that way, all the problems that you we were challenged were talking about before all the issues that we just talked before, get resolved automatically for you right and so look. We can go back and look at the challenges again. Remember like, for example, of back to the details. It's a lab architecture. We have the validation. We have the reprocessing of the updates. Okay. Well, in reality, it's not that simple!

B

Okay, it's actually even more complicated, because each one of those four diagrams is basically impacted by the arrow that you see here. You don't just have one stream of kafka. You have a kafka. Kinesis you've got your batch data from your data from other sources. You've got other kafka streams, and so what I'm alluding to here by the way, what you see in the color coding is that you really have a bronze silver gold data quality framework that you're trying to create okay. So first you have the bronze right.

B

That's the initial dump of data from your different sources. Your kafka kinesis data, like csv json text. Whatever else right, then you're going to transform that data into a silver level right and silver means that you're going to basically from a data quality perspective. It's it's saying, I'm going to augment it! I'm going to filter out the data I potentially will do joins today.

B

But the point is uh now: here's the data in which I am enforcing the schema and I'm cleaning it up sure I I have some minimal enforcement of the bronze, but the point is that I'm still trying to just get the data in the the this context of oh yeah, I really care um about getting the data into my data like as fast as I can. But then the silver allows me to say: okay, I'm gonna do a bunch of tasks to clean up that data and then, by the time I hit gold.

B

The gold data set is basically your feature engineering for your machine learning or for your aggregates for your bi tools right so it'll you know aggregate it or summarize it or whatever else, for the purpose of streaming analytics and for ai reporting. So you want to process. You want to be able to create your data pipelines to be able to take that data from your bronze to silver to gold and ensure that if any errors occur throughout the system, each one of these arrows that you can go and say: okay, no problem!

B

I can restart the process again and I can trust that what I will write to a subsequent system downstream will in fact be solid, okay, and so that's the key thing right. I wish it was this simple and I'm being very facetious. I realized, but it's not just remember. This is just for one kafka source on the things that have to do with it in reality, it's this, so I really need that type of consistency to protect the data. As I progress.

B

Download the delta lake architecture, medallion architecture and delta lake itself, which is that you have full asset transactions. You can focus on the data flows instead of worrying about your failures right. um The key context also for delta lake is we're open standards and open source we're part of the linux foundation. You might may or may not be able to see it because it's a the coloring scheme is a little off here, but there's actually a linux foundation logo there, and so the key context is: you can store petabytes of data without worries of lock-in.

B

It's a growing community, including presto, spark and more right uh when uh michael armbrust had announced during data at the ai summit, uh the delta 1.0 uh during his one of his keynotes right now. You know, I probably should have slipped that slide in here right. It there's so many different systems, uh hats off to uh tyler and the rest of the scribd and back market team who uh introduced the delta rust api with their python bondings and uh upcoming uh ruby and golang bindings right.

B

So the does not require you to use spark at all to work with delta lake. That's the whole point: it's an open source, open source of protocols right on github, so you can go ahead and see exactly how it works and there are more and more partners daily that are integrating with delta lake open source.

B

uh Commercial, doesn't matter because it's an open because we are working with open standards because it's open source, so you're never going to get locked in to any particular vendor, uh because you're using delta lake, quite the opposite, uh using delta lake ensures that you can use whatever. Whichever vendor you want. In fact, and yes it at the same time as I call out that you don't need to use spark, obviously is powered by spark as well right really.

B

This will require much longer conversation, but what it boils down to is that the the context of exactly one semantics it structure streaming, brought you most of what was needed, but the context of needing asset transactions was the final step that allowed exactly one semantics for your system end to end right. So it really allowed you to unify your streaming. Your batch and they're allowed to really convert existing jobs with mental modifications, and I'm actually going to talk about an example shortly about that, but that's the real con. That's what's really cool about this.

B

The idea that you have exactly one semantics from source to storage right and back to that data quality thing. I sort of implied that with the slides about uh a few slides ago.

B

Right, like I said, there's the the these are the data quality levels right, your bronze, which is your raw ingestion from all of these different sources, doesn't matter you have your data quality levels where then it switches to silver, which is your filter, you're, clean, you're, augmented, and then you have your gold, your business level aggregates such that you can do your streaming analytics and ai reporting, so it massively simplifies all of those changes. Those challenges. Excuse me right and, and that's more or less, the key call right.

B

Delta lake allows you to incrementally improve the quality of your data until it's actually ready for consumption, and so just to over emphasize this point, but I'll I'm going to skip through this relatively quickly, because I've we've more or less talked about it right, you're dumping ground for your bronze levels, you're dumping ground for raw.

A

B

Often you're gonna have long retention and you're gonna avoid any of the any error prone parsing because you're trying to get the data dumped into bronze as fast as you can, but then with silver. You can clean up that data. Sorry tyler! You have any questions.

A

I do um so when you're thinking about this move from bronze to silver to gold in in your experience or as you look at designing delta tables- is that is that sort of less structured to more structured data, because I know I've seen some delta-like users that are basically using that bronze level. For you know, here's some raw data and some metadata around our ingestion and we dump that into into bronze, and so they go from sort of, not necessarily a quality change. Although there's quality changes along the way, but they go from less structured.

A

You know denormalized data on the left to more structured and sort of flattened, easier to query data structures on the right. I'm curious how you think about that.

B

I I I think the less structured to more structure definitely can come into play and, depending on how like now, we start getting almost a philosophical debate right in a lot of ways, your quality of your data. From the standpoint from the standpoint of ai reporting and streaming, it's higher quality quote-unquote when it's structured. Now, I'm not trying to knock semi less structured data. Quite the opposite.

B

I actually like it right, but I'm also saying that for most analytics ar reporting uh data science for feature engineering, basically they need the data to be some more and they need the data to be structured.

B

So from the standpoint of that type of reporting or aggregates, the less structure to more structured is also a data quality context as well. Does that make sense, yep got it perfect, okay, and so just to finish up with the silver concept right. Just like we said, filter, clean and augmented, but remember, there's going to be intermediate with some cleanup applied right, it's queryable for easy debugging. So often I see customers. You know I'm sure tyler would have asked this question too.

B

Is this where they'll go ahead and actually run the machine learning actually against the silver, not against the gold, because they wanted to keep the details? They didn't want the summary information that the goal tables have and that's fair right and the reality is less about.

B

uh You must build it in a specific way and much more about saying you are going to think about your data quality progression right, you're, going to think about your semi structure or less structure to your more structured progression that and so, for example, if I was to pull back to my data warehousing days, the old days, I would say, like oh yeah, it's like your old tp, the staging to your data warehouse, and we all knew even back in those days right that it's not like there. It was one size fits all.

B

We we did it because we want this concept of old lp, a dumping ground staging where we basically filter, clean and augmented and maybe third joined with a third normal form data. And then, when we had our data, warehouses was the business level aggregates that uh that our business could query against right.

B

But how you did that and what details they were really dependent on your business right, so we're not trying to advocate, for you must design a specific way, we're advocating for the idea that you must think about your data quality, as opposed to thinking that I could just dump the data into your data. Lake and quality magically happens because that's where the real work is the real work is not so much getting the data into that bronze. But the real work is: how do I go ahead and cleanse the data and then reprocess the data?

B

If I need to, and that's part of the reason why the problems exist, because if I need to go back and recognize, let's just say the last two months of data was wrong or it needs to be updated because of new business rules. I can just basically delete the date of my silver gold. Go back to bronze and say all right pick up the process from there and kick it in and reprocess the last two months of data and any new data. Using these new business rules. Right and, like I said, call call gold.

B

Basically, clean data rate of consumption uh read with spark or presto uh presta by the way, already works with the uh the uh um uh the manifest files, but we're seeing coming soon for also not just reading the manifest file, but actually be able to read the transaction log within delta lake and be able to query it.

A

For that, coming soon, by the way, since we are talking about hacking on delta lake, uh yes, this week, is that which repository is that, coming soon into.

B

uh Right now, where.

A

Does that come soon.

B

Gotcha the coming soon is in the connectors. Actually in the connectors repository there already is pr request, 82, I believe all right.

A

So we'll link in the chat.

B

Thanks, thank you very much, perfect all right. So all right, uh let me finish off with the gold stream through to delta lake low latency manually, triggered eliminates management and schedule schedules and jobs. What do I mean by that? Let me talk about that. So I'm already running a little over time. So let me just, but let me finish uh up with the uh some quick call outs. um uh The oh basically delta lake allows you to do inserts, delete, merges and overwrites.

B

I should have removed the start, dml released in 0.3 0.30, because we're on delta, like 1.0 now, but the whole point is that you can actually run standard dml helpful for retention, corrections, gdpr, okay, um so back to that reprocessing. This pros uh contact that I just talked about that's more or less the point. If I need to go back in time because because of business logic, changes whatever, I can just clear the tables clear, the partitions involved, restart the streams and then back up and running right and th this.

B

This is why you care about data quality framework and because you have acid transactions to protect the data you can. You can be, this process becomes significantly easier and so uh who's using delta lake.

B

There are lots of organizations and if I I'm gonna skip past the slide just because uh the data ai summit uh in the keynote uh it's the keynotes, I would definitely watch those because that's more up to date, there are thousands of organizations with extra bytes of data per day of processing, but the one thing I did want to call out um their comcast had a dnai summit session from I believe two years ago, or maybe three years ago, uh just called sessionization with delta lake. It's a I.

B

I can't do justice in in 30 seconds, but the key context about their sessionization of data. In this case it's from the remote that they have with your your comcast box. Is that because of data lake at delta lake, they were able to improve reliability of the peta byte scale, jobs, which is awesome, but because they were able to improve that reliability. They were able to do 10x, lower compute. So, in other words, instead of having 640 vm 640 machines, they got it down to 64.

B

right, which is pretty amazing and because, because they're able to not just do everything in batch, but they were able to combine streaming and batch right, they were able to go from 84 jobs down to three and the data. Latency is significantly uh uh uh smaller, like half of the data latency. So just by going ahead and switching to delta lake and and following this context of a of a data quality framework, they will to improve reliability, use 10 times less compute and easier maintenance.

B

So this is why we, we love delta lake, why we advocate for delta lake, because the idea that I can have data reliability- and this translates into so many other things- easier jobs, faster performance, uh uh a lower compute. That's again, that's why you hear us talking about and then how do I use delta lake? So here's the last slide for us here, okay, and so, if you want to get started with delta lake okay, you can add it as a spark package.

B

Okay, uh you can use it as a maven, um also you'll notice that you can do a pip install and that's using with the spark apis. Don't forget, you can also use it with the uh the the rust api and so then all you do is with that is pip install delta lake, that's for the rust api and then pip install delta spark.

B

If you want for the spark apis irrelevant if you're using spark and you're so used to using parquet. That's great dataframe.right.format parquet, simply change that to dataframe.right.format delta and boom now you're using delta- and I think that's it for my session today- that I ran a little bit over so.

A

There's no rules, this is the internet.

B

That's true, that's true! It's fine! It's fine.

A

Well, thank you very much for joining us uh denny uh for anybody. That's interested in joining delta hack. um If you just search for delta hack, 2021 you'll find the dev post site. You can join us in our slack channel if you had a delta io, there's a slack button at the bottom. If you just joined delta hack, I think our channel is called deltahack2021 there as well and if you've got any questions, join us on twitter or or slack or the delta users mailing list in about an hour or so we'll welcome back steven.

A

U onto onto youtube, who will do a nice, hands-on demo of actually using delta lake with apache spark, but danny thanks a lot. I think it's safe to call it. There see you later awesome.

A

All right streams, streams off.