Delta Lake Delta Hack 2021, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Delta Hack: Create Reliable Tables with Delta Lake & Apache Spark in under 10min

Description

A live walk-through demonstrating the many simple ways you can create and manage Delta tables! We'll leverage a few sample datasets to showcase ACID in action - Updates, Deletes, Merges, Schema Evolution, Time Travel and more!

A

Welcome everybody to the second of a number of delta hack, uh related sessions. For this session. We've got steven yu from databricks here with us to share a little bit about actually creating and using delta lake with apache spark.

A

Just as a reminder, delta hack is a hackathon that we're hosting from pretty much today to the end of the week, uh just a little thing to get some people introduced to the delta lake ecosystem, some of the open source projects that are under the delta lake umbrella. If you search for delta hack, 2021 you'll find more there and then, if you've got any questions uh about the stream you're.

A

More than welcome to ask those in the youtube live, chat, I'll, be I'll, be monitoring that and interrupting, as is appropriate, um but I'd also encourage you to go to delta.io and join the slack channel. There's a link there for our slack workspace or the delta users.

A

Google group, but without further ado I figure I'll pass it over to you, stephen and you can uh go ahead and get started.

B

Awesome, thank you. Tyler hi everybody, um so I'm gonna first start off by showing you three different ways that you can create delta tables, I'm using the sql apis, the data frame writer apis, as well as the delta table builder apis, and then we'll go ahead and start doing something interesting with those tables I'll show you how to do, updates and deletes and merges and uh demonstrate schema evolution. Take a look at history, uh all kinds of exciting things. So I guess we'll start with with the most basic.

B

uh The most basic thing is creating a delta table using sql apis.

B

So let's go ahead and do percent sql and say create or replace table, and this syntax should be really familiar to you, because it's just standard sql table creation ddl.

B

Let's just call this one super creative delta, hack, 2021 um and then we're just going to put in some dummy columns, so we'll just say, column one um and the real creative we'll make that a string and then I'll create a column called date and make that no surprise there a date. um The real difference here is I'm going to say using delta.

B

Instead of saying using part k or using orc or some other, some other file format, and that's really it um let's just go ahead and execute that command, and it's going to create a entry in the hive metastore um in a database that I have already created ahead of time called again delta hack, super creative and this table is going to have no records in it.

B

um So I'll use I'll show you how to create the table now using the data frame writer apis um and then we can insert some actual data and make things look a little bit more interesting. So, let's just go ahead and title this one data framewriter apis how to make that look, not silly. um So, let's import some functions. This is a python notebook, so I don't have to do anything special here. I'm from highspark got sql.functions import expr because I'm going to need that, let's create a data frame, let's just generate some fake data.

B

um So let's do something like spark.range uh for the purpose of this demo, let's just say: create a hundred thousand records um we'll just do we'll just we'll make some things up. So let's say we have an id with spark.range.

B

um So let's say if ib mod 2, um let me get rid of those if id mod, 2 oops.

B

Id mod 2 equals 0 we're going to have this column print, otherwise it will print bar yeah super super creative as column one.

B

um I also probably do that all right and then, uh let's see what else, uh let's generate, let's generate a date column, um that's not going to be completely today. um So, let's say with date, we'll create an expression, um let's say cast and then we'll concat this month, so 2021.06, um okay and then we'll concat and we'll just generate a random date and five uh times 30.

B

as integer, all right, plus one as eight all right, okay and then yeah, we'll start with that. One, let's see how that works out, make sure I didn't make any syntax errors there. Okay, good, that's actually surprising, um so we'll do df.right.format.

B

Now we're going to be using the dataframewriter apis, you'll notice here that, um instead of saying again part k, orc whatever I can simply just say, format, delta and then we'll just stick in a mode, let's say overwrite and then we'll do a save as table. What save as table does? Is it actually, as it writes, the data out will will actually create the meta store entry for you. um So let's go ahead and do that.

B

And populate our table with a hundred thousand records um and then I'll just show you real quickly. um Do a quick sanity check on this table select count star from delta hack a21 as soon as that finishes. Writing all right. Let's take a look. This should have 100 000 records and there we go. That's that's not a surprise. So so far, we've covered how to create a delta table using sql apis. This is how you do it using the data frame writer apis.

B

If you didn't want to make a metastore entry, you don't call save as table, you just call save, and then you provide a path or a location, that's out on blob, storage or something, but let's go ahead and now use the table builder apis. um So let me go ahead and copy this guy.

B

Delta table builder, um the advantage here is that compared to like the data frame apis, you can actually specify extra information like comments and table properties. um There's also a new, exciting feature in delta lake 1.0. That's experimental, called generated columns so I'll go ahead and show you how how you can create that so let's go ahead and do from delta.tables import star and then we'll import some data types as well.

B

And then go ahead and create and the delta dot tables import star is what gives you access to this delta tables that create, if not exists, pass in the spark context, and then let's just go ahead and create a new table name. So the previous one we called delta hack, 2021, um we'll call this one delta hack, 2021 underscore new, just because I'm creative that way, um let's go ahead and add some columns and make this make this a little bit more interesting.

B

So let's go ahead and add an id like we had before um make this a long type, uh let's go ahead and add uh another column column, one uh we'll make that a string type just like we had before we'll add another column.

B

Let's call this one time stamp um or ts and then we'll make that a timestamp type, timestamp type, all right uh and then we'll add- and this is the this- is the cool one- uh we'll add a generated column. So we'll call this one date.

B

It's going to be a date type, but it's going to be generated from our timestamp generated, always as cast timestamp as date.

B

So now what this allows you to do is it allows your source data data set to have just a time stamp in it not explicitly specifying the date and the date will automatically be computed for you um and then we'll just go ahead and partition.

B

This whole thing by the date uh execute all right and that's because it's not delta tables it's delta table, and I cannot time I cannot type today. So let's go ahead and execute this.

A

So stephen, I may have missed it, but why would one use the delta table builder api versus just the traditional data frame api.

B

That's a great question so compared to the the data framewriter api, you have a little. You have a few more options for specifying uh the different types. You can even add column comments. You can specify table properties, you can do generated. Columns like these are not things that you can. Actually you can specify, as uh when you're doing a frame dot write.

B

I mean I guess data types you could, if you were to provide a schema for the for the data frame up front, um but it gives you a little bit more uh more control over how the table gets created. Gotcha thanks. Okay.

B

So now, since we've written that out, let's do a quick describe on this table uh describe and since I can't type I'm just going to copy and paste that make sure that here we go, we have a column, uh we have a table with four columns and it's partitioned on my date- column, okay, cool! um So let's go ahead and write some data to it. So I'm just going to go ahead and copy that data frame actually well. I've already defined the data frame.

B

So let's just go ahead and copy that expression. That I had up here and modify it just a little bit so that we can write a timestamp.

B

So instead of the date I'm going to go ahead and write a timestamp, and this is just going to execute current timestamp right, but otherwise we still have our id. We still have a column.

B

Actually, I should probably specify that I also do want the id column here all right. So let's go ahead and write that ef.right.format delta overwriting. I I guess that doesn't really matter because there's nothing there, um but let's just go ahead and write it into my new delta hack, 2021 table.

B

Now we're going to write another 100 000 records with one timestamp. uh What I'm going to go ahead and show you is that the date column does automatically get calculated so select star from delta hack, make 21 new.

B

Go ahead and let that run and there we go, we have our id columns. We have our timestamps notice in our data frame. We did only insert the timestamp, but the date automatically got calculated without any additional work on your part.

B

So just to review, um I showed you how to create delta tables using the sql apis, the data frame right at writer apis and now the delta table builder api. Let's go ahead and do something more interesting with our delta lake table. So we'll just call this. um I don't know acid on delta.

B

All right so, let's just say I want to run some updates on my delta table, um for whatever reason I decided that I don't want my column to be called bar anymore. Maybe I want it to be called baz, um so there are several ways that I can do this. I can either use the delta table apis. I can use the sql apis I'll just go ahead and show you both. um So, let's just let's say my table equals deltatable.for name. uh We called this one delta hack, hack, may 21 mu.

B

um Then I'm going to call an update on this object, so mytable.update the condition's going to be where column one equals foo um I'm going to. Actually, let's, let's do this, instead of instead of changing the value. Let's just go ahead and update the timestamp for this, let's just say so: update the ts, column um and we'll make that the new current timestamp all right go ahead and run this, for name is missing. Oh yes, I almost forgot gotta pass in the smart context.

B

There we go all right so now we're actually going and finding all the records inside the files that we've written out that have column one equals the foo and we're going to update the current timestamp. So if I just go ahead and do this, if I query the table again, let's take a look at what changed. So the original right here happened at 1642 utc, you can see um now I updated all columns for foo at 16. 45.

B

um date remains unchanged.

A

Steven, do you mind if anywhere up with another question, please.

B

A

uh Someone from chat wanted to know if the uh the generated column syntax, that you showed a little bit before with the uh the table builder api. If that was available for sql ddls as well.

B

It is, um and at the I don't know off the top of my head, but um let me let me follow up. uh Let me follow up at the end with some documentation. That's good thanks! Okay, cool um okay, so I showed you updates. uh Let's do something interesting like I wanted to delete all the records that contained bar in column, one. So uh very simple. Again, I have my table. Fine, so I'm just gonna say delete where column one equals bar.

B

Okay, rewriting those files that contain these records and then, if we did a quick sanity check there you'll see that all the records now with bar are gone.

B

um I showed you, the delta table builder apis, for how to do all this, um the sql syntax should you should be pretty familiar with uh it's just the same like update and then the table name set. You know, column equals to. You know current time stamp where column one equals var um same thing, with the delete it's going to be just basically delete from table.

B

uh Where condition so I mean I come from a sequel background. um I, like writing, sql. So I think it's it's a lot faster for me, but you you have options.

B

uh Let's see what else. um Let's talk about overwriting schemas, so periodically your table changes, um maybe something's something's changed upstream, um let's just say: let's just say: I want to overwrite this table and get rid of that date column. um So let's just go ahead and do this. I have a data frame that I created earlier. It just has three columns in it. I don't want the generated column anymore. So how do I overwrite my delta table? um You actually just have to specify an option. That's called overwrite schema. So if I say.

B

Dataframe.Write.Optionoverwriteschema and let's call that, let's set that the true and we're gonna this time, we're really gonna overwrite the table, save as table delta, hack, 2021, new and then we'll do a describe and copy and paste and you'll see that generated column is now gone and this new table is not partitioned.

B

um Let's see, let's talk about, let's talk about another another interesting thing that you can do with delta tables. um Basically these are. These are called merges and upserts. So let's say you have a new data set and you want to do something like if this data on this key already exists in my target table, then let's just go ahead and update that data.

B

If that data doesn't exist, then we're just going to insert it it's pretty common for etl for cdc type use cases I'll, show you how to do this, using both the table builder, api and the sql syntax. So, let's see for I'll just label. This section merge.

B

Cable builder api: we have my table, uh which I should probably redefine, because I've changed my table. uh Table.For name spark we're going to not forget the context, this time, delta, hack, 2021, new, all right and then we're going to take that same data frame that I got to find up here somewhere all right. So I'm just going to copy that I got my data frame records same id and time stamp. So let's go ahead, and but actually let's, let's make this different.

B

Like my new updated my new data set's not going to have foo and bar it's going to have as and bar that's column, one and we've deleted all those records for bar. So this should uh this should be.

B

This should update our data set. Now the ids are going to be the same from zero to 999 999, so it should update those records that say food abaz and then insert these bar records. So we'll say my table, we'll just for the sake of making this easier. I'm just going ahead and give this an alias and say, merge um and we'll do we'll. Take our data frame and we'll give this an alien, alias, we'll just call this updates and say: bh.id equals updates.id.

B

All right and then when.

B

There is a when there's a match: we're going to go ahead and update this so set column, 1, oops, column. 1 is going to be the value in our update set and then I'm going to go ahead and update the timestamp as well.

B

Updates.Ps all right and then when it is not matched.

B

Then we're going to insert all right and the values that we insert there are just going to be the same column, one updates, dot, column one and then for column. Two, it's just going to be our timestamp from our update set. Let's just make sure that I've got all of these in the right place.

B

All right looks like I've got a superfluous.

B

All right, dot execute: let's go ahead and run this, and, of course, because I typed something incorrectly.

B

Updates updates dot column one there we go.

B

Okay, so now what's happening is exactly that: it's looking for all the all the matches on id and when there is no match, we insert that new record. So let's go ahead and see what this looks like. Where is my where.

A

B

My select statement.

B

All right, so you see here all of these records that previously said foo have been updated to bass with this new timestamp and then all these bar records, which didn't exist in the table anymore, because I deleted it- um are now there, okay cool. um So let's go ahead and say I want to evolve the schema again. Instead of replacing the schema, let's say something upstream happened and a developer or platform team or somebody decided to add extra columns to my data set. um So let's call this one schema evolution.

B

All right, um so I'm going to create another data frame, uh we'll just call this one bf new and accept I'm not going to type all this out again, I'm just going to say copy this all right, um but instead of just having ib and column, one.

B

Let's just go ahead and copy this all right, so we'll create a second column two, but we'll just we'll inverse the logic here. So if it's, if id mod q equals one, then we'll say foobar.

B

Otherwise, it's going to be bad and bar and then we'll call this one as column. Two still got the timestamp there, so this is now a data frame that has an extra column.

B

We'll just go ahead and do df new dot, write, dot option merge, schema now what happens with delta tables? If you don't specify, merge schema, it will actually stop you and say hey the schema of this new data frame that you're trying to write to this table doesn't match. So you can't do this, but if you explicitly say that you want to evolve the schema or merge the schemas, you can go ahead and provide this option so dot mode, let's just say, overwrite and then we'll do the same as save as table delta, hack, 2021,.

B

Okay, so let's go ahead and run that and then we'll do another describe table.

B

And see that we've got that new column in there and there it is added to the end cool. So we've made a number of changes to our delta table. um How do you sort of make sense of all the all the things that have happened and then what? If you want to go back in time and and uh you know maybe pull some data or or maybe new data engineer that you just hired on your team pulls a mulligan?

B

I may or may not have done this before and update something with the incorrect or absolutely missing where clause. um Let's take a look at our history, so describe history and then name of our table.

B

It'll show you the history of changes uh that have occurred over this delta table because you can sort of think about and I'm sure, as denny mentioned earlier, um the changes to the table are a series of of new snapshots of this table. So you can, you can think of you know at this point in time at version zero. This is when I created the table at version one. I went ahead and I added a bunch of data at version. Two I updated some data version. Three I deleted something and then we created some.

B

We did some merges so on and so forth. um So let's just say I want to see. I want to see how many records I had in my table um right after I did the delete. So, let's, actually, let's do some. Let's do something interesting, so if I'd say, select count star from the table as it stands right now, after all of my updates, there are a hundred thousand records in it, but after I did the delete earlier, I deleted half of those records because I deleted everything where the column was bar.

B

So what if I wanted to go back and take a look at that, so we'll do a select count star from this table version, as of and the delete happened at version 3.

B

That version has 50 000 records and accordingly you can go ahead and actually see this snapshot at that point in time. So these were the values for these columns as a version three, you can also inspect this using timestamps. So instead of saying version, as of you would say, timestamp as of uh and then like everything else. Of course, um there is a way to use the do this programmatically using data frames um it'd just be a matter of doing something like spark.read.format oops and not actually executing that before I'm done typing it format delta.

B

But then you pass in an option like version as of um and say, and I deleted everything that I just typed so version as of and say, like version three um dot table delta, hack, 21, new.

A

Just let you know stephen we're almost at the top of the hour.

B

And that is excellent timing, because this was the last thing that I was going to demonstrate um awesome. Let me just go ahead and execute this so that I don't leave with an.

A

Error on my screen.

B

Okay, perfect and then you can just do display.

B

All right, um so that is all I wanted to demonstrate. Today we covered history, time travel, we covered schema evolution, we covered merges, we covered schema replacements and we covered three different ways to create delta tables.

B

So um yeah. I think I.

A

B

A

Cool well, thank you, stephen. uh If you want to connect with steven he's stephen you on linkedin. um What he showed here today is using delta lake, of course, but using it in uh the databricks product you don't have to use databricks. uh I am actually a customer of databricks. I I like what they do a lot, but a lot of everything that you you can do uh that I showed you today is in delta lake 1.0, which was recently announced uh and available for download.

A

Now, if you want to learn more about how to use delta lake, we have documentation at docs.delta.io and without further ado. Stephen. Thank you very much for your time and we'll see you later.

B