►
Description
A live walk-through demonstrating the many simple ways you can create and manage Delta tables! We'll leverage a few sample datasets to showcase ACID in action - Updates, Deletes, Merges, Schema Evolution, Time Travel and more!
A
A
Just
as
a
reminder,
delta
hack
is
a
hackathon
that
we're
hosting
from
pretty
much
today
to
the
end
of
the
week,
just
a
little
thing
to
get
some
people
introduced
to
the
delta
lake
ecosystem,
some
of
the
open
source
projects
that
are
under
the
delta
lake
umbrella.
If
you
search
for
delta
hack,
2021
you'll
find
more
there
and
then,
if
you've
got
any
questions
about
the
stream
you're.
A
More
than
welcome
to
ask
those
in
the
youtube
live,
chat,
I'll,
be
I'll,
be
monitoring
that
and
interrupting,
as
is
appropriate,
but
I'd
also
encourage
you
to
go
to
delta.io
and
join
the
slack
channel.
There's
a
link
there
for
our
slack
workspace
or
the
delta
users.
A
Google
group,
but
without
further
ado
I
figure
I'll
pass
it
over
to
you,
stephen
and
you
can
go
ahead
and
get
started.
B
Awesome,
thank
you.
Tyler
hi
everybody,
so
I'm
gonna
first
start
off
by
showing
you
three
different
ways
that
you
can
create
delta
tables,
I'm
using
the
sql
apis,
the
data
frame
writer
apis,
as
well
as
the
delta
table
builder
apis,
and
then
we'll
go
ahead
and
start
doing
something
interesting
with
those
tables
I'll
show
you
how
to
do,
updates
and
deletes
and
merges
and
demonstrate
schema
evolution.
Take
a
look
at
history,
all
kinds
of
exciting
things.
So
I
guess
we'll
start
with
with
the
most
basic.
B
Let's
just
call
this
one
super
creative
delta,
hack,
2021
and
then
we're
just
going
to
put
in
some
dummy
columns,
so
we'll
just
say,
column
one
and
the
real
creative
we'll
make
that
a
string
and
then
I'll
create
a
column
called
date
and
make
that
no
surprise
there
a
date.
The
real
difference
here
is
I'm
going
to
say
using
delta.
B
Instead
of
saying
using
part
k
or
using
orc
or
some
other,
some
other
file
format,
and
that's
really
it
let's
just
go
ahead
and
execute
that
command,
and
it's
going
to
create
a
entry
in
the
hive
metastore
in
a
database
that
I
have
already
created
ahead
of
time
called
again
delta
hack,
super
creative
and
this
table
is
going
to
have
no
records
in
it.
B
So
I'll
use
I'll
show
you
how
to
create
the
table
now
using
the
data
frame
writer
apis
and
then
we
can
insert
some
actual
data
and
make
things
look
a
little
bit
more
interesting.
So,
let's
just
go
ahead
and
title
this
one
data
framewriter
apis
how
to
make
that
look,
not
silly.
So,
let's
import
some
functions.
This
is
a
python
notebook,
so
I
don't
have
to
do
anything
special
here.
I'm
from
highspark
got
sql.functions
import
expr
because
I'm
going
to
need
that,
let's
create
a
data
frame,
let's
just
generate
some
fake
data.
B
So
let's
do
something
like
spark.range
for
the
purpose
of
this
demo,
let's
just
say:
create
a
hundred
thousand
records
we'll
just
do
we'll
just
we'll
make
some
things
up.
So
let's
say
we
have
an
id
with
spark.range.
B
I
also
probably
do
that
all
right
and
then,
let's
see
what
else,
let's
generate,
let's
generate
a
date
column,
that's
not
going
to
be
completely
today.
So,
let's
say
with
date,
we'll
create
an
expression,
let's
say
cast
and
then
we'll
concat
this
month,
so
2021.06,
okay
and
then
we'll
concat
and
we'll
just
generate
a
random
date
and
five
times
30.
B
B
Now
we're
going
to
be
using
the
dataframewriter
apis,
you'll
notice
here
that,
instead
of
saying
again
part
k,
orc
whatever
I
can
simply
just
say,
format,
delta
and
then
we'll
just
stick
in
a
mode,
let's
say
overwrite
and
then
we'll
do
a
save
as
table.
What
save
as
table
does?
Is
it
actually,
as
it
writes,
the
data
out
will
will
actually
create
the
meta
store
entry
for
you.
So
let's
go
ahead
and
do
that.
B
And
populate
our
table
with
a
hundred
thousand
records
and
then
I'll
just
show
you
real
quickly.
Do
a
quick
sanity
check
on
this
table
select
count
star
from
delta
hack
a21
as
soon
as
that
finishes.
Writing
all
right.
Let's
take
a
look.
This
should
have
100
000
records
and
there
we
go.
That's
that's
not
a
surprise.
So
so
far,
we've
covered
how
to
create
a
delta
table
using
sql
apis.
This
is
how
you
do
it
using
the
data
frame
writer
apis.
B
B
Delta
table
builder,
the
advantage
here
is
that
compared
to
like
the
data
frame
apis,
you
can
actually
specify
extra
information
like
comments
and
table
properties.
There's
also
a
new,
exciting
feature
in
delta
lake
1.0.
That's
experimental,
called
generated
columns
so
I'll
go
ahead
and
show
you
how
how
you
can
create
that
so
let's
go
ahead
and
do
from
delta.tables
import
star
and
then
we'll
import
some
data
types
as
well.
B
And
then
go
ahead
and
create
and
the
delta
dot
tables
import
star
is
what
gives
you
access
to
this
delta
tables
that
create,
if
not
exists,
pass
in
the
spark
context,
and
then
let's
just
go
ahead
and
create
a
new
table
name.
So
the
previous
one
we
called
delta
hack,
2021,
we'll
call
this
one
delta
hack,
2021
underscore
new,
just
because
I'm
creative
that
way,
let's
go
ahead
and
add
some
columns
and
make
this
make
this
a
little
bit
more
interesting.
B
So
let's
go
ahead
and
add
an
id
like
we
had
before
make
this
a
long
type,
let's
go
ahead
and
add
another
column
column,
one
we'll
make
that
a
string
type
just
like
we
had
before
we'll
add
another
column.
B
Let's
call
this
one
time
stamp
or
ts
and
then
we'll
make
that
a
timestamp
type,
timestamp
type,
all
right
and
then
we'll
add-
and
this
is
the
this-
is
the
cool
one-
we'll
add
a
generated
column.
So
we'll
call
this
one
date.
B
B
This
whole
thing
by
the
date
execute
all
right
and
that's
because
it's
not
delta
tables
it's
delta
table,
and
I
cannot
time
I
cannot
type
today.
So
let's
go
ahead
and
execute
this.
B
That's
a
great
question
so
compared
to
the
the
data
framewriter
api,
you
have
a
little.
You
have
a
few
more
options
for
specifying
the
different
types.
You
can
even
add
column
comments.
You
can
specify
table
properties,
you
can
do
generated.
Columns
like
these
are
not
things
that
you
can.
Actually
you
can
specify,
as
when
you're
doing
a
frame
dot
write.
B
I
mean
I
guess
data
types
you
could,
if
you
were
to
provide
a
schema
for
the
for
the
data
frame
up
front,
but
it
gives
you
a
little
bit
more
more
control
over
how
the
table
gets
created.
Gotcha
thanks.
Okay.
B
So
now,
since
we've
written
that
out,
let's
do
a
quick
describe
on
this
table
describe
and
since
I
can't
type
I'm
just
going
to
copy
and
paste
that
make
sure
that
here
we
go,
we
have
a
column,
we
have
a
table
with
four
columns
and
it's
partitioned
on
my
date-
column,
okay,
cool!
So
let's
go
ahead
and
write
some
data
to
it.
So
I'm
just
going
to
go
ahead
and
copy
that
data
frame
actually
well.
I've
already
defined
the
data
frame.
B
B
B
Actually,
I
should
probably
specify
that
I
also
do
want
the
id
column
here
all
right.
So
let's
go
ahead
and
write
that
ef.right.format
delta
overwriting.
I
I
guess
that
doesn't
really
matter
because
there's
nothing
there,
but
let's
just
go
ahead
and
write
it
into
my
new
delta
hack,
2021
table.
B
Now
we're
going
to
write
another
100
000
records
with
one
timestamp.
What
I'm
going
to
go
ahead
and
show
you
is
that
the
date
column
does
automatically
get
calculated
so
select
star
from
delta
hack,
make
21
new.
B
B
So
just
to
review,
I
showed
you
how
to
create
delta
tables
using
the
sql
apis,
the
data
frame
right
at
writer
apis
and
now
the
delta
table
builder
api.
Let's
go
ahead
and
do
something
more
interesting
with
our
delta
lake
table.
So
we'll
just
call
this.
I
don't
know
acid
on
delta.
B
All
right
so,
let's
just
say
I
want
to
run
some
updates
on
my
delta
table,
for
whatever
reason
I
decided
that
I
don't
want
my
column
to
be
called
bar
anymore.
Maybe
I
want
it
to
be
called
baz,
so
there
are
several
ways
that
I
can
do
this.
I
can
either
use
the
delta
table
apis.
I
can
use
the
sql
apis
I'll
just
go
ahead
and
show
you
both.
So,
let's
just
let's
say
my
table
equals
deltatable.for
name.
We
called
this
one
delta
hack,
hack,
may
21
mu.
B
Then
I'm
going
to
call
an
update
on
this
object,
so
mytable.update
the
condition's
going
to
be
where
column
one
equals
foo
I'm
going
to.
Actually,
let's,
let's
do
this,
instead
of
instead
of
changing
the
value.
Let's
just
go
ahead
and
update
the
timestamp
for
this,
let's
just
say
so:
update
the
ts,
column
and
we'll
make
that
the
new
current
timestamp
all
right
go
ahead
and
run
this,
for
name
is
missing.
Oh
yes,
I
almost
forgot
gotta
pass
in
the
smart
context.
B
There
we
go
all
right
so
now
we're
actually
going
and
finding
all
the
records
inside
the
files
that
we've
written
out
that
have
column
one
equals
the
foo
and
we're
going
to
update
the
current
timestamp.
So
if
I
just
go
ahead
and
do
this,
if
I
query
the
table
again,
let's
take
a
look
at
what
changed.
So
the
original
right
here
happened
at
1642
utc,
you
can
see
now
I
updated
all
columns
for
foo
at
16.
45.
B
A
Someone
from
chat
wanted
to
know
if
the
the
generated
column
syntax,
that
you
showed
a
little
bit
before
with
the
the
table
builder
api.
If
that
was
available
for
sql
ddls
as
well.
B
It
is,
and
at
the
I
don't
know
off
the
top
of
my
head,
but
let
me
let
me
follow
up.
Let
me
follow
up
at
the
end
with
some
documentation.
That's
good
thanks!
Okay,
cool
okay,
so
I
showed
you
updates.
Let's
do
something
interesting
like
I
wanted
to
delete
all
the
records
that
contained
bar
in
column,
one.
So
very
simple.
Again,
I
have
my
table.
Fine,
so
I'm
just
gonna
say
delete
where
column
one
equals
bar.
B
I
showed
you,
the
delta
table
builder
apis,
for
how
to
do
all
this,
the
sql
syntax
should
you
should
be
pretty
familiar
with
it's
just
the
same
like
update
and
then
the
table
name
set.
You
know,
column
equals
to.
You
know
current
time
stamp
where
column
one
equals
var
same
thing,
with
the
delete
it's
going
to
be
just
basically
delete
from
table.
B
Where
condition
so
I
mean
I
come
from
a
sequel
background.
I,
like
writing,
sql.
So
I
think
it's
it's
a
lot
faster
for
me,
but
you
you
have
options.
B
Let's
see
what
else.
Let's
talk
about
overwriting
schemas,
so
periodically
your
table
changes,
maybe
something's
something's
changed
upstream,
let's
just
say:
let's
just
say:
I
want
to
overwrite
this
table
and
get
rid
of
that
date
column.
So
let's
just
go
ahead
and
do
this.
I
have
a
data
frame
that
I
created
earlier.
It
just
has
three
columns
in
it.
I
don't
want
the
generated
column
anymore.
So
how
do
I
overwrite
my
delta
table?
You
actually
just
have
to
specify
an
option.
That's
called
overwrite
schema.
So
if
I
say.
B
Let's
see,
let's
talk
about,
let's
talk
about
another
another
interesting
thing
that
you
can
do
with
delta
tables.
Basically
these
are.
These
are
called
merges
and
upserts.
So
let's
say
you
have
a
new
data
set
and
you
want
to
do
something
like
if
this
data
on
this
key
already
exists
in
my
target
table,
then
let's
just
go
ahead
and
update
that
data.
B
B
Cable
builder
api:
we
have
my
table,
which
I
should
probably
redefine,
because
I've
changed
my
table.
Table.For
name
spark
we're
going
to
not
forget
the
context,
this
time,
delta,
hack,
2021,
new,
all
right
and
then
we're
going
to
take
that
same
data
frame
that
I
got
to
find
up
here
somewhere
all
right.
So
I'm
just
going
to
copy
that
I
got
my
data
frame
records
same
id
and
time
stamp.
So
let's
go
ahead,
and
but
actually
let's,
let's
make
this
different.
B
B
This
should
update
our
data
set.
Now
the
ids
are
going
to
be
the
same
from
zero
to
999
999,
so
it
should
update
those
records
that
say
food
abaz
and
then
insert
these
bar
records.
So
we'll
say
my
table,
we'll
just
for
the
sake
of
making
this
easier.
I'm
just
going
ahead
and
give
this
an
alias
and
say,
merge
and
we'll
do
we'll.
Take
our
data
frame
and
we'll
give
this
an
alien,
alias,
we'll
just
call
this
updates
and
say:
bh.id
equals
updates.id.
B
B
B
A
B
All
right,
so
you
see
here
all
of
these
records
that
previously
said
foo
have
been
updated
to
bass
with
this
new
timestamp
and
then
all
these
bar
records,
which
didn't
exist
in
the
table
anymore,
because
I
deleted
it-
are
now
there,
okay
cool.
So
let's
go
ahead
and
say
I
want
to
evolve
the
schema
again.
Instead
of
replacing
the
schema,
let's
say
something
upstream
happened
and
a
developer
or
platform
team
or
somebody
decided
to
add
extra
columns
to
my
data
set.
So
let's
call
this
one
schema
evolution.
B
All
right,
so
I'm
going
to
create
another
data
frame,
we'll
just
call
this
one
bf
new
and
accept
I'm
not
going
to
type
all
this
out
again,
I'm
just
going
to
say
copy
this
all
right,
but
instead
of
just
having
ib
and
column,
one.
B
B
B
We'll
just
go
ahead
and
do
df
new
dot,
write,
dot
option
merge,
schema
now
what
happens
with
delta
tables?
If
you
don't
specify,
merge
schema,
it
will
actually
stop
you
and
say
hey
the
schema
of
this
new
data
frame
that
you're
trying
to
write
to
this
table
doesn't
match.
So
you
can't
do
this,
but
if
you
explicitly
say
that
you
want
to
evolve
the
schema
or
merge
the
schemas,
you
can
go
ahead
and
provide
this
option
so
dot
mode,
let's
just
say,
overwrite
and
then
we'll
do
the
same
as
save
as
table
delta,
hack,
2021,.
B
And
see
that
we've
got
that
new
column
in
there
and
there
it
is
added
to
the
end
cool.
So
we've
made
a
number
of
changes
to
our
delta
table.
How
do
you
sort
of
make
sense
of
all
the
all
the
things
that
have
happened
and
then
what?
If
you
want
to
go
back
in
time
and
and
you
know
maybe
pull
some
data
or
or
maybe
new
data
engineer
that
you
just
hired
on
your
team
pulls
a
mulligan?
B
I
may
or
may
not
have
done
this
before
and
update
something
with
the
incorrect
or
absolutely
missing
where
clause.
Let's
take
a
look
at
our
history,
so
describe
history
and
then
name
of
our
table.
B
It'll
show
you
the
history
of
changes
that
have
occurred
over
this
delta
table
because
you
can
sort
of
think
about
and
I'm
sure,
as
denny
mentioned
earlier,
the
changes
to
the
table
are
a
series
of
of
new
snapshots
of
this
table.
So
you
can,
you
can
think
of
you
know
at
this
point
in
time
at
version
zero.
This
is
when
I
created
the
table
at
version
one.
I
went
ahead
and
I
added
a
bunch
of
data
at
version.
Two
I
updated
some
data
version.
Three
I
deleted
something
and
then
we
created
some.
B
We
did
some
merges
so
on
and
so
forth.
So
let's
just
say
I
want
to
see.
I
want
to
see
how
many
records
I
had
in
my
table
right
after
I
did
the
delete.
So,
let's,
actually,
let's
do
some.
Let's
do
something
interesting,
so
if
I'd
say,
select
count
star
from
the
table
as
it
stands
right
now,
after
all
of
my
updates,
there
are
a
hundred
thousand
records
in
it,
but
after
I
did
the
delete
earlier,
I
deleted
half
of
those
records
because
I
deleted
everything
where
the
column
was
bar.
B
That
version
has
50
000
records
and
accordingly
you
can
go
ahead
and
actually
see
this
snapshot
at
that
point
in
time.
So
these
were
the
values
for
these
columns
as
a
version
three,
you
can
also
inspect
this
using
timestamps.
So
instead
of
saying
version,
as
of
you
would
say,
timestamp
as
of
and
then
like
everything
else.
Of
course,
there
is
a
way
to
use
the
do
this
programmatically
using
data
frames
it'd
just
be
a
matter
of
doing
something
like
spark.read.format
oops
and
not
actually
executing
that
before
I'm
done
typing
it
format
delta.
B
But
then
you
pass
in
an
option
like
version
as
of
and
say,
and
I
deleted
everything
that
I
just
typed
so
version
as
of
and
say,
like
version
three
dot
table
delta,
hack,
21,
new.
B
B
B
So
yeah.
I
think
I.
A
A
Cool
well,
thank
you,
stephen.
If
you
want
to
connect
with
steven
he's
stephen
you
on
linkedin.
What
he
showed
here
today
is
using
delta
lake,
of
course,
but
using
it
in
the
databricks
product
you
don't
have
to
use
databricks.
I
am
actually
a
customer
of
databricks.
I
I
like
what
they
do
a
lot,
but
a
lot
of
everything
that
you
you
can
do
that
I
showed
you
today
is
in
delta
lake
1.0,
which
was
recently
announced
and
available
for
download.