►
From YouTube: Deep-Dive into Delta Lake
Description
Delta Lake is becoming a defacto-standard for storing big amounts data for analytical purposes in a data lake. But what is behind it? How does it work under the hood? In this session you we will dive deep into the internals of Delta Lake by unpacking the transaction log and also highlight some common pitfalls when working with Delta Lake (and show how to avoid them).
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/
A
Hello
and
welcome
everyone
to
my
session
for
data
in
the
isomet
2022
about
deep
dive
into
delta
league
before
we
get
started.
Let
me
also
introduce
myself
so
that
you
know
that
who's
speaking
and
who's,
giving
you
all
this
information,
so
my
name
is
get
plugger.
A
Okay,
so
let's
get
started,
I
prepared
us
a
little
agenda
on
what
what
we're
going
to
cover
in
the
next
in
the
next
hour
or
next
40
minutes
basically
have
a
quick
look.
What
delta
lake
actually
is
and
then
have
some
details
on
the
delta
log,
which
is
obviously
very
important
for
the
delta
lake
architecture
and
also
I'll
show
you
and
some
brief
examples
how
this
actually
works
and
what
happens
under
the
hood?
A
If
you,
if
you
perform
some
of
the
most
common
transactions
on
a
delta
lake
table,
we
will
cover
file
and
storage
management,
as
this
is
very
crucial
for
delta
lakes
and
we'll
also
have
a
look
at
streaming
use
cases
and
what
you
need
to
watch
out
for
when
you
use
delta
lake
for
streaming
and
the
last
technical
thing
that
we'll
cover
are
table,
properties
and
properties
that
you
can
set
in
in
spark
or
on
the
table
itself
to
control
how
delta
works
internally
and
at
the
end,
I've
prepared
some
conclusions
and
lessons
learned.
A
So
what
is
delta
leak?
Delta
lake
is
basically
an
open
source
storage
format
that
you
can
use
to
store
your
data.
It's
compatible
with
most
of
the
common
processing
engines
for
big
data
for
most
spark,
but
also
with
others
like
heif
and
trino,
flink
and
so
on.
It's
also
very
important
and
and
makes
it
very
usable
across
different
people
and
skill
sets,
is
that
there
are
different
apis
for
most
of
the
common
languages
being
python
scala
and
also
for
ruby
and
rust.
A
For
example,
the
key
features
of
delta
lake
is
actually
that
it
is
asset
compliant
in
terms
of
transactions,
meaning
that
whenever
you
do
a
transaction
run.
A
transaction
on
your
delta
lake
table
being
like
an
insert
update,
delete
whatever
delta
lake
or
the
delta
lake
protocol,
basically
ensures
that
this
transaction
is
either
done
completely
or
not
at
all.
A
So
you
basically
never
end
up
in
a
state
where,
where
your
transaction
is
only
executed
halfway
and
then
in
the
board,
so
this
can
never
happen
and
the
data
is
actually
always
in
a
consistent
state.
A
Some
other
features
are
which
are
actually
very
important
because
it
makes
the
delta
lake
very
usable
is
the
support
for
updated
merge.
If
you
have
used
hive
in
the
in
the
very
beginnings-
or
you
probably
know
that
updating
data
is
actually
not
that
simple,
if
you
do
a
draw
by
row
and
delta
lake
really
helps
here,
especially
using
the
merge
statement,
which
provides
a
lot
of
cool
features
that
you
can
use
to
easily
manage
your
date
on
top
of
the
transaction
protocol.
A
There
is
some
features
that
are
built
for
most
time
travel,
so,
as
each
transaction
is
basically
recorded
in
in
the
delta
log
that
we
will
cover
later.
You
can
basically
always
jump
back
to
previous
version
of
your
table
and
the
data
in
your
table
there's
also
other
features
for
for
schema,
schema
evolution
and
skip
schema
enforcement.
A
So
it's
actually
not
possible
to
add
data
in
a
wrong
schema
to
an
existing
delta
table
so
that
that
transaction
would
just
fail.
If
you
try
to,
I
don't
know
like
insert
a
string
value
into
an
integer
column,
for
example.
A
A
I'm
also
delta
lake
supports
both
batch
and
streaming,
which
is
also
very
important,
because
you
can
actually
maintain
the
same
code
for
your
batch
processing
also
for
stream
processing,
and
it
is
also
100
compatible
with
apache
spark.
A
So
when
it
now
comes
to
you
to
the
actual
implementation,
what
you
will
get
when
you
actually
create
the
delta
table
is
it's
basically
a
folder
that
contains
the
metadata,
the
transaction
lock,
the
delta,
lock,
that
we
will
cover
later
and
obviously
also
the
data
itself.
But
this
this
folder
insert
itself
is
is
more
or
less
like
self-contained,
so
you
can
just
take
that
folder
put
it
on
a
usb
key
and
open
it
from
there.
A
If
you
want
it's
not
actually
tied
to
any
file
system
or
a
storage
system,
you
can
just
place
it
anywhere.
Obviously,
it
makes
the
most
sense
in
in
a
data
lake
and
but,
as
I
said,
it's
not
meant
that
they're
mandatory
as
the
delta
lake
table
contains
all
the
data
that
it
needs
to
to
operate
in
itself
in
that
folder.
Anyone
who
everyone
wants
to
work
with
the
delta
table
just
needs
to
know
the
location.
A
So,
for
example,
if
you
have
a
delta
label
table
in
your
in
your
data
lake-
and
you
have
like
a
spark
cluster
that
needs
to
process
it,
you
can
just
tell
spark
that
the
path
of
the
delta
table
and
then
you
can
just
read
it
as
a
regular
data
frame.
You
don't
need
to
specify
the
schema.
You
don't
need
to
specify
anything
else.
It's
all
handled
automatically
for
you.
So
that's
that's
pretty
convenient.
A
So
how
is
this
all
possible?
As
I
said
there
is
the
the
key
concept
of
delta
lake
is
actually
a
transaction
layer
on
top
of
the
actual
data,
and
this
transaction
layer
is
called
the
delta
log.
A
So
this
delta
log
is
also
stored
as
part
of
of
the
of
the
delta
lake
table
in
a
special
folder
called
underscore
delta
underscore
log,
and
it
basically
contains
everything
that
that's
important
to
manage
the
transactions
and
and
keep
the
and
then
keep
the
data
table
and
metadata
up
to
date.
So
it
is
the
delta
log
foremost
contains
this
table
schema,
obviously,
and
all
the
changes.
So
if
there
is
an
like
a
new
column
is
added,
then
that's
also
recorded
there.
A
It
also
contains
the
references
to
all
the
actual
files
that
contain
the
data
and
with
every
operation
and
transaction
that
you
run
on
the
delta
table.
It
also
stores
some
additional
metadata
and
metrics
for
each
transaction
that
you
do
as
you
can
see
on
the
right.
A
Those
transactions
in
the
delta
log
are
actually
stored
as
json
files
and
after
10
transactions,
a
so-called
checkpoint
file
is
generated
which
basically
contains
all
the
previous
transactions
and
aggregates
it
into
one
big
per
k.
File.
Reason
for
that
is
that,
as
you
can
imagine,
if
you
have
a
table,
that's
out
there
for
a
couple
of
years,
and
you
have
a
lot
of
transactions
there.
A
It
would
cause-
I
don't
know
like
millions
of
json
files
and
obviously
reading
those
json
files
would
be
a
big
hassle,
because
you
would
have
to
touch
let's
say,
a
million
json
files.
So,
to
avoid
this,
those
checkpoints
were
introduced,
and
basically,
what
you
need
to
do
is
just
like
read
the
latest
checkpoint
and
everything
all
the
json
transactions
from
there
on
so
it
it's
a
maximum
of
11
files.
Actually.
A
This
that
the
delta
log
further
allows
these
this
acid
compliant
features
for
concurrency
control.
So
if
multiple.
A
The
output
would
not
be
the
same
or
not
what
you
expect.
So
that's
why
one
of
those
transactions
would
fail,
and
that's
so
called
optimistic,
concurrency
control.
So
basically,
whenever
you
commit
that
or
before
the
the
process
actually
commits,
the
transaction
basically
checks
whether
that
the
data
has
changed
since
it
just
since
the
transaction
was
started.
If
that's
the
case,
the
transaction
will
fail
and
yeah
the
delta
log
is
is
used
for
time
travel.
A
So
you
see
all
those
different
versions
on
the
right
side.
You
can
basically
say:
okay,
please
show
me
the
table
as
it
was
in
version:
zero,
zero,
seven
for
example,
and
it
will
return
the
data
as
it
was
back
then
also
for
streaming
as
every
transaction
is,
is
locked
in
the
delta
log.
We
can
also
use
this
information
for
streaming
and
process
the
data
as
it
was
changed
or
added
in
order
for
for
our
streaming
pipelines.
A
There
is
a
need
command,
which
is
called
describe,
history,
which
you
can
run
on
your
on
your
delta
table,
and
it
will
basically
give
you
all
the
information
that
currently
exists
in
in
this
delta
log.
Okay,
you
see
like
the
version,
the
timestamp
when
it
was
changed,
user
id
username,
the
actual
operation,
and
there
are
a
lot
of
other
columns
there
that
provide
even
more
information
just
to
have
a
look.
A
I
just
wanted
to
to
show
you
that
you
actually
find
all
this
information
in
a
easy
to
use
way,
so
that
also
helps
a
lot
when
you
actually
debugging,
when
you're
actually
debugging
some
some
etl
pipelines
and
wondering
what's
actually
happened
to
your
delta
tables.
Just
have
a
look
at
the
history
and
you
very
likely
see
a
lot
of
get
a
lot
of
information
from
there.
A
A
A
Once
you
read
the
checkpoint
file,
you
already
have
most
of
the
of
the
transactions
in
there
and
if
there
are
any
transaction
after
the
checkpoint
file,
the
the
processor
would
also
need
to
read
all
following
json
files
as
we
create
the
checkpoint
file
every
by
default
every
after
every
10
transactions.
A
It
only
has
to
read
a
maximum
of
of
those
of
10
json
files.
In
addition
to
the
checkpoint
file,
once
the
the
delta
log
is
read,
the
engine
basically
receives
all
the
the
files
that
belong
to
that
state
or
to
the
latest
state
of
the
delta
table,
which
are
basically
per
k.
Files
reads
them
and
well
returns
the
results
to
the
caller.
So
that's
how
a
queries
is
processed
on
top
of
a
delta
lake
table.
A
So
how
does
the
delta
lake
work
and
how
does
the
transaction,
the
delta
logs
and
the
actual
store
data
files
are,
are
persisted
in
in
the
data
lake?
So
let's
take
this
very
simple
example:
we
have
a
a
table
of
three
rows.
It
was
created
as
a
delta
table,
so
we
have
one
transaction
in
the
delta
log
folder
and
an
indent
that
transaction
basically
added
one
file,
which
is
the
part
zero
one
per
k
file
that
contains
three
rows.
A
Okay.
So
that's
what
it
looks
like
after
the
update.
So
what
happens
if
I
delete
a
row?
So
if
I
run
a
delete
from
the
product
where
the
product
equals
pc,
it
might
be
a
bit
counter
intuitive
because
we're
actually
deleting
some
file
some
some
rows.
But
what
happens
is
that
it
still
creates
a
new
per
k
file.
A
It
only
has
two
rows,
though,
but
it
still
adds
adds
files,
even
though
I'm
running
a
delete
which
is
a
bit
counter
intuitive
right
and
again,
the
the
delta
log
entry,
like
the
json
file,
contains
a
remove
entry
for
the
the
previous
uppercase
file
and
a
new
add
entry
for
the
new
packafil.
A
That
now
only
contains
those
two
rows
for
an
insert
slightly
different
because
for
an
insert
we
don't
actually
have
to
remove
something
because,
like
the
the
the
spark
engine
that
reads,
the
delta
table
or
the
consumer
in
general,
basically
always
has
to
read
all
the
k
files
that
are
in
the
in
the
in
the
current
delta
table
and
belong
to
the
latest
version.
So
we
can
simply
add
a
new
file.
So
in
this
case
we
add
one
row
and
we
also
add
and
well.
A
So
that's
that's
pretty
straightforward
and
should
give
you
like
a
quick
overview
of
what
what
each
transaction
causes
in
the
delta
log
and
also
on
the
storage,
as
you
have
seen
like,
even
if
we
delete
files,
the
files
still
reside
on
the
storage,
so
basically,
each
transaction
can
potentially
create
a
new
file.
A
An
insert
or
an
update,
obviously
always
create
new
files,
whereas
even
a
delete
can
create
new
files.
Okay
and,
as
you
can
imagine,
if
you
do
a
lot
of
transactions
and
touch
a
lot
of
different
files,
this
can
create
a
lot
of
files
in
your
storage
can
be
millions.
Well.
Very
much
depends
on
your
use
case
and
your
table,
but
I
I
I
get.
A
I
guess
you
get
the
point
so
how
to
deal
with
with
this
issue
or
with
this
amount
of
files,
which
leads
us
to
the
to
the
next
big
part,
which
is
file
and
storage
management.
A
A
So
if
I
run
this,
this
specific
vacuum
command,
it's
actually
not
changing
the
data
itself,
as
you
can
see
on
the
right
side,
so
the
table
at
the
top
on
the
left
side
is
the
very
same
as
the
table
on
the
right
side.
A
But
what
happens
is
that
it's
actually
cleaning
up
our
storage,
so,
as
you
can
see
at
the
very
bottom,
the
first
and
the
second
package
file
are
actually
physically
removed
so
before
they
were
only
logically
removed
due
to
the
delete
statement
or
to
the
update
statement,
and
now
once
we
run
the
vacuum
command,
it's
actually
physically
deleting
those
files
and
also
making
the
the
storage
free,
and
so
we
don't
have
to
pay
for
it
anymore.
A
Also,
the
vacuum
transaction
is
locked
in
the
in
the
delta
log.
There
is
a
specific
operation
called
vacuum
start
and
vacuum
end.
We
have
just
been
a
lot
recently
but
like
a
couple
of
months
ago,
added
so
that
you
also
keep
track
of
that
and
know
when
the
table
is
actually
vacuumed
in
the
delta
log
history.
A
Okay,
the
second
command
that
I
want
to
mention
here
is
so-called
optimize.
Optimize,
basically
consolidates
smaller
files
into
larger
files.
So
a
pain
point
for
most
big
data
processing
engines
is
that
if
you
have
a
very
large
amount
of
very
small
files,
the
metadata
overhead
of
reading
those
single
files
can
can
be
a
potential
bottleneck.
A
So
delta
lake
introduces
optimize
command
that
basically
consolidates
small
files
like
two
rows
and
one
row
into
a
new
file
that
contains
all
the
rows.
It's
a
very
simple
example
here,
but
just
that
you
get
the
point
for
optimize.
So
again,
the
data
is,
as
you
can
see,
in
the
in
the
transaction.
We
remove
two
files
logically
and
add
a
new
one,
but
on
the
storage
we
now
have
again
the
data
twice,
even
though,
basically
nothing
changed
right.
A
So
what
does
vacuum
do?
As
I
said,
it
basically
physically
removes
the
files
from
the
actual
storage.
It's
a
bit
more
complicated
that
then
I've
shown
in
the
example,
because
it's
not
removing
all
files,
but
only
files
that
have
been
deleted
at
least
x
days
or
hours
ago,
so
that
have
been
outstated
for
longer
than
a
than
a
given
period.
A
You
can
basically
run
vacuum
whenever
you
want,
it
will
never
have
effect
on
the
on
the
on
the
most
recent
version
of
the
delta
table.
What
you
need
to
keep
in
mind,
though,
is
once
you
remove
the
physical
files.
You
obviously
cannot
use
time
travel
anymore
and
go
back
to
let's
say
the
table
as
it
was
seven
or
ten
days
ago.
If
you
ran
vacuum
with
a
retention
period
of
five
days,
because
then
everything
that's
older
than
that
has
been
deleted
more
than
five
days
ago
is
also
physically
removed.
A
You
will
get
an
error
message
there
saying
that
well,
the
actual
file
doesn't
exist
anymore
for
optimize.
A
As
I
mentioned,
it
basically
collapses
small
files
into
bigger
files,
but
there
are
also
some
other
features
which
is
mainly
set
ordering,
which
is
some
kind
of
clustering
and
ordering
of
the
data
which
allows
for
for
better
file
file,
and
I
close
the
partition,
pruning
and
optimize
as
well
as
the
name
implies
mainly
used
to
optimize
query
performance
so
that
you
don't
have
this
overhead
of
small
of
the
small
files,
but
can
ideally
read
files
that
have
already
one
gigabyte
or
an
an
amount
of
or
a
size.
That's
that's!
A
Okay,
some
some
additional
informations
for
vacuum
and
optimized,
so
vacuum
is
also
have
a
dry
run,
parameter,
which
just
tells
you,
okay,
which
files
it
would
remove.
The
issue
with
the
current
implementation
of
vacuum.
Is
that
it
well?
It
would
show
you
like
a
thousand
files
that
would
potentially
be
deleted
if
you
run
the
actual
vacuum
command,
but
you
don't
know
how
many
files
there
are
in
total,
so
it
could
be
a
thousand
and
one
file,
but
it
could
also
be
a
hundred
thousand
files.
A
So
you
don't
know
a
little
trick
that
you
can
use
is
if
you,
if
you
execute
the
vacuum
command
in
a
scala
notebook
or
in
a
scalar
cell,
it
actually
prints
out
the
number
of
files
that
it
that
it
would
delete,
which
is
not
the
case.
A
If
you
run
the
very
same
statement
in
python
or
in
sql,
for
example,
one
thing
to
keep
in
mind
when
you
run
a
vacuum
is
that
it
can
take
a
very
long
time,
so
we
had
cases
at
customers
where
we
were
basically
deleting
a
couple
of
million
files.
I
think
30
million
and
the
job
ran,
for
I
don't
know
like
a
week.
A
Main
reason
for
that
was,
is
that
the
deletion
was
actually
like
a
single
threaded
operation
running
on
the
driver
only
and
it
didn't
really
really
leverage
the
the
whole
cluster
and
didn't
distribute
the
load.
A
A
A
This
can
be
very
useful
if
you
accidentally,
for
example,
delete
the
data
or
accidentally
did
some
et
or
ransom
media
data
pipelines
that
modified
the
data
and
you're
actually
not
happy
with
the
output.
You
can
just
restore
the
previous
version.
It's
actually
super
convenient,
and
but
what
what's
also
nice
about?
Is
it
doesn't
really
copy
any
date?
It's
just
like
a
metadata
only
operation
and
just
points
or
yeah
makes
you
delta.
A
Then
the
lake
table
point
to
a
different
version
of
the
of
the
files,
but
also
the
restore
operation
creates
a
new
version.
Okay,
it
will
also
be
locked
in
the
in
the
delta
log,
which
is
quite
good
actually.
A
The
second
function,
clone
is
also
very
important,
especially
for
testing,
so
clone
basically
allows
you
to
to
clone
an
existing
delta
lake
table
into
a
under
or
into
a
different
path.
So
you
can
have
there's
two
options.
You
can
use
a
shallow
clone
or
a
deep
claw,
a
shallow
clone,
basically
just
copies
the
delta
law.
A
What
forks,
the
delta
log
and
every
operation
that
you
that
you
run
on
top
of
the
of
the
of
the
shallow
clone
will
actually
not
touch
the
old
or
the
original
table
anymore,
but
it
will
just
touch
your
shallow
clone
and
do
all
the
changes
there.
A
So
if
you
want
to
to
test
some
some
data
pipeline,
the
safest
way
is
probably
to
just
run
it
on
or
create
a
shallow
clone
up
front
and
then
run
the
pipeline
on
that
shallow
glow.
A
Another
option
is
so-called
deep
clone
and,
in
addition
to
the
actual
delta
log
that
is
copied,
it
also
copies
all
the
data
files.
So
you
need
to
be
avail.
If
you
have
like
a
big
table
with
a
couple
of
terabytes,
it
will
actually
copy
the
whole
table
and
yeah
again
can
take
some
time.
A
A
Okay,
some
additional
information
about
restored
clone,
so
you
can
run
a
restore
as
often
as
you
want.
So,
as
I
mentioned,
like
restore,
also
creates
a
new
version
in
the
delta
log
table
and
if
you,
for
example,
are
not
happy
with
your
restore
and
you
want
to
go
back
to
the
original
version,
you
can
restore
the
restored
version.
Basically
yeah,
that's
just
nice
to
know,
and
as
I
mentioned
before,
right,
it
doesn't
create
any
new
data
files.
It's
just
a
metadata
operation,
so
you
cannot
really
break
anything
there
regarding
clones.
A
One
very
important
features,
especially
about
deep
clones,
is
that
they
can
be
incremental.
So
if
you
run
a
create
or
replace
deep
clone
command,
it
will
actually
do
incremental
incremental
updates
to
your
to
your
clone,
which
is
super
convenient
and
can
be
used,
for
example,
for
backup.
A
So
if
you
want
to
backup
a
delta
lake
table
using
deep
clones
or
incremental
deep
clones
is
actually
a
very
good
solution
and
very
efficient,
because
it
you
don't
need
to
scan,
don't
need
to
scan
the
storage
and
everything
can
basically
done
using
the
delta
log
only
so
that
that's
that's,
really
convenient.
A
As
you
have
been
working
with
big
data
already,
I
guess
petitioning
you
are
familiar
with
petitioning
anyways.
I
want
to
to
cover
briefly
what
we
want
to
achieve
when,
when
we
create
partitions
for
our
delta
tables,
so
basically
we
we
partition
our
data
so
that
our
file,
either
our
file
management,
is
easier
or
our
query.
Performance
is
better,
especially
for
the
lower
layers
being
bronze
and
silver,
usually
the
data
that
the
the
tables
are
partitioned
for
etl
performance.
A
So,
ideally,
if
you
load
data
from
into
brass
the
data
that
you
load
matches
exactly
one
partition
so
usually,
for
example,
if
you
get
a
daily
export
from
from
your
source
system,
you
would
usually
create
a
one
one
partition
per
day
when,
when
you
received
an
export
for
silver,
that's
usually
similar,
and
ideally
you
can
just
use
the
same
partitioning
concept
as
on
broads,
because
then
you
can
just
copy
or
reload
silver
very
easily
by
just
replacing.
A
But
this
again,
like
very
much,
depends
on
your
use
case
and
whether
that's
actually
possible
for
gold,
which
is
usually
designed
towards
query
performance.
So
end
users,
query
goal
and
performance
should
be
your
main
priority,
which
means
that
your
petitioning
can
can
vary
so,
for
example,
if
you,
if
you
receive
data
in
on
a
daily
basis
on
bronze
and
silver
and
process
it
there
and
the
data,
for
example,
contains
a
start
date
that
is
not
aligned
with
the
petition.
A
So
you
can
receive
data
today,
which
has
a
start
date
from
I
don't
know
like
last
year,
for
example,
and
if
users
usually
query
by
this
start
date
column.
It's
it's.
It's
a
good
idea
to
to
change
the
petition
in
gold
to
to
use
that
start
date.
Column,
because
then
every
query
that
uses
the
start
date
can
already
be
can
already
be
filtered
down
to
the
partitions
that
actually
contain
the
data
without
having
to
to
read
and
scan
the
whole
table,
which
obviously
is
very
time
consuming
in
general.
A
A
In
this
case,
you
you
need
to
change
the
partitioning,
as
I
said
like
from
silver
to
gold.
That's
just
something
that
you
need
to
keep
in
mind.
Changing
partitioning
can
be
very
complicated
and
resource
intensive,
because
potentially
every
new
batch
that
you
load
from
silver.
A
So
we
had
cases
where,
like
loading,
one
batch
basically
has
rewritten
the
whole
gold
table
and,
as
you
can
imagine,
that's
not
really
yeah
efficient
in
terms
of
in
terms
of
processing
time
and
also
in
terms
of
storage,
as
we
know
that
whenever
we
run
a
an
update
or
a
merge,
it's
actually
copying
the
data
and
if
we
touch
the
old
partitions
of
the
gold
table,
basically
copy
all
the
data
again,
thereby
consuming
twice
the
amount
of
storage
that
the
actual
table
would
need.
A
To
make
your
queries
more
efficient
and
also
your
your
etl,
it's
advisable
to
always
explicitly
specify
the
petitioning
columns
when
you,
for
example,
merge
into
a
table
or
well
obviously
deleted
delete
data
or
also,
if
select,
data,
as
you
can
already
see,
and
and
now,
based
on
the
example
that
I've,
given
you
a
good
candidate
for
petitioning
partitioning
is
usually
time,
but
it
again
very
much
depends
on
your
use
case.
Let's
say:
95
percent
of
the
of
the
delta
lag
tables
that
I've
seen.
A
We
at
some
point
have
petitioned
by
time
whether
it's
arrival
time
of
the
data
in
bronze
and
silver
or
some
event,
time
or
start
time
in
gold.
It's
usually
related
to
time
and
depending
on
your
requirements
and
your
data,
you
can
add
additional
columns
for
for
your
petitioning
and
things
to
keep
in
mind.
If
you
have
too
many
partitions,
you
again
potentially
run
into
troubles
of
having
a
large
overhead
of
reading
all
the
metadata
okay.
So
that's
all
something
that
you
would
like
to
avoid.
A
So,
as
a
rule
of
thumb,
you
should
you
should
have
a
few
thousand
partitions
at
a
maximum,
and
ideally
a
single
partition
should
be
one
gigabyte
or
bigger
than
one
gigabyte,
so
it
doesn't
make
sense
to
have
a
partition.
That's
only
one
megabyte
yeah.
However,
it's
very
very
very
much
depends
on
on
your
data
on
the
distribution
of
your
data,
on
your
query:
query
patterns
on
your
etl
patterns
and
so
on.
A
A
Another
thing
that
I
need
to
mention
is
so
called
generated,
column,
so
delta
lake,
since
some
months
are
having
a
feature
called
generated,
columns
which
basically
allow
you
to
derive.
A
Well,
to
derive
data
from
existing
columns
and
persist
in
in
in
new
columns,
so,
for
example,
where
it
is
used
very
often
is
for,
if
you
have
like
an
event,
time
stem
column
and
you
would
need
to
partition
the
the
table
by
the
date
of
the
event
timestamp
year
of
the
of
the
event
timestamp,
you
can
use
a
generated
column
that
basically
extracts
this
information
from
the
timestamp
column
and
populates
it
automatically
the
cool
thing
about
this
is,
if
you
follow
that
approach,
it's
actually
so
the
delta
engine
will
also
try
to
to
push
the
filters
that
you
have,
for
example,
on
a
query
also
down
to
the
petition.
A
So
if
you,
if
you
have
a
setup
like
this,
where
you
have
like
this
event
stand
generated
or
just
this
event
date
column,
I'm
generated
based
on
the
event
timestamp
and
you
run
a
query
select
star
from
my
table
where
event
date
equals.
I
don't
know
27th
of
june
2022,
then
it
will
basically
push
down
that
automatically
for
you.
A
Even
if
you,
if
you
create
specific,
more
specific
filters
on
the
on
the
original
event
timestamp
column.
This
will
also
be
pushed
down
to
due
to
the
actual
partitions,
which
makes
your
queries
much
more
efficient.
A
There
are
some
some
functions
that
can
be
pushed
some
others
can't.
I
included
the
link
here,
which
ones
those
are
so
if
you're
not
sure,
just
have
a
look.
There.
A
Delta
lake
supports
transactions
and
also
supports
multiple
updates
at
the
same
time
as
long
as
each
of
the
updates
only
touches
a
specific
partition,
so
you
can
have
10
concurrent
updates.
If
each
of
those
updates
chops
only
touches
one
single
partition
and
those
are
not
overlapping,
then
that's
just
fine,
because
delta
will
manage
it.
So
make
sure
that
you
that
you
always
specify
those
partitions
when
you,
for
example,
run
emerge.
A
This
can
also
speed
up
the
merge
process
itself
because
it
already
knows
in
advance
which
petitions
to
scan
for
changes,
because
otherwise,
if
you
do
not
specify
the
partitions
in
the
target,
it
will
actually
scan
the
whole
target
table.
A
If
you
want
to
know
if
your,
if
your
partition
filters
have
actually
been
used
in,
for
example,
a
merge
statement,
you
can
check
the
delta
log
history
and
search
for
the
query
predicates
that
have
been
used
when,
when
running
the
match
statement-
and
ideally
you
show
you
see
your
new
petition
filters,
there.
A
So
some
more
information
on
the
petitioning
on
the
right
side
and
oh
yeah.
On
the
right
side,
we
basically
see
the
the
raw
delta
log
entry
when
a
new
file,
a
new
file,
is
added
as
part
of
a
transaction.
A
As
you
can
see,
there
is
in
line
three
four
and
five:
it's
basically
the
the
object,
four
partition
values,
which
contains
all
the
partitioning
values
that
that
file
belongs
to.
A
Usually,
if
you
have
a
look
at
the
at
the
delta
table,
you
probably
think
that
that
the
partitions
are
resolved
based
on
the
path
and
the
subfolders
and
subfolders,
but
that's
actually
not
the
case,
so
the
path
could
point
anywhere.
So
it's
the
only
thing.
That's
actually
important
are
those
partition
values.
A
Okay,
for
example,
if
you
create
the
clone
the
path
points
to
somewhere
else
and
not
necessarily
has
to
contain
the
folders
and
the
subfolders,
but
by
default
that's
the
way
how
your
your
your
data
files
are
laid
out
just
from
from
for
historic
reasons.
Actually,
as
it
was
the
same
for
hive
in
the
past,
what
you
will
also
realize
is
that
the
actual
physical
k
files
do
not
contain
the
values
for
those
partitioning
columns.
A
So
it
doesn't
really
make
sense
to
like
if
you
have
per
k
files,
one
million
rows
to
just
sort,
a
sales
story,
territory
key
or
the
same
state,
one
million
times
with
the
same
value.
So
it
was
just
emitted
there
and
is
only
available
from
from
the
delta
log
and
obviously
the
engine,
the
delta
engine
or
the
reader
has
to
pick
up
the
value
from
there.
A
Another
thing
that's
made
slightly
different
to
to
other
processing
engine
is
that
you
don't
have
to
specify
all
the
partitioning
columns
sequentially.
So
if
you
have,
let's
say
five
partitioning
columns,
you
don't
need
to
specify
them
all.
You
can
just
also
just
only
specify
the
latest
as
it
is
resolved
in
the
delta
log.
Only
it
just
uses
your
your
partition,
filter,
predicates
to
filter
the
delta
log
and
then
retrieve
the
files
that
did
still
match
your
your
filters.
A
A
It's
also
micro
patches
and
the
lowest
granularity
that
we
can
actually
stream
is
a
file
or
well
a
part
of
a
file,
for
example,
fbk
file
and
when
you're
reading,
from
from
a
delta
lake
table
in
a
stream
the
files
actually
processed
in
order
first,
in
order
of
the
of
the
version,
obviously,
and
within
that
version,
if
one
one
transaction
creates
multiple
files,
you
will
see
this.
A
The
physical
files
having
part
minus
zero,
zero,
zero,
zero,
zero
one,
zero
zero,
two
and
that's
actually
the
number
in
which
in
which,
how
the
order
in
which
they
are
they're
closer
to
in
the
stream
from
that
table
to
make
streaming
possible
and
to
to
basically
save
the
state
of
of
what
has
already
been
processed.
So
called
checkpoints
need
to
be
created
not
to
be
confused
with
the
checkpoint
files
in
the
delta
log.
A
You
always
need
one
checkpoint
per
source
and
you
can
technically
also
stream
from
the
same
source
multiple
times.
If
you
use
different
checkpoints,.
A
Yeah,
when
you
do
streaming
and
you
need
to
use
merge,
our
merch
is
only
available
in
and
for
each
batch
command.
So
you
cannot
stream
or
yeah.
It
doesn't
make
sense
to
stream
like
each
individual
row
and
run
a
merge
statement
for
each.
So
what
you
actually
need
to
do
is
to
run
the
merges
part
of
this
for
each
batch
function.
A
What's
what
you
need
to
watch
out
is
the
size
of
of
your
trigger
and
of
your
batch.
So
how
many
well,
how
much
data
you
actually
read
in
one
batch?
This
can
be
controlled
with
there's
two
just
mainly
two
properties
when
you
start
to
stream
with
such
max
bytes
by
trigger
and
max
files
per
trigger
there
is,
there
are
different
triggers
that
you
can
use.
A
One
of
them
is
trigger
once
which
actually
you
should
avoid,
because
it
processes
everything
in
one
big
batch,
which
is
definitely
not
what
you
want,
because
it's
very
likely
causing
and
out
of
memory
if
you
stream,
initially
from
a
big
table.
So
there
is
another
one
just
called
trigger
available
now,
which
which
trig,
which
which
creates
batches
in
the
size
that
you
specify
so
trigger
once
ignores
on
this
max
files
for
trigger
and
max
bytes
for
trigger
and
yeah.
You
can
basically
stop
and
resume
a
stream
at
any
time.
A
So
we
have
a
scenario
at
one
of
our
customers
where
we
have
basically
implemented
a
streaming
architecture,
but
it's
not
24
7
running,
but
we
just
start
the
pipeline
every
every
now
and
then
like
once
a
day
process,
everything
that
has
been
accumulated
until
or
since
the
previous
execution
processes
it
and
and
then
we
stopped
stopped
the
pipeline
again,
because
it's
just
more
cost
efficient
and
we
don't
need
to
have
real-time
data.
A
Delta
lake
table
properties-
you
can,
you
can
use
properties
in
in
on
on
a
delta
lake
table
directly
or
also
in
the
spark
context,
to
control
how
delta
works.
A
There
are
some
some
yeah,
more
important
ones
and
some
that
yeah,
I'm
not
mentioning
and
see
them
on
the
next
slide.
A
What's
important,
though,
is
that,
as
you
can
specify
them
on
different
levels,
to
know
which
ones
are
actually
used,
and
in
this
case,
if
you
have
specified
a
delta,
a
property
on
the
delta
table
itself,
and
you
also
have
configured
your
spark
setting,
the
setting
on
from
your
execution
from
from
the
execution
context,
will
always
overwrite
what
you
have
specified
on
the
table.
Property.
A
These
are
just
some
important
table
properties
that
you
should
know.
If
you
want
to
know
the
details,
you
can
just
simply
google
them
or
can
look
them
up
in
the
references
that
I'll
show
you
afterwards.
A
And
if
you
want
to
make
your
especially
your
data
management
and
your
delta
links
very
efficient,
I
usually
recommend
to
always
run
the
the
commands
like
an
optimize
or
a
vacuum.
Is
the
defaults
and
specify
the
exceptions
on
a
table
level?
A
So,
for
example,
if
I
want
to
have
a
table
set
to
to
delete
files
older
than
three
days
and
not
the
default
of
seven
days,
I
can
just
specify
that
exception
that
table
and
just
run
a
vacuum
command
on
that
table,
and
then
it
will
just
use
the
default
for
the
setting
from
the
table
and
if
I
run
the
vacuum
on
another
table
which
doesn't
have
the
table
property
defined,
it
will
just
use
the
default
of
7..
A
Okay,
so
let's
come
to
the
to
the
conclusion
of
my
session
and
takeaways,
so
delta
lake
can
obviously
solve
a
lot
of
problems,
especially
when
it
comes
to
data
management
in
terms
of
updating
and
merging
data
and
doing
etl
and
data
pipelines.
A
However,
due
to
the
nature
of
the
delta
log
and
and
how
how
it
handles
versions
and
and
data
files
file
management
is
super
crucial
because,
as
as
they've
seen,
each
transaction
can
potentially
create
new
files
and
if,
for
example,
if
you
run
an
optimize
on
a
whole
table,
it
will
actually,
at
least
in
the
first
run,
duplicate
the
whole
table
physically.
A
So
data
maintenance
paint
maintenance,
jobs
are
absolutely
mandatory,
so
you
should
have
a
vacuum
job
or
an
optim
or
more
important
is
the
vacuum
job
I
guess
scheduled
on
a
regular
basis
like
daily,
ideally
or
maybe
even
weekly
or
monthly,
if
they
did.
If
that
also
works
for
you,
you
just
need
to
check,
but
you
should
definitely
have
some
in
place
and,
as
I
just
mentioned
before,
use
table
properties
to
manage
the
different
settings
of
your
delta
lake
tables.
A
If
you
want
to
have
a
look
on
the
on
the
internals
and
on
more
details
on
the
delta
delta
lake
lock
protocol
and
in
terms
of
delta
lag
itself,
there
are
some
very
good
links
that
basically
describe
what
what's
happening
underneath
and,
for
example,
what
what
table
properties
you
can
use
and
what
they
actually
do.
So
these
are
just
two
very,
very
important
resources
and
references
and
yeah.
That's
it.
Thank
you
from
my
side
for
for
having
you
in
my
session.
A
If
you
have
any
questions,
please
feel
free
to
reach
out
to
me.
You
have
all
my
contact
teachers
in
the
in
the
first
slides,
so
just
drop
me
an
email
or
ping
me
on
twitter.
If
you
need
anything,
thank
you.