►
Description
The Pace of Innovation in Delta Lake - Vini Jaiswal, Delta Lake
A
A
I
had
a
keynote
this
morning
where
I
talked
about
lake
houses,
how
it
emerged
as
the
new
modern
data
architecture,
because
it's
it
adds
reliability.
Performance
quality
features
on
top
of
existing
data
lakes,
so
that
you
can
make
sense
out
of
your
data
and
use
it
for
critical
decision
making.
So
thank
you
for
all
who
attended
one
of
the
most
important
part
of
that
was
delta
lake,
which
is
the
foundation
of
lake
houses
so
before
diving
into
the
project.
Let's
look
at.
Why
do
we
care
one
work
morning?
A
A
A
A
They
already
have
adopted
this
system
called
data
lakes
and
the
promise
of
data
lakes
is
that
you
can
take
all
your
data,
whether
it's
unstructured
or
structured,
and
dump
them
into
a
file
system
over
s3,
google,
cloud,
storage
or
azure
blob
storage,
and
this
is
really
a
powerful
concept.
When
you
compare
it
to
the
traditional
databases,
because
in
traditional
databases,
you
have
to
come
up
with
a
schema
and
a
lot
of
pre-processing
and
cleaning.
A
So
what
a
data
lake
allows
you
to
do
is
forego
that
whole
process
and
just
start
collecting
everything,
because
sometimes
you
don't
know
why.
That
data
is
valuable
until
much
later
and
if
you
don't
store
that
data,
then
you
have
lost
it.
Think
about
many
powerful
use
cases
that
you
could
have
improved.
A
Your
businesses
brought
innovations
if
you
had
access
to
that
data,
but
unfortunately,
what
happens
when
you
collect
all
that
data
in
data
lakes
is
the
data
at
the
beginning
of
the
pipeline
was
bad
quality,
which
in
turn
means
that
more
advanced
processes
that
relied
on
the
bad
quality
data
it
significantly
flows
through
the
entire
pipeline,
so
machine
learning,
models
and
ai,
which
were
built
on
top
of
that
data
now
becomes
unreliable
as
a
consequence,
data
science
and
business
leaders
who
were
really
trying
to
extract
meaningful
information
out
of
their
data.
A
A
Through
this
talk,
I
will
talk
about
data
engineering
challenges,
the
solutions
that
it
has
to
offer
through
modern
data
technologies
and
open
source
projects,
and
I
will
talk
about
how
and
where
delta
lake
fits
in
the
picture
and,
most
importantly,
if
you
find
that
technology
useful,
I
will
share
the
ways
where
and
how
you
become
a
part
of
it.
A
A
Maybe
the
cluster
failed,
because
you
relied
on
an
ec2
spot
instance
and
that
spot
instance
now
is
lost,
so
when
a
job
fails
halfway
through
now,
you
have
to
think
about
that
corrupted
data,
because
you
know
it
fell
through
in
the
middle
now
you
have
all
the
half,
half
quality,
half
data
and
so
on.
So
that
needs
to
be
cleaned
up.
A
A
Data
validation
is
very
vital
for
any
data
engineering
pipeline,
because
machine
learning
and
ai
applications
dependent
on
it
and
if
there
is
no
way
to
gauge
whether
something
about
data
is
broken
or
inaccurate,
the
worst
is.
You
cannot
identify
data
errors
in
the
beginning
of
the
pipeline,
so
you
can
corrupt
the
whole
pipeline,
so
data
lakes
doesn't
offer
any
kind
of
schema
enforcement
or
data
quality.
A
This
is
a
screenshot
from
one
of
my
projects
where
I
have
a
parking
table
with
over
14
000
rows
and
four
columns,
and
now
I
have
a
streaming
job
that
appends
the
data
into
this
table.
Let's
see
what
happens
after
my
stream,
query
is
run
so
looks
like
my
streaming.
Job
went
through,
however,
my
table
now
has
51
records
and
it
has
two
extra
columns
that
I
didn't
expect.
So
what
really
happened
there?
A
A
With
increasing
amount
of
data
that
is
collected
in
real
time,
companies
also
need
the
ways
to
reliably
perform,
updates,
merges
and
deletes
so
that
it
can
remain
to
date
all
times
with
traditional
data
lakes.
It
can
be
incredibly
difficult
to
perform
simple
operations
like
these
and
to
confirm
that
they
occurred
successfully.
A
This
is
how
it
happens
in
legacy
data
pipelines
to
insert
or
update
a
table.
A
data
engineer
has
to
find
new
rows
to
be
inserted.
Then
they
have
to
identify
what
rows
they
should
be
replaced
by
or
updated
identify
all
the
rows
that
are
not
impacted
by
the
insert
or
update,
create
a
new
temp
table
based
on
all
these
insert
statements.
Now
that
delete
the
original
table
with
the
wrong
records,
and
then
you
have
to
rename
the
temp
table
drop
the
tem
table.
A
So
those
are
some
of
the
problems
that
I
have
seen.
I
don't
know
if
that
resonates
with
you.
Hopefully
it
does.
So
how
are
we
solving
those
reliability
and
quality
problems?
We
need
something.
So
that
data
can
be
reliable
and
used
for
production
applications
right.
A
Delta
lake
file
structure
actually
consists
of
two
main
components.
The
first
component
is
the
data
objects.
The
data
objects
are
stored
as
parquet
files
in
the
scalable
storage,
like
s3
azure,
google
cloud
storage
second
component
is
the
scalable
transaction
log.
One
of
the
really
cool
properties
of
delta
lake
is
that
it's
as
highly
available
as
cloud
data
lakes,
so
you
automatically
get
all
the
same
availability,
scalability
and
flexibility
that
cloud
providers
have
already
baked
into
their
flagship
service.
A
A
These
files
have
meaningful
transaction
information
like
indexes
and
statistics,
to
ensure
that
you
have
all
the
metadata
about
every
commit
and
every
change
that
happened
in
the
table
and
delta
lake
also
periodically
takes
checkpoints
in
the
same
log
folder.
You
can
see
that
there
are
some
checkpoints
added
and
it
is
sort
of
like
a
shortcut
to
fully
reproduce
a
table
state
and
why
this
is
useful
is
because
it
allows
a
query
engine
to
avoid
reprocessing.
A
A
Those
actions
are
then
recorded
in
the
transaction
log
which
we
saw
earlier,
and
it's
recorded
as
ordered
atomic
units
known
as
commits
what
atomic
means
is
in
delta
table.
Either
a
data
file
is
stored
in
completeness,
or
it
won't
store
it
at
all.
It
will
fail,
and
through
these
commits
you
can
act
access
any
historical
version
of
that
data.
So
I
will
show
you
how
it
is
useful
later
on
in
a
slide,
so
we
got
an
understanding
of
how
the
logs
are
stored.
A
Delta
lake
allows
the
same
logs
to
be
accessed
by
multiple
users
and
also
allows
readers
and
writers
to
perform
actions
at
the
same
time-
and
you
know
you
might
see
like
how,
how
does
that
happen,
and
this
is
fine,
because
there
no
one
is
going
to
see
any
changes
until
you
have
successfully
committed
a
change
and
log
says
that
the
table
is
now
this
version
of
the
table.
So
basically,
this
is
made
possible
by
snapshot,
isolation,
property
of
delta
lake.
A
In
this
figure,
for
example,
one
of
the
query
is
actually
doing
an
insert
and
another
one
is
updating
the
table
with
the
file:
zero,
zero,
three
dot
parking,
but
only
one
of
them.
Those
actions
will
succeed.
Either
a
reader
will
be
able
to
see
zero
zero,
one
plus
zero,
zero,
two
dot
par
k
or
only
zero,
zero.
Three
dot
parking.
A
Now,
what
happens
when
multiple
writers
want
to
update
the
same
table?
We
talked
about
readers
and
writers.
At
the
same
time
now,
let's
talk
about
multiple
writers,
because
delta
lake
has
optimistic
concurrency
control.
Multiple
writers
can
concurrently
modify
a
delta
table
by
agreeing
on
the
order
of
changes.
A
For
example,
when
you
have
a
query
like
this
one,
where
there
are
two
writers
trying
to
modify
a
delta
delta
table
with
002.json,
only
one
of
their
changes
will
succeed
and
other
change
will
fail
in
this
case
writer,
zero,
zero,
zero,
zero
writer,
two
with
zero
zero,
two
dot
json
will
be
successful
because
of
the
agreement
of
order
of
changes.
A
A
A
Schema
enforcement,
also
known
as
schema
validation,
is
a
safeguard
in
delta
lake
that
ensures
that
data
quality
by
rejecting
rights
to
a
table
that
do
not
match
the
original
existing
schema
of
the
table.
Like
the
front
desk
manager
at
a
busy
restaurant.
They
only
accept
reservations
only
allowing
whether
one
has
the
reservation
or
not
and
rejecting
you.
If
you
don't
have
a
reservation.
A
Moreover,
the
schema
evolution
feature
of
delta
lake
allows
users
to
easily
change
the
table's
current
schema
so
that
you
can
now
accommodate
changing
data
over
time,
and
you
know
there
are
two
modes:
it
supports
append
and
overwrite,
so
you
can
just
give
dot
options
schema,
merge
schema,
and
it
will
understand
that
this
is
a
merge
schema
and
it
won't
actually
replace
the
whole
file.
It
will
add
separate
records
as
well
as
separate
two
columns.
A
Now
that's
about
how
delta
lake
works
behind
the
scenes.
If
you
remember,
I
talked
about
versioning
and
how
it
solves
some
of
the
use
cases.
So,
let's
look
at
those
so
delta
data
versioning
is
one
of
the
most
helpful
feature
in
lot
of
different
ways.
We
have
seen
you
know
with
regulations
as
well
as
audit
auditing.
If
there
is
data,
of
course,
somebody
wants
to
audit
it
right.
A
A
A
How
that
information
can
be
used
here
is,
as
delta
lake
records,
every
action
that
has
been
performed
it
also
in
the
schema
or
metadata
captures
the
files
that
were
impacted
and
so
on.
So
you
can
use
describe
history
and
see
all
the
commits
and
look
at
the
history
of
table
changes
that
that's
how
you
can
solve.
The
audit
second
feature
is
reproduce
experiments
during
model
training
data
scientists
run
various
experiments
with
different
parameters
on
different
data
and
when
scientists
revisits
their
experiments
after
a
period
of
time
to
reproduce
the
models.
A
Typically,
the
source
data
has
been
modified
by
somebody
else
right
a
lot
of
change.
At
times,
though,
the
changes
were
caught,
unaware
of
and
upstream
data
teams
can
actually
sometimes
modify
it
without
even
telling
you
so
delta
lake
time.
Travel
capabilities
works
well
in
conjunction
with
another
popular
linux
foundation,
project
called
ml
flow.
A
So
for
reproducible,
machine
learning,
training,
you
can
simply
log
a
timestamp
url
to
that
path
as
an
ml
flow
parameter
so
that
you
can
track
all
the
changes
versions
of
the
data
which
was
used
for
training
and
so
on,
so
that
capability
of
time
travel
allows
you
to
go
back
in
earlier
stages
and
settings
for
those
data
sets
and
reproduce
earlier
models,
and
to
do
that,
you
also
have
to
make
sure
that
that
data
historically
has
has
been
retained
somewhere.
A
A
A
So
this
can
be
very
useful
during
reading
of
a
delta
table
where
you
can
skip
reading,
keep
reading
the
files
that
are
not
matching
specific,
specific
condition.
For
example,
if
you
have
like
different
set
of
groupings,
it
will
skip
the
groups
and
only
look
for
the
groups
where
that
data
is
could
be
present.
A
So
in
the
example
here,
our
query
is
actually
looking
for
events
triggered
by
user
id
24
000,
and
you
can
see
those
groupings
here,
file,
one
dot
parking
file,
two
dot
parking
file,
three
dot
parque
in
those
first
two
groupings-
user
id
24
000-
will
not
be
found.
So
it
will
skip
those
two
groupings
and
look
for
file
three
dot
parque.
A
So
it's
easy
for
a
delta
to
skip
the
records
that
don't
match
the
query
condition
and
you
save
a
lot
of
bucks
now.
Another
cool
feature
is
generated
columns.
Let's
say
you
have
a
table
that
have
a
timestamp.
We
deal
with
a
lot
of
timestamps
right
because
we
have
to
find
records
from
like
specific
dates
and
years,
so
you
don't
have
to
write
partition
by
timestamp.
That
would
result
in
way
too
many
partitions.
A
Instead,
you
want
to
particularly
partition.
By
date
you
can
just
do
a
column
adjustment
that
takes
the
time
stem
and
converts
it
into
a
date,
but
you
need
to
do
that
manually.
So
remember.
I
remember
my
sequel
days
when
I
had
to
deal
with
a
lot
of
time.
Stamps
and
like
different
tools
did
not
match
this
query
formats,
so
you
have
to
play
around
with
time
stamps
but
delta
lake
has.
This
feature
called
generated,
columns
which
automatically
calculates
those
dates
and
times
times
and
values
for
you.
A
A
This
collag
co-locality
is
automatically
used
by
delta
lake
data,
skipping
algorithm
that
we
saw
earlier
to
dramatically
reduce
the
amount
of
data
that
needs
to
be
scanned
to
z-order
data.
You
simply
need
to
specify
the
columns
that
you
need
to
z
order
by
and
query
will
look
for
the
most
common
columns
and
can
find
the
data
or
columns
in
the
same
file
rather
than
having
to
look
at
multiple
files.
A
A
You
want
to
use
it
for
and
that's
because
we
we
released
the
delta
standalone
a
few
months
ago,
so
it
can
integrate
well
with
other
ecosystem
projects
and
also
it's
available
from
a
wide
variety
of
languages,
a
wide
variety
of
services-
and
there
are
some
popular
connector
tools
for
data
engineers
that
it
integrates
with.
So
you
can
query
many
different
databases
and
also
delta
lake
is
multi-cloud,
so
it
operates
on
aws,
google
cloud
and
azure.
A
A
It's
been
a
long
journey
and
very
exciting
and
thrilling
journey
to
get
here.
Ever
since
the
delta
lake
was
open
source
in
april
of
2019
at
the
spark
summit,
which
is
now
data
ai
summit,
so
you
can
go
to
delta.
I
o
website
again
and
check
out
our
blogs
or
release
notes,
and
you
will
find
each
of
these
features
and
how
they
are
used,
and
we
are
not
done.
The
community
is
always
looking
and
working
to
bring
more
innovation
to
data
engineering,
and
these
are
some
of
the
features
that
community
is
working
on.
A
You
can
stay
tuned
with
the
updates
on
our
roadmap
through
github,
where
we
discuss
all
the
features
and
track
progress
there
and
open
source
projects
cannot
be
here
without
thriving
community
of
users
and
developers,
so
lots
of
organizations
have
adopted
and
are
contributing
to
delta
lake.
This
is
just
a
subset
of
many
organizations
that
are
running
that
workloads
on
delta
lake
and
the
work
that
they
are
doing
is
very
transformational
collectively.
A
More
than
an
exabyte
of
data
gets
processed
per
day
on
delta
lake
and
we
have
an
engaged
very
active
community
of
slack
users,
which
is
more
than
6
000,
now
very
exciting,
and
more
than
50
or
50
companies
have
contributed
to
data
lake.
A
So
while
we
have
exciting
momentum
going
on
in
the
community,
I
want
to
encourage
you
to
get
involved.
There
are
a
bunch
of
channels
how
you
can
get
involved
with
delta
lake
check
us
out
on
slack
check
us
out
on
youtube
channel
where
you
will
find
tech
talks,
live
q,
a's
demos
there's
a
mailing
list.
If
you
want
to
get
involved
with
the
actual
code,
you
can
always
join
on
github.
A
A
We
have
data
and
ai
summit
next
week
where
there
are
over
17
sessions
just
on
delta
lake
and
of
course,
since
it's
a
data,
ai
summit,
you
get
to
hear
a
lot
of
use
cases
from
different
companies
as
well
as
different
projects
that
community
is
working
on
and
just
on
delta
lake.
Specific
michael
ambrose
is
also
going
to
give
a
keynote,
and
some
cool
amas
with
the
committers
as
well
as
we
are
celebrating
delta,
lake's
third
birthday
party.
A
So
you
can
join
in
person
in
san
francisco
or
you
can
totally
join
it
from
the
comfort
of
your
couch
online
for
free
thanks
a
lot
for
joining
me
today
and
yeah.
I
can
take
any
questions
you
may
have.
B
Yes,
yeah,
I'm
curious,
so
I
think
that
the
the
the
way
the
park
date
file
is
to
sort
of
manage
the
data
and
the
ability
to
go
back
in
time,
and
you
know
progress
later
branch
off
make
sense
when
you
go
ahead
and
complete
the
schema
of
your
table
when
you
update
the
schema
of
your
table,
is
that
also
in
sort
of
a
similar
format?
Something
comparable?
A
Yeah,
that's
a
good
point,
so
every
change
that
you
do
either
it's
a
schema
change
or
a
data
change.
It
will
add
a
new
version
of
commit
to
that
delta
lake,
folder
delta
log
folder,
and
you
see
that
you
remember.
I
showed
you
the
two
file,
the
delta
log
log
directory
as
well
as
well
as
data
files.
So
it
will
keep
all
those
changes
in
that
data
delta
log
directory.
A
B
C
You
for
presenting
my
question
so
if
I
heard
you
correctly
so
there
is
a
retention
policy,
then
on
the
delta
log,
okay,.
A
Yeah,
so
typically,
there
is
a
retention
policy
as
well
as
vacuum,
so
retention
policy
is
30
days.
So
anytime
you
have
a
changes
to
data,
it
will
store
it
for
30
days.
Of
course
you
can
play
around
with
retention.
There's
a
specific
command
you
can
give
either
to
retain
only
for
seven
days,
because
maybe
your
organization
doesn't
want
to
keep
historical
data,
so
you
can
always
play
around
with
that.
Yeah.
A
There
is
another
retention,
another
thing
I
would
say:
let's
say
if
you
have
deleted
the
data,
so
actually
it
doesn't
take
all
the
data
away.
It
will
store
that
deleted
data
for
seven
days,
so
it
will
still
allow
you
to
you
know
to
retain
that
data
in
case
somebody
accidentally
deleted
it.
So
after
seven
days
it
will
automatically
get
rid
of
that
deleted
data.
But
if
you
do
want
to
not
keep
that
seven
day
deletion
policy,
you
can
always
vacuum
at
zero
retention,
so
it
will
completely
remove
that
data.
So.
A
Good
question
any
other
questions
awesome.
Well,
hopefully
I'll
see
you
in
like
some
of
the
delta
sessions
or
maybe
like
if
you
get
involved
in
the
community.
Thank
you
all
for
attending
appreciate
it.