►
Description
Data + AI Summit Keynote talk from Michael Armbrust
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...
Instagram: https://www.instagram.com/databricksinc/
A
A
In
this
talk,
I'm
going
to
talk
about
three
things:
I'm
going
to
explain
what
delta
lake
is
and
why
it's
the
foundation
of
the
lake
house,
I'm
going
to
tell
a
story
about
delta,
lake's
history,
which
is
actually
deeply
entwined
with
this
conference,
we're
all
at
today
and
then.
Finally,
I
have
a
really
exciting
announcement
about
delta's
future.
A
So,
first
of
all
what
is
delta-
and
why
is
everyone
talking
about
it?
Well,
as
ali
said
this
morning,
there's
always
been
this
big
divide
between
data
lakes
and
data
warehouses.
Data
warehouses
were
the
traditional
technology.
They
were
really
easy
to
use.
They
were
really
fast,
but
they
were
expensive
and
not
very
scalable.
A
Data
lakes
were
the
young
upstart.
They
came
in
they
let
you
store
tons
of
data,
but
they
were
kind
of
slow
and
clunky
and
difficult
to
configure
and
delta
really
was
created
to
unify
these
two
worlds.
It
brings
acid
transactions
to
the
data
lake
and
it
brings
speed
and
indexing
and
it
doesn't
sacrifice
its
scalability
or
its
elasticity.
And
it's
what
enables
the
lake
house
when
delta
first
started.
It
was
mostly
a
spark
technology,
but
that
couldn't
be
further
from
the
truth.
A
Today
we
have
connectors
for
everything
from
old
school
technologies
like
hive
to
fancy
new
technologies
like
dbt,
which
you'll
hear
out
here
about
in
just
a
little
bit
this
year
has
been
no
no
different.
We've
added
a
ton
of
different
connectors
we've
added
support
for
flink,
trino
and
presto,
and
we're
working
on
support
for
pulsar
and
google's
bigquery
as
well
as
the
ecosystem
has
expanded.
So
has
our
user
base.
A
So
this
graph
that
I'm
showing
here
is
actually
a
metric
by
the
linux
foundation
that
looks
at
the
health
of
contributions
in
any
given
open
source
project.
It
looks
at
how
many
unique
people
are
fixing
bugs
and
responding
to
pull
requests
and
merging
code,
and
so
you
can
see
really
just
how
much
momentum
there
is
behind
behind
the
project
and
it's
increased
by
over
600
percent
in
the
last
three
years.
A
But
now
I
want
to
kind
of
rewind
and
take
a
trip
down
memory.
Lane
talk
about
the
history
of
delta,
so
I'm
going
gonna
go
back
to
the
year
2017
a
much
simpler
time.
I
was
working
on
structured
streaming
and
spark
sql,
as
ali
just
said,
and
I
was
talking
to
a
bunch
of
users
at
this
conference
about
how
they
were
processing
tons
of
data
from
a
variety
of
data
sources,
and
they
were
doing
it
all
in
parallel
on
the
cloud.
A
A
I
was
fielding
bug
reports
from
users
constantly
who
were
saying
spark
is
broken
because
they
what
they
were
doing
is
they
were
basically
corrupting
their
own
tables
because
there
were
no
transactions
when
their
jobs
failed,
because
a
machine
was
lost,
it
didn't
clean
up
after
itself.
Multiple
people
would
write
to
a
table
and
corrupt
it.
There
was
no
schema
enforcement.
So
if
you
drop
data
with
any
scheme
into
the
folder,
it
would
make
it
impossible
to
read.
There
was
a
bunch
of
added
complexity
of
working
with
the
cloud.
A
The
hadoop
file
system
just
wasn't
really
built
for
it.
I'm
sure
people
in
this
room
remember
setting
the
direct
output
committer
and
if
you
got
it
wrong,
things
would
be
broken
and
even
just
working
with
large
tables
was
slow,
just
listing
all
the
files
could
take
up
to
an
hour,
and
so
it
was
here
at
you
know
what
used
to
be
known
as
the
spark
summit
that
I
talked
to
a
bunch
of
people
and
thought.
A
There's
got
to
be
a
better
way
and
I
actually
always
take
a
couple
of
days
off
after
spark
summit
to
decompress
and
I
was
so
excited.
I
wrote
a
design
dock
during
that
vacation
and
so
in
2018
we
came
back
on
the
stage
and
we
announced
databricks
delta.
It
was
one
of
the
first
fully
transactional
storage
systems
that
preserved
all
of
the
best
parts
of
the
cloud
and
even
better
it
was
battle
tested
by
hundreds
of
our
users
at
massive
scale.
A
In
fact,
one
of
my
friends
dom
got
up
here
and
told
us
about
his
use
case
where
he
had
been
using
delta
for
the
last
year
to
process
petabytes
of
data
in
real
time,
with
hundreds
of
analysts
around
the
globe
for
a
critical
information
security
use
case.
If
you
haven't
seen
that
video,
I
suggest
you
go
to
youtube
and
check
it
out,
but
delta
was
too
good
to
keep
just
for
data
bricks,
and
so
in
2019
we
came
back
and
we
announced
the
open
source
delta
lake
and
we
didn't
just
open
source
the
protocol.
A
The
description
of
how
different
people
can
connect
and
make
transactions
in
the
system.
We
actually
also
open
sourced
our
battle
tested,
spark
reference
implementation
and
put
all
of
that
code
up
on
github,
but
we
weren't
done
with
delta.
We
were
actually
believed
in
this,
so
much
that
data
brick
started
to
commit
its
business
to
it,
and
so,
in
addition
to
all
the
exciting
things
that
were
happening
in
delta
lake,
we
were
busy
building
features
to
make
delta
even
better.
A
A
We
built
this
really
cool
command
to
go,
alongside
of
it
called
optimize
z-order,
which
actually
takes
your
data
and
maps
it
to
a
multi-dimensional
space-filling
curve
so
that
you
can
filter
efficiently
on
multiple
dimensions
that
works
really
well.
With
this
cool
trick
called
data
skipping
based
on
statistics,
it's
basically
like
a
coarse-grained
index
for
the
cloud.
A
A
And
so,
if
you
see
this
feature
matrix,
you
can
see
we're
already
on
our
way.
If
you've
been
following
the
project
closely,
you
might
have
seen
something
is
up:
we've
been
opening
up
tickets
on
github
and
we
are
rapidly
open
sourcing.
All
of
these
different
features
and
bringing
them
out
you
know
for
the
community
and
what
we've
seen
is.
This
is
actually
going
to
dramatically
improve
the
performance
of
this
open
source
project.
So,
as
you
can
see
the
baseline,
this
is
delta
1.0.
A
With
that
optimize
command,
it
improves
performance
a
little
bit,
but
when
you
add
in
z,
order
and
data
skipping,
the
performance
gets
really
good,
which
is
super
exciting.
This
is
you
know
the
same
tpcds
query
that
we've
been
showing
all
day
today
and
the
other
really
exciting
thing
is
delta
is
now
one
of
the
most
featureful
open
source,
transactional
storage
systems
in
the
world.
A
So
this
is
from
some
people
over
at
data
beans
who
are
users
of
the
open
source
project
and,
as
you
can
see,
they
saw
that
we
are
dramatically
faster
not
only
at
loading
data,
but
it
processing
data,
then
iceberg,
but
we're
not
done
yet
there's,
actually
some
really
cool
technology
waiting
in
the
wings.
I
want
to
tell
you
get
just
a
quick
preview
of
one
of
those
things.
So
there's
always
been
a
problem
with
columnar
formats
like
per
k.
The
problem
is
due
to
the
encoding.
A
When
you
want
to
update
even
just
a
single
value,
you
have
to
rewrite
the
entire
file.
This
is
called
write
amplification
in
databases,
because
it
takes
a
single,
tiny
write
and
turns
it
into
a
big
massive
copy
of
all
of
this
unchanged
data,
and
so
we're
very
excited
to
add
this
new
technology
to
the
delta
protocol.
That
allows
you
to
delete
a
single
row
called
deletion,
vectors.
A
What
this
is
going
to,
let
you
do
is
it's
going
to.
Let
you
mark
that
row
as
deleted,
so
you
can
only
write
out
the
data
that
change,
which
will
dramatically
speed
up
things
like
deletes,
updates
and
merges,
and
we've
already
gotten
started
on
this
effort.
So,
as
you
can
see,
there's
a
couple
of
jiras,
some
of
them
already
resolved
where
we've
been
adding
some
of
the
groundwork
to
both
parquet
and
apache
spark.
A
If
you'd
like
to
learn
more
there's
a
ton
of
exciting
things
going
on
at
this
conference,
come
and
join
us
for
our
amas
or
our
deep
dives
into
various
topics.
If
you
want
to
get
involved
actually
coding
for
the
project,
you
can
come
to
our
meetup
and
actually
talk
to
some
committers
and
figure
out
some
good
projects
to
get
started.