►
From YouTube: Building Reliable Lakehouses with Delta Lake
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Welcome
to
building
reliable
data
lakes
with
delta
lake,
my
name
is
denny
lee,
I'm
a
senior
staff
developer
advocate
here
at
databricks,
been
here
since
2015,
or
so
I've
been
working
with
apache
sparks
in
0.6
and
delta
lake.
Since
its
inception.
Prior
to
that,
I
was
the
senior
director
of
data
science,
engineering
at
sap
concur
and
a
principal
program
manager
at
microsoft
for
azure
cosmos
db,
project
isotope,
which
is
what
is
currently
known
as
azure
hd
insight,
as
well
as
sql
server.
A
A
So
let's
talk
about
that
evolution
of
data
management,
why
do
we
even
need
lake
houses
in
the
first
place?
Well,
first,
let's
start
with
data
warehouses
and
I'm
actually
from
that
era
by
the
way.
So
I'm
I
don't
know
if
I
look
at,
but
I
certainly
feel
it
okay
and
what
happens
here
is
that
I
was
part
of
the
sql
server
team
and
we
were
boosting
this
idea
that
the
data
warehouse
would
allow
us
to
do
everything.
We
would
build
these
old
tb
transaction
databases
do
some
etl.
A
I
apologize
in
advance
for
my
involvement
with
the
sql
server
integration,
services
and
dts
for
any
of
the
folks
that
end
up
using
it,
and
we
push
all
of
this
data
into
data
warehouses,
which
was
great
right,
because
that
meant
I
could
throw
my
bi
my
reporting.
So
again,
you
want
to
blame
me
for
or
partially
blame
me
for
sql
server,
reporting
services
or
sql
server
analysis
services
for
your
bi.
Yes,
I
was
there
during
those
days
as
well,
helping
to
build
those
systems
they're
great
right,
except
and
that's
the
key
thing.
A
There
wasn't
support
for
video,
audio
or
text,
there's
no
support
for
data
science
and
machine
learning.
Even
this
idea
that
we
could
potentially
use
cursors
or
functions
within
python
functions
or
r
functions
directly
within
the
database
was
problematic
at
best.
There's
limited
support
for
streaming.
Some
different
companies
actually
did
try
to
provide
streaming
capability
within
warehouses,
but
it
was
very,
very
complicated
and
the
most
important
aspect.
It
was
the
closed
and
proprietary
formats.
The
systems
themselves
were
not
open.
A
They
were
built,
it's
a
closed,
silo,
closed
house
that
ensured
that
it
was
that
you
would
only
utilize
that
particular
data
warehouse
and
it
was
really
expensive
to
scale
out
and
therefore
right
now.
You
know
when
we
switched
to
data
lakes,
most
data
stored
in
data
lakes
and
blob
stores,
because
we
wanted
that
flexibility,
and
so
I
was
actually
part
of
that
transition
as
well,
because
I
was
part
of
the
crew
that
said
hey.
A
We
need
to
build
hadoop
hadoop
on
windows,
originally
first
and
then
afterwards,
azure
hadoop,
so
in
other
words
hadoop
on
linux,
but
in
azure
cloud
and
we
we
really
pushed
that
idea
of
like
schema
on
re.
Yes,
we
could
do
everything
with
schema
on
read.
We
could
wouldn't
care
about
the
structure
of
the
data.
We
didn't
need
a
scheme
anymore.
We
would
just
simply
go
ahead
and
determine
the
schema
at
that
point
of
read
time
right
and
no
problem
at
all
and,
of
course,
to
anybody.
That's
in
this
particular
conference.
A
You
know
that's
not
true
at
all,
okay,
so,
but
that
was
the
idea.
The
idea
that
data
links
could
handle
all
of
your
data
for
data
science
and
machine
learning,
but
really
poor,
bi
support
any
systems
I
want
to
buy.
So,
for
example,
if
I
wanted
trino
to
go
ahead
and
read
a
data
like
you
know,
if
I
didn't
have
a
schema,
that's
a
horrible
thing
to
actually
have
right.
It's
super
complex
to
set
up
at
times
more
often
than
not
extremely
poor
performance
and
unreliable
data
swamps.
Now
some
people
call
data
swamps.
A
Some
people
call
data,
salad
doesn't
matter
what
the
nomenclature
is.
What
it
meant
is
that
you
had
a
lot
of
data
stored
in
these
cloud
object,
stores
or
hdfs,
I.e
data
lakes
in
which
there
was
no
lineage
or
control
over
this
data.
Nobody
seemed
to
understand
what
was
going
on.
It
did
give
you
the
flexibility
to
store
structured,
semi-structured
and
unstructured
data.
A
So
the
idea
of
saying:
okay,
no,
no
we'll
have
a
data
warehouse
on
one
side
and
we'll
have
a
data
lake
on
the
other
side,
and
everything
will
be
perfectly
fine.
Well,
how
do
you
reconcile
all
that
data?
How
do
you
make
sure
that
data
is
actually
what
you're
reading
out
of
the
data
warehouse
and
what
you're
reading
out
data
lake
actually
made
sense
that
the
users
themselves
could
actually
reconcile
between
these
two
different
systems?
A
You
end
up
having
reporting
by
one
system
saying
one
number
or
another
system
say
another
number
and
technically
they
make
sense,
but
they
don't
and
the
reconciliation
process
ends
up
becoming
or
validation
process
becomes
more
complicated
than
the
actual
process
of
processing
the
data.
Now,
that's
very
undesirable,
to
say
the
least,
and
so
this
is
the
reason
why
we
really
want
to
bring
this
concept
of
a
lake
house
best
of
both
worlds.
A
It
gives
you
the
ability
to
have
the
flexibility
of
a
data
lake
here
on
the
right
side,
yeah
right
hand
here
in
which
you
can
store
structured
and
semi-structured,
unstructured
data
and
work
with
it
and
be
able
to
deal
with
multiple
paradigms
like
data
science,
machine
learning,
real-time
databases,
yet
at
the
same
time,
on
the
left
side
that
it
actually
has
some
of
the
most
important
aspects
of
a
data
warehouse.
I
have
a
schema.
I
have
structure.
A
So
how
do
you
reconcile
these
two
okay?
So
we
there
are
multiple
versions
like
house
and
obviously
today
we're
going
to
be
talking
about
delta
lake,
but
the
key
most
important
aspect
is
especially
here
in
this
conference
of
single
neutrinos
that
you
really
need
this
concept
of
somewhere
in
between,
where
you
have
a
delta
lake
for
a
data
lake.
Excuse
me,
data
lake,
for
all
your
data
and
one
platform
for
every
use.
A
Okay,
you
need
somewhere
in
between
okay
and
how
you
do
that
is
with
delta
lake,
okay,
scalable,
open
general
purpose,
transactional
data
format,
all
right,
that's
a
fundamental
concept
in
order
for
rios
to
be
able
to
do
it
and
then,
in
order
to
add
on
top
of
it
but
delta
lake,
obviously
we're
going
to
showcase
trino
today,
but
that's
the
call
out
the
high
performance
query
engine
that
talks
to
delta
lake
in
order
to
give
you
the
best
of
both
worlds.
You
need
these
two
concepts.
A
So
let's
go
backwards
a
little
bit
and
ask
the
question
of
in
delta
lake
or
the
architecture.
Why
is
it
scalable?
How
does
it
actually
give
you
that
context?
Well,
delta
lake
equals
this
concept
of
scalable
storage
for
data
as
well
as
scalable
transaction
log
from
edit.
You
actually
need
those
two
things,
because
azure
transaction
log
or
as
your
metadata
grows,
it
actually
has
to
be
able
to
scale
as
well,
and
so
how
do
we
do
it?
Well,
let's
talk
about
the
scalable
storage
concept.
A
First,
the
table
data
is
stored
as
part
k,
files
on
cloud
storage
all
right
so
in
other
words
your
standard,
open
source,
parque
files,
that's
exactly
what's
inside
your
delta
lake
table
itself,
okay,
so
in
other
words,
for
example,
you
have
your
past
a
table.
You
have
your
zero
part,
zero
zeros
there
are
ones
there
are
two
parquet
files.
These
are
the
part
k
files
that
make
up
your
start.
Your
scalable
storage,
they're
sitting
on
cloud
object,
storage.
So
that
way
they
can
scale
appropriately.
A
A
So
how
do
I
speed
things
up?
Well,
I
speed
them
up
by
having
a
scalable
transaction
log.
Okay,
so
inside
that
same
folder,
there's
an
underscore
delta
underscore
log,
okay
and
so
inside
there
is
a
bunch
of
json
files.
Okay,
plus
those
are
jsons.
These
are
one.json,
so
it's
basically
a
sequence
of
metadata
files.
These
json
files
to
track
operations
made
on
the
files
in
the
table.
A
Now
these
are
like,
I
said,
they're
stored
in
the
cloud
storage
along
with
the
table,
and
they
allow
you
to
read
and
process
the
metadata
in
parallel.
That's
an
important
aspect.
So
when
you
look
at
the
trino
connector,
for
example,
when
it's
accessing
a
delta
lake
table,
it's
able
to
go
ahead
and
read
that
those
json
files
and
by
the
way
there's
every
tenth
is
a
checkpoint
which
is
a
parquet
file.
A
So
it'll
just
read
the
parquet
files,
but
the
idea
is
that
it's
actually
able
to
read
those
files
quickly
understand
what
it
inside
those
json
parquet
files
in
the
delta
log.
That
is,
it
includes
a
list
of
every
single
parquet
file
that
makes
up
the
table.
Not
every
single
par
k
file.
That's
inside
the
cloud
object
store
actually
makes
up
the
table,
at
least
for
that
particular
snapshot.
I'll
talk
about
that
concept
in
a
second
a
little
bit
more
in
a
second.
A
But
the
idea
is
that
every
single
parquet
file
that
is
inside
the
cloud
object
store
does
not
necessarily
translate
to
what
the
table
is
at
the
time,
because
delta
lakin
also
includes
time
travel
the
ability
to
basically
go
ahead
and
go
back
in
time
and
look
what
the
earlier
version
of
the
table
is
so
based
on
the
metadata
that
you're
reading.
For
example,
if
I'm
looking
at
the
most
recent
snapshot
of
the
data
it'll
contain
all
the
parquet
files
that
make
up
the
current
table.
A
If
I
was
to
go
back
in
time,
it
would
actually
tell
me
all
the
parquet
files
that
actually
made
up
that
version
of
the
table
at
that
time.
So,
instead
of
actually
doing
that
file
listing
from
cloud
object
stores,
it's
just
simply
reading
a
single
parquet
or
json
json
files
to
determine
what
makes
up
that
table,
then
trino's
able
to
go
ahead
and
quickly
send
its
worker
notes
to
go
ahead
and
query
the
data
directly
all
right.
A
So
that
way
they
can
read
know
exactly
which
parque
files
it's
supposed
to
hit
in
order
to
read
that
data
so
really
powerful
way
for
you
to
scale.
Both
the
storage
and
the
transaction
log
itself,
okay,
so
let's
talk
about
that.
Why
is
this
transaction
log
so
important?
Well,
it's
because
it's
going
back
to
your
database
days.
It's
about
transaction
log
commits
changes
to
the
table
are
ordered
and
there
are
atomic
commits
okay.
This
is
very
important.
Okay,
so
in
other
words,
it
allows
you
to
have
that
asset
transaction.
A
That's
so
more,
so
much
so
important
that
versus
if
you
were
to
go
ahead
and
read
and
write
directly
to
a
parque
file
directly
to
a
data
lake
in
general,
you
would
not
have
that,
and
what
do
I
mean
by
that?
What
I
mean
is
that
there's
this
you're
running
a
job
doesn't
matter
which
distributed
system
could
be
trino.
It
could
be
spark.
It
doesn't
really
matter
just
you're
running
some
job
itself.
It's
writing
files
in
mid-flight,
okay,
but
the
job
fails.
A
What
happens?
Well,
a
bun
if
there
were
let's
say
20
tasks
executed
at
that
time
to
actually
write
all
those
files.
What
happens
is
that
let's
say
the
19th
task
fails.
Well,
that
means
18
files
were
already
most
likely
were
already
written
to.
The
cloud.
Object
store
all
right.
Well
then,
what
happens
to
that
19th
file?
Okay,
well,
the
19
file
failed.
A
So
now
what
you
have
is
a
bunch
of
orphaned
files,
a
bunch
of
files
that
are
in
the
cloud
object
store
in
the
same
location
that
you
don't
know
if
they're
supposed
to
be
there
or
not
right.
Well,
that's
what
the
transaction
log
does
and
transaction
log
allows
us
to
know
are
those
files
that
were
written
supposed
to
be
there
or
not.
So
just
in
case
there
was
a
failure,
the
transaction
log
itself
would
have
failed
okay,
which
means
that
it
would
not
have
been
committed
transaction
log.
A
A
Okay,
so
this
allows
us
to
do
x
when
you're
trying
to
process
lots
of
data
and
write
lots
of
data,
whether
it's
batch
or
streaming
to
the
to
the
cloud
object
store
at
any
one
point
in
time.
It's
writing
these
json
files.
That
is
the
final
commit
that
lets
us
know.
What's
going
on
so,
for
example,
if
I'm
doing
an
insert
action,
I'm
inserting
data
into
my
delta
lake,
okay.
A
Well,
for
example,
using
this
green
here,
I
added
001
and
zeros
are
due
parquet
files
in
zero,
zero,
zero
dot,
json
the
zeroth
json,
okay,
perfect!
So
now
we
know
that
if
I
was
to
read
the
table
at
the
point
in
time
where
you're
saying
okay,
because
it's
a
couple
milliseconds
before
01
json
got
written
right,
I
don't
want
any
dirty
reads:
okay,
what
is
the
snapshot
at
that
point
in
time
that
I
went
ahead
and
ran
the
query?
A
I'm
only
supposed
to
read
zero
one
and
zero
two
parquet
files.
A
few
seconds
later,
I
do
an
update
that
update
goes
ahead
to
create,
generates
a
zero
three
dot
parquet
file
and
ends
up
removing
zero
one.
Zero.
Two
part
k
files:
okay,
no
problem,
any
query:
whether
it's
a
read,
write
or
update
or
modification
that
happens
after
the
update
action.
Instead,
we'll
see
the
zero
one.json
file
and
because
it
sees
that
it'll
only
see
it'll,
only
the
listing
of
files
will
only
include
zero.
A
Okay,
so
that's
pretty
cool,
because
that
way,
if
you
had
a
query
that
ran
right
after
xero
dot
json
and
you
had
another
query
that
ran
right
after
zero1.json,
you
know
exactly
which
files
and
you're
not
risking
any
dirty
reads
at
that
point
in
time,
because
this
is
actually
a
very
for
short
running
queries.
That
probably,
is
not
that
big
of
a
deal
honestly,
because
you'll
probably
just
generate
right
right
off
the
zero
one,
but
for
long
running
queries.
A
This
actually
is
a
big
deal,
because
you
want
to
know
what
the
status
of
the
data
was
at
the
time
that
you
queried
it.
Okay,
and
so
that's
why
it's
important
that
every
single
atomic
commit
goes
as
a
json
file
within
that
delta
log,
folder,
okay,
a
set
of
actions,
as
you
can
see
here
all
right,
but
in
order
to
keep
that
consistent,
snaps,
just
like
I
said
this.
A
Actually,
I
just
described
it
for
you,
so
that
can
the
readers,
whenever
they're
reading
they're
going
to
read
those
atomic
units,
so
they
actually
will
see
those
zero
dot,
json
or
01.json.
So
that
example
that
I
just
gave
you
right
if,
if
it
happened,
let's
just
say
at
one
o'clock,
it
was
zero.json
at
two
o'clock
at
zero.
One
json,
just
for
the
simplicity
of
this
example
that
I'm
speaking
to
you,
what
will
happen
is
basically
any
query
that
happens
between
one
o'clock
and
159
59.99
seconds.
A
It'll
actually
ensure
it
reads:
the
0102
parquet
files
right
from
the
insert
action,
but
at
2
o'clock,
when
the
zero
one
json
gets
committed,
it'll
only
read
the
zero
three
parquet
files,
and
so
the
readers
any
of
the
readers
to
ensure
consistency.
It'll
only
read
either
zero
one
and
zero
two
per
k
or
zero,
three
par
k,
but
nothing
in
between
and
nothing
in
between.
Excuse
me:
okay,
because
that
way
you
have
consistency
when
you're
reading
your
data,
okay,
and
so
when
we
talk
about
acid
transactions
by
mutual
exclusion
on
those
law
commits.
A
What
has
to
actually
happen
is
that
the
concurrent
writers
have
to
agree
on
the
order
of
changes,
so
we
we
actually
follow
the
optimistic
and
currency
control
that
we
basically
are
choosing
to
say
or
if
you're
able
to
go
ahead
and
write
to
the
the
the
to
your
delta,
like
forsake
argument
to
different
partitions
of
data,
so
one's
doing
an
update
to
partition
zero
while
one's
doing
an
insert
to
partition
two,
that's
fine,
that's
optimistic
and
currency
control
will
allow
that
right
to
happen
because
we're
saying
it's
they're
actually
not
interfering
with
each
other,
but
if
you're
actually
trying
to
update
the
same
partition
of
data
as
you're
trying
to
insert
okay,
there
may
in
fact
be
problems
because
you
need
to
know
exactly
which
order.
A
Did
you
update
this
first
and
then
you
insert
the
data,
or
did
you
insert
the
data
which
the
update
itself
might
actually
impact?
Okay,
so
these
new
commit
files
will
be
created
mutually
exclusively
from
each
other,
using
storage,
specific
api
guarantees?
Okay,
so,
for
example,
writer
one
is
going
ahead
and
trying
to
write
to
zero.json.
Okay,
meanwhile
zer
excuse
me
writer,
two
is
actually
going
ahead
and
try
to
write
to
zero
one.json.
A
But
then
both
writer,
one
and
writer,
two
are
both
trying
to
write
to
zero2.json.
At
the
exact
same
time,
only
one
of
them
is
going
to
succeed.
In
this
case
it's
writer
two,
but
you
never
know,
maybe
writer
one
would
have
won,
but
either
way
it
doesn't
matter.
Only
one
of
them
succeeds
and
then
writer
one
will
actually
try
again,
because
this
is
import.
This
is
an
important
fact
of
atomicity.
We
have
to
make
sure
whatever's
being
written
is
consistent
at
that
point
in
time,
there's
only
one
of
them.
A
That's
allowed
doing
commit
at
that
point
in
time.
So,
even
with
all
of
our
discussions
about
parallelism
at
some
point,
there
is
a
serial
point
where,
basically,
we
have
to
make
a
decision
which
one's
committing,
which
one
that
succeeds
and
only
one
gets
only
one
gets
the
rule
in
this
case.
Okay,
oops.
Sorry,
all
right.
So
when
you
look
at
the
same
storage
system,
support
delta
relies
on
the
scalable
cloud
storage
infrastructure
for
an
asset
guarantee.
A
So
remember
it's
relying
actually
on
the
s3
itself,
it's
running
on
azure,
blob
storage,
azure
data
lake
source,
gen
2,
excuse
me
or
google
cloud
storage.
So
that
way
there
is
no
single
point
of
failure
and
it's
production
ready,
okay,
and
so
so,
I'm
sorry,
the
storage
system
support
just
call
that
out,
but
with
s3.
A
There
is
one
call
out:
okay
that
actually
in
order
to
ensure,
because
s3
itself
lacks
put
if
absent,
consistency,
guarantees
what
happens
that
we
actually
need
a
lock
store,
which
one's
locking
first
in
order
for
us
to
write
to
the
log
store
in
this
case
with
s3
we're
also
using
dynamodb
as
well.
It's
commonly
used
for
other
storage
formats
as
well,
but
so
we're
we're
following
the
same
thing.
It
was
just
released
actually
in
delta
1.2,
with
our
friends
over
at
samba
tv
to
go
ahead
and
do
that
contribution.
A
So
it's
pretty
cool
all
right.
So
what
does
it
make
it?
What
does
delta
make
it
delta
lake?
What
features
make
delta
lake
unique?
Well,
in
addition
to
the
fact
that
we're
talking
about
acid
transactions-
and
we
talked
about
things
like
scalable
metadata-
we
also
have
the
ability
to
time
travel
and
audit
history.
A
A
All
of
this
together
allows
us
this
idea
of
unified
batch
and
streaming
so
the
semantics
of
basically
writing
to
the
cloud
object
store
or
for
them
or
hdfs,
whether
it's
a
batch
process,
that's
reading
or
writing,
or
if
it's
streaming
process,
that's
reading
or
writing
it
doesn't
matter.
In
fact,
not
only
is
it
being
utilized
for
structure
streaming,
but
we
actually
have
pr
for
pulsar
and
actually
flink
the
flink
sync
actually
just
came
out
as
part
of
delta
connector
0.4.
A
A
There
are
things
like
constraints
and
generated
columns,
pretty
cool
concept
to
ensure
that
the
data
meet
your
semantic
requirements,
actual
constraints,
you
know
from
our
database
days
and
generate
columns.
Basically
this
idea
some
people
called
in
partitioning.
We
think
it's
a
little
bit
better
because
it
allows
us
this
idea
that
you
can
generate
new
columns
based
on
the
existing
data,
and
then
you
can
build
your
structures
or
your
constraints
based
off
of
that
and
of
just
as
important.
Of
course,
your
dml
operations
data
manipulation
language.
So
you
can
do
your
updates.
A
Your
merges
your
deletes,
whatever
you
want
in
scala
sql
java
python,
whatever
language
you
prefer?
Okay,
so
really
cool
okay.
So
with
the
few
minutes
I
have
left,
I
want
to
talk
a
little
bit
about
the
road
map.
So
delta
lake
has
a
very
fast
pace
of
innovation,
so
this
is
a
massive
screen
which
I'm
not
going
to
go
through,
but
it
has
a
lot
of
a
lot
of
innovations
that
have
happened
since
it's
two
years
since
we've
open
sourced
the
project.
A
Okay,
so
we're
actually
coming
up
to
april,
actually
we're
already
april
2022
as
part
of
summit,
we're
going
to
be
having
a
big
splash
talking
about
our
three
year
anniversary
of
open
sourcing
delta
lake,
which
is
great
so
there's
a
lot
of
innovations,
as
you
can
see
from
here,
so
as
part
of
delta
lake,
one
two
which
was
just
released,
we're
also
including
things
like
data
skipping
with
column
stats.
We
already
talked
about
the
s3
multi-cluster
writes,
which
was
great
okay
with
the
data
skipping.
A
I
do
want
to
call
things
out
right.
It's
the
column
and
max
values
are
automatically
collected
and
when
you're
writing
files
and
committed
to
the
logs.
So
that
way,
when
you
write
you're
running
your
read,
queries
like
from
trino,
you
can
go
ahead
and
skip
those
files.
You
don't
have
to
read
those
files
directly
pretty
sweet.
Okay,
also
things
like
compaction,
the
optimizing
and
optimize
or
commanders
are
automatically
for.
Some
of
you
are
gonna.
Ask
me
the
question:
hey
that's
great,
that
I've
got
optimized
but
where's
optimize,
z
order.
A
That's
actually
part
of
our
delta
2.0
push
as
well.
So
if
you
take
a
look
at
the
delta
lake,
github
issues
or
delta,
I
o
dot
slash
roadmap.
You
notice
it's
right
there.
Okay,
you
want
to
build
to
restore
or
roll
back
to
previous
table
versions.
You
could
do
before
with
some
funky
queries,
but
now
we're
just
including
the
restore
table
function.
Okay,
so
that
that
way
makes
it
a
lot
easier
and
you
can
also
rename
columns.
A
Oh
that's
right.
I
forgot
I
put
this
in
the
slide
already
the
optimize.
The
order
is
actually
important
included
for
the
the
remaining
of
the
h1
roadmap
for
of
this
year.
Okay,
so
that's
being
included
as
well,
so
with
the
optimize
z
order.
The
whole
context
is
this
ideas.
A
Multi-Column
data
clustering-
that
is
better
than
just
simply
than
multi-column
sorting,
so
we
actually
are
clustering,
the
data,
so
that
way
we
know
how
to
skip
or
read
which
file
so
that
way:
you're
you're
scanning,
less
files
overall,
there's
less
false
positives
for
use
to
go
through.
Okay,
now
with
call
stats.
This
enables
also
the
better
data
skipping,
which
also
leads
to
faster
queries.
A
So
all
this
together
allows
us
this
much
faster
performance,
okay
and
then
oh
other
things
that
we
definitely
want
to
be
adding
things
like
generate
it
generate
the
change
data,
feed
change.
Data
feed
is
a
popular
feature,
that's
also
being
included
as
part
of
the
roadmap
as
well
as
draw
columns
like
I
said
these.
This,
these
slides
by
the
way
will
be
available
to
everybody.
A
So
if
you
want
to
see
the
full
road
map
magnus,
just
look
at
the
github
at
you
or
go
to
delta
io
roadmap
and
then
in
terms
of
the
connector
ecosystem,
with
the
few
minutes
that
I've
got
left,
as
you
can
tell,
delta
lake
has
multiple
api
languages,
scala,
ruby,
python,
multiple
cloud
platforms,
multiple
c
collisions
engines
and
etl
and
streaming
engines.
Okay,
so
lots
of
different
systems
that
are
all
working
together
that
all
work
well,
extremely
well
with
delta
lake
and
there's
even
more
integrations,
some
calls
outs.
A
We
we're
calling
out
the
presto,
trino,
flink
ones
that
were
just
recently
released.
So
that's
why
I
want
to
call
those
out:
let's
sell,
oh
communion
adoption,
oops,
sorry
there
we
go
so
to
give
you
some
context
of
the
incredible
scale
of
delta,
like
we're
talking
about
450
petabytes
of
data
processed
each
day
and
within
the
databricks
environment.
A
We're
talking
about
75
of
the
data
scanned
is
all
in
delta
lake
and
more
than
5
000
customers
in
production
and
actually,
when
we
go
ahead
and
talk
about
more
in
delta
2.0
for
the
data
and
ai
summit
yeah
we're
actually
going
to
be
telling
you
even
larger
numbers,
because
these
are
last
year's
numbers,
which
is
pretty
cool.
But
how
do
you
want
to
engage
with
us?
Well,
there's
multiple
ways
to
engage
with
us.
There's
the
delta
io
website.
There's
the
slack
google
group
youtube
channel
github
issues
linkedin
the
meetups
right.
A
There
are
multiple
ways
to
engage
with
us
and
don't
forget.
We
have
community
office
hours
every
two
weeks,
okay,
so,
for
example,
this
one's
from
february
17th
included
apple's
dominique
berzinski,
who
was
actually
talking
about
how
he
had
committed,
contributed
to
delta
lake
and
then
finally,
most
important.
How
do
I
use
delta
lake
well
for
us,
cinco
de
trino
folks
just
go
to
go
dot,
dot,
dot,
delta,
dot,
io,
slash,
trino,
okay,
that
actually
goes
directly
to
the
trino
delta
lake
page.
A
To
give
you
all
the
information
you
need
to
get
trino
up
and
running
in
delta
lake
and
as
well,
like
I
said
before,
there's
multiple
ways
to
engage
with
us.
You
can
start
with
the
delta
io
page
and
all
the
different
ways
to
engage
us
with
this
there,
and
with
that
I
appreciate
your
time.
Thank
you
very
much
I'll
go
ahead
and
probably
answer
some
questions
now,
so
that's
it
for
now.