►
From YouTube: Delta Lake Connector for Presto - Denny Lee, Databricks
Description
Delta Lake Connector for Presto - Denny Lee, Databricks
Delta lake is an open-source project that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. We - the Presto and Delta Lake communities - have come together to make it easier for Presto to leverage the reliability of data lakes by integrating with Delta Lake. In this session, we would like to share the design decisions and internals of the Presto/Delta connector.
For more info about Presto, an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes, see: https://prestodb.io/
A
A
Prior
to
this
I
was
a
principal
program
manager
at
microsoft
for
various
teams,
including
azure
cosmos
db
project
isopo,
which
was
the
incubation
team
for
what
is
now
known
as
hdinsight,
sql,
server
and
bing
and
somewhere
in
between
all
that
I
was
also
the
senior
director
of
data
science.
A
A
So,
let's
talk
a
little
bit
about
the
motivation
here,
like
what
is
delta
lake
and
why
did
we
help
build
work
with
the
presto
community
to
build
a
presto
connector
for
delta
lake?
Well,
let's
talk
about
delta
lake
first,
which
is
delta
lake,
is
one
of
the
major
open
source
data
lake
storage
standards
in
order
to
ensure
data,
reliability,
one
is
in,
are
arguably
a
very
unreliable
system
and
we're
going
to
talk
about
that
in
a
second
as
well.
A
A
The
issue
at
hand
is
that
currently,
when
you
work
with
presto
and
delta
lake,
you
actually
have
to
use
the
manifest
file
in
order
for
which
then
allows
you
to
register
a
delta
lake
table
into
the
hive,
my
store
as
a
sim
link
table
type.
So
basically
it's
a
sim.
It's
basically
a
file
that
contains
the
list
of
files
that
then
presta
will
access
so
the
symlink
file
in
order
to
be
able
to
figure
out
what
the
current
tables
are.
A
Sorry,
what
the
current
files
for
the
table
are
that's
great
if
you
are
not
doing,
interact
queries
on
relatively
stable
data,
but
how
about?
If
you
want
to
go
ahead
and
work
on
interactive
queries
on
data
that
does
change
over
time
or
that
is
updated
regularly
or
modified
regularly
or
so
forth
and
so
forth.
A
A
So
delta
lake
is
an
open
source
project
that
enables
building
a
lake
house
architecture
on
top
of
existing
storage
systems
such
as
s3,
adls,
gcs
and
hdfs.
Now,
when
I
talk
about
lake
house
architecture,
it's
not
just
this
idea
of
a
marketing
buzzword.
It's
a
paradigm
shift
on
saying
the
v1
world
of
data
was
typically
associated
with
databases.
A
The
v2
world
of
data
engineering
has
now
shifted
over
to
more
about
data
lakes.
So,
what's
great
about
these,
two
worlds
is
in
the
case
of
databases,
the
v1
world
version
one
in
the
world
is
that
you
actually
had
this
very
reliable
system
with
acid
transactions
that
was
simple
to
use,
okay
or
relatively
simple.
To
use.
A
The
v2
system,
though,
about
data
lakes,
is
that
it
gave
you
mass
amount
of
flexibility
to
use
low-cost
object
stores
to
store
all
of
your
data,
and
it
gave
you
an
a
high
degree
of
flexibility
to
solve
problems
that
you
really
couldn't
try
to
solve
with
databases.
The
version
one
the
lake
house
paradigm
is
is
like
the
version
three
of
this
world
now,
which
is
to
say
that,
can
you
marry
the
best
of
these
two
worlds?
A
The
asset
transactions
and
reliability
that
you
would
normally
get
out
of
databases
with
the
flexibility
that
you
got
out
of
data
lakes
and
then
with
open
source
storage
systems
like
delta
lake.
The
fact
is,
you
can
right,
because
the
idea
that
you
can
actually
have
that
flexibility,
but
also
have
a
reliable
data
storage,
allows
you
to
go
ahead
and
get
the
best
of
both
worlds
when
it
comes
to
the
data
warehousing
database,
world
and
the
delta
link.
A
Sorry,
the
data
lake
world
delta
lake,
but
that's
why
it's
a
fundamental
for
what
we're
seeing
is
a
lake
house
paradigm.
A
Now,
let's
talk
a
little
bit
about
the
promise
of
the
data
lake
right,
and
so
what
was
great
about
the
data
lake
is
that
hey,
let's
go
collect
everything
it
allowed
you
to
do
all
this
really
cool
stuff.
You
could
store
it
on
the
delta
lay
in
your
data
lake
without
any
problems.
Apologies
for
the
bias
there
and
you
also
have
data
science
and
machine
learning.
So
now
you
can
run
your
recommendation
engines,
your
risk
and
fraud
detection.
A
All
these
other
cool
things
directly
against
your
data,
lake
and
everything's
done
right,
because
everything's
fixed
no
problem
at
all
problem
is-
and
this
is
an
old
adage
garbage
in
garbage
stored
garbage
out
your
data,
science
and
machine
learning
was
only
as
reliable
as
what
you
actually
stored
and
what
you
actually
collected.
So
how
do
we
make
this
better?
A
So
we
you
actually
don't
have
a
reliable
storage,
sorry
reliable
set
of
files
that
truly
dictate
what
the
table
that
you're
trying
to
query
is
made
out
of
there's
no
wave
form
of
quality
enforcement
of
that
data
and
there
isn't
any
form
of
consistency
or
isolation
that
goes
with
it
as
well.
So
this
these
distractions
are
not
just
minor.
They
actually
have
a
ultim,
truly
big
impact
on
whether
you
can
trust
the
data
that
you're
querying.
A
So
what
happens
when
I
do
this
with
delta
lake
instead?
Well
with
delta
lake,
you
know
we
often
talk
about
delta,
like
hand-in-hand
with
the
process
of
a
what
you
what
we
often
terms
the
medallion
architecture.
This
idea
that
you
have
bronze
silver
and
gold
data
quality
levels
of
data
as
they're
coming
in
right,
so
the
quality
is
basically
derived
as
in
bronze.
Is
your
raw
ingestion
silver?
Is
your
filtered
clean
and
augmented
data
and
gold?
Is
your
business
level
or
aggregate
data?
A
The
idea
is
that
delta
lake
allows
you
to
as
part
of
that
process
to
incrementally
improve
the
quality
of
your
data
as
you're
processing
it.
So
it's
ready
for
consumption
and
so
again,
when
I
look
at
the
broad,
the
focus
just
a
little
bit
on
that
in
terms
of
the
raw
ingestion,
it's
the
dumping
ground
for
your
data.
It's
often
you're
retaining
it
for
a
very
long
time
and
you
will
avoid
any
error
prone
parsing.
So
you
can
keep
the
data
there.
A
All
right
silver
is
at
that
point
where
you
have
intermediate
data
with
some
cleanup
applied,
it's
queryable
for
easy
debugging.
Sometimes
you
often
it's
common
that
you,
where
you
would
do
your
debug
logs
or
if
you're
running
any
machine
cooled
machine
learning
stuff.
You
might
actually
run
them
at
this
level
as
well,
but
then,
ultimately,
your
goals.
That
is
where
you
want
to
query.
Ultimately
query
that
data,
whether
it's
for
streaming
purposes
or
for
your
ai
batch
reporting
purposes.
You
want
to
be
able
to
go
ahead
and
reliably
state.
A
You
haven't
lost
anything
you've
kept
and
retained
all
the
the
the
data,
even
if
you
have
to
change
business
logic
from
the
past
and
so
who's
using
delta
lake
they're,
it's
used
by
thousands
of
organizations
worldwide,
as
we've
listed
here,
I'm
going
to
skip
past
most
of
this
stuff.
The
one
cool
case
that
I
want
to
call
out
in
terms
of
like
comcast,
for
example,
is
that
this
is
actually
from
one
of
the
date
and
ai
summit
sessions
with
comcast
those
the
keynotes
from
the
dayton
ai
summit.
A
It's
about
the
sessionization
with
delta
lake,
but
written
by
comcast.
They
improved
a
lot
reliability
of
their
pedes
petabyte
scale,
jobs,
the
cool
thing,
but
because
they
are
actually
were
able
to
leverage
delta
lake
to
to
run
as
both
a
streaming
and
batch
processing
system.
They
were
able
to
do
two
really
important
things:
10x
lower
compute.
They
went
from
640
vms
instances
down
to
64.
A
and
because
they
had
they
could
leverage
streaming
capabilities.
They
were
able
to
run
simpler
and
faster
etl
jobs
from
84
jobs,
down
to
three
and
also
half
the
data
latency.
So
pretty
cool
things
right
so
just
want
to
do
some
quick
call
outs
on
some
of
the
amazing
things
that,
by
having
data
reliability
and
using
a
system
like
delta
lake,
it
allows
you
to
go
ahead
and
afford
to
make
things
more
efficient
and
ultimately
cheaper.
A
Okay
and
I'm
going
to
skip
through
this
pretty
quickly.
But
there's
a
lot
of
innovation
because
delta
lake,
just
like
presto,
we're
an
open
source
project.
You
know
there's
a
lot
of
innovations
that
are
going
quickly.
This
is
the
a
piece
of
innovation.
From
april
2019,
up
to
february
2021,
we
went
from
0.1
to
0.8
added
lots
of
really
cool
features
and
then
with
delta
lake
1.0,
which
was
announced
earlier
this
year,
we
also
added
other
really
cool
features.
A
I
would
highly
advocate
for
you
to
go
ahead
and
attend
michael
armbrust's
keynote
presentation
from
this
year.
The
2021
dating
ai
summit,
which
goes
into
great
details
about
this,
but
things
like
generated
columns,
multi-cluster
rights,
clown
independence,
spark
3.1
support,
pi,
pi,
installed
delta
everywhere,
a
key
component
in
terms
of
having
other
systems
work
with
delta,
really
really
natively,
and
also,
of
course,
connectors
all
right
so
that
segway's
brought
us
back
to
the
back
to
the
motivation
right,
like
I
said
before,
delta
lake
is
a
major
open
source
data.
Lake
storage.
A
Standard
presto
is
a
super
part
of
not
the
most
popular
distributed
load
lanes,
the
sql
query
engine
so
and
right
now,
up
to
this
point
until
today's
session,
we
can
only
really
talk
about
it
from
the
standpoint
of
manifest
well
how
about
if
we
actually
gave
presto
the
ability
to
go
ahead
and
read
a
delta
lake
table
right
at
runtime.
A
So
if
there
are
any
changes
right
at
runtime
at
that
point
in
time,
presto
is
able
to
automatically
know
which
files
it's
supposed
to
access
for
the
table,
and
so
it
has
a
clean
read
of
that
data.
Well,
that's
exactly
what
we're
here
to
talk
about
today,
okay,
and
why
are
there
issues
when
it
comes
to
using
the
manifest?
Well,
let's
go
into
that
right.
A
There's
data
consistency,
issues
for
partition,
delta
tables,
which
may
result
in
an
inconsistent
view
of
that
delta
table
right,
also,
the
performance,
if
there's
a
lot
of
data
which
results
in
basically
lots
of
files
to
be
listed.
There's
a
lot
in
the
manifest
that's
loaded
into
memory,
and
then
it's
going
to
be
loaded
in
memory
all
at
once,
and
if
there
are
a
lot
of
files
for
that
table,
there
definitely
is
going
to
be
a
performance
issue
x,
we'll
see
after
the
first
record
right.
A
It's
just
going
to
take
a
lot
of
time
for
it
to
figure
all
that
stuff
out.
If
you
look
at
time,
travel
queries
with
the
manifest
file
you're,
not
actually
able
to
look
at
time
travel.
One
of
the
really
cool
things
about
delta
lake
is
the
ability
to
say:
what's
what
does
my
data
lake
table?
What
does
my
delta
lake
look
from
previous
versions?
Well,
with
a
proper
connector,
you
can
actually
see
older
versions
of
the
data
okay,
so
so
without
getting
into
all
of
the
details
here.
A
Okay,
because,
frankly,
I
put
a
link
here
to
give
you
an
access
to
the
design
document.
So
that
way
you
can
definitely
go
look
in
the
details.
Number
one
and
number
two.
We
actually
have
as
part
of
the
delta
users
slack
we'll
have
we
have
not,
we
will
have.
We
have
bi-weekly
presto,
connector
meetings.
So
you're,
more
than
welcome
to
join
us
and
I'll
put
a
link
near
the
bottom
for
where
you
can
find
that
information.
A
But
the
key
call
out
of
the
design
is
this:
it
starts
with
the
presto,
corning
and
presto
executor,
as
you
see
here
and,
as
you
already
know,
right,
there's
two
type
of
jvm
processes,
which
is
the
coordinating
executor
so
now
the
delta
connector,
and
how
it
coordinates
its
calls
with
the
press,
domain
delta,
standalone,
reader
library.
The
hybrid
store
is
in
this
form,
the
metadata
provider.
Okay,
that
you
see
here
is
what
loads
the
delta
metadata,
which
is
stored
in
its
own
transaction
log.
A
So
traditionally,
when
it
comes
to
working
with
presto,
the
metadata
is
actually
stored
inside
the
metastore,
but
because
of
the
way
delta
works
that
metadata.
That
tells
you
which
files
file
paths
excuse
me
contain
what
files
which
ultimately
make
up
your
table.
That's
stored,
actually
in
the
underscore
delta,
underscore
log
file
folder,
which
is
a
bunch
of
transaction
log
files
that
contain
json
files
that
contain
that
information.
A
So
we
created
a
metadata
provider
that
is
able
to,
even
though
normally
presto
is
going
to
access
that
information
directly
from
metastore
it's
going
to
access
it
from
the
delta
lake
transaction
log.
Instead,
it
still
accesses
the
table
information
from
the
meta
store,
but
when
it
comes
to
the
underlying
metadata,
it's
actually
going
to
access
it
from
there.
Okay,
any
of
the
information
containing
splits
on
how
presto
is
going
to
be
splitting
that
split
generator.
A
That's
also
included
as
part
of
and
part
of
this
code
base,
in
which
the
delta,
in
which
now
we
can
figure
out
how
to
split
that
delta
table
into
multiple
input,
splits
and
then.
Finally,
the
page
source
provider
is
that's
the
interface
in
which
the
task
will
get
the
record
reader
for
a
given
split.
So
by
building
it
out.
A
This
way
now,
we've-
hopefully
at
least
seamlessly
made
it
so
that
presto
can
go
ahead
and
interact
with
the
with
a
delta
lake
table
without
the
users
themselves
being
aware
at
all
that
they're
talking
to
a
delta
lake
table,
so
it's
completely
transparent
to
the
users,
which
is
our
main
goal
here,
all
right
so
enough
to
be
said
about
talking.
Let
me
just
go
show
it
to
you.
So
let
me
let
me
dive
into
it
here.
So
I've
got
this
terminal
window
in
which
I'm
logged
into
my
pr
local
presto
instance.
A
Do
you
know
the
fact
that
this
precedence
is
locally
running
locally
on
my
box,
though
I'm
actually
for
the
fun
of
it?
Accessing
data
is
stored
in
an
s3
bucket,
so
there's
gonna
be
a
little
bit
of
latency,
but
I
didn't
want
to
call
that
out
and
also,
if
I
shift
over
here,
this
is
the
ui,
the
local
ui
than
running
that
you
can
see.
What's
going
on,
I've
got
here's
the
number
of
nodes
that
are
running.
I
did
actually
have
an
error
from
a
fake
sorry,
not
a
faulty
query
from
before.
A
So
I'm
going
to
go
ahead
and
run
this
right
now,
okay,
so
let's
go
ahead
and
of
course
I'm
copying
pasting,
let's
be
honest,
so
because
I
cannot
type
this
fast,
but
nevertheless
let
me
go
ahead
and,
oh
sorry,
I
am
showing
you
the
wrong
screen,
I'm
going
to
go
ahead
and
paste.
My
first
query
inside
here
and
so
right
now,
what's
going
on,
is
exactly
what
you
expect
presto
in
this
case,
what
it's
doing
it's
accessing
the
s3
bucket
this
particular
bucket
for
the
new
york
city
taxi
data
set.
A
So
this
is
initially
going
to
run
a
little
slow,
but
around
30
seconds
or
so,
oh
there
you
go,
it
is
it's
done.
I
did
want
to
call
out
that,
because
of
the
way
we've
set,
this
particular
set
up
right
now.
What
we're
doing
is
we're
actually
specifying
the
path,
not
the
actual
hive
metastore
bucket,
so
because
we
use
delta
s3
and
we
specify
this
particular
path
setting
right
here,
then
we
know
that
we're
actually
accessing
the
delta
table
through
its
file
path,
as
opposed
to
traditionally
through
the
metadata
metastore.
A
Yes,
you
have
the
access
to
the
metastore
as
well,
but
I
just
wanted
to
call
that
out
so
perfect.
You
get
to
see
the
data
set
and
you're
good
to
go
perfect.
When
I
go
ahead
and
switch
back
to
the
ui,
you
get
to
see
the
queries
here,
exactly
what
you
expected
in
the
presto
ui,
all
the
pertinent
information.
All
that
stuff
is
here
exactly
as
you
would
expect
so
so
far,
so
good,
nothing
terribly
unexpected.
Let
me
go
back
to
the
terminal
again,
all
right.
A
So
now
I'm
going
to
run
a
bigger
query,
but
without
partitions,
okay,
and
so
I'm
going
to
run
this
one
right
now
so
so
far
so
good.
Let's
think
here
all
right
I'll
show
you
the
ui
view
of
it
now,
and
so
it's
right
now
planning
it
through,
and
this
query
should
also
take
about
35
seconds
or
so
we'll
see
what
happens
all
right.
B
A
You
can
look
at
the
ui
in
terms
of
the
rows
per
second
and
all
that
funk
fun
stuff.
Look
at
this
might
take
a
little
bit
longer.
So
my
apologies
for
my
faulty
predictions,
but
as
I
did
note,
the
fact
is
that
it
is
actually
trying
to
access
an
s3
bucket.
So
it's
probably
running
a
little
bit
longer
so
now
this
you'll
notice
is
against
a
non-partition
table
now
there's
so
it's
not
a
huge
table,
but
nevertheless
it's
not
it's.
A
It's
not
able
actually
to
break
things
down
faster
because
it's
accessing
a
non-partition
table
so
what's
great
about
it
is
that,
of
course,
you
know
if
I
have
a
same
query,
but
I'm
going
to
this
time
run
it
using
a
partitions,
okay
same
table
except
you'll
notice
that
in
this
case
I'm
saying
new
york,
city
219
part
because
that's
the
represent
the
partition
table
versus
the
non-partition
table,
okay,
and
so
I'm
going
to
go
back
and
sure
enough.
Here's
the
query!
So
this
first
one
that
I
ran
took
about
45.82
seconds.
A
Okay,
35.86
was
actually
done
executing
all
right.
Let's
see
what
happens
now
and
as
you
can
tell
we're
almost
done
here,
we're
looking
at
the
ui
so
significantly
faster
because
we're
actually
using
a
partition
table
instead
of
the
45
seconds
or
excuse
me
46
seconds
now
we're
talking
about
26.88
seconds
so
significantly
faster,
which
is
pretty
cool
all
right
so
far,
so
good,
but
right
now
all
I've
really
shown.
A
You
is
the
fact
that
okay,
I've
made
it
super
easy
for
presto
to
query
the
delta
lag
table,
and
that's
great
don't
get
me
wrong.
That's
all
good
stuff,
but
am
I
actually
also
able
to
leverage
some
of
the
cool
functionality
of
deltalic
and
the
best
way
for
me
to
show
this
actually
is
to
go
ahead
and
run
a
version
or
a
history
table
query?
Okay,
so
here
is
I'm
going
to
show
you
a
little
bit
of
syntax
now,
okay,
give
me
one!
A
Second,
I
just
go
back
to
terminal
all
right,
so
I'm
going
to
run
this
query.
It'll
run
relatively
quickly,
but
basically,
what
it's
doing
is
it's
doing
a
select
count
from
the
partition
table.
Okay,
now!
Well,
you
notice
here
and
it
number
one.
It
actually
fits
pretty
quickly,
but
you
notice,
I
have
this
additional
syntax,
the
at
v1.
A
By
the
way,
the
design
document
that
I
had
posted
to
this
inside
the
the
slides,
the
design
document
actually
explains
exactly
what
we're
doing
here,
but
adding
this
additional
syntax
is
basically
saying
I
want
to
look
at
version
one
of
the
table,
basically
the
first
set
of
insertions
that
I
put
into
this
table
this
delta
lake
table.
By
the
way
there
are
nine
versions
to
this
table.
A
A
So
now
I'm
going
to
run
it,
and
so
again
it's
actually
running
pretty
quickly
because
I'm
running
it's
a
partition
table,
it's
only
a
two
note
or
but
still
is
actually
able
to
go
ahead
and
make
use
of
that.
So
it's
actually
able
to
split
the
data
quickly
enough
and
actually
bring
the
data
back
and
actually
come
back
with
the
results
nicely
and
this
time
you'll
notice
version
five
of
the
table
actually
has
79
million
rows.
Well,
78.9
million
rows,
but
close
enough.
79
million
rows.
A
So
again,
I'm
looking
at
the
history
and
then
if
I
was
to
look
and
you'll
notice
right
here
it
says
version
5.
and
then
finally,
I'm
going
to
run
the
last
version
of
the
table,
so
right
here
bam
and
sure
enough
it'll
go
run
through
and
when
it's
done,
it'll
actually
have
even
more
rows.
And
so,
if
I
go
ahead
and
look
at
the
ui
okay,
sorry
just
make
sure
yes
right.
You'll
need
notice
how
the
queries
are
running
perfectly
fine.
A
If
I
was
to
look
at,
let's
just
say,
the
v1
version
of
this
there's
actually
no
difference.
Okay,
because
what's
happening
what's
happening
here,
is
that
the
delta
standalone
reader,
what
it
does
sorry
the
presto
delta
connector,
what
it
does
is
it
actually
is
accesses
the
delta
standalone
reader.
The
delta
standalone
reader
itself
is
automatically
able
to
return
exactly
which
sets
of
files
that
belong
to
version.
One
of
that
table,
because
it's
the
one
that
returns
that
information,
the
presta
connector
just
gets
the
list
of
files
it
needs.
A
A
It
actually
understands
what's
happening
and
just
to
finish
up
the
what
we're
showing
here,
the
the
final
query
we
ran
there,
which
is
about
the
ninth
version
of
the
table-
there's
84
million
rows,
so
you
notice
that,
basically,
with
this
capability
now
you're
actually
able
to
go
ahead
and
make
use
of
the
time
travel
capability
within
your
delta
lake
table
right
from
the
get-go
as
well,
so
pretty
cool.
So
hopefully
you
guys
get
to
enjoy
using
that
all
right.
A
So
I'm
going
to
switch
back
to
the
slides,
real
quick
before
we
take
on
any
more
questions,
but
saying
that
I
did
want
an
important
callout
to
do
some
attributions
to
the
folks
that
actually
helped
build
this.
I
want
to
call
it
venky,
sadgeth
and
george.
They
were
crucial
to
the
help
development
of
the
project.
If
you
yourself
want
to
get
involved
with
this
project
or
any
other
one,
please
ping
us
at
delta
dot
io.
I
think
it's
the
next
time.
Yes,
it
is
all
right,
so
you
want
to
build
your
own
delta
lake.
A
You
want
to
help
us
with
the
delta
lake
inner.
You
want
to
join
the
presto
and
delta
community
meetings
that
we
have
every
two
weeks
just
join
us
at
https,
delta.io
and
and
all
the
information
from
the
slack
user
group
and
everything
else
is
all
available
there.
So
we
absolutely
welcome
you,
as
part
of
the
presto
community,
to
come,
join
us
to
help
us
improve
this
connector
and
then
saying
that
if
you
have
any
questions
left
by
all
means,
this
is
the
perfect
time
to
ask
them
right
now.
A
You'll
notice
that
I'm
a
bit
of
an
expanse
fan,
so
yes,
the
quote:
I'm
going
to
use
as
you're
about
ready
to
send
any
questions
my
way,
so
you
can
tell
you
found
a
really
interesting
question
when
nobody
wants
you
to
answer
it,
so
hopefully
I'll
actually
be
able
to
answer
it,
but
nevertheless,
please
do
ask
away
and
again,
if
you
have
any
questions
or
want
to
join
the
delta
lake
community
with
the
this
presto
connector
just
join
us
at
delta,
dot,
io.
A
Hey
everybody-
hopefully
you
guys
could
hear
me
now,
but
if
you
guys
get
questions,
I'd
love
to
hear
them.
B
Yeah
thanks
so
much
danny.
That
was
a
really
great
presentation,
went
into
much
more
detail
of
obviously
delta
lake,
but
the
connector
as
well,
and
really
good
to
see
time
travel
right
so
going
back
in
time.
A
Yes,
that
was
a
big
ask
that
everybody
was
going
for,
and
so
that
was
one
of
the
first
things
we
did.
We
did
talk
to
the
presto
community
specifically
about
the
syntax,
so
because
originally
our
syntax
was
much
more
closer
to
be
honest
to
spark
and
then
based
on
the
feedback
from
the
person
we're
like
got
it
we're
switching
this
right
now
and
so
yeah.
We
have
much
more
presto
friendly
syntax,
so
I
I
mean
for
anybody,
that's
listening,
especially
if
it's
like
whether
you're
watching
it
now
or
watching
later.
Really.
A
Please
do
join
us
because
we
really
want
to
take
into
account
your
feedback
so
that
way,
whether
we're
working
out
or
you
guys,
are
working
or
doesn't
matter.
Yeah
like
you,
can
chime
in
and
give
us
that
feedback.
So
we
can
keep
on
updating
it.
Yeah.
B
Absolutely
sebastian
has
a
question:
what
are
these
some
of
the
key
features
in
the
roadmap
in
the
next
few
months
for
the
connector.
A
Yeah
so
right
now
the
two
key
things
that
we're
working
on
is
we're
actually
trying
to
do
some
faster
optimizations,
basically
better
memory
management.
So
that
way
we
can
handle
larger
tables
even
faster.
So
one
of
the
things
that
were
it
works
for
pretty
large
tables
already,
but
right
now,
candidly
there
is.
A
There
are
issues
surrounding
the
metadata
in
which
you
actually
have
to
basically
take
all
of
the
the
entire
file
listing,
and
so,
if
you're
talking
about
batch
processing,
that's
honestly
not
that
big
of
a
deal,
but
if
you're
streaming
data
into
that
delta
lake
at
the
same
time,
the
idea
that
you
actually
have
to
read
the
entire
thing,
the
entire
file
list
in
the
memory
and
then
tell
presto
what
it's
supposed
to
do
can
actually
be
very
memory
intensive,
and
so
the
context
is
that
we're
gonna
change
that
up
by
adding
an
iterator
specifically
to
speed
up
processing
of
that
the
in
terms
of
farther
out,
obviously,
there's
going
to
be
features
and
functionalities,
as
in
like,
oh,
you
know,
can
we
do
better
partition
pruning
things
of
that
nature?
A
So
we
are
definitely
looking
at
that,
but
the
other
big
call-out
that
I
wanted
to
say
is
that
we
are
starting
to
and
I've
again
that's
why
I
want
love
to
get
feedback
is
to
go
ahead
and
okay
is
there
interest
for
the
presto
to
delta
writer
as
well
right
right
now,
the
vast
majority
of
people
are
asking
for
readers.
So
that's
why
we
focused
on
that
one
first.
But
are
you
also
interested
in
writing?
So,
if
you
are
again,
please
chime
in
we,
incidentally,
we're
going
to
be
publishing
the
proposed
roadmap.
A
I'd
say
in
mid
to
late
january,
for
for
the
del
for
delta
lichens,
specifically,
obviously,
if
it's
not
this
audience
the
delta
connector
for
presto,
so
we're
gonna
want
you
guys
to
chime
in
and
tell
us
what
you
want
out
of
this
okay.
So
right
now
we're
getting
a
little
bit
of
feedback
that
like,
for
example,
ctas
operations,
are
the
desired
flow
cool.
A
Then
that's
actually
honestly
relatively
easy,
so
we
can
do
it,
but
by
the
same
token,
if
there's
some
more
complex
scenarios
love
to
know
about
it,
so
you
can
chime
in
and
then
I
did
notice
denise.
Hopefully
I'm
saying
your
name
correctly.
At
least
myself
used
to
live
in
montreal,
I'm
assuming
that's
french,
so
I
apologize
if
it's
not
does
at
v1
syntax
work
in
sparse,
equal
as
well,
not
right
now,
in
fact,
we're
gonna
go
back
and
pull
do
some
pull
requests
into
the
spark
community
to
accept
that
syntax.
A
B
Sounds
good,
thank
you
and
then
on
the
right
path.
I
wanted
to
say
that
sitas
is
something
that
you
know.
Some
of
the
other
connectors
already
support
right
for
the
other
table
format
the
good
place
to
start.
We
are
also
looking
at
just
basic.
You
know.
Iud
operations
start
with
inserts,
first
most
common
and
then
look
at
updates,
upsearch
deletes
and
so
on
those
get
more
complicated
with
it
basically
touches
the
entire.
B
You
know,
code
path
from
the
parser,
all
the
way
down
to
the
the
writer
itself
and
that's
something
that
ahana
is
looking
at,
adding
in
in
from
the
right
path
as
well.
So
just
last
question
on
pre
catalogs.
Is
there
a
preferred
catalog,
not.
A
No,
there
isn't
actually
a
preferred
one.
I
mean
we're
all
we're
going
to
be
compatible
like
the
at
least
the
ones.
We've
been
testing,
it's
more
like
the
hive,
metastore
or
hms
and
like
glue,
for
example,
there's
no
big
requirement
that,
but
the
part
of
the
reasons
is
the
the
context.
At
least.
Is
that
mo
the
vast
majority
of
the
metadata
for
that's
red?
It
actually
is
not
read
from
the
the
meta
store.
The
vast
majority
is
actually
read
from
the
transaction
log.
A
Now,
what
we've
done
with
the
connector
and
also
using
the
delta
standalone
reader,
is
that
basically,
that
metadata,
it's
the
exact
it's
completely
transparent
to
presto
whatsoever
in
terms
of
it,
doesn't
need
to
know
that
it's
grabbing
the
data
from
the
transaction
log,
which
is
basically
a
file
system
versus
actually
reading
from
the
meta
store.
The
reason
delta
lake
was
designed
this
particular
way
was
to
allow
for
much
much
larger
scale.
Building
we're
talking
about
hundreds
of
petabyte
systems,
which
normally
operate
where
basically
candidly
we
would.
We
would
see
the
metastore
fail
and.
A
Of
those
scenarios,
we're
like
that's
why
we
did
that
with
the
with
the
the
file
system,
but
because
of
that,
we
wanted
to
make
sure
that
the
from
a
presto
perspective
didn't
matter
it
just
it
plainly
didn't
matter.
So
the
cool
advantage
by
the
way,
and
thanks
for
the
question
rohan,
is
that,
in
addition
to
be
able
to
query
through
the
meta
store
directly
itself,
you
can
also
query
directly
through
the
file
path.
So
you
actually
have
both
options
now
so.
B
Awesome
well
danny.
Thank
you
so
much
we're
out
of
time
and
our
drummer
must
be
going
on
in
the
other
room,
so
we're
going
to
go,
join
him
and
close
out
the
the
the
session
dennis
bastian
rohn.
Thanks
for
the
questions
looking
forward
to
the
the
right
path
being
executed
as
well.
Next
presto
khan,
hopefully
we'll
have
both
in
and
out
exactly.