►
From YouTube: Delta Lake Community Office Hours (2022-04-19)
Description
Join us for the next live Delta Lake Community Office Hours on April 19 at 9 AM PST / 12 PM EST and ask us your Delta Lake questions. Featuring: Tathagata Das, Allison Portis, Ivana Pejeva, and Vini Jaiswal. The session will be hosted live and the recordings are available on the Delta Lake YouTube channel.
A
Degree
had
to
postpone
it
due
to
some
technical
difficulties,
but
we
got
to
it
so
here
we
have
ivana
and
td
from
our
delta
lake
ecosystem
teams,
and
you
know
we
we.
We
are
here
to
answer
questions
about
desolate
releases,
delta
lake
features,
and
you
know
if
you
have
any
questions
related
to
what's
going
on
and
if
you
don't
see
any
features
in
our
roadmap.
A
You
can
ask
questions
about
it,
a
quick
recap
from
the
last
session,
so
we
had
scott
benke
and
denny
as
panelists
during
last
session,
who
provided
amazing
insights
on
the
connector
ecosystem,
presto
delta
and
trino
delta
integrations.
A
We
also
covered
file
compaction,
feature
and
improvements
to
restore
the
delta
table
to
earlier
snapshots.
So
we
will
talk
more
about
it.
A
little
bit
in
this
session
as
well.
Also,
a
few
light
of
highlights
on
the
future
releases,
where
improving
the
file
compaction
to
capture
the
partial
progress
and
also
adding
the
z
ordering
so
that
the
queries
will
be
faster.
A
So
that's
a
recap
now
I
will
hand
it
over
to
panelists,
so
they
can
do
a
quick
introduction
so
td,
why
don't
you
just
say
hi
and
few
words
to
our
community.
B
Hey
everyone,
I
I'm
universally
called
td.
That's
my
initials!
I
am
a
staff
software
engineer
in
databricks.
I
work
with
the
folks
that
focus
on
the
delta
lake
oss
project
and
I
my
background
is
I've
been
involved
with
the
spark
community
for
the
last
12
years,
been
working
on
stream
processing
in
spark
for
last
close
to
10
years
now
now.
B
C
Yes,
thanks.
Thank
you,
hey
everyone,
I'm
ivana,
I'm
a
data
engineer
working
as
a
consultant
based
in
belgium,
part
of
the
delta
ecosystem
and
actually
doing
a
lot
of
implementations
using
delta
at
different
customers
and
yeah
in
production
systems.
A
That's
awesome,
yeah
you're
working
on
like
large-scale
products
savannah
so
glad
to
have
you
as
well
and
our
og
td
here.
So
that's
exciting.
So
td
you
mentioned
that
you
know
you
are
you
know
we
were
working
on
delta
1.2
release.
What
are
some
of
the
key
features
that
community
should
be
excited
about?
I
know
we
talked
about
performance.
We
talked
about
user
experience,
improvements
so
few
things
that
you
would
like
to
shed
light
on.
Yeah.
B
Yeah,
so
let's
see
we
can
start
yeah
as
you
can
see.
If
I
there
is
performance
and
there
is
user
experience,
so
performance
wise
one
of
the
features
that
was
most
awaited
for
a
long
period
of
time,
and
finally,
we
have
have
it
in
the
open
source
is
the
column
stats
where
the
short
version
short
explanation
of
that
feature
is
when
we
write
out
files
into
the
delta
lake
format,
we
automatically
collect
stacks
for
columns
like
the
what
is
the
for
each
file
for
the
first
32
columns,
which
is
configurable.
B
We
keep
track
that
within
each
file.
What
is
the
max
and
the
min
values
of
that
column
or
file?
So
this
is
column
stats
and
we
save
this
as
part
of
the
our
delta
log,
which
contains
all
the
metadata
of
what
files
are
present
in
the
delta
table.
So
we
save
along
with
that,
and
the
advantage
of
that
is
that
when
you're
actually
doing
going
to
read
the
table,
if
there
are,
if
your
sql
query
or
your
data
frame
query,
it
doesn't
matter
where
you're
acquiring
from.
B
If
your
query
has
like
filters,
like
read,
only
column
value
equal
to
five-
something
like
that,
we
can
use
this
min
max
stats
to
pretty
aggressively
eliminate
files
that
cannot
possibly
contain
the
rows
that
you're
looking
for
so
even
before.
B
So
we
completely
skip
even
reading
those
files
whose
min
max
range
is
does
not
include
five,
so
that
is
a
lot
of
data
skipping
and
depending
on
your
type
of
query,
if
you
have
query
that
is
going
to
expect
it
to
filter
out
a
lot
of
files
and
focus
on
a
few
rows,
this
will
help
in
this
can
help
in
eliminating
a
lot
of
scans
out
of
the
box.
B
Now
the
tide
to
this
is
our
future
feature,
optimize,
z
order
and
without
going
to
the
details
of
z
order,
what
z
order
is
basically
a
multi-column
spatial
distribution
curves.
What
it
basically
does
to
not
use
fancy
terms
it
can
cluster
the
data
in
your
table,
such
that
the
min
max
ranges
of
each
column
in
each
file
is
small,
so
that,
when
you
are
looking
for
a
particular
value
of
that
column,
only
few
files
then
call
can
contain
that
value.
B
Most
of
the
files
can
be
eliminated
just
by
doing
the
basic
data
skipping
on
the
main
max,
without
even
touching
and
reading
those
files.
So
to
make
the
this
data
skipping
using
column,
stats
even
more
effective,
we
will
be
work.
We
are
working
on
optim
z,
order,
data
clustering,
which
hopefully
the
next
release
of
spark
next
release
of
delta.
B
We
will
be
able
to
make
it
to
that,
so
that's
the
performance
track
of
stuff,
then
the
user
experience
track
of
stuff
is
actually
tied
to
this
optimize
commander
saying
that
up,
we
are
working
on
optimize
z
order,
which
is
the
data
data
layout
reorganization
for
better
clustering.
But
what
we
released
in
this
is
optimize,
which
is
just
file
compaction,
so
earlier
there
wasn't
any
inbuilt
command
to
make
it
to
do
the
file
compaction.
B
If
you
are
writing
or
from
a
streaming
query,
you
often
end
up
with
many
many
many
small
small
files
which
can
cause
read
performance
to
be
lower.
This
command
can
very
easily
compact
them
into
1gb
files,
which
1gb
is
a
very
sweet
spot
that
they
are
large
enough
to
avoid
any
small
file.
Reading
overheads,
but
not
large
enough
that
you're
not
sacrificing
a
parallelism,
etc
for
reading
those
files,
so
optimize
comes
out
of
the
box
now.
B
This
is
again
another
very
useful
feature
when
you
have
to
fix
something
that
went
wrong
so
yeah
having
all
out
of
this
box
definitely
improves
the
user
experience
and
stuff.
A
Thank
you
to
the
for
giving
like
highlight
of
each
of
each
of
the
feature.
So
I
think
the
first
feature
you
mentioned
was
data
skipping.
For
that
we
collect
collect
column
stats.
Can
you
go
go
a
little
bit
deeper
into
like
you
know
what
like?
How?
How
does
it
collect?
You
know
information
about
those
column
statistics
you
know
you
know
have.
We
are
only
limited
to
32
columns.
Is
there
any
limitation
on
the
column,
names
or
things
like
that.
B
Very
good
question,
so
let
me
break
it
into
two
parts:
how
does
it
collect
and
then
limitations
on
the
column,
names
yeah.
B
So
how
it
collects
is
that
so
delta
uses
sparks
the
parquet
writing
code
path,
so
delta
just
says
that
just
allow
spark
to
do
with
sparky
writing
now.
B
Spark
has
hooks
into
its
sparky
writing
code
path,
where
you
can
add
your
own
stats,
collecting
framework
and
stuff,
and
so,
as
every
row
is
being
written
out
to
parquet
file,
that
stats
collecting
framework
can
inspect
the
rows
and
check
what
is
the
value
of
each
column
in
that
row
and
using
that
framework
delta
plugs
in
to
collect
statistics
of
assert
the
first
32
rows
and
statistics
specifically
min
max
null
counts
and
stuff,
so
it
basically
nicely
hooks
into
existing
api
is
already
provided
by
spark
now
why
first
32
columns
this
is
to
kind
of
limp
as
of
now,
and
this
is
not
a
hard
rule,
all
these
additional
column
stats
needs
to
be
stored
in
the
delta
logs,
along
with
the
file
names
and
stuff.
B
So
this
can
add
quite
a
lot
of
overhead
to
reading
the
log.
We
don't
want
the
log
to
be
unnecessarily
huge.
So
that's
why,
as
of
now,
we
are
limiting
it
to
the
first
32
columns,
and
so
you
have
to
but
and
32
132
is
a
configurable
number.
B
You
can
expand
it
if
you
want
and
two
if
you
don't
want
to
expand
it,
but
you
want
your
columns
to
be
present
in
that
column
size
because
you're
going
to
frequent
filter
on
that
column,
you
can
re,
you
can
use,
alter
table
to
change
the
column,
order
and
put
that
column
within
the
first
32
columns.
So
those
are
two
things
that
you
can
do
to
make
it
work.
B
We
had
the
limitation
that
that
column
names
can
only
cannot
have
spaces
or
unique
code,
and
that
kind
of
stuff,
because
parque
file
format
did
not
allow
column
names
to
have
those
with
1.2.0.
B
B
A
That's
awesome
so
with
that
there
was
al.
Also
one
more
feature
was
you
know
if
you
have
like
delta
offers
generated
columns
right.
So
can
you
talk
a
little
bit
about
that
as
well?
Like
does
delta
liquid,
like.
B
B
Yes,
so
in
another
feature,
manner
are
running
a
lot
of
time
to
talk
about
all
the
features
we
read
so,
along
with
the
column,
stats
based
data
skipping.
Another
performance
improvement
we
did
is
that
delta
already
had
support
for
generated
columns
and
for
those
who
are
not
familiar
with
generate
columns.
It's
a
feature
where
you
can
say
that
the
value
of
column,
one
will
be
always
generated
by
expression
on
column.
Two,
for
example,
where
this
is
useful.
B
Is
that
let's
say
you
have
time
step
in
your
data,
but
you
want
to
partition
by
date
instead
of
so
one
way
to
do,
it
is
extract
the
date
from
the
timestamp
and
do
it,
but
with
if
you
do
it
through
generated
columns,
then
the
delta
table
is
the
delta.
Log
is
actually
aware
of
the
relationship
between
these
two
columns.
That
day,
two
is
always
a
function
over
the
timestamp.
B
What
it
allows
us
to
do
is
the
new
feature
is
that
whenever
you
have
any
filters
on
top
of
timestamp
column
say
I
want
timestamps
to
be
within
these
two.
This
date
range
we
can
automatically
generate
additional
filters
on
the
dependent
generated
partition
column
for
date,
which,
which
means
that
all
we
can
in
many
places
we
can
take
full
advantage
out
of
out
of
the
box
of
partition,
pruning
and
stuff
like
dynamic
partition,
pruning,
etc.
Can
work
with
this
additional
automatically
generated
filter
on
partition,
columns
on
partition,
columns
that
are
also
generated
columns.
B
A
So
td
you
mentioned
about
this-
this
is
awesome
right
by
the
way,
this
generated
columns
is
like
a
column
that
is
automatically
added
by
delta,
like
if
I,
if
I'm
correct
right,
so
what
is
what
is?
What
is
it
that's
happening
in
delta
light,
which
allows
it
to
generate
that
column.
Can
you
give
a
little
bit
of
mechanics
there.
B
If
you
just
give
it
the
timestamp
column,
the
date
column
will
get
automatically
generated.
It
is
more
foolproof
because
otherwise,
multiple
applications
writing
the
same
table,
computing
date,
differently,
etc.
That
can't
happen,
and
if,
by
any
chance,
some
application
is
misbehaving
by
computing
the
date
incorrectly,
then
the
generated
columns
using
the
same
expression
will
also
validate
that.
Whether
you're
trying
to
write
data
that
is
satisfying
that
expression
or
not,
and
if
you're
not
because
some
writer
is
misbehaving,
then
it
will
block
that
right.
B
A
B
The
way
these
features
are
designed
is
that
that
they
work
very
well
with
each
other
using
the
same
underlying
concepts
so
that,
ultimately,
from
the
user
point
of
view,
things
just
work
better
out
of
the
box.
You
cannot
corrupt
data
if
you
give
it
more
information
like
additional
constraints,
generate
columns
the
more
features
coming
along
the
way
just
try
to
take
advantage
of
it
as
much
as
possible
to
give
better
performance.
A
C
Yeah,
actually,
this
release
has
very
exciting
features,
something
that
people
were
waiting
really,
especially
on
the
performance
part
on
optimize.
This
is
something
which
is
which
was
very
much
needed,
especially
if
you're
working
with
small
files
or
it's
especially
in
streaming
cases
where
you
will
generate
a
lot
of
small
files.
C
C
There
are
a
lot
of
features
here,
the
one
he
mentioned
about
the
column
names.
I
had
that
problem
a
lot
of
times
at
clients
where,
okay,
you
have
some
columns
or
some
strange
characters
that
they
still
wanted
to
persist
them
in
delta,
but
it
was
not
possible
before
so.
I
think
this
is
something
which
is
also
one
of
the
nice
features,
but
this-
or
this
has
really
many
many
features
which
really
have
better
performance
or
bring
better
performance
and
a
lot
of
user
experience.
Improvements.
B
C
I
think
the
z
ordering
will
be
one
which,
which
is
gonna,
be
really
nice.
I
got
I
got
to
use
it
before
and
I
think
it
can
bring
a
lot
of
extra
performance
benefits.
So
it's
one
of
the
features
that
I'm
looking
forward
to
awesome.
A
C
A
So
td,
one
more
question
around
like
the
features.
So
there
was
a
feature
where
you
know
we
have
consistent
sorry,
concurrency
controls
for
reads
and
writes
delta.
Make
sure
that
whenever
you
know
there
are
multiple
reads
and
rights,
it
ensures
the
mutual
exclusion
and
figures
out
or
you
know,
decides
on
the
order
of
the
right
operations.
So
there
is
a
feature
which
we
were
working
on
in
the
delta
1.2
release.
Would
you
like
to
add
what
yeah?
What's
going
on
yeah.
B
B
So
till
now,
since
the
beginning
of
delta
lake
open
source
project,
we
have
had
the
problem
that,
since
s3
does
not
provide
the
necessary
mutual
exclusion
guarantees,
we
could
not
support
multi-cluster
rights
on
s3,
so
that
so
at
any
point
of
time
multiple
say:
emr
clusters
cannot
write
to
the
same
s3
delta
table
on
s3,
with
without
breaking
while
satisfying
the
acid
guarantees.
B
Finally,
we
have
a
solution,
and
it
this
was
a
huge
collaborative
effort
between
us
folks
and
folks
from
samba
tv
who
conceptualized
that
idea.
First
code,
we
co-designed
it
together
to
make
it
more
rock
solid
and
then
finally,
they
developed
it,
and
now
it
is
in
delta.
1.2
is
a
true
s3
multi-clusterized
support
using
dynamodb
as
the
mutual
exclusion
solution.
B
So
basically
the
two.
The
crux
of
the
matter
is
that
for
acid
guarantees
to
work,
you
need
to
write
to
the
delta
log,
basically
create
a
log
file
into
the
delta
log
directory
atomically
and
mutually
mutually
exclusively.
That
is
only
one.
Writer
should
be
able
to
create
that
log
file
for
version
5.
st
did
not
provide
any
api
to
get
that
mutual
exclusion,
so
both
writers,
attempting
to
write
five
dot
json
in
the
log
directory,
would
have
succeeded,
which
means
that
you
are
overwriting
each
other's
changes
effectively
losing
data
with
dynamodb
this
solution.
B
Instead
of
writing
directly
to
s3,
we
will
be
writing
to
dynamodb
using
dynamodb's
mutual
exclusion
so
that
only
one
of
the
writers
can
write
to
five
the
version
5
to
dynamodb,
and
then
only
one
of
the
writer
wins,
the
other
one
backs
off
retries
as
to
write
six
and
then
goal
of
the
work
and
then
from
dynamodb.
We
think
lazily.
B
Once
the
serialized
order
of
the
commits
have
been
decided
by
committing
to
dynamodb,
we
sync
it
back
into
s3,
so
that
all
writers
can
now
get
the
full
correct
view
just
by
looking
at
the
s3
as
well.
So
we
are
getting
so.
That
means
that
any
writers,
as
long
they're
correctly
configured
to
use
dynamodb,
will
get
the
full
asset
guarantee
properties
without
chances
of
data
loss.
This
is
an
experimental
feature.
We
released
it
for
the
first
time.
B
A
That's
helpful
td,
so
I
have
one
question
related
to
that
which
was
like
when
you
so
you
said,
because
of
dynamodb,
we
were
able
to
get
this
feature
for
s3
multi-cluster
right
when
you
were
deciding
like
which
how
to
how
to
go
about.
You
know
making
sure
that
we
achieved
this
feature.
Why?
Why
did
you
use
dynamodb?
Was
this
something
that
samba
tv
was
already
using
so
a
little
bit
about
yeah.
B
That's
that's
a
good
question,
so,
conceptually
speaking
any
data
storage,
any
database
or
key
values
store
that
provides
this
single
row.
Mutual
exclusion
guarantee
which
pretty
much
all
keyword,
stores
and
databases
provide,
can
be
used.
In
this
way
we
chose
dynamodb,
because
this
is
primarily
a
problem
only
with
s3,
and
so
therefore,
most
of
the
users
would
be
in
the
aws
ecosystem,
who
are
trying
to
write
to
s3.
So
the
most
natural
product
in
aws
to
use
is
seems
like
the
key
value
store
which
very
little
to
maintain.
B
No
interest
inventory
just
create
a
dynamodb
table,
and
it
it.
You
don't
need
to
give
it
a
lot
of
get
put
requests
per.
Second
like
a
lot
of
quota,
because
this
is
only
for
reads
rights
to
at
per
version.
So
it's
not
like.
It
only
contains
metadata,
not
data,
so
it
seemed
the
right
natural
thing
to
for
all
the
aws
delta
users
stuff.
B
So
that's
why
we
chose
that,
but
if
but
the
but
in
the
future,
if
other
use
cases
arise
where
we
want
to
provide
multi-cluster
write
on
file
systems,
but
other
file
systems
in
the
same
pattern.
File
system
that
provide
does
not
provide
those
guides
out
of
the
box.
We
could
follow
the
same
approach
on
using
other
key
value
stores
or
databases
and
stuff
the
same
technique.
A
Got
it?
Okay,
that's
that's
helpful
to
know
and
learn
tv
yeah.
There
are
a
lot
of
people
who
might
be
using
other.
You
know
other
type
of
design
considerations.
So
if
you
do
please
reach
out
to
us
either
on
github
or
slack
channel,
and
we
can
look
at
those
requests
as
well.
So
I
think
talking
about
ecosystem.
There
was
a
feature
which
you
know
community
has
been
excited
about,
which
was
in
the
lines
of
streaming,
which
is
splinxing
connector.
So
any
light
on.
B
Yeah
yeah,
I
mean
that
itself
takes
a
dedicated
officer
for
that,
because
they
were
so
so
excited
about
that.
Like
so
much
things
happened
in
the
last
month
we
released
presto
trainer,
we
released
fling.
We
really
is
delta
1.2,
it's
very
exciting
times
for
us.
So,
yes,
we
finally
have
a
flink
sync
that
with
which
you
can
write
to
delta
tables
from
clink,
both
from
a
streaming
job
or
a
bad
job
it.
Unfortunately,
it
is
using
the
flink
sync
stream
api,
which
is
a
lower
level
api,
so
it
doesn't
support
sql
and
table
api.
B
Yet
we
are
working
towards
that.
That
is
in
progress,
but
for
people
who
want
to
start
using
delta
table
on
the
fleet
ecosystem.
This
is
a
great
unblocking
feature
so
and
you
can
always
define
flink.
We
designed
it
says
the
way
that
it
supports
flink
1.12,
which
is
one
of
the
lowest
a
pretty
old
fling
version,
and
the
connector
will
work
on
1.13
and
1.14
as
well.
B
You
can
write
streaming
queries
using
table
api,
but
there
is
a
way
in
flink
to
use
non,
to
get
it
down
to
the
stream
api
and
use
the
stream
api
sync
as
well.
So
a
lot
of
things
is
unblocked
by
this
in
parallel,
we're
currently
working
on
the
flink
source,
which
will
have
both
for
streaming
as
well
as
batch
queries
and
also
the
sql
table
api
support
so
that
you
can
define
delta
tables
in
the
meta
store
and
run
sql
queries
on
that
purely
using
flink.
A
Yeah
so
td,
we
talked
about
a
lot
of
features
and
there
are
a
lot
of
features.
We
haven't
been
able
to
talk
in
this
half
an
hour
session.
So
looks
like
we
are
done
right,
like
we
released
so
many
features,
and
now
we
are
done
right.
B
B
We
were
doing
a
trial
run
of
that,
and
now
we
are.
Finally,
it's
robust
that
we
are,
we
can
safely
open
source.
It
is
change
data
feed,
where
the
idea
is
that,
if
you
want,
if
you
have
dml
queries
like
insert
like
update,
delete,
merge
writing
into
a
delta
table.
B
You
delta
table
right
now
uses
copy
on
write
to
rewrite
entire
files,
so
it
is
harder
to
read,
identify
and
read
only
the
changed
rows
that
were
changed
by
the
delete,
update,
merge
queries
which
changed
it,
a
feed
that
in
exposes
the
change,
only
the
change
rows
for
both
batch
and
streaming
leads.
B
So
there
is
another
thing
that
we
are
very
excited
about
is
that's
this
change
interference
feature
and
what
this
will
help
is
build
end-to-end
in
purely
fully
incremental
pipelines,
so
that,
if
you
one
way
to
think
of
pipelines
is
like
having
this
this,
this
medallion
architecture,
bronze
table
silver
table
gold
table.
You
can
have
fully
end-to-end
incremental
pipelines
that
append
to
a
bronze
table
from
bronze,
it,
let's
say,
merges
changes
to
the
silver
table,
and
so
then
from
silver.
B
A
Yeah,
that's
exciting
td.
I
know
we,
we
are
always
working
hard
and
our
community
has
given
us
a
huge
amount
of
support
so
that
we
can
make
these
features
happen
and
every
time
their
feedback
helps
us
to
improve
our
roadmap.
So
we
are
never
done
and
we
keep
releasing
the
features.
Yeah.
B
B
And
and
just
to
highlight
one
more
thing:
please
engage
with
us
on
github
on
the
roadmap
posted
on
the
github
that
that
really
helps
us
decide
and
prioritize
what
features
we
want
to
publish
like,
for
example,
all
of
these
features
that
we
open
source
was
the
result
of
all
the
feedback
we
got
from
the
from
the
community
from
actual
practitioners
like
ivana
here.
So
this
is
the
culmination
of
all
that
feedback
that
the
community
has
given
us.
B
A
C
Actually,
the
change
date
of
it
is
something
that
I've
already
tried
and
it's
really
nice
to
create
incremental
increments,
to
create
incremental
pipelines.
It's
it's
really
nice
to
see
that
that's
coming!
That's
coming!
Next,
I
I
didn't
know.
C
It's
really
exciting.
Yeah
I've
used
it.
It
makes
your
life
easier
in
duty
when
you're
dealing
with
incremental
data,
you
don't
yeah
all
of
the
updates
and
and
inserts
that
you
get
you
don't
have
to
propagate
all
the
rows
or
all
the
files
that
were
changed
to
your
next
to
your
next
job.
So
that's
indeed
very
nice
to
hear
I'm
exciting
excited
for
that.
C
A
Yeah,
it's
always
helpful
to
hear
from
users
like
you
know.
Some
kind
of
you
know
testament
of
what
the
work
that
community
is
doing.
So
it
was
great
having
both
of
you
here
on
the
call
today
and
you
know,
for
the
community
if
you
would
like
to
get
involved
in
any
of
these
projects
or
even
if
you
have
feedback,
please
share
with
us
on
slack
github
or
google
group,
and
we
will
our
team
works.
Like
not
our
team.
Our
community
works
on
slack
and
you
know
they.
A
They
will
be
there
to
answer
your
questions.
So
hopefully,
this
sessions
are
meaningful
to
you
and
we'll
see
you
again
next
next
week,
yeah.
Thank
you
all.