►
From YouTube: Delta Lake Community Office Hours
Description
Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. Thanks!
A
A
That
was
one
heck
of
an
awesome
echo.
There
wasn't
it
okay,
so
I
think
we
are
live
now.
So
welcome
to
everybody
to
today's
delta
lake
community
office
hours.
I
am
just
currently
trying
to
set
up
everything
on
youtube
and
as
well
as
on
linkedin,
so
give
me
a
minute
to
try
to
get
that
in
order,
and
then
we
will
rock
it
and
so
perfect.
We
are
on
linkedin.
A
So
excellent,
so
thank
you
very
much
everybody
for
joining
today's
community
office
hours.
Right
now,
we've
got
christian
from
scribd.
We
have
scott
from
databricks
myself
from
databricks
and
I
think
we're
gonna
have
td
joining
us
a
little
bit
late,
but
we
are
live
on
linkedin
youtube.
So
if
you
have
any
questions
when
it
comes
to
delta
lake,
you
are
currently
in
the
right
set
of
office
hours.
We're
gonna
run
for
about
20
minutes
today,
so
we're
gonna
end
at
about
9
20
a.m.
A
Pacific
time
or
if
you
have
no
questions,
then
we'll
end
actually
a
little
bit
earlier,
but
I
did
want
to
let
christian
and
scott
introduce
themselves
real,
quick
for
starting
off
christian.
Why
don't
you
start
I'll,
provide
a
little
background
on
who
you
are
and
well
you
know
you,
as
a
delta
lake
committer
what
what
projects
you
currently
are
focused
on.
B
Sure
yeah,
so
I'm
a
involved
in
delta
insofar
as
I'm
a
committer
on
delta
rs,
which
is
the
rust
implementation
of
the
delta
lake
protocol
and
kafka
delta
streams,
which
is
a
russ
damon.
B
That
streams
json
messages
from
kafka
topics
to
delta
lake
tables,
which
scribd
is
using
in
production
right
now
for
about
60ish
topics.
A
A
C
Thanks
denny
good
morning,
everyone
hi,
I'm
scott,
I'm
a
software
engineer
on
the
delta
lake
ecosystem
team.
Here
at
databricks
recently,
I've
been
working
on
a
variety
of
like
open
source
projects
that
that
we
that
we
contribute
to,
such
as
the
delta
standalone
library,
which
is
like
the
main
low-level
library
for
new.
C
Connect
to
the
delta
lake
protocol
also,
last
week
I
open
sourced
the
latest
version
of
delta
lake
1.1,
which
is
really
exciting,
and
now
I'm
continuing
to
work
on
some
open
source
features
and
performance
improvements.
A
Perfect,
thank
you
very
much.
Scott
and
hi
everybody.
My
name
is
dan
lee.
I'm
a
developer
advocate
here
at
databricks,
long
time
spark
guy
long
time,
delta
lake
guy,
so
hopefully
I'll
also
be
able
to
answer
your
questions
here.
I
did
want
to
start
off.
I
believe
it's
more
on
the
boring,
not
not
boring
the
more
uber
detailed
part
of
it,
which
is
christian.
You
had
in
yesterday's
delta
rust
api
meeting.
We
had
talked
about
basically
trying
to
figure
out
how
to
read
the
metadata
faster.
A
B
Yeah,
actually,
rather
than
fast,
faster,
it's
more.
The
the
concern
I
have
more
is
around
memory.
B
So
currently,
when
we
load
the
transaction
transaction
log
for
a
table
in
delta,
rs
we're
using
a
row-based
format,
we
aren't
using
arrow
or
any
kind
of
a
columnar
structure,
and
I
and
denny
I
think,
had
mentioned
that
you
guys
were
facing
some
memory
issue
issues
in
the
delta
standalone
reader,
and
I
guess
you
may
have
already
started
to
take
steps
to
resolve
it.
B
And
so
just
I'm
just
interested
to
hear
any
gotchas
you
hit
along
the
way
or
just
anything
about
the
the
problem
as
as
you're
facing
it
in
the
delta
standalone
reader.
C
C
The
main
issue
here
is
that
you
need
to
get
the
correct
and
latest
snapshot
of
a
table
which
requires
reading
all
the
actions
you
need
to
know
what
are
the
add
files
that
were
added
to
the
table,
and
you
also
need
to
know
whether
to
remove
files
files
that
were
removed,
because
sometimes
those
will
cancel
out.
That's
why
you
have
to
read
all
of
them
to
get
the
latest
point.
So
typically,
you
do
that
all
at
once.
You
load
it
all
into
memory.
C
That's
where
the
memory
concerns
come
from
a
better
way
to
do.
It
is
with
an
iterator
method,
but
you
can't
just
iterate
forwards
in
time
to
the
transaction
log,
because
there
could
be
some
remove
files
that
come
on
later,
so
the
trick
that
we
did
and
the
code
is
open
source
feel
free
to
look
at
it.
In
fact,
you
can
email
me
and
I
can
send
you
the
exact
links,
but
you
actually
iterate
through
the
transaction
log
backwards
and
as
you're
iterating
backwards.
C
You
keep
track
of
the
remove
files
you've
seen
in
the
add
files
you've
seen
and
you
use
that
to
actually
decide
which
files
to
return
to
the
reader.
So
that's
that's
when
you're
updating
and
actually
reading
all
the
files
and
just
a
quick
hint
when
you
instantiate
a
new
delta
log,
as
I'm
sure
you
probably
do
in
your
rest,
api.
C
You
need
like
two
things
that
a
delta
log
requires
to
be
instantiated
is
the
correct
protocol
and
the
correct
metadata
of
the
table
as
well,
and
you
can
do
that
too,
using
iterator
api.
You
just
read
backwards
until
you
find
those
two
and
then
you're
done,
you
don't
need
to
load
all
the
actions
into
memory.
Let
me
know
if
that
made
enough
sense,
for
you.
A
Perfect,
perfect,
all
right,
so
that
was
committed
to
committer
questions.
So
that's
pretty
sweet
all
right
before
without
further
ado
td
want
to
do
a
quick
introduction
yourself.
We
already
did
ourselves.
We
have
td
now
joining
us,
which
is
awesome,
because
it's
good
timing,
because
we
have
a
bunch
of
questions
from
linkedin
that
we
gotta
go
answer
we're
gonna,
throw
them
all
your
way
td,
I'm
joking,
but
still
introduce
yourself.
E
All
right,
first
of
all,
apologies
everyone.
I
was
a
little
late.
My
previous
meeting
ran
over.
My
name
is
well
td.
Everyone
calls
me
dd.
I
am
a
staff
software
engineer
in
this
company
and
same
team
as
scott.
E
We
work
together
very
closely
and
I've
been
working
on
the
delta
lake
project
for
the
last
four
years
since
it's
since
its
inception,
so
we're
having
way
too
much
fun
working
on
this
delta
lake
stuff
and
that
I'm
still
not
bored
of
four
years,
so
yep
ready
to
answer
all
your
questions
so
throw
all
the
hard
balls
at
me.
A
No
problem:
well,
I
mean
in
fact
actually
so.
Shiv
from
linkedin
has
a
bunch
of
questions,
so
I'm
gonna
go
backwards
first,
just
because
I
and
and
then,
of
course,
christian
and
scott-
definitely
chime
in
too
right,
but
I'm
just
gonna
direct
some
of
these
because
they
seem
to
be
very
much
in
dede's
wheelhouse.
So,
let's
start
with
the
the
most
recent
one
so
shift.
Please
continue
asking
your
questions.
This
is
awesome.
Can
I
run
parallel
queries
to
delta
lake?
A
E
Yes,
you
definitely
can,
and
you
can.
The
coolest
thing
is
that
you
can
run
these
parallel
queries
within
the
same
cluster
or
across
any
number
of
clusters,
with
the
full
consistency
guarantee,
because,
because
delta
internally
maintains
the
version
snapshots
of
the
table,
so
each
of
these
queries
will
see
a
consistent
version
of
the
table
may
not
see
the
same
version
of
the
table
because
the
table
can
be
getting
new
versions,
as
people
are
writing
into
it
as
well.
E
But
every
query
will
see
a
consistent
version
of
the
table
of
the
entire
table
rather
than
oh,
it
sees
somebody's
writing
into
it.
It
sees
half
of
that
right,
but
not
the
other
half
that
won't
happen
either.
It
will
see
entirely
version
one
or
version
two,
but
nothing
in
between.
So
yes,
you
can
run
things
in
parallely.
E
Things
will
be
consistent
and
whether
you
can
actually
let
it
just
is
your
cluster
configuration
whether
it
has
enough
capacity
in
your
cluster
for
spark
to
run
those
tasks
and
stuff,
but
from
the
the
from
the
data
format
point
of
view,
you
can
absolutely
run
everything
and
you
get
full
consistency,
guarantee.
A
Excellent
thanks
very
much
td
and
I'm
just
gonna
go
to
the
next
question,
though
christian
scott,
if
you
have
anything
that
you'd
like
to
add
by
all
means.
So
the
next
question
now
a
shift,
it's
also
from
shift
now.
This
is
specifically
the
z
ordering
and
how
and
why
the
question
is:
can
you
please
explain
how
z
ordering
works
and
why
it
improves
performance?
A
Now
before
I
go
ahead
and
hand
that,
off
to
the
gentleman
here,
I
do
want
to
call
out
that
z-ring
is
specific,
currently
specific
to
databricks
the
delta
version,
that's
in
databricks,
so
as
opposed
to
the
one
that's
in
oss
delta.
We
definitely
want
to
answer
the
question,
but
we
wanted
to
call
that
out.
Okay,
so
saying
that
who
would
like
to
tackle
the
the
question
of
concerning
the
order.
E
I
can,
but
I
won't
let
others
have
an
opportunity.
E
But
it's
hard
to
explain
this
without
the
visuals
and
slides
but
say
if
you
have
two
two
columns
and
you
want
to
cluster
the
data
such
that
reading
from
both
the
columns
is,
you
can
filter
down
which
files
to
read
by
some
sort
of
clustering
on
both
those
columns?
Now
the
naive
way
would
be.
You
have
a
primary
sorting
column,
you
sort
by
and
then
within
the
and
then
a
secondary
starting
column
within
that
and
then
split
that
data
into
files.
E
The
good
thing
is
that
if
you
really
do
the
math
to
overuse,
that
word
is
that
if
you
have
queries
that
filter
by
one
of
the
columns,
you
can
filter
down
very
well.
But
if
you
filter
by
another
the
other
column,
it
doesn't
filter
on
very
well,
because
one
is
a
primary
sorting.
The
other
is
secondary.
Sorting
what
z
order,
or
the
generic
term
is
space
filling
curves
does-
is
that
it
balances
it
out
across
the
multiple
columns
like,
for
example,
to
put
some
rand
number
sample
numbers
on
it
like
with
simple
primary
secondary
sorting.
E
If
you,
if
you
filter
by
column
one,
it
will
filter
down
from
100
files
to
two
files,
but
if
you
filter
by
column
two,
it
will
filter
down
from
100
files
to
50
files,
not
much
better,
whereas
if
you
use
the
space
filling
curves,
then
both
columns,
it
will
filter
down
to,
let's
say
10
files,
so
it's
not
the
best
for
any
individual
column,
but
it's
more
evenly
balanced
between
whichever
column
you
want
to
filter
by.
So
that's
the
idea
about
space,
filling
curves
and
z
order
is
just
one
particular
type
of
space
filling
curves.
E
There
are
other
space
filling
curves
like
hilbert
curve
and
stuff,
and
so
that's
the
basic
idea
of
z
ordering
is:
how
do
you
cluster
the
data
across
files
such
that
when
you
want
to
search
for
a
particular
row
with
a
particular
value
of
those
multiple
columns
you
can
without
reading
the
data
in
every
file?
You
can
very
easily
quickly
narrow
down
this
okay.
E
These
are
the
only
files
that
can
have
those
particular
values,
because
these
files,
the
the
those
two
columns,
the
min
max
values
of
those
columns
in
these
files,
match
the
hazard
as
a
small
range
that
matches
the
thing
and
other
files
don't
have.
That
cannot
have
the
possibly
have
that
one.
Hopefully
that
answers
the
question.
It's
hard
to
explain
that
without
slides
and
the
visuals,
but
hopefully
that
conveys
the
point.
A
I
think
certainly
think
it
does.
By
same
token,
folks,
if
you
do
have
more
questions
concerning
the
order
by
all
means
duping
us,
I'm
actually
gonna
go
ahead
and
when
we
answer
the
next
question,
I'm
actually
gonna
get
the
the
blog
post
that
we
wrote
on
z,
ordering
and
I'll
paste
that
in
both
the
youtube
links
and
the
linkedin.
So
that
way
that
provides
the
additional
details.
But
I
figured
at
least
giving
us
a
chance
to
answer.
First
that'd
be
really
good.
A
The
next
question
actually
I'll
answer
at
least
I'll
start.
I
should
put
that
way
and
we'll
go
from
there.
So
it's
also
from
shiv
it's
a
great
question,
which
is
how
can
you
I
replicate
a
snowflake
database
as
a
delta
lake?
Okay,
so
in
other
words,
you're
going
to
go
from
and
I'm
going
to
actually
just
answer
the
question
not
so
much
targeting
any
particular
database
just
more
of
the
idea
of
a
database
to
delta
like
okay.
A
D
A
To
basically
utilize
their
bulk
export
mechanism
to
get
the
data
into
blob
storage,
whether
that's
in
csv,
initially
or
some
of
them
actually
have
part
k
some
of
them
don't.
So
I
don't
want
to
automatically
assume
that.
But
the
context
is
since
you're
talking
about
databases,
typically
structured
data
to
begin
with,
which
means
csv
or
tab,
or
that
type
of
format
is
perfectly
fine.
A
When
you
export
the
data
out,
then
having
delta
lake
rated
is
pretty
much
a
straight
forward,
you
know
literally
just
a
straightforward.
Okay
in
spark
read
bam.
You
create
the
delta
lake
and
you're
good
to
go
the
the
the
one
little
except
I
won't
say
exception,
but
one
little
call
out
would
be
that
if
it's,
if
you
can't
explore
in
parque,
there's
an
in-place
conversion
of
a
parquet
table
directly
into
delta
lake
as
well,
so
you
could
always
utilize
that
now
the
real
concern
isn't
so
much
that
part
of
the
replication.
A
Honestly,
though,
even
though
I've
went
on
a
little
already,
the
real
concern
is
more
more
like
when
you're
online
and
you're
continuously
processing
data.
How
to
get
that
data
in
and
typically
that
means
you're
going
to
have
to
do
some
form
of
streaming
mechanism
or
an
etl,
a
regular
periodic
etl
process,
and
so
you
will
always
need
to
separate
that
bulk
export
portion
from
that
regular
etl,
batch
or
stream
process.
A
A
What
they're
typically
doing
is
that
they
have
that
two
separate
process.
One
is
the
that
bulk
export
from
a
database
right.
I've
seen
it
happen,
multiple
times:
oracle,
sql
server,
whatever
else
and
then
upstream
they
work
on
basically
a
regular
etl
process
where
they
multicast
the
data
out
one
time
to
the
database
and
then
directly
to
delta,
like
they
do
a
reconciliation
between
what's
in
the
delta
lake
versus.
A
What's
in
the
database
for
over,
like
you
know,
a
period
of
time,
let's
just
say
two
weeks
or
a
month
to
make
sure
the
numbers
are
matching
and
then
once
they
validate
that,
then
they
can
shut
off
the
flow
into
the
database
and
it
goes
directly
and
delta
life.
So
I
provided
a
whole
bunch
of
context
and
recognize
the
fact
that
if
we
want
to
dive
deeper,
we
probably
want
to
actually
have
a
much
longer
conversation
in
the
delta
lake
user
slack.
So
please
go
to
delta
dot
io
near
the
bottom.
A
There
is
a
slack
channel
link.
You
can
just
join
us
there
and
my
name
is
denny.
You
can
just
help
me
down.
Okay,
so
just
want
to
call
out,
but
hopefully
that
answers
your
question
or
at
least
provides
enough
context.
Anybody
else
like
to
add
anything
that
I
might
have
missed.
A
I'm
I'm
seeing
nods,
so
I
think
I
think
that'll
be.
A
Right,
I'm
gonna
go
switch
to
the
next
question.
It's
from
sharon
givi.
I
apologize
if
I
said
your
name
correct
incorrectly.
This
one
I'm
definitely
gonna
target
ut.
Just
because
it
is
about
merge.
Okay,
does
merge,
yeah
yeah,
exactly
does
merge
sparks
people
databricks
rewrite
the
files
again,
even
if
there
are
no
inserts
or
updates
when
it
comes
to
utilizing
delta
lake.
The
merge
statement.
E
Okay,
good
question,
so
the
way
merge
works
is
in
two
passes
on
the
data.
The
first
pass.
It
scans
the
data
to
identify
which
are
the
files
that
contain
data
that
needs
to
be
updated
or
deleted,
and
then,
in
the
second
pass
it
rewrites
only
those
files
so
in
your
table,
if
you
don't
contain,
if
any
file
doesn't
contain
any
data
that
needs
to
be
updated
or
deleted,
it
will
not
read
at
all.
So
merge
is
very
selective
in
that
way.
E
Now,
whether
now,
how
many
files
out
of
your
table
will
actually
be
needs
to
be
rewritten
is
a
kind
of
a
separate
question
and
it
totally
depends
on
the
distribution
of
your
data.
The
data
layout,
the
in
the
changes
that
are
trying
to
merge.
But
what
is
the
distribution
of
those
changes?
Is
there
in
any
spatial
locality
in
them
and
not
not?
E
Those
are
the
stuff
that
makes
a
difference
in
the
merge
performance
and
things
like
z
order,
and
basically,
data
clustering
in
the
end
in
the
data
layout
can
make
a
difference
and
make
merge
faster
because
it
limits
the
changes.
The
number
of
files
that
needs
to
be
written
to
a
smaller
number
of
files
by
clustering,
the
the
value
that
the
key
space
into
a
smaller
number
of
files,
if
you're
touching
only
a
smaller
range
in
the
key
specific
that
is,
there
is
data
locality.
A
Actually
it
does
so,
I
think,
that's
probably
all
the
time
we
actually
have
today
for
today's
community
office
hours,
so
apologies,
but
we
had
some
great
questions
so
saying
that
I
did
want
to
call
out
a
couple
things
number
one
I'm
going
to
share
my
oh
actually,
it
looks
like
I
can't
share
my
screen
today.
So
my
bad
on
that
one,
but
I
did
want
to
call
out
there
is
a
dated
and
ai
summit
cfp
call
for
presentation.
So
by
all
means
please
go
ahead
and
the
deadline
currently
is
january
2022.
A
So
please
go
ahead.
If
you've
got
a
great
data
session,
it
does
not
have
to
be
about
delta
life
and
smart
by
the
way,
just
any
session
on
data
and
ai
we'd
love
to
have
you
go
ahead
and
present
there,
or
at
least
you
know,
call
for
presentations
to
do
it.
If
you're
looking
for
this
video,
it's
actually
going
to
be
both
on
linkedin,
where
it
is
right
now,
as
well
as
on
youtube.
So
there
is
the
youtube
channel
delta
lake
youtube
channel,
so
just
go
to
delta.io
at
the
bottom.
A
There's
links
to
both
the
youtube
channel
and
also
the
slack
channel.
So
since
we
didn't
answer
all
of
the
questions
that
did
pop
up
again,
apologies
for
that.
Please
join
us
on
the
delta
user
slack.
We
are
there
and
we
will
go
ahead
and
answer
your
questions
there
as
well.
So
seeing
that
I
think
that's
it
for
today.
I
wonder
any
last
words
from
christian
or
td
or
scott
okay,
I'm
getting
nod,
so
I
think
that's
good
to
go.
E
So
again,
just
only
one
thing
keep
these
questions
coming.
These
are
absolutely
great
questions
and
we
want
to
educate
the
community
on
the
internals
of
delta
lake.
So
please
keep
asking
these
questions
in
all
these
different
venues.
You
heard
from
danny
from
all
the
different
menu
slack
etc.
So
please
give
us,
in
these
questions,
we're
happy
to
explain
and
help
and
and
make
everyone
in
the
community
aware
of
the
delta
internals
and
how
it
helps
in
different
scenarios.
A
And
especially
from
linkedin
you've
you're
asking
some
awesome
questions
so
we're
up
again:
apologies
for
not
being
able
to
answer
all
of
them,
so
please
join
us
on
the
delta
user,
slack
we'd,
love
to
answer
them
and
we'll
it'll
allow
us
actually
to
be
a
little
bit
more
long-formed
in
our
responses
too,
so
that
we
get
details
as
well.
Okay,
so
again,
thank
you
very
much.
A
Everybody
really
appreciate
your
time
and
we'll
probably
see
you
yeah,
we'll
see
you
in
two
weeks:
yeah,
we'll
we'll
still
have
one
more
and
then
and
then
we'll
take
the
winter
break
for
everybody.
Okay,
so
again,
thank
you
very
much
everybody
thank
you,
christian,
scott
and
td
for
attending
today's
session
and
providing
answers.
Thank
you
to
everybody
on
linkedin
and
youtube
for
going
ahead
and
asking
some
amazing
questions.
All
right
have
a
good
one.
Everybody
thank
you.
Bye.