►
From YouTube: Delta Lake Community Office Hours (2022-03-03)
Description
Join us for the next Delta Lake Community Office Hours and ask us your #DeltaLake questions. The Delta Lake community AMAs occur bi-weekly on Thursdays at 9AM PST. These sessions allow our community to ask questions about Delta Lake OSS and get to learn what we are building, planning to build to recently released features.
A
B
Okay,
awesome!
So
if
we
are
live
hi
everybody.
B
All
right
one,
second:
okay,
if
you're
live
hi
everybody
please,
we
are
live
on
linkedin
as
well
as
youtube.
I
hope
you
have
the
right
links.
If
not,
you
know,
I'm
gonna
paste
it
shortly
again
in
the
linkedin
channel.
If
you
can
say
hi
where
you're
from
I
would
love
to
see
like
you
know
what
global
presence
we
are
getting
today.
So,
let's
see.
A
All
right
we
should
vinnie,
we
should
be
live
now
on
both
linkedin
and
youtube.
So
hopefully,
we've
got
a
little
bit
of
traction
here.
So
just
I
think
we're.
Whenever
we're
ready,
we
can
start
the
show
or
we
can.
We
can
continue
talking
about
quebec
politics,
whichever
which
one
you
guys
ever
prefer.
So,
oh.
B
Let's,
let's
see
there
are
some
chats
coming
in
which
I
love.
We
have
people
from.
You
know:
arizona,
toronto,
south
africa,
wow,
amsterdam,
canada,
brazil,
interesting
nice,
so
I'm
joining
from
san
francisco
live
from
derby's
office.
So
why
don't
you?
Why
don't
we
introduce
our
panel?
I
will
start
with
florian
florian,
just
yeah
introduce
yourself
to
the
canary.
C
C
We
have
a
lot
of
data
pipelines
running
using
delta
lake
technology,
mostly
open
source
one
and
I'm
also
a
contributor
to
delta
rs,
the
the
rust
native
integration
of
delta
lake
and
also-
and
so
I'm
designing
a
bit
of
exploration
with
a
big
query:
to
find
a
good
alternative
to
have
an
integration
between
gcp,
bigquery
and
delta
h.
B
Awesome
awesome.
Thank
you,
florian
amazing,
to
hear
what
you
are
working
on
and
ivana.
Next
to
you.
E
Yeah,
hello,
everyone,
I'm
ron
a
software
engineer
from
databricks,
I'm
mostly
working
on
delta
lag
and
in
the
past
several
years,
and
also
previously
also
working
on
structural
streaming.
The
stream
engineering
about
this
park.
B
Awesome
thanks
ryan,
danny
next.
A
Hey
thanks
very
much
hi
everybody.
My
name
is
denny
lee,
I'm
a
developer
advocate
here
at
databricks,
long
time,
spark
and
delta
late
guy,
and
before
that
you
can
either
blame
me
or
thank
me,
I
don't
know
which
one
you
want
to
do
for.
I
was
part
of
the
sql
server
team
at
microsoft
and
also
on
the
team
that
created
a
hdinsight
so
that
one,
you
probably
want
to
blame
me.
So
my
apologies
back
to
you,
vinnie.
B
That's
a
great
history.
Is
it
then
don't
we
have
exciting
panels
here
and
hi
myself?
I'm
a
developer,
I'm
looking
at
dynamics
working
in
delta
lake
open
source
community.
So
that's
about
us.
Let's
start
get
started
with
some
questions.
A
few
topics.
What
you
can
ask
questions
about
is
anything
related
to
delta
lake,
for
example.
We
have
recently
released
amazing
features
like
performance
improvements
and
merge.
We
have
new
features
coming
in
which
are
reordering
data
skipping.
B
We
have
released
and
continue
to
release
new
connectors
like
trino
presto
flink,
so
ask
any
questions
about
that.
So,
while
we
wait
for
some
good
questions
to
come
in,
why
don't
we
quickly
get
started
with
some
updates
on
rust,
so
florian?
Why
don't
you
provide
like
some
updates
of
what
that
the
lake
community
should
look
forward
to
yeah
any
insights
that
would
be
appreciated.
C
Yeah
definitely
it's.
We
have
a
lot
of
improvement
into
the
delta
irs
binding.
So
mostly
we
integrate
the
new
azure
bindings,
meaning
that
you
will
use
a
new
crate
for
it.
So
it
has
been
a
lot
of
improvement
for
integrating
azure
cloud
provider
with
us
using
delta
irs,
and
many
things
is
coming
as
soon
as
we
have
some
requests
so
feel
free
to
reach
us,
so
it
will
be
a
nice
way
to
improve
or
to
provide
more
features
into
delta
ls.
C
B
Awesome,
that's
a
great
update
florian.
So
can
you
for
those
who
don't
know
like?
Can
you
provide
some
inputs
on
like
what
ether
does
and
a
little
bit
of
what
they
can
expect
from
the
connectors?
Like
a
use
case,
specific
yeah.
C
Yeah,
it's
actually,
I
can
use
a
black
market.
We
use
delta
irs
to
read
the
data
tables
without
having
a
cluster
running,
meaning
that
we
can
fetch
a
lot
of
metadata
using
delta
ios.
You
can
also
use
read
the
data
tables
list
files
of
the
of
the
partition,
underline
packet
file
using
this
library,
and
we
can
als.
We
can
also
provide
a
writer
to
write
delta
tables
or
updating
in
in
the
in
the
delta
lake
tables,
without
requiring
spark
installed.
So
mostly
the
rough
binding
is
there
and
we
also
have
a
python
and
ruby.
B
That's
awesome,
so
you
said
the
cluster
up
and
running
is
not
required
when
you
are
fetching
fetching
the
metadata,
so
that
means
saving
on
cost
to
run
compute.
That's
awesome.
C
Yeah,
exactly
you
can
read,
you
can
have
a
small
data
set
if
you
want
to
look
for
exploration
his
age,
I'm
thinking
about
data
scientists
use
case
when
you
need
just
to
look
or
have
a
small
data
set
to
work
with,
without
launching
everything
on
the
databrick
side
or
the
open
source
with
other
cluster
running
with
spark.
So
mostly,
it's
really
good
when
you
just
want
to
explore
with
a
small
portion
of
the
data
table.
B
Yeah,
I'm
pretty
sure,
florian.
I
missed
a
lot
of
details.
They
are
happy
to
have
like
detailed
conversation
later.
That
would
be
awesome
for
for
people
to
know
awesome.
So
we
have
some
other
questions
on
the
chat.
Thomas
is
asking
anything
on
the
horizon
to
help
with
the
surrogate
key
problem.
A
Sure
I
mean
I
can
probably
tackle
that,
so
I
do
want
to
call
out,
there's
probably
a
little
bit
of
bias
for
my
answer,
because
we
actually
had
a
and
I'll
post
it.
We
actually
had
a
delta
tech
talk
that
actually
explains
how
to
work
with
surrogate
keys
within
delta
lake
and
in
addition
to
that
in
case
you're
into
stargate
sg1.
There
are
plenty
of
our
stargate
sg1
references
to
that
particular
session,
but
outside
of
being
going
to
geekdom.
A
The
one
thing
that
I
did
want
to
call
out
actually
is
that,
as
part
of
the
roadmap
delta
lake
actually
does
will
actually
be
including
an
identity
column
actually,
and
so
that
is
one
of
the
things
that
we
will
that's
working
down
the
pipe
and
so
combination
of
where
we
are
planning
to
work
on
in
terms
of
the
community
to
to
to
address
some
of
these
issues.
I
think
that'll
be
helpful
for
the
longer
term.
Thinking
for
the
shorter
term.
A
Thinking
honestly-
and
I
think
it's
thomas
that
went
ahead
and
actually
asked
the
question-
the
reality
is
yeah,
the
problem
is
cumbersome
and
it
really
does
involve
you
figuring
out,
which
type
of
either
hash
or
basically
circuit
key
generation
that
you
want
to
work
with,
and
so
honestly,
I'm
just
going
to
give
you
the
answer
that
we'll
post
the
the
link
inside
there
now
saying
this.
It's
not
I'm
not
trying
to
avoid
actually
giving
providing
more
details.
It's
just
that,
depending
on
how
you
want
to
solve
the
problem.
A
There
are
different
answers,
so
please
join
us
on
the
delta
user
slack.
If
you
join
us
there,
we
can
dive
into
those
scenarios
a
little
bit
better
okay.
So
hopefully
that
gives
you
some
context
on
that.
Oh
thank
you
vinnie
for
dropping
in
that
particular
video
that
that's
quite
helpful.
B
Yeah,
thank
you
danny
awesome
and
that's
what
that
was
good
answer.
I
think
the
next
question
is
about
what
differentiates
delta
lake
from
competitor
products
like
azure,
ideal,
adls
and
snowflake.
There
are
many
answers
to
that.
But
ryan.
Do
you
want
to
take
a
pick
at
question?
Answering
that
question.
E
I
think
I
think
one
major
difference
is
like,
for
example,
theta
later
it's
an
open
source
for
max,
which
so
we
did.
We
basically
documented
how
to
read
or
write
a
delta
table.
You
know
like
a
website.
This
is
entirely
open
and
we
also
building
a
lot
of
like
open
source
connectors
like
node.jspac.
From
a
presto
frank,
they
can
also
read
it
directly
using
using
like
a
delta
format,
and
then
you
don't
need
to
worry
about
the
weather,
for
example,
for
some
similar
snowflake.
E
If
your
data
is
slow
flag,
if
you
decide
to
use
a
different
plot
product,
you
probably
need
to
move
all
your
data
out
from
snowflake,
but
for
delta
lag
whatever
like
a
platform
you
are
using,
you
can
easily
just
switch
the
engine
without
like
changing
the
underlying
storage
yeah.
Hopefully
to
answer
your
question:
yeah.
C
I
would
just
highlight
the
site
that
the
protocol
md
is
a
open
source,
is
on
delta
repository
itself
and
is
a
great
help
if
you
want
to
add
anything
related
to
delta.
It
was
the
case
with
delta
rs,
but
it
was
the
main
source
of
information
if
you
took
if
you
want
to
change
or
the
connector,
so
it's
really
a
good
place
for
the
open
source
community
to
see
what
are
the
requirements,
the
guidelines
and
the
information.
If
you
plan
to
integrate
something
in
it,.
B
Yep
great
there's
a
question
from
sean
collins
which,
which
is
a
part
of
the
question
so
sean.
If
you
can
please
you
know,
ask
the
entire
question
in
one
chat
I
I
would
be
able
to
help
you
out.
Okay.
Moving
on
to
the
next
question,
so
jether
is
asking,
is
hadoop
installed
on
databricks
cluster
by
default?
A
Let
me
try
to
dive
in
here
because
in
general,
let's
try
to
limit
our
questions
to
delta
lake
specific
questions.
Okay.
So,
but
I
will
answer
this
particular
question:
okay,
no
hadoop
is
not
installed
on
a
database
cluster
so
that
that's
the
easy
answer.
So
that's
why
I
didn't
mind
answering
it:
there's
always
a
small
file
problem,
though
it
doesn't
matter
whether
you've
got
hadoop
or
adls
or
gcs
or
aws.
A
The
small
file
problem
is
rooted
on
the
fact
that
the
listing
of
files
on
cloud
object
stores
is
excessively
slow,
don't
forget
when
you're
dealing
with
cloud
object
stores
what
you're
doing
is
you're
doing
web
breast
web-based
rest
api
calls,
as
opposed
to
bulk
batch
calls,
which
inherently
will
be
slower,
and
so,
if
you've
got
a
lot
of
small
files,
you've
got
a
lot
of
transactions,
a
lot
of
like
pings,
back
and
forth
communication
back
and
forth
that
will
basically
significantly
slow
the
performance
down,
and
so
how
delta,
like
it
happens,
to
solve
this
problem
by
the
way
is
and
delta
lake
does
in
fact
solve.
A
This
problem
is
that,
instead
of
actually
providing
the
making
you
list,
the
files
directly,
the
actual
files
are
inside
the
transaction
log.
So
whenever
spark
or
delta
rust,
or
the
upcoming
flink
connector
or
the
current
presto
connector
or
any
of
these
systems
want
to
get
access
to
delta
like
what
it
does.
Is
it
reads,
transaction
log
by
reading
the
transaction
log,
it
basically
gets
the
full
list
of
files,
so
you
don't
actually
have
to
use
that
really
cost
prohibitive.
Listing
of
your
cloud
object
store
to
get
the
list
of
files
now.
A
One
other
notion
is
that
you
do
want
to
go
ahead
and
optimize
your
file
system
every
so
often
ie
compaction,
because
if
you've
got
lots
of
little
small
files,
it
doesn't
matter
if
you're
listing
it
from
a
transaction
log
or
even
from
a
from
hdfs
or
cloud
object.
Storage,
that's
still
too
many
files,
and
so
what
file
compaction
does
and
delta
lake
already
has.
That,
which
is
the
ability
to
do
file
compaction,
is
basically
take
a
lot
of
small
files
making
them
into
bigger
files.
A
So
that
way,
it
will
actually
significantly
speed
things
up
now.
This
will
naturally
lead
to
a
question
which
is
hey,
but
I
heard
that
there's
this
thing
called
optimize
and
I
would
really
love
to
have
optimize
included
in
delta,
oss
and
oh
yeah,
and
also
optimize
z
order.
I
really
would
love
to
have
that
and
the
good
news
we
heard
everybody's
question
on
that,
so
we
had
a
survey
last
year
and
so
the
combination
of
the
survey
also
the
feedback
from
github.
A
It
was
the
overwhelming
winner
to
put
it
rather
lightly
that
we
actually
had
to
put
optimize
in
so
as
part
of
the
current
delta
lake
h,
2022
h1
roadmap.
I'm
trying
to
be
precise,
though
it's
hard
to
say
at
all,
but
with
the
current
roadmap
that's
targeted
for
june
of
this
year,
we
we
are
targeting
for
both
optimize
and
optimize.
The
order
current
current
timelines
are
approximately
q1
for
optimize
and
q2
for
optimize
the
order.
A
B
Thank
you
danny.
That
was
wonderful
answer.
I
I
just
add
I'll
just
add
a
little
bit
of
more
information
there.
So
I
think
there
might
be
a
underlying
question
as
well
that
you
you
were
trying
to
ask
which
is
setting
up
data
lake.
So
I
put
a
link
for
you
to
getting
started
on
delta
lake
as
well.
As
you
know
how
you
can
manage
the
small
file
problems.
There
are
different
ways.
B
You
don't
have
to
just
use
hadoop,
it's
a
delta
x
standalone,
so
you
can
basically
have
you
know
any
configuration
underneath
for
your
storage
systems
and
then
the
link
works
on
top
of
that
awesome
and
then
there
was
amazing
roadmap
which
recently
got
launched
and
we
are
happy
to
have
more
discussions.
I
put
a
link
in
the
chat,
so
please
get
involved
in
the
discussions
awesome.
So
moving
on
to
the
next
question,
all
right,
so
there's
a
question
about.
B
I'm
thinking
how,
if
you
do
a
sql
server,
backpack
deployment,
it
largely
handles
changes
for
you.
So
it's
a
question
around
how
we
manage
schema,
manage
schema
outside
of
manual
scripting.
B
Did
you
do
you
want
me
to
repeat
the
question.
A
A
Okay,
cool
cool
cool,
all
right,
so
the
only
reason
I
wanted
to
figure
out
I
could
tackle
this
is
because,
since
I
was
formerly
of
the
sql
server
team,
I
probably
have,
I
know
a
little
bit
about
sql
server.
I
can
claim
that
so
the
thomas's
question
is
very
prototypical
and
that's
a
good
thing
by
the
way
pro
typical
standard,
dw,
bi
type
questions
in
which
how
do
you
handle
metadata
and
schema
management
right,
and
so
so,
just
like
in
sql
server,
and
for
that
matter
other
relational
databases.
A
The
context
at
least,
is
you're
going
to
write,
ddl
scripts
or
dml
scripts.
Excuse
me,
and
you
probably
save
them
in
github
or
some.
You
know
some
version
control
system,
okay
and
then
that's
how
you
manage
that,
and
so
there's
two
things
when
it
comes
to
delta
lake.
First
of
all,
delta
lake
actually
has
both
schema
enforcement
and
schema
evolution.
A
So,
whatever
dml
that
that
you
actually
are
writing
you,
it
will
basically
make
the
changes
to
the
schema
of
your
delta
lake
table.
Assuming
you
want
it.
Okay,
in
other
words,
you
have
to
enable
evolution
to
allow
it
to
evolve
by
default.
It'll
enforce
this
current
schema.
So
if
you
try
to
put
like
for
sake,
mr
five
columns
in
on
a
four
column
table
it'll
actually
prevent
that
from
happening,
because
you
don't
want
to
risk
accidentally
corrupting
your
data
now
to
thomas's
question
specifically
about
how
to
manage
that.
A
Typically,
you
would
just
take
the
dml
scripts
that
you're
using
to
evolve
the
schema
for
sake,
argument
inside
github
now
you're
saying.
Well,
then,
how
do
I
match
those
two?
Well,
what's
really
cool?
Is
that
there's
actually
custom
metadata
that
you
can
actually
put
with
the
within
your
delta,
like
transaction
log,
so
a
common
practice
is
actually
as
part
of
the
custom
metadata
is
to
actually
link
directly
to
your
github
lead
repo
that
contains
the
dml
link
so
inside
the
delta
transaction
log
not
only
contains
the
fact
that
the
schema
change.
A
It
also
contains
a
link
to
the
schema
itself.
So
that
way,
you
just
review
the
transaction
log
and
go
get
that
information,
so
it's
at
least
I've
seen
this
happen
pretty
recently.
More
recently
in
terms
of
various
customers
have
been
tr
doing
this.
So
hopefully
that
helps
answer
your
question.
B
Awesome
all
right,
thank
you,
danny
moving
on
to
the
next
question,
so
ivana,
this
question
is
for
you,
so
you
are
doing
a
lot
of
work
around
streaming.
You
know,
can
you
yeah?
Can
you
maybe
give
us
like
few
highlights
on
what
are
the
some
awesome
features
that
you
have
come
across
and
you
know
why
you
are
interested
in
working
on
those
features.
Yeah.
A
E
A
D
Well,
delta
doesn't
have
only
data
engineers,
but
also
data
scientists,
and
also
I
do
a
lot
of
projects
where
data
engineering
is
involved.
So
I
want
to
keep
my
hands
dirty
with
the
data
engineering
stuff
as
well
so
yeah
I
do.
I
do
a
lot
of
streaming
and
for
that
delta
delta
is
really
the
savior
particularly
they're.
D
One
of
the
latest
features
about
the
change
data
feed
feature
there.
I
find
it.
I
find
it
quite
useful,
because
then
I
don't
have
to
read
all
the
all
the
data.
If
I
have
a
stream,
let's
say
the
bronze
typical
bronze
silver
gold
in
your
in
your
data
lake,
and
if
I
will
have
to
read
the
changes
only
from
the
from
the
silver
layer
and
get
them
into
bronze
and
do
some
kind
of
aggregations
there.
I
don't
typically
don't.
D
I
don't
want
to
get
all
the
data,
because
that
can
be
a
lot
of
a
lot
of
data,
and
I
would
like
to
only
read
that
the
rose
or
the
the
data
that
has
recently
been
updated
and
with
this
change
data
fit
feature
I
can.
I
can
easily
I
can
easily
get
only
those
rows
and
then
merge
them.
Merge
them
into
my
into
my
gold
layer.
D
Yeah,
that's
one
of
that
and
yeah
for
the
for
the
data
science,
part
delta,
really
delta,
really
yeah
helps
helps
a
lot
on
that
front
front
as
well.
To
keeping
all
the
data
sets
all
the
features
features
there,
building
easy,
easier
feature
store,
etc.
B
Present
in
yeah
in
boot
fronts,
so
I
guess
you
touched
upon
all
the
layers,
bronze,
silver
and
gold,
and
you
know
you,
as
a
data
scientist,
understand
that
how
that
architecture
really
helps
in
you
know.
You
know,
setting
up
your
pipeline
well
and
really
helping
that
end.
Use
case
which
is
like
once,
data
engineering
is
set
up
properly.
B
Then,
as
a
data
scientist,
you
can
go
ahead
and
you
know
make
use
out
of
the
data
in
a
very
I
think
in
a
curated
manner.
Right
because
definitely.
D
As
a
data
scientist,
it's
I
think
it's
the
dream
to
have
the
the
data
that
you
have
as
an
input
as
as
with
good
with
good
quality
as
possible
and
delta
lakes.
Actually,
delta
lake
helps
us
with
that.
So
we
can
just
read
the
data
that
the
data
engineers
have
prepared
and
yeah
directly
have
have
the
clean
and
yeah
high
quality
data.
B
That's
awesome,
ivana,
that's
a
wonderful
insight.
Coming
from
a
data
scientist,
you
know
this
is
what
data
engineers
do
with
delta
lake.
You
know
awesome.
Anybody
else
would
like
to
add
anything
there
for
data
science.
A
I
mean
not
to
take
up
the
show,
but
but
but
the
one
thing
I
definitely
want
to
add
is
exactly
to
vana's
point.
I
mean
it's
garbage
in
garbage
out
right,
like
you
could
literally
have
if
you,
depending
on
which
machine
learning
algorithm
you're
using
literally
the
sort
order
of
the
data
coming
in,
could
literally
change
change
the
hyper
parameters
and
change
the
meanings
of
everything
right.
So
you
you
really
it's
there.
D
A
B
So
we
touched
on
data
quality
reproducibility
and
having
a
clean
slate
telenate
for
data
science.
There
is
one
more
question
related
to
this,
which
dennis
is
asking
he's
asking.
Would
you
recommend
to
use
dedicated
buckets
for
bronze
silver
and
gold
layers?
For
any
reasons,
there
are
so
many
questions,
so
many
answers
I
could
provide,
but
ivana
would
love
to
have
you
answer
this
question.
D
Yes,
I
would,
I
would
put
it
in
in
different
backgrounds.
I
think
there
are.
There
can
be
many
different
opinions
on
this.
I
I
prefer
that
that
they're
they're
separated
and
sometimes
that's
also
better
in
terms
of
security
and
governance.
D
B
B
You
have
to
have
some
kind
of
governance,
so
when
the
data
is
coming
in,
it's
always
a
good
practice
to
curate
that
data
in
the
first
bronze
layer,
bronze
layer,
you
can
just
use
it
for
raw
data,
as
is
but
then
apply
the
transformations
as
well,
as
you
know,
get
first
pii
data
and
you
know
regular
use
data
that
can
be
separated
and
then
use
like
specific
buckets
for
use
case
that
you
are
trying
to
solve.
B
For
example,
if
you
have
like
10
different
teams
working
on
different
use
cases,
it's
always
a
good
idea
to
have
separate
buckets
in
the
goal
layer
where
you
can
curate
the
data
add,
do
the
you
know,
merge
of
tables
and
all
that
in,
like
a
specific
use
case
manner,
so
hope
that
helps.
There
are
a
lot
of
different
use
cases
which,
which
you
can
think
of
when
creating
your
delta
architecture
awesome.
So
I
think
that's
about
that's
a
wrap.
We
are
coming
close
to
our
office
hours.
B
I'm
gonna
pay
some
chat
links
in
the
chat
so
that
you
can
keep
continuing
this
discussion
through
various
channels.
We
have
either
through
slack,
google
group
or
github.
You
know
I
love
seeing
how
much
engaged
the
audience
is
on
the
chat.
So
thank
you
for
that.
Any
closing
words
from
the
panel.
A
My
only
call
out
is
join
us
for
the
delta
users
slack.
If
you've
got
any
questions,
I
think
that's
been
posted
in
linkedin
and
youtube
because
I'm
sure
you
all
have
other
great
questions
we'll
be
here
in
another
two
to
three
weeks
to
answer
your
questions
again
with
a
new
panel
so
but
between
now
and
then
please
join
us
in
delta
user
slack.