►
Description
We're re-igniting the Spark Online Meetup! In this live meetup, Denny Lee (Engineer and Developer Advocate at Databricks) interviews Delta Lake engineer Burak Yavuz.
Read more here: https://delta.io/
Learn more about Delta Lake Connectors: https://github.com/delta-io/connectors
Join Delta Community Slack: https://dbricks.co/DeltaSlack Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
A
B
Everybody,
my
name,
is
thanks
very
much
Karen
hi,
my
name
is
Denny
Lee
I'm,
a
developer
advocate
here
at
data
bricks
based
out
of
Seattle
Washington.
So
that's
why
I'm
actually
sitting
with
a
where-where'd
espresso
background,
while
Barack
moves
chairs
and
pops
on
over
he's
based
out
of
San
Francisco
California.
So
without
a
do,
you
currently
are
watching
our
interview
with
Baraka
for
the
genesis
of
Delta
Lake,
but
before
we
go
into
it,
the
heart
of
this
type
of
online
meetup
is
that
we
get
to
interview
the
people
behind
the
technology.
C
Hi
Denny,
so
hello,
everyone,
my
name
is
Brock
I'm,
a
software
engineer
here
at
theta,
Brix
I'm,
also
a
spark
committer,
basically
I
work
in
a
team
called
the
stream
team
at
data
breaks.
Our
goal
is
to
make
the
lives
of
data
engineers
much
simpler
and
our
team
motto
is
we
make
your
dreams
come
true.
That's.
B
C
I,
don't
know
we
were
joking
about
it
like
we
had
like
versions
of
Maher
streams.
Come
true.
We
make
your
streams
come
true,
it's
kind
of
like
we're.
Joking
around,
like
we
called
ourselves
to
the
stream
team
or
like
once,
we
started
working
on
Delta.
We
like
switch
that
off
to
like
dream
team
and
whatever
it
was
kind
of
like
just
joking
around
right.
So.
C
B
C
So
I'm
originally
I'm
a
cackle
engineer,
but
I've
been
programming
myself
since,
like
high
school
early
high
school
I
was
like
really
interested
in
robots
and
kind
of
like
Bionic
arms,
and
things
like
that.
So
it's
like
I
wanted
to
get
like
a
full
view
of
like
engineering
where
you
know
mechanical
engineering,
you
build.
You
know
things
that
move
work.
C
You
know
control
systems
and,
on
the
other
hand,
like
you,
have
to
program
them
and
I
really
enjoyed
that
whole
area
of
you
know
like
making
something
work,
making
something
move
and
everything
so
studied
mechanical
engineering.
But
then
I
came
to
the
u.s..
It's
something
called
management
science
and
engineering,
which
is
kind
of
like
Industrial
Engineering,
worked
on
large
scale,
optimization
problems
and
that's
how
I
kind
of
got
into
the
world
of
big
data
cool.
B
C
C
I
did
buy
so
I
came
to
Stanford
University
for
grad
school,
that's
where
I
did
management,
science
and
engineering
and
I
was
planning
on
doing
like
a
full
ph.d
program
like
for
six
years
kind
of
like
learn.
Everything
about
you
know
optimization
and
things
like
that.
But
then
you
know
that
kind
of
introduced
me
to
the
world
of
big
data
and
machine
learning,
because
you
know
every
machine
learning
algorithm
in
the
end
uses
some
optimization
algorithm
and
some
optimization
routine
to
actually
get
a
result.
So
gotta.
B
B
Okay,
so
before
you
got
introduced
to
spark
because,
as
you
know,
that
your
committer
spark
I'm
just
curious,
did
you
like
what
were
the
types
of
libraries
or
machine
learning
that
you
were
doing?
Was
it
like?
You
know,
old-school
Java
Mallett
was
a
like
pre
Python
or
we're
using
Python
pandas
just
curious
in
terms
of
your
progression
into
the
machine
learning
cycles
before
then,
you
actually
switch
over
to
the
data
engineering
cycles.
Yeah.
C
I
mean
honestly,
like
we
were
doing
very
academic
research.
In
that
sense,
like
we
had
a
lot
of
code
in
MATLAB,
we
had
a
lot
of
code
in
C.
We
had
like
some
code
in
Java
we're
doing
like
pretty
like
I
mean
well-known
zatia
routines
like
us
to
cast
a
gradient
descent
and
it
turned
out
like
I,
was
using
like
all
these
tools
that
you
know
people
had
built
on
top
of
like
C
or
Fortran.
Even
right.
You
know
you
have
the
very
optimized
like
matrix-matrix
multiplication,
routines
and
whatnot.
C
B
Gotcha
so
we're
going
back
to
almost
old-school
Fortran
77
types,
yeah
got
it
got
it:
okay,
cool,
no,
fair
enough!
That's
even
even
what
I
did
in
the
past
completely
okay.
So
then
you
finished
off
your
degree
at
Stanford.
But
then
how
did
you
progress
in
the
data
Brixham
for
that
matter?
How
did
you
even
progress
into
spark
in
the
first
place,
yeah.
C
When
I
first
joined,
it
was
kind
of
like
building
the
tools,
so
I
started
off
as
an
intern.
One
of
the
things
back
then,
like
spark
1.0,
was
just
released
and
we
were
trying
to
build.
You
know
all
these
tools
to
figure
out
regressions
in
spark
like
we
were
adding
all
kinds
of
code
to
spark.
We
just
wanted
to
make
sure
that
we
didn't
regress
in
performance.
So
you
know
the
first
things:
I
dawn
was
like
spark
perf,
which
was
kind
of
like
this
library
that
allowed
us
to
run
benchmarks
on
spark
run.
A
B
C
Maybe
we
should
come
up
with
like
streaming
data
frames
as
well,
and
so
we
were
wondering
like
how
that
would
be,
because
we
had
spark
streaming
that
had
this
like
D
streaming
API,
we
had
sparked
core,
which
was
our
DD
API,
and
then
he
had
sparked
sequel,
which
had
data
frame
api's
like
how
do
we
connect
all
these
two
or
three
like?
Are
we
gonna
have
addy
stream
of
data
frames?
Are
we
gonna?
Have
you
know
like
some
other
concepts,
so
that
was
kind
of
like
early
on?
We
were
thinking
about.
C
B
Okay,
so
let's
hold
for
that,
so
so
basically,
what
you're
telling
us
here
is
that
the
progression
into
the
streaming
data
frames
which
we'll
talk
about
shortly,
but
it
was
actually
started
off
for
that.
The
fact
that
on
the
date
of
excite
can
you
without
obviously
diving
into
too
many
details.
But
can
you
provide
some
context
of
why
this
was
such
a
big
problem
or
how
the
type
of
problems
that
you're
trying
to
solve
yeah.
C
I
mean
so
it
was
a
humongous
problem.
I
mean
like
early
on.
We
had
this
like
very
simple
batch
data
pipeline
that
you
know
took
one
day's
worth
of
data
processed,
it
put
it
into
nice
like
table.
We
need
to
do
something
else,
and
then
that's
when
dater
bricks
decided
that
hey
a
new
college
grad
should
build
our
new.
C
B
C
Your
team,
precisely
and
like
I,
mean
the
idea
was,
you
know,
like
we
were
streaming
data,
which
was
like
a
completely
different
API,
which
was
in
spark
streaming
with
D
streams.
But
then
we
had
our
batch
processing,
which
was
in
data
frames
and
like
managing
like
two
completely
different
code.
Paths
started
becoming
like
a
hassle,
and
we
wanted
to.
You
know
like
we
were.
We
started
thinking
about
like
how
can
we
unify
these?
You
know,
API
is
I'm,
come
up
with
this
like
logic
where
you
know
you
wouldn't
have
to
change
too
much
code.
C
You
just
give
us,
you
know
the
transformations
that
you
want
to
do
and
you
know
doing
transformations
on
data
frames
was
like
super
declared.
It's
super
easy,
and
you
know
it
was
kind
of
like
sequel
like
API,
is
that
people
were
used
to
so
we're
like?
Oh,
can
we,
you
know
like
start
doing
this
in
a
streaming
fashion
as
well,
god.
B
I
got
it
so
so
then,
from
that
progression,
because
you
had
all
this
data
and
because
you
want
to
do
a
land
architecture
where
basically
you're
doing
batch
queries
where
there
was
whether
it
was
machine,
learning
or
bi
or
whatever
else
that
you
were
doing
it's
that
data.
But
you
also
needed
to
look
at
the
data
live
right
via
streaming
right
and
so
because
you're
looking
at
the
data
live,
you
want
to
be
able
to
use
that
same
declarative,
nature
and
batch
and
apply
that
to
streaming.
B
So
that
was
basically
part
of
the
reason
why
the
spark
community
itself
was
saying:
hey.
We
see
the
experience
and
we
wanted
to
go
ahead
and
build
basically
streaming
on
top
of
the
spark
sequence
intact,
using
spark
sequel.
Syntax
excuse
me,
then,
versus
these
streams.
I'm
presuming,
that's
the
progression
I.
How
did
that
communication
go
with
the
community?
How
to
like
you
how
many
others
works
are
bugging
you
about
the
same
problem,
yeah
I.
C
Mean
like
in
the
stream
team,
like
we
had
many
ideas,
but
basically
people
like
Matt,
Azaria
and
Mike
alarm
breasts.
They
you
know
like,
but
at
their
heads
together
and
they're,
like
you
know,
how
can
we
get
this
to?
You
know
work
nicely
and
then
they
were
like
yeah.
We
could
just
like
unify
this
within
this
API
and
then
you
know
like
we
started.
You
know
like
talking
about
it
within
the
developer
community
as
well,
and
then
people
were
like.
C
B
B
By
building
the
streaming
data
frames,
how
sorry,
by
building
the
streaming
data
frames
and
by
building
the
batch
data
frames,
and
now
you
have
your
land
doctor?
What
were
the
some
of
the
issues
that
you
and
your
team
were
running
into
in
that
case,
because
it
seemed
like
you
had
a
solution.
You
had
the
land
architecture
that
was
the
popular
concept
of
your
per
se.
So
what
what
led
you?
What
were
some
of
the
issues
that
you
ran
into
that?
Oh.
C
There
were
like
so
many
it's
like
I,
like
I,
can't
begin
to
count.
It
was
like
basically
I
mean
we
would
get
alerts
like
every
other
day.
On
our
like
pipe
lines
like
something's.
Failing,
okay
and
I
mean
it
was
just
cut
like
a
issue
with
there
like
working
on
large-scale
distributed
systems.
It's
we
had
many
cases
where
you
know
data
would
arrive
late.
We
would
forget
to
process
that
data.
C
You
know
we
didn't
know
how
far
back
to
look.
You
know
when
new
data
came
in,
so
we
just
like
said
okay,
you
know
we
expect
data
to
come
in
in
within
three
days.
So,
let's
just
like
reprocess
our
entire
data
set
over
like
three
days
and
as
we
were
doing
all
this
like
streaming
work
and
also
our
batch
pipeline,
the
streaming
the
latest
data
that
we
wanted
to
query.
C
It
was
always
very
slow
and
the
reason
for
that
was
this
concept
of
small
files
having
a
lot
of
small
files
and
the
idea
there
is
that
I
mean
we
were
initially
up
in
AWS
working
with
Amazon
s3
and
these
kinds
of
like
storage
systems
like
data
leaks,
as
we
call
them.
You
know
they're,
just
like
key-value
blob
storage
systems.
You
know
they're
great
for
storing
insane
amounts
of
data,
but
they're
not
great
at
like
telling
you
what
data
is
there
or
which
version
of
the
data
is
there.
C
That's
like
I
mean
it's
just
very
hard
to
give
you.
You
know
like
very
consistent
semantics,
so
we
would
hit
like
so
many
issues
around
Amazon's
s3,
eventual
consistency
where
we
would
write
out
a
file,
but
before
writing
out
the
file,
you
would
have
to
check
that
the
files
there
just
so
that
you
don't
overwrite
it
or
you
know,
write
garbage
data
and
that
check.
Would
you
know,
prime,
a
negative
cache.
You
would
write
the
file,
you
try
to
read
it
back,
and
then
you
have
these
issues
of.
C
Oh,
the
file
doesn't
exist
and
you're
like
well.
I
just
wrote
it
there:
how
does
it
not
exist?
You
know,
but
those
were
kind
of
like
you
know
the
kind
of
problems
that
we
had
to
deal
with
listing
all
those
files
were.
You
know
super
expensive
because
you
know
s3
wasn't
just
built
for
like
listing
things,
it's
very
hard
for
like
those
kinds
of
systems
to
give
you
a
list
of
what's
there
and
then
just
you
know,
reading
all
those
small
files
opening
so
many
HTTP
connections
that
would
be
super
expensive.
C
B
I
got
it
so
so
the
heart
of
the
matter
for
at
least
when
you
were
doing
the
land
architecture
for
log
analytics,
was
that
the
underlying
file
system,
in
this
case,
the
cloud
storage
system
itself-
was
not
reliable,
right
and
so
I'm
just
curious.
You
know
based
on.
Obviously
this
is
what
was
the
your
experience,
but
then
did
you
see
the
same
thing
happen
with
lots
of
data
bricks,
customers,
yeah.
C
I
mean
like
so
it
was,
you
know
like
I
mean,
like
I
said,
everyone
was
building
the
same
thing
at
the
time
and
we're,
like
you
know
with
these,
like
architectures
I
mean
it's
not
that
the
storage
system
is
unreliable.
It's
that
everyone
had
to
build
their
own
database
semantics
on
top
of
this
storage
system,
and
you
know
people
were
just
used
to
working
with
things
like
my
sequel
or
you
know
like
data
warehouses,
and
you
know
these
kind
of
storage
systems
that
were
very
easy
to
deal
with.
C
You
didn't
have
to
think
about
a
lot
of
the
problems
that
you
might
face,
but
then
suddenly
one
when
you
came
into
this
data,
Lake
architecture,
which
you
know
all
our
customers
also
were
in,
and
then
you
had
to
deal
with
all
kinds
of
like
how
do
I
deal
about
files
like
can
I
delete
files?
How
do
I
optimize
my
io
patterns
with
these
files?
You
know
which
file
format
do
I,
save
them,
and
you
know
which
file
sizes
do
I
want
to
have.
C
B
Right
so
so
that
transition
from
on-premise
to
the
cloud
that
transition
transition
from
single
box
to
distributed
systems,
the
the
fact
that
you
actually
had
to
do
deal
with
a
distributed
file
system.
This
basically
introduced
a
whole
set
of
issues
that
you
know
not
only
you
were
suffering
from,
but
you
know
in
terms
of
doing
the
analysis
of
the
data,
but
a
lot.
Many
of
the
data
books,
customers
themselves
were
actually
going
ahead
and
suffering
problem.
Yeah.
B
C
I
mean
we
were
like
trying
to
solve
all
these
problems
like
in
different
ways.
We
would
get
all
kinds
of
like
support
ticket
saying.
Oh
my,
you
know,
my
queries
are
slow
or
you
know
like
my
listing
is
super.
Slow
can
I
make
this
faster.
We
were
getting
all
kinds
of
support
tickets
around.
Oh,
you
know,
like
two
people
tried
this
changes
the
same
table
at
the
same
time,
but
then
I
have
this
like
totally
inconsistent
garbage
state
of
my
table.
People
would
be
like
oh
I
have
duplicate
records
here.
C
Why
do
I
have
duplicate
records?
Well,
you
know
you
had
partial
failures.
Those
were
kinds
of
issues,
then
you
know
like
with
what's
spark.
For
example,
a
lot
of
things
were
built
with
like
Hadoop,
distributed
file
system
in
mind.
Where
you
know
the
idea
there
was
with
an
on-prem
like
HDFS,
you
could
just
like
write
to
a
temporary
location,
rename
and
then
renames
are
super
fast.
That's
like
a
you
know,
constant
time
operation.
C
Whereas
with
you
know
cloud
storage
systems,
it
could
either
be
a
you
know
very
quick
reading
or
it
could
be
a
server-side
copy
of
the
entire
file
and
these
kinds
of
like
performance
issues.
People
were
you
know,
as
people
were
like
moving
from
on
front
to
the
cloud
they
just
like
had
so
much
trouble.
You
know
dealing
with
you
know
all
these
kinds
of
like
inconsistencies
and
all
these
kinds
of
like
performance
issues,
so
that
really
led
to
the
problem
like
like
we
came
up
with
like
intermediate
solutions.
Yeah.
B
Actually,
with
that
I'd
love
to
I'd
love
to
actually
dive
a
little
bit
into
that
in
terms
of
like
some
of
the
intermediate
solutions
you
had
to
put
in
place
before
you
actually
had
the
solution
so
just
to
give
a
little
heads
up
to
the
audience
here
right,
we
are
gonna,
be
talking
about
the
Delta
Lake
transaction
log,
which
ultimately
solve
some
of
these
problems.
But
before
we
talk
about
that,
I'd
love
to
just
as
you
hinted
at
baraka
understand
a
little
bit
more.
B
C
Yeah
yeah
I
mean
so,
for
example,
with
structured
streaming
once
we
release
so
streaming
data
frames
structure
streaming.
We
came
up
with
this
file.
Sync
implementation,
where
you
could
take
your
data
from
anywhere
Kafka
Kinesis
as
your
event
hubs
whatever
or
you
know
files,
and
then
you
would
store
it
in
some
other.
You
know
file
storage
system
and
what
this
file
sync
did
was
that
it
was
actually
kind
of
like
the
initial
implementation
of
Delta's
transaction
log.
C
It
would
write
out
all
the
files
with
unique
names,
so
this
these
unique
names
ensured
that
you
know
you
wouldn't
ever
hit
an
eventual
consistency
problem
you
would
never
hit.
You
know
like
a
task,
a
failed
task,
writing
out
a
file
and
then
a
second
task
that
you
know
retries.
You
know
that's
a
retry
writing
out
the
same
file.
You
would
just
like
get
new
sets
of
files
every
time
and
once
all
the
files
were
complete,
it
would
just
take
the
set
of
files
that
it
wrote
and
then
store
it
in
a
manifest
file.
C
That
said,
okay,
you
know
this
batch.
This
micro
batch
I
stored,
I,
wrote
all
these
files
and
when
SPARC
would
actually
query
this
table,
it
would
directly
go
to
this
manifest
it
wouldn't
have
to
list
any
of
the
directories.
It
wouldn't
have
to
listen
anything.
This
manifest
was
kind
of
like
the
the
source
of
truth
about
which
files
the
SPARC
had
to
read
to
actually
have
a
you
know
full
view
of
the
table,
so
that
was
kind
of
our
like
one
of
our
initial
solutions
to
dealing
with.
C
You
know
avoiding
listing
and
kind
of
like
having.
You
know,
an
atomic
operation
where
you
know
if
the
trends
you
know,
if
the
write
fails,
then
you
know
sparks
not
going
to
read
those
files,
Sparks
still
going
to
look
at
the
transaction
log
to
see
or
a
manifest
file
to
see
what's
the
source
of
the
truth,
so
that
was
kind
of
like
a
very
early
implementation
of
what
was
getting
us.
There
got.
B
It
so
basically
the
the
prior
to
the
transaction
log.
In
essence,
you
have
a
manifest
file,
basically
just
a
file
which
lists
all
the
names
and
because,
if
you
have
a
lot
of
files
listing,
the
files
from
s3
became
relatively
slow.
So
it
was
actually
faster
to
read
that
one
manifest
file
which
itself
contained
the
list
of,
let's
just
say,
25
files
yeah,
even
though
there
may
have
been
50
files
due
to
failures
on
write.
For
whatever
reason,
it
would
only
grab
the
25
files
that
you
needed
exactly.
C
Exactly
and
it
wasn't
just
the
single
file,
it
was
actually
like
a
ordered
operation,
so
it
was
kind
of
like
the
first
batch
wrote.
These
files
second
batch
wrote.
These
files
third
batch
wrote
these
files
and
sudden,
like
once,
we
saw
this
directory.
We
would
have
to
read.
You
know
like
1,
2
3.
We
would
list
that
directory
and
then
read
all
the
files
within
1,
2
3,
and
then
you
know,
answer
a
query
based
on
the
file
list
generated
from
those
gotcha.
B
Gotcha
so
so,
like
you
know,
this
is
the
precursor
to
the
transaction
log.
So
the
idea
is
then
basically
you've
had
a
file
I'm
just
curious.
Was
there
any
discussion
on
why
forsake
argument
that
transaction
or
so
that
manifest
file
would
in
fact
be
a
file
versus
for
seeker
arguments?
Some
other,
like
you
know,
sequel
or
no
sequel
store
or
something
like
that?
Were
there
discussions
for
that
and
looked
from
a
standpoint
of
a
queuing
or
in
memory
system
instead,
ya.
C
Know
that's
a
great
question,
I
mean
so
there
were
like
I
mean
with
spark.
Every
time
like
the
biggest
question
is
like
scalability.
How
can
we
build
something
scalable
and
also
like
one
other
thing?
Was
you
know
we
don't
want
to
depend
on
external
systems
like
avoid
dependence
on
external
systems,
because
it
just
adds
more.
You
know,
problems
on
to
the
users,
and
so
we
were
like
oh
they're,
trying
to
write
to
a
storage
system.
Why
not
have
this?
You
know
like
our
source
of
truth.
C
Along
with
you
know
all
the
data
files
within
the
storage
system,
so
it
was
kind
of
like
you
know
we
didn't
want
to
have
them
set
up
a
connection
to
you
know
some
other
database.
We
didn't
want
to
have
them
set
up.
You
know
a
connection
to
you
know
some
other
key
value
stores
like
we
already
have
permissions.
They've
set
everything
up
so
that
the
right
people
can
write
to
that
directory
or
read
from
that
directory.
Why
not
just
like
store
all
the
information
that
we
need?
C
B
C
B
It
got
it
cool
so
that
I
mean
that
solves.
Basically,
the
manifest
file
basically
did
solve
a
bunch
of
things
and
like
especially
the
the
file
rights
issue,
but
I'm
just
curious.
Then
you
started
off
with
the
lambda
Kurt
that
you're
really
talking
about
streaming
right.
So
how
did
was
a
Justin
manifest
itself
that
would
solve
the
streaming
issues
that
you're
in
the
resolution
streaming
or
what
else
did
you
have
to
do
in
order
to
be
able
to
resolve
these
things?
Yeah.
C
I
mean
so
the
manifest
file
kind
of
like
resolved
the
issues
around
like
distributive
failures,
partial
failures
and
like
a
file
listing
with
streaming,
but
it
didn't
get
rid
of
the
problems
of
you
know
like
having
a
lot
of
small
files,
because
you
know
the
manifest
is
the
source
of
truth.
It
tells
us
you
know
like
which
files
we
have
to
read,
but
then
it
was
only
you
know.
It
only
worked
with
streaming
rights.
So
how
do
you
actually
read
from
this
table?
C
Unlike
compact,
your
files,
you
know
at
the
end
of
the
end
of
the
day,
like
do
you
have
like
one
table?
That's
just
like
streaming
and
then
you
compact
to
a
separate
table-
or
you
know,
do
you
like
ignore,
like
some
customers
would
just
like
ignore
that
transaction
log
and
then
just
like
overwrite
everything
and
like
blow
everything
out
at
the
end
of
the
day,
even
though,
like
we
told
them
hey,
please
don't
do
it
like
you're,
not
getting
any
guarantees
this
way,
but
you
know
like
so
some
people,
so
we're
okay
with
that
solution.
C
But
so
you
know
still
a
lot
of
issues
existed
because
there
was
no
like
unification
with
batch,
yet,
especially
with
this
new
like
streaming
file,
sync.
So
in
the
end
you
know
we
noticed
all
these
problems.
People
were
starting
to
use,
structure
streaming
a
lot
more
and
then
we're
like.
Well,
you
know.
Maybe
we
should
start
thinking
about
this
again
and
come
up
with
this
like
v2
concept
of
a
streaming,
because
you
know
people.
B
B
Okay,
well,
this
is
I,
guess
the
the
love
of
doing
live
sessions
where
sometimes
we
have
technical
difficulties.
I
don't
know
if
it's
on
my
end
or
urine,
but
nevertheless,
okay,
let's
progress,
because
we
actually
only
have
a
few
minutes
left
on
the
on
the
interview
portion
of
Eric's,
because
we
did
want
to
try
to
time
these
interviews
to
be
time
for
about
the
little
the
length
of
time
for
the
average
length
of
time
it
takes
for
somebody
to
commute
in
San
Francisco,
which
is
about
32
minutes.
B
So
nevertheless,
all
right,
so
you've
went
ahead
and
told
us
a
little
bit
about
how
that
transaction
log
worked
the
streaming
sync.
So
then,
what
were
some
of
the
other
issues
that
you
ran
into
as
well
right?
You
expecially
with
your
customers
in
terms
of
like
you
know,
because
you
progress
with
the
file
sync,
you
progress
with
the
what
ultimately
turned
into
a
transaction
log
or
what
were
the
other
things
like,
for
example,
I'm
presuming
one
of
the
problems
was,
as
time
changed.
Oh
sorry,
it's
time
progressed.
Excuse
me.
B
C
Know
it's
like
your
face
was
like
yeah,
so
yeah
to
repeat
your
question.
It
was
like
what
kind
of
like
business
need
you
know
came
up
along
with.
You
know
time
that
the
file
sink
manifest
did
not
support
another
like
big
thing
that
came
up
was
GDP.
Are
you
know
all
this
like
issues
around
like
data
protection
and
data
privacy
and
like
their
requirement?
For
you
know,
data
subject:
requests
where
people
could
ask
specifically,
for
you
know
what
is
my
data
or
you
know
like?
C
Can
you
update
my
data
or
just
delete
my
data
entirely,
and
people
had
to
build
like
very
complex
systems
and
data
pipelines
or
did
architectures
to
kind
of
solve
those
issues,
but
we're
like
normally
you
know
you
should
be
able
to
write
an
update
statement
in
sequel
and
be
able
to
update
your
table.
That's
what
people
are
generally
used
to,
or
you
should
be
able
to
write
a
delete
statement
on
your
table
and
delete
all
the
records
for
a
user
and,
like
that's
what
you
know
our
users
are
used
to
from
like
on-premise.
C
You
know
data,
warehouses
or
databases.
So
we
saw
this
like
new.
More
complex
workloads,
you
know
emerging
from
you
know
like
all
these
new.
You
know
requirements
around
the
data
world
as
well,
and
our
transaction
log,
which
only
supported
streaming
rights,
was
never
gonna
support
those
use
cases.
So
we
had
to
come
up
with
this
like
new
protocol
that
actually
was
able
to
understand
what
changes
are
being
made
at
the
table.
C
B
C
C
People
knew
across
like
different
roles,
and
you
know
it
was
kind
of
like
the
sequel,
syntax
allowed
them
to
transition
a
lot
easier
into
this
world
without
having
to
know
any
spark
api's
or
you
know,
data
frame
api's,
you
know,
maybe
you
know
data
scientists
knew
about
data
frames
from
like
pandas
or
are,
but
not
necessarily
a
you
know,
data
analyst
who
was
you
know
using
BI
tools
and
we're
writing.
You
know
dashboards,
but
you
know
we
needed
to
empower
all
these
people
to
actually
build
such
things
as
well.
When
required,
perfect.
B
D
B
C
Time,
travel
yeah
well
yeah,
so
it's
got
like
when
we
came
up
with
Delta
I
mean
it
was
kind
of
like
well
with
all
the
intermediate
solutions.
Delta
was
kind
of
like
with
its
transaction
log
and
it's
protocol
it
solved.
A
lot
of
you
know:
support
tickets
that
we
would
get
honestly
around.
You
know
like
it
would
ease
a
lot
of
issues.
Another
thing
that
you
know
another
issue
that
came
up
was
with
Delta.
C
We
tried
to
enforce
kind
of
like
best
practices
on
users,
and
what
would
happen
was
people
with
hive
there's
this
concept
of
dynamic
partition
over
rights.
What
that
does
is
that
you
have
a
data
set.
You
try
to
write
it
out
in
overwrite
mode,
so
it's
gonna
overwrite
some
amount
of
data
and
it
overwrites
only
the
partitions
that
it
writes
new
data
to
so.
The
idea
was,
you
know
like
if
my
it
was
kind
of
like
a
lazy
way
of
saying
you
know
like
I.
Have
this.
C
You
know
entire
new
data
set
just
like
overwrite,
whatever
I
need
to
overwrite
with
so
the
initial
users
of
Delta,
who
were
used
to
that
kind
of
like
mode
started,
overriding
their
entire
tables,
which
meant
deleting
all
their
data
and
actually
just
overwrite,
inserting
a
very
small
subset
of
data
and
when
they
asked
like
oh,
why
did
this
happen?
They
would,
you
know,
create
a
lot
of
support.
Tickets.
C
Oh
Delta
lost
all
my
data,
we're
like
so
here's
the
history,
log
and
here's
the
operation
that
you
wrote
and
you
know
we
have
this
operation
called
replace,
where
which
you
should
have
like
what
you
can
use
to
actually
guarantee
that
you're
overwriting
the
right
data,
so
you
know
like
you're,
not
deleting
spurious
data.
You
know
accidentally,
but
here
you
go.
C
Here's
time-travel
that'll,
allow
you
to
actually
get
whatever
data
was
in
a
previous
version
and
you
can
merge
all
this
data
back
into
your
current
version,
so
it
kind
of
like
the
first
like
day
we
like
the
first
week
we
released
time-travel,
actually
like
six
or
seven
customers,
add
like
saved
all
the
data
that
they
accidentally
deleted.
Wow.
B
C
Yeah,
it
was
kind
of
like
I
mean
it
was.
It
came
from
the
idea
that
people
make
mistakes
got
it.
You
know,
if
there's
a
way
that
we
can
prevent
or,
like
I
mean
roll
back
from
there
from
those
mistakes
a
lot
easily
like
much
easily,
then
you
know
we
should
provide
that.
You
know
feature
to
users
and
it
was
actually
like
just
from
how
the
transaction
log
and
like
the
concepts
of
like
multi-version
concurrency
control,
worked.
C
It
was
really
easy
for
us
to
actually
go
back
to
the
state
of
a
table
at
any
given
time.
So
why
not
just
like
empower
users
to
actually
you
know
if
they
want
to
query
the
differences.
Do
that
if
you
want
to
you
know
if
you
accidentally
deleted
data,
add
that
back
just
provide
all
that
functionality
very
easily
perfect.
B
Perfect
well,
okay!
Well,
this
has
been
an
awesome
interview,
I'm
glad
you
spent
the
time
with
us
here
today,
Brock
on
the
genesis
of
Delta
like
it,
and
also
telling
us
a
little
bit
about
yourself.
It's
a
really
interesting
journey
that
you
went
from
basically
mechanical
engineering
to
machine
learning.
To
being
a
hardcore
engineer,
I
did
want
to
leave
a
few
minutes
to
ask
some
questions
before
we
wrap
this
up.
B
C
Yeah
I
mean
so
kind
of
like
this
concept
of
you
know
like
having
these
two
tables,
and
you
know
a
streaming
workload
and
a
batch
workload.
What
we
wanted
to
do
with,
like
the
Delta
architecture,
was
kind
of
you
know
you
can
stream
into
your
table.
You
can
do
all
kinds
of
operations
on
that
table
additionally
and
that
came
from
the
biggest
power
of
Delta,
which
was
acid
transactions,
which
I
am
funnily
we
didn't
get
to
in
this
interview
yet
but
yeah
through
this
acid
transactions.
C
Basically,
he
had
all
the
power
to
you
know
append
new
data
delete.
You
know
existing
data,
a
compact
existing
data,
without
you
know,
like
changing,
you
know,
causing
any
transaction
conflicts
and
whatnot.
So
by
you
know
this.
What
we
wanted
to
do
was
propose
this
new
architecture,
style
called
Delta
architecture,
where
you
would
incrementally
you
know,
improve
the
quality
of
your
data.
So
what
that
meant
was
kind
of
you
would
have
data
coming
in
from
this
like
centralized
messaging
queue
message:
queue
which
I
was
like
a
very
short
retention
period.
C
Maybe
seven
days
worth
of
data
or,
like
you
know,
two
weeks
of
data
max
and
what
we
wanted
to
do
was
you
know,
like
people,
don't
realize
mistakes,
maybe
in
that
short
period
of
time
you
need
like
a
longer
retention
period,
so
the
first
step
would
be
to
like
take
all
that
data
and
store
it
in
cheap
storage
with
you
know
like
leaving
it
untouched
and
from
there
you
just
like
you,
do
one
more
layer
of
refinement
where
you
actually
okay,
take
this
like
raw
data.
Let
me
just
parse
it
out,
move
it
into.
C
You
know
nicely
clean
tables
where
you
know,
I
have
like
my
source
of
truth
for,
like
all
the
events
that
I
need
and
what
this
and
then
just
like
one
more
add
one
more
layer.
Add
one
more
layer
just
like
on
combine
all
these
event
sources
back
again
to
tables
that
you
know,
data
analyst
Quinn
can
query
optimally
or
very
quickly,
so
you
know
there.
The
idea
was.
C
So
you
know
these
kinds
of
like
operations
really
and
like
this
architecture,
really,
you
know
took
off
with
a
lot
of
our
customers,
because
you
know
they
understood
the
pain
points
of
like
making.
You
know,
unfixable
mistakes
and
this
architecture
kind
of
gave
them
the
flexibility
to
actually
fix
those
mistakes
right.
B
And
so
actually,
you
called
up
a
really
good
point
which
again
you're
right.
We
should
have
brought
up
a
little
bit
earlier,
which
is
the
context
of
transactions.
Why?
Why
was
this
concept
so
important
that
ultimately
led
to
that
creation
of
that
transaction
or
ultimately
allowed
you
to
be
able
to
provide
the
reliability
within
Delta
Lake
in
the
first
place,
yeah.
C
C
Oh
I
want
to
make
sure
that
you
know
if
I
have
two
concurrent
writers,
like
changing
doing
things
to
my
table,
I
want
to
ensure
that
you
know
they
are
consistent
amongst
each
other,
like
I,
don't
want
to
people
trying
to
delete
the
same
file
and
or
delete
the
records
from
the
same
file
and
update
with
new
values
to
it.
So
or
you
know
like
if
you're
doing
compaction.
You
know
compaction
means
you're.
Gonna
have
a
second
copy
of
the
data
within
the
same
table.
C
You
don't
want
to
break
anything
that
you
know
just
was
running
at
the
time,
a
query
that
was
started
before
you
start
your
compaction
process.
You
needed
to
give
it
isolation
so
that
you
know
that
query
can
run
for
two
days
three
days
if
it's
kind
of
like
a
you
know,
a
deep
learning
algorithm,
for
example,
but
the
next
time
it
runs.
B
Perfect,
okay,
so
I
actually
only
want
to
ask
one
more
question
just
because
if
we
have
a
lot
of
other
good
questions
but
I
figured,
some
of
them
are
a
little
bit
more
detail.
So
it
probably
it
takes
longer
than
just
a
quick
Q&A
but
I
did
want
to
ask
answer.
Look
sue
me.
Ask
the
file
question,
which
is:
when
do
we
expect
the
full
hive
integration
with
Delta
Oh,
full.
C
Hive
integration,
that's
a
great
question,
so
we
do
have
a
hive
connector
right
now
that
we're
hoping
that
people
are
gonna,
try
out
and
help
give
us
feedback.
So
Delta
is
an
open
source
project.
It's
under
a
repository
called
Delta
I/o
Delta.
We
have
a
github
repository
called
Delta,
io
connectors,
and
here
we
wish
to
have.
You
know
like
vectors
for
other
kinds
of
analytics
engines.
Where
you
know
hive
is
one
of
them.
C
B
Perfect,
okay,
well,
hey
Brock!
Thanks
very
much
for
this
great
session.
I
really
appreciate
your
time.
I
did
want
to
do
a
couple
things
just
to
do
a
wrap-up,
so
Karen
I,
don't
know
if
you're
gonna
do
it
do
a
wrap-up,
but
I'm
gonna
at
least
want
to
cause
some
quick
things
for
wrapping
up
number
one
is
that
Brock
and
myself
and
a
few
other
members
of
the
stream
team.
We
will
be
doing
a
three-part.
Webinar
series
called
a
diving
deep
into
Delta
Lake,
where
we're
gonna
be
talking
about
two
transaction
log.