►
Description
In this session, Yeshwanth Vijaykumar, Senior Engineering Manager and Architect at Adobe and our host Denny Lee will discuss how the data lake house architecture at Adobe Experience Platform combines with the Real-time Customer Profile architecture to increase our Apache Spark Batch workload throughputs and reduce costs while maintaining functionality with Delta Lake.
Quick Links
Read Our Newest Blog Post: https://delta.io/blog
Yeshwanth Vijaykumar: https://www.linkedin.com/in/yeshwanth-vijayakumar-75599431/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
Join the Google Group: https://groups.google.com/forum/#!forum/delta-users
A
So
we're
currently
being
live
streaming
now.
So
just
give
us
a
couple
of
minutes
if
you're
wondering
what
is
this
session,
what
is
this
random
set
of
people
here,
you're
actually
currently
logged
into
massive
data
processing
in
Adobe
experience
platform
using
Delta
lake,
so,
like
I,
said,
give
us
a
couple
minutes
to
get
ourselves
all
set
up
and
established.
We're
live
casting
to
both
LinkedIn
and
YouTube,
so
that's
being
set
up
as
we
speak.
A
Meanwhile,
why?
Don't
you
tell
us
where
you're
based
out
of
where
are
you
from
my
name
is
Denny
and
I'm
based
out
of
the
Evergreen
city
and
I'm
I.E
from
Seattle
I'm
super
excited
that
Geno
Smith
got
resigned,
so
yes
for
Seahawks
football,
so
I
had
to
toss
that
a
little
bit
in
yes,
she
wanted
to
go
ahead
and
kick
kick
us
off
here.
B
A
Actually,
how
cold
is
it
right
now
in
in
San,
Jose.
B
A
A
A
Don't
you
tell
us
where
you're
based
out
of
so
you
hear
yesterday's
from
San
Jose
myself
from
Seattle?
Why
don't
you
chime
in
while
we
finalize
the
logging
in
LinkedIn
and
YouTube,
and
it
looks
like
it's
being
set
up
right
now.
A
All
right,
I
think
we're
good
to
go
excellent,
all
right
Lowell
then,
let's
start
the
show.
Oh
we've
got
somebody
from
Spain
from
Pedro
from
Spain.
That's
awesome.
We
have
Goro
from
Brussels,
oh
I,
love
Brussels
by
the
way,
a
huge
fan
of
that
town
and
hey,
hey
Pedro.
You
said
you're
from
Spain,
but
whereabouts
yeah,
just
curious
anyways,
oh
for
Valencia,
nice.
Okay,
that's
pretty
sweet!
Okay,
let's
start
off
the
show,
so
just
to
provide
context.
A
We
are
currently
live
casting
on
the
Linux
Foundation
Zoom,
we're
also
on
the
LinkedIn
and
on
Delta
Lake
LinkedIn
and
the
Delta
Lake
YouTube.
So
today's
session
is
massive
data
processing
with
the
Adobe
experience
platform
using
Delta
lake,
so
saying
that
I
actually
want
to
have
yesh
introduce
himself
provide
a
little
background.
So
I
want
to
tell
the
audience
who
you
are
and
yeah.
Actually
how
did
you
even
get
into
the
data
engineering
space?
Like
you
know,
you
can
even
go
back
to
college
if
you
want
to
so
sure.
B
So
I'm
yashon
Vijay
Kumar
I'm,
currently
like
an
engineering
manager
at
Adobe,
so
I
started
off
I,
think
all
the
way
back
to
college
and
so
on.
So
I
started
off
as
a
machine
learning
engineer,
doing
churn
prediction
and
stuff
with
Ericsson
then
was
building
machine
learning
models
for
ads
at
Yelp
and
then
I
moved
to
Adobe.
Then
I
became
all
things
distributed.
Systems
like
distributed
day-to-day
databases,
distribute
computation,
Frameworks,
so
kind
of
became
a
generalist
kind
of
started
off
from
the
machine
learning
side.
Because
how
do
we
get
data
into
the
models
right?
B
It's
in
for
us
to
build
the
models.
A
lot
of
the
times
most
of
the
time
is
probably
spent
trying
to
clean
up
the
data
or
trying
to
get
the
data
in
the
right
way.
Building
the
model
is
much
more
easier.
The
time
you
spend
for
the
pipeline
is
probably
much
was
much
lesser
than
what
you
spend
on
the
data
slide.
So
that's
how
I
got
into
the
data
space?
B
A
Yeah,
let's
talk
a
little
bit
about
that:
I
actually
want
to
go
backwards,
a
little
bit
because
I,
don't
know
this
about
you
so
so
for
starters:
complete
Rocky
from
the
ad
space
of
models,
because
I
was
actually
doing
a
very
similar
thing
I.
Actually
this
is
going
to
age
me
now,
but
I
actually
helped
with
some
of
the
display
ads
with
Bing.
Initially,
when
we
were
writing
Pearl
scripts
to
go
ahead
and
process
data.
Okay.
A
So
exactly
to
your
point,
we're
building
models
we
were
using
God
knows
how
many
different
programs
at
the
time,
it's
even
worse
than
what
you
think
it
was
now.
A
We
would
go
ahead
and
then
use
Pearl
to
go
ahead
and
actually
try
to
do
our
data
pipelines
and
yeah.
Like
exactly
to
your
point
like
you,
you
start
off
like
you
know,
trying
to
do
machine
learning
models,
and
then
you
end
up
spinning
the
vast
majority
of
your
time
for
data
cleansing
data
pipelines
just
making
sure
the
data
is
even
you
know
at
a
point
where
you
can
actually
generate
the
actual
model.
So
I'm
just
curious
like
when
you
join
Yelp,
like
was,
were
you
actively
fighting
that
process?
A
B
No
actually
so
you're
kind
of
very
good
infrastructure,
setup,
I.
Think
a
lot
of
smart
folk
has
set
up
a
lot
of
things.
So
I
was
a
so
I
hated,
Java
I.
B
So
yeah
so
I
still
I
love
scalable
by
the
way,
so
that
before
any
pitchforks
come
out
so
yep
had
this
amazing
thing
where,
like
in
Ericsson,
we
had
a
problem
where
we
had
to
set
up
our
own
clusters
right,
like
we
literally
set
up
a
data
center,
a
mini
data
center
within
the
office,
to
try
to
simulate
some
stuff.
At
the
end
they
were
full
AWS.
So
there
was
a
lot
of
EMR
going
on,
but
still
to
write
the
goddamn
job.
B
A
B
You
you
get
to
write
python
right,
so,
okay,
all
the
stuff
was
in
Python,
so
that
was
amazing,
so
you're
still
doing
all
the
hadoopy
stuff,
but
it
was
in
Python,
of
course,
Earthwise
it's
going
to
be
there,
but
then,
when
you
have
so
many
thousands
of
cores
to
throw
at
it,
who
cares
in
a
way?
But
of
course
the
AWS
bill
will
care,
but
it
was
good
from
that
point
onwards
right.
B
But
then,
as
you
go
out,
it's
not
as
structured
or
data
as
in
the
data
teams
were
not,
as
invested
in
I
would
say
even
eight
years
back
right
now
you
see
a
much
larger
focus
on
that
area,
but
yeah,
but
getting
to
the
question
right
and
yeah
at
least
I
didn't
have
to
do
too
much
of
a
thing.
A
lot
of
these
things
were
set
up
for
me,
but
then
I
got
out
of
Yelp.
That's
when
I
kind
of
noticed,
oh
okay.
So
this
is
not
something
that's
common
everywhere.
B
So
that's
that's.
What
got
me
into
it
right
so
to
say,
like
okay,
I
need
to
actually
understand
what's
happening
from
an
infra
level
from
Storage
level.
Okay,
somebody's
not
going
to
solve
this
for
me,
I
need
to
go
and
understand
what
exactly
how
the
data
is
stored,
what
the
data
is
and
how
can
I
get
them
the
more
in
the
most
efficient
way.
So
that's
kind
of
what
you
say
was
my
Inception
into
the
whole
data
engineering
space.
A
B
I
would
say
not
that
the
first
but
then
it's
a
completely
organically
built
platform,
which
is
focusing
on
getting
the
user
data
or
the
marketing
data
or
like
first
party
data
second
by
data
from
the
various
customers
that
are
there
so
for
for
us,
for
example,
like
Target,
could
because
remember,
Nike
could
be
a
customer
like
and
then
they
bring
their
data
into
our
platform.
They.
B
But
then,
or
even
more
critical
experiences
right,
like
the
emails
that
you
get
like
say
from
your
Airlines
saying,
like
hey
I'm,
targeting
you
for
some
points
or
anything
like
for
discounts
Etc.
So
a
lot
of
these
daily
interactions
that
you
take
a
good
amount
of
them
flow
through
the
Adobe
Systems,
particularly
the
Adobe
experience
platform.
It's
all
the
way
from
data
collection
assimilation
as
well
as,
like
you
know,
weaponizing
your
Sparty
data
right
into
ex
into
customer
experiences.
That's
kind
of
what
the
Adobe
experience
platform
is
about
right.
A
So,
basically,
it's
that
it's
all
about
that
personalization
and
recommendation
right
to
be
able
to
go
ahead
and
take
that
information,
combine
it
with
other
piece
of
information.
So
that
way
you
can
personalize
which,
whichever
customer
you've
got,
can
personalize
specifically
for
the
for
their
users,
gotcha,
okay,
exactly
well,
then
why
don't
we
start
with
what
was
it
like
when
you
first
started
with
the
Adobe
experience
platform
like
in
terms
because
it
didn't
start
with
Delta
Lake
I
mean
we
got
there,
which
is
great.
Obviously,
I
want
to
talk
about
that.
A
B
So,
quite
a
few
right,
especially
when
you're
building
something
like
when
you
join
an
already
mature
product
or
team,
it's
much
different
right.
A
lot
of
the
groundwork
has
been
put
in
for
you
in
the
in
the
case
of
AP.
This
was
being
built
from
the
ground
up
right,
so
a
lot
of
the
stuff
had
to
be
like
best
practices
or
even
infrastructural.
Best
practices
had
to
be
laid
down
from
the
beginning.
B
So,
with
respect
to
the
challenges
that
I
personally
faced
right
like
trying
to
get
of
like
a
even
a
proper
spark
cluster
right,
we
had
our
own
homegrown
thing,
which
was
built
on
top
of
I.
Don't
know
if
everyone's
on
the
K
it's
trained
right
now,
I,
don't
know
how
many
people
remember
marathon
dcos,
and
these
meets
us
so.
A
B
You've
tried
working
with
that
I
I've
self-hosted
for
one
of
the
teams
that
I
worked
with
a
dcos
cluster,
guess
what
it's
not
good
for
your
hairline,
so
you
can
see
the
remnants
of
that
right,
yeah,
yeah
so,
but
yeah
so
trying
to
even
get
a
single
like
a
simple
sparkler
setup
right.
That
was
a
problem
and
then
we
are
primarily
on
Azure
right.
So
that
was
another
thing:
gen
1,
as
in
ADLs
gen
1,
the
data
Lake
gen,
one
that
wasn't
as
an
ass
scalable
as
what
Gen
2
is
right
now.
B
So
that
was
a
different
thing,
so
we
had
to
go
discover
the
issues
that
were
there,
but
then,
in
terms
of
the
actual
data
storage
problem
that
we
that
that
is
my
team,
that
is
the
unified
profile
team,
which
is
kind
of
responsible
for
getting
all
this
fire
hose
of
data
from
everybody
right,
like
every
single
click,
app
link,
whatever
you
do,
is
getting
fed
into
the
system
in
real
time.
So
we
were
initially
trying
to
model
it
with
headspace,
and
then
we
move
and
headspace
on
Azure
right.
B
There
was
no
at
least
whatever
was
touted
to
be
like
a
managed
service
is
not
a
the
HD
inside
version
was
not
exactly
something
that
you
could
go
to
war
with,
so
we
were
trying
to
build
that
on
top
of
Usher
right
and
that
didn't
work
out.
Then
we
moved
to
a
nosql
store,
a
managed,
no.
A
B
Okay,
yeah
yeah,
so
we
others,
tears
and
blood
was
spent
on
those
things
right,
but
all
those
were
learning
lessons
to
say
like
okay.
How
far
can
we
push
something?
What
works?
What
does
not
work,
because
for
every
lesson
we
found
out
where
something
is
not
working.
We
also
found
something.
Okay.
These
are
some
good
stuff
too,
that
we
should
probably
carry
on.
So
there
were
some
good
things
that
were
okay,
I,
don't
know,
but
that
was
basically
the
state
of
it
right
when
we're
trying
to
build
and
explore.
B
At
the
same
time,
there's
a
lot
of
issues
on
that
front,
but
yeah,
then
storage
wise.
Then
we
moved
to
our
own,
manage
nosql,
store
right
and
that
worked
out
great
right.
Okay,
it
in
terms
of
scaling
out
and
all
of
that,
that
kind
of
worked
out
pretty
well
but
manage
nosql
stores
on
any
Cloud
they're
going
to
be
very,
very
expensive
right.
They
basically
fail
through
the
most
so
that's
places
and
then
the
way
that
you
have
to
model
the
data.
B
That
also
is
another
thing
in
in
the
case
of
our
product
offering
right.
We
are
kind
of
in
a
very
weird
space,
where
we
also
do
real-time
transactional
operations
like
what
you
say
you
you
have
hundreds
and
thousands
of
requests
per
second
like
which
are
one
ingesting
data
or
they're,
even
querying
data
right.
So
it's
like
Point
lookups,
but
at
the
same
time
the
majority
of
our
workloads
are,
you
know
the
spark
based
scan
workloads
right
right.
B
So
now
you
can
maintain
two
different
ways
of
doing
it,
but
then
you
know
there
was
no
Delta
back
then
yes,
in
trying
to
do
like
analytics,
so
our
viewer
had
to
build
it
on
top
of
which,
in
terms
of
latencies
and
everything,
would
not
have
worked
out
for
us
with
respect
to
our
data
packing.
So
we.
B
Up
consolidating,
at
least
in
the
build
phase,
to
say,
let's
have
one
store
and
let's
try
to
do
to
the
best
of
our
abilities
like
a
spark
connector
on
top
of
the
same
store
but
and
we'll
also
do
the
transactional
loads
on
top
of
this.
So
now
that
obviously
has
limitations,
but
I'll
stop
at
that
point,
but
that's
basically
how
it
was
when
we
were
building
out
this
entire
thing
and
kind
of
the
challenges
people
face.
Acc.
A
Ent,
so
so
in
in
a
nutshell.
From
this
perspective,
then
you're
about
to
shift
into
a
solution
in
which
you
basically
a
cycle
chunk
of,
is
your
traditional,
like
batch
processing
type
like
where,
like
a
standard,
smart
cluster
could
actually
go
ahead
and
query
and
get
you
like.
You
know:
okay,
these
number
of
people
were
recommended
X
or
saw
these
ads,
or
these
events
occurred
or
yada
yada,
yada,
okay,
standard
stuff,
basically
that
approach
okay.
B
And
then
so,
I
just
look
at
it.
It's
yeah
in
in
terms
of
the
number
of
functionalities
that
go
right,
you
can
just
classify
it
as
like
a
scan
based
system
and
an
island
store
versus,
as
in
a
lot
of
use
cases
catering
to
that
side
versus
right.
You
know
just
a
single
transactionals,
Point,
lookup
kind
of
a
scenario
right.
A
Right,
okay,
so
so,
then,
when
you
went
in
like
how
did
you
before
we
talk
about
the
actual
solution,
how
did
you
go
about
the
reasoning?
What
would
be
the
solution
to
the
problem
because
I
mean
I?
Think
a
lot
of
people
here
would
like
to
understand
like
how
did
you
get
to
that
point?
And
you
know
what
were
the
gotchas,
because
the
like
this
isn't?
This
is
actually,
unfortunately,
quite
common
right.
A
This
idea
that
you
have
this
scanning
query
access
plan
right,
the
a
la
spark
or
and
then
there's
a
hot
store
type
partition
planning
where
you
actually
have
to
query
these,
like
you
know,
point
in
time
type
queries.
So
how
did
you
go
about
reasoning
that
were
you
prioritizing
one
over
the
other?
Did
you
basically
have
to
treat
both
as
the
same
priority
I'm
just
curious.
B
A
B
Kind
of
put
it
in
that
CDP
ecosystem,
like
we
were
kind
of
there
before
other
remarketings
or
something
that
was
happening
so
there
was
a
lot
of
I,
would
say,
push
for
us
to
get
there
quickly
too.
We
can
wait
for
the
perfect
solution,
but
we
need
to
make
things
work
so
right,
our
priority
was
okay.
We
know
we
want
this.
We
know
we
want
this
and
we
cannot
trade
off
say
like
oh
we're
not
going
to
give
this
for
the
next
one
here,
like
one
part
of
it
right.
B
Yeah
right,
so
we
were
like
okay,
we
were
willing
to
take
on
that
architectural
debt
to
say,
like
okay,
let's
make
this
work,
let's
make
as
long
as
we
are
spending.
What
do
you
say
less
money
than
what
you're
bringing
in
we're
kind
of
okay
with
it
and
let's
evolve
The
Arc.
Let's
make
sure
we
have
enough
Escape
hatches
built
in
so
that
we
can
go
get
out
of
this
architecture
as
Tech
improves
and
that
that
kind
of
pay
to
pay
to
paid
off.
A
Perfect,
okay,
so
this
is
excellent.
I
love
your
flow,
so
basically
you
know
you
have
to
solve
both
the
hot
partition,
heart
store
and
the
batch
store
at
the
exact
same
time.
You
know
that,
basically,
just
as
long
whatever
you
do
will
end
up
lowering
the
cost
so
you're
like
you're,
so
you're,
basically
not
losing
money
running
the
system,
it's
a
migration
path.
So
it's
not
like
a
flip.
The
switch
it's
very
much
a
pathway
between,
like
we're,
you're,
not
even
sure
what
the
end
goal
looks
like
yet
right.
A
You
just
know
that
you're
going
to
go
ahead
and
iterate
through
this
is
excellent.
I
love
that
piece
of
advice,
because
a
lot
of
people
are
always
often
looking
for
like
I,
know
exactly
what
the
solution
is
and
and
my
user
response
is
like.
I
can't
tell
you
what
the
technology
is
able
to
do
six
months
from
now,
let
alone
a
year
from
now.
No
that's
the
wrong
way
of
looking
at
the
problem.
I.
B
Would
love
to
say
that
yeah
it
would
look
like
I'm
a
genius,
but
no
the
entire
team
right.
That's
not
just
me
the
way,
a
lot
of
smartphones
in
the
team.
We
all
came
to
the
same
conclusion.
Look
we
don't
know
what
might
be
there
five
years.
Is
it
when
we
were
designing
this?
There
was
no
Delta.
There
was
no
Delta
make
yes,
so
we
didn't
know
something
like
that
would
come
our
as
in
terms
of
our
plan.
B
plan
C
was
okay.
How
can
we
make
our
representation
of
the
data
more
light?
B
Can
we
kind
of
optimize
on
that
and
we
still
knew
we
had?
We
could
make
some
Headway
into
that,
but
then
okay,
you
do
have
like
you
know,
literally
like
groundbreaking
stuff,
coming
out.
It
changes
the
equation
completely.
So
that
way
that
time
it
makes
you
look
really
good.
You're,
like
oh
that's
great,
oh.
B
B
A
Two
years
from
now
and
you
would
magically
be
able
to
make
use
of
it,
no
so
so
no!
No!
This
actually
is
a
really
good
lesson
for
a
lot
of
people,
because
this
is
what
I
constantly
trying
to
remind
folks
is
like
it's
like
you
design
things,
so
you
can
plan
for
a
future.
That's
different,
as
opposed
to
designing
for
things
that
you
think
are
going
to
happen
because
more
times
than
not
whatever
you
think
happens,
isn't
going
to
happen.
Anyways,
true,
okay!
So
let's,
let's
talk
about
those
Lessons
Learned.
So
now
you
you
joined.
A
You
have
this
awesome
scenario
where
you
have
to
actually
do
two
things
at
once
to
competing
things
at
once,
right
like
without
without
like,
because
if
we
go
into
the
Uber
details,
this
will
be
five
hours
long
conversation,
so
we
only
have
about
30
more
minutes.
So
look
yeah.
What
like,
why
don't
we
break
down
some
major
Milestones,
like
of
how
did
you
get
there?
Basically.
B
Right
so
actually
related
to
the
previous
thing
that
we've
had
that
actually
ties
into
this
part.
So
one
of
the
Escape
hatches
that
we
were
kind
of
talking
about
is
very
early
on
right
from
the
V1
of
our
system.
We
kind
of
took
an
even
sourcing
design,
which
was
meant
to
feed
into
other
parts
of
the
platform,
because
we
are
kind
of
the
Hub
at
the
center
of
the
experience
platform
powering
a
lot
of
the
applications
depending
on
what
app
it
is.
B
So
what
we
thought
was:
okay,
let's
be
a
bit
even
driven,
so
that
we
can
at
least
lessen
a
bit
of
the
load
on
our
system
and
that
probably
that
decision
that
the
team
took
probably
paid
off
the
most
in
terms
of
the
migration
there.
So
what
what
I
mean
by
even
sourcing
in
this
thing
is
we
have
any
database
any
nosql
DB
that
we
provisioned?
Basically,
we
built
our
own
CDC
or
change
data
capture.
So
every
time
a
mutation
happens
to
this
primary
store
that
we
have
single
store
that
we
have.
B
It
would
emit
a
notification
of
the
change
to
our
centralized
Kafka
topic
so
basically
think
of
it
as
even
if
you
have
1000
customers
or
a
thousand
DBS,
all
the
notifications
are
basically
fanned
in
into
a
single
fire
hose.
So
anybody
who
is
listening
Downstream
to
us
knows
basically
when
something
has
changed
or
what
has
changed
specifically
like
on
a
per
a
row
level
basis.
Think
of
this
as
a
row,
level
change
feed
on
any
of
the
databases
that
we
have
or
like
in
the
postgresite
The
Logical
replication.
B
B
That
we
have
a
bit
of
agnosticity
over
here
to
say,
like
okay,
I
have
a
particular
data.
Modmod
portal
I
have
modeled
all
the
changes
and
I
have
a
way
to
consume
those
changes
in
one
particular
way.
So
that
was
the
first
part.
So
we
before
we
decided
to
migrate
to
anything
or
move
or
replicate,
we
kind
of
hardened
that
system
to
say
like
okay
is
the
replication
behave
the
same
and
then
now
comes
the
Milestones
part
right.
B
So
now
we
have
the
transport
layer
built
in
to
say,
like
oh,
something
changed
from
the
source
of
Truth,
which
is
the
current
source
of
Truth.
So
now,
then,
we
started
building
the
layers
to
say
like
okay
step,
one,
let's
just
duplicate
everything
row
by
row
like
a
mirror
of
the
day-to-day
data
that
we
have
as
in
the
advantage
of
that
is
because
it
makes
our
verification
process
very
easy,
like
we
could
choose
to
have
a
different
data
model.
B
B
Right
so
our
first
Milestone
was
to
simplify
that
entire
process
to
say
like
if
I
have
a
hundred
rows
on
my
left
hand,
side,
which
is
the
source
of
truth,
I'm
going
to
have
100
rows
on
my
right
hand,
side
right
and
that
what
the
guarantee
that
this
replication
system
that
we're
giving
is
very
simple,
we're
going
to
say
within
a
particular
time
bound
that
we
can
measure
we're
going
to
replicate
all
the
rows
from
A
to
B.
And
now
this
becomes
a
much
more
easily
verifiable
thing
through
hashes
and
jacquards
and
stuff,
like
that.
B
A
So
this
just
to
reiterate
the
point.
This
replication,
though,
is
of
the
changes,
so
in
other
words,
yes,
in
other
words,
is
that,
like
forsake
argument,
if
there
were
a
billion
rows
in
the
source
of
truth,
you're
replicating
a
billion
rows.
What
you're
doing
is
you're
saying
no,
no
I'm
focusing
on
the
CDC,
the
change
data
capture
side
of
the
house
and
I
have
the
verification,
you've
basically
hardened
the
replication,
so
that
way,
I
you
never
have
to
rebuild
the
whole
system.
You
basically
only
need
to
grab
the
100
rows
or
whatever.
A
It
is
at
that
point
in
time
that
time
window
that
batch
window
to
basically
ensure
that
those
100
rows
came
across
and
whatever
you
do.
Downstream
doesn't
really
matter
at
that
point
because
you've
validated
that
in
fact,
the
100
growth
that
came
through
first,
those
are
exactly
what
you
were
expecting
exactly
exactly:
okay,.
B
And
okay,
so
that
was
our
first
Milestone
right
in
terms
of
finishing
it
out,
because
we
we
wanted
to
narrow
down
the
problem
now
for
the
second
milestone
right
that
we're
acting
with
this
thing.
But
before
now
the
first
thing
was
the
replication
the
session.
The
second
milestone
was
to
get
the
actual.
B
You
know
the
workloads
that
are
dependent
on
the
hot
store
to
move
on
top
of
this
gotcha
gotcha
right,
but
but
there
was
a
big
Advantage
now
because
we
narrowed
the
scope
to
do
a
one
is
to
one
match
right
like
right.
Every
row
on
LHS
equal
to
every
row
on
rhs
now
the
spark
workload
in
terms
and
then
for
those
familiar
with
spark
I'm,
assuming
everyone
is
from
this
path.
In
terms
of
our
reader
and
whatever
DF
dot
read.
We
do.
B
A
B
Terms
of
migration,
in
terms
of
so
we're,
not
if
you,
if
you,
if
you
have
10
different
workloads
which
are
dependent
on
this
hot
store,
we
are
not
rewriting
10
different
workloads.
One
team
writes
the
mapping
function
to
make
sure
the
rest
of
the
job
thinks
it's
just
talking
to
the
hotstar,
because
the
job
doesn't
care.
It
just
cares
about
what
schema
you
input
schema.
Are
you
getting
into
in
the
top
of
the
pipeline?
The
rest
of
the
flow
does
remains
unchanged.
So.
A
So
for
for,
for,
for
all
intents
and
purposes,
as
opposed
to
10
different
replicas,
you're,
basically
multicasting
it
in
memory
out
so
each
of
the
10
jobs
just
say
using
the
magical
number
of
Ten
each
of
those
10
jobs
basically
is
able
to
access
that
data
in
memory,
as
opposed
to
being
written
in
storage,
has
the
100
rows
or
whatever
the
number
rows
or
I've
been
cheers.
Yeah.
B
So
like,
how
do
you
validate
yeah
yeah?
Sorry,
no
slightly
differently
right!
Think
of
it.
Okay,
you
don't
have
to
go
into
10
jobs.
Just
think
of
you
have
a
single
job
which
reads:
let's
say
it
has
a
schema
of
Focus
right
now
that
was
from
the
hot
store,
so
whatever
struck
schema,
I
was
reading
from
that
it
was
reading
from
the
hot
store.
Now,
when
we
switched
this
job
over
to
use
the
source
of
the
Delta
Lake
replica
now
we
need
to
ensure
like
we
can
do
two.
We
have
two
options
now
right.
B
B
The
other
option
we
have
is
to
say
just
write,
a
very
thin
mapping
function,
sure
to
make
make
sure
that
the
data
that's
coming
from
Delta
Lake
looks
exactly
the
same
as
whatever
it's
looking
at
from
the
hot
store,
at
least
because
from
a
milestone
point
of
view,
it
reduces
the
amount
of
friction
that
we
have.
We
need
to
get
you
know
multiple
teams
to
align
on,
to
use
the
Delta
Lake
store,
instead
of
because
the
it
goes.
There
are
two
types
of
block.
B
I
would
say
blockages
that
you
have,
one
is
I
can
say:
oh
look,
Delta
lake
is
10
times
faster
than
reading
from
the
hot
stone
right
people
will
be
very
happy,
but
then
there's
also
the
human
element
or
the
engineering
element.
Oh
by
the
way,
it's
going
to
take
you
six
months
to
Port
your
workload
to
work
on
top
of
the
Delta
Lake.
Sure
sure
right.
A
Right
so,
in
other
words,
you've
abstract
for
the
downstream
systems,
you've
abstracted
away
the
problem,
they
don't
know
they're
occurring
whatever
it
doesn't
really
matter
to
them.
Correct
right,
okay,
got
it.
Okay,
so
basically,
you've
got
this
hot
store.
You've
placed
it
into
Delta,
Lake,
you've
built
this
mapping
functions,
so
the
downstream
systems
doesn't
actually
have
to
care,
which
is
great.
Is
everything
solved
or
no.
A
So
why
don't
we
talk
a
little
bit
more
so
so
this
this
gives
us
the
idea
of
being
the
extractor
way
and
minimize
the
downstream
Engineers
what
they
actually
had
to
do
in
order
to
be
able
to
query
the
same
data
right
okay,
so
this
is
great
right,
so
you've
basically
you've
stabilized
it,
but
then
and
you've
made
it
easy
for
Downstream
systems,
but
obviously
there's
a
whole
bunch
of
other
problems.
So
let's
get
into
that
now
so.
B
If
you
so
now,
we
have
to
just
go
into
the
nosql
itself
right,
okay,
so
take
any
nosql
store,
that's
there
right
sure
sure
as
a
concept
of
multiple
partitions
right.
Yes,
so
when,
when
you
I
said-
and
let's
look
at
this-
the
status
quo,
when
our
spark
Jobs
go
talk
to
the
nosql
store,
the
parallelism
of
the
read
is
going
to
be
proportional
to
the
number
of
partitions
that
that
no
SQL
server
has
because
if
you
have
only
10
I
can
throw
100
cores
I.
B
A
B
B
So
now
you
have
twice
as
much
or
n
times
as
much
by
the
way,
the
more
sub
partitions
that
you
do,
which
means
that
you're
doing
more
things
parallely,
which
means
you're
going
to
incur
more
cost
because
of
life
in
terms
of
higher
concurrent
operations
that
are
going
to
happen
so
yeah.
So
we
solve
the
time
Problem
by
parallelizing
it,
oh,
but
guess
what
we
have
to
pay
through
the
nose
and
years
in
order
to
actually
get
the
functionality
that
we
need
so
right
now
it
was
like
okay,
fine.
A
B
A
Yeah
data
we.
B
B
So
when,
when
some
actions,
personalization
action
is
happening
on
me,
the
data
has
to
be
combined
from
across
both
these
individual
fragments
before
a
decision
can
be
made
right,
because
if
you
don't
include
it
you're
going
to
have
wrong
what
is
say,
logic
getting
fired
in
the
downstream
system,
but
sure
the
problem
was
shuffled.
So
if
we
were
to
characterize
it,
shuffle
was
a
problem,
so
these
were
the
two
main
problems
that
we
had.
So
okay
now
comes
one
of
the
points
that
you
were
saying
right
with
the
CDC
approach.
B
We
basically
had
an
incremental
handle
on
top
of
what
changed
now
with
just
the
raw
Delta
Lake
mirroring.
We
basically
increased
our
read
throughput
right
because
the
data
stored
in
Delta
Lake
in
terms
of
compression,
we
see
anywhere
between
10
to
15
x,
compression
from
what
we
store
in
the
this.
This
one,
so
our
read
performance
has
gone
up
because
more
partitions,
more
core
usage,
but
then
another
important
part
was
our
main
workflow,
which
is
assimilating
these
profiles.
So
now
we
are
doing
this
on
an
incremental
basis.
B
So
now
Delta,
because
you
have
the
change,
feed
and
all
that
stuff
that
you're
going
to
put
in
in
terms
of
like
only
when
the
comment
happens
in
a
particular
table.
Is
it
going
to
the
change
fee
is
going
to
be
there,
which
basically
serves
as
a
notification
for
another
system
that
we
have
to
say
like
okay,
incrementally
materialize,
whatever
you
have
right
right,
it's
not
possible
right,
as
in
I,
wouldn't
say
not
possible.
B
If
you
throw
enough
money
at
something
you,
it
is
possible
even
with
the
hotstar,
but
then
the
the
problem
trying
to
do
a
materialization
like
that
is
because
there
are
like
multi-level
joints
that
are
happening
across
spark
and
a
nosql
store.
So
now
you
have
exactly
plan
a
query
like
this
right,
but
then
now
what
happens?
The
materialization,
whatever
the
incremental
materialization,
that
you're
doing,
is
within
the
same
Delta,
Lake,
Universe
right
right,
so
like
parque
parque.
B
So
in
terms
of
spark,
it
knows
exactly
how
to
optimize
this
incremental
materialization
pretty
well
so
right,
the
change
feeds
so
Delta.
Of
course,
thanks
to
parquet,
we
get
the
compression
and
all
the
statistics
and
everything
we
are
able
to
do
the
replication
much
more
efficiently,
but
then
because
of
change
feeds,
it
enables
us
to
do
incremental
materialization,
so
that
solves
the
second
problem.
That
I
was
talking
about.
Okay,.
A
No,
that's
actually
really
cool
so
basically
you're
able
to
leverage
CDF
to
change
data
feed
for
basically
solving
that
second
problem,
and
then
it
also
solves
technically
part
of
the
first
problem,
because
the
cost
of
it,
the
read
hits,
are
actually
no
longer
on
the
nosql
store
they're
actually
on
Delta
Lake
and
because
it's
all
a
bunch
of
smart
queries,
anyways.
It's
not
that
big
of
a
deal
for
you
basically
yeah.
It
is
some,
let's
catch
the
memory
all
on
its
own
anyways.
A
If
you
have
to
query,
run
multiple
concurrent
queries,
it
doesn't
really
matter
it's
hidden
off
the
Delta,
just
like
you
said,
you're
actually
able
to
leverage
the
calm
stats,
the
compression
algorithms,
the
snappy
code.
So
all
that's
basically
put
in
play.
So
this
is
this,
makes
your
life
easier
and
your
Reliance
on
the
nosql
store
has
reduced
not
so
much
on
functionality,
but
so
much
on
cost
like
we're
able
to
reduce
the
cost.
That's
that
because
you
no
longer
have
to
throw
that
much
concurrency
on
the
nosql
store
exactly.
B
We
were
kind
of
hacking
it
or
trying
to
use
it
for
our
own
needs,
and
what
do
you
say
our
own
timeline
we're
trying
to
fit
our
timeline
with
the
solution
right?
So,
but
now
we
don't,
we
can
we
have
two
systems
which
can
be
kept
in
sync.
We
know
that
right
and
each
of
them
do
what
they
do
best.
So
that's
kind
of
right.
It's
done
something
yeah.
A
Like
choosing
the
right
tool
for
the
job
right
and
right
right,
there's
no
SQL
stores
I've
tried
to
basically
do
operational
type,
queries
on
that
and
like
for
the
occasional
one.
It
actually
makes
a
lot
of
sense
too,
because
if
you
need
it
immediately
because
like
for
example,
you're
debugging
something-
and
you
need
it's
a
customer
scenario
where,
like
like,
we
haven't
even
transformed
the
data
yet
we're
just
simply
trying
to
like
hey.
Oh
no,
we
need
to
deal
with
it
right
away
because
we
got
a
debug
call.
A
B
A
The
concurrency
cost
alone
is
going
to
basically
kill
you
yeah.
So
oh
actually
related
to
this
there's
a
great
question
on
LinkedIn
and
actually
I'll
pose
that
question
right
now
you
did
I
believe
I'm,
saying
your
name
correctly.
I
apologize.
If
I'm
not
ask
the
questions,
how
can
we
do
hash
comparison
for
finding
the
new
changes,
and
so,
in
this
case,
he's
trying
to
understand
better
like
the
the
new
changes,
because
I
think
there's
there's
two
concepts
that
we
want
to
talk
about.
A
There's
the
new
changes
that
went
from
the
hot
store
into
Delta
Lake,
which
is
the
hash
comparison,
so
we'll
talk
about
that
a
little
bit,
but
then
there's
also
the
downstream
systems
which
are
actually
piggybacking
off
the
CDF
for
from
Delta
lake.
So
I'll
answer
the
latter
one
just
because
that's
straightforward
like
for
the
latter
part,
basically
change
that
you
feed
is
an
option
directly
within
Delta
Lake
Delta
Lake
itself
has
its
own
transaction
log
anyway.
A
So,
by
definition,
all
we
did
was
really
make
it
available,
and
so
that
way
it
became
really
easy
for
you
to
go
ahead
and
understand
what
were
the
changes
that
happened
to
the
Delta
Lake
directly.
That's
why
yes
was
able
to
talk
about
like
the
incremental
views
or
things
of
that
nature
like
because
you're
able
to
or
the
notifications
go
with
it,
because
it's
you,
you
basically
have
that
information
right
directly
in
the
CDF.
A
The
question
basically,
then,
is
more
a
matter
that
I'll
have
you
answer.
You
is
about
like
the
hash
replica
of
between
it
looks
like
my
videos
Frozen.
So
that's
pretty
cool
yeah
like
wow
I,
look
really
weird
right
now,
but
okay
is
to
basically
go
ahead
and
do
the
replica
between
the
hot
store
and
the.
B
Right,
so
no
that's
a
good
question
so,
and
this
is
something
that
we
have
spent
a
good
amount
of
time
and
we
have
an
injury,
an
individual
work
stream
dedicated
collaborating
with
Adobe
research
just
on
this.
So
there
are
multiple
answers
to
it,
but
I'll
kind
of
give
you
the
more
simple
version
of
what
we're
doing
the
naive
version.
So.
B
So,
if
I
were
to
make
a
query
to
the
hot
store
to
say,
okay,
give
me
all
the
documents
that
change
for
this
DB
in
the
last,
let's
say:
15
minutes,
that
is
kind,
is
a
cheat
query
in
terms
of
what
as
in
and
then
we
also
have
like
a
pseudo
document
hash
that
is
kind
of
represented
to
indicate
the
content
of
the
document
itself.
A
B
The
hot
store
itself
so
now
that
is
also
getting
replicated
right
so
now,
at
regular
intervals,
we
are
able
to
query
the
hotstar
to
say
give
me
all
the
hashes
that
have
occurred
in
the
past
15
minutes
for
our
comparison.
Now
you
can
look
at
Delta
Lake
and
see
like
okay.
How
many
of
these
hashes
have
actually
made
it
into
this
system?
B
Got
it
I,
think,
as
in
you
can
think
of
this
as
an
exact
set
match,
or
you
can
think
of
this
like
jakart
similarity.
But
then
your
imagination
is
your
imagination
and
your
budget
is
your
limit
on
this
thing
right,
because
you
can
throw
a
lot
of
computation.
You
can
do
exact
matches
too,
depending.
B
But
we
are
right
now
at
kind
of
like
the
terabytes
a
day,
petabytes
in
total
kind
of
a
scale.
So
we
try
to
optimize
this
into
more
probabilistic
techniques
to
say,
like
okay,
we're
probably
okay
with
say,
like
a
five
percent
diff
between
this
and
this.
But
we
need
to
make
sure
that
all
the
data
does
progress
into
the
system.
So
that's
the
verification
process
that
we're
doing
right.
B
But
apart
from
that,
any
data
that
gets
into
this
CDC
approach
we
kind
of
maintain
elaborate
metadata
on
a
source
level
so
that
we
can
track
so
this
entire
application
process
is
keeping
track
of
at
every
source.
So
we
have.
Every
database
has
multiple
sources
hydrating
into
it.
So
on
a
per
Source
level,
we
are
maintaining
counts
and
how
many
things
are
happening.
So
we
see
like
okay,
you've
got
an
input
of,
say,
Source
One
for
let's
say
Adobe
analytics
celt
in
500
million
records
or
not
5
million
records
in
the
last
15
minutes.
B
So
now
we'll
be
able
to
know
okay.
How
much
did
the
processor
or
the
replication
system
also
do
so
that's
more
from
systems
engineering,
point
of
view
to
keep
track
or
like
give
some
transparency
on
how
much
data
flowed
so
we'll
be
able
to
measure
it
at
that
way.
But
then,
assuming
the
entire
system
is
a
black
box
and
let's
say
the
other
thing:
is
there
so
we
built
this
mechanism
so
that
we
can
actually
check
the
the?
What
do
you
say
the
strength
of
the
replication
process
through
this
hash
comparison.
A
B
A
Okay
cool
now
this
is
awesome,
but
I'm
gonna
go
switch
because
I
just
realized.
We
only
have
about
eight
minutes
left.
Okay.
So
up
to
this
point,
we've
talked
about
the
replication.
It's
it
works.
Great
we've
talked
about
how
Delta
Lake
and
has
actually
helped
solve
a
lot
of
the
problems
because
it
simplified
how
the
downstream
systems
were
able
to
do
it.
A
You
had
the
mapping
function
that
made
things
easier
so
that
way,
the
downstream
systems
didn't
actually
have
to
make
many
changes
if
I
need
anything
at
all
for
for
them
to
be
able
to
go
ahead
and
still
query
the
date.
That's
awesome,
but
there's
got
to
be
some
problems.
I
mean
right
up
to
this
point,
I
mean
basically
you've
made
almost
like
a
great
ad
for
Delta
like
it's
like.
Okay,
definitely
solves
everything,
we're
good
to
go
run
away,
so
we
know
that's
not
true,
and
because
this
is
a
d3l2
session.
A
B
Okay,
cool,
that's,
okay,
so,
in
terms
of
problems
right
there
are,
there
were
a
lot
of
problems.
There
are
still
some
problems
which
will
fit
around,
but
then,
in
terms
of,
if
you
look
at
the
bigger
ones,
I
think
the
underlying
infrastructure,
so
Delta
Lake
gives
you
a
good
Paradigm
in
terms
of
doing
your
absurds
deletes
and
all
of
that
on
top
of
Market
data.
B
But
what
people
forget
is
you
are
bound
by
your
by
the
performance
of
the
underlying
hdfs
that
you
have,
whether
it's
S3,
whether
it's
ADLs
as
in
the
Azure
data
lake,
so
in
the
case
of
ADLs,
like
I,
was
I.
Think
I
did
say
this
before
we
were
earlier
on
gen
one
gen
1
did
not
scale.
I.
Think
even
Microsoft
acknowledged
that,
because
there
were
a
lot
of
limitations
in
terms
of
you
know
the
number
of
metadata
files,
as
in
the
number
of
nodes
that
you
could
have
within
the
cable
folder
Etc.
B
B
Changed
the
game
completely
like
in
fact
unpopular
opinion
right,
but
Gen
2
in
terms
of
Delta
Lake
support
is
better
than
S3,
because
in
S3
you
don't
have
the
as
an
out
of
the
box,
the
multi-cluster
right.
You
actually
need
a
coordinating
database
to
make
sure
that
you
can
write
for
multiple
clusters,
but
on
the
ADLs,
Gen
2.
You
don't
need
to
do
that.
So
thanks.
That's
right!
We
didn't
have
to
yeah.
A
B
There
was
not
yet
another
component
that
we
needed
to
add
to
our
infrastructure
stack
right
so,
but
that
was
a
problem
for
sure
in
terms
of
scaling
out
gen
2.,
because
people
forget
it's
not
oh
I
just
did
an
operation
on
Delta
I
just
work.
No,
no.
You
have
to
definitely
monitor
your
underlying
hdfs
solution
to
see
what
throttling
limits
are
there?
What
bandwidth
limits
are
there
so
that
you
can
work
around
it
or
you
can
work
with
your
cloud
provider
to
make
sure
that
you
have
some
exceptions
ahead
in
place.
A
So
specifically
like
for
that
for
ADLs
Gen
2,
just
to
provide
some
context
for
folks
Alias
Gen
2,
you
in
all
seriousness,
is
basically
the
the
Azure
blob
storage
with
a
hierarchical
store
on
top
of
it,
with
the
option
for
higher
call
stores,
and
that
actually
often
makes
a
lot
of
sense.
So
that's
why
we
pretty
much
tell
you
yeah
yeah
go!
Do
it,
and
so
that's
what
basically,
what
adl's
doing
too
it
provides
an
excessive
amount
of
like
scalability
but
and
I'm.
Pretty
sure
this
is
where
the
butt
is
going
to
kick
in.
A
I.
Take
it
that
you
had
to
basically
work
with
the
cloud
provider
in
this
case
as
here
to
basically
make
sure
that,
and
now,
when
I
use
the
word
partitioning
I'm
talking
about
the
partitioning
from
the
infrastructure,
not
from
the
data
perspective
right
right,
that
you
had
to
basically
pre-allocate
the
partitioning
of
of
ads
Gen
2
or
or
did
they
speed
that
up?
A
Because
in
the
past
now,
I'm
literally
using
numbers
from
like
years
ago
in
the
past
was
something
like
you
had
to
basically
have
hit
the
threshold
of
like
something
like
five
thousand
or
ten
thousand
iops
for
24
hour
period
before
the
auto
partitioned
underneath
the
covers.
Is
this
something
where
you
actually
had
to
you
waited
for
that
long?
Or
did
you
go
ahead
and
create.
B
No,
no
so
the
good
thing
about
AP
is
a
big
enough
team
with
a
lot
of
different
I
would
say
expertise.
So
we
have
a
dedicated
team
who
is
managing
a
lot
of
the
data
on
the
lake
as
in
the
data
Lake,
and
they
were
able
to
basically
have
relationship
with
Microsoft
to
make
sure
all
these
pre-allocations
happen.
I
heard
of
time
gotcha
so.
B
Took
care
of
it,
so
there
is
a
dedicated
team
kind
of
you
know,
working
behind
the
scenes
to
make
sure
that
all
these,
the
the
exact
the
infra
details
that
you're
talking
about
are
kind
of
nailed
so
that
the
rest
of
the
system
does
not
need
to
worry
about
it.
So
this
is
one
of
the
advantages
of
being
in
the
platform
being
grown
from
the
bottom
up.
Yeah
yeah.
A
No,
this
is
it's
also
good
learning,
because
that
a
lot
of
folks
don't
know
this
right,
and
so
the
fact
that
you
have
a
platform
team
that
focuses
on
that,
like
it's
analogous
to
like
what
I
talked
about
in
the
past,
where,
like
when
you're
dealing
with
database
systems
like
you,
actually
have
to
care
about
like
when
this
is
before
the
cloud
became
such
a
big
thing
right.
You
actually
had
to
care
about
like
the
idea
of
random
iops
right
how
random
iops,
especially
on
spindex,
would
screw
you
over.
A
So
you
actually
had
to
understand
the
underlying
infrastructure
and
recognize.
That's
why
we
told
you
all
to
use
ssds
because
it
allowed
you
basically
have
random
iops
without
any
problem
right
so
with
Cloud
objects,
the
cloud
Object
Store.
Yes,
we
abstracted
that
away.
We
said
yes,
it's
all
just
rest.
Api
calls
for
All
Dental
purposes
that
are
against
the
cloud.
A
B
A
Realized
that
we
only
have
a
few
minutes
left,
okay,
so
I
did
want
to
call
out,
though,
and
I'm
gonna
call
myself
out.
Okay
on
this
one
Delta
Lake
being
awesome,
you
had
the
infrastructure
figured
out,
but
there
was
a
problem
again
I'm,
going
to
call
this
out
myself
that
we
hadn't
open
sourced
enough
of
Delta
Lake
fast
enough.
So
why
don't
you
talk
a
little
bit
about
that,
because
I.
A
B
B
So
it
was
amazing
because
a
lot
of
our
workloads
like
because
I
just
talk
about
replication,
right
and
people,
think
of
like
okay,
you
just
replicated
A
to
B
fine.
No,
this
replication
consists
of
inserts
updates
and
deletes
right
right.
You
need
to
know
how
to
do
that
efficiently,
so
that
that
time
bound
that
we're
giving
is
not
like
Harvest
Right,
it
can
be
minutes,
but
it
cannot
be
harsh.
So
the
only
way
we
were
able
to
do
that
was
I.
Think
two
important
features
that
kind
of
made
the
thing
was.
B
The
ordering
was
the
one
super
important
one:
another
one
was
Auto
optimized,
but
Z
ordering
made
a
huge
difference,
because
we
were
kind
of
reordering
based
on
a
particular
primary
key,
which
we
knew
all
our
update
statements
or
absurd
statements
were
based
on
now.
The
problem
is
the
test
came
out
really
well.
Performance
was
really
good,
but
then
in
terms
of
selling
it
to
the
rest
of
platform
and
upper
management,
it's
like.
Oh
so
you're
going
so
the
entire
business
functionality
or
the
performance
hinges
on
a
proprietary
feature.
B
That
was
a
huge
pain
point
right
because,
yes,
you
could
say
all
the
yada
yada
yada
about
Delta
Lake
and
then
oh
by
the
way
we
are
dependent,
oh
and
it's
open
source,
oh,
but
by
the
way,
we're
dependent
on
this
proprietary
feature
from
the
from
databricks
that
we
love
by
the
way.
But
it's
a
risk
that
we
need
to
go
ahead
with
right,
yeah,
because
any
organization
right
it
has
a
combination
of
risk
covers
versus
risk
taking.
So
how
do
you
rationalize
it?
That
was
a
problem.
B
So
then,
after
a
lot
of
talking
with
databricks
a
lot
of
discussions
and
trying
to
rationalize
it,
I
was
super
happy
that
you
guys
actually
just
said
like:
okay,
fine,
let's
go
for
the
open
source,
let's
just
open
source
this
and
yes,
and
that
I
would
say
that
was
a
big
thing,
mainly
because
one
it
was
like
internally,
we
had
the
results
and
everything
showing
up,
but
then
there
was
this
thing:
that
code
cannot
change
right.
B
A
No,
no
that's
awesome,
and,
incidentally,
we
were
actually
trying
to
figure
out
if
we
could
do
it
even
earlier
than
2.0,
but
as
you
can
tell,
we
actually
had
a
bunch
of
the
calm
stats
stuff
that
we
actually
had
to
do
first
before
and
the
cluster
we
had
to
do
before
we
could
get
the
Z
order
out.
That's
why
it
took
us
to
Delta
2.0
to
do
it
and
to
be
very
clear.
A
It
wasn't
just
Adobe
asking
I
do
want
to
give
yes
a
ton
of
call
outs
here,
but
this
wasn't
definitely
an
ass
by
the
community.
So
in
typical
fashion
like
this
is
where
customer
and
Community
were
extremely
aligned
like
there
are
times.
Obviously,
the
the
two
won't
be
aligned
right,
but
in
many
cases
when,
especially
when
it
comes
to
Delta
like
the
customers
on
computer
are
extremely
in
line.
A
So
we're
like
okay
cool
well,
like
we
heard
you,
we
listened
and
we
just
went
ahead
and
did
it
and
I
think
you
also
called
out,
like
you
needed,
open
source
if
for
no
other
reason,
also
for
local
development
right
if
I
want
to
compile
and
run
those
code
based
on
my
local
machine
you're,
not
running
a
databricks
cluster,
on
your
on
your
your
MacBook
Air
Max,.
B
Will
never
sell
right
as
in
right
there's,
so
many
hundreds
of
developers
it'll
be
like
okay,
fine,
oh
by
the
way
you
want
to
test
something
or
you
want
to
build
some
something.
Oh,
it
has
Z
order
in
it,
which
will
now
be
a
part
of
every
single
thing,
because
this
is
in
the
hub.
Oh
by
the
way,
go,
do
it
in
database?
No,
that's
not
that's!
Not
it's
not
going
to
happen!
No!
No
you'll
have
to
or
to
say,
pry
IntelliJ
and
vs
code
or
whatever
from
offer.
A
B
A
We've
tried
to
make
it
easier,
which
is
like
basically,
by
the
way,
just
as
a
call
shout
out
like
with
a
patchwork
through
that
forward,
including
spark
connect
and
Spark
connect
and
databricks
connect.
Basically
are
the
same
thing
you
just
databrix
connect
has
authentication
authorization
stuff,
that's
specific
to
databricks,
otherwise,
it's
the
exact
same
code
base,
and
so
the
idea
is
that
you
could
presumably,
from
your
IDE
from
your
vs
code,
write
the
code
and
actually
submit
it
directly
to
Native
cluster.
A
But
even
with
me
saying
that
you
and
I
are
on
the
same
page,
no
you're,
gonna
pry
the
development
of
my
cold
dead
hands.
Right,
no
I
need
to
run
this
stuff
locally.
First,
before
I
do
anything
yeah.
So
so
we
only
have
a
few
minutes
left,
but
I
did
want
to
leave
off
with
a
cool
tidbit.
You
had
told
me
this
one.
We
had
chatted
before
about
exactly
how
much
easier
from
a
manageability
perspective
like
in
terms
of
the
number
of
tables
of
the
number
of
tenants
that
your
team
actually
has
to
manage.
A
B
So
if
you
think
about
it,
I
think
we
have
close
to
I'm
not
going
to
get
attendance
but
in
terms
of
a
sheer
number
I
say
close
to
5
000
to
6
000,
Delta
tables
right
now
and
actively
growing.
So
usually
when
people
talk
about
the
migration,
it's
like
you
have
one
big
fat
table
which
should
basically
split
into
probably
two
three
in
terms
of
making
it
a
good
alignment.
B
But
in
this
case
we
have
internal
clients,
external
clients
and
all
that,
when
you
put
of
total
number
in,
we
have
close
to
five
thousand
to
six
thousand
tables
Delta
tables
that
were
actively
replicating
to
Fanning
out
to
at
a
regular
basis
right
in
a
real-time
basis.
So
now
maintaining
that
and
everything
has
become
is
actually
pretty
easy.
Contrary
to
what
people
might
think,
because
we
just
have
one
huge
maintenance,
job
that
runs
and
all
it
does
is
it
takes
care
of
orchestrating
the
vacuums,
optimize
and
everything.
B
So
everything
is
broken
down
right
like
if
you
even
had
a
trustless
TV,
you
still
had
to
sometimes
schedule
your
vacuums
and
stuff
like
that
yeah.
What
we
did
is
we
basically
understand
we're
basically
running
a
database
on
demand
like
compute,
on
demand
for
a
database.
That
is
what
we
are
managing
right
right,
that's,
that
is
the
acknowledgment
that
we
have
to
make
and
that
right
we
have
to
take
care
of
all
these
maintenance
operations
ourselves.
B
So
from
our
side,
we
already
have
a
good
orchestration
system
that
can
paralyze,
which
is
again
just
confused,
put
on
databricks
and
the
spark
cluster,
so
we've
just
managed
that
and
the
rest
of
the
operations
in
terms
of
the
vacuum
optimize
and
everything.
As
long
as
we
have
Awareness
on
what
it's
doing,
you
schedule
it
and
you're
good
to
go.
B
Your
tables
are
fine,
of
course,
I
say
this
very
easily,
but
then,
as
long
as
you
read
the
docs
and
you
understand
the
concurrency
conflicts
and
everything
that
happens,
that's
a
separate
talk
of
its
own.
But
as
long
as
you
do
that
you
should
be
gold
golden
to
manage.
How
many
other
databases
that
you
want.
We
have
learned
a
lot
of
lessons
from
my
name
and
management
and
energy
multiple
candles
before
so.
A
lot
of
them
came
into.
What
do
you
say
came
handy
at
this
time.
A
Perfect,
yes,
this
is
excellent.
I
think
this
is
a
great
way
for
us
to
end
today's
session.
If
you
do
have
any
questions
or
want
to
chat
a
little
bit
more,
we
will
we'll
be
updating
this
by
the
way,
we'll
be
a
follow-up
blog
that
will
be
writing
together
on
this
topic
as
well.
Based,
basically,
all
this
conversation
that
we're
having
today
this.
This
video
is
already
being
live,
streamed
of
both
LinkedIn
and
YouTube,
so
you
can
already
see
us
there,
but
just,
as
importantly,
go
to
go.io
slack.
A
A
bunch
of
us
are
already
there
answering
questions
as
well,
so
that
that's
really
it
yes,
anything
else
that
you
want
to
add.
Potentially
at.
A
Yeah
I
figured
as
much,
which
is
why
I
wanted
to
type
it,
but
no
really,
yes,
I
really
appreciate
your
time.
This
is
super
insightful.
Super
helpful,
I'm,
really
glad
there's
a
lot
of
really
cool
Lessons
Learned
for
the
community,
but
then,
like
I,
said
yeah.
If
you
want
to
continue
chatting
or
you
have
specifics
that
maybe
be
worthwhile
for
us
to
have
a
follow-up
either
session
or
follow
a
Blog
on
yeah,
just
join
us
at
go
dot,
Delta,
dot,
IO
slack.
So
with
that.
Yes,
thank
you
very
much.
I
really
appreciate
your
time.