►
From YouTube: D3L2: Migrating from a Data Warehouse to a Lakehouse with Structured Streaming and Delta Lake
Description
In this D3L2 episode, we sit down with Christina Taylor, data engineer at Carvana, Bread Finance, and Walt Disney Company to discuss her path from data warehousing to the lakehouse. In the process, she led her teams to an open data lake that unifies batch and streaming workload with Delta Lake that decouples data storage from proprietary formats, dramatically reducing data extraction costs.
B
Welcome
to
this
next
section
of
the
D3
L2,
vidcast
and
podcast,
my
name
is
Denny
Lee
and
I'm
really
happy
to
have
Christina
Taylor
here
for
our
new
next
session
about
migrating
from
a
data
warehouse
to
a
lake
house
with
structure
streaming
in
Delta
Lake.
But
before
we
start,
I
really
wanted
to
go
ahead
and
actually
allow
Christina
to
introduce
herself.
So,
hey
Christina.
Why
don't
you
tell
a
little
of
that,
the
audience
of
who
you
are
and
like
how'd,
you
even
get
involved
in
the
the
data
engineering
space
but
yeah?
A
Sure
thing:
hey
folks,
I'm
a
data
engineer
today
at
carvana,
but
kind
of
an
interesting
personal
story
originally
had
a
background
in
teaching
and
education.
A
Graduated
into
a
recession
worked
in
universities
in
career
advisement
for
three
years
could
not
afford
paying
rents
in
New
York
City
on
a
teacher
salary,
so
I
got
into
data
and
analytics
the
Insight
I've
learned
being
a
career
advisor
I
thought.
Oh
okay,
this
is
hot!
So,
let's
see
what's
going
on
here,
started
off
as
an
entry
level
analyst
made
it
all
the
way
to
analytics
manager,
but
then
realizing
okay
for
people
to
do
the
cool
business
intelligence,
visualization
data,
science,
stuff,
you
really
need
a
strong
data
engineering
foundation.
A
So
let
me
be
that
janitor
with
a
keyboard,
so
I
took
a
staff
to
join
the
founding
team
at
d-plus,
so
learned
and
start
streaming
and
structure
streaming
on
the
job
migrated
from
EMR
to
daily
breaks,
Microsoft
from
redshift
to
snowflake,
so
fast
forward.
A
few
years
grew
to
be
the
staff
engineer
for
fintech
startup
called
Brad
came
over
to
carvana
after
the
acquisition,
so
very
strong,
startup
mentality
really
love
the
spark
distributed
computing
framework.
A
C
A
I've
seen
that
shirt
and
it
brings
transactional
nature
to
perkate
files,
so
it
adds
all
these
cool
features
like
time
travel
and
version
controlling
and
safer
overwrite
and
replace
so
I
really
enjoy
working
with
that
as
well.
So
here
we
are.
B
Okay,
well,
this
is
pretty
awesome
and
I'm
actually
sort
of
impressed
by
that
fact.
So,
let's,
let's
go
back
actually
just
a
little
bit,
because
I
I
find
your
history
very
interesting.
So
you
originally
were
going
to
be
a
teacher
and
you
hit
the
recession
which
sucks,
of
course,
but
then
you're.
You
notice
that
the
Hot
Topic
or
the
hot
career
was
within
the
realm
of
data
engineering,
and
so,
as
you
were
doing
career
revisement,
and
so
then
that's
actually
how
what
got
you
involved
in
the
in
into
this?
The
space
yeah.
B
I
love
that
approach
to
to
to
approach
to
it,
because
it
it
definitely
continues
that
push
for
the
idea
of
self-learning,
like
you
know,
where
you're
constantly
educating
and
changing
and
updating
yourself
to
based
on
the
newest
Technologies
in
the
in
the
newest
pushes
which
is
really
cool,
and
so
then,
okay,
so
you
got
involved
involved
what
ultimately
LED
you
into
the
realm
of
like
even
just
spark
and
structure
streamings
I
mean
I'm,
really
happy
that
you
write
like
the
the
data
frame
API,
which
is
great.
B
You
know
a
shout
out
to
the
smart
Community
to
help
build
that,
but
like
what
got
you
to
care
about
distributed
processing
the
first
place
like
you
know
you,
because
you
jump
right
to
there
right
away
and
I'm
like
well.
How
did
you
get
there
like
what
what
led
you
to
actually
needing
to
utilize?
Something
like
spark
in
the
first
place,
yeah.
A
So
I'll
take
a
spark
and
then
streaming
so
I
became
a
huge
advocate
for
open
source,
Technologies
and
open
format
as
well
after
seeing
a
lot
of
the
opposite
Trends
in
some
of
the
more
traditional
Industries,
so
I
felt
like
with
a
spark
framework.
First
of
all,
the
code
you
write
is
naturally
parallel
and
the
second
place
you
have
the
support
of
all
the
entire
open
source
Community
for
contributions,
rather
than
just
using
particular
patented
technology
within
your
organization.
A
So
that's
extremely
powerful,
probably
explained
Spark's
popularity
and
growth
and
open
format
and
I
think
I'll
touch
on
a
little
bit
more
when
we
talk
about
data,
warehouses
also
I
feel
like
it's
like
very
misunderstood
and
but
necessary
concept
as
of
today.
So
the
idea
of
owning
your
data
and
no
vendor
login
and
or
keeping
it
whether
on-prem
or
secure
in
your
own
cloud
environment.
It's
really
really
important.
So
a
matter
of
fact,
and
it's
this
kind
of
thinking
that
contributed
to
the
500
000
Cloud
cost
reduction
here
at
carvana,
so
I'll
discuss.
A
A
Can
be
real
time
but
I
think
the
most
powerful
aspect
of
stream
spark
structure
streaming
is
the
idea
of
checkpointing
and
Disaster
Recovery
I'm,
coupled
with
Technologies
such
as
AWS
sqis,
a
notification
system.
You
can
achieve
file,
detection
and
ingestion
on
demand,
as
well
as
this
trigger-based
streaming
pattern.
B
You
know
that
makes
a
lot
of
sense.
I
mean
exactly
to
your
point,
I.
Think
one
of
the
key
facets
that
a
lot
of
people
forget
about
in
when
it
comes
to
things
like
Sparks
or
structure
streaming.
Is
that
the
most
one
of
the
most
complex
parts
of
this
is
actually
the
State
Management,
just
actually
understanding
what
files
did
you
process
or
what
rows?
Did
you
process
and
average
having
a
system
that
actually
built
in
pre-built
in
that
state
management
right
from
the
get-go?
B
Actually
simplifies
your
life,
whether
you
actually
are
dealing
with
real-time
scenarios
or
dealing
with
batch
snare,
so
that,
actually
that
the
fact
that
you
basically
can
think
about
the
problem
purely
in
terms
of
latency,
but
the
business
logic
stays
exactly
the
same
right,
I
think
that's
a
pretty
powerful
concept,
and
so
I
think
this
actually
naturally
segments
to
exactly
what
the
well
you
know
the
title
of
today's
session
is
which
is
migrating
from
a
data
warehouse
to
lake
house
and
with
structuring
the
Delta
Lake.
Well
yeah,
once
you
give
your
motivation
about.
B
In
your
background,
in
terms
of
you're,
saying
you're
going
to
migrate
from
a
warehouse
to
a
lake
house,
so
what
what
was
the
the
infrastructure
or
the
setup
that
you
had
in
terms
of
data
warehousing
the
first
place?
And
why?
Oh
you
know,
let's
just
start
with
that,
like
what
is
the
current
and
then
afterwards
we'll
discuss
why
the
transition
to
lake
houses,
yeah.
A
For
sure,
let
me,
with
the
business
impact
on
what
problems
we're
trying
to
solve,
with
data
so
being
from
an
analytics
background,
really
gave
me
the
sympathy
for
stakeholders,
so
my
group
at
carvana
handles
customer
Communications.
So
that's
customer
engagement
retention,
workforce
management,
conversation
and
AI
blah
blah
blah,
so
the
Chrome
drool
of
our
front-end
Services
is
called
Sebastian.
So
that's
the
chat
about
you
talk
to
when
you
visit
the
carvana
website.
A
Often
your
first
point
of
interaction
with
a
customer
and
it
would
also
route
Communications
to
a
relevant
Advocate,
and
so
we
keep
track
of
all
Custom
Communications
and
that
goes
into
something
called
a
come
router
service.
That's
a
legacy
monolith
service
that
has
a
collection
of
a
lot
of
different
things
that
goes
down
like
oh
Customer,
Center
chat
and
the
child
is
assigned
to
an
advocate
or
customer
click.
The
escalation
button
blah
blah
blah.
But
it's
a
lot
of
different
things.
A
All
mixed
together
and
as
I
said,
this
is
a
legacy
service
and
they
were
now
designed
for
analytics
use
cases.
But
then
people
want
to
help
Sebastian
work
smarter,
so
help
workforce
management
to
assign
conversation
more
effectively.
So
we
definitely
want
that
Insight.
So
what
do
we
do
there?
Okay,
so
this
was
before
we
had
a
proper
data
engineering
team,
so
the
quickest
and
fastest
TurnKey
solution.
A
B
A
Example,
Google
cloud
bigquery
is
one
of
these
targets.
Oh
so
from
the
Google
Cloud
logging
service.
You
click
a
button.
You
choose
a
BQ
table
as
a
destination
boom
in
a
couple
of
minutes,
and
you
have
an
analytics
table
and
I
have
to
say
that
having
worked
with
SQL
Server,
redshift,
snowflake,
Google
cloud,
bigquery
and
BQ
is
actually,
in
my
opinion,
one
of
the
most
powerful
analytics
data
warehouse.
There
is,
unlike
all
the
others,
it's
truly
serverless.
You
don't
have
to
provision
at
all.
A
Even
some
of
the
like
serverless
data
warehouses
like
snowflake,
for
example.
It
requires
you
to
somewhat
size
the
warehouse
right.
You
choose
like
small
extra
small
and
quite
often
there's
a
contract
model.
So
you
have
to
have
some
thought
in
mind,
but
with
BQ
you
can
truly
just
pay
as
you
go
and
the
pricing
is
also
fairly
transparent.
So
it
all
depends
on
how
much
data
you
Pro
process
and
the
query
for
on-demand
pricing,
five
dollars
per
terabyte
and
it's
been
pretty
fast
and
I've
actually
not
had
performance
compliance.
C
A
A
Cost
is
not
a
problem
until
it
becomes
a
problem
right,
but
some
so
one
of
the
reasons
why
cause
became
out
of
control
is
that
when
you're
using
these
out
of
the
box
things,
so
you
don't
write
any
code
right,
but
there's
also
a
very
few
customizations
you
could
make
and
at
best
you
could
only
partition
that
destination
table
by
timestamp
BQ
is
smart
enough
to
recognize
that
as
a
date
time.
So
it's
not
really
partitioning
on
that
hard
cardinality,
but
you
can't
do
anything
else.
A
You
won't
be
able
to
Cluster
or
the
Water
by
event,
name
or
event
type,
so,
resulting
in
every
time
somebody
does
analytics
online
table.
It's
a
full
table
scan
so
think
about
like
a
two
terabyte
table
every
time
someone
does
they
like
to
start
from
table
where
events
equal
to
something
it
has
a
full
table
scan
so
ten
dollars
for
query,
so
that
gets
very
expensive,
but
it's
only
one
problem.
A
This
target
doesn't
really
have
schema
involvement
capabilities
as
many
data
warehouses.
It
doesn't
just
add
that.
So
if
your
Yvonne
has
a
new
field,
it
doesn't
add
it
for
you.
That's
not
something
that
typical
data
warehouse
can
support,
so
we've
been
very
lucky
that
this
is
a
legacy
service
and
the
schema
doesn't
change
a
lot.
So
schema
change
using
a
new
Target
and
do
a
tedious
backfill
was
just
like
a
once
in
a
year
event.
A
But
when
that
happens-
and
it's
it's
quite
tricky
and
it's
really
annoying
okay,
so
I
talked
about
the
Full
Table
scans
and.
A
Able
to
give
out
Fields,
right
and
so
okay,
so
so
we
realized
one
of
these
jobs,
because
we
need
to
recompute
the
metric
every
five
minutes.
So,
even
though
it
is
just
on
a
two
terabyte
table
that
job
itself
a
thousand
dollars
a
day
on
a
two
terabyte
table,
it's
crazy
and
using
DML,
it's
very
tricky
in
BQ,
it's
not
like
in
Delta
Lake.
A
Okay
yeah,
so
we're
like
this
is
really
unsustainable.
So
what
do
we
do
so?
Let's
our
first
try
was
actually
to
have
one
separate
sync
for
each
event:
type.
A
Yeah
so
and
then
people
would
keep
be
requesting
new
event
types
and
new
tables,
and
we
we
do
use
manifest
to
release
these
things
so
kubernetes
config
connector,
but
that
quickly
became
a
deployment
nightmare.
So
a
senior
developer
needs
to
be
involved.
You
are
messing
with
production,
gcp
resources
and
we
just
cannot
keep
up
with
the
Demand
right.
B
Right
because,
basically,
the
the
just
just
to
roll
back
a
little
bit
just
to
provide
people
context.
The
idea
is
that
every
single
time
anybody
wants
to
go
ahead
and
create
something
generally
generate
a
new
topic
of
any
type.
Basically,
it
generates
a
whole
new
set
of
tables.
You
actually
have
to
have
somebody
like
a
senior
person.
It's
not
automated,
it's
not
metadata,
and
even
if
it
you
could
now
there's
a
maintenance
nightmare,
because
you
have
all
of
these
tables
to
go,
keep
track
of
I'm.
A
A
A
B
Yeah,
no,
no!
No!
Okay!
Sorry!
This
is
story
time
just
for
the
for
the
sake
of
little
just
just
for
for
posterity
for
fun
I.
Actually,
if
you
think
generally
from
Excel
is
bad
I
actually
had
worked
in
a
system
of
my
past
I'm
not
going
to
mention
it
which
system
it
is,
but
in
my
past,
where
they
designed
the
pipelines
using
the
precursor
to
Vizio.
B
Oh
SQL
script,
not
a
code
I,
don't
wanna
I,
don't
wanna
I,
don't
want
to
claim
sql's
the
code,
so
yeah
so
I
I
could
I
could
empathize
with
pain,
but
it
also
is
really
enlightening
that
basically,
we
still
back
when
I
had
done
it
so
I'm
pretty
much
dating
myself
really
nicely.
But
the
fact
is
that
we
see
keep
this.
B
We
seem
to
keep
Reinventing
the
wheel
and
doing
more
or
less
the
same
type
of
hacks,
even
when
we're
talking
about
completely
different
projects
like,
as
in
my
case
I,
was
saying,
I'm,
Vizio
and
and
pearl
you're
talking
about
Scala
but
SQL,
but
bigquery
like
yeah.
It's
just
it's
interesting
that
we're
we're
constantly
shifting
back
and
forth.
Like
that.
Like
doing
doing
the
same
mistakes,
I
mean.
B
A
Terms
of
I
haven't
finished
with
all
of
my
pain
points.
Oh.
C
B
A
There's
one
last
pain
point
to
us
in
the
data
export
now.
Carvana
is
a
very
interesting
organization.
We
don't
have
a
CTO
and
our
cpo's
vision
is
for
each
Engineering
Group
to
be
run
like
its
own
independent
startups
great.
So
my
team
has
like
a
team
of
people
with
very
strong
startup
mentality.
He
roll
up
the
sleeves.
We
picked,
our
own
favorite
Cloud
tool,
Technology
based
on
our
skills
and
interests,
and
other
teams
did
the
same.
A
So
we
ended
up
on
all
three
clouds,
so
the
so
the
engineering
operational
database
is
in
Azure
and
some
SQL
Server
stuff
in
gcp,
the
communications
Team
all
on
gcp
and
the
core
analytics
team
on
AWS.
So
it's
come
to
the
point
where
we
need
to
understand
how
much
business
the
company
has
done
from
a
centralized
perspective,
and
my
comms
team
needs
to
provide
data
to
other
organizations
so
we're
taking
data
from
gcp
into
AWS
every
day,
so
that
two
terabyte
table
I
was
talking
about
right.
A
There's
actually
three
bills:
we
have
to
pay
painfully
I've
realized
this,
so
first
select
something
from
bigtable
five
dollars
terabyte
and
then
we
pay
a
BQ
storage,
API
cost,
which
is
more
than
the
network
calls.
They
didn't
give
me
a
breakdown
unless
we
pay
a
network,
egress
12
cents
per
gigabyte
from.
B
B
A
Funny
thing
is:
when
you
look
at
the
bill
and
it's
a
network
egress
North
America
I
have
an
English
degree.
I
looked
at
the
gcp
pricing
page,
which
is
pretty
transparent
compared
to
a
lot
of
other
providers.
I
had
to
write
to
support
and
to
have
him
translate
for
me
hey.
This
is
my
understanding.
Is
that
correct,
so
we're
actually
paying
because
of
cross
cloud
in
nature.
C
A
B
B
Now
I
get
it
because,
even
though
you
started
off
talking
about
how
much
you
you
loved
using
BQ
as
a
data
warehouse
the
overall
operations,
the
overall
infrastructure
resulted
in
the
fact
that
you
need
a
data
platform
that
goes
way
outside
the
boundaries
of
what
a
traditional
data
Wars
house
can
do,
and
so
this
invariably
I
guess
LED
you
to
thinking
about
needing
to
build
a
lake
house
per
se.
Basically,
is
that
is
that
the
transition,
or
did
you
end
making
any
steps
before
you
went,
decided
that
the
lake
house
was
the
right
approach.
A
A
lot
of
factors
went
into
consideration
here,
so
I
think
one
of
the
core
principles
is
the
open
format,
so
I
really
like
the
idea
of
decoupling
storage
from
the
compute
and
from
a
specific
vendor
or
technology.
So
actually,
this
was
particularly
relevant
because
I
came
from
fintech,
so
our
customers
at
that
time
was
very
adamant
that
each
tenant
would
have
its
own
AWS
account
and
that
data
lives.
A
In
our
age
of
us
account
not
in
some
other
data,
warehouse
providers
account.
So
that
was
extremely
important
for
us
and
we
also
like
the
idea
of
open
format,
because
one
of
the
considerations
of
data
warehouses
is
very
easy
and
TurnKey
solution,
great
user
experience
and
there's
usually
no
cost
or
low
cost.
When
you
put
data
in
so
the
price
problem
Stars,
when
you
have
this
repeated
compute,
there's
only
so
much
SQL
can
do
and
then,
when
you
want
to
take
data
out,
that's
where
you
start
paying
hands
on
me.
A
B
Oh
sorry,
I,
muted,
Myself
by
accident,
so
basically,
what
it
came
down
to
is
that
before
I
even
started
that
lake
house
discussion,
you
already
are
thinking
I
need
to
store
this
in
an
open
format.
So
that
way
you
can
actually
make
sure
you're
not
locked
into
any
vendor
such
that,
even
if
that's
the
Tool
Du
Jour
today
the
realities
with
how
fast
this
industry
is
changing
and
we're
introducing
new
systems
all
the
time.
Maybe
you
want
to
use
blah
right
and
then
this
project
blah
is
great.
B
A
Yeah
exactly
and
super
important
consideration
for
us,
and
sadly,
one
of
the
very
leastly
covered
aspects
when
people
are
evaluating
these
Technologies,
it's
like
worse
than
a
divorce.
C
B
That's
pretty
brutal
all
right,
so
let's
switch
gear.
So
then
that
invariably
I
guess
LED
you
to
before
again
before
the
lake
house.
The
Delta
like
that
I
guess
like
did
you
is
I
mean.
Is
that
what
made
you
into
the
open
format?
Or
did
you
start
with
parquet,
first
or
Json?
First
I'm
just
curious?
What
was
the
modus
operandi
for
you.
A
Yeah,
so
we've
worked
on
both
inside
my
structured
data
and
structured
data
in
the
past,
and
there
was
a
lot
of
Legacy
systems
that
are
still
writing,
parquet,
but
I.
Think
of
Delta
as
like
a
supercharged
parquet
files
so
and
it
really
enforces
the
schema.
So
it's
not
like
that.
You
can
append
too
incompatible
schemas
together
and
just
lay
over
them
in
a
parquet
file
and
have
to
do
this.
Expensive,
merge
schema,
calls
every
time
and
it's
as
a
transactional
in
nature.
A
So
you
don't
have
to
worry
about
partially
corrupt
files
and
it
has
a
schema
involvement
capabilities
which
was
really
important
for
us.
So
we
don't
have
to
redeploy
a
Delta
thing
every
time
the
service
wants
to
change
schema,
and
perhaps
one
of
the
more
powerful
features
we
like
is
how
easy
it
is
to
do
an
absurd
so
on
a
parquet
table.
If
you
want
to
do
an
update,
it's
like
a
seven
step
process.
A
B
Oh
no
I
mean
don't
get
me
wrong,
that's
pretty
awesome,
and
so
it
basically
the
features
of
Delta
Lake
itself
and
be
in
itself
plus.
The
openness
basically
allowed
you
to
feel
comfortable,
saying
we're
not
going
to
need
to
worry
about
those
egress
costs
anymore.
We
have
this
one
Lake
that
basically
allows
us
to
put
all
of
our
data
and
then
process
or
and
query
our
data
against
so.
B
A
What
we
ended
up
doing
for
the
Legacy
service,
so
we
first
ship
the
data
to
cloud
storage
and
we
take
it
out
from
cloud
storage.
We
have
to
paint
network,
but
it's
something
that
only
has
to
be
done
once
right
and
because
of
structure
streaming,
and
this
extraction
is
incremental
in
nature.
It's
not
like
every
time
anyone
wants
to
do
an
event.
I
pay,
this
bigquery
select
fee,
plus
the
storage,
API
Cost
Plus
network.
A
No,
so
I
only
have
to
do
this
once
and
once
the
data
landed
in
the
bronze
Delta
table
like
I,
feel
very
interesting
thing
we
can
do
so.
Not
only
can
we
partition
an
event
date,
we
can
de-order
on
a
event
type,
so
it's
like
a
multi-dimensional
inside
index
and
the
lookup
is
much
faster
and
we
can
do
also
interesting
things
that
we're
not
able
to
do
in
SQL
and
that's
called
star
expansion
in
pi
spark.
So
we're
what
we're
really
only
interested
in
is
just
the
Json
payload.
A
B
That's
pretty
cool
so
basically
it
because
of
the
nature
of
basically
combining
Pi
spark
with
structured
streaming
and
Delta
Lake
you're,
actually
able
to
have
that
reliable
store
and
because
of
the
Json
payload
attributes
that
you're
trying
to
call,
even
though
we
sort
of
bashed
on
SQL
a
little
bit
before
it's
actually
nice
and
simple
now,
so
for
any
of
the
analysts
that
actually
need
to
get
the
analysts
that
want
to
get
access
to
it.
It's
a
very
simple
SQL
statement
for
them
to
extract
this
data.
A
Exactly
so,
we
have
an
O
income,
passing
bronze,
Delta
sink,
that's
being
incrementally
ingested,
using
structure
streaming.
We
don't
do
it
24
7..
We
we
use
trigger
available
now.
So
we
update
this
bronzing
once
an
hour
and
we
do
use
a
data
Medallion
architecture
here
so
Brown
silver,
gold,
and
so
the
people
who
are
writing
the
silver
and
or
gold
jobs
actually
have
a
lot
more
understanding
of
what
the
attribute
to
look
for
for
a
particular
event
type.
A
So,
even
though
the
bronze
table
is
a
little
ugly
okay,
so
it
has
100
columns.
Lots
of
thems
are
NOS,
but
this
is
like
overall,
a
good
trade
out
of
between
the
number
of
things
you
have
to
develop
and
the
cleanest
okay.
So,
ideally,
probably
one
event:
group
should
have
one
destination,
but
a
state.
This
service
really
wasn't
designed
with
analytics
in
mind.
So,
okay,
so
the
schema
registry
and
the
publishing
events
to
a
queue.
That's
the
that's
a
Holy
Grail!
A
B
B
So
that
way
in
case
you
ever
need
to
go
back,
because
the
business
logic
was
wrong
or
there
was
some
a
failure
in
processing
you'd
always
go
back
to
the
original
source
and
then
and
reprocess
it
from
there
and
then
subsequently,
you,
you
know,
as
you
called
out
in
the
the
Delta
Medallion
architecture
in
which
we're
talking
about
like
a
data
quality
framework,
we're
from
bronze
silver
cold,
where
the
bronze
is
where
the
where
it
drops
down.
Silver's
filtered
gold
is
basically
the
business
level
aggregations
or
Roll-Ups.
B
A
It
is
exactly
yeah
exactly,
and
there
are
so
many
cool
things
you
could
do
with
spark
structure
streaming
that,
like
a
data
warehouse
sink,
you
will
now
be
able
to
easily
accomplish
so
streaming
aggregation
and
streaming
deduplication.
So
because
we.
A
There's
actually
a
lot
of
consideration
that
went
into
this
design
so
from
the
log
router
there
are
actually
several
destinations.
You
can
choose
other
than
bigquery.
You
could
choose
cloud,
storage
or
Pub
sub,
for
instance,
but
if
you
use
cloud
storage
as
a
native
destination,
so
the
file
gets
rotated
every
hour.
A
So
that's
a
huge
latency
requirement
and
you
also
don't
have
a
lot
of
control
over
the
layout
of
the
storage
like
you
can't
do
things
like
lexical
ordering
for
structure
streaming,
optimization
well
and
then
okay,
so
we
we
tried
Pub,
sub
and
technically
you
don't
have
to
use
spark
structure
streaming.
So
there's
a
python
library
that
can
ingest
from
a
pub
sub
Source,
but
we
like
structure
streaming
because
of
the
fault,
tolerance
and
checkpointing
mechanism,
and
we
really
don't
want
to
worry
too
much
about
tuning
the
python
job
and
worry
about
okay.
A
So
how
many
in
a
badge
and
that
sort
of
thing
like
spark,
handles
that
for
us
and
we
actually
used
like
an
Apache
beam
job,
also
open
source
technology
running
on
data
flow,
which
is
Google,
Cloud's,
managed
a
streaming
pipeline
so
to
copy
that
to
the
pop-up
message
to
cloud
storage.
Well,
because
we
want
that
infinite
retention
and
easy
backfill,
so
pops
up
doesn't
retain
messages
forever.
But
analytics
events
is
a
mission
critical
data
set
for
us.
A
It
impacts
in
the
way
that
we
train
Sebastian
and
the
way
Advocates
gets
evaluated
by
their
speed
to
answer
so
we
want
to
be
correct
as
well,
so
a
lot
of
balancing
out
between
speed
and
reliability
and
scale.
So
we
ended
up
exporting
pop-up
messages
to
cloud
storage,
every
five
minutes,
and
so
right
now
we
do
trigger
the
pipelines,
the
injection
to
run
every
hour.
But
another
factor
that
we
actually
really
like
about
instructor
streaming
is
that
it's
actually
one
unified
API
for
both
patch
and
streaming.
A
So
you
use
trigger
available
now
for
a
batch
style
workload
and
do
you
use
you
come
and
data
layout
online
change
it
to
go
real
time,
and
so
all
of
our
Brands
and
silver
jobs
are
streaming
aggregation
in
nature,
because
we
know
that
all
the
events
should
be
received
within
five
minutes.
So
that's
how
the
file
rotation
works,
so
we're
able
to
compute
data
as
they
arrive
near
each
other,
so
we're
using
watermark
with
a
window
of
10
minutes
for.
C
A
A
B
B
There's
no
change
in
business
logic
outside
literally
a
line
of
code
that
you
comment
out
and
then,
but
that
way
you
I
think
Reynolds
Shin
had
actually
explained
this
like
when
he
talked
about
like
spark
streaming
all
those
years
ago
about,
like
I,
believe
the
term
was
continuous
applications.
B
They
refer
to
like
I'm,
referring
to
I,
think
a
spark,
an
a
summit
from
like
four
years
ago,
where
the
context
is
basically
what's
great
about
spark
streaming
is
the
idea
that
you
don't
actually
have
to
think
about
streaming
anymore,
like
the
business
law,
you're
you're,
decoupling,
the
business
logic
of
of
decoupling,
the
business
logic
from
the
latency,
basically
yeah.
A
Exactly
so
like
on
the
high
level,
these
are
the
things
we
consider
when
we
design
a
pipeline.
So
like
we're,
what
is
the
format
of
the
data?
Where
does
it
come
from?
How
often
does
it
get
updated,
but,
more
importantly,
so
the
the
one-time
historical
load
will
be
easy.
But
how
do
you
accommodate
new
data?
How
do
you
avoid
recomputing
of
historical
data
and
that's
exactly
what
a
instructor
streaming
offered
us
with
these
window
based
functions
and
streaming
datuplication
and
aggregation.
B
Got
it
so
so
I
mean
sort
of
almost
like
a
wrap
up,
then
it
seems
to
me
like
what
it
boils
down
to
is
that
you've
migrated
like
we
read
the
titles
migrating
from
data
warehouse
to
lake
house
for
structured
streaming,
Delta
lake,
so
structure
streaming
and
Delta
Lake
we've
talked
about
in
terms
of
stretch
streaming
gives
you
the
ability
to
have
that
to
be
able
to
deal
with
batch
and
streaming
at
the
exact
same
time
the
streaming
aggregation
streaming
deduplication.
These
are
amazingly
powerful
tools.
B
Delta
Lake
gives
you
the
reliability,
but
the
Crux
of
it.
All
this
is
that
it's
all
open,
and
so,
from
your
perspective,
the
idea
is
that
migration
from
that
warehouse
to
a
lake
house
really
is
because
the
fact
that
the
because
you
have
multiple
clouds,
because
you
have
multiple
like
warehouses,
all
these
different
systems,
where
there's
egress
costs,
Ingress
costs
and
so
you're,
basically
to
prevent
that
lock-in.
In
essence,
that's
what
bolt
down
just
like
okay,
if
I
switch
to
a
lake
house,
I've
got
an
open
system
and
then
you're
able
to
build
whatever
you.
B
A
Yeah,
so
it's
not
only
the
vendor
login,
but
Cloud
login
as
well.
Right
I
mean
it's
it's
easy.
If
all
of
your
infrastructure
say
it's
in
the
AWS,
you
are
in
West
two,
but
the
reality
is
especially
at
bigger
organizations.
We're
increasingly
need
to
live
with
a
multi-cloud
environment.
So
that's
where
I
think
open
format
really
shines.
B
Got
it
got
it
cool
I
mean
this
is
probably
a
good
end
of
the
end
for
the
this
particular
episode.
This
is
super
interesting
any
like
tidbits
any
advice
you
want
to
give
to
people
who
actually
are
facing
the
very
thing
that
you're
currently
doing
right
now,
which
is
multi-cloud
multi-systems
migration
from
Legacy
any
other
little
tidbits,
because
this
has
been
super
interesting.
Super
helpful
I
just
want
to
figure.
Leave
you
the
last
little
note
basically
yeah.
A
I
guess
one
of
the
most
important
lessons
that
I
have
learned
is
it's
actually
incredibly
difficult
to
compare
vendors
or
clouds?
It's
you're,
always
comparing
apples
to
oranges
and
the
cost
and
estimate
itself
could
require
a
separate
degree
for
it.
So
I
really
wish
we
would
have
one
for
cloud
called
understanding
or
cause
reduction,
but
I
think
I've
been
fortunate
to
be
able
to
work
in
both
AWS
and
gcp
and
under
stand
some
of
the
nuances
when
it
comes
to
network
Ingress,
egress
but
I
I.
A
B
This
is
a
thank
you
very
much
for
saying
this
has
been
a
wonderful
session.
You've
been
super
helpful,
so
for
any
of
the
folks
that
are
watching
us
on
the
vidcast
or
eventually
on
Spotify,
for
the
podcast
join
us
on
go.delta.io,
so
slack.
If
you
have
any
questions
but
Christina
I
want
to
say.
Thank
you
very
much
for
taking
your
time
to
speak
to
me
today.