►
Description
Considering shifting gears into Spark Data Engineering? Join this fun session with Simon Whiteley (@mrsiwhiteley) and Denny Lee (@dennylee) as they chat through their meandering journeys from SQL Server & BI to Apache Spark, Delta Lake, and the emerging Data Lakehouse approach. Be prepared for a geeky, trans-Atlantic event from two data nerds. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
A
So
if
you
are
watching
us
live,
are
you
just
launching
in
you're,
probably
seeing
us
go
ahead
and
mess
around
with
technology
and
all
that
stuff
hi
there?
I'm
denny
welcome
to
the
tech
chat
with
simon
and
daddy,
basically
we're
here
to
talk
about
our
experiences
from
the
standpoint
of
sql
server
and
bi
to
spark
delta
lake
and
lake
houses.
So
basically
we
wanted
to
go
ahead
and
have
you
guys
chat
if
you'd
love
to
sorry,
as
it
looks
like
there,
we
go
target.
A
A
So
now
that
we've
went
ahead
and
survived
our
technical
issues,
if
you
have
any
questions,
please
prop
them
into
the
chat
or
the
q
a
and
for
those
of
you
on
youtube,
live,
go
ahead
and
chime
there
as
well
I'll,
be
monitoring
the
youtube
live
simon
and
I
will
monitor
monitoring
the
q
a
when
the
other
person
isn't
talking.
Basically
so
simon
without
further
ado.
B
Okay,
so
hello,
I'm
simon
wiley,
hi
yeah,
who
am
I
I
don't
know,
I'm
just
a
tech.
A
B
Spark
person
thing
so
I
run
a
consultancy
in
the
uk,
but
generally
I'm
getting
around
all
over
the
place
at
the
moment.
So
clients
in
the
u.s
clients
of
europe
declines
all
over
the
place,
because
there's
lots
of
people
suddenly
doing
lakes
and
spark
and
big
data
stuff
that
haven't
gone
there
previously
and
that's
kind
of
what
I
spend
most
of
my
time.
Doing
going
around
helping
people
out
learn
spark
learn
how
it
integrates
with
the
rest
of
microsoft,
azure,
I'm
a
microsoft
mvp.
B
So
I
spend
lots
of
time
doing
microsoft,
stuff
and
talk
to
those
guys
about
where
product
directions
and
all
that
kind
of
stuff
and
I've
recently
started
making
lots
and
lots
of
youtube
videos
about
the
stuff
that
we
know
and
we
like,
which
happens
to
be
databricks.
Hence
why
I'm
suddenly
talking
to
lots
of
dangerous
people.
A
Rock
on
well,
okay,
then
that
actually
is
a
good
segue
to
myself.
My
name
is
denny
lee.
I
am
a
developer
advocate
at
databricks.
I
am
a
former
sql
server
guy
myself,
literally
sql
server
resin.
I
was
part
of
the
sql
server
team
itself.
I
was
the
my
goodness
what
was
my
title:
sql
sql
customer
advisor
team
data,
warehousing
and
bi
lead.
I
think
that
was
that's,
why
it's
such
a
hard
time
figuring
out
what
the
heck
I,
what
the
heck!
A
That
is,
because
it's
some
asinine
title,
but
nevertheless
the
context
is.
I
worked
with
some
of
the
largest
enterprises
that
use
sql
server.
I
was
post
sales,
not
pre-sales,
I'm
there
to
go
ahead
and
help
you
get
the
max
out
of
sql
server,
so
yeah
and
then
due
to
a
store
which
actually
I'll
probably
talk
about
a
little
bit
later.
I
went
ahead
and
made
that
transition
into
spark.
A
I
was
fortunate
enough
to
be
able
to
join
databricks,
because
this
is
an
awesome
company
and
then
just
like
simon,
I'm
really
interested
in
having
the
conversations
about
how
you
work
within
the
context
of
both
the
microsoft
azure
world
and
also
the
data
brick
spark
delta
lake.
You
know
lake
house
world
right
and
the
fact
is
that
those
concepts
actually
aren't
that
far
apart
right,
in
fact
they're
at
least
from
where
I'm
coming
from
I'm
sure
simon
and
I
have
slightly
different
experiences.
A
But
the
the
fact
is:
there's
a
lot
of
similarities
and
the
transition,
while
in
some
cases
a
little
rough.
The
transition
wasn't
actually
that
bad
to
go
from
one
or
the
other,
and
so
in
fact,
that's
probably
really
the
basis
for
the
first
question,
which
is
you
know
that
that
we
want
to
ask
which
is
and
by
the
way
when
you
please
do
continue,
asking
questions
inside
the
chat
and
q
a
we're
going
to
go
ahead
and
definitely
answer
them.
A
It's
just
that
we're
gonna,
probably
gonna,
start
with
a
little
bit
of
a
not
a
script
because
we're
we
have
no
script
by
the
way.
So
just
this
is
completely
live,
and
so,
if
you
hear
my
three-year-old
in
the
background,
that's
live
you're
listening
to
my
three-year-old
in
the
background
right,
so
the
idea
was
more
a
matter
of
like
we.
We
had
some
things
that
we
want
to
cover
first,
just
to
provide
some
context,
and
then
we
want
to
definitely
dive
into
the
questions.
A
So
definitely
chime
in
and
put
your
questions
in,
but
then
yeah
I
mean
simon.
I
think
maybe
the
first
thing
we
should
be
talking
about
a
little
bit
now
that
people
have
some
background
of
who
we
are
is
what
like
what
was
your
first
data
job
like
not
first
job
like
not
newspaper
boy
like
or
anything
like
that,
just
first
data
job
versus
and
how
does
that?
Compare
to
your
current
job
that
what
you're
doing
right
now.
B
Yeah
I
mean,
I
think
I
started
in
the
same
place
that
so
many
people
started
in
just
like,
as
you
know,
reporting
grunting
a
support
team
right,
so
fresh
out
of
uni,
I've
done
likes
of
business
and
stats
and
all
that
kind
of
stuff
I
did
a.
I
did
one
of
the
year
intern
things
at
ibm
and
at
the
time
then
it
was
like
balance
between
crm
and
like
some
reporting-
and
it
was
like
you
know
your
job
for
this
week.
You
know
it's
like
it
was
like
three
days
a
week.
B
I
was
meant
to
be
making
a
basically
building
a
report
using
lotus
123
in
the
dark
side,
and
it
was
like
manage
all
this
data
together,
copy
that
column
over.
It's
like
a
giant
script
of
manual,
things
to
do
to
then
put
data
into
a
report,
and
I
was
like
so
I
was
there
for
like
a
month
just
like
going
just
chopping
and
changing
columns
around
and
I
was
like
have
we
ever
asked
if
we
could
just
get
it
in
the
right
format
like
no?
No,
no,
it's
a
big
report.
B
B
And
they
had
this
whole
lotus
notes
playing
from
emails,
turning
it
into
like
putting
into
a
reporting
system.
Basically,
it's
like.
B
From
their
crm
system
and
then
they
could
do
like
views
which
were
like
little
reports
and
like
little
sales
dashboards,
and
I
just
took
that
and
started
building
and
building
and
like
at
the
end
of
that
year.
That
was
like
their
fully
fledged
reporting
tool.
They
did
a
load
of
stuff
in
so
went
back
to
uni
for
a
year
and
then
kicked
off
joining
like
support
and
stuff
doing
a
little
bit
of
crm
and
loads
and
loads
of
reporting,
and
then
that
was
sql
server
and
that
was
like
microsoft,
bi.
Well.
B
Actually,
at
that
point
it
was
access.
You
know,
monday
morning,
every
monday
it's
run
the
giant
access
database
of
doom
that
spits
out
a
load
of
access
reports
and
it
takes
the
entire
day
cranking
out
pdfs-
and
you
know
world
of
pain
and
eventually
migrate,
that
over
sql
server
started
learning
about
bi.
He
started
introducing
mbxq,
but
it
was
just
kind
of
you
know
that
slowly
learning
that
stuff-
and
that
was
so
internal.
B
I
was
there
for
like
six
seven
years,
just
again
like
to
learning
little
bits,
but
at
that
point
it
wasn't
at
all
involved
in
the
community.
You
know
I
didn't
get
to
get
out
there
and
go
to
meetups
and
go
to
talks,
so
we
kind
of
like
scratching
together
what
we
knew
and
kind
of
just
could
learn
and
then
left
looked
outside
one,
oh
okay,
yeah.
We
could
have
been
doing
this
a
whole.
A
B
Differently
and
then
yeah
after
that,
microsoft,
the
iron
onwards,
but
yeah
it's.
I
think
a
lot
of
people
have
had
that
monday
morning.
It's
your
job
to
make
sure
the
reports
go
out
kind
of
pain.
A
No,
I
actually
absolutely
1
000
agree
with
you
on
that
one
I
mean
the
reality
is
like.
Whenever
you
start
talking
about
data
projects
right,
that's
exactly
it
now,
I'm
I'll!
Admittedly
enough
did
not
go
from
lotus,
one,
two,
three,
that
that
was
not
my
background,
but
nevertheless
I
definitely
went
ahead
and
one
of
my
first
jobs
basically
was
just
to
do
web
analytics,
and
so
I
was
actually
like
before
I
like
the
the
first
job.
A
I
actually
want
to
talk
about
released
and
because,
before
that
was
just
sort
of
like
you
know,
toss
on
some
data
inside
the
database,
I
I
did
play
with
with
cubes
actually
actually,
no,
maybe
I
should
talk
about
the
first
one,
because
in
fact,
actually
I
just
realized
my
first
one
actually
was
part
of
microsoft,
I.t
and
yeah
yeah,
and
so
what
was
interesting
about
microsoft?
A
It
was
that
we
were
actually
the
first
team
to
build
the
very
first
olap
cube,
like
the
very
first
analysis
cube
ever
it
was
back
to
this
is
back
to
when,
like
you
know,
microsoft
had
purchased
that
portion
of
panorama
software
and
the
nets
brothers
had
had
come
from
from
israel
over
to
redmond,
and
literally
I
was
on
the
team.
It
was
myself
there's
a
guy
named
jim
berg,
so
shout
out
to
jim.
If
he
happens
to
be
online
right
now
we
were
and
dave
shuba.
A
Oh,
I
definitely
want
to
give
a
shout
out
to
him
where
we
basically
went
ahead
and
as
part
of
microsoft
it
in
hr.
Like
you
know,
you
know
human
resources
all
right.
This
is
back
in
the
time
where
basically,
the
hierarchy
from
bill
gates
to
myself
little
me
was
only
seven
levels:
okay,
yeah
right
yeah.
Exactly
so
we
built
the
first
analysis.
A
Oh
sorry,
olap
services
at
the
time
cube,
and
so
that's
how
I
got
into
it
and
then
slowly
because
of
that
the
transition
went
into
web
analytics,
and
so
that's
actually,
where
I
was
sort
of
introduced
to
the
idea
of
distributed
computing
early
on,
even
though
we
didn't
we
didn't
have
hadoop
or
anything
like
that.
It
just
was
just
more
the
this
idea
of
a
concept
of
doing
that.
A
Basically,
and
then
that
transition
to,
because
I
was
doing
weapon
lex
and
was
constantly
dealing
with
very
large
cubes
that
transitioned
to
basically
building
the
joining
bing
to
help
them
build
like
really
large
sql
server
and
analysis
versus
instances.
A
We
then,
at
the
time,
with
a
buddy
of
mine
named
bella,
obed
he's
actually
a
solution
architect
here
at
databricks
as
well.
We
built
the
at
the
time
the
largest
cube
at
ad
center
at
6.5,
terabytes
or
6.1
terabytes
or
whatever
it
was,
and
then-
and
that
was
a
cube
that
size
and
that
transitioned
into
the
awesomeness
yahoo
cube.
A
That's
the
yeah,
exactly
that's
the
so
for
those
of
you
who
don't
know
what
that
is,
the
yahoo
cube,
at
least
in
the
bic,
microsoft,
sql,
server,
bi
side
of
the
house,
that
is
a
24
terabyte
cube
that
was
on
top
of
a
5
000
node
hadoop
cluster,
with
like
a
massively
large
oracle
rack
as
its
staging
server,
actually,
which
is
which
is
amazingly
painful
yeah.
It
was
just
exactly.
It
was
just
a
tad
painful
so
and
so
yeah.
A
It
turned
out
to
be
the
largest
cube
and
that,
actually,
shortly
after
we
built
that,
that's
actually
when
I
got
introduced
to
spark
right,
because
the
yahoo
team
was
going
like.
Let's
stop
transferring
all
of
that
data
from
the
5000
no
2
cluster.
Over
to
this
one
cube
right
and
just
to
give
you
some
context
and
this
sort
of
leads
to
our
next
ques
the
next.
You
know
quote-unquote
question
or
a
script
here
it
took
us
72
hours
just
to
process
a
quarter's
worth
of
data.
A
A
A
But
I
want
to
start
with
your
story,
simon
tales
from
the
trenches
like
what
are
some
of
the
issues
that
we
got
into
and
then
I
think
we
actually
have
some
a
q
a
here
from
bob
that
actually
probably
would
be
very
applicable
here
for
us
to
talk
about
things
like
how
important
is
etl
and
data
wrangling
to
get
things
done.
You
know
like
yeah,
so.
A
A
Exactly
we'll
we'll
get
to
that
later,
so
yeah,
sorry,
yeah,
we're
going
to
answer
live
on
that
one.
So
I
we
apologize.
I
I
apologize
for
going
ahead
and
not
chiming
that
out
so
go
ahead,
tails
from
the
trenches,
some
of
the
pain
and
the
the
headaches
that
you've
been
going
through
yourself,
simon.
B
A
Person,
because
you
know
you,
no,
no,
you
are
you,
you
are,
you
are,
don't
worry,
I'm
it's
just
you
know
just
a
call
out
to
everybody
else.
Yes,
simon
absolutely
is
I
just
happen
to
have
really
good
stories
on
the
most
insane
sizes.
That's
all
that's
that's
all.
This
is
that's
actually
not
normal.
If
it's
normal
for
you
to
build
a
24,
terabyte
cube
you,
you
need
to
start
like
really
not
do
that.
That's
the
best
answer
about
just
not
do
that.
A
B
So
the
normalist,
which
I
look
back
kind
of
I
look
like
fondly
on
it
with
like
a
stockholm
syndrome.
Kind
of
this
was
like
the
thing
that
punished
me
and
it
wasn't
even
that
long,
a
project.
It
was
like,
like
three
four
months,
project
kind
of
like
a
fairly
early
in
my
consultancy
life.
You
know
so
I've
been
been
working
away
as
a
reporting
guide
done.
Some
mdx
join
the
consultancy
in
my
towards
the
end
of
my
first
year
going.
Okay.
Actually,
this
is
how
all
this
cool
stuff
works.
B
There
was
a
client
who
was
doing
some
interesting
stuff
like
market
research,
trying
to
read
like
sort
of
we'll
compare
that
to
that
and
that
that
and
it's
like
okay,
this
is
fine,
and
you
know,
consultancy,
client
kind
of
goes,
oh
and
then
on
this,
and
it
goes
like
this
this
this
this
and
doing
it
in
mdx
was
just
it
got
gnarly.
So
essentially
it's
you
know.
It's
just
a
load
of.
B
B
B
B
A
B
It
was
pink
beauty,
the
fact
that
we
got
this
thing
actually
working.
I
was
like
okay,
I'm
amazed
that
it's
actually
accurate
and
gives
you
correct
things
and
they're
like
it's
not
very
fast,
just
like
what
do
you
expect
and
yeah.
I
just
keep
looking
back
like
stuff
by
these
days.
You
know
kind
of
mdx
isn't
to
be
seen.
No
one
goes
near
it,
and
it
just
makes
me
sad.
I
used
to
love
mdx
because
some
of
the
horrible
glory.
B
Yeah
problems
but
that
one's
always
stuck
with
me,
as
kind
of
like
the
the
kind
of
like
goal
post
just
moving
and
me
just
like
desperately
trying
to
catch
up
with
like
just
just
growing
and
growing,
like
mbx
calculates
gift
card.
Please
no.
A
Oh,
no,
oh!
No,
I'm
with
you
a
thousand
percent.
I
mean
exactly
to
your
point,
which
is
like
because
I
remember
doing
like
additive
measures
myself
and
like
oh
distinct
count
that
was
the
bane
of
my
existence.
The
absolute
bane
funny
right
these
days.
That's
like
great
easy,
exactly
yeah,
exactly
exactly,
but
at
the
time
right
at
the
time,
the
bane
right
and
but
exactly
to
your
point
right,
I
mean
in
a
lot
of
ways.
Actually,
your
cubes.
A
What
you
had
to
build
was
actually
harder
than
mine
right,
because
because
you
actually
had
a
lot
of
business
logic,
complex
calculations
that
you
had
to
go
through,
so
that
you
actually
had
to
understand
it
wasn't
even
about
getting
the
data
from
the
storage
engine.
Oh
sorry,
to
provide
context,
analysis
services
had
a
formula
engine
which
is
basically
how
it
does
its
queries
and
it
had
a
storage
engine
storage
engine
was
grabbing
data
from
disk
and
chucking
up
the
to
memory,
and
the
formal
engine
was
actually
the
part
that
actually
did
the
calculations.
A
So,
in
your
case,
simon,
a
lot
of
your
stuff
was
actually
very
formal
engine
heavy.
In
other
words,
it
wasn't
so
much
about
getting
the
data
off
of
disk
you
just
once
once
you
got
a
disk,
it
was,
it
would
actually
take
a
long
time
for
it
to
process
the
data
in
memory
so
that
wasn't
so
yeah.
That
was
your
problem
like
and
so
in
a
lot
of
ways.
My
problems
were
simpler
right
because,
even
though
I
was
dealing
with
a
massively
large
cube
right,
basically
it
was
more
about
her.
A
Can
I
speed
up
the
disc
fast
enough?
That
was
it?
It
was
all
about.
Can
I
get
the
because
if
you
think
about
the
calculations,
if
I
was
to
do
semi-additive
measures
on
a
24,
terabyte
cube
it's
not
going
to
work
yeah,
just
it's
just
not
going
to
work,
let's
not
bother
the
pretense
right,
so
I
did
simplify.
We
did
simplify
the
cube
right,
hats
off
to
dave
mariani,
who
actually
was
leading
the
project.
So
I
I
I
want
to
call
that.
I
helped
him
not.
A
I
wasn't
the
lead
of
it
because
I
was
at
microsoft,
so
yahoo's
the
one
who
created
it
so
shout
out
to
dave
mariani
here,
okay,
so,
but
they
went
ahead
and
it
was
just
about
getting
the
data
into
memory
faster.
So,
like
you,
if
you
look
at
some
of
the
older
presentations,
thomas
kaiser-
and
I
did
it
was
things
like
you
know-
we
we
love
ssds
right
because
it
allowed
random
iops.
So
we
could
get
the
data
from
disk
into
memory
fast
enough.
A
A
A
Sir,
no
no
you're
right,
no
you're,
absolutely
right.
The
four
v's
volumes,
variety,
the
variability
and
velocity
right
exactly
the
four
v
sorry
volume,
yeah
velocity.
B
A
B
A
B
More
is
a
kind
of
reaction
to
the
lake
stuff.
You
know
so
the
fact
that
kind
of
like
kind
of
took
off
him
when
he's
like.
I.
B
Data
into
this
thing,
like
it
was
a
network,
drive
and
just
going
it's
fine
and
cool.
Oh
yeah,
it's
fine
exactly
and
then
no
one
had
a
clue
what
was
going
in
there
and
no
one
could
trust
anything
and
it's
like
well.
This
thing
is
entirely
pointless,
so
I
kind
of
like
just
write
like
putting
in
veracity
just
to
say
you
still
have
to
manage
your
data.
It's
it's
not
magic.
B
It
still
needs
some
kind
of
thing
in
there
and
that's
like
another
real
thing
right
now,
so
talking
big
data,
you
know
and
that's
why
please
data.
B
The
connections
are
with
yeah,
it's
like
the
number
of
people,
and
I
say
you
know
spark's
really
good
spock
can
help
you
out
and
they
go.
We
don't
have
big
data
and
it's
like
you.
B
What
do
you
mean
and
they
go?
We
don't
we
don't
have
huge
amounts
of
data,
it's
like
cool,
but
what
kind
of
date
are
you
doing
like?
Oh
we're
ingesting
a
live
stream
of
like
till
data,
that's
in
fairly
nested
json.
It's
like
you,
have
a
big
data
problem
that
that
that
is
an
exotic
data
type
with
some
gnarly
unstructured
stuff
in
there.
That's
coming
in
the
stream.
That's
the
very
definition
of
dealing
with
a
big
data
problem
like
but
there's
not
that
much
data.
B
It's
like
and
that's
like
what
it's
infuriates
me
that
everyone
like
sees
this
thing
as
this
big.
I
don't
need
that
tool,
because
that
tool
is
for
people
who
are
this
tall
gonna,
be
this
tall
to
ride
kind
of
thing
right.
So,
just
like
you
know
kind
of
pulling
those
things
out
as
much
as
I
hated
the
four
v's
originally
because,
that's
like
you
know,
big
data
marketing
people
getting
a
buzz
word
out.
Oh
yeah
yeah
yeah
yeah
yeah,
but
it's
like
a
great
thing
to
have
that
conversation
in
co.
A
Right
and
and
more
importantly,
the
the
call
out-
because
this
is
actually
when
I
was
calling
out
one
when
we
originally
so.
Oh
sorry,
I
forgot-
I
think
I
forgot
to
mention
this
one
of
the
projects
I
was
involved
with.
I
was
actually
on
the
project
isotope
team.
That
was
the
team
that
actually
built
what
now
is
currently
known
as
hdinsight
okay.
So
we
were
the
ones
who
brought
hadoop
into
microsoft,
which
was
that
was
a
fun
conversation
to
have
with
lots
of
people
by
the
way.
A
B
A
But
but
number
one,
I
like
your
four
v's
by
the
way,
so
I'm
gonna
stick
with
that
volume,
velocity
variety
and
veracity.
I
love
that
one
number
one:
okay,
so
we're
in
agreement
number
one
number
two.
What
typically
was
what
we
would
consider
a
big
data
problem?
Wasn't
the
fact
that
you
had
volume
and
nothing
else
or
velocity,
nothing
else,
because
theoretically,
you
could
then
just
have
a
single
system
handle
volume
right.
A
If
it's
just
volume,
you
could
just
literally
chuck
it
into
to
azure
blob
storage,
or
you
know
adls
gen
2
and
be
done
for
the
day
right
or
if
it's
you
know
a
velocity,
you
could
literally
write
custom
code
for
that
type
of
thing
and
so
forth
and
so
forth.
Right.
The
problem
was
that
you
had
all
the
above
or
a
com
or
some
of
them,
like
you,
know,
two
of
two
of
the
four
or
whatever
right
and
typically,
especially
in
this
day
and
age.
A
It's
really
in
all
seriousness,
three
of
the
four
or
four
of
the
four
anyways
right
like.
Even
if
you
don't
have
the
volume
aspect,
you'll
often
have
that
velocity
aspect
which
is
like
just
like
you
said
streaming
json
coming
in,
then
you
have
the
the
the
variety
in
terms
of
like
it's
you're,
not
just
looking
at
one
set
of
data
right
you're.
A
Looking
at
json,
you
plus
you're,
looking
at
a
database
plus
some
csvs
while
you're
at
it
plus
you
know,
rest
api
calls
for
social
or
whatever
else,
and
you've
got
to
combine
all
that
together
and
exactly
to
your
point,
veracity
right.
This
is
an
entire
idea
that
you
actually
need
reliability
underneath
that
right.
So
the
great
thing
about
data
lakes
was
that
I
could
go
ahead
and
chuck
all
my
data
in
there
and
not
worry
about
it.
Oh
by
the
way
somebody
just
chimed
out
and
said:
is
it
possible
to
re-watch
this?
A
We
actually
put
this
on
youtube,
live
concurrently
as
well,
so
you're
more
than
welcome
to
go
ahead
and
just
watch
it
then
so
anyways.
But
back
to
my
point
like,
if
you
have
all
this
data
coming
in
the
reality,
is
you
need
to
actually
make
sure
it's
reliable?
You
actually
have
to
manage
it,
and
so
at
this
point
this
is
where
I
think
at
least
that's
my
guess,
because
you
know,
in
fact
I
think
this
is
only
our
second
time.
A
We've
actually
talked
together,
even
though,
even
though
we
know
each
other,
this
is
actually
on
their
second
time
talking.
This
ultimately
led
us
into
not
just
obviously
spark
data
engineering,
even
though
we
came
from
the
sql
server
side
of
the
house,
but
it
also
led
us
into
things
like
delta
lake,
because
it
brought
us
back
to
asset
transactions
like
something
we
missed,
something
we
loved
actually
having
before
and
so
yeah
I
mean
simon.
What
do
you
think,
like
you
know,
provide
a
little
paint?
B
One
thing
I
want
to
pull
out
just
before
stepping
into
that
is
going
back
to
so
bob
had
that
question
about
you
know
what
about
etl
right
so.
B
And
we're
talking
about
business
logic
and
we're
talking
about
encoding
all
the
actual
end
calculations
that
you
do
at
runtime
on
top
of
a
semantic
layer
right
and
you
know
you
get
across
like
it
became
you
get
to
a
point
when
you
know
putting
calculations
in
you
kind
of
like
you
know,
you
can
there's
an
amount
of
domain
understanding
that
you
need
for
that
right
right
and
then,
from
a
consultancy
point
of
view,
you're
going
from
client
to
client,
client
and
seeing
they
had
the
same
problems
getting
the
data
in
there
the
same
problems,
making
it
trustworthy
in
quality,
the
same
problems
trying
to
just
hop
through
the
same
things
that
then
became
like
the
interesting
tech
problem
going.
B
How
do
we
just
solve
that?
How
do
we
make
that
when
we
get
to
a
client
we
can
just
go
yeah
fine,
getting
the
data
in
is
easy,
and
then
we
can
start
fixing
your
actual
business
problem
and
start
getting
to
the
nuts
and
bolts
of
it
right.
I
just
went
through,
like
you
know,
you're
going
through
client
after
client
after
client
and
the
tech
is
moving
and
each
time
you
do
it
there's
like
a
slightly
better
approach.
You
can
do
to
it
and
then
I
find
myself
looking
back
going.
B
It's
been
a
long
time
since
I've
actually
looked
at
the
customer
data.
Honestly,
you
know
kind
of
it's
it's.
You
know
you
can
build
almost
an
abstracted.
How
do
you
manage
data
and
I
don't
care
what
the
date
is
about?
It's
you
know
you
care
the
shape
of
the
data,
how
fast
the
data
comes
in
the
volume
of
it,
the
the
requirements
for
doing
that
stuff.
But
then,
if
you're,
a
bank
or
if
you're
a
retailer
or
if
you're,
marketing
or
whoever
you
are
actually
they
all
have
the
same
data
problems.
B
You
know
there's
some
real
common,
similar
data
engineering
style
problems
that
you
see
across
it
and
that's
when
you
start
talking
data
engineering
right,
so
we
mentioned
block
data
engineering.
But
then,
when
we
started
we
were
bi
etl
this
and
exactly
there's
a
transition
that
people
are
making
going.
You
know
I'm
no
longer
just
building
an
ssis
package,
so
I
can
get
it
into
a
cube.
B
I'm
now
designing
a
reasonable
data
pipeline
that
I'm
actually
sort
of
programming
and
I'm
actually
having
to
take
software
engineering
principles
to
make
it
decoupled
and
have
a
microsoft
service,
style
architecture
and
all
of
that
kind
of
stuff.
And
that's
very
it's
it's
a
big
shift
and
I
I
think
that
scares
a
lot
of
people.
A
lot
of
people
are
going
going.
Oh
they're
talking
software
engineering,
they're,
talking
coding,
standards,
they're,
talking
unit
testing
on
a
python,
build
making
a
wheel
and
that
that
sounds
so
far
away
and
it's
yeah
it
has.
B
Things,
have
changed
in
terms
of
what
it
takes
to
actually
do
that
stuff.
But
for
me
it's
so
that
it's
it's
changed
in
terms
of
the
amount
of
work
has
gotten
much
less
right,
but
it's
just
slightly
more
complex.
B
B
I
need
to
load
in
and
you
sit
there
copying
and
pasting
a
templated
ssis
package
and
changing
the
connection
and
the
next
one
and
then
the
next
one
or
you
write,
bible
script,
and
so
business
is
market
language,
awesome
tool,
but
geez,
just
writing,
c-sharp
and
xmla
and
having
to
hack
them
together
into
a
nested
loop.
That
then
generates
things
for
you
on
the
fly,
it's
fairly
painful,
to
say
the
least,
and
you
know
kind
of
I'm
fairly
cheeky
in
that
the
you
know:
devops
story
right.
B
You've
got
to
still
get
them
out,
and
so
that's
that's
takes
devops
and
deployments
and
slick
processes,
whereas
you
know
spark
and
all
the
modern
stuff.
We
can
just
write
a
generic
reusable
package,
that's
metadata
driven.
So
if
you
want
to
say
I
want
to
onboard
a
new
data
set,
that's
then
configuration
it's
a
bit
of
job,
I'm
going
to
add
a.
B
And,
like
the
whole,
you
know
the
slogan:
is
you
can't
deploy
faster
than
not
deploying
right?
So
if
that
doesn't
involve
deployments,
that's
going
to
go
faster,
no
matter
how
good
your
code,
generation
stuff
is,
and
it's
it's
like
a
whole
evolution
of
thinking
about
that
as
a
almost
isolated
problem
right
so
going
from
very
much
the
customers
trying
to
predict
this
value
or
they're
trying
to
report
on
this
thing,
and
it's
all
about
that
end
user.
Just
taking
steps
back
me
and
how
do
you
get
smarter?
B
B
It
becomes
a
separate
technical
challenge
of
its
own
case
and
that's
super
interesting
to
me
and
that's
where
we
are
now
and
all
these
tools
around
delta
lake
and
all
these
kind
of
new
technologies
they're
all
just
evolutions
of
that
same
thing.
It's
all
ways
for
us
to
make
that
slicker
remove
work,
remove
pain.
You
know
so
the
way
we've
been
building
daily
traders.
We
use
parking
park.
B
Because
it's
complex
store
and
it's
super
fast
regulations,
and
all
that
and
then
when
you're
saying
when
your
client's
going
yeah,
but
I
need
I
still
need,
like
some
kimball
style
stuff.
So
can
you
make
that
slowly
changing
right,
like
you're
changing
achievements?
The
pain
of
my
life
follows
me
everywhere
and
you're
having
to
build
something
going.
Okay,
so
I've
got
like
a
gigantic
table
of
parquet.
I've
got
some
change
coming
in,
so
I
need
to
write
a
script
that
says:
lift
them
both
up.
B
B
It's
like
that's!
That's
my
life
is
just
hopefully
taking
buckets
of
buckets
of
script
and
going
okay.
Now
it's
just
that
much
good!
Okay!
Now
it's
that
much
script.
I
don't
know
now
it's
just
one
command
and
then
life
is
just
getting
easier
and
that's
what
does
it?
For
me,
delta
is
just
a
whole
bucket
of
utilities
that
mean
a
load
of
stuff
that
we
could
do
before.
B
Just
now
takes
a
hell
of
a
lot
less
work,
and
it's
just
a
lot
more
approachable
and
approachable
is
like
the
key
right
that
whole
imposter
syndrome.
I
can't
do
big
data,
I'm
not
I'm
not
a
big
I've.
Never
I
don't.
I
don't
write
scarlet,
I
didn't
do
mac
produced
back,
then
being
able
to
say
you
know
what.
Actually,
if
you
can
write
a
bit
sql,
you
can
actually
use
delta
and
you
can
start
doing
things
that
encapsulates
all
of
the
parquet.
All
of
the
big
data
stuff.
B
All
the
big
massively
parallel
processing
distributed
engine
stuff
is
a
lot.
Basically,
if
you
can
write
a
bit
sequel,
you
can
now
use
a
load
of
it
and
that's
cool
and
that's
a
big.
The
big
shift
that
for
me
has
happened
in
the
past.
It's
only
really
the
past
three
years,
past
three
four
years
that
that's
become
that
approachable
okay,
you
could
do
it,
but
it
took
an
element
bit
more
of
engineering
and
config
and
setting
up-
and
you
know
so
hdinsight.
A
B
But
I
I
came
fairly
late
to
that
stuff.
Honestly,
because
I
I
went
a
a
meandering,
a
jewelry,
so.
B
B
Yeah
good
and
then
you
know,
data
factory
v1
came
out
and
it
was
just
the
biggest
bag
of
spanners
known
to
man.
It's
like
great,
but
like
we
did,
we
went
down
the
whole
path
of
dead
lake
analytics.
That
was
my
first
real
kind
of
foray
into
that
stuff.
Oh,
my
god
writing
new
sequels.
You
know
because
I've
been
writing
with
c
sharp
for
various
different
things
and
going
okay.
It's
it's
like
c
sharp
and
sequel
jam
together.
I
I
get
this.
B
I
think
this
makes
sense
and
building
code
generators
right
to
spin
up
adla
jobs.
They
went
on
demand
and
that's
like
that's
great,
I'm
not
having
to
I'm
not
having
to
build
anything.
I've
got
my
little
job
function,
but
on
the
fly
writes
some
new
sql
scripts
kicks
the
job
off.
That
does
some
stuff.
B
I
I
don't
need
to
write
ssis
anymore,
I'm
just
like
happy
happy
days
and
then
from
there
you
know
that
evolved.
Like
kind
of
future
uncertainty,
kind
of
things,
data
breaks
came
released.
I
managed
to
sneak
on.
I
was
so
there
was
a
microsoft
internal
training
course
for
likes
of
some
of
the
microsoft's.
Like
csa
or
yeah,
yeah,
sp,
yeah,
exactly
yeah
and
me
as
a
partner,
we
were
like
you
know
what
we're
super
interested
in
databricks.
Now
we
can
sneak
you
in.
B
So
it's
like
me
and
I
think
I'm
just
like
the
only
people
who
weren't
microsoft
in
the
room
going
they've,
not
noticed
it's
cool
when
they
were
first
doing
the
dailyrx
rollout
in
in
the
uk,
and
then
you
know
we
just
like
to
try
to
try
to
use
it,
trying
to
piece
it
in
so
that
and
that's
fairly
late
in
terms
of
spark
right,
we're
already
we're
already
talking.
B
A
A
B
Of
spark
that's
become
so
much
easier,
I'm
looking
at
it
as,
like.
You
know,
kind
of
going
wow.
I
am
so
glad
I
didn't
start
back
in
2012
when
it
was
literally
just
and
cranking
things
going,
yeah
so
like
when
you
when
you'd
like
to
vlog.
When
you
start,
then,
when
you
did
your
your
hop
was
that
rdd
lance
was
that
enough?
Oh.
A
Absolutely
no!
No!
I
was
involved
with
spark
back
in
0.7,
so
when
the
project
was
still
in
berkeley
right,
so
I
was
I
was
starting
to
mess
around
with
it.
Then,
shortly
after
we
made
hd
insight
beta
like
when
the
the
name
isotope
was
still
prevalent
and
yeah,
and
by
the
way,
by
the
way,
there's
only
like
nine
of
us
that
created
this
project,
which
is
pretty
sweet.
A
I
would
already
dove
into
spark,
because
I
reckon
we
a
bunch
of
us
who
had
been
working
on
the
project,
recognized
like
some
of
the
issues
with
hadoop
right,
which
was
that
it
allowed
us
to
process
massive
amounts
of
data,
which
was
great
like
especially
for
the
scenarios
that
we
were
trying
to
address
right,
like
the
ones
right.
The
there
wasn't
the
same
level
of
clarity
that
like
that,
for
example,
that
you
have
right
now
right
the
at
the
time.
A
It
was
just
more
like
you
couldn't
process
that
data
period,
so
it
wasn't
about
speed
anymore.
It
was
just
about
the
fact
that
I
couldn't
even
do
it
right,
like
like,
like
with
the
the
sheer
look
at
the
time.
Terabytes
hundreds
of
terabytes
was
considered
a
really
hard
thing
to
do
and
we
were
approaching
petabytes
already
at
that
time.
This
was
like
10
years
ago,
right
like
so
so
you
know
that.
That's
why
we
had
no
choice
but
to
dispute
the
problem.
A
A
It
was
a
while
before
we
had
michael
armbrust
and
mate
had
introduced
this
concept
of
schema
rdds,
which
of
course
became
that
what
is
now
known
as
data
frames
right
back
in
spark
1.0
to
1.3
that
time
frame
right,
and
so
it
was.
The
funny
story
was
that
actually
I
had
a
regular
sync
with
my
friends
at
yahoo,
because
you
know
that
24
terabyte
cube
right
and
what
end
up
happening
is,
but
purely
by
accident
we
end
up
having
all
of
our
meetings
end
up
being
about
spark.
A
So
it
was
just
like
oh
okay.
Well,
I
it
wasn't
like.
I
was
telling
them
sparker.
They
were
telling
me
it
was
just
we
just
we.
We
both
came
to
that
conclusion
in
separate
ways
like
which
was
the
at
least
in
our
case
was
just
because,
like
I
said
it
was
just
purely
about
it
was
that
large
we
couldn't
process
it,
so
we
had
to
have
distribution,
and
but
we
still
wanted
at
least
it
not
to
take
three
days
to
process
the
data
yeah
yeah,
I'm
just
saying
you
know
yeah
exactly
I
mean
I
realized.
A
A
A
The
idea
that
also
in
these
queries
that
would
take
maybe
hours
now
we're
taking
like
you
know,
minutes
and
so
that
and
then,
but
yes
exactly
to
your
point,
actually
the
running
joke,
and
so
was
that,
because
of
I
was
actually
trying
to
initially
work
with
hadoop
right
in
terms
of
actually
working
with
the
internals
and
all
that
stuff
and
then,
of
course,
invariably
that
led
to
working
with
spark
and
in
it's
internal.
So
don't
get
me
wrong.
I'm
not
trying
to
pretend
that
I'm
really
good
at
what
I'm
doing.
A
I'm
not
I'm!
Okay
right.
I
started
writing
a
lot
with
my
code,
of
course,
in
scala
right
and
then,
and
so
the
reason
I
wanted
to
just
do.
A
real,
interesting
callout
is
because
holden
corral
she's
one
of
them
one
of
the
awesome
people
that
was
able
to
push
forward
with
pi
spark
right.
So
because
you
know
why
we
have
pi
spark,
I
mean
don't
get
me
wrong,
I'm
not
trying
to
discredit
other
people,
I'm
just
simply
calling
out
there.
A
There
are
a
lot
of
really
good
people
that
and
holden
was
one
of
those
people
right
that
helped
push
through
pi
smart.
The
the
reason
I'm
calling
out
this
running
joke
is
because,
as
I
started,
diving
into
data
science,
of
course
I
did
pi.
I
just
started
doing
python
myself
and
then
did
pi
spark,
but
when
she
wanted
to
do
performance
she
started
getting
the
scala.
So
invariably,
even
though
I
started
in
scala
and
she
started
a
a
la
pi
spark
her
most
recent
spark
book
was
written
in
primarily
in
scala.
A
I
I.e
the
person
to
help
us
create
pie,
sparkles
now
writing
in
scala
and
then
my
first
book
in
in
spark
was
which
is
learning
pi
spark.
That
book
actually
was
well
obviously
written
in
python,
even
though
I
was
a
skull
engineer
first,
and
so
yes,
and
at
that
time,
that
as
the
evolution
happened,
I
was
perfectly
happy
saying:
oh,
it's
bouncing
back
for
scala
and
python.
In
fact,
here's
a
shameless
plug
to
learning
spark.
A
A
That
was
insanity
on
its
own
right,
okay,
but
I
do
mean,
like
writing,
like
at
least
it
wasn't
java.
So
at
least
I
wasn't
like
writing
like
the
reams
of
java
code,
but
still
I
would
write
like
scala
code
or
exactly
to
your
point,
like
the
merge
statement,
like
you
know
now,
with
spark.
Actually,
the
scala
api
is
actually
pretty
smooth
too,
but
the
idea
originally
what
I
had
to
write
in
scala
versus
right
now.
A
A
simple
little
merge
statement
is
in
sql
yeah,
exactly
to
your
point
like
the
as
time
progresses
and
as
people
who,
in
some
ways
I
wish.
I
didn't
hadn't
I
mean.
Obviously
I
don't
wish
I
didn't
have
to
do
it,
but
in
terms
of
if
I
was
starting
from
a
data
warehousing
perspective,
oh
boy,
yeah
it'd,
be
it's
a
lot
simpler
to
jump
to
it.
A
Now,
because,
with
sparks
equal
you
you
have
the
friendliness
and
the
awesomeness
and
the
power
of
your
sql
language,
yet
it
actually
can
be
distributed,
which
is
awesome,
sauce
right
and
at
the
same
time,
the
idea
that
now
we
have
delta
lake,
it
allowed
us
to
go
ahead
and
actually
have
acid
transactions
and
the
combined
combination
of
the
two
together
allowed,
especially
with
some
of
the
the
apis
that
were
included
in
spark
3.0
with
delta,
like
0.7.0.
A
That
allowed
us
to
actually
significantly
simplify
the
manageability
of
a
distributed
system,
because
we
all
know
a
distributed
system
actually
is
harder
to
maintain,
not
easy
to
maintain
like
if
I
was
to
maintain
a
single
sql
server
instance.
That's
actually
not
that
hard.
If
I'm
trying
to
maintain
50
of
them,
that's
a
little
tricky
yeah,
just
just
a
little
bit
so
yeah.
Sorry.
B
But
those
kind
of
hugs
you
know
you
always
so
so
many
clients,
like
we've,
worked
with
they've
gone.
Like
you
know,
someone
someone
high
up
has
gone
we're
having
a
modern
platform,
we're
going
to
get
people
in
and
we're
going
to
put
a
modern
platform
in,
and
we
speak
to
the
warehousing
guys
and
they're
like
oh,
I
don't
want
to
learn.
Scarlet
python
they're,
just
looking
in.
A
B
Got
an
amazing
distributed
system,
we're
gonna,
do
some
cool
stuff:
let's
go
go
and
then
you
start
working
and
it's
like
how
do
you
get
the
hair
in
okay?
Well,
this
is
a
merge
statement.
I'm
like
wait.
What
it's
just
like
this
anchor
point
of
familiarity-
and
you
know
we
teach
them
pi
spark.
We
teach
them.
B
B
A
B
Ham,
fistedly
going
that's
probably
gonna.
A
Work
so,
okay,
we
actually
only
have
16
minutes
left
and
I
just
so
this
one
we
went
a
little
longer
than
we
thought,
but
all
right,
actually,
let's
dive
right
into
some
of
the
questions,
because
actually
this
is
a
good
segue.
So,
for
example,
let's
go
back
to
the
first
one,
which
is,
I
understand,
delta
lake
from
previous
presentations.
But
what
is
what
are?
What
is
a
lake
house
like?
Why
are
lake
houses
important,
so
we
actually
covered
it
without
actually
explicitly
calling
out.
A
So
why
don't
you
start
and,
and
then
I'll
chime
in
as
well
for
that
matter,.
B
Okay,
so
specifically,
data
legs
have
an
absolute
ton
of
flexibility
and
power
and
all
that
crazy,
big
datay
stuff
of
saying
different
kind
of
exotic
data
types.
Vast
amounts
of
data
streaming
all
that
kind
of
stuff
daylights
are
awesome
at
it,
but
structurally
having
that
kind
of
hey
I've
got
a
schema.
I
know
what
structures
data
is
I'm
doing
some
regulatory
reporting
and
I
need
to
actually
manage
this
thing.
I
don't
want
to
use
it
to
come
in
and
just
accidentally
delete
my
data
because
they've
got
access
to
that.
B
All
of
that
kind
of
stuff
has
always
historically
been
a
little
bit
flaky
in
the
data
lake
land,
then
over
on
the
warehouse
side.
You've
got
all
of
that
so
transactional
consistency,
management
of
schemas,
deployability
control,
auditing,
awesome,
and
then
you
know
you
try
and
get
jason
in
there
and
it's
like.
B
Oh
we've
got
a
jason
collins
now
you
know
how
to
write
for
jason
pass
thing
right
there,
oh
god,
and
it's
just
like
there's
so
many
things
that
it's
just
hard
and
especially
these
days
when
people
are
going
there's
some
new
data
we've
got
an
opportunity.
Can
we
take
advantage
of
that
new
data
and,
like
traditional
warehousing
teams,
are
going
yeah
yeah?
Absolutely
our
next
production
replaces
next
production
deployment
is
scheduled
for
three
weeks
time.
Is
that
soon
enough
I
mean,
I
know
three
weeks
is
being
generous.
B
Normally
we're
talking,
you
know,
can
I
monthly
cadences
at
the
end
of
this
project
we
might
do
a
deployment
and
then,
by
that
point
you've
missed
a
vote.
You
know,
so
it's
missing
out
on
all
the
stuff
in
the
lake
that
you
can
do
just
you
know
what
actually
bring
some
data
in.
Let's
do
some
generic
landing
of
some
data
and
we'll
figure
out
what
we
want
to
do
with
it
later
and
like
there's
kind
of
the
two
different
sides
of
the
fence.
B
A
It
perfect.
Well
I
mean
I
don't
think
I
could
do
much
better.
I
I'll
just
do
the
short,
the
short
phrase,
tag
line,
which
basically
is
best
of
both
worlds
of
data,
warehousing
and
data
lakes,
the
manageability
of
warehouses
with
the
flexibility
of
lakes,
right
that
that's
the
the
marketing
tagline,
but
you
dive
in
a
little
bit
deeper
when
you,
I
think
simon
called
it
out
perfectly
right.
The
reality
is
like
we're
not
done
yet.
A
There's,
obviously
things
the
community
as
a
whole,
whether
it's
the
spark
community
or
the
delta
lake
community,
or
just
the
overall
data
community.
You
know
we
still
have
more
work
to
do.
Let's,
let's
call,
let's
call
a
spade
as
fate
here.
There
are
things
that
we
can
do
to
improve,
but
the
reality
is
that's
what
lake
houses
are,
which
is
to
say
that
for
me
like
okay,
actually,
I
know
this
sounds
like
a
like.
A
I'm
going
off
track
a
little
bit,
but
in
fact
it's
actually
an
important
component
like
when,
when
simon
and
I
started
talking
about
our
past-
it's
not
just
because
we're
reminiscing.
Okay,
I
mean
yes,
we
are
too
okay,
but
but
the
reason
why
we're
doing
this
is
because
there's
a
fundamental
concept
that
there's
actually
a
model,
that's
supposed
to
be
applied
to
your
data
right
there's.
Actually,
your
data
is
actually
important
right
when
we
went
to
lakes.
The
whole
premise
is
that
we
were
just
trying
to
chuck
the
stuff
in
as
fast
as
we
could.
A
So
we
kept
on
things
saying
things
like
schema
on,
read,
schema
reads:
schema
we'll
worry
about
the
schema
later
we're
worried
about
the
schema
later
we're
worried
about
what,
if
it's
important,
later
right-
and
there
is
value
to
that
statement
by
the
way,
I'm
not
saying
there
is
no
value
to
that
statement.
Quite
the
opposite.
A
We
needed
to
actually
go,
read
it
and
do
something
with
it
and
build
a
model
on
it
I.e.
The
schema
is
the
model
for
your
data
right.
We
needed
to
do
things
like
that.
So
that
way,
people
remember
what
the
value
of
that
we
we
have
these
awesome
marketing
and
I'm
being
very
facetious.
When
I
say
this
awesome
marketing,
taglines,
like
oh
yeah,
there's
the
new,
the
the
new
oil
is
data
or
the
new
gold
is
data
or
whatever
you
know
that
type
of
bs
right
and
that's
that's
a
great
marketing
tagline,
but
the.
A
But
the
fact
is
the
truth
of
statement.
If,
even
if
I
was
to
follow
the
marketing
tagline,
is
that
yeah
there's
a
lot
of
work,
though
right
oil
doesn't
just
automatically
come
on
the
ground
and
automatic
pro
a
process
right
data
is
the
same
thing
right,
so
it
required
us
to
do
a
lot
of
things
to
to
make
sense
of
it
and
so
the
so.
For
me,
it's
not
just
about
manageability.
A
It's
also
about
re,
remembering
the
value
of
your
data
and
reapplying
that
back,
which
is
why
lake
houses
are
so
important
to
me,
because
it's
not
just
the
technical
construct
or
you
know,
like
you,
sometimes
hear
us
say
paradigm
and
merely
enough.
That's
a
marketing
term.
So
I
also
apologize
for
using
that.
But
the
reason
we
often
use
the
word
paradigm
is
because
it's
not
it's,
because
it's
not
just
a
tactical
innovation.
A
B
B
Right
and
then
there's
a
ton
of
like
evolution,
that's
become
on
it,
which
is
all
the
data
management,
patents
and
things
like
slowly
changing
and
things
like
auditing,
lineage
columns
and
things
like
making
a
fat
table
with
all
your
data
quality,
lineage
data
and
calling
that
an
audit
effect-
and
I
thought
of
that-
it's
just
data
management,
and
I
can
do
that
in
a
link
really
easily
and
you
know
so
it
used
to
be.
B
You
know
if
you
had
like
some
kind
of
fact
table
and
dimension
tables
that
didn't
perform
too
well
in
spark,
but
spark
3,
because
we
got
mix
of
the
dynamic
partition
push
down,
so
you
can
actually
filter
your
date
table.
Have
that
actually
correctly
fill
the
effect.
Suddenly
that
unlocks
all
that
stuff
that
stuff's
a
hell
of
a
lot
easier
to
do
so,
you
can
actually
get
a
lot
of
the
fairly
which,
if
you
talk
to
a
big
data
person
and
say
I
am
doing
kimball
in
my
lake,
they
go.
B
B
A
B
So
you
can
go
to
a
certain
part
of
your
link
and
you
know
the
data
in
there
that
has
been.
That
is
just
as
it
can.
I
I
I
need
enter
at
your
own
risk.
You
need
to
be
able
to
understand
the
data
that
might
not
be
right.
Middle
bit.
That
date
has
been
sanity
checked,
it's
been
sense,
checked,
it's
had
some
quality
cleaning
done.
You
can
go
and
trust
it,
but
it's
still
its
original
format
so
requires
a
bit
of
skill
to
figure
it
out
and
know.
B
What's
going
on
there
and
then
some
kind
of
curated
this
is
this
is
managed
trustworthy.
This
has
been
shaped
for
ease
of
querying
now.
Sometimes
facts
and
dimensions
make
sense
for
that,
because
the
kind
of
data
and
the
kind
of
people
you're
showing
it
to
sometimes
it's
a
big
wide
reporting
table.
Sometimes
it's
a
mix
of
the
two
in
some
other
shape
and
that's
absolutely
fine
and
that's
just
options
right.
That's
just
that
you.
B
We
now
have
the
flexibility
for
all
of
the
different
data
management
paradigms
that
we've
used
to
get
in
so
many
different
places.
They
can
actually
fit
and
there's
no
longer
a
technological
barrier
saying
you
have
to
use
this
way.
You
have
to
use
it
you're
not
allowed
to
do
a
start.
Gaming
exam.
It's
now
a
manage
to
take
manage
the
data
in
a
model
that
makes
sense
for
your
business
purpose,
which
is
great
right.
B
A
Right
and
actually
exactly
to
your
point,
so
this
is
a
slightly
plug
for
the
databricks
youtube
channel,
but
I
did
want
to
call
out
that,
like
in
the
daybreak,
this
data
and
ai
online
meetup
that
you're
on
right
now
in
the
database
youtube
channel.
We
actually
have
videos
a
la
kimble
talking
about
sergey
key
generation,
the
importance
of
them
right
for
delta
lake
for
data
lakes,
slowly
changing
type
two
dimensions
in
your
data
lake
right
cdc
in
your
data
lake
right
it
it's
so
exactly
to
simon's
point
it's
like.
A
I
get
the
idea
of
using
this
these
techniques
and
in
some
cases
people
are
going
like.
Why
would
you
try
to
apply
that,
to
you
know
a
data
warehousing
technique
to
a
data
lake
and
I'm
going
like
well
because
it
it's
not
like
the
concept
was
wrong.
The
concepts
actually
make
a
ton
of
sense.
It's
just
that
now,
with
you
know,
spark
especially
with
spark,
3.0
and
and
and
and
delta,
like
I'm
actually
able
to
do
that
now.
A
B
Yeah,
so
previously
it
was
a
square
box.
I
had
a
massively
round
peg,
I'm
just
trying
to
squeeze
into
it
and
try
to
force
it
to
it
right.
Well,
it's
not
to
say
that
absolutely
everything
still
stands
right.
You
know,
probably
of
course,
we're
holding
days.
If
you
ever
had
a
string
or
sorry
have
you
ever
had
a
var
chart
on
your
fact
table,
then
you
are
the
devil
and
that's.
B
Right,
that's
fine!
You
can
denormalize
some
things
onto
your
fat
table
for
ease
of
querying
because
actually
park,
a
constant
compression
dictionaries
related
to
coding.
All
of
that
stuff
squeezes
down
really
nicely
exactly
exactly
so
there's
some
stuff
so
that
that's
like
the
use
case
right
if
you've
got
like
a
a
conform
dimension,
you've
got
a
hierarchy
and
you
need
to
manage
that
if
you're
insisting
on
having
your
big
wide
reporting
tables,
and
then
you
need
to
change
how
your
product
hiring
works,
and
you
suddenly
have
to
regenerate
all
of
your
giant
transaction
reporting
tables.
B
B
You
can
use
and
go,
but
that
still
makes
still
makes
sense
but
yeah,
but
for
a
long
time
it
was
very
much
the
you
weren't
one
of
the
cool
kids.
You
when
you
went
a
cool
kid,
if
you're
trying
to
do
like
some
traditional
data
modeling
in
big
data
land,
but
it's
people
do
people
want
to
do
that
and
those
things
make
sense
the
business
and
that's
how
they
think
about
their
data
in
a
lot
of
places.
A
Cool
okay,
so
you
know
what
we're
probably
gonna
need
to
wrap
up
in
two
minutes.
I
just
realized
because
we
were
running
along
so
for
all
the
people
that
have
asked
questions
in
both
youtube
live
and
the
q
a
or
the
chat
we
apologize
for
not
getting
to
all
of
them.
I
just
want
to
start
off
with
that,
based
on
the
feedback
that
we're
getting
it
looks
sounds
like
simon.
You-
and
I
probably
should
do
this
a
couple
more
times,
so
so
so
we'll
definitely
plan
to
do
so.
A
A
B
B
A
No,
no,
no!
No!
It's!
Okay!
I'm
a
kimball
guy,
too,
all
right!
So
most
people,
a
lot
of
my
sequels
sequel
server
friends
are
like
are
like,
like
how
can
you
say:
you're
a
kimball
guy
when
you
went
into
hadoop
and
I'm
like
you
know
what
there
is
a
fairness
to
that
statement.
There
really
is
okay,
so
I'm
not
gonna
go
ahead
and
actually
disagree
with
that.
A
B
A
Idiot
all
right,
dude,
okay,
let's
wrap
it
up,
we're
gonna
have
to
go.
I
did
want
to
call
out
two
things
again,
one
the
youtube
link.
Is
there
put
your
questions
since
we're,
since
we
did
not
answer
your
questions
and
there
are
too
many,
we
apologize
simon
and
I
obviously
are
having
a
little
too
much
fun
here,
so
put
them
onto
youtube
chime
in
there.
A
We're
gonna
use
those
as
the
basis
for
our
next
show,
simon
I'll,
find
another
time
to
do
this
number
one
number
two:
we
we
small
plug,
we
do
have
a
show
next
week
on
the
24th,
so
come
join
us
for
that.
Oh,
so,
that's
a
completely
different
show
on
the
automation
of
pi
spark.
It's
actually
the
data
collab
lab
with
franco
myself,
so
it'll
be
a
little
bit
of
fun
there.
It's
also
very
much
sql
sql
centric,
so
I
definitely
would
love
you
guys
to
join
for
that.
A
B
Yes,
well
next
week,
we're
both
at
big.
A
Day,
london,
no,
no
I'm
not
got
big
data,
london,
but
my
boss,
ali
is
gonna,
be
a
big
deal.
No,
no!
It's
okay,
we're
still
gonna
be
there.
I
just
I
can't
go
for
other
reasons.
That's
the
reason
why
that's
all
sorry.
B
So
yeah
ali
is
doing
a
keynote
on.
This
is
the
data
lake
house.
This
is
why
we
did
this
whole
thing,
I'm
doing
a
session
which
is
actually
here's
all
the
actual
individual
bits
of
delta,
which
enables
data
lake
house.
So
that's
something
you
guys
are
interested
in
big
data.
London
happening
next
week
on
wednesday
thursday,
I'm
going.
A
B
Yeah
well
again,
thanks
for
coming
again,
you've
got
your
youtube
channel.
I've
got
my
youtube
channels
to
look
out
for
advancing
analytics
and
we're
talking
all
things.
Databrick
van
spark
and.