►
From YouTube: Simon + Denny Ask Us Anything - November 15, 2022
Description
Join us for the monthly series "Simon and Denny - Ask Us Anything!" on Tuesday, November 15, 2022 where we will answer your data engineering questions from building a data platform to ingestion to ETL to analytics. With our background in SQL Server and BI to Apache Spark and Delta Lake - we want to show you how to build your own lakehouse.
As this session is interactive, come prepared to ask questions all throughout the session! Be prepared for another geeky, trans-Atlantic event from two data nerds.
Learn more about Delta Lake
Read Our Newest Blog Post: https://delta.io/blog
Join us on Slack: https://go.delta.io/slack
Delta Lake Releases: https://github.com/delta-io/delta/releases
A
Okay,
there
we
go
perfect
all
right,
hi
everybody,
as
we
slowly
trickle
in
for
today
on
Zoom,
we're
also
getting
ourselves
broadcast
to
LinkedIn
and
broadcast
to
YouTube
at
the
exact
same
time.
So
because
of
that,
give
us
a
couple
minutes
just
to
go
ahead
and
get
everything
jump
started.
Candace
on
the
background
from
the
Linux
Foundation
is
setting
up
all
the
events,
so
basically
we're
just
getting
ourselves
ready,
so
give
us
a
couple
more
like
maybe
30
more
seconds,
and
we
should
be
good
good
to
go.
A
Meanwhile,
if
you're
on
the
Youtube,
LinkedIn
or
Zoom,
why
don't
you
tell
us
where
you're
based
out
of
like,
for
example,
I
myself,
Denny
I'm
based
out
of
Sunny
Yes?
Actually
Sunny
Seattle
go
for
it.
B
Well,
I,
don't
know,
I
mean
Kent
in
in
England,
where
it's
rainy
and
grim
and
miserable,
and
it's
not
holiday
anymore
and
I'm.
Sad
fine.
A
A
All
right
so
perfect
looks
like
we
are
live
on
all
three
platforms,
so
this
is
great.
We've
got
some
folks
from
New
Jersey
from
Leeds
from
Miami
Ed
I'm,
hoping
at
least
you
from
Miami
are
enjoying
the
hot
weather,
Northern
California
I!
Think
it's
not
cold
at
least
I
hope
not
so
there
you
go
so
before
we
dive
into
oh
good.
It's
already
commented
back.
A
He
says
it's
awesome
down
in
Miami
awesome
for
you,
man
all
right,
one
thing's,
just
to
provide
some
context
to
remind
you
all
that
this
is
basically
an
open
Forum
where
you're
going
to
be
asking
myself,
Danny,
Lee
here
and
Simon.
Some
awesome
questions,
so
hopefully
they're
data,
engineering
and
Delta,
Lake
and
Spark,
and
you
know
SQL
related
so
hope
at
least
those
are
the
ones
we
intend
to
answer.
I
should
put
it
that
way.
A
So
we
can
probably
talk
a
little
bit
about
that
and
as
a
starting
point,
but
if
you
have
other
questions
or
other
contexts
that
you'd
like
to
dive
into
by
all
means,
go
ahead
and
chime
in
and
let
us
see
the
questions
in
the
q,
a
for,
oh
sorry,
q,
a
for
Zoom
for
LinkedIn,
just
prop
right
here,
I'm
actually
typing
in
right
now
and
and
it's
our
perfect
Ed's
already
started
with
the
gtpr
question
and
we
actually
have
a
question
from
YouTube.
A
So
let
me
start
with
that:
one
from
YouTube
and
then
we'll
switch
over
to
Ed
for
the
gdpr
question
and
again
keep
on
chiming
in
so
from
YouTube
there's,
actually
an
initial
question
from
Tariq,
which
is
a
great
question
which
unfortunately
I
don't
have
a
good
answer
here
and
tram
by
the
way,
if
you're
trying
to
type
your
questions
literally
type
them
exactly
what
you
did
in
the
Q,
a
that's
exactly
where
you're
supposed
to
put
them.
This
is
a
perfect
way
to
do
it,
but
for
Tariq
your
questions,
any
idea.
A
If
the
book
Delta
Lake,
the
definitive
guide,
will
be
out
in
print
anytime
soon
and
candidly,
the
answer
is
unfortunately,
probably
another
year
or
so
based
on
the
feedback
that
we
received
from
folks.
We
do
need
to
do
a
rewrite
now.
Part
of
it
was
the
rewrite
part
of
was
also
the
fact
that
we
had
the
release
of
Delta
2.0,
which
basically
restructured
a
sizable
chunk
of
the
book.
A
So
that's
what
I
meant
by
the
restructuring
of
the
book
by
the
same
token,
like
I,
said,
there's
a
lot
of
things
that
happen
because
the
Delta
2.0
so
because
of
that
we
are
currently
in
the
process
of
rewriting
the
book
as
soon
as
we
have
some
structure
around,
that
I
will
gladly
go
ahead
and
inform
you
all
when
that
book's
available.
But
right
now
we
I'd
say
honestly
the
timeline's
more
like
a
year
from
now.
Okay,
so
all
right,
Perfect
all
right!
A
Now
we
have
the
first
question,
which
is
gdpr
really
from
Ed
for
my
awesome
Miami.
So
the
I'll
read
out
the
question
and
then
Simon
you
and
I
can
figure
out
who
wants
to
answer
it?
First.
Okay,
he's
got
a
loaded
question.
So,
let's
start
off
with
in
pain,
right
away,
okay,
on
gdpr.
What
is
the
best
practice
for
using
Azure
ADLs
and
connecting
the
databricks
also
considering
that
unique
catalog
may
or
may
not
be
in
the
picture
right
now,
but
what
should
be?
A
B
I'm
just
skimming
through
it
there's
there's
a
there's
a
lot
in
terms
of
ADLs
container
Lake
structuring
in
there,
which
we
can
go
into.
If
you
want
to
talk
about
Unity
catalog
before
we
get
into
that,
then
you
go
ahead
and
then
we'll
dive
into
how
you
should
structure
your
leg
for
good
security.
Boundary
concerns.
A
B
All
right,
okay,
so
the
wider
question
so
yeah
best
practice
for
using
ADLs
and
connecting
to
databricks
concerning
unit
catalog
may
not
be
a
picture
now,
but
could
be
in
the
future.
Should
we
use
one
storage
container
for
everything
all
layers
in
all
environments:
one
storage,
container
per
environment,
Dev,
QA
prod?
How
do
you
mount
it?
How
does
all
that
stuff
fit
together?
So
there's
some
basic
advice
around
there
in
terms
of
separation
of
concerns.
B
Certainly,
when
we're
talking
gdpr
and
all
that
kind
of
stuff
week
we
can,
we
can
dig
into
it.
I
think
that's
already
a
whiteboard
question.
If
that's
all
right
give
me
a
moment,
I
will
dive
into
a
whiteboard.
B
B
Always
Yes
except
weirdly,
my
pens
kind
of
stopped
working
so
I
have
to
shake
it
like
it
was
a
real
Biro.
Now
it's
it's
authenticity,
all
right.
Okay,
if
you're
building
a
lake,
so
you're
saying
we've
got
all
layers,
you
know
so
we
talked
about
you
know.
Maybe
you've
got
bronze
silver.
Gold
kind
of
you
know
that
kind
of
thing:
I
always
annoy
people
by
not
using
the
silver
and
gold
kind
of
you
know:
periodic
table
things.
I,
don't
know
BRS
s-I-g-d,
so
you've
got
layers.
B
Yeah
yeah
fair
enough
yeah,
so
we
got
kind
of
our
Dev
test.
B
We
got
prod,
so
the
question
we're
saying
is:
if
I've
got
if
I'm
in
Azure
for
this
question
talking
about
other
Cloud
providers
of
similar
mechanisms
should
I
have
a
single
blob
storage
account
and
then
one
container
kind
of
you
know
the
the
traditional
like
sort
of
Hadoop
style
root
container
and
then
that's
all
just
folder
structures.
That's
that's
option
one
option,
two
exactly
so!
B
Well,
actually
you
know
maybe
I
can
keep
having
one
storage
container,
but
then
I,
one
size
storage
account,
but
my
containers
I
have
as
my
Dev
test
and
prod
or
the
one
saying
actually
do.
I
need
to
have
real
separation
and
should
I
have
entirely
separate
storage
accounts
and
then
maybe
break
down
my
bronze
silver
and
gold.
As
my
separate
container
layers,
so
there's
lots
of
different
ways
we
can
configure
around
it
easy
answer
is
environments
should
always
be
separated
by
Azure
resources.
B
You
should
never
have
your
prod
data
in
the
same
Azure
Resource
as
your
Dev
data,
it's
your
test
date
and
all
that
kind
of
stuff.
A
few
reasons
for
that.
One
separation
of
admin
concerns
the
kind
of
people
you
want
to
give
admin
access
to
your
Dev
Storage
Lake.
If
you're
playing
around
and
testing
stuff,
you
don't
want
to
give
those
same
people
full
carte
blanche,
access
to
your
product
data.
That's
just
not
a
good
idea!
B
So
generally,
you
always
want
to
be
talking
about
having
your
entire
Dev
resources,
your
entire
test
resources,
your
entire
broad
resources,
because
you
don't
want
to
give
your
developers
full
absolute
access
to
all
prod
data
you're
not
going
to
pass
any
infosec
Security
reviews,
if
you're
doing
that,
so
environments
should
always
be
separated
just
from
a
security
concerns.
Point
of
view
whether
you
want
to
have
different
different
containers
for
your
different
things.
That's
saying:
I've
got
my
Dev
storage
account
and
maybe
I
have
containers
in
here.
B
For
my
bronze,
my
silver,
my
gold,
you
can
do
or
do
you
have
just
one
generic
one
and
then
that's
just
folder
structures
inside
having
a
generic
one
means
you
only
need
to
mount
it
to
databricks
once
so.
It's
slightly
easier
for
configuration
having
separate
containers
yeah,
you
can
do
admin
a
little
bit
separately.
There's
not
a
huge
amount
you
can
do
at
the
container
level
rather
than
the
storage
account
level.
So
that's
just
kind
of
a
tidiness
thing.
How
do
you?
B
How
do
you
want
to
structure
it
either
way
you
can
create
containers
programmatically,
you
can
create
folders,
programmatically
you're,
not
really
limited
by
a
which
way
around
you
go
with
those.
It's
just
there's
a
slight
bit
of
extra
work.
If
you
have
separate
containers
that
you
have
to
mount
each
and
every
container
as
a
separate
thing,
just
when
you're
actually
working
that
out
or
not
Mount
create
external
credentials.
If
you
go
in
the
unity,
catalog
route,
so
general
advice,
I
absolute
storage
accounts
separate
for
different
environments.
B
Whether
or
not
you
have
different
containers
kind
of
depends
on
how
often
you're
going
to
change
layers.
If
you
think
you'll
ever
change
and
add
new
layers
into
your
leg,
if
they're
fairly
fixed,
then
yeah,
you
do
the
effort
of
setting
up
once
and
then
it's
fine.
B
A
Answer
that
was
cool
man
all
right,
so
Ed
I'm
sure
you
have
a
bunch
of
questions
to
follow
up
on
that.
But
at
least
this
is
a
probably
a
good
starting
point
for
us.
So
if
you
do
have
a
follow-up
by
all
means,
go
ahead
and
and
chime
in
here,
okay,
so
great
question
a
little
bit
great
answer:
Simon,
let's
go
right
to
the
next
one.
Okay,
and
let's
see
oh
okay,
you
and
I.
Can
you
or
I
could
answer
those
questions?
A
B
A
B
A
Now
exactly
no
I
apologize
all
right,
so
it
is
actually
pretty
common
to
use
actually
even
the
open
source
methodology
to
actually
share
your
data.
Delta
tables.
There's
a
Delta
sharing
server,
it's
actually
completely
open
source.
You
go
to
github.com
Delta.
I
o
slash
Delta,
sharing,
okay,
the
server
code's.
Basically
there
and
and
I'm
like
like,
we
noted
I'm,
going
to
answer
the
open
source
and
one
and
then
the
non-databricks
employee
is
going
to
answer
the
UC
one
okay.
A
So
because
that's
what
that's,
how
we
roll
here
and
so
the
context
is
that
because
your
data's
already
sitting
inside
a
scalable,
reliable
storage,
I.E,
ADLs,
Gen
2
or
if
you're,
using
S3,
AWS,
S3
or
Google,
Cloud,
Storage
or
hdfs,
or
whatever
else
that
you're
using,
but
it
can
scale
basically
based
off
of
that,
and
so
what
the
Delta
sharing
server
is.
Basically
simply
doing.
Is
it's
saying
hey:
do
you
have
the
rights
or
the
credentials
to
access
this
data?
A
That's
all
it's
really
doing
so
when
you
provide
you
as
the
client,
whether
you're,
using
like
pandas
or
spark
or
Tableau
or
whatever
it
is,
or
power
bi
or
whatever.
It
is
that
you
want
to
do
you're,
basically
providing
a
token
that
says:
hey
here's,
the
credentials
I'm
allowed
accessing
this
data
once
I'm
about
to
accessing
the
data.
All
Delta,
sharing
server
does
is
actually
provide
pre-signed
URLs
directly
from
the
cloud
storage
to
the
client.
A
So
what
that
means
is
whatever
workspace,
whether
it's
like
a
databricks
one,
whether
it's
a
ADLs
like
Azure
synapse,
whatever
I,
don't
care
like
whatever
workspace
that
you're
currently
working
with,
because
it's
got
the
token
it's
a
it,
will
its
job.
Let's
say:
I'm
for
the
fun
of
it
I'm
using
python
pandas.
It
will
basically
use
those
pre-signed
URLs
to
grab
the
Delta
tables
and
then
upload
them.
Oh
excuse
me,
download
them
and
be
able
to
query
that
those
pre-sign
URLs
actually
have
a
security
timeout.
A
So
I
forget
exactly
what
the
timeout
is,
but
basically
after
X
number
of
seconds
or
whatever
it
is.
Basically,
you
no
longer
have
access
to
them,
but
basically
this
allows
you
to
share
the
data
without
actually
replicating
the
data,
because
you
already
have
scalable
storage
now,
underneath
the
covers.
Maybe
you've
also
geo-replicated
the
data
and
things
of
that
nature,
and
so
that
actually
helps
things
out,
but
ultimately
what
it
boils
down
to
is
that
it's
coming
from
the
the
cloud
storage
itself
and
you
do
not
need
to
replicate
the
data
to
do
that.
B
Man
there
we
go
cool
okay,
so
essentially
we
were
talking
one
assuming
we're
talking
about
external
tables
or
unmanaged
tables
where
we've
got
I
need
to
shake
my
pen
again
and
it
magically
starts
working.
What
is
that
and
I've
got?
My
we've
got
our
databix
workspace
sitting
here,
so
we've
got
a
Delta
table
that
exists
inside
our
Lake
and
because
we're
saying
I
want
to
share
it
between
workspaces,
I'm,
assuming
you've
registered
that
as
a
Delta
table.
So
people
can
see
it
in
the
SQL
layer.
B
It's
in
Hive,
it's
in
the
meta
store,
so
he's
saying
well
how
do
I
share
that,
amongst
other
things,
so
this
is
the
old
school
way
of
doing
it.
Right
of
I've
got
a
separate
databricks
workspace.
There's
nothing
to
stop
me
writing
a
bit
of
code
saying
create
table
as
over
this
thing,
location
and
just
pointing
to
that
existing
Delta
table.
You
can
have
multiple
different
workspaces
all
using
the
same
bit
of
data,
all
exposing
it.
You
can
refer
to
it
as
different
table
names.
You
can
do
whatever
you
want.
That's
been
in.
B
You
know
you
can
do
that
in
any
kind
of
spark
environment.
You
can
just
register
a
SQL
table
over
a
file,
but
you'd
have
to
do
that
manually.
You
know
if
you
spat
out
100
Delta
tables
into
your
Lake,
and
you
want
it
to
replicate
those
100
tables
or
register
them.
Federate
those
registrations
across
a
load
of
different
database
workspaces.
You've
got
to
manage
a
load
of
metadata.
You
need
to
run
a
load
of
create
table
scripts
across
a
load
of
different
workspaces,
and
it's
it's
honestly,
a
bit
of
a
pain.
B
That's
the
old
way.
That's
the
way
we
used
to
do
things
so
we'll
have
some
process
that
creates
the
table
and
then
something
that
Rand
Robinson
goes
and
creates
them
in
a
lot
of
different
data,
which
workspaces,
which
is
really
annoying
well,
Unity,
catalog,
is
doing,
is
saying:
well
actually
this
thing
the
thing
that's
registered
here
that
kind
of
that
meta
store,
let's
not
make
it
live
in
databricks
anymore,
let's
rip
it
out
and
then
that
actually
lives
at
a
slightly
higher
grain.
B
So
it's
gonna
go
over
here
and
we
get
Unity
catalog
which
exists
at
your
tenant
and
region
level,
so
you
create
a
metastore
at
West,
Europe,
north
Europe
for
Azure.
You
know
whatever
Cloud
provisor
you're
using
you
register
the
table
with
that,
and
then
you
associate
your
workspaces
with
that
metastore.
So
it's
a
shared
metastore
that
exists
across
whatever
database
workspaces
that
you
have
so
you
can
just
register
the
table
once
and
then
use
it.
B
Many
many
times
and
again
it's
all
about
the
just
that
meta
metadata
registration,
it's
the
SQL
object
the
points
at
the
data
that
you
are
sharing
between
these
workspaces.
There's
no
data
replication
we're
not
keeping
10
20
copies
of
the
data.
So
if
you're
in
databricks
environment
you
can
switch
on
Unity
catalog,
you
can
take
that
step
and
you
can
have
that
centrally
managed
catalog,
which
is
again
good
for
governance.
It's
good
for
gdpr,
knowing
all
the
different
data
across
all
your
different
workspaces.
All
that
kind
of
stuff
is
great
for
that.
B
A
Cool
this
is
really
awesome.
I
did
want
to
remind
that.
We,
this
actually
session,
was
not
going
to
be
very
UC
Centric,
even
though
we've
got
a
bunch
of
UC
questions,
so
try
to
keep
them
around
devil,
like
in
lake
house,
we'll
still
talk
about
UC
a
little
bit,
but
I
just
want
to
remind
folks
that
we
the
what's
the
main
context
for
today's
sessions.
Okay,
let
me
switch
over
to
LinkedIn.
This
is
not
direct
to
you
Simon.
This
is
directed
at
some
of
the
questions
coming
in.
That's
all.
B
A
Yeah,
okay,
let
me
switch
over
to
LinkedIn,
because
we
got
some
great
questions
there.
So
hey
one
of
the
questions
it
has
what
is
the
best
practice
for
leveraging
DLT
across
different
zones
like
bronze,
silver
and
gold?
My
understanding,
that
is,
that
DLT
cannot
cross
between
databases,
Target
and
I'm
Brent.
It's
a
great
question,
but
I'm
gonna
probably
have
to
ask
you
to
clarify
this
one
a
little
bit
because
and
when
you
say
cross
between
databases,
it's
more
a
matter
of
like
when
we
sing
you
have
different
zones.
A
These
are
tables:
okay,
like
your
bronze
set
of
tables
that
were
forsake,
argument,
direct
dumps
of
or
loads
of
data,
from
whatever
databases
or
whatever
CSV
files,
or
whatever
else
things
that
you're
currently
working
with.
Okay
and
the
whole
premise
of
saying
we
have
a
Brawn
and
then
the
silver
and
the
gold
level
is
not
about.
Can
I
cross
between
databases
per
se?
A
It's
more
about
saying
that,
okay,
when
you're
at
bronze,
it's
basically
just
a
dump
of
data
when
you're
at
Silver
we've
actually
filtered
this
out
and
cleansed
it
and
made
it
actually
something
that's
usable
and
then
for
gold.
It's
actually
something
that
is
really
great
for
aggregation
or
for
or
for
machine
learning.
In
other
words,
it's
more
about
a
data
quality
framework,
and
so,
if
you
could
clarify
a
little
bit
about
where
you're
coming
from
in
terms
of
the
database.
B
I
I
I
get
the
question
because
I've
got
I've,
got
the
same
complaint:
okay,
fair
enough!
So
when
we're
talking
about
you
know,
essentially
it's
all
it's
just
organization,
it's
making
it
tidy
right
right.
B
So,
if
I've,
if
I
had
100
tables,
if
I
had,
if
I
had
an
a
slightly
awkwardly
large
number
of
tables
or
100
data
feeds
that
I'm
trying
to
pull
through
and
I
pulled
them
all
in
as
bronze
tables
and
then
I
cleaned
them
all
of
the
silver
tapes
and
then
out
of
that
I
made
50
or
so
gold
tables.
Currently
in
DLT
they
would
all
live
inside
the
same
database
structure.
This
came
the
same.
B
Hive
schema,
so
I'd
have
100
brands,
100,
silver
and
then
50
gold,
so
I've
got
250
pedals
just
as
a
giant
pile
that
I
have
to
have
naming
conventions.
I
have
to
have
gold
underscore
so
I
know
where
it
lives,
yeah,
I,
can't
easily
say
Grant
ax
to
just
gold.
For
that
team.
It's
just
a.
We
can't
currently
organize
the
tables
into
that
nicer
bucket
for
managing
it.
That's
it!
Oh.
A
Yeah,
okay,
so
that's
a
great
color,
so
I
I
was
obviously
coming
from
that
particular
question,
a
very
different
context,
so
so
for
something
for
a
manageability.
A
I
absolutely
would
provide
the
feedback
directly
to
the
databricks
team,
because
part
of
the
reason
in
terms
of
like
I
want
to
say
a
first
principle,
because
I
think
that's
a
little
too
strong
of
a
statement,
but
in
terms
of
the
context,
is
that
a
lot
of
the
feedback
that
we
have
received
from
not
just
for
data
breaks
but
in
general
from
the
community
was
the
ability
to
basically
have
different
catalogs
for
different
sets
of
users.
A
Now
I
think
that
also
ends
up
having
its
own
set
of
problems,
as
you
just
described
both
Simon
and
also
by
by
Brent
here
so
I'm,
not
disagreeing
with
your
assessment.
Now
that
I
understand
the
context
better.
So
thank
you
for
claring,
my
assignment
and
Brent
I
would
just
simply
say:
I
would
provide
that
feedback
directly.
So
that
way
we
can
make
that
manageability
significantly
easier,
because
I
do
agree
with
your
assessment.
Actually,
so
just
that
we're
probably
not
the
right
forum.
For
that.
That's
all.
Let
me
saying
yeah.
A
Yeah
yeah,
we
love
Frameworks
around
Frameworks,
that's
great
by
the
way
for
everybody
who's.
Listening
to
us
right
now
we
are
being
excessively
sarcastic
just
in
case.
You
didn't
know
us
that.
Well
so
all
right,
so
this
one's
a
little
bit
more
freewheeling,
but
I
still
think
it's
great
to
talk
about
here.
It
caused
this
discussion,
a
business
from
Gordon
Anthony
from
the
LinkedIn
side.
Don't
worry
folkson's
in
zoom
and
Linkedin
will
go.
A
Oh
sorry
and
YouTube
will
go
back
to
you,
we're
just
literally
going
back
and
forth
between
the
three
different
platforms
here.
So
it's
from
Gordon.
He
says
a
constant
discussion
at
work
is
when
to
finally
dive
into
Delta
Lake
and
not
look
back
and
away
from
our
classic
on-prem
data.
Warehouse
and
analytics
thing
as
people
are
are
making
more
sorry
are
more
of
making
these
decisions
and
giving
people
the
power
I
work
to
say.
Is
it
big
enough
or
what
is
your
opinion?
Basically,
the
context?
A
Is
there
any
advice
like
in
terms
of
how
to
make
that
transition,
and
so
I'll
start
first
but
I'm
sure
Simon
will
have
a
lot
of
his
own
advice
on
this.
One
too,
but
one
of
the
things
I
I'd
like
to
and
the
reason
I
like
this
particular
question,
even
though
it's
a
little
confusing
because
I
actually
came
from
there
and
actually
both
Simon
and
I.
Both
came
up.
A
So
if
you
even
watch
some
of
our
older
sessions,
Simon
and
I
talk
about
how
like
we're
both
SQL
Server
hounds,
and
we
made
the
transition
from
SQL
to
to
Delta
Lake
to
spark
things
of
that
nature.
A
So
I
actually
came
from
myself
personally
came
from
the
SQL
Server
team,
so
in
other
words
classic
on-prem
data
warehousing
yeah,
because,
like
game
literally
came
from
the
system
where
I
was
proponing,
hey
everything
should
be
in
SQL
data
warehouse,
and
the
thing
is
that
we
have
to
remember
and
when
you're
making
that
transition
to
say
you
want
to
build
it.
Thank
you
Delta
Lake
and
lake
houses
in
general.
Right
because
it's
it's.
A
The
discussion
is
really
about
the
fact
that
you're
making
this
transition
from
an
on-prem
database,
which
is
was
great
for
what
it
was
very
good
at
doing.
But
the
the
data
Paradigm
has
changed
since
then,
when
we
built
databases,
it
was
about
the
ideas
that
we
knew
knew
everything
had
to
be
structured.
A
Okay,
when
we
talk
about
data
Lakes
from
databases,
databases
allowed
us
to
say
hey,
we
want
transactional
protection
and
we
wanted
to
go
ahead
and
have
reliability,
but
everything
is
structured
data
Lakes.
We
made
that
transition
and
said
hey.
No,
no,
we
don't
care
about
schemas.
We
don't
care
about
any
of
what
we
need
flexibility
and
we
need
scalability.
A
A
You
know
and
is
why
it's
so
fundamental
for
lake
houses
in
the
first
place
is
because
it
gives
you
that
ability
to
have
that
scale
have
that
reliability
while
also
having
the
transactional
protection,
and
so
that's
part
of
the
reason
I
myself
like
made
that
transition
out
of
databases
and
data
warehouses
into
the
realm
of
spark
and
Delta
like
in
the
first
place.
A
Just
because
of
the
fact
that
I
needed
the
best
of
both
worlds
in
order
to
apply
these
in
order
to
be
able
to
do
the
new
problems
like,
for
example,
I
want
to
do
streaming
or
I
want
to
do
machine
learning
or
AI
or
any
of
these
other
things,
and
so
at
least
from
where
I'm
coming
from
that's,
usually
where
the
discussion
should
be
because
you're
recognizing
that
the
data
problems
that
you
have
actually
are
much
more
complicated
than
what
databases
by
themselves
could
actually
go.
Do
so
that's
my
little
Spiel.
B
Yeah
I
mean
so
there's
there's
the
two
things
one
massively
is
the
use
cases.
You
know
the
thing
things
that
a
relational
database
doesn't
do
well,
if
you
have
a
ton
of
them
knocking
around
absolutely
and
it's
the
to
be
super
cheesy
about
it.
The
fees
of
big
data
right
so
yeah
you've
got
scale,
but
it's
good
to
mention
there.
It
should
not
be
a
bigger
little
discussion.
Scale
is
one
thing.
We
can't
process
that
amount
of
data
in
the
amount
of
time
that
we
need
to.
B
You
might
need
something
that
can
be
used
distributed,
compute,
which
Sparks
they're
good
for
streaming.
As
you
said,
you
can't
process
data
fast
enough.
We
need
something
that
can
do
that
kind
of
stuff
variety.
You
know
we.
We
need
to
be
able
to
process
a
whole
mix
of
images,
data
documents,
Jason
horrible,
gnarly,
nested
dodo,
anything
that
happens
to
be
they're,
all
good,
classic
use
cases.
B
There
are
still
tons
of
people
who
I
help
do
a
lift
and
shift
of
their
classic
Legacy
Warehouse
into
a
lake
house,
and
that's
not
because
of
a
use
case.
It's
not
some
newfangled
things
come
along.
It's
not
because
their
data's
getting
bigger
and
bigger
and
bigger
a
lot
of
me
a
lot
of
the
the
way
that
I
work
tends
to
be
about
agility.
It's
a
lot
of
the
Legacy
classic
tools
when
you're
building
something
that
can
process
data,
clean
data
and
shape
data.
B
It's
very
very
manual
in
the
old
set
of
tools
you're,
either
using
code
generators
and
turning
a
crank,
and
it
spits
out
reams
of
code
what
you're
doing
something
graphical
and
every
single
new
bit
of
data
that
you
have
to
hand
code
a
bit
of
how
to
process
that
thing
and
deploy
it
and
the
move
away
from
the
Legacy
Tech
has
kind
of
moved
a
lot
more.
That's
where
the
whole
data
engineering
thing
came
from
for
me.
B
The
big
argument
for
doing
things
in
a
lake
house
way,
because
the
technology
that
sits
behind
it
means
we
can
do
things
like
write.
One
Pi
spark
script
that
says
clean
a
bit
of
data
from
here
and
put
it
into
there.
Look
up
a
list
of
rules
and
dynamically
apply
them.
So
I
can
have
one
script
to
do
that
stuff
and
then,
if
someone
says,
can
you
load
more
data
in
then
I
can
just
add
metadata
to
load
more
data
and
I.
B
Don't
need
to
deploy
code
I,
don't
need
to
do
a
load
of
productionization
stuff.
So
for
me
that
is
just
such
a
massive
sales
pitch
of
agility
and
you
know
kind
of
it's
investment
in
the
text
so
that
you
can
then
speed
up
massively
down
the
line.
So
we
had
a
wonderful
use
cases,
size
and
scale
sure,
and
and
do
you
ever
need
to
add
new
data
and
are
you
constantly
having
to
find
new
data
sources
and
add
it
in,
and
that
is
a
major
major
selling
point
for
me.
B
A
Yeah,
the
the
typically
my
response
versus
something
like
this
is
like
it.
You
know
your
data,
warehouse
or
data
system
is
no
longer
being
used
because
you
don't
need
to
change
it
anymore
right.
The
reality
of
a
successful
data
project
is
that
you
actually
are
changing
things.
New
business
problems,
new
new
data
sources,
whatever
else
you're
changing
it
all
the
time.
A
This
is
not
a
knock
on
databases.
By
the
way
when
I
say
something
like
this,
some
people
might
imply
that
it's
not
that
the
the
knock
is
really
more
toward
management
who
thinks
like?
Oh
yeah,
I've
built
it
now,
I'm
done
and
I
have
nothing
else
to
do
like
no,
no,
that's
not
how
these
systems
work
right,
cool!
All
right.
A
Let's
switch
gears
we're
going
to
switch
over
to
sorry
YouTube,
because
we
have
not
actually
answered
questions
on
YouTube
for
a
small
while
so
okay,
there's
two
questions
that
I
want
to
tackle
right
away.
Just
because
they've
been
waiting
for
us
one,
that's
a
little
bit
quicker,
so
I'll
get
to
that
one
right
away.
A
There's
a
question
from
TCP
packet
from
YouTube,
which
is
asking
hey:
should
we
replace
batch
jobs
with
run
one
run,
one
streaming
and
I
believe
that's
reference
to
trigger
once
or
trigger
available
now,
and
the
quick
answer
is
in
many
scenarios:
that's
exactly
what
you
can
do.
You
can
actually
change
your
bash
shops
into
structure
streaming
jobs
because
of
the
way
spark
works
when
it
comes
to
structure
streaming
is
there's
really
the
only
difference
outside
the
apis,
going
read
and
read
stream
or
write
and
write
stream
is
basic
latency.
A
That's
it
like.
Basically,
it's
the
same
logic,
same
Catalyst
engine
same
everything,
that's
happening
under
underneath
the
covers,
so
it
actually
simplifies
the
way
you
write
your
code.
So
what
that
translates
to
is,
then
you
can
go
ahead
and
literally
use
trigger
once
or
trigger
available.
Now
to
basically
look
at
data,
that's
coming
in
so
that
way,
instead
of
having
like
multiple
batch
jobs,
you
can
have
a
physique
argument,
a
single
streaming
job.
One
really
cool
example
is
actually
from
Comcast.
A
They
actually
had
the
sessionization
problem
that
they're,
where
they're
sessionizing
data
from
their
set-top
boxes
and
they
went
from
because
they
switched
to
Delta
Lake.
They
went
from
640
VMS
down
to
64,
which
is
pretty
sweet,
but
in
addition
to
doing
that,
they
went
from
84
batch
jobs
to
three
streaming
jobs.
So
pretty
impressive.
But
that's
the
whole
point.
It's
from
imaginability
standpoint.
It's
a
lot
simpler!
So
just
want
to
call
that
out
anything
else.
You
want
to
add
before
I
go
to
the
next
one.
B
For
us,
as
a
consultancy
out
of
the
books,
reference
architecture
is
using
stream
once
using
available.
Now,
that's
because
if
someone
says
I
do
a
back
job
cool,
if
someone
says,
can
you
make
it
streaming?
We
just
change
the
trigger
and
it's
the
same
code,
if
someone
says
actually,
rather
than
doing
a
daily
make
it
hourly,
it's
just
you
just
trigger
it
differently
and
you
don't
exchange
on
your
code.
It
makes
life
a
hell
of
a
lot
easier
for
all
of
that
stuff.
B
The
only
caveat
I
would
say
is
the
complexity
of
doing
some
of
the
bad
things
that
aren't
supported
by
streaming.
B
So,
if
you're
trying
to
do
a
merge
into
Delta
you're
trying
to
write
to
multiple
sources,
you're
trying
to
do
those
some
things
you're
trying
to
do
things
like
Drop
duplicates
on
on
there,
you
don't
want
to
do
that
on
a
stream,
because
that's
going
to
be
very
stateful,
problematic
things,
so
you
end
up
doing
that
for
each
batch.
So
you
have
your
stream,
you
have
a
for
each
batch.
B
All
your
actual
processing
logic
is
in
there
for
each
batch,
so
from
a
complexity
of
building
the
script,
there's
one
or
two
you
want
Spitz
that
you
need
to
get
around
once
you've
built
it
once.
You
can
then
use
it
for
many,
many
things,
and
it's
really
straightforward
and
that's
how
we
build
things
but
yeah
I
would
say
super
easy,
just
click
a
button
it'll
make
your
life
easier,
but
it's
a
such
a
flexible
pattern.
We
use
it
for
everything.
A
Thanks
for
the
call
out
you're,
a
really
good
call
about
the
4-H
patch,
that
is
a
that
becomes
a
very,
very,
very,
very
powerful
tool
gets
a
little.
It
takes
a
little
while
to
get
used
to
it,
just
like
Simon
called
out,
but
once
you
get
used
to
it,
it
actually
becomes
very,
very
useful
and
very
very
helpful.
So
yes,
all
right
next
question,
Louis
from
YouTube
has
actually
a
great
question.
There
is
not
a
straightforward
answer
for
this
one.
A
So
can
you
explain
how
to
disable,
CDC
or
sorry
change
data
feed
from
Delta
tables
for
huge
data
reprocessing?
Okay?
So
the
thing
about
in
general
is
that
you
technically
don't
need
to
turn
off
change
data
feed
because
all
it's
doing
is
leveraging
the
transaction
log
you
already
have
and
we're
simply
exposing
that
information.
So
you
get
a
row
by
row
basis.
Now
to
your
point,
though,
okay,
which
is
valid.
A
Maybe
if
you're
doing
a
large
enough
data
reprocessing,
you
don't
actually
want
to
go
ahead
and
not
not
only
go
ahead
and
see
all
of
the
row
by
Row
from
a
change
data
feed.
You
may
not
even
want
to
see
it
in
terms
of
the
history.
Okay,
so
in
those
particular
situations,
one
of
the
ways
I'm
sure
Simon
has
some
other
ways,
but
at
least
the
way
that
I
a
pattern
that
I
often
see.
Is
you
don't
actually
reload
the
data
in
fully
into
that
particular
table?
A
You
actually
load
it
into
a
whole
new
table.
You
verify
that
that's
the
table.
You
want,
because
it's
a
it's
a
full
float
and
blown
data
reprocessing
you
rename
the
old
table
to
basically
you
have
it
like,
basically
there
for
a
month
or
two
just
to
make
sure
there
were
no
issues.
So
that
way,
you
have
the
both
tables
running
side
by
side.
That
way,
you
don't
run
the
risk
of
needing
to
actually
have
all
the
history
that
was
no
longer
applicable,
because
you
did
need
to
do
a
reprocess
and
then
once
you're
comfortable.
A
There
then
flip
on
change
data
feed
on
the
new
table.
Okay,
so
I
would
typically
do
it.
Something
like
that
because,
like
I
said
change,
the
feed
really
is
just
opening
up
or
making
available
the
log
that
Delta
already
has.
We
didn't
actually
really
do
much
more
than
that.
I
mean
this
is
not
a
knock
on
all
the
engineers
that
worked
on
it.
A
B
Guess
the
only
challenges
of
you
if
it
was
a
partial
reprocess.
So
if
you
had
like
five
years
of
history
and
you
wanted
to
reprocess
the
most
recent
year,
so
you
didn't
want
to
lose
the
change
feed
from
all
the
four
previous
years,
you're
just
restating
a
given
year.
That's
not
going
to
work
in
that
scenario
as
to
is
there
a
solution
scenario?
A
But
the
things
like
I
said
in
the
end,
it
doesn't
avoid
the
problem.
Even
if
you
turn,
if
you
were
to
disable,
it
still
doesn't
have
the
fact
that
you
would
actually
end
up
rewriting
so
much
of
the
data
in
the
first
place
right
and
then
you
basically
build
up
a
gigantic
transaction
log
and
duplicate
data
in
the
first
place.
So
that's.
A
I'm
trying
to
avoid
that
particular
problem
to
say:
is
it
possible
for
us
to
basically
like,
even
if,
if
it's
a
partial
reload?
Theoretically,
what
you
do
is
you
could?
Oh
here
we
go
like
we're,
there's
a
solution
on
the
spot
here.
If
it's
a
Year's
worth
of
data,
you
basically
build
a
view
that
points
to
the
old
table
like
that.
Has
two
thousand
you
know:
18
19
20
data
2021
is
the
one
year
of
processing.
That's
actually
in
the
new
table.
That's
gets
rebuilt.
A
Look
at
the
data,
verify
the
view
then
go
ahead
and
coalesce
everything
and
and
because
a
coal
lesson,
because
the
view
basically
skips
out
the
2021
data
from
table
a.
But
then
it's
now
and
it's
merging
sorry
unioning
with
the
data
from
table
Beach.
B
A
A
Yeah
exactly
so
yeah,
so
this
is
us
just
literally
solving
problems
on
the
fly,
so
this
may
solve
your
solution.
Sorry,
this
was
for
who
is
it
Louis
from
on
YouTube?
So
hopefully
that
helps
don't
forget,
join
us
on
Delta
userslackgo.delta
dot,
IO
slack,
we're
there
answering
tons
of
questions
there
and
so
okay,
let
me
go
ahead
and
switch
back
to
zoom,
because
we
have
not
answered
a
ton
of
Zoom
questions
now
Okay.
A
So
let
me
go
back
to
this
now
so
babak
from
I,
hopefully,
I
said
your
name
correctly
from
Dusseldorf
Germany
cool
awesome.
He
has
a
question
about
parquet
modular
encryption.
Basically,
since
spark
3.2
columnar
encryption
is
supported
in
parquet
tables,
with
Apache
parquet,
1.12
plus,
so
in
that
file
parts
are
encrypted
with
a
key,
and
is
this
encryption
compatible
Delta
format
or
can
we
use
with
Delta
tables
honestly
I?
A
A
So
if
your
system's
actually
able
to
both
write
and
read,
presumably
because
you're
using
spark
3.2
to
with
yours,
encrypted
columns,
then
this
should
not
be
a
problem
at
all
within
Delta,
Lake
whatsoever
and
so
saying
that
I
I
do
want,
to
put
the
caveat,
should
work
because
I
have
not
personally
tested
this
so,
but
if
you
do
run
into
issues,
though,
please
open
up
a
GitHub
issue
and
or
join
us
on
the
Delta
user.
Slack
we'd
love
to
actually
see
that
because
again,
I,
don't
I'm
not
expecting
a
problem.
A
B
Never
used
poke
module
encryption
I've
not
dug
into
it.
Whenever
we
encrypt
things,
we've
been
doing
it
using
the
spark
functions
for
a
aesd
Krypton
encrypt,
which
is
what
we
tend
to
use
for
doing
that
kind
of
column,
level,
encryption
and
decryption,
which
is
pushing
up
to
the
compute
level
rather
than
the
Storage
level
gotcha.
A
Gotcha,
perfect
tram.
You've
got
a
couple
questions
specifically
about
Unity
catalog.
Honestly,
I'm
gonna
probably
have
a
skip
those
questions
and
I'd
like
you
to.
If
that's
okay,
if
you
can
ping
your
databricks
rep
for
those
specific
questions,
because
right
at
least
for
now,
you
see
specific
to
the
databricks
environment,
so
I
I,
don't
feel
either
of
us
probably
will
be
very
good.
Answering
those
questions
I
mean
unless
Simon
is
brutal.
Simon
the
non-databricks
guide
knows
more
about
UC
than
I
do
so.
A
A
Let's
see,
okay,
hero
and
hacks
for
my
I'm
thinking
for
Miami
probably
has
another
question
on
Zoom
before
we
switch
back
to
LinkedIn
what
is
happening
under
the
covers
when
executing
analytics
on
data
that
is
stored
in
different
geographical
regions
like
across
regions
or
across
clouds,
is
data
filter
remotely
and
then
copied
across
a
single
location?
A
And
how
can
this
be
performant
when
data
is
stored
remotely
from
the
current
workspace?
And
you
don't
want
to
copy
data
to
a
local
workspaces?
Okay,
there's
no
straight
answer
to
this
question:
I'm
going
to
start
it,
but
knowing
full!
A
Well,
there's
no
straight
answer
now:
with
Delta
Lake
in
general,
there
is
a
concept
of
predicate
push
down,
so
there
can
be
some
filtering
done
actually
at
the
remote
location,
but
and
but
the
reality
is
because
it's
a
remote
location
and
the
spark
job
I'm
using
spark
as
an
example,
but
whether
using
tree
notes,
link
or
anything
else.
It's
really
the
same
since
the
fact
this
the
job,
the
actually
just
to
be
to
not
cold
spark
the
data
processing
framework
that
you're
using
okay,
it's
not
sitting
in
the
same
location
as
your
remote
location.
A
So
since
it's
not
there's
not
much,
you
can
do,
you
will
end
up
still
grabbing
the
vast
majority
of
your
data
across
the
wire
and
then
that
actually
in
the
egress
costs
are
basically
going
to
be
very,
very
painful
now
how
you
can
potentially
work
around.
That
problem
is
running
the
jobs
remotely
where
basically,
your
data
processing
framework
is
also
running
on
the
remote
system
as
well.
Okay,
so
for
sake
argument
now,
I'm
gonna
just
use
spark
just
because
it's
easier
for
me
to
talk
about
it.
A
You
have
a
spark
job
on
region.
A
your
spark
job
on
region
B
and
your
spark
job
and
Region
C.
You
run
them
as
disparate
jobs
to
to
basically
shrink
down
because
you've
done
a
join
so,
for
example,
you're
doing
a
join
with
a
demographics
table.
In
that
particular
format,
I
would
copy
the
demographics
table
from
your
current
central
location
to
each
of
the
regions
run
the
joins
there
so
now
you've
Shrunk
the
data
from
let's
just
say,
a
billion
rows,
each
down
to
200
000.
A
Okay,
then
what
ends
up
happening
is
the
egress
from
region.
A
region
being
Region
C
to
the
central
location
is
a
lot
less.
Instead
of
transferring
a
billion
rows
on
each
region,
now,
you're
only
transferring
200
000
rows
now
this
could
actually
simplify
the
egress
cost.
But
again
now
this
will
get
becomes
a
lot
more
complicated
to
manage
because
you're
running
basically
remote
jobs
on
each
system.
Now
it
actually
I
wouldn't
say
common,
but
it
is
actually
a
practice
that
some
organizations
do
to
basically
reduce
the
workloads
or
the
egress
costs
between
systems.
A
Saying
what
I
just
said,
often
more
times
than
not
when
you,
if
you're
a
large
enough
organization
in
which
you
do
need
to
worry
about
egress
costs,
usually
you
can
actually
work
with
your
Cloud
providers
and
they're
actually
going
to
go
ahead
and
deal
with
each
other
and
because,
for
example,
if
you've
got
like
a
a
direct
connect
right,
you
can.
For
example,
your
on-prem
environment
has
a
direct
connect
to
both
AWS
and
GCS,
using
just
or
in
Azure.
Just
all
three,
for
example,
there's
various
data
centers.
A
Allow
you
to
do
that
because
there's
direct
connect,
the
egress
costs
between
the
data
center
and
each
of
the
clouds
is
actually
negligible
anyways,
and
so
then,
if
you're
doing
something
like
that,
for
example,
you're
on-prem.
Just
using
that
particular
example,
your
on-prem
system
is
actually
executing
that
same
spark
job
or
distributed
framework
job
to
each
of
the
regions.
It's
grabbing
the
data
and
the
egress
costs
actually
are
negligible
is
just
more
the
time
it
takes
to
do
the
actual
egress
right,
and
so,
as
you
can
tell,
there's
no
straightforward
answer
to
this.
A
This
is
basically
you
can
jump
back
and
forth
in
terms
of
how
you
want
to
do
it,
and
so
it
really
depends
on
what
you're
trying
to
achieve.
So
it
makes
a
lot
of
sense
where
you're
coming
from,
but
depending
on
your
scenario,
you
may
want
to
again
remote
run
remote
jobs
or
again,
if
you've
got
like
a
direct
connect
type
of
setup,
the
egress
costs
are
actually
going
to
be
a
big
deal
so.
B
I
want
to
touch
on
on
a
slightly
different
direction.
Please
if
we
talk
about
Delta,
specifically
just
just
Delta
things,
all
of
the
performance
tuning
that
you've
got
in
Delta
is
about
reading
less
data.
It's
about
selectively
just
reading
some
of
the
files
that
are
inside
that,
rather
than
order
for
files,
so
it
depends
if
your
analytics
server
just
distributed
data
across
lots
of
different
storage
or
distributed
Delta
tables
across
lots
of
different
places.
B
You
know
reducing
that
egress
cost.
Reducing
the
performance
of
a
talking
spark
spark
does
everything
in
memory.
It
has
to
read
the
data
into
memory
to
work
with
it,
but
it
doesn't
have
to
read
all
of
the
data
and,
if
we're
talking
about
partitioning-
and
you
run
a
query
that
hits
partition
key
or
it's
using
Dynamic
partition
processing,
it
will
only
read
the
data
from
those
partitions,
so
you
can
just
ignore
all
the
other
subfolders.
B
If
you're
talking
about
Z
ordering
and
you've
got
data,
skipping
you've
got
your
systicks,
that's
going
to
go
well,
actually,
I
don't
need
to
read
those
files
and
you're
reading
even
fewer
files,
if
you're
talking
about
Bloom
filter
indexes-
and
it
goes
actually
it's
most
likely
going
to
be
in
that
file.
You've
got
these
various
different
performance
techniques.
You
can
do
if
you're
using
Delta,
which
is
doing
that
pushing
the
filtering
to
the
cloud
storage.
So
it
doesn't
bother
reading
data.
B
It
doesn't
have
to
read
again
only
if
we're
working
with
Delta
partitioning
in
spark
yeah
for
other
things,
but
it
just
depends
on
what
kind
of
data
you've
got
and
where
it
is.
A
Excellent
okay:
we
have
a
ton,
more
questions
and
not
enough
time.
So,
look
you
and
I.
We
we
keep.
We
keep
on
right
holics.
So,
let's
try
to
dive
into
them.
Okay,
back
to
LinkedIn
Jonathan
just
asked.
A
We
need
to
backfill
a
Delta
bronze
raw
title,
Rotator
raw
table
loading,
Json
files,
segmented
by
day
about
500
000
files
a
day
for
the
last
13
months,
once
we
load
once
loaded,
we
want
to
turn
on
DLT
to
process
it
as
a
stream
as
a
real-time
ish
sales
data
strategies
for
loading
that
much
data,
so
Jonathan
great
question:
I'm
gonna
do
a
high
level
and
try
to
make
it
short.
A
You
probably
want
to
stream
the
data
in
just
because
that
way,
it's
a
it's,
it's
a
continued
continuous
process
and
that
way
it'll
actually
go
ahead
and
work
in
your
current
mode,
whether
using
DLT
or
not,
basically
just
from
the
standpoint
of
streaming.
So
as
because
you
want
to
go
ahead
and
run
continue
running
the
stream.
A
The
the
main
strategy
that
you
want
to
take
into
account
is
the
fact
that,
because
you're
trying
to
load
a
whole
bunch
of
data
from
the
past,
you'll
probably
want
to
knock
up
your
cluster
to
a
larger
size.
So
it
can
go
ahead
and
load
the
stream,
the
data
faster,
but
the
tracking
of
all
the
individual
files,
whether
they're
processed
or
not,
things
of
that
nature.
That's
actually
done
by
structure
streaming.
A
You
can
also
make
use
of,
if
you're,
using
databricks
Auto
loader
to
help
that
with
that
process,
but
irrelevant
of
whether
you
use
you
know
vanilla
spark
or
whatever
else
the
context
I'm
trying
to
get
at
is
basically
I
would
use
streaming
to
basically
the
same
code
that
you
would
normally
run
for.
Regular
processing
run
that
actually
for
your
backfill,
but
then
knock
up
the
number
of
of
available
notes.
So
you
can
actually
catch
up.
So
anything
you'd
like
to
add
time
on
that.
One.
B
So
I've
not
actually
tried
switching
over
existing
Delta
tables
to
be
then
DLT
loaded
afterwards,
I,
don't
know
how
awkward
that's
going
to
be
in
terms
of
just
switching
that
to
then
be
our
DLT
process.
You
probably
could,
if
you
got
the
location,
folder
structure
set
up
in
advance,
but
I
don't
know
if
it's
going
to
try
and
reprocess
that
table
left
effect.
B
If
you
couldn't
and
you're
trying
to
work
out
the
best
most
efficient
way
to
get
in
reading
that,
if
that's
500k
files
per
day
and
we've
got
like
the
small
files,
Json
problem
and
you're
figuring
out,
how
do
we
get
that
in
easiest
way,
is
to
batch
them
into
Json
lines?
Files
turn
them
into
slightly
more
performant
files
still
treat
it
as
your
DLT.
If
you
have
to
reload
them
in
from
scratch,
but
it's
certainly
easier
to
build
as
a
Delta
table
than
switch
it
over
I've
just
not
actually
tried.
Okay,.
A
Yeah,
well
all
right
give
it
a
shot.
Let
us
know
cool
all
right
back
to
the
gdpr
one
Tony
asked
a
great
question
is:
how
can
we
selectively
remove
records
from
a
Delta
version
history
to
a
service,
so
you
can
meet
for
a
right
to
be
forgotten
request,
so
you
want
to
go
first.
You
want
me
to
go
first
on
this.
One.
B
B
Tony
deliberately
asking
the
awkward
questions,
yeah
I
mean
so
realistically
you
come
from
the
Delta
version.
History
I
mean
so
there's
two
parts.
Today
we
talk
about
right
to
be
forgotten,
squeezing
in
an
actual
bit
of
gdpr.
At
the
end,
you
know,
if
you
do,
if
you
just
run
a
delete
statement
so
well,
this
person
we're
going
to
obfuscate
the
data
we're
going
to
mask
the
data
we're
going
to
trash
the
encryption
key,
we're
going
to
delete
the
record,
whatever
method
of
forgetting
you're
doing.
B
If
you
just
do
that,
run
that
on
your
Delta
table
that,
then
it's
gonna
be
in
your
history
until
you
do
a
vacuum
and
you
can't
selectively
open
up
one
of
those
historical
parquet
files,
scribble
out
the
data
and
close
it
back
up
again.
Essentially,
if
you
perform
a
right
to
be
forgotten,
you
have
that
record
still
in
your
data
in
your
version
history,
until
the
point
when
you
vacuum
it's
part
of
building
your
processes
to
know
okay
right
to
be
forgotten.
How
long
have
we
got
to
action
that
process
I?
B
A
Agreed
saying
that
there
one
of
the
things
I
definitely
want
to
say
that
you
can
Design
Systems
up
front
to
make
it
less
painful.
Okay.
Now
this
goes
back
to
the
processes.
For
example,
one
of
the
scenarios
that
I
often
talk
about
which
that
the
folks
over
at
Starbucks
did
right
is
the
data
that
their
legal
department
again
remember.
A
We
are
not
lawyers,
so
you
do
need
to
talk
to
your
lawyers
for
stuff
like
this
okay,
but
the
data
that
would
be
deemed
as
things
that
would
fall
through
fall
through
on
a
gdpr
request
like
the
right
to
be
forgotten,
request
I'm
going
to
deem
that
as
a
demographics
table.
Okay,
and
so
the
idea
of
the
data
processing
itself
is
that
the
data
itself
when
it
comes
to
demographics
data,
it
goes
into
a
separate
demographics
table
and
then
the
facts.
Information
would
only
contain
the
ID
and
nothing
else.
A
So
in
their
particular
case,
what
they
did
is
that
when
they
received
a
right
to
be
forgotten
request,
they
would
update
the
demographics
table,
not
delete
it
at
least
not
yet.
The
reason
they
updated,
basically
is
to
say
very
clearly
redacted.
Okay-
and
they
often
within
the
notes
or
even
within
the
transaction,
would
put
the
gdpr
ID
request
ID,
so
they
say:
okay.
This
is
associated
with
this
request
that
just
came
in
Okay.
A
The
reason
they
did
the
updates
versus
delete
wasn't
because
they
were
trying
to
avoid
anything
and
by
the
way,
underneath
the
covers,
they
would
actually
have
to
run
a
vacuum
immediately
to
exactly
what
Simon
was
calling
out.
But
the
idea
is
that
you
would
you
wouldn't
need
to
redact
vacuum
all
the
fact
data.
You
would
only
need
to
vacuum
the
demographics
data
which
isn't
changing
nearly
as
much
okay
and
so
in
the
process
of
doing
that
number
one.
A
They
didn't
still
wouldn't
have
to
run
the
vacuum
for
30
days,
because
precisely
because
how
about
if
you
accidentally
deleted
the
wrong
person
right,
there's
gdpr
compliance
rules
where
you're
actually
allowed
keeping
the
information
in
history
like
a
backup,
because
for
precisely
that,
you
accidentally
delete
the
wrong
thing.
The
point
is
that
is
there
a
paper
trail?
That's
clearly
states
and
that's
the
reason
why
they're
saying
no.
We
actually
enter
update
the
information
and
say
we
redacted
that
person.
A
They
also
did
that,
because
that
way,
the
the
downstream
systems
could
go
ahead
and
actually
automatically
delete.
They
could
tell
their
Downstream
system
to
delete
any
demographic
information
related
to
that
person
as
well.
So
that's
why
they
didn't
want
to
go
ahead
and
delete
the
information
right
away
and
so
now
again
we're
not
lawyers
so
different
organizations.
A
Different
countries
are
going
to
see
this
slightly
differently
and
you're
going
to
have
to
work
with
your
legal
department,
and
this
is
not
a
bad
thing
by
the
way,
it's
actually
hard
and
tricky
to
do
some
of
the
stuff
and
lawyers
are
in
fact
some
of
our
best
friends
when
it
comes
to
stuff
like
this.
So
this
is
me
not
trying
to
knock
lawyers.
Quite
the
opposite.
I've
worked
with
some
really
awesome.
Lawyers
sounds
funny
coming
from
me,
but
really
awesome
lawyers
to
basically
go
ahead
and
actually
make
sure
we
follow
gdpr
compliance.
A
B
The
one
one
slight
Twist
on
that
approach
of
having
the
separate
table
and
deleting
that
those
separate
attributes
which
again
links
into
what
we're
originally
going
to
talk
about
in
terms
of
Pi,
is
doing
it
with
the
dynamic
decryption.
So
we
we
encrypt
the
data
in
that
table
whether
it's
separate
as
a
dimension
table
where
there's
party
factor
or,
however,
you
do
that,
and
people
have
to
join
to
a
separate
table.
That
is
essentially
like
you're
saying
that
look
list
of
keys
with
a
decryption
key
for
each
of
those
different
keys.
B
And
then,
if
someone
says
forget
me,
you
just
go
in
and
delete
the
decryption
key,
so
essentially
the
data's
still
there,
but
it
can
no
longer
be
decrypted.
You
have
trashed
that
data.
The
flip
side
of
that
means
everyone
who
accesses
that
data
has
to
join
to
a
table
and
run
an
AES
decrypt,
and
so
you
end
up
with
views
that
do
that
Dynamic
decryption,
which
is
the
same
pattern
we
used
to
do
for
things.
B
You
know
we're
kind
of
a
traditional
relational
warehouses,
but
we
can
now
get
those
same
patterns
working
and
it's
much
more
efficient
because
again,
you're
deleting
a
single
record
in
that
one
vacuum.
History,
plus
you
don't
have
your
statistics,
you
don't
have
any
pii
data
exposed
in
any
of
your
Delta
log
stats.
B
So
if
you're
trying
to
be
very
very
compliant-
and
you
do
have
really
sensitive
data
and
you've
got
a
mix
of
users,
some
who
are
allowed
to
see
it,
some
who
are
not
allowed
to
use
it
having
the
everything
sensitive
is
encrypted
and
having
a
Central
master
decryption
key
linked
to
each
of
those
individual
Master
records.
Actually
works
really
nicely
for
that
pattern
with
a
live
performance
it
because
you're
having
to
decrypt
on
the
Fly.
A
All
right,
this
is
excellent,
excellent
stuff,
so
unfortunately
we're
actually
at
the
point
where
we
need
to
stop
at
answering
questions,
because
it's
eight
minutes
the
hour.
Technically,
we
were
supposed
to
finish
10
minutes
the
hour
saying
this.
You
might
have
heard
a
Big
in
the
background.
What
we're
gonna
do
is
because
we
did
not
answer
a
bunch
of
our
questions,
but
we
actually
a
lot
of
them,
are
really
really
good.
A
First
of
all,
you
can
always
join
us
at
go.delta,
dot,
IO
slack.
You
can
ask
those
questions
there,
but
the
reason
you
heard
the
big
being
is
because
I
took
a
screenshot
of
all
of
your
questions.
Simon
and
I
will
actually
make
sure
to
answer
those
questions
in
the
next
session,
which
will
be
December,
13th
I
believe
so
we
we've
run
this
monthly,
so
you
can
come
back
in,
but
we're
gonna
actually
copy
down
the
questions
and
we're
going
to
make
sure
that
answer
your
questions
in
that
session
as
well.
A
So
so,
thank
you
very
much
for
everybody.
Taking
the
time
out
of
your
day
to
to
pick
up
the
ping
to
the
two
of
us
like
I,
said
any.
If
you
have
any
questions
in
the
interim,
definitely
go
ahead
and
join
us
at
slack.
But
meanwhile
we
will
I've
already
editing
screenshots
of
all
the
existing
questions.
So
I
apologize
for
not
being
able
to
answer.