►
From YouTube: Simon + Denny AUA (2022-09-06)
Description
Join us for the brand new monthly series "Simon and Denny - Ask Us Anything!" where Simon Whiteley and Denny Lee will answer your data engineering questions from building a data platform to ingestion to ETL to analytics. With their background in SQL Server and BI to Apache Spark and Delta Lake - they want to show you how to build your own lakehouse.
As this session is interactive, come prepared to ask questions all throughout the session! Be prepared for another geeky, trans-Atlantic event from two data nerds.
Quick links:
https://delta.io/
https://go.delta.io/slack
https://groups.google.com/g/delta-users
https://go.delta.io/github
B
B
Okay,
that
that
that
that
that
that
is
the
most
interesting
answer-
I've
given
okay
well,
it
is
currently
about
what
66
degrees
and
almost
what
18
celsius,
I
guess
or
16
celsius
here
in
seattle.
Yes,
sunny
seattle,
so
goodness
gracious!
I
know
people
might
be
surprised
that
we
actually
might
have
decent
weather
here.
So
we've
got
the
youtube
up
and
running.
So
that's
good!
So
now
we're
just
waiting
for
linkedin
to
kick
in
and
once
we
get
that
we're
good
to
go.
B
A
I
mean
something
something's
wrong
in
the
world.
If
london
is
warmer
than
texas.
B
B
All
right
perfect,
I
think
we
are
oh
yeah
ardavan
from
texas,
has
noted
that
it's
85
degrees.
So
that's
a
little
on
the
warm
side.
Okay,
perfect.
We
are
up
and
running
on
linkedin
and
youtube.
So
if
you
have
questions
by
all
means,
please
start
asking:
let
me
go
ahead
and
stop
sharing
and
then
perfect.
So
for
everybody,
that's
wondering
this.
B
Is
simon
and
denny:
ask
us
anything
now
we
we
do
paraphrase
the
comment
when
we
say
ask
us
anything:
it
is
within
the
realm
of
data
engineering,
delta
lake
lake
houses.
Okay,
so
it
is
ask
us
anything
related
to
that.
I
mean
I
guess
you
could
throw
in
some
coffee
for
me
and
baking
for
for
for
simon,
but
nevertheless
that's
the
context.
Okay,
so
saying
that
first
things:
first,
why
don't
we
just
introduce
ourselves
simon?
Why
don't
you
start.
A
B
A
B
All
right
thanks
very
much
okay
and
then
myself,
my
name
is
denny.
I'm
a
developer
advocate
here
at
databricks
as
I
go
ahead
and
bust,
my
mic,
the
I'm
a
long
time
contributor
to
apache,
spark
and
also
delta
lake,
I'm
also
committed
to
deltalake
prior
to
that
prior
to
databricks.
I
was
also
at
microsoft,
and
you
can
blame
me
for
things
like
hdinsight
and
sql
server
and
cosmos
db
and
things
of
that
nature.
So
lots
of
data
questions,
that's
what
you're
here,
for
so,
let's
dive
right
into
it.
B
So
now,
as
this
isn't
ask
us
anything,
we
do
ask
you
to
prop
your
questions
directly
into
either
the
q
a
or
to
linkedin
or
to
youtube
okay,
and
so
this
show
pretty
much
will
run.
Based
on
the
premise
of
the
questions
you
ask.
Okay,
so
do
note
that
so.
In
other
words,
this
thing
could
end
in
about
five
minutes
if
you've
no
questions.
By
the
same
token,
we
can
also
iterate
and
talk
about
lots
of
really
cool
things
related
data.
B
So
one
of
the
first
things
that
we
actually
got
one
of
the
first
questions,
because
you
can
also
send
questions
via
linkedin
directly
to
simon
and
myself
was
basically
or
tweet
us,
for
that
matter
was:
can
we
dive
a
little
bit
into
star
schema
migrations
into
delta
lake
and
data
warehouse,
best
practices?
B
A
A
A
You
know
you
can
write
your
insert
statements,
your
merge
statements,
whatever
you
use,
to
make
your
facts
and
dimensions
and
you
can
lift
and
shift
and
run
in
spark
the
majority
of
that.
It's
just
going
to
work.
You
know
there
might
be
some
a
little
bit
of
syntactical
changes.
I
don't
want
square
brackets.
A
I
want
back
ticks
around
my
long
column
names
that
kind
of
stuff
you
can't
have
recursive
ctes,
but
apart
from
that,
the
majority
of
things,
if
you've
got
massive
recursive
cte
in
your
dimension
model,
there's
something
a
bit
weird
you're
getting
so
much
of
this
stuff
and
you
can
just
lift
and
shift
over
now,
don't
necessarily
say
that's
the
best
way
of
doing
things,
because
if
you
had
50
store
procs
sitting
in
your
secret
database,
they'll
be
using
to
make
your
kind
of
your
existing
warehouse,
and
you
just
turn
that
into
50
different
dead
books.
A
Notebooks-
and
I
just
ran
that
I
mean
you're
missing
out
on
some
of
the
the
nice
parts
of
performance
and
tuning
and
engineering.
So
a
lot
of
the
time
we
take.
We
take
that
sql
logic
and
we'll
put
it
into
a
more
generic
python.
Notebook
we'll
have
a
bit
pi
spark
that
says:
load
data
from
somewhere,
transform
it
land
it
somewhere
with
partitioning
and
do
a
proper,
merge
and
kind
of
manage
it
for
you
and
you
just
kind
of
take
that
same
sql
and
just
run
that,
as
put
it
into
a
pi
spark.
A
You
know,
data
frame
equals
spark
dot,
sql,
run
that
sequel
or
have
it
as
a
view
you
select
from
or
have
it
as
whatever
it
happens
to
be.
We
tend
to
separate
the
business
logic,
put
some
sql
over
there
and
then
a
real
generic
kind
of
rapper
saying
get
some
sql
and
then
land
it
properly
in
a
nice
governed
performance.
B
Perfect
well
so,
adding
to
that.
So
one
of
the
things
that
I
often
will
talk
about
is
that
it
sort
of
depends
on
what
environment
that
you're
starting
off
with
first
okay.
So,
for
example,
if
you're
starting
from
scratch
and
your
date,
your
source
data
is
relatively
small
and
you
don't
have
that
much
different
varieties.
B
You
may
actually
not
need
a
star
schema
in
all
seriousness.
You
can
start
from
the
standpoint
of
just
the
data.
I
know
I
know
I
know
I'm
hurting
support
simon
on
this
one.
I
apologize
for
for
hurting
him
like
that,
and
no
no
dude.
I
came
from
sql
server
and
all
I
did
was
promote
data
warehouses,
so
calm
down,
buddy
boy,
okay,
yeah.
B
No,
no,
I
know
I
know,
but
the
the
context
is
that
in
order
to
get
yourself
up
and
running,
it's
not
not
a
necessity
to
to
do
that,
because
the
whole
reason
we
talked
about
stark
scheme
is
especially
for
online
analytical
processing
or
olap,
or
data
warehousing
design
in
general
was.
The
premise
is
that
we
actually
were
trying
to
be
extremely
efficient
with
our
joints.
Okay,
so
that
way
they
were
especially
during
the
older
days.
B
This
is
aging
myself,
but
basically
the
idea
is
that
we
could
utilize
memory
more
efficiently
by
ensuring
that
joins
were
being
done
by
primary
keys
and
four
keys,
or
whether
we
actually
did
actual
primary
quirkies,
but
logically
primary
key
and
foreign
keys
using
integers.
So
that
way
those
joints
can
be
very
efficient
and
by
only
storing
integers,
at
worst
case,
big
inside
the
fact
tables
you
basically
were
reducing
the
size
of
the
fact
table
as
well.
B
So
that
was
the
initial
impetus,
okay
of
why
we
built
star
schemas
in
the
first
place,
and
so
a
lot
of
the
logic
I
just
said:
isn't
this
isn't
necessary,
especially
when
you're
starting
off?
But
okay?
As
you
progress
and
as
you
build
more
complex
systems,
what
usually
ends
up
happening
is
that
for
starter,
your
dimension
tables
aren't
necessarily
stored
in
the
same
system
that
you're
actually
processing
in
your
lake.
B
You
may
have
some
third
normal
form
design,
or
you
may
have
some
operational
data
store
that
actually
controls
what
those
dimension
data
sets
actually
look
like
in
the
first
place,
and
so
because
of
that,
your
dimension
data,
where
you
keep
your
what
your?
What
ultimately
is
you
know
remember:
a
star
scheme
is
comprised
of
a
fact
table
with
a
bunch
of
dimension
tables.
B
Okay,
that
dimension
data
may
be
organized
or
controlled
by
some
other
system,
and
so
because
it's
controlled
by
some
other
system
you'll
end
up
needing
to
build
a
star
schema
anyways,
just
because
you're
going
to
be
extracting
the
data
out
of
this
other
system
and
then
because
you
may
not
necessarily
have
full
control
of
the
system
end
to
end.
What
will
end
up
happening
ultimately.
B
Is
that
basically,
you'll
have
to
do
exactly
what
simon
just
described,
which
is
that
you'll
actually
need
to
go
ahead
and
all
those
design
patterns,
like
data
warehousing
design
patterns
that
you
normally
would
apply
to
data
warehousing
the
the
actual
techniques,
even
though
you're
not
building
a
data
warehouse,
you're
building,
a
lake
house
right,
the
actual
patterns
of
a
data
warehouse
are
still
very
much
applied
to
your
lake
house
design.
A
B
B
A
A
You
know
a
lot
a
lot
of
kimball
when
it
was
designed
about
how
you
designed
that
star
schema.
It
is
actually
designed
for
a
relational
system,
there's
loads
of
things
that
we
keep,
because
it's
good
to
manage
data.
That
way,
you
know
having
things
like
a
slowly
change
dimension,
which
is
dimensionally.
A
You
know
if
I've
pulled
out
all
the
information
to
do
with
this
dimension
onto
that
separate
table,
because
I'm
doing
some
data
model
design
and
I
need
to
update
something
because
something's
changed.
If
I
can
do
that
by
updating
one
record
rather
than
updating
every
single
record
in
a
10
billion
row
effect,
that's
way
more
operations,
you
know
so
there's
loads
of
bits
of
data
modeling.
That
just
makes
a
lot
of
sense,
but
some
things
that
we
used
to
never
do
like.
You
know
the
cardinals.
You
know
designing
a
star
schemer.
A
If
you
put
a
string
on
a
fact
you're,
you
know
you're
a
rebel.
You
know
you
get
out
of
data
modeling,
it
doesn't
matter
so
much.
It's
a
little
bit
more
forgiving.
The
good
compression
and
dictionary
management
to
park
a
is
a
little
bit
better.
So
you
can
do
that
kind
of
thing,
so
I'm
a
little
bit
more
lazy,
fair
with
my
kind
of
kimball
modeling
rules,
I'm
not
the
dictator
that
I
used
to
be
about
how
well
you
should
design
and
how
well
you
should
kind
of
conform
to
the
kimber
modelling
principles.
A
A
B
B
Do
what
we've
been
doing
for
the
past?
Exactly
exactly
one
thing
to
note:
this
is
just
a
shameless
plug
and
ardavan
who
actually
originally
asked
the
question
called
it
out
douglas
moore
and
myself
actually
did
a
couple
of
sessions
or
three
sessions
about
data
warehousing
techniques
with
delta
lake.
The
session,
I'm
particular
that
that
simon's
calling
out,
particularly
that
is
sorry
that
is
related
to
what
simon's
calling
out
is
the
one
related
to
sir
keys
and
type
two
slowly
changing
dimensions.
B
So
we
actually
have
sessions
just
on
those
two-
and
I
am
proud
to
say,
as
a
nerd
that
the
I
made
sure
the
entire
session
was
related
to
stargate
sg-1
as
well
and
stargate
atlantis,
and
yes,
because
it's
important
for
us
to
get
that
in
so
I
just
want
to
call
that
make
sense,
yeah,
okay,
so
as
a
quick
follow-up,
because
otherwise
you-
and
I
will
keep
on
talking
about
this
particular
topic.
Okay,
so
we're
going
to
stop
now,
because
we
have
other
questions
and
we're
going
to
actually
try
to
answer
some
of
them.
B
A
I
mean
for
me,
it
was
2.x.
You
didn't
really
build
a
star
schema
because
it
didn't
really
perform
that
well
with
joins.
You
can
do
predicate
push
downs
over
partition
joints
and
it
wasn't
very
good
so
used
to
have
to
do
really
designs,
like
making
sure
essentially
pre-spark
3.0.
I
spent
most
of
my
design
time
trying
to
trick
people
into
quit.
You
know,
including
the
actual
partition
column
in
this
alex
of
the
filter,
predicate
of
the
sql
statements.
That's
how
I
spent
my
life
trying
to
go.
A
No,
no,
ignore
those
columns,
use
this
nice
shiny
column
that
you
really
should
query.
But
after
3.0,
especially
after
adaptive,
query
execution
and
dynamic
partition
pruning
went
in,
you
can
do
things
like
have
a
date
dimension
and
have
people
filter
the
data
dimension
and
have
it
work
out
that
oh
well?
Actually,
I
can
take
that
filter
context
and
apply
it
over
to
my
effect
and
it
just
works.
So
it's
it's
less
of
a.
What
should
you
consider
what
migrating
it's
just
it'll
actually
work
as
opposed
to
not
work
so
well.
B
No
yeah
yeah,
though
the
one
thing
I
would
definitely
add,
is
the
which
I'm
a
huge
fan
of
is
basically
the
active
query,
execution
aqe
that
actually
speeds
up
performance,
and
it
is
designed
very
much
with
more
that
particular
like
data
warehousing
design
in
mind,
right
and
and
the
my
favorite
part
of
aqe
is
skewed.
B
So
basically
the
spark
3.x
actually
solves
that,
for
you
there's
actually
nothing
well,
I
always
just
say
salsa
for
you,
but
does
a
great
job
handling
sku,
so
you
don't
actually
have
to
sub-partition
the
data
yourself
specifically
to
your
question
about
there's,
really
nothing
you
need
to
do.
It
just
happens
because
it's
these
are
the
learnings
from
the
spark
community
itself
that
basically
over
time
as
more
and
more
people
build
these
complicated
systems
that
are
like
houses,
you
know
on
top
of
delta
lake.
B
A
Okay,
cool
the
one
thing,
I'm
casting
my
mind
back
like
what
a
couple
of
years
now
I
think
there
was
the
minor
thing
that
put
us
up
with
data
mentions
that
we
generated
we
had
to
change
the
case,
sensitivity
of
the
various
different
data
formats.
I
think
that
was
the
only
main
thing
that
actually
trips
up
in
taking
it
literally
a
lift
and
shift.
Suddenly
it's
like
oh
it's
case,
sensitive
now,
because
it's
gone
nancy
standard.
B
That's
right:
that's
right,
yes,
and
then
for
table
column
names.
As
of
delta
2.0.
You
can
accept
the
so
long
story
short.
The
reason
why
you
couldn't
you
actually
had
couldn't
have
spaces.
It
couldn't
have
capital
capitalizations,
though
frankly
capitalization
sort
of
sort
of
suck
anyways,
but
is
because
parque
itself
didn't
accept
it,
but
then
but
delta
2.0
onwards,
basically,
except
there's
a
mapping
that
was
actually
introduced
in
delta
1.2.
That
now
allows
us
to
go.
B
Do
that,
but
again,
these
are
minor
things
that
that
actually
just
make
things
a
little
bit
easier
for
you
more
more
than
anything
else.
That's
all
so
cool
ready
for
the
next
question,
because
we've
got
somebody
from
the
uk
asking
and
by
the
way
for
the
folks.
B
I
we
see
your
questions
from
jeff
and
hilbert
somebody
anonymous,
but
there's
one
from
graham
that
we
want
to
tackle
first,
just
because
it's
from
your
neck
of
the
woods,
okay,
so,
and
so
I
want
me
to
say
it
yeah
I'll,
say
it:
okay,
hello
from
wet
in
windy,
blackpool
in
the
uk.
So
there
you
go
for
your
neck
of
the
woods.
I'm
using
delta
lake
on
azure
synapse
to
build
a
lake
house
against
the
against
data
versus
data.
What
is
the
best
way
to
generate
generate
surrogate
keys
with
spark?
B
He
saw
mai
and
doug's
tech
talk
about
certain
keys
a
couple
years
ago
again,
the
one
that
includes
references
to
stargate,
g1
and
stargate
atlantis.
Have
your
views
changed
yet.
A
Was
the
thing
I
was
I
was,
I
was
teasing
danny
before
the
stream
with
a
question
of
my
own,
which
is
so
we
used
to
build
surrogate
keys
in
a
couple
of
different
patterns.
You
can
do
it
with
a
a
row
number
window
function
and
that
that
involves
a
little
bit
of
sorting
and
it's
like
performance
or
you
could
do
it
with
the
monitor
monotonically
increasing
id,
which
is
a
fantastic
name
which
is
super
performance.
A
A
Options,
it
was
just
a
performant
way,
a
tidy
way,
but
these
days
delta
you've
got
an
identity
column.
So
if
you
have
built
a
delta
table
before
you
insert
data
into
it,
so
if
you
do
a
create
table
as
and
specify
the
table,
properties
and
specific
properties,
you
can
say,
have
an
identity,
column
and
then
just
insert
data
into
it,
merge
data
into
it
and
it
manages
its
own
identity.
So
the
same
way
as
you
would
in
terms
of
a
normal
sql
database,
you
would
have
an
identity
column.
I
mean
I
know.
B
B
The
the
funny
thing
about
the
statement
is
that
yes,
within
a
an
smb,
a
single
system
within
sql
server,
since
I'm
from
sql
server,
okay,
it's
possible
to
do
such
a
things
when
you
have
a
distributed
system
where
each
of
the
workers
actually
need
to
go
ahead
and
generate
a
unique
id,
that's
actually
a
lot
more
complicated
than
people
realize,
and
so
yes,
it
took
a
little
bit
longer
for
us
to
go.
Do
that
there
is
actually
a
great
session,
I'm
actually
pasting
it
in.
B
Let
me
paste
it
in
the
chat,
so
y'all
can
see
it.
Oh,
where
did
my
chat
go
over?
Oh
there
you
go
all
right.
This
is
the
the
one
I'm
pasting
right
now
into
linkedin
is
and
youtube
is
simon's
video
about
identity,
columns
and
delta,
okay
and
then,
of
course,
if
you
wanted
the
the
previous
history
and
where,
like
I
said,
there's
references
to
stargate
sg1,
then
I'm
going
to
go
ahead
and
paste
my
my
doug's
version
of
that
as
well.
B
So
I
think
that
should
cover
those
two
scenarios
quite
nicely
so
all
right,
so
that
should
cover
us
on
surrogate
keys
for
now,
let's
go
ahead
and
go
to
the
next
question.
All
right.
Let
me
go
to
linkedin
first,
because
otherwise,
I'm
gonna
forget
what
people
are
asking
here.
What
are
the
biggest
challenges
in
delta
lake
for
real-time
data,
refresh.
A
I
mean
joins
realistically,
so
in
terms
of
how
we
design
likes.
You
know
you
talk
about
the
even
talk
about
bronze
silver
gold,
so
you
know
what
I
use
but
run.
Somebody
else
is
fine,
having
bronze
and
for
real
time
great
easy,
because
you
tend
to
have
a
one-to-one
mapping
between
a
data
set
comes
in,
we
put
it
down,
we
pick
the
data
set
up
and
we
clean
it.
We
get
it
right
and
we
get
it
all
ready
and
validated.
A
We
put
it
down
again
now,
having
that
single
line
of
hops
that
easy
to
stream
things
just
work
yeah,
you
have
to
manage
it.
You
get
like
some
optimizer.
You
have
to
look
after
for
the
smalls
files
problem
of
parquet,
if
you're
constantly
constantly
inserting
small
things,
but
that's
all
fairly
simple
the
moment
you
say
right
and
then
I
want
to
join
these
five
tables
together
to
make
something.
Like
effect,
things
get
really
complicated
because
it
depends
on
the
the
window
depends
on
the
kind
of
the
temporal
relation
of
those
different
things.
A
How
much
state
are
you
having
to
keep
in
memory
of
your
spark
cluster
to
actually
get
that
working,
so
the
majority
of
them
for
pure
brutal
simplicity?
We
tend
to
have
things
going
full
real
time
up
to
that
kind
of
silver
layer
and
then
have
just
incremental
rebuilds
of
the
gold
layer
or
kind
of
just.
You
know
every
15
minutes
every
five
minutes,
whatever
happens
to
be
saying,
take
everything
that
changed
pull
that
over
just
because
trying
to
do
it
as
spark
structured
streaming.
A
You've
got
limitations
as
to
how
many
streams
you
can
join
together
and
then
a
steam
streams
are
static
or
static
to
stream
and
there's
just
different
constraints
about
what
you
can
achieve
there
and,
if
you're
asking
people
to
do
it
from
a
from
a
business
point
of
view.
From
a
data
analyst
point
of
view,
kind
of
you've
got
all
your
data,
your
dimensions
designed
effective
designs.
A
B
Yeah,
I
can't
overemphasize
simon's
point
enough,
which
is
often
more
times
than
not
it's
actually
the
business
logic.
It's
not
to
say
the
technical
logic
is
difficult
it.
It
can
be
quite
the
opposite.
Okay,
quite
quite
assuredly,
it
can
be
quite
difficult
from
a
technical
perspective,
but
what
usually
happens
is
that
the
business
that's
making
the
quest
doesn't
actually
understand
the
implications
of
what
they're
asking
and
so
because
they
don't,
then
the
people
who
are
trying
to
build
it
it
it.
B
So
that
way
it
invariably,
when
the
system
breaks
down
all
systems,
break
right,
that
it's
really
easy
to
reboot
restart
refresh
whatever
the
the
other.
Quick
call
that,
I
would
also
say
is
that
you,
when
you
talk
about
refresh,
it
really
depends
on
that.
That
statement
that
word
refresh
actually
means
different
things
to
different
people.
B
Okay,
so,
for
example,
one
somebody
from
a
business
might
say
refresh
means
everything
from
bronze
silver
gold
needs
to
be
completely
cleaned
up
versus
some,
maybe
analysts,
just
saying
oh,
no,
I
need
just
need
the
gold
table
refreshed
right
so
with
the
former,
which
is
everything
needs
to
be
refreshed,
is
basically
you're
taking
everything
from
the
raw
data
and
you're,
because
you've
decided
for
sake,
argument
the
business
logic.
You
know
what
used
to
be
coded
as
red
now
is
blue.
B
All
of
it
needs
to
be
rebuilt,
so
every
silver
table
every
gold
table
all
the
machine
learning
all
the
analytics
that
went
with.
It
also
had
to
be
rebuilt
now
for
be
forewarned
more
times
than
that.
That's
actually
not
what
it
means,
that's
just
what
they
think.
It
means,
okay,
so
and
then
the
the
latter
being.
Oh,
no,
I
just
need
to
do
a
change
to
a
single
table
from
silver
to
gold,
and
that
actually
requires
just
some
business
logic
updates.
That's
all
that
is
okay,
so
or
modifications.
B
I
should
probably
be
very
specific
for
my
wording
here,
but
the
context
nevertheless,
is
that
it's
that
business
logic
that
really
comes
into
play
that
really
starts
messing
you
up.
So
that's
what
we
will
more
times
than
not
when
people
use
that
word
refresh.
You
have
to
be
very
careful
what
that
actually
implies,
and
what
that
entails.
Does
that
make
sense
makes
sense.
A
To
me
all
right
final
thing
on
that,
so,
if
you're
going
through
your
bronze
and
silver-
and
you
can
have
you
you're
not
going
to
rebuilding
your
gold,
you
can
also
point
your
reports
to
hit
silver
and
whenever
someone
says
hit,
go
they
then
recalculate.
Those
things
on
the
fly
and
the
report
takes
a
second
or
two
longer
to
make,
but
you're
not
actually
having
to
recalculate
everything
on
the
fly
or
refresh
everything
on
the
fly.
There's
different
patterns
right
that's
by
deciding
what
needs
to
be
updated.
B
Excellent
okay,
so
let
me
switch
to
youtube
before
I
go
back
to
the
the
zoom
here
so,
and
this
is
actually
related
to
hilbert's
question
hilbert's
question
from
zoom
about
dlt
and
dbt.
So
do
you
recommend
using
data
modeling
tools
like
dbt
instead
of
a
bunch
of
transformation
notebooks?
A
I
I
don't
have
a
strong
opinion
on
dbt,
honestly,
okay,
so
because
we've
already
invested
in
terms
of
the
stuff
that
I
do,
I've
already
built
a
lot
of
transformation
notebooks
and
I
tend
to
be
lifting
and
shifting
existing
sql
to
kind
of
then
just
plug
it
into
it.
A
So
for
me,
dbt
doesn't
make
sense
for
what
I'm
doing
in
terms
of
my
state
of
code,
I've
seen
a
lot
of
clients
who
have
great
success
in
terms
of
using
bt
and
applying
some
of
that
kind
of
the
principles
to
template
out
the
logic
for
it,
but
I
don't
have
a
huge
amount
of
hands-on
experience
with
it.
Yet
it's
on
my
list
of
stuff
that
I
need
to
spend
a
lot
of
time
digging
into
so
yeah.
No,
I
can't
really
say
too
much
about
it.
B
Okay,
well,
and
but
by
the
same
token,
that
that's
actually
the
valid
answer
right,
because
a
lot
of
the
times
when
it
comes
to
using
systems
like
dbt
or
dlt,
which
is
related
to
hilbert's
question
about
dlt,
okay
and
just
using
notebooks,
it
really
depends
on
where
you're
coming
from
right.
So,
for
example,
the
often
the
data
engineering
persona
is
one
that
they're
using
clis
and
ides
to
do
all
their
development.
B
The
data
scientist
is
the
one
who
uses
notebooks
right
so
for
sake,
argument
if
you're
or
data
analyst,
for
that
matter,
I'm
sorry.
So
if
you're,
typically
a
data
analyst
or
data
scientist
that
has
a
lot
of
sql
statements
or
python
statements
using
that
as
an
example,
it
may
actually
make
in
all
seriousness,
make
complete
sense
to
go
ahead
and
just
use
notebooks
for
those
type
of
transformations.
B
By
the
same
token,
if
you're
coming
in
from
and
by
the
way,
dbt
is
great
for
that,
because
the
whole
purpose
is
to
write
everything
in
sql,
okay.
By
the
same
token,
if
you're
coming
from
like
no
no,
I
really
need
an
ide
style
or
I
need
cli.
This
is
where
I'm
saying
okay.
Well,
then,
maybe
I
want
a
cli
system.
B
Maybe
I
want
to
use
dlt
or
in
some
cases
dbt
or
whatever
else,
to
do
that
and
so
really
honestly,
it's
the
context
is
very
much
where
you're
coming
from
and
which
what
tools
you're
already
used
to
using.
So,
for
example,
if
you
are
a
scala
ide
developer,
then
you
typically
use
intellij
honestly,
that's
probably
where
you're
going
to
come
from
you're
going
to
come
from
that
aspect
right
versus.
If
you're
going
to
go,
be
a
python
developer,
you're,
probably
going
to
be
perfectly
happy
inside
the
notebooks
okay.
B
So
hopefully
that
answers
that
question
from
youtube.
Now
this
flows
into.
Are
there
big
differences
between
dlt
and
dbt
right
away?
I
can
I'll
take
that
answer.
Since
I've
used
dbt
a
little
bit
yeah
I
mean
there
are
very
big
differences.
Delta
live
tables
is
very
much
about
the
structure
of
delta
lake
tables
within
the
context
of
data
bricks
right
now.
Okay,
right
now,
that
is
okay.
Now
in
the
case
of
dbt,
dbt,
is
great
for
using
sql
statements
on
multiple
different
sources:
okay,
so
dlt!
B
It's
not
is
very
it's.
Its
design
is
very
centric
about
how
to
simplify
the
streaming
and
and
or
batch
processing.
So
it's
very
much
this
context
of
you're
coming
from
spark
you're
coming
from
delta
lake.
Now,
how
do
I
go
ahead
and
abstract
away
some
of
the
complexities
around
using
different
environments,
basically
write.
The
code
once
apply
to
different
environments,
you're
good,
to
go
in
the
case
of
dbt.
On
the
other
hand,
it's
very
much
about
saying.
B
Okay,
let
me
use
sql
statements
to
go
ahead
and
apply
to
all
of
these
different
databases,
all
these
different
systems,
source
systems,
and
so
there
there
definitely
is
some
intersection
between
the
two
but
they're
actually
really
designed
for
two
different
environments,
and
hopefully
that
helps
answer
that
question.
A
B
Rock
on
okay,
we
got
plenty
of
other
questions
and
I
realized
we're.
Probably
we're
gonna
try
to
solve
them
in
the
next
15
minutes.
So
let's
go
do
this?
Okay,
we
have
another
question
from
our
buddy
greg
kramer
from
youtube.
He's
asking
us
a
question:
is
there
a
relation
between
dbt
and
five
trend?
There
are
lots
of
tools.
It
was
easier
when
we
could
just
handle
data
with
everything
everything
all
the
data
with
excel.
So
you
want
to
comment-
or
you
want
me
to
comment
first
on
this
one.
B
B
Okay,
so
greg
greg,
greg,
regreg,
okay,
so
greg's
an
old
friend
in
terms
of
all
the
different
tools.
I
I'm
probably
not
going
to
go
ahead
and
be
able
to
explain
dbt
versus
five
trend
in
a
decent
enough
way.
Okay,
these
these
are
great
tools
that
are
allowing
for
connectivity
to
different
source
systems.
The
fact
is,
there's
a
lot
of
systems
out
there,
a
lot
of
different
orchestration
systems,
a
lot
of
different
development
systems,
and
so
I'm
we're
probably
not
going
to
be
the
best
people
to
explain
all
that
stuff.
B
Honestly,
we
will
actually
have
sessions
with
dbt
and
delta
lake
and
five
train
and
delta
lake
and
all
these
stuff
in
the
near
future,
but
those
are
they'll
be
more
appropriate
for
that
type
of
discussion,
but
what
it
comes
down
to-
and
the
part
that
I
will
say
is
that
yes,
it
was
easier
when
we
did
everything
in
excel
and
and
part
of
the
reason
why
I'm
laughing
at
that
particular
statement
is
that
greg
knows.
B
I
was
a
part
one
of
the
first
demos
I
did
actually
by
the
way
when
I
was
still
at
microsoft,
was
to
basically
get
excel,
to
connect
directly
to
hadoop
to
actually
bring
data
down,
and
that's
true
when
you're
dealing
with
less
than
65
000
rows.
B
The
old
65
000
row
limit
in
excel,
as
data
has
gotten
larger
and
there's
more
complexity,
and
it's
gotten
much
faster.
The
reality
is
well
excel
is
a
great
end
tool
to
look
at
this
stuff.
The
reality
is
95
of
our
problems
are
much
more
related
to
everything
before
that,
as
opposed
to
that
little
tidbit
near
the
tent
end.
So.
A
I'll
tell
you
it
was.
It
was
easier
when
we
could
do
everything
in
itself,
because
the
data
was
a
small
enough
problem
that
it
could
be
done
in
excel.
Yes,
the
fact
that
we're
doing
larger,
bigger
crazy,
you
know
we're
doing
things
over
real
time,
we're
doing
things
that
are
dealing
with
horrible
nasty
nasty,
gnarled,
nested
jason,
the
fact
that
we're
doing
mad
data
science
stuff-
I
mean
you-
couldn't
really
do
that
back
in
those
days
so
yeah,
it's
gotten
more
complicated
and
yeah
the
the
environments.
A
B
B
Exactly
join
us:
let's
go
switch
over
to
back
to
zoom
here,
jeff
actually
had
early
on
asked
this
question.
This
is
a
bit
a
longer
one,
so
I'm
going
to
actually
I'm
going
to
say
it,
but
I'm
going
to
say
a
little
slower.
So
that
way
everybody
can
hear
it
so
good
morning.
I'm
using
click,
replicate
it's
a
cdc
tool
and
also
dynamics
365
to
push
data
pushed
to
a
data.
Lake.
Okay,
both
create
incremental
changes,
change
files
in
azure
in
an
azure
storage
account.
B
Okay,
in
both
scenarios,
the
initial
and
any
subsequent
full
load
files
land
in
one
folder
and
change
files
into
a
second
folder,
so
full
load
files
in
one
folder
change
files
in
the
second
folder.
Okay,
he'd
like
to
use
autoloader
to
ingest
to
a
bronze
delta
table.
But
how
do
you
best
account
for
two
source
pass,
keep
in
mind
full
reloads
versus
the
that
could
be
adjusted
after
the
fact,
plus
the
fact
that
you
still
have
change
data
capture.
All
the
above.
A
I
mean
it's
a
pain.
Essentially,
you
need
to
have
a
switch
in
your
notebooks
that
look
after
it.
Essentially
it's
two
separate
parts
I
mean
whenever
we're
doing
an
autoloader
load,
we'd
have
a
single
notebook
that
does
it
anyway,
so
he's
saying,
load
from
autoloader
and
I'll.
Tell
you
the
folder
path,
I'll,
tell
you
the
details,
I'll
tell
you
the
scheme
to
expect
all
that
kind
of
stuff
and
then
transform
it
and
land
it
into
whatever
songs
table
you're
dealing
with.
A
Essentially
it
something
would
have
to
do
to
trigger
the
run
of
the
historical
something
would
have
to
run
to
trigger
a
run
of
the
at
the
increment.
It
depends
how
you're
using
autoloader,
because
I
think,
compared
to
the
databricks
true
standard,
if
it's
a
streaming
tool
and
it
should
be
turned
on
streaming
all
the
time
I
use
autoloader
like
a
criminal
and
just
turn
it
into
a
trigger
available
now
root
and
just
run
it
as
an
increment
batch
one.
A
It's
just
a
very
easy
one
for
me
to
batch
data
in
so
it
depends
on
how
you
trigger
in
it.
If
you
just
leave,
it
turned
on,
if
you
leave
it
streaming,
so
you
should
essentially
have
two
parallel
streams
running
which
could
be
the
same
notebook
but
just
executed
in
parallel
with
different
different
parameters.
Playing
one
saying
is
my
historical
stream,
one
saying
that's
my
incremental
stream,
but
there's
an
overhead
right,
there's
a
cost
of
leaving
that
historic
one.
A
If
that's
just
sat
there
streaming
waiting
for
for
a
full
historical
reload
that
might
never
come
that.
That's
a
big
overhead
to
have
sat
on
your
spark
driver
constantly
running
microbatch's
gun
any
more
data,
no
any
more
data,
no
any
more
data.
No,
so
I
don't
tend
to
do
it.
I
tend
to
split
it
up
as
yeah
two
two
parallel
streams,
but
I
have
another
mechanism
that
triggers
the
full
reload
and
build
that
into
the
logic
of
whatever
orchestrator
you're
using.
B
That's
great,
I
don't
think
there's
much
more.
I
can
add
to
that.
Okay,
we
got
tons
of
other
questions.
Let
me
try
to
get
into
them.
Okay,
next,
one
from
hilbert.
What
is
the
best
way
to
start
with
a
framework
to
make
your
notebooks
more
generic.
A
Give
away
my
secrets
yeah
I
mean
so
there's
loads
of
things.
You
can
do
essentially
it's
getting
used
to
whenever
you
write
spark
code.
A
Whenever
you
write
a
bit
by
spark,
if
you
get
choice
between
doing
something
that
can
be
parameterized
as
something
that
can't
do,
the
one
that
can
be
parameterized
and
there's
a
few
little
tips
and
tricks
with
that
so
making
making
your
spark.read
command
generic
is,
you
know,
don't
use
the
implicit
file
formats,
so
don't
do
splash.read.csv
do
spark.read.format
pass
in
the
string
csv,
which
means
you
can
pass
it
into
parameter.
Dot
load
pass
in
the
path
so
to
get
used
to
trying
build
your
spark.read.
A
It's
fully
parameterized
pass
in
the
options.
Config
pass
in
think
get
used
to
passing
it
again
transformations.
We
had
a
lot
of
c,
so
you
can
use
the
expl
function
as
a
kind
of
the
python
event
allows
you
to
just
dump
a
string
in
there
and
if
that
string
can
be
passed
as
a
bit
of
spark
sql
then
it'll
work,
and
that
means
you
don't
in
your
logic
of
your
notebook,
you
don't
have
to
say
I'm
going
to
change
this
column.
A
I'm
going
to
apply
this
calculation
to
it,
you
say,
apply
something
to
some
column,
and
then
you
can
start
to
make
a
genetic.
You
can
start
to
inject
strings
at
real
time
and
make
it
very
generic
and
then,
at
the
end,
your
spark.writes
or
your
dataframe.right
has
the
same
things.
You
can
make
that
generic
you
tell
it
where
to
go.
You
can
tell
it
how
to
put
it,
you
can
tell
it
whether
to
use
delta
or
not
or
just
not
include
a
formula
and
then
it'll
use
delta
automatically,
but
then
you're
a
criminal
yeah.
A
So
you
can
get
those
three
steps
and
that's
what
a
generic
notebook
is
right
read
something
do
something
to
it:
write
it
somewhere
and
if
you
break
your
entire
data
pipeline
into
lots
of
steps
like
that,
so
you're
not
having
these
big
notebooks,
which
is,
is
all
the
different
transformations.
I
need
to
do
to
generate
my
20
tables
in
my
data
model.
Each
element
of
your
data
model
is
a
single
notebook
that
runs
independently
and
is
entirely
generic.
A
Read
something
do
something
to
it:
write
it
out
and
if
you
can
start
to
build
in
those
brands,
you
can
basically
start
to
parametrize
anything
and
your
notebooks.
You
know
kind
of
the
majority
of
the
legs
that
we
deal.
If
we're
talking
about
going
from
bronze
to
silver
silver
to
gold,
we
tend
to
have
maybe
three
or
four
notebooks
and
that's
it.
Everything
else
is
just
parameter
configuration
run
it
in
a
loop,
run
these
and
run
that
same
notebook,
100
times
with
different
parameters
between
100
tables
at
once,
how
we
do
it
in
a
nutshell,.
B
Yeah,
that's
great.
Okay.
I've
got
a
bunch
of
small
little
questions
that
I'll
go
ahead
and
just
tackle
them
off
right
now.
Let's
see
okay,
first
one
from
linkedin
from
anyesh.
I
apologize.
If
I
did
not
say
your
name
correctly
is
delta
lake
views.
Can
they
be
registered
by
data
catalogs
like
libra
in
general,
when
you
register
a
delta
lake
table
you're
registering
into
something
like
an
hms
a
high
metastore.
B
You
can
often
do
it
with
things
like
glue
as
well,
there's,
actually
a
glue
delta
light
connector
that
was
released.
I
want
to
say
two
months
ago,
I
believe,
there's
also
azure
purview
integration
and
data
hub
recently
released
theirs.
I
don't
believe
collabora
itself
currently
has
one,
but
it
would
not
be
that
complicated
to
do
a
plug-in.
B
So
one
of
the
things
I
usually
ask
people
that
if
you
have
ideas
for
pl
for
connectors
like
libra
or
anything
else,
let
us
know
join
us
at
go.delta.io,
slack,
okay,
that's
the
delta
user,
slack
chime
in
there
with
your
ideas
and
and
then
we'll
track
them
or
either
or
go
to
the
delta
users
delta
lake
github.
B
Excuse
me,
as
in
go.delta
dot,
io,
slash
github,
and
then
you
can
go
ahead
and
just
chime
in
there
with
your
ideas,
because
more
times
than
not
that's
how
we're
getting
ideas
as
the
delta
community
to
go
ahead
and
actually
add
these
different
integrations
just
basically
based
on
these
slack
messages
and
or
based
on
github
issues
being
created.
Okay,
so
hopefully
that
answers
your
question
on
this
one.
Let's
see
there
was
another
one:
oh
artavon,
you
had
answered
asked
that
the
circuit
key
version
with
the
notebook
isn't
available.
B
Bummer
on
that,
my
apologies.
So
do
me
a
small
favor
audubon
if
you
can
just
either
link
ping
me
by
linkedin,
like
you
did
before
or
open
a
github
issue
and
then
we'll
go
tackle
that
asap.
So
my
apologies
for
that
one.
Let's
see
there's
another
question
from
sema
from
youtube.
She
had
she
or
he
I
apologize.
You
had
asked
if
a
specific
thing
about
delta
sharing
this
one's
a
little
bit
more
complicated.
B
So
what
I
did
is
I
did
a
response
saying:
please
go
to
go.delta
dot,
io
slack.
There
is
a
delta
sharing
channel
by
all
means.
Ask
us
questions
there,
because
then
we'll
have
a
bunch
of
the
delta
sharing
folks
there
to
help
you
get
everything
up
and
running
avishek
asked
the
question:
will
the
when
will
the
delta
lake
definitive
guy
be
published,
we're
still
working
on
it?
It's
been
with
everything
that
happened
with
delta
2.0.
We
end
up
delaying
writing
because
half
of
what
we
wrote
was
changing
based
on
delta
2.0.
B
So
now
we're
actually
going
through
the
process
of
trying
to
get
ourselves
up
and
running
with
that
again
all
right.
So
those
were
the
quick
questions.
Let's
see,
there
is
a
great
question
from
dennis
or
denis
from
linkedin,
which
is:
is
there
a
known
or
good
strategy
for
often
run
vacuum
and
define
file
retention
values
depending
on
your
table,
update
frequency.
A
I
know
it's
very
much
and
it
depends
how
how
often
do
you
change
the
team,
how
much
how
how
much
of
the
the
data
gets
touched
in
each
incremental
upset
so
because,
currently,
whenever
you
update,
you
know
if
there's
a
parquet
file
that
gets
changed
as
part
of
an
update
action.
Anything
that
didn't
change
now
is
copied
into
another
parka
file
and
lo
shuffle
likes
of
vader,
the
the
delete's
gotten
better.
So
it's
now
actually
kind
of
just
lotion
for
merging,
like
if
you're
doing,
upsets
copies
the
records
that
didn't
change
separately.
A
So
you
might
protect
them.
But
then,
if
they
get
changed
in
the
next
one,
you're
still
going
to
have
multiple
copies
but
like
incremental
upsets,
do
create
lots
and
lots
and
lots
of
redundant
copies
of
data.
But
if
you're
doing
a
daily
update,
you
don't
need
probably
no
need
to
daily
vacuum.
A
You
can
put
it
away
with
a
weekly
vacuum
if
you're
doing
streaming
and
you're
updating
the
table
every
10
seconds
every
60
seconds.
Maybe
it's
actually
touching
a
lot
and
your
data
volume
just
exponentially
growing.
So
it's
always
kind
of
a.
How
much
are
you
touching
it
kind
of
how
many
times
a
day?
Are
you
updating
that
table
and
therefore
how
many
extra
copies
you're
eating
you
were
streaming
and
it
was
append
only
so
you're
not
actually
making
any
records?
Obviously,
it's
fine.
B
Perfect:
okay,
we
actually
probably
only
have
time
for
two
more
questions
and
both
are
a
little
bit
of
a
doozy
one.
So
we'll
try
our
best
to
answer
these
questions.
First,
one's
from
shaman
from
linkedin.
The
question
is:
what
is
the
best
way
to
handle
configure
information
for
your
silver
layer,
batch
processing?
A
So
depends
what
kind
of
configuration
are
we
talking?
So
we're
talking
about
things
that
we'd
kind
of
correct
one
of
the
one
of
the
question
things
that
we
talked
about
earlier
is
what
comes
first,
the
table
or
the
update
state,
because
a
lot
of
the
way
that
we
used
to
build
these
things
is
the
dataframe.right
command
would
implicitly
create
the
table.
So
you
can
apply
table
properties.
A
You
can
do
lots
of
things
in
that
just
in
the
right
state
and
so
we'd
have
kind
of
various
things
built
in
to
go
and
do
that
and
then
we
switched
things
around
and
we
said
well.
Actually
we
want
to
we're
going
to
start
you
doing
a
merge,
because
we've
got
merge
now
and
delta
and
there's
an
easier
way
to
do
things
and
then
actually
so,
we'll
actually
so
table
level
properties
kind
of
need
to
exist.
Before
we
give
a
merge,
we
can't
just
run
a
merge
statement.
It'd
be
lovely.
A
So
you
have,
you
normally
have
a
separate
little
function
and
we
just
generate
some
sql
in
there
kind
of
a
create
table
as
here's
all
my
columns,
here's
my
identity,
column,
here's
my
partitioning,
all
the
all
the
relevant
information
needed
for
that,
and
we
just
do
that
as
a
separate
little
three
commands
and
then
have
it
behind
in
this
statement
going
does
the
table
already
exist
if
not
go
and
create
the
table?
Here's
my
table
other
configuration.
B
Perfect,
I
there's
probably
a
lot
more
to
this,
just
like
what
simon
talked
about
when,
depending
on
what
the
definition
of
configuration
is
so
since
we're
pretty
much
almost
out
of
time,
but
I
still
want
to
at
least
try
to
tackle
two
more
questions.
I'm
going
to
chime
in
and
say,
please
join
us
at
the
delta
user
slack
I'm
going
to
paste
it
to
everybody
here.
So
that
way
you
can
just
join
and
ask
us
questions
there.
B
But
let's,
let's
dive
into
the
final
final
question,
actually
there's
one
little
quick
question
that
I'll
answer
right
away
and
then
the
final
question,
which
I
think
is
interesting
for
us
there's
a
still
a
question
is:
was
z
or
intro
order
introduced
in
spark
3.x
or
was
it
there
before,
and
the
long
story
short
is
z
order
is
actually
specific
to
to
delta
lake.
B
Well,
actually
I
should
rephrase
that
slightly
there
are
z,
orders
and
other
systems
too,
but
in
the
in
the
context
of
data
lakes
themselves,
it's
not
a
spark
thing.
It's
actually
a
data
like
thing
so
z
order
was
actually
existing
within
delta
lake
within
databricks
itself,
and
we
recently
open
sourced
it
as
part
of
delta
lake
2.0,
okay.
B
So
now
it's
available
for
everybody
to
work
with
right
away,
underneath
the
covers
we're
talking
about
how
it
basically
reorganizes
the
data
with
sparse
filters,
I'm
not
going
to
get
into
that
right
now,
because
we
don't
have
much
time,
but
we
can
definitely
chime
into
that
next
time.
If
you
want
to
dive
into
that.
So.
B
Exactly
that's
fun,
I'm
fine
into
it.
In
fact,
all
I'll
do
is
this.
Is
I'm
just
going
to
grab
td
in
my
session,
we'll
just
paste
it
there
and
we're
good
to
go
all
right,
but
this
one
I
figured.
We
want
to
answer
because
I
think
it's
a
great
question
to
end
in
the
session.
It's
from
davine
is
in
delta
lake.
How
do
we
handle
data
deletion,
that
is
in
scenarios
like
right
to
forget
data
or
data
consent,
withdrawal
we
we
need
to
have
time
travel
for
other
records,
for
example.
B
If
we
have,
let's
see,
for
example,
if
we
have
a
thousand
records,
you
have
to
time
travel
for
the
last
15
days,
but
if
those
10
of
those
records
need
to
get
deleted
today-
and
you
do
the
vacuum
also,
basically
you
remove
all
the
the
last
15
days
of
data
right.
So
how
do
you?
What
is
the
right
balance
on
how
you
approach
that
problem?
So
I've
got
some
answers
and
you've
got
some
answers,
so
why
don't
you
go
first.
A
Yeah
I
mean
so
gdpr,
certainly
for
us
in
europe
is,
is,
is
an
issue
right
most
of
the
time.
It's
not
an
immediate
someone
rings
up
and
says:
hey,
I
want
to
be
forgotten,
get
rid
of
my
data
and
you
have
to
get
rid
of
it.
That
moment
there
is
a
reasonable
period
for
you
to
enact
that
request
now
so
talking
about
15
days
worth
of
kind
of
date
of
attention
that
should
be
covered
by
the
amount
of
time
that
you've
gotten.
A
As
long
as
you
action
that
request,
you
do
the
deletion
and
you
are
keeping
a
short
enough
period
in
terms
of
your
time
trial
for
that
to
reasonably
be
forgotten
within
the
realm
of
that.
That's,
okay,
that's
that's!
Usually
fine.
I
mean
it
depends
on
the
the
rules
that
you're
in
depends
on
the
country
that
you're
in
depends
on
the
data
protection
legislation
that
you're
actually
under
as
to
how
big
that
window
needs
to
be,
but
you
can't
get
around
it.
A
If
you
have
to
delete
that
data,
you
have
to
vacuum
up
to
that
point
and
you
you'll
lose
time
travel
ability
up
to
that
point.
There
isn't
really
much
you
can
do.
We've
seen
some
interesting
things
that
people
have
done
to
try
and
ground
it,
which
is
more
less
keeping
the
data
but
keeping
the
data
in
an
obfuscated
or
encrypted
state
and
then
storing
an
encryption
key
elsewhere.
So
you
actually
at
real
time
when
you're
querying
it,
you
use
the
aes
decrypt
function
to
decrypt
the
data.
A
Then,
if
you
delete
the
value
for
that,
that's
gone
and
then
actually
your
actual
table
still
has.
Its
title
still
has
everything,
but
no
one
has
the
ability
to
decrypt
that
data.
Therefore,
the
data
is
technically
destroyed.
If
you
cannot
reverse
engineer
what
that
data
is
because
your
key
is
gone,
it
kind
of
gets
around
the
problem,
but
it
means
you,
then
have
to
join
that
table
and
decrypt
and
it's
a
performance
issue.
So
you
can,
you
can
build
it
to
be
incredibly
robust
and
just
kill
it
immediately
using
decryption
stuff.
A
You
can
just
increase
the
amount
of
vacuum
you're
doing
you
can
build
into
your
processes
and
actually
just
talk
to
your
data
protection
autos
and
go.
Is
this
amount
of
time
that
I've
got
because
I've
got
a
15
15
day,
backup
of
my
database
kind
of
that?
That
is
normally
absolutely
fine
in
terms
of
how
those
things
work
I
did
have.
A
One
thing
which
is
a
challenge
is
in
terms
of
the
log
retention
period
also
needs
to
be
considered
because,
if
it's
in,
if
the,
if
the
right
to
be
forgotten,
if
the
pii
information
is
in
your
minimum
maximum
statistics,
that's
also
a
pai
node.
So
if
you
know
if
it
happens,
to
be
there,
so
it's
like
don't
take
stats
on
pii
columns
needs
to
be
something
you're
thinking
about.
Therefore,
it
needs
to
not
be
in
your
first
32
columns
and
that's
kind
of
there's
a
few
things
you
need
to
be
careful
about.
A
B
Yeah
and
just
to
add
to
simon's
point,
this
is
a
very
complicated
system
and
really
depends
on
your
legal
department
engineering
department's
definitions
of
how
they
follow
gdpr.
So,
for
example,
one
company
that
I
actually
one
not
even
company
sorry,
one
pattern
that
I've
worked
with
is
that
they've
identified
what
is
deemed
as
pii
and
put
them
as
actually
as
a
separate
delta
table,
and
in
this
case
it's
based
in
essence,
it's
a
demographic
table.
If
you
want
to
think
of
it,
this
way
right.
B
So,
basically,
all
the
pii
are
placed
into
these
three
or
four
demographic
tables.
So
what
ends
up
happening
is
that
they
actually
will
not
delete.
Interestingly
enough
they'll,
specifically
redact,
and
so
in
other
words,
let's
just
say
id
20
name,
simon,
okay,
and
so
what
they'll
do
is
they'll
actually
keep
the
id
20,
but
they'll
change
the
name
from
like
name
to
from
simon
to
redacted
the
reason
they
do.
That
is
because
there's
downstream
systems
that
actually
that
they
don't
control
that
they
actually
have
to
make
sure
that
are
aware
of
it.
B
So
they
actually
have
to
keep
the
id.
So
they
know
that
id20
was
the
one
that
actually
specifically
redacted
and
they
trigger
downstream
systems.
To
basically
say:
oh,
we
got
the
word
redacted
now
go
ahead
and
clean
up
the
downstream
systems
as
well.
So,
even
though
the
the
data,
the
delta
lake,
the
data
lake
is
actually
completely
cleaned
out
by
that
because
in
essence,
there
is
little
to
no
history
actually
on
the
demographics
table,
because
that's
the
whole
point
there.
B
They
need
history
and
the
facts
not
actually
on
the
dem
on
the
demographics
themselves,
and
so
they
don't
care
if
they
wipe
history
away
from
the
demographics
table.
But
what
they
do
have
to
do
is
they
have
to
keep
the
id
so
that
way
they
can
go
ahead
and
specifically
impact
downstream
systems,
and
so
there's
actually
this
in
itself
by
the
way
is
a
is
a
longer
discussion.
B
B
All
right,
so
that's
it!
For
today
we
apologize
for
not
being
able
to
tackle
all
the
questions
we
probably
into
prototypical
simon
and
me
we
probably
did
rat
hole
a
little
bit
in
the
beginning,
so
apologize
for
that.
But
please
do
join
us
on
delta
user
slack.
B
You
can
go
ahead
and
ask
your
questions
there,
where
both
simon
and
I
there
pretty
regularly
anyways
and
we'll
be
here
next
month-
anyways
as
well,
so
without
further
ado,
we're
going
to
end
today's
session
by
going
ahead
and
doing
our
little
splash
screen.
So
anything
else
to
add
before
I
go,
do
that
simon?