►
From YouTube: Delta Hack: Introduction to Delta Lake
Description
Apache Spark is the dominant processing framework for big data. Delta Lake adds reliability to Spark so you analytics and machine learning initiatives have ready access to high quality and reliable data.
A
All
right
welcome
everybody.
This
is
the
first
session
of
delta
hack,
2021.
delta
hack
is
a
little
hackathon
that
denny
and
I
organized
to
help
grow
the
the
delta
lake
open
source
community.
We
also
thought
it
would
be
something
fun
to
do
after
everybody
recovered
from
data
and
ai
summit,
which
databricks
hosted
a
couple
of
weeks
ago.
A
We're
going
gonna
get
this
session
started
with
just
an
introduction,
talk
about
delta
lake
with
denny
and
if
you've
got
any
questions,
throw
them
in
the
chat
and
I'll
interrupt
any
and
ask
those
along
the
way.
If
you
have
them,
but
otherwise
denny.
Take
it
away.
B
Thank
you
very
much.
Tyler
really
appreciate
the
introduction.
Just
like
tyler
noted,
it's
been
a
crazy
few
weeks
post
summit
so
but
we
we
got
all
excited
because
of
the
fact
that
we
had
a
lot
of
questions,
a
lot
of
asks
about
delta
like
and
so
that's
what
was
the
brainchild
behind
this
delta
hack.
Thank
you
tyler
for
pushing
this
along.
Like
always
saying
this
for
the
introduction
to
delta
lake
we're
gonna
just
start
off
with
a
real
quick.
You
know
high
level
hey,
how
are
things
going?
B
Here's
the
quick
context
about
data
reliability
for
data
links.
Okay,
so
that's
what
today's
initial
session
is.
Don't
forget
that
tyler
later
on
was
going
to
be
showing
a
delta
hack
himself,
where
he's
going
ahead
and
coding
away,
and
then
we
also
have
steven
yu
coming
on
board
to
go
ahead
and
show
you
how
to
create
your
delta
table
in
a
short
amount
of
time,
as
well
like
in
under
10
minutes
the
session's
30
minutes,
but
he'll
actually
do
the
table
part
of
it
in
10
minutes.
B
So,
just
like
tyler
noted
if
you've
got
questions,
please
prop
from
the
chat
he'll
interrupt
me.
It's
probably
a
worthwhile
endeavor
to
interrupt
me
anyway.
So
not
a
big
deal,
and
so,
let's
start
with
the
key
context
here
on
the
what
the
promise
of
the
data
lake
is
right,
so
to
provide
a
little
context.
My
background
is:
I'm
originally
a
database
guy
used
to
be
part
of
the
sql
server
team
at
microsoft,
and
yet
I
had
shifted
over
to
hadoop.
B
So
a
little
background
actually
was
on
the
nine
person
incubation
team,
known
as
project
isotope
that
actually
introduced
what
now
is
known
as
hc
insight
and
so
brought
hadoop
into
microsoft.
So
we
love
this
concept
of
a
data
lake
right
you
know,
even
though
I
was
a
sql
server
person.
B
Originally,
this
idea
of
a
data
lake
was
to
be
able
to
hold
semi-structured
and
unstructured
data,
and
all
these
other
things
was
just
vibenic
like
really
really
great,
and
so
that
that
so
it
allowed
us
to
do
collect
everything,
and
that
was
the
whole
point
right.
You
know
it
wasn't
just
about
structured
data
anymore.
I
could
take
collect
your
streams.
B
Your
sensor
data
semi-structured
email
videos,
whatever
else
right,
I
just
store
all
this
in
a
data
lake
and
I'm
good
to
go
and
I'm
happy
you're
lucky
and
then
then,
because
it's
all
in
one
place,
you
know
I
can
go
ahead
and
run
my
data
science
machine
learning,
doing
things
by
recommendation
engines,
things
of
that
nature,
so
I'm
happy
all
right
so
with
in
the
realm
of
hadoop,
I
should
have
been
happy
because
I've
got
this
data
lake
or
or
for
that
matter.
B
Claude
object
stores,
whether
it's
you
know
s3
or
blob,
storage
or
gcs
or
whatever
else.
The
idea
is
that
I
could
just
chuck
everything
there
and
I'm
good
right.
So
I'm
happy
right.
Well,
not
exactly
right.
It's
the
old
adage
of
garbage
and
garbage
out
right.
So
basically
yeah,
we
collected
everything,
including
a
lot
of
garbage
right,
which
means
garbage
in
and
then
garbage
stored,
which
means
you're
data,
science,
machine
learning
and
things
of
that
nature
isn't
actually
terribly
good
actually
right.
B
So,
in
other
words,
the
the
output
that
you
got
from
your
ai
from
your
data
science
for
machine
learning
wasn't
high
quality,
because
the
data
that
you
actually
had
wasn't
high
quality
to
begin
with.
So
it
doesn't
really
help
out
that
much,
and
so
oh
I'm
just
realized.
I
should
have
turned
off
my
videos
that
way
you
can
see
the
garbage
out
all
right
there
we
go
all
right
perfect,
and
so
what
does
a
typical
data
lake
project?
Look
like
all
right,
so
forget
about
delta
lake.
B
Just
for
a
second,
let's
just
talk
about
what
does
a
data
lake
project?
Look
like
okay?
Well,
you
know
it's.
The
evolution
of
a
cutting
edge
data
like
often
looked
like
a
lambda
architecture,
now
remind
you,
we're
talking
land
doctor
specifically
for
data,
warehousing
style
or
old
lap,
style,
processing.
Okay,
so
you've
got
your
events
going
into
kafka.
You
go
to
your
other
system,
whether
it's
kinesis
or
event
up
or
whatever
else,
I'm
just
using
kafka
as
the
as
the
open
source
example
right.
You
can
then
for
sake.
B
Argument
like
okay,
let's
use
the
patch
spark
with
structure
streaming,
so
I
can
go
ahead
and
do
my
streaming
analytics,
but
that's
awesome
all
right,
so
I
can
go
ahead
and
process
the
data
and
then
and
then
in
real
time
or
close
to
real
time,
go
ahead
and
process
and
see
what's
going
on,
and
I
can
see
the
trends
happening
immediately,
often
good
for
things
like
advertising
analytics
and
real
time,
and
things
of
that
nature
all
right.
Well,
how?
B
If
I
wanted
to
look
at
historical
queries
right
the
the
whole
context
of
using
something
like
structured
streaming
to
process
the
data
for
your
streaming
analytics
was
about
this
concept
of
saying,
okay,
I
I'm
I'm
actually
not
storing
all
of
the
data,
I'm
looking
as
the
data
comes
in
right.
That's
all
I'm
doing
I'm
not
actually
trying
to
look
at
all
the
data,
because,
if
you're
trying
to
look
at
all
the
data
of
all
time,
that's
you
know
terabytes
petabytes
of
data,
so
you're
not
going
to
be
able
to
do
that.
B
What
I'm
doing
with
start
stream
is
just
like.
What's
the
current
set
of
data
that
I
have
okay,
and
so
what
I
will
have
instead
is,
if
I
want
to
do
historical
queries,
I
actually
needed
a
data
link
which
was
to
store
all
of
the
old
data
like
so
that's
that's
the
reversing
to
the
the
the
the
terabytes
or
petabytes
of
data,
and
so
that's
what
the
beginning
of
this
lambda
architecture
was.
B
Was
that
you're,
basically
splitting
it
off
right,
you're,
providing
this
context
to
say
that
okay,
part
of
my
data
is
sitting
is
going
to
be
I'm
going
to
multicast
my
data
out
from
kafka
in
this
example,
and
I
have
one
engine:
that's
processing
it
for
streaming
purposes
and
I'm
one
engine,
that's
processing
for
the
purposes
of
historical
purposes.
Okay,
you
can
do
that
it's
it
starts
getting
complicated
because
the
the
maintenance
that
goes
with
it
right,
and
so
I'm
just
turning
the
camera
back
on.
B
So
you
can
see
my
expressions,
but
you
can
tell
this
starts
getting
a
little
bit
more
complicated
to
say
the
least.
Okay.
So
the
this
is
like.
I
said
this
is
the
context
of
that
lambda
architecture.
I'm
referring
to
here.
Okay,
all
right
and
then
so
now
I've
got
two
streams,
one
for
streaming,
one
for
my
historical
data.
B
B
Well,
how
about
you
get
messy
data
right?
I
mean
this
initial
concept
works
really
well.
If
you
were
to
have
clean
data
and
I'm
sure
tyler
and
anybody
else
that's
on
this.
Today's
session
here
could
vouch
for
the
fact
that
you're,
especially
in
production
you're,
you
don't
have
clean
data
right
you.
Actually,
your
data
has
issues
right,
cleanliness,
corruption,
bad
code,
all
the
above
right.
So
so
you
need
to
figure
out
how
to
run
a
validation
step.
All
right,
which
basically
is
as
your
stream
is
running
in
the
slammed
architecture,
the
top
part.
B
Here
it's
going
ahead
and
checking
you
know:
maybe
some
just
do
some
numeric
validation
like
the
counts
per
minute
or
once
per
hour
or
whatever
else,
and
you
also
validate
that
against
your
data
lake
as
well
or
at
least
the
the
job.
That's
loading
into
your
data
and
so
that'll
allow
for
basically
to
validate
that
the
numbers
actually
match
each
other
really
well,
okay,
so
that
that's
that's
completely
cool
all
right.
B
All
right,
so
so
that's
this
validation
concept
here,
okay,
so
well,
then,
how
about
if
there
are
mistakes
and
failures
in
the
processing
of
this
data?
Right,
because
no
matter
how?
Even
if
your
the
validation
steps,
work
well,
you're
gonna
at
some
point
need
to
reprocess
data
because
there
was
mistakes
in
the
code
there
were
failures
in
the
system
outright
and
or
for
that
matter
you
know
the
data
just
changed
on
you
and
you
weren't
expecting
that.
B
So,
even
if
the
code
was
perfect
for
this
point
in
time,
well
a
data
lake
or
any
data
system
for
that
matter
is
constantly
changing
right.
It's
it's
a
it's
a
it's
a
it's
a
flowing
system.
It's
not
one
of
those
systems
where
you
just
like
okay,
you
know
I
can
be
static
right
even
back
to
the
old
data
warehousing
days
right.
If
your
data
warehouse
never
changed
and
you
never
were
updating
the
schema
or
weren't
updated.
B
That
means
things
are
going
to
change
so
same
context
here
with
the
data
with
a
data
lake
except
it's
even
it's
that
context
is
even
much
more
on
steroids,
because
you
really
have
that
much
more
data
to
work
with,
so
I'm
going
to
have
to
reprocess
my
data
all
right.
So,
for
example,
I
update
my
patchy
spark
jobs.
I
partition
the
data
by
time
so
that
way
I
can
reprocess
specific
partitions.
Okay,
perfect!
Well,
that
that
eases
things-
and
I
can
certainly
do
go-
do
that
and
so
all
right.
B
This
reprocessing
will
then
be
will
take
care
of
this
part
of
the
context.
But
how
about
if
I
have
to
update
the
data?
So
in
other
words,
even
if
I
didn't
everything
was
working
fine,
but
the
upstream
system
is
saying:
hey
wait
a
minute.
The
data
that
came
in
actually
isn't
quite
right
right.
Originally,
I
told
you
yesterday
we
sold
20
widgets,
but
now
it's
actually
30,
widgets
or
whatever
okay.
B
B
So,
in
other
words,
I
need
to
be
able
to
update
the
data
to
prototypically
a
a
a
a
data
lake
is
very
much
an
append,
only
system
right,
you're,
just
constantly
adding
data
you're
not
trying
to
actually
update
it,
but
the
reality
is,
if
you
want
to
make
make
it
useful,
make
it
powerful
you're
going
to
actually
need
to
update
some
of
the
data
at
some
point,
because
that's
what's
going
to
be
reported
on
right,
you
could
always
try
to
do
make
it
additive
by
doing
a
negative
like,
for
you
know,
it's
an
old
data,
warehousing
trick
where
we
wouldn't
update
the
data.
B
We
would
just
simply
say:
oh
20
wedges
today,
oh
you
actually
only
sold
10
widgets,
okay
well,
we'll
just
add
a
negative
10
right
widgets,
and
so
that
that
we
can
update
so
sure
you
could
do
that
or
whatever
else.
But
the
point
is
that
you
do
need
to
update
the
data.
Okay,
all
right
and
then
even
if
you
could
update
today,
you
have
to
schedule
it
correctly.
So
you
can
avoid
modifications
to
your
reports
or
the
downstream
systems
so,
like
I
said
lots
of
fun.
B
Well,
that's
absolutely
more
of
the
context
right.
The
context
is
that,
as
you
productionize
your
systems
right,
the
reality
is
this
diagram
does
get
much
more
more
complicated
and
that's
more
or
less
the
issue
right,
like
you
know,
as
you
know,
I'm
guilty
of
this
too
right.
You
know
when
we
initially
built,
you
know
hadoop
on
windows
and
then
hadoop
and
azure.
That
was
part
of
my
team.
We
were
thinking
to
ourselves,
yeah
yeah,
we
could
be
we'd,
be
good.
B
All
of
these
challenges
in
one
form
another
peaked
its
ugly
head
right,
whether
it
was
for
streaming
data,
whether
it
was
the
fact
that
I
had
to
reprocess
the
data,
and
so
the
the
real
work
wasn't
was
was
basically
by
going
and
saying
I
could
store
everything
the
real
work
ended
up
being
now.
I
have
to
build
all
of
these
systems
around
to
compensate
for
these
challenges
right
and
so
exactly
to
your
poi
teller.
B
Okay,
from
a
if
I
was
to
take
these
problems
and
all
the
other
problems
we
have
there,
I
could
probably
this
is
probably
too
much
of
a
generalization
but
there's
some
key
generalizations,
okay,
which
is
no
atomicity
okay,
which
means,
if
you
have
failed
production,
jobs,
you're
going
to
leave
data
in
a
corrupted
state,
okay
and
it
becomes
really
really
complicated
for
recovery
right.
So
for
many
of
you
who,
if
you've
ever
run
like
a
production,
any
job,
I
don't
care
what
system?
B
B
The
problem
with
that
approach
is
then,
how
of
one
of
the
20
tasks
that
that
job
runs
right
fails:
okay,
well
in
your
data
lake,
as
what,
whatever
system
that
distribution
system
is
running,
it's
constantly
writing
files
to
disk,
okay,
to
storage
right
again,
doesn't
matter
what
storage
you're
talking
about,
whether
it's
cloud
object,
storage
or
hdfs.
It's
writing
stuff
down
right.
It
can't
keep
everything
in
memory,
so
you,
since
you
can't
keep
everything
in
memory.
You've
got
to
write
it
to
storage
all
right.
B
Well,
if
the
task
fails,
what
ends
up
happening
is
there's
a
bunch
of
of
files
that
have
now
been
written
to
storage
they.
If
the
job
failed,
the
system
just
shuts
itself
down.
You
lose
network
connectivity.
What
happens?
There's
a
bunch
of
files
that
have
been
written
storage?
Okay,
so
that
that's
not
good,
because
now
that
you've
got
these
files,
you've
got
basically
orphan
files
or
straggler
files
that
are
sitting
in
storage
with
your
good
data
and
you're
going
like
oh
great.
B
So
if
I
query
that
table
or
succeed
me
that
data
in
the
file
system,
I
don't
know
which
one's
actually
good
and
which
one's
bad
right-
and
so
this
is
a
real
con,
real
importance
of
this
context
about
atomicity,
okay,
there's,
no
quality
enforcement.
Just
this
well,
I
sort
of
implied
when
I
was
talking
about
the
challenges
of
your
data
like
the
problem.
Is
that
basically,
is
that,
because
I'm
just
dumping
the
data
in
and
I'm
I
I
don't
I'm
basically
this.
B
This
is
old
context
that
we
used
to
advocate,
for
which
I'm
guilty
of
by
the
way
as
well,
which
is
saying,
hey,
schema
on
re.
You
know
I
don't
have
to
worry
about
what
the
schema
looks
like
I'm
just
going
to
wait
to
put
the
prop
the
data
down
and
then
I'll
define
the
schema.
When
I
read
the
data,
so
I
don't
have
to
worry
about.
I
can
get
the
data
written
to
my
data
lake
as
fast
I
can
so
that's
great.
B
A
B
Let's
just
say
the
schema
of
the
data,
all
right
that
it's
supposed
to
have
x,
number
of
columns
of
those
x
number
columns
have
this.
Many
of
them
are
integer
versus
string
or
whatever
else.
I
need
to
be
able
to
enforce
that
type
of
quality,
okay
and
then
there's
no
consistency
and
there's
no
isolation.
B
Okay,
so
because
you
know
there's
no
consistency
or
isolation.
When
it
comes
to
your
data
like
right,
it
basically
becomes
nearly
impossible
to
mix
the
pens
and
read
batch
and
streaming
right.
The
remember
like
now,
I'm
just
basically
talking
in
the
realm
of
databases
all
over
again,
because
you'll
notice,
basically,
what
I've
done
is
talk
basically
about
acid,
atomicity,
consistency,
isolation
and
durability,
right
and
so
for
those
of
you
are
who
are
not
database
folks,
which
is
perfectly
cool
right.
B
This
concept
is
that
what
was
great
about
databases
was
that
and
what
made
them
like
the
the
flavor
du
jour
when
it
came
to
business
data
processing.
Was
this
idea
that
we
had
this
one
system,
the
database
that
allowed
us
to
go
and
ensure
that
if
two
sys,
you
know
two
clients
were
trying
to
write
to
the
table.
At
the
exact
same
time,
one
was
trying
to
do
an
update.
One
was
trying
to
do
a
delete.
B
Well,
this
is
even
harder
when
it
comes
to
data
lake,
because
you're
going
to
have
not
just
lots
of
fast
jobs
and
multiple
batch
jobs,
kind
of
concurrent
batch
jobs,
but
also
streaming
jobs,
trying
to
write
the
data
at
the
same
time.
So
you've
got
like
a
structured
streaming
job.
That's
writing
data
into
that
file
system,
every
second
or
so
right.
Meanwhile,
I'm
also
trying
to
go
ahead
and
say:
oh
maybe
I
need
to
update
a
value
because
the
widget
number
was
originally
20
and
it
really
should
be
10
right.
B
Well,
how
do
you
ensure,
if
there's
no,
nothing,
basically
locking
the
file
system,
so
that
way
we
can
ensure
what
the
streaming
is
doing
and
what
the
update
is
doing
aren't
going
to
conflict
with
each
other
right.
So
that's
what
databases
were
really
good
at
doing,
and
so
you
know
the
context
or
what
I'm
implying
very
strongly
is
that
man
it'd
be
pretty
cool
if
we
could
provide
acid
consistency
to
my
data
lake.
B
So
that
way
I
could
go
ahead
and
you
know
prevent
fail
production
jobs
from
corrupting
my
system
and
enforce
some
form
of
quality
and
be
able
to
have
consistency,
isolation,
and
so
because
of
that,
that's
the
introduction
of
to
delta
lake
right.
That's
what
delta
lake
does
right.
It
allows
you
the
reason
we
I
called
the
session,
and
if
you
look
at
other
sessions,
you'll
notice,
it's
called
like
you
know,
make
in
fact
you
spark
better
with
delta
lake
or
bringing
data
reliability
to
your
data
lake
with
delta
lake
and
yada
yada.
B
This
whole
context
is
about
saying
that
I
can
bring
acid
consistency,
protections
transactional
protections
to
my
data
lake.
So
that
way,
all
the
problems
that
you
we
were
challenged
were
talking
about
before
all
the
issues
that
we
just
talked
before,
get
resolved
automatically
for
you
right
and
so
look.
We
can
go
back
and
look
at
the
challenges
again.
Remember
like,
for
example,
of
back
to
the
details.
It's
a
lab
architecture.
We
have
the
validation.
We
have
the
reprocessing
of
the
updates.
Okay.
Well,
in
reality,
it's
not
that
simple!
B
Okay,
it's
actually
even
more
complicated,
because
each
one
of
those
four
diagrams
is
basically
impacted
by
the
arrow
that
you
see
here.
You
don't
just
have
one
stream
of
kafka.
You
have
a
kafka.
Kinesis
you've
got
your
batch
data
from
your
data
from
other
sources.
You've
got
other
kafka
streams,
and
so
what
I'm
alluding
to
here
by
the
way,
what
you
see
in
the
color
coding
is
that
you
really
have
a
bronze
silver
gold
data
quality
framework
that
you're
trying
to
create
okay.
So
first
you
have
the
bronze
right.
B
That's
the
initial
dump
of
data
from
your
different
sources.
Your
kafka
kinesis
data,
like
csv
json
text.
Whatever
else
right,
then
you're
going
to
transform
that
data
into
a
silver
level
right
and
silver
means
that
you're
going
to
basically
from
a
data
quality
perspective.
It's
it's
saying,
I'm
going
to
augment
it!
I'm
going
to
filter
out
the
data
I
potentially
will
do
joins
today.
B
But
the
point
is
now:
here's
the
data
in
which
I
am
enforcing
the
schema
and
I'm
cleaning
it
up
sure
I
I
have
some
minimal
enforcement
of
the
bronze,
but
the
point
is
that
I'm
still
trying
to
just
get
the
data
in
the
the
this
context
of
oh
yeah,
I
really
care
about
getting
the
data
into
my
data
like
as
fast
as
I
can.
But
then
the
silver
allows
me
to
say:
okay,
I'm
gonna
do
a
bunch
of
tasks
to
clean
up
that
data
and
then,
by
the
time
I
hit
gold.
B
The
gold
data
set
is
basically
your
feature
engineering
for
your
machine
learning
or
for
your
aggregates
for
your
bi
tools
right
so
it'll
you
know
aggregate
it
or
summarize
it
or
whatever
else,
for
the
purpose
of
streaming
analytics
and
for
ai
reporting.
So
you
want
to
process.
You
want
to
be
able
to
create
your
data
pipelines
to
be
able
to
take
that
data
from
your
bronze
to
silver
to
gold
and
ensure
that
if
any
errors
occur
throughout
the
system,
each
one
of
these
arrows
that
you
can
go
and
say:
okay,
no
problem!
B
I
can
restart
the
process
again
and
I
can
trust
that
what
I
will
write
to
a
subsequent
system
downstream
will
in
fact
be
solid,
okay,
and
so
that's
the
key
thing
right.
I
wish
it
was
this
simple
and
I'm
being
very
facetious.
I
realized,
but
it's
not
just
remember.
This
is
just
for
one
kafka
source
on
the
things
that
have
to
do
with
it
in
reality,
it's
this,
so
I
really
need
that
type
of
consistency
to
protect
the
data.
As
I
progress.
B
Download
the
delta
lake
architecture,
medallion
architecture
and
delta
lake
itself,
which
is
that
you
have
full
asset
transactions.
You
can
focus
on
the
data
flows
instead
of
worrying
about
your
failures
right.
The
key
context
also
for
delta
lake
is
we're
open
standards
and
open
source
we're
part
of
the
linux
foundation.
You
might
may
or
may
not
be
able
to
see
it
because
it's
a
the
coloring
scheme
is
a
little
off
here,
but
there's
actually
a
linux
foundation
logo
there,
and
so
the
key
context
is:
you
can
store
petabytes
of
data
without
worries
of
lock-in.
B
It's
a
growing
community,
including
presto,
spark
and
more
right
when
michael
armbrust
had
announced
during
data
at
the
ai
summit,
the
delta
1.0
during
his
one
of
his
keynotes
right
now.
You
know,
I
probably
should
have
slipped
that
slide
in
here
right.
It
there's
so
many
different
systems,
hats
off
to
tyler
and
the
rest
of
the
scribd
and
back
market
team
who
introduced
the
delta
rust
api
with
their
python
bondings
and
upcoming
ruby
and
golang
bindings
right.
B
So
the
does
not
require
you
to
use
spark
at
all
to
work
with
delta
lake.
That's
the
whole
point:
it's
an
open
source,
open
source
of
protocols
right
on
github,
so
you
can
go
ahead
and
see
exactly
how
it
works
and
there
are
more
and
more
partners
daily
that
are
integrating
with
delta
lake
open
source.
B
Commercial,
doesn't
matter
because
it's
an
open
because
we
are
working
with
open
standards
because
it's
open
source,
so
you're
never
going
to
get
locked
in
to
any
particular
vendor,
because
you're
using
delta
lake,
quite
the
opposite,
using
delta
lake
ensures
that
you
can
use
whatever.
Whichever
vendor
you
want.
In
fact,
and
yes
it
at
the
same
time
as
I
call
out
that
you
don't
need
to
use
spark,
obviously
is
powered
by
spark
as
well
right
really.
B
This
will
require
much
longer
conversation,
but
what
it
boils
down
to
is
that
the
the
context
of
exactly
one
semantics
it
structure
streaming,
brought
you
most
of
what
was
needed,
but
the
context
of
needing
asset
transactions
was
the
final
step
that
allowed
exactly
one
semantics
for
your
system
end
to
end
right.
So
it
really
allowed
you
to
unify
your
streaming.
Your
batch
and
they're
allowed
to
really
convert
existing
jobs
with
mental
modifications,
and
I'm
actually
going
to
talk
about
an
example
shortly
about
that,
but
that's
the
real
con.
That's
what's
really
cool
about
this.
B
B
Right,
like
I
said,
there's
the
the
these
are
the
data
quality
levels
right,
your
bronze,
which
is
your
raw
ingestion
from
all
of
these
different
sources,
doesn't
matter
you
have
your
data
quality
levels
where
then
it
switches
to
silver,
which
is
your
filter,
you're,
clean,
you're,
augmented,
and
then
you
have
your
gold,
your
business
level
aggregates
such
that
you
can
do
your
streaming
analytics
and
ai
reporting,
so
it
massively
simplifies
all
of
those
changes.
Those
challenges.
Excuse
me
right
and,
and
that's
more
or
less,
the
key
call
right.
A
B
A
I
do
so
when
you're
thinking
about
this
move
from
bronze
to
silver
to
gold
in
in
your
experience
or
as
you
look
at
designing
delta
tables-
is
that
is
that
sort
of
less
structured
to
more
structured
data,
because
I
know
I've
seen
some
delta-like
users
that
are
basically
using
that
bronze
level.
For
you
know,
here's
some
raw
data
and
some
metadata
around
our
ingestion
and
we
dump
that
into
into
bronze,
and
so
they
go
from
sort
of,
not
necessarily
a
quality
change.
Although
there's
quality
changes
along
the
way,
but
they
go
from
less
structured.
A
B
I
I
I
think
the
less
structured
to
more
structure
definitely
can
come
into
play
and,
depending
on
how
like
now,
we
start
getting
almost
a
philosophical
debate
right
in
a
lot
of
ways,
your
quality
of
your
data.
From
the
standpoint
from
the
standpoint
of
ai
reporting
and
streaming,
it's
higher
quality
quote-unquote
when
it's
structured.
Now,
I'm
not
trying
to
knock
semi
less
structured
data.
Quite
the
opposite.
B
I
actually
like
it
right,
but
I'm
also
saying
that
for
most
analytics
ar
reporting
data
science
for
feature
engineering,
basically
they
need
the
data
to
be
some
more
and
they
need
the
data
to
be
structured.
B
So
from
the
standpoint
of
that
type
of
reporting
or
aggregates,
the
less
structure
to
more
structured
is
also
a
data
quality
context
as
well.
Does
that
make
sense,
yep
got
it
perfect,
okay,
and
so
just
to
finish
up
with
the
silver
concept
right.
Just
like
we
said,
filter,
clean
and
augmented,
but
remember,
there's
going
to
be
intermediate
with
some
cleanup
applied
right,
it's
queryable
for
easy
debugging.
So
often
I
see
customers.
You
know
I'm
sure
tyler
would
have
asked
this
question
too.
B
B
You
must
build
it
in
a
specific
way
and
much
more
about
saying
you
are
going
to
think
about
your
data
quality
progression
right,
you're,
going
to
think
about
your
semi
structure
or
less
structure
to
your
more
structured
progression
that
and
so,
for
example,
if
I
was
to
pull
back
to
my
data
warehousing
days,
the
old
days,
I
would
say,
like
oh
yeah,
it's
like
your
old
tp,
the
staging
to
your
data
warehouse,
and
we
all
knew
even
back
in
those
days
right
that
it's
not
like
there.
It
was
one
size
fits
all.
B
We
we
did
it
because
we
want
this
concept
of
old
lp,
a
dumping
ground
staging
where
we
basically
filter,
clean
and
augmented
and
maybe
third
joined
with
a
third
normal
form
data.
And
then,
when
we
had
our
data,
warehouses
was
the
business
level
aggregates
that
that
our
business
could
query
against
right.
B
But
how
you
did
that
and
what
details
they
were
really
dependent
on
your
business
right,
so
we're
not
trying
to
advocate,
for
you
must
design
a
specific
way,
we're
advocating
for
the
idea
that
you
must
think
about
your
data
quality,
as
opposed
to
thinking
that
I
could
just
dump
the
data
into
your
data.
Lake
and
quality
magically
happens
because
that's
where
the
real
work
is
the
real
work
is
not
so
much
getting
the
data
into
that
bronze.
But
the
real
work
is:
how
do
I
go
ahead
and
cleanse
the
data
and
then
reprocess
the
data?
B
If
I
need
to,
and
that's
part
of
the
reason
why
the
problems
exist,
because
if
I
need
to
go
back
and
recognize,
let's
just
say
the
last
two
months
of
data
was
wrong
or
it
needs
to
be
updated
because
of
new
business
rules.
I
can
just
basically
delete
the
date
of
my
silver
gold.
Go
back
to
bronze
and
say
all
right
pick
up
the
process
from
there
and
kick
it
in
and
reprocess
the
last
two
months
of
data
and
any
new
data.
Using
these
new
business
rules.
Right
and,
like
I
said,
call
call
gold.
B
Basically,
clean
data
rate
of
consumption
read
with
spark
or
presto
presta
by
the
way,
already
works
with
the
the
the
manifest
files,
but
we're
seeing
coming
soon
for
also
not
just
reading
the
manifest
file,
but
actually
be
able
to
read
the
transaction
log
within
delta
lake
and
be
able
to
query
it.
A
For
that,
coming
soon,
by
the
way,
since
we
are
talking
about
hacking
on
delta
lake,
yes,
this
week,
is
that
which
repository
is
that,
coming
soon
into.
B
B
Thanks,
thank
you
very
much,
perfect
all
right.
So
all
right,
let
me
finish
off
with
the
gold
stream
through
to
delta
lake
low
latency
manually,
triggered
eliminates
management
and
schedule
schedules
and
jobs.
What
do
I
mean
by
that?
Let
me
talk
about
that.
So
I'm
already
running
a
little
over
time.
So
let
me
just,
but
let
me
finish
up
with
the
some
quick
call
outs.
The
oh
basically
delta
lake
allows
you
to
do
inserts,
delete,
merges
and
overwrites.
B
I
should
have
removed
the
start,
dml
released
in
0.3
0.30,
because
we're
on
delta,
like
1.0
now,
but
the
whole
point
is
that
you
can
actually
run
standard
dml
helpful
for
retention,
corrections,
gdpr,
okay,
so
back
to
that
reprocessing.
This
pros
contact
that
I
just
talked
about
that's
more
or
less
the
point.
If
I
need
to
go
back
in
time
because
because
of
business
logic,
changes
whatever,
I
can
just
clear
the
tables
clear,
the
partitions
involved,
restart
the
streams
and
then
back
up
and
running
right
and
th
this.
B
B
There
are
lots
of
organizations
and
if
I
I'm
gonna
skip
past
the
slide
just
because
the
data
ai
summit
in
the
keynote
it's
the
keynotes,
I
would
definitely
watch
those
because
that's
more
up
to
date,
there
are
thousands
of
organizations
with
extra
bytes
of
data
per
day
of
processing,
but
the
one
thing
I
did
want
to
call
out
their
comcast
had
a
dnai
summit
session
from
I
believe
two
years
ago,
or
maybe
three
years
ago,
just
called
sessionization
with
delta
lake.
It's
a
I.
B
I
can't
do
justice
in
in
30
seconds,
but
the
key
context
about
their
sessionization
of
data.
In
this
case
it's
from
the
remote
that
they
have
with
your
your
comcast
box.
Is
that
because
of
data
lake
at
delta
lake,
they
were
able
to
improve
reliability
of
the
peta
byte
scale,
jobs,
which
is
awesome,
but
because
they
were
able
to
improve
that
reliability.
They
were
able
to
do
10x,
lower
compute.
So,
in
other
words,
instead
of
having
640
vm
640
machines,
they
got
it
down
to
64.
B
right,
which
is
pretty
amazing
and
because,
because
they're
able
to
not
just
do
everything
in
batch,
but
they
were
able
to
combine
streaming
and
batch
right,
they
were
able
to
go
from
84
jobs
down
to
three
and
the
data.
Latency
is
significantly
smaller,
like
half
of
the
data
latency.
So
just
by
going
ahead
and
switching
to
delta
lake
and
and
following
this
context
of
a
of
a
data
quality
framework,
they
will
to
improve
reliability,
use
10
times
less
compute
and
easier
maintenance.
B
So
this
is
why
we,
we
love
delta
lake,
why
we
advocate
for
delta
lake,
because
the
idea
that
I
can
have
data
reliability-
and
this
translates
into
so
many
other
things-
easier
jobs,
faster
performance,
a
lower
compute.
That's
again,
that's
why
you
hear
us
talking
about
and
then
how
do
I
use
delta
lake?
So
here's
the
last
slide
for
us
here,
okay,
and
so,
if
you
want
to
get
started
with
delta
lake
okay,
you
can
add
it
as
a
spark
package.
B
Okay,
you
can
use
it
as
a
maven,
also
you'll
notice
that
you
can
do
a
pip
install
and
that's
using
with
the
spark
apis.
Don't
forget,
you
can
also
use
it
with
the
the
the
rust
api
and
so
then
all
you
do
is
with
that
is
pip
install
delta
lake,
that's
for
the
rust
api
and
then
pip
install
delta
spark.
B
If
you
want
for
the
spark
apis
irrelevant
if
you're
using
spark
and
you're
so
used
to
using
parquet.
That's
great
dataframe.right.format
parquet,
simply
change
that
to
dataframe.right.format
delta
and
boom
now
you're
using
delta-
and
I
think
that's
it
for
my
session
today-
that
I
ran
a
little
bit
over
so.
A
Well,
thank
you
very
much
for
joining
us
denny
for
anybody.
That's
interested
in
joining
delta
hack.
If
you
just
search
for
delta
hack,
2021
you'll
find
the
dev
post
site.
You
can
join
us
in
our
slack
channel
if
you
had
a
delta
io,
there's
a
slack
button
at
the
bottom.
If
you
just
joined
delta
hack,
I
think
our
channel
is
called
deltahack2021
there
as
well
and
if
you've
got
any
questions,
join
us
on
twitter
or
or
slack
or
the
delta
users
mailing
list
in
about
an
hour
or
so
we'll
welcome
back
steven.