►
From YouTube: D3L2: The Journey Unifying Data Lake and Data Warehouse with Robert Kossendey at Claimsforce
Description
In this D3L2 episode, we chat with Robert Kossendey, Tech Lead at Claimsforce on their journey from unifying data lake and data warehouse. As Robert’s team builds and expands, they chose Delta Lake and AWS Athena as the foundation for their lakehouse.
Quick Links
Read Our Newest Blog Post: https://delta.io/blog
Robert Kossendey: https://www.linkedin.com/in/robert-kossendey-303b0019a/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
Join the Google Group: https://groups.google.com/forum/#!forum/delta-users
A
To
the
new
session
of
d3l2
yeah,
that's
the
name
of
this
right.
Is
it
yes,
where
we're
going
to
be
talking
to
Robert
here
from
claims
Force
about
their
involvement
of
building
going
from
a
data
warehouse
to
data
Lake
using
Delta
Lake?
So
we've
still
got
a
few
minutes
so
Robert?
Why
don't
you
just
introduce
yourself
a
little
bit,
but
then
we'll
we'll
do
the
formal
introductions
once
we
get
the
LinkedIn
and
the
YouTube
thing
going.
B
Sure
yeah
hi
I'm
Robert
I'm,
the
tech
lead
of
clemsforce.
We
are
a
small
intro
Tech
startup,
based
in
Hamburg
Germany
and
yeah
I'm,
leading
the
data
team
at
our
company
I'm
responsible
for
the
data
architecture.
I've
been
there
since
now
more
than
three
years
started
as
a
working
student
transitioned,
then
into
the
data
engineering
role
and
then
also
started,
leading
our
data
team
or
the
being
the
technical
lead
for
our
data
team
and
yeah
happy
to
be
here.
A
B
A
Okay,
all
right,
perfect
I
looks
like
we
are
live
on,
live
on
YouTube
and
we're
live
to
LinkedIn.
So
thank
you.
Everybody
Welcome
Aboard,
all
right
now,
we're
gonna
do
our
official
introduction.
So
here
give
me
a
second
all
right.
Let
me
just
type
in
the
welcome
aboard
to
all
our
our
different
and
I
believe.
There's
actually
a
twitch
thing
going
too,
if
I
recall
correctly
open
yeah,
yeah,
that's
pretty
sweet
eh.
A
A
Yeah
so
I
believe
twitch
is
going
at
least
I
will
we'll
find
out
if
we're
doing
it
or
not.
After
the
fact,
that's
usually
how
that
ends
up
happening,
but
saying
that
welcome
aboard
everybody
again.
This
is
a
a
session
of
d3l2.
The
journey
for
unifying
data
Lake
and
data
warehouse
with
Robert
at
claimsforce
I
want
to
have
Robert
introduce
himself,
but
I
did
want
to
do
a
little
housekeeping.
First
just
want
to.
A
Let
you
all
know
that
there's
the
data
Council
Austin
conference,
the
the
council
Austin
2023,
that's
in
well
Austin
Texas
from
March
28th
to
March
30th.
If
a
bunch
of
us
are
going
to
be
there.
So
if
you
want
to
look
at
the
latest
when
it
comes
to
data
engineering
and
data
science
and
make
sure
to
show
up
at
the
data
Council
Austin
conference,
all
right,
perfect
I
did
my
little
housekeeping
call
out.
A
Oh
yes
and
yes,
we
are
on
Twitch
the
twitch
Channel,
by
the
way
it's
twitch.tv
Slash,
Delta,
Lake
OSS.
So
just
to
let
you
know
all
right
now
saying
all
that
Robert.
Why
don't
you
introduce
yourself
more
formally
with
the
this
audience,
provide
people
a
little
bit
about
your
background
and
we'll
go
from
there.
B
Yeah
sure
so,
hey
I'm,
Robert
I'm,
the
tech
lead
of
the
data
team
at
Clan
Source.
We
are
a
small
intro
Tech
startup
based
in
Hamburg
Germany
I
joined
claims
for
us
now
more
than
three
years
ago,
starting
as
a
working
student,
I
came
directly
from
University
I
studied
computer
science
and
business
at
the
University
or
the
University
of
applied
science
in
Hamburg
and
because
I
had
some
like
interest
or
as
focused
on
on
data.
B
During
my
studies,
I
transitioned
into
the
role
as
a
data
engineer
at
claims
for
us
and
then
also
took
leadership
of
our
Tech
Team
in
data
and
now
being
responsible
for
our
whole
data
architecture,
data
engineering
department
also
bi
and
data
science
yeah.
That's
me.
Thank
you
for
the
invite
today
happy
to
talk
about
our
journey
from
the
data
warehouse
data
Lake
to
the
lake
house.
Finally,.
A
Perfect,
so
actually,
let's
go
backwards.
A
little
bit.
You
said
you
start
off
in
Hamburg
to
go
into
University,
so
I
want
to
understand
a
little
bit
more
about
your
background.
Actually,
so
how
like
we
all
have
slightly
different
ways
on
how
we
get
into
the
data
space
right.
You
know
like,
for
example,
for
myself.
I
actually
originally
was
going
to
go
into
gaming
and
then
made
this
really
weird
turn
in
into
this
and
then
also
now
in
data.
B
B
To
be
honest,
I
I,
applied
for
computer
science
and
business
just
out
of
like
I
I
I've,
been
always
curious.
Working
with
computers,
I
didn't
really
know
what
to
do
essentially
with
with
the
degree
but
I
knew
I
wanted
to
do
something
with
computers
and
during
my
last
year,
in
University
I
attended
some
nosql
big
data
lectures
and
really
got
interested
interested
in
distributed
computing
learning
all
about
Hadoop
ecosystems,
Spark,
all
the
good
stuff-
and
this
led
me
to
the
interest
in
data
itself.
B
Also,
there
was
a
little
bit
of
data
science
I
did
during
my
University
times.
There's
also
like
further
grew
my
interest
in
data
and
yeah
it
kind
of
came.
Naturally,
it's
not
like
I
I
started
my
studies
with
the
goal
in
mind
to
to
become
a
data
guy,
so
yeah
makes.
A
B
Curious,
no,
no!
No!
We
we
did
a
lot
of
java
during
in
University,
but
also
python,
so
the
main
part
was
Java,
but
then
in
the
special
or
when
you
specialized,
the
lectures
were
primarily
like
at
least
the
data
like
just
primarily
python
yeah.
A
B
Sure
we
are
a
startup
that
focuses
on
insurance
claims,
so
we
provide
software
for
participants
in
our
participants
in
the
insurance
Market.
Let
me
give
you
an
example.
Let's
say
you
have
a
water
leakage
claim
at
home.
B
If
it's
a
bigger
one,
then
probably
someone
needs
to
come
to
your
home,
an
expert,
so-called
expert
that
assesses
the
claim
and
we
provide
software
for
those
experts
and
also
for
the
clerks
and
back
office
employees
at
insurances,
managing
those
clerks.
So
we
help
them
doing
the
disposition.
We
have
them,
assessing
the
claim
correctly,
providing
them
tools
for
measuring
for
yeah,
assessing
the
claim
essentially
and
then
also
writing
a
report
that
assesses
the
claim
damage
and
how
much
needs
to
be
repaid
to
the
policy
holder
and
yeah.
B
A
Oh,
it
makes
sense
so
basically
from
just
with
houses,
switch
it
to
like
in
terms
of
data.
Speak
you've
got
a
lot
of
data
that
comes
from
the
initial
claims
from
the
original
user.
You
basically
need
to
figure
out
how
to
process
it.
You
provide
data
that
allows
the
adjusters
to
determine
what
the
actual
claim
is
supposed
to
be
worth.
A
You
aggregate
it,
so
you
can
actually
do
financial
reporting
from
both
a
individual
purpose
and
also
from
an
aggregate
purpose,
and
then
that
way
you
can
basically,
as
a
startup,
you
provide
that
software
services
to
everybody.
So
that
way
you
can
process
all
this
information,
yet
at
the
exact
same
time,
make
it
as
streamline
and
as
efficient
as
possible.
So
that
way
you're
not
actually
you
don't
actually
have
folks,
basically
wasting
money
on
the
process
here,
just
more
the
money
can
be
either.
A
B
A
B
A
A
B
A
So,
basically,
okay,
wow,
that
that's
that's
eye-opening,
oh
my
goodness!
Okay,
all
right!
So
no
wonder
you
guys
exist
because
there's
obviously
a
lot
of
things
to
to
optimize
for
okay,
cool.
Well,
okay,
sorry,
I,
I
digress
for
all
the
folks
there,
but
that
just
blew
my
mind:
okay,
good!
It's
all
right!
All
right!
Let's
talk
a
little
bit
about
this,
so
in
terms
of
the
data
space,
you
obviously
have
tons
of
data
coming
from
all
these
different
locations.
Why
don't
you?
Why?
Don't
you
provide
us?
A
The
original
state
before,
like
also
to
provide
context
to
everybody
and
we're
going
to
add
this
to
actually
to
the
the
LinkedIn
YouTube
links
afterwards,
Robert's
actually
written
three
really
cool
blogs
about
that
journey,
and
so
this
is
really
a
discussion
based
on
those
three
different
blocks
in
all
seriousness.
A
So
why
don't
you
start
telling
us
about
that
original,
like
before
you
like,
when
this
all
started
like
when
you
had
this
data
warehouse?
What
did
it
look
like
what
the
sources
like
the
sizes?
What
what
were
your
issues,
processing,
yada,
yada,
yada,
I'm,
just
curious.
B
Yeah
so
so
the
original
state
was
that
we
actually
didn't
really
have
a
data
architecture
at
all.
We
basically
just
dumped
data
into
S3,
so
I
joined
very
early
on
claims
plus,
so
the
startup
is
now
five,
not
not
even
five
years
old,
so
I
I'm
I
joined
very
early
okay
and
back
then
we
basically
just
dumped
our
production,
because
we
used
dynamodb
and
doing
analytical
queries
on
dynamodb
is
practically
like
it's.
It's
not
possible
at
all.
B
Yeah,
we've
done
our
data
into
S3
and
use
glue,
crawlers
to
to
use
Athena
to
to
query
the
data
and
again
this
this
turned
out
to
be
very
impractical
because,
first
of
all,
this
like
we
didn't,
have
any
schema
enforcement
there
and
okay.
B
So
we
decided
to
to
team
up
or
partner
with
AWS
Solutions
Architects
and
they
helped
us
initially
building
our
initial
architecture,
which
consisted
then
of
a
basically
raw
Landing
Zone
in
S3,
and
then
we
loaded
the
data
into
redshift
for
further
processing.
So
this
was
the
initial
now
architecture
that
we
had
basically
consisting
of
the
raw
Zone
and
Athena,
where
you
could
do
an
S3,
where
you
could
do
a
top
queries
via
Athena
and
then
for
further
processing.
B
A
Gotcha
gotcha
gotcha
Okay,
so
you
provided
a
ton
of
context.
So
then
I
guess
the
question
is
from
your
perspective,
what
were
the
issues?
The
problems
with
your
initial
setup?
Basically,
you
know
yeah.
You
mentioned
that
dynamodb
was
problematic
to
put
for
those
type
of
queries.
That's
fair
that
that
you
know
that's
a
fair
assertion.
I'm
just
curious
like
what
what
else?
Basically
like
yeah.
B
B
We
had
schema
on
read,
so
we
didn't,
we
could
just
dump
our
production
databases
and
query
at
our
talk,
which
was
great
right
and
also
the
data
Lake
supported
all
of
like
all
data
formats.
So
we
also
have
a
ton
of
photos
and
and
video
footage
that
we
could
also
dump
there
okay,
but
it
also
came
with
some
disadvantages,
so
running.
Queries
on
Raw,
CSV,
Json
or
even
parquet
files
with
Athena
is.
B
This
is
provides
some
low
performance
also
because
we
didn't
have
any
asset
transactions
there
right
and
yeah.
It
was
only
a
raw
Landing
Zone
because
we
only
were
able
to
do
append
only
stuff.
Okay
and
we
didn't
have
any
merge.
There
was
no
merge
or
absurd
capabilities
on
top
of
S3
gotcha.
That's
why
we
then
had
redshift
as
our
data
warehouse,
but
that
also
came
with
some
disadvantages,
mainly
maintaining
two
different
places
or
two
different
storage.
A
B
You
yeah
exactly
exactly
yeah.
This
leads
to
data
Stanis.
First
of
all,
because
you
have
like
essentially
two
steps
right
and
redshift
is
really
really
expensive,
at
least
from
our
experience,
so
sure
you
are
not
really
able
to
to
scale
storage
and
compute
independently.
B
At
least
then
there
was
no
ratch
of
serverless,
which
would
maybe
helped
us
with
mitigating
some
of
the
costs,
but
still
it
was
very
painful
to
work
with
redshift
from
our
experience
and
we
really
liked
the
fina
interface.
We
really
like
just
being
able
to
query
the
data
on
top
of
S3,
but
yeah.
We
didn't
have
the
capabilities
to
to
move
all
of
our
workload
that
was
performed
on
top
of
redshift
to
the
data
Lake,
because
yeah,
because
of
that
mentioned,
disadvantages
no
asset
compliant
no
makes.
A
A
lot
of
sense,
oh
by
the
way
I
I,
would
just
pop
into
a
question.
One
of
the
attendees
actually
asked.
The
question
is
for
claims
for
services.
This
is
back
to
business,
but
I
I,
so
I
apologize
for
missing
this
question.
Originally
you
met
eight
minutes
ago.
So
again,
my
apologies
is
this.
For
what
you're
doing
is
this
Property
and
Casualty
Insurance
industry
claims.
B
Yeah
yeah,
it's
only
PNC.
We
only
focused
on
PNC
we're
not
in
the
health
business,
so
yeah
yeah.
A
We're
not
gonna
get
into
today
all
right
perfect.
So
is
this
because
of
the
problems
that
you're
facing
with
like
data
duplications
costs,
complexity
of
pipelines
like
that?
Is
that
what
like
did
you
know?
You
wanted
a
lake
house
per
se,
or
was
this
like
yeah
I'm,
just
curious
like
where?
How
did
you
get
to
this
point
where
you
sort
of
rocked
you
wanted,
but
like?
Was
it
early
on
or
was
it
like
not
to
like?
B
It
was,
it
was
pretty
early
on
so
okay,
so
we
quickly
realized
that
also
because
of
new
requirements
from
from
our
customers
surfaced,
for
example,
real
time
data
which
gets
really
difficult
with
redshift
early
on,
like
notice
that
okay,
this
is
not.
This
architecture
is
not
meant
to
be
there
for
eternity,
and
we
need
to
come
up
with
a
different
solution
and
yeah.
Then
I
ventured
it
out
and
looked
for
Alternatives,
pretty
quickly
stumbling
onto
the
Lighthouse
approach
and
reading
the
white
paper
and
then
going
on
from
there.
B
A
Perfect
and
the
white
people
I'm
presuming
you're
talking
about
Michael
armbrust,
multilink,
high
performance,
acid
table,
Transit
storage
over
Cloud
object,
stores,
exactly
okay,
perfect.
So.
B
A
So
for
everybody
there,
basically,
that's
actually
I'll,
send
a
link
and
include
the
link
inside
all
of
our
different
chats,
but
the
context
for
everybody,
basically,
hey,
sorry
I
just
make
sure
there
we
go
all
right,
I'm,
just
trying
to
send
it
to
everyone.
The
the
context
for
everybody
basically,
is
that
this
particular
paper
talks
about
Delta
Lake,
which
you'll
notice
that
Robert's
talking
about
the
acid
transactions.
So
now,
let's
talk
about
that
from
your
perspective,
because
you
know
cool
that
you
found
the
paper,
but
obviously
you
were
looking
for.
A
You
know
either
specifically
acid
transactions,
the
data
lakes
or
something
akin
to
that.
So
I'm
just
curious.
What
what?
Why
did
you
care
about
the
asset
transactions?
What
what
was
about
it,
because
you
know
you're
you're
not
to
age
myself,
but
you're
younger
than
me?
Okay.
So
so,
since.
A
Than
me,
like,
you
didn't
necessarily
come
with
the
idea
that
I
have
to
have
databases:
I
had
to
have
asset
transactions.
So
what?
What
led?
You
look
because
I
I
come
from
that
world
like
I'm,
a
former
SQL
Server,
so
I'm
like
yeah,
yeah,
of
course,
you're
new.
You
need
to
have
data.
You
know
databases
for
transactions
until
I
realized
I
was
wrong.
So
I'm
just
curious,
like
what
led
you
to
recognizing
the
fact
that
you
did
need
asset
transactions
yeah.
B
So
so,
first
of
all,
data
quality
is
very,
very
important,
not
data
quality,
but
data
correctness
is
very,
very
important
to
our
customers.
Sure
so,
eventually,
consistency
is
is,
let's
say,
a
no-go.
B
We
really
want
to
have
updated
data
and
and
correct
data,
and
without
like
having
consistency
there
and
like
transactions
that
might
fail
partially
and
then
write
inconsistent
data
to
a
database.
This
is
as
far
as
a
no-go
and
that's
why
asset
transactions
was
our
asset.
Transactions
were
a
must
when
considering
our
Solutions
program.
Okay,.
A
So,
let's
start
with
that,
okay,
you
said:
eventual
consistency
is
a
no-go
so
to
provide
context
to
everybody
in
the
who's.
Listening
who
may
or
may
not
know
this,
like
eventual
consistency,
is
basically
the
default
mode
when
it
comes
to
working
with
S3,
okay,
and
for
that
matter
many
other
systems.
A
In
fact,
you'll
probably
see
old
comments
of
mine
about
Acid
versus
bass,
right,
like
in
terms
of
like
we're
basically
base,
is
basically
the
the
realm
of
like
the
Hadoop
world,
where
we
basically
introduced
basically
available
steady
state,
eventually,
eventually
consistent,
which
is
the
the
key
context
versus
acid,
which
is
atom.
A
This
is
the
E,
because
this
is
the
isolation
durability,
which
is
basically
the
realm
of
the
database
world
right
and
so
from
your
perspective,
because
you're
working
with
S3
and
knew
that
you
were
dealing
with
an
eventual
consistency
position
basically,
you're
worried
about
data
corruption,
but
I'm
just
curious
like
yeah.
What
what?
What?
What?
Why
were
you
concerned
about
data
corruption?
What
was
inherent
about
your
pipelines
or
whatever
else
it
is
that
was
causing
that
to
happen.
B
So
we
like,
we
were
using
AWS
glue
as
our
yeah
Computing
engine,
essentially
or
as
a
spark
environment,
sure
glue
had
this
bookmarking
feature
that
captures
which
files
has
been
have
been
processed
already
and
this
for
some
reason
we
weren't
able
to
figure
it
out
but
for
some
reason,
sometimes
failed
on
us.
A
Right
right,
right
right,
so
basically
the
nature
of
the
pipelines
that
you're
processing,
because
there's
so
much
of
it
and
by
the
way
this
is
related
to
a
question
that
actually
is
coming
from
the
Q
a
like.
Do
you
also
put
your
audio
and
visual
data
into
your
lake
house,
or
is
this
primarily
like
the
the
structured
data
I'm
just
curious
from
your
perspective.
B
Both
so
we
don't
like,
we
don't
have
audio
data
but
video
and
photo
data,
and
we
both
have
it
reside
in
S3
yeah.
Exactly
so.
B
Both
can
be
essentially
used
in
the
same
breath
when
you
want
to
do
any
machine
learning
tasks,
for
example,
or
whatever.
A
A
B
Always
have
this
is
this
is
the
next
thing
compliance
is
as
we
are
based
in
Germany
and
and
yeah.
This
is
a
very
important
topic
here.
It
was
amazing
that
you
have
like
an
audit
lock
and
also
able
to
delete
essentially
right.
So
you
can,
you
can
do
row
level
deletes
on
top
of
parquet
files,
which
is
great
for
our
use
cases.
Yeah.
B
A
Under
do
you
understand
that
basically
process
is
extremely
important
in
Germany
yeah
all
right.
So
let's
not
just
go
into
that.
Just
because
I
know,
I
will
be
able
to
rat
hole
into
this.
So
you
started
off
basically
going
ahead
and
basically
having
your
data
Lake,
which
was
built
in
Delta
Lake,
to
give
you
that
transactional
production
and
then
you
basically
start
with
the
idea
of
I
believe.
Was
it
atheno
and
glue
to
basically
start
working
with
that?
Okay,
so
yeah?
How?
A
B
Yeah
so
glue
inherently
at
least
back
then
only
supported
two
spark
versions,
essentially
like
the
2.4,
I
think
and
3.1
and.
B
Exactly
and
even
the
3.1
was
the
fork
from
Amazon
I
think
and
yes,
there
was
yeah
there.
B
Integration,
so
you
had
to
essentially
bring
the
jars
to
glue,
but
that
meant
also
that
you
have
compatibility
issues.
You
cannot
use
the
latest
data
Lake
versions,
because
data
Lake
needs
a
specific
version
of
spark
and
yeah
that
that's
when
we
quickly
realized
okay.
If
we
really
want
to
do
this
in
production,
we
need
to
look
for
an
alternative.
A
B
Thank
God
glue
comes
with
Native
Delta,
so
this
is
not
a
problem
anymore,
but
back
then,
when
we
made
our
decision,
this
was
still
no
problem
for
us.
A
Yeah
and
a
shout
out
to
our
friends
over
at
glue
yeah
yeah.
Basically
during
AWS
re
invent
late.
Last
year
they
announced
glue
4.0
glue,
4.0
actually
includes
native
Delta
Lake
support,
so
they
had
glue
support
in
3.0,
but
that
was
with
the
manifest
file
and
so.
A
Glue
crawlers
would
basically
go
ahead
and
read
the
Manifest
files,
and
so
in
terms
of
actually
not
even
real-time
pseudo
real
time
capability
was
not
even
possible
right
and
so
now,
because
it's
actually
able
to
make
use
of
query,
the
crawls
are
actually
able
to
query
the
Delta
like
tables
directly,
not
a
big
deal.
So
now
it's
included
so
that
that
was,
but
that
was
back
at
the
time
right.
A
You
know
like
and
in
fact
your
own
blog
post
actually
does
call
out
that
noritaka
sakayama
actually
went
ahead
and
called
that
out,
and
the
reason
I
like
calling
out
is
because
we're
actually
working
with
Nori
Taco
quite
a
bit.
We
actually
have
a.
We
actually
have
a
session.
A
separate
D3
is.
A
B
Definitely
yeah
yeah
funny
enough.
The
day
I
I
published
the
it
was
the
second
article
I
think
it
was
the
first
day
of
re
invent
and
they
immediately
announced
the
glue
now
negatively
supports
that.
A
Come
off
quite
well
with
him,
so,
okay,
but
let's,
let's
talk
a
little
bit
about
so
you
were
working
with
glue.
You
had
some
of
the
issues
and
that's
ultimately,
why
you
end
up
moving
away
from
glue,
basically
because
of
that
at
the
time,
basically
yeah
exactly
yeah
all
right
and
sometimes
that
it
happens
in
terms
of
timing.
B
So
nothing
with
the
spark
version
itself,
but
all
like
we
wanted
to
use
the
latest
version
of
of
Delta
Lake
right
because
of
the
compatibility
issues
or
compatibility
Matrix.
It
wasn't
possible
for
us
to
use
the
latest
version
on
top
of
clue
three.
So
it's
not!
It
wasn't
about
the
spark
version
rather
than
the
glue
the
diatolic
version.
Gotcha.
A
So
in
your
journey
of
switching
over
to
this,
like
how
have
you
what
what
are
the
I
guess
like
what
are
the
efficiencies
or
the
improvements
to
your
own
processes
that
you've
been
able
to
attain
by
switching
from
like
what
you
originally
were
working
with,
like
back
in
the
beginning,
when
we're
talking
about
redshift,
right
and
and
and
and
and
to
now,
basically
going
and
working
with
Delta
like
I'm,
just
curious
like
what
were
the
inherent
efficiencies,
improvements
and
processes
like
our
data?
A
Is
data
process
faster
or
is
it
more
real
time
I'm
just
curious
because
it
seems
like
yeah,
like
you
know
like
it's
just
one
little
thing
I
mean
if
anybody's
asked
it's
like.
Well,
it's
just
a
freaking
storage
format
in
parquet
that
happens
to
have
a
transaction
log,
so
doesn't
really
help
so
yeah,
like
what's
your
content?
What's
your
context
behind
that
statement
or
that
question
excuse
me.
B
First
of
all,
the
biggest
the
biggest
upside
we
saw
is
cost
reduction,
so
just
not
having
a
redshift
cluster
up
and
running
is
just
a
breeze,
although
you
now
have
increased
costs
for
Athena
and
also
for
for
on
top
of
S3.
But
this
is
negligible
and
in
comparison
to
what
what
the
cost
reduction
we
achieved
by
turning
off
redshift.
So
this
is
probably
the
biggest
achievement,
but
also
our
the
data
sayings
got
reduced,
so
we
have
less
writing
from
as
like.
B
It
can
stay
in
S3
and
this
really
cut
down
our
ETL
times,
because
especially
the
load
into
redshift
was
always
the
most
the
the
the
slowest
part
of
our
Pipelines
and
also
reading
like
automatically
reading
data
from
our
gold
and
aggregated
tables.
The
loading
times
improved
drastically,
especially
for
machine
learning
purposes.
Because
back
then
we
had
to
use
a
jdbc
connection
to
redshift,
and
now
we
can
read
directly
on
top
of
S3,
which
is
great.
A
B
B
Data
Lake
API,
so
it's
I
love
to
like
it's
very
easy
to
interact
with
the
data
table
and
up
until
now
it's
it
just
has
been
great,
so
I
cannot
complain,
we
basically
exceeded
our
expectations
and
and
what
we
wanted
to
improve,
and
it
seems
like
we're
still
a
small
startup
right
and
don't
have
the
the
data
that
a
big
Fortune
500
company
has,
but
it
seems
like
we
can
scale
indefinitely
with
our
current
setup,
at
least
for
now,
and
we
are
now
completely
future
proof.
It's
great.
A
A
A
B
But
what
I
really
like
is
that
we
kind
of
can
like
we
can
operationalize
the
data
and
our
Delta
lake
or
lake
house,
because
with
like
Frameworks
like
the
dental,
Standalone,
reader
or
Delta
rust,
and
you
can
actually
really
quickly
interact
with
parquet
files,
which
allows
you
to
essentially
write
back
to
your
data
Lake,
which
hasn't
been
like.
We
weren't
doing
that
on
top
of
redshift.
Obviously
before-
and
this
is
really
cool,
so
we
can
write
applications
that
write
back
to
our
data
Lake
based
on
yeah
data
visualizations,
for
example.
A
No
amazing,
so
one
of
the
actually
I'd
love
to
talk
a
little
bit
more
of
that,
but
I
just
realized
something
because
you
like
don't
worry,
even
though
I'm
from
databricks
I'm
not
actually
trying
to
pitch
databricks
here,
or
at
least
not
that
much
I
should
put
it.
That
way.
There's
obvious
biases
here
come
on
guys,
but
like
I'm
just
curious.
So
it's
interesting
because
you
have
the
databricks
environment,
that's
running
a
lot
of
your
stuff,
but
you
also
have
Athena.
A
So
in
other
words,
you
actually
have
this
mixture,
where
some
of
the
stuff
actually
is
sitting
in
databricks
and
some
of
the
stuff
is
sitting
in
Athena,
so
you're,
basically
like
picking
choosing
what
you
like
from
either
world
as
opposed
to
necessarily
doing
all
data
bricks
or
necessarily
doing
all
AWS
native
Services,
yeah
I'm
just
curious.
What
led
to
that
and
or
is
it
just
because
that's
the
way
it
was
built
and
that's
good
enough.
B
So,
first
of
all,
what
enables
this
is
I
think
the
most
important
thing
that
that
we
bet
on
open
source
software
like
so
so,
our
core,
the
the
storage
format,
the
table
from
it
is
Data
Lake
and
our
processing
engine
is
a
spark
and-
and
this
allows
us
to
choose
what
we
want
to
use
to
interact
with
the
data
and
for
the
for
the
reason
why
we
picked
Athena
instead
of,
for
example,
databricks
SQL.
Again
it
just
historical
reasons.
B
And
we
also
had
Athena
already
up
and
running
and
configured,
but
the
goal
eventually
is
also
to
move
over
to
databricks
SQL,
to
like
Leverage
all
the
other
cool
features,
but
yeah.
This.
A
Is
the
reason
why
but
but
it
actually
defense
right
forsake
argument
if
databricks
xql
sucks
it
doesn't
by
the
way,
but
the
pretend
it
did.
You
have
no
problem,
switching
back
to
athenics,
okay,
no
problem,
switching
all
these
other
services,
you
have
the
flexibility
and
that's
more
or
less
the
whole
point
of
behind
this.
A
A
I'm
I've
actually
answered
a
bunch
of
the
questions
off
the
side,
so
I'm
going
to
give
this
opportunity
to
let
other
folks
go
ahead
and
chime
in
if
they
have
any
other
questions
but
and
like
I,
said
I've
already
posted
Robert's
blogs
and
we're
gonna
we'll
update
the
YouTube
video
by
the
way
so
YouTube's,
actually
where
we
we
have
the
final
recording.
B
Like
is
and
and
what
really
helps
or
helped
us
during
the
migration
of
from
redshift
to
to
data
lake
is
all
the
great
open
source
libraries
that
they
are
next
to
data
Lake
and
next
to
spark.
So
I
want
to
give
a
shout
out
to
Matthew
Paris.
There
he's
he's
maintaining
a
lot
of
great
libraries
that
helped
us
transitioning
in
the
transitioning
period,
so
definitely
check
them
out,
because
this
made,
which.
B
Mac
I'm
using
now,
but
definitely
chispa,
is
one
of
the.
B
Rather,
but
one
of
the
must
use
libraries
for
every
developer
there
out
there,
but
yeah
Mac,
definitely
using
that
and
also
looking
forward
to
contribute
there.
It's
a
great.
A
Price,
oh
yeah.
Definitely,
actually
that
reminds
me.
I
did
see
your
contribution
to
one
of
the
libraries
I
forgot,
which
one
it
was.
But
yes,
that's
back
yeah,
yeah
yeah!
That's
why
I
spread
it
up.
There's
a
reason:
I
brought
that
up,
yeah!
That's
right!
Okay!
That's
right!
You
made
a
contribution
of
that
we're
by
the
way
we're
moving
that
to
the
entire
Library
I
believe
to
Delta
to
Delta
repo
by
the
way,
yeah.
A
To
go
there,
but,
like
he's
been
going
nuts,
as
you
can
tell
so
we're
going
to
be
moving
a
bunch
of
stuff
to
various
locations,
just
because
it's
like
okay,
then
we've
got
tons
of
contributions.
Yes,
but
sorry,
I
completely
went
sideways
on.
You
go
on
like
advice
for
anybody,
who's
trying
to
make
that
transition
to
lake
houses.
B
A
B
My
biggest
advice,
don't
reinvent
the
wheel.
A
lot
of
things
have
been
done
already
and
then
let
me
think
so.
For
for
us,
everything
went
so
smoothly
that
I
I
cannot
like
now
tell
any
stories
of
any
pitfalls
and
caveats
because,
like
it,
went
really
good
yeah
trying
hard
to
think
of
something
I,
some
advice
I
could
give.
But
just
do
it
essentially
because
for
us
like
at
least
for
us,
it
has
been
one
of
the
best
decisions
we
made
so
yeah.
B
If
anyone's
thinking
of
at
least
try
it
out,
try
using
data
and
yeah.
A
B
A
I
didn't
I
I
would
I
could
have
sworn
there
was
like
their
problems
but
looks
like,
and
this
is
of
course
not
a
little
bit
of
a
tagline
here.
Delta
make
things
so
simple
that
you
actually
didn't
have
run
into
pitfalls.
So
I
guess
we
got
very
lucky.
No!
No!
Yes,
there
usually
is
amount
of
like
both
it's
always
a
little
bit
of
luck,
a
little
bit
of
good
Tech
a
little
bit
of
Sparks
and
then
the
all
three
mixed
together
to
to
turn
out
that
way.
A
So,
yes,
I
completely
agree
with
you,
man
perfect
well!
Thank
you
very
much.
Robert
I
really
appreciate
your
time.
Anybody,
if
you
got
more
questions
you
can
like
Robert
and
I
were
already
talking
about
join
us
in
the
Delta
users
GitHub,
because
we're
actually
doing
lots
of
contributions
in
that
Arena.
Also
don't
forget
to
join
us
at
Delta
user
slack,
that's,
basically,
go.delta
dot,
IO
slack!
All
of
us
are
there
actually
actively
acting
asking
questions.
I
I
will
call
myself
out.
A
I've
disappeared
for
a
couple
weeks
due
to
the
last
week's
data,
Rick
cko
and
I'm
still
recovering
from
it.
But
but
yes,
we
are
normally
there
we're
normally
answering
questions
and
then
one
small
final
call
out
actually
I'll
be
in
London.
At
the
end
of
the
month,
we
were
going
to
have
a
meet
up
there
to
talk
about
lake
houses,
so
come
join
us
there
in
person.