►
From YouTube: 2021-08-31 delta-rs open development meeting
Description
Discuss tombstones and challenges with vacuuming
B
A
It
okay
cool.
So,
as
I
explained
yesterday
on
the
on
the
kdi
channel,
the
the
problem
is
that
when,
when
you've
got
a
list
and
it's
empty,
the
the
emptiness
of
the
yeah,
the
yeah,
the
emptiness
of
the
list-
is
expressed
as
a
as
an
offset,
but
that
offset
only
lives
on
the
list.
It
doesn't
get
propagated
down
to
the
list's
children.
A
So
when
it's
when
the
list
child
is
a
primitive,
it's
fine,
because
when
we,
when
we
get
to
the
primitive
calculating
its
definition,
we'll
we'll
will
be
able
to
pick
up
that.
Okay!
Well,
if,
for
example,
your
max
maximum
definition
is
two,
as
I
explained
with
the
list,
you've
got
if
if
a
list
is
nullable,
you've
got
actually
two
or
three
three
or
two
levels.
A
There
are
two
valid
definition
levels
where
zero
says
that
the
list
is
now
one
says
that
the
list
is
empty
and
two
says
that
the
list
has
a
slot
or
multiple
slots.
So
when,
when
doing
that
with
the
primitive,
if
you've
got
a
list
and
list
list
that
has
primitives
it's
not
an
issue,
because
after
we
calculate
that
the
definition
of
the
list,
we
immediately
calculate
that
of
the
primitive
and
that's
not
an
issue,
but
where
you've
got
a
nested.
A
Well,
when
you
we've
got
an
anotherness
that
type
inside
the
list,
that's
why
it
becomes
a
problem,
because
you
effectively
pick
up
that
you've
got
a
zero
offset.
So
you
set
one
value
to
be
zero
like
an
empty
slot
to
be
zero.
But
then,
when
you
come
and
calculate
let's
say
the
struct
inside
a
list,
you
totally
well
the
current
way
that
we're
doing
it
then
disregards
the
that
ended
that
list
slot,
and
then
you
end
up
with
nothing.
B
I
get
the
general
idea
yeah,
so
basically,
it's
specifically
a
problem
for
lists
that
that
have
strucks
as
an
element
type
yeah.
B
What
I
don't
understand,
though,
is
how
severe
of
a
problem
is
this
to
resolve?
Is
it
yeah?
It's
like?
Is
this
gonna
be
a
while.
A
I
don't
think
so,
because
I've
I've
started
working
on
a
solution.
Obviously
I
started
with
with
the
test
case
that
fails,
so
instead
of
writing
a
whole.
Instead
of
writing
a
whole
black
arrow
error
batch
to
parquet,
I've
just
reduced
it
down
to
the
in
the
level
calculations,
I'm
probably
about
60
percent
there
with
the
solution,
because
I've
got
it.
I've
got
it
working
partially.
A
So
the
when
you
calculate
the
list
when
you
go
from
the
list
to
the
struct
it's
working
correctly
now
I'm
left
with
going
from
the
struct
to
whatever
the
primitive
value
might
be
or
even
struck
to
another
list
and
making
sure
that
that's
that's
accurate.
So
it's
just
been
a
a
matter
of
not
having
enough
bandwidth.
I
could
have
probably
finished
it
over
the
weekend,
but
I
intend
on
trying
to
get
that
done
like
during
this
week.
B
Okay,
excellent,
that's
good
news
in
the
meantime,
so
you
know
we
finally
realized
yeah.
You
know
we're
kind
of
working
on
some
bleeding
edge
stuff
here
and
we
should
expect
occasional
bugs
like
this
to
show
up
yeah
yeah,
so
we
started
implementing
implementing
a
dead
letter
queue
in
kafka
and
in
kafka
delta
ingest
I
mean
and
the
approach
we're
taking.
Basically,
as
I
think
you
know,
we
kind
of
we,
we
buffer
json
messages
and
memory
at
on
certain,
inter
intervals,
we
write
those
to
arrow
record
batches.
A
B
B
You
know
test
the
validity
of
it,
so
so
the
the
bug
only
surface
surfaces
for
us
in
very
rare
occasions
when
you
know
when
we
get
a
null
value
for
one
of
these
fields,
and
so
the
approach
we're
taking
is
don't
validate
the
message
up
front.
But
when
we
go
to
write
the
record
batch
to
parquet,
if
that
right
fails,
we
are
backing
up
and
basically
writing
each
we're
creating
a
test
parque
buffer.
B
At
that
point,
where
we
write
each
message
in
the
batch
and
we
write
it
to
a
separate
parque
file
buffer
for
each
record
and
then
that
way
we
can
sift
through
the
good
records
and
the
bad
records,
and
once
we
have
the
good
ones,
we
then
create
a
new
record
batch
out
of
that
and
write
it
to
per
k.
I
will
given
the
way,
we're
using
arrow
and
parquet
to
do
this.
B
I'd
like
to
send
you
a
link
to
some
code
in
the
pr,
so
that
you
can,
you
know
just
let
us
know
if
it's
you
know
if
this
seems
like
a
reasonable
approach
to
you
or,
if
so
I'll,
I'll
grab
that
real,
quick,
okay.
B
B
And
this
one
is
interesting
because,
let's
see
here.
B
This
this
last
one
is
interesting,
because
what
I'm
doing
is
to
protect
the
clean
buffer,
I'm
storing
off
the
existing
rk
bytes.
First,
I'm
cloning
them
and
then,
if
a,
if
an
error
happens,
then
I'm
replacing
the
existing
par
k
buffer
with
the
copied
bytes
and
a
new
fresh
cursor
re-initialized
to
to
the
good
bites
that
we
that
we
left
off
with
before
we
found
the.
B
So
you,
you
don't
have
to
review
this
right
in
a
second,
but
if
you
get
some
time
today
to
take
a
peek
at
it,
and
just
let
me
know
if
you
see
anything
too,.
B
Misha
had
another
issue
that
popped
up
for
him,
but
it
was,
it
was
related
to
spark
not
aero,
and
it
was
related
to
vacuum,
which
florian
implemented
in
delta
rs,
and
so
we
were
thinking
he
might
have
some
experience
in
case
misha
still
had
any
existing
questions,
but
I
think
where
misha
landed
this
morning,
he
actually
might
not
need
that
assistance
anyway.
So
so
I
think
we're
good
here
for
this
call.
B
A
Cool
I'll
go
through
this
and
then
I'll.
We
can
just
continue
this
in
the
channel
and
then
also
give
you
an
update
on
how
far
I
am
with
with
fixing
the
back.
A
Cool
no
no
worries
and
then
have
you
found
any.
Have
you
come
across
any
issues
with
the
map
support
so
far.
A
A
Now
that's
great
to
hear,
because
I
was
just
a
bit
worried
that
maybe
I
would
have
introduced
some
issues
or
something.
B
Yeah
yeah,
when
we
first
started
deploying
it,
we
ran
into
a
couple
of
bugs,
but
it
turned
out.
They
were
on
our
side.
Oh.
A
Cool
awesome
and
then
I'm
assuming,
actually
we
we
well
yeah,
we
might
have.
A
We
might
have
a
similar
use
case
in
that
in
the
near
future,
where
we
need
to
use
kafka
delta
and
just
the
one
that
one
of
the
teams
at
work
they,
I
think,
as
your
yeah
azure
has
finally
enabled
or
still
in
preview,
but
they
find
enabled
change
data
capture
on
azure
sequel
instead
of
just
the
the
normal
microsoft
sql
server.
A
So
we,
I
think
our
current
etl
batches
are
hourly
so
we're
looking
at
potentially
moving
to
real
time,
so
it'll
be
great
to
introduce
the
the
team
to
kafka
delta
and
just
because
they
just
want.
B
B
Gotcha
yeah,
so
one
thing
to
keep
in
mind
there
for
azure
right
at
the
moment:
kafka
delta
ingest
is.
It
only
has
s3
support,
yeah,
so
we'll
you
know
we'll
need
to
get
some
azure
bits
in
there
before
we
can
before
you'll
be
able
to
leverage
it
most
likely.
B
Cool
well,
I
appreciate
the
call
sir
and
I'm
gonna
stop
the
live
stream
and
I'll
talk
to
you
next
time.