►
From YouTube: Delta Hack: Live demo of kafka-delta-ingest
Description
One of the lead developers on kafka-delta-ingest, Christian Williams, will share the progress and demonstrate the tool. Kafka Delta Ingest helps bring Kafka-based data into Delta Lake as quickly and efficiently as possible.
A
Welcome,
let's
see
if
we
get
our
youtube
stream
going,
looks
like
it's
going,
welcome
to
another
sort
of
live
demo
or
live
stream
for
delta
hack
hackathon
that
we're
organizing
to
just
run
a
couple
of
days.
It'll
end
at
the
end
of
the
day
tomorrow.
So
if
you've
got
projects
definitely
submit
those
the
site
is
on
devpost
you
just
search
for
deltahack2021
you'll
find
the
site
you
can
sign
up
or
or
submit.
You
know
whatever
projects
you
have
that
there,
but
today
I've
got
christian
williams.
A
I've
got
christian
williams,
who
is
one
of
the
primary
developers
on
damon
in
the
delta
project,
called
kafka
delta
ingest
and
the
goal
for
kafka
delta
ingest
is
to
bring
data
out
of
kafka
as
quickly
and
efficiently
as
possible
and
put
that
into
delta
tables,
which
can
be
very
useful
for
data
streaming
or
spark
streaming
or
any
sort
of
you
know,
high
throughput
stream
applications
that
you
might
want
to
be
building
on
top
of
delta
lake
now
kafka
delta
ingest
is
still
early.
A
It's
not
quite
production
ready,
but
I'm
looking
forward
to
learning
just
how
how
ready
it
actually
is
from
christian
and
if
you've
got,
questions
feel
free
to
throw
those
in
the
the
chat
sidebar
on
youtube
or
if
you
join
the
delta
community
slack
workspace,
which
you
can
find
at
the
bottom
of
delta
dot
io.
You
can
join
us
in
the
kafka
delta
dash
ingest
channel
on
our
delta
users,
slack
workspace,
but
without
further
ado
christian.
Why
don't
you
go
ahead
and
show
us
something
cool
sure.
B
So,
like
tyler
mentioned
kafka,
delta
ingest
is
a
daemon
written
in
rust
for
streaming
data
from
kafka
topics
to
delta
lake
tables,
and
we've
done
a
demo
on
this
previously
and
so
for
today's
demo,
I'm
gonna
kind
of
I'm
gonna
pick
up
on
one
of
the
key
features.
B
Speaking
of
that
production
readiness
thing
one
of
the
key
features
we
did
not
have
in
our
prior
demo
that
we
do
now
have
in
kafka,
delta,
ingest
and
that's
support
for
multiple
worker
processes,
running
it
running
running
for
the
same
topic
to
table
stream,
which
is
of
course,
really
useful
for
handling
high
volume
topics.
B
So
I've
already
got
stuff
up
on
my
screen
and
let
me
let
me
explain
what
what
we're
seeing
here
so
prior
to
running
anything
here.
I
I
set
up
a
kafka
topic
with
three
partitions
and
the
name
of
that
topic
happens
to
be
web
requests.
That's
the
same
example.
Data
set
that
that
we
used
in
the
previous
demo-
and
it's
also
the
same
data
set
that
that
we
have
in
our
open
source
repo
that
you
can
run
locally.
I've
launched
two
separate
kafka
delta,
ingest
workers
and,
and
these
are
each
configuration.
A
We're
seeing
at
the
the
top
and
then
the
second
we've
got
four
games
here
right.
B
Yeah
sport
panes,
so
the
top
two
are
my
two
separate
kafka
delta,
ingest
workers
and
they're
configured
to
write
a
new
delta
file
every
minute
and
and
so
in
between
that
time,
they
are
they're,
just
buffering
messages,
they're
consuming
from
kafka,
doing
some
little
tiny
transformations
on
those
messages
and
sticking
them
in
a
buffer.
So
that's
worker,
okay,
so
we
got
one
worker
in
the
top
one
worker
in
the
second
pane.
B
B
So
we
have
238
files
right
now,
so
so
kafka
has
assigned,
since
we've
got
two
workers
and
three
partitions
kafka
has
assigned
two
of
these
partitions
to
one
worker
and
one
partition
to
the
other
worker
and
our
output
directories.
B
B
B
A
B
Yeah
so
one
of
the
things
I
wanted
to
zoom
in
on
to
make
that
point
that
you
know
we're
leveraging
kafka
partitions
to
split
the
traffic
across
our
workers,
and
so
each
of
these
data
files
is
it's
going
to
have.
B
A
each
data
file
will
have
a
mutually
exclusive
set
of
partitions,
so
the
worker
that
is
handling
partitions,
zero
and
one
all
the
records
are
going
to
have
this
partition
value
in
our
message
equal
to
either
one
or
zero,
because
those
are
these
are
the
partitions
at
sampling.
B
B
So
these
are
the
delta
log
files
that
they,
they
don't
necessarily
correspond
to
the
exact
exact
same
rk
files
I
was
showing,
but
these
are
alcoholog
files,
and
one
difference
here
is
that
the
worker
handling
partition
zero
and
one
is
going
to
include
this
we've
called
it
texan,
but
the
tx
in
action.
B
It's
gonna
include
two
of
those
one
for
each
partition
that
it's
handling
that
records
the
kafka
offset,
so
that
if
another
worker
gets
this
later,
it
can
look
at
the
version
number
and
then
seek
to
the
appropriate
kafka
offset,
and
so
so
that
is
the
part.
The
worker
handling
partition,
zero
and
one.
B
So
next
one
thing
we
can
do
now
that
we're
playing
with
two
workers-
anybody
who's
familiar
with
kafka,
probably
knows
about
the
concept
of
a
rebalance
and
the
concept
of
a
partition
assignment.
B
So
if
you
have
the
simplest
possible
scenario
and
you've
got
a
single
topic
and
a
single
worker,
I'm
sorry
a
single
topic
partition
and
a
single
worker
there's
only
one
place
for
that
partition
to
go
to
to
the
one
worker
and
there's
only
one
partition
that
needs
to
be
distributed
to
anywhere,
and
so
obviously,
worker
one
is
going
to
handle
that
partition.
B
B
B
Now
the
number
of
consumers
for
topic
can
change
at
any
time
and
when
that
happens,
something
something
called
a
rebalance
happens
and
what
kafka
does
when
there's
a
rebalance?
A
B
Our
partitions
were
revoked
and
then
immediately
afterwards,
we
received
a
new
partition
assignment,
and
our
new
partition
assignment
includes
all
three
partitions,
because
we
only
have
two
workers
right.
A
B
A
B
Right
and
so
once
he
got
his
new
assignment,
he
stopped
what
he
was
doing
and
he
went
and
and
checked
out
the
latest
state
of
the
delta
log
and
he
found
okay.
So
the
delta
log
says
the
the
last
data
we
committed
to
the
delta
table
corresponded
with
these
three
partitions
and
it
had
a
a
record
written
for
this.
This
is
what
comes
out
of
those
texts
and
actions.
B
It
had
the
offset
recorded
for
each
of
those
partitions,
so
it
extracted
those
offsets
and
then
did
a
seek
with
the
kafka
consumer
to
go
pick
pick
back
up
where
it
left
off,
so
that,
basically
it's
it's
only
ever
it.
You
know
it
when
we
when
we
lose
workers,
we
always
start
from
the
last
recorded
offset.
So
we
know
that
we're
not
duplicating
data
and
we're
always
picking
up
where
we
left
off
so
eventually
once
he
hits
that
one
minute
threshold
he's
going
to
do
another
right
and,
let's
see
I
haven't,
been
watching
bottoms.
A
B
Of
the
whole
run
loop
and
just
to
kind
of
situate
us
and
be
able
to
answer
that
that
question
I'm
actually,
if
you
don't
mind
I'll,
walk
through
it
from
the
beginning,
is
that
okay
go
for
it
all
right
cool.
B
B
B
So
after
it
reads
the
delta
log,
it
gets
the
offsets
out
of
those
texan
actions
and
once
it
discovers
exactly
what
offsets
it
needs
to
start
from
for
each
of
its
partitions,
it
seeks
the
kafka
consumer
there's
api
for
doing
that.
So
it's
like
consumer.c.
Can
you
give
it
the
offset
for
each
partition?
B
Then
it
checks
to
see.
Then
it
starts
the
actual
run
loop
and
it
checks
to
see
has
a
rebalance
of
it
happened.
If
it,
you
know,
when
we're
first
starting
up,
it's
unlikely
that
it
has.
So
it's
going
to
go
down
this
no
path,
and
it's
going
to
start
it's
going
to
consume
and
buffer
the
message
it
just
received
and
it's
going
to
buffer
each
message
that
comes
in
actually
got
these
things
out
of
place.
The
rebalance
happens
after
consuming
buffer
apologies.
B
So
it's
consuming
and
buffering
each
time
it
consumes
and
buffers
it
checks
to
see
if
a
rebalance
has
happened
yet
and
it
also
and
then
afterwards
it
checks
to
see.
Should
I
write
yet
have
I
buffered
for
long
enough.
Have
I
buffered
either
enough
messages
or
buffered
for
the
max
time,
and
if
so,
it
writes
out
a
parque
file
and
writes
out
a
delta
log
file.
B
Then
it
goes
back
to
consuming
and
buffering
in
the
case
where
a
rebalance
does
happen,
then
it
rechecks
its
partition
assignment
and
starts
back
over
reading
the
delta
log,
and
just
because
it's
so
confusing
that
I
got
those
things
misplaced
and
tyler.
You
can
call
me
off
of
this
if
you
want
me
to
just
go
back
to
doing
something
and
it's
clear
enough,
but
like
I
mentioned
I
I
I
think
it's
clear,
though
okay
yeah,
this
is
one
of
my
little
ocd
behaviors
that
I
just
can't.
B
I
can't
live
if,
if
this
isn't
in
the
right
spot,
but
all
right
I'll
stop
tinkering
with.
I
think
this
is
enough
to
get
the
idea.
So
basically
we
can
zoom
in
buffer.
Have
we
rebalanced
if
we
have,
we
need
to
go?
We
need
to
check
what
our
current
partition
assignment
is
from
kafka.
Go
back,
read
the
delta
log
again
get
our
offsets
and
then
reseek
and
start
the
whole
cycle
over.
B
Okay,
so
I
switched
back
over
to
we
see
it
now
console
and
let's
see
here-
oh
yeah,
okay,
so
we
spent
two
time
too
long
on
that
diagram.
We
missed
the
first
right,
the
first
exciting
right
after
our
rebalance,
but
you
know,
as
you
can
see,
it
picked
back
up
where
we
left
off
kept
writing
new
double
log
files.
So
what
happens
if
we
restart
worker
two?
B
So
what
we'll
see
here
is
we
get
a
rebalance
on
our
worker
that
that
was
continuing
to
run
and
we
also
get
rebalance
on
the
new
worker
and
you
know,
what's.
A
B
Unless
we
have
a
bug
which
we
could
I
don't
know,
but
as
far
as
like
the
designed
behavior
for
sure
we
should.
You
know
when
it
checks
that
should
write
node
in
the
workflow.
B
One
of
the
things
that's
funny
that
happened
here
when
we
terminated
that
second
worker.
I
think
I
could
be
wrong
about
this,
but
I
think
that
the
second
worker
was
originally
handling
partition.
Two
and
the
first
worker
was
handling
partitions,
zero
and
one
and
now
they're
flip-flopped,
and
that's
the
thing
about
rebalance.
You
know
when
you
have
one
consumer
handling
all
partitions
and
you
bring
another
one
on
you.
You
don't
know
how
the
shuffle
is
going
to
come
out.
B
It's
kind
of
an
interesting
nuance
that
I
happened
to
notice.
While
we
were
watching
this.
So
let's
see
we
looked
at,
we
watched
our
workers
run,
we
inspected
the
per
parquet
file.
We
took
a
look
at
the
delta
logs.
That's
all
I
had
lined
up
for
the
demo
any
further
questions
or
anywhere
you'd
like
to
go
deeper
on
tyler.
A
Could
you
share
a
little
bit
about
the
work?
That's
come
in
lately
to
make
sure
that
those
two
workers
can
actually
write
to
the
same
delta
table.
B
B
Is
we
were
only
going
to
have
one
transaction,
one
texan
action
in
the
delta
log
and
instead
of
storing
kafka
offsets
here,
we
were
gonna
store,
a
an
identifier
that
points
to
a
record
in
our
right
ahead
log
that
kept
that
that
kid
keeps
track
of
the
offsets
for
all
partitions.
B
Somebody
on
our
team
had
a
really
brilliant
idea.
What
if
we
don't
do
that
and
instead
just
write
multiple
transactions
in
the
delta
log
file,
multiple
texans
in
the
delta
log
file
each
and
each
would
have
an
app
id
specific
to
the
partition
and
record
the
offset
for
that
specific
partition.
So
by
doing
that,
we
were
able
to
get
rid
of
a
bunch
of
infrastructure
that
we
were
going
to
have
to
support.
B
So,
whereas
our
initial
right
ahead
log
was
going
to
require
a
dynamodb
table
and
then
a
locking
table
as
well
now
we're
just
relying
on
the
delta
log
for
all
of
our
offset
information
and
that's
what
allows
us
to
kind
of
maintain
this.
You
know
I
hate
to
use
the
phrase
exactly
once
semantics,
but
you
know
we
think
it
is.
B
A
A
I
think
I
mean
that
that
answers
all
of
the
questions
that
I
was
thinking
about,
maybe
maybe
to
wrap
us
up.
If
you
could
talk
a
little
bit
about
how
how
somebody
might
be
able
to
get
involved
either
helping
with
writing
code
or
testing
or
documenting,
or
what
have
you.
B
Yeah
definitely-
and
you
sent
the
repository
yeah,
I
did
issues
and
pull
requests
are
very
welcome.
As
far
as
any
of
those
three
categories
documentation
testing
code-
I'll,
say
you
know,
a
lot
of
people
have
expressed
interest
in
in
supporting
other
formats
formats
other
than
json
and
there's
there's
even
been
some
talk
of
abstracting
kafka
as
the
source,
so
that
you
know
this
might
become
delta
ingest
instead
of
kafka
delta
ingest,
which
would
also
be
really
cool.
Those
are
two
of
the
bigger
items.
B
I
think
that
would
be
neat
contributions.
Some
of
the
kind
of
smaller
scoped
things
that
would
be
good
contributions
include
like
some
resiliency
to
some
of
the
network,
calls
we're
doing
right
now.
We've
we've
got
some
pretty
limited,
retry
handling
in
case
of
network
failures.
When
we're
like
writing
desk
three,
for
example,
that
might
be
a
good
easier
item
to
get
in
on
what
was
there's
something
else
that
just
occurred
to
me.
B
Oh
yeah
we're
very
focused
on
s3
right
now,
so
if
but
yeah,
if,
if
there's
any
kind
of
azure
related
thing
that
somebody
might
want
to
contribute,
really,
though,
that
that's
all
at
this
point,
since
we
got
rid
of
the
ride
ahead
log,
I
actually
don't
think
we
have
any
direct,
s3
or
or
cloud
storage
touch
points.
We
always
go
through
delta
rs
for
that,
so
I
might
take
that
back.
That's
might
not
be
as
important
anymore.
B
Somebody
I
did
notice
on
our
issue
list.
Somebody
had
a
really
neat
issue.
A
So
I
should
mention
for
anybody
watching
if
you
go
to
the
kafka
delta
ingest
repository
good
first
issues
are
noted
there.
So
if
you're
just
getting
started
with
rust
or
are
looking
for
an
easy
entry
into
kafka
delta
ingest,
there's
a
couple
of
good
first
issues
that
are
available.
B
Oh
yeah,
here
it
is
there's
one
somebody
had
posted
this
issue
to
to
add
the
ability
to
access
kafka
message
headers
in
transformations
I
could.
I
could
see
that
being
a
nice
easy
ad.
A
Interesting,
so
kafka
delta,
ingest
I'll
reiterate,
is
open
source.
It's
up
on
the
it's
part
of
the
delta
project
or
delta
lake
project.
Excuse
me
any
parting
words.
B
A
Okay,
well
christian.
Thank
you
very
much
for
for
demoing
the
current
state
of
kafka
delta
ingest,
I'm
very
much
looking
forward
to
this
finding
finding
a
production
workload
soon.
I
also
want
to
thank
qp,
misha
and
neville,
who
have
all
been
major.
I
would
say
contributors
into
getting
kafka
delta
in
just
this
far.
A
If
you've
got
any
more
questions
about
kafka
delta
ingest
or
want
to
want
to
tinker
with
it,
you
can
join
us
in
slack
at
kafka,
delta,
ingest
on
the
delta
users
slack
or
hit
us
up
on
github
start
a
discussion
open
up
an
issue,
etc,
etc,
etc.
But
with
that
said,
have
a
good
rest
of
your
day
and
I
hope
everybody
enjoys
some
delta
hacking
bye.
Now,
okay
have
a
good
one.
Tyler.