►
From YouTube: Python Ingestion Framework Tour : Feb 19 2021
Description
Harshal Sheth describes the new python-based ingestion framework in DataHub, including support for Airflow.
Recorded at: DataHub Community Meeting: Feb 19, 2021
A
The
scientist
community
so
and
and
we're
using
a
lot
of
open
source
in
our
tools,
of
course,
aws
as
the
service
team
that
are
building
products,
etc.
We
always
try
to
use
what
the
native
aws
analytics,
because
if
there
is
a
service
that
that
solved
the
problem,
but
if
not,
we
are
looking
at
third
party
or
open
source,
but
one
other
things
that
people
do
not
know
that
we
are
also.
A
Some
of
our
customers
are
amazon
themselves,
so
amazon
has
different
business
units
from
alexa
amazon
go
and
we
actually
help
them
as
well,
because
they
are
using
aws
and
I
joined
amazon
in
2013
when
amazon
web
services
has
15
services
and
today
it's
like
almost
200.
So
it's
the
platform
evolved
over
the
years
so
and
I've
been
in
the
data
analytics,
probably
most
of
my
career.
My
background
is
distributed
computing,
so
a
big
fan
of
what
linkedin
did
with
kafka
and
others.
A
So
you
know
jay
and
nia
from
past
life
where
before
confluence
so
always
happy
to
join
back
to
the
meetups
and
teams.
So
now,
in
my
role,
I
think
there
is
a
tight
correlation
of
things
that
we
can
do
together
on
the
open
source
so
happy
to
help
and
again
I'm
not
selling
any
aws
services
in
here.
So
just
from
from
sharing
knowledge
and
learn
from
you
as
well
so
nice
to
meet
you
cool
thanks.
C
B
C
Awesome
yep
so
past
couple
weeks,
I've
been
working
on
a
new
python
ingestion
framework
for
data
hub
first
off
is
kind
of.
Why
did
we
do
this?
So
you
know
the
status
quo
was
that
people
were
using
python
ingestion
framework
that
we
had
previously
was
really
just
a
set
of
scripts
and
they
were
already
using
that
to
ingest
metadata
into
data
hub.
C
But
you
know
there
were
a
couple
shortcomings
there.
Specifically.
You
know
it
was
hard
to
ingest
via
both
kafka
and
the
rest
api.
If
you
wanted
to
get
instantaneous
feedback,
you'd
want
the
rest
api,
but
using
that
with
those
scripts
wasn't
possible,
and
another
thing
was
that
we
had
these
opaque
json
blobs,
that
you
would
ingest
into
data
hub
and
it
was
based
on
avro,
which
is
a
serialization
format.
Similarly,
protobuf
the
issue
is
that
it's
it's
pretty
difficult
to
use.
There
are
a
lot
of
sharp
edges.
C
People
run
into
issues,
not
know
what
the
schema
was
for
the
or
you
know
how
to
format
their
data,
and
so
we
get
a
lot
of
questions
around
what
what
is
even
possible.
C
With
that,
and
specifically
you
know
not
having
type
annotations
around
that
was-
was
something
that
a
lot
of
people
struggled
with
and
they'd
run
into
a
bunch
of
runtime
errors
when
they
tried
to
try
to
execute
their
code,
but
they
wouldn't
actually
have
any
prior
warning
and
then
the
other
thing
was
that,
in
order
to
configure
your
metadata
ingestion,
you
need
to
go
and
modify
a
bunch
of
code,
which
was
not
the
ideal
situation.
Ideally,
you
just
modify
some
configuration
and
then
the
code
remains
the
same.
C
What
we
found
is
that
people
wanted
to
stick
with
python
because
of
the
the
ecosystem
around
it
all
the
open
source
projects.
You
know
things
like
airflow
and
numpy
and
tensorflow
and
so
forth,
and
so
they
were
used
to
this.
They
wanted
to
continue
to
use
it
to
ingest
data
into
data
hub,
and
we
wanted
a
principled
way
to
make
that
happen.
C
You
write
and
then
forget-
and
you
know
you
don't
necessarily
know
that
it
was
processed
correctly
until
you,
you
receive
the
audit
event
or
the
failed
metadata
event
back,
whereas
with
the
rest
api,
it's
a
little
bit
more
instantaneous
and
so
for
different
use
cases.
I
wanted
to
use
different
things.
C
We
wanted
to
enable
that
we
were
inspired
by
apache
goblin
for
the
architecture
of
this
and
I'll
go
into
a
little
bit
more
detail
as
to
what
that
means,
and
the
final
thing
that
we
we
made
sure
to
do
when
architecting
this
was
have
file
based
configuration,
so
you
write
configuration
in
a
yaml
or
a
toml
file,
and
then
you
know
you
can
just
run
the
data
hub
ingestion
framework
against
that
config.
C
So
how
do
we
architect
this?
As
I
mentioned,
it
was
inspired
by
apache
goblin
right
now
we
have
two
main
abstractions.
One
is
the
source
and
the
other
is
the
sink,
so
it
sinks.
These
are
the
methods
that
you
can
take
a
event
of
some
sort
and
write
it
into
data
hub,
and
that
can
happen
over
kafka
over
rest
or
for
debugging
purposes.
You
can
just
dump
it
to
the
console
or
write
it
to
a
file.
C
This
is
a
central
concept
in
data
hub
every
single
event,
or
every
single
change
in
metadata
is
modeled
as
a
metadata
change
event,
and
you
can
update
you
know
basically
do
all
of
the
operations
that
you
want
to
do
by
emitting
a
number
of
mces
or
metadata
change
events,
and
then,
finally,
we
had
we
have
the
sources,
and
these
are
wide
and
varied
everything
from
databases
to
a
file
to
even
like
ingesting
the
metadata
of
kafka
itself,
and
as
long
as
the
source
can
create
a
metadata
change
event.
C
So
I
will
talk
a
little
bit
more
about
what
it
actually
takes
to
add
a
source
as
well.
C
So
you
know,
as
with
all
live
demos,
I'm
gonna
take
a
stab
at
it,
but
you
know
things
go
wrong
in
live
demos,
so
bear
with
me.
Hopefully
you
can
all
still
see
my
screen
here.
C
So
all
of
this
is
in
the
metadata
ingestion
directory
the
easiest
way
to
install
it.
We
can
build
the
schemas
from
the
rest
of
data
hub
and
then
we
have
a
relatively
simple
set
of
commands.
You
can
just
copy
and
paste
this
into
your
terminal
and
get
set
up
immediately.
C
So
the
first
thing
that
you
might
want
to
do
is
ingest
some
sample
data,
and
so
the
way
to
do
that
is
data
hub
ingest
and
then
we
included
a
number
of
examples.
C
So
if
we
run
this,
we'll
see
oops
that
that's
not
supposed
to
happen,
as
with
all
demos
bear
with
me
one
sec,
I
will
worry
about
debugging
that
after
theoretically,
it
will
work,
so
we
will
fix
that
later.
C
But
what
you'll
see
is
you
know,
you'll,
go
to
data
hub
and
go
under
data
sets
and
we'll
see
some
stuff
here,
we'll
fix
the
ingest.
The
other
thing
that
that
I
did,
I
I
used
to
be
a
student
at
yale
and
I
ran
a
course
selection
tool
there
with
a
mysql
database,
and
so
I
actually
ingested
this
real
world
database
into
datahub.
C
This
is
the
configuration
there's
a
username
and
password
above
that
scroll
down.
So
you
know
we
have
the
the
host
port
and
then
we
have
a
filter
rule.
So
you
can
filter
out
certain
like
mysql
tables
and
then
allow
the
rest
in
initially
I
just
printed
it
out
to
the
console.
We
can
actually
write
these
to
data
hub
over
kafka,
this
time,
let's
say
and
so
oops
I've
gotta
change
the
recipe
to
run,
and
so
we
can
run
it.
C
It
will
configure
all
of
the
ingestion
and
then
it
will
go
through
and
ingest
a
number
of
tables,
and
then
you
get
a
nice
summary
of
what
happened
so
it
filtered
out
all
of
the
mysql
internal
tables.
C
Here
are
all
the
tables
that
it
actually
fetched,
and
so,
if
we
go
into
data
hub,
we
can
take
a
look
at
let's
say
the
students
table
and
we
get
the
full
schema.
We
get.
You
know
a
bunch
of
other
information
related
to
the
tables
that
we
just
ingested
yeah.
So
that's
the
that's
the
usage
side
of
things.
Let's
talk
a
little
bit
about
what
it
takes
to
add
a
source,
so
the
simplest
source
that
we
have.
C
C
These
are
the
three
operations
that
a
source
needs
to
support
and
if
it
supports
those
three,
then
we
can
integrate
it
into
the
rest
of
this
framework.
So,
as
an
example,
here's
a
very
simple
source
that
just
reads
from
a
file.
C
So
it
reads
that
and
then
you
know,
constructs
a
metadata
change
event
out
of
it
using
the
code,
gen
classes
and
then
sends
them
into
the
rest
of
the
framework,
and
you
know
that's
that's
how
simple
it
is
to
add
a
source,
a
slightly
more
complex
one.
Let's
take
a
look
at
the
source
for
kafka,
so
this
ingests
metadata
about
the
topics
and
partitions
and
so
forth
in
kafka,
and
it
sends
them
into
data
hub,
and
so
here
it's
a
little
bit
more
complex.
You
connect
you
construct
your
your
consumer
and
then
similar
thing.
C
You
know
using
things
like
aspects
and
snapshots
and
all
of
these
things
that
we're
relatively
familiar
with
as
long
as
you
can
do
that,
then
you
know
the
source
plugs
into
the
rest
of
the
framework.
Accordingly,
all
right
last
thing
I
wanted
to
talk
about
touch
upon
is:
where
are
we
headed
with
this.
B
Can
I
ask
a
question:
can
you
walk
us
through
the
chord
gen
that
you're
doing
from
the
avro
schemas?
I
thought
that
was
one
of
the
hard
things
about
this
project.
Yep
sure.
C
C
Let's
take
a
look
here,
so
the
av
sc
file,
which
is
the
average
schema
file,
looks
like
a
big
json
blob
and
it
has
you
know
all
of
the
fields
and
everything.
Accordingly,
we
have
a
codegen
system
that
you
know
I
I
took
from
an
open
source
project
and
and
modified,
and
I'm
also
working
to
contribute
that
back
but
the
cogen
system.
C
Basically,
let's
say
this
is
the
ml
properties
class.
So
for
this,
let's,
let's
do
like
data
set.
C
C
When
you
construct
it,
you
can
automatically,
you
know,
have
the
add
the
type
systems
and
it
all
associates
correctly,
and
this
way
you
know
if
you
run
my
pi
on
this
code,
it
actually
completely
checks
every
single
assignment,
every
single
constructor.
C
All
of
that,
so
that
you
know
for
sure
that
you
know
the
code
is
at
least
semantically
correct
in
terms
of
types.
So
that's
that's
kind
of
a
little
bit
on
the
code
gen
and
obviously
it
generates
a
truly
massive
file
of
5000
lines.
So
you
know
glad
we
aren't
writing
this
by
hand
and
updating
it
did
that
answer
your
question.
C
Thank
you
yeah.
So
let's
talk
a
little
bit
about
where
we're
headed
with
with
ingestion,
so
we're
going
to
do
a
more
formal,
rfc
process
on
this
relatively
soon
and
then
hopefully
also
publish
a
packaged
pie
pie.
So
you
can
pip
install
you,
know,
data
hub
or
something
along
those
lines
and
then
start
executing
this,
and
you
don't
need
to
do
the
code
gen
yourself
and
all
of
those
other
steps.
C
The
other
things
are
more
functional
improvements
so,
for
example,
detecting
when
metadata
stale.
So
if
you've
deleted
a
a
table
in
your
source
in
let's
say
mysql,
we
want
to
be
able
to
detect
that
it
disappeared
and
perform
the
according
the
the
associated
deletion
in
data
hub
right.
Now,
it's
a
purely
like
additive
process
and
we
can
do
updates,
but
the
the
deletes
are
not
something
that
we
get
support.
C
Another
thing
is
validating
that
it
was
actually
ingested
correctly.
So
you
know,
if
you
run
with
kafka
and
you
ingest
via
that,
you
just
send
you
know,
let's
say
30
events
to
to
a
kafka
topic
or
to
the
broker.
You
don't
necessarily
know
that
those
all
got
accepted
correctly
and
so
doing
that
validation
step
would
also
be
very
helpful
and
then
on
the
java
ingestion
side.
So
we
have
python
and
we
also
have
java
we're.
C
So
if
people
really
love
the
the
python
or
we'll
invest
more
on
that,
if
people
still
want
to
be
able
to
use
the
the
java
we'll
we'll
do
that
accordingly
as
well
and
then
the
last
thing
is
standardizing
a
testing
harness
between
the
two
so
that
we
can
test
functional
parity
between
the
the
java
and
the
python.
D
I
have
one
harsha
I
was
hoping
you
could
talk
a
little
bit
about
how
someone
would
productionalize
this
say
in
airflow
or
something.
C
Yeah
sure
so
we
include.
C
A
couple
sample
dags
actually
in
air
in
our
in
our
repo,
so
that
you
can
use
this
directly
within
airflow.
It's
actually
quite
simple.
Let's
say
for
for
mysql,
you
know
you
just
create
a
pipeline,
give
it
a
configuration
and
just
call
run,
and
you
know,
as
long
as
you
can
do
this
within
a
python
operator
of
airflow,
you
can
run
this
and
you'll
get
all
the
standard
error
reporting
out
of
airflow,
as
you
would
expect.
C
So
it
is,
it
is
production
ready
already,
and
you
know
you
can
use
it.
However,
you
like,
I
think
it
does
make
sense
to
run
it
within
an
orchestration
framework
like
airflow
or
even
cron.
If,
if
you
want
something
sim
simple
so
that
you
can
continually
get
updated
metadata
from
from
a
given
source,
instead
of
just
a
one-time
import.
B
Yeah
this
is
cool,
because
I
know
that
data
hub
has
this
positioning
almost
as
push-based
right
and
so
a
lot
of
people
think
that
oh
it's
push
based.
So
I
cannot
pull
metadata
into
this
system,
but
I
think
this
kind
of
shows
the
people
like
how
easy
it
is
to
once
you
have
a
push
based
system.
B
You
can
always
add
on
a
pull
based
system
upstream
of
it
to
essentially
essentially
pull
metadata
into
your
system,
and
you
don't
have
to
really
choose
between
the
two
you
can
you
can
do
both
if
you
want
all
right,
so
I
had
just
very
small
logistical
things
to
go
over
very
quickly.