►
Description
DataHub Lineage Demo with Airflow and Superset
Community TownHall on Apr 23rd 2021
Airflow Lineage Support
Lineage Viz Demo
B
Okay
cool.
Let
me
yeah
present.
A
The
first
is,
you
know,
pretty
simple,
using
airflow,
essentially,
is
a
cron
system
to
just
run
ingestion
on
a
schedule,
so
you
know
similar
to
how
you
define
a
recipe
with
the
data
hub
cli,
you
can
create
a
pipeline,
give
it
your
source,
tell
it
to
push
to
gms
via
the
the
data
hub.
Rest,
sync
and
just
run
it
every
day,
and
it's
pretty
simple
to
do
this
and
then
set
it
up
with
airflow
the
second
method,
if
you
hit
next
slide.
A
The
second
method
is
to
emit
mces
via
a
datahub
operator
directly
within
your
dag.
So
the
reason
you
might
want
to
use
this
is,
if
you've
got
say,
you're
generating
a
dag,
and
you
know
exactly
what
lineage
or
you
know
some
extra
information
about
a
given
data
set.
You
can
just
create
that
that
mce
here
construct
that
object
and
push
it
up
to
data
hub
and
to
tell
data
hub
about
whatever
you
know
within
that
airflow
dag.
A
Now,
in
order
to
use
this,
you
have
to
set
up
an
airflow
connection.
This
is
a
pretty
standard
thing
and
then
you
just
put
the
connection
id
in
the
operator
as
a
parameter
and
airflow
will
figure
out
the
rest
of
how
to
pass
the
credentials
into
the
the
emitter
operator
and
then
push
that
information
all
to
data
hubs
now
a
third
way
to
to
integrate
airflow
and
data
hub,
and
you
know
the
one
that
I'm
most
excited
about
is
via
the
lineage
back
end.
A
So
the
way
this
works
is
you
set
up
a
little
bit
in
your
airflow
config.
If
you
see
that
second
screen
shot,
you
configure
the
data
hub
air
flow,
lineage
back-end,
that's
deleting
you
back
in
with
an
airflow
and
give
it
the
connection
id
similar
to
how
we
did
it
in
in
the
operator
case
and
then
in
your
operators
within
your
dag.
You
pass
inlets
and
outlets,
and
this
is
a
airflow
native
integration.
A
Every
single
operator
supports
inputs
and
outlets
in
the
right
version
of
airflow,
and
you
just
declare
your
data
sets
that
are
consumed
and
produced
by
a
given
job
and
datahub
is
able
to
view
and
visualize
all
of
that
metadata,
plus
it
fetches
a
bunch
of
extra
metadata
about
the
dag
and
the
tasks
itself.
A
So
you
can
view
properties
about
you
know
what
were
parameters
that
were
passed
into
the
task
or
you
know
when
was
this
thing
last
run
now
the
one
caveat
here
is:
you
know
it
requires
a
little
bit
of
config
and
it
only
works
with
airflow,
1.10.15
or
newer
or
2.0.2
or
newer,
because
the
lanes
back
end
was
not
supported
prior
to
those
versions,
and
so
there
you
have
it.
If
you
want
to
learn
more,
I'm
sure
I'll
give
you
the
next
slide.
A
I've
got
a
link
to
the
docs
where
you
can
read
about
it
and
then
I'm
on
slack.
If
you
have
any
other
questions,
awesome
and
thank
you,
pass
it
over
to
gabe
sweet.
C
C
So
next
up,
I'm
going
to
talk
about
how
users
of
data
hub
can
take
advantage
of
these
connections
that
we
have
between
entities
to
get
a
better
understanding
of
the
data
that
they
have.
So
I
know
that
partial
in
spare
time
made
some
analytics
pipelines
and
dashboards
about
the
demo
data
that
we
have
for
demo
datahub
project
auto.
C
We
have
lots
of
data
sets
there
and
as
a
little
pet
project,
he
built
some
pipelines
to
and
some
dashboards
to,
visualize
metadata
about
them
and
I'm
gonna
go
through
and
explore
this
and
understand
what
he's
built
and
use
lineage
to
get
a
better
understanding
and
more
trust
in
the
data.
So
say
I
wanted
to
understand
what
the
documentation
coverage
is
for
our
demo
data
so
that
I
could
understand.
C
Well,
what
what
platforms
do
I
need
to
improve
the
documentation
for
so
I
can
go
into
the
search
bar
and
search
for
documentation,
and
I
see
a
superset
chart
that
has
been
created.
That
gets
the
completeness
of
documentation
for
some
data
sets.
I
can
click
into
that
in
our
superset
integration.
That's
been
included
in
the
new
release.
You
can
go
and
see
some
basic
properties
like
the
metrics
and
the
dimensions
of
the
chart,
as
well
as
the
sources
that
feed
into
that
chart.
C
But
how
do
I
know
that
these
data
sets
that
are
being
talked
about
in
the
chart?
Are
there
the
data
sets
that
I'm
interested
in,
and
I
actually
are
our
demo.
C
This
is
when
I'm
going
to
need
to
go
into
the
lineage
view
to
see
the
whole
picture
of
how
this
chart
was
created.
So
when
I
click
this
button
in
the
top
right
of
my
entity,
I
get
taken
to
a
graphical
view
that
shows
in
the
center.
This
is
the
chart
that
we're
talking
about.
It
also
shows
the
upstream
table
dependency
that
the
chart
is
reading
from,
and
this
is
what
we
saw
before
downstream.
C
So
I
might
can
make
make
a
mental
note
that
if
this
chart
is
what
I'm
looking
for,
I
might
want
to
go
investigate
this
dashboard
and
see
what
other
charts
it
has.
And
if
I
double
click
on
this
dashboard,
I
can
re-center
the
graph
around
it
and
see
here
all
the
other
charts
that
are
contained
in
that
dashboard.
C
Going
back
to
this
upstream
table,
I
can
click
on
it,
get
the
full
name.
If
there
was
description
or
other
metadata
like
tags,
I
would
be
able
to
see
that
as
well.
But
all
I
know
is
this:
is
some
generated
snowflake
table
in
harsha's
pipeline?
How
do
I
actually
know
that
this
table
is
generated
off
of
the
data
that
I'm
expecting
and
and
the
dimensions
are
constructed
in
a
way
that
I
want
at
this
point,
I'm
going
to
hit
the
plus
this
plus
icon
to
further
expand
out
the
lineage
graph.
C
So,
as
I
hit
these
pluses,
the
lineage
graph
has
expanded
out
and
I
can
see
more
and
more
of
the
dependencies
of
these
tables
until
finally,
once
all
the
pluses
have
been
clicked,
I
have
now
have
a
full
picture
and,
as
I
zoom
out,
I
can
see
a
full
picture
of
all
the
the
flow
of
the
lineage
that
leads
to
my
dashboard
zooming
in
I
can
see
it
all
starts
with
this
s3
bucket.
That
is
a
snapshot
of
our
demo
aspects
for
people
familiar
with.
B
C
Exactly
so,
what
we've
done
here
is
built
a
pipeline
around
the
metadata
included
in
our
demo
data
hub
project,
so,
as
you
can
see
from
the
float
from
left
to
right,
we're,
starting
with
the
raw
data
of
the
aspect,
and
then
harshal
has
created
pipelines
to
produce
derived
tables
off
of
these
aspects
that
they're
easier
to
consume.
Until
finally,
we
have
the
table
that
all
the
superset
charts
read
from,
which
is
a
much
more
consolidated
version
of
our
data,
metadata
that
the
chart
that
charts
can
easily
be
built
off
of.
C
C
And
I,
if
I
want
to
zoom
in
and
look,
I
can
then
inspect
an
airflow
test
and
say
I
might
want
to
know
okay,
so
I
know
now
that
the
source
data
and
the
flow
of
data
seems
expected.
But
how
do
I
know
the
transformations
that
harshal
has
done
are
the
transformations
that
I
would
want.
I
can
then
click
view
profile
of
this
airflow
task
that
I've
highlighted
and
I'm
brought
to
the
airflow
tasks
profile.
C
When
I
click
on
that,
I'm
brought
to
airflow
and
I
can
actually
go
in
inspect
the
code
and
verify
that
this
code
is
the
what
I'm
expecting
and
the
transformations
that
partial's
done
are
the
transformations
that
I
expect
going
back
to
our
lineage
graph.
Now
that
I've
been
able
to
do
my
due
diligence
on
the
lineage
flow
understanding
from
the
beginning,
all
the
way
down
to
my
chart.
How
is
this
data
transformed?
I
can
now.
Finally,
look
at
this:
go
back
to
the
chart.
C
Profile
click,
my
click
out
to
my
view
and
superset
button,
and
now
I
finally
have
confidence
that
this
chart,
I'm
looking
at.
It's
not
doesn't
just
say
it's
doing.
Data
set
document
completeness,
but
I
understand
from
the
beginning,
through
the
transformations
all
the
way
to
the
chart.
This
is
what
I
expect.
B
We
gotta
improve
snowflake
documentation.
That
much
is
clear.
C
That's
right,
it
looks
like
the
snowflake
documentation
is
lacking
and
s3
is
barely
there,
but
our
bigquery
documentation
and
our
kafka
documentation
is
looking
pretty
good.
So
now
I
know
next
week
I've
got
my
work
cut
out
for
me
and
I
can
now
go
into
my
other
charts.
Do
investigations
there
and
draw
more
conclusions
so
now,
with
our
in
the
new
release,
now
that
we
have
airflow
integration,
we
have
dbt
superset,
looker
and
other
sources
that
are
helping
tie
your
different
entities
together.
C
These
lineage
visualization
visualizations
will
help
you
understand
how
your
all
your
data
relates
to
each
other
and
with
that
this
concludes
our
walkthrough
of
the
lineage
future.
B
B
Awesome
cool,
let's
move
to
the
next
section,
one.
Second,
while
I
go
find
my.
B
B
Yep
here
we
are
and
oh
another
handoff.
We
have
john
who's
going
to
talk
to
us
about
the
data
hub
hackathon
that
the
depop
guys
did-
and
you
know
it
was
pretty
cool-
to
see
them
pop
in
literally
pop
in
to
the
community
channel
and
say:
hey
we're
doing
a
hackathon,
and
you
know
in
a
few
days
they
had,
they
were
contributing,
glue
integration
back
to
us,
and
you
know
the
klarna
folks
helped
them
out.
So
thanks
for
that
collab,
I
think
it
was
really
nice
to
see
that
happen.