►
From YouTube: DataHub Basics: Lineage 101
Description
John Joyce & Surya Lanka (Acryl Data) review the basics of managing lineage in DataHub during the November 2021 Community Town Hall.
Referenced Links:
https://github.com/linkedin/datahub/blob/master/metadata-ingestion/README.md#using-as-a-library
A
Right,
zooming
along,
we
are
we're
going
to
dive
into
lineage
101..
This
is
one
where
it's
kind
of
a
combination
of
the
basics,
but
also
highlighting
some
of
our
new
features,
and
this
is
really
coming
from
some
community
feedback
that
lineage
was
still
kind
of
an
opaque
feature
that
we
didn't
that
folks
didn't
really
know
how
to
interact
with.
A
So
we
just
wanted
to
take
a
chance
to
just
do
a
run
through
of
it
from
start
to
finish,
john,
do
you
want
me
to
share
my
screen,
or
do
you
want
to
take
control
yeah.
B
You
can
you
can
keep
the
to
keep
the
screen
I'll.
Just
ask
you
to
flip,
but.
B
For
that
intro
maggie,
so
like
maggie
mentioned,
we're
gonna
do
two
kind
of
sections
one
talking
about
how
we
sort
of
model
lineage
and
data
hub
at
a
higher
level,
as
well
as
the
integrations
that
are
most
notable
for
each
type
of
lineage
and
then
we'll
hand
it
over
to
surya
to
kind
of
give
a
an
overview
of
how
to
emit
lineage
a
more
technical
dive.
So
with
that,
I
think
we
can.
We
can
move
on
so
taking
a
step
back.
B
You
know,
data
ecosystems
are
quite
complex
and
it's
very
difficult
to
understand
how
data
is
actually
moving
in
many
cases,
and
this
is
going
to
help
you
kind
of
unlock
two
things.
One
is
proactive
impact
analysis,
so
understanding
who's
going
to
be
affected.
When
you
make
a
change
to
a
particular
data
asset
or
a
data
job
and
then
there's
the
reactive
side,
which
is
debugging
something
once
it's
already
gone
wrong.
B
So
if
you
notice
that
some
data
set
is
showing
some
weird
results,
understanding
what
is
upstream
of
that
data
set
and
be
able
being
able
to
triage
that
more
efficiently,
I
think
we
can
move
on
and
we'll
talk.
First
about
about
data
set
to
data
set
lineage.
C
B
There's
various
types
of
of
entity
to
entity
lineage
that
currently
supported
on
the
platform.
I
think
the
most
important
right
now
is
data
set
to
data
set
lineage,
and
we
model
that,
through
the
the
upstream
lineage
aspect,
which
is
attached
to
the
data
set
entity,
and
this
allows
you
to
see
relationships
between
obviously,
data
sets.
B
Currently,
we
have
two
notable
integrations
where
we
can
extract
that
information
automatically
on
your
behalf,
that
is
snowflake,
and
for
that
we
use
the
access
history,
log
and
bigquery
where
we
use
the
audit
log.
And
so
what
you
can
see
here
in
this
example
is:
is
the
lineage
that's
been
extracted
from
bigquery
using
the
audit
log?
B
If
you'll
go
to
the
next
slide
maggie,
you
can
see
that
we
also
support
snowflake
to
snowflake,
dataset
lineage
and
one
other
thing
that
I'll
call
out
is
we
support
fetching
lineage
between
redshift
and
external
tables,
and
so
that's
what
you're
going
to
be
seeing
on
the
left
side
of
the
screen
here,
where
we
have
a
redshift
table
and
it's
actually
pointing
to
some
external
source.
That's
on
glue.
B
So
that's
also
a
new
addition
to
to
the
lineage
graph
support,
so
that's
the
first
broad
type
of
entity
to
entity
lineage.
I
think
we
can
move
on
to
the
next
one,
which
is
really
about
tracking
lineage
between
data
sets
and
the
pipelines
that
they
are
a
part
of,
and
currently
we
model
a
concept
or
an
entity
called
a
data
job.
B
B
And
so,
if
you
kind
of
annotate
that
on
your
task,
we
are
able
to
behind
the
scenes,
pick
up
that
information
and
push
that
as
lineage
edges
into
datahub
automatically
as
long
as
those
annotations
exist,
and
you
can
find
that
as
actually
a
package
on
the
astronomer
kind
of
marketplace
under
datahub.
If
you
search
that,
so
that's
what
you're
seeing
in
the
right
right
corner
here,
it's
very
very
easy
to
set
up.
B
It
takes
probably
two
lines
of
configuration,
and
it's
one
of
the
most
popular
I
think
integrations
that
we
have
right
now
you
can
see
on
the
left
side
there
we
have.
Currently
we
have
this
lineage
being
captured
from
a
snowflake
table
to
an
airflow
job
which
is
doing
some
sql
or
transformation
back
to
a
snowflake
table,
so
we're
able
to
model
that
edge.
I
think
you
can
move
on
now
maggie,
so
the
next
edge
we'll
talk
about
is
between
dashboards
and
charts.
B
Now
some
people
get
confused
about
this,
because
a
lot
of
people
intuitively
think
you
know,
isn't
a
chart
only
associated
with
one
dashboard
and
that's
what
I
initially
thought
as
well.
But
as
I
started
playing
around
with
these
dashboarding
tools,
most
of
them
actually
allow
you
to
kind
of
break
out
charts
into
reusable
little
sub
modules
and
then
associate
them
with
different
types
of
dashboards.
B
So
what
we've
done
is
we've
modeled
an
explicit
edge
between
a
dashboard
and
a
chart
such
that
charts
can
be
attached
to
multiple
dashboards
and
that
edge
is
modeled
through
the
dashboard
info
aspect
which
is
attached
to
the
dashboard
entity.
I
think
there's
a
an
inputs
field
in
there,
which
specifies
which
charts
are
serving
as
the
inputs
to
the
dashboard.
B
The
integrations
that
I'll
call
out
for
this,
this
edge
type
are
looker,
where
we're
able
to
support
kind
of
automatic
extraction
of
relationships
between
models,
explorers
views
and
looks,
and
then
superset,
where
we're
able
to
obviously
get
those
dashboard
and
chart
relationships
and
that's
what
you're.
Seeing
on
the
bottom
two
screens
here
we
have
super
set
on
the
right
and
the
looker
integration
highlighted
on
the
on
the
left.
B
Okay,
I
think
the
next
one,
the
next
one
is
chart
to
data
set,
so
mostly
when
you're
building
a
chart,
you're
going
to
have
some
input.
Data
set
on
these
platforms,
whether
it's
looker
mode,
superset,
anything
else,
you're
going
to
have
kind
of
a
view
that
it
serves
as
the
basis
of
that
chart,
and
so
what
we
do
is
we
model
this
chart
to
data
set
edge
and
that's
through
the
chart
info
aspect
which
is
attached
to
the
chart
entity.
B
B
I
think
very
useful
and
finally
I'll
just
talk
about
a
new
kind
of
some
ways:
a
new
category
of
edge
or
lineage,
support
that
we
just
rolled
out
recently
and
that
is
rich
lineage
to
capture
sort
of
ddt
native
lineage,
using
the
dbt
concepts
themselves
that
you're
familiar
with,
like
tables
views,
tests,
seeds,
etc,
and
so
what
you
can
see
on
the
bottom
here
is
we're
able
to
actually
use
data
sets
as
this
base
entity
type
and
extend
data
sets
for
basically
subclass
data
sets
for
each
one
of
those
dbt
specific
concepts
and
then
link
them
together.
B
B
We
actually
recently
rolled
out
this
ability
to
subtype
base
entities
such
as
data
sets,
so
that
you
can
actually
achieve
a
much
more
native
feeling
experience
without
having
to
go
and
reinvent
all
of
these
models
or
model
a
bunch
of
new
concepts
on
data
hub,
so
very
excited
about
this.
This
new
integration
and
definitely
looking
for
feedback
on
it.
With
that,
I'm
going
to
hand
it
over
to
surya
who's
going
to
talk
about
how
you
can
actually
push
your
own
lineage
custom
lineage
into
datahub.
C
Okay,
yeah
so
yeah
many
times
we
basically
get
this
got
this
request
where
people
wanted
to
do
their
own
custom
lineage.
So
we
this
basically
like
we
have
extended
the
samples
right
now
that
cover
all
the
types
of
lineage
you
have
seen
so
far.
So
how
to
do
that
here
is
basically
a
simple
example
of
how
you
can
emit
data
set
to
data
set
lineage.
C
So
in
this,
if
you
see
basically
it's
basically
between
the
hive
table
and
the
upstream
kafka,
basically
topic
so
very
simple,
so
you
basically
like
construct
the
recommended
way.
Basically
for
constructing
most
aspects.
These
days
is
like
the
mcpw.
C
So
this
like
lineage
aspect,
so
this
is
basically
how
we
construct
it
from
the
substring
thing
so
and
simply
emit
using
the
emit
mcp
of
the
data
breast
emitter.
So
few
things
to
basically
note
here
are
like
most
of
the
identifiers
right
are
earns,
so
it's
entity
to
entity
any
entity
basically
typically
has
an
iron
here.
So
if
you
see,
if
you
wonder
what
these
things
are
like
hype
is
identifying
the
platform,
so
each
arm
basically
is
essentially
identifying
a
full
entity.
C
Basically,
so
we
specify
the
entities
to
emit
simple
upstream
lineage,
basically
between
data
sets
this
way,
but
mike,
if
you
click
on
samples,
so
I
would
like
to
people
to
see
what
all
the
examples
we
have
so
another
very
important
thing,
basically
to
note,
is
whenever
you
emit
any
aspect
right,
the
existing
aspect
gets
completely
overwritten,
which
means,
if
you
have
existing
edges,
they'll,
be
gone.
C
Basically,
people
need
to
like
know
these
things,
so
it's
all
documented
here
we
have
about
like
10
samples
here
covering
all
the
types
of
lineage
types
that
john
talked
about
so
yeah,
that's
pretty
much
it.
You
can
feel
free
to
basically
go
through
the
samples
and
provide
your
feedback.
B
Awesome,
thank
you
surya,
so
yeah,
just
just
to
wrap
up
we've
recently
rolled
out
more
documentation
around
this
concept.
As
syria
mentioned,
we
have
examples
showing
every
single
edge
type
that
I
talked
about
as
well
as
creating
every
single
edge
type.
Using
this
metadata
change
proposal
wrapper
abstraction.
Definitely
take
a
look
at
that.