►
From YouTube: Automatic Spark Lineage
Description
Mugdha Hardikar and Shirshanka Das team up to give a demo of Spark Lineage during the January Town Hall
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
A
The
jar
essentially
is
a
listener
to
the
spark
application
events
and
it
produces
metadata
events
out
and
sends
it
over
to
the
data
hub.
Metadata
service
mukta
actually
recorded
a
couple
of
nice
demos
for
us
to
walk
through,
and
it
basically
produces
metadata
for
date,
jobs.
The
the
spark
application
is
a
pipeline,
as
well
as
the
input
and
output
data
sets
that
the
job
consumed
and
produced.
B
Is,
let's
see
how
we
can
use
our
data
sparkly
image
in
jupyter
notebook?
Here
we
have
a
couple
of
imports
specifically
for
spicebar
and
after
that,
when
we
create
a
spark
session
using
spark
station
building,
we
will
specify
master
whichever
we
want
to
get
local
or
it
can
be
a
cluster
setup
as
well,
and
then
the
app
name
app
name
is
important,
because
the
app
name
whatever
is
specified
here
will
be
reflected
in
data.
B
Contains
information
about
all
the
airports
and
where
that
airport
is
which
city
and
state
and
all
that
and
whereas
flights
flight
cs,
contains
the
information
about.
When
was
the
flight
day
of
the
month
day
of
the
carrier,
what
was
the
original
airport
and
what
is
the
destination
airport
id
and
whatever
the
delays
are,
whichever
are
present
like
departure
and
arrival
in
this
particular
notebook,
I'm
going
to
work
on
departure
release
see
how
the
departures
are
affecting
are
affected
by
the
airport
and
the
carrier
together.
B
So
what
I
have
done
here,
I
have
merged
the
two
csvs
to
get
all
the
data
together.
After
that
I
have
added
aggregation
function
over
the
delay,
mean
department,
delay
and
whatever
the
information
which
we
have
received,
we
have
just
written
into
the
table.
Such
kind
of
information
will
be
useful
in
further
model
training
that
can
be
used
for
further
analysis,
let's
say
in
model
training
where,
instead
of
just
giving
the
departure
airboat,
if
we
provide
the
average
delay
at
that
airport
for
the
given
carrier,
it
might
be
better
for
those
models.
B
So
that's
why
we
created
this
and
store
it
into
the
hud.
In
a
table
which
will
be
further
utilized
by
our
pipeline,
which
will
be
utilized
further
by
our
pipeline
and
then
just
park
that
stop
I'm
just
going
to
execute
this
notebook.
Okay
and
then
we
will
see
that
this
executed
notebook.
We
will
see
how
it
gets
reflected
into
our
data.
So
let
me
open
our
data
hub
and
you
will
see
quickly
see
how
it
is
being
executed.
So
right
now
it's
been
it's
executing.
B
Now
here
is
the
pipeline,
so
if
we
go
to
the
pipelines
he
has
car
because
the
pipeline
was
executed
over
a
spa
here,
it
is
whatever
the
master
which
we
have.
So,
if
I
have
let's
say,
spark
master
colon
77,
then
that
particular
thing
will
pop
up
over
here
and
just
click
on
local,
and
this
is
our
pipeline.
So
this
is
the
name
of
the
pipeline,
which
is
getting
reflected
from
this.
Our
app
name,
flight
departure
analysis.
So
let's
go
inside
parts,
departure
analysis
documentation.
B
Here
we
can
see
when
the
job
was
started
when
the
job.
What
was
the
application
name
who
created
who
has
triggered
the
job?
What
was
the
id
of
that
app
inside
a
spark
context?
In
addition,
here
they
will
see
tasks
in
our
notebook.
We
have
only
one
task
which
actually
writes
or
updates
something,
and
everything
before
this
is
read
only
so.
Those
things
won't
get
recorded,
only
thing
where
we
are
actually
storing
something
will
get
recorded.
B
So,
as
we
can
see,
we
have
a
save
as
a
table
at
native
methods,
whatever
so
save
method
we
can
here
see
it
has
a
properties
and
properties.
We
can
see
what
was
the
departure
like
application
name,
complete
data
description.
Important
thing
is
here:
we
have
a
query
plan
which
is
important,
logical
query,
plan
which
is
getting
showed
and
what
was
the
execution
very
execution
id.
So
we
can
actually
relate
this
query
execution
id
as
well
as
the
app
id
to
the
spark
history
server
in
case.
B
We
want
to
link
the
two
things
together
and
inside
this
job
that
the
lineage
is
being
created,
so
there
are
two
upstream
all
lineages
and
one
downstream,
because
we
read
from
the
two
csv
files
and
we
write
to
a
data
set.
So
if
we
see
this
is
a
data
set
which
is
a
higher
data
set
and
the
before
these
two
are
the
hddfs
dataset,
because
we
are
reading
it
from
the
hdfs
and
processing
it.
As
you
can
see,
there
are
different
lineages,
so
there
is
no
upstream.
B
There
is
only
one
downstream
which
is
being
created
by
this.
So
all
the
information
related
to
this
is
present
inside
this,
and
we
can
see
what.
A
A
Let's
get
out
of
this
cool
so
back,
I
will
post
both
of
these
videos,
one
that
shows
how
to
do
this
with
the
notebook
and
the
other,
with
kind
of
the
spark
submit
option
and
how
to
add.
The
jars
on
the
spark.conf
documentation,
for
this
is
also
available
on
the
website.
What's
coming.
Next
is
spark
3.x
support,
support
for
data
bricks,
environments
and
column
level
lineage.
So
there
were
a
couple
of
questions
around
common
level
lineage
as
well.