►
From YouTube: Metadata Use Cases at LinkedIn: DataHub Community
Description
Lightning Talk by Shirshanka Das at DataHub Community / Nov 6 / 2020
A
A
Actually
what
are
the
different
things
you
can
do
with
metadata
or
what
are
the
different
directions?
We've
taken
in
terms
of
use
cases
to
build
on
top
of
metadata,
and
so
I
thought
you
know
it
might
be
good
to
tell
the
community
how
we
look
at
metadata
and
what
we
are
doing
with
it
at
linkedin.
So
it
gives
people
kind
of
an
idea
of
where
they
can
take
this
at
their
own
companies.
A
So
I'm
shashanka,
I'm
a
principal
staff
software
engineer
at
linkedin.
I
have
been
accused
of
many
things,
including
being
a
godfather
of
three
projects:
linkedin
data
hub
apache,
goblin
and
dali.
There
are
a
lot
of
other
projects
I've
worked
on,
but
these
are
the
three
that
are
probably
the
closest
to
my
heart
and
I
have
promised
naga
that
I'll
keep
this
to
five
minutes.
So
I'm
gonna
go
forward
really
quick,
all
right.
A
A
So
this
is
what
you
use
to
build
a
set
of
metadata
services
that
actually
talk
to
each
other
and
form
what
I
would
call
a
metadata
mesh
or
a
metadata
fabric,
and
on
top
of
that
is
datahub
the
app.
This
is
what
most
people
see
the
application
that
actually
enables
productivity
and
governance
use
cases
on
top
of
this
mesh.
A
A
We
have
lots
and
lots
of
types
of
data
stores
and
we
have
streams
coming
out
of
them
and
we
have
dumps
coming
out
of
them
into
kind
of
a
warehouse
that
has
streams
as
well
as
batch
data,
there's
a
bunch
of
standardization
reporting
and
then
derived
data
going
back
into
these
stores
and
sometimes
going
out
back
into
the
services,
sometimes
going
out
externally
into
third-party
apis
complex.
Hopefully,
everyone
has
similar
problems.
That's
why
we
are
here.
A
So
the
first
use
case,
of
course,
is
search
and
discovery.
So
that's
the
bread
and
butter
of
what
we
do
and
everyone
gets
it.
You
know
we
take
the
metadata
platform,
sit
it
on
top,
like
connect
the
entire
ecosystem
to
it,
and
then
we
build
an
app
on
top
of
this
platform
that
gives
us
search
and
discovery.
Everyone
understands
what
that
looks
like
you,
get
search,
browse
faceted
across
a
bunch
of
different
kinds
of
entities,
and
then
you
can
explore
relationships
among
them.
A
A
There's
something
interesting
happened.
The
ai
team
started
saying:
oh,
we
need
a
bunch
of
things.
We
need
explainability,
we
need
reproducibility,
we've
got
a
bunch
of
these
model,
training
stuff
that's
going
on
and
we
need
a
place
to
store
this
stuff,
and
I
said
well,
we've
built
this
thing
called
gma.
It
allows
you
to
store
metadata,
you
might
want
to
use
it
and
they
actually
went
all
in
on
it.
So
we
have
the
metadata
platform
extending
to
support
experiments,
metadata
around
experiments,
metadata
around
model,
training
and
metadata
around
features
and
then
around
it.
A
There
are
a
few
other
concepts
that
have
shown
up
and
the
goals
have
always
been
reproducibility,
auditability
visibility
and
then
consistency
of
concepts,
and
also
this
thing
where
you
know
you
want
things
to
be
integrated
with
the
dev
workflow.
If
I'm
checking
in
my
stuff
in
git,
my
metadata
should
be
right
there
with
it.
So
we've
added
a
bunch
of
new
concepts,
like
you
know,
what's
the
problem
statement
that
the
ai
team
that
the
the
experiment
is
about
what
are
the
pipeline
and
run
infos?
A
A
A
The
other
big
thing
that
we've
done
is
compliant
data
management,
so
we've
got
like
a
ton
of
data,
obviously
on-prem
and
cloud
100,
petabytes,
plus
of
it
and
growing
every
day,
and
we
have
a
ton
of
apis,
some
internal,
some
external,
that
we're
integrating
and
we
had
similar
problems
like
hey.
Do
we
know
where
all
the
data
is?
What
is
the
compliance
of
those
things
and
we
have
retention
policies
and
fine-grained
data
deletion?
Stuff
that
we
need
to
do
so.
A
Can
I
attach
you
know,
purge
policies
to
every
single
data
set
that
I
need
to
care
about
every
single
api
that
we
need
to
be
exchanging
data
with,
and
we've
kind
of
managed
to
do
that.
So
the
single
metadata
platform
has
tentacles
into
every
single
entity
in
this
ecosystem
and
you
can
attach
tags
as
well
as
policies
to
these
things,
and
then
we
use
this.
A
You
know
big
thing
called
apache
goblin,
which
is
our
massive
data,
injection
and
life
cycle
management
system,
and
it
does
all
of
these
single
things
kind
of
as
operators
on
top
of
the
metadata
right.
So
you
can
ingest
data
from
external
apis.
You
can
then
manage
its
life
cycle
inside
you
can
do
limited
retention.
A
You
can
do
fine
grain
data
deletion
stuff
like
right
to
be
forgotten
and
stuff
like
that,
and
then
you
can
automatically
create
obfuscated
data
to
create
pii
free
zones
and
then,
finally,
you
can
actually
export
and
manage
this
data
in
external
apis
using
this
metadata.
So
if
I've
got
some
data
in
salesforce,
I
can
actually
export
that
data
there
or
dynamics
export
it
there
and
then,
when
the
customer
deletes
their
data
or
wants
their
data
to
be
deleted,
we
can
actually
fire
off,
deletes
against
that
external
endpoint.
A
A
The
other
interesting
thing
that
happened
was
governance
workflows,
so
we've
got
the
metadata
platform,
but
metadata
is
changing
so
with
the
core
metadata,
schemas
compliance
tags,
and
then
there
are
logs
when
data
is
getting
ingested
when
access
is
being
granted-
and
we
had
a
few
scenarios
that
we
wanted
to
stay
on
top
of
ownership
of
an
asset
must
be,
you
know,
locked
down
and
be
you
know,
good
schema
changes
must
be
insane.
Deletion
must
happen
in
time.
A
Access
must
be
granted
in
accordance
with
our
policies,
and
so
what
we
did
was
we
took
the
mediator
platform,
it's
stream
first
and
supports
batch
integration,
so
you
can
do
change
processing
on
it.
So
you
can
write
a
deletion
monitor.
You
can
write
a
schema
change,
monitor.
You
can
write
an
ownership
monitor.
You
can
write
an
access
monitor
and,
as
these
things
change,
you
can
assert
on
these
things
that
you
want
to
keep
happening
in
your
ecosystem
and
those
things
can
then
fire
off
issues
or
alerts
or
actually
lock
down
data
sets.
A
Orc
chart
that's
been
pretty
exciting
as
well,
so
I
talked
about
three
things:
search
and
discovery,
ai
model,
reproducibility
feature
reproducibility
compliant
data
management
and
governance
workflows,
and
there
are
actually
a
few
and
I
just
ran
out
of
time,
so
I'm
not
going
to
get
into
them
data
quality
operations,
monitoring
and
we're
really
just
getting
started
with
what
we
can
do
for
the
whole
company
using
this
one
metadata
platform.