►
Description
Elizabeth Cohen and Maggie Hays sat down with Steven Po, Senior Data Engineer from Coursera - a global online learning platform - to hear about how he and his team are using DataHub to manage their increasingly complex metadata landscape.
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
B
Absolutely
yeah
so
hi
hi
everybody,
I'm
excited
to
be
here
with
you
all
today.
I'm
stephen
po
senior
data
engineer
within
the
data
engineering
team
at
corcera,
and
previously
I've
worked
in
various
companies
ranging
from
sort
of
early
stage
startups
to
mature
enterprises
and
and
for
those
of
us
who
might
not
be
familiar
with
coursera
corsair
was
founded
by
athlete,
kohler
and
andrew
ing
in
2012,
and
today
we're
a
global
online
learning
platform
and
also
a
b
corp,
and
we
offer
anyone
anywhere
access
to
online
courses
and
degrees
from
leading
universities
and
companies.
A
B
Oh
yeah,
absolutely
yeah.
So
for
a
brief
of
context,
our
data
stack
of
coursera
has
evolved
over
time,
so
we
start
off
with,
like
you
know,
a
few
online
databases
pulling
data
into
redshift
and
then
we
expose
it
to
looker
and
all
of
that
was
kind
of
orchestrated
within
an
internal
solution
that
we
have
and
over
time.
I
think
the
number
of
tools
that
we
have
have
expanded,
we've
added
components
like
you
know,
data
breaks,
airflow,
amplitude,
etc,
and
also
the
number
of
data
assets
has
grown
over
time
as
well.
B
We're
hitting
pretty
in
solid
figures
of
like
five.
I
think
five
figures
for
a.
A
B
Storage
perspective,
as
well
as
a
visualization
perspective,
so
all
that
kind
of
calls
for
an
increased
focus
on
metadata
around
our
data
stack,
and
so
we
started
looking
into
you
know
what
what
open
source
solutions
are
out
there?
We
came
across
a
few
and
we
came
across
data
hub
and
while
we
were
kind
of
researching
and
conducting
a
poc,
we
found
that
data
hub
aligns
with
our
needs
a
lot
in
the
sense
of
data,
documentation,
sort
of
a
unified
search,
experience,
lineage
information
and
additional
metadata,
and-
and
so
after
that,
poc.
B
I
think
we
are
interested
in
data
hub
increase
and
we
were
also
very
impressed
with
the
the
vibrant
and
supportive
community
and
also
thanks
to
acrodata
for
kind
of
helping
us.
Try
that
open
source
community.
A
Yeah
awesome
so
where,
where
are
you
all
right
now
in
your
adoption
journey,
are
you
still
you're
still
in
the
evaluation
stage?
Is
that
right.
B
We
yeah
so
we
do
have
some
road
map
of
adopting
some
sort
of
a
metadata
solution,
but
I
think
there
were
some
specific
things
that
we
wanted
to
like
enable
within
our
organization
with.
B
And
that
kind
of
fell
into,
I
would
say,
maybe
three
major
buckets
discoverability
of
data
data
management
and
data
governance
right
so
some
example
use
cases
are
like
you
know.
If
we
have
data
sets
available
where,
where
are
they
and
who
do
we
go
to
to
ask
questions?
Even
that
would
be
a
major
win
in
terms
of
like
data
management,
for
example,
you
know
if
a
production
job
goes
down.
Who
do
we
end
up
notifying
in
terms
of
downstream
impact
on
looker
or
whatnot?
B
Of
like
governance,
maybe
use
cases
around
gdpr,
you
know
what
happens
if
we
delete
or
anonymize
a
certain
record
on
the
online
database
like
how
does
that
kind
of
trickle
down
into
all
of
our
downstream
visualizations?
So
all
of
those
questions
are
things
that
we
we
wanted.
You
know
to
enable
with
data
hub.
A
Awesome,
can
you
share
a
little
bit
about
what
you
enjoy
most
about
the
data
hub
community.
B
Oh
yeah,
absolutely
yeah,
I
think,
as
alluded
to
earlier,
I'm
very
excited
noticing
a
lot
of
talented
individuals,
everybody
willing
to
share
their
ideas,
their
experiences
overall,
very
responsive,
very
vibrant
and
also
you
know,
special
kudos
to
aqua
data
again
for
helping
us
sort
of
drive
that
community
and
increase
adoption
and
also
being
supportive.
As
always
with
our
on
non-stop
questions.
A
Yeah,
it's
been
a
lot
of
fun.
I
mean
really
big,
shout
out
back
to
you
back
to
you
all
your
you
and
your
team
have
been
just
really
great
partners
and
giving
us
feedback
and
helping
us
kind
of
understand,
use
cases
and
flesh
that
out
so
love
right
back
to
you.
A
So
I'm
curious
within
the
within
your
organization.
So
I
know
we
talked
a
little
bit
about.
Discovery
talked
a
little
bit
about
governance,
who
are
the
I'm
going
to
kind
of
flip
the
question
that
we
had
a
little
bit
on
its
head
a
little
bit?
Who
are
the
end
users
of
data
hub
either
currently
or
you
know,
kind
of
like
targeted
in
the
near
future.
Within
your
organization.
B
Yeah,
that's
a
that's
a
good
question.
I
think
we
would
envision
a
phased-out
rollout
plan,
so
we
would
likely
start
because
we
do
have
a
lot
of
internal.
Like
data
team
use
cases,
we
would
likely
start
with
data
engineering
data
science,
as
perhaps
so
that
first
batch
of
users,
but
in
terms
of
like
a
longer
term
perspective
like
we
envision
that
sort
of
anybody
that
works
with
data
and
that
works
with
data
on
a
very
in
a
critical
day-to-day
responsibility-wise
perspective,
would
be
consumers
of
data
hub
and
we're
also
envisioning.
A
B
Can
incorporate
some
of
that
in
terms
of
like
our
our
production
platform,
engineering
being
able
to
help
us
input,
metadata
and
documentation
so
on
and
so
forth,
and
also
opening
up,
perhaps
also
to
business
users,
where
they
could
sort
of
input
documentation
in
terms
of
like
what
they're
seeing
and
how
certain
datasets
could
be
used
for
their
use
cases.
So
I
think
tldr
we'll
start
with
the
data
team.
B
We
would
like
to
roll
it
out
to
the
broader
company
and
we
would
also
like
to
roll
out
with
our
content
production
platform
engineers
to
to
help
us
prepare
data
as
well.
A
That's
awesome,
great,
yeah
and,
and
speaking
of
different
features
and
use
cases.
What
is
your
favorite
data
hub
feature
or
use
case.
B
Yeah,
I
think
that
answer
will
change
over
time,
as
new
features
are
constantly
being
added.
I
I
think
to
answer
this
question
for
now
like
we
will
start
off
with
with
what
we
were
looking
for,
initially,
which
is
the
unified
search
experience.
I
I
find
that
to
be
you
know
very
delightful.
The
team
finds
it
to
be
delightful,
and
I
think
that
the
newly
added
lineage
downstream
impacts.
B
It's
something
that
we
we
have
always
you
know
held
in
terms
of
like
higher
party,
and
it's
something
that
again
shout
out
to
to
gabe
and
to
professor
for
working
on
this.
I'm
really
excited
about
this.
It
will
help
us
enable
our
operational
use
cases
and
also
having
sort
of
that
collaboration
between
different
stakeholders
of
our
company.
So
I
would
focus
on
the
answer
on
those
for
now,
but
I'm
sure
that
that
answer
will
change
over
time
as
new
features
are
added.
B
A
Yeah,
I'm
really
excited
about
the
the
impact
analysis
and
man.
It's
just
gonna,
it's
something
that
I
wish
I
would
have
had
access
to
five
years
ago.
I
would
have
had
a
fundamentally
different
experience
in
my
my
role
in
analytics
and
bi
engineering
all
right.
So
then,
let's
think
about
so
actually
kind
of
like
thinking
a
little
bit
in
the
in
the
future.
Is
there
anything
that's
kind
of
on
your
radar
for
2022
that
you're
excited
to
see
in
either
either
data
hub
the
product?
B
Yes,
that's
that's
a
good
question.
I
think,
in
terms
of
specific
features
on
the
roman
that
we're
noticing
there,
there
are
a
couple
that
that
we
are
excited
about,
such
as
column
level
lineage.
I
know
that's,
that's
a
work
in
progress
super
excited
for
for
that
one
and
like
for
information.
We
we
had
an
internal
poll
around
what
use
cases
we
were
looking
for
within
coursera
and
that
actually
came
out
like
really
high
higher
than
I
had
originally
anticipated
without
gain.
B
So
super
excited
about
that
road
based
access
controls,
I
think,
would
be
awesome
in
terms
of
controlling
which
groups
of
users
can
view
with
schemas
or
which
documentation.
I
think
that
would
be
very
high
in
our
list
as
well.
B
Selfishly
a
delta
lake
integration
just
because
we're
we're
going
to
be
using
database
double
blade,
a
lot
so
that
that's,
exciting
and
and
also
integrations
with
or
potential
integrations,
with
slack
and
jira,
for
collaboration
use
cases
that
we've
seen
notifications
for
changes,
and
maybe
you
know,
servicing
up
previous
questions
or
or
tickets
on
certain
data
assets.
I
think
those
were
you
know
highly
exciting
features
among
others
as
well,
and
maybe
in
terms
of
like
themes
like
some
themes
that
we're
also
excited
for
is
sort
of.
A
B
Some
examples
of
that
you
know
could
be
knowing
now
that
we
have
so
that
end-to-end
picture
of
our
data
is
there
a
possibility
where
we
could,
you
know
potentially
automatically
optimize
how
pipelines
are
scheduled.
You
know
to
meet
certain
sla,
criterias,
reduce
concurrency
issues
and
costs
right
and
maybe
help
us
sort
of
suggest
where
we
could
potentially
consolidate
our
data
assets
to
avoid
having
like
multiple
sources
of
truth.
B
I
think
those
are
topics
that
we
can
consider
as
well
and
the
and
the
other
theme
would
be
like
governance,
so
potentially
looking
into
some
sort
of
a
centralized.
You
know
data
policy,
access
control
where
we
kind
of
consolidate
sort
of
enterprise-wide
data
policies
and
make
sure
that
we
can
control
access
like
on
on
the
data
contents
themselves
through
some
sort
of
a
centralized
fashion
would
be
super
helpful
and.
B
A
No,
those
are
some
amazing
use
cases
and
I'm
I'm
also
really
excited
to
see
how
we,
you
know
how
we
kind
of
can
start
to
build
out
some
of
these
either
alerting
frameworks
or
recommendation
frameworks
to
really
like
tailor
to
really
manage
all
of
the
entities
everywhere.
A
Right
like
right
now,
we're
very
focused
on
surfacing
surfacing
entities
regardless
of
platform
regardless
of
source,
but
how
do
we
help
people
kind
of
curate
that
a
little
bit
better
right
and
actually
remove
redundancies,
or
even
like
remove
kind
of
like
high-risk
data
entities,
so
we
kind
of
like
minimize
potential
kind
of
like
data,
governance
or
compliance
risk,
so
I
just
think
there's
some
really
exciting
ways
for
the
community
to
build
out
some
like
best
practices
or
frameworks
around
those
things.
A
A
B
Yeah
that
was
tough
to
answer
as
well.
I
think
a
lot
of
the
channels
were
very
helpful
in
terms
of
you
know,
finding
experiences
and
and
answers.
B
So
if,
if
I
had
to
pick
one
you
know,
perhaps
the
announcements
channel
continuously,
it
will
help
me
sort
of
help
us
keep
up
to
date
in
terms
of
developments
and
on
the
data
hub
products.
So
I
find
that
to
be
very
helpful
and
just
very
exciting
to
see
how
the
product
evolves
over
time.
A
Or
what
it,
what
advice
would
you
give
to
someone
who's
joining
data
hub
or
or
joining
the
data
hub
community
or
starting
to
work
with
data
hub
yeah,
like
what
advice
would
you
give
them
in
the
early
days.
B
Yeah,
I
yeah
that's
a
good
question.
I
would
say
from
like
from
our
experience
number
one.
The
the
documentation
is
very
helpful,
so
we
we
may
have
some
time
to
sort
of
you
know
dig
through
that
documentation
and
also
what
we
also
find
particularly
helpful.
B
Is
that
the
the
slack
channel,
the
official
slide
channel
data
hub
has
all
of
the
previous
like
messages
and
whatnot
and
those
are
not
sort
of
cut
off
by
like
we
only
retained
the
last
like
ten
thousand
messages
or
something
that
slack
has
yeah,
so
fine
yeah,
but
I
find
looking
through
those
slack
channels
as
well
for
pre-responses
super
helpful
and
and
lastly,
to
to
you
know,
don't
you
know
not
don't
be
shy
about.
B
You
know,
raising
questions
to
the
community,
that's
something
I
I've
noticed
that
a
lot
of
people
would
raise
questions
which
I
think
is
really
helpful
because
a
lot
of
times
like
we
all
have
like
similar
questions
and
so
sometimes
like
I
find
that
like
for
for
us,
like
we,
we
sometimes
jump
on
like
existing
question.
Threads
and
sort
of
you
know
work
with
that
and
get
more
details
from
from
that
perspective,
so
so
that
would
be.
That
would
be
my
advice
from
my
experience.
B
B
Absolutely
it
was
yeah
thanks
for
having
me.