►
From YouTube: DataHub @ hipages Case Study: Oct 29 2021
Description
Chris Coulson from hipages shares their experience adopting DataHub to supercharge data discovery, lineage, quality, and ownership. Chris shares their core use-cases, an overview of their modern data stack, and how they are leveraging the DataHub lineage graph to identify the most influential datasets & assign ownership.
Join us at our next Town Hall - RSVP here: https://forms.gle/g8EpCLnohtPLLtdg6
A
So
my
team
is
responsible
for
the
data
orchestration
around
the
company
and
to
look
after
all,
the
machine
learning
components
both
internally
and
externally,
facing
so
we've
gone
through
a
journey
recently
where
we
have
done
a
technology
evaluation
and
found
that
datahub
was
the
best
tech
for
some
of
the
problems
that
we
wanted
to
find
and
subsequently
deployed
datahub
and
integrated
it
with
our
airflow
instance,
and
I
just
wanted
to
talk
you
through
some
of
the
stuff
that
we've
achieved
and
some
of
the
cool
things
we've
implemented
along
the
way
so
before
we
do
that.
A
Just
to
give
you
a
quick
introduction
to
high
pages.
So
high
pages
is
a
two-sided
marketplace
in
australia,
where
we
connect
consumers,
so
people
who
want
a
diy
job
done
or
they
want
something
built
with
tradesmen.
Who
can
then
do
that
work?
So
the
company
was
founded
16
years
ago
and
we
recently
listed
on
the
asx.
We
are
growing,
which
is
really
cool
and
our
data
team's
still
growing
as
well.
So,
to
give
you
an
introduction
to
the
data
team,
this
is
a.
This
is
the
whole
team.
A
We
are
focused
on
helping
the
company
start
to
make
data
driven
decisions
and
the
way
we're
doing
that
is
by
opening
out
all
of
our
data
assets
and
hopefully
helping
people
from
all
parts
of
the
company
consume
data
and
care
about
data,
and,
through
that
start,
making
decisions
based
on
real
insights
and
information
that
we
draw
around
how
people
interact
with
our
site
or
how
people
how
some
of
our
consistent
customers
use
our
platform.
A
So
with
that
in
mind
like
how
people
we
use
data
across
the
whole
company,
but
we
have
some
questions
that
continually
come
up
and
they
sort
of
like
can
be
broken
down
by
the
the
specialism
that
people
work
in.
A
So
the
first
thing
to
think
about
is
our
data
analysts
and
the
data
scientists
and
they're,
principally
interested
in
where
they
can
find
data
and
how
that
data
is
generated,
so
they're
they're
interested
in
knowing
what
tables
are
available
and
what
information
is
in
those
tables
and
how
that
information
was
generated.
A
When
you
talk
to
our
data
engineering,
guys
they're
really
interested
in
what,
if
a
pipeline
goes
down,
who
do
they
need
to
alert
so
who's
the
downstream
consumer
of
that
data,
both
in
terms
of
like
processes,
but
also
like
when
it's
surfaced
up
to
dashboard
level?
Who
actually
cares
about
that
data,
because
what
we
don't
want
to
do
is
have
our
consumers
of
the
data?
Tell
us
that
there's
a
problem
we
want
to
be
ahead
of
that
we
want
to
be
able
to
fix
it.
A
The
other
problem
we
have
is
when
we
have
complex
and
interdependent
pipelines
that
run
we.
Actually,
it
can
be
sometimes
quite
difficult
to
work
out
if
there's
a
real.
If
there's
an
upstream
failure,
what
do
we
need
to
re-run
in
order
to
recover
all
of
the
downstream
data?
So
we
need
to
worry
about
the
lineage
from
that
point
of
view,
and
also
we
kind
of
care
about.
How
can
we
prevent
it
in
the
future?
A
So
these
are
like
a
collection
of
common
questions
and
issues
that
we
saw
across
our
ecosystem
and
data
ecosystem,
and
we
were
trying
to
think
about
how
to
solve
these
and
you
can
break
those
down
into
a
few
four,
essentially
four
different
problems
that
we
were
seeing
and
these
problems
were
really
highlighted
that
we
need
to
get
some
kind
of
system
or
metadata
cache.
That
would
help
us
surface
all
of
these
things
and
we
came
across
data
having
informative
data
hub
is
the
solution
for
that.
A
So,
if
we
talk
through
the
different
problems
we
were
seeing
and
how
we're
going
to
solve
them.
The
first
thing
that
became
clear
is
that
we
need
to
find
a
way
of
allowing
our
data
to
be
discoverable,
so
we
need
to
find
a
way
that
people
can
search
through
the
information
and
data
we
have
available
and
find
what
they
need.
A
A
The
other
thing
we
were
finding
is
that
we
need
to
be
proactive
on
quality,
so
we
needed
to
start
thinking
about
profiling
and
understanding
where
and
what
our
data
is
looking
like
when
it's
in
flight
and
the
final
thing
is
ownership.
So
when
we
start
thinking
about
ownership,
we
start
worry.
We
start
making
people
accountable
and
encourage
responsibility
for
their
data,
sets
and
data
assets
and
start
encouraging
document
them
properly
and
really
explain
them
and
that
all
comes
together
as
we
sort
of
migrate
towards
a
mesh
architecture.
A
A
A
A
We
actually
developed
our
our
own
helm,
charts
to
bring
up
the
persistent
storage
layers
and
then
use
the
community
helm
charts
to
deploy
the
data
hub
components.
We
also
switched
on
things
like
the
authentication
layers,
just
to
make
sure
that
it
was
all
nicely
secure
and
it
seems
to
be
working
well
at
the
moment.
A
So
so,
to
go
back
to
like
the
thing,
the
problems
we
were
trying
to
solve.
Let's
first
look
at
discovery,
so
the
thing
with
discovery
is
we
wanted
to
make
sure
that
people
could
find
all
their
tables
or
find
all
the
dashboards
that
have
already
been
created
so
that
we
could
start
sharing
ideas
and
knowledge
and
take
it
out
of
that
sort
of
tribal
world
and
bring
it
into
the
more
discoverable
pattern.
A
So
what
we've
used
is
all
of
the
recipes
that
are
available
to
look
at
through
data
hub
to
ingest
all
of
our
metadata,
from
both
from
our
consumption
and
processing
loads
and
really
start
surfacing
up
some
of
those,
the
information
that
people
need.
So
to
give
you
an
example,
we
can
use
the
free
text
search
now
to
find
tables,
dashboards
and
some
processing
steps
that
are
related
to
specific
keywords
and
through
that
now
we're
starting
to
share
ideas
in
a
better,
more
controlled
way.
A
So
the
next
thing
to
talk
about
is
lineage
and
we've
done
a
lot
of
work
in
this
space,
principally
because
of
the
way
our
system
is
architected.
It's
made
it
quite
easy
to
bring
up
some
lineage
that
covers
quite
a
lot
by
processing
stack
very
quickly,
so
we'll
break
this
down
into
the
three
main
areas.
We've
done
a
bit
more
work,
but
just
to
give
you
a
flavor
for
the
three
main
things
that
we've
done.
We've
now
enabled
rds
log
inspection
to
give
us
processes
give
us
lineage
inside
the
production
environment.
A
So,
to
break
that
down,
we
currently
run
a
lot
of
cron
jobs
on
our
my
sequel
instance
or
executing
cube,
and
what
we
did
is
we
added
some
labels
to
those
cron
jobs
so
that
we'd
be
able
to
identify
them
in
the
rds
logs,
and
now
we
do
inspection
on
the
rds
logs
to
analyze
all
of
our
queries,
they're
being
executed
in
that
environment,
to
provide
linear
general
of
the
queries
that
are
operating
in
the
production
environment
without
actually
affecting
any
of
the
production
code,
that
much
and
without
having
to
inject
extra
dependencies
or
affect
performance
on
the
production
side.
A
A
The
next
thing
we
were
looking
at
is
how
we
do
our
snapshots,
so
we
simply
added.
We
use
a
generation
pattern
to
generate
tags,
that
copy
tables
from
our
mysql
instance
into
our
link.
So
we
quickly
were
able
to
add
lineage
for
all
of
the
source
to
lake
transformations
when
they
come
from
the
operational
data
store
and
put
them
into
data.
A
The
next
thing
we've
done
is,
as
we
sort
of
migrate
towards
a
mesh
architecture.
We've
we've
started
implementing.
Well,
we've
invented
for
a
long
time.
This
idea
of
sequel
templating.
So
people
can
write
sql
if
it
has
like
a
daily
snapshot.
We
can
then
template
it
with
airflow
with
the
temporal
values,
and
we
take
snapshots
on
we'll
take,
for
example,
we
would
say:
do
a
later
late
transformation
on
a
daily
basis,
so
by
analyzing
the
template
to
sql.
A
We
can
then
generate
the
lineage
of
that
later
late
transformation
as
a
pre-execution
step
in
the
air,
for
instance,
and
emit
the
lineage
into
data
helper
calling
and
then
the
last
bit
we're
still
working
on
and
it's
still
still
sort
of
in
flight
is
doing
the
lineage
all
the
way
back
from
the
dashboards
all
the
way
back
through
into
our
from
the
dashboard
queries
that
generate
all
the
insights
and
visualizations
all
the
way
back
through
to
the
source.
A
A
So
we'll
quickly
rattle
through
the
last
two
bits
that
we've
done
so
data
quality.
So
data
have
recently
released
the
data
profiling
steps
on
some
of
their
tables,
we're
trying
to
use
those,
but
it
turns
out
that
it's
quite
expensive
and
computationally
complex
to
analyze
all
of
your
tables.
A
So
what
we've
done
is
try
to
identify
some
of
our
key
tables
and
just
profile
those
we
tried
profiling
in
athena
and
it
was
very
expensive,
both
computationally
and
because,
because
of
the
way
athena
is
architected
was
also
just
expensive
from
an
aws
point
of
view,
because
it
was
taking
a
lot
of
time
to
process
they
say:
wasn't
a
lot
of
data
was
being
generated
so
to
identify
and
target
our
most
our
most
influential
tables.
A
We
took
our
lineage
graph
and
started
running
some
graph
analytics
over
the
top
of
it,
so
to
identify
key
tables.
You've
got
three
types
of
paid
key
tables.
The
first
is
ones
that
we
know
are
key
to
the
business
and
are
consumed
by
some
of
the
most
important
decision
makers
around
the
company.
So
we
profile
those.
The
second
is
tables
that
are
consumed
a
lot
in
different
etl
processes.
A
So
if
we
have
a
key
table
that
is
consumed
in
a
lot
of
different
processes,
that
looks
like
it's
a
very
key
table
and
that
can
be
identified
through
a
lineage
graph
by
looking
at
concepts
around
the
outward
degree
number
of
the
graph
of
the
node.
So
that
looks
at
the
number
of
downstream
processes
that
are
consuming
it.
A
The
the
last
step
you
can
do
is
to
look
through
look
for
influential
nodes
and
that's
looking
for
things
like
centrality
measures
in
the
graph,
and
by
doing
that,
we
can
identify
graphs
that
sorry
tables
that
might
not
be
consumed
by
a
lot
of
processors,
but
are
very
influential
downstream.
Of
that.
A
So
the
last
thing
we
can
look
at
is
ownership
and
again
the
this
is
a
project,
that's
in
flight
and
is
in
part
depending
on
the
wider
engineering
team.
But
what
we
can
do
is
use
your
lineage
graph
to
start
looking
for
clusters
of
tables
that
all
get
processed
together
within
our
lineage
graph,
and
by
doing
that
we
can
start
identifying
groups
of
processing,
steps
or
groups
of
tables
that
are
used
commonly
together
and
by
doing
that,
we
can
start
defining
domains
within
our
table
with
our
tables
and
then
assigning
ownership
of
those
domains.
A
So
what
we've
got
here
is
a
is
a
lineage
graph
of
this
generated
using
all
the
processes
I've
just
described,
and
you
can
see
here
that
this
is
a
source
table.
It
goes
all
the
way
through
to
my
sequel,
take
the
some
airport
process
that
links
it
through
and
now
you
can
start
tracking,
all
the
later
late
transformations
and
how
it's
consumed
downstream,
and
you
can
see
that
this
table
informs
a
lot
of
other
tables
and
it
becomes
quite
complex
in
the
way
that
it's
consumed.
A
You
can
see
all
these
dance
screen
processes
here
and
then
just
to
give
you
a
flavor
for
what
we're
working
on
at
the
moment.
This
is
a
recent
feature
that
we've
that
we've
released
and
it's
being
tracked
by
some
dashboards
and
redash,
and
you
can
see
here
that
the
redash
dashboard
tracks
through
to
a
chart-
and
we
can
see
that
that
chart-
consumes
this
specific
data
set
here
and
once
we
link
that
in
once,
we
move
that
to
production.
A
So,
just
to
say,
thanks
to
the
team,
I
obviously
didn't
do
all
this
work.
The
the
this
is.
This
is
the
talented
group
of
people
that
actually
did
the
work
that
I'm
presenting
to
know
about.
So
thanks
to
all
them
and
yeah.
Thank
you
for
your
time.
Sorry,
I
can't
be
awake.
It's
quite
early
in
the
morning
for
me,
so
I'm
sure
I'll
be
very
asleep,
but
if
you
need
to
get
hold
of
me
I'll
reach
out
to
me
on
the
data
hub
slack
channel
other
than
that
we're
hiring.