►
From YouTube: Grab & DataHub Community Case Study
Description
Grab team members Alex Dobrev, Harvey Li, and Amanda Ng share how they are using DataHub to democratize discoverability with computational data governance.
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
Hi
everyone,
and
thanks
to
Maggie
and
the
rest
of
the
wonderful
team
at
the
crew
data
for
having
us
today,
I'm
sorry
this
will
be
recording,
but
our
team
is
based
in
Singapore
and
we're
trying
to
keep
a
healthy
schedule
just
kidding,
obviously
working
in
data
engineering,
we're
quite
nocturnal,
but
we
just
wanted
to
save
you
from
our
late
night
phases.
Okay,
in
any
case,
let's
get
started.
A
My
name
is
Alex
I
work
on
the
data
platform
product
team
here
at
grab
and
together
with
my
teammates
Amanda
and
Harvey
today,
we're
going
to
tell
you
about
our
journey
to
improve
data
discoverability
with
computational
data
governance
here
at
Broad.
A
A
In
order
to
cater
for
in
order
to
cater
for
this,
and
after
evaluating
multiple
options
in
2021,
we
decided
to
invest
in
data
hub
for
three
main
reasons:
first,
the
autonomy
and
flexibility
to
customize
it
according
to
the
needs
of
the
data
Community
here
at
grab
and
the
specifics
of
our
data
infrastructure.
Second,
the
maturity
of
the
technology
and
the
advanced
metadata
architecture
and,
third,
third
and
potentially
foremost,
the
amazing
Community.
Behind
data
behind
data
Hub
that
is
driving
the
the
solution
forward.
A
I'll
be
surprised
if
there
is
another
metadata
management
solution
out
there
that
has
almost
5
000
people
behind
it.
With
that,
let
me
pass
it
over
to
Harvey
and
Amanda
for
the
more
interesting
and
practical
part
of
today's
talk.
Thank
you.
B
Thanks
Alex
for
the
intro
indeed,
data
Hub
is
so
extensible
with
the
strong
community
and
that
that
is
pushing
towards
the
Forefront
of
metadata
management
at
grab
before
the
data
Hub
repo
and
we're
rebranding
that
Hubble.
Since
data
Hub
was
deployed
I
grabbed
a
few
months
ago,
tens
of
releases
have
been
made
internally.
We
established
a
streamlined
release
process
to
ensure
we
have
an
easy
way
to
merge.
Our
code
changes
with
data
Hub
new
releases.
B
Our
release
Cadence
closely
follows
the
community
one.
Each
release
is
rebased
against
a
recent
Community
version,
with
additional
features
developed
internally
at
grab.
Some
of
those
have
been
contributed
back
to
the
community.
Next
Amanda
and
I
will
highlight
some
features
that
we
instant
from
data
hub
for
data
Discovery
and
data
governance
use
cases.
B
Scalability
has
been
really
an
issue
for
data
hub
today
we
ingest
3.5
million
metadata
change
proposals
to
GMS,
either
synchronously
or
asynchronously
on
a
daily
basis
that
translates
to
around
40
mcps
per
second,
we
managed
to
ingest
metadata
for
over
100
000
people
within
15
minutes
with
parallelism
in
place.
Using
Presto
on
hive
plugin,
which
has
now
been
open
sourced,
we
also
created
additional
entities
and
aspects
to
the
metadata
graph,
one
of
which
is
a
generic
entity
type.
We
call
it
others
to
cover
custom
entities
at
grab.
B
I'll
share
more
later
tooth
here,
I'd
like
to
share
how
we
scale
Hive
injection,
to
reduce
injection
time
from
over
20
hours,
using
original
hard
plugin
to
50
minutes
with
a
new
plugin
that
we
initially
developed
called
Preston
height
from
the
original
hype.
Plugin
metadata
is
fetched
by
looking
through
every
schema
table
and
view.
If
there
are
X
Gamers
white
tables
and
zip
views,
there
will
be
in
total,
X
Plus
y
plus
Z
described
queries
sent
to
high
server
2
to
parallelize.
The
injection
there's
no
effective
way
to
shut
new
and
existing
tables
as
well.
B
So,
instead
of
connecting
to
Hive
service,
we
fetch
metadata
directly
from
hack
meta
store,
DB
we're
starting
to
achieve
ingestion
in
parallel.
Now,
instead
of
using
X,
Plus
y
plus
Z
queries
as
we
saw
when
using
High
plugging,
it
only
requires
three
queries
to
fetch
metadata
for
schemas
tables
and
Views.
With
multiple
parallel
tasks
running.
B
One
of
my
favorite
in
data
Hub
is
its
super
extensible
metadata
model
where
you
just
need
to
define
the
PDF
schema
and
update
the
entity
registry
to
add
a
new
entity
or
aspect,
and
everything
else
is
then
taken
care
of
with
auto
code.
Gen
next
I'd
like
to
highlight
two
examples
we
did
on
this
front.
First
is
on
the
data
set
entity.
We
added
a
Time
series
aspect
to
support
sample
events
for
Kafka
topics.
B
As
more
use
cases
are
onboarded,
we
often
have
to
make
the
Judgment
call
on
whether
to
reduce
some
existing
entity.
Add
new
aspects
through
an
existing
entity
or
to
create
a
new
entity
in
the
metadata
model.
The
current
subtypes
as
well
means
the
extensibility
of
an
entity
type.
For
example,
we
can
model
table
view
topic
or
as
data
sets
in
beta
Hub.
However,
there
are
occasions
when
some
entities
are
very
bespoke
to
internal
data
platforms.
B
Such
we
create
a
generic
entity
called
others
in
data
Hub
to
cater
to
all
these
unique
groups
of
use
cases.
It
has
an
aspect
for
generic
properties
that
support
different
rendering
types
so
that
a
property
can
be
rendered
as
table
string
code
or
Json
in
a
UI
in
a
screenshot
here.
What
you
see
is
another
entity
with
a
subtype
name
called
Skywalker
MLG
notes
are
entities
for
graphs,
contain
ranking
platform
in
short
data
hubs,
plug-in-based
ingestion
framework
and
schema
first
metadata
model
are
what
impulses
to
make
data
Discovery
democratize
across
all
sorts
of
entities.
B
C
Thanks
Harvey
I
will
now
be
going
through.
Data
governance
in
graph
data
and
grab
is
extremely
valuable
as
it
forms
the
foundation
of
many
of
our
critical
systems
and
operations.
So
we
really
need
to
establish
our
data
as
something
governed
while
delivering
the
greatest
value
and
Agility
of
these
data
at
grab
data
governance
anchors
on
three
main
points.
Firstly,
data
is
immensely
valuable
to
grab
great
data
unlocks
great
growth,
encouraging
Innovations,
improving
operations
and
the
quality
of
our
products.
C
Secondly,
data
management
is
democratized
at
grab,
there's
no
single
role
or
Department
that
owns
every
aspect
of
the
data.
It
cannot
be
seen
or
treated
in
isolation,
but
it's
something
that
everyone
in
the
company
may
be
consuming
from
to
make
important
decisions.
Everybody
company
should
be
able
to
work
with
the
data
comfortably
and
confidently,
and
lastly,
data
must
be
agile.
C
Information
classification
in
grab
is
a
process
where
the
owners
identify
the
sensitivity
level
of
the
company's
information
to
apply
an
appropriate
level
of
protection
to
the
information.
As
it's
created,
updated,
stored,
transmitted
or
deleted,
the
level
of
sensitivity
would
affect
the
user's
access
to
the
corresponding
data.
Prior
to
data
Hub,
we
were
relying
on
an
Enterprise
software
to
help
manage
sensitivity,
levels
and
customizations
were
extremely
limited,
so
all
we
could
do
is
to
guide
users
through
documentations
and
relying
on
them
to
infer
and
conclude
the
data
sensitivity
level.
C
This
manual
work
led
to
a
lot
of
invalid
classifications
against
the
organizational
rules.
For
instance,
the
table
was
intact
as
containing
Pi
when
the
column
is
tagged
as
containing
pii,
which
then
led
to
several
rounds
of
organizational
white
campaigns
from
our
governance
and
cyber
security
teams
to
manually
validate
these
rules
across
hundreds
of
schemas
and
100
000
tables
within
this
was
incredibly
taxing
for
the
team
taking
up
huge
number
of
men
hours
to
go
through
them.
Data
Hub
enabled
grep
to
pick
this
number
down
to
zero.
C
So
how
do
we
take
that
number
down
to
zero?
We
modeled
such
metadata
into
glossary
terms
and
nodes.
We
added
validations
on
the
react
application
and
on
the
entity
service
within
GMS.
These
allowed
the
validation
errors
to
be
displayed
both
to
the
users
using
the
user
interface
to
update
and
to
any
external
Services
ingesting
these
rules
directly
into
Data
hub.
C
So
we
have
talked
about
creating
these
information
classification,
and
the
next
topic
is
how
we
would
determine
who
should
own
these
access
controls.
We
Define
these
decision
makers
as
technical
data
owners.
These
are
individuals
who
are
accountable
for
security
access
controls
of
the
data
sets
they
manage
at
a
schema
or
database
level
and
as
the
role
of
the
technical
data
owner
for
the
schema
isn't
trivial,
like
printing
access
to
possibly
hundreds
of
tables
at
one
go.
We
have
to
enforce
that.
C
This
person
meets
the
minimum
requirements
set
up
by
our
data
governance
office
and
considering
that
this
seems
to
be
a
highly
specific
and
customized
piece
of
logic,
we
then
decided
to
use
the
data
Hub
actions
framework.
The
data
Hub
actions
framework
allowed
us
to
read
only
the
events
we
need,
in
this
case
it's
ownerships
for
schemas.
C
If
he
or
she
doesn't
meet
the
minimum
requirements
and
in
this
snippet
you
can
see
that
adding
me
Amanda
as
a
technical
data
owner
results
in
a
select
message
being
sent
and
when
the
page
gets
refreshed,
it
gets
reassigned
to
Harvey,
and
with
these
two
customizations
in
place,
we
have
ensure
that
the
flow
for
data
access
is
governed
and
with
that
we've
come
to
the
end
of
our
presentation.
All
this
is
possible
thanks
to
the
amazing
team
working
on
Hubble
and
the
support
from
the
data
Hub
community.