►
From YouTube: Conquer Data Governance with Acryl Data’s Metadata Tests
Description
Maggie Hays & John Joyce (Acryl Data) share a framework for conquering Data Governance with Acryl Data’s Metadata Tests during DataHub's October Town Hall.
Presentation Deck:
https://docs.google.com/presentation/d/1bE_rY9dZCfrDcfRVR4-0G1XTNTaXPnhe9nTikeOUR6Y/edit?usp=sharing
Learn more about DataHub: https://datahubproject.io
Learn more about Acryl Data: https://www.acryldata.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
All
right,
we
are
going
to
switch
gears,
so
we're
going
to
talk
about
some
a
feature
in
actual
data's,
managed
data,
Hub
called
metadata
tests.
After
that,
we're
going
to
have
a
sneak
peek
of
an
upcoming
feature,
called
save
views,
and
then
we
will
be
talking
performance
improvements.
So
just
so,
you
know
what
you're
in
store
for
so
earlier
this
month,
John
and
I
spoke
in
Budapest
and
at
a
conference
called
crunch
data
conference.
A
So
what
we're
going
to
walk
through
is
a
subset
of
of
what
we
presented
there,
we'll
also
link
to
the
full
talk.
If
you
want
to
see
it,
but
the
idea
is
that
we
are
collectively
as
a
community
as
an
industry
kind
of
all
at
the
bottom
of
this
Mount
governance,
and
we
don't
really
know
how
to
tackle
it
right.
We
talk
a
lot
about
about
how
to
roll
out
governance
programs.
A
What
are
the
kind
of
the
key
components
of
governance,
but
the
reality
is
a
lot
of
a
lot
of
us
are
kind
of
sitting,
starting
at
the
bottom
of
this.
Like
huge
mountain
of
governance,
debt,
we
have
thousands,
if
not
hundreds
of
thousands
of
data
sets
and
pipelines
that
we
need
to
document,
and
it
can
be
a
really
really
really
daunting
task
to
figure
out.
How
do
you
even
start?
A
So
the
idea
that
we're
going
to
walk
through
is
you
know
some
examples
of
how
you
can
start
to
iteratively
address
data
governance
management
through
both
just
kind
of
a
work
framework,
but
then
also
through
automated
workflows
within
managed
data
hub.
A
So
when
we
talk
about
incremental,
automated
or
automation-driven
governance,
there
are
kind
of
four
principles
for
us
to
consider.
First
and
foremost,
we
want
to
set
really
clear
goals.
We
then
want
to
narrow
the
scope
and
then
surprise
kind
of
narrow
it
a
little
bit
more.
We
want
to
drive
incremental
action
and
then,
as
we
go
measure
the
progress
to
make
sure
that
we're
hitting
the
goals
that
we
set.
So
what
are
some
examples
of
this?
A
One
thing
I've
gone
through
that
personally
gone
through
this
exercise
of
kind
of
rolling
out
governance
initiatives
in
a
handful
of
companies,
many
many
many
times,
and
what
I've
found
is
that
you
really
need
to
have
a
clear
goal
of
why
this
is
important,
why
governance
is
worth
tackling,
and
so
that
could
be
coming
from
your
organization
that
could
be
coming
from
your
leadership
team.
It
could
be
coming
from.
A
You
know,
data
team
and
realizing
that
there's
some
problems,
but
the
idea
is
identify
the
problem
and
then
set
the
goal
of
what
a
positive
or
kind
of
targeted
outcome
looks
like.
So
maybe
you
have
within
your
organization
a
lot
of
questions
or
concerns
around
compliance.
Maybe
in
that
regard
you
want
to
start
by
a
class
ensuring
that
everything
has
a
security
classification
on
it.
A
Maybe
you
actually
are
hitting
a
lot
of
issues
with
unreliable
data.
So
then,
maybe
an
ownership
initiative
is,
is
the
right
path
for
you,
but
the
idea
here
is
to
just
set
a
really
targeted
and
concise
goal.
The
next
step
of
it
is
to
narrow
the
scope.
So
if
you
think
about
the
entirety
of
your
data,
stack
you're
not
going
to
like
I
guarantee
you,
you
will
not
be
successful.
If
you
seek
to
add
ownership
to
200
000
entities
in
in
a
week
or
a
month,
it
is
not
going
to
happen.
It's
it's.
A
It's
just
not
gonna
happen,
so
narrow
the
scope
and
some
ways
that
you
can
start
to
kind
of
focus
on
the
data
that
that
matters
most
is
to.
Maybe
maybe
you
want
to
Target
a
specific
platform.
Maybe
you
want
to
Target
a
specific
domain.
This
could
be
really
popular
in
companies
that
have
adopted
data
mesh,
or
maybe
you
actually
want
to
start
targeting
entities
that
have
high
usage
regardless
of
platform
regardless
of
domain.
A
But
the
idea
is
now
that
you
have
your
goal,
set
a
very
finite
scope
of
of
how
to
start
addressing
that
and
then
also
cut
your
scope
a
little
bit
more.
So
you
can
start
building
out
this
this
workflow,
more
often
than
not.
These
are
going
to
be
initiatives
where
you're
asking
other
teams
to
do
something
right,
you're
asking
them
to
own
data
you're,
asking
them
to
document
data
you're,
asking
them
to
classify
it,
and
so
really
what
you're
doing
is
is
testing
the
waters
in
the
early
stages
to
figure
out.
A
How
do
you
roll
out
these
initiatives
in
a
way?
That's
successful,
so
I've
been
most
successful
when
I've
teamed
up
with
highly
motivated
stakeholders
that
are
attached
to
that
problem
right,
so
we've
set
the
goal
up
top.
Maybe
compliance
is
the
issue,
so
maybe
you
want
to
work
with
your
compliance
or
your
legal
team,
but
the
idea
is
that
you
should
start
with
a
small
set
of
stakeholders
that
are
that
are
both
invested
in
the
outcome
and
are
familiar
with
with
what
the
initiative
should
should
seek
to
do.
A
The
other
thing
is
just
really
being
setting
clear
expectations
of
what
you're
asking
them
to
do.
Asking
someone
to
be
an
owner
of
data.
Sure
like
no
one
will
will
come.
Nobody
will
say
no
to
that,
because
it's
so
nebulous
and
it
honestly
doesn't
really
mean
much.
So
what,
when
you
ask
someone
to
be
an
owner,
what
are
you
explicitly
asking
them
to
do?
What
are
the
expectations?
So
it
should
be
very
clear
of
what
what
they're
expected
to
do
and
then
the
last
part
is
to
create
rapid
feedback
loops.
A
A
If
you're,
having
a
hard
time
thinking
through
what
stakeholders
you
might
want
to
Target,
my
my
advice
is
to
start
with
making
it
very
obvious
so
working
with
stakeholders
that,
like
are
really
aligned
with
why
this
is
worth
doing,
make
it
easy
right,
so
just
help
them
understand
exactly
what
you're
asking
for
them,
but
also
make
it
collaborative
so
don't
go
in
and
say:
Here's,
here's
exactly
what
I,
what
I'm
looking
for
you
to
do,
and
here's
how
you
should
do
it
you've
already?
A
They
already
understand
why
they
need
to
do
it.
They
already
understand
what
you're
asking
them
to
do
so,
don't
over
prescribe
how
they
do
it.
I've
personally
walked
into
a
room
and
of
like
with
Google
Sheets
and
asked
Engineers
to
document
tens
of
thousands
of
of
data
columns,
and
they
literally
laughed
me
out
of
the
room
which
I
will
never
forget
that
moment.
Ever
in
my
life
it
was
horrible,
so
please
do
not
over
over,
prescribe
the
how
and
then.
A
Last
but
not
least,
you
want
to
measure
your
progress,
making
sure
that
what
you
sought
out
that
goal
that
you
set
up
top,
are
you
actually
hitting
those
goals?
Is
it
having
the
impact
that
you're
looking
for?
Do
you
still
have
the
organizational
support
that
you
need
and
also
don't
get
too
married
to
those
goals?
Right
circumstances,
change,
you're,
going
to
learn
more
throughout
the
process.
A
Just
because
you
set
a
goal
up
top
doesn't
mean
that
you
necessarily
doesn't
mean
that
it's
necessarily
the
right
goal
indefinitely,
so
walk
into
it
with
some
flexibility
and
when
you,
when
you
are
able
to
measure
progress,
automate
that
as
quickly
as
possible.
That
way,
you
can
kind
of
track
things
systematically
and
it's
not
on
your
and
you
can
kind
of
keep
an
eye
on
how
on
how
things
are
progressing.
A
So
those
are
steps,
one
through
four
latherines,
repeat:
iterate
through
a
different
subset
of
data.
A
different
set
of
stakeholders.
I
realize
I've,
walked
through
this
incredibly
quickly.
A
B
Yeah
thanks
Maggie.
So
what
we're
going
to
do
now
is
kind
of
look
at
a
practical
application
of
the
steps
that
Maggie
just
introduced.
A
B
The
first
step,
you
know
we're
going
to
set
clear
goals,
so
imagine
we
have
a
company
we'll
use
the
example
of
long
tail
companions
which
we
usually
use.
But
let's
imagine
that
we've
set
clear
governance
goals
at
long
tail
that
every
data
set
must
have
at
least
one
owner
assigned.
It
must
have
well-structured
semantic
documentation,
understand
that
describes
the
purpose
of
the
data
and
then
finally,
it
must
have
a
classification,
so
maybe
labeling
from
a
centrally
managed
taxonomy
step,
two
we're
going
to
narrow
the
scope
right.
B
B
B
So
in
here
we're
going
to
look
at
Acro
data,
hub's
metadata
test
feature
which
basically
allows
you
to
First
select
a
subset
of
the
assets
inside
of
your
ecosystem
and
then
do
something
with
them.
And
so
what
we're
going
to
build
is
we're
going
to
build
a
selection
criteria
that
finds
all
of
the
snowflake
tables
that
are
in
the
top
25
percent
of
most
use
and
also
have
a
significant,
unique
user
count.
So
we're
basically
just
taking
our
criteria.
B
The
scope
and
criteria
that
we've
outlined
and
we're
actually
just
remodeling
it
on
data
Hub
and
we're
using
data
Hub
to
help
us
find
those
data
assets,
and
so
we're
going
to
look
for
those
which
have
a
query
count:
percentile
as
greater
than
75,
meaning
top
25
percent
of
most
used
and,
finally,
we're
going
to
use
another
metric.
That
datahub
will
automatically
service
for
us,
which
is
the
unique
users
in
the
last
30
days
and
we're
going
to
say
that
must
be
greater
than
one
now
once
we've
defined
this
criteria.
B
The
next
thing
we're
going
to
do
is
we're
going
to
kind
of
organize
these
assets,
we're
going
to
track
those
assets.
We're
going
to
talk
about
rules
in
the
next
section.
So
we're
just
going
to
skip
this
for
now
and
move
right
on
to
the
the
actions
piece.
But
what
we're
going
to
do
is
we're
going
to
actually
automate
the
process
of
adding
a
glossary
term
to
all
of
those
data
assets
and
we're
going
to
say
that
all
of
those
data
assets
that
were
identified
are
in
tier
one.
B
So
we're
going
to
add
what
we
call
an
action
which
means
that
any
asset
that
matches
the
criteria
is
automatically
given
a
tier
one
label
and
any
that
falls
out
of
the
criteria
is
going
to
be
removed
from
tier
one.
And
so
we
can
kind
of
use
this
to
enrich
our
metadata
in
real
time.
So
we're
going
to
create
a
test
here
and
then
after
some
time,
you're
going
to
see
that
the
test
begins
to
run
across
your
entire
data
catalog.
B
So
once
we've
kind
of
you
know
identified
the
scope
of
the
assets
that
need
to
be
governed
for
our
initiative.
We're
then
going
to
move
on
to
driving
incremental
action.
So
this
is
the
third
step
that
Maggie
talked
about
once
we
know
the
assets
we're
going
to
identify
the
experts
for
the
data.
That's
in
that
scope,
so
people
that
should
own
those
91
assets.
You
can
do
this
in
two
ways.
B
One
is
you
can
look
at
actual
snowflake,
you
know
access
history
manually
or
you
can
use
something
like
data
Hub,
which
services
that
to
you
or
you
can
use
tribal
knowledge
right.
So
you
can
actually
go
and
do
the
manual
work
of
finding
those
people,
but
I.
B
Don't
recommend
that
and
then
finally,
you're
gonna
have
to
actually
do
the
work
to
you
know:
get
ownership,
get
documentation
and
get
classification
on
those
91
assets,
there's
kind
of
No
Way
Around
that
you
have
to
have
human
kind
of
intervention
in
this
process
and
then,
finally,
what
we
can
do
with
data
Hub
is
we
can
measure
our
progress
against
our
governance
goals
and
so
what
we're
going
to
view
here
is
defining.
Yet
another
metadata
test
that
will
allow
us
to
track
the
data
assets
that
meet
our
governance
criteria.
B
So
the
first
thing
we're
going
to
do
is
we're
just
going
to
again
Define
a
selection
criteria
in
this
case
we're
going
to
identify
all
of
the
data
assets
that
are
tagged
with
tier
one.
So
you
can
remember
we
kind
of
grouped
everything
into
tier
one.
First,
now
we're
selecting
all
of
those
things
and,
in
this
case
we're
going
to
add
some
rules,
so
these
are
basically
conditions
that
all
tier
one
data
sets
must
match
and
we're
going
to
say
it
has
to
have
a
description
it
has
to
have.
B
You
know,
maybe
a
glossary
term
from
the
classification
term
group
and
then
finally,
it
has
to
have
you
know
at
least
one
owner,
so
we're
just
modeling.
You
know
our
definition
of
success
in
governance
in
data
hub
now
we
can
use
data
Hub
to
kind
of
try
it
out
on
some
sample
data
and
see
which
of
the
assets
that
are
in
scope
are
actually
matching
see.
One
failed
one
passed
in
this
case,
we'll
just
skip
the
the
actions,
because
really
we
just
want
to
monitor.
B
You
know
how
many
assets
are
compliant
versus
non-compliant
so
that
we
know
kind
of
what's
remaining
and
you
can
see
here
over
time,
you'll
be
able
to
track
those
assets.
So
you
can
actually
say
here
are
all
the
things
that
do
not
meet
my
governance
standards
and
you
can
watch
that
number
kind
of
tick
down.
As
you
run,
your
governance
initiative,
so
I
think
this
is
really
useful
when
you're
really
trying
to
kind
of
iteratively
track
your
progress
and
actually
report
on
that
progress
to
external
stakeholders
in
governance.
B
I
think
that's,
that's
pretty
much
it
for
the
demo
and
then
finally,
I'll
just
leave
you
guys
with
what
is
going
to
be
available
in
the
open
source.
So
what
we
just
saw
as
a
demo
on
actual
data
Hub,
which
provides
the
entire
experience
from
defining
a
test
running
the
test
against
your
entire
catalog
reporting.
The
results.
What
will
be
available
in
the
open
source
is
a
couple
of
things.
So
first
is
kind
of
the
specification
of
the
test
format.
B
So,
under
the
hood,
all
tests
are
going
to
be
represented
in
yaml
or
Json,
we're
going
to
publish
that
format.
The
second
is
the
model
itself
in
GMS
for
metadata
tests
and
metadata
test
results
so
that
you
could
presumably
kind
of
ingest
metadata
tests
into
Data
Hub
and
results
into
Data,
Hub
and
then
finally,
UI
support
for
actually
rendering
those
test
results
that
you
saw
there
at
the
end
on
the
entity
page
all
right
and
with
that
I
think
we
can
conclude
this
one.