►
From YouTube: DataHub 201: Data Debugging
Description
John Joyce (Acryl Data) presents DataHub 201: Data Debugging - Learn how you can prevent and triage data issues using DataHub as part of your core workflows during the March 2023 Town Hall.
DataHub Public Roadmap: https://feature-requests.datahubproject.io/roadmap
Presentation Deck:
https://docs.google.com/presentation/d/1Xe1HZ11zpP7BPyXUwVYoHFEr7fHLjSUaZlKaae2KuM8/edit?usp=sharing
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
For
my
session,
I'm
gonna
try
to
keep
it
relatively
quick,
it's
quite
a
few
slides,
but
we're
going
to
be
talking
about
debugging
data
issues
with
data
Hub,
and
so
first
we're
going
to
talk
about
sort
of
you
know
what
are
the
what's
the
ecosystem
look
like
like
for
most
companies.
Why
do
these
issues
happen
in
the
first
place
and
then
we're
going
to
go
through
a
few
different
common
types
of
issues
we
see
and
how
you
can
actually
start
to
triage
or
debug
them
through
data
hub
all
right?
A
What
I
mean
is
the
process
of
identifying
and
addressing
errors
in
data
and
we'll
talk
about
what
types
of
Errors
we
can
see,
but
most
often
this
is
you
know
the
expectations
of
a
data
consumer
not
being
met
for
one
of
a
few
reasons.
A
A
And
let's
take
a
look
at
the
data
ecosystem
pattern
that
we
see
most
often
at
most
organizations.
So
most
often
we
have
you
know
an
online
service
or
application.
Let's
say
like
in
an
e-commerce
site,
you
would
have
you
know,
purchases
and
products
being
served
online
and
typically
these
applications
are
going
to
be
generating
two
types
of
data.
A
Why
go
and
go
through
all
this
trouble?
Well,
it's
because
at
the
end
of
the
day,
someone's
going
to
be
making
decisions
based
on
the
data
we're
producing
right,
so
we're
enriching
data,
we're
adding
value
to
it,
so
that
some
operator,
whether
it's
an
internal
person
like
a
CEO
looking
at
a
dashboard
or
a
user
of
our
product,
can
make
some
decision.
So
this
is
kind
of
the
data
ecosystem
data
transformation
landscape
that
we
see
at
most
companies
at
a
high
level.
A
So
we're
running
this
statement
every
single
day,
where
we're
inserting
into
our
product
purchases
table
by
selecting
from
the
purchases
events
table
and
we're
joining
it
with
the
products
table,
so
you
can
imagine
we're
creating
like
an
enriched
table
called
Product
purchases.
So
we
own
this
all
right
now.
What
are
the
things
that
can
go
wrong
on
a
day-to-day
basis
if
we're
trying
to
generate
this
data
every
single
day?
A
Well,
the
first
thing
is:
we
can
have
unexpected
schema
changes
in
the
data
that
we
depend
on
so
those
Upstream
tables,
so
the
first
type
or
first
example
I'll,
give
is
a
column
removal.
So
let's
imagine
that
the
Upstream
products
table
has
an
alter
table
statement
made
on
it
right.
So
let's
say
that
there
was
an
alter
table
statement
where
the
description
column
was
dropped.
This
is
pretty
significant,
so
it's
actually
not.
Maybe
all
that
common,
maybe
something
that's
more
common-
is
a
column
name
change.
A
Let's
imagine
that
the
Upstream
product
table
has
a
name
change
where
the
description
column
is
renamed
to
edited
edited
description.
Now,
there's
other
types
of
schema
changes
that
we
may
see
like
column
type
changes.
You
can
imagine
a
bigint
column
going
to
an
INT
column
in
a
lossy
fashion.
Something
like
this,
but
in
most
cases
this
is
the
type
of
thing
we'll
see
and
now,
let's
talk
about
how
we
would
detect
an
issue
like
this.
A
So
actually
the
query
engine
would
catch
this
for
us
right
at
query
time
before
the
impact
of
that
change
really
gets
to
spread
to
any
of
the
downstream
use
cases
that
are
depending
on
our
data.
So
it's
actually
kind
of
a
nice
property
once
we
know
that
we've
had
an
unexpected
schema
change.
How
can
we
begin
to
debug
it,
and
this
is
where
I
think
data
Hub
can
help
there's
kind
of
two
features
that
can
help
us
here.
A
One
is
lineage
and
the
second
is
a
feature
called
schema
version,
history
and
I'm
going
to
illustrate
the
point
by
jumping
over
to
my
data
Hub,
where
I've
got
this
exact
architecture
set
up
so
I've
got
my
table
here
and
I've
got
some
input
tables,
and
you
can
imagine
that
my
query
failed
today.
Now,
how
would
I
go
about
debugging
it?
A
A
So
if
I
go
to
the
schema
tab,
I'll
be
able
to
see
immediately
that
there's
this
edited
description,
column,
and
maybe
that
doesn't
look
familiar.
Maybe
it's
why
my
SQL
statement
is
breaking.
What
I
can
do
is
I
can
actually
open
up
this
column
history
and
begin
to
see
that
you
know
this
edited
description.
Column
was
just
added.
Actually
it
was
just
added
14
hours
ago.
This
is
a
bit
fishy.
A
What
I
can
do
is
dig
a
little
bit
deeper
and
I
can
see
that
the
previous
version
of
the
schema
had
a
description
column,
so
this
has
recently
changed.
So
this
is
one
way
that
we
can
try
to
understand
what's
gone
wrong
when
our
query
fails
because
of
an
unexpected
schema
change.
This
also
applies
to
cases
like
a
column.
Type
change.
Of
course,.
A
So
the
second
type
of
issue
that
we
would
see
is
maybe
delayed
data.
So
it's
a
little
bit
different
problem.
You
can
imagine
that
that
purchases
table
that
we
depend
on
is
updated
every
single
day
and
on
March
18th
March
19th,
March
20th.
Everything
is
good,
but
on
March
21st
the
data
doesn't
appear.
A
So
what
can
we
do
to
detect
a
case
like
this?
Well,
let's
revisit
the
SQL
query
that
we're
using
to
generate
our
data.
So
one
thing
you'll
notice,
right
off
the
bat
is
that
this
query
won't
necessarily
fail,
like
the
previous.
Would
right.
It'll
just
generate
zero
rows,
and
this
is
a
bit
trickier
of
a
case
right
because
now,
at
this
point
we're
not
catching
it
early.
In
fact,
we
may
we
may
not
even
notice
that
anything
is
wrong.
A
So
what
are
our
options
to
detect
this
more
proactively?
Maybe?
Well,
maybe
we
can
pre-validate
the
input,
so
maybe
we'll
run
a
check
that
says
hey.
If
there's
no
rows
from
the
21st
in
our
Upstream
table,
then
maybe
we
shouldn't
run
it
all
or
maybe
we'll
do
the
opposite,
we'll
post
validate
so
once
we
generate
some
rows,
we'll
make
sure
that
there
is
greater
than
zero
rows.
But
you
can
imagine
the
issue
with
this.
There
may
actually
be
valid
cases
where
there
aren't
any
rows
produced.
A
Maybe
there
were
no
purchases
on
the
21st,
so
it's
not
a
foolproof
way
to
protect
against
this.
In
fact,
these
types
of
issues
are
very
difficult
to
catch
in
option.
Three,
which
is
probably
the
most
likely
is
that
somebody
notices
the
missing
data
right
and
they
come
tell
us
hey,
and
why
is
there
no
rose
today
I'm
looking
at
this
dashboard
and
it
doesn't
look
correct,
so
it's
sometimes
possible
to
catch
this
near
query
time
when
we're
generating
our
data,
but
it's
not
always
possible.
A
In
fact,
it's
actually
pretty
rare
that
it's
possible,
but
let's
imagine
that
someone
did
come.
Tell
us
hey
something
looks
wrong.
How
can
we
use
data
Hub
again
to
begin
to
debug
this
case?
Well,
we
can
again
use
the
lineage
feature
combined
with
two
important
data.
Hub
features,
one
is
called
operations
which
allows
you
to
understand
the
changes,
the
inserts,
the
updates,
the
deletes
that
have
happened
on
on
table
and
incidence,
which
allows
us
to
communicate
that
there
is
an
issue
on
a
data
set
or
a
table
proactively.
A
So
let's
actually
go
and
check
out
those
two
features
really
quick,
okay,
so
the
first
one
is
the
operation.
So
you
can
imagine
that
you
know
I'm
the
owner
again
of
this
product
purchases
table
and
the
Upstream
purchases
maybe
was
delayed,
and
someone
told
me
hey.
This
is
delayed
well
now
what
I
can
do
is
I
can
come
in
and
I
can
actually
see
the
last
updated
time
and
I
can
see
that
the
last
time
it
was
changed
is
320.,
so
we
were
missing
that
321
date.
So
obviously
something
is
a
little
bit
wrong.
I.
A
Imagine
that
this
table
is
going
to
be
updated
every
day.
So
what
can
I
do
if
I've
detected?
That
something
is
is
wrong.
Well,
maybe
the
best
thing
I
can
do
is
let
my
stakeholders
know
that
something
is
wrong
as
soon
as
I
possibly
can
so
that
I
can
mitigate
the
effects,
so
maybe,
as
the
owner
of
a
table,
let's
say,
the
Upstream
owner
actually
notices
this
first.
What
I
can
do
is
use.
This
feature
called
incidence
to
basically
say
that
this
SLA
was
missed
right.
A
So
maybe
we,
you
know
a
bug
in
application
code
caused
an
incident
here,
and
the
really
nice
thing
about
this
feature
is
that
once
we
create
the
incident,
we
will
actually
hopefully
we
yep.
Basically,
we
will
actually
notify
you
so
on
slack
or
on
your
preferred
notification
channel
that
something
has
happened
to
an
upstream
dependency.
So,
for
example,
if
you're
an
owner
of
the
downstream
purchases
table,
you'll
actually
be
notified.
When
there's
an
incident
on
the
Upstream
purchases
and
similarly
you'll
be
notified
once
that
issue
is,
is
ultimately
resolved.
A
Awesome,
let's
continue
back
to
the
slides.
I
know
I'm
going
a
little
long,
I'm
going
to
try
to
speed
things
up
a
bit
here,
all
right.
So
the
third
case-
and
this
is
probably
the
hardest
to
detect
and
the
hardest
to
catch-
is
unexpected
semantic
changes
to
data.
So
what
do
I
mean
by
that?
Well,
let's
talk
about
the
first
type,
which
is
a
column,
semantics
change.
A
Let's
imagine,
there's
a
purchase
at
a
column
in
the
purchases
events
table
that
we
depend
on
and
it's
traditionally
been
in
seconds
right,
but
you
know
the
engineers
that
are
owning
the
Upstream
application
decided
that
milliseconds
is
actually
the
industry
standard
and
maybe
they
should
have
been
putting
milliseconds
in
this
column
all
along,
so
they
make
that
change
right
now.
All
of
the
downstreams
have
to
be
able
to
deal
with
that.
The
second
case
is
a
row,
semantics
change.
A
So
let's
imagine
that
the
purchase
event
table
initially
or
traditionally
has
only
represented
online
purchases,
those
made
from
our
web
store,
and
today
we
made
a
change
that
actually
includes
mobile
purchases
in
the
purchase
event
table.
Well,
now,
the
meaning
of
the
purchase
event
row
has
changed
to
include
mobile
purchases,
so
you
can
see
the
key
characteristic
of
this
type
of
Change
Is
that
the
structure
of
the
data
didn't
change.
It's
really
the
the
meaning
of
the
data
has
changed
in
some
way.
A
So
how
do
we
detect
cases
like
this?
Well,
it's
the
most
difficult
to
detect,
because
our
query
will
not
only
run
successfully,
but
it
will
produce
roads.
It
will
actually
run
and
produce
data,
and
so
you
know
this
actually
will
possibly
lead
to
cascading
effects
where
downstreams
consume
our
data
and
in
turn
produce
bad
data.
A
So
what
are
our
options
to
detect
these
cases?
Well,
one
is,
maybe
we
could
you
know
I,
don't
know
pre-validate
the
columns
before
we
use
them
as
the
consumer,
but
that
that
seems
kind
of
expensive.
Maybe
we
could
pre-validate
the
row
count,
but
again
is
that
is
that
really
full
proof
like?
Can
we
really
know
how
many
purchases
were
made
yesterday
or
if
there
was
a
spike
because
of
the
holidays?
Can
we
really
account
for
that?
A
A
This
is
very
rarely
possible
to
catch
a
query
time,
but
it
can
be
very
expensive
if
you
don't
catch
it
inversely
because
of
the
downstream
effects
it
can
have
all
right.
So
how
can
we
if
we
know
something,
has
gone
wrong?
Let's
say
we
know
that
the
data
looks
weird.
How
can
we
detect
that
there
was
an
unexpected
semantic
change
using
data
Hub?
Well,
you'll
notice,
a
theme
Here.
We
can
use
the
combination
of
lineage
and
two
other
key
features
of
data
Hub.
A
The
data
profiles
feature
in
the
assertions
feature
if
we're
lucky
to
debug
these
types
of
changes,
so
one
final
pop
over
to
datahub,
so
you
can
imagine
that
we
know
something's
wrong
with
our
Downstream
table
of
the
purchases
table.
Maybe
what
we
can
do
is
we
can
come
to
actually
look
at
the
profile
of
this
data
and
maybe
we
look
at
the
column
purchase
that
date.
This
is
the
one
that's
been
changed
and
you
know
nothing
looks
particularly
weird
with
the
sample
values.
A
So
we
go
over
to
history,
and
if
we
look
at
this,
we
can
actually
see
that
the
values
have
changed
quite
dramatically.
In
fact,
they've
changed
by
a
value,
a
factor
of
a
thousand.
So
that's
one
thing
we
can
do.
We
can
also
look
at
things
like
the
row
count
over
time.
We
can
see
that
maybe
some
new
events
are
coming
in.
A
Maybe
this
is
caused
by
that
mobile
purchase
event
being
added
all
right
and
if
we're
really
lucky,
you
know
our
Upstream
data
owner
will
be
super,
be
responsible
and
they
will
have
defined
some
tests.
This
is
what
we
call
assertions
and
data
Hub,
and
we
can
see
that
there
was
an
assertion
defined
on
that
purchase
that
column,
which
basically
enforces
that
it's
in
seconds
and
previously
it's
good
and
now
it's
bad.
So
this
is
less
common
I.
This
is
the
ideal
state.
A
What
you'll
notice
is
that
we
have
this
little
Data
Insights
panel
over
here,
which
is
provided
by
the
data
Hub
Chrome
extension,
and
when
we
click
it.
What
this
will
do
is
it'll,
basically
cross-reference
the
dashboard
with
the
information
and
data
Hub.
It
will
use
the
lineage
graph
to
tell
you
that
there's
something
wrong
with
an
input
table
which
is
used
to
produce
the
adoptions
dashboard,
and
so
we
can
see
that
there's
one
table
Upstream
of
this
that
has
failing
assertions
from
here.
A
We
can
actually
go
in
and
we
can
see
that
it's
failing
assertions
and
we
can
begin
to
triage
that
problem
using
data
Hub.
So
I'm
really
excited
about
this
particular
feature
set
or
roadmap
track
within
data
Hub
and
there's
one
thing
that
we're
working
on
right
now
immediately,
which
will
make
this
even
better,
which
is
the
ability
to
search
and
filter
across
lineage
for
entities
that
are
failing
their
tests
or
have
active
incidents.