►
From YouTube: DataHub 101: Data Profiling
Description
Maggie Hays and Tamás Németh (Acryl Data) provides an overview of Data Profiling and Usage Stats in DataHub during the January 2022 Town Hall.
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
Now
we're
going
to
move
we're
going
to
switch
gears
a
little
bit
and
talk
the
basics,
so
one
of
the
one
of
my
personal
favorite
features
within
data
hub
is
our
ability
to
quickly
view
kind
of
the
the
profile
of
a
data
set,
the
kind
of
shape
of
the
data
to
really
kind
of
minimize
the
amount
of
ad
hoc
discovery
you
need
to
do
within
it.
You
can
just
kind
of
get
a
glance
at
what
that
data
looks
like
so
tamash
and
I
are
going
to
give
you
a
rundown
of
all
things.
A
Data,
profiling
and
usage
sets
within
data
hub.
So,
let's
just
start
with
the
basics.
Why
do
we
care
about
data
profiling?
So
I
think
the
main
driver
here
is
answering
questions
around.
Can
I
trust
this
data
set?
Is
it
fresh?
How
large
is
it?
Are
there
dupes?
I
need
to
worry
about.
Are
there
no
values?
I
need
to
worry
about,
what's
actually
meaningful
about
this,
so
what's
what's
what
character
column
is
going
to
be
a
unique
identifier
versus
a
categorical
variable
that
I
can
kind
of
pivot
the
data
around?
A
A
So
what
you'll
see
here
is
a
sample
data
set
and
in
our
stats
tab
you'll
see
a
quick
kind
of
like
high
level
view
of
all
stats
related
to
this
table.
So
I
can
start
to
answer
by
looking
at
the
latest
stats
or
historical
stats.
What
does
this
data
look
like
now
versus
some
time
in
the
past,
so
has
it
evolved
over
time?
How
large
is
this
data
set?
A
Am
I
dealing
with
a
ton
of
records
that
I'm
going
to
need
to
think
about
kind
of
like
query,
performance
or
kind
of
processing
time,
or
is
it
small
and
I
can
just
kind
of
zoom
right
through
it?
How
up
to
date?
Is
this
data
when's
the
last
time
that
it
changed
when's
the
last
time
that
it
was
updated
just
to
give
me
a
sense
of
kind
of
data
freshness
there,
the
kind
of
like
zooming
in
on
the
output?
A
We
can
start
to
look
at
column
level,
stats,
so
answering
questions
around
data
quality
issues.
So
if
we
see
like
with
this
location
description
column,
we
see
that
there's
about
8
700
null
values
within
this
data
set.
Now
the
data
practitioner
can
look
at
that
and
say
like
oh
no,
I
that's
untrustworthy,
but
then
in
the
grand
scheme
of
things
since
there's
7.5
million
records,
it's
less
than
you
know,
it's
like
point
one
percent
of
the
data
set.
Maybe
it's
not
something
I
need
to
worry
about.
A
I
can
start
to
look
at
what
date
rate
what
date
ranges
are
represented
within
the
data
set,
so
this
is
looking
at
records
from
early
2001
all
the
way
to
the
end
of
2021.
So,
to
give
me
a
sense
of
just
how
much
what's
the
breadth
of
history
included
here,
I
can
start
to
understand.
Are
there
duplicates
so
looking
at
our
unique
key?
I
see
that
this
is
a
hundred
percent
distinct.
So
great
it
is
unique.
I
don't
need
to
worry
about
dupes
there.
A
Thinking
about
you,
know
kind
of
like
feature
categories
or
reporting
categories,
what
what
categories
are
actually
meaningful.
So
if
I
look
at
some
with
you
know,
within
a
data
set
of
7.5
million
and
there
are
61
000
categories
of
block,
that
tells
me
that
that's
you
know
pretty
high
car
pretty
high
cardinality
and
something
that
you
know
maybe
would
be
useful
for
reporting
or
modeling
around
the
other
thing
is
we
see
sample
values,
so
you
can
get
a
sense
of
what
data
is
actually
in
here
right.
A
So,
if,
like
maybe
you
just
see
a
column
called
block,
what
does
that
even
mean?
Oh
now
I
see
that
this
is.
You
know
an
actual
like
physical
street
block
within
the
city,
or
this
is
you
know,
kind
of
like
the
types
of
categories
that
are
going
to
be
measured
here.
A
One
thing
to
note
with
this:
this
is
all
configurable,
so
some
folks
are
worried
about
actually
displaying
this
information
to
the
well.
Some
some
teams
are
worried
about
accidentally
displaying
the
wrong
information,
so
you
can
figure
you
can
configure
each
of
these
to
actually
show
the
sample
values
or
not
it's
up
to
your
discretion
there
and
then,
on
the
other
side
of
things,
we
think
about
usage
stats.
So
this
starts
to
answer
questions
around.
How
is
this
data
generally
used?
What
columns
are
most
important?
Even
if
the
table's
been
updated
recently?
A
Are
people
actually
querying
it
like?
Is
it
something
that
is
used
within
my
community
of
data
practitioners
in
my
in
my
company
and
then
questions
around
you
know
who
who's
using
this?
So
I
can
go.
Ask
them
questions
about
how
to
interpret
the
output
of
it
so
kind
of
going
back
to
our
schema
tab
here
with
the
same
data
set.
We
have
this
idea
of
query
usage
at
the
column
level.
A
This
isn't
supported
for
all
of
our
sql,
all
of
our
sql
stores,
but
but
where
this
is
available,
we
do
surface
it.
So
you
can
get
a
sense
of
you
know
kind
of
the
relative
popularity
of
a
given
data
set.
A
You
can
also
see
the
top
users
of
it
so
who's
querying
it
the
most-
and
you
know
maybe
folks
that
you
can
go,
ask
questions
to
and
also
oops
oh
yeah.
Sorry,
then
we
start
to
look
at
actual
sample
queries
so,
in
our
queries,
tab.
This
is
where
we
start
to
surface
the
most
popular
queries
over
a
period
of
time,
so
you
can
start
to
kind
of
understand.
How
is
this,
what
are
kind
of
like
the
most
common
calculations
and
also
what
are
things
that
have
already
been
calculated?
A
A
B
B
B
So
so
previously,
I
showed
you
these
two
lines.
Now
you
can
add
even
more
lines
just
to
disable
and
or
enable
all
these
profiling
and
and
when
I'm
going
through
all
of
those
you
will
see
what
kind
of
profilers
we
have
and
basically
we
have
all
the
options.
So
you
can
really
it's
up
to
you
what
you
want
to
turn
on
and
off
it's
totally
on
you.
So
for
numerical
I
use
we
for
all
of
the
columns.
B
We
calculate
no
accounts
and
you
can
either
enable
with
the
include
field
and
account
set
it
true
or
false.
For
all
the
numeric
columns
we
calculate
minimum
value.
Maximum
value
mean
value
median
value
for
integers.
You
also
generate
standard
deviation
if
it's
needed
for
you
and
also
for
like
timestamp
fields,
we
calculate
as
the
minimum
max
value
so
and
here
you
can
see-
I
think
it's
quite
straightforward,
so
you
can
enable
and
disable
those
there
is
an
option
as
well
like
include
field
contents.
B
The
fifth,
the
25,
the
50
75
and
the
95
percentiles,
currently
actually
it's
disabled
by
default,
because
it's
not
shown
in
the
ui,
but
if
you
want,
then
enable
that
for
for
the
back
end,
we
are
storing
this
information
and
calculated,
but
this
is
not
visible
on
the
ui.
Currently,
there
is
the
distinct
value
frequencies.
B
This
calculated
for
low
carbonate,
numeric
fields,
sorry
low
carbohydrate
fields
actually,
and
this
as
well
not
really
not
shown
on
the
ui
currently.
So
this
is
disabled
by
default
and
the
field
histogram.
Therefore
numeric
fields
it
can
create
a
generate
automatically
a
histogram
for
you,
this
as
well,
not
shown
currently
on
the
ui.
So
it's
disabled,
the
field
sample
sample
values.
This
is
what
you
could
see
so,
like
maggie's
screenshot,
where
you
had
these
sample
values
this.
B
This
is
one
that
you
can
control
if
you
want
to
enable
or
disable
that-
and
these
are
about
the
profilers-
we
also
just
to
make
sure
that
these
profiling
queries
run
effectively
and
fast
enough.
We,
we
introduced
this
query
combiner,
which
basically
try
to
make
sure
that
all
these
profiling
queries
run.
You
know
optimal
batches
and
not
doing
too
many
round
trip
towards
the
your
data
warehouse
and
this
by
default.
B
You
also
can
set
a
limit
which
basically
add
the
limit
to
the
profiling
query.
So
if,
if
you
set
the
limit
thousand,
that
would
mean
only
thousand
lines
will
be
used
for
the
profiling.
That
also
means,
if
you
set
it
like
2
000,
and
then
you
have
like
2
000
lines
in
your
in
your
table,
then
in
the
end
the
total
account
with
the
a
thousand,
because
that
that
was
the
limit
for
what
we
used
offset
is
basically
just
add
the
sql
offset.
So
basically,
it's
not
start
from
that
offset.
B
It
starts
the
profiling.
But
if
you,
if
you
don't
want
to
mess
all
of
these,
but
you
feel
that
hey
my
profiling
is
slow.
We
have
a
nice
option.
This
turn
off
expensive
profiling,
metrics,
which
you
can
turn
on,
and
then
it
will
turn
off
the
expensive
profiling
profilers
and
also
set
the
max
number
of
fields
to
provide
to
10..
B
One
thing
you
should
know
that
these
profilers
running
on
the
whole
table,
which
means,
if
you
have
like
a
huge
click
stream
data
on
hive
with
multiple
partitions,
then
you
should
think
size.
If
you
want,
you
really
want
to
do
profiling
on
on
the
top
of
that
in
the
future,
we
would
like
to
support
these
partition
tables,
but
now
you
should
know
about
this
limitation
and
if
everything
went
well,
then
I
think
the
profiling
finished,
as
you
can
see,
and
now,
if
I
go
to.
B
Table
and
all
the
stats
should
be
here,
but
one
thing
you
can
see
here
like
the
use
each
one
and
you
might
ask
okay,
but
how
I
can
enable
the
usage
statistics.
That's
one
thing
what
I
think
some.
Sometimes
people
get
confused
so
there
you
need
a
a
different
recipe
and
a
different
source
which
are
the
usage
sources.
C
B
It's
basically,
you
just
install
your
source
so
like
in
this
case.
You
need
to
install
the
bigquery
usage
source
and
then
basically,
if
you
don't
set
anything,
then
it
will
get
the
usage
for
the
previous
day.
If,
if
you
want,
you
can
set
a
start
and
then
end
time,
especially,
I
think
it's.
It
can
be
quite
useful
if
you
want
to
bootstrap
your
data,
so
you
want
to
run
in
multiple
and
after
that
you
just
only
want
to
run
like
incrementally.
B
C
And
if
I
can
tweak
my
editor.
C
B
So
then,
I
just
need
to
run
the
ingest
same
way
in
a
normal
way,
and
then
it
should
collect
all
the
usage
set
for
that
table.
B
You
want
to
get
a
usage
set,
but
you
can
only
see
for
the
last
two
or
three
days,
that's
most,
probably
because
because
of
that
retention
period,
what
you
have
in
these
systems
now,
if
I
just
hit
refresh,
I
should
be
able
to
see
all
the
queries
here
and
if
I
go
to
the
schema
page,
I
should
see
all
these
usage
next
to
the
columns
based
on
the
queries
so
in
high
level
or
not
in
that
high
level.
But
actually
this
is
how
you
can
use
profiling
and
and
usage.
B
As
you
can
see,
it's
super
easy
to
use
and
we
are
continuously
improving
it
so
like
for
profiling.
Recently,
if
you
use
profile
four
months
ago
and
now
you
might
wonder
how
much
faster
it
become,
because
now
we
are
using
like
approximate
queries,
wherever
we
can,
we
basically
disabled
those
profilers
which
currently
you
can
see
on
the
ui
and
it
made.
You
know,
I
would
say,
like
significant
improvement
for
the
profiling.