►
Description
Harshal Sheth (Acryl Data) describes how Dataset Popularity is implemented in DataHub and gives a demo.
A
Yeah,
so
I'm
going
to
be
talking
about
data
set
popularity
and
how
we
use
it
in
data
hub
as
well
as
how
we
get
it
from
you
know:
query
logs
and
creativity.
So,
first
off.
Why
do
we
care
about
this
so
for
different
people?
Data
popularity
implies
different
things
for
the
data
platform
owner.
It
enables
them
to
kind
of
understand
what's
going
on
within
within
the
enterprise.
How
is
data
being
used
in
what
systems
is
it
being
used?
A
A
It
can
also
help
you
in
the
help
point
you
in
the
right
direction,
for
where
to
dedicate
more
resources
and
where
to
both
in
terms
of
people
and
in
terms
of
compute
power,
so
that
you
know
you
never
have
an
issue
as
you
as
you
operate
for
data
engineers,
data
popularity
is
a
little
bit
different
for
them.
It
kind
of
under
helps
them
understand.
A
How
are
people
using
the
things
that
you're
producing
so
kind
of
like
an
impact
analysis
within
the
company
as
well
as
helps
you
prioritize
among
the
data
assets
that
you
actually
produce,
which
ones
are
most
important
and
which
ones
are
actually
getting
used,
which
ones
say
to
use
more
documentation
to
improve
usage?
Because
you
know
it's
a
great
data
set
that
you
produced.
A
It
can
also
help
you
streamline
the
deprecation
process,
so
let's
say
a
data
set
that
you're
trying
to
deprecate
still
has
you
know
100
queries
a
week?
Well
then,
you
probably
don't
want
to
deprecate
it
just
yet.
Instead,
you
want
to
you
know,
look
at
the
look
at
the
popularity
and
the
usage
figure
out
who
exactly
that
person
is,
and
you
know,
reach
out
help
them
migrate
to
to
a
different
solution
for
data
scientists.
A
Popularity
and
usage
is
a
major
trust
signal.
It
helps
you
know
kind
of
understand.
Is
this
thing,
someone
that
you
know
someone
put
out
a
year
ago
and
hasn't
touched
since?
Or
is
it
something
that
is
being,
you
know,
regularly
updated,
regularly
used
and
is
something
that
you
can
rely
on,
given
that
other
people
are
also
relying
on
it?
The
other
thing
you
can
do
is
you
can
look
at
kind
of
the
other
queries
that
people
are
issuing
against
that
data
set
figure
out,
say
what
what
other
tables
are
relevant
here?
A
What
is
it
commonly
joined
on
which
keys
and
so
forth?
So
you
can
kind
of
determine
not
just
whether
or
not
to,
but
also
how
to
query
that
data
set,
and
then
you
know
helping
everyone
is
you
know
we
can?
We
can
use
usage
and
popularity
data
to
improve
search
rankings,
and
you
know
improve
the
ordering
of
things
in
the
languages,
visualization
and
so
forth.
So
lots
of
product
improvements
for
datahug
can
also
come
out
of
the
usage
statistics.
A
So
let's
look
at
what
we're
collecting
and
kind
of
how
we're
doing
that
so
right
now
we
support
bigquery
and
snowflake
for
usage,
stats,
bigquery
we're
using
the
bigquery
logs
and
we're
parsing
those
out
and
then
with
snowflake.
We
are
using
the
access
history
and
the
query
history
views
joining
these
together
and
getting
our
popularity
and
usage
data
that
way
and
for
each
data
set
we
can
collect
per
user
usage
frequencies,
so
you
know
person
a
is
using
it.
This
much
person
b
is
using
it
this
much.
A
We
can
also
collect
how
they're
using
it
what
queries
they
issued
a
lot
of
granularity
here
even
to
the
extent
of
what
columns
are
they
frequently
querying
versus
which
ones
are,
are
not
being
used
and
once
again,
we
kind
of
roll.
This
up
and
you
know,
can
get
frequent
queries
across
all
of
all
the
people
using
it
together,
data
set
as
well.
A
A
You
know
500
000,
queries
per
day,
again
say
bigquery
or
snowflake,
or
a
similar
data
warehouse
and
you
know,
have
10k
users,
so
this
is
kind
of
our
north
star
on
what
sort
of
scale
we
might
want
to
support.
They'd
want
to
retain
a
decent
bit
of
historical
data,
so
they
can.
You
know,
view
this
historical
usage
over
time,
let's
say
a
year
here,
but
you
know
it
varies
for
different
different
enterprises.
A
Some
might
only
care
for
30
days,
some
might
care
for
many
years,
and
the
last
thing
is
we
want
to
avoid
refetching
the
same
data
from
the
same
solar
system.
Repeatedly.
A
What
this
means
is,
you
know
if
we're
collecting
data,
we
only
want
to
pull
that,
given
that
given
piece
of
the
usage
log
or
the
query
log
or
whatever
it
is
once
and
then
you
know
not
have
to
pull
it
repeatedly.
A
So,
given
this,
we
are
some
of
the
decisions
we
made.
The
first
is
we're
going
to
start
with
a
batch
based
system,
so
you
know
you
can
configure
to
run
hourly
or
daily,
whatever
you'd
like
and
we'll
pull
kind
of
the
most
recent
queries
in
history.
A
So
we
have
a
memory
efficient
algorithm
for
kind
of
pulling
this
in,
while
keeping
preventing
the
memory
usage
of
ingestion
to
blow
up,
and
then
we
do
some
pre-aggregation
here
on
a
byte
on
a
per
data
set
level
roll
it
up
so
that
we
can
frequent
users
of
the
data
set.
Frequent
columns
used
for
the
data
set
frequent
queries
of
a
data
set,
and
then
we
take
that
information.
A
You
can
push
it
into
through
gms
into
elastic
and
that's
where
we
can
store
these
aggregate
statistics
do
additional
aggregations
on
them
and
then
you
know
surface
them
in
the
ui,
as
you
might
expect,.
B
Special,
I
guess
one
more
interesting
constraint
in
the
design
that
you
probably
had
was
not
adding
one
more
moving
part
to
data
hub,
like
oh
you've
got
to
go,
run
a
spark
job
or
you
have
to
go
run
some
other
big
data
processing
job
to
compute
this
stuff
right
yeah.
I.
A
Think
absolutely-
and
you
know
that's
actually
kind
of
a
good
segue
into
the
demo
piece.
So
I
wanted
to
show
how
you
know:
bigquery
and
snowflake
usage
work,
the
snowflake
one,
I'm
actually
going
to
show
you
how
it's
looking
when
scheduled
with
airflow,
because
that's
kind
of
the
common
use
case
here
you
schedule
it
on
on
a
daily
basis
and
then
how
it
looks
in
the
ui.
A
So
we
can
start
with
how
bigquery
usage
works
so
right
now
I
have
a
little
recipe
configuration.
It
works
the
same
way
as
most
other
sources.
Do
you
know
you
just
have
a
new
plugin
type
called
bigquery
usage
you
can
put
in
the
product
project
id
for
bigquery.
I
just
have
a
playground
instance
that
I'm
using-
and
I
unfortunately
haven't
queried
this
this
instance
in
a
in
a
few
days.
A
So
what
is
it
doing
here?
It's
pulling
the
bigquery
usage
logs
from
the
cloud
logging
product
from
google
and
then
doing
a
little
bit
of
pre-aggregation
here
and
then
dumping
that
into
a
file
here.
A
A
A
I
actually
added
it
to
our
demo
instance
here,
it's
you
know
pretty
straightforward,
how
it
looks
this
time
we're
running
the
ingestion
using
direct
code,
because
we
want
to
do
that
inside
of
airflow
and
it's
remarkably
similar,
we
first
ingest
snowflake
and
then
we
we
add
snowflake
usage
in
kind
of
a
pipeline,
so
you
get
both
of
them
at
once,
once
again
kind
of
set
your
configuration
and
then
I
wanted
to
get
a
bunch
of
historical
data.
A
So
I
set
the
start
time
manually,
but
then
beyond
this
I
might
just
leave
it
blank
and
it
will
automatically
do
the
current
day
and
yeah.
Now
that
we've
kind
of
run
this
successfully,
we
can
see
the
little
green
box
there.
We
can
head
over
to
the
demo
instance
and
we
can
see
where
this
is
surfaced
in
a
couple
places.
A
So
the
first
is
we
see
immediately
the
queries
that
the
number
of
queries
have
been
issued
against
this
data
set.
We
can
see
that
it's,
you
know
78,
and
you
know
this
time
period
and
we
can
also
see
you
know
I
and
through
have
issued
queries
against
it.
So
we
can
see
the
top
users
and
this
is
going
to
be
ordered
by
frequency.
A
So
you
know
that
you
know
I've
done
the
most
and
sort
of
the
second
most
beyond
that.
We
can
also
see
you
know
per
column
so
entity
and
earn
the
two
columns
here
we
can
see
the
entity
has
had
78
queries
per
month
and
the
urn
field
has
had
only
43
queries
per
month.
A
So
you
know
your
standard
select
count.
You
know
group
by
entity,
here's
where
we
might
guess
that
the
entity
field
is
being
used
more
frequently
than
you
know
the
urn
field
or
other
fields,
and
then
we
can
also
see
people
are
creating
other
generated
tables
that
reference
this
all
entities
table.
So
we
can
kind
of
understand
how
are
people
using
it
how
they're
joining
on
the
urn
and
so
forth.
So
we
get
a
lot
more
rich
information
about
how
people
are
using
this
data
set.
A
A
You
know
line
charts
so
that
if
you're
expecting
a
certain
say,
data
set
to
be
deprecated
you're,
going
to
look
for
the
usage
per
day
to
to
kind
of
taper
off
as
you
migrate
people
over
and
then
finally
expanding
our
time
series
metadata
piece
to
add
mechanisms,
for
you
know
using
no
code
to
like
using
a
similar,
no
code
approach
to
expanding
that
so
yeah
with
that
love
to
hand
it
over
to
shoshanka.
B
Awesome,
thank
you,
marshall,
for
that
yeah.
If
people
have
questions
about
how
to
use
it,
my
my
first
question
was:
why
is
snowflake
and
snowflake
usage,
two
different
sources
and
there's
actually
a
good
reason
for
it,
because
in
some
of
these
sources,
the
place
where
you
get
usage
data
from
is
actually
different
from
the
place
you
get
the
metadata
from.
In
some
cases,
you
actually
need
elevated
privileges
to
get
usage
data
out,
so
it
makes
sense
to
separate
out
those
two
pathways
now.