►
From YouTube: Apr 23 2021: DataHub Community Meeting (Full)
Description
Welcome: 00:00
Project Updates by Shirshanka : 03:03
- 0.7.1 Release and callouts (dbt by Gary Lucas)
Use-Case: DataHub at DefinedCrowd by Pedro Silva : 11:20
Deep Dive + Demo: Lineage! Airflow, Superset integration by Harshal Sheth and Gabe Lyons : 26:08
Use-Case: DataHub Hackathon at Depop by John Cragg : 38:58
Observability Feedback share out : 52:41
Product Analytics design sprint announcement by Maggie Hayes : 56:41
A
B
Occasion
we
have
half
days
for
on
every
fortnight
on
fridays,
for
sort
of
like
lock
down
mental
health
kind
of
things,
so
people
can
chill
out
a
little
bit.
It's
quite
good.
C
That
is
nice
actually
at
envision.
They,
they
tell
us
to
leave
at
lunch
friday
every
day
every
week
and
about
a
quarter
of
the
people
do,
but
I
think
it's
good
a
lot
of
people
go
and
actually
have
family
time.
D
Perfect,
this
is
gonna,
be
a
super
tight
town
hall.
I
have
been
warning
all
the
speakers
that
I'm
going
to
be
keeping
us
all
on
time
because
we
have
a
ton
of
stuff
to
go
through
so
welcome
everyone.
This
is
the
april.
I
mean
we're
doing
it
after
like
five
weeks,
because
I
wanted
to
align
it
to
the
fourth
friday
of
the
month
going
forward.
D
D
I
think
we
added
like
another
20
25
in
the
last
month,
so
amazing
and
I
think
engagement
has
been
off
the
charts
so
really
enjoying
kind
of
the
conversations
whether
it
is
small
stuff,
like
someone's
docker
image,
not
really
working
for
them
and
the
big
stuff,
like
you
know,
tags
and
taxonomies
and
ontologies,
and
how
do
we
keep
it
all
together?
So
keep
it
coming?
Love
the
energy
awesome
so
agenda
today,
quick
project
updates,
we're
gonna,
have
gary
talk
about
dbg
integration.
D
Pedro
is
talking
about
data
hub
at
define
crowd
if
harshal
and
gabe
are
awake,
they're
gonna
do
a
deep
dive
and
a
demo
on
lineage.
They
were
working
pretty
late
last
night,
getting
it
all
together
and
then
john
is
gonna.
Tell
us
about
the
hackathon
that
they
did
with
data
hub
at
depop
and
last
time
we
did
an
observability
a
share
out
with
mocks
and
got
some
community
feedback.
D
I'll
kind
of
share
out
what
happened
as
a
result
and
maggie
has
some
announcements
around
a
design
sprint
that
she's
running
so
super
exciting
awesome.
So
around
midnight
I
cut
the
release,
so
we
have
a
brand
new
release:
zero,
seven
one.
It's
been
like
five
weeks,
we're
trying
to
go
to
a
monthly
cadence,
so
pretty
much
every
month
should
expect
official
release
coming
out.
We
had
almost
140
commits
in
the
last
five
weeks.
I
think
so
our
commit
rate
is
actually
going
up,
which
is
awesome
to
see.
D
I
also
pulled
like
the
number
of
committers
and
the
diversity.
That's
looking
really
nice.
We
had
actually
12
different
companies
contributing
to
the
project
over
the
last
five
weeks.
So
that's
great
and
in
terms
of
highlights,
I
was
trying
to
bucketize
where
all
the
features
are
coming
in
and
where
all
the
contributions
are.
You
know.
Obviously,
product
improvements
is
a
big
one.
Then
there's
operator
tools
and
then
integrations
so
product
improvements.
Quick
highlights,
we've
got
column
level
matching
going
on.
D
D
D
If
you
have
a
description
on
a
column
and
you
want
to
just
edit
it,
we
support
editing
it
and
we
actually
keep
it
separate
from
the
primary
schema
description.
So
you
can
always
have
that
as
well
and
there's
some
discussions
on
how
we're
going
to
improve
the
ux
to
make
it
even
more
nice
for
conflict
resolution,
nested
schemas
for
people
who
run
avro
protobuf
and
that
kind
of
stuff.
You
see
this
all
the
time.
D
D
So
everyone
knows
that
now,
data
hub
ingest,
blah
blah
blah
and
now
we're
starting
to
add
other
verbs
to
it
like
check.
So
you
can
start
checking
the
health
of
your
cluster
or
the
health
of
your
local
docker
installation
using
datahub
and
a
lot
more
metadata
ops.
We
have
a
lot
of
ideas
and
if
people
have
ideas,
definitely
send
them
over
it's
a
fantastic
place
to
hack
on
what
are
the
amazing
things
you
want
to
do
on
the
command
line
with
data
hub,
and
we
can
kind
of
keep
expanding
that
big
news.
D
We
finally
mainlined
our
helm,
charts
so
kubernetes
helm,
it's
production
ready,
we're
using
it
for
our
aws
deployments.
So
we
we
can
support
you
on
that.
A
bunch
of
work
was
done
by
the
community
in
adding
monitoring,
I
think,
define
crowd
worked
on
that
linkedin
actually
contributed
some
neo4j
writer
improvements,
because
when
they
were
doing
backfills
and
stuff,
they
found
a
bunch
of
bottlenecks
in
the
neo4j
integration,
so
that
neo4j
integration
should
be
going
faster
now,
and
I
think
klarna
worked
on
adding
ssl
everywhere.
D
D
You
know
devices
and
then
start
hacking
on
data
hub
integrations
include.
You
know:
airflow
dbt,
superset,
druid
snowflake,
better
snowflake.
We
had
snowflake
last
time,
but
we
improved
it
this
time
glue
thanks
to
the
debug
team
for
that
and
mongodb
and
oracle
looker
spot
hero
actually
contributed
it.
It's
still
in
the
contrib
folder,
because
we
want
to
go
back
and
improve
it
and
make
it
an
official
using
the
new
ingestion
framework.
So
lots
of
new
integrations
this
time
around
and
we're
looking
to
continuously
add
more
logos.
D
Awesome
so
I'll
hand
it
over
to
gary
who,
in
a
in
a
hackathon,
I
think
at
envision
kind
of
popped
up
and
said:
hey
I'd
like
to
work
on
dbt
and
we're
like
yeah.
B
C
Hi
I
miss
my
name
is
gary:
I'm
a
staff
data,
engineer,
vision,
app
and,
as
srishanka
mentioned
it
was.
This
is
a
result
of
a
hackathon.
When
I
was
done,
I
decided
that
I
wanted
to
contribute
this
work
to
datahub
so
quick
overview.
Dbt
is
a
directed
acyclic
graph
for
sql.
It
lets
analysts
and
scientists
create
workflows
in
sql.
It's
really
easy
to
use
it's
easy
to
learn
it's
really
just
sql.
C
So
what
I
want
to
do
is
I
want
to
import
the
dbt
generated
graph
for
lineage
information
as
well
as
some
specific
metadata,
such
as
the
model
name
and
the
hierarchy
within
the
dbt
models
folder.
So
I
often
find
it's
a
challenge
to
map
the.
What
I
see
in
sql
versus
what
is
actually
executing
the
dbt
folder.
Some
people
can
do
it
in
their
head.
C
I
find
that
to
be
a
challenge
and
as
well
as
the
model
type
so
source
the
what,
if
it's
a
source
if
it's
a
model,
if
it's
a
viewer,
if
it's
ephemeral,
future
iterations
may
involve
pulling
in
model
documentation
and
additional
tags
and
anything
else.
The
community
finds
helpful.
C
Well,
I
think
the
the
straightforward
answer
is,
you
know
if
you're
using
dbt
docs
generate
that's
a
great
tool
if
all
of
your
assets
are
computed
in
dbt
and
all
of
them
all
those
assets
like
feed
into
dbt
in
some
way
it
works
fine,
but
most
organizations
don't
actually
have
that
situation.
So
there's
going
to
be
things
that
exist
outside
of
that
dbt
docs
shows
the
universe
for
dbt
data
hub
is
intended
to
solve
the
the
the
greater
ecosystem.
Does
that
make
sense.
D
Yeah
cool,
so
this
is
what
it
looks
like
once
you
get
it
all
integrated.
I
think
gary
even
checked
in
one
of
those
files
in
the
repo
right,
yep
cool-
and
this
is
yours
again.
C
Yeah,
so
in
terms
of
how
do
I
use
this
integration,
you
you
run
dbt
and
you're
in
a
regular
pipeline,
it
executes,
creates
the
sql
sql
assets
and
or
the
the
data
assets
in
your
in
your
data.
In
your
data
store,
the
output
of
that
is
going
to
be
a
manifest
file,
which
is
in
the
slash
targets,
folder
of
your
dvt
project
and
then
to
pull
in
schema
data
and
some
other
metadata.
You
can
run
dbt
docs
generate
and
that
actually
generates
this
catalog
json
file.
D
Cool
and
just
today
morning
I
think
someone
on
the
community
channel
was
asking
about.
You
know
how
dbt
connects
to
snowflake,
and
then
you
know
doing
snowflake
ingestion
and
dvd
ingestion
and
the
ids
are
not
quite
lining
up.
So
I
think
there's
some
very
interesting
work
left
to
do
in
stitching
together
the
dvd
graph
with
the
rest
of
the
catalogs,
but
I
think
it
should
be
pretty
easy
to
do
it
and
then
I
think
we
can
kind
of
get
that
whole
end-to-end
lineage
flow
that
we
all
dream
about.
D
Awesome
thanks
gary
all
right.
The
next
presenter
is
pedro
and
he's
gonna
talk
about
data
hub
and
define
crowd
pedro.
You
wanna.
E
A
E
All
right
perfect,
so,
first
of
all,
thank
you
so
much
sirshanka
for
giving
me
the
possibility
to
present
the
work
that
we
have
been
doing
at
the
fine
crowd
with
data
hub
for
everyone
else.
My
name
is
philip
silva.
I'm
a
data
engineer
at
define
crowd.
So,
first
of
all,
let
me
just
very
briefly
explain
to
you
what
the
company
is.
E
Essentially,
we
are
a
marketplace
for
ai
data
and
our
objective
is
to
make
your
ai
use
cases
smarter
and
the
way
that
we
do,
that
is
by
crowdsourcing
data
assets
specific
to
your
use
case
into
your
industry
as
a
service
with
certain
quality
guarantees.
E
These
can
be
things
like
speech
to
text,
translations,
audio
or
even
image
recognition
type
data
assets
regarding
the
company
we
were
founded
in
2015,
we
have
over
300
employees
and
through
our
series
and
series
b
funding
we
have
over
63
million
dollars
already
raised
important.
To
mention
in
this
talk
is
that
one
of
the
ways
or
the
way
in
which
we
crowdsource
data
sets
is
through
our
nevo
platform.
E
This
is
essentially
a
mobile
application
for
android
or
iphone
where
regular
users
can
download
the
app
and
earn
a
little
bit
of
money
by
performing
certain
tasks.
E
Regarding
the
architecture
itself,
what
we
have
is
nevo
generate
certain
metadata
events,
so,
let's
suppose
that
a
certain
unit
of
work
has
been
assigned
to
a
user,
he
has
been
working
on
it.
So
you
have
progress
over
time.
E
Our
kafka
topics
into
usually
views
that
are
more
consumable
for
our
stakeholders
and
and
these
views
can
either
be
consumed
via
druid
or
via
hive.
The
use
case
here
is
in
druid:
we
want
to
work
with
more
real-time
queries
and
an
sql
like
approach,
but
in
hive
through
jupiter
hub
for
those
cases,
for
example,
data
scientists
where
they
want
to
perform
actual
manipulations
on
the
data,
and
they
want
to
perhaps
do
some
daily
cleaning
and
future
engineering
for
our
internal
machine
learning
models
so
in
a
very
high
level
overview.
E
This
is
what
we
have
for
a
day-to-day
ecosystem.
This
is
not
everything,
but
it's
what
I
feel
is
relevant
to
this
conversation
right
now
and,
as
you
can
tell,
this
is
a
very
centralized
like
approach
and
the
whole
architecture
itself
is
owned
and
managed
by
the
data
engineering
team.
Our
vision
is
to
move
towards
a
more
data
mesh
like
approach,
which
certainly
has
some
benefits
for
us.
Concretely.
What
we
want
is
to
achieve
three
things:
data
democratization
so
allowing
data-driven
decision-making
without
bottlenecks
or
external
dependencies
by
our
stakeholders.
E
These
can
be
people
like
project
managers,
business
analysts,
so
on
and
so
forth,
or
even
c-level
decisions.
We
want
them
to
be
able
to
make
those
decisions
based
on
data
but
not
being
dependent
on
the
team,
and
the
reason
being
is
that
the
scale
of
the
team
and
the
people
who
manage
this
infrastructure.
E
We
are
around
six
people.
This
changes
over
time
because
we
have
people
being
allocated
into
different
projects,
but
our
fan
out
ratio,
if
you
will,
is
six
to
eighty.
So
if
you
want
to
reduce
this,
I
think
it's
something
like
one
to
twelve
and
that's
sort
of
how
it
works,
and
given
this
scale
we
naturally
want
our
users
to
be
more
self-serviceable.
E
Speaking
of
intuitive
tooling,
that's
where
data
hub
itself
comes
in
right.
Self-Service
ability
is
something
that's
not
really
possible
without
having
data
discovery
and
data
data
lineage
over
the
assets
that
we
have,
and
even
for
the
data
team.
For
us,
we
do
increasingly
have
a
harder
time
keeping
up
with
the
growth
of
data
assets
given,
given
that
the
company
itself
is
a
data
provider,
our
asset
catalog
is
continuously
growing
and
for
six
people
it's
a
lot
of
information
for
us
to
handle.
E
E
Their
approach
was
not
perfectly
fixed
to
the
finance
crowd
use
case,
though,
we
did
also
look
to
inspiration
at
companies
like
netflix
and
intuit
that
had
other
approaches.
In
the
end,
though,
we
did
decide
to
go
to
data
hub
the
reason
being,
it's
extremely
active
community.
E
The
ability
to
have
strongly
typed
dynamic,
metadata
models
so
being
able
to
define,
in
certain
end
certain
entities,
relationships
and
being
able
to
modify
that,
if
possible,
to
make
it
match
our
use
case
and
finally,
a
push
and
pull
ingestion
model
for
the
metadata
and
because
we
do
have
certain
components
of
our
architecture
which
are
streaming
based
and
others
are
batch
based.
Possibly
some
will
be
and
created
by
us
internally
and
having
that
flexibility
is
very
important
to
us
to
give
you
a
sense
of
where
we
are
right.
E
Now,
we've
undergone
the
exploration
proof
of
concept
based
deployment.
We
are
now
at
a
production
level
deployment,
though
with
a
very
basic
use
case.
In
this
scenario,
data
set
catalogs
only
and
on
our
downstream
databases
drew
it
in
hive
and
the
reason
why
we
wanted
to
do
this
is
to
ensure
that
our
direct
stakeholders
had
access
to
information
that
they
could
understand
if
they
work
with
superset
and
jupiter
hub.
E
They
are
directly
interacting
with
data
that's
available
in
variable
in
druid
and
hive,
and
this
was
a
work
that
involved
three
people
and
overall,
it
was
an
extremely
positive
experience
and
our
initial
rollout
had
over
20
plus
data
users
and
their
feedback
has
been
quite
good,
though
at
this
time
because
of
the
lack
of
metrics.
I
can't
really
tell
you
if
this
number
has
changed
over
time,
but
it
is
the
information
that
we
have
right
now
and
but
finally,
data
hub
as
a
system.
It
is
relatively
complex
right.
E
It's
a
it's
a
large
database
with
a
lot
of
moving
parts,
but
the
community
support
has
been
exceptional,
so
in
that
sense
I
feel
it
has
been
an
excellent
choice
on
our
part
regarding
contributions-
and
I
know
shashank
already
mentioned
this-
we
did
contribute
with
some
things,
particularly
monitoring
metrics
crown
crawling
support
for
metadata
all
of
this
done
in
kubernetes,
because
that
is
our
default
deployment
mode
and
finally,
support
for
juul,
though
it
is
not
the
end
of
contributions,
I
hope,
but
we
will
see
as
time
moves
on
so
just
to
give
you
a
sense.
E
This
is
the
ecosystem
that
we
had
before,
and
this
is
what
we
have
additionally
with
the
with
data
right
and
you
have
your
they
have
you
high
your
installation?
This
is
all
done
in
kubernetes,
then,
through
the
data
hub,
metadata,
crawlers
and
the
crown
jobs
we
very
high
and
to
it
and
finally,
with
regards
to
opportunities-
and
we
feel
naturally,
there
are
things
to
improve.
There
always
are
in
good
projects,
and
in
our
case,
for
our
use
case,
it's
dynamic
metadata
models.
E
It
is
true
that
these
models
are
flexible
and
that
you
can
change.
However,
they
are
hard-coded
into
the
database
in
the
sense
that
if
we
wanted
to
change,
we
needed
to
maintain
a
fork
of
the
project
continuously
and
from
time
to
time,
merge
things
from
the
original
repository
to
gain
all
the
niceties
and
all
the
features
that
have
been
released
and
the
reason
we
want
to
do.
This
is
not
all
our
data.
Stakeholders
are
tech,
savvy
right.
Some
of
them
come
from
linguistics
backgrounds,
with
little
to
no
cs
training
and
for
them
semantics
matters.
E
Other
things
include:
role-based
active
access
control
so
having
granular
entity
level
access
and
definitions
is
something
that
I
feel
is
quite
important
so
being
able
to
say
that
a
given
data
user
has
access
to
data
sets
a
b
and
c,
but
not
d,
e
and
f,
and
this
is
something
that
is
very
relevant
to
us
and
finally,
field
level,
lineage
based
on
jobs
and
pipelines.
So
because
we
have
certain
data
sets
or
in
a
sense
certain
views
where
a
subset
of
those
columns
are
generated
by
job
a
and
the
others
are
by
job
b.
E
This
is
the
sort
of
information
that
we
want
to
make
sure
we
surface
correctly,
though
I
do
have
to
mention
that
all
of
this
is
already
in
data
hub's
backlog.
So
I
don't
think
that
I'm
saying
anything
new
here
sure
shankar,
but
it's
just
what
we
had
to
find
crowd
and
would
like
to
see
in
the
future
and
yeah.
I
guess
that's
it.
I
don't
know
if
anyone
has
any
questions,
I
do
apologize
if
I
sped
through
this
presentation,
but
give
me
your
feedback
if
you
have
any.
Thank
you
so
much.
D
Cool
thanks
a
lot
pedro,
definitely
plus
one
on
all
the
pain
you
felt
with
some
of
those
things,
especially
the
metadata
models.
We
feel
it
every
time
we
add
one
small
thing
to
it,
so
no
code,
metadata
models
are
absolutely
top
of
mind
for
us
and
our
back
also
on
the
roadmap
field
level.
Lineage
also
we've
got
the
rfc
in
and
so
we'll
figure
out
the
implementation
pretty
soon.
D
D
For
you
pedro
and
then
we
can
do
some
quick
questions
from
the
community.
You
know
you
currently
have
it
have
data
hub
kind
of
crawling,
your
druid
and
hive
clusters.
You
also
have
some
kafka
connect
in
your
ecosystem.
Yes,
do
you
anticipate
kind
of
pushing
metadata
from
that
streaming
ecosystem
into
data
hub
in
the
future.
E
Ideally,
yes,
I
do
feel
that
kafka
connects
integration
is
more
on
the
level
of
lineage
than
generating
metadata
assets
themselves.
So,
right
now,
as
I
understand
it,
the
ingestion
framework
will
generate
data,
sets
snapshots
and
possibly
even
user
information
for
ldap
systems,
and
I
feel
that
it
needs
to
be
adapted
to
be
able
to
provide
these
sort
of
connections
or
updates
to
aspects
of
meditating
right.
Kafka
connect
feels
very
much
like
a
source
of
that
type
of
metadata
and
at
the
ingestion
framework
level.
We
don't
yet
have
that.
D
Agreed,
I
think,
especially
the
way
in
which
we've
done
the
airflow
lineage
integration.
D
We
can
probably
use
the
same
strategy
for
kafka
connect
where
there's
like
framework
level
integration
with
kafka
connect
and
when
those
pipelines
start
up,
they
emit
metadata
events
that
then
connect
the
edges
together,
so
that
I
think,
would
be
great
for
the
community
as
well.
Are
there
any
other
questions?
We
have
a
minute.
D
E
D
D
F
The
first
is,
you
know,
pretty
simple,
using
airflow,
essentially,
is
a
cron
system
to
just
run
ingestion
on
a
schedule,
so
you
know
similar
to
how
you
define
a
recipe
with
the
data
hub
cli,
you
can
create
a
pipeline,
give
it
your
source,
tell
it
to
push
to
gms
via
the
the
data.
Rest
sync
and
just
run
it
every
day,
and
it's
pretty
simple
to
do
this
and
and
set
it
up
with
air
flow.
F
The
second
method,
if
you
hit
next
slide,
the
second
method,
is
to
emit
mces
via
a
data
hub
operator
directly
within
your
dag.
So
the
reason
you
might
want
to
use
this
is,
if
you've
got
say,
you're
generating
a
dag,
and
you
know
exactly
what
lineage
or
you
know
some
extra
information
about
a
given
data
set.
You
can
just
create
that
that
mce
here
construct
that
object
and
push
it
up
to
datahub
to
tell
datahub
about
whatever
you
know
within
that
air
flow
dag.
F
Now,
in
order
to
use
this,
you
have
to
set
up
an
airflow
connection.
This
is
a
pretty
standard
thing
and
then
you
just
put
the
connection
id
in
the
operator
as
a
parameter
and
airflow
will
figure
out
the
rest
of
how
to
pass
the
credentials
into
the
the
emitter
operator
and
then
push
that
information
all
to
data
hubs
now,
a
third
way
to
to
integrate
air
flow
and
data
hub,
and
you
know
the
one
that
I'm
most
excited
about
is
via
the
lineage
back
end.
F
So
the
way
this
works
is
you
set
up
a
little
bit
in
your
airflow
config.
If
you
see
that
second
screen
shot,
you
configure
the
data
hub,
airflow
lineage
backend,
as
deleting
you
back
in
with
an
airflow
and
give
it
the
connection
id
similar
to
how
we
did
it
in
the
operator
case
and
then
in
your
operators
within
your
dag,
you
pass
inlets
and
outlets,
and
this
is
a
airflow
native
integration.
F
Every
single
operator
supports
inputs
and
outlets
in
the
right
version
of
airflow,
and
you
just
declare
your
data
sets
that
are
consumed
and
produced
by
a
given
job
and
datahub
is
able
to
view
and
visualize
all
of
that
metadata,
plus
it
fetches
a
bunch
of
extra
metadata
about
the
dag
and
the
tasks
itself.
F
So
you
can
view
properties
about
you
know
what
were
parameters
that
were
passed
into
the
task
or
you
know
when
was
this
thing
last
run
now
the
one
caveat
here
is:
you
know
it
requires
a
little
bit
of
config
and
it
only
works
with
airflow,
1.10.15
or
newer
or
2.0.2
or
newer,
because
the
range
back
end
was
not
supported
prior
to
those
versions,
and
so
there
you
have
it.
If
you
want
to
learn
more,
I'm
sure
I'll
give
you
the
next
slide.
F
I've
got
a
link
to
the
docs
where
you
can
read
about
it
and
then
I'm
on
slack.
If
you
have
any
other
questions,
awesome
and
thank.
G
Sweet
awesome.
G
G
So
next
up,
I'm
going
to
talk
about
how
users
of
data
hub
can
take
advantage
of
these
connections
that
we
have
between
entities
to
get
a
better
understanding
of
the
data
that
they
have.
So
I
know
that
partial
in
spare
time
made
some
analytics
pipelines
and
dashboards
about
the
demo
data
that
we
have
for
demo
datahub
project
auto.
G
We
have
lots
of
data
sets
there
and
as
a
little
pet
project,
he
built
some
pipelines
to
and
some
dashboards
to,
visualize
metadata
about
them
and
I'm
going
to
go
through
and
explore
this
and
understand
what
he's
built
and
use
lineage
to
get
a
better
understanding
and
more
trust
in
the
data.
So
say
I
wanted
to
understand
what
the
documentation
coverage
is
for
our
demo
data
so
that
I
could
understand.
G
Well,
what
what
platforms
do
I
need
to
improve
the
documentation
for
so
I
can
go
into
the
search
bar
and
search
for
documentation,
and
I
see
a
superset
chart
that
has
been
created.
That
gets
the
completeness
of
documentation
for
some
data
sets.
I
can
click
into
that
in
our
superset
integration.
That's
been
included
in
the
new
release.
You
can
go
and
see
some
basic
properties
like
the
metrics
and
the
dimensions
of
the
chart,
as
well
as
the
sources
that
feed
into
that
chart.
G
But
how
do
I
know
that
these
data
sets
that
are
being
talked
about
in
the
chart
are
the
chart?
The
data
sets
that
I'm
interested
in
and
actually
are
our
demo.
G
This
is
when
I'm
going
to
need
to
go
into
the
lineage
view
to
see
the
whole
picture
of
how
this
chart
was
created.
So
when
I
click
this
button
in
the
top
right
of
my
entity,
I
get
taken
to
a
graphical
view
that
shows
in
the
center.
This
is
the
chart
that
we're
talking
about.
It
also
shows
the
upstream
table
dependency
that
the
chart
is
reading
from,
and
this
is
what
we
saw
before
downstream.
G
So
I
might
make
a
mental
note
that
if
this
chart
is
what
I'm
looking
for,
I
might
want
to
go
investigate
this
dashboard
and
see
what
other
charts
it
has.
And
if
I
double
click
on
this
dashboard,
I
can
recenter
the
graph
around
it
and
see
here
all
the
other
charts
that
are
contained
in
that
dashboard.
And
these
charts
that
I
might
want
to
investigate
later
and
so.
G
Going
back
to
this
upstream
table,
I
can
click
on
it,
get
the
full
name.
If
there
was
description
or
other
metadata
like
tags,
I
would
be
able
to
see
that
as
well.
But
all
I
know
is
this:
is
some
generated
snowflake
table
in
harshal's
pipeline?
How
do
I
actually
know
that
this
table
is
generated
off
of
the
data
that
I'm
expecting
and
and
the
dimensions
are
constructed
in
a
way
that
I
want
at
this
point,
I'm
going
to
hit
the
plus
this
plus
icon
to
further
expand
out
the
lineage
graph.
G
A
G
Exactly
so,
what
we've
done
here
is
built
a
pipeline
around
the
metadata
included
in
our
demo
datahub
project,
so,
as
you
can
see
from
the
float
from
left
to
right,
we're,
starting
with
the
raw
data
of
the
aspect,
and
then
harshal
has
created
pipelines
to
produce
derived
tables
off
of
these
aspects
that
they're
easier
to
consume.
Until
finally,
we
have
the
table
that
all
the
superset
charts
read
from,
which
is
a
much
more
consolidated
version
of
our
data,
metadata
that
the
chart
that
charts
can
easily
be
built
off
of.
G
So
now,
after
looking
at
this
lineage
visualization,
I
feel
much
more
confident
that
this
data
that
is
feeding
into
the
charts
is
in
fact
the
data
that
I'm
looking
for,
because
I
can
see
from
the
beginning.
It
is
built
off
of
these
core
aspect
tables
that
I
understand,
and
I,
if
I
want
to
zoom
in
and
look,
I
can
then
inspect
an
airflow
test
and
say
I
might
want
to
know
okay.
So
I
know
now
that
the
source
data
and
the
flow
of
data
seems
expected.
G
G
G
When
I
click
on
that,
I'm
brought
to
airflow
and
I
can
actually
go
in
inspect
the
code
and
verify
that
this
code
is
the
what
I'm
expecting
and
the
transformations
that
harshal's
done
are
the
transformations
that
I
expect
going
back
to
our
lineage
graph.
Now
that
I've
do
my
due
diligence
on
the
lineage
flow
understanding
from
the
beginning,
all
the
way
down
to
my
chart.
How
is
this
data
transformed?
I
can
now.
Finally,
look
at
this:
go
back
to
the
chart.
G
Profile
click,
my
click
out
to
my
viewing
superset
button,
and
now
I
finally
have
confidence
that
this
chart,
I'm
looking
at.
It's
not
doesn't
just
say
it's
doing.
Data
set
document
completeness,
but
I
understand
from
the
beginning,
through
the
transformations
all
the
way
to
the
chart.
This
is
what
I
expect.
D
G
Right
it
looks
like
the
snowflake
documentation
is
lacking
and
s3
is
barely
there,
but
our
bigquery
documentation
and
our
kafka
documentation
is
looking
pretty
good.
So
now
I
know
next
week
I've
got
my
work
cut
out
for
me
and
I
can
now
go
into
my
other
charts.
Do
investigations
there
and
draw
more
conclusions
so
now,
with
our
in
the
new
release,
now
that
we
have
airflow
integration,
we
have
dbt
superset,
looker
and
other
sources
that
are
helping
tie
your
different
entities
together.
These
lineage
visualization
visualizations
will
help.
D
Awesome,
I
think
I
think
people
are
just
blown
away
so
huge
kudos
to
the
two
of
you
for
cranking
it
out
yesterday
and
getting
it
to
this
polished
state.
This
is
really
cool.
This
is
really
cool.
D
Awesome
cool,
let's
move
to
the
next
section
one.
Second,
while
I
go
find
my
tab.
D
Yep
here
we
are
and
oh
another
handoff.
We
have
john
who's
going
to
talk
to
us
about
the
data
hub
hackathon
that
the
depop
guys
did-
and
you
know
it
was
pretty
cool-
to
see
them
pop
in
literally
pop
in
to
the
community
channel
and
say:
hey
we're
doing
a
hackathon,
and
you
know
in
a
few
days
they
had,
they
were
contributing,
glue
integration
back
to
us,
and
you
know
the
kalerna
folks
helped
them
out.
So
thanks
for
that
collab,
I
think
it
was
really
nice
to
see
that
happen.
B
Can
you
see
my
screen
well,
yep
awesome
yeah
a
bit
of
a
tough
one
to
follow
that
one
that
looked
really
good.
So
I
hope
that
is.
Don't
disappoint
you
here,
but
anyway,
my
name's,
my
name's
john
I'm,
the
lead
data
engineer
here
at
depop
and
yeah
here
to
talk
about
the
hackathon
that
we
did
with
the
data.
B
So
just
quick
intro
about
the
pop
and
who
we
are
we're
a
fashion
marketplace
for
the
next
generation
to
buy
and
sell,
discover,
unique
fashion,
so
we're
an
app.
Basically,
we
sell
provide
the
ability
to
sell
predominantly
secondhand
fashion
and
sustainable
fashion.
You
can
think
of
as
a
bit
like
ebay,
mixed
with
instagram.
B
That's
what
my
top
my
mum
anyway,
when
I
joined
but
yeah
lots
of
lots
of
people
in
the
uk
using
depop,
and
it's
growing
around
the
world
as
well
as
the
us
too
yeah
and
we're
we're
growing
very
fast,
and
our
data
needs
growing
well
too.
B
So
why
did
we
look
at
the
data
hub?
Well,
we
need
to
enable
the
business
to
use
data
in
a
self-service
fashion
and
we
need
a
single
location
for
all
of
our
data
needs
shout
out
to
the
design
crew.
Who
did
that?
Certainly
wasn't
me.
B
So
I'm
going
to
walk
us
through
some
of
the
problems
that
we're
trying
to
solve
here
that
we
can
see
through
various
slack
messages
that
we've
had
through
the
company.
So
we've
got
issues
with
data
discovery,
somebody
new
joins
and
they
want
to
know
about
data
for
our
crm
and
they
they
don't
really
know
where
to
find
it,
which
is
a
bit
of
a
shame.
B
We
want
to
know
data
about
recently
viewed
items
or
any
data
about
banning
people
in
the
trust
platform,
and
we
don't
have
that
single
location
for
search,
so
the
data
hub
would
be
pretty
useful
there.
The
data
lineage
aspect.
B
I
don't
really
need
to
speak
about
this,
as
have
you
seen
a
perfect,
a
demonstration
of
how
how
that
works,
but
generally
producers
and
consumers
and
seeing
who,
where
data
starts
and
where
it
ends
up
all
the
way
through
to
our
looker
instance,
that
would
be
very
useful
for
our
business
users
and
depop's,
a
startup
or
a
scale-up,
and
and
we've
got
lots
of
knowledge
in
our
heads
when
we
have
one
in
the
the
sort
of
documentation
phase.
B
So
the
tribal
knowledge
is
pretty
rife,
and
this
this
table
has
a
a
a
column
called
active
status
which,
over
the
years
has
has
baffled.
Many
people
in
the
business,
including
this
guy
who
said
active
status,
could
just
about
mean
anything.
So
documentation
is
pretty
important
for
our
for
our
users.
B
So
what
we
did
is
we
had
a
hackathon,
the
in
the
data
engineering
team
and
and
the
bi
team
as
well,
and
we
split
up
and
we
tried
to
have
a
sort
of
a
head-to-head
against
a
munstern
and
data
hub,
and
so
we
did.
What
did
we
try
and
do?
Well,
both
of
them
have
local
setups
that
use
docker,
and
we
tried
to
go
from
zero
knowledge
of
these
products
to
getting
as
much
production
data
into
them
as
we
could
do
it
inside
two
days.
B
So
I
will
just
change
the
screen
and
I'm
going
to
only
show
the
the
demo
part
for
the
data
hub.
Obviously
this
is
what
we
managed
to
do
and
then
I'll
slip
back
in
afterwards.
So
I'll
stop
now
I
might
need
to
reshare
my
screen
actually
two
seconds.
F
H
I
Didn't
have
any
glue
support,
so
we
spent
the
last
two
days
figuring
out
how
we
could
ingest
data
from
glue
into
data
hub
and
we
managed
to
do
it.
So
that's
good.
So
if
you
look
in
data
sets,
we
have
broad
thing.
We
have
this
glue
here
and
then
this
then
goes
to
a
database
level.
I
So,
for
example,
if
we
just
click
in
random
on
like
daily
compacted
here's
all
of
the
tables
that
are
in
there,
if
you
search
for
product,
create
and
then
go
to
the
one
in
compacted
so
yeah,
this
has
the
search
as
well.
So,
for
example,
you
can
have
see
we
added
a
description
here.
Most
of
our
data
isn't
really
well
documented.
It
doesn't
have
descriptions,
but
in
the
schemas,
for
example,
in
glue
you
can
have
descriptions
for
each
field.
I
I
know
there's
often
confusion
about
like
what
a
user
id
actually
is.
Is
it
the
seller?
So
all
that
can
be
documented
got
the
name
of
each
field,
the
type
on
the
left,
the
descriptions.
So
that's
the
schemas
there's
also
ownership
that
we
could
pull
out.
All
of
ours
are
apparently
owned
by
owner.
So
that's
not
that
helpful,
but
that
can
be
changed
and
then
properties
has
just
probably
about
the
table
that
get
pulled
out
so
just
extra
information
in
there.
I
B
J
Sure
so,
basically
our
team
rob
abby
and
myself
we
worked
on
integrating
the
redshift
tables
and
looker
so
as
similar
to
the
glue
schemas.
You
know
the
ratchet
tables
we,
it
basically
pulled
the
tables
from
redshift
and
it
has
again
schema
types
names
of
fields.
J
J
If
I
search
for
a
keyword
here,
I
am
able
to
see
the
all
the
entities
that
have
this
particular
tag,
and
I
can
even
do
that
the
opposite
way.
By
going
to
the
tag
and
then
looking
look
up,
you
know
everything
that
has
this
tag.
I
think.
Similarly,
if
I
look
for
a
another
keyword
here,
I
want
to
see
well
in
this
case.
For
example,
it
appears
in
the
table
name,
but
in
this
case
there
is
a
mac
for
calling.
J
So
it's
really
interesting
to
see
that
the
search
is
very
inclusive
for
the
looker
implementation.
So
there's
another
area
here
that
we've
been
able
to
integrate
a
particular
dashboard
here
with
a
description
to
scroll
in
I'm
able
to
see
the
obviously
tags
and
owners
and
everything
can
be
added,
I'm
able
to
see
the
actually
the
actual
looks
that
are
part
of
this
dashboard.
J
In
this
case,
we
just
provided
a
few
examples,
but
so,
if
I
scroll
to
one
of
them,
I'm
able
to
first
of
all
see
tags,
I'm
able
to
see
the
actual
table,
so
the
data
source
for
this
particular
look
in
luca
and
obviously
scrolling
and
see
that
information
I
can.
I
don't
want
to
get
out,
but
yeah
there's
a
direct
link
to
the
to
the
look.
So
that's
really
nice
what
I
think
I've
covered
most
of
it.
K
Here,
if
you
check
confirm
signups,
you
can
see
documentation
online
lineage.
J
Yes,
so
yeah,
this
is
a
redshift
table
and
any
documentation.
Let's
say
it's
an
etl-based
story,
any
logic
that
is
part
of
that
creation
of
the
table.
We
were
able
to
see
that
and
each
entity
here,
you're
able
to
see
the
upstream
national
dependency.
So
that's
very
useful
for
later
lineage.
B
Cool,
so
that
that's
the
majority
of
our
demo
there's
some
faqs
afterwards,
but
we
were
presenting
to
the
business.
So
I
wouldn't
show
you
those
so
yeah
what
we
achieved
during
the
hackathon
is
we
ingested
all
of
our
production
data
into
the
local
instances,
so
that
was
redshift
glue
and
kafka.
They
all
came
into
our
local
instance
of
the
data
hub.
We
also
linked
that
chart
of
looker
in
and
we
created
some.
B
We
used
the
metadata
change,
events
to
create
lineage
and
tags
and
documentation
and
owners,
and
we
created
a
merge,
the
pull
request,
which
was
pretty
nice.
So
I
think
that
the
most
important
thing
for
us,
and
probably
sir,
for
any
advice
I
could
give
people
who
here
who
haven't
decided
yet,
is
why
we
actually
picked
the
data
hub.
B
Most
of
the
problems
we
had
with
the
munston
was
the
the
lack
of
kafka
support
and
when
we
tried
to
integrate
that
with
data
hub,
it
just
worked
straight
away
and,
as
you
can
see,
we
added
the
glue
integration,
which
was
really
easy,
and
the
the
process
for
adding
a
new
ingestion
type
was
was
super
super
easy.
It
was
very
straightforward:
the
docks
were
set
up
nicely
and
most,
I
think
pedro
said
earlier.
The
support
from
from
the
team
was
just
a
man
immense.
B
It
was
amazing,
we
were
messaging
at
all
times
and
we
were
getting
responses
and
pushing
that
vr
was
was
really
trivial
and
thanks
to
kleiner
for
helping
us
out
there
as
well
yeah.
The
the
aspect
of
data
lineage
is
really
important
for
us,
because
we
have
several
layers
of
transformations
on
a
business
level.
Anna
munson
didn't
really
support
that
very
well.
B
Looker
was
a
work
in
progress,
and
I
know
you
said
it's
in
the
in
the
contra
folder
at
the
minute,
but
we're
really
excited
to
see
that,
and
just
all
of
the
other
bits
that
you've
seen
already.
It
was
just
super
good
and
we
had
a
really
good
time
doing
it
and
contributing
back
and
we're
looking
forward
to
integrating
into
our
production
stack
in
in
the
next
couple
of
months,
so
yeah.
Thank
you
very
much
for
your
help.
B
I'm
really
pleased
to
be
working
with
you
and
it's
been.
It's
been
absolute
pleasure,
yeah
and
shout
out
to
the
team.
I
think
marie
is
here,
hey
maria.
Thank
you
very
much.
D
Thanks
john
yeah,
we
really
enjoyed
all
the
energy
that
the
pub
team
brought
into
the
project
so
keep
that
coming
awesome.
So
now
that
we
have
just
a
few
minutes
left,
I
wanted
to
do
one
of
the
things
that
we
had
promised.
We
would
do
for
the
community
and
that's
a
share
out
of
the
observability
marks
and
the
poll
that
we
ran
last
time.
D
Everyone
can
see
my
screen
right.
Okay,
so
you
know.
In
the
last
town
hall,
we
went
through
a
couple
of
screens
showing
hey.
This
is
what
datahub
could
look
like
if
we
added
observability
metadata
to
it
and
then
started
building
kind
of
data,
quality
style
visualization,
as
well
as
data
ops,
style
visualization
to
the
screens,
and
we
got
a
ton
of
feedback
thanks
to
everyone
who
participated
about
what
they
want
to
see
and
what
direction
they
want
the
project
to
go.
D
We
actually
got
more
than
25
responses,
so
really
happy
to
get
that
kind
of
high
quality
feedback
from
the
community.
D
The
first
thing
we
asked
people
was
what
is
their
role,
and
so
what
is
very
nice
to
see
is
that
data
hub
seems
to
be
very
aligned
with
the
interests
of
the
data
platform
leads
and
the
data
engineers
who
are
trying
to
ensure
that
they
have
like
a
modern
data
catalog
with
the
architectural
strengths
of
data
hub.
D
But
then,
on
top
of
this
lineage
graph,
now
adding
observability
or
operational
metrics,
so
that
you
can
understand
end
end-to-end
quality
integrations
with
tools
like
great
expectations
and
enabling
alerting
based
on
these
metrics.
These
definitely
seem
to
be
popping
up
at
the
very
top
of
the
people
we
surveyed
from
the
community.
D
The
next
thing
we
asked
was
feedback
like
should
we
do
this?
Or
should
we
not?
You
know
and
overwhelmingly?
People
said
this
looks
amazing
and
we
should
just
work
on
it.
So
it
seems
like
the
community,
definitely
is
voting
for
these
things
to
be
in
the
product,
and
the
nice
thing
was
right.
After
that
we
asked
people.
How
will
you
help
and
I'm
so
glad
to
say
that
the
number
one
and
two
results
on
those
are?
D
You
know
I
will
storyboard
the
use
cases
and
I'll
contribute
to
building
out
the
backend.
So
that's
exactly
what
we
want.
We
will
set
up
time
with
all
of
you
to
storyboard
together
and
figure
out
how
to
even
run
the
project
if
people
are
interested
in
actually
working
on
this
together.
D
So
last
night
I
actually
created
this
slack
channel.
It's
called
design
data
quality,
it's
empty,
but
everyone
who's
interested,
please
jump
in,
and
we
can
use
that
as
a
way
to
take
the
conversation
forward
and
then
set
up
follow-on,
chats
we're
doing
something
similar
for
tags
and
taxonomies,
and
you
know
async
works
well
pretty
much.
Everyone
is
busy,
so
it's
good
to
get
kind
of
async
feedback
from
around
the
world
and
then
come
up
with
some
sort
of
a
global
picture
for
where
we
want
to
take
the
project.
D
So
thanks
for
all
of
that,
we
are
setting
up
the
slack
channel
and
we
will
set
up
in-person
discussions
with
people
who
would
like
to
go
deeper.
So
look
out
for
those
announcements.
D
It
is
an
opt-in
channel,
so
no
no
requirements
to
auto-join,
but
if
you're
interested
in
shaping
the
future
of
data
observability
on
data
hub
just
join
that
channel
awesome
now-
and
this
is
kind
of
a
very
interesting
segue,
because
all
of
the
discussions
that
happened
today
were
about
data
platform
teams
who
have
taken
kind
of
a
first
step
or
a
second
step
with
data
hub
at
their
company
and
they
have
rolled
it
out.
Some
people
just
finished
a
hackathon.
Some
people
have
actually
rolled
it
out
to
production
and
have
20
people
on
it.
D
But
what
we've
noticed
with
a
lot
of
data
platform
teams-
and
I
remember
even
at
linkedin-
you
know
over
the
last
six
years
as
we
kind
of
built
out
this
product.
We
had
these
repeated
moments
of
feeling
like
okay,
we
did
something,
but
did
it
actually
make
a
difference.
D
So
how
do
you
get
from
deploying
what
you
think
is
the
right
solution
for
your
company
to
actually
making
sure
that
your
entire
company
actually
loves
this
solution
so
that
that's
kind
of
the
challenge
that
we're
setting
ourselves
and
I
think,
as
a
community.
I
think
we
have
the
same
challenge.
We
pick
tools
and
then
now
it's
our
job
to
make
the
entire
company
love
the
tool.
D
L
Awesome
so
hello,
everybody,
I'm
maggie,
I'm
a
senior
product
manager
at
spot,
hero
based
out
of
chicago
focused
on
data
services,
so
everything
from
data
engineering
to
data
science,
data
analytics
and
all
of
the
complexities
there.
So
I've
been
working
closely
with
the
with
folks
from
the
data
hub
community
over
the
past
year
or
so
I'm
a
huge
proponent
of
the
the
tool
and
just
think
it's
you
know
you
guys
are
really
just
kicking
ass.
I'm
just
like
thoroughly
impressed
by
all
the
great
work
that's
coming
through.
L
L
I
wanted
to
kind
of
provide
some
pm
support
as
much
as
I
possibly
can,
and
so
one
approach
that
we're
taking
is
by
running
a
and
facilitating
a
design
sprint
next
week
for
those
of
you
who
are
unfamiliar
with
design
sprints,
it's
really
just
kind
of
a
dedicated
three
to
five
day
session,
we're
going
to
focus
it
on
three
days
to
identify
a
big
problem,
map
out
solutions,
decide
on
a
prototype
and
then
rapidly
iterate
and
build
that
prototype
out
to
see.
L
If
we
can
kind
of
come
to
a
consensus
of
how
of
how
to
solve
kind
of
like
this
big
gritty
problem,
so
you
know
shashanka
called
it
out
that
you
know
knowing
understanding
how
these
cut
well,
really
meta
tools
solve
these
bigger
problems
can
be
difficult,
so
we're
focusing
on
understanding
product
analytics,
so
really
understanding.
You
know
how
are
users,
interacting
with
the
tool,
what
are
kind
of
like
the
core
user
funnels?
L
How
would
we
think
about
user
adoption
or
power
users,
or
or
really
kind
of
like
a
a
successful
user
flow
or
user
journey
throughout
the
product,
so
we'll
be
tackling
that
next
next
tuesday?
L
If
you
are
interested
in
participating
in
this,
there
are
two
ways
one
we
are
looking
for
for
folks
to
volunteer
for
30,
minute,
expert
interviews
and
really
the
the
task
there
is
is
we'll
just
be
asking
you
about.
You
know
how?
L
How
do
you
think
about
successful
adoption
or
meaningful
ways
to
track
or
understand
user
adoption
within
data
hub
at
your
at
your
company?
So
we're
looking
for
folks
who
have
either
you
know
implemented
the
tool
are
thinking
about
implementing
the
tool,
are
admins
of
it
and
just
have
your
perspective
on
product
analytics
there
and
then
otherwise.
L
We'll
be
scheduling,
and
I'm
a
little
bit
behind
here
and
getting
these
things
scheduled,
but
we'll
be
scheduling
kind
of
like
a
final
review
or
prototype
review
of
what
we
build
out
and
we'll
we'll
make
that
available
to
everyone
to
join
that
that
presentation
towards
the
end
of
the
week
so
feel.
B
D
And
we'll
figure
it
out,
it
doesn't
have
to
be
tuesday
morning
monday
night,
we'll
figure
it
out
awesome
thanks
so
much
and
we're
right
on
time.
This
is
it.
The
release
is
out
check
it
out,
kick
the
tires
if
it's,
if
they're
bugs
we'll
fix
them
and
keep
hacking,
we'll
take
some
more
questions
offline.
On
slack,
I
want
to
give
you
back
your
minus
one
minute
thanks
everyone.