►
From YouTube: May 27 2021: DataHub Community Meeting (Full)
Description
Full version of the DataHub Community Meeting on May 27th 2021
Welcome - 00:00
Project Updates by Shirshanka - 00:01
◦ 0.8.0 Release
◦ AWS Deployment Recipe by Dexter Lee (Acryl Data) - 09:48
• Demo: Product Analytics design sprint [Maggie Hays (SpotHero), Dexter Lee (Acryl Data)] - 12:32
• Use-Case: DataHub on GCP by Sharath Chandra (Confluent) - 30:16
• Deep Dive: No Code Metadata Engine by John Joyce (Acryl Data) - 48:35
• General Q&A and closing remarks - 01:12:13
A
Hello,
hello,
everyone.
I
think
we
have
all
our
speakers
and
we
have
quite
a
few
people
on
the
call.
So
welcome
to
the
may
edition
of
the
data
hub
community
meeting.
A
Cool
so
let's
see
what
we
did
this
month.
First
off
quick
project
updates
on
the
release.
We
have
aws
deployment
recipe
that
dextre
will
walk
us
through.
Then
we
have
a
demo
of
the
product
analytics
design
sprint
that
maggie
led
and
we
have,
for
the
first
time
a
pre-recorded
video
from
sharat
that
confluent
we'll
talk,
walk
us
through
how
he's
done?
A
How
he's
deployed
data
hub
on
gcp
he's
actually
here,
but
he's
just
on
a
low
internet
connection,
so
he's
gonna
be
there
for
questions
and
then
finally,
john
is
going
to
walk
us
through
kind
of
our
big
highlight
no
code
metadata,
and
if
we
still
have
time
we
will
go
through
questions
all
right.
So
first,
a
big
update
on
kind
of
the
metadata
space
itself.
A
Last
week
we
ran
metadata
day
2021
in
collaboration
with
linkedin,
and
you
probably
saw
a
lot
of
folks
who
joined
the
slack
channel
as
a
result,
because
we
used
our
slack
community
as
a
way
to
kind
of
have
conversations
about
the
event,
a
lot
of
great
content.
I
highly
suggest
you
watch
the
video.
We
got
a
great
group
of
experts
and
addressed
a
lot
of
these
burning
questions
around
how
to
do
data
mesh
right.
A
A
few
controversial
statements
like
you
know
we
should
be
contra
confronting
the
mess
and
not
running
away
from
it.
But
a
lot
of
good
stuff
go.
Take
a
look
at
it.
What
was
really
nice
to
see
is
that
everyone
was
aligned
with
essentially
getting
all
of
the
domains
to
publish
metadata
out
and
get
it
all
connected
up
into
a
single
metadata
graph,
which
is
kind
of
very
aligned
with
how
we
think
about
the
world
at
datahub.
A
So
this
is
great,
so
go
to
youtube
check
it
out
lots
of
good
nuggets
in
there
all
right,
so
project
updates
0.8
is
coming.
We
opted
not
to
cut
the
release
before
the
long
weekend,
just
because
we
didn't
want
people
to
upgrade
and
then
run
into
issues,
and
you
know
not
get
support
over
the
weekend.
So
we'll
cut
the
release
right
after
the
long
weekend
so
enjoy
your
holiday.
If
you
are
taking
it,
if
not
wait
and
we
will
get
it
done
right
after
stats.
A
Look
very
similar
to
the
previous
one
about
five
weeks,
trending
about
the
same
number
of
commits
interesting
update.
We've
got
13
new
committers
into
the
project
and
this
particular
release
will
have
26
committers
from
18
different
companies.
So
that's
great
a
lot
of
diversity.
This
is
exactly
what
we
want
in
terms
of
the
biggest
highlights.
A
Of
course,
it
will
include
the
product
analytics
feature
as
well
as
no
code
metadata,
and
there
are
a
bunch
of
other
highlights
as
well
that
I'll
quickly
walk
through
before
that
there's
an
official
kind
of
sense
to
our
airflow
lineage
integration.
Now
the
astronomer
team
has
kind
of
published
our
provider
on
the
registry,
so
it's
now
kind
of
official
airflow
supports
data
hub
as
a
lineage
back
end,
we're
actually
listed
as
a
featured
partner.
A
So
this
is
great,
I
think,
we'll
see
a
lot
of
people
using
airflow
connecting
us
up
with
airflow
for
lineage,
and
this
is
going
to
be
great.
So
really,
thanks
to
all
of
you
for
kind
of
getting
us
over
the
hump
and
all
of
the
support,
we'll
probably
do
a
longer
write-up
about
the
integration
in
a
future
blog
post.
A
Okay,
so
product
improvements,
a
couple
of
big
improvements
to
search.
We
now
support
autocomplete
across
types.
So,
as
you
start
typing,
you
not
only
get
recommendations
for
data
sets,
but
also
charts,
as
well
as
other
kinds
of
entities,
depending
on
what
the
hits
are.
So
it's
it's
pretty
cool.
It's
table
stakes
for
any
nice
search
product,
so
we
kind
of
built
it.
Try
it
out.
It's
already
live
on
demo.
Dot
data
hub
so
go
take
a
look
and
play
around
with
it.
A
Second
one:
it's
almost
like
something
we
filled
in
that
we
didn't
get
to
the
first
time.
Pipelines
now
include
also
visualization
of
the
tasks
that
are
part
of
the
pipeline
themselves,
so
we
actually
organize
the
tasks
within
the
pipeline
based
on
kind
of
a
sorting
of
the
dependencies
between
the
tasks
and
that's
that's
kind
of
one
of
the
things
we
added
a
few
improvements,
and
this
is
thanks
to
the
new
york
times
team
that
have
been
playing
around
with
the
themes
that
are
available
in
data
hub.
A
So
they
added
a
few
things
like
making
the
logo
a
bit
more
friendly
to
customization,
as
well
as
a
subtitle
below
datahub.
I'm
actually
very
curious
to
hear
what
they
are
planning
to
put
below
datahub.
But
this
screen
shot
looks
pretty
good
to
me.
A
And
next
up
coming
is
business
glossary.
It
was
one
of
the
big
kind
of
requested
items.
The
saxo
bank
and
thoughtworks
team
have
been
working
really
closely
to
kind
of
build
this
with
us.
So
it's
great.
This
is
kind
of
screenshots
from
their
production
deployment
internally
and
next
month,
they'll
be
talking
in
more
detail
at
the
town
hall
about
this,
so
it's
actually
in
the
code
right
now,
but
we're
calling
it
incubating
because
we
haven't
yet
published
full-on
documentation
for
how
to
use
it.
But
this
is
sort
of
how
it
looks.
A
So
I'm
really
excited
about
this
because
it
finally
gives
kind
of
the
maturity
that
we've
been
looking
for.
For
people
to
actually
have
curated
glossaries
that
they
can
attach
to
schemas
as
well
as
data
sets
all
right
as
usual,
a
lot
of
improvements
and
integrations
on
the
systems.
Side.
Herschel
has
been
obviously
doing
a
great
job
managing
the
community
here,
but
a
lot
of
people
have
made
a
lot
of
contributions.
So,
thanks
to
all
of
you,
I've
listed
out
your
github
ids
if
you're
here.
A
Thank
you
very
much
for
all
of
the
help.
Big
changes
that
we've
added
transformers,
so
you
can
connect
to
a
source
and,
as
you're
extracting
metadata
out,
also
transform
it
before
it
goes
into
data
hub
people
are
using
it
to
add
owners.
People
are
using
it
to
add
tags
to
metadata
as
it's
flowing
through,
and
I
think
the
sky
is
the
limit.
So
we'll
probably
see
a
lot
of
new
integrations
there
in
terms
of
systems,
we've
had
deeper
integrations,
so
improvements
in
integrations
with
dbt
looker
migrated
out
of
incubating
into
supported
production.
A
So
looker
is
now
fully
supported.
We
actually
had
support
for
added
support
for
views
and
a
few
other
things.
So
thanks
for
that,
hive
also
got
better.
We
now
support
kind
of
the
data
bricks
hive,
as
well
as
the
hd
and
site
hive
they're
kind
of
odd,
in
the
sense
that
they
don't
use
the
thrift
binary
protocol,
but
they
use
the
http
and
the
https
protocols
for
transporting
for
doing
jdbc
over
http,
and
so
it
was
a
little
bit
of
work,
but
we
got
it
done
so.
A
Hive
should
now
be
fully
supported
in
all
shapes
and
forms,
and
a
few
other
improvements
that
I've
listed
out,
the
one
I
was
very
happy
to
see-
was
schema
inference
added
for
mongodb,
so
mongodb,
as
you
know,
is
a
kind
of
document
store
and
there
are
no
real
schemas
in
it.
You
can
put
whatever
you
want,
so
we
added
kevin
added
the
schema
inference
capability,
so
you
can
connect
data
hub
up
to
a
mongodb
instance
and
it
will
not
only
get
the
collection
names
but
also
infer
schemas.
A
A
All
right,
a
few
upcoming
roadmap
items
that
I
haven't
published
on
the
public
roadmap,
but
since
the
community
asks
about
this
a
lot
I
thought
I
would
drop
it
in
this
talk.
One
is
neptune.
We
get
asked
about
it
a
lot,
so
it's
gonna
happen
in
june
for
sure.
So
this
is
a
graph
database
that
is
supported
as
a
managed
database
managed
graphdb
on
aws
column
level,
lineage
we're
going
to
get
to
it
in
either
june
or
latest
by
july.
A
A
All
right
so
now,
I'm
going
to
hand
it
over
to
dexter
to
walk
us
through
some
of
the
improvements
that
have
happened
on
the
kubernetes
side,
as
well
as
the
aws
deploy
capabilities
and
the
recipes
that
we
just
published
dexter.
You
want
to
take
it
over.
B
Yep,
hello,
everyone,
I'm
dexter
from
aqua
data,
so
in
the
last
few
months
I've
talked
with
quite
a
few
of
you
trying
to
help
out
how
to
set
up
on
kubernetes.
So
the
first
thing
we
finally
moved
out
of
contrib,
please
check
out
our
new
home
for
all
things,
kubernetes
and
data
hub
dash
kubernetes.
B
So
the
aim
has
been
to
make
deploying
on
kubernetes
easier
and
during
my
talk
with
everybody
here,
I
figured
out
some
of
the
pain
points
that
we
had.
One
of
the
pain
points
was
that
setting
up
the
storage
layer
on
kubernetes
was
difficult.
So
what
we
did
was
we
added
a
quick
start
configuration
for
setting
up
the
storage
layer,
including
my
sql,
neo4j
kafka
and
elasticsearch,
so
folks
can
just
easily
create
using
the
helm,
charts
that
we
created
by
the
end
of
next
week.
B
We
are
planning
to
publish
our
helm
charts
to
helm.datahub
project.io,
so
please
wait
for
the
announcements
there
and
we're
planning
on
adding
guides
on
exposing
data
hub
front-end,
which
was
the
hard
part
so
setting
up
ingress,
to
expose
the
data
up
front
and
to
external
parties,
and
this
is
very
specific
to
different
platforms.
So
gcp
has
their
own
way.
Aws
has
their
own
way,
so
we're
planning
on
adding
guys
one
by
one
for
the
widely
used
platforms.
B
I
want
to
give
kudos
to
pedro
shakti,
zak
ricardo
and
everyone
else
who
have
contributed
to
improving
all
him
charts
all
right.
So
moving
on
to
the
aws
side
of
things,
so
I
created
a
simple
guide
as
a
quick
start
guide
for
deploying
on
aws.
B
So
it
starts
by
talking
about
how
to
easily
create
a
kubernetes
cluster
on
eks,
deploying
data
hub
and
depth,
using
our
hum
charts
on
the
kubernetes
cluster.
Third
is
the
big
part,
exposing
data
up
front
end
using
the
application
load,
balancer
controller,
of
course
there's
other
ways
of
exposing
data
up
front
in,
but
I
wanted
to
focus
the
the
aws
specific
way
in
the
guide.
A
All
right
and
I
will
hand
it
back
to
you
so
the
next
thing
on
our
agenda
is
the
first
kind
of
big
item.
The
data
hub
analytics
design
sprint
that
mackie
led
for
us
maggie.
Do
you
want
to
actually
take
over
the
screen,
share
and
drive
from
there
yeah.
C
Yeah
that'd
be
fine,
sounds
good.
Give
me
one
second,
so
hello,
everybody,
I'm
maggie
hayes,
I'm
a
senior
pm
of
data
services
at
spot,
hero
based
out
of
chicago
so
earlier
in
april.
C
I
think
it
was
april.
Time
is
weird
right
now.
Who
knows
it
was
within
the
last
month
or
so
I
teamed
up
with
the
guys
over
at
actual
data
to
run,
what's
called
a
design
sprint.
So
I'll
walk
you
guys
through.
What
is
that?
What
does
that
mean?
What
did
we
do?
What
was
the
point
of
it
and
then
we'll
move
into
a
live
demo?
C
So
can
y'all
see
my
screen?
Look
good
all
right.
So
if
you've
never
heard
of
a
design
sprint,
it's
something
that
was
created.
It
came
out
of
gv
or
google
ventures
and
it's
basically
a
framework
to
rapidly
move
through
discovery,
ideation
solution,
prototyping
and
testing
solving
hard
problems
with
technology
in
five
days
granted
we
did
it
in
three
days.
There
are
a
bunch
of
truncated
ways
that
you
can
do
it,
but
the
the
original
one
was
in
a
five-day,
a
five-day
sprint
and
I'll
walk
you
through
this.
C
If
y'all
are
interested
in
learning
more
about
this,
there's
a
ton
of
information
online,
this
is
the
book.
This
is
kind
of
like
the
main
source
of
record.
I
guess
of
what
a
sprint
framework
looks
like
you
can
find
out
amazon
all
that
also
on
youtube.
There's
a
channel
called
aj
and
smart,
where
they
have.
They
have
videos
that
break
down
every
single
session.
They
call
it
design
sprint
2.0,
so
it
kind
of
gives
you
like
a
refresh
of
it
there
so
ample
ample
context
or
resources
for
you
online.
C
If
you
want
to
run
similar
things
in
your
own
companies,
so
what
the
role
that
I
played
in
this
was
really
facilitators,
so
moving
the
teams
through
a
bunch
of
different
steps
of
this
process.
C
So
on
the
first
day,
we
tackled
understanding
our
problem
like
identifying
and
understanding
our
problem
at
hand,
so
that
we
could
ultimately
build
a
strong
prototype
around
it.
So
we
asserted
that
our
problem
was
that
the
owners
and
admins
of
data
hub
do
not
understand
how
users
are
interacting
with
the
tool.
So
that's
a
big
problem
right.
There
are
a
lot
of
technical
approaches.
C
You
could
take
to
solving
that
and
so
what
what
we
started
doing
was
taking
a
step
back
and
understanding
how
to
contextualize
that
problem
into
the
bigger
picture
of
the
data
hub
strategy.
So
we
talked
about
how
does
this
fit
into
the
long
term
vision
of
data
hub,
and
we
rallied
around
this.
This
vision
that
in
12
to
18
months
data
platform
owners
will
want
to
deploy
data
hub
at
the
organization
because
it
gives
them
superpowers
so
right
away.
C
When
we
start
talking
about
solving
this
problem,
we
want
it
in
the
the
context
of
data
hub
is
going
to
provide
an
immense
amount
of
value
right.
So
how
do
we,
how
do
owners
understand
their
user
activity
so
that
data
hub
can
give
them
data
superpowers?
C
And
then
we
talked
about
what
we
identified?
What
question
or
questions
we
would
be
asking
at
the
end
of
this
process
to
understand
if
it
was
a
success
and
we
rallied
around,
are
we
providing
data
platform
owners
with
actionable
insight,
so
user
usage
analytics
is
not
all
like
it
doesn't.
Just
because
you
have
usage
analytics
doesn't
mean
it's
meaningful,
so
we
wanted
to
make
sure
that
we
would
be
able
to
ask
concretely.
Do
you
now
have
actionable
insights
so
that
you
can
move
towards
this,
like
future
value
of
data
hub
in
the
long
run?
C
The
next
thing
we
did
is
we.
We
started
to
break
down
all
of
the
potential
pain
points
within
developing
or
solving
this
problem
within
the
current
stack,
and
we
reframe
this
into.
What's
called
a,
how
might
we
and
really
it's
just
a
way
to
kind
of
like
flip,
a
problem
on
its
head
and
turn
it
into
an
opportunity?
So
we
talked
about
how
might
we
make
the
analytics
infrastructure
easy
to
manage?
So
it's
not
another
service
for
operators
to
manage.
C
How
might
we
give
clear
insights
where
there's
poor
data
qual,
sorry,
there's
poorer
data
quality
coverage
but
heavily
used
assets
so
that
way
we're
we're
trying
to
solve
this
solution
without
adding
too
much
burden
on
the
the
owners
or
operators
of
the
platform
and
then
also
giving
insight
into?
Where
are
you
seeing
a
lot
of
activity
and
there's
actually
opportunity
to
enrich
that
metadata?
C
To
give
folks
more
more
power,
there
then
another
thing
we
did
was
we
talked
to
our
experts
within
the
data
hub
community
and
wanted
to
make
sure
that
we
had
a
well-rounded
understanding
of
this
problem,
set.
How
folks
even
thought
about
how
products
analytics
would
fit
into
their
management
of
data
hub,
and
so
you
know
sample
questions,
and
these
user
interviews
were
what
are
some
like.
The
top
questions,
you'd
like
to
be
able
to
answer
around
user
activity
and
what
decisions
would
that
inform.
C
C
We
then
mapped
out
our
kind
of
user
experience
within
datahub
so
that
we
had
a
very
concrete
understanding
of
where
this
solution
fit
into
that
workflow.
So
we
talked
about
how
you
would
install
datahub
as
a
poc
have
some
step
of
ingesting
metadata
share
it
with
your
users
gather
feedback.
Maybe
do
some
iteration
cycles
here
from
there.
You
then
move
into
feature
development
and
improving
metadata
to
then
move
it
back
into
this
this
flow
and
we
really
targeted
this
idea
of.
We
are
assuming
that
the
poc
exists.
There
is
metadata.
C
There
are
active
users,
we
are
gathering
feedback
and
making
decisions
about
user
activity
to
inform
future
development
areas
to
improve
metadata
and
ways
to
drive
adoption.
So
again,
this
really
just
helps
us
have
a
very
laser
like
a
laser
focus
of
where
this
problem
fits
into
the
vision
of
data
hub,
the
user,
life
cycle,
etc.
C
Then
we
move
into
sketching
solutions.
You
can
see
that
these
came
in
a
variety
of
different
ways.
Some
folks
are
writing
a
pencil
paper.
Some
folks
are
whiteboarding
mocking
things
up
with
the
ui.
The
idea
is
that
we
just
start
visualizing.
What
does
this
solution?
Look
like
then,
day,
two
decide
on
a
solution,
we're
again
we're
we're
deciding
on
a
solution
to
tackle
this
one
big
problem.
C
Once
we
had
all
the
solutions
up
here,
we
did
a
lot
of
you
can,
since
we're
doing
this
remotely,
there's
a
bunch
of
like
little
emojis
or
thumbs
up
to
kind
of
show
areas
where
we
think
they're
good
ideas,
and
it's
really
just
rallying
around.
How
are
we
actually
going
to
solve
this?
C
So
this
is
day
two
by
the
time
we
moved
into
day
three,
we
started
moving
towards
our
prototype
and
I
think
here,
dexter.
B
B
So
the
first
thing
is
to
standardize
the
way
usage
events
are
produced
on
the
react
app,
so
please
check
out
the
event
schemas
there,
so
we
standardize
the
page
view,
events
search
events,
browse
events
and
so
on
and
so
forth,
where
we
put
enough
information
for
us
to
understand
where
these
usage
events
are
coming
from
and
what
these
users
events
like
it
actually
mean.
Second,
was
to
utilize
existing
components
of
datahub,
as
maggie
mentioned
before,
we
don't
want
to
make
operators
lives
even
harder
by
adding
even
more
components
to
deploy.
B
So
we
wanted
to
use
whatever
components:
we've
already
deployed
to
actually
support
a
initial
prototype
of
the
analytics
class
analytics
product.
The
third
was:
while
we
wanted
to
have
this
default
way
of
using
existing
components.
We
wanted
everyone
to
be
able
to
plug
their
own
architecture
for
consuming
these
usage
events.
B
So
usage
events
are
actually
posted
to
a
kafka
stream,
so
anybody
can
just
plug
in
any
consumer
of
choice
for
data
collection
and
analytics
operators
can
also
wire
third-party
analytics
tools,
like
google
analytics
and
fixed
panel
to
the
react
app.
So
please
check
out
this
doc
for
more
details
on
how
to
do
that.
B
Unfortunately,
for
now
you
have
to
fork
the
repo,
but
we
are
going
to
work
on
making
that
through
config
all
right,
so
moving
on,
let's
go
on
to
the
the
end-to-end
flow,
so
you
can
see
each
component
here
are
existing
components
in
our
data
hub
graph,
so
our
reactive,
so
that
we
have
the
user
mark
here,
as
the
user
interacts
with
the
react
app,
it
calls
it
sends
over
the
events
through
the
track
endpoint
in
the
front
end.
So
the
front
end
collects
these
events
and
posted
to
a
kafka
topic.
B
B
It
will
listen
to
data
hub
usage
event,
v1
and
process
these
events
that
come
through
so
note
that
these
events
are
not
hydrated.
So
what
we
do
like,
for
example,
a
user
earn
comes
in.
We
want
to
know
the
details
about
this
user
to
do
that,
we
go
back
to
gms,
so
we
call
the
remote
dow
local
dao
to
get
the
details
about
the
entities,
so
we
hydrate
the
entity
features
and
we
package
it
into
a
single
document
which
we
send
over
to
the
data
hub
usage
event.
B
Data
stream
on
elasticsearch,
so
elasticsearch
connects
all
these
usage
events
and
front
end.
So
we
created
a
new
analytics
controller
which
sends
over
filter
and
aggregate
queries
to
the
last
search
data
hub
usage
event,
data
stream,
where
it
kind
of
says
it,
tries
to
count
and
do
a
bunch
of
time
series
analytics
and
things
like
that
to
build
some
bare
bone,
charts
and
tables
that
power
our
analytics
service
and
that
is
fed,
backed
into
our
react
app
at
the
end.
A
Dexter
just
one
thing,
maybe
just
take
a
minute
or
two
at
max
for
the
demo,
I'm
just
looking
at
the
timing.
Oh.
B
Okay,
you
guys
see
the
yeah
the
data
all
right,
so
what
I
did
here
was
I
modified
any
consumer
job
a
little
bit
so
that
it
prints
out
the
ma
the
usage
event
that
is
coming
in,
so
we
are
in
the
usual
data
hub
app.
B
B
You
can
see
the
search
event
that
came
in
so
it
says
it
queried
with
a
query
sample
as
well
as
search
view,
events
that
talk
about
how
many
results
was
in
the
search
page
and
as
we
click
on
it.
Each
of
these
actions
that
you
take
inside
the
entity
page
inside
the
search
page
inside
the
browse
page
will
translate
to
a
certain
usage
events
that
comes
in
so
now.
These
usage
events
are
all
sent
over.
You
can
see
that
the
elastic
search
connector
is
sending
the
bulk
request
to
our
data
stream
there.
B
So
once
we
go
to
the
analytics
beta,
what
we
do
is
each
of
these
components
are
configurable
inside
the
code.
In
the
data
hub
front
edge,
we
have
highlight
cards,
we
have
time
series
charts
and
then
we
have
tables
as
well
as
stacked
bar
charts.
So
we
created
these
main
four
different
visual
cards
that
we
want
to
support,
and
then
we
implemented
all
of
them.
So
you
can
see
here.
This
is
searches
last
week
and
then
top
search
queries
that
come
in
you
can
see
sample.
B
There
was
top
search,
five
searches
as
well
as
section
views
across
different
entity
pages,
so
we
have
lineage,
we
have
ownership,
we
have
schema
and
so
on
also
actions
by
entity
type.
You
can
see.
We
have
update
tags
here.
I
updated
a
few
yesterday
to
just
show
you
guys,
and
then
we
have
top
view
data
sets.
Of
course
we
will
be
continuing
to
add
more
charts
here.
So
it'd
be
great
if
we
could
get
feedback
here,
so
I
wanted
to
go
over
the
charts
that
we
see
for
our
own
demo.datahub
page.
B
So
you
can
see
we
have
amazing.
We
have
421
weekly,
active
users,
crazy
thanks
for
using
the
demo.
You
can
see
the
searches
that
are
happening
as
well
as
the
various
search
queries.
So
we
can.
We
can
gather
a
lot
of
signals,
but
what
users
are
doing
on
this
platform
by
just
looking
at
these
few
charts?
C
Yeah
one
thing
I'll
add
here:
dexter:
could
you
actually
just
display
open
your
or
hide
the
terminal
in
there,
so
you
can
see
like
a
full
s,
a
full
view
of
the
dashboard
perfect.
Thank
you.
So
what
we're
trying
to
do
is
like
find
find
ways
to
contextualize
not
only
activity,
but
also
where
is
their
opportunity
to
really
like
leverage
the
power
of
of
data
hub
right.
C
So,
if
we're
thinking
about
the
the
number
of
data
sets,
so
we
have
92
data
sets
and
half
of
them
have
owners
assigned
that's
great.
So
what
that
means
is
that
we're
halfway
towards
having
fully
documented
data
sets
within
data
hub
right?
So
it's
it's
not
even
just
the
what
are
people
looking
at,
but
what
are
people
looking
at
that's
specific
to
the
value
that
data
hub
is
driving
the
other
part,
and
so
speaking,
from
my
perspective
as
a
product
manager
managing
this
type
of
tool?
C
I
want
to
understand
how
do
I
decide
where
to
invest?
My
team's
energy
are
people
only
looking
at
data
sets.
Are
they
looking
at
pipelines
now,
and
maybe
our
pipelines
aren't
well
documented
or
in
the
actions
that
they're
taking?
Are
they?
Are
they
adding
tags?
Are
they
manage
changing
owners?
Are
they
looking
at
ownership,
detail,
lineage,
etc?
C
That
way,
I
can
start
to
narrow
down
where
to
have
my
team
and
my
stakeholders
start
to
and
invest
in
having
more
robust
and
and
more
meaningful
metadata
within
there.
The
other
thing
that
we're
thinking
about
is
you
know.
Looking
at
the
terp
terp
is
not
a
word
excusing.
The
top
search
queries
to
understand
like
what
are
people
even
looking
for
in
here?
C
Is
it
specific
terms-
and
I
think
one
one
thing
we
were
talking
about-
is
there
as
as
the
set
of
data
platforms
expands,
do
we
have
people
coming
in
and
searching
for
something
like
salesforce
data
or
braised
data
like
some
of
these
other
tools
that
maybe
aren't
in
there,
and
that
can
be
a
leading
indicator
of
other
ingestion
mechanisms
or
pipelines
that
we
need
to
pull
in?
C
So
I
think
we
can
also
start
to
leverage
this
idea
of
finding
the
gap
of
what
people
are
searching
for
and
trying
to
do,
but
we're
not
actually
meeting
that
demand
and,
like,
like
dexter,
said,
if
you
have
ideas
or
questions
about
how
to
make
this
more
impactful
or
meaningful,
I
will
route
you
all
over
to
the
actual
team,
but
we're
definitely
excited
to
see
where
this,
where
this
heads.
A
Cool
thanks
a
lot
maggie
and
dexter
yeah.
It
was
a
great
experience
and
I
was
talking
to
young-
and
you
know,
nick
and
ben
over
on
the
linkedin
side
as
well,
and
they've
actually
built
a
very
complex
and
very
expensive
analytics
capability
on
the
product
stream
as
well,
so
at
a
future
date.
We
can
get
into
that
as
well.
That
includes
sessions
and
a
lot
of
deeper
analytics.
So
it's
pretty
cool
what
people
are
doing
with
it
all
right.
So
coming
back,
we
are
running
pretty
late.
A
So
what
we
have
today
is
a
very
interesting
talk
from
sharath
he's
actually
in
idaho,
backpacking
or
something
like
that.
He
drove
a
thousand
miles,
but
he
was
dedicated
enough
to
pre-record
a
talk
for
all
of
you,
so
I'm
gonna
play
that
right
away
and
he's
on
the
meeting.
So
he'll
be
around
to
answer
questions.
D
Sounds
hi
folks,
my
name
is
sharath
and
I'm
here
to
talk
to
you
about
data
hub
deployment
on
gcp
and
before
we
dive
into
it.
Maybe
a
brief
introduction
about
myself.
My
name
is
shirat.
I
am
a
data
engineer
at
confluent,
I've
been
one
of
the
first
teas
on
in
confluent,
and
that
means
I
helped
set
up
the
tech
stack.
The
data
stack
help
build
out
some
of
the
tools
that
we
use
within
the
data
science
and
engineering
teams
yeah.
So
let's
talk
about
today's
agenda.
D
D
So,
let's
get
into
it
so
at
confluent
our
data
warehouse
stack
is
basically
if
we
are
a
google
workshop
right,
so
we
use
bitquery.
As
our
data
warehouse,
we
use
cloud
composer,
which
is
airflow
as
an
orchestrator
for
high
volume
jobs.
We
use
pi
spark
on
data
procs,
which
is
again
a
managed
cluster
on
gcp
and
essentially
within
bigquery.
We
have
multiple
layers.
Think
of
these.
As
schemas
landing
is
like
a
layer.
D
D
When
we
talk
about
onboarding
right
now,
we
have
the
data
warehouse
set
up
in
a
way
that
all
of
the
lineage
that
exists
within
the
data
warehouse,
which
is
big,
query,
will
be
existing
or
will
be
seen
in
data
hub.
But
eventually
we
want
to
onboard
various
engineering
teams
who
have
kafka
streams
as
their
input,
who
could
have
their
own
silo
databases
and-
and
another
example,
is
right.
Now
we
have
data
sources
where
an
engineering
team
produces
data
into
a
kafka
topic.
D
We
use
a
connector
to
push
the
data
into
bigquery
and
then
there
are
multiple
layers
like
you
see
that
transformations
happen
and
you
see
a
final
table
that
could
be
real
time
real
plus
a
batch,
or
it
could
be
a
batch
processing
over
real
time,
but
this
lineage
would
help
us
essentially
identify
okay.
This
reporting
table
has
a
source
of
this
kafka
stream
and
not
just
the
internal
data
warehouse
tables,
but
also
the
source
of
the
stream
that
we
have
that
you
are
from,
and
the
third
part
is
visibility.
D
I
think
it's
needless
to
say
that
data
lineage
would
increase
the
visibility
of
how
these
how
the
data
flows
within
the
system
we'll
also
allow
the
engineering
teams
to
use
these
emitters
for
them
to
emit
additional
information
like
owners
or
if
there
is
additional
information
on
lineage
metadata,
then
these
emitters
can
be
used
for
folks
who
want
to
understand
how
the
data
is
flowing
yeah.
D
So
why
did
I
have
so?
We
did
look
at
a
few
options.
I
looked
at
lifts
amundsen.
There
was.
We
also
looked
at
one
of
the
proprietary
metabases,
so
we
used
metabase
as
a
bi
tool
and
we
wanted
to
derive
a
lineage
using
database,
but
that
had
very
high
limitations,
and
you
know
data
hubs,
architecture
being
having
having
kafka
in
their
architecture.
We
really
want
to
leverage
this
and
make
sure
that
we
use
confluence
kafka
and
data
hubs,
intel
architecture
to
really
power.
D
This
tool
that
can
help
drive
not
only
the
metadata
but
also
real-time
changes
and
alerts.
That
could
happen
over
this
real
time.
So
one
of
the
other
projects
that
we're
doing
is
to
build
streaming
applications,
and
I
think
this
is
a
good
example
of
how
streaming
applications
can
blend
in
with
one
of
these
metadata
tools,
to
give
us
good
info
just
to
not
maybe
not
to
dive
too
deep.
But
something
else
that
I
think
is
important
is
a
lot
of
data.
Warehouses
always
fall
back
on
things
like
okay.
D
D
So
before
we
dive
into
the
deployment
itself,
just
a
high
level
overview,
so
the
first
step
is
using
helm,
charts
with
deployed
data
hub,
and
it
really
took
us
less
than
30
minutes
to
do
that.
The
second
is
using
gcp.
We
deployed
ingress
services.
A
third,
a
simple
cron
job
to
load
the
metadata
into
bigquery
into
data,
big
queries
metadata
into
data
hub,
and
this
cron
job
is,
as
is
because
we
know
that
the
tables
and
columns
don't
change.
D
Very
often,
we
have
like
a
weekly
crown
job
that
pushes
this
data
and
then
the
custom
tag
that
helps
us
emit
all
of
the
lineage
data
into
github.
So
let's
maybe
get
into
the
data
hub
deployment,
so
just
quickly
going
over
the
environment
right.
So,
as
we
said,
we
are
a
gcp
workshop
at
confluence
data
warehouse
internal
data
team
is
the
gcp
uses
all
of
gcp
services,
so
it
only
made
sense
for
us
to
deploy,
manage
kubernetes
gcps
manage
kubernetes
service
use,
use
that
to
deploy
data
hub.
D
So
we
have
a
community
service
that
runs
data
hub
right
now,
and
our
internal
servers
like,
for
example,
cloud
composer,
connects
with
this
k8
service
to
emit
any
information.
D
So
before,
and-
and
so
one
of
the
things
that
we
we
discussed
is
the
first
step
of
this
is
basically
when
you
go
to
data
hub
kubernetes
and
you
have
the
quick
start
guide
there,
we
had
to
do
very
little,
nothing
to
very
little
changes
to
that
guide.
Just
some
infrastructure
changes
where
we
had
to
increase
and
decrease
some
of
the
preloaded
values
for
the
cpu
and
the
memory
space.
But
apart
from
that,
I
think
everything
else
that
we
used
from
the
charts
helped
us
really
deploy
data
hub
on
gcp
kubernetes
seamlessly.
D
Once
it
was
deployed,
there
were
two
things
that
we
wanted
to
do
that
we
required
to
do.
One
is
to
run
the
ui
and
the
second
is
to
run
a
gms
service
ingress.
That
would
help
us
connect
to
this
data
hub
service
to
emit
any
data,
so
google
provides
or
gcp
provides
a
good
way
to
create
english
services.
D
So
if
you,
if
you
look
at
this
slide
here,
we
have
all
of
these
data
hub
services
that
are
running
prerequisites
that
are
running
and
all
we
have
to
do
is
click
on
the
two
services
which
is
front
end
and
gms
and
select
and
and
and
just
create
english
services
on
this,
and
when
you
create
this
english
services,
gcp
automatically
gives
you
an
ip
address
that
you
can
use
to
interact
with
these
services
and,
in
our
case
the
one
ip
address
that
we
used
for
the
one
ingredient
service
that
is
created
for
the
front
end.
D
The
ip
address
would
be
used
to
interact
with
the
ui,
and
the
second
would
be
used
to
interact
with
the
airflow,
which
is
cloud
composer.
So
going
back,
we
deployed
our
our
kubernetes
using
helm,
charts
into
gcp.
We
created
two
ingress
services,
which
is
again
just
two
clicks
away.
The
next
step
is
to
run
a
simple,
tiny,
cron
job
that
will
push
all
of
the
data
metadata
into
bigquery,
and
this
cron
job
could
be
scheduled.
However,
you
feel
your
database.
D
How
often,
if
you
feel
your
database
is
refreshed
and
the
final
step
is,
I
think,
how
do
you
operationalize
your
lineage
data
into
data
hub?
How
do
you
make
sure
that
your
lineage
data
is
captured
in
data
hub?
So
I
think
I'm
going
to
take
some
spa
time
and
and
maybe
walking
through
some
of
these
things
that
we
set
up,
but
before
that
just
a
few
components
that
are
required
for
this.
So
we
figured
that
gcp's.
Audit
query
log
is
a
very
good
source
of
understanding
the
lineage
of
data.
D
That
means
every
query
that
is
run
against.
Bigquery
is
captured
in
the
audit
log.
So
any
query
where
a
source
table
is
a
table
within
the
information
schema
and
the
destination
table
is
also
a
table
within
the
information
schema.
Any
queries
with
this
that
match
this
condition
is
safe
to
say
that
it's
a
transformation
or
record
in
the
audit
log
for.
D
Then
the
audit
log
would
capture
one
row
where
you
would
have
the
destination
table
as
t1
and
the
source
tables
as
s1
s2
s3,
and
when
you
think
about
data
hub,
that's
exactly
what
it
up
is
doing.
It's
trying
to
take
your
source
tables
and
map
it
to
your
destination
tables
and
bigquery
audit
logs
gives
this
out
of
the
box.
D
You
don't
have
to
worry
about
which
sql
script
runs,
which
of
the
tables
are
which
sql
script
is
responsible
for
loading
which
of
these
tables
and
the
steps
you
need
to
get
this
audit
log
into
bigquery
is
also
pretty
simple.
You
have
the
gcp
lobs
logs
in
the
logging
again
as
a
service
in
gcp,
a
logs
router
can
be
set
up
to
push
these
logs
into
bigquery.
There
again
has
basically
nested
rows
in
bigquery
and
we
use
cloud
composer
again
airflow
to
transform
this
data
and
push
it
to
data
hub
as
emitters
right.
D
So
thinking
of
the
first
one
on
the
left
side,
you
would
see
here
that
we
have
a
logging
setup
and
the
destination
of
this
logging
would
basically
be
bigquery
and
using
this
query
on
the
right,
I
have
links
to
this
place
to
the
queries
here.
I
can
share
this
during
the
meeting
these.
D
This
query
essentially
breaks
down
the
audit
log
into
source
and
destination,
just
two
cables
that
do
two
columns
just
source
and
destination
so
for
every
destination.
What
are
the
different
source
tables
once
you
have
this
information
you're?
Basically,
what
you're
trying
to
do
is
you
use
the
emitter
task
to
use
this
information
embedded
embed
each
of
these
source
or
destination
tables
into
an
emitted
task
and
create
a
tag
out
of
it
and
use
this
stack
to
push
the
data
into
a
data
hub?
D
So
so
the
next
step
was
basically
to
take
all
of
this
log
data
create
a
sort
of
a
source
and
destination
hierarchy
and
then
using
the
template
that
is
given
by
data
hub,
create
an
emitter
task
and
using
that
emitter
task
we
create
a
dag
that
is
then
executed
or
scheduled.
However,
frequently
you
want
so,
let's
maybe
we
take
a
quick
look
at
the
data
hub
emitters
itself
right.
D
So
when
you
think
about
the
query,
this
is
the
query
that
we
were
talking
about
how
what
is
the
best
way
to
extract
this
query
in
a
way
extract
the
log
in
a
way
that
we
get
source
and
destinations
once
we
have
that
data.
All
we
are
doing
is
creating
upstream
and
downstream
funds
or
urns,
basically
trying
to
say
that
for
every
destination
table,
let's
build
these
upstream
and
downstream
dependencies
and
push
this
entire
task
as
a
data
hub
emitter
operator,
so
going
back
so
that
that's
exactly
what
we're
doing
here.
D
So
this
is
the
setup
just
a
quick
overview
of
what
we've
done.
D
We
first
deploy
data
hub
on
gcp
using
helm,
charts
that
are
using
charts
that
are
already
provided
in
the
startup
quick
start
guide.
The
next
would
be
to
create
these
two
english
services,
which
is
a
two-step
process,
just
to
click,
select
this
service
and
create
custom
services,
and
the
next
is
to
use
the
template.
That's
already,
given
as
an
example
to
load
bigquery
metadata.
The
next
is
to
consume
the
logs
the
audit
logs
and
create
emitter
tasks
that
are
then
pushed
into
data
hub
for
the
lineage.
D
So
our
current
we're
taking
all
of
this
into
consideration
with
our
current.
Our
current
status
is
this.
We
have
data
deployed
in
our
sandbox.
We
don't
have
any
of
the
fully
managed
services.
For
example,
we
don't
have
cloud
sql
or
kafka
confluent
using
all
of
the
internal
services,
that
data
hub
is
packaged
with
what
did
do.
What
we
did
do
is
enable
oidc
connection,
authentication
using
octa
and
and
any
user
who
wants
to
access
data
hub
now
goes
through
octa.
D
So
there
is
that
step
of
security
as
well,
and
the
next
part
is
merited
and
linear.
So
this
is
pretty
much
automated,
so
we
have
metadata
and
lineage
almost
automated
into
our
sandbox
environment,
and
maybe
this
is
a
good
opportunity
for
us
to
take
a
look
at
how
this
data
looks
in
the
current
format
in
our
in
our
sandbox
environment.
Right.
So
when
we
look
at
the
data
here,
we
have
a
lot
of
different
tables
and
for
this
purpose
of
this
demo,
I've
just
selected
one
table
and
this
particular
table
has
a
hierarchy.
D
When
we
look
at
the
hierarchy,
we
have
a
lot
of
downstream
and
upstream
dependencies
and
within
these
downstream
map
stream
dependencies,
we
have
these
nested
upstream
dependencies
will
also
have
subsequent
dependence,
for
example,
this
particular
table
can
have
more.
D
So
all
of
this
rich
data
set
lineage
can
help
when,
when
a
new
user
is
trying
to
understand
where
the
data
into
a
particular
table
flows
and
need
not
worry
about,
the
code
need
not
worry
about
what
the
sql
does
at
a
high
level
can
start
looking
at
where
these
tables
populate
and
support,
and
and
and
and
essentially
push
the
data
to
and
from
yeah
and
the
next
steps.
I
think
for
us.
The
next
steps
is
to
productionalize
our
current
setup,
make
sure
that
we
have
the
right
versioning
in
our
deployment.
D
We
also
want
to
make
sure
that
we
have
our
metadata
and
lineage
productionized
in
a
way
that
any
time
there's
a
new
change.
It's
automatically
pushed
to
data
hub.
We
want
to
start
using
managers
like
cloud
cloud
sql
for
mysql
or
even
confluent
kafka
to
use
instead
of
the
data
service
kafka,
that's
already
installed
as
a
part
of
the
charts.
Also,
we
want
to
make
sure
that
the
metadata
that
we
emit
is
rich.
So,
for
example,
we
want
to
add
more
column,
metadata
column
level
data.
D
We
want
to
add
owners
to
this.
We
also
want
to
add
processing
information
right
now.
We
don't
use
inlets
and
outlets,
so
we're
trying
to
see
what's
the
best
way,
to
make
sure
that
we
include
the
airflow
processing
information
into
the
edges.
So
then
you
have
the
complete
view
of
which
is
the
table
source
table
which
is
the
destination
table
and
which
processing
setup
really
helped
push
this
data
into
a
bigquery
yeah.
A
Thanks
a
lot,
I
wanted
to
make
sure
that
we
had
time
to
get
into
the
no
no
code
metadata
piece
from
john
as
well.
Sharath
is
on
the
call.
So
if
you
have
any
questions
about
kind
of
his
deployment
on
gcp
and
how
he's
setting
it
up
at
confluent,
definitely
being
him
either
here
or
on
slack,
but
thanks
a
lot
charles
for
sending
this
over
ahead
of
time
and
being
able
to
attend,
even
though
you
were
remote
all
right
with
that,
I
will
move
things
over
to
john.
E
Hey
guys,
can
everyone
see
my
screen:
yeah,
okay,
awesome,
yeah,
thanks
trishanka
and
thanks
dexter
and
sherat.
Those
were
these
are
really
informative
and
great.
Today,
I'm
going
to
talk
a
little
bit
about
a
project.
The
accrual
team
has
been
working
on
in
the
past
few
weeks
that
we're
calling
no
code
metadata.
E
I'm
going
to
go
through
sort
of
the
problem,
we're
looking
to
solve
the
solution.
We
came
up
with
and
a
little
bit
more
deep
technical
details
around
that,
as
well
as
a
demo
fit
in
there
a
little
bit
as
well,
so
with
that,
let's
get
right
into
it.
So
what
are
the
what's?
The
problem
we're
trying
to
solve
with
no
code
metadata?
Well,
the
problem
is
that
adding
an
entity
to
data
hub
today
is
pretty
hard.
So
these
are
three
pr's.
E
They
all
add
two
entities
each
and
you
can
see
that
the
number
of
files
that
needs
to
change
just
to
add
them
to
the
gms.
So
the
backend
layer
which
we're
going
to
focus
on
here
today
is
about
50
files
right.
So
we'll
talk
about
two
things,
one
is.
The
first
is
just
the
sheer
complexity
of
adding
that
entity
today.
It
requires
greater
than
25
files
for
entity
in
the
complex
case,
in
the
average
case,
it's
about
25
files
and
those
files
consist
of
models.
E
So
these
are
snapshots
and
aspects
that
we're
all
familiar
with
probably
search
documents,
relationship
models,
these
wrestling
resource
values,
burns
and
I'm
sure,
I'm
missing
something
here.
Endpoints
we
have
entity
and
optionally
aspect,
endpoint
files,
which
are
all
separate.
We
have
clients
which
are
just
wrappers
to
actually
talk
to
those
wrestling
resources.
You've
created
the
endpoints.
E
We
have
these
things
called
data
access,
object
factories,
so
each
entity
has
its
own
search,
local
browse
and
graph
dial,
which
allow
you
to
talk
to
the
persistence
layer
and
as
such,
we
need
each
entity
to
have
a
configured
factory
which
is
to
create
these.
Like
strongly
typed
factories,
we
have
index
filters
which
are
effectively
just
lambdas
that
take
metadata
change
events
and
turn
them
into
updates
against
the
search
index
and
the
graph
index.
E
E
This
is
the
actual
onboarding
and
new
entity
guide
doc
thing
visualization,
so
you
can
see
27
steps
listed
here
in
seven
different
subcategories,
it's
just
very
difficult,
and
because
of
all
of
that
complexity,
it
takes
a
long
time
to
actually
add
entities
based
on
conversations
with
folks
who
have
raised
pr's
data
entities.
E
We
found
that
on
average
it
would
take
one
to
two
weeks
for
a
new
data
hub
contributor
to
actually
get
up
to
speed
with
all
the
abstractions
and
add
entities,
and
that's
not
even
counting
the
back
and
forth
that
occurs
on
the
pr
after
you've
added
50
files,
which
can
span
up
to
a
month,
we've
seen
so
it's
just
too
hard
right.
That's
the
problem
we're
trying
to
solve
with
the
no
code
movement.
That's
where
the
no
code
movement
comes
into
play.
We
started
to
think
about.
E
How
can
we
make
the
process
of
adding
an
entity
much
simpler?
And
specifically,
we
started
to
rally
around
this
goal
that
it
should
take
no
more
than
15
minutes
to
add
or
extend
a
datahub
entity
at
the
backend
gms
layer
and
what
we
wanted
to
be
in
scope
are
the
ability
to
read
and
write
of
the
new
entity
using
a
rest
api,
the
ability
to
define
searchable
fields
that
are
indexed
in
the
search
index
and
to
define
sort
of
outward
graph
edges
coming
coming
from
that
entity.
E
E
We
wanted
it
to
be
declarative,
which
means
you
know
you
shouldn't
have
to
write
any
a
java
code
or
imperative
code
at
all,
ideally
or
at
least
minimize
that
requirement.
The
second
thing
is
we
wanted
this
to
be
extensible,
because
we
are
going
to
hopefully
move
towards
a
more
declarative
world.
We'll
have
a
dsl.
E
We
wanted
things
to
be
extensible
horizontally,
as
new
requirements
popped
up
and
then
the
third
thing
is:
we
wanted
to
just
be
very
usable
right,
intuitive
hard
to
make
mistakes
well
validated
up
front
such
that
runtime
bugs
are
sort
of
hard
to
come
by
based
on
these
changes,
and
so
now
I'm
going
to
go
right
into
a
demo
of
what
adding
an
entity
looks
like
in
the
new
no
code
world
after
the
the
work
over
the
last
few
weeks
and
then
we'll
go
into
sort
of
how
it
all
works
under
the
hood.
E
So
I'm
gonna
get
out
of
the
slides
here
and
go
over
to
this
town
hall
demo
dock.
I
have
here
and
we're
gonna
start
with
just
modeling
an
entity
so
we're
going
to
imagine.
We
want
to
model
a
an
entity
representing
a
service
right.
So
we've
talked
about
this
before
a
service
is
maybe
like
an
online
microservice
that
you
want
to
represent
in
data
hub,
and
the
first
thing
we're
going
to
do
in
the
new
world
is
we're
going
to
actually
model
the
aspects
and
aspects
are
just.
E
E
We
have
the
second
annotation,
which
is
pretty
interesting
here
on
top
of
the
name
field,
and
that
is
the
searchable
annotation,
and
so
what
this
allows
us
to
do
is
mark
the
name
field
as
searchable,
and
this
allows
us
to
sort
of
index
that
field
based
on
these
configurations.
So
we're
saying
this
should
be
a
partially
a
search,
a
field
that
can
be
partially
matched
right.
That's
what
we're
saying
here
and
then
we're
saying
we
also
want
to
support
auto,
complete
queries
against
this
field.
E
So,
if
you're
searching
in
the
search
box
on
the
ui,
you
should
be
able
to
see
auto
complete
results
based
on
the
service
name.
The
second
aspect-
we're
defining,
is
just
a
set
of
properties.
In
this
case,
we
just
have
two
simple
properties:
a
description
and
an
owner,
and
you
can
see
again
we
have
a
searchable
annotation
on
description.
In
this
case,
we
don't
provide
any
configurations,
because
we
are
okay
with
the
default
searchable
configurations
which
will
simply
make
this
a
space,
delimited
search
index
field
and
then
the
second.
E
The
second
field
has
the
next
kind
of
annotation.
We
want
to
talk
about
the
third
annotation,
which
is
this
relationship
annotation,
and
what
this
basically
does
is
it
marks
this
field
as
representing
a
foreign
key
relationship,
an
edge
that
extends
out
of
the
service
entity
and
into
a
different
entity.
E
In
this
case,
we
don't
put
any
bounds
on
what
can
be
on
the
other
side,
but
we
do
support
the
ability
to
add
something
like
this,
where
you
can
say,
entity
types
is
corp
user,
which
will
then
restrict
that
edge
to
only
have
a
user
on
the
other
side.
So
for
this
purpose,
I
think
that
that
makes
sense
here.
E
The
second
thing
we'll
need
to
do
is
just
add
the
aspect
union.
This
is
exactly
the
same.
Nothing
has
changed
from
what
happens
today.
We
add
a
service
aspect
which
basically
pulls
together
all
of
these
individual
pieces
of
metadata,
so
it
pulls
together
the
service
key,
the
info
aspect
we
just
defined,
and
then
there's
this
third
thing-
that
I'll
quickly
talk
about
which
is
pretty
fun,
which
is
this
new
browse
paths
aspect,
and
this
is
something
that
the
no
code
initiative
has
allowed
us
to
address.
E
This
ask:
that's
come
up
repeatedly
over
the
past
few
months,
which
is
the
ability
to
customize
browse
paths.
So
browse
paths
are
what
you
see
when
you're
navigating
the
explorer
hierarchy
right,
we're
seeing
prod
snowflake
something
else
right
now.
All
of
those
are
generated
based
on
this
hard-coded
logic
that
sits
in
gms.
We've
actually
changed
that
such
that
you
can
provide
custom
browse
paths
as
a
normal
aspect,
as
you
would
any
other
metadata,
and
so
here
we're
actually
adding
that
aspect.
E
So
we
can
provide
browse
paths
and
we'll
demo
exactly
how
you
would
query
for
that
in
just
a
moment
and
the
final
thing,
the
final
big
thing
we
have
to
do
is
define
sort
of
the
entity
model.
This
is
the
snapshot
model
that
everyone's
used
to.
We
have
one
final
new
annotation
that
we
put
on
the
snapshot
model,
which
is
the
entity
annotation
in
the
same
way
that
the
aspect
annotation
allows
us
to
define
a
common
name
for
an
aspect.
E
The
entity
annotation
allows
us
to
define
a
common
name
for
an
entity
which
is
globally
unique.
It
allows
us
to
get
away
from
using
the
fully
qualified
model
name
as
the
de
facto
name
for
the
entity,
which
we
can
talk
about
the
benefits
of
that
in
a
little
while,
but
the
second
piece
of
metadata
we
specify
here
is
that
key
aspect
right
so
here.
E
So
we
have
this
ability
to
sort
of
serialize
the
service
key
and
deserialize
the
service
key
into
a
struct
that
you
can
use
such
that
when
you're
querying
for
an
entity,
you
will
always
get
back
both
in
urn,
which
you
shouldn't
have
to
look
into
and
a
service
key
aspect
which
you
can
then
pull
fields.
Out
of
so
it's
sort
of
this
idea
of
a
virtual
aspect
here
and
then
the
final
thing
is
is
just
adding
that
service
snapshot
to
the
list
of
all
snapshots.
E
This
is
again
nothing
different
from
from
what
we
do
today
and
that's
pretty
much
it
right.
So
four
steps
entity
is
now
added.
We
redeploy
gms
and
we
redeploy
mae
consumer,
a
few
other
containers
and
we
should
be
able
to
now
interact
with
that
entity,
and
so
that's
what
I'm
going
to
show
now.
I've
already
actually
modeled
this
entity
and
redeployed,
my
own
local
versions
to
save
you
guys
some
of
that
awkward
silence
time,
but
we're
going
to
go
ahead
and
try
to
write
an
entity.
E
So
the
first
thing
we're
doing
is
we're
using
a
newly
created
generic
entities
endpoint,
which
allows
you
to
read
and
write
any
entity
in
data
hub
and
we're
going
to
write
into
it.
This
service
snapshot,
so
some
things
to
call
out
are
the
description.
Is
my
demo
service?
There's
an
owner
marked
here.
So
remember:
that's
the
foreign
key
relationship.
E
This
is
going
to
be
indexed
in
search,
and
then
we
have
these
custom
browse
pads
right.
So
my
custom
browse
path.
1,
my
custom
browse
path,
2.
An
interesting
thing
to
note
is
you
also
have
made
it
such
that
you
can
specify
multiple
browse
pads,
so
you
can
actually
access
this
entity
from
multiple
explorer
traversals,
which
I
think
is
a
pretty
cool
feature,
so
we're
going
to
go
ahead
into
this
terminal
and
I'm
going
to
go
ahead
and
paste
this
in,
and
hopefully
everything
works.
E
Okay,
so
nothing
came
back,
which
is
good
means,
there's
no
exceptions
and
we
can
validate
that
by
reading
that
entity
using
the
exact
same
resource.
So
this
is
the
same
end
set
of
endpoints.
All
generic
we'll
go
ahead
and
curl
that
and
you
can
see
okay
here
we
are
so
we've
got
our
new
our
new
entity
back.
It
has
all
the
data
that
we
we
think
it
should,
and
we
also
can
call
out
that
we
have
that
service
key
coming
back
right.
E
So
now
you
can
actually
ask
it:
what's
your
name
in
a
much
more
clean,
clean
way
and
we're
gonna
actually
search
right?
So
here
we
have
a
search
endpoint
again,
it's
generic!
You
can
search
across
any
entity
in
this
case
we're
going
to
search
across
the
service
entity
in
particular,
and
we're
going
to
pass
in
my
demo.
So
if
you
remember,
our
description
was
my
demo
service
and
we'll
get
back
a
my
demo
service.
So
this
is
saying
yes,
this
matched
your
search.
E
Okay,
we've
got
one
back.
My
demo
service
is
the
browse
entity
there,
so
this
seems
to
be
working
and
then
finally,
we
have
this.
This
new
endpoint
called
relationships
which
allows
you
to
fetch
arbitrary
relationships
between
data
hub
entities
and
in
this
case,
what
we're
going
to
fetch
is
an
owned
by
relationship
that
is
actually
incoming
into
a
corp
user
right.
So
basically
we're
trying
to
test
the
inverse
relationship.
E
So
we're
saying
get
me
anything
that
I
own
as
user
one
right,
so
we're
gonna
go
ahead
and
run
that
and
you
can
see
okay,
we
got
the
service
back,
so
the
edge
has
been
indexed
it's
available
via
this
generic
relationships.
Endpoint
now,
and
so
that's
basically
the
demo,
that's
the
process
of
adding
a
new
entity
takes,
you
know
less
than
15
minutes.
I
think
I
was
talking
through
it,
so
maybe
15.
E
and
we're
going
to
go
back
to
talking
about
how
we
did
it
so
quickly
I'll
go
over
sort
of
what
the
architecture
used
to
look
like
you
know
in
a
nutshell,
the
theme
is
that
at
every
one
of
these
segments
you
had
to
define
components
on
a
per
entity
basis
right
so
at
the
client
layer
you
have
individual
classes
or
components
for
each
entity.
Data
sets
users
charts
dashboards,
that
propagates
over
to
the
resource.
The
wrestling
endpoint
layer
as
well.
E
You
had
a
different
set
of
endpoints
for
data
sets
users,
charts
dashboards
right
and
inside
of
those
components
you
had
another
layer.
You
had
individual
dowels
for
searching
for
writing
and
reading
to
the
key
value
store
and
to
getting
relationships,
and
these
are
each
specific
to
the
individual
entity.
So
you
can
see
this
just
scales
with
the
number
of
entities.
E
If
you
head
over
to
the
mae
consumer
side,
which
is
responsible
for
updating
the
search
index
and
the
graph
store
as
metadata
changes
come
through
the
system,
you'll
see
that
we
have
the
same
exact
pattern
right,
so
you
have
a
search
builder,
that's
specific!
For
a
data
set.
You
have
a
user
search
builder.
You
have
a
user
graph
builder,
you
have
a
ds
graph
builder,
and
so
it's
just
kind
of
scaling
with
the
number
of
entities.
E
It's
the
common
theme
across
all
of
this,
and
now
I'll
show
you
kind
of
what
the
after
looks
like.
So
this
is
the
after
we've
revised
that
such
that
we
have
two
sort
of
generic
sets
of
endpoints.
One
is
about
entities,
so
it
provides
the
ability
to
read,
write,
search,
browse
any
entity
in
data
hub.
E
We
have
the
relationship
endpoints,
which
allows
you
to
effectively
do
the
same,
but
for
edges
right
and
then
we
have
these
service
classes
that
sit
behind
those
and
those
are
the
the
key
they're
kind
of
the
generic
read
and
write
abstractions
over
the
persistence
layer
right.
So
we
have
the
entity
service,
we
have
the
search
service,
we
have
the
graph
service
and
then
heading
over
to
the
mae
consumer
side.
We
followed
a
very
similar
pattern
where
we
have
a
generic
entity
search
index
builder.
E
We
have
a
generic
relationship
graph
builder
and
this
is
all
driven
based
on
those
annotations
in
the
model
right.
So
that's
how
we
compute
what
the
updates
need
to
be,
and
so
I'm
going
to
go
quickly
over
this
because
we
already
talked
through
it.
There's
a
few
slides
just
going
deeper.
You
can
feel
free
to
look
through
these
in
your
own
time.
But
in
a
nutshell,
here
we
have
four
new
annotations:
the
entity
aspect,
relationship
searchable,
annotation
that
we
talked
about
it's
very
easy
to
just
define
them.
E
Alongside
the
model
as
additional
metadata
I'll
talk
quickly
about
one
of
the
key
abstractions
that
unlocked
sort
of
all
of
this
ability,
which
is
the
entity
registry,
the
entity
registry
is
really
a
time
source
of
truth
for
both
models
and
metadata,
and
it's
what
all
of
the
services
and
the
index
builders
now
depend
on
to
get
information
about
the
model
and
information
about
the
metadata.
So
you
can
think
about
the
storage
configurations,
the
annotations.
How
should
I
build
the
search
index?
E
How
should
I
build
the
relationship
index
and
a
key
part
about
this
is
that
we
decouple
all
of
the
service
and
index
builder
layer
from
the
metadata
models
and
configuration
itself
such
that
in
the
future.
You
know,
maybe
we
aren't
using
pegasus,
maybe
we're
using
protobuf,
maybe
we're
using
something
else,
or
maybe
we're
having
a
completely
dynamic
entity
registry
where
you
can
curl
in
a
new
schema
like
a
database
right
and
from
the
graph
services
perspective
from
the
services
perspective.
E
In
the
indexability
perspective,
nothing
would
change,
which
is
which
is
pretty
exciting,
so
again,
services.
They
all
allow
you
to
do
generic
things
to
each
entity
in
relationship.
One
key
point
here
is
they're
decoupled
from
storage
technologies,
so
they're
all
based
on
data
hub
specific
abstractions.
E
So
one
interesting
thing
here
is
that
these
service
classes,
we've
introduced,
are
actually
used
now,
both
on
the
read
path
so
from
gms,
but
also
on
the
right
path
from
the
index
builders.
So
these
search
service
and
graph
service
are
kind
of
common
across
multiple
parts
of
the
stack
which
makes
kind
of
changing
and
updating
things
much
easier,
and
we
have
one
central
abstraction
that
everything
else
depends
on
so
just
revisiting
the
the
non-functional
requirements
we
talked
about
earlier.
Talking
about
how
we
may
have
achieved
them
start
with
the
declarative
one.
E
So
we
provide
a
dsl
for
defining
models
as
well
as
storage
configurations
without
any
java
required
right,
no
coding,
we
defined
an
extensible
model
where
it's
easy
to
add
new
indexed
field
types.
So
you
remember
text
partial.
We
saw
earlier
it's
very
easy
to
add
something
new
there.
It's
easy
to
add
new
relationships,
it's
as
easy
as
defining
a
new
relationship,
annotation,
it's
easy
to
plug
in
new
storage
implementations
and
then
finally,
it's
usable.
E
E
You'll
know
a
build
time
so
that
we
can
avoid
these
runtime
exceptions
and
then
a
couple
fun
features
we
think
which
are
configurable
browse
paths
as
well
as
the
moving
away
from
the
requirement
to
have
these
strongly
typed
earn
pdls,
as
well
as
java
classes,
which
is
the
current
requirement,
and
so
just
quickly
going
to
talk
about
the
impact
of
this
initiative.
So
I
think
firstly,
the
biggest
impact
is
just
the
reduction
in
complexity
across
gms.
E
That's
unlocked
from
this,
so
I
don't
know
if,
if
you
guys
want
to
call
out
or
write
in
the
chat,
how
many
files
you
think
may
have
been
made
redundant
or
can
be
removed
with
the
no
code
initiative,
if
anyone
has
any
guesses,
let
me
know
before
I.
E
Now
I'm
going
to
talk
a
little
bit
about
where
no
code
goes
from
here,
so
we're
going
to
look
to
actually
move
up
the
stack
right
now.
All
of
the
work
that
you've
seen
is
really
limited
to
the
metadata
platform
layer
so
gms
and
beyond,
but
we
want
to
actually
auto
generate
the
graphql
api
at
the
datahub
frontend
layer,
as
well
as
exploring
the
ability
to
sort
of
gener
dynamically
generate
sort
of
ui
configurations.
E
There's
a
couple
other
things
we
want
to
do
as
quick
follow-ups,
so
we
want
to
clean
up
all
of
that
legacy
code,
so
271
files
we
want
to
get
rid
of
if
we
can-
and
then
we
want
to
you
know,
continue
to
expand,
expand
on
those
apis.
We've
developed
they're,
very
minimal
they're
slim.
E
They
do
exactly
what
data
hubs
ui
needs
them
to
do,
but
we
can
certainly
foresee
the
ability
to
add
sort
of
new
capabilities
to
those
apis
as
well,
and
then
I'm
just
going
to
close
by
talking
briefly
about
sort
of
the
vision
of
datahub.
As
we
see
it,
we
really
want
datahub
to
become
this
sort
of
true
metadata
platform,
where
you
can
do
things
like
dynamic
model,
registration
and
storage
configurations,
perhaps
perhaps
even
at
runtime,
as
we
touched
on
before.
E
So
we
kind
of
think
of
the
this
having
multiple
kind
of
layers,
where
we
have
the
metadata
storage
engine
responsible
for
access
controls,
index,
building
the
commit
log,
maintaining
that
entity
registry,
and
then
we
have
a
set
of
apis
on
top
both
synchronous
and
asynchronous,
so
kafka
based
and
and
rest
apis,
and
then
on
top
of
that,
the
client
sdks,
which
can
then
interact
with
all
of
those
and
then
finally
I'll
just
conclude
by
saying
talking
a
little
bit
about.
You
know
how
we
get
to
know
code.
E
So
how
is
this
going
to
be
released?
The
code
itself
is
coming
next
week
early
next
week,
and
it
includes
a
few
different
things.
One
is
a
newly
introduced
data
hub
upgrade
container
in
cli
that
allows
you
to
actually
perform
the
upgrade
against
a
running
instance
of
data
hub.
That
would
be
required
to
move
to
no
code
right.
So
there's
two
ways
to
do
it.
You
can
either
sort
of
restart
the
data
hub
instance
from
a
clean
slate
and
that'll
just
work,
or
you
can
run
this
data
hub
upgrade
cli
against
a
running
instance.
E
If
you
have
a
lot
of
data
and
we've
tested
this
against
a
few
hundred
thousand
rows
in
a
in
a
sql
store
and
things
look
good,
so
the
second
part
is:
we
have
all
these
guides.
Basically
that
allow
you
to
run
the
snow
code
upgrade
against
docker,
compose
deployments,
helm
deployments
or
manually,
if
you
want
to,
if
you
have
a
different
setup,
but
basically
in
a
nutshell,
it'll
be
deploy.
New
data
hub
containers,
run
this
migration
script
and
then
verify
and
validate.
E
A
That's
fine,
john.
I
think
we
can
all
deal
with
the
productivity
events.
You've
gotten
us,
so
you've
saved
us
a
lot
of
time
as
well.
A
couple
of
things
I
wanted
to
point
out,
as
we
did
the
design
exercise
for
this.
We
didn't
want
to
boil
the
ocean
too
much
so
a
few
things
we
actually
kept
them
the
way
they
are.
For
example,
the
mce
schema
stays
the
same
for
now
and
even
for
strong
types
we
made
them
stay
as
a
constraint.
A
We
wanted
actually
so
these
models
that
you're
using
to
define
your
entities
and
your
aspects
they're
actually
used
to
create
a
serializable
structs
that
you
can
actually
use
to
send
metadata
over.
So
you
don't
lose
strong
types.
As
a
result
of
this,
we
just
created
a
generic
entity.
Endpoints
so
really
excited
about
this.
I
hope
everyone
is.
A
I
saw
a
lot
of
good
feedback
on
the
chat,
so
we're
looking
forward
to
rolling
this
out
next
week
and
then
helping
you
as
you
upgrade
your
ecosystems,
and
you
know
we'll
be
there
online
to
help
you
out
with
any
of
this.
We've
tested
it
out
internally,
quite
heavily,
and
we've
done
lots
of
backwards
and
forwards
compatibility
testing,
so
we're
comfortable
that
this
works.
But
you
know
you
only
know
when
you
finally
run
it
so
looking
forward
to
helping
you
all
get
over
the
hurdle.
A
We've
actually
introduced
storage
formats
now
as
a
concept,
so
the
next
upgrade
is
actually
going
to
be
much
more
seamless,
because
now
we
have
versioned
metadata
around.
So
that's
pretty
much
all
we
had.
I
was
looking
at
the
questions
that
came
in
just
before
the.
A
A
We
are
going
to
publish
cloud,
hosted
documentations
about
how
to
run
these
systems
on
aws
gcp
as
well
as
azure.
I
think
the
aws
one
is
out.
If
the
question
is
about
the
eta
of
a
cloud
implementation,
we
are
hosting
a
data
hub
on
aws,
so
contact
us
if
you
want
the
flexibility
of
having
a
hosted
instance
of
data
hub
run
for
you
by
us
on
the
struct
type
for
injection
via
hive
herschel
looked
into
it.
I
think
there
are
some
details
with
how
pi
hive
does
it.
A
A
Live
all
right,
so
we'll
see
you
in
a
month,
but
hopefully
next
week
we
get
the
new
release
out
and
looking
forward
to
hearing
your
feedback.