►
From YouTube: June 25 2021: DataHub Community Meeting (Full)
Description
Full version of the DataHub Community Meeting on June 25th 2021
00:00 Welcome
01:30 Project Updates by Shirshanka
- Release notes
- RBAC update
- Roadmap for H2 2021
19:01 Demo: Table Popularity powered by Query Activity by Harshal Sheth
34:14 Case Study: Business Glossary in production at Saxo Bank by Sheetal Pratik (Saxo Bank), Madhu Podila (ThoughtWorks)
50:00 Developer Session: Simplified Deployment for DataHub by John Joyce, Gabe Lyons
1:00:00 Closing Remarks
A
All
right
looks
like
we
have
all
our
speakers
so
welcome
everyone
to
the
june
edition.
This
would
be
our
last
town
hall,
for
this
quarter
hope
everyone
is
starting
off
a
good
summer
or
ending
one
depending
or
for
some
people.
It's,
I
guess
winter.
We
have
a
packed
agenda
as
usual.
A
We'll
quickly
go
through
project
and
community
updates.
Bunch
of
stuff
happened
in
june.
John
is
going
to
give
a
quick
update
on
our
back
I'll.
Do
a
preview
of
the
roadmap
for
the
rest
of
2021..
I
know
it
was
a
big
thing
for
a
bunch
of
folks
in
the
community,
and
then
we
have
three
big
blocks
of
talks
today.
A
The
first
would
be
talking
about
a
recent
work
done
by
herschel
on
adding
a
query
activity
based
popularity
data
set
tables,
pretty
much
anything
using
usage
logs
and
then
the
saxo
bank
team
and
thoughtworks
are
presenting
how
they
have
implemented
business
clustering,
using
data
hub
and
what
they're
doing
with
it
at
saxo
bank,
some
very
interesting
stuff
there
with
usage
of
schemas
and
protobufs
and
finally,
john
and
gabe,
are
gonna,
walk
us
through
improvements
they've
made
to
deployments
of
data
hub
both
for
single
node,
as
well
as
for
multi-node
deployments.
A
Community
update
acura
data
launched,
we
are
the
company
driving
the
open
source
project
forward,
collaborating
with
linkedin
and
all
of
you.
Of
course,
there
have
been
some
press
articles,
tech,
grunts
and
a
few
other
companies
covered
us.
So
do
read
about
us,
but,
more
importantly,
we
are
hiring
exceptional
engineers
and
community
builders.
So
tell
your
friends
and
people
who
are
excited
about
joining
this
community
and
building
on
the
world's
best
metadata
platform
together
we're
looking
forward
to
the
the
ride
and
the
journey
ahead.
A
I
would
say
the
engagement
on
slack
continues
to
be
something
that
I'm
blown
away
by
both
in
terms
of
the
quality
of
people
coming
in
the
kind
of
discussions
we're
having
as
well
as
kind
of
the
responsiveness
of
the
community.
So
thank
you
to
everyone.
It's
not
just
us.
I
know
there's
so
many
other
people
who
help
out
when
questions
come
up,
whether
it's
design,
questions
or
just
troubleshooting
questions.
A
So
thanks
for
making
this
like
one
of
the
most
vibrant
communities
that
I've
seen
and
let's
keep
the
bar
high
like
this
all
right.
So,
let's
move
into
more
specific
project
updates.
I
wanted
to
give
the
floor
over
to
young
to
talk
about
some
plans
that
linkedin
have
for
the
project
in
terms
of
amber
young.
B
Thank
you,
shashanka
for
those
I
haven't
met.
My
name
is
young.
I
lead
our
metadata
and
reporting
teams
at
linkedin,
so
responsible
for
data
hub
linkedin.
I
want
to
give
a
quick
announcement.
I
know
there's
been
a
lot
of
activity
using
the
react,
client
going
forward,
and
so
on
the
linkedin
side.
We
also
want
to
announce
that
we're
going
to
deprecate
basically
the
number
code
base.
I
think
we
want
to
just
kind
of
have
a
call
to
action
here
to
the
community
if
any
folks
are
still
using
it.
Please
reach
out
to
us.
B
We'll
start
a
slack
thread
see
if
there's
any
anybody
using
it
help
folks
migrate
over
to
the
react
side
as
we'll
be
doing
the
same
on
the
linkedin
side
and
contributing
as
well
and
tentative
timeline,
I
think,
is
probably
in
the
next
month
or
so.
If
we
don't
hear
from
a
lot
of
folks,
then
we'll
probably
do
it
a
little
bit
sooner,
but
we
do
want
to
give
folks
enough
time
in
case
there
are
folks
that
are
still
using
them
for
side
of
things.
So.
A
B
A
Cool,
that's
awesome.
I
know
that
there's
so
many
amazing
features
that
have
been
built
on
the
amber
side
at
linkedin
and
it's
always
been
a
bit
of
a
challenge
to
kind
of
build
those
same
features
in
the
react
app.
So
I
can't
wait
to
kind
of
join
forces
and
build
these
together
in
the
future.
So
so
this
is
gonna
be
great.
A
Talking
about
releases,
I
held
off
on
like
minting
the
release
just
before
the
town
hall,
because
there
are
like
one
or
two
commits
still
coming
in,
but
it's
gonna
be
called
zero.
Eight
four.
It
is
not
a
backwards,
incompatible
release,
so
we're
staying
with
a
minor
release.
I
was
looking
at.
You
know
the
activity
that
happened
the
last
town
hall.
After
that
we
published
zero,
eight
zero,
which
was
kind
of
the
big,
no
code
metadata
release.
A
Since
then,
in
the
past
three
weeks
we've
had
about
100
commits
so
still
keeping
up
with
in
our
120
to
130
comments
per
per
month
and
a
rate.
So
that's
pretty
good
news.
There
are
a
couple
of
rfcs
that
are
in
flight
and
I
wanted
john
to
kind
of
walk
us
through
the
high
level
of
what
they're
about
and
also
give
a
little
preview
of
what
is
to
come
on
that
front.
C
Yeah
sure,
thanks
srishanka,
so
we
have
two
rfcs
that
are
currently
open.
The
first
one
is
about
adding
sort
of
the
ability
to
collect
user
feedback
inside
of
the
data
hub
ui
and
actually
you
know,
read
that
that
feedback
afterwards
to
understand
how
people
are
actually
using
the
product,
that's
being
driven
by
melinda
at
the
new
york
times,
really
excited
to
see
what
comes
out
of
this.
I
think
this
is
going
to
be
a
feature
that
really
can
benefit
a
lot
of
companies
deploying
data.
C
Today
the
idea
is
yeah
again,
just
basically
being
able
to
sort
of
trigger
these.
These
in
experience,
pop-ups
and
small
micro
surveys
and
collect
feedback
on
how
people
are
using
the
app
why
people
are
using
the
app
if
they
got
the
information
they
were
looking
for,
etc.
So
I
think
this
will
be
useful
not
only
for
operators
of
data
hub,
but
also
as
a
kind
of
a
source
of
feedback
for
the
data
hub
project
more
broadly,
to
help
us
sort
of
collect,
aggregate
insights
and
help
drive
the
road
map
forward
for
datahub.
C
C
The
main
idea
here
is:
we
want
to
effectively
provide
a
way
to
manage
access
to
the
metadata
itself,
that's
stored
on
datahub.
So
this
is
not.
You
know,
access
control
for
the
actual
underlying
data
assets.
This
is
just
access
control.
It's
the
metadata
that
datahub
has
collected
and
aggregated
on
its
graph.
We
have
a
design
out.
C
I
think
we
just
got
some
feedback
this
morning.
That
was
super
insightful
and
useful,
so
we're
still
working
through
some
use
cases,
some
requirements
and
issues.
I
really
would
suggest
that
anyone
who's
interested
in
this
take
a
look
at
the
design
go
ahead
and
post
on
that
that
pr
and
we'll
try
to
incorporate
any
of
those
suggestions.
I
mean
all
the
feedback
we
can
get
on.
C
This
are
very
useful
because
we
understand
that
every
company
has
a
different
set
of
requirements,
is
operating
under
a
different
environment,
so
we
want
to
make
sure
that
we
kind
of
collect
that
aggregate
pool
of
requirements
before
really
implementing
this.
Just
for
as
a
heads
up,
the
target
for
implementation
will
be
hopefully
starting
mid-july.
C
A
That's
great
and
I'm
actually
really
looking
forward
to
melinda's
work
on
the
user
survey
as
well.
It's
something
we've
talked
about
internally,
quite
a
bit,
so
so
yeah
call
out
for
the
community.
Please
comment
on
the
our
back
design,
make
sure
it
addresses
or
will
address
the
kind
of
complexity
that
you
probably
have
in
your
enterprise
and
make
sure
that
by
the
time
we
get
into
implementation,
we've
got
all
the
requirements
locked
in
awesome.
A
So
so
a
quick
overall
highlights
on
all
the
releases
that
happened
in
june
again
broken
down
by
the
same
typical
categorization,
product
and
platform,
features
developer
and
operations,
improvements
and
integrations.
So
the
product
and
platform
features
we're
gonna
see
a
talk
from
herschel
later
around
how
usage
stats
can
really
drive
the
product
and
what
we
can
do
with
it.
A
A
Third,
one:
on
the
api
side,
we
do
now
have
a
version
api
for
metadata
gets,
so
it
was
a
little
bit
embarrassing
honestly
that
data
hub
keeps
everything
nicely
version,
but
we
didn't
really
expose
an
api
to
get
those
version
metadata
pieces
back,
so
we
added
it.
This
is
actually
going
to
be
the
foundation
for
how
we
build
schema
history
as
well
as
other
kinds
of
visualizations
in
the
future.
A
No
code
has
been
hardening,
so
we
released
it
in
zero,
eight
small
stuff
coming
in
we're
fixing
those
and
so
john.
What
do
you
think?
Is
it
a
seven
out
of
ten
eight
out
of
ten
ten
out
of
ten?
Where
are
we?
How
hard
is
it
right
now.
A
I
think
there
are
a
couple
of
issues
still
open
that
we're
looking
at,
but
I
would
give
it
like
an
eight
out
of
ten
right
now
in
terms
of
how
good
I
feel
about
it.
I
want
to
see
a
few
more
folks
saying
that
we've
tried
it
out
and
it
looks
amazing.
We've
got
a
few
people
saying:
we've
got
some
small
issues
so
looking
forward
to
getting
more
feedback
around.
Are
you
trying
out
no
code?
How
is
it
working
for
you?
How
many
new
entities
have
you
added
things
like
that?
A
Moving
on
to
developer
and
operations
improvements,
a
lot
of
contributions
have
come
in
around
hardening
auth,
which
is
great,
so
we've
got
new
impulse
for
oidc
improvements
for
elastic
kafka.
There's
a
gcp
guide
that
dexter
wrote
yesterday,
which
kind
of
completes
one
of
our
commitments
to
the
community.
A
We
we've
got
the
aws
guide.
Now
we've
got
a
dcp
guide,
so
you
should
be
able
to
deploy
on
pretty
much
any
cloud
you
want
and
I
think
we'll
have
an
azure
guide
soon.
Other
big
thing,
neo4j
was
a
sticking
point
for
a
lot
of
folks,
and
you
know
we
were
considering
moving
to
neptune.
But
then,
as
we
looked
at
the
details
of
how
datahub
is
using
the
api
and
and
how
the
graph
is
built,
we
realized
that
actually
elastic
can
do
just
fine,
especially
for
you
know,
one-hour
queries.
A
A
I
know
linkedin
was
about
to
move
to
liquid
when
I
was
leaving,
so
that
might
have
happened
already,
but
on
the
default
side,
we're
going
to
say
you
can
just
run
with
elastic.
So
that
simplifies
your
deployment
and
in
many
cases
even
your
production
installation
goes
much
simpler.
A
A
lot
of
work
has
happened
in
hardening
our
docker
images.
We
were
running
on
a
really
old
base
image
thanks
to
grant
for
pointing
that
out
and
we've
kind
of
done
quite
a
bit
of
work,
and
I
think
we're
now
all
clear
on
the
vulnerability
side.
For
all
our
docker
images,
john
and
gabe,
are
going
to
talk
about
how
much
more
improved
single
node
install
is
now.
A
I
think
they
might
be
able
to
run
it
on
a
raspberry
pi
now
or
it's
one
of
our
goals
for
this
year.
So
that's
definitely
on
the
list
and
then
moving
to
integrations.
A
lot
of
work
happened
to
integrate
with
blue
well,
so
kevin
has
been
doing
a
ton
of
work.
We
now
support
blue
both
for
s3
so
like
if
you've
got
s3.
There's
a
nice
recipe
for
how
to
integrate
with
s3
data
sets
using
the
glue
pathway,
but
we've
also
got
support
for
glue
jobs.
A
Now
it
was
another
thing
that
was
asked
right
after
we
did
the
airflow
integration.
Hey
we've
got
blue
etl
jobs.
Can
we
get
those
ingested
as
well?
So
we've
got
that
done.
Dbt
lots
of
features
have
been
added
on
dbt,
so
we're
getting
better
and
better
at
the
entire
dvt
graph,
if
you
will,
and
finally,
our
first
4a
into
ml.
So
we've
got
integration
with
feast
now
so
for
people
who
are
considering
using
feast
for
their
feature
store.
This
is
a
perfect
time
to
try
it
out.
A
A
So
that's
pretty
much
it
for
the
highlights
on
the
releases
and
another
big
thing:
the
roadmap
for
the
rest
of
the
year,
so
some
of
the
stuff
is
carryover
from
things
that
we
had
promised.
We
would
do
in
the
first
half
of
the
year,
but
a
lot
of
this
is
new
stuff
that
we
are
taking
on
for
the
rest
of
the
year.
A
I
feel
like
we
did
a
decent
job
of
hitting
our
first
half
of
the
year
milestone.
It
was
pretty
ambitious.
There
were
a
few
things
we
couldn't
get
to
like
data
profiling
and
data
set
previews
or
data
quality
integration,
but
we
prioritize
more
of
the
foundational
work
like
the
no
code
metadata,
simplifying
the
single
node
install,
and
things
like
that,
because
we
feel
like
that
gives
us
the
right
base
to
build
from.
A
As
we
add
a
lot
of
new
features
on
top
of
the
platform,
so
you
feel
good
about
the
trade-offs
we
made
big
things
that
are
coming
up,
as
john
said,
are
back.
The
implementation
of
that
will
will
land
very
soon,
we'll
start
working
on
it.
As
we
said
early
july,
business
glossary
saxon
bank
will
talk
about
it
a
bit
later,
but
you
know
the
way
they
have
done
it.
A
It
supports
all
of
the
modeling
all
of
the
viz,
but
doesn't
have
the
edits
yet
so
we're
going
to
add
the
edits
in
in
q3
column,
level,
lineage
carrying
over
data
profiling
and
sql,
and
data
set
previews
data
quality
integration
with
a
few
systems,
not
just
great
expectations,
but
also
aws
dq
and
dbt
tests.
These
are
the
systems
that
we
find
in
our
community.
A
So
the
way
we
envision
that
integration
to
work
is,
you
can
come
to
data
hub
and
you
can
see
all
of
the
data
quality
rules
that
have
run
what
is
the
status
of
all
of
them
across
all
of
these
different
tools.
So
it
does
it
shouldn't
matter
whether
you've
got
your
tests
written
in
one
system
or
another
system.
You
should
be
able
to
see
it
in
the
same
way
and
leading
to
that
is
again
a
foundational
improvement
that
you're
planning
to
make
which
is
building
a
metadata
trigger
framework.
So
what
does
that
mean?
You
know?
A
A
A
Of
course,
we'll
build
integrations
with
email,
slack,
github
actions,
things
like
that
that
are
kind
of
the
common
ways
in
which
people
hook
up
things
together.
I'm
very
excited
about
that.
I
think
it
will
open
up
just
like
the
ingestion
framework
made.
It
super
easy
to
get
metadata
in
to
datahub.
The
trigger
framework
will
make
it
very
easy
for
everyone
to
react
to
changes
that
are
happening
in
the
metadata
ecosystem
and
one
final
thing:
a
lot
of
interest
in
the
community
for
adding
the
metrics
entity.
A
A
So
that's
all
from
the
from
our
side
kind
of
committed
for
the
q3
roadmap,
moving
on
to
q4
a
lot
of
stuff
that
we
actually
wanted
to
do
in
q3,
but
we're
going
to
do
it
in
q4.
So
if
you
want
to
accelerate
this
roadmap,
let
us
know-
and
we
can
partner
with
you
on
it-
the
ml
ecosystem,
getting
features
models
and
notebooks
all
nicely
modeled
and
visualized
in
the
ui
support
for
operational
metadata,
so
really
supporting
partition,
metadata,
completeness
freshness,
those
kind
of
signals,
support
for
the
data
lake
ecosystem.
A
So
you
know
support
for
the
common
formats
out
there,
delta
lake
iceberg,
hoodie
hive
already
supported,
so
I'm
not
including
that
here
and
then
I
don't
know
how
many
of
you
attended
data
mesh
or
metadata
day.
There's
a
lot
of
interest
in
supporting
data
mesh
oriented
features,
and
I
know
a
lot
of
our
companies
in
the
community
are
actually
implementing
data
meshes
so
being
able
to
support
those
kind
of
features
in
the
product
like
being
able
to
see
a
data
product
on
its
own,
separate
from
a
data
set
being
able
to
see
analytics
on.
A
How
is
my
data
mesh
journey
going
and
improving
as
we
go
like?
What
percentage
of
my
data
products
are
being
driven
by
high
quality
data
sets,
or
vice
versa?
And
then
finally,
collaboration
features
being
able
to
share
knowledge
across
all
the
data
professionals
and
then
having
conversations
in
line
in
the
product
as
well
as
off
platform.
So
those
are
kind
of
the
our
q4
roadmap
items.
D
D
D
Okay,
so
I
hope
everyone
can
see
this
yeah,
so
I'm
going
to
be
talking
about
data
set
popularity
and
how
we
use
it
in
data
hub
as
well
as
how
we
get
it
from
you
know:
query
logs
and
creativity.
D
So,
first
off.
Why
do
we
care
about
this
so
for
different
people?
Data
popularity
implies
different
things
for
the
data
platform
owner.
It
enables
them
to
kind
of
understand
what's
going
on
within
within
the
enterprise.
How
is
data
being
used
in
what
systems
is
it
being
used?
You
know
if
you've
got
like
a
snowflake
and
a
big
query,
let's
say,
and
you
want
to
understand
which
one
is
actually
being
used
by
more
people
and
it's
more
popular.
D
It
kind
of
under
helps
them
understand
how
are
people
using
the
things
that
you're
producing
so
kind
of
like
an
impact
analysis
within
the
company
as
well
as
helps
you
prioritize
among
the
data
assets
that
you
actually
produce,
which
ones
are
most
important
and
which
ones
are
actually
getting
used,
which
ones
say
to
use
more
documentation
to
improve
usage,
because
you
know
it's
a
great
data
set
that
you
produced.
D
It
can
also
help
you
streamline
the
deprecation
process,
so
let's
say
a
data
set
that
you're
trying
to
deprecate
still
has
you
know
100
queries
a
week?
Well
then,
you
probably
don't
want
to
deprecate
it
just
yet.
Instead,
you
want
to
you
know,
look
at
look
at
the
popularity
and
the
usage
figure
out
who
exactly
that
person
is,
and
you
know,
reach
out
help
them
migrate
to
to
a
different
solution
for
data
scientists.
D
Popularity
and
usage
is
a
major
trust
signal.
It
helps
you
know
kind
of
understand.
Is
this
thing,
someone
that
you
know
someone
put
out
a
year
ago
and
hasn't
touched
since?
Or
is
it
something
that
is
being,
you
know,
regularly
updated,
regularly
used
and
is
something
that
you
can
rely
on,
given
that
other
people
are
also
relying
on
it?
The
other
thing
you
can
do
is
you
can
look
at
kind
of
the
other
queries
that
people
are
issuing
against
that
data
set
figure
out,
say
what
what
other
tables
are
relevant
here?
D
What
is
it
commonly
joined
on
which
keys
and
so
forth?
So
you
can
kind
of
determine
not
just
whether
or
not
to,
but
also
how
to
query
that
data
set,
and
then
you
know
helping
everyone
is
you
know
we
can?
We
can
use
usage
and
popularity
data
to
improve
search
rankings,
and
you
know
improve
the
ordering
of
things
in
the
linux,
visualization
and
so
forth.
So
lots
of
product
improvements
for
datahug
can
also
come
out
of
the
usage
statistics.
D
So
let's
look
at
what
we're
collecting
and
kind
of
how
we're
doing
that
so
right
now
we
support
bigquery
and
snowflake
for
usage,
stats,
bigquery
we're
using
the
bigquery
logs
and
we're
parsing
those
out
and
then
with
snowflake.
We
are
using
the
access
history
and
the
query
history
views
joining
these
together
and
getting
our
popularity
and
usage
data
that
way
and
for
each
data
set
we
can
collect
per
user
usage
frequencies,
so
you
know
person
a
is
using
it.
This
much
person
b
is
using
it
this
much.
D
We
can
also
collect
how
they're
using
it
what
queries
they
issued
a
lot
of
granularity
here
even
to
the
extent
of
what
columns
are
they
frequently
querying
versus
which
ones
are,
are
not
being
used
and
once
again,
we
kind
of
roll.
This
up
and
you
know,
can
get
frequent
queries
across
all
of
all
the
people
using
it
together,
data
set
as
well.
D
D
Some
might
only
care
for
30
days,
some
might
care
for
many
years,
and
the
last
thing
is
we
want
to
avoid
refetching
the
same
data
from
the
same
solar
system.
Repeatedly.
D
What
this
means
is,
you
know
if
we're
collecting
data,
we
only
want
to
pull
that,
given
that
given
piece
of
the
usage
log
or
the
query
log
or
whatever
it
is
once
and
then
you
know
not
have
to
pull
it
repeatedly.
D
So,
given
this,
we
are
some
of
the
decisions
we
made.
The
first
is
we're
going
to
start
with
a
batch
based
system,
so
you
know
you
can
configure
to
run
hourly
or
daily,
whatever
you'd
like
and
we'll
pull
kind
of
the
most
recent
queries
in
history.
D
So
we
have
a
memory
efficient
algorithm
for
kind
of
pulling
this
in,
while
keeping
preventing
the
memory
usage
of
ingestion
to
blow
up,
and
then
we
do
some
pre-aggregation
here
on
a
byte
on
a
per
data
set
level
roll
it
up
so
that
we
can
frequent
users
of
the
data
set.
Frequent
columns
used
for
the
data
set
frequent
queries
of
a
data
set,
and
then
we
take
that
information.
D
You
can
push
it
into
through
gms
into
elastic
and
that's
where
we
can
store
these
aggregate
statistics
do
additional
aggregations
on
them
and
then
you
know
surface
them
in
the
ui,
as
you
might
expect,.
A
D
A
Guess
one
more
interesting
constraint
in
the
design
that
you
probably
had
was
not
adding
one
more
moving
part
to
data
hub
like
oh
you've
got
to
go,
run
a
spark
job
or
you
have
to
go
run
some
other
big
data
processing
job
to
compute.
This
stuff
right.
D
Yeah,
I
think
absolutely-
and
you
know
that's
actually
kind
of
a
good
segue
into
the
demo
piece.
So
I
wanted
to
show
how
you
know:
bigquery
and
snowflake
usage
work,
the
snowflake
one,
I'm
actually
going
to
show
you
how
it's
looking
when
scheduled
with
the
airflow,
because
that's
kind
of
a
the
common
use
case
here
you
schedule
it
on
on
a
daily
basis
and
then
how
it
looks
in
the
ui.
D
So
we
can
start
with
how
bigquery
usage
works
so
right
now
I
have
a
little
recipe
configuration.
It
works
the
same
way
as
most
other
sources.
Do
you
know
you
just
have
a
new
plugin
type
called
bigquery
usage
you
can
put
in
the
product
project
id
for
bigquery.
I
just
have
a
playground
instance
that
I'm
using-
and
I
unfortunately
haven't
queried
this
this
instance
in
a
in
a
few
days.
D
So
what
is
it
doing
here?
It's
pulling
the
bigquery
usage
logs
from
the
cloud
logging
product
from
google
and
then
doing
a
little
bit
of
pre-aggregation
here
and
then
dumping
that
into
a
file
here.
D
There
it
is
okay,
so
you
see
a
couple
couple
instances.
We
can
see
the
general
queries
that
I
was
or
the
general
data
sets
that
I
was
using
and
then,
if
we
want
to,
we
can
take
a
more
detailed
look
into
the
actual
usage
data
that
was
produced.
D
So
we
have,
you
know,
emails,
frequent
queries
and
then
the
fields
and
each
one
of
their
usages
as
well-
and
we
have
this
on
a
per
day
basis
per
bucket
or
per
per
data,
set
basis
cool.
So
it's
snowflake.
It
works
pretty
similar.
D
I
actually
added
it
to
our
demo
instance
here,
it's
you
know
pretty
straightforward,
how
it
looks
this
time
we're
running
the
ingestion
using
direct
code,
because
we
want
to
do
that
inside
of
airflow
and
it's
remarkably
similar,
we
first
ingest
snowflake
and
then
we
we
add
snowflake
usage
in
kind
of
a
pipeline,
so
you
get
both
of
them
at
once,
once
again
kind
of
set
your
configuration
and
then
I
wanted
to
get
a
bunch
of
historical
data.
D
So
I
set
the
start
time
manually,
but
then
beyond
this
I
might
just
leave
it
blank
and
it
will
automatically
do
the
current
day
and
yeah.
Now
that
we've
kind
of
run
this
successfully,
we
can
see
the
little
green
box
there.
We
can
head
over
to
the
demo
instance
and
we
can
see
where
this
is
surfaced
in
a
couple
places.
D
So
the
first
is
we
see
immediately
the
queries
that
the
number
of
queries
have
been
issued
against
this
data
set.
We
can
see
that
it's,
you
know
78,
and
you
know
this
time
period
and
we
can
also
see
you
know
I
and
through
have
issued
queries
against
it.
So
we
can
see
the
top
users
and
this
is
going
to
be
ordered
by
frequency.
D
So
you
know
that
you
know
I've
done
the
most
and
sort
of
the
second
most
beyond
that.
We
can
also
see
you
know
per
column
so
entity
and
earn
the
two
columns
here
we
can
see
the
entity
has
had
78
queries
per
month
and
the
urn
field
has
had
only
43
queries
per
month.
So
you
know
that
you
know
people
maybe
use
the
entity
field
more
than
the
urn
field,
for
whatever
the
reason
might
be,
and
what
might
that
reason
be
well.
D
We
can
actually
hop
over
to
the
query
instance
and
we
can
take
a
look
at
some
recent
queries
that
have
been
issued
that
reference
this
table.
So
you
know
your
standard
select
count.
You
know
group
by
entity,
here's
where
we
might
guess
that
the
entity
field
is
being
used
more
frequently
than
you
know
the
urn
field
or
other
fields,
and
then
we
can
also
see
people
are
creating
other
generated
tables
that
reference
this
all
entities
table.
So
we
can
kind
of
understand
how
are
people
using
it?
D
D
With
the
ui
improvements,
we've
got
a
bunch
of
you
know,
time,
series
data
for
the
usage
statistics
and
right
now
we're
kind
of
only
showing
you
know,
queries
per
month
or
something
we
can
also
add
line
charts
so
that
if
you're
expecting
a
certain
say,
data
set
to
be
deprecated
you're,
going
to
look
for
the
usage
per
day
to
to
kind
of
taper
off
as
you
migrate
people
over
and
then
finally
expanding
our
time
series
metadata
piece
to
add
mechanisms,
for
you
know
using
no
code
to
like
using
a
similar,
no
code
approach
to
expanding
that
so
yeah.
A
Awesome,
thank
you,
marshall,
for
that
yeah.
If
people
have
questions
about
how
to
use
it,
my
my
first
question
was:
why
is
snowflake
and
snowflake
usage,
two
different
sources
and
there's
actually
a
good
reason
for
it,
because
in
some
of
these
sources,
the
place
where
you
get
usage
data
from
is
actually
different
from
the
place
you
get
kind
of
the
metadata
from
in
some
cases,
you
actually
need
elevated
privileges
to
get
usage
data
out,
so
it
makes
sense
to
separate
out
those
two
pathways.
A
Once
we
add
time
series
metadata
support
for
no
code,
I
think
it
will
be
pretty
cool
to
see
different
kinds
of
systems
being
able
to
push
usage
metadata
into
data
hub
awesome.
So
our
next
talk
today
is
sheetal
and
madhu
who
are
going
to
talk
about
business,
glossary
and
their
implementation
of
that
in
data
hub,
as
well
as
at
the
samsung
bank,
so
sheetal
and
mother.
You
want
to
take
it
away.
E
Yeah
shankar
I'll
start
quickly.
E
Yes,
yes,
yeah!
Thank
you
srishanka,
so
a
quick
context.
Saxo
bank,
along
with
partnership
from
thoughtworks
madhu
as
a
data
strategist
from
thoughtworks,
we
were
working
together
in
an
engagement
and
we
have
contributed
back
to
the
open
source
on
business
class.
We
work
closely
with
acuraland
linkedin.
E
E
Oh,
I
thought
I've
already
shared.
I'm
sorry.
E
Okay,
so
a
brief
theoretical
context
on
business
closely.
Business
glossary
is
a
list
of
business
terms
with
the
definitions
which
lays
down
cons,
business
concepts
for
an
organization
or
an
industry
and
is
specific
to
a
database
or
a
data
store.
E
E
For
us
with
saps
or
specific
glossary
as
well,
how
does
it
help
us
so
once
it
is
out,
it
helps
us
to
identify
the
relationship
between
different
terms.
This
is
an
example
from
fibo.
Eventually,
we
want
to
target
graphical
representations,
but
we'll
stick
to
tabular
on
on
data
hub
or
our
database
product,
I'm
just
laying
down
the
pain
point
which
led
us
to
come
up
with
this
solution.
E
Why
we
kind
of
developed
this
a
couple
of
years
back
when
we
started
on
this
journey
in
saxo
bank,
and
I
was
taking
personal
interviews
across
the
organization
to
understand
the
pain
points
around
the
data,
the
common
problem
that
came
from
different
system
owners
and
system.
Smes
was
data,
quality
issues
and
inconsistencies
where
they
were
spending
lot
of
time.
Solving
tickets
because
of
data
flowing
across
systems
a
quick
example
and
a
very
common
example.
E
I
have
a
data
set,
a
and
system
a
data
set
b
in
system
b
and
data
set
c
in
system
c,
and
if
I
I
got
a
couple
of
a
few
data
elements
where
account
in
in
system
a
data
set,
a
the
account
name
is
named
differently
in
data
set
b,
account
number
they're
same
things,
and
in
system
seats
account
id.
So
the
etl
that
flows
from
data
set
a
to
b
is
dependent
on
the
mapping
sheet
that
has
been
created
by
system
a
and
b
and
similarly
etl
from
data
set
c.
E
The
system
is
dependent
on
mapping
shapes
too
if
the
sme
leaves
or
some
knowledge
is
going
near
and
there
and
another
version
of
mapping
sheet
2
is
created.
The
etl
processor,
screwed
up,
validation,
scale
and
account
id
and
account
are
no
more
consistent,
and
then
this
leads
to
a
lot
of
issues.
Now,
how
can
we
resolve
it
right?
E
E
If
you
can
point
all
these
account,
names
account
id
and
account
number
across
these
systems
expand
the
schema,
then
being
ingrained
in
schema.
The
dependence
on
the
mapping
sheet
goes
away.
The
dependency
when
the
smes
go
away
and
the
data
flows
across
systems
can
be
consistent,
can
be
correct
quickly.
How
do
I
have
enabled
it?
So
what
we
have
done
is
if
this
looks
familiar.
E
This
is
the
data
hub
page,
where
we
have
added
tags
and
terms,
and
these
are
the
business
terms
which
expand
the
metadata
for
data
elements,
and
this
actually
and
we'll
show
it
later.
This
actually
points
to
the
fibo
url.
E
It
could
be
fibo
or
anything
else,
then
the
design
principles
that
we
have
stuck
to,
as
we
start
to
data
ops
principles
which
is
based
on
communication,
collaboration,
integration,
automation
and
measurement.
We
believe
that
business
philosophy
can
be
involved,
staying
agile,
iteratively,
taking
care
of
business
needs
in
the
digitization
journey.
This
will
also
show
in
the
end,
if
we
get
time
how
we
wanted
to
make
sure
that
technology
is
involved
right
from
the
start,
when
a
business
function
is
introduced
into
the
organization
enhance
the
metadata.
E
So
apart
from
the
data,
elements
now
will
also
have
industry
standard
ontologies
defined
at
the
metadata
layer
obviously
schema
maturity,
because
the
the
business
terms
are
now
engraved
in
the
schema
schema
versioning.
Any
changes
into
metadata
regarding
the
data
elements,
data
types
or
business
terms
will
cause
the
schema
to
be
versioned,
and
it
will
enforce
ownership
not
only
of
the
metadata
but
also
of
the
business
terms,
the
appropriateness
and
validity
on
the
producers
quickly.
How
we
have
actually
realized
the
physical
implementation
both
for
data
sets
and
for
business
terms.
E
Our
schema
definition
is
in
protobuf,
so
the
messages
which
are
defining
a
business
terms
use
options
to
define
the
type
of
the
business
source
that
ontology
source
that
they're
using
and
their
url.
So
with
this
I'll
quickly,
stop
sharing
and
hand
it
over
to
madhu
for
the
next
set
of
first
slides
and
the
demo.
F
Yes,
yep?
Okay,
thank
you,
connect
connecting
back
where
she
talked
about
business
terms
like
define
the
business
concepts,
enable
the
common
vocabulary
within
the
organizations.
F
So
I
wanted
to
talk
about
how
we
are
trying
to
relate
the
data
sets
with
the
business
terms
that
can
enhance
the
value
of
the
elements
and
maybe
better
meaningful
to
the
data
sets.
I
have
taken
a
simple
example
with
the
purchase
order,
which
has
these
elements,
which
is
id
revision,
number
status,
employee
id
vendor
a
number
of
elements.
We
have
like
order
line
item
this
element.
F
Now,
if
I
talk
about
like
vendor
id
to
map
to
the
supplier,
identifier
may
be
another
table,
we
call
it
the
product
supplier,
id
and
map
to
supplier
identifier.
This
actually
enhances
the
value
to
this
data
set
and
other
by-product
is.
If
you
define
a
certain
business
rule
at
a
supplier,
identifier
level,
you
can
actually
drive
those
business
rules
against
these
data
sets
and
other
than
association
of
the
data
which
is
enriching.
The
value
of
the
data
sets
business
concepts
or
terms
itself
is
interrelated
or
like
and
hierarchical.
F
So
some
of
you
can
compose
the
business
and
create
a
new
term
altogether.
Let's
say
we
have
a
purchase
order,
date,
value
and
ship
date.
These
are
part
of
like
composed
and
created
the
purchase
order.
That
is,
the
kind
of
relationship
will
help
you
to
discover
your
data
sets
of
interest
and
get
to
the
right
data
set.
F
F
So
you
have
these
aspects:
ownership
schema
metadata
and
all
those
things
now
we
are
trying
to
bring
in
business
term
or
business
glossary,
which
is
a
third
with
these
two
entities,
one
is
the
glossary
node
another
is
a
glass
written.
The
glossary
node
is
introduced
to
define
the
hierarchy
of
the
ontology
okay,
so
we
could
achieve
very
much
similar
to
the
free
book
kind
of
hierarchy.
F
If
I
want
to
relate
this
analogy
right,
okay,
glossary
node
is
kind
of
a
package.
You
can
have
a
number
of
hierarchical
levels
and
glossary
term
is
a
class
definition
which
talk
about
the
business
term.
Okay,
if
you
can
say
glossary
term
no
info.
This
talks
about
the
definition
of
the
glossary
term
and
source
of
the
term
which
can
borrow
from
it
can
be
internal
organization.
It
can
be
borrowed
from
the
external
organization
and
you
can
even
have
a
link
to
the
external
organization
so
that
people
can
navigate.
F
So
this
is
the
first
thing
we
onboarded
these
entities.
Then
we
had
expanded
the
data
set
by
adding
a
new
aspect
to
the
data
set
so
that
glossary
term
can
be
related
to
the
data
set
and
data
set
has
a
schema
metadata,
which
is
an
array
of
schema
field.
We
enhance
the
schema
field
to
associate
the
business
term
at
the
attribute
level.
With
this,
you
are
able
to
attach
the
term
to
the
data
set
and
the
schema
field
that
help
the
business
user
to
navigate
to
the
data
set
from
the
business
concepts
itself.
F
The
one
other
thing
which
currently
we're
working
on
the
design
is
the
terms
itself
are
related,
which
we
have
seen
in
the
previous
example
as
well
as,
given
that
we
are
achieving
to
the
another
relation
which
is
established
within
the
terms.
It
can
be
easier
relations
and
has
a
relationship
so
with
this
I'll,
actually
try
to
move
to
the
saxo
implementation,
so
that
you'll
have
better
context
how
we
actually
implemented.
F
F
F
If
you
see
like
there
is
a
type
name,
and
there
are
attributes,
like
account
number
and
the
balance,
if
you
look
at
the
balance,
itself,
is
another
type,
which
is
a
balance
amount,
and
I
would
also
see
you
could
also
see
the
savings
account
is
kind
of
linked
to
a
customer
account
here,
we're
trying
to
define
that.
Oh
you
see
savings
account.
F
This
is
of
a
term
or
type
customer
account
so
that
you
are
able
to
relate
things
so
that
you
can
okay,
even
though,
let's
say
example
like
in
organization,
your
independently
terms
can
evolve
over
them
and
realize
that
these
are
common.
You
can
relate
it
back
and
proto,
given
a
very
flexibility
so
that
you
can
actually
expand
the
definition
or
metadata
of
a
schema.
We
are
using
an
options
to
do
that,
and
there
are
other
cases
like
okay.
F
So
next
thing
is:
I
wanted
to
give
a
little
a
little
bro
overview
of
how
the
metadata
is.
Onboarded
saxo
is
adopted.
The
database
approach
to
the
new
data
platform,
where
domain
teams
are
response
for
building
the
data
products
and
also
annotate
about
their
metadata.
F
So
the
response
for,
like
you,
come
up
with
the
self-service
capabilities
where
users
are
can
be
declaratively
defined.
The
data
set
and
we
have
a
githubs
process
which
takes
this
thing
and
create
the
topics
in
the
kafka
and
register
schema
and
extract
the
this
metadata
templates
parse
the
files
and
pushing
into
the
linkedin
data
by
converting
into
a
snapshot
which
is
required
by
the
mc
schemas
with
that
I'll
quickly
move
to
the
demo
in
the
interest
of
time.
F
Here
you
could
see,
this
is
the
saxo.
We
call
it.
The
data.
Workbench
is
a
one-stop
shop
for
date.
So
you
see
the
home
page
of
the
data
workbench.
Let's
say
let
me
look
take
it
to
the
business
glossary
and
these
are
like.
We
have
a
domain
hierarchy,
party
domain,
market
domain
and
trading
and
common
these
things,
and
let
me
take
a
simple
example
of
example:
here,
okay,
we
could
see
a
customer
account
earlier
we've
seen
the
example
I
could
navigate
either
I
can
search
through
or
I
can
directly
go
here.
F
If
I
go
to
the
business
term,
I
could
see
the
definition
of
the
term
and
what
is
the
source
and
I
can
navigate
to
the
source
within
external
different.
It
points
the
fibo
here
it
can
be
other
things
and
you
could
see
the
related
data
sets
and
additional
properties
right
with
the
related
data
set.
You
have
two
data
sets.
Maybe
you
can
navigate
with
one
of
data
sets
and
see.
This
is
a
very
much
familiar
to
everybody
so
which
is
the
data
set
home
page.
Where
you
have
information
now
you
could
see.
F
A
A
On
top,
we
did
the
same
thing
even
at
linkedin,
but
it's
really
nice
to
see
a
lot
of
companies
are
doing
something
similar
and
it's
great
to
see
some
of
these
recipes
emerging
for
how
to
connect
schemas
and
git,
along
with
metadata
in-gate,
along
with
this
kind
of
push-based
architecture,
to
get
metadata
out
and
integrate
it
into
a
common
base.
So
I
highly
encourage
reaching
out
to
them.
A
We
probably
are
going
to
have
similar
support
for
even
in
the
open
source
code
base,
for
you
know
having
protobuf,
schemas
and
applying
annotations
on
them,
and
so
to
talk
to
them
about
how
they've
done
it
and
try
to
implement
similar
practices
at
your
organization.
I
think
it's
definitely
a
game.
Changer.
A
Cool,
so
next
up
we
have
gabe
and
john
who
are
going
to
talk
about
all
of
the
hard
work
that
has
gone
in
simplifying
the
data
hub
deployment,
single
node
as
well
as
multi-node
john.
Do
you
want
to
share
the
deck
on
your
end,
or
do
you
want
me
to
walk
through
it?
A
It's
just
a
couple
slides.
Do
you
mind
just
sharing
okay,
don't
wait.
C
All
right,
thank
you,
srishanka
in
the
interest
of
time,
we'll
keep
it
pretty
short
to
allow
some
time
for
questions
at
the
end
yeah.
So,
in
the
past
few
weeks,
gabe
and
and
folks
from
mackerel
have
been
working
on
simplifying
the
data
hub
deployment.
C
We've
heard
from
the
community
that
it's
very
heavyweight
and
surprised.
We
agree
so
we're
trying
to
do
everything
we
can
to
to
simplify
it
along
multiple
dimensions,
including
sort
of
the
resource,
consumption,
the
overall
complexity
and
beyond.
So,
if
you
wouldn't
mind,
sri
could
go
into
the
next
slide
I'll
start
talking
about
what
we've
done
so
there's
two
kind
of
broad
buckets.
We
we
improved
over
the
last
few
weeks,
the
first
one
is
really
just
the
general
experience
of
deploying
a
datahub
for
the
first
time.
C
We
believe
that
you
know
you
can
get
a
lot
of
value
out
of
data
hub
just
from
the
default
models,
kind
of
the
boilerplate
implementation,
without
changing
anything.
So
as
part
of
that
belief,
we
decided
to
kind
of
allow
you
to
deploy
datahub
without
actually
requiring
to
check
out
the
code.
So
previously
you
had
to
get
clone
cd
into
the
docker
quick
start
and
and
run
that
script.
C
Now,
we've
changed
it
up
so
that
you
can
just
install
the
datahub
cli
and
go
ahead
and
run
datahub
docker
quickstart
to
ingest
sample
metadata.
You
can
just
run
datahub
docker
and
just
sample
data,
and
it's
really
nice.
What
we
do
behind
the
scenes
is
we
we
go
and
we
fetch
a
few
files
dynamically
a
full
docker
composed
file
that
includes
all
the
default
environment
variables.
You
would
need
to
spin
up
everything
as
well
as
some
other
resources,
and
then
we
go
ahead
and
deploy
it.
C
So
we've
updated
the
quick
start
guide
to
make
this
kind
of
the
default
mechanism
for
deploying
data
hub.
Now,
we've
heard
some
feedback
fixed
some
issues
and
I
think
recently
things
have
been
looking
really
good
on
this,
we're
again
trying
to
push
down
the
barrier
to
entry
to
actually
deploying
datahub
for
the
first
time
and
minimizing
that
time
to
value
that
we
can
provide
with
data
hub.
So
the
second
thing
on
the
experience
side
is
that
we
worked
on
improving
the
logging
coverage.
So
this
is
another.
C
C
We
also
set
up
the
kind
of
log
rotation
default
settings,
so
we
have
all
of
the
info
error
warning
messages
going
to
a
normal
daily
rotated
file
that
can
be
found
under
slash
temp,
slash
data
hub
gms
or
slash
temp
slash
data
hub
data
hub
front
end.
I
think
this
is
going
to
be
really
useful,
especially
for
us
providing
support
to
the
community.
C
I
think
now
we
can
just
ask
folks
to
go
and
create
a
create
a
zipped
log
file
and
send
it
over
when
they
have
issues
so
greatly
improve
our
own,
our
own
ability
to
help
you
guys.
The
second
thing
is:
we've
added
a
ton
of
debug
logs
that
will
not
be
part
of
those
logs
by
default,
but
we've
added
a
short-lived
sort
of
like
memory
capped,
debug
log
as
well
that
rotates
every
day.
So
that's
going
to
be
a
lot
richer.
C
It
has
information,
that's
really
specific
to
data
hubs
application,
so
we
kind
of
filter
out
the
other,
the
other
logs
and
only
log
data
hub
stuff.
So
hoping
that
will
help
us.
You
know
debug
those
critical
issues,
much
quicker
in
the
future,
just
note
that
that
pr
is
in
review.
So
if
you
guys
want
to
take
a
take
a
look
at
it,
that
would
be
greatly
appreciated.
C
The
second
thing
is
on
the
resource
side.
So
previously
it
required.
You
know
two
cpus,
eight
gigs
ram
and
two
gigs
of
swap
roughly
to
deploy
data
hub
on
docker
desktop
or
using
the
docker
engine.
We
were
able
to
get
that
down
to
one
cpu,
three
gigabytes
of
ram
and
one
gb
of
swap.
I
think
we
can
to
answer
sri
zhang's
earlier
question.
We
can
run
it
on.
I
think
it's
the
second
tier
model
of
the
raspberry
pi,
now
four
gigs
of
ram,
but
again
we're
continuing.
C
This
is
going
to
be
a
continuing
effort,
we're
trying
to
push
that
really
as
low
as
we
possibly
can
again.
This
is
something
we've
heard
from
the
community.
It's
just
too
heavyweight.
It's
kind
of
annoying
to
have
to
go
and
change
those
docker
settings
all
the
time
and
debug
that
stuff,
so
big
progress
there.
The
last
thing
is
just
the
actual
container
count.
C
C
I
think
we
have
a
few
areas
we
can
improve
specifically
on
the
data
hub
containers
themselves,
we're
hoping
we
can
actually
merge
gms
and
the
front
end
into
one
container
in
the
default
case,
of
course,
allowing
you
to
deploy
them
as
separate
containers
as
well,
but
one
other
big
thing
we
recently
did,
which
some
of
you
have
already
discovered
is
we
merged
in
the
the
two
kafka
consumer
jobs
we
had
ma
consumer
and
mce
consumer
into
gms.
C
So
that's
all
now
one
deployable
by
default,
so
that
gets
rid
of
two
containers
which
we
think
has
been
a
great
improvement
for,
at
least
for
us
operating
data
hub,
so
excited
about
that.
We're
going
to
continue
to
try
to
work
on
that.
But
one
last
thing
I'll
just
call
out
is
with
the
datahub
cli
quickstart.
It's
a
lot
more
resilient
than
just
running
quickstart.ssh,
mainly
because
we've
provided
this
kind
of
wrapper.
That
will
actually
check
to
make
sure
that
all
of
the
required
containers
are
up
and
are
healthy
as
well
as
pinging
them.
C
Things
like
that,
and
so
I'll
conclude
with
the
little
message
that
we
we
added
there,
which
gives
you
a
green
data
hub,
is
now
running
when
you,
when
you
deploy
it,
it's
very
satisfying
to
see,
especially
if
you're
coming
from
running
the
the
quickstart.ssh
world.
So
with
that,
I
will
pass
it
off
to
gabe
who's
gonna
talk
about
simplifying
our
persistence.
There.
A
John,
can
I
ask
one
question
that
I
heard
from
the
community
as
well
and
has
come
up
a
few
times
the
docker
compose
files
that
are
sitting
in
the
you
know
the
git
repo?
Are
they
still
usable
so
can
I
still
go
into
them
and
just
do
docker
compose
up
and
will
that
still
work.
C
Yeah
it'll
still
work
the
big
change
we
made.
There
is
actually
splitting
some
of
those
tools
that
I
had
mentioned
into
a
separate
docker
compose
file,
so
you
certainly
can
deploy
kind
of
the
thin
version
as
well
as
adding
that
additional
tooling.
If
you
want
it
pretty
flexibly,
so
we
haven't
really
regressed
in
functionality
right
or
what
we're
supporting
we've
just
made
the
default
kind
of
much
slimmer.
But
yes,
all
of
those
docker
compose
files
should
still
work.
A
C
No,
so
actually,
we
still
are
going
to
be
publishing
independent
containers
for
mae
consumer
and
mce
consume
consumer
and
you
can
configure
your
deployment
using
environment
variables,
so
you
can
actually
switch
off
in
gms,
the
mae
consumer
and
the
mcu
consumer
and
then
go
ahead
and
deploy
those
mae
and
nce
dedicated
consumers.
If
you
are
operating
an
environment
where
you
need
to
scale
those
different
services
independently.
G
Awesome
thanks
john
for
sharing
all
that
stuff.
I'm
really
excited
about
how
much
easier
it
is
to
start
data
hub
now
and
I
think
folks
are
going
to
find
that
it's
just
so
much
easier
to
get
things
started
sritanka.
If
you
go
to
the
next
slide,
yeah
we'll
talk
about
one
other
way
that
we've
made
even
easier
to
deploy
datahub.
G
So
one
thing
that
we've
heard
from
the
community
is
that
sometimes
neo4j
can
add
extra
complexity
when
running
data
hub
and
some
folks
want
to
be
able
to
run
without
neo4j.
So
we've
provided
the
option
to
run
datahub
just
using
elasticsearch
as
the
backend
for
our
graph
service.
So
we've
essentially
abstracted
out
the
different
graph
methods
into
an
interface,
and
we
allow
you
to
run
that
either
with
neo4j
or
elasticsearch
right
now.
Data
hub
just
uses
single
hop
queries
to
power
the
front
end,
so
elasticsearch
and
neo4j
are
going
to
be
about
equally
performant.
G
In
fact,
elasticsearch
might
be
slightly
more
performant.
In
some
cases,
some
folks
are
going
to
want
to
do
more
advanced
graph
queries
and
also
we
intend
to
add
more
advanced
graph
queries
to
data
hub
down.
The
line
in
that
case
for
very
large
graphs
that
are
running
very
complex
graph
queries.
Frequently,
you
may
still
want
to
use
neo4j,
however,
for
many
deployments
elasticsearch
will
continue
to
be
just
as
good
of
a
solution.
G
G
It
just
requires
flipping
a
variable
on
which
graph
service
implementation
you're
using
switch
that
from
neo4j
to
elasticsearch
and
then
disable
your
dependency
on
neo4j
in
the
helm,
prerequisites
and
you'll
start
using
elasticsearch
as
your
graph
backend
so
soon
we'll
be
providing
migration
scripts
so
that
you
can
re-index
your
elasticsearch
graph
backend
with
the
existing
data.
That's
in
neo4j
for
now,
if
you
want
to
get
that
those
relationships,
you'll
just
need
to
re-ingest
the
the
data
we're
all
and.
G
Yeah
we
are
so
if
you
want
to
go
ahead
and
just
see
how
performant
things
are,
you
can
go
to
demo
data
hub
and
you'll
see
that
relationships
load
essentially
instantly
and
I
think
you'll
find
that
in
many
cases,
elasticsearch
is
going
to
be
just
as
good
as
a
graph
surface
backend,
but
moving
forward
we're
going
to
just
moving
forward
we're
going
to
be
continuing
to
support
both
of
them
and
if
you
ever
want
to
add
your
own
graph
implementation
for
a
different
graph
service,
you
can
implement
the
graph
service
interface
with
any
graph
database
and
contribute
that
back
so
it
doesn't
have
to
just
be
elasticsearched
in
neo4j.
A
Awesome
thanks
both
of
you.
I
know
we're
at
time,
so
closing
remarks
super
excited
to
be
building
with
you.
All
data
hub
is
moving
really
fast,
so
hold
on
and
hang
on.
Things
are
getting
more
and
more
exciting,
really
excited
about
our
roadmap
ahead.
Do
give
us
feedback
on
slack
and
offline.
A
Try
out
the
new
features
that
we've
launched
and
you
know,
let's
keep
building,
see
you
in
another
four
weeks
in
and
I'll
send
out
the
announcements
for
when
the
july
town
hall
is
gonna,
be
all
right
with
that
bye.