►
From YouTube: DataHub Community Meeting (Full): Feb 19 2021
Description
Full version of the DataHub Community Meeting on Feb 19th 2021
Agenda:
Welcome - 5 mins
Latest React App Demo + Tags Preview by John Joyce and Gabe Lyons - 10 mins
Use-Case: DataHub at Geotab by John Yoon - 15 mins
Tech Deep Dive: Tour of new pull-based Airflow + Python Ingestion scripts by Harshal Sheth - 15 mins
General Q&A from sign up sheet, slack, and participants - 15 mins
Closing remarks - 5 mins
B
C
C
Awesome,
you
might
see
something
you
like.
We
have
some
prototypes
of
tags.
B
C
C
Hello,
welcome
to
the
community
we
haven't
seen
you
before.
Do
you
mind
quickly,
introducing
yourself.
A
Yes,
my
name
is
juwan
petrini.
I'm
work
as
a
senior
data
engineer
at
klarna,
together
with
thomas
larson,
should
be
until
in
this
meeting
as
well.
We
are
building
a
data
catalog
and
we
are
basing
it
on
nice
awesome.
That's
the
reason
I'm
here.
D
C
E
C
Well,
welcome
and
john
john.
F
Hi,
I'm
the
lead
data
engineer
for
a
fashion
marketplace,
company
called
depop
and
here
for
similar
reasons
to
everybody
else,
we're
looking
at
starting
a
data
catalog
and
investigating
data.
I'm
here
with
my
colleague,
maria
who's
here
somewhere
as
well.
C
Hi
maria
welcome,
I
see
thomas
also
from
clarna.
I
think.
D
Yeah
hi,
I
don't
know,
maybe
you
one
has
already
introduced
things
up:
yeah
yeah.
C
Cool,
oh
that's,
awesome
john
from
geotab
is
here.
I
had
heard
that
he
might
be
a
little
delayed,
but
I'm
glad
to
see
that
he
could
make
it
yeah.
I'm
here
awesome,
suresh,
hello,.
H
Hi
everyone,
hey,
I'm
suresh,
something
ramen,
I'm
I'm
data,
I'm
heading
a
data
engineering
team
for
children's
hospital
of
philadelphia.
Again.
My
interest
today
is
obviously
learn
more
from
you
all
currently
would
like
to
evaluate
data
hub
as
an
enterprise
data
catalog.
I
think
I'm
hoping
to
have
another
team
member
here.
Let's
go
joe
marzio
with
me
today
and
very
very
interested.
I
think,
there's
a
lot
of
stuff
going
on
here
and
thank
you
for
forwarding
that
invite
srishanka.
Thank
you.
I
Hello,
this
is
nagashina.
Well,
I'm
working
for
linkedin
on
the
data
hub
project,
yeah.
C
C
C
All
right
welcome.
This
is
our
second
community
meeting
of
the
year.
We
have
settled
on
doing
it
every
third
friday
of
the
month.
So
that's
how
we
pick
the
dates.
C
C
Last
time
we
also
have
john
from
geotab
who's
going
to
talk
to
us
about
how
the
adoption
journey
has
been
with
data
hub
at
geotab
and
herschel
joins
us
to
give
us
a
quick
tour
of
the
new
ingestion
framework
that
he
contributed
and
to
describe
some
of
the
design
decisions
that
he
made
and
then
you
know
we'll
we'll
take
some
of
the
q,
a
from
the
spreadsheets
and
figure
out
logistics
for
the
next
meeting.
C
So
that's
the
plan,
a
quick
thing
that
I
was
trying
to
categorize
and
I
think
we
should
make
it
a
bit
more
formal
from
the
next
time
onwards.
I
was
just
looking
at
kind
of
the
developments
that
have
happened
in
the
last
month.
We've
had
close
to
50
prs
marched,
that's
probably
a
record.
I
do
need
to
go
back
and
look
at
kind
of
the
historical
trends
in
terms
of
big
things
that
have
landed
the
graphql
support
that
we
talked
about
last
month.
C
C
Another
big
thing
that
was
merchant
was
the
ml
model's
back-end
implementation.
So
thanks
to
ryan
holstein,
for
pushing
that
through
and
to
the
linkedin
team
for
actually
shepherding
and
reviewing
it
very
carefully
and
thoroughly,
and
then
we
have
the
python
injection
rewrite
and
we'll
have
hershel
talk
about
it
and
finally,
we're
also
seeing
elasticsearch
7
support
getting
merged
in
slowly.
It's
still
on
a
branch,
but
at
least
it's
there
for
people
to
look
at
and
john
placed.
It
has
been
leading
the
charge
on
that
with
na
and
jyoti.
C
We
should
hopefully
see
that
getting
merged
into
masters
soon
or
maine.
That's
another
thing.
We
have
to
fix
all
right.
So
those
were
the
developments
in
the
last
month
and
I
would
like
to
have
john
joyce
take
us
through
a
quick
demo
of
the
react
app
and
let's
see
where
it's
going.
L
K
I'd
say
we're
at
probably
99
feature
parody,
there's
a
few
which
you'll
be
able
to
see
in
the
app,
but
we've
also
extended
some
of
the
functionality
that
the
ember
app
has
or
does
not
have,
rather
so
we'll
start
by
just
logging
in
you're
greeted
with
this
kind
of
home
search
screen.
This
is
kind
of
the
jump
off
point
for
both
search
as
well
as
the
browse
experience,
which
you
guys
are
probably
familiar
with
we'll
just
start
with
searching
some
of
the
data
we
have
kind
of
a
new,
fresh
search.
K
Look
here,
one
notable
addition
is
we've
added
some
of
this
all
tab
which
allows
you
to
see
different
entities
grouped
by
type.
So
you
can
see
we
have
dashboards,
charts
users
and
data
sets
from
here.
You
can
go
right
to
the
details
pages
pretty
pretty
similar
to
what
the
mrap
has.
We
have
the
schema
lineage
one
of
the
notable
exceptions
in
terms
of
functional
parodies.
K
M
K
Let's
see
so
the
biggest,
I
think
difference
that
we're
looking
at
is
we've
added
recently
dashboards
and
charts.
So
now
you
can
actually
see
them
pretty
minimal,
we're
looking
to
add
new
types
of
metadata
to
the
model,
but
right
now
we
just
have
say
for
a
chart,
the
source
data
sets
and
for
a
dashboard.
We
have
the
related
charts.
So
we
have
this
kind
of
like
parent-child
relationship
here,
the
last
kind
of
functional
piece
that
I
think
we're
really
missing.
K
That's
big
is
this
kind
of
data
sets
you
owned,
charts
you
own
dashboards,
you
own
tabs
in
the
profile,
the
user
profile
page,
but
that's
that's
pretty
much.
It
browse
works
as
you
would
expect
kind
of
just
walk
through
the
file
system
hierarchy,
and
then
we
just
had
a
new
addition.
The
log
out
button
thanks
to
the
folks
at
geotab
for
for
getting
that
done
and
that's
pretty
much
the
the
demo.
It's
it's
ready
for
deployment.
K
K
If
you
want
to
do
a
side-by-side
feature
comparison
if
you're
considering
moving
over
but
one
other
thing
I
want
to
call
out
before
handing
it
back,
is
we're
going
to
start
doing
some
react
office
hours
that
I'll
be
leading
just
dedicated
time
for
people
to
jump
in
to
ask
any
questions
about
any
part
of
the
react.
App
deployment
features
roadmap
how
to
actually
make
changes
again.
This
is
an
ongoing
process.
I
think
in
the
future
we
have
some
interesting
action
items.
We're
trying
to
get
to.
K
One
of
those
things
is
doing
kind
of
a
comprehensive
ui
revamp
of
the
different
entity.
Details
pages
you
saw
there
as
well
as
the
browse
experience
revisiting
that
so
yeah
I'll
kind
of
send
out
a
link.
It'll
probably
be
fridays
for
about
a
two
hour
block,
but
excited
to
continue
to
work
with
the
community
excited
to
take
more
feedback,
so
feel
free
to
reach
out.
If
you
have
any
feedback
concerns
considerations
around
the
react
app
and
with
that
I'll
hand
it
back
to
tisha
shanka.
C
Awesome
thanks
a
lot
john.
I
do
know
that
people
wanted
to
contribute
stuff
back
to
the
ui,
and
you
know
the
the
amber
application
was
something
that
was
very
hard
for
us
to
work
with
the
community
on.
So
I'm
really
glad
we
were
able
to
make
this
push
and
finally
kind
of
get
to
the
point
where
I
think
we
can
start
getting
contributions
back
from
the
community.
So
I'm
really
glad
that
we
got
our
first
log
out
button.
You
know
even
small
things
like
that.
C
Make
us
very
delighted,
so
I
would
love
to
see
the
office
hours
actually
start,
making
making
a
difference.
We'd
love
to
not
have
to
do
all
the
pages
ourselves.
We
would
like
to
have
people
start
contributing
new
pages,
so
that'll
be
great.
C
I
know
that
gabe
had
something
that
he
wanted
to
quickly
demo
in
terms
of
things
that
he's
working
on
almost
like
a
sneak
preview
of
tags.
So,
okay,
do
you
have
a
little
bit
of
time
to
do
that.
N
N
So
the
future
that
I
wanted
to
give
a
sneak.
Peek
today
is
a
adding
a
little
bit
of
richness
to
the
existing
tagging
system
and
I'm
working
with
frederick
and
mati
from
walt
on
developing
this
richard
tagging
system,
they're
doing
an
amazing
job
driving
forward
the
rfc.
That's
hammering
out
the
nitty-gritty
details
and
I
think
that
rfc
is
still
a
work
in
progress.
So
exactly
how
this
would
be
modeled
in
the
back
end
is
still
being
developed,
but
this
is
an
exploration
of
what
this
richer
tagging
system
might
look
like
in
the
ui.
N
So
you
can
see
this
is
a
sensitive
data
set
that
has
some
tags
applied
to
it.
The
sensitive
data
set
has
this
pii
tag
applied
to
it
and
then
also
we're
able
to
tag
this
field,
the
contact
field
by
saying
that
it's
an
email
in
this
richer
tagging
system,
we
also
might
want
to
associate
some
additional
context
to
a
tag
besides
just
the
name
email.
So
if
we
look
at
this
email
tag
and
click
into
it,
it'll
bring
us
to
the
tags
page
and
there
we
can.
N
You
know
personally
identifiable
information,
and
the
exciting
thing
here
is
that,
with
the
ability
to
associate
tags
with
other
tags,
we
can
then
start
to
explore
this
hierarchy
of
tags,
and
so
we
can
click
through
to
the
pii
tag
and
then
see
what
tags
that
is
associated
with,
and
then
on
this
pii
page
you
can
see
and
again
exactly
what
properties
and
what
metadata
we
might
want
to
associate
with
the
tag.
This
is
something
that
is
still
we're.
N
You
know
still
trying
to
hammer
out,
but
this
is
kind
of
an
example
of
another
piece
of
metadata
that
we
could
associate
with
so
those
tag.
The
tag
lists
could
could
demonstrate
what
tags
are
associated
with
what
other
internal
to
data
hub
tags
but
say
you
would
want
to
also
associate
well.
What
is
the
relationship
between
this
tag
and
some
definition
external
to
github?
N
So
we
can
say
here
is
our
pii
label
inside
of
datahub,
and
this
is
the
external
definition
of
it
outside
of
the
datahub
system,
and
then
we
can
also,
you
know
from
there
continue
to
explore
this
hierarchy.
So
this
is
just
a
sneak
preview
of
what
this
tagging
system
might
look
like
in
the
in
the
future.
C
Awesome
that
was
pretty
cool
and
I'm
sure
madhu
is
getting
excited
because
he
had
submitted
the
business
glossary
rfc
and
I
think
he'll
see
a
lot
of
similarities
to
how
we're
thinking
about
tags
and
so
definitely
there's
a
there's,
something
where
we
would
basically
wanted
to
make
it
as
close
to
how
business
tags
or
business
glossaries
should
look
like,
while
still
allowing
a
little
bit
of
freeform
capabilities
for
people
to
come
up
with
lightweight
tags
that
don't
necessarily
have
to
go
through
a
ton
of
review.
C
So
we'll
we'll
search
for
kind
of
a
the
right
balance
and
so
love
to
work
with
the
community
on
figuring.
What
that
is
thanks
a
lot
yeah.
N
This
tagging
system
that
we
end
up
with
that
and
and
then
frederick
and
mata,
are
doing
a
great
job.
Driving
that
rfc
process
is,
is
something
that
could
encapsulate
the
business
glossary
term.
So
just
a
slightly
you
know
becomes
somewhat
of
a
more
extensible
version
of
the
exactly
that
rfc
that
that
you
drove.
G
Hey
I'm
curious
with
the
tagging
system.
Where
are
what's
the
thought
process
in
terms
of
how
those
tags
are
generated?
Is
it
some?
Is
it
done
via
the
ui
or
would
it
be
like
similar
to
the
like
other
message
like
yeah,
how
are
tags
generated
and
associated
to
one
another.
N
So
again,
monty
and
frederick
are
going
to
be
they're
working
on
an
rfc
that
is
talking
about
these
technical
details,
but
I
think
that
being
able
to
edit
tags
or
having
the
capability
to
edit
tags
in
the
ui
is
something
that
we
would
eventually
want
to
support
for
sure.
C
Team
persona
that
says,
I
want
to
control
the
number
of
tags
and
kind
of
this
explosion,
and
there
are
some
specific
tag,
taxonomies
that
we
want
to
make
sure
are
known
and
understood
in
our
business
terms
that
everyone
understands.
So
you
know
when
I
was
at
linkedin.
We
had
a
compliance
taxonomy
that
you
know
we
literally
hard
coded
for
the
company,
and
we
said
these
are
the
terms
you
don't
get
to
invent
your
new
term
for
what
an
email
is.
M
C
C
G
G
J
N
J
I
can
quickly
a
comment
here,
so
I'm
I'm
from
waltz,
also
and
frederick
has
been
mostly
working
on
this.
J
This
rfc,
but
like
our
use
case,
also
requires,
like
maybe
there's
like
two
taxonomies
of
or
two
types
of
adding
tags
to
entities
so
that
some
sometimes
types
of
tags
can
be
added
through
the
ui.
But
for
some
tags
we
need
to
be
able
to
control
them
and
audit,
for
example
these
pii
tags,
for
example.
So
we
can't
hide
them
by
accident,
or
we
have
to
have
some
kind
of
audit
lock
there
and
then,
of
course,
those
those
types
of
tags.
J
C
Cool
yeah,
I
think
the
the
rfc
is
open,
so
please
add
comments
there
and
we
can
make
it
the
right
fit
for
pretty
much
all
businesses
that
are
trying
this
out
maggie,
since
we
had
you
just
did
the
dashboard
and
charts
page
at
least
look
like
what
you
would
expect
as
a
first
cut.
G
G
It
looks
great
that
that
new,
that
new
ui
is
slick,
looks
good.
C
Oh
awesome,
okay!
So
now,
if
geotab
folks
are
ready,
then
we
can
have
john
actually
take
over
and
drive.
O
O
So
geotab
is
a
global
global
leader
in
telematics
with
about
2.1
million
subscribed
vehicles
using
our
products
and
services
we're
one
of
the
very
few
telematics
companies
that
make
both
hardware
and
software
for
anyone
who's
not
familiar
familiar
with
the
term
telematics.
It
means
that
we
use
iot
devices
and
oem
softwares
to
collect
data
from
vehicles
to
provide
various
products
and
services
to
help
our
customers
below
are
some
examples
on
how
we
help
our
customers
to
improve
their
fleet
productivity,
optimization,
enhance
driver
safety
and
achieve
stronger
compliance
to
regulatory
changes.
O
So
geotabs
spent
quite
a
bit
of
time
in
2019
on
evaluation
and
poc
with
commercial
products
like
libra,
elation
and
talent,
which
they
all
had
robust
set
of
features
from
data
management
governance
perspective.
But
it
didn't
really
take
too
long
for
geotab
to
realize
that
it
wasn't
for
us.
O
I
think
most
community
members
from
the
use
cases
that
they
shared
from
previous
town
halls.
Most
of
us
had
a
very
similar
list
of
open
source
products
to
evaluate
from
and
from
from
those
we
shortlisted
to
atlas
amansen
and
data
for
our
evaluation.
O
Functional
and
non-functional
requirements
were
very
important,
but
one
of
the
key
evaluation
metrics.
I
think
we,
I
wouldn't
say
unique.
I'm
sure,
like
someone
else,
also
like
looked
into
it,
but
key
evaluation
metric
that
made
us
to
select
data
hub
was
approachability
and
technical
capabilities
of
leading
depth
developer.
O
Leading
dev
team
linkedin,
as
most
of
us
know,
had
solid,
have
solid
list
of
open
source
portfolio
that
they
designed
and
donated
to
the
patch
foundation
and
also
datahub
team.
They
have
been
very
approachable
responses
and
open
during
our
evaluation
phase,
so
from
a
very
small
team
at
geotab
trying
to
tackle
the
problem.
O
So
our
first
crack
at
datahub,
we
onboarded
small
number
of
data,
sets
just
over
250
and
had
60
users
from
one
department
to
try
out
data
hub.
The
result
was
somewhat
disappointing.
The
adoption
rate
was
very
poor
and
feedback
was
discouraging
from
user's
eyes.
Datahub
wasn't
any
better
than
how
they
searched
datasets
in
google
bigquery.
O
For
some
it
was
useful,
but
there
were
there
weren't
enough
allocations
when
they
needed
to
find
something
on
datahub.
So
I
asked
myself
like.
I
was
told
that
data
discovery
was
a
problem
at
geotab,
but
it
turns
out
the
scope
was
poc
was
poorly
established,
and
I
made
a
very
naive
decision
to
blindly
accept
the
comment
that
someone
else
said
and
took
the
scope
from
calibra
poc,
which
was
also
an
unsuccessful
poc.
O
So
for
past
few
months
I
did
taking
on
my
own
to
learn,
what's
really
going
on
behind
the
scenes,
so
just
to
give
you
guys
some
overview
of
what
data
journey
was
like
from
50
000.50
geotap
grew
very
fast
500
growth
rate
in
revenue
and
size
in
five
years
fast.
O
So
the
past
few
months,
I
spent
most
of
my
time
talking
to
people
from
other
departments
to
understand
where
we're
to
understand
where
we
are
in
terms
of
data
management,
then
made
a
proposal
on
what
we
would
need
to
do
to
change
what
we
would
need
to
change
from
architectural
perspective,
integration,
security,
compliance
operations
and
metadata
management
perspectives.
O
So
in
2021,
one
of
our
goal
is
to
productionize
data
hub
we're
currently
working
closely
with
srishanka's
team,
john
and
gabe
to
learn
more
about
their
react,
app
and
assisting
them
bit
by
bit
in
building
react
application
and
once
we're
comfortable
with
the
app
in
the
testing
environment,
we're
planning
on
productionizing
data
update,
geotab
and
internally.
O
We
had
a
debate
on
whether
to
allow
anyone
to
push
any
data
sets
within
geotab,
but
the
decision
was
to
make
only
the
production
data
sets
available
on
data
hub.
There
isn't
right
answer
for
choosing
one
over
the
other,
but
we
just
we
decided
to
put
more
emphasis
on
production
level.
Data
set,
which
follows
our
internal
data
office
processes
to
ensure
relevant
metadata
is
captured
on
data
integrity,
ownership,
security
and
compliance.
O
O
Data's,
generalized
metadata
model
allowed
us
to
start
conversation
with
other
departments
at
geotab
to
model
custom
entities
that
they
want
to
catalog
while
capturing
meaningful
relationships
with
other
data
hub
entities.
So
basically,
we
are
discussing
and
we'll
be
treating
datahub
as
internal
open
source
project,
so
other
department.
Dev
teams
also
can
contribute
to
internal
internal
features
and
custom
entities.
O
O
O
O
We
just
started
to
contribute
to
the
open
source,
react
app
application.
We
made
a
couple
of
contributions
past
couple
of
weeks,
but
hopefully
the
numbers
will
grow
over
time,
we're
not
adding
too
much
value
at
this
point
in
time,
but
we're
slowly
shifting
towards
an
open
source.
First
mindset
to
generalize
generalize
our
use
case
as
much
as
possible
to
find
the
opportunities
to
contribute
back
to
the
community
while
solving
our
internal
problems.
At
the
same
time,.
O
So
these
are
some
of
the
wish
lists
before
I
close,
I
think
I
mentioned
in
this
flat
channel
that
hopefully
we
can
have
the
roadmap
timelines
updated
on
the
open
source,
skit
repo,
and
I
one
of
the
pain
point
when
we
were
having
discussions
like
internally
with
other
departments,
was
that
there
wasn't
really
easy
way
for
us
to
understand
quickly
what
entities,
aspects
and
properties
are
available
currently
available
in
data
hub.
O
So
we
can
minimize
the
redundant
effort
when
we
create
new
custom
entities,
so
hopefully
the
metadata
model
to
graph
with
graphic
visualization
to
kind
of
help
the
community
members
to
quickly
see
what
entities
and
aspects
and
properties
are
available
and
what
the
relationship
between
them
among
them
is
would
be
very
helpful.
In
my
opinion
and
column
level.
Lineage
is
something
that
we've
been
tackling
internally
to
ask
ourselves
like
how
like
what
would
be
the
most
efficient
and
automated
way
to
first
capture
the
column
level
relationship.
O
So
when
the
feature
is
available
on
data,
we
can
readily
surface
it,
and
social
feature
has
been
like
one
of
the
hot
discussion
internally,
but
I
know
like
most
of
the
commercial
products
have
this
feature,
but
it's
not
the
most
like
high
priority
item
on
our
like
backlog,
but
I
think
it
would
be.
It
would
be
very
valuable
for
data
hub
community
as
well
and
that's
about
it.
C
Cool
that
was
great,
john
thanks
for
sharing
the
journey.
I
think
this
I
can
relate
to
definitely
a
lot
of
those
challenges
and
concerns
the
one
thing
that
we've
had
quite
a
lot
of
debates
about
with
a
lot
of
teams,
especially
central
teams,
is
exactly
this.
C
Rationalizing
of
do
we
only
put
the
clean
data
in
data
hub,
meaning
the
clean
metadata
in
data
hub,
or
do
we
actually
put
everything
in
there
and
then
have
the
clean
data
rise
to
the
top
and
use
that
as
a
way
to
drive
data
governance?
So
that's
something!
That's
definitely
on
my
mind,
it's
it's
a
big
topic
of
debate
in
lots
of
communities
as
well.
I
Okay,
if
I
can
just
quickly
jump
in
my
team,
built
airbnb
data
portal
at
airbnb,
and
we
went
through
a
similar
decision
making
process
and
there's
something
magical
that
happens
when
you
have
you
know
more
than
200
weekly
active
users
of
your
product,
you'll
find
the
right
blend
of
trusted.
Data
sets
and
data
sets
that
people
want
to
be
productive
with.
So
I
I
believe
it's
just
about
growing
usage
and
the
the
quality
of
data
sets.
Questions
will
settle
itself
once
you
get
the
experts
using
the
tool.
M
Yeah
we
we
encounter
the
same
the
same
challenge
here
at
amazon
with
our
clients
and
what
we
found
that
works
better
is
to
have
the
responsibility
for
the
publisher
of
the
data
sets
that
they
need
to
tell
if
the
data
is
reliable,
etc,
because
it's
also
relate
like.
We
see
a
lot
of
customers
and
even
our
internal
team
like
building
a
feature
store
right
so
is
this.
Data
set
is
something
that
you
can
rely
for
your
reporting
or
bi.
M
So
we
push
it
to
the
publishers
and
the
subscriber,
and
we
just
create,
like
you
know,
json,
that
that
define
the
contract
between
the
publisher
and
the
subscriber
about
the
data
set.
So
we
try
to
you
use
technology
to
enforce
it,
but
what
I've
seen
that
you
always
need
the
men
in
the
middle,
like
the
data
steward
or
or
someone
from
legal
to
tell
okay,
can
you
actually
publish
this
data
et
cetera?
M
M
The
responsibility
happy
to
share,
like
maybe
in
the
next
meet
up
like
like
some
architecture
or
somehow
we
we
solve
it
in
several
use
cases,
and
we
again,
like
you
mentioned
colibra
at
leon
like
we
saw
you
know,
we
are
looking
on
all
these
third
parties,
some
customers,
we
always
get
into
this
like
there,
isn't
a
man
in
the
middle
or
processes
that
needs
to
be
enforced.
Somehow
I
agree
with
you
in
the
comments.
C
Absolutely
we'll
take
you
up
on
that
offer
roy
yeah
great
okay,
all
right
if
there
are
no
more
questions
we'll
go
over
to
the
next
item
on
the
agenda,
which
is
having
our
newest
contributor
herschel.
Give
us
a
tour
of
the
new
injection
framework
that
he
contributed.
P
Let
me
let
me
share
my
screen.
P
M
Yeah
sure
I'm
roy
benata,
I'm
based
in
new
york,
I've
been
with
amazon
like
seven
plus
years,
so
I
had
several
roles
with
amazon
like
product
sa
today.
I'm
I'm
a
director
at
aws.
So
what
we
do?
We
we
work
with
all
the
aws
customers
and
partners,
helping
them
with
their
journey
with
data
and
machine
learning
on
the
cloud.
M
Another
thing
that
that
my
team
does
we,
because
we
cannot
meet
the
demand
of
all
the
customer
that
needs
help,
so
we
are
doing
a
lot
of
open
source
as
well,
one
of
our
most
popular
open
source.
The
rich
two
million
downloads
like
and
self
dr
is
dead
wrangler.
M
It's
a
pandas
library
that
we
build
in
python
to
for
folks
that
don't
like
to
use
spark
and
they
love
pandas,
especially
the
data
scientist
community,
so
and,
and
we're
using
a
lot
of
open
source
in
our
tools,
of
course,
aws
as
the
service
team
that
are
building
products,
etc.
M
M
Some
of
our
customers
are
amazon
themselves,
so
amazon
has
different
business
units
from
alexa
amazon
go
and
we
actually
help
them
as
well,
because
they
are
using
aws
and
I
joined
amazon
in
2013
when
amazon
web
services
has
15
services
and
today
it's
like
almost
200.
So
it's
the
platform
evolved
over
the
years
so
and
I've
been
in
the
data
analytics,
probably
most
of
my
career.
My
background
is
distributed
computing,
so
a
big
fan
of
what
linkedin
did
with
kafka
and
others.
M
So
you
know
jay
and
ania
from
past
life
where
before
confluence
so
always
happy
to
join
back
to
the
meetups
and
teams.
So
now,
in
my
role,
I
think
there
is
a
tight
correlation
of
things
that
we
can
do
together
on
the
open
source
so
happy
to
help
and
again
I'm
not
selling
any
aws
services
in
here.
So,
just
from
from
sharing
knowledge
and
and
learn
from
you
as
well
so
nice
to
meet
you
cool.
C
P
Awesome
yep
so
past
couple
weeks,
I've
been
working
on
a
new
python
ingestion
framework
for
data
hub
first
off
is
kind
of.
Why
did
we
do
this?
So
you
know
the
status
quo
was
that
people
were
using
python
ingestion
framework
that
we
had
previously
was
really
just
a
set
of
scripts
and
they
were
already
using
that
to
ingest
metadata
into
data
hub.
P
But
you
know
there
were
a
couple
shortcomings
there.
Specifically.
You
know
it
was
hard
to
ingest
via
both
kafka
and
the
rest
api.
If
you
wanted
to
get
instantaneous
feedback,
you'd
want
the
rest
api,
but
using
that
with
those
scripts
wasn't
possible,
and
another
thing
was
that
we
had
these
opaque
json
blobs,
that
you
would
ingest
into
data
hub
and
it
was
based
on
avro,
which
is
a
serialization
format.
Similarly,
protobuf
the
issue
is
that
it's
it's
pretty
difficult
to
use.
There
are
a
lot
of
sharp
edges.
P
People
run
into
issues,
not
know
what
the
schema
was
for
the
or
you
know
how
to
format
their
data,
and
so
we
get
a
lot
of
questions
around
what
what
is
even
possible.
P
With
that,
and
specifically
you
know
not
having
type
annotations
around
that
was-
was
something
that
a
lot
of
people
struggled
with
and
they'd
run
into
a
bunch
of
runtime
errors
when
they
try
to
try
to
execute
their
code,
but
they
wouldn't
actually
have
any
prior
warning
and
then
the
other
thing
was
that,
in
order
to
configure
your
metadata
ingestion,
you
need
to
go
and
modify
a
bunch
of
code,
which
was
not
the
ideal
situation.
Ideally,
you
just
modify
some
configuration
and
then
the
code
remains
the
same.
P
What
we
found
is
that
people
wanted
to
stick
with
python
because
of
the
the
ecosystem
around
it
all
the
open
source
projects.
You
know
things
like
airflow
and
numpy
and
tensorflow
and
so
forth,
and
so
they
were
used
to
this.
They
wanted
to
continue
to
use
it
to
ingest
data
into
data
hub,
and
we
wanted
a
principled
way
to
make
that
happen.
P
You
write
and
then
forget-
and
you
know
you
don't
necessarily
know
that
it
was
processed
correctly
until
you,
you
receive
the
audit
event
or
the
failed
metadata
event
back,
whereas
with
the
rest
api,
it's
a
little
bit
more
instantaneous
and
so
for
different
use
cases.
I
wanted
to
use
different
things.
P
We
wanted
to
enable
that
we
were
inspired
by
apache
goblin
for
the
architecture
of
this
and
I'll
go
into
a
little
bit
more
detail
as
to
what
that
means,
and
the
final
thing
that
we
we
made
sure
to
do
when
architecting
this
was
have
file
based
configuration,
so
you
write
configuration
in
a
yaml
or
a
toml
file,
and
then
you
know
you
can
just
run
the
data
hub
ingestion
framework
against
that
config.
P
So
how
did
we
architect
this?
As
I
mentioned,
it
was
inspired
by
apache
goblin
right
now
we
have
two
main
abstractions.
One
is
the
source
and
the
other
is
the
sync,
so
it
syncs.
These
are
the
methods
that
you
can
take
a
event
of
some
sort
and
write
it
into
data
hub,
and
that
can
happen
over
kafka
over
rest
or
for
debugging
purposes.
You
can
just
dump
it
to
the
console
or
write
it
to
a
file.
P
This
is
a
central
concept
in
data
hub
every
single
event,
or
every
single
change
in
metadata
is
modeled
as
a
metadata
change
event,
and
you
can
update
you
know
basically
do
all
of
the
operations
that
you
want
to
do
by
emitting
a
number
of
mcds
or
metadata
change
events,
and
then,
finally,
we
had
we
have
the
sources,
and
these
are
wide
and
varied
everything
from
databases
to
a
file
to
even
like
ingesting
the
metadata
of
kafka
itself,
and
as
long
as
the
source
can
create
a
metadata
change
event.
P
So
I
will
talk
a
little
bit
more
about
what
it
actually
takes
to
add
a
source
as
well.
P
So
you
know,
as
with
all
live
demos,
I'm
gonna
take
a
stab
at
it,
but
you
know
things
go
wrong
in
live
demos,
so
bear
with
me.
Hopefully
you
can
all
still
see
my
screen
here.
P
So
all
of
this
is
in
the
metadata
ingestion
directory
the
easiest
way
to
install
it.
We
can
build
the
schemas
from
the
rest
of
data
and
then
we
have
a
relatively
simple
set
of
commands.
You
can
just
copy
and
paste
this
into
your
terminal
and
get
set
up
immediately.
P
So
the
first
thing
that
you
might
want
to
do
is
ingest
some
sample
data,
and
so
the
way
to
do
that
is
data
hub
ingest
and
then
we
included
a
number
of
examples.
P
So
if
we
run
this,
we'll
see
oops
that
that's
not
supposed
to
happen,
as
with
all
demos
bear
with
me
one
sec,
I
will
worry
about
debugging
that
after
theoretically
it
will
work.
So
we
will.
We
will
fix
that
later.
P
But
what
what
you'll
see
is
you
know,
you'll,
go
to
data
hub
and
go
under
data
sets
and
we'll
see
some
stuff
here,
we'll
fix
the
ingest.
The
other
thing
that
that
I
did,
I
I
used
to
be
a
student
at
yale
and
I
ran
a
course
selection
tool
there
with
a
mysql
database,
and
so
I
actually
ingested
this
real
world
database
into
datahub.
P
This
is
the
configuration
there's
a
username
and
password
above
that
scroll
down.
So
you
know
we
have
the
the
host
port
and
then
we
have
a
filter
rule.
So
you
can
filter
out
certain
like
mysql
tables
and
then
allow
the
rest
in
initially.
I
just
printed
it
out
to
the
console,
but
we
can
actually
write
these
to
adahub
over
kafka.
This
time,
let's
say
and
so
oops
gotta
change
the
recipe
to
run,
and
so
we
can
run
it.
P
It
will
configure
all
of
the
ingestion
and
then
it
will
go
through
and
ingest
a
number
of
tables,
and
then
you
get
a
nice
summary
of
of
what
happened
so
it
filtered
out
all
of
the
mysql
internal
tables.
Here
are
all
of
the
tables
that
it
actually
fetched.
And
so,
if
we
go
into
datahub,
we
can
take
a
look
at
let's
say
the
students
table
and
we
get
the
full
schema.
P
We
get.
You
know
a
bunch
of
other
information
related
to
the
tables
that
we
just
ingested
yeah.
So
that's
the
that's
the
usage
side
of
things.
Let's
talk
a
little
bit
about
what
it
takes
to
add
a
source,
so
the
simplest
source
that
we
have,
let's
start
with
the
source,
is
py.
P
P
These
are
the
three
operations
that
a
source
needs
to
support
and
if
it
supports
those
three,
then
we
can
integrate
it
into
the
rest
of
this
framework.
So,
as
an
example,
here's
a
very
simple
source
that
just
reads
from
a
file.
P
So
it
reads
that
and
then
you
know,
constructs
a
metadata
change
event
out
of
it
using
the
gen
classes
and
then
sends
them
into
the
rest
of
the
framework,
and
you
know
that's
that's
how
simple
it
is
to
add
a
source,
a
slightly
more
complex
one.
Let's
take
a
look
at
the
source
for
kafka,
so
this
ingests
metadata
about
the
topics
and
partitions
and
so
forth
in
kafka,
and
it
sends
them
into
data
hub,
and
so
here
it's
a
little
bit
more
complex.
You
connect
you
construct
your
your
consumer
and
then
similar
thing.
P
You
know
using
things
like
aspects
and
snapshots
and
all
of
these
things
that
we're
relatively
familiar
with
as
long
as
you
can
do
that,
then
you
know
the
source
plugs
into
the
rest
of
the
framework.
Accordingly,
all
right,
so
last
thing
I
wanted
to
talk
about
touch
upon
is:
where
are
we
headed
with
this.
C
Can
I
ask
a
question:
can
you
walk
us
through
the
chord
gen
that
you're
doing
from
the
avro
schemas?
I
thought
that
was
one
of
the
hard
things
about
this
project.
Yep
sure.
P
P
Let's
take
a
look
here,
so
the
av
sc
file,
which
is
the
average
schema
file,
looks
like
a
big
json
blob
and
it
has
you
know
all
of
the
fields
and
everything.
Accordingly,
we
have
a
code
gen
system
that
you
know
I
took
from
an
open
source
project
and
modified,
and
I'm
also
working
to
contribute
that
back
but
the
cogen
system.
P
Basically,
let's
say
this
is
the
ml
properties
class.
So
for
this,
let's,
let's
do
like
data
set
snapshot.
P
P
When
you
construct
it,
you
can
automatically,
you
know,
have
the
add
the
type
systems
and
it
all
associates
correctly,
and
this
way
you
know
if
you
run
my
pi
on
this
code,
it
actually
completely
checks
every
single
assignment,
every
single
constructor.
P
All
of
that,
so
that
you
know
for
sure
that
you
know
the
code
is
at
least
semantically
correct
in
terms
of
types.
So
that's
that's
kind
of
a
little
bit
on
the
code
gen
and
obviously
it
generates
a
truly
massive
file
of
5000
lines.
P
So
you
know
glad
we
aren't
writing
this
by
hand
and
updating
it
did
that
answer
your
question.
P
Yeah,
so
let's
talk
a
little
bit
about
where
we're
headed
with
with
ingestion,
so
we're
going
to
do
a
more
formal,
rfc
process
on
this
relatively
soon
and
then
hopefully
also
publish
a
package
to
pi
pi.
So
you
can
pip
install
you,
know,
data
hub
or
something
along
those
lines
and
then
start
executing
this,
and
you
don't
need
to
do
the
code
gen
yourself
and
all
of
those
other
steps.
P
The
other
things
are
more
functional
improvements,
so,
for
example,
detecting
when
metadata
is
stale.
So
if
you've
deleted
a
table
in
your
source
in
let's
say
mysql,
we
want
to
be
able
to
detect
that
it
disappeared
and
perform
the
according
the
the
associated
deletion
in
datahub
right
now,
it's
a
purely
like
additive
process
and
we
can
do
updates.
P
But
the
deletes
are
not
something
that
we
have
support.
Another
thing
is
validating
that
it
was
actually
ingested
correctly.
So
you
know
if
you
run
with
kafka
and
you
just
via
that,
you
just
send
you
know-
let's
say
30
events
to
to
a
kafka
topic
or
to
the
broker.
You
don't
necessarily
know
that
those
all
got
accepted
correctly
and
so
doing
that
validation
step
would
also
be
very
helpful
and
then
on
the
java
ingestion
side.
So
we
have
python
and
we
also
have
java
we're.
P
So
if
people
really
love
the
the
python
or
we'll
invest
more
on
that,
if
people
still
want
to
be
able
to
use
the
the
java
we'll
do
that
accordingly
as
well
and
then
the
last
thing
is
standardizing
a
testing
harness
between
the
two.
So
we
can
test
functional
parity
between
the
the
java
and
the
python.
K
I
have
one
harshal
I
was
hoping
you
could
talk
a
little
bit
about
how
someone
would
productionalize
this
say
in
airflow
or
something.
P
Yeah
sure
so
we
include
a
couple
sample
dags
actually
in
air
in
our
in
our
repo,
so
you
can
use
this
directly
within
airflow.
It's
actually
quite
simple:
let's
save
her
for
mysql,
you
know
you
just
create
a
pipeline,
give
it
a
configuration
and
just
call
run,
and
you
know,
as
long
as
you
can
do
this
within
a
python
operator
of
airflow,
you
can
run
this
and
you'll
get
all
the
standard
error
reporting
out
of
airflow,
as
you
would
expect.
P
So
it
is,
it
is
production
ready
already,
and
you
know
you
can
use
it.
However,
you
like,
I
think
it
does
make
sense
to
run
it
within
an
orchestration
framework
like
airflow
or
even
cron.
If
you
want
something
simple
so
that
you
can
continually
get
updated
metadata
from
a
from
a
given
source,
instead
of
just
a
one-time
import.
C
Yeah
this
is
cool,
because
I
know
that
data
hub
has
this
positioning
almost
as
push-based
right
and
so
a
lot
of
people
think
that
oh,
it's
push-based.
So
I
cannot
pull
metadata
into
this
system,
but
I
think
this
kind
of
shows
the
people
like
how
easy
it
is
to
once
you
have
a
push
based
system.
You
can
always
add
on
a
pull
based
system
upstream
of
it
to
essentially
essentially
pull
metadata
into
your
system,
and
you
don't
have
to
really
choose
between
the
two
yeah
you
can.
C
You
can
do
both
if
you
want
all
right,
so
I
had
just
very
small
logistical
things
to
go
over
very
quickly.
C
C
Okay,
so
a
quick
thing
on
next
meeting,
we
got
a
lot
of
folks
from
eu
who
have
struggled
to
join
because
it's
a
bit
late
for
them.
So
for
the
next
one
we
are
going
to
move
up
the
slot
and
make
it
7
a.m.
It
still
allows
folks
on
the
west
coast
to
join
as
long
as
they
set
their
alarms
correctly
and
folks
in
the
central
time
zone
and
eastern
will
be
doing
just
fine
and
for
eu.
C
I
think
it
also
works,
so
our
next
meeting
is
going
to
be
third
friday
of
march,
so
that's
turns
out
to
be
19th
again,
thanks
to
february
being
exactly
four
weeks
right,
so
19th
march,
we're
gonna
be
back
7
a.m.
That's
the
next
meeting!
C
That's
all
I
had
on
logistics.
There
were
a
couple
of
questions
on
the
town
hall
meeting
that
I
wanted
to
quickly
go
over
maggie,
hopefully
got
her
answer
on
dashboard
elements.
C
G
Just
one
clarifying
thing
there,
so
we're
we
are
not
well
is
the
idea
that
we
should
migrate
over
to
the
react
app
for
dashboard
elements.
C
Yes,
thanks
for
asking
we
essentially
it's
very
hard
to
continue
supporting
the
community
on
the
amber
app.
It's
always
been
hard,
so
the
react
app
is
where
we
will
be
doing
new
development
and
supporting
the
open
source
community.
So
that's
where
you
should
plan
to
move
towards.
G
C
That
was
the
reason
we
we're
getting
to
feature
priority
with
the
amber
app
as
well.
G
Got
it
that
sounds
great.
I
just
wanted
to
signal
that
spot
hero
is
the
spy
hero.
Data
edge
team
is
still
planning
on
contributing
back
our
liquor.
M
G
To
the
community,
so
that's
something:
we've
had
to
punt
it
a
little
bit
for
some
other
work,
but
we
should
have
that
out.
Maybe
end
of
this
quarter,
more
likely,
q2.
C
That
would
be
great,
and
you
know
hopefully
it
it.
The
new
ingestion
framework
helps
you
with
rewrite,
like
writing
the
least
amount
of
code
for
it,
and
and
finally,
you
can
turn
off
that
hack
to
make
looker
dashboards.
Pretend
to
be
data
sets
so
yeah.
C
A
Yes
to
it
helped
me
a
bit,
but
I'm
a
really
I'm
still
a
data
hub
newbie.
So,
but
I
saw
in
your
presentation
before
that
you
actually
had
this
sort
of
a
schema
like
for
a
data
set
that
was
a
table.
It
looked
like
the
relation
table.
You
had
this
column
descriptions
and
all
that.
So
that
would
answer
my
question.
C
Yeah,
I
think
I
think,
for
those
classic
metadata
that
is
just
on
the
columns.
We,
the
the
standard
injection
script,
does
pull
it
in.
C
If
you
do
something
custom
where
you
have
like
very
custom,
table
properties
or
column
properties,
that
you're
squirreling
away
rich
metadata
in
then
talk
to
us
about
how
to
convert
that
and
extract
that
metadata
and
make
it
more
structured.
We
have
some
ideas
around
that
harshal
did
not
talk
about
the
extractor
concept
inside
the
source.
To
sync,
the
extractor
is
basically
a
plugable
thing
that
allows
you
to
extract
more
metadata
than
the
default
metadata
that
is
getting
extracted
by
the
you
know,
the
out
of
the
box
source.
C
So
those
are
ways
in
which
you
can
customize
the
extraction
of
metadata
on
top
of
the
default
pipeline.
Cool
thanks
will
do
awesome.
We
are
at
time,
so
I
will
let
you
all
go
thanks
to
the
eu
folks
for
staying
up
late
for
this
and
we'll
see
you
in
four
weeks.