►
From YouTube: Etsy Adoption Journey
Description
Vishal Shah, Data Engineer, describes the adoption journey of DataHub at Etsy.
Recorded at: DataHub August 2022 Town Hall
A
A
So
for
anyone
who
is
not
familiar,
etsy
is
a
two-sided,
global
marketplace
for
handcrafted
and
antique
pieces
founded
in
2005.
It
sees
an
e-commerce
platform
based
out
of
brooklyn
new
york
with
offices
and
admin
around
the
world.
So
to
give
an
idea
of
the
size
and
complexity
of
our
data.
Etsy
has
over
7
million
7
million
active
sellers,
93
million
active
buyers
and
over
100
million
unique
listings.
A
Our
production
data
is
stored
across
hundreds
of
shards.
In
my
sql
we
have
a
data,
warehouse,
bigquery
and
other
data
sources
as
well.
So
our
platform
has
seen
tremendous
growth
over
the
years
and
with
that
so
has
our
data
and
with
the
data,
so
has
our
metadata,
which
takes
me
to
our
journey
to
a
new
data
catalog.
A
A
A
Both
tools
had
been
on
maintenance
mode
for
some
time,
and
we
knew
it
was
time
to
reinvest
in
data
discovery
so
cut
to
april
2021
data
engineering
teams,
etsy
wrapped
up
our
data
warehouse
migration
from
vertigo
to
bigquery
and
formed
our
new
data
discovery
team.
A
So
we
learned
that
the
main
issues
were
around
data
just
around
discovery
and
trust.
It
was
hard
to
find
the
right
data
sets
and
it
was
unclear
which
data
sets
were
reliable
where
they
came
from
and
how
they
were
being
used.
Much
of
this
information
came
from
tribal
knowledge
which,
at
this
stage
of
growth,
was
no
longer
sustainable
for
us.
A
I
want
to
call
out
that,
while
this
wasn't
complex
on
the
engineering
front,
there
really
wasn't
much
engineering
at
all
the
entire
month.
These
interviews
helped
our
team
learn
about
the
problem
space
better
and
become
invested
in
our
team's
mission,
so
the
next
step
was
to
find
a
solution.
So
we
split
up
into
squads
to
investigate
all
of
our
options.
A
There
were
quite
a
number
what
if
we
extended
schemer
what,
if
we
built
a
new
in-house
tool,
what
if
we
paid
for
a
fully
managed
solution
that
we
could
get
off
the
ground
faster
or
what?
If
we
used
an
open
source
solution,
so
we
dove
into
30
tools
over
the
course
of
a
month
and
poc
a
couple
of
them.
Our
proof
of
concept
of
data
hub
involves
setting
up
a
local
instance,
adding
a
custom
source
and
adding
a
custom
aspect
to
the
metadata
model.
A
So
after
months
of
research,
we
were
finally
ready
to
move
forward
with
data
hub
so
now
for
the
fun
engineering
part
along
with
datahub.
There
are
many
technologies
that
our
team
did
not
have
much
experience
with,
but
we
did
have
the
advantage
of
other
teams
at
sea,
owning
instances
of
gke
and
kafka
and
having
expertise
in
build
quite
to
guide
us
along
the
way.
We
also
set
up
a
cloud
sql
database
and
used
manage
elasticsearch
in
our
infrastructure
setup,
so
we
started
small
and
integrated
along
the
way
for
ingestion.
A
So
after
months
of
implementation
work
by
april
of
this
year,
we
are
ready
to
launch
data
hub
at
etsy.
So
till
date
we
have
over
600
total
users
and
about
45
are
active
each
month,
we've
ingested
11
000
data
sets
across
five
data
platforms
and
the
number
increases
each
day.
A
We're
also
well
on
our
way
to
turning
off
our
in-house
lineage
tool
to
consolidate
all
of
our
data
discovery
efforts
into
data
hub,
so
it
wouldn't
be
quite
a
journey
if
there
weren't
any
bumps
along
the
way.
So
I
want
to
take
the
next
couple
of
slides
to
highlight
some
of
our
learnings
as
they
pertain
to
implementation
and
governance.
A
A
The
cool
part
about
using
an
open
source
solution
was
that
we
weren't
left
in
the
dark,
with
having
to
figure
out
every
solution
by
ourselves
or
having
to
fully
rely
on
another
team
to
fix
all
the
issues
for
us.
We
were
able
to
fit
data
hub
into
our
ecosystem
and
made
a
couple
upstream
contributions
along
the
way.
A
Maybe
the
one
with
the
most
impact
for
us
was
around
the
bigquery
ingestion.
There
were.
We
have
projects
at
etsy
use
only
for
storage
and
other
projects
used
for
running
queries
and
jobs.
This
didn't
immediately
work
with
a
bigquery
injection,
because
at
that
time
it
only
took
in
a
single
project
id.
A
However,
I
should
add
that
not
all
changes
were
as
seamless
for
ingestions
such
as
looking
mel
and
some
custom
sources
that
read
from
github
repositories.
We
were
surprised
that
we
had
to
clone
these
repos
into
a
custom
image
and,
lastly,
for
my
sql
ingestion
hurdle
that
we
came
across
was
around
how
to
profile
a
sharded
databases.
A
Profiling
off
of
one
host
would
not
capture
the
full
statistics
for
a
table
that
started
over
hundreds
posts
and
also
profiling
against
our
production
database
was
turning
out
to
be
a
bit
complex
and
expensive
and
a
concern
for
our
production
systems.
So
for
now,
this
one's
still
on
hold
and
something
that
we
discussed
on
our
team.
A
A
So
our
approach
is
to
start
with
ownership,
to
make
adding
an
owner
to
a
data,
set
extremely
easy
and
accessible
directly
in
the
creation
code.
For
that
data
set
and
then
adding
more
governance
rules
from
there
even
for
lineage
for
airflow
a
situation
falls
around.
Do
we
ask
users
to
add
inlets
and
outlets
manually
to
their
dags?
A
A
A
So
that's
all
I
have
for
today.
I
want
to
give
a
huge
thank
you
to
the
datahub
team
for
all
of
the
help
and
support
along
our
journey
with
our
many
questions,
also
a
special
shout
out
to
the
data
discovery
team
at
etsy
for
their
support.
With
this
presentation,
I
can't
wait
to
see
all
the
cool
stuff
that
we
work
on
with
the
data
hub
and
if
you
have
any
questions
or
success
stories
that
you
would
want
to
share
with
us,
you
can
feel
free
to
reach
out
to
me
on
the
datahub
slack.