►
From YouTube: Jumio's DataHub Adoption Journey
Description
Ray Suliteanu (Jumio) shares the Jumio Team's DataHub Adoption journey, the problems they set out to solve, and what they have learned along the way during March 2023 Town Hall.
DataHub Public Roadmap: https://feature-requests.datahubproject.io/roadmap
Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject
A
Thanks
Maggie
and
datahub
team
for
giving
me
the
opportunity
to
speak
today,
I
will
just
Dive
Right
In
Here
agenda
a
little
bit
about
Gmail
sort
of
the
obligatory
kind
of
thing
and
then
get
into
mostly
why
we
using
data
Hub,
how
we're
using
it
and
some
of
the
adoption
and
feature
State
kind
of
discussion.
A
So
briefly
about
me:
I've
been
at
gmail
just
about
two
years:
it'll
be
two
years
in
May,
but
been
doing
this
quite
a
while
about
30
years
more
than
30
years
as
a
professional
engineer,
mostly
doing
back-end,
stuff
and
enterprise
software
and
I've
done.
Everything
from
you
know
a
handful
of
individuals
at
a
startup
to
multinationals
like
HP,
with
hundreds
of
thousands
of
people
and
for
those
of
you
who
know
University
of
California,
Santa,
Cruz,
I'm,
a
Banana,
slug
and
my
degree
is
in
computer
science.
A
A
But
you
also
want
to
try
and
do
that
with
a
good
user
experience,
particularly
on
the
onboarding
side
of
things,
because
conversion
is
important,
but
you
know:
there's
been
a
lot
of
stuff
in
the
news
about
fake
accounts
and
all
these
kinds
of
things
you
don't
want
Bad
actors
getting
into
your
system
as
well,
and
so
to
do
so.
Obviously,
it's
pretty
complex
and
there's
a
lot
of
different
checks
that
you
can
do
and
a
lot
of
different
data
sources
out
there.
Each
one
of
these
different
kinds
of
sources
actually
could
be
a
separate
company.
A
A
Obviously,
yourself
to
do
onboarding
is,
and
fraud
prevention
is
expensive,
costly,
a
lot
of
different
problems
that
you
could
have
there
and,
of
course
that's
where
two
meal
comes
in
and
we
have
our
no
code
orchestration
platform
that
pulls
together
all
of
these
data
sources
and
includes
that
with
our
own
Ai
and
machine
learning
and
apis
to
help
you
prevent
the
fraud
and
compliance
challenges
that
that
people
have
so
you
know
to
date
over
a
billion
of
transactions
and
supporting
200
different
companies
so
countries,
obviously,
if
you're
talking
about
fraud,
prevention
and
onboarding
and
identity,
verification,
there's
different
types
of
documents
and
and
proof,
you
know,
in
addition
to
just
doing
selfies,
you
know
being
able
to
scan
passports
or
driver's
licenses
or
other
national
identity
cards,
and
you
can
just
imagine
how
many
different
types
there
are
for
all
the
different
countries
and
states
within
countries
and
so
being
able
to
do
that
analysis
and
data
extraction
from
all
these
different
sources.
A
A
Machine
learning
stuff
that
we've
done
is
actually
relatively
new.
Given
the
overall
history
of
of
jumia
of
around
10
years.
The
the
machine
learning
aspects
have
only
been
around
for
about
four
years
or
so
predates
me,
but
the
it
was
sort
of
an
add-on
if
you
will
and
not
just
related
to
machine
learning,
but
there's
a
lot
of
productivity
challenges.
A
Just
data
Discovery
access
to
the
data
understanding
the
data,
typical
data
challenges
that
that
companies
have
and
combining
that
with
the
global
data
and
Workforce,
as
I
mentioned
all
the
different
data
sources,
the
different
countries
and
having
the
distributed
nature
of
our
teams,
where
the
people
doing
machine
learning
are
in
North,
America
and
Europe
in
India,
and
so
that
poses
additional
challenges.
A
When
you
combine
that
with
legal
and
Regulatory
challenges
of
where
you
can
have
the
data,
gdpr
kind
of
concerns,
and
not
just
obviously
gdpr
but
there's
plenty
of
other
regulations
and
laws
coming
up
around
the
world-
including
you
know,
CCPA
in
California
and
bipa
coming
out
of
Illinois
and
having
to
deal
with
all
these
different
legal
and
Regulatory
challenges
as
part
of
our
data
management
has
also
been
a
challenge
and
from
a
productivity
perspective,
some
of
the
consequences
of
that
have
to
do
with
duplicated
work.
A
You
know,
because
teams
these
distributed
teams
don't
have
the
discovery
and
access
that
we're
hoping
that
they
would
be
able
to
have
there's
a
lot
of
duplication,
people
doing
the
same,
say,
feature
calculations
or
data
set
generation.
That
obviously,
would
would
or
could
be
resolved.
If
we
had
a
way
to
make
this
information
much
more
available
and
as
I
mentioned,
you
know,
the
the
compliance
aspect
is
is
also
a
big
challenge
there
as
well.
So
you
know
what
we've
got
today.
A
You
know
our
current
pre-data
hub
in
setup
is
we're
an
AWS
drop
and
we
have
data
centers
and
three.
You
know
we
use
three
different
regions
within
within
AWS,
the
US
EU
and
Singapore
region,
and
our
homegrown
data
set
management
tool
and
user
interface.
That's
built
on
Athena
and
our
data
is
in
primarily
in
S3,
as
I
mentioned,
with
doing
identity,
verification
and
things
like
that.
There's
a
lot
of
images,
a
lot
of
image
and
video
data
that
we
have,
and
so
all
of
that's
stored
in
S3,
for
example.
A
So,
on
top
of
the
Homegrown
service,
we
don't
have
any
kind
of
single
sign-on
related
to
that
and
and
along
with
that
very
basic
role-based
access
control.
The
search
is
pretty
simplistic
we
have.
We
don't
have
something
like
an
elasticsearch
or
similar
search
engine.
A
It's
just
sort
of
your
typical
database
queries
and
the
metadata
is
also
just
limited
to
data
sets,
so
you
can
search
for
data
sets,
but
there's
no
other
kind
of
data,
whether
that's
models
or
model
features
jobs,
all
all
of
the
other
kinds
of
metadata
and
obviously
none
of
the
related
linking
that
you
might
be
able
to
get
and
and
on
top
of
that,
essentially
no
governance
other
than
you
know
who
can
log
in
and
work
with
the
data
sets.
A
So
what
did
we
do?
Well,
obviously,
we
started
this
effort
to
address
this
as
part
of
a
a
larger
data
initiative
that
I
launched
in
it's
probably
a
year
and
a
half
ago.
Now
more
broadly
than
just
hey,
we
need
a
a
data
catalog.
It
was
more
of
an
overall
strategic
initiative,
including
data
mesh.
A
But
you
know
within
the
data
mesh
idea,
you
have
Discovery
and
Discovery
was
one
of
the
things
that
was
a
big
gap,
so
we
started
looking
around
at
different
open
source
and
Commercial
products,
and
we
had
a
bunch
of
different
factors
that
we
were
looking
at
and,
of
course,
then
what
did
we
do?
We
ended
up
picking
data
hubs,
so
you
also
see
Acro
data
on
the
bottom.
There
we
as
I'll
mention
in
a
little
bit
we
we
did
actually
look
at
the
manage
data
Hub
as
well.
A
So
yes,
data
Hub
was
our
was
our
choice,
and
so
how
did
we
ended
up
doing
that?
We
have
decided
for
the
time
being,
to
use
the
open
source
data
Hub
and
deploy
it
ourselves.
One
of
the
choices
that
we
made
was
to
use
the
out-of-box
helm
charts
without
any
modifications.
This
had
a
couple
consequences,
one
of
which
is
we
are
using
eks,
even
though
a
lot
of
our
other
services
are
more
or
most
of
our
services.
A
Actually
now
in
production
are
using
ECS,
but
we
do
have
some
kubernetes
use
cases
or
usage
in
Gmail
and
given
that
the
data
Hub
deployment
was
already
all
working
with
eks,
we
didn't
want
to
take
the
additional
effort
to
go
and
try
and
figure
out
how
to
get
this
working
with
ECS.
It
might
have
been
easy,
but
we
didn't
want
to
go
and
give
that
a
try,
given
the
Staffing
and
resources
and
time
that
we
were
trying
to
do,
and
we
also
wanted
to
do
serverless.
So
we
chose
eks
fargate.
A
This
had
some
interesting
knock-on
consequences
for
us,
so
we
use
datadog
for
logging
with
the
out
of
box
home
charts
without
changing
them.
We
couldn't
add
datadog
sidecar,
for
example,
so
we
ended
up
looking
at
and
using
a
thing
called
kubemod
which
allows
you
to
sort
of
more
dynamically.
A
Do
some
deployment
of
of
in
this
case
sidecars,
which
is
what
we
did.
So
we
could
continue
to
use
the
out-of-box
home
charts
and,
as
I
mentioned,
one
of
the
big
things
that
we
didn't
have
in
our
prior
solution
was
single
sign-on
Gmail
uses
OCTA
for
its
single
sign-on
service
and
we've
integrated
data
Hub
with
our
corporate
OCTA.
A
This
is
all
deployed
in
the
e-region.
This
is
something
that
Gmail
has
done
in
general
for
most
of
its
data
related
stuff
is
that
we
for
sort
of
as
a
as
an
easy
way
to
deal
with
data
locality
requirements.
We
just
keep
everything
or
or
move
everything
into
the
EU
if
it's
not
generated
there,
but
all
of
the
existing
stuff
in
the
other
regions
in
in
the
US
and
Singapore.
A
What
we
wanted
to
do,
obviously,
is
have
a
single
repository
for
everybody
to
do
Discovery,
as
opposed
to
the
current
situation,
where,
if
somebody
wanted
to
find
a
data
set,
they
would
have
to
log
in
if
they
didn't
know
which
region
their
data
set
was
in.
They
would
have
to
log
into
potentially
three
different
locations
to
find
their
data
set
or
if
they
needed
data
from
all
regions,
they
would
have
to
go
to
every
region.
A
We
built
what
we're
calling
a
adapter
service,
which
basically
is
a
spring
boot
application
running
in
the
U
in
eks
that
takes
the
metadata
and
transforms
it
into
the
data
Hub
format
using
the
Java
clients
and
and
pushes
that
into
data
Hub,
and
so
that
is
how
we've
deployed
it,
and
while
it's
not
quite
in
production,
yet
we're
in
on
the
final
stages
of
testing
and
everything
is
syncing
correctly,
and
basically
we
just
have
to
do
our
final
bit
of
testing
and
back
filling
of
all
of
the
existing
data
sets
that
are
are
out
there
in
in
the
other
regions
and
then
we'll
be
actually
deploying
it
to
all
of
jumia
So
speaking
of
deployment,
then
and
adoption
we
are
intending
to
open
it
up
to
all
of
jumia.
A
We've
had
a
lot
of
situations
and
I've
been
in
meetings
where
people
have
asked
you
know
well,
what's
you
know
this
particular
field
or
data
about
and
I
would
have
a
side
con.
You
know
side
conversations
with
colleagues
on
slacks
like
well.
If
we
had
our
just,
you
know:
data
catalog,
we
had
our
Discovery
service
deployed.
You
know
this
product
manager
or
the
support
person
would
need
to
ask
that
question.
A
They
could
just
go
and
find
out
for
themselves,
but
our
initial
Target
users
are
the
data
scientists
and
and
all
Engineers
within
Gmail,
because
we
are
initially
just
deploying
to
get
the
data
set
information
into
Data
Hub,
because
again,
that's
the
only
metadata
management
that
we
have,
but
we
have
a
lot
of
tools
as
I
was
alluding
to
earlier
diverse
tool
chains.
We
use
things
like
data
bricks
and
airflow
and
Sage
maker
and
and
getting
all
of
that
data
into
Data
Hub
is
is
where
we're
going.
A
So
some
of
the
enhancements
that
we're
looking
for
is
actually
exactly
what
shank
was
mentioning
earlier
around
schemas,
and
that
was
key
for
us,
because
we
have
some
use
cases
where
it's
we're
interested
in
the
schema
and
not
so
much
about
the
data
sets
and
so,
for
example,
we
have
as
part
of
that
kyx
platform,
that
I
was
mentioning
the
there's,
a
rule
engine
and
a
typical
business
rule
engine.
If
some
of
you
might
be
familiar,
they
have
data
models
that
need
to
be
defined
a
fact.
Model.
A
Facts
are
the
the
lingo
for
the
rule
engine
yeah
in
terms
of
the
data,
so
this
these
fact
models
are
backed
by
schemas
today,
they're,
actually
Jason
schemas
and
the
team
that
is
using
those
is
just
sort
of
built,
their
own
little
storage
mechanism
in
S3
to
load
these
schemas,
but
obviously
it'd
be
nice
if
they
could
put
those
into
this
common
service
that
we're
deploying
you
know,
built
on
data
Hub
and
being
able
to
just
you
know,
save
and
retrieve
these
schemas
to
be
able
to
have
the
rule
engine,
for
example,
know
what
the
what
the
data
is.
A
That's
coming
in
so
having
the
schemas
as
a
first
class
entity
is
going
to
be
great
and
the
schema
registry
support
where
we
can
then
also
leverage
that
with
other
services
that
know
about
Confluence
schema
registry
that'll
be
another
great
thing.
One
thing
that
recently
came
up
as
we
were
deploying
is
roles
associated
with
groups.
I
submitted
a
feature
request
for
that
just
recently.
Actually,
it's
not
a
showstopper,
but
it
would
certainly
help
us
in
terms
of
managing
the
the
security
model.
A
So
the
final
thing,
too,
that
we
we're
looking
for
I,
mentioned
at
the
beginning,
that
the
initiative
started
out
as
far
as
looking
for
metadata
service
was
around
data
mesh
and
so
I
had
started
a
discussion
around
GitHub
supporting
this
notion
of
data
products
and
actually
Maggie,
went
and
created
this
slack
Channel
and
there's
been
some
good
discussion
on
that
and
that's
an
ongoing
discussion.
A
It's
going
to
be
interesting
to
see
where
that
goes,
but
for
jumia
that
still
a
road
map
item,
it's
sort
of
slow
rolled
due
to
a
variety
of
situations.
But
where
are
we
going
from
here?
So,
as
I
said,
we
need
that
schema
repository.
A
The
data
that
we
have
is
relatively
poor
around
schemas.
It
hasn't
been
a
focus
at
gmail
for
a
variety
of
reasons,
and
so
just
getting
people
to
create
schemas
and
have
that
documentation
and
field
level,
Med
data
being
able
to
specify
things
as
pii
and
and
having
that
strongly
typed
information.
While
we
have
the
data
sets
today
the
schemas,
the
schema
metadata
information
that
we
have
doesn't
include
documentation
or
certainly
not
few
level
documentation.
We
do
know
which
fields
are
pii,
but
we
don't
have
things
like
you
know.
A
Is
this
biometric
data
versus
you
know
just
address
data,
or
something
like
that,
so
just
the
expanding
on
schemas
is,
is
really
what
the
focus
is
going
to
be
at
Gmail
and
use
of
data
in
in
the
near
term,
but
then
obviously
starting
to
integrate
other
metadata
is
the
other
top
priority,
as
I
mentioned,
we're
using
sagemaker
and
airflow
and
data
bricks
and
and
several
other
tools
Kalina
for
model
testing,
so
that
last
bullet
about
data
and
analytics
test
automation.
A
It's
one
thing:
you
know
on
the
on
the
data
side
and
integration
with
things
like
Great
Expectations,
but
on
the
model
testing
side,
where
there
are
tools
out
there,
as
I
mentioned
like
kalino,
which
is
one
of
the
tools
that
we're
using
getting
all
that
metadata
with
the
lineage
and
governance
around
that
is,
is
going
to
be
key
and
finally,
hopefully
we'll
be
able
to
actually
switch
to
the
managed
data
hub
from
Acro
and
not
have
to
deal
with
all
of
this
metadata.
A
A
It
was
more
of
a
business
level
issue
that
kicked
that
can
down
the
road
and
we
had
been
doing
a
parallel
effort
to
stand
up
the
infrastructure
at
the
same
time,
just
in
case
can
was
kicked
down
the
road
as
as
it
as
it
happened,
and
so
one
of
the
things
that
I'm
still
looking
for
is
ultimately
to
switch
to
the
managed
data
Hub
down
the
road,
and
so
that's
really
all
I
had
given
trying
to
keep
within
a
time
frame
here,
and
so
here's
some
contact
information.
A
If
you
want
to
reach
out
I'm
happy
to
discuss
any
of
this,
you
know
obviously
on
the
date
Hub
slack
but
feel
free
to
reach
out
with
anything
else.
Thanks
a
lot.