►
Description
Surya Lanka and Shirshanka Das (Acryl Data) give a demo of stateful ingestion works in DataHub, ensuring that you ingest only net-new metadata to minimize redundancy and optimize ingestion performance.
A
Hello,
everyone,
one
of
the
things
that
we've
done
recently
and
are
about
to
roll
out
in
the
next
week
or
so
is
support
for
stateful
ingestion,
and
it
really
improves
our
ability
to
not
unnecessarily
ingest
old
stuff
again
and
again,
and
it
improves
the
capabilities
of
new
injection
connectors
to
to
essentially
only
pull
relevant
information
every
time
they
run.
So,
let's
get
into
it.
A
So,
let's,
let's
start
with
an
example
of
a
connector
that
we
already
have
it's
called
the
snowflake
usage
source
and
let's
look
at
how
it
works
today
you
know
the
snowflake
usage
store
source
starts
up
and
it's
got
in
its
configuration.
A
variable
called
start
time
and
that
start
time
can
be
specified
in
your
configuration.
A
So
you
can
say
start
time
is
yesterday.
Our
start
time
is
day
before
yesterday
and
depending
on
what
you
specify
the
snowflake
usage
connector
says:
okay,
I'm
going
to
start
from
this
time
and
pull
all
usage
logs
from
this
start
time
up
to
a
granularity
and
the
default
granularity
is
one
day.
So
essentially
what
the
snowflake
usage
connector
is
doing.
Every
time
it
starts
up,
it
pulls
a
window
of
data
from
the
source
and
that
window
is
defined
by
the
start
time
parameter
now.
A
The
start
time
can
be
specified
in
config
or
if
you
don't
specify
it,
the
default
source
will
just
default
it
to
now
minus
one
day.
So
essentially
it'll
pull
data
for
the
previous
day
once
it
does.
All
of
that
it
does
aggregations
in
memory
and
then
pushes
a
grouped
usage
stats
group
by
data
set
and
bucketed
by
this
day,
granularity
into
data
hub
and
that
powers
a
bunch
of
features.
If
you
go
to
the
data
hub
demo,
you'll
see
monthly
queries,
you'll
see
top
users,
you'll
also
see
column,
level,
stats
and
recent
queries.
A
A
So
if
you
rerun
today's
job
twice,
you
know
the
the
injection
connector
is
stateless
and
it
just
says:
okay,
I
guess
I'll
pull
data
for
today
again-
and
this
is
not
just
a
problem
for
the
usage
source.
Pretty
much.
Every
single
source
today
is
memoryless,
which
means
it
starts
up
and
crawls
the
whole
world
and
ingests
metadata.
A
So
we
wanted
to
facilitate
sources
to
remember
where
they
last
left
off
to
allow
for
incremental
ingestion
so
that
we
can
reduce
load
on
source
systems
and
only
produce
relevant
changes.
Of
course,
when
sources
are
pushing
metadata
out,
this
is
all
trivial,
because
you're
right
in
the
context
of
actually
making
a
change,
so
you
can
push
metadata
out
when
it
changes,
but
for
sources
that
are
crawled.
You
actually
need
state
to
be
able
to
ingest
only
the
things
that
have
changed.
We
also
want
these
sources
to
be
really
simple:
self-healing
cron
jobs.
A
What
were
the
requirements?
Well,
first,
we
need
temporal
storage,
so
we
need
state
to
remember
across
time
and
second,
we
want
to
be
able
to
query
that
state
to
get
the
last
successful
run
for
a
particular
window
of
time
and
of
course
there
are
a
few
options
for
storing
this
state.
You
could
either
store
it
in
your
local
file
system
or
you
could
store
it
in
s3
or
some
other
blob
store,
or
you
could
store
it
in
a
db
or
you
could
store
it
in
data
hub
and
we
were
like
that's
interesting.
A
So
what
we
ended
up
building
is
just
leveraging
the
same
time.
Series
metadata
support
that
we
added
to
data
hub
to
actually
store
state
about
metadata
injection
runs
if
that's
not
meta
enough
for
you,
I
don't
know
what
is
so.
Here's
an
example
of
how
the
snowflake
usage
ingestion
system
changes
once
we
add
this
capability.
A
The
first
thing
it
does
when
it
starts
up,
is
gets
the
last
successful
run
state
from
data
hub
and
data
hub
internally
has
temporal
state
for
all
of
the
previous
runs
once
it
figures
out
the
start
time
based
on
the
last
successful
run
state.
It
then
issues
the
query
to
snowflake
gets
the
deltas.
Does
the
same
thing?
A
Groups
usage
stats
sends
it
over
to
data
hub,
but
in
addition
to
all
the
metadata
it
produces,
it
also
produces
another
piece
of
metadata,
which
is
the
ingestion
run
state,
and
that
includes
things
like
the
run
id
the
pipeline
id
the
status.
How
long
it
took
other
metrics
as
well
as
the
start
time
and
the
granularity.
So
that's,
basically
the
window
and
each
of
these
individual
ingestion
run.
A
A
We
surya
is
gonna,
do
a
quick
demo
for
you
here
and
just
before
he
gets
started.
I
want
to
give
you
an
example
of
what
we
will
be
demoing.
So
today
is
september
24th
in
utc
time,
and
you
know
we
wanted
to
show
you
what
it
would
look
like
if
ingestion
was
run
two
days
ago
and
then
rerun
and
then
we
move
time
forward
to
now
and
then
run
ingestion
again.
So
we
want
to
basically
show
how
ingestion
runs.
B
So
for
this
demo
we
have
done
something
very
interesting,
so
we
have
set
up
basically
a
docker
that
can
fake
time
with
the
library.
So
if
you
want
to
see
what
it
looks
like
so
before
we
do
the
injection
two
days
ago.
So
if
I
just
ask
like
this
what's
the
date,
it
says
today
is
september
22nd.
B
So
if
I
ask
minus
three,
what
is
it
it's
21st?
So
it's
actually
changing
system
time.
Basically,
so
now
let's
go
ahead
and
do
an
injection.
So
that's
two
days
ago,
okay,
so
from
snowflake.
B
B
Okay
done
so,
it
has
remembered
what
it
did
before
for
this
day
now
we
can
actually
oh.
A
Yeah,
it
might
be
a
good
idea
to
just
look
at
that
recipe
quickly.
The.
A
B
B
So,
okay,
the
only
thing
basically
from
here
that
will
survive
is
this,
so
the
only
additional,
basically
piece
of
information,
this
suite
needs
is
basically
where
the
gms
server
is.
So
that's
where
time
stats
live.
In
order
to
query
these
aggregations
are
like
even
push
so
this
is
the
end
point
we
need,
so
the
other
configuration
is
going
to
go
away
so
yeah.
So
this
is
what
it
is.
One
thing
that's
interesting
for
you
to
notice:
is
that,
like
there
is
no
time
specified
anywhere
here
so
now?
B
B
B
A
So
any
questions
so
far.
A
Right,
so
what
is
interesting,
I
think,
is
that
even
the
data
hub
injection
itself
will
produce
a
pipeline
instance
which
will
allow
us
then
to
look
at
the
ingestion
process
itself
as
a
pipeline.
That
runs
and
produces
metadata
so
now,
you'll
be
able
to
actually
not
only
look
at
your
data
pipelines
and
your
data
sets
and
the
lineage
across
all
of
them,
but
also
your
metadata
pipelines
and
make
sure
that
they
are
running
on
time
and
that
they
are
actually
running
successfully.
A
Cool
so
maggie,
I
think
we
can
go
back
to
the
slide
deck,
so
this
will
get
rolled
out
to
all
the
usage
sources,
so
snowflake,
bigquery
and
redshift.
A
Another
thing
that
we
want
to
roll
it
out
to
is
the
data
set
profiling
sources.
These
tend
to
be
our
most
expensive
profiling
sources
and
we
want
to
start
doing
incremental
profiling
as
well
and
also
start
converting
other
sources,
such
as
our
sql
sources
that
are
ingesting
schemas,
our
local
sources
to
be
stateful
and
only
perform
incremental
post.
This
will
reduce
your
api
bills
as
well
as
reduce
the
time
that
it
takes
for
these
sources
to
run.