►
From YouTube: Dagster Declarative Scheduling of Software-defined Assets - Dagster Community Day - Dec 2022
Description
Dagster 1.1 introduces a system for scheduling data pipelines that allows you to escape writing workflows entirely. Instead, you directly specify how up-to-date you expect each asset to be, as well as how to determine whether source data has changed. Dagster then automatically schedules asset materializations to ensure that data arrives on time while avoiding unnecessary computation.
A
I'm
Sandy
the
lead
engineer
on
the
dagster
project
and
I'm
here
to
talk
about
functionality,
indexer
1.1
that
introduces
declarative
scheduling
for
software-defined
assets.
We
use
orchestrators
to
keep
data
assets
like
tables
and
machine
learning
models.
Up
to
date,
scheduling
data
pipelines
means
managing
change
in
data
assets,
which
boils
down
to
a
few
basic
elements.
A
So,
first
to
compute
our
data
assets,
we
typically
run
code.
When
that
code
changes,
we
eventually
want
to
update
our
data
assets
to
reflect
the
New
Logic.
We
drive
our
data
assets
from
Upstream
data.
When
that
Upstream
data
changes
or
grows.
We
eventually
want
to
update
our
assets
to
incorporate
those
changes
and
then,
last
of
all,
depending
on
how
our
data
assets
are
used,
we'll
have
different
requirements
for
how
up
to
date
they
need
to
be
most
orchestration
tools,
don't
think.
A
A
First
of
all,
it
makes
it
awkward
to
express
what
should
happen
in
some
fairly
common
situations.
For
example,
if
some
tables
are
hourly
and
others
are
daily,
but
those
tables
have
shared
dependencies,
it's
difficult
to
construct
a
set
of
dags
and
schedules
that
run
in
the
right
order
and
don't
duplicate
work.
A
Second,
every
time
you
add
an
asset,
you
have
to
find
a
dag
to
put
it
in
to
get
it
scheduled.
Then
you
have
to
worry
about
whether
dags
are
getting
too
large
and
unwieldy
or
too
small
and
fragmented,
and
last
you
get
alerted
when
your
task
fails.
Not
when
your
data
is
out
of
date,
which
is
often
what
you
actually
care
about.
If
the
system
can
retry
and
self-correct
before
the
deadline,
then
nobody
needs
to
get
paged.
A
Dexter
1.1
helps
move
Beyond,
workflow-based
scheduling
by
introducing
a
set
of
features
that
enable
scheduling
data
pipelines
in
a
declarative
and
asset
focused
way,
I'm
going
to
take
a
deep
dive
into
a
few
of
these.
To
give
a
taste
of
what
this
looks
like.
One
of
these
features
is
freshness
policies.
You
can
now
construct
policies
that
specify
how
up-to-date
you
expect.
Your
assets
to
be,
and
then
use
those
policies
for
monitoring,
alerting
and
scheduling,
to
understand
how
this
works.
Let's
look
at
a
simple
graph
of
data
assets.
A
The
base
table
of
events
that
we
pull
into
our
data
warehouse
from
our
app
database,
and
these
are
two
tables
that
summarize
it
for
different
business
users.
We
want
this
for
some
retail
to
be
pretty
up
to
date.
Events
are
constantly
streaming
into
our
production
database
and
we
want
them
to
be
pulled
into
our
data
warehouse
and
represented
in
this
table
within
five
minutes
of
when
they
arrive
in
production.
A
The
second
summary
table
is
more
expensive
to
compute
and
we
only
actually
care
about
looking
at
it
once
per
day,
which
is
at
a
team
check-in
meeting
that
happens
at
9am,
so
we
set
freshness
policies
on
these
assets
and
the
way
that
we
do,
that
is
in
code.
So
here's
where
we've
defined
our
assets-
and
you
can
see
the
five
minute
freshness
policy
on
this
one
and
the
9
A.M
freshness
policy
on
this
other
one.
A
A
It's
five
minutes
late,
because
the
last
time
we
re-materialized
it
was
10
minutes
ago.
So
it's
not
incorporating
all
the
data
that
we
expect
it
to,
but
the
daily
one
isn't,
even
though
we
haven't
updated
it
for
a
while.
That's
fine,
because
we
don't
need
it
to
be
updated
until
9am
tomorrow.
A
A
A
A
The
sensor
will
avoid
duplicating
work
when
two
assets
depend
on
the
same
Upstream
asset.
It
knows
that
the
same
materialization
of
the
login
events
asset
can
be
used
to
help
both
of
these
Downstream
assets
meet
their
freshness
policies.
Achieving
the
same
outcome
without
freshness-based
scheduling
would
require
a
complex
pyramid
of
jobs,
schedules
and
sensors
with
freshness-based
scheduling.
We
just
tell
dags
for
how
fresh
we
want
our
data
to
be
and
it
handles
the
rest.
A
A
A
Something
you'll
notice
here,
which
is
new,
is
that
we've
assigned
code
versions
to
all
of
the
Assets
in
our
graph.
The
code
version
represents
the
version
of
the
function
that
computes
the
asset
from
its
dependencies.
If
that
function,
changes
we'd
expect
the
contents
of
the
asset
to
change
as
well,
and
so
we
because
we've
changed
our
function
in
this
case,
we're
also
going
to
bump
the
code
version
to
reflect
that
it
changed.
A
A
A
A
A
Of
the
file
and
use
that,
as
its
version,
an
alternative
strategy
could
be
to
take
a
hash
of
the
contents
of
the
file
in
the
web
UI.
We
can
then
click
on
this
button
to
observe
our
source
asset.
So
that's
going
to
run
our
observation
function
and
then
pick
up
the
latest
version
in
this
case
nothing
changes
because
the
file
has
the
same
modification
timestamp
as
the
last
time
that
we
observed
it.