►
From YouTube: Dagster Hot Takes: Retries and Alerts
Description
Getting an on-call page is the worst. Unfortunately most task-based orchestrators page teams frequently, whenever jobs fail. With Dagster you can reduce this alert fatigue by using retry strategies and only getting notified when SLAs are violated.
Resources:
- *Star Us*: https://github.com/dagster-io/dagster
- *Example*: https://github.com/dagster-io/hooli-data-eng-pipelines
A
A
So
today,
I
want
to
look
at
how
dagster
is
built
to
be
robust
so
that,
even
if
you
have
flaky
Upstream
data
sources
like
API
aestheticationally
return
errors
as
long
as
things
are
recoverable.
The
orchestrator,
in
this
case
diagster,
can
actually
move
on
with
its
life
without
interrupting
or
paging
on
call
and
only
alert
you
when
there's
actual
violations
of
the
slas
that
your
stakeholders
care
about.
So
let's
go
ahead
and
look
at
a
asset
graph.
A
Here
we
have
a
set
of
assets
that
start
by
querying
an
API
for
some
user
data
in
some
orders
data
and
then
a
number
of
DBT
models
live
Downstream
of
those
two
AP
Raw
data
sets
that
are
processed
and
then
further
Downstream.
We
have
some
forecasting
assets
that
maybe
represent
ml
models,
as
well
as
some
marketing
assets
that
maybe
represent
in
different
reports.
A
So,
let's
look
at
the
timeline
for
these
assets.
We
can
see
that
there's
definitely
some
flakiness
here.
Typically,
jobs
are
running
every
hour,
but
sometimes
those
jobs
fail
and
if
we
go
and
look
at
one
of
these
runs
say
the
one
that
occurred
at
12.
I
want
to
talk
about
some
of
the
ways
that
Daxter
helps.
You
be
robust
to
these
types
of
failures.
So
here
we
have
two.
A
A
A
A
So
as
a
result
where
we
land
is
that
we're
able
to
successfully
run
our
model
and
avoid
getting
undue
Pages
now
that
was
at
12
o'clock.
If
we
look
at
our
runs
table,
we
can
see
that
subsequently,
we've
had
some
failures,
and
maybe
these
are
failures
now
that
are
actually
impacting
the
team,
so
as
opposed
to
12
o'clock,
where
we
were
able
to
use
our
various
retry
strategies
to
succeed.
Now
we
have
runs
that
have
failed
and
so
what's
happening.
Are
we
getting
an
on-call
page?