►
From YouTube: Partitioned Data Pipelines in Data Engineering
Description
Partitioning is a technique that helps data engineers and ML engineers organize data and the computations that produce that data.
A
A
The
lead
engineer
on
the
dagster
project
and
I'm
here
to
talk
to
you
about
partition
data
Pipelines
partitioning,
is
a
technique
that
helps
data
engineers
and
machine
learning,
Engineers,
organize
data
and
the
computations
that
produce
that
data.
Partitioning
also
makes
data
pipelines
more
performant
by
letting
them
operate
on
subsets
of
data
instead
of
all
of
it
at
once.
A
A
A
A
If
we
go
to
the
asset
Details
page,
we
can
inspect
individual
partitions
of
our
asset
I'm
clicking
here
on
the
partition
that
corresponds
to
a
particular
hour
of
a
particular
day
and
I
can
inspect
the
metadata
for
that
partition,
such
as
the
file
where
it's
stored,
I,
can
also
click
into
the
run
that
materialized
it
to
see.
The
logs
we
can
launch
runs
to
fill
in
or
recompute
partitions
of
our
asset.
A
A
By
default,
this
backflip
will
launch
a
separate
run
for
each
partition.
However,
we
can,
alternatively,
choose
the
option
to
launch
a
single
run
that
covers
all
partitions,
which
would
be
helpful
if
we're
using
a
parallel
processing
engine
like
spark
or
Snowflake,
and
we
want
to
execute
our
backfill
in
a
single
query
or
job.
A
We
can
also
schedule
our
asset
so
that
each
hourly
partition
will
be
filled
in
at
the
end
of
that
hour.
Partitioned
assets
can
depend
on
other
partitioned
assets,
creating
a
partition
data
pipeline.
Here's
some
code
that
implements
this
pattern.
We
have
the
hourly
asset
that
we
looked
at
before,
along
with
another
hourly
asset
that
depends
on
it.
Each
hourly
partition
of
this
Downstream
asset
depends
on
the
corresponding
partition
in
the
Upstream
asset.
A
If
we
go
back
into
the
UI,
we
can
select
both
of
them
and
then
click.
The
materialize
button
to
launch
runs
to
materialize.
Those
partitions
in
order
dagster
can
also
handle
dependencies
between
assets
with
different
time.
Partitionings.
Here's
code
that
includes
the
hourly
assets
that
we
looked
at
before,
along
with
a
daily
partitioned
asset
that
depends
on
both
of
them.
A
Each
daily
partition
depends
on
the
24
Upstream
hourly
partitions
for
the
same
day
in
the
UI.
We
can
select
all
of
these
assets
and
then
launch
a
backfill
over
a
selected
time
range.
While
it's
executing
this
backfill
Dexter
will
wait
until
all
the
Upstream
hourly
partitions
are
filled
before
filling
in
the
corresponding
Downstream
daily
partition.
A
A
Data
for
different
countries
might
arrive
at
different
times
so
partitioning
it
in
this
way
allows
us
to
update
the
data
for
a
particular
country.
Without
touching
the
data
for
other
countries,
we
Define
this
asset
by
constructing
a
static
partitions
definition
with
a
list
of
countries
and
assigning
it
to
our
asset.
A
A
So
far,
I've
looked
at
time,
partitioned
assets
and
we've
looked
at
statically
partitioned
assets
as
well.
What
if
we
want
an
asset
to
be
both,
for
example,
maybe
we
have
an
asset
that
contains
weather
events
from
the
weather
stations
that
we
tracked
in
our
previous
asset.
Each
day
we
add
weather
events
for
each
country,
so
we
want
a
separate
partition
for
every
date
for
every
country.
A
A
A
When
we
materialize
this
asset,
we
can
choose
both
a
country
and
a
date
to
Target.
We
can
also
launch
a
backfill
that
will
cover
both
our
country
partitioned
asset
and
our
multi-dimensional
asset.
For
example,
we
could
backfill
all
the
historical
data
for
both
the
USA
and
Brazil
in
both
of
these
assets.
A
In
everything
that
we've
looked
at
so
far,
the
set
of
partitions
is
fully
determined
by
the
code
that
defines
the
asset,
but
in
some
situations
we
need
to
be
able
to
add
and
remove
partitions
dynamically,
for
example,
consider
a
data
pipeline
that
creates
a
derived
file
for
every
file
that
lands
in
a
particular
directory
as
new
files
land.
We
need
to
create
new
partitions
to
represent
them
or
consider
a
machine
learning
pipeline
that
we
want
to
run
with
ad
hoc
hyper
parameters
to
create
a
set
of
ml
models.
A
We
can
compare
each
time
we
launch
a
run
with
a
new
set
of
hyper
parameters.
We
want
to
create
a
new
Partition
to
represent
the
new
machine
learning
model
that
is
generated
by
that
run
in
dagster.
We
can
handle
these
situations
with
dynamically
partitioned
assets.
Here's
a
dynamically
partitioned
assets
with
the
partition
for
every
release
of
the
dagster
project
itself.
Dexter
publishes
a
release
roughly
once
a
week,
but
some
weeks
have
no
release
and
some
other
weeks
have
multiple
releases.
A
It's
common
to
combine
dynamically
partitioned
assets
with
dagster
sensors.
Here's
a
sensor
that
monitors
GitHub
for
new
dagster
releases
when
it
finds
a
new
release.
It
adds
a
new
Partition
for
that
release
and
then
it
requests
a
run
to
materialize
that
partition
in
the
whole
pipeline
of
assets
that
are
partitioned
by
release.
A
We
can
also
add
Dynamic
partitions
through
the
dagster
UI.
If
we
select
one
of
our
assets
and
click
the
materialize
button,
we
get
to
type
in
the
name
of
a
release
and
then
launch
a
run
for
it.
So
that
was
a
whirlwind
tour
of
dagster's
partitioning
functionality
using
partitions
in
your
data
pipelines
has
some
big
advantages.
A
It
helps
you
Monitor
and
materialize
the
subsets
of
your
data
that
you
care
about
in
a
particular
context.
This
gives
the
peace
of
mind
that
you're
operating
on
the
data
that
you
need
to
be
and
avoids
wasting
computation
on
data
that
you
don't
need
to
touch
to
learn
more
visit,
dagster.io
and
look
for
partitions
in
our
docs.
Thank
you,
foreign
foreign.