►
Description
The Modern Data Stack gives us a wide range of free technology for data ingestion, data storage, data transformation, data orchestration, and data visualization.
Pedram Navid, Head of Data Engineering and DevRel at Dagster, walks us through one data pipeline design using a number of open-source solutions, including Dagster, dbt, PopSQL, and the Mastodon API, DuckDB, dbt-duckdbt, Evidence, Sling and Steampipe.
The pipeline is used to analyze bird observation data from the Cornell Lab.
Github Repo: https://github.com/dagster-io/mdsfest-opensource-mds
Source Data: https://feederwatch.org/explore/raw-dataset-requests/
A
Hey
everybody,
my
name
is
pedram
Navid
and
I.
Am
the
head
of
data
engineering
and
devrel
here
at
Daxter
and
I
want
to
talk
to
you
about
building
an
open
source.
Modern
data
stack.
So
what
is
a
pipeline?
A
pipeline
is
really
a
collection
of
jobs
that
we
need
to
get
done
in
order
to
make
data
valuable
right,
and
so
in
many
ways,
if
you
want
to
ingest
data
from
various
Source
systems,
that
might
be
a
postgres
database,
it
could
be
some
data
in
the
cloud
as
a
zip
file.
A
It
might
be
an
API
from
a
third-party
SAS
app
either
way.
The
data
is
out
there
and
we
need
to
bring
it
in
in
order
to
do
some
type
of
analysis
on
it
right.
So
that's
step
one
and
then
on
the
storage
side
that
can
be
anything
from
a
database
itself
or
a
data
warehouse
or
even
just
parquet
files
or
Arrow
files
stored
on
local
disk.
Then
transformation
is
the
next
thing
we
really
want
to
care
about,
and
this
is
all
about
taking
that
data
that
we
have
and
doing
something
with
it.
A
You
know
whether
it's
aggregating
it
grouping
it
filtering
it
joining
it
all
that
kind
of
fun
stuff
and
finally,
we
need
an
orchestration
tool,
a
way
to
take
all
these
various
steps
and
put
them
in
a
sequence
so
that
they
can
run
and
we
can
monitor
and
observe
it,
and
once
we
have
all
of
that,
then
we
can
do
visualization
or
activation
which
is
really
about.
You
know
the
final
mile
here,
you've
done
all
this
work,
and
now
what
do
you
want
to
do
with
it?
A
So
I'll
talk
a
little
bit
about
the
state
of
Open
Source
tools,
as
it
relates
to
each
of
these
different
things
that
we
care
about
the
first
one
is
obviously
ingestion
getting
that
data
out
and
I
think
one
of
the
underrated
ways
to
do
that
is
really
just
plain
old
python
script.
It's
a
Time,
tested
and
true
method
of
getting
data
from
wherever
it
may
be,
and
one
that
I
think
still
is
very
valuable
today,
outside
of
that,
we
have
meltano
and
Alto.
A
Meltano
is
a
framework
for
getting
data
out
of
these
systems
based
on
a
single
protocol
that
is
also
used
by
companies
like
Stitch,
for
example.
Then,
there's
air
by
air
byte
is
similar
in
some
ways
to
Montano,
but
it
has
its
own
set
of
connectors
and
they
have
different
qualities
to
them
as
well.
Dlt
is
a
newer
framework,
one
that
I'm
pretty
excited
about.
I
haven't
spent
too
much
time
on
it,
but
I
do
like
how
light
and
versatile
it
is,
and
it
makes
it
very
easy
to
create
some
of
your
ingestion
logic.
A
Then
there's
also
things
like
steam,
pipe
and
sling
steam
pipe
is
really
nice.
What
it
can
do
is
take
data
that
is
behind
a
rest,
API
and
convert
that
into
a
SQL
query,
so
that
you
can
now
write
SQL
against
some
of
these
SAS
applications.
Instead
of
having
to
deal
with
rest
apis,
which
you
know
they're
fine,
but
we
don't
love
them.
That
much
sling
is
a
wonderful
little
tool
that
it's
kind
of
like
a
five-tran
type
tool.
A
Behind
that
we
have
storage
storage
again
we
have
a
lot
of
options,
there's
the
time
tested
and
true
postgres,
which
has
been
around
for
what
feels
like
100
years.
Now
it
works
pretty.
Well,
maybe
not
the
best
tool
for
large
volumes
of
data
when
you
want
to
do
analytical
processing
but
I,
think
it's
a
very
solid
option.
There's
clickhouse
and
duck
EB
much
better
suited
to
analytical
queries
than
both
available
as
well
parquet
Arrow
files.
These
are
wonderful
ways
to
write
data
to
disk
that
are
high
performant.
A
They
have
compression
schemas,
so
all
the
things
you
kind
of
want
out
of
a
modern
file
format
and
then
there's
data
band,
which
is
an
open
source,
Data
Warehouse
in
a
box
in
some
sense
now
I
haven't
used
it
particularly
well,
but
like
I've,
heard
great
things,
so
another
wonderful
open
source
tool
to
investigate
once
the
data
is
in
storage.
Typically
now
we
want
to
start
transforming
it
right
and
I.
Think
DBT
is
the
canonical
standard
by
which
we
typically
transform
data
from
a
data
warehouse
or
from
a
tool
like
duct,
TB
or
postgres.
A
Josh
Wills
has
created
a
great
tool
called
DBT
ductdb,
which
is
a
wonderful
adapter
for
DBT
outside
of
that
there's
also
polars
and
Aero
data
Fusion.
If
you,
for
some
reason,
don't
want
to
write,
SQL
orchestration
Dexter
is
I,
think
probably
the
best,
but
there
are
many
others
in
this
space
as
well,
airflow
being
the
most
obvious
one,
and
then
on
the
visualization
side
we
have
superset,
we
have
evidence.dev,
we
have
light
Dash.
A
Project
feature
watch
is
a
project
by
Cornell
lab
and
birds
Canada,
which
gets
data
from
checklists
that
people
have
done
of
bird
observations,
and
so
every
three
or
four
years
they
publish
a
whole
new
checklist.
These
checklists
can
be
quite
large,
as
you
can
imagine
one
for
every
bird
observation
by
a
person,
and
so
they
can
be
up
to
a
gigabyte
each
and
there's
multiple
of
these
they're
compressed
they're
in
the
cloud
they're,
a
zip
file
and
so
we'll
download
that
there's
also
sites
descriptions
and
species
translations.
A
These
are
you
know,
mapping
files
that
you
want
to
combine
to
the
checklist.
You
get
a
fuller
picture
of
that
data.
Then
we
have
a
popsql
demo
database,
we'll
just
hit
this
demo
database,
that's
in
the
cloud
as
an
example
of
a
way
to
get
data
from
postographs
and
then
finally,
there's
Macedon
we'll
use
that
as
an
example
of
a
rest
API
that
we
want
to
reach
data
from
using
SQL
and
without
further
Ado.
A
What
I'll
do
is
pull
up
the
repo
for
this
project
and
first
things,
first
I'm
going
to
open
up
dagster
by
typing
tagster
Dev
in
the
terminal,
that's
going
to
start
dagster
and
then
I'll
pull
up
diagster
in
my
browser
by
visiting
localhost
3000
and
the
first
thing
we're
going
to
see
here
is
a
overview
of
all
the
assets
that
I've
created.
So
if
I
click
on
global
asset
lineage,
you
get
a
full
picture
here.
We
have
the
data
coming
in
from
the
checklist.
A
This
is
using
a
plain
old
Python,
and
so
it's
downloading
that
file
from
S3
unzipping
it
extracting
the
CSV,
and
we
do
that
for
two
checklists.
We
do
it
for
the
site,
description
data
and
we
do
that
for
the
species
data
once
we
have
that
we
also
will
use
sling
to
get
data
from
events
and
tickets
from
that
postgres
database
in
the
cloud
and
finally,
we'll
hit
Mastodon
API
and
get
bird
toots,
which
is
really
just
a
hashtag
of
birds
on
some
bird
instance
related
to
Mastodon.
A
Next
up,
we
have
the
prepared
section
so
here
what
we're
going
to
do
is
take
all
these
data
sets
that
we've
created
all
these
raw
files
that
we
saved
to
disk
and
adjust
them
into
duct
DB.
So
we
can
then
do
some
aggregation
and
some
more
fun
stuff
with
it
right.
So
we
ingest
the
data
first
and
then
eventually
we
combine
the
All
Birds
data,
which
unions
and
ads
and
joins
all
the
birds
and
then
from
there
we
create
like
a
top
Birds
by
year,
just
as
an
example
of
some
type
of
aggregation.
A
You
might
want
to
do,
and
so
this
entire
pipeline
I
can
click
materialize
all
and
it'll
start
from
the
beginning
and
I
click
on
V
run.
If
you
can
watch
this
sort
of
progress
and
what
it's
going
to
do,
is
it's
going
to
do
exactly
what
I
just
said?
It's
going
to
download
the
data
it's
going
to
hit
the
API
it's
going
to
hit
postgres,
it's
going
to
run
using
all
the
tools
that
we've
mentioned
and
ingest
that
data
into
duct
DB
and
make
them
available
as
DPT
models.
A
And
while
that's
running
the
next
thing,
I'm
going
to
do
is
show
you
some
of
the
code.
I
use
to
build
this,
and
so
the
first
thing
I'll
do
is
walk
you
through
the
resources
section
of
our
code.
This
is
where
we
Define
really
how
we
want
to
connect
with
the
outside
world
with
index
there
right.
So
we
have
a
resource,
for
example,
for
duck
DB.
We
have
a
resource
for
slaying
for
DBT.
A
All
these
sort
of
things
and
they're
quite
simple
to
Define
duckdb
is
already
supported
by
Daxter,
there's
a
library
for
it,
and
so
really
in
one
line,
two
lines:
we've
defined
inductive
resource.
We
give
it
the
path
to
a
database
that
we
want
it
to
save
our
data
in
and
that's
pretty
much
it
when
it
comes
to
deckdb.
Now
on
sling
here
we
have
some
wrapper
code
around
an
existing
library,
because
dagster
doesn't
come
with
a
sling
integration
built
in.
A
We
have
to
do
a
little
bit
more
code,
but,
as
you
can
see,
it's
not
much.
We
Define
a
function
that
we
want
our
assets
to
call
we'll
say
you
know,
give
us
a
source
table
and
the
file
and
we'll
write
to
that
file
by
default.
We'll
do
full
refreshes,
which
just
gets
all
the
data
from
The
Source
table,
but
we
could
conceivably
do
incremental
as
well.
A
If
we,
if
we
chose
to
we
pass
in
a
config
to
the
CLI
and
then
we
ask
it
to
run,
and
then
we
can
even
process
the
messages
coming
in
from
standard
out
to
find
patterns
that
we
care
about.
For
example,
here
we're
getting
the
number
of
rows
and
then
we
output
that
as
metadata
out
here,
that's
it
on
the
sling
side,
DBT,
because
it's
again
supported
by
Dexter
just
a
couple
lines.
We
give
it
the
path
to
our
DPT
project.
A
We
tell
it
to
parse
the
project
and
then
create
a
manifest
that
we
will
use
in
our
assets
and
that's
really
it
when
we
get
into
the
assets.
That's
where
the
interesting
stuff
starts
to
happen.
So
first,
we've
defined
this
function
that
downloads
the
data
from
S3
it'll
extract
it
from
that
zip
file
and
then
save
it
to
a
location
depending
on
the
file
name.
That's
the
helper
function
itself.
It's
not
been
used
by
Dexter
except
through
these
assets,
and
so
we
have
these
assets
here,
which
is
really
that
representation
of
that
file.
A
So
we
create
an
asset
by
using
this
asset
decorator,
we
use
compute
kind
to
give
it
that
nice
little
tag
in
the
picture
in
the
asset
graph,
and
then
we
also
group
it
here
to
make
it
a
little
bit
nicer
to
see
in
the
UI
and
then
all
we
really
do
is
ask
it
to
download
and
extract
that
data.
That's
the
function
we
had
just
above
and
we
pass
in
some
context
and
some
constants
so
that
it
knows
where
to
find
the
file
and
then
that's
really.
A
It
it'll
download
that
file
it'll
extract
it
and
then
we're
sort
of
done,
and
so
we
repeat
that
once
for
the
checklist
here
and
again
for
this
other
checklist
down
here
and
then
from
here,
we
we
can
download
the
other
files,
for
example
the
site,
description,
data,
the
species,
translation
data
and
so
on
and
so
forth.
Once
we
have
all
that
data
put
together,
the
next
thing
we
want
to
do
is
join
that
data
inductdb,
because
it's
just
sitting
as
a
CSV
file
right
now,
right
and
so
to
checklist
data.
A
We
want
to
combine
that
right.
We
have
these
two
very
large
checklists
one
for
2020
one
for
2023,
but
we
don't
want
to
have
two
separate
tables,
and
so
what
we
can
do
is
use
our
duct
DB
resource
to
first
read
these
into
the
center
tables
and
then
create
our
Birds
table,
which
is
just
a
union
of
these
two
things
and
for
reference.
This
is
about
17
million
rows
and
it
runs
on
my
laptop
in
under
a
minute.
That's
really
the
power
of
these
tools
that
we
have
in
our
hands
today
and
so
this
code.
A
All
it
does.
Is
it
reads
the
files
using
a
CSV
reader,
it
joins
them
and
then
creates
the
tables,
and
then
we
do
that
once
for
the
birds
table,
and
then
we
also
do
that
again
for
species
and
again
for
the
site
descriptions.
So
now
we
have
these
three
tables
inductdb
we
can
move
on
to
the
tooths.
The
toots
are
like
mastodon's
version
of
tweets.
A
We
hit
the
Mastodon
API
and
we
want
to
query
that
API
to
get
you
know
the
top
tweets
or
whatever
it
is,
and
so
to
do
that
we
run
Steam
pipe
steam.
Pipe
is
a
command
line
tool
that
runs
locally
in
my
laptop,
so
we
have
a
sub
process
to
run
it,
and
then
we
pass
that
a
query
and
let's
take
a
look
at
that
query,
to
see
kind
of
how
it's
simple
and
easy.
This
query
really
is
right.
We're
writing
SQL,
and
this
has
been
converted
into
API
calls
to
Mastodon.
A
So
we're
saying
you
know,
select
the
content
from
Mastodon
toots,
give
us
a
thousand
and
then
match
it
with
some
regex
to
find
hashtags.
A
So
this
is
very
interesting
because
now
we
can
write
these
complex
queries
that
are
difficult
to
do
as
rest,
API
calls,
but
very
simple
to
do
a
SQL,
and
then
we
can
say
you
know
give
me
the
aggregate
of
that
and
order
it
by
the
count
and
by
hashtag,
and
so
we
get
this
nice
SQL
transformation
layer
and
all
we
have
to
do
once
we
do
that
is
pass
that
into
steam,
pipe
and
steam.
Pipe
will
translate
that
and
write
it
into
a
CSV
file
for
us,
which
is
really
really
nice.
A
The
next
big
step
is
to
take
that
CSV
file
and
again
just
create
a
table
so
nothing
too
interesting
here
or
just
creating
a
doc
DB
table
using
that
CSV
file
and
finally,
we're
going
to
use
sling
to
get
data
from
postgres.
So,
like
I
mentioned,
we
have
this
postgres
database
in
the
cloud
it
has.
You
know,
events
and
tickets
and
that
type
of
stuff,
and
what
we
want
to
do
is
sync
that
data
from
progress
into
a
CSV
that
we
can
then
load
into
duckdb
as
well
right.
A
So
in
just
one
line,
we
can
do
that.
We
run
sling
sync,
we
give
it
the
name
of
the
table
and
we
give
it
a
path
for
it
to
save
the
files
to
and
then
that's
it.
It's
done,
it'll
save
the
files,
and
then
we
have
some
metadata
around
it
as
well.
Now,
once
that's
done,
we
can
again
create
our
tables
induct
TB
by
reading
that
CSV
file,
and
we
do
that
for
the
tickets
and
events
and
that's
really
it.
The
last
step
here
is
to
just
run
DBT
and
because
DBT
is
reading
from
duckdb.
A
Dexter
will
know
that
this
runs
only
after
all.
The
source
data
has
been
ingested
and
transformed,
and
it's
ready
for
DBT
to
consume.
If
you
look
at
our
DBT
model,
there's
not
quite
a
lot
there.
We
just
have
like,
for
example,
this
all
birds
I've
created
as
a
simple
example
and
what
it
does
is
it
just
takes
the
birds,
the
sites
and
the
species
and
combines
it
into
this
all
birds
table
that
will
continue
to
use
in
our
bi
tool.
A
Okay,
so
I'll
exit
out
of
this
and
go
back
to
extra
UI.
As
you
can
see,
it's
now
all
complete
everything
is
done
and
if
we
take
a
look
here,
it
says
it's
finished
in
about
a
minute
which
is
incredible,
because
this
bird's
file,
for
example,
is
17
million
rows
and
it
was
able
to
not
only
join
that
data
but
Aggregate
and
do
all
this
fun
stuff
with
it.
A
So
that's
quite
interesting
next
thing
we're
going
to
do
now
is
take
a
look
at
the
output
of
all
this
data
right,
so
we
created
these
tables
with
Dexter.
Now
what
we
can
do
is
open
evidence
which
is
a
visualization
tool
for
you
know,
analyzing
and
looking
at
data.
It
can
connect
directly
to
your
duct
TV
instance,
and
so
I'll
pull
this
up
now
and
as
you'll
see
it's
going
to
load
data
from
duckdb
and
create
these
live
outputs
using
the
queries
that
we've
generated.
A
So
here
we're
selecting
from
all
birds,
for
example,
and
here
we
can
see
that
table
of
our
data
we
can
do
top
tweets.
We
can
do
bar
charts,
we
can
do
all
kinds
of
stuff
and
it's
reading
this
directly
from
TB,
and
it's
so
easy
to
update
that
we
can
go
back
into
our
terminal
here.
We
can
pull
up
in
the
index
file,
and
this
is
really
all
the
code
being
used
to
generate
that
query.
That
visualization
we
just
saw
right
so
in
here.
A
I
have
a
very
simple
SQL
statement,
select
solve
from
our
birds,
and
that
is
being
generated
right
here.
I
can
go
ahead
and
change
this,
for
example,
I
can
change
it
to
100
rows,
and
if
I
go
back
to
here,
you
can
see
it's
already
been
updated
and
we've
got
100
rows
now.
So
that's
really
the
power
of
this
Atari
stack
as
you
can
see.
It
works
really
beautifully
for
getting
data,
transforming
it
running
it
visualizing
it
all
locally,
on
a
laptop
all
with
free,
open
source
tools
and
I.
A
Think
that
is
really
powerful
and
very
exciting.
So
I
hope
you
liked
it.
If
you
want
to
check
it
out
more,
you
can
go
and
check
out
again
thanks
for
repo
for
this
project.
It's
called
mdsfest,
open,
source,
MDS
and
it'll
have
all
the
instructions
in
order
to
install
this
and
run
it
locally
on
your
own
laptop
enjoy.