►
From YouTube: Build a poor man’s data lake from scratch with DuckDB
Description
DuckDB is so hot right now. Could it replace our cloud data warehouses or data lakes?
Pete Hunt builds a data lake from scratch with DuckDB and Dagster. Follow the tutorial here: https://dagster.io/blog/duckdb-data-lake
Combined with Dagster, S3, and Apache Parquet, DuckDB can become a powerful, multiplayer data lake that can serve the needs of many organizations with very little effort. Think of it as a free, simple SQLite.
We can see the rise of DuckDB for subsets of workloads that don’t have massive scale and can take advantage of its simplicity and fast performance.
Give it a go!
A
A
Actually
you
already
know
what
ducktv
is
it's
like,
so
hot
right
now,
everybody's
talking
about
it,
you
can
think
of
duck
DB,
as
kind
of
like
a
sequel
light
for
analytics,
so
I
think
there
is
a
couple
reasons
why
people
like
Duck
DB
and
why
it's
so
hot
right
now,
it's
very
feature
Rich,
it's
very
fast.
It
runs
locally
integrates
really
nicely
with
python.
It
has
a
bunch
of
features
that
make
it
just
really
useful
and
it's
all
MIT
licensed
I.
A
Think
the
bigger
reason,
though,
and
the
underlying
reason
people
like
it
so
much-
is
that
it's
just
like
a
breath
of
fresh
air
seems
like
really
simple
lightweight
makes
sense.
Modern
data
stack
in
2022
is
super.
Complex
requires
a
bunch
of
different
components
and
duct
EB
is
just
like
you
know,
just
a
real
breath
of
fresh
air
and
and
it's
it's
simple
and
it's
free
and
I.
A
Think
people
like
it
for
that
reason,
so
we
are
going
to
talk
about
building
a
data
Lake
on
top
of
duck
DB,
and
you
know
before
we
do
that.
A
We
should
probably
talk
about
why
this
is
even
a
good
idea.
I
told
you
a
bunch
of
features
that
ducktv
has
a
bunch
of
reasons
that
ductdb
is
good,
but
why
should
we
even
embark
on
this
project?
A
Well,
you
know,
despite
the
limits
of
Dr
B,
specifically
that
it
runs
on
a
single
machine
when
it's
combined
with
a
couple
of
other
Technologies
like
an
orchestrator
like
dagster,
S3
and
parquet
files
or
the
parquet
format,
you
can
really
build
a
a
data
Lake
that
scales
pretty
highly
because
there's
so
many
high
performance
machines
available
on
AWS,
so
with
duct
DB,
it
runs
in
a
single
machine.
So
you
know
if
a
single
machine
can't
churn
through
all
the
data
fast
enough,
then
you
can't
use
it.
A
But
the
question
is
you
know
how
many
organizations
are
really
doing
lots
of
queries
that
are
going
to
take
more
than
like
2
000
gigs
of
RAM
and
128?
You
know
CPUs.
The
answer
is:
there's
probably
a
lot
that
that
do
you
know
that
scale
of
data
you
know
I've
worked
in
a
bunch
of
those
places
and
there
there's
way
more
than
you
think.
However,
there
is
number
one
like
a
large
class
of
businesses
that
don't
process
that
much
data
and
number
two.
A
There
are
lots
of
teams
within
large
organizations
that
want
to
take
advantage
of
a
lot
of
the
benefits
of
duckdb,
and
maybe
the
specific
problem
they're
working
on
can
fit
Within
These
constraints
so
and
when
you
get
to
use
duckdb
for
something
like
this,
you
get
the
same.
You
know
SQL
dialect
locally
when
you're
running
your
tests,
as
as
you
would
remote
and
you
have
to
pay
licensing
fees
to
some
SAS
vendor
in
order
to
to
run
your
queries
and
so
before
we
dig
into
this
like
in
depth.
A
I
just
want
to
highlight
that
this
is
like
totally
experimental
we're
not
using
it
in
production,
in
fact,
we're
not
using
any
libraries
here
for
this
specific
problem
that
are
tested
or
anything
we're
going
to
go,
build
this
solution
from
scratch.
So
think
of
this
as
a
tutorial
to
help
you
learn
these
Technologies
and
potentially
kind
of
sketch
out
a
future
architecture
for
these
types
of
data.
A
Lakes
I
think
it's
a
really
exciting
time
to
be
in
data
engineering,
and
this
is
one
of
the
the
really
exciting
areas
that
I'm
I'm
pumped
up
to
explore.
So
the
thing
we're
going
to
build
is
we're
going
to
take
dbt's
jaffle
shop
example.
This
is
like
a
classic
example
that
a
lot
of
data
engineers
and
analytics
Engineers-
you
know
use
for
for
examples
and
we're
going
to
use
we're
gonna,
basically
plug
duct
DB
into
dagster,
S3
and
parquet,
and
we're
going
to
call
it
duck
pond.
A
It's
going
to
be
like
a
little
mini
data,
Lake
built
on
top
of
these
Technologies.
Now
the
code
is
available
here
on
GitHub.
If
you
just
want
to
get
clone
it
and
see
the
code
and
this
blog
post
is
available
here
and
we're
gonna
just
get
started
right
now,
so
the
first
thing
we're
going
to
do
is
create
a
new
get
pod
instance.
This
is
a
really
cool
app
that
I
like
it's.
It's
called
gitpod.
A
It
creates
a
cloud
development
environment
for
you
with
a
standard
set
of
like
packages,
package
versions
and
stuff,
so
for
tutorials
it's
really
really
convenient,
so
just
go
to
getpod.new
and
it
will
spin
up
a
container
in
the
cloud
for
you
with
an
in-browser
Visual
Studio
code.
A
So
you
know
you
you'll
be
working
in
exactly
the
same
tools
that
I
am
and
we
won't
have
to
you
know,
deal
with
python
dependencies
or
anything
like
that
or
inconsistencies.
So
the
first
thing
we're
going
to
do
is
is
we're
going
to
get
the
project
started,
so
I'm
going
to
type
hip,
install
dagster,
and
this
is
going
to
install
the
dagster
command
line
interface.
A
So
dagster
is
what's
called
a
data
orchestrator
it's
the
thing
that
combines
and
coordinates
all
of
your
different
Tools
in
your
data
stack
and
so
we're
going
to
start
with
dagster.
It's
convenient
because
I
work
there
too,
but
but
really,
your
orchestrator
does
sit
in
the
center
of
these
things.
So
we're
going
to
start
with
that.
A
A
A
A
You
know
we're
going
to
be
generating
SQL
strings
because
we're
using
duct
DB,
and
so
we
need
a
library
to
help
us
create
a
safe,
SQL
strings
and
then,
finally,
for
one
of
our
examples,
we're
going
to
scrape
Wikipedia,
and
so
we
need
these
libraries
in
order
to
do
that,
we're
also
going
to
add
some
Dev
dependencies.
There's
this
really
cool
library
or
tool
called
local
stack.
A
If
you're
not
aware
of
it,
it
lets
you
run
a
bunch
of
AWS
like
mock
AWS
in
Services
locally,
so
we're
going
to
use
this
to
run
S3
on
our
on
our
local
machine
or
this
Cloud
Dev
environment.
In
order
to
to
set
things
up,
we
also
need
the
AWS
CLI
and
the
this
dialect
of
it
called
the
AWS
CLI
local,
which
points
basically
the
AWS
CLI
at
local
stack.
So
now
that
we've
saved
this,
we
can
pip
install.
A
You
know
what
I'm
I'll
make
the
font
size
a
little
bit
bigger,
so
you
can
see
it
so
oops
I
I
should
probably
switch
to
the
right
directory,
so
I
can
I
can
run
pip
install
Dash
e,
and
you
know
this
is
going
to
install
all
of
these
different.
These
different
packages
except
I
misspelled.
This
SQL
Escape.
A
And
so
this
can
take,
you
know
a
little
bit
of
time.
Fortunately,
the
the
get
pod
environment's
got
a
bunch
of
CPUs
and
is
is
in
a
data
center.
So
it's
generally
pretty
quick.
A
So
the
next
thing
we
need
to
do
once
this
finishes.
Installing
is
we're
just
going
to
spin
up
our
local
stack
environment.
So
local
stack,
like
I,
said
it's
a
set
of
mock
AWS,
instant
AWS
services,
and
it's
it's
really
convenient.
You
just
run
local
stack
start,
which
we
installed
it
as
part
of
that
python.
A
Installation
process
and
it
runs
on
you,
know,
localhost,
and
so,
if
I
can
go
and
create
my
S3
bucket
now,
so
just
to
give
you
an
idea
for
what
this
this
architecture,
it
kind
of
looks
like
we're
going
to
be
running
a
dagster
app
in
here.
A
So
if
you,
you
know,
for
example
like
if
you're,
using
snowflake
or
bigquery,
each
of
these
parquet
files
would
correspond
to
a
table
partition
or
Full
Table
in
those
those
systems,
and
you
know
where
bigquery
and
snowflake
couple
compute
and
and
storage
in
this
we
separate
those.
So
so
S3
is
going
to
be
our
storage
and
duckdb
is
going
to
be.
A
Our
compute
I
also
want
to
note
that
you
know
there's
a
lot
of
generic
S3
implementations,
so
you
know
you
don't
have
to
actually
be
tied
to
any
vendor
using
this
this
architecture.
All
these
tools
are
completely
open.
Source
and
S3
is
a
generic
API.
So
you
can.
You
can
use
digital
ocean
spaces
or
Min.
I
o
as
your
as
your
storage
layer
here
so
with
that
we're
going
to
go
and
create
our
our
S3
bucket
here.
A
A
I
think
you'll
just
have
to
trust
me
on
that
and
let's
go
in
and
start
writing
some
code
here.
So
like
I
said,
the
core
kind
of
abstraction
here
is
going
to
be
S3
files
or
sorry
parquet
files
sitting
on
S3
and
we're
going
to
interact
with
them
through
SQL.
A
So
I'm
gonna
first
want
to
create
like
a
little
abstraction
for
representing
SQL
statements,
because
we're
going
to
be
doing
a
lot
with
SQL,
so
I'm
going
to
create
a
file
here
called
duckpond.pi
I'm,
going
to
create
a
little
class
to
represent
a
SQL
statement.
A
So
this
is
kind
of
just
a
a
data
class
and
you
can
basically
think
of
this.
As
representing
a
statement,
the
SQL
statement
would
play
so
those
so
you
might
want
to
say,
like
select
star
from
my
table.
My
table
would
be
a
binding,
that's
past
this
Constructor
and
the
SQL
would
be
this
template
string,
so
you
would
kind
of
construct
it
like
this.
You
know
my
table
equals
so
we'll
we'll
show
I'll
show
you
that
implementation
in
a
bit.
A
Hopefully
that
makes
sense
so
with
this
little
abstraction,
we're
going
to
start
actually
building
an
example
application.
So
I
told
you
that
we
were
going
to
go.
Do
the
jaffle
shop
example,
but
we're
actually
going
to
start
with
something
really
really
simple,
something
that
reads
a
couple
of
of
data
sets
from
Wikipedia
and
then
compute
some
basic
statistics
on
them.
A
So
I'm
going
to
start
I'm
going
to
just
copy
and
paste
some
code
in
here
we're
going
to
use
dagster
software
defined
assets
to
model
each
of
those
parquet
files
or
tables
in
our
database.
So
we'll
start
with
a
population
asset.
A
What
this
does
is
reads:
the
UN
list
of
countries
by
population
from
Wikipedia
uses
the
pandas
read
HTML
function
to
do
that,
then,
because
you
know
they,
they
use
a
weird
Unicode
Dash
to
to
show
you
know
negative
numbers.
We
replace
that
and
then
we
return
a
SQL
statement
that
selects
you
know
all
the
columns
from
that
data
frame
and
and-
and
you
know,
Returns
the
result-
I'm
also
going
to
add
a
second
asset
that
is
based
on
the
first
asset.
A
And
you
can
see
over
here
we
run
a
query,
we
say
select
the
continent
and
then
the
average
population
change
from
this
population
asset,
which
is
kind
of
referred
to
up
here,
and
then
we
Group
by
the
continent
and
then
we
order
it
in
descending
order.
In
terms
of
you
know,
highest
population
change
to
lowest.
A
So
you
know
this
is
this
is
hopefully
pretty
straightforward.
You
know
you
should
go
check
out
dagster
software
to
find
assets
if
it's
not
clear
how
this
works.
But
you
know
basically,
this
decorator
plus
the
name
of
this.
This
parameter
tells
dagster
to
kind
of
set
up
a
dependency
between
these
two
assets.
A
So
the
next
thing
we're
going
to
do
here
is
we're
going
to
write
a
test
case
so
that
we're
doing
a
little
bit
of
test
driven
development.
Here,
obviously,
we
haven't
implemented
anything
yet
that
that
actually
talks
to
duckdb,
but
let's
just
write
a
test,
so
we
kind
of
know
where
we're
going
here.
A
So,
let's
open
up
test
assets
within
the
jaffle
tests
package
and
we'll
save
this,
and
so
basically,
what
we're
going
to
do
is
we're
going
to
call
those
two
assets
that
we
that
we
talked
about,
and
we
will
basically
assert
that
we
get
that
SQL
statement
back
and
then
we'll
actually
run
the
query
and
then
we'll
assert
that
you
know
this
continent
is
the
top
one
and
that
the
the
population
changes
the
known
kind
of
expected
population
change.
A
So
you
know
this
this
test
by
the
way
it
does
call
this
population
asset
directly,
which
does
go
to
the
network
so
in
a
real
kind
of
production
application
we
would
mock
that
out.
But
for
the
purposes
of
this
tutorial,
we're
just
going
to
leave
that
in
you'll
also
notice
that
we
import
this
this
fictional.
A
So
far,
duckdb
class,
we
instantiate
an
instance
of
duct
DB.
We
call
query:
we
pass
it
that
SQL
object,
which
represents
the
SQL
statement,
and
then
we
get
a
data
frame.
So
we
should
probably
go
and
Implement
that
that
duckdb
class,
so
I'm
going
to
go
back
here
into
the
duck
pond
and
I'm
going
to
copy
and
paste
my
implementation
of
the
duck
DB
class
and
we'll
take
it
step
by
step
here.
A
So
I
think
I
should
probably
pull
this
up
here.
So
we
import
the
connect
function
from
duckdb.
If
you've
ever
used,
SQL
Lite
from
python,
you
know
it's,
it's.
It
tries
to
mimic
that
API
and
that's
not
right.
A
There
we
go
so
this
basically
is
a
you
know:
it's
got
a
Constructor.
You
can
pass
in
custom
duck
DB
options
which
we'll
use.
During
this
tutorial,
we
created
an
in-memory
or
temporary
database.
We
load
What's
called
the
httpfs
extension.
This
lets
duct
DB
talk
directly
to
S3,
so
this
is
actually
really
important
if
you
have
a
parquet
file
sitting
on
S3
or
on
any
kind
of
HTTP
server.
Duct
TV
is
smart
enough
to
only
scan
the
parts
of
the
parquet
file
that
it
needs
in
order
to
fulfill
a
query.
A
So
this
means
that
if
you
have
a
giant
parquet
data
set
sitting
on
on
S3,
you
don't
necessarily
have
to
download
the
whole
thing
in
order
to
fulfill
all
the
queries,
maybe
it
only
downloads
certain
row
groups
or
certain
columns
that
it
needs
to
fulfill
the
query.
So
this
this
is.
This
is
a
really
great
feature
of
duct.
Tb
I
mean
combine
that
with
the
fact
that's
got
like
really
great
parallelism.
You
can
do
a
lot
with
this
architecture.
A
An
HTTP
httpfs
is
key
to
that.
So
we
set
that
up.
We
also,
you
know,
enable
any
options
that
are
passed
into
ductdb
for
us
and
then
we
are
going
to
do
a
couple
of
things.
A
First,
we're
going
to
collect
all
the
pandas
data
frames
that
are
referenced
by
the
SQL
objects
that
were
passed
in.
So
if
you,
if
you
recall,
let
me
go
back
to
our
assets,
we
actually
pass
in
a
pandas
data
frame
as
a
as
a
binding
here
to
the
SQL
statement.
So
we
need
a
way
to
tell
duck
DB
to
query
that
this
is
one
of
the
magic
tricks
that
duck
DB
does
by
the
way.
Not
every
SQL
engine
can
just
query
a
data
frame.
That's
just
sitting
in
memory
in
your
python
process.
A
So
one
of
the
great
advantages
of
duct
TV
is
that
the
the
interrupt
between
Python
and
SQL
is
is
great,
and
so
we
we
basically
have
to
have
to
write
a
function
that
that
kind
of
collects
all
the
data
frames
referenced
in
the
query,
and
then
we
call
this
function,
db.register,
which
is
a
duck
DB
function
that
says
you
know,
register
this
data
frame
as
a
SQL
View.
A
And
then
you
can
query
against
that
view
and
then,
finally,
we're
going
to
run
our
you
know
the
provided
SQL
statement
because
it
has
placeholders
in
it
like
the
dollar.
Sign
placeholders:
we
need
a
function
that
kind
of
like
replaces
those
placeholders
with
actual
valid
SQL
and
we'll
write
that
function
in
a
second
and
then.
A
Finally,
we
return
a
data
frame
as
a
result,
you
know
from
from
Duck
TV
and
by
the
way,
it's
worth
noting
that
the
the
queer
you
know
according
to
the
documentation
of
docdb
the
queries
against
in-memory
data
frames,
are
actually
like
fairly
efficient
I.
Don't
believe
data
is
copied
or
if
it
is
it's,
it's
done.
You
know
fairly
efficiently.
So
it's
it's
just
really
cool.
A
A
A
And
so
I'm
just
going
to
copy
and
paste
this
implementation
in
here
and
I'll
just
pull
these
Imports
up
here.
A
So
Warning,
by
the
way,
this
is,
though,
we're
gonna.
Be
writing
some
recursive
functions
here,
so
they
can.
A
They
can
be
a
little
mind-bending,
but
but
I'll
try
to
I'll
try
to
walk
you
through
it,
so
we
basically
need
to
take
a
SQL
object
that
has
basically
a
string
with
placeholders
in
it
and
then
a
bunch
of
bindings
and
those
bindings
can
refer
to
either
data
frames,
primitive
python
values
like
if
you
want
to
put
a
string
in
your
query,
for
example,
or
it
can
refer
to
other
instances
of
this
SQL
object.
A
So
if
you
want
to
embed
like
one
query
and
another
one
like,
for
example
in
here
right
in
this
first
asset,
we
pass
in
a
data
frame
in
this
second
asset.
We
actually
pass
in
a
SQL
object,
which
references
kind
of
this
one.
So
we
need
to
be
able
to
turn
that
into
a
SQL
string
that
ductdb
actually
understands
because
duct
TV
doesn't
understand
these
placeholders.
A
So
what
we
do
is
we
we
iterate
through
all
the
bindings.
If
we
find
a
data
frame,
we
reference
a
The
View
that
was
that
was
going
to
be
registered,
so
we
create
a
name
so
DF
underscore
and
the
ID
of
the
data
frame.
So
every
python
object
has
a
unique
ID
number
associated
with
it.
A
If
we
pass
in
you
know
an
instance
of
of
a
you
know,
a
string
or
an
INT
or
a
float
or
any
other
kind
of
primitive
python
type.
We
just
basically
convert
that
to
a
string
here
and
we
use
the
SQL
Street
Escape
function.
A
You
know
in
order
to
safely
Escape,
you
know
strings,
and
you
know
the
most
interesting
case
here
is.
If
we
pass
in
another
SQL
object,
we
recursively
call
this
function,
and-
and-
and
so
we
can
kind
of-
like
you
know
substitute
in
you
know
a
valid
SQL
string
for
that.
A
Finally,
in
order
to
do
the
actual
string
substitution,
we
use
this
built-in
python
string.
Template
class
and
that's
the
thing
that
basically
takes
those
dollar
sign
placeholders
and
can
interpolate
them,
so
you
can
look
up
the
python
documentation
for
this
thing,
it's
not
that
commonly
used,
but
it's
in
the
standard
library
and
it
fulfills
our
purpose.
A
So
the
next
thing
we're
going
to
do
so.
We've
already
implemented
this.
The
last
thing
we
need
to
do
to
make
this
class
work
is
to
implement
this
collect
data
frames
function,
so
I'm
gonna
go
do
that
whoops.
A
We
just
need
that
type
definition
and
then
I
think
I
can
just
paste
this
in
here
and
so
the
collect
data
frames
function.
Like
I
said
it
goes
from
a
SQL
object
and
returns
a
mapping
of
the
view
name
to
the
data
frame.
So
if
you
remember
correctly
like
or
if
you
remember
from
before,
we
created
these
like
view
names
here,
DF
underscore
and
then
the
ID,
so
we
need
to
basically
have
a
function
that
walks,
through
all
of
our
our
different
nested
SQL
statements
and
and
collect
all
the
all.
A
The
data
frames
that
are
used
in
these
queries
so
that
duck
DV
can
know
about
them.
So
we
again
iterate
through
all
the
bindings.
If
we
see
a
data
frame,
we
register
it
here
in
the
return
value.
A
A
A
I'm
gonna
run
Pi
test
the
name
of
my
my
package
and
we'll
see
if
it
works.
A
A
So
the
the
thing
that
you
might
find
a
little
interesting,
though,
is
we
have
this
working
with
ductdb,
but
we
haven't
really
brought
in
parquet
or
S3
or
any
of
the
stuff
that
makes
this
a
data
Lake.
Yet
so,
in
order
for
us
to
do
that,
we
need
to
implement
something
called
an
I
o
manager.
A
You
can
look
up
the
documentation
for
dagster.
I
o
managers,
but
basically
what
an
I
o
manager
does
is.
It
manages
passing
return
values
from
one
asset
into
another
asset,
so
in
when
you're
running
in
a
test,
for
example
like
if
I
pop
over
to
this
test,
you
can
see
that
we
just
call
the
function
and
then
pass
that
value
in
into
the
other
asset.
That's
great
for
tests,
it's
really
really
fast,
really
easy
to
test,
but
in
in
a
in
a
real
kind
of
production
environment.
A
You're
going
to
want
to
write
this
to
some
storage
and
then
read
this
from
a
from
Storage
when
you
want
to
actually
run
this
second
step,
so
you
don't
have
to
rerun
all
these
different
steps
of
your
pipeline
every
single
time
and
potentially
you
can
parallelize
them.
If
multiple
steps
depend
on
the
same
piece
of
data,
so
dagster
provides
I
o
managers,
which,
which
is
the
abstraction
to
do
this.
A
You
basically
say
you
register
an
I
o
manager
with
your
app
and
then
it
will
handle
you
know
serializing
and
deserializing
data
from
Storage
and
and
reading
and
writing
that
to
that
storage.
For
you,
this
keeps
your
business
logic
super
super
clean
and
lets
you
kind
of
transparently
swap
out
your
I
o,
which
can
be
useful
in
different
environments.
A
A
So
I'm
just
copy
and
pasted
that
in
here
you
can
see
that
when
we
create
an
I
o
manager,
we
basically
pass
it
the
name
of
our
S3
bucket.
We
pass
it
an
instance
of
ductdb
and
then
we
pass
in
an
optional
prefix
to
basically
prefix
the
names
of
our
parquet
files
that
we're
going
to
create.
This
is
not
the
whole
implementation.
This
is
the
first
kind
of
step
and
I'm
going
to
you
know,
go
through
the
implementation,
this
class
step
by
step.
A
What
we
need
to
do
is
basically
determine
the
name
of
the
file
or
the
URL
of
the
file
on
S3,
for
a
given
asset
so
set
another
way.
We've
got
this
asset
called
population
and
it
returns
some
data
where
in
S3
do
we
store
that
data?
A
So
in
order
to
do
that,
you
know
we
get
this.
We
can
get
this
context.
Object
and
I'll
show
you
where
that
comes
from
in
a
second.
That
represents
the
current
asset
and
we
can
call,
you
know,
get
asset
identifier
and
that
will
get
us
a
string
representing
the
asset
and
we
just
construct
a
little
parquet
file,
URL
here
from
the
S3
bucket
name,
any
prefix
provided
by
the
user
and
then
that
ID
that
corresponds
to
the
asset
now
by
default.
A
That
ID
is
just
going
to
be
the
name
population,
but
you
know
there
are
more
complex,
dagster
features
where
you
can
add
time-based
partitioning,
for
example,
and
that
will
make
the
the
URL
include
that
information.
A
So
the
next
thing
we
need
to
do
is
handle
writing
to
that
storage.
So
how
do
we
go
from
that
return
value
to
a
file
sitting
on
on
S3
I
will
show
you
how
let
me
just
copy
and
paste
this
in.
A
A
This
handle
output
function
or
handle
output
method.
This
actually
comes
from
the
The
Base
Class
of
I
o
manager
and
basically
given
a
context
which
represents
the
asset
and
then
this
the
what
I'm
calling
select
statement.
But
this
is
just
the
return
value
from
the
asset
basically
says:
write
it
to
storage
somewhere.
A
So
I'll
just
walk
through
this
step
by
step.
If
we
are
returning,
none
from
our
asset,
so
you
know
sometimes
you've
just
got
an
asset
that
just
you
know
during
debugging
you're,
just
logging,
something
you
don't
need
to
store
anything
there.
So
we
just
return.
We
do
nothing
there
if,
if
we,
otherwise,
if
we're
not
receiving
an
instance
of
the
SQL,
we
throw
an
exception.
So
this
duck
pond
system
with
this
particular
I
o
manager.
It
expects
every
asset
to
return
a
SQL
object.
A
So
a
SQL.
You
know
those
objects
that
take
the
SQL
template
string
and
then
a
set
of
bindings.
It
expects
those
to
be
returned
if
you're
returning
anything
else,
you
should
be
using
a
different,
I
o
manager
and
then
finally,
we
just
run
a
simple
query.
A
We
call
query
on
our
duct
TV
instance,
and
this
is
the
magic
incantation
to
save
something
to
S3
with
a
duck
DB
like
look
at
how
simple
that
is
I
just
say
copy
this
table
to
this
URL
in
the
parquet
format
and
by
the
way
the
default
here
is
to
to
use
the
parquet
format
with
Snappy
compression,
but
you
can
provide
Z
standard
or
gzip
or
no
compression
as
an
argument
here.
This
is
all
provided
by
the
https
plug-in
that
we
had
installed
earlier.
But
but
how
easy
is
this?
A
This
is
so.
This
is
so
great,
and
this
is
all
the
code
that
we
need
in
order
to
write
to
our
data
Lake
we're
combining
the
great
features
of
duckdb,
which
are
itself
built
on
the
great
features
of
parquet
and
the
grain
abstractions
provided
by
dagster.
To
just
make
this
incredibly
simple
and
the
the
read
path
is,
in
my
opinion,
even
simpler.
A
So,
just
like
we
have
a
a
handle
output,
function
or
method.
We
have
a
load
input
method
which
basically
takes
you
know
a
context
for
the
asset,
that's
trying
to
load
a
parameter,
and
then
it
you
know
Returns.
The
value
of
that
asset,
and
so
we
create
a
new
SQL
object
with
you
know
this
template
string
and
there's
this
read
parquet
file
in
inductdb
that
you
know
again
talks
to
httpfs
and
does
all
of
that
magic
around.
A
You
know
only
reading
the
parts
of
the
file
that
it
needs
to
with
HTTP
range
requests,
and
all
we
do
is
just
pass
it,
the
URL
to
the
file
that
we
saved
to
up
here.
So
you
can
see
here
the
URL
that
we
wrote
to
is
the
same
as
the
URL
we're
reading
from
and
so
I
think.
One
thing
that
makes
this
really
cool
is
in
our
tests.
A
You
know
we
just
pass
those
values
directly
into
the
function.
We
don't
have
to
go
through
any
sort
of
I
O
process,
which
makes
your
your
test
really
fast,
but
then
in
production
you
know
we
we
kind
of
rewrite
the
SQL
a
little
bit
totally
transparent
to
the
user
and
you
get
the
exact
same
semantics,
but
you,
but
you
you
get
a
big
scalable
data
warehouse
which
is
really
cool,
or
at
least
a
reasonably
scalable
data
warehouse.
A
So
now,
before
we're
able
to
kind
of
run,
this
whole
thing
end
to
end.
We
need
to
actually
connect
this.
I
o
manager
to
our
project
and
actually
I
think
we
have
to
get
our
indentation
right
yeah.
So
so
we
have
to
connect
our
data,
our
I
o
manager,
to
our
data
or
to
our
project.
A
So,
for
example,
we
need
to
figure
out.
You
know
what
is
our
bucket
name.
How
do
we
get
an
instance
of
ductdb?
Do
we
want
to
provide
a
prefix
Etc,
so
we're
going
to
head
on
over
to
this
repository.pi
file?
A
Repository
is,
is
kind
of
what
dagster
calls
a
project
and
we're
just
gonna.
You
know,
add
a
couple
of
lines
of
code
here.
It's
actually
probably
the
most
lines
of
code
in
this
whole
project,
but
hopefully
they're
they're,
pretty
straightforward.
No,
no
crazy,
recursion
in
here
first
I
need
to
add
a
couple
of
imports
here
and
then
I'm
going
to
add.
What's
called
a
duck,
DB
resource.
A
So
a
resource
in
in
dagster,
basically
like
a
connection
to
some
external
service,
and
in
this
case
it's
a
connection,
it's
connecting
our
in-memory.db
to
the
remote
like
S3
instance,
so
it's
providing
the
credentials
to
S3
the
URL,
the
bucket
name
Etc.
So
what
we're
doing
here?
I-
guess
not
the
bucket
name,
but
we'll
do
that
later,
but
it
provides
the
access
keys
and
stuff.
A
A
This
is
basically
just
this:
the
the
variables
that
we're
going
to
set
inside
a
duckdb
in
order
to
kind
of
configure
the
HD
PFS
plugin
to
talk
to
S3
local
stack,
uses
an
access,
key
ID
and
secret
called
test.
A
We
pointed
at
our
Local
Host
4566.,
that's
that's
where
local
stack
is
running
and
just
because
of
the
way
that
S3
URLs
are
constructed
and
how
SSL
works
and
how
DNS
works.
We
need
to
set
these
two
variables
to
be.
You
know,
to
disable
SSL
and
to
use
the
path
style
rather
than
the
subdomain
style.
A
That's
not
like
super
important,
but
basically
like
with
S3.
You
know
you'd
have
to
configure
a
bunch
of
DNS
in
order
to
make
it
work.
Otherwise,
so
calling
this
dot
configured
method
takes
this
resource
and
basically
applies
a
configuration
to
it.
So
now
we
have
a
now.
We
have
a
new
resource
called
duckdb
local
stack
and
and
we're
going
to
use
that
resource
in
a
second.
A
The
next
thing
we
need
to
do
is
just
like:
we
brought
in
and
added
duct
TB
connected
to
localhost
or
to
local
stack
to
our
project.
We
also
have
to
connect
our
our
I
o
manager
that
we
wrote
to
our
project.
A
So
you
know
you
should
just
kind
of
read
the
documentation
for
how
this
works,
but
we
Define
an
I
o
manager,
which
instantiates
the
class
that
represents
our
IO
manager.
We
pass
in
the
name
of
our
S3
bucket,
which
is
called
Data
Lake.
You
know
you
could
potentially
provide
this
via
runtime
config.
If
you
wanted
to,
we
say
that
as
I
O
manager
requires
a
resource
called
duckdb,
and
then
we
pass
that
duckdb
resource,
which
is
an
instance
of
the
duckdb
connection
to
our
Duck
Pond.
A
A
A
And
then
I
I
just
need
to
say
that
I
want
to
apply
the
duck:
DB
I,
the
the
duct
TV
instance,
and
our
Duck
Pond,
I
o
manager
to
all
of
the
Assets
in
my
project.
A
So
what
I
basically
did
was
I
took
you
know,
I
took
the
load
assets
from
package
module
which
loads
all
the
assets
in
our
project
and
then
with
resources,
says:
hey.
You
know
anytime.
You
want
to
use
an.
I
o
manager
use
the
duck
pond.
I
o
manager
instead
of
the
default
one,
and
we
also
provide
a
resource
called
ductdb
which
points
out
our
local
stack
version
of
Zac,
TV
and
then
and
this
this
resource
is
used
and
passed
in
to
this.
I
o
manager.
A
So
that's
that's
kind
of
a
lot
of
typing,
but
and
a
lot
of
keys
for
things
and
I
know
that
can
be
a
little
annoying
when
you
first
type
it,
but
there's
a
lot
of
advantages
here
like
if
you
have
a
really
complex
project
and
you
want
some
of
your
assets
to
be
stored
in
one
data
Lake.
You
want
some
other
assets
stored
in
another
data
lake.
A
So,
for
example,
like
you
want
to
put
some
in
the
U.S
some
in
Europe,
or
you
know
some
in
your
own
data
center
with
like
Min
IO
or
something
this
Dexter
provides
the
flexibility
for
you
to
do
that
in
a
really
elegant
way,
but
only
you
know
providing
these
resources
to
the
proper
to
the
to
the
proper
set
of
assets,
all
right,
so
I
think
if
we've
done
everything
correctly,
we
have,
you
know,
created
our
example:
application
we've
wired
up
our
I
o
manager.
Let's
actually
try
running
this
thing.
A
You
know
for
real,
like
in
production,
so
I
just
ran
the
command
dag
it.
This
is
basically
a
a
user
interface
for
dagster
and
if
I
go
over
here,
all
right,
so
I
just
had
a
quick
problem
with
gitpod,
but
I
found
a
solution
for
it.
You
can
just
run
GP
URL
3000,
and
this
prints
out
the
URL
that
you
can
use
to
access
that
port
on
your
on
your
instance,
and
so
you
can
see
here,
this
is
dagster.
A
Ui
and
I've
got
two
assets:
the
population
asset
and
the
continent.
Population
asset
and
I
can
smash
this
materialize
all
button
and
it
launches
a
run.
And
now
we
have
dagster
basically
running
that
computation
for
us,
so
we
fetch
the
data
from
Wikipedia
parse
it
with
pandas.
Then
we
use
ductdb
to
query
that
data
frame
and
then
take
the
results
and
store
it
as
a
parquet
file
on
S3.
Then
we
have
the
derived
data,
this
country,
continent
population.
A
Actually
so
I
can
run
AWS
local,
which
is
our
AWS
CLI
pointed
at
at
local
stack
and
I
can
just
say
you
know
what
is
in
my
data
Lake
bucket
and
you
can
see,
we've
got
those
two
parquet
files
and
if
we
want
to
actually
take
a
look
at
those
parquet
files,
I'm
going
to
actually
need
to
install
a
a
quick
dependency
here
to
do
it.
A
Yeah
this
this
Pi
Arrow
dependency
is
going
to
just
be
used
right
now
to
view
these
files,
I
can
download
that
file.
I
can
say:
AWS
local
S3
copy
data,
Lake
continent,
population,
dot,
okay,.
A
And
so
you
can
see
now:
I've
got
that
continent
population.par
K
file
sitting
there
and
if
I
run
I
can
use
pandas
to
to
read
it.
So
I
can
say
import
pandas
as
PD
need
for
k,
continent
population.
A
If
I
match
up
my
parentheses
appropriately
yeah,
you
can
see
that
we've
got
our
our
data
right
here
created
from
our
dagster
job.
So
I
want
to
just
recap
what
we
saw:
we've
just
built
in
a
fairly
short
amount
of
time,
with
not
that
much
not
that
many
lines
of
code,
a
a
pretty
useful
data
Lake
that
lets
you
kind
of
compose
together
different
SQL
queries
and
efficiently
query
data
sitting.
You
know
on
on
S3
and
parquet
and
orchestrate
all
that,
with
with
dagster
and
write
tests
against
it
efficiently.
A
This
can,
you
know
very
easily
grow
to
more
complex
apps
and
if
you
kind
of
continue
through
the
blog
post,
we'll
Implement
a
more
complex
one
called
the
jaffle
shop
example
I'm
not
going
to
take
you
through
all
of
the
details
here,
but
you
can
see
here
in
this
blog
post.
We
basically
take
the
patterns
that
we
walk
through
here
and
we
expand
it
out
to
something
that
resembles
a
an
e-commerce
store.
A
And
so,
as
you
can
see,
you
know,
rather
than
waiting
for
reading
from
Wikipedia,
we
read
from
a
an
example
data
set.
That's
a
CSV
file.
We
do
some
additional
pandas
Transformations
and
then
our
SQL
queries
are
a
bit
bigger
and
a
bit
more
complex,
but
it's
basically
all
the
same
stuff
that
I
showed
you
in.
In
that
example,
maybe
one
little
addition
in
this
final
example
is
that
you
know
we
that
resource
that
duckdb
resource
that
we
pass
in.
A
We
actually
use
that
to
kind
of
query
directly
at
the
end,
so
we
can
materialize
our
query,
results
and
inspect
them
and
and
so
yeah.
You
know
this
has
been
a
a
really
interesting
expiration
for
us.
Sandy
and
I
have
had
a
lot
of
great
discussions
about
it,
and
even
members
of
our
community
have
come
upon
similar
ideas
so
with
that
I
want
to
just
leave
you
with
with
you
know
a
thank
you
for
for
sticking
to
the
end.