►
Description
Protocol Lab's @mikeal, @chafey, and @rvagg will discuss tips and tricks and key underlying concepts for storing large data sets on Filecoin!
Keep up with events for the Filecoin community by heading over to the Filecoin project on GitHub:
https://github.com/filecoin-project
Check out the Filecoin community resources:
https://github.com/filecoin-project/community
And stay connected on Filecoin Slack:
https://app.slack.com/client/TEHTVS1L6
A
Yeah
welcome
everyone.
We're
just
gonna,
give
a
few
more
minutes
here
to
allow
some
more
time
for
some
more
people
to
come
in,
but
we
will
get
started
right.
A
A
Okay,
let's
get
started,
welcome
everybody.
Thank
you
so
much
for
joining
us
today.
We
have
a
really
exciting
hour
ahead
of
us
with
another
filecoin
master
class,
and
this
happens
to
be
one
of
the
most
requested
ones.
We've
had
so
far,
so
it's
going
to
be
a
really
great
hour
with
a
lot
of
great
information.
A
So,
let's
hop
right
into
it.
I
would
like
to
start
by
introducing
michael
rod
and
chris
from
protocol
labs
and
they're
gonna,
be
the
ones
taking
us
through
how
to
prepare
large
data
sets
for
file
coin
storage.
So
with
that
rod,
would
you
like
to
get
us
started.
B
A
B
Okay,
so
I'm
rod
bag
and
I'm
with
chris
and
michael's
also
here
as
well,
we
are
on
the
ipld
team.
Ipld
is
the
middle
layer
of
the
stack,
it's
concerned
with
the
data,
the
data
layer
and
how
we
connect
them
to
connect
pieces
of
data
together
in
the
distributed
fashion.
B
B
My
one
of
my
aims
with
this
is
to
try
and
get
folks
out
of
a
little
bit
out
of
the
standard,
ipf
ipfs
frame
of
thinking
and
into
thinking
more
about
how
to
think
about
content,
address
data
structures
and
how
we
link
pieces
together
to
form
these
complex
graphs
and
not
just
jam
everything
into
the
the
icfs
way
of
building.
A
B
So
when
chris
has
his
talk
after
mine
he's
going
to
talk
about
how
we
use
these
approaches
to
process
extremely
large
data
sets,
and
do
it
in
a
very
parallel
way,
so
that
they
were
suitable
for
storing
in
file
coins
in
discover.
So
with
that,
I'm
going
to
start
and
I'm
going
to
rewind
right
back
to
the
beginnings
of
of
content,
addressability
and
some
of
the
basic
terminology
and
things
that
you
may
or
may
not
be
familiar
with
just
to
set
the
theme.
B
So,
let's
start
off
and
talk
about
merkle
trees,
I'm
assuming
that
most
people
listening
in
have
the
very
basics
of
content.
Addressing
that
a
content
address
is
really
in
in
our
world.
Anyway,
it's
a
a
way
of
addressing
content.
By
looking
at
it's
hash,
it's
has
digest
so
content
and
hash
have
a
one-to-one
relationship
and-
and
we
can
look
up
content
by
having
the
hash.
B
Now
the
term
merkle
trees
comes
from
a
patent
that
ralph
merkel
filed
in
1979,
and
the
basic
idea
of
a
merkle
tree
is
that
you
could
have
a
hash
digest
embedded
within
hash
digest,
and
you
can
form
this
tree-like
structure
and
what
this
means
is
that
at
any
point
in
this
tree,
the
address
or
the
hash
of
that
point
authenticates
all
of
the
linked
data
underneath
the
tree.
B
So
you
can
have
a
single
hatch
that
references,
many
hashes
underneath
it
now
importantly,
the
the
original
pattern
diagram
has
this
classic
style
merkle
tree,
where
you
concatenate
hashes,
to
form
this
very
pretty
binary
graph.
A
merkle
tree
is
not
just
an
concatenation
of
hashes,
although
commonly
in
the
real
world
like
in
bitcoin
and
other
places,
you'll
see
the
term
merkle
tree
used
for
a
concatenation
of
hashes
to
form
this
very,
very
clean
binary
tree.
B
A
dag
is
something
you
might
have
encountered
in
filecoin
documentation,
but
we
use
this
term
a
lot.
It's
directed
acyclic
graph.
It
comes
from
graph
theory,
not
necessarily
to
do
with
content
addressability,
but
the
useful
concept.
So
what
it
really
means
is
you
have
a
graph
of
nodes?
It's
directed
so
the
there's
directionality
to
all
of
the
link.
So
in
graphically
you
can
have
bi-directional
graphs
as
well.
A
dag
is
directed
so
the
links
only
go
one
way
and
it's
a
cyclic
there's
no
possibility
of
cycles
within
the
graph.
B
So
you
can't
link
from
a
later
node
back
to
an
earlier
node.
It
fits
really
nicely
with
merkle
trees
because
of
hash
functions.
The
hash
function
gives
us
directionality
they're
a
one-way
operation.
You
can
only
go
from
in
theory
anyway,
from
the
hash
of
the
data
to
the
data,
the
directionality
and
there's
no
cycles.
You
can't
link
to
data
that
doesn't
already
exist
that
you
already
don't
already
have
a
hash
full.
You
can't
leave
a
placeholder
for
a
hash
you
get
baked
in
stone,
so
there's
no
cycle.
B
So
we
often
end
up
with
these
with
terms
like
merkel,
dag
or
dag
by
itself
or
merkle
tree.
So
these
these
terms
get
thrown
around
a
bit,
but
this
is
what
they
mean
now
to
bring
it
into
something
a
little
bit
more
concrete.
This
is
a
classic
thing
that
people
often
want
to
do
with
content
addressed
data
structures,
build
a
file
system,
and
this
is
very
similar
to
the
ipfs
way
of
building
a
file
system,
and
many
other
systems
will
do
similar
things.
So
in
this
example,
we
might
have
eight
files.
B
We
might
just
have
an
array
of
hashes
and
that
might
be
our
directory.
Maybe
that's
all
we're
hashing
and
we
hash
these
directories
as
well,
whatever
form
they
take
and
then
the
hashes
of
those
directories
become
addresses
that
we
can
then
point
to
so
we,
our
file
system
in
this
very
simplistic
example,
has
10
independent
chunks,
each
of
which
has
its
own
address,
and
you
see
a
directionality
there
and
you
see
lack
of
cycles
as
well.
So
this
is
a
dag.
B
This
might
be
familiar
if
you
know
anything
about
git,
so
get
this
tree
looks
very
similar
a
little
bit
more
complicated,
but
in
the
the
lower
levels
of
the
git
format,
we
find
exactly
the
same
thing
where
you
have
what
are
called
trees,
and
these
are
essentially
directories
and
the
trees
point
to
blobs,
which
are
essentially
final
and
you
can
have
trees
pointing
to
other
trees,
but
all
the
way
at
the
bottom.
B
You
have
these
blobs,
the
blobs
get
hashed
and
git
uses
sha
one
at
the
moment,
and
the
the
shallow
ones
get
included
in
these
blobs,
which
point
to
other
blobs,
sorry,
which
point
two
trees
and
blobs,
and
you
hash
them,
and
you
include
the
hash
of
these
trees
within
the
actual
commit,
and
then
you
hash
the
commit
to
get
the
commit
hash
that
we're
used
to
seeing
you
get.
The
command
itself
contains
not
only
a
link
to
the
the
tree
and
its
blob.
B
Therefore,
it's
obviously
the
merkle
fashion,
but
it
also
includes
other
metadata,
like
the
awesome
community
message,
and
it
also
includes
a
link
to
the
previous
commit.
So
in
this
way
we
build
this
tree
that
grows
over
time
and
can
mutate
over
time.
So
in
this
example,
here
I've
got
on
the
on
the
left
hand,
side.
I've
got
the
newer
commits
pointing
back
to
older
communities,
and
you
can
see
that
I'm
mutating
my
files
over
time
so
in
the
first
commit
on
the
right.
I've
got
four
files.
B
Second,
commit
I've
added
two
more
third
commit
I've
made
a
subdirectory
and
put
some
in
subdirectory
and
I've
deleted
a
file
and
added
some
new
one,
but
I'm
still
pointing
back
to
some
original
blobs.
I've
got
some
new
ones
and
I've
got
directory
structure.
So
this
is
a
very,
very
familiar
pattern
in
content,
stressing
that
you'll
see
repeated
again
and
again,
these
a
cyclic
graph
and
and
using
localization
to
authenticate
the
the
data
all
the
way
at
the
leaf
by
including
it
at
the
top.
B
So
ipld
comes
in
here
with
some
primitives
that
we
think
help
with
building
these
structures
and
the
first
one
is
cid,
which
is
our
extension
of
a
hash
digest,
and
we,
the
cid,
is
content
identifier
and
we
use
this
to
be
a
self-describing
content
identifier,
because
a
hash
is
really
just
a
hashtag.
Just
is
really
just
a
array
of
it
could
be
this
number
of
different
lengths
of
standard
hash,
digest
and
different
hash
functions
as
well
so,
and
we-
and
we
also
don't
know
what
they
point
to
without
the
context
of
that.
B
We
add
a
multi
codec
that
tells
us
the
content,
type
I'll
talk
about
that
in
a
minute,
and
that
tells
us
essentially
what
the
hash
is
pointing
to.
What
is
what
is
the
hash
of
when
we
get
there?
What
will
we
find
and
if
the
cid
also
includes
a
multi-hash
code?
So
this
is
the
tells
us
what
hash
function
was
used
and
that
multihash
also
includes
the
length
of
the
hash.
So
we
can
see
that
right
up
front.
B
We
know
what
the
length
should
be
and
how
much,
how
many
bytes
to
expect
there's
some
links
on
the
page
there
that
meet
that
will
will
and
they're
in
the
zoom
chat
as
well.
That
will
show
you
take
you
to
some
of
the
specifications
for
these
things
and
also
there's
code
repository
as
well
to
break
that
down
a
little
bit
further
into
something
that
might
be
more
familiar.
The
ids
can
be
represented
as
strings
and
we
use
this
thing
called
multi-base
and
that's
just
different
ways
of
representing
different
different
base
strings.
B
So
because
we've
got
an
array
of
bytes,
we
need
to
turn
into
something
we
can
actually
print.
So
we
use
multi-base
to
do
that.
The
the
idea
of
giving
that
you
there
at
the
top
is
this
is
in
base
32,
and
you
often
see
that
the
asy
prefix
is
very
common
prefix
that
you'll
see
because
which
will
be
clear
in
a
minute,
but
the
very
first
character
tells
us
something
important.
It
tells
us
in
this
string.
It
tells
us
what
base
we
we
have.
B
So
in
this
we've
got
the
b
and
in
the
multi-base
table
we
can
look
up
b
and
say
that's
base
32..
We
can
use
that
then
to
decode
this
string
into
the
byte
and
give
you
the
hexadecimal
there
in
those
bytes.
We
have
the
original
hash.
The
original
content
address,
and
I
put
that
in
bold
and
that's
on
the
right.
This
little
prefix
tells
us
a
lot
about
what's
going
on.
B
So
if
I
work
through
the
prefix,
I've
got
first
of
all
the
version
cid
version
one
it's
the
first
byte
there,
which
is
just
one.
The
second
part
is
the
codec
which
is
hexadecimal
70,
and
that
tells
us
that
the
data
we're
pointing
to
is
dag
pb
format,
which
is
the
typical
ipfs
unix
fs
file
format.
B
So
we
know
what
this
thing's
point.
What
type
of
data
listing
point
is
so
when
we
get
there,
we
can
decode
it.
We
can
pull
up
the
dagp,
codec
and
say
decode
this
binary.
B
The
next
little
bit
is
the
multi-hash,
so
it
tells
us
the
the
hash
function,
chart2256,
that's
text,
decimal,
12
and
then
the
number
of
bytes
that
it
can
choose,
and
that
takes
us
20,
which
is
32
in
in
decimal.
So
then
we
decode
those
32
bytes.
That
gives
us
the
hash
that
we
can
use.
Then,
when
we
load
that
binary,
we
could
rerun
the
hash
function
and
verify
that
it's
the
data
we
expected.
B
So
this
is
what
a
cid
is
and
they're
very
flexible,
with,
with
all
sorts
of
things
with
cid
and
you'll,
see
them
a
lot
when
you
do
it
with
powerpoint,
sometimes
in
some
novel
ways
too
now
cid
has
a
version
0,
which
is,
is
you'll
commonly
see
around
the
word.
This
has
been
deprecated
and
we-
and
we
wouldn't
invite
you
actually
using
these
in
anything
new,
but
when
you
encounter
them
you'll
see
them.
They
start
with
q,
the
capital
q.
B
That
tells
us
that
multibase
is
based
58
pcc
and
we
can
use
that
to
decode
the
rest,
cid
version.
Zero
assumes
dag
pb,
it
doesn't
vary
on
the
codec
and
it
also
only
it
forces
you
char
2
to
56.
So
when
you
loaded
acid
version,
1
one,
you
know
it's
jpb
and
it
will
be
sha-2256.
B
This
is
the
typical
unix
fs
for
ipfs
format.
New
everything
newer
is
using
the
cid
version
one
and
you
would
be
recommended
to
use
that
very
strongly
because
it
contains
a
lot
more
information.
B
The
second
thing
that
ipld
brings
is
codecs.
Now
we
use
a
table
to
map
integer
codes
to
codecs.
A
codec
tells
us
how
to
decode
and
encode
binary
block
data,
and
we
have
a
lot
of
different
codecs
and
some
of
them.
You
know
a
lot
of
them
are
not
ours.
We've
got
our
own
codecs,
but
codecs
can
be
generic
like
json
as
a
codec
tells
us
how
to
decode
strings
into
into
objective
data.
C-Board
is
another
one
designed
to
fit
specifically
from
binary
and
it's
much
more
compact
and
also
just
raw
bytes.
B
If
we
have
like
video
or
something
where
we
don't
actually
want
to
decode
it
into
a
data
form,
we
want
to
do
something
else
with
it.
We
use
raw
bytes,
and
this
is
used.
A
lot
in
fileclone
discover
that
chris
will
be
covering
as
well
a
lot
of
raw
bytes
in
there.
B
Codecs
can
be
other
content
address
formats
that
include
their
own
implicit,
linking
types
like
git
and
bitcoin
block.
We
can
review
them
through
an
ipld
lens
and
interpret
them
with
our
own.
We
can
put
cids
when
we
instantiate
data
out
of
them,
so
we
have
some
codes
that
will
interpret
these
things
and
give
us
the
ids
when
we,
when
we
read
them,
but
these
are
not
formats
where
we
can
write
the
ids
into
them,
but
they've
got
their
own
hashes
in
them,
and
it's
just.
B
We
know
what
those
actually
point
to
and
then
lastly,
we
have
ipld
native
codecs,
dag
pb,
dag
seaboard
and
jack
jason
are
the
main
iplt
native
codex
dagpb
is
used
for
unix
fs
in
ipfs.
You
encountered
that
a
lot
and
most
of
the
databases
for
filecoin,
discovered,
xtb
and
raw
dag
sebor
is
if
you're
building
something
new
with
with
that's
not
using
just
plain
file
data.
This
is
a
good
recommended
codec,
it's
compact
and
very
flexible.
B
If
you
do
all
sorts
of
other
shapes
it'll,
basically
allow
you
to
put
almost
any
data
shape
into
block
form.
Json
is
not
something
you'd
be
recommended
to
do
with
data.
You
have
a
lot
of
because
it's
not
very
efficient,
but
it's
a
interesting
format
to
look
at
at
least
so.
Moving
on.
Let's
get
back
to
our
example
of
the
the
file
system,
we're
building.
I
want
to
point
out
this
concept
of
a
graph
root,
because
this
becomes
a
really
critical
concept
when
you're
talking
about
managing
large
amounts
of
data.
B
So
in
our
example
directory
one
would
be
our
graph
root,
and
this
means
that
this
is
the
single
thing
that
we
need
to
hold
on
to
to
address
the
whole
knot
like
in
the
merkle
tree
in
the
beginning.
We
only
need
directory
one
in
order
to
be
able
to
address
and
authenticate
all
of
the
files
in
our
whole
tree.
So
we
don't
need
to
have
references
to
all
of
these
things
in
an
index
somewhere
directory
one
serves
as
our
index
and
we
can
traverse
through
the
graph
to
get
to
everything
we
need.
B
We
can
also
use
directory
2
with
our
index.
If
we
only
care
about
that
subgraph.
Perhaps
we
only
care
about
sort
of
these
four
files.
We
could
just
grab
directory
2
and
use
that
as
our
root,
for
whatever
purpose
we
needed.
Our
root
for
root
is
fairly
arbitrary
when
it
comes
to
this
graph,
because
graphs
can
be
very
large,
they
can
span
across
multiple
graphs.
We
might
only
want
subgraphs
or
we
could
make
an
entirely
new
directory
and
and
reference
only
part
of
our
other
graph.
That
might
still
exist.
B
B
So
in
this
example,
we
care
about
four
files
down
the
bottom
and
two
new
files,
so
we've
got
there's
one
one
root
of
points
in
there
roots
are
important
if
we
care
about
mutability-
and
I
know
it's
finally
talking
about
mutability
with
content
addressing
but
mutability
is
something
we
can
build
in
into
these
essentially
append
only
data
structures
by
caring
about
the
route.
B
So
if
we
change
the
route,
we
can
make
things
mutable.
So
in
this
example,
I
want
to
say
edit
file
number
one.
Now
I
can't
edit
the
bytes
and
have
it
because
the
hash
will
change.
So
what
I
end
up
doing
is
making
a
new
blog
and
hashing
that
and
then
I
have
to
include
that
hash
in
its
parent,
which
is
directory
one,
and
then
that
gets
a
new,
a
new
hash
as
well.
So
essentially,
I'm
making
two
new
nodes,
but
I'm
still
referencing
eight
different
files.
B
One
of
them
is
different
to
the
original,
but
I'm
still
referencing.
Eight
files
and
I've
only
created
two
new
nodes.
So
I've
got
this
mutable
file
system
going
on
here
and
a
neat
feature
of
this
kind
of
behavior
is
that
we
can
actually
use
it
for
snapshots.
We
could
use
it
to
roll
back
in
time.
If
we
built
in
some
notion
of
garbage
collection,
we
could
get
rid
of
the
old
nodes
we
don't
care
about.
We
could
change
it
over
time.
B
Ipfs
does
some
of
this
stuff
natively,
but
it's
important
to
think
about
routes
and
how
a
route
references,
the
tip
of
your
graph
and
the
most
important
thing
you
care
about.
Mutability,
extends
to
everything
you
want
to
change
with
a
graph.
So
in
this
graph
it's
some
kind
of
arbitrary
b
tree.
Maybe
it
contains
file
data.
Maybe
it
contains
something
else.
B
Maybe
it's
nicely
formed
they've
got
rules
about
how
it's
formed.
If
I
want
to
add
delete,
modify
and
do
any
operation.
This
thing
I
I
just
need
to
care
about
the
root
and
how
these
modifications
bubble
up
through
the
tree,
to
give
me
a
new
route.
So
if
I
I've
done
a
bunch
of
edits
in
this
tree
and
it's
giving
me
a
new
route,
I've
done
some
deletions
some
additions.
B
I'm
still
referencing
the
bulk
of
the
old
tree,
but
I
also
have
some
new
elements
and
again
I've
got
this
snapshotting.
I
can
look
back
in
time.
I
can.
I
can
change
which
route
I
care
about,
for
which
purpose?
Maybe
I
garbage
collect
old
things,
but
you
can
see
what
what
I'm
getting
at
here,
which
is
this
concept
that
these
merkle
trees
bubble
up
to
the
tip
and
it's
the
hashes
that
also
bubble
up
as
well,
and
the
tip
is
the
bit
that
you
you
need
to
care
about.
B
B
In
this
example,
I'm
going
to
it's
fairly
simplistic,
but
it
does
actually
extend
to
a
real
data
structure
that
we
have
not
specified.
So
I
want
to
build
a
super
large
array.
I
want
to
build
an
array
that
is
that
can
live
in
content
addressed
land
and
it
could
be
arbitrarily
large,
but
we
want
to
interact
with
my
pieces.
So
perhaps
this
thing
is
so
large.
I
I
couldn't
fit
it
in
memory,
perhaps
it's
so
large.
B
I
couldn't
even
fit
it
on
my
own
disk,
it's
it
has
to
live
out
there
in
file
coin,
perhaps
or
perhaps
in
some
other
context,
dress
space.
So
this
thing
is
fairly
generic.
I'm
not
caring
about
how
it
encodes
that,
as
a
separate
concern,
I'm
building
an
algorithm
here
and
the
things
I'm
storing
in
this
in
this
array
are
also
generic.
They
could
be
links
to
other
objects
or
they
could
be
simple
values.
They
could
be
something
else.
B
B
So
let's,
let's
start
off
with
the
naive
case,
which
is
they
just
put
everything
in
one
block
now
that
obviously
falls
apart
when
you
get
to
large
sizes,
because
your
blocks
become
unreasonably
big
and
in
ipfs
you're.
If
you're,
storing
the
knife
and
your
recommended
maximum
is
about
one
meg,
I
think
there's
a
a
a
actual
maximum
of
two
megs
because
of
bit
swap,
but
your
your
advice,
maximum
is
about
one
meg.
Now
the
size
of
your
block
will
depend
on
your
use
case.
B
Perhaps
you
want
smaller
blocks
for
other
reasons,
because
they're
faster
to
load
but
there's
various
trade-offs
with
block
size,
but
I
can't
just
pack
all
of
my
elements
into
one
block,
so
I
need
to
have
a
way
to
extend
beyond
the
block.
So
what
I'm
going
to
do
is
I'm
going
to
say,
there's
a
maximum
width
of
my
array
in
a
single
block.
Any
block
in
this
array
can
only
get
up
to
a
certain
width,
I'm
going
to
fix
that
in
my
example
to
five.
B
So
if
I
start
with
four
and
then
add
another
one,
then
I've
got
my
maximum
sized
array
here.
This
is
full
and
in
both
the
first
and
the
second,
I've
only
got
one
one
one
block.
So
therefore
it
is
my
root.
My
root
is
my
graph.
If
I
want
to
and
when
I
want
to
mutate,
it
gives
me
a
new
root,
which
is
a
new
version
of
it.
So
if
I
maintain
within
five
minutes,
just
still
got
the
one.
B
If
I
want
to
push
beyond
five,
then
I
have
to
start
building
a
tree.
So
in
this
algorithm
I'm
going
to
add
a
second
sister
block
are
also
up
to
a
maximum
of
five
and
then
I'm
going
to
address
both
of
those
blocks
in
a
parrot
block,
and
I'm
going
to
call
this
a
height
in
my
graph,
so
I've
got
a
height
of
2
and
in
my
height
of
2
I
can
fit
in
five
blocks
of
data
each
with
five
elements.
So
that
gives
me
a
maximum
of
25..
B
B
My
elements
in
my
in
my
height
2
are
just
links
to
my
height
one.
My
height
one
contains
the
actual
elements
I
care
about.
If
I
overflow
that
capacity,
I
have
to
add
a
new
height,
so
in
this
in
this
second
part,
I'm
adding
a
new
element.
I've
got
26
elements
now
I've
overflowed
now,
I
have
to
add
a
new
height
height
3,
and
my
new
capacity
is
125,
because
now
it's
5
to
the
power
of
3.,
so
you
can
see
with
this
game.
B
I
can
keep
on
going
out
arbitrarily
large.
It
just
means
I
also
have
to
get
higher
at
the
same
time,
and
all
it
does
is
take
me
to
a
new
single
root,
but
one
note
node
I
can
hang
on
to,
and
that
gives
me
the
address
of
everything
in
the
entire
array
now
that's
well
and
good,
except
that
I
want
to
be
able
to
generally
get
to
the
individual
elements
of
these
things
without
loading
them
all.
So
I
have
to
have
some
way
of
getting
from
the
root
to
the
element
I
care
about.
B
B
B
B
I
don't
even
know
where
I
am
in
this
list,
I'm
just
I'm
just
a
free-floating
block
with
five
elements.
I've
got
to
be
able
to
tell
it.
I
know
I
want
the
fourth
element
along
in
your
array,
so
at
each
level
I've
got
to
change
my
little
index
there.
So
there's
an
algorithm
for
doing
that.
That
will
get
you
closer
and
closer
as
you
get
towards
the
end,
and
it
leaves
you
with
these
blocks
that
are
all
hours
array.
B
Hashing
arrays
should
provide
links
they're
using
the
id
for
that
and
there's
no
other
metadata
included
in
any
of
these
blocks,
that
just
arrays,
which
is
really
a
really
nice
format,
because
it
gives
us
some
nice
properties.
It
also
gives
us
some
unfortunate
properties,
but
these
are
trade-offs.
We
have
to
care
about.
B
So
let's
look
at
some
properties
of
this
algorithm.
Larger
width
means
larger
blocks,
but
also
fewer
levels.
So
so
your
traversals
could
be
quicker,
but
your
load
times
of
blocks
is
larger,
so
you've
got
some
trade-offs
already
in
considering
how
wide
your
blocks
need
to
be.
There's
also,
a
mutation
cost
there
if
you're,
adding
one
element
at
a
time.
B
You're
also
discarding
blocks
at
the
edge
of
your
array,
at
a
rate
of
you,
know
up
to
the
width.
So
there's
garbage
collection
costs
there
as
well
get
and
size
operations
are
efficient,
that
they
use
that
are
roughly
the
algorithm
I
already
showed,
which
literally
to
true
they're,
just
one
step
each
down
to
height.
B
One
and
for
size
we
want
to
traverse
all
the
way
up
to
the
edge
to
see
what
how
big
each
at
the
f
is,
the
the
trailing
edge
they
are
appending
data
requires
mutating
a
maximum
of
only
one
existing
node
up
to
one
existing
node
per
level.
So
when
we
append
data
at
the
end,
we
get
this
hash
thing
bubbling
up
to
to
parent
level
and
that's
a
nice
property.
It
means
we
we
don't
we're
not
modifying
huge
chunks
of
the
data
structure.
B
Iteration
is
really
easy.
It
becomes
a
left
to
right
tree
traversal
and
you
just
put
this
in
the
shape
of
a
tree,
and
you
just
do
the
left
right
with
really
nice.
Traversal
slicing
is
only
efficient
if
you
perform
it
at
the
boundaries
of
the
width.
So,
if
I
want
to
take
let's
say
I
want
to
take
the
10
elements
from
the
middle
of
this
l
this
array
and
make
a
new
array
that
would
be
really
easy
if
they,
if
they
tell
spell
within
the
the
width
boundary.
B
So
if
I
want
to
take
elements
element
6
to
15,
then
I
just
need
to
those
two
blocks
and
I
need
a
new
parent
block
to
address
them.
That's
all
I
need.
If
I
need
anything
else,
then
I'm
going
to
have
to
start
shuffling
things
which
is
essentially
rewriting
the
whole
lot,
so
it
gets
business
pre-pending
data
is
equally
costly
because
you're
not
rewriting
stuff.
There
are
ways
to
do
it
efficiently.
If
you
were
working
at
the
width
boundary,
but
you
can
see
it
gets
messy.
B
This
is
the
kind
of
stuff
that
goes
into
thinking
about
ipld
and
how
we
build
data
structures.
On
top
of
ipld,
and
hopefully
that
kind
of
thinking
is
useful
for
your
application.
Further
reading,
I've
got
for
this.
If
you
want
to
know
more
about
ipld
and
any
of
the
things
I've
talked
about,
there
was
some
links
in
the
talk,
but
a
lot
of
them.
You
can
reach
through
ipld.io.
B
If
you
go
there,
that's
our
documentation
site
we're
working
on
that
to
make
it
more
informative.
We
have
a
specifications
repo
on
github,
that's
in
the
ipld.org
specs,
and
that
includes
a
spec
of
this
array,
which
we
call
vector,
and
it's
also
got
a
spec
for
a
hash
map
which
is
a
hampton
algorithm
which
you'll
see
again
and
again
in
this
in
this
world.
B
The
hampt
is
a
hash
array,
mapped
hash
array,
map
tree
and
it's
a
way
of
performing,
really
efficient,
key
value
stores
across
very
large
data
structures,
and
that's
a
really
quite
elegant
algorithm.
If
you
want
to
look
into
the
details
of
that
just
to
learn
about
how
to
think
more
about
these
things,
that's
it
from
me.
I
don't
know
if
we
want
any
questions
or
we
want
to
hop
over
straight
to
chris.
D
B
D
Great,
thank
you
so
hi
everyone,
I'm
chris
hafey
and
I'm
working
with
the
ipld
team
and
thanks
rod
for
that
great
introduction.
Ipld.
D
It
just
kind
of
hurts
me,
I
think,
as
many
of
you
coming
from
ipfs
and
lotus
world,
you
work
with
the
apis
that
are
given
to
you,
but
there's
a
lot
going
on
there
underneath
the
hood,
which
is
what
rod
just
went
through
and
one
of
the
things
the
ipld
team
is
really
focused
on
these
underlying
building
blocks
to
basically
build
things
like
ipfs
on
those
and
also
your
own
custom
applications.
D
And
today
I'm
going
to
be
talking
a
bit
about
how
the
ipld
team
applied
these
intrinsics
to
prepare
very
large
data
sets
for
file
from
storage
and
the
project
that
we
gave.
The
name
is
called
dumbo
drop
and
here's
how
my
presentation's
going
to
go
basically
give
a
quick
overview
of
double
drop.
Talk
about
how
we
approach
the
problem.
Talk
about
the
architecture
go
through
some
lessons
learned
and
tips
and
tricks
in
the
end.
D
I
think
our
hope
here
is
that
you'll
take
some
tangible
kind
of
tidbits
away,
at
least
a
better
understanding
of
how
you
know
how
we
approached
a
very
large
scale,
data
ingestion
problem,
and
maybe
some
ideas
on
how
you
can
do
it
yourself
and
solve
your
specific
problems.
D
So,
first
of
all
an
overview.
What
is
dumbo
drop.
Our
goal
was
to
process
a
very
large
amount
of
open
data
and
short
amount
of
time
for
file.
Point
mikhail
rogers
is
on
the
call
he's
one
actually,
the
brains
behind
this.
I
kind
of
just
came
in
and
cleaned
up
a
little
bit.
So
most
of
this
design
work
is,
is
his
handiwork,
but
we're
pretty
happy
with
the
amount
of
data
that
we
process
over.
D
Three
petabytes
of
data
has
gone
through
dumbo
drop
and
it's
amazing
to
see
how
it
worked
in
terms
of
scalability.
So
if
you
are
a
performance
person,
a
skill
builder
person,
you
like
cloud
stuff,
you're,
going
to
get
kind
of
excited
by
some
of
the
things
that
we've
done
here,
but
that
was
our
goal
is
to
process
a
large.
C
D
Of
data
in
a
short
amount
of
time,
so
how
do
we
approach
this?
Well?
Most
of
our
data
was
already
in
amazon,
s3
buckets.
There's
public
data
sets
like
landsat
and
whatnot,
and
we
want
to
convert
those
into
filecoin.
So
one
of
the
strategies
we
did
is
we
decided
to
create
a
highly
our
own
custom
application,
using
the
same
underlying
libraries
that
lotus
and
ipfs
uses,
and
so,
if
you've
ever
looked
at
the
loaded,
stripe,
dfs
projects
you'll
see
all
sorts
of
package
dependencies.
D
Hundreds
of
them
made
by
protocol
labs,
some
of
them
by
the
ipld
team,
some
by
different
teams,
and
so
what
we
did
is
we
said
to
really
get
the
type
of
scalability
and
performance
that
we
needed.
We
kind
of
had
to
go
a
lower
level
than
those
knight
bfs,
and
so
we
went
directly
to
libraries
and
kind
of
stitched
together.
This
whole
pipeline
number
two:
is
we
exploited
aws
lambda?
D
It's
a
lambda
is
a
great
capability
of
aws
on
serverless
functions
and
allows
us
to
get
to
a
very
high
level
of
parallelization
at
a
very
affordable
price.
The
other
thing
is
in
terms
of
language
selection.
We
actually
didn't
use
any
go
code,
we
wrote
most
of
it
in
javascript
and
the
main
reason
that
is
just
for
speed.
We
could
really
iterate
quickly
as
we
we
didn't
have
a
lot
of
time,
but
we
wanted
to
move
quickly
and
javascript
allows
us
to
move
very
very
quickly.
We.
D
Rough
rust
as
well,
because
there's
some
proof
generation
code,
that's
written
in
rust,
that's
very
high
performance
that
we
didn't
like
writing,
rewriting
in
javascript,
and
so
we
kind
of
leveraged
that
as
well.
D
So
what's
the
architecture
look
like
well,
you
know.
First
thing
is
we
kind
of
thought
that
we
have
a
massive
amount
of
data
and
we
have
to
go
to
a
whole
pipeline.
You
know
it
begins
in
it's
raw
state
and
we
have
to
have
a
couple
output,
art
factors.
So
basically,
what
you're,
seeing
here
in
a
vertical
is
a
vertical
barge
are
the
three
stages:
transform
aggregate
and
com
key
generation
cutting
across
horizontally
are
kind
of
like
architectural
layers,
so
we
actually
have
storage
layer
where
we
store
data
processing
where
we
do.
D
You
know
a
lot
of
number
crunching
and
then
indexing
content
database,
and
so,
if
I
actually
put
kind
of
architectural
blocks
in
here,
you'll
see
there's
three
different,
s3
buckets
and
we
actually
had-
I
don't
know-
maybe
30
or
40
days
sets
in
the
three
petabytes,
and
so
this
is
kind
of
replicated
over
and
over
again,
but
for
each
source.
Data
sets
in
the
bucket
we
generate
output
data
in
ipld
format
and
blocks.
D
We
had
three
different
processing
functions
for
each
of
the
different
pipeline
stages
and
down
here
we
used
dynamodb
to
store
information
about
this
whole
processing
logic
as
well
as
a
com
key.
D
So
that's
our
our
high
level
architecture,
and
I
think
one
of
the
things
I
take
away
is
think
in
terms
of
pipeline
stages.
When
you're
dealing
with
massive
amounts
of
data,
you
don't
want
to
basically
do
it
all
at
one
time.
You
want
to
basically
break
into
stages,
and
you
also
want
to
think
about
these
as
different
architectural
components.
D
So
what
happens
so?
The
data
begins
in
the
source.
Data
bucket
first
phase
kicks
off.
What
we
do
is
we
pull
data
out
of
here,
so
you
kind
of
iterate
over
all
the
objects.
We
update
the
fact
that
we
know
these
objects
in
the
database
for
tracking
purposes,
and
then
we
convert
them
into
ipld
blocks
over
here
and,
as
rod
said,
there
is
a
block
size
maximum,
and
so
sometimes
we
have
to
break
a
single
file
up
into
multiple
blocks
and
we
do
that.
D
We
keep
track
of
all
that
in
this
source,
dynamic
database,
and
so
this
takes
days
to
to
run
even
in
parallel
in
s3
and
I'll,
explain
why
it's
kind
of
slow
as
we
go
on,
but
this
is
the
first
step.
Second
step
is
aggregating
it
up.
So
again
we
take
all
this
data
we
just
produced.
We
run
it
through
another
lambda
function.
We
update
the
database
and
we
store
carpals.
This
is
like
concatenating,
a
bunch
of
ipld
blocks
and
do
a
single
big
old
file.
D
Car
is
kind
of
like
guitar
file.
That's
that's
content,
adjustable
archive,
as
opposed
to
tape
archive
same
idea.
It's
a
bunch
of
blocks
back
to
back
to
back
and
that's
step,
number
two
and
then
the
third
processing
step
is
you
take
those
car
files
run
through
this
rust
proofs
code?
This
runs
in
the
lambda.
We
generate
com
key,
which
is
part
of
the
the
piece
proof
that
file
point
depends
upon.
I
don't
need
to
get
the
details
of
some
of
this
stuff.
D
In
fact,
this
last
thing
I'm
not
going
to
talk
about
at
all.
We
don't
even
encourage
you
to
get
down
to
that
level
of
detail
because
you
don't
have
to.
We
now
have
things
like
powergate,
which
allows
you
to
basically
put
target
ipfs
as
an
output
target
and
start
in
terms
of
your
pipeline
and
use
powergate
to
shift
it
directly
into
filecoin,
so
I
won't
get
into
more
on
car
files
and
compete
right
now.
D
So
a
little
bit
of
lessons
learned,
that's
basically
what
we
did
again:
three
petabytes,
where
the
data
ran
through
this
system
for
a
lar
large
number
of
data
sets
some
things
we
learned
you
know
one
is:
there
are
limits
in
aws
and
s3
performance
based
on
prefixes,
and
so
you
know
the
name
form
is
directories,
but
amazon
calls
them
prefixes
and
it'll
actually
throttle
your.
D
I
guess
the
ability
to
read
from
s3,
based
on
how
data
is
partitioned
into
different
directories
or
prefixes
and,
as
I
mentioned
earlier
on
the
longest
phase
that
we
had
here
was
actually
reading
from
the
source
data
buckets,
and
this
problem
is
one
that
we
ran
into
is
that
these
source
data
buckets
weren't,
really
kind
of
organized
or
designed
with
this
prefix
limitation
in
mind,
and
so
what
you'd
have
is
one
top
level
directory
called
foo
and
then
thousands
and
thousands
of
subdirectories
and
files
underneath
it
well
s3,
will
limit
on
that
one
food
prefix,
and
so
it
was
only
so
quickly.
D
We
could
actually
go
into
that
directory
into
that
prefix
and
pull
out
files
for
processing,
and
that
was
a
huge
file
that
we
ran
into
if
you're
like
us,
and
you
have
basically
an
s3
bucket
that
you're
reading
from
you
can't
really
change
it.
There's
not
much
you
can
do,
but
if
you
do
have
the
opportunity,
in
terms
of
your
pipeline
upstream,
to
partition
your
data
with
this
prefix
thing
in
mind,
you'll
have
far
better
scalability
and
things
will
move
much
quicker.
D
Another
thing
is
when
you're
dealing
with
extremely
large
numbers
of
data
petabytes
of
data
like
we're
talking
about
here,
we
actually
start
running
into
some
reliability
and
performance
issues.
In
s3
I
mean
this
data
is
just
so
large,
there's
so
much
going
on.
It
can
be
like
a
week
weeks
worth
of
processing,
sometimes
for
this
stuff,
amazon
will
cycle
servers,
it'll
scale
out
your
infrastructures,
all
sorts
of
crazy
things,
and
you
have
to.
D
We
actually
ran
into
a
lot
of
burps
that
we
had
to
work
around
where
it
would
get
slow
or
start
returning
errors
that
we
wouldn't
expect
aren't
documented.
D
D
It'd
been
so
much
nicer
if
we
could
iterate
through
the
whole
bucket
and
then
take
that
as
input
in
one
stage
into
another
and
then
just
directly
generate
that
ld
blocks
as
output,
random
failures
do
occur,
make
sure
your
pipeline
is
resumable
and
build
and
retry
logic.
D
I
have
a
couple
of
examples
of
some
really
nasty
code
that
was
in
dumbo
drop
to
make
it
work
around
these
things
again
we're
trying
to
do
a
lot
of
the
in
a
short
amount
of
time
and
we
kind
of
have
the
most
beautiful
code
in
some
places,
and
this
is
one
the
area
where
it's
like.
What.
D
It's
just
we're
getting
errors,
we
don't
know
and
understand.
Aws
custom
lambdas
are
tricky
if
you
have
to
do
a
custom.
Lambda
like
you,
like
you,
use
like
a
non-standard
language
like
rust.
There
is
the
good
news.
Is
there's
a
way
to
do
it.
D
We
had
our
russ
proofs
code
that
we
didn't
want
to
report
to
javascript,
so
we
had
to
figure
out
how
to
get
rust
into
a
custom
lambda
and
it
was
really
challenging
because
the
base
docker
image
that
you
filled
into
is
based
on
cento
7.6,
which
was
released
back
in
2011
and
tool,
change
and
whatnot
change
over
time.
It's
hard
to
get
current
tech,
sometimes
to
work
with
old
tech
like
this,
and
so
I
spent
quite
a
bit
time
making
that
work.
D
The
thing
is:
there's
a
hard
upper
limit
on
lambdas
in
terms
of
ram
and
disk,
and
so
what
we're
doing
with
large
data
sets.
We
ran
into
both
of
these
issues
in
terms
of
our
ram
consumption
and
on
how
much
disk
we're
using
and
by
the
way
that
includes
temp
files
temp
directory.
D
So
we
actually
had
to
modify
the
rust
proofs
code
to
be
streaming
rather
than
just
writing
the
whole
thing
to
disk
and
because
we
didn't
have
a
disk
to
do
it
and
we
certainly
didn't
have
enough
ram
to
do
it.
So
I
had
to
be
modified
to
be
a
complete
streaming
based
implementation.
A
D
Lesson
learned
is
compq,
is
way
cheaper
than
storage,
and
so
we're
dealing
with
petabytes
and
petabytes
of
data.
The
cost
aws
street
costs
goes
up
quite
a
bit
we
actually
found
in
one
month.
We
were
doing
a
lot.
A
lot
of
processing
still
only
like
a
fifth
of
our
cost
was
compute.
Four
fifths
of
it
was
storage
and
in
one
case
we
did
this
massive
conversion,
but
we
didn't
quite
need
the
data,
for
I
don't
know
another
couple
of
months
and
we
did
the
math
and
we
realized.
D
You
know
it's
gonna
be
cheaper
to
do
reprocess.
This
huge,
like
data
set
in
the
future,
then
keep
it
in
s3
for
the
next
couple
of
months.
We
believe
it
and
reprocess
it
a
couple
months
later,
when
we
actually
need
it.
So
storage
is
expensive,
and
so
you
think
about
this
stuff.
Compute
is
cheap
and
plentiful
storage
is
plentiful
too,
but
it's
just
costly,
so
factor
that
into
your
design.
Just
like
we
did.
D
The
other
thing
too
is
you
need
to
be
able
to
tune
your
concurrency
parameters.
Different
data
sets
will
run
into
different
bottlenecks
at
different
times,
and
so,
if
you
kind
of
hard
coded
or
think
you
know,
if
you
get
one
data
set
to
work,
you've
got
one
day.
Is
it
working
to
get
the
next
one?
You
may
need
to
tune
some
parameters
to
get
optimal
performance
so
build
that
in
it's
just
kind
of
good
design.
D
But
it's
especially
important
when
you're
talking
about
a
highly
parallel
highly
concurrent
system-
and
you
know
not
all
the
work
we
do
at
pro
labs-
is
our
most
proud
work,
and
this
is
an
example
like
I
talked
about
of
us,
trying
to
move
quickly
and
make
stuff
work
in
a
short
amount
of
time
and
dealing
with
some
kind
of
tricky
things
that
we
ran
into
with
aws.
A
D
So
we
had
to
build
in
retry
logic,
all
over
the
place,
with
some
kind
of
random
timeout
back
off,
just
to
make
sure
that
when
you
have,
you
know
a
thousand
parallel
lambdas
running
out
there,
they
didn't
get
into
this
kind
of
race.
I
guess
I
don't
know
what
you
call
but
they're
all
hitting
and
at
the
same
time
and
overwhelming
the
system.
So
there
is
code
like
this
and
also
there's
another
example
where
we
just
get
random
errors
that
aren't
really
documented
anywhere.
D
That
aws
would
give
you
as
it's
scaling
out
or
or
dealing
with
kind
of
this
load,
regenerating
that
and
so
just
get
ready
to
be
prepared
for
not
just
solving
the
problem,
but
keeping
your
data
pipeline
running.
It's
not
an
easy
thing,
tips
and
tricks.
You
know.
One
thing
like
we
did
here
is
consider
building
your
own
pipeline
from
the
same
libraries.
Ipfs
lotus.
Is
you?
Don't
you
don't
need
to
have?
You
know,
use
the
ipfs
or
those
apis
for
everything
you
can
actually
create
python
like
we
did.
D
We
did
poll
in
the
libraries
that
they
used
to
do
things
and
you
can
actually
get
much
better
gains
in
some
cases
by
doing
that,
so
don't
be
afraid
to
do
it,
although
it
does
require
some
programming
capabilities.
D
Great
tool,
if
you
haven't
done
servers
before
it's
extremely
powerful
and
powerful
and
affordable,
and
it's
really
a
way
to
get
incredible
scalability
which,
by
the
way,
when
we're
talking
about
as
rob's
talking
about,
we
have
immutable
content
adjustable
data.
D
D
We
did
you
know
we
talked
about
that
prefix
problem,
one
of
the
ways
we
got
around
it
in
our
block,
store
and
in
our
car
store
is
we
use
the
content
identifiers
themselves
as
the
object.
Prefix,
the
content,
content
identifiers
are
highly
selectable
and
therefore
they
don't
run
into
that.
Concurrency
blocking
performance
limitation,
s3
has,
and
so
we
put
those
in
front
and
then
followed
it
with
the
that
same
c
id
as
the
object
name,
and
we
got
around
this
this
file
neck
in
terms
of
performance
and
so
phase.
D
Two,
as
you
remember
so
phase
one
is
to
take
all
the
data
in
the
bucket
and
turn
it
into
blocks.
Phase
two
is
take
that
same
data
and
turn
into
car
files.
That
second
phase
runs
like
10
times,
20
times
faster,
because
we
don't
have
this,
even
though
it's
doing
basically
the
same
kind
of
work,
because
we
don't
have
this
prefix
problem.
D
Fourth,
transform
your
data
into
immutable
ipld
blocks
as
soon
as
possible
in
your
data
pipeline.
It
just
solves
tons
of
tons
of
problems
for
you.
If
you
can
actually
have
data.
That
is
immutable.
This
is
a
common
theme.
Mutability
is
good
and
the
quicker
you
get
into
mutable
world.
The
fact
the
easier
and
faster
your
life
will
get.
D
The
last
thing
in
terms
of
tips
and
tricks
be
aware
that
aws
performance
will
change
as
it
rebalances,
while
your
pipeline
runs,
and
so
one
thing
we
would
run
into
is
we'd
start
up
a
big
processing
job
and
it
would
go
really
fast,
and
then
you
know
after
a
few
hours
it
would
get
a
little
bit
slower
and
after
a
few
hours,
just
come
to
a
crawl
and
we're
watching
the
logs
and
we're
like
what's
going
on
and
then
all
of
a
sudden
boom.
D
When
you
know
you
when
it
when
a
given
account
is
doing
a
lot
of
by
using
a
lot
of
resources,
it'll
try
to
start
tuning
it
for
your
load
and
when
it's
adjusting
the
load
or
rebalancing
it,
it
kind
of
you
kind
of
get
these
burps
in
the
speed
and
so
don't
be
afraid
that
get
comfortable
with
it,
because
when
you
start
moving
into
high
scalability
massive
graphs
like
this,
you
start
pushing
limits.
What
aws
can
do
in
ways
that
no
one
else
has
done.
C
Yeah
I
did
I
did
early
on.
I
mean,
if
you're
really
really
diligent
about
it.
You
can
probably
squeeze
some
more
performance
out
of
ec2.
C
The
issue
is
that
just
spinning
up
a
box
that
has
a
bunch
of
pores
that
you're
then
going
to
to
balance
into
the
moment
that
you
stop
using
it
you're
just
eating
all
of
that,
and
so
we
were
constantly
tuning
this
and
tweaking
it
and
iterating
on
it
as
we
were
building
it,
and
so
it
was
just
never
going
to
be
the
case
that
we
could
always
saturate
all
of
our
resources,
and
it
was
just
like
a
massive
amount
of
extra
work
to
try
to
spin
up
some
kind
of
clustering
around
all
this,
and
I
mean
honestly
lambda-
keeps
getting
cheaper
and
is
like
in
general,
just
cheaper
to
do
anything
with
than
ec2.
C
I
mean
the
whole
model
for
lambda
seems
to
be.
If
we
give
you
basically
free
processing,
you
will
generate
data
that
you
then
have
to
store,
and
so
like.
The
the
model
here
is
like
amazon
just
wants
to
eat
all
of
your
money
in
recurring
storage
costs.
As
an
example,
like
we
processed
two
petabytes
of
data
and
the
storage
bill
for
storing
it,
that
month
was
four
times
the
compute
bill
to
process
it,
and
the
storage
bill
was
going
to
require
every
month
if
we
didn't
get
rid
of
it.
C
So
there's
just
like
a
huge
delta
here,
like
yeah,
so
lambda
being
like
so
flexible
and
so
easy
to
use
compared
to
ec2,
was
just
kind
of
like
an
obvious
win,
and
then
we
weren't
really
going
to
you,
know
split
hairs
on
how
much
difference
it
was
in
cost
to
ec2,
because
the
storage
was
always
going
to
cost
us
way
way
more.
C
I
mean
the
main
thing.
You
have
to
make
sure
that
you
do
those
that,
like
any
other
services,
you
talk
to
or
any
storage
stuff
that
you
talk
to
is
in
the
same
region,
so
you
don't
hit
the
transfer
cost
as
well,
but
other
than
that,
it's
pretty
good.
E
There
we
go
okay,
yeah
yeah,
sorry
about
that,
so
I
think
you
may
have
already
covered
it
a
bit,
but
I
was
just
wondering
if
you
could
talk
a
little
bit
more
about
why
you
wreck
for
dumbo
drop.
You
transform
the
d.
Why
you're
making
the
recommendation
to
transform
the
data
into
the
ipld
blocks
like
as
early
as
possible
in
the
process.
D
Yeah,
that's
a
good
question.
I
probably
didn't
clarify
the
value
of
that,
but
once
you
move
into
a
mutable
world,
you
gain
all
sorts
of
properties.
You
don't
have
over
the
immutable
world,
so
you
have
complete
data
validations.
You
know
that
bit
for
a
bit.
You
can
verify
that
the
data
you're
working
with
hasn't
changed
and
that
is
going
to
save.
You
can
save
you
a
lot
of
headaches
when
you
deal
with
diagnosing
aws
failures
and
whatnot,
you
know,
did
it
fully
process
this
file?
D
If
so,
how
far
do
I
need
to
pre-process
the
whole
thing,
and
so
by
sooner
you
move
into
immutable
world.
You
have
know
your
problem
set.
You
have
a
lot
more
knowns
than
you
had
before
in
a
very
variable
dynamic
environment
like
aws
with
large
data.
So
that's
probably
one
of
the
biggest
reasons,
but
I
can
list
many
many
more,
but
the
other
thing
is
just
tooling
right.
D
So
all
of
the
protocol
labs
libraries
works
with
cids
and
blocks,
and
so
as
soon
as
you
get
in
that
you
can
start
using
all
those
tools,
your
benefit
as
opposed
to
hand
rolling
your
own.
Your
own
tools
and
whatnot.
C
C
You
basically
just
have
to
reprocess
everything,
because
you
have
you
have
no
guarantee
of
nothing
that
you
can
check
to
really
figure
out
like
oh
what
data
matched
this
and
what
didn't,
whereas,
like
all
the
data
that
is
actually
like,
once
you
hit
an
immutable
state,
you
then
have
new
immutable
states
or
some
immutable
reference
for
everything
after
that,
and
so
it
becomes
really
easy
to
like
go
and
find
all
the
data
that
you
may
have
messed
up,
and
you
know
you're
going
to
have
bugs.
So
you
might
as
well
plan
for
that
early
on.
E
I
have
another
question
too,
which
is
just
like
for
the
for,
like
slingshot
or
for
people
who
are
maybe
just
trying
to
like.
Mostly
you
know,
for
this
competition
or
something
just
say
they
want
to
like
what
would
the
recommendation,
because
I
know
I
know
you
all
were
doing
this
for
like
an
insane
amount
of
data.
E
It's
like
you
know,
and
I'm
not
sure
the
extent
to
which
you
wanted
it
to
be
like
a
super
repeatable
pipeline
or
whatever,
but
if
someone's
kind
of
like
just
like
trying
to
do
this
as
a
one-off
thing,
what's
like
the
easiest
way
to
do
it
now,
especially
given
kind
of
like
what
chris
has
mentioned
before,
like
powergate
is
maybe
a
thing.
I
don't
know
how
much
that
changes
but
yeah.
What
would
be
the
recommendation
now
if
you
were
just
trying
to
do
a
one-off
ingest.
C
Well,
I
mean
it
always
depends
on
the
amount
of
data
and
the
shape
of
the
data.
So
if
you
need
to
do
a
lot
of
data-
and
you
know
that
if
you
were
just
to
shove
it
into
ipfs
on
your
machine-
it
wouldn't
complete
until
after
slingshot's
over
and
then
you
have
to
figure
out
a
way
to
parallelize
it,
and
then
the
the
question
is
like.
Okay,
do
you
have
a
bunch
of
data
that
is
really
large
files?
Or
do
you
have
a
lot
of
data?
C
That's
a
lot
of
really
small
files
and
that
that
really
changes
kind
of
the
approach.
But
one
one
big
thing
that
I
like
chris
mentioned.
This-
do
not
do
what
we
did,
which
was
like
actually
create
the
car
files
and
during
compe
and
use
the
offline
flow
for
the
deals
that,
like
you're
gonna,
run
into
a
lot
of
things
that
just
don't
work
as
well
as
you
want
them
to.
But
but
also
I
mean
it's
a
really
sort
of
unnecessary
step.
C
Unless
you're
shipping
drives
to
people
like
we
are
it's
much
easier
to
just
like
prepare
the
data
in
a
format
that
ipfs
can
accept
and
then
get
up
an
ipfs
node
connected
to
that
data
that
you
processed
and
now
you
can
use
powerdate
to
do
the
deals
right
like
this
is
way
easier
and
then
yeah
I
mean
in
terms
of
how
you
decide
to
paralyze
the
processing,
like
you
know,
use
lambda.
C
If
you
need
that
much
compute
I
mean
we
had
to
have
them,
increase
our
concurrency
limit
from
a
thousand
to
ten
thousand
just
so
that
we
could
get
through
all
this
data
quickly.
Enough
turns
out
that
when
you
go
up
to
over
3000,
though
there
are
other
infrastructure
things
that
will
keep
you
from
actually
going
going
above
that,
so
you
can't,
you
can't
use
it
for
our
use
case
as
much
as
you
would
like,
but.
C
You
know-
maybe
it's
just
using
a
bunch
of
cores
on
your
machine
and
paralyzing.
That
way,
and
the
thing
to
remember
about
these
graphs
right
is
that
if
you're
doing
operations
serially,
it's
very
slow,
like
it's
actually
like
the
one
of
the
natures
of
these
immutable
data
structures
that
they're
they're
they're
really
expensive
to
sort
of
create
and
mutate
serially,
but
they
paralyze
incredibly
well.
C
So
if
you
can
figure
out
how
to
sort
of
isolate
sub
components
of
the
graph,
then
you
can
parallelize
all
of
those
like
as
much
as
you
want
and
so
for
for
us
that
really
became
like
oh,
we
can
process
every
individual
file
and
then
you
know
when
we
have
a
lot
of
small
files.
C
We
want
we'll
hand
like
you,
know,
100
or
a
thousand
files
to
one
lambda
and
say
go
and
do
all
of
these
right
or
if
it's
a
large
file,
where
we
know
that
we're
actually
going
to
go
over
the
time
for
lambda,
we'll
cut
that
file
up
into
parts
and
then
give
a
part
to
each
one
right,
and
so
it
really
changes.
What
you
want
to
do,
based
on
the
kind
of
the
shape
of
your
data.
C
E
Nice
thanks
and
then
sorry
last
question
for
me,
what
are
some
of
the
like?
I
guess
the
fields
that
you
maintain
in
the
in
the
database
that
you
use
for
like
indexing
all
the
information.
D
Yeah,
so
you
know,
I
think
we
need
to
store
the
paths
to
to
like
the
source
file
and
then
the
output
blocks
and
the
car
files,
so
those
paths
need,
I
actually
have
it
all
documented.
I
need
to
refresh
my
memory
while
pulling
up.
Why
don't
you
answer,
michael?
If
you
remember
the
whole.
C
Yeah
so
so
one
thing
to
remember:
that's
a
little
bit
unique
about
our
data
set
and
the
way
that
we
have
to
do
it
onto
these
car
files
is
that
each
car
file
is
like
at
the
or
sorry
each
car
file
is
like
about
a
gig,
because
that's
how
much
we
can
kind
of
process
in
parallel.
C
So
we
take
a
bunch
of
those
and
we
do
like
one
deal
with
them
and
then
we
have
information
about
that
deal
and
we
need
to
be
able
to
figure
out
like
okay,
for
this
source
data,
for
this
one
url
to
a
file.
Where
is
that
in
the
filecoin
network
right,
and
so
one
of
the
tables
is
just
like
all
of
those
origin
urls
for,
like,
I
think,
literally
billions
of
files-
maybe
I
think
yeah.
C
I
think
we're
in
the
billions
now
and
like
I
pointed
from
that,
to
what
car
file
is
it
in
and
then
and
then
we
have
a
database
of
all
the
car
files
and
their
comp
generation,
and
then
we
and
then
from
there
you
can
go
okay.
Well,
that's
what
the
car
file
is.
What
deals
did
we
do
for
that
car
file
and
then
we
can
find
the
deals
and
get
the
minor
ids,
and
all
of
that,
so
we
could
reverse
from
there
to
to
actually
access
any
of
the
data.
C
One
of
the
things
that
that
chris
did
that
was
really
good.
Was
we
stopped
using
one
table
for
all
of
the
data
and
started
doing
a
table
per
data
set,
so
we
have
billions
of
rows,
but
they're
now
split
across
more
and
then
we
have
one
table
for
all
of
the
the
car
files
and
compete,
because
there's
there's
less
of
those
because
we're
compacting
so
many
things
together
into
one.
C
I
mean,
there's
still
millions
of
them,
because
it's
like
we
did
about
three
petabytes
data.
But
you
know
it's
not
it's
not
billions.
Also,
just
dynamo
is
terrible,
and
so
it's
really
bad
and-
and
we
we
have
to
like
in
each
of
these
rows,
we
have
to
put
a
list
of
all
of
the
hash
id's
for
all
of
the
individual
raw
blocks
and
so
that
actually
gets
like
a
little
bit.
Bigger
than
dynamo
tends
to
like
having
in
a
row.
E
Awesome
thanks
and
this
this
schema
link
is
really
helpful
too.
Thank
you,
chris.
C
C
We
had
very
early
shift
states
in
our
minds
at
the
time
and-
and
it
was
like
literally
like
we-
we
don't
think
that
this
will
finish
running
unless
I
start
running
it
now,
and
so
I
was
writing
it
as
fast
as
I
could,
and
so
it
was.
It
was
very
impressive,
like
thing
that
it
was
doing,
but
it
was
like
the
worst
actual
code
I
may
have
ever
written.
So
chris
did
a
great
job
cleaning.
All
of
that
up.