►
From YouTube: Indexing and Interoperability on Filecoin and IPFS
Description
Learn more about indexing and interoperability across Filecoin and IPFS in this session from research engineer Will Scott.
A
Hi,
my
name
is:
will
scott
and
I'm
a
member
of
the
data
systems
team
at
protocol
labs?
We
work
on
the
data
layer
of
a
bunch
of
the
data
transfer
protocols
and
stacks
in
the
in
the
interplanetary
stack
that
vertical
labs
helps
maintain,
and
so
this
talk
is
looking
at
content
routing
and
how
we
find
content
in
both
filecoin
and
in
ipfs
and
and
more
broadly,
what
that
part
of
the
system
looks
like
today
and
and
some
of
the
work
that
we're
doing
to
try
and
help
scale
that
going
forward.
A
And
so
this
talk
is
looking
at
content
routing
and
how
we
find
content
in
both
filecoin
and
in
ipfs
and
and
more
broadly,
what
that
part
of
the
system
looks
like
today
and
and
some
of
the
work
that
we're
doing
to
to
try
and
help
scale
that
going
forward.
A
So
I'm
going
to
start
by
talking
about
where
content
routing
is
today
and
then
I'll
move
sort
of
up
the
stack
through
this
talk
so
that
by
the
end,
I'm
talking
more
about
how
the
providers
who
have
content
are
providing
that
content
out
to
the
network
and
and
and
those
protocols.
A
So
today,
from
from
something
like
ipfs
content,
routing
happens
through
a
dht,
and
what
that
means
is
each
each
piece
of
content
which
is
identified
by
a
cid.
A
content
id
which
is
the
hash
of
that
content
is
the
the
item
that
is
both
being
requested
and
being
published.
A
So
people
who
have
data
that
they
want
to
make
available
to
other
ipfs
users
will
publish
that
data
into
the
dht
by
finding
the
around
20
appropriate
nodes
that
make
up
the
mesh
and
and
putting
a
record
into
them,
indicating
that
they
have
this
content
and
so
it
it
is
a
abstraction
that
looks
a
lot
like
a
key
value
store
where
the
key
is
this
cid
the
hash
and
the
value.
Is
that
publisher
that
that
location,
part
of
the
way
that
the
dht
works
is
that
in
order
to
prevent
stale
records?
A
So
so
some
publishers
saying
that
they
have
something
and
then
going
away,
but
you
still
getting
that
answer
a
long
time
later
and
not
actually
being
able
to
connect
to
them.
Is
that
records
that
get
put
into
the
dht
expire
after
a
day
and
so
publishers
who
want
to
keep
content
available
in
in
the
current
ipfs
world
republish
every
day
to
sort
of
indicate,
they're
still
alive
and
still
providing
that
content?
A
This
has
a
couple
scalability
limitations
right.
The
dht
setup
grows
very
well
if
the
amount
of
content
grows
with
the
number
of
participants
right.
So
so,
if
you
have
something
that
looks
like
a
blog
or
you
know,
everyone
bringing
their
own
library
of
data,
and
so
as
more
users
join
and
the
dht
grows
larger.
The
data
can
also
increase
linearly
at
the
same
rate,
but
there's
other
sorts
of
quantities
of
data
that
are
just
too
big
for
a
dht
in
this
in
this
sort
of
paradigm.
A
A
So
we
start
facing
these
larger
scales
of
data
with
filecoin
and
and
are
looking
at.
You
know
what
how
what
is
the
sort
of
next
iteration
of
this
protocol
that
we
can
use
to
to
deal
with
this
increased
scale.
A
So
the
focus
from
from
my
perspective
is
the
thing
we
want
to
get
right.
Is
the
protocols
and
we'll
figure
out
the
ability
to
iterate
the
specific
mechanism
at
a
different
rate
if
we've
got
good
protocols
on
the
two
edges
of
this?
So
if
we
have
a
good
protocol
for
publishing
and
a
good
protocol
for
requesting
and
finding
data,
then
we
can
allow
ourselves
some
additional
flexibility
and
freedom
to
experiment
with
different
specific
implementations
of
content
wrapping.
A
So
what
I'm
going
to
talk
through
next
is
sort
of
the
thoughts
on
where
we
go
on
those
two.
So
the
first
is
is
content
routing
and-
and
so
this
is
a
user
looking
for
a
specific
cid
and
we
already
have
an
interface
in
ipfs.
That
is
roughly
what
we
might
want,
which
is
when
ipfs
needs
to
find
data.
It
makes
a
request.
A
They
can
serve
to
implement
this
interface,
and
so
I
think,
we'll
start
to
see
this
and
and
some
abstraction
that
looks
a
lot
like
this
coming
to
to
a
set
of
clients
that
are
able
to
retrieve
and
and
provide
data
from
some
some
mesh
of
of
ipfs
and
or
filecoin
and
and
more
broadly,
for
content
addresses.
So
that's
our
thought
of
you
know
what
is
this
simple
client
side
of
this
interface?
A
Okay?
So
so,
then,
what
do
we
do
to
actually
implement?
That
is
the
next
question,
and
the
work
that
my
team
is
pursuing
right
now
is
is
an
indexing
solution
where
we're
trying
to
say
well.
Okay,
can
we
just
make
something
that
happens
to
know
the
answer
to
all
of
these
queries,
and
this
is
building
on
a
bunch
of
great
work?
That's
happened
over
the
last
couple
of
years.
There
was
an
initial
implementation
of
a
storage
system.
A
That's
optimized
for
these
cid
hash,
based
keys,
called
store,
the
hash
that
volkl
made,
and
that
has
then
been
ported
to
golang
and
that
we're
now,
using
as
the
basis
of
a
indexing
network
index
system
that
we're
calling
store
the
index.
So
let
me
give
you
a
brief
or
or
sense
of
what
that's
doing
and
how
that's
actually
working,
so
it
takes
a
cid
and
it
uses
some
bytes
of
that
cid
itself,
which
it
believes
are
generally
random
right,
because
it's
it's
a
hash
in
in
general,
and
so
what
it
does
is.
A
A
If
it
knows
about
it
and
the
relevant
metadata
entries
for
that
cid,
and
so
then
it
will
do
reads
to
get
those
specific
metadata
entries
and
the
metadata
that
we're
imagining
ends
up
being
stored
for
a
cid
is
the
provider
who
has
it
where
they
are
the
multi
address.
So
this
is
this
full
address
info
that
we
talked
about.
Basically
it's
how
to
find
them,
and
then
also
you
know
what
protocol
they're
talking
about
and
relevant
metadata.
That's
protocol
specific!
A
So
how
you
actually
go
about
retrieving
the
data
from
that
provider
and
this
gets.
You
know
this
is
one
additional
level
of
complexity
than
what
we
have
had
previously,
because
now
we're
having
multiple
data
transfer
protocols,
both
bitswap
and
graphsync,
as
potential
data
transfer
options,
and
we
expect
that
there
may
end
up
being
additional
ones
and
that's
part
of
the
extensibility
and
flexibility
here.
A
The
nice
thing
about
this
setup
is
that
there's
a
fixed
number
of
reads
from
disk
that
happen,
regardless
of
how
many
cids
end
up
in
the
database.
You
do
one
read
in
the
index
to
get
a
record
chunk
and
then
you
do
a
read
to
get
the
specific
metadata
entries
that
you
want.
The
reason
we
do
the
extra
lookup
on
metadata
is
because
we
expect
that
there
is
sort
of
for
a
given
provider.
A
There's
many
cids
right.
So
if
you
think
about
a
file
coin
deal,
that's
you
know
32
gigs,
typically
of
of
data,
which
has
many
many
cids
within
it,
and
all
of
those
the
process
that
you
should
then
use
to
retrieve
is
the
same,
and
so
we
can
de-duplicate
and
spend
a
lot
less
disk
resources.
A
If
we
do
that
deduplication,
but
it
means
that
there's
only
two
to
three
reads
from
from
disk
for
a
given
cid,
regardless
of
how
many
cids
are
in
the
index,
and
so
that
means
the
only
thing
that
we're
really
having
to
scale
is
disk
usage,
but
we
can,
but,
but
we
end
up
retaining
a
relatively
low
latency
we're
optimistic.
This
ends,
up
being
you
know,
order
of
five
or
ten
milliseconds.
To
answer
a
query.
Sorry,
microseconds,
you
know
random
access
disk
seeks,
even
as
the
database
of
indexing
gets
very,
very
long.
A
Okay,
so
so
you've
got
this
abstraction
of
this
library.
That's
able
to
do
you
know
individual
indexing
requests,
it
can
answer
for
a
cid
who
has
it
relatively
quickly
and
it
should
be
able
to
scale,
but
that's
part
of
the
story.
Right
we've
still
got
to
answer.
How
does
this
indexing
component
actually
learn
about
all
of
these
cids?
A
And
then
I
want
to
talk
a
little
bit
about
how
we
see
this
sort
of
evolving
in
the
future.
Now
that
now
that
we've
got
interfaces
right,
you
know
this
is
one
solution,
but
it's
certainly
not
the
bl
end.
All
I'm
sure
that
there's
already
a
lot
of
you
know
questions
in
your
head
about
okay.
So
if
there's
this
centralized
thing
well,
aren't
we
building
a
decentralized
web,
and-
and
yes
of
course,
we
are
okay.
A
So
the
the
other
interface-
and
this
is,
I
think,
right
now-
the
the
focus
on
the
thing
that
we
want
to
get
right
and
the
protocol
we
want
to
get
right
is
ingestion,
which
is
how
do
we
learn?
What
data
various
storage
providers,
either
filecoin
or
large
ipfs
nodes
have
and
get
that
reflected
in
the
network
index.
A
So
so
the
current
thought
is
that
what
we're
expecting
from
storage
providers
is
that
they
will
have
a
hash
chain
of
basically
advertisements
of
what
content
they
have
and
and
they'll
expose
this
list
of
advertisements
and
the
cids
in
those
advertisements
as
a
thing
that
network
indexers
can
get
from
them.
The
reason
that
we're
having
this
as
a
sort
of
hash
chain
of
advertisements
that
that
are
signed
by
the
storage
provider
is
it
is.
A
This
is
part
of
that
extensibility,
which
is
it's
not
that
the
index,
the
storage
providers
are
pushing
or
or
just
sort
of
saying,
I've
got
this
data
which
would
not
be
you
know,
immutable
would
allow
them
to
provide
potentially
different
views
of
their
data
to
different
indexers.
A
A
Okay.
So
what?
What
does
it
look
like?
As
you
know,
either
a
new
file
coin
deal
happens
for
a
provider
or
they,
you
know
through
some
other
means,
get
an
additional
set
of
cids.
A
A
A
You
know
on
some
regular
basis
to
see
if
they
have
new
content,
so
the
gossip
sub
is,
is
you
know
a
way
to
to
lower
that
latency
in
most
cases,
but
there
there's
a
sort
of
a
fallback
to
make
sure
you
are
keeping
everything
in
sync
once
a
networker
indexer
sort
of
learns
about
this
and
adds
it
to
its
queue
of
providers
that
it
needs
to
get
updates
from
it
will
use
graphsync
to
whole
first
that
advertisement
and
then
the
list
of
cids
that
it
does
not
yet
have
and
and
there's
a
couple
reasons
for
doing
it.
A
A
We
also
expect
that
there's
a
bunch
of
deduplication
in
the
sids,
which
is
you
know,
a
deal
that
is
made
to
file
point-
is
likely
made
to
multiple
providers
and
so
by
by
having
this
sort
of
structured
through
this
ipld
setup,
the
indexer
will
only
have
to
fetch
that
from
one
provider
not
from
all
of
them
and
not
duplicate
that
work,
because
it
knows.
A
Oh,
this
is
the
same
list
of
cids
that
I've
seen
in
other
places,
so
the
indexer
pulls
that
and
and
then
adds
to
its
index,
that
these
cids
are
now
provided
by
this,
this
provider.
Okay.
So
so
so
the
the
advertisements
either
introduce
a
a
new
set
of
cids
or
retract
and
say
I
am
no
longer
providing
some
previous
set
of
cids.
A
Those
are
those
are
the
two
basic
deltas
and
then,
in
that
advertisement
layer
it
says
what
metadata
and
protocol
should
be
used
to
retrieve
them
right,
because
we
expect
that
you
know,
for
some
large
set
of
cids.
You've
got
one
sort
of
mechanism,
so
if,
if
you
end
up
as
a
storage
provider
having
you
know
different
protocols
or
whatever,
you
would
do
that
through
multiple
advertisements,
each
with
the
smaller
set
of
appropriate
cids
that
match
that
metadata
other
metadata-
that
you
might
imagine
is
things
like.
A
Is
this
available
for
free
retrieval,
or
is
it
going
to
cost
cost
file
point
to
to
do
the
retrieval?
Okay?
So
so
an
indexer
has
the
sid
maintains
it.
But
let's
talk
now
about
sort
of
what
what
that
expectation
of
the
the
growth
and
sort
of
evolution
of
the
indexes
so
we're
doing
one
index
right
there
there's
one
thing
which
is
like
okay
as
the
scales:
what
what's
the
what's
the
plan
and
there's
two
ways
that
it
can
scale?
A
You
can
either
scale
in
terms
of
having
a
lot
more
queries
or
you
can
scale
in
terms
of
having
a
lot
more
data
and-
and
hopefully
we
scale
in
both
as
we
get
more
queries
likely.
What
you
want
to
do
is
have
caches
of
recently
queried
cids
that
end
up
closer
to
clients
right
and
that
takes
load
off
of
the
actual
primary
index,
because
many
of
the
there
will
be
a
small
number
of
cids
that
are
requested
many
times
and
many
cids
that
are,
you
know,
only
never
requested
at
all.
A
A
The
the
index
itself
is
then
limited
primarily
by
that
growth
right,
like
as
those
disks
get
really
big
that
gets
expensive
or
it
can't
pull
from
the
growing
number
of
storage
providers
in
a
reasonable
way.
And
so
you
can
imagine
sharding
it
where
you
end
up
with
regional
partial
indexes.
So
there's
some
index
that
looks
at
storage
providers
that
are
in
north
america
and
a
different
one.
A
That
looks
at
storage
providers
that
are
in
europe,
and
you
can
probably
scale
that
to
another
order
of
magnitude,
potentially
just
by
starting
across
regions,
and
then
the
local
caches
will
will
send
out
sort
of
parallel
requests
back
to
these
multiple
different
partial
shards,
maybe
potentially
preferring
the
local
ones.
Because
they
they
want
to
connect
clients
with
data
that
is
closer
to
them
by
default.
But
you
know,
if
that
fails,
then
they
fall
back
to
further
away
charts
and,
and
that
way
you
you've
got.
A
You
know
a
a
set
of
you
know
fairly
tractable
extensions
that
likely
allow
you
to
keep
scaling
this
for
a
while,
okay,
but
that's
all
still
within
the
scope
of
a
single
trust,
administrative,
regent,
right.
There
there's
some
operator,
that's
operating
this
whole
system,
and-
and
this
is
why
the
focus
initially
is
on
the
protocols
of
both
content,
retrieval
on
one
side
and
storage
provider
ingestion
on
the
other
side,
which
is
we
don't
want
to
be
making
the
only
network
index,
and
we
don't
want
to
be
running
the
only
network.
A
So
there's
a
few
different
sort
of
thoughts
here,
one
is,
the
expectation
is
pretty
quickly
we'll
end
up
with
federated
replicas,
where
other
entities
also
are
running
network
indexes,
and
that
gives
you
a
couple
options.
One
is
that
clients
can
query
from
multiple
providers
indexes
and
can
check
them
against
each
other
to
make
sure
that
no
one
is
lying
and
that
and
that
these
are
all
sort
of
operating
in
sync
and
and
have
similar
views
into
providers
on
the
network.
A
The
other
thing
in
the
provider
ingestion
interface
right
is
that
you've
got
this
now
a
tested
chain.
That's
not
mutable
from
each
provider
saying
what
data
they
have,
and
that
means
you
can
do
a
dispute
process
where,
if
the
index
responds
to
a
query
either
saying
the
provider
doesn't
have
it
when
the
provider
does,
or
vice
versa,
saying
a
provider
has
something
that
they
don't.
A
You've
got
a
process
where
you
can
show
the
a
tested
record
from
the
provider
and
the
response
from
the
network
index
and
say:
look
these
don't
correspond
and
you
can
figure
out
who
is
lying
or
who
has
provided
incorrect
data
based
on
the
the
underlying
provider
records
from
the
provider,
and
so
that
means
you
can
now
introduce
a
whole
set
of
incentive
games
where
you
you
make
it
irrational
for
a
network
index
or
to
not
provide
the
the
standard
view
of
the
network
so
that
moves
us
closer
towards
an
untrusted
or
supporting
potentially
byzantine
indexers,
where
the
the
indexer
itself
can
be
malicious
can
provide
incorrect
information
because
you've
got
a
dispute
resolution
process
for
those
who
have
grievances
we're
trying
not
to
specify
fully
a
protocol
for
moving
towards
untrusted
either.
A
So
you
could
imagine
schemes,
for
instance,
where,
when
I
do
a
retrieval,
there's
some
sort
of
finders
fee
where
the
indexing
subsystem
or
the
set
of
indexers
that
help
me
match
make
to
a
given
provider,
gets
a
percent
or
of
the
retrieval
cost,
or
get
some
get
some
fee
for
doing
that.
Finding
and
maybe
there's
you
know
indexes
that
are
fast,
but
only
have
some
of
the
content
and
they
charge
less
of
a
fee
and
then
there's
other
more
expensive
to
run
indexes
that
have
all
of
the
data
but
charge
a
higher
fee.
A
So
there's
there's
potentially
some
sort
of
market
here
and
before
we've
fully
understand
the
set
of
constraints
and
incentives
that
we
would
that
are
useful
in
structuring
that
market.
We
don't
want
to
sort
of
overspecify
it
so
we're
expecting
that
evolves
and
that
the
right
thing
in
this
iteration
to
get
right
is
those
two
interfaces
so
that
we
can
then
iterate
and
change.
What
component
actually
exists
as
that
networking
interface
without
having
to
keep
updating
the
protocol
that
either
the
providers
or
the
clients
are
speaking
so
to
finish
off
we're
working
on
indexing.
A
You
can
see
the
code,
the
the
network
indexer,
that
store
the
index,
there's
a
reference
provider
at
index
reference
provider
and
this
work.
You
know
I'm
not
doing
this
work,
but
the
the
whole
data
systems
team
that
you
see
in
these
pictures
below
is
really
the
ones
to
credit
for
this.
So
with
that,
thank
you.
I'm
happy
to
take
questions
on
slack
or
or
through
any
of
these.
Various
protocol,
labs
messaging
channels-
or
you
can
reach
me
at
will
scott.
Thank
you.