►
Description
Network indexer roadmap - presented by @willscott at IPFS þing 2022 - Content Routing 1: Performance - https://2022.ipfs-thing.io
A
So
this
is
this:
is
sort
of
a
state
of
the
world
on
network
indexing
and
and
store
the
index.
So
this
is
a
little
bit
higher
level
than
what
andrew
gave
in
the
lightning
talk
earlier
today
and
is
also
a
little
bit
more
forward.
Looking
at
sort
of
current
paths
that
were
that
we're
thinking
about
so
in
particular,
we've
got
a
problem
of
scaling
content.
Routing
we've
got
store
the
index
as
a
current
thing.
That's
doing
this.
A
We
can
talk
about
where
we
want
to
go,
what
interfaces
we've
put
in
place
and
how
we're
thinking
about
evolving
them
and
and
where
we're
heading
and
the
things
that
we
don't
know
how
to
do
so
right
now.
This
is
a
problem
that
we
brought
up
a
few
times.
I'm
going
to
be
quick
about
this
part.
When
you
look
at
the
hydra,
it's
got
2
to
4
billion
records,
20
000,
40,
000
peers.
A
That
means
it's
asking
for
about
four
megs
of
routing
resources
per
peer
right
now,
when
we
look
at
our
network
store
the
index,
we've
got
on
the
order
of
eight
billion
records
which
are
coming
from
right,
like
you
think
about
file
coin
as
like,
maybe
a
thousand
ish
miners
or
storage
providers
that
are
that
are
producing
that.
A
So
we've
got
a
scale
issue
here,
which
is
the
amount
of
content,
is
growing
a
lot
faster
than
the
number
appears,
and
so
that's
sort
of
like
one
of
these
fundamental
short-term
things
of
this
dht
model
of
just
scaling
the
content
with
the
peers.
We
can't
we
can't
just
have
a
homogeneous
pool
of
peers.
We
need
to
make
use
of
the
heterogeneity
in
resources,
and
the
current
dht
stuff
has
some
security
things
that
you
run
into
around
sybils
and
sort
of
resist
resiliency.
A
So
we've
got
a
couple
things
that
we've,
I
think,
become
somewhat
opinionated
about
in
terms
of
what
we
need
to
make
this
problem.
Tractable
one
is
that
we
think
that
it's
pretty
important
for
the
index
providers,
the
the
producers
of
content
to
have
an
ongoing
interface
where
they
sort
of
say
this
is
the
content
that
I
have
and
that
stays
accessible
like
they're
responsible
for,
like
tracking,
that
and
being
able
to
have
that
get
pulled
by
the
network,
and
so
rather
than
them
just
saying
like
hey
I've
got
this
sid.
A
Has
these
one-off
messages
they
need
to
like,
remember
those
sids
that
they've
said
and
that
string
of
advertisements
and
be
able
to
have
multiple
people.
Ask
them
about
that,
and
so
that
gives
them
accountability,
which
is
like
you
can't
go
and
tell
one
guy.
I've
got
this
sid
and
then
give
a
different
manifest
to
someone
else,
because
if
we
want
to
have
any
sort
of
reputation
of
like
you
know,
you
really
then
need
to
actually
serve
that
content.
A
That
you're
claiming
we
need
to
sort
of
have
this
ability
to
hold
hold
accountable,
and
it
also
is
an
interface
that
allows
for
multiple
competing
indexing
systems
to
both
pull
right
like
that.
Content
is
now
available
and
it's
not
just
been
pushed
into
one
network
system,
but
rather
another
networking
system.
That's
trying
something
different
can
also
pull
that
content
so
that
that
came
up
with
the
structure
that
we're
using
in
store
the
index.
A
The
anyone
who's
producing
content
that
they
want
to
make
indexed,
creates
a
chain
of
advertisements
the
advertisements
are
assigned.
Each
advertisement
has
a
bunch
of
sids
that
it
points
to
or
multi
hashes.
A
We
have
the
index
provider
sort
of
put
out
when
they're
like
a
proactive
ping,
when
there
is
new
data
available
that
helps
with
lower
latency,
because
it's
it
sort
of
gives
say,
hey,
there's
an
update
available
and
then
the
indexers
are
fetching
the
delta
of
what
they
don't
have
yet.
So
they
go
back
until
the
previous
thing
and
that
allows
them
to
not
get
overwhelmed.
A
So
if,
if,
if
we
end
up
with
as
with
less
headroom-
and
we
can't
just
respond
to
every
ping,
we've
got
a
sort
of
graceful
degradation
where
we
can
collapse
and
and
do
these
sketches
of
multiple
advertisements
at
a
time.
And
so
in
particular,
you
know
if
someone
is
creating
new
advertisements
every
second
and
we
decide
that.
A
That's
really
quite
often
it's
on
the
indexer
side
to
tune
that
down
say
actually
we're
gonna
we're
gonna,
pull
every
10
seconds
or
every
minute
and
and
get
sort
of
that
that
content
so
where
we
are
now
so.
These
are.
These
are
sort
of
some
distributions
of
how
large
the
number
of
entries
that
we're
seeing
per
advertisement
and
then
how
long
these
advertisement
chains
are
growing.
A
A
A
And
then,
at
the
high
end,
you've
got
like
some,
I
don't
know
getting
into
orders
of
millions
and
that's
someone
like
nft
storage
that
I'm
just
going
to
keep
staring
at
through
a
lot
of
this
talk
as
the
you're
out
on
this
edge
of
the
storage
provider
range
of
what
indexes
look
like
and
then
the
the
core.
The
sort
of
counterpoint
is
the
the
number
of
entries
in
those,
and
so
what
we're
seeing
is
again
this.
A
The
sort
of
the
big
jump
area,
which
is
like
this
bulk
of
file
coin
deals,
is
that
you've
got
between
32
000
and
half
a
million
sids
in
one
of
those
32k
blocks
right,
and
that's
that's
about
what
we
expect
the
block
size
to
be
right,
if
you're,
if
you're
pushing
in
files
and
they're
getting
to
be
a
meg
or
two
meg
blocks
for
a
lot
of
that,
then
in
your
32
gigs
you're,
going
to
end
up
with
under
half
a
million
cents,
and
then
you've
got
a
few
that
are,
you
know
getting
up
to
up
to
that
like
million-ish,
where
it's
like
biocoin
state
tree
and
a
lot
of
really
small
blocks,
and
then
you've
got
a
few
that
are
really
small,
where
it's
it's
this
trickle
of
data,
but
the
trickle
of
data
there's
very
few
providers
that
are
doing
that.
A
So
this
is
per
provider.
So
we're
not
like
if
we
counted
the
number
of
advertisements
that
had
small
numbers,
there
would
be
a
lot
of
the
small
numbers
ones,
but
they're
coming
from
relatively
few
providers,
I
think,
is
the
way
to
read
this
okay.
So
where
are
we
going?
Well,
right
now
like
in
terms
of
like
this
year,
I
think
the
goal
is
like:
let's
get
it
pretty
fast
and
like
pretty
reliable,
we're
not
there
yet.
A
I
think
we're
doing
pretty
well
on
our
on
both
of
these
numbers
it
and
it's
a
matter
of
activating
enough
of
the
network
to
be
able
to
say
that
we're
actually
doing
this.
So
when
we
look
at
store
the
index
itself,
most
of
our
lookups,
like
at
the
point
that
you
get
your
http
query
to
us
or
your
network
query
to
us.
We
get
a
response
out
in
about
one
millisecond,
pretty
much
all
the
time.
A
So
we're
reasonably
happy
with
that,
and
then
the
question
is
like:
how
much
of
the
time
do
you
hit
us
and
how
long
does
it
take
for
for
things
like
hydra
we're
under
100
milliseconds?
I
think
in
order
to
get
to
having
ipfs
gateways
match
this
sort
of
target
we're
going
to
need
replication,
so
we're
going
to
need
at
least
regional
replicas
and
we're
going
to
need
reframe
operations
hookup,
where
we're
not
going
through
hydras.
A
A
This
is
sort
of
going
to
to
juan's
point
about
pushing
stuff
further
right,
which
is
great.
We
can
say
we're
going
to
do
three
or
like
a
set
of
regional
replicas
and
that's
at
this
level
going
to
get
us
under
100
milliseconds
to
get
down
to
10
milliseconds
or
something
like
that.
A
A
That's
never
getting
requested,
and
so
if
we
can
just
have
a
cache
that
anytime
anything
actually
gets
asked
for
just
keeps
it,
but
if
it
doesn't
have
it,
it
falls
back
the
first
time
at
least
that'll
be
a
few
orders
magnitude,
smaller,
probably
and
content
that
gets
asked
for
once.
Maybe
we're
okay
with
that
being
a
little
bit
slower
than
content
that
has
been
asked
for
before
in
the
region.
A
So
that's
that's
sort
of
like
the
next
plan
that
doesn't
get
us
into
this
this
edge.
Yet
I
think
that's
going
to
be
this
incentive
problem
like
that's,
not
something
we're
going
to
try
and
talk
to
all
the
data
centers,
but
but
I
think
that's
that's
sort
of
like
this.
A
Where
we
would
like
to
go,
I
think
I
think
there
is
a
question
of
like.
Should
we
be
pre-caching
to
ipfs?
A
Hosts
like
is,
is
that
is
that
cache
thing
where
we
are
just
doing
it
when
we
get
a
request
enough
or
do
we
need
to
have
pre-built
those
caches
and
push
them
in
advance,
because
if
I'm
going
even
to
regions,
there's
something
that's
hotter
than
what
someone
has
asked
for
it
in
this
region
like
there
is
a
set
of
content
that
we
know
when
you
spin
up
a
new
replica,
it
is
going
to
get
asked
for,
and
so
we
should
like
proactively
push
that
that
shouldn't
purely
be
on
demand
like
oh,
you
got
unlucky
because
your
node
restarted,
and
so
the
really
popular
like
front
page
of
the
new
york
times.
A
You
have
to
fall
back
like
we
can
push
a
bunch
of
that
stuff
in
already,
and
so
we
have
to
identify
what
is
this
head
of
content
routing
stuff
and
get
that
packaged
as
an
even
smaller
subset?
That
is,
that
is
proactively
pushed
so
that
that's
going
to
be
harder,
but
but
seems
like
there's
a
tractable
problem
in
there.
A
So
this
is
going
to
be
scoping
down
a
little
bit
on
on
replication,
the
stuff
we're
working
on
right
now
we
want
to
be
able
to
spin
up
new
instances
pretty
quickly
where
the
critical
factor
for
how
long
it
takes
to
spin
up
a
new
replica
is
like
what
bandwidth
that
node
has
like.
How
fast
can
it
pull
down?
Three
terabytes
of
data,
like
it's
gonna,
be
hard
to
get
around
that,
but
it
shouldn't
be.
A
Oh
there's
this
storage
provider
that
has
data
that
we
care
about,
but
it's
really
slow,
and
so
everyone
who's
starting
up,
has
to
wait
and
re-pull
the
data
from
that
source.
So
we
need
to
get
it
so
that
the
data
doesn't
have
to
only
come
from
the
sort
of
authoritative
origin,
but
rather
we
can
have
archive
snapshots
that
also
because
this
data
signed
like
I
don't
need
to
get
it
from
there.
A
I
can
just
like
go
back
to
content
routing
and
get
the
data
from
a
file
coin,
backup
or
from
a
car
or
something
we
need
to
be
a
little
careful
that
we
don't
cause
infinite
loops
of
infinite
content.
A
Routing
in
doing
that,
we
sort
of
already
realized
that
if
this
goes
into
sids,
then
those
become
provided
and
we
just
sort
of
make
our
lives
harder,
and
we
would
like
instances
to
be
eventually
consistent
so
that
if
we,
if
things
did
pause
like
there,
would
be
sort
of
an
even
place
like
right
now,
provider
sets
do
not
always
stay
equivalent
enough
that
we're
sort
of
like
confident
that
we
have
consistency,
we'd
like
to
move
towards
stable
snapshots
and
sort
of,
and
then,
as
we
finish,
this
replication
work
right
now.
A
The
next
thing
that
we're
sort
of
starting
to
stare
at
in
scope
is
how
we
get
the
instances
of
either
the
index
or
these
large
caches
near
the
gateways
and
and
make
the
gateway
case
work.
Well,
there's
a
couple
other
things
that
I
think
I'm
going
to
throw
in
this
replication
heading
that
that
we
haven't,
like
necessarily,
I
think
said
when
we're
going
to
take
on,
but
that
we're
thinking
about
one
is
this
consensus
problem
and
in
particular
one
of
the
things
we
might
do
short
term
is
like.
A
Should
there
be
a
daily
like
consensus,
snapshot
of
what
the
index
is
for
this
day
or
every
six
hours,
or
something
right
that
that's
not
the
every
sid
is
the
consensus,
but
if
you've
got
a
stable
snapshot
where
there
is
an
agreement
of
like
what
is
the
correct
index
in
for
today
or
for
the
six
hours
or
something
now,
you've
got
accountability
for
the
indexing
system,
which
is
like
this
indexer
did
not
answer
correctly
against
the
snapshot
that
had
it,
and
so
you
can
start
to
penalize
indexers.
A
That
are
wrong
because
you
know
what
correct
is
and
that's
a
nice
thing
to
have,
and-
and
it
is
probably
useful
for
some
of
the
privacy
things
so
so
it
likely
is
either
privacy
or
a
reputation
thing
that
drives
needing
to
do
this
or
wanting
to
do
this
and
then
the
other
one
is
like
how
much
of
this
caching
and
proactive
replication
we
we
want
to
sort
of
throw
in,
like
we'll,
probably
have
a
cache,
that
is
on
demand
fallback.
A
But
the
do
we
figure
out
hot
content
to
proactively
push
is
going
to
be
another
one
that
we
need
to
figure
out
as
a
performance,
then
for
scaling
we're
working
on
checkpoints
as
andrew
alluded
to.
We
want
to
not
need
to
a
replay
of
full
historical
chains
and
in
particular
for
cases
like
nft
storage,
where
it's
a
really
long
chain.
It
would
be
really
nice
to
just
have
like
a
snapshot
of
all
the
content.
A
Currently,
that
can
be
reaggregated
and
it's
as
much
to
allow
them
to
not
have
to
store
all
of
that
previous
historical
data,
as
it
is
we've
caught
up,
but
the
next
guy
hasn't.
So,
like
there's
reasons
to
like
have
that
rebundled
for
more
efficiency,
we
think
we
will
at
some
point
need
to
shard
our
ingest
layer,
and
so
this
is
as
we
fetch
from
providers
right
now.
That's
you
know.
A
We
have
a
centralized
node,
that's
doing
that
and
as
the
number
of
providers
expands,
we'll
probably
want
multiple
nodes
that
are
taking
different
subsets
and
then
combine
them.
So
that
part,
probably
splits
up
as
the
number
of
overall
providers
grows,
so
figuring
out
exactly
how
that
then
goes
into
a
single
index.
A
There's
a
question
around
performance
here,
which
is:
what
is
the
need
for
a
low
latency
pipeline
from
content,
that's
newly
published
to
being
accessed,
and
is
that
all
one
system
which
is,
could
we
have
a
different
streaming
system
for
the
first
six
hours?
That
has
maybe
less
consistency,
guarantees
and
then
content?
That's
available
in
the
next
snapshot
goes
into
sort
of
this
main
indexing
system.
That's
got
a
more
mapreduce-y
periodic
feel
versus
a
streaming
behavior
for
a
much
smaller
but
very
churny
amount
of
fresh
content.
A
Those
may
be
different
systems,
depending
on
where
we
end
up
right
like
if
I've
got
new
content,
that's
newly
available,
I
might
want
to
push
it
out
and
it
may
be
in
some
temporary
cache
and
it
may
not
get
the
same.
Consistency
guarantees
until
it
gets
put
into
a
consensus.
A
And
then
it's
also
gonna,
like
the
the
thing
will
get
bigger,
we'll
have
to
figure
out
like
does
it
end
up
on
across
multiple
disks,
like
we
can
already
sort
of
handle?
This
we've
got
a
lot
of
files
like
it's
easy
enough
to
mount
this
across
multiple
disks,
but
we
need
to
figure
out
what
sort
of
sharding
a
data
center
or
iraq
will
actually
want
to
feel
comfortable
with
those
deployments
as
we
get
there.
I
think
we're
fine
for
a
while,
though,
like
a
zfs
pool
will
keep
us
going.
I
think
it's
really
more.
A
As
as
we
get
higher
query
volume
we
need
to
figure
out.
Is
the
answer
more
replicas
or
is
the
answer
like
well
here?
It
is,
and
we've
got
lots
of
readers
that
can
like
independently
access
the
same
copy
like
there
may
be
data
center
cases
where
you
do
want
the
efficiency
of
just
having
that
copy
on
fast
storage,
and
you
can
have
multiple
compute
nodes
accessing
it
and
that
would
be
sort
of
the
the
sharding
case
or
or
the
multiple
readers
over
one
instance,
and
then
we've
got
some
trust
stuff.
A
We're
not
working
on
any
of
that
right
now.
But
the
things
that
are
sort
of
like
on
the
horizon
is
figuring
out
signing
an
authenticity
of
records.
So
in
the
reframe
provider
spec
we're
trying
to
say
we
should
start
signing
the
records
that
we're
publishing
the
current
indexed
writers
are
already
doing
this.
So
like.
Let's,
let's
keep
that
ball
rolling
and
then
we
have
some
questions
about.
Should
we
be
blinding
or
hashing
or
adding
privacy
in
here?
And
what
does
that?
Look
like.
A
So
immediate
ones
that
we're
likely
to
deal
with
at
some
point,
we
may
have
some
malicious
replicas.
How
do
we
deal
with
those
do
clients
cross-compare
from
like
different
replicas,
that
they
can
see
and
like
make
sure
that
they're
not
just
being
like
given
bad
data?
A
There's
some
latency
tradeoffs
there
and
some
like
excessive
bandwidth
stuff
in
in
in
the
making
that
the
client's
problem?
But
as
you
get
to
trustless
you,
you
either
are
doing
some
crypto
stuff
or
you're,
making
up
the
client's
problem
so
and
then
there's
also
a
feedback
loop
of
like
how
do
the
clients
find
the
right
ones,
and
so
in
the
short
term.
A
That
may
be
some
sort
of
gossip-based
discovery
of
these
things
in
a
permissionless
world,
or
we
may
do
some
sort
of
consensus-based
routing
table
of
what
are
the
ones
out
there.
I'm
less
worried
about
clients
finding
the
fastest
one.
A
I
think
that
we
already
have
enough,
like
ip
geolocation
type
databases,
that
it's
pretty
easy
for
a
client
to
make
pretty
good
guesses
about
like
given
a
list
of
multi
addresses,
you
can
probably
find
the
ones
that
are
faster
closest
to
you
with
with
reasonable
precision
and
then,
if
you
can
keep
some
state
of
ones
that
have
worked
well
previously,
like
that
plus
your
general
ipmap
is,
is
probably
going
to
do
pretty.
Well
some
things.
We
don't
know
how
to
do
this.
First
one
goes
back
to
the
dht
like.
A
A
Where
do
we
root
our
notion
of
consensus
of
correctness
in
in
an
index
snapshot
if
we
do
that
and
and
what
what?
What
is
that
baseline?
Who
learns?
What
do
we
do?
We
add
limits
to
information
leakage,
and-
and
are
we
going
to
feel
like
we
can
do
that
sufficiently
efficiently,
that
it
makes
sense,
or
is
there
going
to
be
sort
of
a
tension
there
that
we
need
to
worry
about,
and
then
what
does
the
distribution
of
these
indexes
look
like
so
do
we
do?
A
There's
a
website
said.contact
this
morning,
the
firewall
here
didn't
like
it.
I
think
we
complained
to
them
and
it
works
now.
There's
also
stored
the
index
on
filecoin
slack
and
then
the
research
open
problems.
There
is
a
an
open
problem
around
privacy,
stuff
and
private
retrieval,
for
that
includes
content,
routing
that
we'll
talk
more
about
tomorrow,
all
right,
happy
to
take
questions.
A
I
think
more
integration
like
there's,
there's
two
sides
like
there's:
there's
an
ecosystem
thing
here
and
there's
only
so
many
of
those
links
that
we
can
build.
So
the
more
proactive
like
if,
if
you
are
building
this
like
either,
have
that
open
source
have
a
generalizable
help
us
with
docs.
Tell
us
where
things
don't
make
sense,
so
that
the
next
people
who
are
also
doing
those
links
can
do
it
faster
because
getting
more
clients
and
more
providers
is
what
makes
this
sort
of
gain
momentum
and
succeed.