►
From YouTube: How Elastic IPFS content is discoverable in the DHT - @ShogunPanda, @alanshaw - Content Routing 1
Description
How Elastic IPFS content is discoverable in the DHT - presented by @ShogunPanda and @alanshaw at IPFS þing 2022 - Content Routing 1: Performance - https://2022.ipfs-thing.io
A
Hi
everyone,
I'm
alan
and
paula's
gonna-
be
we're
double
teaming
this
talk,
so
it's
gonna
be
good
fun.
Okay,
so
this
talk
is
about
is
about
how
all
of
the
provider
records
for
us
elastic,
ipfs
content
are
discoverable
in
the
dht,
and
so
we're
talking
about
how
all
of
the
content
that
was
ever
uploaded
to
web3.storage
nft.storage.
A
A
So,
first
of
all,
let's
talk
a
little
bit
about.
Why
we're
talking
about
discovering
content.
The
the
problem
we
have
is
that
we're
at
scale
nft.storage
and
web3.storage
have
over
90
million
uploads
between
them.
That's
over
one
and
a
half
billion
blocks.
We
have
over
40
ipfs
nodes
in
clusters,
trying
to
write
provider
records
to
the
dht.
It's
just
a
lot
a
lot
of
stuff,
and
so
we
bought
ourselves
some
time
by
turning
on
the
accelerated
dht.
A
When
that
became
available,
we
we
switched
our
data
store
from
badger
to
flat
fs
for
to
eke
out
that
extra
performance.
We
also
recently
considered
actually
just
changing
our
providing
strategy
to
just
roots,
but
we
kind
of
didn't
really
want
to
do
that,
and
luckily
we
wouldn't
have
to
there's
also
wider,
as
I
think
will
touched
upon
as
well.
A
There's
wider
concerns
about
being
a
good
citizen
on
the
ipfs
network,
like
the
number
of
records,
we're
expecting
other
peers
to
store,
as
well
as
like
the
bandwidth
costs,
to
kind
of
send
in
send
them
to
them
and
continually
do
that.
It's
quite
it's
quite
a
lot.
A
I
think
will
said
it
was
like
four
meg
per
pier
of
just
provider
records
that
doesn't
seem
cool,
and
so
what
we
found-
and
it's
not
super
surprising-
is
that
at
some
point
there's
just
too
many
cids
for
a
single
node
to
provide
to
the
dht,
and
it's
also
kind
of
annoying
to
pinpoint
that
tipping
point.
A
If
you've
got
like
popular
content
and
your
node's
gonna
be
really
busy
with
reading
stuff
as
well
like
it
depends
on
things
stupid
things
like
systems
like
the
system
file
system
type
the
like,
if
you
rated
your
disks,
if
you
got
like-
and
it
varies
from
disk
to
disk
and
stuff
like
likes
like
that-
which
makes
it
really
difficult
to
know
when
you've
put
too
much
stuff
on
your
node,
but
in
general,
just
don't
let
them
get
too
big.
So
this
is
a
it's
not
super
easy
to
see
on
the
screen.
A
Unfortunately,
but
it's
a
graph
of
the
percentage
of
dht
records,
found
that
say
that
this
particular
peer
has
this
content
and
it's
over
a
period
of
about
a
month
earlier
this
year
and
it's
for
one
of
the
oldest
nodes
in
the
cluster.
That
node
has
got
about
70
terabytes
of
data
on
it,
and
it's
not
doing
great
they're,
very
low
low
percentages
of
found
provider
records
in
the
dht
and
just
so
you're
clear,
like
we're
checking
for
for
content.
A
We
know
is
stored
on
this
node,
so
we're
expecting
to
see
a
dht
record
on
the
dht,
the
y-axis.
Only
goes
up
to
55,
so
yeah
that
node
is
not
having
a
good
time.
A
So,
conversely,
with
like
20
terabytes
of
data,
things
are
a
bit
better,
still
not
great.
It's
got
some
brief
periods
of
awesomeness,
but
it's
mostly
bad
and
you
can
sort
of
see
this
slow
decline.
As
the
disk
fills
up,
it
gets
worse
and
worse
at
being
able
to
provide
stuff
to
the
dht,
so
there
you
are
so
for
comparison.
This
is
elastic
ibfs.
This
is
when
this
is.
This
is
happy
days.
A
This
is
when
the
and
nodes
came
online
and
started
schlepping
up
all
of
the
advertisements
that
we'd
been
writing
and
just
to
be
clear.
This
is
all
of
the
data
like
we
have
or
everything
and
it's
we
went
from.
Nearly
zero
percent
found
dht
records
to
like
almost
100
all
the
time,
and
it's
stayed
this
way
ever
since,
and
it's
continually
slurping
up
stuff.
This
is
like
hundreds
of
terabytes
of
data
billions
and
billions
of
cids
and
yeah.
So
it's
things
are
good
at
the
moment.
A
That's
really
cool,
but
how
does
elastic
ipfs
achieve
this?
Well,
it
makes
use
of
the
indexer
nodes
and
they
are
purpose-built
to
map
cids
to
content
providers
for
the
scale
of
the
file
coin
network
and
we're
we're,
I
guess,
relatively
small
comp
compared
to
the
the
whole
of
the
file
coin
network,
but
essentially
indexer
nodes
work
by
you
ask
them
like
who
has
a
cid
and
it
tells
you
who
has
it
pretty
simple
right.
A
Obviously,
someone
has
to
tell
the
index
of
nodes
before
you
ask
them
who
who
has
it
and
there
are
actually
multiple
ways
to
tell
an
index
of
node.
I
mean
two
ways
that
you
have
a
cid,
and
so
paulo
is
going
to
tell
you
specifically
how
elastic
ipfs
does
it.
B
Hello
again
folks,
so
as
alan
was
anticipating,
we
have
two
ways
to
ingest
data
to
the
indexer
node
using
the
direct
api
they
provide.
The
first
one
is
based
on
lib
p2p
and
gossip
sub,
while
the
other
one
is
a
two
phases:
http
based
one,
let's
analyze
the
first
one
leap,
b2p
and
gossip
sub
right
now.
Let
me
ask
you
one
question:
how
much
money
do
you
have?
B
So
will
impact
costs
as
well,
so
we're
gonna
throw
a
lot
of
money
out
of
nothing
basically,
but
there's
even
even
more,
even
if
even
lsc
you
have
infrit
money,
there
are
technical
limits
that
cannot
be
bypassed
very
easily.
The
main
one
is
that,
as
francisco
was
anticipating
this
morn
this
morning,
is
that
eipfs
from
the
outside
is
just
I
regular
node
like
every
anybody
else's
one
peer
id,
which
seems
to
be
just
a
machine,
but
actually
our
tons
of
machine
right.
B
The
implication
of
that
is
is
that
there
is
an
optimization
in
the
lib
p2p
stack.
Is
that
says
that
if
you
try
to
connect
to
a
destination
to
the
same
destination
from
two
different
sources
which
which
share
the
same
peer
id
lib
p2p
will
try
to
merge
the
streams
in
a
single
connection
and
drop
the
other
one?
Therefore,
the
communication
will
be
dropped
and
not
effective.
We
cannot
support
that.
So
let
me
introduce
you.
B
A
very
old
friend
which
is
http,
that's
lightweight
is
is
everywhere
everywhere
is
stateless
and
the
the
long-living
connection,
which
is
basically
http
live,
are
opt-in.
So
if
you
want
to
use
it
whether
this
also
applies
on
the
client
on
the
server,
the
server
can
refuse
a
keep
a
live
connection
from
a
client
if
they
want
to.
So
that's
the
ideal
situation
from
cloud
environments
because
it's
very
lightweight
and
you
can
drop
as
soon
as
you
need
it.
So
we
use
that
right
now.
B
How
do
we
use
that?
The
ingest
api
of
the
indexer
nodes
requires
us
to
provide
the
advertisements
and
entries
data
over
http?
That's
it!
That's
all
they
ask
us.
Moreover,
we
need
to
maintain
an
additional
head
link
to
the
latest
advertisement
that
that
has
been
published,
because
the
entire
idea
behind
this
indicator
node,
is
that
there
are
blockchain
approach
to
this
thing,
so
you
can
reconstruct
to
the
hell
to
to
the
from
the
edge
to
the
tail
of
the
cube
anytime.
B
So
if
you're
not
this
gone
drops
you
do
your
disks
damage
and
so
forth.
You
can
reconstruct
from
the
beginning.
Now,
I'm
not
saying
you
can
really
do
that
because,
in
order
to
reconstruct
billions
of
records,
you
might
probably
will
take
your
centuries,
but,
theoretically
speaking,
you
could,
if
you
want
to
really
waste
a
lot
of
time
there
anyway,
once
you
have
all
the
data
available,
the
http,
all
you
need
to
know
is
that
make
a
put
request
to
that.
B
Slash
in
the
ingest,
slash
announce
route
and
which
basically
will
signal
the
index
or
not
to
say
look.
I
have
some
data,
please
get
it
now.
This
last
thing
is
also
the
reason
why
for
that
attended
francisco
the
conference
this
morning
right
now,
right
after
publishing
data
through
web3.storage,
you
cannot
immediately
fetch
it
using
the
dht,
because
we
cannot
predict
when
the
indexer
node
will
actually
fetch
the
new
new
advertisements
and
therefore
publish
on
the
dht
there's
something
we
can't
control.
Usually
it
happens
fast.
B
B
Yes,
yeah
yeah,
so
what
will
the
indexer
node
will
do
on
their
site
like
we
said
here?
First
fetch
the
slash,
add
and
say
what
is
now
the
latest
advertisement
two
fetch
this
advertisement,
three
analyze
the
entries
which
this
advertisement
points
to
and
then
repeat
repeat
repeat
until
you
get
to
a
point
that
you've
meet
advertisement
that
you
already
processed
so
they're.
B
Basically
the
old
head
of
the
queue
and
you're
done
all
the
data
is
now
available
in
the
dht,
via
whatever
I
will
tell
you,
neither
I
don't
want
to
make
any
spoiler
and
now,
let's
take
a
brief
look
to
the
different
files
involved.
In
this
case,
the
slash
add
file
is
a
very
simple
structure
file
for
the
sake
of
this
torque.
What
do
you
really
care
about?
Is
the
line
number
three
that
says
what
is
the
cid
of
the
current
latest
published
advertisement?
B
That's
it
already
is
a
secretary
and
a
publicly,
but
we
don't
really
care
about
that.
Then
we
have
the
other
advertisement
files,
which
cares
much
more
information
on
line
two
and
line.
What
is
that
18
there
is?
There
are
information
about
the
current
provider
in
this
case.
That's
the
peer
id
of
the
elastic
ipfs
cluster
and
its
addresses
I
mean,
which
is
a
single
address
in
this
case
over
there
and
then
on
line
7.
B
B
Now,
as
alan
said,
we
have
a
lot
of
blocks
and
a
lot
of
new
blocks
get
added
every
day
if
we
mishandle
the
concurrency.
First
of
all,
the
queue
will
explode
very
easily
before
we
even
realize.
First
and
second,
if
we
lose
any
advertisement,
this
advertisement
will
be
lost
forever
because
nobody
will
be
able
to
ever
find
it
because
it's
not
advertised
by
anyone.
So
there
is
no
provider
associated
for
the
same
reason,
there
is
a
concurrency.
B
B
The
first
one
is
that
we
get
all
the
new
blocks
that
we
receive
from
the
indexing
part
which
are
in
a
sqs
topic,
and
we
group
them
together
by
a
number
of
ten
thousand.
In
this
case,
since
sqs
is
on
exactly
once
delivery.
We
don't
care
about
the
ordering.
We
know
that
each
cid
will
finish
in
one
entry
file.
We
don't
care,
which
one
is
that
once
we
compute
this
entry
file,
we
enqueue
its
cid
in
another
topic,
another
sqs
topic.
B
This
sqs
topic
is
fed
to
another
lambda
which
will
group
these.
All
this
entries
file
in
another
batch
of
ten
thousand
create
sorry
we'll
we
will.
There
will
be
an
execution
for
each
10
000
entries
each
entries
will
become
an
advertisement
file
and
gets
uploaded
an
internal
in
memory.
We
keep
the
sequence
of
these
advertisements
at
the
end
of
the
day,
we
update
the
ad
only
once
with
the
latest
advertisement
that
we
have
processed
now.
B
The
result
of
this
is
that
we
reduce
the
order
of
magnitude
of
the
problem
by
ten
thousand,
which
means
and
in
theory
to
handle
one
millions
upload,
sorry,
one
billion
uploads
per
day.
We
just
need
a
thousand
the
execution
of
this
second
lambda
and
we
are
able
to
do
that
pretty
easily.
That's
it.
That's.
B
B
This
this
block
becomes
an
entries
file
and
each
entries
file
is
mapped
to
an
advertisement
and
each
advertisement
is
bound
to
a
one
execution
of
the
second
lambda.
What
we
call
the
publisher
advertisement,
lambda,
it's
all
executed
once,
but
in
this
case,
in
the
first
case
we
were
grouping
manually,
ten
thousand.
In
second
case.
B
Ten
thousand
nil
is
a
limit
given
by
aws,
so
aws
you
can
receive
up
to
ten
thousand
messages
per
peri
event
generated
from
sql
to
lambda,
and
that's
why
we
get
to
ten
thousand,
but
each
advertisement
is
maps
to
one
entry.
I
mean,
if
you
think
about
this
earlier.
B
You
see
on
online
seven
jobs
maps
to
a
single
entry
file.
This
is
an
advertisement
file
maps
to
an
entry
file.
That's
it.
B
Finally,
this
is
an
overall
view
of
what
I
just
said:
that's
the
diagram
with
architecture,
so
it
all
starts
on
the
top
right.
When
there
is
the
sqs
multi-ash's
topic,
then
there
is
the
first
alarm
with
the
execution
that
will
output
to
f3
and
to
sqs
then
goes
to
a
second
lambda.
Then
eventually,
we'll
also
update
the
advertisement
to
s3
and
finally,
only
once
per
execution
notify
the
indexer
node
that
at
some
point
later
in
the
future,
will
say.
Okay,
let
me
fetch
everything
from
that
moment.
A
Okay,
so
once
it's
in
once
it's
in
the
indexer
node,
how
does
how
does
how
does
it
become
available
on
the
dht?
So
so,
for
that
we
need
to
talk
a
little
bit
about
the
the
hydra
nodes
which
we,
I
think
people
have
already
already
talked
about
a
little
bit,
but
this
is
kind
of
a
bit
more
in
depth.
So
anyway,
how
does
it
become
available
on
the
dht?
Well,
the
hydro
nodes
are
just
ipfs
nodes
and
they're
designed
to
help
the
dht.
A
They
position
themselves
such
that
when
you
query
the
dht
you're
you're,
pretty
likely
to
run
into
one
and
they're
just
everywhere
and
they
kind
of
help
there
to
help
out.
The
kind
of
the
key
thing
is
that
they
share
a
data
store.
So
if
one
hydra
head
knows
about
something,
then
all
of
them
do
just
like
a
real
hydra.
They
share
the
same
belly
of
data
with
many
hits
and
so
yeah.
The
the
cool
thing
about
them
is
that
they
prefetch
provider
records.
A
So
when
someone
asks
asks
who
has
the
bear,
the
the
and
the
hydra
nodes,
don't
know,
then
they'll
respond
and
say
I
didn't.
I
don't
know,
but
behind
the
scenes,
what
they'll
do
is
they'll
actually
query
the
network
and
find
that
information.
So
next
time
they
can
be,
they
can
tell
whoever
asks
what
the
actual
answer
is
so
yeah
it's
this
key
kind
of
like
getting
provided
records
and
caching
them
so
that
next
time
they're
available
and
also
being
available
all
over
the
network.
A
So
the
difference
now
is
that
the
hydras
have
been
updated
and
now,
when
hydra
is
asked
about
that,
bear
it
actually
delegates
that
query
to
a
network
indexer.
A
If
it
doesn't
already
know
the
answer,
obviously,
and
then
so,
the
network
indexer
is
able
to
say
the
elastic
provider
has
that
teddy
bear,
and
so
by
virtue
of
the
hydra
heads
being
prevalent
on
the
network.
You'll
almost
always
get
an
answer
without
having
to
explicitly
issue
provider
records
to
the
dht
yeah.
So
here
we
go,
send
it
on.
A
Yeah,
I
think
it
gets
cached
still
yes.
So
so,
if
there's
a
lot
of
requests,
it
eventually
is
going
to
catch
all
the
hot
set
in
hybrids.
Yes,
yeah,
there's
a
cash
yeah
yeah!
I
didn't
do
the
implementation
in
hydrogen.
I
don't
can't
speak
to
it
like
if
that's
then
cash
in
hydros,
yeah,
we'll
nodding
so
yeah.
A
Okay,
so
the
requests
that
hydra
make
to
the
indexer
nodes
are
actually
what
we
call
reframe
messages.
It's
a
spec
for
transport,
agnostic,
request,
response
messages
and
what
the
hydras
use
is
the
find
providers
method,
but
this
spec
is
kind
of
part
of
a
bigger
kind
of
delegated
routing
protocol
that
the
hydras
are
sort
of
participating
in,
so
that
that's
cool
and
then
so.
A
This
is
sort
of
ends
up
like
this,
where
the
our
elastic
bfs
implementation
is
basically
telling
the
network
indexes
what
cids
it
has
and
then
the
ipfs
nodes
on
the
network
are
able
to
find
that
information
via
the
hydras.
At
the
moment
this
is
kind
of
a
temporary
situation.
A
The
the
the
index
and
nodes
will
eventually
like
we
talked
about
earlier,
the
queryable
via
ipvs,
directly
and
and
other
ways
which
I'm
not
entirely
in
that
situation
to
explain
about.
But
this
is
how
elastic
ipfs
currently
works
and
gets
all
of
the
data
that
we
have
ever
available
on
the
dht
for
for
people
to
to
query
and
respond
to
and
have
nearly
100
provide
a
record
coverage
for
all
of
our
stuff
cool.
Thank
you.