►
From YouTube: A Deep Dive into Provider Records - Mikel Cortes
Description
This talk was given at IPFS Camp 2022 in Lisbon, Portugal.
A
Hello:
everyone,
my
name,
is
Miguel
I
work
at
the
Barcelona
supercomputee
Center
and
today,
I'm
gonna,
introduce
you
a
recent
study
that
we
performed
performed
in
collaboration
with
the
problem
about
the
provider
record
liveness.
In
this
case,
the
talk
is
going
to
be
about
the
fundings
and
all
the
results
that
we
got
from
that
study.
A
So
I'm
gonna
make
a
quick
introduction
of
the
outline
I'm,
basically
going
to
introduce
the
topic
that
we
are
talking
about.
Basically,
what
or
how
the
content
is
published
on
apfs
all
the
involved
parameters,
the
objectives
of
the
study.
A
We
will
go
through
the
methodology
that
we
use
to
actually
perform
the
study,
and
then
we
will
go
for
the
interesting
stuff,
which
is
the
basically
the
results
and
the
takeaways
that
we
take
from
from
the
study
the
introduction,
it's
funny,
because
I
think
that
everybody
has
like
friends
that
don't
know
about
the
apfs
or
that
they
don't
know
much
about
it
and
we
normally
or
I,
normally
hear
these
people
thinking
that
ipfs
is
a
storage
system.
It's
basically
you
publicly
advertise
content.
A
People
in
the
network
keep
it
for
you
and
everybody's
happy,
but
that's
not
how
it
works
like
when
you
want
to
put
something
on
ipfs.
You
don't
replicate
the
content
itself
that
you
replicate
the
pointers
to
that
content.
It's
funny,
because
I
think
that
there
are
already
two
talks
before
mine
that
already
talked
about
this.
It's
like
a
point
that
we
want
to
emphasize
like
we
don't
replicate
content,
so
the
the
real
way
ipfs
address.
This
content
is
basically
like
when
you
want
to
publish
some
content
You
by
you're.
A
Gonna
divide
this
content
in
different
blocks.
I'm
gonna
touch
you
here
like
a
visualization
tool
in
case
you
want
to
upload
a
file
and
you
want
to
see
how
many
blogs
will
it
generate.
So
basically,
what
you
will
do
is
like
from
this
content.
In
this
case,
for
example,
imagine
that
it's
a
PDF
file,
you
will
generate
all
the
blocks
and
you
will
basically
identify
the
hash
of
that
block
into
a
what
we
call
like
CID,
which
is
the
content
identificator.
A
So,
instead
of
replicating
the
content
itself,
now
we
are
talking
about
blocks
about
cids
and
you
will
have
like
as
many
cids
as
blocks
as
your
PDF
generated,
but
you
are
not
gonna
replicate
them.
What
you
will
do
is
like
you
are
going
to
make
a
link
between
yourself,
which
is
the
one
that
is
going
to
host
that
content
or
those
different
blocks
to
the
CID
that
you
are
keeping
so
in
case
that
anyone
wants
to
retrieve
that
content
of
that
block.
A
This
link,
as
Janice
and
Joe
were
introducing
before
are
called
provider
records
or
PR
with
like
the
short
version,
and
these
are
actually
the
things
that
you
will
distribute
to
the
network
like
the
purpose
of
this
is
that
you
don't
overload
the
network
with
a
lot
of
storage
like
they
don't
need
to
store
as
much
as
just
like
a
simple
set
of
bytes
and
the
way
we
do
it
is
like
we
young
was
introducing
with
the
DSC
and
the
whole
lookup
process.
A
So,
basically,
we
are
going
to
look
for
each
of
the
cids,
which
are
the
closest
peers
in
the
network
that
match
the
house
of
the
CID
to
their
PID,
and
we
are
gonna
select
them
as
what
we
call
peer
holders
like
provider
record
holders.
So
those
peers
are
going
to
become
the
link
between
my
content
and
me,
as
content
provider
introduced
the
key
value
for
the
key
buckets
which
is
20
in
the
content
publication.
We
also
have
another
key
value
like
sorry
for
having
the
same
K,
but
we
didn't
use
it.
A
It
basically
means
like
it's
the
replication
value,
it's
like
how
many
peers
in
the
network
and
are
we
gonna,
choose
to
store
these
provider
records.
The
main
reason
of
having
like
a
key
value
is
that,
as
we
are
talking
about
peer-to-peer
networks,
they
elastically
increase
or
decrease
the
network
size,
and
we
don't
want
those
links
between
content
and
content
providers
to
disappear.
So
we
basically
store
these
records
in
a
number
of
peers
so
that
in
case,
like
half
of
them
or
almost
all
of
them
disappear.
A
There
is
at
least
one
of
them
which
can
link
the
content
to
the
content
provider
so
as
a
peer
that
wants
to
publish
a
PDF.
What
we
will
do
is
actually
generate
the
blocks,
generate
the
PRS
look
for
each
CID,
which
are
the
k-class
peers
and
actually
contact
them
to
tell
them.
Hey
I
want
you
to
keep
these
records
in
case
someone
is
looking
for
dcid,
so
this
is
basically
what
we
do
when
we
want
to
police
content
on
apfs.
A
These
PRS
sorry
are
not
going
to
be
there
forever
like.
We
don't
want
ipfs
to
be
overloaded
from
data
that
is
no
longer
interesting
for
anyone
or
that,
for
example,
providers
don't
want
to
publish
anymore.
So
there
is
a
Time
Life,
For
Those
peer
records,
or
they
have
an
expiration
time,
which
is
basically
currently
set
to
24
hours.
It's
like
any
provider
records
that
stays
in
the
network
and
that
doesn't
get
republished
are
gonna
disappear
or
are
not
going
to
be
re-linked
to
the
content
provider
after
24
hours.
A
So,
for
example,
if
someone
pulls
you
something
24
hours,
it
will
probably
disappear.
As
I
said,
like
the
way
of
extending
the
lives
of
these
provider,
records
is
actually
republish
them,
which
is
basically
making
the
exact
same
publication
thing
again.
It's
like
the
whole
process
is
repeated
again,
so
it's
like
you
will
have
to
search
for
the
new
closest
peers
to
that
CID
or
for
each
CID,
and
you
will
contact
them
to
keep
them
to
give
them
the
provider
records
and
that's
how
you
extend
for
another
12
hours
with
I
mean
Expedition.
A
If
you
want
to
publish
content
and
keep
it
alive
during
a
lifetime,
you
will
have
to
republish
them
like
every
12
hours,
but
that's
at
least
what
we
do
now
in
in
the
current
specification,
so
which
are
the
objectives
of
the
study
that
we
were
doing
is
like,
of
course,
like
the
link
between
content
and
content
provider
is
probably
one
of
the
key
things
of
ipfs
like
how
it
works
like
otherwise,
we
wouldn't
be
able
to
to
retrieve
content.
A
A
So,
for
example,
if
we
had
like
many
peers
joining
and
leaving
the
network
it
might
it
could
happen
that
the
the
closest
peers
of
that
particular
particular
CID
actually
completely
changed
over
six
hours
and
therefore
like
people
that
is
trying
to
retrieve
the
content
from
Another
Side
will
actually
end
up
in
different
closest
peers,
so
then
link
between
the
content,
retrievable
and
the
content
provider
won't
won't
be
able
to
achieve.
A
We
wanted
to
expand
a
bit
this
study
and
we
want
to.
We
wanted
to
explore
like
okay.
We
know
that
cow.
Sorry
now
is
20.,
but
what
happens
if
we
increase
it?
What
happens
if
we
decrease
it?
It's
like.
We
call
them
like
magic
numbers
which
are
numbers
that
they
are
there.
They
work,
and
actually
we
want
to
see
whether
it's
word
changing
them,
or
maybe
they
were
correctly
set,
and
we
didn't
know
why.
A
So
whenever
you
want
to
fetch
some
provider
records
from
the
network-
and
you
reach
one
of
those
heads
of
the
Hydra,
if
any
other
heads
of
that
higher
already
knew
their
records,
they're
already
gonna
give
it
to
you,
it's
like
you,
you
would
be
able
to
access
a
provider
record
that
that
appear
or
the
set
of
peers
actually
got
to
start
the
records.
A
So
we
wanted
to
check
whether
they
are
keeping
up
the
whole
liveness
of
these
records,
which
are
their
their
performance
in
the
network
and
how
important
are
they
or
whether
we
can
actually
manage
them
from
the
network?
A
So
yeah
a
quick
introduction
introduction
to
the
methodology.
We
have
a
simple
client
which
basically
reproduces
in
a
control
environment
like
the
the
publication
of
the
cids.
We
actually
monitor
the
process
of
looking
the
closest
peers.
We
monitored,
how
we
store
them,
whether
those
20
pairs
that
are
the
closest
ones
to
the
CID,
actually,
whether
they
success
the
whether
they
are
successfully
connected,
whether
they
are
the
the
closest
ones
and
sorry
yeah.
A
Once
we
identified
those
provider
record
holders,
what
we
do
is
like
we
every
given
interval
of
time,
which
in
this
case
was
set
to
30
minutes.
We
actually
actively
connect
to
them
and
request
them
like
whether
they
keep
the
records
or
not
and
by
by
do
by
doing
so,
we
actually
know
whether
they
are
active
in
the
network
and
whether
they
keep
the
the
records
in
the
in
the
network.
A
We
also
do
over
these
30
30
minutes,
another
lookup
for
being
or
for
knowing
actually
where
these
peers
are
the
closest
ones
or
are
still
the
closest
ones
to
the
to
the
cads,
and
we
actually
keep
track
of
the
number
of
hops
that
the
DHC
lookup
it's
it's
doing,
so
that
we
can
also
link
it
later
on
and
it's
important
to
know
that
we
are
not
making
any
republish.
So
it's
like
a
plain
published.
We
keep
track
over
this
life
of
the
records
for
36
hours
and
then
we
just
leave
it
there.
A
We
just
leave
it
like
a
stale
content
in
the
network,
so
going
a
bit
to
the
study,
the
current
status,
it's
quite
healthy.
It's
like
these
are
the
in
this
graph.
We
can
see
like
the
number
of
successful
peer
quarters
that
we
connected
whenever
we
wanted
to
connect
the
or
we
were
publishing
the
10
000
cids
that
we
were
publishing
in
the
in
the
Tool.
A
A
The
nice
point
is
that
the
the
lowest
part,
or
that
at
least
95
95
percent
of
the
network
or
the
cids
actually
achieved
14
successful
peer
quarters,
which
I
think
is
still
nice.
Let's
remember
that,
as
soon
as
one
has
the
records
like
the
content
should
be
retrievable,
so
14
should
be
fine
whenever
we
started
pinning
them
every
30
minutes.
A
Only
five
like
five
out
of
20
piers
are
actually
not
accessible
during
the
36
hours
of
the
study,
and
if
we
compare
it
to
the
number
of
peer
holders
that
actually
shared
the
the
provider
records,
we
can
see
that
it
only
drops
to
13..
So
it's
like
still
quite
accessible
because
once
again
as
soon
as
one
shares
the
period,
the
the
peer
PRS,
the
content
should
be
retrievable.
A
Important
key
point
is
that,
where
peers
are
sharing
their
records,
we
can
see
that
our
24
there
is
a
sharp
drop,
which
is
like
the
expiration
time
that
we
were
talking
before
and
for
some
reason.
Hydra
nodes
they're
actually
longer
to
prune
that
content,
keep
in
mind
that
they
have
a
huge
database,
so
it
actually
takes
them
longer
to
prune
the
still
records
talking
about
the
close.
The
degree
of
beer
holders
originally
contacted
inside
the
closest
peers
over
those
30
minutes.
A
We
can
see
that
it's
quite
stable,
of
course,
like
we,
we
take
into
account
that
the
original
peer
holders
should
be
the
closest
appears
to
the
content
at
the
beginning,
and
it
drops
after
a
few
hours
to
15
by
million
and
it
stabilizes
them.
But
there
is
no
point
where
those
originally
peer
holders
are
not
inside
the
cake
closest
peers
over
the
the
study.
A
This
means
that
anyone
that
wants
to
retrieve
the
content
from
the
other
side,
which
actually
end
up
talking
or
trying
to
connect
to
the
same
peers
that
we
originally
connected,
which
is
nice
keeping
up
with
the
comparison
between
case,
of
course,
having
to
replicate
things
over
20
pairs
for
each
CID
etib
represents
a
Blog
which
is
maximum
four
megabytes
of
of
storage.
So
you,
if
you
really
want
to
share
like
a
big
file,
you
really
need
to
split
it
into
too
many
blocks.
A
So
this
is
making
the
network
to
be
a
bit
overloaded
and
we
see
that
number
of
users
are
increasing,
but
number
of
nodes
doesn't
really
match.
So
what
we
were
trying
to
explore
is
like
whether
we
can
decrease
the
number
of
other
overload
that
we
share
to
the
network
by
only
publishing
could
be
affected
by
modifying
this
K
value.
So
what
we
tried
was
starting
from
K15
to
K20,
K24,
25
and
KV
40..
We
were
testing
these
values
in
the
real
Network
by
only
our
tool.
A
So
it's
like
we
are
testing
these
values
in
a
Network
that
has
key
20.
It's
only
like
a
lot
local
setting,
but
we
can
see
that
if
we
compare
the
percentage
of
peers
that
we
successfully
connect
when
we
are
publishing
the
records,
the
percentage
don't
vary
that
much
it's
like.
It
still
stays
into
a
a
median
of
like
90,
which
is
nice.
A
The
bigger
difference
comes
when
we
take
into
content
number
or
the
amount
of
time
that
it
took
to
us
to
actually
publish
the
records,
which
is
something
that
Janice
already
mentioned
before
it
takes
by
median
10
seconds
or
12
sec.
Well,
in
this
case,
like
with
K20,
it's
like
12
seconds
to
actually
make
the
by
median
the
police
the
publication
of
a
single
CID,
which
is
quite
high
from
a
performance
point
of
view
like
we
would
like
to
drop
it
to
one
second.
A
But
let's
be
optimistic,
like
it's
going
to
take
some
long
journey,
but
we
can
see
that,
for
example,
if
we
drop
it
to
K15,
we
actually
get
like
a
two
seconds
decrease.
However,
if
we
increase
it
to
K40
like
we
actually
double
it,
which
is
more
or
less
stable,
it's
quite
a
quite
a
linear
distribution
and
in
like
two
seconds
by
by
median,
comparing
like
the
onlyness
of
these
peers
versus
the
the
peer
holders
that
are
sharing
the
the
content,
the
percentage
don't
change
much.
A
Basically,
what
this
result
is
telling
us
is
that
the
network
will
behave
still
the
same
for
for
one
Chi
value,
then
for
the
other
one,
with
the
difference
that
the
absolute
value
will
change,
of
course,
that
the
80
of
15
is
not
the
same
as
the
50
percent
of
of
20..
So
if
you
are
big
enough,
you
can
observe
that
K40
always
has
like
a
higher
value,
and
this
is
I
mean
you
see
that
K,
15,
20
and
25
are
more
or
less
stable
into
a
single
line.
A
However,
like
K40
is
a
bit
a
bit
above
it's
interesting,
but
the
fact
that
the
key
bucket
is
settled
to
20
means
that
a
player
knows
with
high
detail
like
the
20
closest
pairs
that
they're
surrounding
him.
However,
if
you
go
beyond
those
K20
piers
in
that
bucket,
your
or
20
closest
peers
just
start
having
more
dabs
about
whether
that
38th
beer
is
actually
the
closest
ones
or
not,
and
the
way
how
the
go
implementation
is
working.
A
A
Still,
no
differences,
I
think
that
the
values
are
quite
Fair
like
60,
and
let's
remember
that
only
one
of
them
has
to
be
online,
so
I
think
it's
quite
good.
The
in
degree
ratio.
On
the
other
hand,
it's
the
opposite,
is
for
K
15,
20
and
25.
A
They
are
still
same
or
follow
the
the
same
10
while
k14
actually
drops-
and
this
is
still
the
same-
it's
like
you,
prioritize
peers
that
are
stable
in
the
network,
sacrificing
accuracy
on
their
closeness
and
that's
the
result
on
the
in
degree,
ratio
of
peers
that
are
closer
to
the
content
over
time.
It's
like
you
are
sacrificing
the
the
that
Precision
of
being
the
closest
one
for
actually
being
more
stable.
A
Something
really
interesting
is
like
in
the
DHT
work
that
we
were
doing
to
actually
get
the
closest
peers.
We
were
keeping
track
of.
How
long
was
the
number
of
hopes,
the
maximum
number
of
hops,
that
we
were
doing
on
that
process?
And
we
realized
that
k14
has
not
many
more
hopes
in
the
depth
of
that
DHT
work.
However,
it
was
growing
in
number
of
on
the
white
of
that
tree.
It's
like
we
are
not
requesting
more
peers
in
depth,
but
more
peers.
In
parallel.
A
It's
like
yes,
it's
taking
us
more
time
to
actually
get
certain
accuracy
of
which
are
the
closest
40
peers.
So
in
that
case,
is
it's
like
more
like
a
drawback
rather
than
an
improvement.
So,
coming
back
to
the
hiera
point,
I
set
up
like
a
tiny
tweak
on
the
DHT
code,
mostly
to
try
to
Blacklist
any
Hydra
peer
in
the
network.
We
were
trying
to
recreate
what
would
be
a
network
without
hydras.
A
Spears
I
was
just
banish
them
and
taking
the
next
one
in
the
queue
and
if
it
was
appear
again
and
a
higher
appear
again,
I
will
banish
it
and
for
those
peers
that
will
actually
escape
the
process
of
this
blacklisting
for
the
light
Road.
What
I
was
doing
is
that,
whenever
I'm
trying
to
open
or
whenever
my
host
is
trying
to
open
a
connection
to
appear
and
I
can
identify
it
as
a
higher
I
will
just
just
created
like
a
new
error,
which
is
like
sorry,
I'm,
a
Hydra
I,
don't
want
to
connect
hybras.
A
So
this
is
mostly
how
I
perform
this
study
and
we
ran
it
again
with
10
000
cities,
and
the
experiment
was
the
same
with
just
the
difference
that
one
or
we
are
comparing
here
like
one
version
that
was
applying
the
filter
and
one
that
wasn't
applying
the
filter
and
the
comparison
is
that
the
it
still
follows.
The
same
distribution
is
that
what
that
we
are
just
avoiding
hydras
in
the
network.
From
our
own
perspective,
it
has
like
a
side
code
which
is
that
this
Hydra
filter
is
only
for
us.
A
So,
for
example,
we
are
not
creating
our
routing
table
or
the
DHC
routing
table
with
with
hi-res,
but
other
Spears
are
getting
benefit
from
that.
So
it's
like
just
as
a
side
note,
but
basically,
what
we
can
see
is
that
the
number
of
successful
connections
that
we
get
at
the
beginning-
it's
still
the
same.
It's
like
doesn't
very
much
rather
than
just
like
the
I,
don't
know.
0.1
percent
in
terms
of
Max
hopes
that
we
are
getting
to
actually
get
the
closest
peers.
We
can
see
that
we
are
increasing.
A
It
is
that,
of
course,
we
are
not
accessing
those
peers
that
has
a
huge
database
of
of
which
are
the
closest
pairs
to
a
specific
CID.
So
it's
taking
us
longer
to
actually
discover
which
are
the
closest
pairs.
A
However,
it
doesn't
apply
much
on
terms
of
how
much
time
does
it
take
us
to
publish
the
content.
It's
like.
We
are
only
increasing
one
second
extra,
while
we
are
avoiding
a
centralization
part
of
the
of
the
of
the
network.
So
I
think
that
in
that
case
it's
it
could
be
a
nice
trade-off,
but
don't
want
to
be
the
one
shutting
them.
A
But
here
are
the
numbers
and
yeah
like
more
or
less
follows
the
same
with
like
successful
connections
or
online
peers.
Over
the
time
we
more
or
less
see
the
same
pattern,
of
course
hydras
the
fact
that
they
are
a
centralized
point
of
the
network.
They
show
like
a
high
or
a
higher
stability
in
terms
of
onlineness
and
sharing
the
peers.
A
If
we
go
back
to
the
previous
yeah
here,
we
can
see
that
the
bottom
line
is
pretty
as
straight
line
is
like
they
are
always
active
of
course,
but
in
this
case
it
doesn't
the
median
or
the
sorry.
The
average
of
active
peers
actually
doesn't
vary.
That
much
and
with
the
I
mean
here,
we
can
see
that
the
records
are
actually
expired
in
after
24
hours,
which
is
nice
yeah.
Oh
I,
think
that
I
missed
here
the
title.
A
So
basically,
this
is
the
in
degree
radio
yep,
it's
like
we
are
still
getting
the
fact
that
we
don't
have
so
much
stability
on
the
on
the
Hydra
side.
To
get
the
closest
Piers,
it's
counter
planus
in
the
sense
that
we
we
don't
have
that
much
Center
to
when
we
are
getting
the
closest
spear.
A
So
we
are
having
sort
of
a
drawback
in
terms
of
getting
or
unstable
in
degree
radio,
but
overall,
we
are
still
above
70,
which
is
still
refined
as
soon
as
we
just
need
one
Pier,
which
actually
keeps
the
record
so
yeah,
that's
more
or
less.
The
summary.
The
the
takeaways
that
I
want
to
take
from
this
study
and
that
we
want
to
sort
of
like
give
the
word
to
the
community
is
that
the
provider
record
liveness
of
the
network
is
actually
quite
good.
A
Over
those
first
24
hours,
we
are
still
in
a
really
nice
in
degree,
ratio
being
the
closest
ones
to
the
content,
which
means
that
it's
quite
unlikely
that
the
peer
will
be
or
wouldn't
be
inside
those
peer
holders
for
more
than
24
hours
and
the
part
of
the
network
that
is
not
centralized,
which
is
mostly
like
users
running
their
own
notes
or
just
go.
Ipfs
notes
or
kubos
are
actually
like,
quite
stable
as
well.
A
It
opens,
or
we
want
to
open
this
study
to
actually
try
to
improve
the
network.
It's
like,
as
I
was
explaining
before
we
are
experiencing
some
sort
of
like
overhead
in
some
notes
that
are
actually
receiving
too
much
traffic
and
without
that,
for
example,
like
expanding
the
lifetime
of
those
provider.
Records,
which
means
that
big
clay
or
big
notes
in
the
network
will
have
to
spend
more
effort.
A
Republishing
every
12
hours
could
actually
be
something
that
we
could
take
into
account
to
reduce
the
overhead,
and
it
also
opens
the
window
whether
we
could
decrease
K
value
from
20
to
15,
which
also
will
reduce
the
overhead
of
the
network.
Although
having
two
options,
the
first
one
seems
to
be
more
I
think
that
it
will
have
like
a
more
priority
in
terms
of
like
it's
like
less
likely
that
we
will
find
content
under
troubles
only
by
ex
like
spending
their
their
lifetime.
A
Yeah,
just
quick
mention
to
the
contributions
that
we
did.
It's
like
the
the
whole
report
is
under
the
rfm
17.
It's
like
I
reported
withdrawal
when
we
wrote
for
the
protocol
Labs
all
these
optimizations
that
we
could
do
in
parameters,
for
example,
are
already
applied
and
they're
already,
with
an
input
request
to
the
specifications
of
lip
to
peer
of
the
Google
implementation,
blah
blah
blah
I'm
gonna
link
you
here,
the
the
repo
for
the
hoarder
in
case.