►
From YouTube: Thunderdome - Ian Davis
Description
This talk was given at IPFS Camp 2022 in Lisbon, Portugal.
A
Hi,
my
name
is
Ian
Davis
and
today
I'm
going
to
talk
about
Thunderdome
now
Thunderdome
is
named
after
the
very
popular
1980s
hit
film
called
Mad,
Max,
Beyond,
Thunderdome
and
just
like
in
that
film
fandom
is
a
gravitorial
Arena
where,
instead
of
fights
to
death,
we
actually
compare
different
ipfs
Gateway
configurations
and
implementations
and
we
battle
them
to
see
how
their
performance
compares
to
one
another.
A
That's
what
a
standard
time
do.
It
runs
experiments
on
demand
and
it
captures
all
the
metrics
from
those
experiments
and
exports
those
to
grafana
so
that
we
can
view
the
results.
It
currently
supports
testing
ibfs,
Gateway
implementations
with
HTTP
traffic,
but
there's
nothing
in
the
architecture
that
stops
IT
from
testing
other
things,
that's
just
what
our
Focus
has
been
on.
So
far.
A
It's
designed
to
run
for
long
periods
like
several
days,
typically
you're
not
going
to
get
any
good
results
from
this
kind
of
experiment
over
one
or
two
hours.
You
really
need
to
have
a
soak
test
for
this
kind
of
thing,
and
so
given
an
experiment
definition,
it
starts
all
the
necessary
infrastructure,
applies
the
load
and
then
monitors
and
captures
the
results.
A
So
I'm
going
to
talk
a
little
bit
more
about
what
actually
goes
into
Thunderdome
and
what
it
consists
of
so
it's
simplest.
An
experiment
is
simply
Aegis
a
bunch
of
containers
which
contain
ipfs
instances
and
we
fire
traffic
at
them.
And
today
those
are
HTTP
requests
because
we're
trying
to
test
the
performance
of
Articles
gateways,
but
in
the
future
there
could
be
other
kinds
of
load.
So
it
could
be
bits
of
requests
or
it
could
be
a
different
kind
of
request
for
a
different
protocol
for
a
different
kind
of
software
under
test.
A
An
experiment
defines
a
particular
request
rate
and
a
concurrency.
The
concurrency
is
basically
the
number
of
requests
that
are
in
flight
at
any
one
time
and
each
Target,
which
are
the
containers
container
instances.
Each
Target
is
configured
differently
and
the
experiment
describes
how
those
things
are
configured.
A
It
may
be
just
that
one
config
setting
is
different
between
each
Target,
so
you're
doing
comparison
with
different
values
for
a
particular
conflict
setting,
or
maybe
that
each
target
has
a
different
base
image
and
actually
has
different
versions
of
software
or
entirely
different
pieces
of
software
that
are
being
compared
to
one
against
the
other.
A
A
Now
the
requests
in
Thunderdome
we
take
a
feed
from
the
public
gateways
that
PL
runs
ipfs.io.
We
ship
some
of
those
logs
through
a
service
called
Loki,
which
is
run
by
grafana,
and
then
we
have
an
automatic
called
skyfish
which
Bridges
Loki
with
some
Amazon
Q
Services,
and
they
and
then
deal
good
listening
to
those
cues.
So
every
time
a
an
experiment
starts
up.
It
subscribes
to
this
queue
of
incoming
requests
and
then
forwards.
A
Those
on
to
the
targets
under
test
and
the
reason
we
do
it
that
way
is
skyfish
can
control
the
kind,
the
the
the
low
the
level
of
requests
that
are
incoming.
It
can
smooth
that
out,
because
the
incoming
requests
from
the
public
gateways
can
be
a
little
bit
kind
of
bumpy,
depending
on
where
we're
getting
those
requests
from
or
how
what
the
distributions
requests
is,
and
we
want
to
make
sure
that
is
as
controllable
as
possible
and
that
we,
the
deal
good,
receives,
requests
a
standard
rate.
A
Now
all
the
components
in
the
system
are
implemented
with
metrics
and
we
use
grafana
agent
to
pull
any
metrics
out
of
the
containers
and
out
of
the
ECS
machines
that
run
the
containers,
but
also
out
of
deal
good
and
out
of
skyfish
and
the
queues.
Basically,
all
of
this
stuff
gets
pumped
into
into
a
grafana,
and
then
we
could
build
interesting
visualizations
of
the
results
from
that
foreign
metrics
via
Prometheus.
A
So
any
the
containers
that
we
run
as
targets,
all
they
have
to
do
is
expose
a
Prometheus
export
endpoint
and
the
grafana
agent
will
scrape
those
metrics
and
send
them
up
to
the
grafana
cloud
and
deal
good
and
Sky
fishing
implement
the
same
kind
of
things
and
they
they
export
metrics
around
for
deal
good.
It's
the
request
and
response
timings
and
for
skyfish
it's
the
behavior
of
the
incoming
requests
and
the
stability
of
that
of
that
request
feed.
A
So
we
can
build
visualizations
like
this.
This
is
the
default
timeline
view
of
an
experiment
and
across
here
we
can
see
things
like
at
the
top
left.
The
two
numbers
show
what
our
expected
request
rate
was
and
what
the
actual
request
rate
rate
was,
and
then
we
categorized
things
like
the
number
of
good
responses
or
a
number
of
dropped,
requests
or
the
time
to
First
byte,
which
is
what
we're
particularly
interested
in
in
in
the
Gateway
implementations.
A
This
particular
graph
shows
a
a
a
poorly
behaving
experiment
which
I'll
come
to
in
a
little
bit,
and
this
is
what
we
call
the
summary
view.
So
what
we
need
to
do
when
we're
running
Thunderdome
experiments
is
to
really
take
a
long-term
view.
A
So
there's
no
point
looking
at
30
minutes
of
metrics,
we
really
want
over
six
hours
or
12
hours
or
even
24
hour
averages
and,
in
this
case,
we're
seeing
here
six
six
hour
averages
of
various
metrics,
including
time
to
first
bite,
and
you
can
see
some
of
those
standing
out,
particularly
slow
and
again
I'll.
A
Come
on
to
the
reason
for
that,
in
a
short
while-
and
we
have
another
another
dashboard
which
is
basically
using
the
metrics
from
Sky
fish
to
show
us
the
incoming
requests-
okay,
that
are
coming
from
our
public
gateways
and
how
we've
kind
of
forwarded
those
onto
each
experiment
in
turn
to
just
make
sure
the
experiments
are
getting
a
fair
number
of
requests
that
they've
asked
for
if
they
don't
and
obviously
that's
going
to
affect
the
outcome
of
the
experimental
results.
A
There
are
some
challenges
to
this
kind
of
infrastructure.
Although
it's
simple
on
the
on
the
surface,
all
we're
doing
is
spinning
up
some
containers
and
sending
traffic
to
them.
Ipfs
and
P2P
might,
by
its
very
nature,
wants
to
connect
to
lots
of
different
things,
wants
to
discover
connections
all
the
time
so
nodes
like
to
chat
to
their
neighbors.
But
what
we're
trying
to
do
is
an
experiment
is
trying
to
isolate
these
things,
because
we
want
to
have
a
fair
test.
A
So
we've
got
three
or
four
copies
of
the
same
software
with
different
configuration
settings.
We
don't
really
want
those
cross
talking,
because
we
don't
want
them
to
do
our
block
to
be
appear
in
one
node
and
then
to
be
instantly
retrievable
from
another
node
just
because
it
happens
to
be
paired
with
it.
So
we
do
some
things
like
we
isolate
the
targets
from
one
another
using
network
ACLs
that
works
pretty
well.
A
What
we
really
want
to
do
is
isolate
at
the
peer
level,
so
we're
tracking
some
ipfs
work
where
there
are
proposals
for
having
rules
around
what
peers
can
be
connected
to.
So
we
could
block
particular
peers
and
we're
going
to
what
we
do.
There
is
we'd
block
all
the
peers
in
every
experiment
that
we
run
and
probably
block
access
to
our
own
gateways
as
well,
just
to
be
sure
that
we're
not
getting
any
kind
of
cross-contamination
that
might
affect
the
the
performance.
A
A
Initially,
we've
been
kind
of
focused
on
proving
Thunderdome,
ensuring
there's
no
bias
and
making
sure
we
can
scale
these
request
streams
properly.
We've
picked
some
easy
experiments
as
a
baseline
that
we
do
and
then
we
have
run
experiment
recently
to
compare
the
latest
reversion
of
Kubo
against
previous
versions.
A
I'm
also
doing
some
experiments
around
altering
the
the
delay
between
when
we're
fetching
blocks
about
when
we
go
out
to
the
DHT
versus
when
we
just
request
blocks
from
peers,
so
our
based
on
experiments,
we
call
tweedles
or
Tweedle
and
based
on
the
old
Tweedledum
and
Tweedle
D
poem
by
Lewis
Carroll.
So
the
idea
is,
we
have
two
identical
instances
of
Kubo.
In
this
case,
they've
got
exact
same
configuration,
they
just
have
the
same
load.
A
They
should
perform
identically
and
that's
our
test
to
see
whether
there's
any
bias
in
the
platform
and
on
the
whole,
they
do
over
the
long
term
perform
identically.
But
of
course
this
is
a
a
dynamic
system,
so
each
instance
will
have
different
peer
lists,
so
they
will
collect
different
Piers
so
in
the
very
short
term,
there's
sometimes
very
very
variations
in
performance
because
there
may
be
requests
that
come
in
that
one.
One
of
these
identical
nodes
can
service
because
it
happens
to
be
collected
or
appear.
A
A
We
actually
took
some
actually,
it
says
two
pairs.
We
actually
took
three
of
each
in
in
the
later
versions,
so
we
took
version
0.14,
0.15
and
the
release.
Candle
is
0.16,
fire,
those
upwards
same
configuration
and
then
with
the
two
experiments.
One
at
10
requests
a
second
to
one
at
20
requests
per
second
looking
for
kind
of
potential
regressions
or
improvements
between
those
versions,
so
about
10
requests
a
second.
There
really
wasn't
any
difference
that
we
could
spot,
basically
all
the
instances.
A
Basically,
there
behave
roughly
identically
in
terms
of
like
time
to
first
bite
and
the
amount
of
resources
they
were
using,
but
at
20
requests
per
second,
we
saw
a
lot
of
changes
and
you
can
see
here
that
at
the
top
of
these
graphs
is
version,
0.14
I
think
the
middle
is
0.15
at
the
bottom.
Is
the
new
release
candidate,
and
you
can
see
that
some
of
the
earlier
versions
have
some
very
pathological
behavior
in
terms
of
time
to
First
Bite
I
mean
very
extreme
numbers
High
to
go
routines.
A
Throughput
was
down
and
the
quite
High
Heap
usage
as
well.
In
this
particular
case,
and
over
time
you
can
see
that
there's
a
kind
of
distinct
difference
between
these
these
instances.
You
can
see
that
some
of
them
were
dropping
quite
a
number
of
the
requests
that
are
being
sent
to
them,
they're
supposed
to
be
handling
20
per
second,
and
quite
some
of
them
are
dropping
up
to
like
10
per
second
at
times
number
of
go
routines.
A
For
some
of
these
is
very
high,
like
at
60
000,
whereas
we
kind
of
could
actually
run
around
20
odd
thousand,
and
you
can
see
a
bunch
of
them
are
running
at
20
000,
but
some
were
elevated.
Some
had
elevated
CPU
utilization,
same
ones
that
had
the
high
go
routines
and-
and
we
also
see
a
high
Heap
as
well-
some
of
those
machines
just
pegged
100
CPU
throughout
the
whole
of
the
experiment,
and
we
see
the
same
one
kind
of
bits.
A
What
we
see
is
there's
some
elevated
pianists
on
some
of
these
the
slow
performing
instances
the
pianists
are
up
to
like
20,
20,
10
and
12
000.
Now,
the
way
these
are
configured,
they
have
a
high
water
mark
of
about
five
thousand
four
of
these,
so
they
should
hover
around
that
that
level.
A
This
is
kind
of
typical
of
a
setup
for
a
public
Gateway,
but
we
find
so.
We
found
that
the
poor
performing
instances
have
very
high
numbers
of
peers.
They're
bit
swap
lists
are
want
Mr
High,
even
though
they're
not
actually
receiving
or
sending
that
much
more
bits
or
traffic.
A
But
we
did
see
also
was
that
the
new
release
candidate
did
seem
to
avoid
falling
into
the
state,
and
you
can
see
from
on
the
right
hand,
side
that
actually,
the
bits
of
want
list
sizes
were
reasonable.
Like
50
most
of
the
time
for
these
three
instances
and
the
number
of
pairs
was
re,
was
exactly
what
we
expected
for
this
kind
of
experiment.
A
So,
as
a
follow-up,
you
can
actually
read
the
summaries
if
you've
got
access
to
to
a
notion
which
is
a
public
page.
We
follow
up
with
comparison
of
various
commits
between
the
two
versions
and
we
did
narrowed
down
to
a
point
in
which
Kuma
does
appear
to
be
more
stable
at
high
volumes
of
usage,
and
we
think
that's
a
bit
round
through
some
logic.
There's
some
changes
in
the
in
the
routing
logic.
A
We've
run
some
various
other
experiments
recently
so
like,
for
example,
we
have
run
an
experiment
where
we've
blocked
access
to
these
particular
nodes
from
the
public
gateways,
and
the
reason
for
that
is
because
we're
taking
a
feed
of
traffic
from
public
gateways.
You
know
I
wanted
to
test
the
theory
that
the.
A
If,
if
we're
paired
with
those
gateways,
then
we
may
just
retrieve
blocks
that
have
already
been
just
retrieved
by
a
Gateway
just
30
30
minutes
previously,
and
so
this
experiment
was
to
test
that
whether
there
was
any
really
any
bias
by
being
paired
with
any
particular
gateways,
because
obviously
the
indices
can
can
discover
the
gateways
that
have
these
blocks
and
it
turns
out
there
isn't
really
any
any
evidence
for
for
that.
A
In
fact,
the
the
kubernetes
actually
performed
slightly
better
when
they
were
not
paid
with
the
gateways,
which
is
probably
down
to
the
fact
that
gateways
are
very
highly
High
utilization
and,
in
fact,
they're
a
little
bit
slower
at
servicing
requests.
Because
of
that,
we
have
a
bunch
of
experiments.
A
Looking
at
this
bit
swap
provider
delay
settings
comparing
whether
we
want
to
change
the
the
number
of
Piers
or
whether
we
use
the
accelerated
DHT
client,
the
size
of
the
data
store,
and
also
some
very
long
settings
for
these,
and
we're
currently
running
experiment.
To
look
at
what
the
unoperable
size
high
water
mark
for
the
pier
set
is
for
a
Gateway,
typical
kind
of
Gateway
node.
A
It
seems
that,
as
we
saw
before,
very
high
number
of
peers
can
also
cause
problems
with
performance,
because
there's
a
lot
of
services
that
has
to
go
on
in
looking
after
those
peers
and
making
and
maintaining
that
that
those
connections
so,
in
fact,
maybe
better
to
have
a
smaller
peer
set,
even
though
that's
kind
of
counter-intuitive
to
a
very
large,
largely
trafficked
Gateway,
which,
which
is
designed
to
fetch
blocks
as
soon
as
possible,
from
a
very
wide
number
of
diverse
requests.
A
So
in
the
future
for
Thunder
Dome,
we
have
a
kind
of
short-term
road
map.
What
we're
going
to
be
doing
is
automatically
testing
every
Kubo
release
candidate.
Again,
the
previous
release.
A
We're
just
going
to
set
it
up
to
be
automated,
so
we'll
always
have
that
data
and
that
will
inform
the
release
process
and
we
can
hopefully
identify
performance,
progressions
or
even
better.
If
we
can
find
better
performance
improvements
or
stability
improvements,
then
that
would
be
great
as
well,
and
then
we
have
some
work
to
do
around
making
it
even
easier
to
run
experiments
So.
Currently,
the
only
people
that
can
build
and
run
experiments
are
the
people
who
own
the
infrastructure,
which
is
currently
the
the
Thunderdome
team.
A
What
we
want
them
to
do
is
in
our
collaborators
to
define
the
experiment
and
then
run
it
on
the
infrastructure
independently
and
then
gather
the
results
themselves
and
then,
of
course,
we
want
better
test.
A
Other
Gateway
implementations,
not
just
Kubo
or
to
test
with
the
iro
and
JS
ipfs,
and
anything
else
really
and
there's
some
kind
of
cross-match
I'm
going
to
do
around
metrics
to
make
sure
we're
measuring
the
same
things
from
each
of
these
types
of
software,
so
that
will
take
some
time
next
year
to
do
and
then
longer
term
and
sweet
long
term.
We
want
to
measure
other
kinds
of
software,
not
just
ipfs
gateways,
really
anything
that
that
can
accept
traffic
and
respond
to
it
in
some
way
and
produces
metrics
that
we
can
do
that.
A
I
think
in
this
in
this
kind
of
model,
so
it
could
be
with
sending
bit
swap
requests
or
it
could
be
a
quick
different
kind
of
software.
It
could
be
something
like
we
to
see
that
apply
Define
what
the
load
is
and
how
to
measure
that
application
of
that
load.
A
Also,
we
want
to
decouple
from
AWS
So.
Currently,
we
have
you,
we
rely
on
a
couple
of
AWS
components
for
obviously
we
use
ECS
for
our
container
system,
we're
using
the
Q
services,
and
we
want
to
be
able
to
allow
this
infrastructure
to
be
destroyed
anywhere
so
that
anyone
can
quickly
run
up
tests
of
their
own
software
or
our
software
or
whatever
they
want
to
do.
And,
of
course,
we
want
to
do
more
experience,
more
data
and
understand
ipfs
infrastructure
in
a
better
way.
A
So
if
you've
got
an
experiment,
you
can
come
find
me
I'm,
Ian
Davis,
on
find
me
on
firecon
slack
in
the
prodenge
Channel
or
an
ipfs
Discord
in
the
problem.
Channel
or
you
can
raise
an
issue
on
our
GitHub
repo,
which
is
github.com,
imfs,
Shipyard,
standard
Dome
and
soon
you
better
run
your
own
experience.
A
But
right
now
you
have
to
ask
me
and
I
will
configure
things
for
you
on
your
behalf
and
it
causes
all
open
source.
So
you
can
take
what
we've
done
build
on
it,
learn
from
it
run
it
yourself,
if
you
and
all
those
kind
of
things,
so
that
was
Thunderdome
thanks
very
much.