►
From YouTube: [Swarm Mini Summit] Observability in Swarm
Description
Anton Evangelatov - EF
Observability means making a complex system as transparent as possible to those who operate it. In this talk we will explain how we approach observability and monitoring within Swarm, what systems and tools we use, and discuss how we utilise them to bring simplicity, transparency and visibility to our production and staging environments.
Biography - Anton Evangelatov
Anton Evangelatov is a software engineer at the Ethereum Foundation, specialising in distributed systems, currently working on the Swarm project and as Ethereum Foundation DevOps. Prior to joining the Ethereum Foundation, Anton worked at a number of startups across Switzerland and Austria.
A
So,
if
you
really
want
to
cure
access
to
forum,
you
should
be
running
your
own
remote,
and
hopefully
this
will
also
help
you
understand
what
your
storm
node
is
doing
once
you
actually
run
it
on
your
own
system,
so
a
bit
of
history
of
swarm
from
a
perspective
of
a
developer
and
someone
who's
been
muted
from
so
I
joined
the
team
in
October
2017
so
one
year
and
at
the
time.
Basically,
what
was
the
state
of
the
code
base
is
that
we
had
no
canonical
master
branch.
A
There
were
multiple
feature
branches,
mostly
conflicting
with
one
another.
The
swarm
gateways
was
running
PLC,
but
it
wasn't
really
clear
which
commit
or
which
version
that
is,
and
we
had
no
canonical
git
repository.
So
we
had
ether
sphere.
That
was
also
a
Tyrian,
but
POC
two
was
actually
on
a
specific
committing
the
ether
sphere
repository.
So
it
was
a
bit
difficult
to
understand
which
code
is
running
on
the
Gateway
and
what
actually
is
happening
within
this
node.
A
So
basically
it
was
a
black
box
that
you
had
to
restart
every
now
and
then
and
you
didn't
really
have
much
visibility
over
what's
happening,
so
you
would
hit
the
gateway.
You
can
upload
content,
you
can
download
it,
but
every
now
and
then
it
would
get
really
slow
or
you
wouldn't
really
know
what's
happening.
A
We
then
decided
to
obviously
introduce
a
release
process
which
we
now
have
so
now
you
can
get
a
news
forum
version
every
two
weeks
and
you
pretty
much
know
that
if
something
is
broken
that
used
to
work
fine,
you
know
that
basically,
a
problem
has
been
introduced
into
the
code
base
over
the
last
two
weeks.
It's
not
difficult
to
find
what
happened
because
of
the
small
amount
of
changes,
and
the
last
one
was
to
basically
simplify
deployment
so
that
our
users
know
what
version
the
majority
of
the
nodes
in
the
network
are
running.
Obviously
we're.
A
Not
the
only
ones
who
are
running
swarm,
swarm
gateways
is
just
a
subset
of
the
nodes
of
the
network,
but
as
a
new
user
to
form,
that's
probably
the
place
that
you're
going
to
go.
If
you
want
to
try
swarm,
that's
the
lowest
barrier
country
right
now.
So
what
I'm
going
to
talk
about
in
this
talk
is
mostly
about
how
do
we
increase
visibility?
A
How
do
we
know
what's
happening
in
this
world
node
when
we
run
one
so
observability
can
be
defined
as
basically
those
three
pillars,
aggregation
of
locks,
aggregation
of
metrics
and
statistics
and
distributed
tracing.
So
there
is
a
really
nice
blog
post
on
how
twitter
approached
that,
and
we
can
read
it.
They
basically
explained
how
they
went
from
a
monolith
to
a
distributed
environment
and
basically,
the
challenges
that
came
up
with
that
with
swarm.
We've
always
been
in
a
distributed
environment.
So
that's
pretty
much
applicable
right
from
the
start.
A
For
us
I
mean
you
need
to
understand
how
your
requests
go
from
one
node
to
another
and
basically
make
sense
of
all
the
protocols
that
are
running.
So
there
is
another
realized
blog
post
from
Peter
Bergen.
Who
basically
has
this
nice
diagram
on
what
he
understands
in
terms
of
observability,
so
he
has
metrics
that
are
aggregate
statistics
that
all
nodes
emit
and
we
want
to
basically
aggregate
them
and
get
full
view
based
on
all
distributed
processes.
We
also
have
tracing,
which
is
pretty
much
request,
request
scope.
A
So
let's
say
that
you
upload
something
or
you
try
to
get
something
from
swarm.
You're
interested
in
your
specific
request
to
the
network
and
also
logging
is
similar
to
tracing,
but
in
terms
of
events.
So
for
locs
we
are
using
the
goretorium
log
package,
which
is
pretty
much
modification
of
this
other
library.
It's
a
structured
logging
library
for
gone
nothing
complicated
there.
It's.
It
was
already
part
of
the
code
base.
A
It
was
just
that
we
were
getting
locks
only
on
the
standard
out,
so
we
were
thinking
of
a
way
to
aggregate
the
locks
from
all
the
machines
so
that
we
have
one
common
view
and
easy
access
to
our
locks.
So
that's
pretty
clear.
There
is
an
example
of
how
you
would
do
it
within
the
go.
Atm
code
base
and
yep
sorum
is
part
of
that.
A
A
Requests
that
go
into
form
and
then
we
store
it
in
the
context
from
there
on
you
basically
can
propagate
it.
20
of
the
internal
subsystems,
and
you
know
that
a
given
log
basically
belongs
to
a
specific
request
and
how
we
use
that
is.
This
is
just
a
simple
screenshot
of
one
of
the
systems
we
use.
We
basically
search
for
a
specific
request
identifier,
and
then
we
can
see
the
whole
path
of
the
request
within
our
network.
A
It's
helpful
for
debugging,
and
let's
say
you
run
your
own
form
note
when
you're
wondering
why
you're
getting
results
that
you
don't
expect.
That
would
be
the
first
thing
to
check,
basically
your
unique
identifier
and
where
your
request
has
gone
through
within
the
code
base.
So
what
do
we
look?
We
look
obviously
errors
and
when
I
say
errors,
I
mean
something
that
a
developer
must
look
at.
That's
the
definition
that
we
use.
We
also
lock
warnings
something
that
might
happen,
but
it's
not
necessarily
an
error.
A
A
We
also
log
a
lot
of
requests
and
response
specific
information,
but
still
with
I
want
to
emphasize
on
the
earth
everything
that
comes
out
as
an
error.
It's
something
that
we
should
be
looking
at
and
trying
to
get
out
of
the
codebase.
That's
probably
something
that's
a
bug
for
aggregation.
We
used
to
use
okay
lock
and
it's
very
simple
block
management
solution.
A
We
just
deploy
it
within
our
cluster
and
we
aggregate
all
the
locks
that
we
received
from
the
different
nodes
and
that's
how
it
looks
like
I
mean
you
can
pretty
much
just
search
and
run
different
queries
towards
the
system.
So
for
metrics
for
metrics
we
are
using
the
go
metrics
library
again,
it's
part
of
the
code
base.
It's
a
fork
of
the
famous
korekiyo
metrics
library.
Some
of
you
probably
come
from
a
Java
background
or
not,
but
it's
a
really
popular
Java,
metrics
library.
A
So
the
way
to
instrument
the
code
with
it
is
basically
just
create,
let's
say
a
counter
and
increment
it
or
that's
an
example
of
how
you
can
measure
latency.
You
basically
record
the
time
you
set
your
timer
and
at
the
end
you
update
it.
When
the
event
you
try
to
measure
has
finished,
it's
quite
simple
so
with
matrix.
Basically,
they
help
us
detect
issues
with
within
current
and
new
versions
of
swarm,
and
they
also
up
show
us
the
performance
of
our
nodes.
A
So
it's
very
handy
to
when
you
deploy
a
new
version
to
basically
compare
the
metrics
that
you
got
up
until
that
point
and
after
the
deployment,
so
that
use,
if
you
have
introduced
regressions
or
you
see
if
you
have
improved
bugs
that
you
have
had
in
the
past,
so
that's
something
that
we're
heavily
using.
So
what
do
we
measure
with
the
matrix?
Obviously
infrastructure,
matrix,
that's
clear
memory,
usage,
CPU
disk
usage
and,
most
importantly,
the
application
matrix.
So
that's
the
errors
that
I
talked
about
warnings
different
info
counters.
A
We
also
measure
application,
specific
things
like
number
of
peers
per
node
and
all
the
different
message,
exchanges
between
peers.
So
that's
where
the
interesting
information
comes
along.
It
basically
helps
you
visualize.
What
a
specific
note
is
given
in
doing
it
at
a
given
time
with
respect
to
the
different
protocols
that
it's
running
in
swarm,
that
would
be
the
busy
zip
protocol.
The
stream
protocol
PSS
or
yep
anything
else,
so
this
is
what
it
looks
like
within
the
dashboard.
It's
not
very
visible
right
now
here,
but
it's
best-before
dashboards
and
those
things
that
you
see
here.
A
These
are
the
different
nodes
that
we're
running.
So
it's
a
handy
way
to
basically
visualize
all
the
airs
on
a
set
of
nodes
and
you
create
alerts
for
them
and
monitor
large
deployment
of
your
network.
In
let's
say
yes,
developers
use
case.
That
would
be
only
one
node.
Let's
say
your
local
node
and
you
can
visualize
what
your
local
node
state
is
a
given
time,
that's
again
the
infrastructure
and
that's
an
example
of
application
monitoring.
A
So
we
also
run
this
when
we
are
doing
simulation
tests
and
that
specific
screenshot
is
from
a
run
on
the
PSS
simulation
tests.
At
the
time
we
were
comparing
different
transport
models
and
you
could
see
that
at
some
point
the
metrics
become
flat
and
on
the
right
side,
where
we're
basically
measuring
the
number
of
calls
to
specific
PSSs
handlers.
A
So
it
pretty
much
looks
clear
that
at
some
point
we
just
flat
out
and
we're
not
incrementing
the
counters
anymore,
which
tells
you
that
your
process
is
not
doing
anything
at
that
point
and
this
might
be
expected
or
unexpected,
depending
on
the
simulation
that
you're
running.
It
basically
gives
you
visibility
over
your
system
and
the
last
one
is
the
distributed
tracing.
That's
the
system
that
we
introduced
within
the
swarm
team,
so
we
decided
to
use
open
tracing
open
tracing
is
basically
vendor-neutral.
A
Ip
is
and
instrumentation
libraries
they
have
standardized
the
semantics
of
what
they
mean
by
spam:
trans,
trace,
etc.
There
are
different
client
libraries
we're
using
the
go
library
and
they're
popular
tracer
supports,
so
basically
those
systems
that
actually
aggregate
the
traces
from
your
processes
and
produce
a
nice
visualization
so
that
you
can
make
sense
of
your
traces,
so
I'll
go
quickly
to
the
open
tracing
data
model.
Basically,
traces
are
defined
implicitly
by
their
spans,
and
a
trace
can
be
thought
of
as
a
dag
of
spans.
So
this
is
what
it
looks
like
now.
A
You
have
one
route
span,
and
then
you
have
a
lot
of
children
of
that.
Pan
so
let's
say
that
your
request
comes
into
the
system,
you
create
your
route
span
and
then
within
the
internal
subs
subsystems
of
your
process.
You
just
attach
spans
to
the
parent
route
span.
So
that's
how
the
relationships
between
spans
look
like
in
a
single
trace.
It's
also
useful
to
think
of
it
within
the
temporal
space.
A
So
generally,
the
route
span
is
going
to
have
the
full
length
of
a
request,
whereas
its
chart
spans
are
gonna
finish
within
the
route
span,
unless
you
fire
up
a
synchronous,
jobs
from
let's
say
your
route
require.
So
that's
the
data
model
of
of
G
open
tracing
model.
How
do
we
use
this
in
this
form?
We
basically
have
instrumented
our
code,
and
this
is
a
screenshot
of
an
HTTP
GET
request
for
a
specific
route
chunk.
A
So
basically,
you
can
see
the
full
length
of
the
request
and
the
internal
systems
that
it's
doing
I'm
not
sure
that's
visible
for
you.
But
basically
the
route
span
is
the
HTTP
GET
file.
Then
we
can
see
that
we're
hitting
the
API
package
and
we
see
that.
Then
we
go
into
the
chunk
reader,
which
is
responsible
for
fetching.
A
The
individual
chunks
that
the
route
hash
is
beautif
here
is
another
example
where
we
actually
issue
an
HTTP
GET
request
to
a
specific
swarm
node
and
we
trace
it
within
our
network
of
costs,
so
we
can
see
which
other
nodes
specific
request
is
hearing.
So
let's
say
that
you
don't
have
all
the
chunks
cached
on
your
local
node.
You
can
visualize,
which
other
nodes
you
are
hitting
in
order
to
retrieve
the
chunks
for
that
specific
content.
Obviously,
that
is
taking
privacy
out
of
the
window,
but
it's
there
for
debug
purposes.
A
We
don't
expect
people
to
be
running
this
in
production
and
even
if
they
are.
Obviously,
this
is
just
giving
you
information
on
the
nodes
that
you
control
and
you're
pretty
much
free
to
do
whatever
you
want
with
your
notes.
That's
something
that
you
should
have
in
mind
that
when
you
are
using
a
public
gateway,
you
might
actually
be
losing
yeah
your
privacy,
so
you
should
be
running
your
own
modes.
A
How
does
that
work
in
terms
of
go?
Adding
the
instrumentation?
It's
just
a
simple
example:
here
we
have
a
handler
function
for
one
of
the
protocol
messages.
In
this
case.
It's
called
offered
hashes
message
and
when
we
handle
that
message,
the
only
thing
that
we
do
is
we
define
this
span
and
we
also
extract
the
route
span
from
the
context,
so
we're
basically
attaching
the
handle
offered
hashes
span
to
the
root
span,
which
is
kept
into
the
context
and
the
context
is
propagated
within
other
internal
subsystems
so
that
you
have
access
to
this
instrumentation.
A
That's
how
you
get
the
relationships
between
the
different
nodes.
So
how
do
you
use
all
that
if
you're
a
developer
and
you
want
to
develop
bonds
form?
Basically,
the
simplest
way
to
do
the
tracing
is
to
run
the
so-called
aggregator
and
we
talk
on
tracing.
You
can
run,
for
example,
jäger
or
any
other,
like
cap,
etc.
A
With
this
simple
docker
command,
you
can
start
it
down
on
your
local
machine,
and
then
you
just
have
to
start
to
warm
with
tracing
enabled
that's
the
tracing
flag.
You
need
to
give
the
end
point
which
in
this
case
is
localhost
and
683
one
port,
and
you
also
give
you
have
to
give
your
node
a
name
in
this
case,
my
local
swarm
with
metrics.
A
A
So
the
question
is:
is
there
any
visualization
that
is
better
than
the
tracing
and
if
there
is
something
in
the
works,
no
I'm
not
aware
of
such
tools
so
tracing
together
with
the
stats?
That's
how
you
get
visibility
over
your
systems,
anything
that
adds
more
value
to
it.
If
something
in
the
audience
knows
that
be
very
interested
in
hearing
about
it,
but
that's
the
state
of
the
art
as
far
as
I
know
at
the
moment,.