►
Description
Failing forward to 1 million requests per second - Axel Liljencrantz, Mikael Sundberg
Many companies claim to have a work culture that celebrates failures, but few companies have tested that claim as thoroughly as Spotify did during our migration to Envoy. Come hear war stories of trying, failing, and failing some more with Envoy, and learn how to make sure you learn something new every time you fail.
A
A
B
A
In
addition
to
having
a
unified
perimeter,
there
is
a
long
laundry
list
of
futures
we
want
to
get
out
of
this
setup.
Common
metrics,
authentication
rate,
limiting
client,
ip
lookups,
access,
logs
and
so
on
and,
as
you
might
know,
envoy,
doesn't
actually
do
all
of
those
things.
Our
desired.
Android
setup
contains
a
docker
sidecar
that
runs
a
second
service.
Implementing
authentication,
guip,
lookups
and
smarter
things.
A
B
B
B
Our
test
setup
uses
the
same
load
balancer
as
production,
an
identically
configured
cluster,
but
with
only
one
host
and
various
different
core
counts
on
that
host.
Finally,
our
test
used
a
single
upstream
service
named
no
op.
No
up
is
a
service
whose
reply
time
status
code
and
payload
size
can
all
be
configured
on
each
incoming
request.
B
B
B
B
A
A
A
A
B
B
B
B
B
B
A
That's
a
pretty
high
number,
and
someone
pointed
out
that
the
buffer
size
is
one
megabyte
and
with
some
math
you
get
a
total
buffer
size
of
13
gigabytes.
That's
quite
a
lot
of
buffering
for
ny.
To
do
so,
we
tried
to
decrease
it
to
32
kilobytes
for
each
connection
and
our
request
per
second
increased
from
30
to
60
000
on
direct
responses.
A
A
B
B
B
B
B
B
We
got
a
suggestion
from
both
in
the
form
of
harvey
touch,
so
reuse
port.
This
configuration
option
in
envoy
is
described
as
such.
This
makes
inbound
connections
distribute
among
worker
threads
roughly
evenly
in
cases
where
there
are
a
high
number
of
connections
which
begs
the
question:
when
would
you
not
want
connections
evenly
distributed
among
workers
anyway?
B
A
A
So
we
have
started
doing
fun.
Things
like
upgrading
to
the
version
3
of
the
xts
api,
adding
rate
limiting
and
looking
at
course
configuration
for
our
clients
and
we
did
take
it
slow
by
rolling
out
gradual
over
an
entire
year,
and
we
did
spend
a
full
week
of
performance
testing
before
our
last
and
final
deployment
and
still
we
failed
to
identify
five
major
scalability
bottlenecks,
maybe
spending
an
hour
looking
at
all
the
available
metrics,
while
testing
our
setup
might
have
actually
identified
a
few
of
these
problems,
but
probably
not
all
of
them.
A
Looking
back
this
journey
was
a
lot
of
fun,
even
though
it
didn't
always
feel
like
that.
While
it
was
ongoing-
and
we
for
sure
did
learn
a
lot
so
some
suggestions,
we
thought
we
would
share,
they
would
most
likely
have
helped
us,
so
maybe
they
can
help
someone
else
make
the
default
cue
size
per
core.
So
you
don't
have
to
remember
to
change
it
when
you
change
your
machine
type
to
have
a
different
number
of
cores
make
so
reuse
port
default.
A
B
This
often
means
having
the
right
metrics.
Next,
do
your
best
to
reproduce
all
problems
outside
of
the
production
environment?
Not
only
does
doing
so,
give
you
much
more
opportunity
to
see
what
happens
in
various
related
failure
scenarios.
The
act
of
crafting
a
test
environment
often
shows
you
blind
spots.
You
didn't
know
you
had
and
finally
communicate
ask
for
help
broadcast
your
shortcomings
to
anyone
who
can
be
made
to
listen
like
you,
even
if
your
mistakes
are
embarrassingly
dumb
like
ours
keep
talking.
A
B
Thank
you
for
all
of
the
feedback
and
the
thumbs
up
and
whatnot.
Let's
see
have
you
guys
looked
at
enabling
exact
balancer
on
the
listener,
I'm
gonna.
Let
you
handle
that
one,
because
I
don't
know.
B
Question
about
if
http
1.1
issue
was
identified,
so
there
was
no
http
1.1
issue.
That
was
a
suspicion
that
we
had
that
maybe
the
http
1.1
stack
was
slower
or
less
battle
tested
or
less
scalable,
or
something
like
that
and
that
turned
out
to
be
wrong.
We
are
still
using
http
2
from
envoy
to
the
load
balancer
from
from
the
load
balancer
to
envoy,
obviously,
and
then
from
envoy
to
our
microservices,
we're
talking
http
1.1
and
they
both
seem
to
perform
just.
B
Fine
running
into
very
similar
problems
at
twitter,
I
think,
overall,
I
would
expect
people
that
have
very
large
request
volumes
to
have
similar
issues,
and
I
think,
like
there
is
a
very
good
start
of
how
to
put
how
to
make
online
http
proxy
for
a
large
organization
in
the
docs
for
envoy.
But
I
think
there
are.
There
are
opportunities
to
improve
the
configuration,
as
well
as
improve
that
documentation
to
make
life
even
easier
for
a
large
installations.
B
It's
another
way
of
forcing
connection
balancing.
Well,
then,
we
should
look
into
and
see
if
it
works
better
or
worse.
Thanks
for
the
tip.
A
B
And
matt
klein
asks:
why
are
you
using
http
1.1
to
the
back
ends
versus
2,
and
the
answer
to
that
is
mostly
legacy?
So
spotify
has
a
very
old
network
stack.
It's
almost
it's
about
a
decade
old.
We
implemented
our
own
transport
layer
instead
of
http,
because
we
had
a
lot
of
scalability
problems
with
http.
B
This
transport
layer,
called
hermes,
is
basically
very
similar
in
most
ways
to
http
2..
It
solves
the
same
problems
in
mostly
the
same
way,
and
it
tries
to
be
very
http
like
in
its
api,
but
it
is
older
than
http
2..
B
We
started
work
slightly
after
well,
like
slightly
before
google
started
talking
about
speedy
publicly
and
we
are
still
transitioning
away
from
this
internal
hermes
protocol
and
what
we
have
today
for
our
hermes
based
services
is
a
like
library
that
you
can
use
to
accept
http
traffic
as
if
it
was
hermes
traffic
and
we
are
instead
moving
to
internally
use
http,
2
and
grpc
and
then
in
the
future,
hopefully
http,
3
and
so
on,
like
modernizing
our
stack,
but
we're
not
there.
Yet.
A
A
A
B
So,
with
regards
to
much
in
the
way
of
filters,
we
are
using
a
few
filters
to
filter
out
users
who
are
not
allowed
on
some
resources
and
so
on.
But
the
big
thing
that
reduces
our
efficiency,
I
would
say,
is
that
we
are
running
both
the
both
envoy
itself
and
this
decorator
side,
car,
which
is
implemented
as
an
ext
off
c
filter
on
those
on
the
same
32
core
machine.
B
So
the
three
resource
hogs
on
the
machine
is
envoy
itself,
which
uses
like
half
the
cpu
and
then
the
sidecar,
which
uses
slightly
less
but
still
a
significant
amount,
and
lastly,
also
the
metrics
propagation,
which
uses
about
1
out
of
32
cores.
So
all
three
of
those
are
running
on
every
single
envoy
host.
B
So
and
also
that
means
that
you're
getting
you
get
a
message
in
to
envoy
and
then
it's
passed
out
from
envoy
to
the
other
service
and
then
back
and
then
to
the
next,
and
then
you
get
the
replying
so
like
there
are
six
message
passing
steps
or
something
like
that,
not
just
the
four
that
you
would
expect.
A
B
That
decision
is
over
a
year
old.
I
I
was
very
interested
to
hear
the
talk
like
one
of
the
starting
talks
about
using
webassembly
to
make
your
own
custom
filters
in
envoy.
B
We
did
not
want
to
write
our
own
c
c-plus
plus
filters,
because
we,
as
a
company,
have
too
few
developers
who
are
super
comfortable
with
c
plus
plus,
and
then
it
becomes
a
like
who
owns
its
problem
problem,
whereas
we
have
lots
of
java
devs,
but
webassembly
might
help
out
with
that.
We
don't
know
we'll
see.
B
B
If
someone
has
a
question
that
they
posted
that
we
didn't
answer,
it's
not
because
we
hate
you
it's
because
we
missed
it,
so
please
feel
free
to
repost
it.
In
that
case,
yeah.
A
B
B
Thanks
a
lot
everyone
for
listening,
this
was
this
was
great.
I
will
now
disconnect
and
go
say
hi
to
the
third
talker
from
this
conference.
Titus,
so
bye,
bye,.