►
Description
Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
Lightning Talk: Protecting Envoy: Overload Manager - Kevin Baichoo, Google
How can Envoy protect itself from OOMs? Envoy has a number of different protection mechanisms out-of-the-box -- how do they work? When should you use them and how should they be configured? Let's find out! Kevin will conclude with some experimental results using these protection mechanisms.
A
So
welcome
to
my
talk
on
protecting
Envoy
with
the
overload
manager,
I'm
Kevin
bachu,
a
software
engineer
at
Google
and
an
Envoy
maintainer,
so
many
users
use
Envoy
at
the
edge
and
in
Edge
deployments
an
attacker
can
either.
You
know,
disrupt
service
to
your
service
by
either
taking
your
service
out
it's
all
or
by
taking
the
pipe
out
that
leads
to
your
service.
In
this
case
the
envoy
proxy.
A
A
So
the
reason
we
ran
into
issues,
but
those
explosions
is,
we
weren't
protecting
some
resources,
and
this
is
exactly
what
the
envoy
overload
manager
tries
to
do.
It
tries
to
do
this
by
one
measuring
a
particular
resource
and
taking
action
as
needed.
Let's
explore
what
the
envoy
overload
manager
can
do.
A
First,
timeouts
are
very
essential
for
distributed
systems.
It's
the
way
we
ensure
that
resources
aren't
indefinitely
tied
up.
For
example,
if
a
client
is
sending
a
request
for
Foo
we'd
want
to
bound
how
long
that
request
will
take.
We
don't
want
them
to
wait
around
hanging
and
we
don't
want
resources
tied
up
throughout
the
system.
A
A
A
Envoy
has
the
ability
to
reset
expensive
HTTP
2
streams.
So
this
is
atlas,
he's
a
Greek
myth.
He
holds
up
the
world
for
your
traffic.
This
is
Envoy
and
Envoy
has
the
ability
to
know
for
HTTP
to
traffic,
how
many
bytes
it
has
buffered
for
a
particular
request
response,
and
it
can
use
that
information
to
drop
the
more
expensive
streams
and,
as
there's
an
increase
in
resource
pressure,
we
can
continue
to
more
aggressively
drop.
A
The
streams
to
keep
the
proxy
alive
Envoy
has
the
ability
to
stop
accepting
connections,
so
Downstream
connections
are
really
where
you
know.
Often
the
workload
of
envoy
is
generated
from
so
so
for
an
overloaded
Envoy
we
can.
By
disabling
the
listeners,
we
hopefully
prevent
additional
work
from
being
added
to
the
envoy
and
preventing
it
from
crashing.
Of
course,
this
can
harm
both
malicious
and
well-behaving
clients.
A
A
Envoy
has
the
ability
to
tell
clients
to
disconnect
this
is
particularly
important
in
Fleet
wide
uses.
So,
for
example,
in
this
given
case,
you
know,
there's
one
Envoy
that
has
many
clients
that
Envoy
is
overloaded
and
as
such,
we
might
spin
up
some
new
instances.
Well,
these
instances
aren't
doing
anything
helpful
right
now,
because
the
clients
are
still
on
the
you
know:
overloaded,
Envoy
and
they're
having
a
lousy
experience
because
they're
on
an
overloaded
Envoy.
A
Envoy
can
tell
TC
Malek
to
return
some
of
the
memory
that
it
has
back
to
the
OS
if
memory
limits
are
near
so
now,
let's
shift
into
an
experiment:
we're
going
to
try
out
static,
timeouts
versus
slow
loris.
So
what
is
slow
loris?
It's
effectively
a
client
being
maliciously
slow
and
effectively
they're
trying
to
tie
up
resources,
for
example,
by
sending
a
request
and
not
reading
the
response.
So
in
the
following
experiment,
we
have
a
client
using
HTTP
one
to
connect
to
the
envoy.
A
It
sends
60
KV
worth
of
headers
and
afterwards
it
maintains
the
connection
or
the
stream
by
sending
one
byte
every
15
seconds
in
order
to
maintain
the
stream
as
active.
A
This
attack
in
these
scenarios
could
really
possibly
reached
about
25k
for
this,
given
experiment.
A
So
this
is
a
graph
of
the
memory
usage
of
the
task
and
you
see
all
of
those
sharp
spikes.
Well,
those
are
all
spikes
when
the
envoy
has
crashed
and
the
reason
we
continue,
you
know
getting
more
data
afterwards
is
due
to
automatic
restarts.
We
can
similarly
see
this
with
the
active
client
connections.
So
this
is
a
graph
of
the
active
client
connections,
and
you
see
that
you
know
we
can
start
crashing
around
18K
client
connections
under
this
traffic.
A
With
these
given
configurations
now,
let's
kind
of
conduct
the
same
experiment
this
time
with
skilled
timeouts,
the
range
of
the
timeout
can
scale
between
60
seconds
and
5
Seconds,
and
the
scaling
starts
at
60
memory,
utilization
and
saturates
at
90.
What
that
means
is
at
90
percent
resource
memory,
usage
we'd
effectively
turn
the
60
second
timeout
to
five
seconds,
so
here's
the
corresponding
Gap
graph
with
the
memory
usage
with
scale
timeouts.
A
We
see
again
there's
a
sharp
rise
in
memory
usage,
but
it
levels
off
when
we've
passed
the
60
threshold
there,
that's
when
we
start
scaling
the
timeouts
and
90
that's
when
we
would
have
reached
saturation.
We
see
that
we're.
You
know
we're
scaling
the
60
second
timeout
under
the
15
seconds
that
the
attack
traffic
is
using
to
maintain
the
connection,
so
the
timeout
is
some
somewhat
under
15
seconds
and
as
such,
we're
able
to
maintain
the
proxy
up
and
maintain
you
know
around
16k
client
connections.
A
A
So
there
are
some
caveats,
of
course,
it's
very
important
when
you're
using
the
overload
manager
to
configure
it
for
your
given
workload
and
your
requirements.
Otherwise
it
might
not
help
you.
It
could
actually
actively
hurt.
You
small
deployments
can
run
into
trouble
with
TC,
Malik
fragmentation,
overhead
and
traffic
diversity
matters.
The
overload
manager
might
not
be
able
to
help
depending
on
the
traffic
and
its
configuration.