Cloud Native Computing Foundation EnvoyCon NA 2022, 28 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning Talk: Protecting Envoy: Overload Manager - Kevin Baichoo, Google

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Lightning Talk: Protecting Envoy: Overload Manager - Kevin Baichoo, Google

How can Envoy protect itself from OOMs? Envoy has a number of different protection mechanisms out-of-the-box -- how do they work? When should you use them and how should they be configured? Let's find out! Kevin will conclude with some experimental results using these protection mechanisms.

A

So welcome to my talk on protecting Envoy with the overload manager, I'm Kevin bachu, a software engineer at Google and an Envoy maintainer, so many users use Envoy at the edge and in Edge deployments an attacker can either. You know, disrupt service to your service by either taking your service out it's all or by taking the pipe out that leads to your service. In this case the envoy proxy.

A

In multi-tenant deployments, we need to consider the fact that the attacker might control an upstream as well, and in that case the Upstream isn't trusted. And in fact the attacker might use the Upstream in order to take down the multi-tenant on way and take down. Stop access to another tenant.

A

So the reason we ran into issues, but those explosions is, we weren't protecting some resources, and this is exactly what the envoy overload manager tries to do. It tries to do this by one measuring a particular resource and taking action as needed. Let's explore what the envoy overload manager can do.

A

First, timeouts are very essential for distributed systems. It's the way we ensure that resources aren't indefinitely tied up. For example, if a client is sending a request for Foo we'd want to bound how long that request will take. We don't want them to wait around hanging and we don't want resources tied up throughout the system.

A

While we serve the response.

A

Many systems have static timeouts, so if the length of this spring is the duration of the timeout, the timeout will be the same. Regardless of the context.

A

Envoy has these scale timeouts what that means is they're effectively timeouts that can be compressed as there's an increase in resource pressure. So if there's an increase in resource pressure, the length of the timeout can decrease.

A

Envoy has the ability to reset expensive HTTP 2 streams. So this is atlas, he's a Greek myth. He holds up the world for your traffic. This is Envoy and Envoy has the ability to know for HTTP to traffic, how many bytes it has buffered for a particular request response, and it can use that information to drop the more expensive uh streams and, as there's an increase in resource pressure, we can continue to more aggressively drop.

A

The streams to keep the proxy alive Envoy has the ability to stop accepting connections, so Downstream connections are really where you know. Often the workload of envoy is generated from so so for an overloaded Envoy we can. By disabling the listeners, we hopefully prevent additional work from being added to the envoy and preventing it from crashing. Of course, this can harm both malicious and well-behaving clients.

A

Envoy has the ability to stop accepting requests, it's a way to fail fast and send a 503 response and avoid tying up even more resources for an overburdened Envoy.

A

As there's an increase in resource resource pressure, we can increase the probability of rejecting a given request. We also have this capability for the rejecting incoming connections.

A

Envoy has the ability to tell clients to disconnect this is particularly important in Fleet wide uses. So, for example, in this given case, you know, there's one Envoy that has many clients that Envoy is overloaded and as such, we might spin up some new instances. Well, these instances aren't doing anything helpful right now, because the clients are still on the you know: overloaded, Envoy and they're having a lousy experience because they're on an overloaded Envoy.

A

Well, if the envoy can tell the clients to disconnect that gives us another opportunity to better better redistribute the clients and utilize our Fleet Envoy is C plus plus base and uses TC Malik as its allocator TC Malik really enjoys a fast allocation path, and in order to do so, it maintains all of these free lists of allocable objects.

A

Envoy can tell TC Malek to return some of the memory that it has back to the OS if memory limits are near so now, let's shift into an experiment: we're going to try out static, timeouts versus slow loris. So what is slow loris? It's effectively a client being maliciously slow and effectively they're trying to tie up resources, for example, by sending a request and not reading the response. So in the following experiment, we have a client using HTTP one to connect to the envoy.

A

It sends 60 KV worth of headers and afterwards it maintains uh the connection or the stream by sending one byte every 15 seconds in order to maintain the stream as active.

A

This attack in these scenarios could really uh possibly reached about 25k for this, given experiment.

A

So this is a graph of the memory usage of the task and you see all of those sharp spikes. Well, those are all spikes when the envoy has crashed and the reason we continue, you know getting more data afterwards is due to automatic restarts. We can similarly see this with the active client connections. So this is a graph of the active client connections, and you see that you know we can start crashing around 18K client connections under this traffic.

A

With these given configurations now, let's kind of conduct the same experiment this time with skilled timeouts, the range of the timeout can scale between 60 seconds and 5 Seconds, and the scaling starts at 60 memory, utilization and saturates at 90. What that means is at 90 percent uh resource memory, usage we'd effectively turn the 60 second timeout to five seconds, so here's the corresponding Gap graph with the memory usage with scale timeouts.

A

We see again there's a sharp rise in memory usage, but it levels off when we've passed the 60 threshold there, that's when we start scaling the timeouts and 90 that's when we would have reached saturation. We see that we're. You know we're scaling the 60 second timeout under the 15 seconds that the attack traffic is using to maintain the connection, so the timeout is some somewhat under 15 seconds and as such, we're able to maintain the proxy um up and maintain you know around 16k client connections.

A

So this is the graph of client connections we can see there are occasional. You know spikes and drops due to this turn going on.

A

So there are some caveats, of course, it's very important when you're using the overload manager to configure it for your given workload and your requirements. Otherwise it might not help you. It could actually actively hurt. You small deployments can run into trouble with TC, Malik fragmentation, overhead and traffic diversity matters. The overload manager might not be able to help depending on the traffic and its configuration.

A

So here are some pointers to get started with using the overload manager and thanks to all the folks, who've contributed to this component and here's some other great talks from past Envoy Conyers on protecting back ends. Thank you.