Cloud Native Computing Foundation EnvoyCon 2020 - Virtual, 12 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Failing forward to 1 million requests per second - Axel Liljencrantz, Mikael Sundberg

Description

Failing forward to 1 million requests per second - Axel Liljencrantz, Mikael Sundberg

Many companies claim to have a work culture that celebrates failures, but few companies have tested that claim as thoroughly as Spotify did during our migration to Envoy. Come hear war stories of trying, failing, and failing some more with Envoy, and learn how to make sure you learn something new every time you fail.

A

B

My name is axel and my name.

A

B

And we are here to tell you how to successfully fail at life.

B

The baseless claim that a company celebrates failure is becoming as common for trope in modern tech, as have you tried turning it off and on again, it turns out that repeated failure is not enough to ensure future success and in practice, celebrating failure often amounts to not firing people who are bad at their job as such thinking about how to fail forward.

B

That is fail in a way that moves you meaningfully towards your goal is relevant. This talk is about how we failed five consecutive deploys at spotify, and why we're reasonably proud about it.

A

First, a little background: at spotify we have a google load balancer in front of our microservices and for a long time we've been wanting to run envoy as an http proxy at the perimeter of our back ends.

A

In addition to having a unified perimeter, there is a long laundry list of futures we want to get out of this setup. Common metrics, authentication rate, limiting client, ip lookups, access, logs and so on and, as you might know, envoy, doesn't actually do all of those things. Our desired. Android setup contains a docker sidecar that runs a second service. Implementing authentication, guip, lookups and smarter things.

A

Envoy will call this service for most incoming requests using the rtc extension.

A

We started this journey over a year ago by adding our first services behind envoy. We have then gradually added more and more traffic and now it's time for the biggest deployment, the http traffic from the actual spotify clients.

B

But before such an important move, we wanted to do a fair bit of testing to build confidence that this would actually work. We created a test setup that resembled the production setup as much as possible.

B

Instead of a real client, we use the wrk2 tool which does open loop testing. That is, it allows us to set the desired request rate instead of just trying to fully saturate the system. This is almost always the correct way to load test the system.

B

Our test setup uses the same load balancer as production, an identically configured cluster, but with only one host and various different core counts on that host. Finally, our test used a single upstream service named no op. No up is a service whose reply time status code and payload size can all be configured on each incoming request.

B

So what did we find? First of all, regardless of number of cores on the host metrics propagation, always uses one full core. Secondly, a few configuration bottlenecks were found. The biggest one was the http thread pool in our off-c sidecar.

B

We tried eight 32 and 64 core hosts and 32 cores offered the best throughput per core.

B

We saw some failure rate elevation on slow requests, but we didn't investigate this further and finally, we could see that tls used a bit more cpu than we expected in the flame graphs good enough. Let's go.

A

And we did, and thanks to that, we have this amazing news, article and, needless to say, we had to roll back fairly quickly.

B

What was going on? Well, it turns out that envoy's circuit breaker, which is really just an outstanding request limit, had triggered it has a default limit of 1000 requests, which we had not changed.

B

Handling 30 to 30, 000, rps per host means that average latency on your request must be lower than 30 milliseconds.

B

We had in fact checked that the median latency was much lower than that, but, as usual, the long tail ruins everything.

B

So we had failed to check the right metric for this situation, which is the average because onward doesn't report the average.

B

But the good news is we could successfully reproduce the problem in our test environment and we could validate that. It went away when we adjusted the limit, so we adjusted the production circuit, breaker settings and we tried again and this time it worked for a few minutes.

A

It started out fine, but, as time went on, we got more and more errors. We noted this amazing graph showing the number of requests for each host. Basically, the load balancer seems to throw all it can at a random host until it gets overloaded.

A

Then it throw the load away at the other hosts. Instead spiraling into more and more 500 hours from overloaded hosts, we adjusted our environment to be protected production by increasing its size from one to ten.

A

We could then see the same problem in our test environment and after some testing, we figured that if we tell the load balancer to target 15 000 requests per second for each host, everything looks fine.

A

We had assumed that a single node test cluster would be fine. Looking back, it feels pretty naive to use a single node, but it's always easier. When you have all the answers. We didn't know that the load balancer considered a single node cluster special.

A

So we did fail to make most to make our test setup similar enough to production, to save some money and slot link to 15. 000 requests stop the flapping, but we have poor research usage and an elevated failure rate. Clearly, something still wasn't right.

B

But we didn't know where, so we tried to isolate the different parts of the system to locate the bottleneck. We started with the previously mentioned off: z sidecar.

B

We disable it and rps went from fifteen thousand to twenty thousand. This is expected, since the number of messages that are processed by the host goes down significantly and also cpu usage during these load tests still stayed well below 50 percent, so that was not the limiting factor.

B

Next, we turned our eyes to the no op service. This fake service runs in kubernetes. We did some quick profiling and found that each replica can handle 23 000 rps. It is auto scaled with a maximum of 100 replicas. That means it can handle roughly 2.3 million rps once again, not the limiting factor.

B

Most envoy users use the http 2 stack but envoy, and our upstream uses http 1.1. Perhaps the http 1.1 stack is somehow less scalable.

B

We ran a test where envoy directly responds to all requests, thereby bypassing any http 1.1 stack and we found that we could handle only 30, 000 rps with a 10 cpu usage.

B

B

This is the low point. It is the part of the hero's journey known as the abyss. It is where we considered giving up on software development and finding a brand new career, one that makes sense like carpentry.

A

But instead we started looking at the number of connections between our load, balancer and nyhosts, and found that we have about 13 thousand connection to east host.

A

That's a pretty high number, and someone pointed out that the buffer size is one megabyte and with some math you get a total buffer size of 13 gigabytes. That's quite a lot of buffering for ny. To do so, we tried to decrease it to 32 kilobytes for each connection and our request per second increased from 30 to 60 000 on direct responses.

A

We did try to tweak similar settings like the number of concurrent streams and window sizes, but we didn't find anything we thought were worth changing as soon as we hit 15 000 requests per second latency started to increase.

A

This did not happen if we removed the rtc decorator from the from the request path to check if it was the service that was slow, replaced it with a service that immediately returned to 100. Okay performance was still bad and we only got 15 000 requests per second.

A

Clearly, we have isolated an issue in the communication between envoy and the otc sidecar.

A

This was narrow enough for a teammate to realize that we have previously touched the network configuration on this cluster and sure enough. We were using docker network bridge instead of the much faster loopback device.

A

Throughput increased to 30 000 requests per second. But why didn't we see this earlier?

A

It does only increase latency so badly that the load balancer started considering the hose dead, it didn't actually limit the throughput.

B

Finally, we have reached the end of our journey. Everything worked. We hid in production and everything is fine for a few minutes.

B

Then, once again, the error rate started to creep up and rps went down to the same old, 15 000 rps.

B

We decided at this point to drill down to the various thread pools on the system to see if any of them were overloaded. What we found instead was that the main envoy worker pool was extremely unevenly loaded. Some threads were pegged at 100 percent.

B

Others were doing nothing. We assumed that this was a locking problem and we started to work on profiling envoy. That is until someone noticed that the number of open connections to each worker thread was actually similarly lopsided.

B

So why were some worker threads receiving all of the traffic and others none? We could not reproduce this problem in our test environment, which meant that we were flying blind. We decided to reach out to the envoy community as well as our cloud provider, google.

B

We got a suggestion from both in the form of harvey touch, so reuse port. This configuration option in envoy is described as such. This makes inbound connections distribute among worker threads roughly evenly in cases where there are a high number of connections which begs the question: when would you not want connections evenly distributed among workers anyway?

B

It worked, but why couldn't we reproduce this problem in testing?

B

It turns out that load started out pretty evenly distributed and then slowly diverges our test. Cluster was either reconfigured, often enough or saw long enough breaks with no traffic that things reset themselves, whereas our production traffic cluster was always loaded.

A

So this is the end of our journey. We have now had four months without any major problems and to get more certainty. We did a successful regional failover test where we killed one region and let all that traffic go to our other regions and it just worked.

A

So we have started doing fun. Things like upgrading to the version 3 of the xts api, adding rate limiting and looking at course configuration for our clients and we did take it slow by rolling out gradual over an entire year, and we did spend a full week of performance testing before our last and final deployment and still we failed to identify five major scalability bottlenecks, maybe spending an hour looking at all the available metrics, while testing our setup might have actually identified a few of these problems, but probably not all of them.

A

Looking back this journey was a lot of fun, even though it didn't always feel like that. While it was ongoing- and we for sure did learn a lot so some suggestions, we thought we would share, they would most likely have helped us, so maybe they can help someone else make the default cue size per core. So you don't have to remember to change it when you change your machine type to have a different number of cores make so reuse port default.

A

We know this has some performance costs to low traffic servers, but we figure efficiency is more important on high traffic servers.

A

Another alternative would be to highlight it in the best practices guide for android as an edge proxy and last add average latency to the histograms. We know averages can be overused and misguiding, but when doing math on connection settings it can be very helpful.

B

Okay, so how do you fail at life by planning, for it assume that you will fail, because you will try to think ahead to when you will fail and try to think of what you need to do next and make sure that you have the tools at your disposal to do just that?

B

This often means having the right metrics. Next, do your best to reproduce all problems outside of the production environment? Not only does doing so, give you much more opportunity to see what happens in various related failure scenarios. The act of crafting a test environment often shows you blind spots. You didn't know you had and finally communicate ask for help broadcast your shortcomings to anyone who can be made to listen like you, even if your mistakes are embarrassingly dumb like ours keep talking.

B

Maybe some of those mistakes can be prevented through code changes and, if not at least more people will know about the common pitfalls.

A

B

Everyone are there any questions in here.

B

Thank you for all of the feedback and the thumbs up and whatnot. Let's see uh have you guys looked at enabling exact balancer on the listener, I'm gonna. Let you handle that one, because I don't know.

A

I I don't actually know what that is.

B

That was what I was too ashamed to admit so.

A

I'm not ashamed of things like that. uh I.

B

A

I will look it up thanks for a tip.

B

uh Question about if http 1.1 issue was identified, so there was no http 1.1 issue. That was a suspicion that we had that maybe the http 1.1 stack was slower or less battle tested or less scalable, or something like that and that turned out to be wrong. uh We are still using http 2 from envoy to the load balancer from from the load balancer to envoy, obviously, and then from envoy to our microservices, we're talking http 1.1 and they both seem to perform just.

B

Fine uh running into very similar problems at twitter, I think, overall, I would expect people that have very large request volumes to have similar issues, and I think, uh like there is a very good start of how to put how to make uh online http proxy for a large organization in the docs for envoy. But I think there are. There are opportunities to improve the configuration, as well as improve that documentation to make life even easier for a large uh installations.

A

Yeah definitely.

B

ah It's another way of forcing connection balancing. Well, then, we should look into and see if it works better or worse. Thanks for the tip.

A

Yeah and yes, maxim in our load testing. We got about 1000 requests per second per core. I think in production we get a little bit less.

B

And matt klein asks: why are you using http 1.1 to the back ends versus 2, and the answer to that is mostly legacy? So spotify has a very old network stack. It's almost it's about a decade old. We implemented our own uh transport layer instead of uh http, because we had a lot of scalability problems with http.

B

This uh transport layer, called hermes, is basically very similar in most ways to http 2.. It solves the same problems in mostly the same way, and it tries to be very http like in its api, but it is uh older than http 2..

B

We started work slightly after well, like slightly before google started talking about speedy publicly and we are still transitioning away from this internal hermes protocol and what we have today for our hermes based services is a uh like library that you can use to accept http traffic as if it was hermes traffic and we are instead moving to internally use uh http, 2 and grpc and then in the future, hopefully http, 3 and so on, like modernizing our stack, but we're not there. Yet.

A

And christopher we're six people, I believe.

A

B

A

Like that, yes and the other people are more competent than me and ex that's.

B

Why they kicked me out.

A

A

And louise I'm not sure how many requests per connection we had. If you ask me on slack, I can I can check it.

B

So, with regards to much in the way of filters, we are using a few filters to filter out uh users who are not allowed on some resources and so on. But the big thing that reduces our efficiency, I would say, is that we are running both the uh both envoy itself and this decorator side, car, which is implemented as an ext off c filter on those on the same 32 core machine.

B

So the three resource hogs on the machine is envoy itself, which uses like half the cpu and then the sidecar, which uses slightly less but still a significant amount, and lastly, also the metrics propagation, which uses about 1 out of 32 cores. So all three of those are running on every single envoy host.

B

So and also that means that you're getting you get a message in to envoy and then it's passed out from envoy to the other service and then back and then to the next, and then you get the replying so like there are six message passing steps or something like that, uh not just the four that you would expect.

B

My math is probably wrong, but something along those lines.

A

Yeah and also the our decorator is in java, so we have some garbage collection, fun things and replacing it with filters.

A

I don't think we have talked about that and I'm not sure why we decided to go with a side car that was before I joined the team. Actually.

B

That decision is over a year old. I I was very interested to hear the uh talk like one of the starting talks about using webassembly to make your own custom filters in envoy.

B

That could definitely be useful for us.

B

We did not want to write our own c c-plus plus filters, because we, as a company, have too few developers who are super comfortable with c plus plus, and then it becomes a like who owns its uh problem problem, whereas we have lots of java devs, but webassembly might help out with that. We don't know we'll see.

B

But overall I agree that the sidecar solution feels like probably not what we want to do long term.

A

Yeah and maxine we're running on managing.

A

Infrastructure.

A

I think we answered all of the questions.

B

If someone has a question that they um posted that we didn't answer, it's not because we hate you it's because we missed it, so please feel free to repost it. In that case, yeah.

A

Or ask on the elmo slack at least I'm there, I'm not sure. If you are excellent, I actually am fantastic. I.

B

B

Well, that seems to be it.

A

B

Thanks a lot everyone for listening, this was uh this was great. I will now disconnect and uh go say hi to the third uh talker from this conference. Titus, so bye, bye,.

A