solo.io Application Networking Day - Detroit 2022, 13 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Application Networking Day Session #7 Data Plane Resilience, No Problem. What About Control Plane?

Description

Featuring Eitan Yarmush. Envoy is an incredibly performant cloud native proxy which is quickly becoming one of the most used pieces of software across our industry. Due to Envoy's popularity, and configurability, quite a few control planes have also been created to dynamically configure Envoy. These include Gloo, Istio, and others. There are now many control planes, but what makes a great control plane. In this talk we'll examine one specific aspect of configuring Envoy, resilience, meaning it's ability to tolerate failures.

A

A

Yeah we're going to take a bit of a step back or not back, um but we're going to talk about Envoy here.

A

um So we've been talking a lot about uh istio and ambient and ebpf, but we're going to take a step back and talk about the technology, the proxy that has powered istio up until this point, uh that being the envoy proxy. But the cool thing about Envoy- and this talk is that these Concepts really apply to kubernetes software development in general. So I think just sharing our experience uh here and specifically how it relates to Envoy um is very interesting and and hopefully very useful to all of you.

A

So um with that we'll get started um so who am I? uh My name is etan yarmesh um I am an architect at solo and I've been with solo for about four years now, I've worked on pretty much all of our products um past future things. I can't talk about no I'm, just kidding um and so yeah. So what are we going to go over?

A

Well, uh first I'm gonna give a brief introduction um about Envoy, specifically XDS, for those of you so I guess to take a poll who, in this room knows what XDS is.

A

Okay, so don't worry if you don't I'm gonna go over it just just curious um and then I'm gonna describe the problem statement and potential Solutions, um so you're just gonna have to hold your breath for that one all right, so Envoy um I guess who, in this room, has heard of envoy before yeah. That's what I expected. Okay, so Envoy is, as the you know, as the slide says, open source, Edge and service proxy designed for cloud native applications. Now, why is that right? We hear Envoy Envoy, Envoy Envoy, it's this amazing proxy.

A

Now, in my opinion, there's actually two reasons for that. One is its performance and speed right. It operates incredibly at scale, but the second most important one is that it's configuration and the way it's configured specifically with XDS is designed for cloud native in in environments. So we're going to talk about that in a second, um but specifically we're going to be talking about the different control planes that exist and how those are built.

A

So there are many control planes out there right now, um specifically, you know from the solo side uh there's glue Edge, that's our that's our Edge proxy and then there's istio, of course, which configures all of the sidecars and now the Waypoint proxy, as well as the Z tunnel, and then there's other projects that exists in the cncf landscape, uh like contour and on Plenty of others that have been home brewed um by lots of different companies, but obviously I. Don't I, don't know what those are so we're not going to go into those.

A

But there are many so why Envoy um to talk about why Envoy I think we should talk about how the typical proxy is configured right. So when you think about configuring a proxy uh historically right, we're going to use nginx as as an example right, there's some config file that you have to give to that proxy right. The proxy or whatever piece of software has to read that configuration in right. Now. Let's say you want to update that configuration right either. The proxy has to restart to pick it up or assuming there's some hot restart.

A

It has to read in the file right close down some of its connections and restart them right, and so, when you're, in um an environment where the endpoints are not changing very often not a big deal, not how kubernetes Works pods are coming up. They're, spinning down they're changing all the time. Endpoints are all over the place, so we need a proxy that's able to handle that okay. So how does envoy work well?

A

The way that Envoy gets its config is that there is a control plane or a management server in Envoy language running somewhere that the proxies can talk to, and there is a bi-directional stream I'm not going to get too much into the technology there, but there's a bi-directional stream between Envoy and the control plane over which the config is fed. So Envoy asks for the config and the control plane sends it back. So this is a very in-depth look at what that looks like.

A

um This is merely meant to show the complexity and Ingenuity of the model I'm not going to dive too deep, um but I will say that the envoy proxy docs have an entire page about the XDS model, and I can talk at length about why I think XDS is amazing, but for now I'm not going to go too deep on this, but this is at a high level how XDS works.

A

So what does that look like more simplified? So Envoy is going to say right. I want config V1 cool. Now the control plane says back all right here. It is I. Have that for you now in the in the good case right if the config is is good, then Envoy is going to say all right that was great I can feed I I can serve that. Thank you now what happens when it's bad? Okay, it's the same same First, Steps Envoy reaches out control. Plane, says here: it is, but this time Envoy says nope.

A

Sorry, this is bad. I can't use this. So then the control plane has to go and figure out what to do about it.

A

Now, how can this go wrong right in our in our initial uh discussion about nginx and and typical, you know, historically, like configuring, a proxy through files right, the file wasn't going to go away. The file is the file system, I mean assuming. Unless you know you have some really crazy situation. The file system is probably going to be there. However, that's not always true for a different application.

A

So we have. We have made a dynamically configurable proxy, very suitable to Cloud native environments, but we've introduced a new potential problem. What is that problem control plane downtime? Now there are two classifications of Errors we're going to talk about today.

A

One is hardware issues, and one is software issues, um so the Red X is a is a hardware issue, so let's say for some reason: the control plane is scheduled to a node or it's running somewhere, where you know the hardware goes down, for whatever reason you have a regional outage, zonal outage, but um so in this case, envoy can no longer get its config and it's going to stop serving the correcting thing right and that's a real problem.

A

So again, let's just talk about the you know, compare and contrast the the two approaches again between the previous model and the new Envoy model.

A

um So again, the new model, the envoy model- allows for many Dynamic config sources, especially routes and and endpoints, which are constantly changing with zero downtime. That means zero downtime. When you change any of the config it just continues to serve uh another plus is you can have multiple config sources for a single proxy which, if you have a file, that's not possible um and then also there's a well-known Proto API.

A

So it's really easy to create these these control planes, but there are Pros to the condition to the to the traditional approach for one it's saved to disk, so it's tolerant to a restart. If the envoy proxy restarts and the control plane is not up, you might, you might be subject to some downtime, it's simpler when you're deploying smaller number of proxies. So if you only have a few proxies and they're running from your file, config right, you don't have to worry um about potentially configuring.

A

So many proxies all at the same time uh and it is also simpler if the config is static, so if it doesn't change very often which, as mentioned earlier, is probably not true in a cloud native environment where the pods are spinning up and spinning down all of the time.

A

So, let's pre. So, let's briefly talk about the standard running environment. So in a normal case, you're going to have Envoy running in your kubernetes cluster and your control plane running in your kubernetes cluster now hold on um as as mentioned, what can go wrong? Well, the node can go down right and if your node goes down, your control plane is going to be down and Envoy is not going to get any more config now, hopefully, it'll it'll reschedule but having any downtime where Envoy can't get. Those updates is obviously less than ideal.

A

So what can we do about that? Well, luckily, kubernetes allows us to scale up pods right using a deployment or whatever other you know, scheduling mechanism you might use depending on your kubernetes distro, but the deployment right is the the one that comes with kubernetes, um so you can scale it up, and so, if one goes down, you're much you're much safer. So if a specific node right, let's say you're running a multi-zone cluster, if one zone goes down you're safe awesome, so we are still liable to failure here.

A

So what failures are we still liable to? Well what if there's a software issue in the control plane and we stop serving config right? We've eliminated the problem of having a node go out or having a hardware issue, but we still have potential software issues because we want to continue serving. But if, for some reason we have a panic or our control plane is not functioning properly, Envoy will stop getting its config. So how do we deal with that?

A

Well to talk about that? Let's talk about how the control planes usually work- and this is very similar to how many control planes, whether it be for Envoy or anything else, in kubernetes work. You have a set of crds, which are configuring, the control plane um and then the control plane configures, whatever end product, whether you know on and which is Envoy. In this case, however, the control plane is liable to failure, just like any other component in kubernetes right. That's what this orange x means.

A

It's definitely not orange on the screen screen, but it is orange for those of you in the back um and some weird color here. So how do we ensure that our control plane is always serving config right, even though it is receiving config via crds, potentially from other users, potentially from a whole nother control, plane right, the the abstractions continuously layer on each other and all of a sudden?

A

How do you make sure that the control plane is continuously serving config to the proxy, because at the end of the day, you want to make sure the proxy is always up to date and is always serving traffic?

A

So, let's talk about two ways that that we can do this, and this is these- are both coming from learnings that we have had at solo building glue Edge. So the first thing is: we can make sure that the core logic of the control plane doesn't error. Now, what do I mean by that? So we can ensure that there is no situation, no code path in which the control plane doesn't continue to serve at least some config right now this comes down to a a win-lose among two different situations.

A

Right, do you want to serve the minimal config, or do you want to stop serving altogether in our experience? Serving minimal config is always better than stop than not serving at all, and so what that has meant in practice is making sure that the actual logic which takes in the new configuration or whatever is it is listening to and then translates it to the envoy. Config can never fail, always serves something, and you know ensure it can't panic, but you know: we've all been there. So how do we solve so?

A

What's another thing that we can do well, we can add a cache layer. So what does that mean? um Depending on the control plane, implementation right? There may be quite a bit of complex business logic in it, and this business logic is liable to failure. Again. We can you, know integration, test and unit test and end-to-end test to our hearts content, but things happen, so it's liable to failure one, but it's also potentially computationally expensive right.

A

So if we have a control plane and all of a sudden, we want High availability for all of our proxies and we start scaling it up. Our CPU costs might start going through the roof because all of a sudden we're Computing the config for Envoy a million different times. So why do we need to compute it so many times? If the problem we were trying to solve was just to make sure it never went down and to make sure it doesn't go down, we only need probably one percent okay.

A

So what can we do about that? Well, luckily, there's an open source project from the envoy, Community called XDS relay, which can help with that and we'll talk about that in a second. So what does XDS relay well? This helps in two major ways, um for the first is caching, and this is actually something that we added uh on top of XCS relay the open source.

A

Implementation does not do this, but essentially it just takes the last good config and saves it in the Pod and so and not the last good config, the last config, and it saves it locally in memory in relay right. So again, all the logic in relay is just get served, config and hold it for Envoy, no complicated business logic just take hold right and the the other is aggregation.

A

So Gathering up configuration from multiple sources right that way, Envoy doesn't have to be the one to do that um and then relay serving the config to Envoy. So these are the buzzwords that are on their their um on their layer, but again you're, potentially saving a lot of compute, because the work that this component has to do is very little very, very very little and again, the important thing here is not just XDS relay right, but it's the concept here and again.

A

I really believe that this this model applies to so many development scenarios in kubernetes it. These are the lessons that we learned from blue edge and Envoy, um but so now now that we've sort of gotten towards the end right. We have scaled our control, plane right and we're it's running on multiple zones, but we were finding it expensive. So we wanted to keep that down and we also wanted to make sure that if the control plane did go down for any logic reason, the config was cached in XDS relay okay.

A

So now we have this simple component that we can scale up to our hearts content. It does not use a lot of resources, so if XDS relay happens to have you know if one of them goes down for any reason, you have a bunch of extras and the same thing with the control plane.

A

So now, by adding these layers in the middle and making sure that the core logic of your app of the control plane can't fail, you are much more likely to ensure that the proxy is constantly receiving the updates that it needs to keep serving your traffic.

A

And uh and that's it um yeah again uh in summary, just these are the lessons that we have learned: building out glue, Edge, and hopefully you can learn uh something from them for your General application development or for learning more about how Envoy, XDS or anything in there Works um again, I was Aton uh hope you enjoyed.