solo.io Istio 1.6, 28 Jul 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Istio 1.6 Locality Aware Load Balancing and Failover

Description

A quick video on locality aware routing / load balancing/ failover using Istio within a single cluster running in multiple availability zones.

Learn more
About Service Mesh Hub https://solo.io/products/service-mesh-hub/
About Istio https://istio.io
Community https://slack.solo.io

A

Thanks for stopping by the youtube channel, this is christian posta and in this quick video, we'll take a look at locality, aware routing and load balancing.

A

Now this is a feature that is available in istio and has been for some time. There have been some improvements to it recently and what I want to show is how you can keep traffic constrained to its cur current locality, or you know that the source of the traffic will stay in the same locality as the destination, but then, when things start to go wrong, then you can start to expect you know over and and failover and so forth. So let's jump right into it.

A

If you go to the docs for istio you can you can see the the requirements for configuration and so forth. What we have here is, let's take a look at the demo.

A

We have a kubernetes cluster set up with three different nodes and each of these nodes lives in a different availability zone in google cloud. So we can see these three different nodes. If we take a look at one of them, we can see right here. This particular one lives in us west, 1c.

A

Oh, this one lives in 1a and then this one in 1b.

A

If we get or look at a particular node, we can see which services are running which pods are running. In this case we have a let's go here. We have our sleep service and then we have a fake service.

A

Now this fake service is provided to us by, let's find it oh fake service, bring it over to this window.

A

Nick jackson actually wrote this a buddy of mine works at hashicorp, and what this does is allow us to simulate behaviors in a service whether responding with a message, whether um erroring, out uh being rate limited, consuming, cpus and so forth, a great great little tool for testing this out and we'll use that here in this demo.

A

So the first thing that we want to do is take a look at what that fake service response looks like we see. If we go into the sleep pod and call our fake service, we get a response, a json response that looks like this. We can see the ip address of the host that returned it. So if I can actually back here and call it again, we see 3.30, we got 3.35 31.

A

41.. We see a bunch of the different 15.. We see a bunch of different hosts that are running the fake service that respond.

A

If we end up calling that a handful of times like we just did, we see that it, it tends to load balance across all of the endpoints that that istio knows about, and that reflects these different parts. These different services that are running. We can see these ip addresses here now. These are actually running across the different nodes um that we have in our kubernetes cluster, which means it's going across. The traffic is going across all these different availability zones.

A

Now you might have reasons for wanting to constrain the traffic locally so that if traffic originates in zone west 1c that it stays in 1c right. So let's take a look at istio's capability to automatically do that. For us, it's actually built into envoy. That is still proxy that, like it's used under the covers here as the sidecar.

A

So if we ask the sidecar, what are the different endpoints that you know about for the fake service, we can see that it's automatically these endpoints are automatically annotated with locality that one we see one a west one, a we see an endpoint in west 1b.

A

And so forth, and we also see a weight associated with a particular endpoint here and if we look at these different endpoints, they all have the weight of one so they're, all equal, they're, all being load balanced equally, which is what we saw in the in the behavior when we were calling the fake service.

A

So let's enable locality priorities and in in envoy what that means is we need to enable um health checking.

A

We got to know which endpoints are healthy and make up our priority based on their locality and then, if they become unhealthy, slowly start to spill over and actually, if you come to the envoy documentation and look at the priority uh page, you can see exactly you know. Prior zero is the highest p1 and so forth, and what are the um the levels that and and the percentage of healthy endpoints and the provisioning factors and so forth over provisioning factor and so forth that gets used? uh It gets nicely explained there now.

A

One thing I do want to point out and I've seen this come up and there's a little bit confusing envoy, does zone aware routing or can do zone aware routing where the proxy itself can pick which endpoints in a upstream to route to. uh But that's not what gets used in istio istio. Does the locality weighted load balancing? So that's that's.

A

What we're talking about where the control plane knows where the different uh endpoints live and annotates them and weights them according to priority uh that the control plane knows about not that the proxy gets to decide. So, let's, let's continue to enable the priority weighting.

A

We need to give it some health checking information in this case we'll do a passive health checking with uh out the outlier detection capability in an envoy and exposed through istio. In this case it says for the endpoints that live in the load, balancing pool.

A

If we see more than one consecutive errors, then we'll eject it from the load balancing pool if we scan the pull every 15 seconds or so to see which ones are healthy, we'll eject it for a minimum of 30 seconds each time it gets ejected it'll be that number of times it's been ejected times this base ejection time and we're willing to eject all of the endpoints from the load balancing pool.

A

So let's add this outlier detection, config or the destination rule to istio, and now, if we check our priority groups, or rather our endpoints again, we should see priority attached to the endpoint yeah and we do right. We see priority one for west one a now. This is doing the priority based on where this particular service is running. So this is the sleep service.

A

Sleep is running and we can see on this node. This is l m j g, so we come over here. Node, node, l, m, j g. Is this one click into it?

A

Indeed, we do see the sleep pod and the locality and priority information will be based on what other pods are running on. That particular node- and in this case we see wes 1a this one. This particular node is west one c west. One c is where sleep is is running.

A

uh West one a is not where sleep pod is running and we see a priority of one remember from the envoy documentation priority zero is the highest. uh We see again west one a priority of one, let's find another one again west one b, this one's priority, one also west one b, where's west one c. Here we go last one c and we don't have a priority waiting um by default.

A

The priority will be zero if it's not specified and we'll verify that in a second now, if we call the endpoints or call our fake service again, we should see that it gets constrained to just the end points that are running in that locality. In this case, we can see it's not load balancing across all seven endpoints. It's doing it across looks like three different ones: 331, 341 and 3 30..

A

Does that match up here we see the good service but fake services, 30, 41, 30 and 31 all living on the same node as the sleep service. So we're good now we're starting to get that locality-based routing so that we prefer the services in our same locale before we start to spill out to any others in the in the load, balancing pool.

A

So that's all good what happens when things start to go badly right, so we can. We can see right now that when we call our fake service, we are calling and we're getting a response from the good fake service call it again good fake service, good, fake service. These are all good.

A

Here we go now we get one from a bad fake service. That's I! But it's acting good right now, but let's, let's change that.

A

uh The first thing we want to do is uh let let's go ahead and port forward on on the sleep pod, we're going to do a port forward of the locally running envoy and we'll run it locally here, let's see firefox, is it running localhost on 15 000.

A

clusters? So we see all the clusters that are running in the envoy proxy um in the sleep pod. If we go to fix service, we can see our configuration. We can also see which endpoints are currently in the load balancing pool, and we can see that the ones that run in us west one have a priority of zero. Remember we said we'll verify that, and indeed it is zero. We see prior to zero again here.

A

Priority is zero here for the endpoints in west, one c now s1b, we see a priority of one right so again go to the go to the envoy documentation around priority, and you can see exactly how that ends up working once things start failing.

A

So that's all good. Now, let's make the bad service misbehave.

A

So, instead of returning 200, it's going to return a 500 and that's well we'll see it here in uh canines, the bad fake service start to uh update itself.

A

So we have the the new one and the old ones should be going away. Give that a second.

A

So now we have uh the bad fake service and the good fake service, some of them running in the same locale as sleep pod.

A

If we come back here to the stats refresh fake service, we see healthy in west c, uh healthy west, one c same thing: here, that's three and four and um you see a priority of zero for those ones. So that's all good.

A

Now, let's start calling the the service and all right now we just called it down here fake service. We got a 500 from 3.42, so, let's refresh for 3.42, now we're seeing a it's been marked. As ah has failed, we've kicked it out of the load, balancing pool and 3.42, we'll kick it out for 30 seconds. So if we call it again a couple times that one looks good, that looks good, 30 and 31. Look good 43 does not look good uh 42, we saw did not look as we refresh now.

A

We can see that 43 and 42 have been kicked out of the load balancing pool. Now they might get and they might come back. Let's see, 30 remember, 3.30 is good.

A

3.31 was good, 30 is good and- and we start to see that we're able to load balance to staying constrained locally, um but the the bad ones have been kicked out. We're not calling those we're just using basic outlier detection, um and we, if we call it a couple more times, we might see that eventually we'll see it again, but it'll be kicked out again this time even longer right and if it happens that all of them get start to get kicked out or too many of them start to get kicked out.

A

Then we'll fail over to the service, the fake service workloads that are running in the different localities. The different availability zones in this case automatically now this is all um applicable for a single istio service mesh right where the control plane knows about the locality of these different workloads. So this will work for a single cluster it'll work for multiple clusters. If you, if you use a single control plane in istio, now, there's there's reasons why you might not want to do that, including um failure, isolation and and so forth.

A

It's probably better to do multiple control planes of istio, but in that case, this locality information is only specific to one cluster, so routing across clusters. This doesn't work very well or you don't get that locality, aware routing across clusters, but that's where uh an open source project that we're working on at solo, called service mesh hub fits into the picture service mesh hub is a management plane for multiple installations of a service mesh or including istio, and we css um in the latest version of service mesh hub.

A

We do have this ability to define failover and locality uh between multiple um multiple deployments of istio, multiple uh istio control planes across multiple clusters, uh so that will that that's that'll be coming in in a release and I'll have some more details and more videos um about that. But until then check out istio and the locality aware, load, balancing and and weighted routing that comes out of the box.

A