solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 007 Intro to Multi Cluster Istio Operations Config, Orchestration, Federation

Description

Running Istio across multiple clusters can bring a lot of value, but can be difficult. In this session, we review some of benefits of a multi-cluster architecture including single-pane of glass operations and global service routings and what patterns and practices can be used to ease operations.

A

Hello and welcome to intro to multi-cluster istio operations, configuration orchestration and federation.

A

I'm eitan yarmosh, I'm an architect at solo. I o.

B

And I'm ivalco having I'm the chief aggregate solo io and today we're going to talk about the standard eco deployments, how they work, what they offer us and why they don't offer us why we see a trend for multi-cluster hco deployment, how do multi-cluster deployment work and the different ways they can be implemented?

B

The pros and cons for different multi-cluster deployment strategies and how can we get the best of both worlds of simplicity and reliability and we'll show a demo of how to achieve and highly available multi-region services? And finally, we will conclude.

B

So, let's start at the beginning, let's talk about the standard ico deployment, as we can see here, it's pretty straightforward. We have a single cluster with a single ico, control, plane and three different workloads, an ingress gateway that sends traffic to the account service that sends traffic to the user service.

B

As we can see in this next slide here, the config boundary is limited to the cluster and it's pretty simple to manage each a workload has its own configuration, so the benefits of this single deployment model is it's easy to reason about and understand the configuration boundaries as far as security there's, a very few potential access point into the cluster, because everything is self-contained in a single cluster and as far as operations, we have a single control plane to manage and to review, as far as configuration, health, etc.

B

Now, if single cluster sounds so good, what's the catch, why are we seeing more and more multi-cluster deployment patterns, so the first parent, a single cluster, also means a single point of failure. A single control plane once fails can bring down the cluster.

B

In addition, different teams or business unit may already have their own cluster that they own and operate, and then we may need to keep it this way for various business reasons and.

A

B

Last point we want to talk about is that, due to regulatory uh geography, attempts to reduce latency, we may actually have to run clusters in different regions and, with that I'll, take take it over to a time to talk about a little bit about the what we see as far as multi-cluster hcl.

A

All right so multi-cluster here to save the day, superman multi-cluster.

A

So what does multi-cluster and istio offer us? Why is it so great? First of all, it offers us redundancy for the control plane and the data plane. So if the control plane were to go down, we have backups of the control plane and similarly, if our workloads or members of the data plane were to go down, we have replicas of those as well.

A

Next, as you've all mentioned, we have flexibility in deployment, geography and availability. So we have customers running all over the world and who are serving their own, their own customers potentially all over the world, and so they might. They are running services in different localities and needing again to run their services in those different regions and zones.

A

And thirdly, as you've all mentioned, a lot of companies already have many clusters with many applications and many services running in those clusters, and so we need to deal with the reality of the situation as it is and apply the tools that we have and istio to those situations.

A

So let's talk about a multi-cluster, istio deployment and what it really looks like so to do that, let's just quickly go back to the single cluster deployment, just as a reference point, so we have our gateway, our account service and our user service and they're all being uh they're all being managed by a single istio control plane.

A

Now, let's start to expand that picture. So now we have again our account service and our user service in cluster one region, one but we're adding more clusters. So we have cluster two, which has an order service and another user service, and we have a third cluster and this one is in is in region two, so this one has an order service, a user service and an account service of its own.

A

So this way again, we have split uh for for redundancy and high of availability, our service across uh multiple regions and, as you can also see, we have istio running in all of these all these clusters.

A

Now what challenges does this bring up? Well, first of all, these clusters need to be able to communicate with each other right if, if the whole reason for them to exist and be part uh and to deploy in a multi-cluster is so that potentially they can communicate with each other.

B

A

How do we make that happen? Well now we need to add gateways right. We need to add gateways or and or ingresses managed by istio, which allow our different services to communicate. This is often called federation, um and so now that we have added these gateways or ingress, these services running in different clusters are able to communicate.

A

However, this still isn't very secure, because we need to make sure that all of these clusters trust each other and are able to communicate over https or more specifically, mtls. So how do we do that?

A

Well, we need to establish trust when we do that via shared certificates, not shared certificates, but a root of trust, which will then create roots of trust for our clusters, which can communicate with each other. So all this to say that when scaling up from one to n clusters, you are not simply just adding right cluster after cluster, you are also adding levels of complexity due to the the connections that must be made and maintained between the different clusters.

A

So what is this added complexity? Well, as the name of our talk would suggest, configuration managing sprawling, istio config for many clusters, simultaneously, two orchestration ensuring that all of our control planes are running compatible versions, ensuring that all of our control planes are up being able to observe all of our control planes and knowing, what's going on at any given time and then three federation, exposing services running in clusters to other clusters.

A

They need to communicate with ensuring that all of our services are able to communicate with each other in a secure and highly available way.

A

And with that in mind, let's I'm going to toss it back over to yuval to talk about the the methods that we can use to deploy. Istio in multicluster.

B

Thank you, ethan. We come and see two deployment patterns for multi-class radio deployment. Let's talk about the first one: a replicated, control plane in this method in this uh system uh to deploy ezio each cluster gets its own hcod control, plane right and um essentially it means that each cluster has its own. It's your ds and is only aware of the workload running in that cluster communication across cluster is done using ingress gateways that the user need to manually configure.

B

So let's talk a little bit about the pros and cons of this approach where each cluster has a different hcod deployment right. So one pro it's availability and in addition to that fault, tolerance and that's pretty simple, to explain, because each cluster is completely isolated as far as hcod failure in one cluster will not impact at all other clusters right and the other pro is that each cluster is self-contained.

B

We're going to review this point in the in the next architecture, but you don't need to expose kubernetes api servers from different clusters into a single ecod deployment which improves your security posture, because each access to the kubernetes api server is limited.

B

Let's talk a little bit about the cons, so we have n deployments to manage and observe and uh to configure right. In addition to that, when we want to create federated services, these are not automatically discovered and we need to manually configure ezio to enable federation.

B

Now, let's talk about the other method for deploying hco in a multi-cluster way and that's a single control pin as you can see. In this example, we have an ecod deployed to cluster 3 and it is configured to manage cluster one and cluster two in region one and, of course you can also mix and match, and in this example, we have a also a replicated control plane for region. Two that manages the clusters in region.

A

B

Right so in this example, you can see that eziod in region, one in cluster three, is managing directly the the cluster one and cluster two in region, one and there's no extra hcod for each cluster. So we have one hcld managing two clusters. At the same time, let's talk a little bit about the pros and cons of this approach right. So the pros is that because we now give hcod access to the kubernetes api in these two clusters, it can perform a service discovery and endpoint discovery and help us with service federation.

B

Right and another thing is that one last deployment to manage for for all this customer. We only have one ecud that we need to upgrade and deploy and configure and observe.

B

So let's talk a little bit about the cons right. So in order for this to work it, so they needs access to the kubernetes api server for each remote cluster, and we touched a little bit about this in the in the previous approach right. Essentially, it kind of opens up a little bit uh the api server beyond the cluster itself into an hcud that is running in a separate cluster, and you need to make sure that this is properly secured.

B

Then it still also requires a for now at least an icod instance for webhook injection and, additionally, the way hco is currently working today. The federated services need to be in the same name and namespace right. Let me explain that in a little bit more detail.

B

Essentially, if you have, if you want to have a federated service that will work across cluster one and cluster two, it needs to have the same name and name space in cluster one and cluster two, and that is often challenging if different teams on the different clusters there's no guarantee that they will agree on the same name, space and name for the service right and because of all these pros and cons.

B

What we see in in the industry is a preference towards a replicated control, plane strategy, because the advantages it provides as far as availability and reliability.

B

The disadvantage in in this approach is that also the config boundary right, because each ecod is deployed and independent in each cluster. Essentially, each cluster configuration can only happen in that cluster right. So, for example, imagine if you have an authorization policy that allows the user workload to talk with the account workload right now you have to replicate that configuration across every cluster and if it ever changes, you need to make sure all these configurations are in sync.

B

Essentially, you lose the ability to have a single source of truth and now I'll hand it off to a time. To recap what we've seen and talk a little bit about how we can do better.

A

Thanks yuval, so with great power, comes great responsibility. The reality of deploying and running multi-cluster, sdm.

A

So what are the pros of running multi-cluster istio, as we said earlier, redundancy for our control plan in our data plane, no single point of failure, gonna keep saying this. This is so so so important. We we hear our customers talking about this all the time, flexibility and deployment, geography and availability, flexibility in company policy and practice.

A

What are the cons because, let's be honest, there are some and they're big. We have added network hops, since our traffic may be going through, maybe going through gateways when crossing cluster boundaries, establishing trust. This is a non-trivial task, observing and debugging all our services and our control planes across multiple clusters in a unified way and four something that we've touched on over and over again configuration sprawl, keeping all of the sprawling configuration in check and in sync.

A

So with all this in mind, let's take a step back. We mentioned at the beginning sort of getting the best of both worlds. So what does that mean? Well, if we think back to the beginning of the talk, we talked about a single cluster istio deployment and the benefits there.

A

So, let's talk about that for a second, we said: a single cluster deployment has simple config boundaries, there's a single cluster worth of config that you need to manage few or potent or zero potential access points right only one way to access the the cluster and the and the configuration and the api server and three a single control plane to manage.

A

Now what were the pros of the multi-cluster istio deployment that we mentioned again? High availability, fault, tolerance and isolation and clusters? Don't need access to each other's api servers now what if we could wed the two of these into one one thing and, as you may have guessed, that's glue mesh.

A

So what does glue mesh actually get us well as a quick recap for our multi-cluster config sprawl? Let's take a look at this slide again and, as you can see just to allow these services to communicate with each other. We need upwards of eight crds per cluster.

A

Okay, and this is just for traffic. This isn't even for for authorization.

A

So what would this look like if we were treating it like a single cluster istio with a small config boundary and only needed to configure one cluster or to run the configuration from one cluster?

A

Well, luckily, that's glue mesh so to configure this exact scenario in glue match would just take two crds, a traffic policy and an access policy, and then our management plan would take care of the rest, and that's really the the lesson here is the is the management plan of wedding, the the benefits of or the the niceties of a single cluster deployment, with the benefits and niceties of a multi-cluster deployment having that high availability having that fault tolerance, so you have all the benefits and necessities of running istio at scale across multiple localities, while also being able to configure it easily and understanding what is going on throughout the entire system.

A

Well, that sounds great, but let's see an example come on so we're going to go with the example of implementing a reliable, highly available service, specifically with a replicated control plane. Now remember we discussed earlier that our customers are tending to favor the replicated control plane. This config would be a little bit different with a single control plane, but for the purposes of this example, we're going to use replicated.

A

So what would you need to get this working with with istio?

A

Well, you would need two virtual services, one on each cluster, four of the services three destination rules, one for each service, as well as one to represent the the service entry, a service entry, a gateway and an envoy filter.

A

That's eight total crds a lot, and this is only for one service.

A

Now in glue mesh, you would need one virtual destination and one traffic policy. That's only two crds, a massive, a massive step down in terms of the number of config objects that you need to manage, not to mention that all of the glue mesh objects would live in one cluster as opposed to the multiple of the other, so keeping it in sync, with your with any kind of um git ops, ci cd approach would fit with the current current tools.

A

So with that in mind, let's, let's, uh let's do a quick demo for those of you who have watched the keynote you will have already seen this demo, um but for those of you who have not uh please uh enjoy it will make morse. It is tailored specifically to this situation and shows off the benefits and simplicity of the glue mesh crds for this typically difficult scenario.

A

So with that in mind, let's get into it your glue mesh to seamlessly fail over traffic from one instance of a service to another running in different clusters in different localities, all using our virtual destination api.

A

Now before we get started with this, let's quickly go over the workloads and services that we have running in our clusters uh and so in cluster one here we have product page ratings reviews, v1 and reviews v2, and for those of you that are familiar with the bookinfo app you'll know, you'll notice that there is an instance of the reviews, app that is missing. Now we can just go over to our other cluster to go, find it so in cluster 2 we have reviews v3.

A

So, at the end, by the end of this demo, we will have deployed a virtual destination and that virtual destination will then fail over traffic from cluster 1 to cluster 2. When reviews v1 and v2 become unhealthy. So let's get started with that. Let's take a look first at our virtual destination.

A

Now the virtual destination has only a few parts to it. The first part is the host name, and this is the dns address at which this virtual destination will be made available to the rest of the mesh or the rest of the virtual mesh, depending on how you decide to export it and that we'll talk about that below the destination selector. This is the select. This is how you select the services which will become a part of the virtual destination um and then later will be seamlessly filled over to outlier detection.

A

This is how envoy or istio decide how or when a service becomes unhealthy such that it will be removed from the pool when making routing decisions. Next, as I said earlier, is the export two. This is our. This is our mesh list. This can also optionally be a virtual mesh. This is how the user can decide um which parts of the system should have access, be able to call this virtual destination, and then, lastly, is the port.

A

So let's go ahead and apply that to our cluster.

A

And the next thing that we need to apply is a simple traffic policy: I'm not going to go into detail about what that traffic policy about traffic policies here now, because we could do a whole other demo about that, but just to quickly show what it's doing.

A

We are selecting the review service as our destination and routing all traffic bound for the review service to our uh virtual destination, which we have created above so this will ensure that all traffic bound for the review service will in fact, call our virtual destination.

A

So with those with that done, let's go ahead and apply that to the cluster.

A

Now that those are applied, let's go ahead and try curling the review service.

A

So we're going to bring up a and we're going to exec into our product page here and we're going to curl reviews on port 9080.

A

And we can see here that we're getting our reviews now something worth noting and the way that we are going to tell the difference between our review services is that the reviews v1 will not return a color. The reviews v2 will return the color black, as you can see here, and the reviews v3 will return the color red now we can't see that yet because our local services are still healthy.

A

So let's go ahead and just call that one one more time and we'll see that we have another call to the local service. So now, let's go ahead and make our local services unhealthy. We're going to do that by injecting the sleep command into our local deployments. So first we'll start with v1 and as we wait and just quickly wait for that to roll out and then we're going to go ahead and do v2.

A

And once that has finished deploying we can go ahead, exact into our product page again and try curling reviews.

A

So, let's exec into our pod and we're going to go ahead and crawl reviews 9080.

A

See what we get and just like that, the color red, so as soon as our as soon as our original service became unhealthy, we immediately started routing to the instance running in our second cluster, with the color red.

B

Thank you aiden for the demo. I hope it made everybody see the power of having a single management plan and how that helps reduce complexity in managing multi-cluster deployments.

B

Thank you. Everybody for attending our talk feel free to get glue mesh. Give us feedback, go to the github repo check it out and we'll see you in the rest of the conference goodbye. Everyone.

A

Thank you see in the q a and the rest of the.

A

A