gRPC Community, 27 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: gRPC August Meetup/ Using gRPC Gloo Mesh- by Lin Sun

Description

In this talk, Lin discusses how Gloo Mesh uses gRPC to solve some of the challenges with Istio multi clusters. A background information of how Istio Multicluster works at a high level will be provided along with some challenges with sharing API server access across different clusters. Then she drills into how Gloo Mesh solves this challenge using the relay architecture that is built using gRPC.

A

So much for being here and thank you everybody for um being here with us as well, so whenever you want to start um this, is your space, your presentation? Thank you. So much! Okay, great! Thank you! So much for the warm intro I am so excited to be here. I hope you guys can see my screen now we do. We do okay, awesome! Please! Let me know if there's an issue sometimes I do have issues with uh google needs.

A

So thanks for the wonderful introduction, I won't introduce myself anymore. So today we are going to talk about istio, multi-cluster, surface mesh and we're going to discuss some of the limitations we heard from our user on istio service mesh and then we kind of going to explain. You know how we tackle this problem at solo.

A

To solve some of the challenges with is your multi-cluster and last not the least, hopefully I'll be able to show you a live demo if the demo guys are with me today. um So today, if you go to istio.io, we support different deployment model for istio service match multi-cluster, one of the most common model is this model where you run multiple primary clusters on separate network?

A

What does it mean by multiple primary multiple primary means: istio control plane, which is issued on this diagram? It runs in each of the clusters, so each cluster itself watches the configuration from kubernetes.

A

So in this requirement, because each cluster watches configuration for kubernetes, it does require endpoint discovery from the local cluster, but also from the remote cluster. So it still knows what are the services and their endpoints on different clusters.

A

Also in this model, because the fact the network are different across these two clusters, we do run a gateway in between to help connect the networks, the services running on different networks. So that's essentially multi-primary issue control, plane cluster on separate network.

A

If you just think about two cluster, it's probably straightforward for you right this architecture, you may not have any concern the first time you look at it, but the moment you start to think about what. If I have 100 cluster, what if I have just four or five clusters, the fact that each issue control plane requires kubernetes api server access, even though it's just read, access could be having security concerns and also it's very hard to manage that configuration across five to 100 clusters.

A

So this would be the architecture you would envision if you use multi-primary on multiple network that we talked about earlier on a much bigger scale to from five cluster to 100 cluster.

A

So the concerns we've been heard is it's very hard to sync and store all the kubernetes api server credentials using traditional github flows, right which we recommend people to use when they deploy istio resources when they deploy kubernetes applications. Githubs is the recommended way. We want everyone to use, so you can recreate that easily across different environments.

A

The other concern we also heard is giving a cluster just read. Access to many other clusters could leave a security threat.

A

It's unnecessary to give out credential to all of your cluster to each of the is your control plane.

A

So, in guru mesh, what we invented is a different model to specifically solve the concerns we heard from our user in this model we're proposing glumash as a central management plan and then inside of each of the istio cluster. We would run a glue mesh agent that communicate to the glue mesh. So there are two challenges. We're trying to solve. So first of challenging is glue mesh as a management plane needs to have the configuration.

A

It provides, an abstraction configuration on top of istio, it's more opinionated to help users simplify their adoption of istio, and we need to be able to propagate that configuration to each of the issue cluster in the language. That issue can understand. The second challenge we're trying to also solve is each of the issue cluster. I mean they're not going to be stale right, as services gets deployed as services get scaled in and out the services and their endpoints could be changing.

A

So who is going to reflect those changes to other clusters? This is essentially the job. Glumash agent is doing it's all. It's propagates those configuration services, discovery, changes back to the management plan, so the management plan have all that information, so this would save the user from giving kubernetes api server access to each of the issue control plane.

A

So what we are proposing, which is also implemented in google mesh, is a pool based configuration model that I just described using a centralized management plan that doesn't require any access to any of the data plane cluster. It doesn't require any access to remote, kubernetes api server. It doesn't require any of the api server credential on any of the seo clusters.

A

So the data plan cluster connects to the central management plan cluster and receive the configuration update. Will we we really like the xds protocol from envoy, which is essentially what isdio uses to push the configuration from? Is your control plane to the onboard sidecar, so we're reusing this configuration model in in this architecture also- and we talked about credential no longer needs to be shared, which was the fundamental problem we're trying to solve.

A

So, in order for each of the issue cluster to register with the management plan, all we need to do is call my ctrl cluster registration and specify. You know this is the management cluster contacts. This is the remote cluster contacts, and this is the relay server address. So the relay server is the management server that provides the grpc service.

A

So it needs to make sure it knows that endpoint. So how does this work? We talk about. It's inspired by the xds protocol and also using kubernetes bootstrap token to establish the initial search certificate and key is required to easily establish the mutual trs configuration between the seo cluster to the management cluster.

A

So, um let's look at um the two components we're describing so on the left side. Enterprise networking is the components on the management plan running on bluemix management plan and the agent on the right side is think about it. It's the istio cluster, where we have the agent glue mesh agent running so at the installation time we're going to distribute the root search to the agent along with well the location, the address of the enterprise networking, which I showed you on the mesh, ctl registration command and then the move.

A

The bootstrap token will also be mounted to the agent part. So the agent can have access to the bootstrap token.

A

The agent is going to make the initial request with the bootstrap token, so that's going to be a tls request to the enterprise agent and the enterprise agent would respond with the client certificate which you see on the right side, so that it can allow the agent to establish mutual ts communication with the enterprise networking component from glumash, with that the agent can establish the remote resource, grpc stream, so that, if there's any configuration changes it can. If there's endpoint changes, it can surface those changes to the enterprise networking component.

A

Also in the meanwhile, the agent is going to also open a mesh config stream, which is also a grpc stream.

A

This is for the purpose if the user deploy any glue mesh customer resources and the enterprise agent needs to push those translated, the istio resources back to the istio cluster. It can push that configuration through the agent so that the issue translated the resources would be available on the istio cluster for the user.

A

So can this model really solve our problem? I would say yes most likely uh for two of the concerns we talked about earlier right. So first, the concern we talked about is giving um the kubernetes api server access to each of the istio cluster in a multi-cluster environment. So, in this model there's no api server, credential needed to be config.

A

The second problem we talked about is for the githubs workflow right having that credential. Where are you going to store that credential in github? So by emulate the api server access? Credential, you know eliminate the configuration uh com complexity with the github flow.

A

Now it's the demo time any questions so far.

A

uh Yeah, so if you go back a couple of slides.

A

One this yeah okay in this case. Yes, so you said, respond with mtl assert. You also need to send the private key that goes with the serve right yeah, so we sent the co. The client uh certificate goes with the search for sure um I would imagine. The private key would also be part of it because it needs to, but I do have to double check um because it needs to be so. We send the minimum required to establish the mutual trs communication.

A

Okay, so you don't know if the private key stays with the agent or is it actually also sent by the by the left-hand box to the page? So is the private key absolutely necessary for establish the mutual trs search? Yes, okay, so if the private key is necessary.

A

Well, so so, first of all right, so it's a signing problem right, so you could potentially have the private key at the initial bootstrap time and then you just assign the key. I think that's most likely what happens here so that you don't have to transit the private key on the wire. I think that's your concern is that right, right, yes yeah! I think this model actually is very much what we do. It's still right, because it's your has the same problem from the envoy proxy to istio control plane.

A

So we use essentially reuse the same techniques and you know establish the secure communication from the agent to the enterprise networking here.

A

Okay, any other questions.

A

Okay, great, um so I would like to show you guys a global failover demo with multi-clusters right, so we have three clusters in this environment: um the first cluster you can see it has the istio booking for application.

A

The only thing is we don't run review version three, it's only on the second cluster and the the reason I do that is it's really for you to easily to see the global failover. I mean you could potentially run review version one and two on the second cluster, but it's not easy for you to see because they look the same um so cluster. One and cluster two have issue running on kubernetes.

A

They both have ingress gateway because they're not using flat network in between and also I have glue mesh management playing on a separate cluster also running in kubernetes.

A

So what we're going to do in the demo is we're going to trigger a failover scenario, so we're going to fail over the reviews on the first cluster and then we're going to see how the traffic handling right. So, if you don't do anything, obviously nothing is going to work right. So if you don't config like your global destination for the review, so there's no traffic failover, nothing is going to work, but then we're going to show you how you can fake this to leverage this concept of virtual destination so that it would fail.

A

Similarly to the remote cluster, which is cluster two here.

A

Okay, so um in my environment, if I refresh booking for right now, I have it's going to be going to the first cluster, because I didn't config anything special right, so I didn't config.

A

The traffic flow allow product page to go to version 3 of the review, so everything is just local, as you can see it's toggled between one and two, and this is my glue mesh environment. um By the way. Let me make sure you can see the graph here too, so we do have a graph where you can easily see. Well, I have two clusters. I was mentioning to you exactly as this diagram right, so I have review version.

A

One review version two in the first cluster and there's no traffic to review version 3 because um I haven't configured anything yet so um in my cluster. If you go to make the meshes you can notice, I have two cluster class one and cluster two and I have, um if you view the mesh details, you know you can see all my work, clothes and everything.

A

So this ui just shows you um different configuration. What are the clusters? You know what are your each of the cluster has htod and you know what are your gateways which each each of the cluster has gateway right, so pretty much match the diagram. I was just describing to you um of my environments.

A

So what we're going to do next is is we're going to trickle some traffic, so this is a diagram I was showing you earlier, so we're going to be on this page.

A

um So if you go to last one minute, this is the one uh you shouldn't see any reviews, because I was uh prepping, which is why, if you go to last 15, you might see- because I was demo before before this meeting- all right. So are you guys with me? So all the traffic is on review version, one on version two, um which is on the cluster one. So what we're going to do next um is, um let me clear out my screen here.

A

So what we're going to do next is um we're going to disable reviews on cluster 1 and cluster 2 just to trigger a failover scenario. So when those are disabled, what do you think it's going to happen right? So if your service is going down right, this is what you expect right. You won't be able to have the review survey for the book, um so this is expected right. So what we're going to do next is to show you what you should have it before.

A

You know your service went down and in fact, on the left side you can see the service went down through our observability ui, also right. So what we're going to do is applying a resources called virtual destination.

A

The reason we do this is uh virtual destination is um a glue mesh concept to provide a global destination regardless, where your actual service runs right, it could be running on vms. It could be running on cluster one cluster, two in different clouds or in different zones and regions, but we bring it together with this host name.

A

You could call it reviews that your company or review.global is just an example here, and the other thing we're going to need to do is apply a traffic policy to say you know for the reviews we want to do um when it's accessing from the default namespace, which is the product page, runs when it's trying to access reviews. You know I we want to send to this virtual destination, which we just deploy right, because I don't care about.

A

If the reviews is on local, all I care about is availability right when the local is failing into my outlier detection. I want to detect it and goes to the global whichever the global is in this case is a cluster two right. So now, if we go back to the product page, I would expect review version 3 and I would expect only the version 3, because the reviews version 1 and 2 are done. So what do you think is going to happen?

A

If I bring the reviews back on the first class by the way you can see, the graphics also render right. So this is shifting all the traffic to the reviews. Version. 3., so what we're going to do next is actually make review version one and two available.

A

So assuming you know, your service went down for maybe a minute or two right, and if you have virtual destination and traffic policy, your user will be seamlessly right. So now that we bring the service back up, um let me see if the service are up. Sometimes it does take a little while.

A

So if you do get cluster one, you should be able to see them. So you can see they are terminating, but they also come up. So it's it's good. um So the reviews, um actually the reviews, one- are terminating the reviews too.

A

Okay, it's also running it's also running sorry, it was uh my shell screen was hiding that okay. So now, if I hit the refresh, what do you think is going to happen, so I do think it's going to hit review version 2, which is good. I also think it's going to hit review version 1, which is also very good. Do you think it's going to hit review version 3?

A

The answer is no. As long as you review version one and two are healthy. The reason is um when we apply the traffic virtual destination, we actually said we prefer local and if local fails, you know- and these are my outlier detection criteria and it fails- we want to fade over to the global virtual destination. So if local is running, you know we want to always goes to local. Why? Because local has low latency right, who wouldn't wants to have low latency and pay less network um money for for for this?

A

So as you can see on the left side, you know the reviews are going not going to version 3 at all right, which is exactly what we are seeing. So, as you can see through virtual destination and traffic policy, you know we seamlessly transition uh for a global failover scenario for the user, um but also this is just one thing, but we are a grpc community right. So I want to talk to you some of the challenges we're going through with this right. So remember.

A

I talked about early on um the glue mesh resources right, so the glue match resources once the user apply them, how it ends up to the resources that isro can understand right, because google maps is not reinventing the wheel, we're still reusing resources from istio. So this is the work. That's been done by the glue mesh management plan and the management plan pushes the configuration to the glue mesh agent, as this diagram indicated right. So the mesh configuration are pushed back to the agent and then it it gets onto the istio cluster right.

A

So the istio has the configuration. So let's look at the configurations on our is your cluster, so we have two clusters here. uh This is the first cluster, so, for instance, if I feel if you are familiar with seo service entry is the key resources that helps you. Connect uh services on the remote cluster and destination rule is the key resources that can fix like outlier detections.

A

So um so we would expect, let's say, service entry. For instance, I would expect the service entry for the review cluster, for instance on the cluster tube. So let's see this is the reviews. Global, so you can see how the reviews global are config and, interestingly, this is the ingress gateway on my eks cluster, which is on the second cluster, which runs the review versions.

A

Three, so that's how you know we build this reviews global service entry for you uh automatically behind the scene without you needing to do anything, and uh we also build uh like, um let's see, I'm expecting uh reviews um that ratings. uh Let's see, okay, I'm expecting this one yeah reviews that default. um So this one um also helps you to connect the endpoint right. So this is our web for that service.

A

Entry and also the service entry is for the reviews that e4 does cluster to the global, and this points to the el, the elb on my eks cluster and for this particular host name, and then we also have a little bit of magic on the cluster too. We also have service entry on cluster, two that helps you to map hey for this reviews.global.

A

uh You know how we are going to actually map to the actual uh pod on the on the cluster on the cluster too. So if you look at the reviews uh that clustered up global, this is exactly is actually the pod ip for the reviews version 3 on the cluster 2.. So by the way, the the reason I highlight the ip here is.

A

This is another magic down by glue match, because the ip address can change right as the review 3 gets scaled up and down, as you become more than one replica or maybe the ipad just could change because of the recovery of the pad.

A

So this needs to be dynamic, discovered by the glue mesh agent on the local istio cluster, which is the cluster 2 and then also pass that information to the glue mesh management plan so glue match management plan could generate the correct service entry, so a lot of hand holding a lot of automation in place from that grpc.

A

um You know stream from the agent to the management plan and which enables this type of global service. Failover scenario that you are seeing, so that's it for my demo. If you guys have any questions, let me know.