solo.io SoloCon 2022, 5 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SoloCon 2022 [Lightning Talk]: Business Continuity with Gloo Mesh

Description

SoloCon 2022:
[Lightning Talk] Business Continuity with Gloo Mesh

Speaker:
Marino Wijay
Developer Advocacy and Relations, Solo.io

Abstract:
Disaster Recovery and Avoidance are critical to ensuring that applications continue to be available. In this lightning talk, we discuss how Gloo Mesh leverages Traffic Policies to mitigate disasters and downtime.

Track:
Service Mesh and Application Networking

A

Hi everyone and welcome to our next lightning talk, I'm samantha kim and I'm part of the marketing team here at solo io, I'm excited to introduce our next lightning talk. Speaker. Please help me welcome mourinho back to the stage to talk about business continuity with glue mesh mourinho over to you.

B

uh Thank you very much. Samantha, um hey everyone again. uh Welcome to my talk on business continuity with glue mesh.

C

My name is marino wije. I am a developer advocate at solo, so I'm here to talk to you about how we can achieve things like disaster avoidance and even aid. Your disaster, recovery processes and business continuity with service mesh technologies, um but.

B

First, let's talk about outages for a.

C

Second, or for a few minutes, if, at all when running a company, our business functions may either support things like our internal workforce for a variety of tools that they might consume, or we might be using business functions that actually support our streams for revenue.

C

These business functions actually or generally map to applications either that we built organically in-house or they live outside. You know in an as a service, world or they're consumed as a service, but for a few.

B

Minutes, let's.

C

Let's understand what could contribute to an outage, so outages can actually be caused by a variety of factors, both technical and non-technical in nature. Some can be internalized, some can be externalized, it doesn't matter at the end of the day and outages and others, and that could be a small part of your environment. That could be the entire thing, but it it contributes to an outage dns, for example, if not available, can cause substantial outages as we are so reliant on things like name resolution.

C

So if we're trying to communicate using ips- and we don't have our our name resolution in place- things tend to break dns servers. What we tend to do with them is deploy them in a high availability state. So that way we have one if it goes down, we have another one, that's available to just continue to to process those name. Resolution requests in the case of storage.

C

That system holds all the data that we consume in a company all the data that we store, that we generate it's stored in some sort of repository or it could be spread. But it's storage. We're talking about storage array, storage networks and even disks, so we have the possibility of running out of storage, we run out of capacity and when we do run out of capacity guess what we have nowhere to write to. So the system comes to a halt.

C

I o itself becomes halted or input out, but becomes halted, and it's like a storage network or a storage environment in itself can have a lot of different kinds of failures. So we also want to consider how that can can contribute and radiate to the other parts of our environment.

C

Sometimes the network is the source of a potential outage. You know, network equipment can can age, it can fail. Network cables can fail, um transceivers for fiber optics can fail, someone can cut a fiber, and then your network has failed. There could be potentially some level of oversaturation to and from a particular environment, and that is actually causing huge spikes in latency and, furthermore, contributes to a level of inaccessibility or a lack of availability.

C

Consider for a moment like a misconfiguration, single misconfiguration for let's say, bgp or or other routing protocols can be catastrophic to a network environment, because those routing changes can propagate elsewhere and then cause inavailability issues as well.

C

And then then, we have to think about security for a second, because some someone or some automated system might have accidentally pushed a policy that automatically denies all services from one location or with a certain set of attributes from compute communicating with another set of apps with another set of attributes, and that in itself will lead to a breaking communication.

C

And then this moves us on to the concept of availability which, when you think about it, really motivates us to consider constructing things like service level, indicators, service level, objectives and even service level. Agreements to just ensure that we're always maintaining a level of availability and resiliency for our business and for the applications that run inside them. And then finally, the last part is people like. How do we ensure that that we are enabling our our workforce, our people, to recover from these failures and that they have the right systems?

C

The right tools, the right technology, to solve a lot of these failover problems, so so what technology can I think of? Or can anyone think of that can actually solve a lot of these challenges? Service mesh, all right? So, let's, uh let's dig into service mesh and and what we're doing here so up in this diagram, I have a single kubernetes cluster, which can be running anywhere. It could be running on eks. It can be running something that you built on premise or you something that you built in another cloud.

C

It doesn't really matter, but it's upstream kubernetes and then we've got istio service mesh running as well. Now this grants us a lot of capabilities in itself.

C

If we take a look at this, sir, this this simple approach right here, we we effectively need to make sure that dns resolves to our desired endpoint, which could be either the kubernetes api endpoint for running cuddle commands or the istio ctl command that we want to run against istio to see what's going on in our network.

C

Additionally, we might be also accessing applications within that kubernetes cluster, so we'll be making http requests through a load balancer or through dns, but behind dns we actually have a load balancer and this load balancer is actually responsible for handling all those incoming requests to either manage a kubernetes cluster's control plane because in a kubernetes cluster you're going to want a multi multi-node control, plane, environment, in addition to a multi-worker node environment.

C

But this this load balancer is there to protect that control plane. If you lose a node inside of kubernetes or if you lose a control plane inside of kubernetes, you still have a control plane that is functioning. That is actually allowed to pass instructions down to the worker notes to allow you to schedule pods and and run containers, and even still do that level of networking. It allows you to still traffic shift or or move around your services as needed, but this is a function of a kubernetes cluster itself. Wherever you've deployed it kubernetes.

B

C

In itself actually gives us the ability to heal, let's say if a pod fails or even gives us the ability to scale and create a number of replicas or copies of a pot, and this is this is primarily built off of the belief that at any given time you know, kubernetes pod can just disappear can die off, but we need to use a declarative approach to recreate that object and once it's been recreated, then it's there for us to continue the same function as it did previously before that pod died.

C

Now beyond what kubernetes provides us from high availability standpoint, the underlying kubernetes cluster might be managed or monitored by a service provider. Now that service provider or cloud provider for that matter is responsible for making sure that cluster is up and operational.

C

So if, for whatever reason, you lost a node in your cluster, the cloud provider's automation will go in there and re-instantiate another node to bring your cluster back to its original state.

C

Now this is all great and well and then you know we have recovery for our applications and we have ways to route to multiple parts of our application within a cluster. But what about multi-cluster now in the case of multi-cluster uh recovery and failover, any single service or endpoint can fail. An entire cluster or environment can fail. You might even have cross-cluster traffic going on. How do we?

C

How do we actually achieve some cross-cluster resiliency in the event of a failure of let's say a service, or even maybe a node in a cluster or even the entire cluster itself, since we've distributed our application, parts of it might end up in different kubernetes clusters. So how do we go about routing to all of these different locations?

C

So we could leverage, let's say a service mesh and sdo's specific specific objects like virtual services and destination rules and even service entries to to just create the necessary policies to get around to where we need to go, but having to do this, for so many services at scale is impossible. So what do we do now? How about.

A

C

System a cicd system might be able to solve part of that problem for us, but not entirely because it doesn't have all available. It doesn't have the complete awareness of all objects in all locations, and then this is what brings us to something like blue mesh, so I'll talk about glue mesh in a second. But what if we can further abstract? Let's say the istio service mesh and treat all of our kubernetes clusters as if they were all part of the same network fabric we actually can, and we do so with glue mesh.

C

So with blue mesh, we can leverage something called a route table resource that that basically tells us where all of our services exist. Whether service saying cluster is in cluster one or service b is in cluster two or all the copies of service uh service, b or across all clusters. The route table is gonna.

C

Tell us that, and so this route table is actually a direct translation of of istio's resources, specifically those virtual services, destination rules and even service entries, but we're simplifying that configuration, because now all you need to do is configure route table resource to configure where things are much like a routing table in the networking world. You configure your route table to say: here's where this network is where that network is.

C

This is the same idea, so this is actually allowing us to effectively traffic engineer where our requests to our applications go and even specify alternative paths. So if there is a failure, let's say cluster one goes down or cluster two goes down and cluster one needs to access something in cluster two, but it's non-existent anymore. Cluster one can route to cluster three without a problem, and this is all made possible using glue mesh and its management plane too, to provide the management and abstractions on top of the steel service mesh.

C

So with this capability we can achieve great resiliency in the way our applications respond. We can achieve higher levels of availability and we even simplify our configurations.

C

Blue mesh itself works by understanding the configurations you pass it like the route table and then, in turn, like I mentioned before, converts that or translates that into istio resources.

C

The other thing that I need to mention is that glue mesh actually unifies the root ca amongst all of your istio instances. So now all of your different environments- here all of your different clusters, are sharing that same root certificate or root ca certificate that enables them to trust all services amongst each other.

C

This is actually a very huge simplification in in configuring service mesh, while giving us the resiliency the continuity and any sort of oversimplification of how we configure our resources all within our our kubernetes clusters, running istio and now blue mesh, so to wrap up business continuity is, is a very broad topic.

C

I mean I covered a lot of different areas where things can go wrong, but most of the time, a lot of that comes down to the network and and several other things around capacity planning and your your available storage and even the systems foundational systems, for example like your clusters and where they run.

C

What what we aim to solve here is with our glue mesh technology. We can provide you a way to continue your business and its functions, provided that you have some sort of outage or a small portion of your environment goes down, while istio provides a very powerful service mesh that gives us things like traffic.

B

Management and.

C

Other capabilities to to uh to basically allow us to connect our endpoints and services together. It also allows us to circumvent failures locally and if we take that to the next step and we leverage blue mesh, we can streamline and simplify our configurations, while also circumventing failures across all different locations and sites.

C

So to simplify blue mesh is your answer to augmenting your business continuity and disaster avoidance and recovery strategy.

C

I hope you enjoyed this lightning talk. Thank you and please enjoy the rest of solocon.