Istio Community, 2 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Istio August Meetup/ Demo: You probably DON'T need a Service Mesh by Abdel Sghiouar

Description

Service Mesh is becoming a key component in the Cloud Native world. It allows teams to connect, secure and observe complex microservices environments built on containers and container orchestration tools. But most Service Mesh tools are also very complex and require a lot of engineering overhead to deploy and maintain. In this talk, we will explore the considerations you have to take into account before you commit to a Service Mesh-based stack. I had the opportunity to help customers design around Istio (one of the most famous Mesh tools) and learned a lot through the years. This talk is a distilled experience of the learnings I had.

A

Hi everyone uh thank you for having me um usually when I do this presentation outside of a dedicated service mesh community. I I it takes me much longer time, because most people don't actually understand what service mesh is.

A

So I have to start with some introductions, but I'm gonna skip all the introductions for this specific meetup and this specific talk so the content, I don't have actually a lot of content, it's more like a detailed experience of couple of years implementing certain uh service mesh in certain projects, uh but I I want to make this as interactive as possible. So if you have any questions, just please feel free to ask me to chat.

A

um So the title of the talk is quite provocative and the idea here is to really um make people cautious about implementing service mesh in general, and I also modified it to make it specific to istio, um because you know different service mesh. Does things different ways, um but uh pretty much all the learnings from this talk really are applied to pretty much any implementation or any cloud products. To be honest, so I'm a cloud developer advocate at google. I've been uh doing before this.

A

A couple of years or five years of consulting as part of something called pso professional services so where, where we work directly with external customers of all sorts from startups to big tech companies to banks and financial institutions, helping them implementing things, I did quite a lot of gke and service measure. Google uh managed offering for communities.

A

And uh overall I did quite a lot of infrastructure experience right and usually when we talk to customers or we talk to people about service mesh. This kind of tends to be the slide that we use it to sell the technology.

A

The benefits of a service mesh right, like the the the fact that it allows you to connect a bunch of micro services in a flexible way, um secure the communication between these microservices control. um The policies between these microservices so basically control both policies in terms of who can do what and who can talk to what, but also control the traffic between these policies and uh this microservices and then observe um by getting a bunch of metrics and traces from this communication and dump it somewhere.

A

And I think what we don't think about too much is the drawbacks that actually, why not the the the the things you have to keep in mind, and I try to organize this- these things into four main categories.

A

um uh Think about them as like incremental steps um into I want to go from. I don't have a service mesh to now. I want. I have a service mesh. What are the four main things? I need to keep in mind right and so the first fourth main thing is the capacity and resources um a service mesh data plane and the service mesh control plane does consume cpu and memory. um So you have to keep that in mind.

A

I have seen implementations where we started with a few hundred pods in a micro service based environments and by introducing um to be to be honest by introducing earlier version of istio is to have improved drastically in the last few years, um or, I would even say the last few releases, but before it is to consumption lots of resources, it still does and it's not cheap.

A

But um I have seen situations where we by introducing a service mesh, we managed to get to a point where the service mesh side cars actually consume more resources than the actual application.

A

And here I'm talking specifically about side car based service mesh like istio. um They do introduce some network latency. That's because there is a proxy and if you go through proxy, there is a latency in the network that you have to keep in mind and accommodates.

A

They do come with some design decision and architecture decision that you have to sort of take on day one.

A

um There are certain things that, if you implement a service mesh with, you can just change them um like most tools are not as flexible as most people think they are, and then, once you take once you deploy your service mesh and start using it, it will start generating a bunch of data and a bunch of metrics and logs and stuff, and you need to deploy extra software to take advantage of those extra things that are available to you right, um and so that's that's what I called the auxiliary infrastructure.

A

So that's, basically, software that you have to deploy and maintain and manage, etc, etc.

A

So, let's start with the capacity and resources. This is kind of um straightforward. I guess for a lot of people um side, cars are containers, containers need cpu memory to run right. uh I just pulled some uh numbers from the last benchmark that is to have released um for a thousand services, uh which means uh 2000 sidecars in 70 000 uh mesh, wide requests per second, so 70 000 rps uh with 1.14 the last release. um The invoice proxy consumes approximately 0.35, vcpu and 40 megabytes of memory per thousand requests per.

A

Second now put this way. It might seem like not a lot but, in my opinion, 0.35 cpu for a thousand requests per second is quite high. um So if your services are not very chatty, you're, probably gonna be okay and you're gonna, probably be fine, paying almost half a cpu and 40 megabytes of memory to have mtls handled for you by the service mesh.

A

But once you start scaling up your application, both in terms of the number of total pods you run in your cluster and also in terms of the the amount of traffic you're handling, while that that footprint will start increasing, and you have to basically keep in mind both from the perspective of it costs money, but also from the perspective of capacity planning. um As you all probably know, most of service mesh implementations are done in cloud and cloud is not unlimited. Like right. Stock outs are very common uh people not being able to provision.

A

Virtual machines is a very common thing. In most cloud providers, different cloud providers are doing different things to try to handle it. But you know it's it's it's something you have to keep in mind is 2d itself, consumes one vcpu and 1.5 gigabytes of memory, which is not a lot, but if you have a big mesh, you will probably need to horizontally. Scale is 2d and that could also cost money right.

A

So that's the first thing, the little network latency now I mean some people might argue that 1.7 uh milliseconds and 2.7 under 1999 percentile latency, is not a lot. That's probably not a lot, but it it really depends on the application. It really depends on how responsive the app have to be. um So if you have a proxy, you will have latency, that's just a fact right. There is no no way around it. Now I have seen some sort of new or about to be either already released or about to be released.

A

Implementations in different service mesh. I mean other service mesh tools, not istio, trying to mitigate this by moving either to ebpf or by by well moving to epf for certain functions or by deploying a proxy at the node level.

A

Those are all good efforts to try to reduce latency overall and reduce footprints, but they come also with the drawback. If you do a node level proxy, then I I bet that most of you who works with customers, the main question that will be raised would be, and by customers.

A

Here I mean, like you, know: non-tech savvy customers, non-tech savvy companies like companies that are not not necessarily talking about specific industries, but you get what I mean so once you tell them how we're going to move from a sidecar to um to a node based proxy and the question is well: how do we secure the communication between the sidecar and the proxy in the node right, um because that portion is unencrypted in a sidecar based service mesh?

A

So if you move the proxy far away from the actual pod application, so how do you secure that that portion of the traffic, um if you move toward pbpf, then the same question comes up like how do we secure it? Who is responsible for the ebpf modules as part of the kernel etc? So these are all like. There are good efforts.

A

I think what I'm trying to say is a good effort to try to mitigate both the resource footprint and the network latency problem, but they come with some drawbacks, and um you have to basically be ready to answer this. These questions design an architecture I took here the most extreme case, in my opinion that I came across. uh There are many many other cases uh where, like you have to decide on day zero before you even implement service mesh. How is your target architecture can be? How are gonna?

A

Look like right um as as much as we want to believe that people who are doing cloud are flexible in terms of oh yeah, I mean if the architecture on day zero was not the right one, then we're just gonna. You know, delete everything and then create it again, because you know a thermal infrastructure and that's what cloud allows you to do as long as as as much as we want that to be through. That's not necessarily through all the time, a lot of people.

A

A lot of customers tend to think about cloud infrastructure as a static over time um and for them deleting everything and rebuilding it from scratch, because they have made a wrong architecture and design decision on day. Zero is probably not not even an option on the table right, so the extreme case I took I I that I came across was um doing a multi-cluster multi-cluster implementation, so multi-region implementation with clusters span across multiple regions on the same vpc, specifically google cloud, but I think it's kind of the same for all other cloud providers.

A

um You can like these kind of things you can just choose and and to be fair, we did it with the remote control plane, which is not supported in morning initial, so this kind of decisions you have to choose them from day one. um You can just switch things around um just like that. Right, like like there are certain decisions that you have to keep in mind and you have to take um from day one, and one of them mainly is the multi-cluster sort of multi-control plane or single cluster single control, plane type decision.

A

It became easier slightly with issue recently, but it's not and- and this becomes even problematic later on- if you want to introduce you know- hybrid connectivity with on-prem and cloud or um uh oh yeah and and all the other things I talked about before the resource for sprints the latency etc is is a factor in this design decision right, because the farther you have to shoot packets around regions, the more latency you're gonna have.

A

um The last one is the auxiliary infrastructure. I have a couple of more things and I want to talk about, but the one thing is the exhaust, the auxiliary infrastructure.

A

The way I like to think about it is that a service mesh doesn't live in a vacuum, and you need to run this on top of a cluster um specifically iso, and you need to deploy a bunch of extra software to be able to take advantage of all the things the service mesh gives you um like: tracing monitoring and login, the graph interface, the gui, to get the graph, etc, etc. So these are all software you have to deploy.

A

um I know that if some people are in this call from the issue community, they might argue that we made it easy in issue to deploy these things. Yes, we did but deploying something from the command line and running it in production of two different things right. um So these are all um interesting, like pieces of infrastructure that you will have to deploy, maintain fix, patch, etc, etc. So um so yeah, so keep that in mind. It's not like a it is here. Is a service mesh deploy?

A

It put put a bunch of labels, magic everything, just works right.

A

um So in a nutshell, I told you it's gonna be short in a nutshell. um The the kind of five lessons learned for me, at least, were do not take a simplest approach. um I've seen, um especially in some security conscious companies, um they just go like oh, we need security, can mtls, give us what we need uh to meet certain securities requirements. Yes, um is, is too painful to man. Sorry is mtls painful to manage by hand. Of course it is especially at scale. Can a service mesh give us mtls easily?

A

Yes, okay, let's use it right and uh specifically, in this case, what I want to actually mention is.

A

This problem is specifically exactly exaggerated if you are working in a distributed environment with some sort of central platform team, which is actually a trend that that we are seeing these days, which is one team handling, all your central infrastructure, all your central communities clusters and all your um service, mesh kind of control, plane, deployment, etcetera, etcetera and um uh the the problem we see. We see there is well one of the problems I see there is the platform team will brand the clusters. They will deploy the service mesh for you.

A

They will label the namespace for you, but you as an end tenant as a developer.

A

You don't probably even know there is a service mesh or even if you know you don't really understand how it works, and then you end up probably implementing certain things manually that in code that exists in the service mesh using a yaml file, so the most kind of extreme case I've seen before is people implementing retries by themselves, while they are using a service, mesh and service can do that for them or implementing their own canary deployment using some sort of sophisticated cicd pipeline.

A

While the service mesh can do that for them right, so these are all things to keep in consideration. If you end up in a situation where you have to deploy a service mesh in a centralized way, um I think you should be able to surface that information to everybody who's involved and let them know that there are things that can be done using a simple yammer file instead of implementing it in code.

A

There is another problem which is service, mesh compatibility with certain applications.

A

I've seen this a couple of times, but I think the one of the kind of common use case that you might have came across is using a data transformation, a data transformation pipeline. Sorry, uh an example that came to mind that this is not specifically to to to point out a specific tool is just an example.

A

Is the argo pipeline, the argo not cd, not the csd partner, but the argo tool that allows you to do data transformation on data um so that the way that works is that you have a control plane that you configure to spin up a pod to run a specific piece of custom code on a specific piece of data.

A

When we introduce a service mesh well, the problem there became a the the pod itself for the application takes less time to run than the service mesh than the sidecar, so the sidecar actually takes more time to uh to get up be ready before the actual application. Pod can access the network and run, and then finish so basically like by adding a sidecar we're extending the runtime of specific job from few seconds to a minute, maybe um and then we're increasing the footprint.

A

Obviously, um this is not a problem so, um for this kind of fast run uh dynamic environments that are either data transformation or ci cd or something like that. Introducing a sidecar based service mesh could be a problem. The other problem you, I think, you're aware of, is the the ordering in which things start. You have to create dependency between your actual application, pod or container and site car container, so that your application container does not try to connect to the network or resolve some dns or something before the site. Car is actually available.

A

um So basically, this problem of whether my application can run with the service mesh in a nice way is a problem you have to figure out before you actually commit to using a service mesh. The example I gave earlier we just end up excluding the argo pipelines from the service mesh because it was not compatible, but that's just one example.

A

As I said um in large organizations, one of the problems I've came across is people having theoretical and practical experience, especially in debugging, because one thing is deploying and labeling and having site current injection working and all those kind of things which is nice. But when things break, you have to have enough theoretical and practical knowledge to know how to debug them right, and I have spent countless hours in front of the terminal trying to figure out why certain things didn't work.

A

uh It still specifically issue ctl. The command line makes that slightly easier, but when you are trying to debug issue using the ctl, you have to understand uh what are you looking at, especially when you use the proxy status or whatever subcommand line? That tells you how the mesh is working? You have to understand these, like terminologies.

A

You have to understand them properly to understand why something is broken, um and this again is another problem that is exaggerated if you are working in distributed environments like in a central like central multi-talent environment, so the question is: who is the responsibility to debug and fix the service mesh when it breaks um it? Will implementing service mesh will increase your technical depth, regardless of whether you want or not, um because simply you have to while maintain, monitor and manage the service match itself and maintain all the other components?

A

I talked about the auxiliary infrastructure if you want anyway, and it will increase your operational complexity for sure, because you are having another layer of abstraction which makes things slightly more complicated.

A

I have written an article um which goes slightly more into details. This is quite old. It's from january this year, um you can find it to my medium blog and it it goes into more details about the stuff I talked about, so you can feel free to read it if you want and with that I'd say. Thank you very much for having me and if you have any questions.

A