solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 020 Istio at Scale

Description

In this talk, I'll share the lessons I learned while deploying Istio on multiple Kubernetes clusters with more than 1000 Pods.
I decided to run everything on KinD clusters and I had to tune many parameters at different levels (Linux kernel, Kubernetes, Istio, ...).
After that, I'll explain how multi cluster communication can be configured using either the native Istio discovery service (EDS) or Gloo Mesh. I'll compare both solutions and how they scale.
Finally, I'll demonstrate how Gloo Mesh can be used to provide high-availability of applications across clusters, zones, and regions.

A

Hi everyone, I hope you enjoyed the conference so far in this talk, I'm going to speak about istio at scale. I am the new general director of field engineering in emea at solo.io. You can find my contact details in the slide and obviously you can also find me on slack.

A

My goal here is to speak about the lessons learned from the testing I did when I was trying to deploy multiple uh istio clusters with you know a lot of pods and so on, and that means in fact trying to deploy like many many control planes, many pods, but without spending too much money. If I want to summarize that- and if I look at my growing numbers, I wanted to achieve at least five cubans clusters.

A

So five is your control plane at least a thousand pods and spending less than 100 a day, and I had different options. There um gke clusters on one side- and I said gabe that could be eks clusters or whatever clusters provided by a cloud provider and can cluster on the other way, which means like having a large, vm and deploying multiple kinds clusters in this vm.

A

Both options are valid, uh I would say I started with a gke cluster, uh but it was like more expensive, but it was still like attractive. So that was not the reason why I I decided to go finally with kind clusters, it's really more complex at the beginning with ken clusters, but you get a lot of advantages like you can redeploy everything very quickly.

A

You can try out different network topologies like having you know, communication between pods through gateways or or directly. You can run that anywhere. So it was like providing a lot of advantages and I ended up with being able to deploy eight uh istio control plane. So you can see here uh one ubuntu vm, my eight kind clusters and uh metal lb running on each of them so that I can create service type load, balancers and so on.

A

um I became a little bit crazy during this tested because, like um there were like every time I was like solving an issue. I was finding another one, and uh the goal of this talk is that for you it should go um like that right. It should be very easy. You should be able to relax and just like apply what I learned and, and you should be able to deploy everything quite easily.

A

So the first issue I I encountered was like this too many open files and it was quite straightforward to find out when I was looking at the istio logs. So I got that at the very beginning when I deployed the first 250 pots in the first cluster, because I decided to go like 8 cluster and 250 pods per cluster.

A

So that was like what I ended up with and I found very quickly this uh documentation on the the kind website uh telling about which values to modify when you get this kind of issue, and uh it was okay. But I put this uh shy guy here just to remember that I need to tell you that don't be shy with the numbers you use here right.

A

It's not a production environment, it's just for tests and why why I said don't be shy is because, if you start to use like the numbers that, for example, I I got in this documentation, then yeah you can deploy more. But you start to have the same issue again later, because you reach the limit again, so don't be shy, put like high numbers so that this issue is fixed and you can move to the next one.

A

So next, one uh again trying to deploy these 250 pods and after, like um I don't know like you see I I was still having like 190 189 pending. So you know, I did quite only like 60 something and it was just like all of them. Staying in pending states, the other ones, and uh I I finally figured out that um there was that this, like cpu and memory request that was set uh by default when you deploy istio and uh third, is your cycle proxy.

A

So I changed the value from the default value of 100 milli, cpu to 10 milli, cpu and from 128 megabyte of ram to 32 a megabyte of ram, so just change that in the istio sidecar injector config map and, obviously again you do that. You find that you modify this value and then you figure out. You need to restart again because uh the pods that you deploy um don't use this right value right.

A

So you need to uh either restart the current pods and it will use the new value, but you know how it is like in kubernetes right if it's very often faster to redeploy your cluster than to try to to delete like dozens of spots right. So that's again, one of the benefits of uh using kind is that uh you discover an issue.

A

You delete everything, you will start from scratch and you try again and see if you have another one so start again solve this problem, but again after less than 100 pods, I found out that I cannot deploy more it's still in pending state. um This one is quite easy to find out as well.

A

Is that basically, you just read the number of maximum number of poles per node, which is like by default 110, and you see, I deployed less than 100, but you you already have like the the system, pods like the pods for calico for metallic and so on. So you reach this limit and you change that in the kind config there is like a kind cluster object and that you can patch- and you say here I say like 1000 pods per cluster, but you guess what you know you need to restart right.

A

So again, I was very happy that I can do that very quickly. So just like did it in my cluster started a new one with the new value and and this time I was quite confident right. I already found three problems so that should that should be quite good now right.

A

um But then what happened is that uh I started to get another issue which is like uh you, you you deploy um these these pods, like uh you, you have like one yammer that describes all these spots and you submit this yaml file and um it's like process the the first 10 or 20 very quickly, and then you start to slow down slow down and then it takes like several seconds per pod at the end right and it's just because etcd is slowing down.

A

So the uh the approach I took here was like: okay, let's just create a one gig uh tmpfs. So memory file system and again update the configuration of the can cluster so that um it's use this um tmpfs uh to uh as a backend for etcb.

A

So I fixed that restarted again and now, when I was submitting my yaml, it was like taking. I don't know few seconds to to process everything. So a lot a lot better.

A

So next step was probably one of the most difficult one to solve. um I was like having this fast etcd and I was able to deploy my first cluster with 250 pods and then the second one with 250 pods. Then I believed. Okay, that's that that looks good right. If I can do that now, I should be able to deploy as many clusters I want as soon as I still have memories available, because you know if it works with one, it works with two.

A

It will work with three five, whatever and and now in fact, at some point after a few clusters, um I started to have really weird dns error. Sometimes it was the nsl. Sometimes it was like very strange issues that you don't really understand and and after some doing some research, I found out that when you run too many containers in the same operating system, you get this issue with the arc cache.

A

So you have different values there, where you can manage how garbage collection is performed by your operating system, and there are, there is one: that's represent the number of entries that you have in the cache. The minimum number, uh the most important one is probably the second which is called like gc stage two, which represent the the maximum of entries in this cache, and after that it starts to remove the old ones, and then you have the r maximum. So I again I just like the the first one.

A

I think the default is 128, which is very low. Imagine you it cannot even keep in the cash like entries for more than 100 something containers right. So it's very very low. It was still working for a few clusters because, probably even if the arc cache was not big enough, there was not so many new entries to manage all the time. So it was okay, but with um a thousand parts, then that doesn't scale right. So I I use these values. You can use the same ones again. I was.

A

I have not been shy here because first time I started to do like just going to from 128 to 256, but still I was having the problem just a little bit later, but I was still having good problem so using these commands just fix that uh that issue um after that, what I found out is that I have like a huge machine with like several hundreds of gigabytes of ram, but still I will not be able to deploy all my uh pods because uh I won't have enough memory.

A

If I, if I do the calculation right, because I found out that each pod was using like 100 megabytes of ram- and uh I wanted to do like eight cluster 2050, pods and 100 megabytes per pod means, like 200 gigabyte of ram. That was a little bit too much. I was close to have enough, but it was still too much so then, um here what I discovered is this.

A

um You know the the the fact that by default, uh istio is, um um is configuring the different sidecap proxy, so that they have visibility of all the other pods in the clusters? So that means like uh there was like what we call envoy clusters entries in each sidecar for all the other parts of the cluster and the more you have entries there, the more you use memory.

A

So if you create like a sidecar object in an istio sidecar object, you can define uh which pods they can see, and this special notation that you see here, which is like uh dot slash star means uh basically um that uh I, I will only see the the only the other parts of the same name, space by default and it's in istio system namespace. So that means it will apply on on everything.

A

So I went from like 100 megabytes to 70 megabytes, and that means that I needed, like 60 gigabyte of ram less, which is which was good enough for for my testing and uh what's interesting is that it was all these all these different things. All these different steps are just like for preparing everything right for deploying istio on eight clusters and these 250 ports. But after that I wanted to do this testing about multi-cluster communication and so on.

A

Right, and uh it would have be a lot worse because um the way you do multi-cluster communication is, you need to first of all, have a discovery in place so that one cluster knows about the pods of the other clusters. So that means that each envoy sidecar would have had the visibility of 2000 pods right. So in terms of memory it would have been like a nightmare and I would have need like a terabyte of ram.

A

I don't know it would have been crazy, so it was good to find this issue here and not after because I would not have been able to to scale anyway.

A

So at the end everything went well, but it took like two hours and a half to deploy everything and and again after that, I wanted to do a lot of testing on multi-cluster communication and if every time I want to reset my environment, because I find out that I made a mistake- or I don't know like- I want to try something different if every time I need to wait for 2012 and a half, it's a lot of time right.

A

So I said: okay, let's take a bigger machine with more more memory uh and having like a huge tmp fs. Just for the storage of the backend storage. For my containers, you know I created like a 200 gigabyte, tmpfs in in fact, at the end, I don't use much.

A

uh I will show you that in the environment, but using that was really nice, because I was able to go from two hours and a half to 45 minutes exactly what uh what I wanted and the second really positive side effect of that is that now, because everything is in memory, I reboot my machine. I start from scratch. I don't need to have this very long.

A

um You know if you do a kind delete cluster uh and you use normal storage. It will take forever because it has to delete to go through. You know the directory, where all these containers have been stored and- and it takes a lot of time to to delete all these entries. So with memory you don't care, you reboot, it's gone and you start from scratch again.

A

So this is the achievement at the end, like eight clusters, two thousand pots and forty five minutes. I didn't calculate the budget, but uh the you you you can um you can do the math. It's like three three dollars an hour, the vm. I use it's a huge one, but still only three dollars now, so it's less than 100 per day, and uh even if you want like a printable version that would be less than one dollar, so it can be very, very cheap and you can do very nice scale testing with it.

A

So the next step was like now. I won't like to have this multi-cluster communication right and um the way it works with the multi-primary design in istio.

A

Is that you enable something that's called endpoint discovery service and the way it works is that each each istio you need to create on each istio control, plane, one secret corresponding to all the to the cube api server of all the other clusters, so that one control plane will then go and reach all the cube api server of the other clusters to be able to discover the workloads and the second control plane. That's the same, and the third has the same, and it's only time four.

A

In my case time, eight in real world scenarios we have with some of our customers. It can be 50 100. So imagine like it's.

A

It's not really nice in terms of the way it will scale, but also it has like other issues like if I cannot reach if one one of my api server here becomes unavailable and one istio control plane restart, it cannot start you have to go and delete the secret before it can start. um You have like also security concern right. If one guy one of these cluster is compromised, you have a secret for all the other ones.

A

So now you can, you can really delete everything if you want, you can delete all the pods of all the clusters and and create a big, a big mess so with blue mesh, uh which is our management plan uh for managing multiple uh istio uh control planes. um We have like a very nice design where, first of all, it's just one component, so the blue mesh control plane or a management plane that will be responsible for discovering everything and making the other clusters aware of what it has discovered.

A

But the other thing that we have implemented recently, uh which is uh is very nice, is that we now have like a an agent. We that runs on each cluster, and this is the responsibility of this local agent to watch the local api server and to pass the information to blue mesh using a grpc channel.

A

So blue mesh gets old info, but all these discovered workloads and then blue mesh can use the same grpc channel to tell the all the different agents uh what what it has discovered and and how to apply all the policies and so on. So it's a lot more scalable and also a lot more secure right, because there is no exchange of the secret of the api servers and and so on.

A

So you can do much more with bluemesh. Obviously, it's not only about discovery and I will speak about. uh I will kind of do a demo where I will show you this environment that I built, but I will also um in this demo do a focus on uh the global failover routing which, which is really an amazing feature.

A

I could spend like an hour just or two hours or more just to go through all the nice capability of blue mesh, but uh but I think you get like a good, uh a good first uh view of of what it can it can do. So, let's go for the live demo, so I have here uh my environment, which is you know the vm I spoke before and I have glue mesh training and you see I have my eight clusters.

A

I have my more than 2000 pods deployed, so everything I described before and I will go through a lot more details uh quickly but before uh let me just show you that uh on the cli, so I have like a different context.

A

I have one for like the management cluster, because I spoke about the fact that I have like at the end eight clusters, but in fact I have nine. So I have one for the management uh for like blue mesh and the eight others. Where I list you, I could have the management on one of these eight clusters, but it's kind of a best practice to have a dedicated one and uh what you can see here that I have like uh just blue mesh running. I don't have istio at all.

A

I have different components like the dashboard that you have just seen the enterprise networking that really takes care of gathering all this information from the agents creating snapshots based on everything has been discovered, plus all the policies that have been created by end users create a snapshot, give the snapshot back to the engine so that they can be applied.

A

We have airbag web hook, which is also very nice. I won't have time to go through the details here, but it can help you to define who can do what like, who can create what kind of policy on which cluster and which name space, which kind of capability like traffic shifts like hitch rides or all these different things.

A

So that's kind of my management cluster. Now um I have like uh cluster one to cluster eight, where you can see uh that I have like ten name spaces and in each namespace I have like. uh If I do go to the first one here on each namespace, I have like 25 codes so 10 times uh 25, it's 250 points right. So uh that's what I have here. I also have like uh in the default namespace.

A

uh You can see that I have like um something called like a vd like I could like for virtual destination. This is a small ui that we will use to uh to show this to demonstrate this global failover, okay. So the other thing I want to show you is, uh if I go to cluster1 and look at the nodes.

A

So you can see here that on cluster one, my nodes is in the region, us1 and zone us quest one.

A

So at least that's what I simulate right and if I go to cluster 2, it's u.s west for some region got a different zone and then on number three, it's a different regions and another zone, and I have like, like that, like four regions, two zombie regions, that's what I I simulated here right and, uh as I was saying before, uh the idea is that here uh I have like um this uh 172 1821 is the ingress of the the ingress gateway of cluster one.

A

So if I do a get of this app, it will basically send a request to this url. And you see this url is the first sport of the first namespace right and this uh application. What it does. It returns information about the pod like the name of the pod, but also uh return information uh about the region and the zone right. So this one says: okay, I am running in west one right zone.

A

uh If I go to this here, which is like another ingredient gateway, this is the ingredient we have the last cluster, so cluster eight, and if I do the same, you see here that it replies with uh asia2 right, because this is where uh it's currently running.

A

So the idea would be to you know, see how we can use glue mesh now, so that uh if this service becomes unavailable on the first poster, I want it to go directly to the next available zone and if the next available zone is not available, I want to go to the next region right. So, as I've shown you here on. Currently it just goes local right on this region, and this one as well.

A

One thing I didn't mention, which is very nice with blue mesh as well, is that we kind of consolidate all the metrics. So what I did is that I we I use it to consolidate all the matrix, so all the sidecar processes they send their metrics to the local agent and the local agent passed this matrix to the blue mesh management plane and on cluster one. I deployed kelly and I point chili.

A

Instead of pointing kelly to um a chromatize locally that will scrub the local metrics, I I pointed it to a promoter locally that scrapped the matrix from uh groommesh directly, and you see here.

A

I see communication in the last 10 minutes that happened between you know: ingress gateway of cluster, one going to echo on one and ingress booster of cluster, eight going to uh echo and one year. So exactly what we we've just demonstrated right and now to have this uh high availability that I discussed about. uh What uh we need is like uh a few things right, so the first one is: um if we, if we look at uh our blue mesh ui, we see like we have a virtual machine. We have like our eight clusters.

A

We see all the projects covered and we see some virtual destination and policies. So we need these two components to have this high availability of the services, so the first one we will take a look at that first is what we call uh the virtual destination.

A

So what we did is that we just say we will create a newest name that will be called uh echo and one dot. um Echo end service, one, something like that. Let me let me take a look here and we will see it, but uh basically we'll uh just go there. You see we have like we have. We have made the 250 pods or the 250 services highly available, so everything is automate will be automatically you know uh doing this failover between regions and zones and so on right. So the way it works.

A

As I said, it creates like a host name like that, and we have like backing services right. So we say this host name, which correspond to the first service of the first namespace, is backed by eight services, because I have this service running on my eight clusters right.

A

So um the way I could use it is that I could just use this hostname now, instead of just name, I have used here and I would have this uh high availability, but what we can do as well- and this is what we do in this demo, which is even more powerful- is that we created some policies so that uh what these policies are doing is that we say when the request is sent for the local echo and one um or echo and service one on the echo and one namespace uh when the request is sent locally.

A

Basically, I want it to go to my virtual destination, so that now is transparent. So that means that when I send a request here to the local service, it's basically uh behind the scene, a highly available service so to try it out. It's uh it's uh quite easy. uh We will fail the service here on the first poster and we see that it will automatically uh go to the the next one right. So let me do that here, so I'm going to so what I do.

A

I just replace the the container used the image used for the service, because it's a very minimal one, and I I use a new one and I just like do a sleep 20 hours so that it cannot reply anymore and it's kind of considered as a failing service by uh envoy. And here, if I just go here, it will take like, uh like probably like 30 seconds to start and to terminate the uh the other one.

A

So uh you see here it's pending the new one, so I still have the old one running and I still need just to wait like uh 30 seconds or something like that, and what I'm going to do in the meantime is that I'm going to go to my ui, the service perspective, my ui.

A

So it's in customer one and I would do like exact id.

A

Yeah- and here I want to do like a curl local host, 15 000, slash crystal so these are the clusters that um these are like the android clusters right. So these are all the entries that it knows right. So it knows about uh the local services that are in the same name space, because this is the way I configured my sitecare object, but it also knows about these global services right. So here I can see. I have like a global service and the one I want is the one that's called one right.

A

So if I just do dot global, so maybe somewhere here, so this one.

A

Oh yeah, I don't say no obviously.

A

So if I do that here, I see all the entries that correspond to my eight services right. It has been automatically uh replaced by that right and what's interesting, is that if I look at my uh zone, for example, I can see that I have like each one in a different zone right because they are each on on different uh clusters and I could even see something else interesting which is like uh based on the zone and the region.

A

What istio has done is that it has set what we call a priority, and you see here so us quest. One is a cluster one right, so the priority for this entry is zero. So I want to send all the requests locally. First right and uh u.s waste ii is same region but a different zone. So it's priority one. If I cannot go there, I prefer to go there and then all the other ones are in different regions and they have priority too right.

A

So that's basically why we will see that it will go to uh the next dawn and then the next region. So if I go back oops sorry, that's not what I wanted to do. If I go back to my close to here, I see uh I have like oh look in the right cluster.

A

So again, if I try to look at my pods here, I see it has been replaced now and um if I go to my ui and I try to access it, you see now it goes automatically on the next zone right because the local one is not available uh anymore, and I can do the same here and uh just try to now make the second one failing as well.

A

That's what I'm doing just now here now, it's still there because, as you have seen before, it takes some time before it's uh replaced by uh the new one. So we have to wait like for a second, but what we can do as well. We can go on kelly and you see here it has discovered automatically that now it sends a request from one cluster to another right. So that's! uh What's very nice with this global metrics, we have gathered from everywhere that we know where the requests are going and so on right.

A

So it's one example is like to use kelly, but you can also just have access to all the metrics uh by yourself uh you. We are also going to add some nice uh graphs based on this matrix in our ui in the in the in the near future. Right.

A

Another thing I want to show you, while uh we wait for this uh failing to happen here, is that we have also. I spoke about the metric that we consolidate, but we also consolidate all the access logs and here uh it's not about like translating all the accessories, because it could be a lot right. It's really giving you the ability to specify um what uh you want to gather like. Let's say you have an issue at the point in time with an application.

A

You want to understand, what's going on right, so what we can do for that is. If you do go to blue mesh, we created like a an assay access log crd, to define what we wanted together. So here I said, let me just go.

A

Come on yeah, I said: okay, I want to gather all the access log for my workloads that com that have this label, so that's correspond to my first application in the first name space- and I want to to these logs from all the clusters in one place right. So I configured that and now I can use this endpoint, which is on my management cluster, so on bluemesh to gather these logs right and I can get a lot of very useful information here.

A

You see here, for example, I got this uh log uh from the cluster tool right because this is where I send my last uh request right and I can see a lot of very interesting information about. You know the the identity of uh of the the pod. I can see information obviously about uh you know my request. You know performance information, but also like information. You know it was a get on this path and this has been the response, and you see the same here on cluster 2 again, and you can continue like that.

A

At some point you will see on customer one that we have like a response. You know 500 or something like that, like you see 503, because that's where we we failed ourselves right so now, if I go back and take a look at my pods and show it it's good and now I can see that uh I'll be uh redirected now to any of the other uh cluster right.

A

It doesn't matter now, because, obviously um there is no clarity, no difference between the the different regions and if I go back to kelly here very quickly, you see, I see that it goes everywhere right now. You see that I got like he quest going through all the different clusters.

A

So yeah, I think that's. uh You have a good idea now about uh how it works. uh We've just shown of a few capabilities of blue mesh, but I think you you already can see how it can simplify your life. If you you have like multiple histo clusters, so I hope you enjoyed the talk and we have some time for q a now thank.

A

A