Istio Community, 26 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Istio Feb Meetup/ Demo: Scaling Istio in Large Clusters by Auto-Generating Sidecars by Cathal Conroy

Description

Demo: Scaling Istio in Large Clusters by Auto-Generating Sidecars
Sidecar resources are key to scaling Istio in large clusters, but configuring Sidecars requires a detailed registry of every given workload’s peers. What do you do if your org maintains no such registry? What if service names are dynamic? This is the story of how we used NetworkPolicies to auto-generate Sidecars for hundreds of services, spanning many technical orgs and teams.

A

um Hi everybody um thanks for tuning in today, um yeah I'm gonna talk, I'm gonna present our work on a scaling istio um with auto generated um side cars.

A

um So a little bit about myself, I'm kahul I've been with workday for uh three three and a half years, um always in the public cloud team, uh where we manage all aspects of workday's public cloud offering, and so that's the kubernetes platform itself, service, mesh and domain specific operators um and everything in between I'll touch. Briefly on our scale um walk you through our journey with this yo. Briefly, um some of the problems we encountered as we rolled out to our larger clusters and how we we use sidecars to overcome those problems.

A

uh So we've been running kubernetes since 1.2, uh currently on 121 and istio, since 1.1, currently on 111., uh we've had seo running in production since october last year. um Historically, we used combination, linker dns tunnel to encrypt um http and tl or tcp traffic, um our org itself, we're split between dublin, ireland and pleasant and california. I think we've about 50 engineers and six teams.

A

Some teams are actually transatlantic, including my own consists of myself six others and we're focused entirely on on all things istio.

A

um So we've 18 live public cloud production clusters across the world and plus our own data centers and the largest of which uh has about 4 000, pods, 300 nodes uh and 70 namespaces.

A

um At any one time we can have up to 200 dev clusters, each with about a thousand pods uh 40 up to up to a thousand pounds. 40 names with nodes and 75 namespaces namespaces are going to crop up a few times in this talk and there isn't. There is a reason for that. I know: namespaces are not typically a scaling constraint for istio, um so I will explain what's going on there, um so I'm going to be digging.

A

I found the first mention of this seo on our ticket system as far back as november, and we had some lingerie memory issue and we ultimately fixed that- and we didn't rely on seo at that time, but um it was kind of uh it was on the horizon.

A

From that point, um by july we had a requirement um to to run with low ttl certs on our workloads uh with the search uh rotation mechanism uh with zero tolerance for service disruption, and so we we decided to split our delivery between istio ingress and istio mesh and we set ingress as our first goal.

A

uh So up to this point, we were using um ingress, lb's external to the cluster, and we decided we all wanted to run with istio at the edge so by january, uh development, full on wonderful swing, uh we're building images off scratch or just released for proxy v2, and we're fitting that you know enormous helm, chart that is provide into our platform cicd system um there. It is in all its glory, uh hopefully not too many of you had to work with that.

A

Back in the time um come april, we had our first service using ingress and dev, and we had the institute control plane rolled out to all dev xboxers with side parent traction disabled now, by september, majority of ingress services migrated over in dev, um istio control, plane, running, live in production and seo is handling ingress for a single service, which is a big milestone, and by march um I think most services were running at seo. Ingress at this stage.

A

um Excuse me, um and then by january uh we had uh pretty much everything um so seo ingress was was running, actually live in production and we had the majority of services onboarded to its your mesh. There is like a good 10-month gap there, and we took this to onboard everything to issue um teaching teams.

A

Why we're bringing the tech in the problems themselves etc and actually onboarding them um so that I'm already involved about 180 services from, I guess 50 or 60 teams, and that was enough of a challenge uh to inspire our talk with all um maybe another day, but anyway by january we had all services on board in a dev and we began our rollout to production.

A

uh So around that time in february we were scaling up our perk clusters to match our largest production clusters uh to make sure everything was performant and good in the world. uh Unfortunately, we found it wasn't. uh Istio was running fine in dev and are much smaller production environments, um but it, but it came crashing down rather spectacularly in our large perth clusters.

A

um So we have two kind of large clusters in particular, which are significantly larger than others, and so we paused our production roll out. At this point, while we figured out what was going on fast forward to october, uh mesh was rolled out everywhere in dev production and that's we are where we are today, and so this talk is kind of going to focus in on the the problems we had between april and october. Getting those those last largest clusters um enabled with mesh.

A

So as we scaled our clusters um up to around three and a half thousand four thousand pods, um we started encountering problems with istio um istio istiod's hba would scale as far as it could um seo. D pods would be giving would begin, uh ooming um and every. As a result. Every data plane proxy in the mesh would start timing out trying to pull configuration from the control plane and those which did manage to come up.

A

They didn't last long and main application containers would crash because their networking was screwed and their downstream dependencies were never satisfied and so you've got crash.

A

Looping crash loops going all over throughout the mesh, um and this ascending wave after wave of events to the control plane to the process and the control plane, wasn't processing this fast enough, um causing data plane configs to time out to slow down causing more crashing, and we ended up with a positive feedback loop, um which was essentially killing the seo control plane and, in fact, the kubernetes control plane too.

A

uh So the first protocol, the slaving is today we scaled it massively horizontally and vertically. um Just to see, I guess and see, if we can get under control, uh we increase data plane memory to three gigs, bear in mind. You you've got, you know up to four thousand pods uh running three gigs at least, um and that wasn't enough data plane still om all over the place. um We played around with pilot d-band settings after max to try and reduce the amount and frequency of config pushes from the control plane.

A

But ultimately, nothing worked. um Here is a snapshot of our control plane. uh You can see the amount of replicas we have, um and these are. These are peaking at around eight gigs um uh and cpu same again, we've got we're running 12, cpus and no matter what we get essentially, no matter what we scaled to and the control plane would max out, and uh we were stuck with the issues we had um now.

A

This graph here shows the max proxy memory per namespace and we're looking this on a permanent space again for a reason which we'll circle back to um the worst offender in one namespace there in purple it was maxing out at 8.5 gigs of memory for a single proxy and the average in the name.

A

Space was 5.6 gigs um that namespace in particular, was was holding you know, shared messaging systems, so kafka and rmq, um and we noticed that those high traffic systems would would have the highest memory footprint, and we saw an enormous, ultimately uh reduction in that memory uh as we introduced sidecars, but um pretty pretty insane uh memory usage across the board majority of name space is maxing out around three gigs again, which is our limit at the time.

A

So it's like cars, um let's take an imaginary, imaginary cluster. um This monitor cluster, where blue boxes represent namespaces and green circles or pods or workloads by default. Istio provides every workload proxy with the configuration needed to communicate with every other workload in the mesh, which is fine for a cluster, with three names places and seven, eight, eight pods. But what happens when you have a lot of name spaces and a thousand pods or five thousand or ten thousand thoughts, etc.

A

um This requires a pretty enormous amount of configuration um to manage. It takes a lot of memory to store and it takes a lot of cpu to compute and in reality, in a typical microservices architecture. Most services only ever speak to a very small subset of all their peers. They don't actually need to speak to everybody.

A

um So what? If we constrain the configuration being pushed to each workload? So they only have the configuration to speak to those who they needed to and so we're essentially trying to move from the cluster on the left to the customer on the right.

A

So, of course, istio does have a mechanism to do this, and that is sidecars. It's a very simple istio crd. It has a workload selector to select who it applies to, and you get an ingress listener list, an egress listener list and an outbound traffic policy, and we won't talk about traffic policy. Today, um it's not important here, but essentially, among other things, um side cards, allow us to limit the amount of configuration pushed to our sidecars um ingress. Configuration is actually uh derived from workload information. um So your your pod ports.

A

Your service supports the services themselves, which is fine. It's pretty minimal anyway, but egress is where we really want to focus on.

A

We can configure what downstream or upstream clusters we will we want. We want to get configuration for essentially, so we need. uh We need a definitive list of all the peers that we might speak to, uh which leads us to one important question: uh who does our service talk to? Who does any given service talk to and more importantly, as a cluster administrator? And how do you get that information for 200, odd, unique services running in your cluster, which you don't even own, and so yeah quick note on on sidecar workloads, selectors themselves?

A

um A default cycle is one with no workload. Selector and typically called default, sits in the namespace, and it applies to all workloads in that namespace and which aren't matched by any other sidecar um undefined behavior occurs when a workload is assigned more than one sidecar can take, so whether that's from multiple default sidecars or multiple sidecars, with with workload selectors avoid it don't do it stay away from it uh one workload config per workload.

A

So back to that question, uh who does my service talk to or who does any service talk to? uh We needed some definitive source um which described relationships between all services, um which you know you would think only service teams could know that. Why would we know that, um but we're on a deadline, everyone's on a deadline and we're not about to approach 60, odd service teams and ask them to write a sidecar definition for every one of their services?

A

Oh and by the way, if you get it slightly wrong, you know you're going to risk breaking network connectivity um so that wasn't going to happen we had to. We had to provide a solution ourselves um work day internally, we have a tool called workday registry, um which is like a it's like a who's who of services and workday.

A

um You've got service owners, deployment platforms, uh dependencies with peers, um et cetera, et cetera, but that has its own problems um it's manually configured, uh so everything is from human input, and so, if I build a new feature and I have a new dependency and I don't bother putting into work registry- well, that's not going to be accounted for. I'm going to have no network, um it's not exhaustive! So it's typically used to describe you know application relationships relationships. Your services need to drive application behavior.

A

It doesn't typically include things like logging, syncs or metric servers, um etc. um So we needed something, and that was that was exhaustive and and also registry. You know this predates kind of workplace adventures into to communities and public cloud and everything, so it had no concept of namespaces or services, etc.

A

So we needed an api, we needed something with a programmable interface and something that was clustered aware and we realized network policies.

A

We run network policies in all our clusters, of course, and we run a deny ingress by default uh setup policies, um which means anybody at all to pod traffic in a cluster has to be described by some network policy somewhere.

A

Furthermore, all our network policies were out of by security, and so we felt it was safe to consider network policies as a canonical source of truth for all traffic flows in our clusters.

A

Here's a sample net call which we apply to each namespace. um This is our our default deny, so we figured if we could map out every possible allowed traffic flow from network policies. We could kind of invert that um into a map of ingress egress of sorry, so everybody's network policies describes what services are allowed ingress to them.

A

So we can use that to find out where services can egress.

A

So, as a start, we want to programmatically generate something like this, and this is a sidecar. It's a default sidecar, no there's no workflow selector, and so it applies everything in the a namespace and what we're doing is specifying a list of hosts and which wildcard um two other namespaces. So um we're saying everything in namespace a is going to be able to speak to everything in bc default enf and in doing this, we're reducing our peer space from the entire cluster to just these namespaces.

A

Yes, uh everything in those namespaces, but it's still a massive reduction, and the algorithm is very simple to do that for every network policy in your cluster and note the destination uh in blue and note your source namespaces uh in orange there and so kind of that's a flow from a source to a destination. Add that to your map.

A

And here's the same thing again with a pretty picture: um if, for every network policy peer, if it doesn't have a namespace selector, then the source is the current namespace and if the namespace selector is open, add that name paul's namespace as a destination for all um in the cluster.

A

Otherwise add the nepal's namespace as a destination for whatever is selected by the the namespace.

A

There's a few gotchas here if the ingress peerless is empty or missing, the rule actually allows all topic. If the ingress pure list has no namespace selector matches the current namespace. If the namespace selector exists, what is empty? It matches all namespace and it's really important to get this right um again without it. You're kind of you risk breaking um network connectively in your cluster.

A

um Now I know the name spaces we've been talking about these uh a number of times and here's why here's a grafana snapshot I took, which shows the number of pods we have running in our in each of our namespaces.

A

um We have over three thousand pods in our default namespace and almost triple the number uh pause in the next things. uh Next target namespace larry to this next place.

A

um The impact here is: is that any namespace which has to egress to any service in the default line space is going to receive configuration to egress to everything in the default namespace?

A

uh Why do we have a setup like this? It's the results of decisions from very early in our kubernetes journey. um This is not good practice, it's not something you should. You should try to replicate and you know provide good namespace isolation between all of your services, and certainly we don't allow that today, but this is the reality. This is what we had to to build around um so per name.

A

Space site cards were not going to be enough for us and looking back at those side cars, we wanted to go a step further um and define exactly which hosts a given night name. Space might want to erase to um so. This means that a namespace which only wants to egress to you know one or two services in that. In that default link space. I was only going to get a configuration for those two, um so we've gone from wildcarding and every namespace to specifying exact uh services in those uh spaces.

A

The biggest problem with this is that sidecar egress listeners use fqdns, uh while network policies use pod selectors, and so we needed some way to find what services sit in front of a you know. Given product um kubernetes provides no such mechanism to do that, and there is no way natively to find which servers may sit in front of a given pod.

A

um So, of course, kubernetes services use pub, selectors to select pods and that's very straightforward. That's a one-way lookup and network policies and our source of truth can select pods via different puzzle, vectors services.

A

um So the question is: if you're, given a pod or a pod selector, how do you discover which services may sit in front in front of that pod?

A

And actually it's quite easy, um so pop selector gives us a label selector and all that's doing is listing pods in a namespace and reducing them to those whose labels are a superset of the label selector.

A

um So similarly, then we take those pod labels and we list all services in that namespace. This positive lecture is a subset of our pod labels, so we need to adjust our algorithm very slightly.

A

Instead of mapping to destination namespaces we're just going to map to destination services using the method we just described.

A

um The problem with this now is is we were generating these kind of uh as we bootstrapped our clusters, um so they weren't very you know they weren't dynamic. um Of course, clusters are not static, um name, spaces, change, services, pods network policies, they all change, and these sidecar definitions need to to respond and be regenerated. Accordingly, we need something dynamic, um introducing the sidecar generator very appropriately named. It generates istio side pairs, um it's an evolution of what we've just seen before. um Rather than and running.

A

You know during cluster bootstrap, we actually run it in a cluster permanently as a deployment, it's long-lived, and so it watches namespaces services, pods, network policies and responses, changes and regenerates. Sidecars live in the cluster, um so rescribe to interested resource events um using a shared informer and the sharing former triggers add, update and delete functions.

A

uh We do a bit of filtering there to ignore events that we don't care about. For example, you know annotation changes they're not going to impact our sidecars, so we don't care uh event handlers decide whether or not um they need to trigger a regen of side cars, and we, essentially, you know, push these events into a channel of one size. If there's something in the channel regenerate the sidecars and take the item from the channel, otherwise wait.

A

um And this is the output um yeah, it's a bit blurred, um but yeah. Imagine this with. You know a couple hundred side, cars um and you're, creating this for every every name, space in the cluster um yeah, and this had pretty enormous impact in our perth clusters. um It it completely stabilized the kubernetes control, plane, seo control plane and you can see the worst offending max memory.

A

Pronating space has dropped from 8.9 gigs to 1.3. That's an 85 percent decrease um the worst offending average memory, um average yeah, sorry, average memory per name. Space has decreased from 5.6 gates to 650 miles and that's an 88 decrease. So absolutely enormous um memory decrease in the data plane uh and something similar to control plane and the top graph here is cpu per hdod instance and we saw a 75 reduction there and the bottom is the same for memory.

A

We saw approximately 66 reduction in memory and also at a fraction of the number of replicas. This is actually a same. Like number of replicas. We had six and probably could scale down even further.

A

um So yeah, that's that's pretty much it um moving forward and, as we scale our clusters, we're probably gonna have to optimize uh even further. Here again, um we were generating circuits per service. We could go even further and generate them per service per port. The sidecar allows us to do that. um I don't know if we're going to see huge optimization there, I don't know, I don't think we have too many services which expose you know, ports that aren't used by you know all their clients.

A

um Ideally we would generate a single sidecar for every workload and but that requires a you know: a unique label for every single workload. We saw a pattern in our clusters. Where you know, teams might use a layer label and which would select multiple, unique services. Then they would also use an app or a name label to select individual ones and that kind of screws up how you're applying workloads uh to individual pods.

A

But you know it's it's something we could solve, um but we haven't needed to yet um a big one for us and it's something that we're absolutely working on since discovering this is eliminating those shared work. Those shared namespaces completely, um so the default one is a pretty bad example for us, um and we've been pushing backs on the online service seems heavily had to do that.

A

Now we did find one kind of issue upstream on the envoy proxy uh git, um where where people were talking about lazy loading and by clusters and routes and endpoints as they were kind of queried, um which would be pretty awesome, um the conversation seemed to died off about march. So I'm not sure I don't think there's any active work on that at the moment, but that would be something really awesome and to see in envoy and then obviously we will get to reap the benefits from that in uh in istio, too.

A

That's all for me, I'm open to any questions you guys have.

A