solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 010 Adopting Istio across 100 clusters at T Mobile

Description

Learn about T-Mobile’s journey of adopting Istio across 100+ clusters to support microservices for fraud detection, billing, sales and APIs across many teams. The talk will cover things such as tenancy, install/upgrade, feature adoption, CI/CD integration, and architecture tradeoffs.

A

Hello, everyone and welcome to the good, the bad and the messy. My name is joe cersei, and I want to share a little about our service mesh adoption story of t-mobile.

A

I'll start by introducing myself a bit, I'm a technology enthusiast at heart, but my day job is being a staff engineer at t-mobile, with a specific focus around distributed systems and all things kubernetes related I've been working with kubernetes and istio for a while now and I try to stay active in the upstream communities as much as I can.

A

I'm also a maintainer of the magtape project that focuses on kubernetes policy as code, but enough about me at t-mobile we have a team of rock stars that power our platforms and services. I'm just the one here talking about it today. So big shout out to all the amazing folks that I get to work with to make all of what I'm about to cover possible.

A

Now, before we dive deep into the service mesh story, I want to take some time to describe the position we were in prior to adopting service mesh technology.

A

We had been doing the kubernetes thing for a while. We started to get to the point that I think a lot of people do where one cluster sort of turns into two and that turns into 10 and so on.

A

We still had a pretty diverse environment with a large cloud, foundry footprint and a lot of applications that didn't make sense to drive to kubernetes from virtual machines and bare metal systems. Security was a constant concern. As our platform grew and new risk surfaces emerged, there was a definite learning curve in the container space for our developers, and we wanted to simplify that as much as possible and just like anything, your business becomes to depend on resiliency was a key focus in everything we did now with any large project.

A

We came together and worked on a list of goals to help us drive towards service mesh adoption, with only a small team. To start with, we had to keep automation in mind from day one we knew service mesh wouldn't be a good fit for all users, at least not at first, and we really weren't staffed to handle the support burden from our entire user base sort of being forced to adopt across the board.

A

Since the service mesh wasn't was meant to be part of the platform we didn't want to have istio-specific clusters, and that meant we'd have to have mesh and non-mesh users co-located in the same clusters.

A

We also knew we wouldn't be successful if we tried to tackle everything all at once. So we picked what we needed to prove that the service mesh thing could work and gave ourselves time to mature both on the platform side of things and within our development organization.

A

So now, let's jump in and talk about how we istio and what istio looks like at t-mobile.

A

We standardize on a git ops model for driving our platform automation a long time ago and istio was not going to be an exception to this.

A

We built out a small abstraction layer that allows us to manage our platform components in a very flexible way, and this gives us varying degrees of granularity for installation configuration and upgrades of the istio control planes and gateways.

A

We package our own sort of internal releases from a combination of upstream istio releases and our environment, specific configurations, and with this we can pinpoint changes to specific clusters, entire regions, life cycle environments or even the global fleet.

A

This model greatly simplifies the operational overhead of managing so many instances with a small staff of engineers. Remember that 100 meshes comment.

A

We started our istio journey before 1.0 man have things changed since then, as with any complex software, you need a good plan for life cycle management. Just getting it installed everywhere is not enough. The the day two operational burden can be huge, and this is part of the process where we learned a ton, as it was sort of a new model with respect to how things could impact consumers of the platform, with the data plane, essentially being a user-facing component to help ease a lot of this.

A

We came up with a pretty formal process for promoting seo changes and upgrades in our environment. Now, there's no magic. Here we started by reading the release, notes and change logs, just like everybody else does um to see if there are any configuration, changes or breaking changes that we need to be aware of and solve for in our own configuration next, we target installing the new release in a sandbox environment of sorts, and we run our suite of tests to validate things.

A

We verify things through an established channel of functional tests that allow us to assess functionality against the known baseline and in cases where we're adopting new features requested by our development customers. We may have one of those teams come in and help us validate the new features.

A

Eventually, once things have been burned in and we feel comfortable, we begin to promote things to the rest of the life cycle environments until it's deployed everywhere.

A

Overall we're trying to build a high degree of confidence that, as we deploy new releases, we don't break things.

A

And moving on to user experience, which has always been a concern as we wanted to make the consumption of istio as simple as possible to our platform customers today, each of our clusters is its own mesh, so yeah over 100 meshes at the moment.

A

At a high level, we try and absorb as much of the mesh complexity into the platform and tooling as possible. We have common ingress and egress gateways established in a sort of centralized manner, and for this we have load balancers, dns, tls pre-plumbed, so things just work without the mesh consumers having to worry about any of that.

A

The users opt-in to consuming the mesh via namespace labels, again with the goal that mesh and non-mesh consumers can co-exist without issues and all this helps to give mature development teams immediate access to service mesh features, and it also provides a clear runway for other teams to adopt slowly over time.

A

In addition to the common features, most folks associate with istio, we have also found a lot of use with regards to handling sni routing with services.

A

This takes the concept that most folks are used to with an ingress controller for http services and makes it possible for tcp based services that support smi. We get the same one-to-many name-based routing functionality as an ingress controller instead of mapping, node ports or burning through a bunch of load balancers.

A

Overall, it's a much higher degree of scalability, and this greatly simplifies the operational overhead for things like databases of service workflows, where the clients that access databases live outside the clusters, where the databases are deployed to.

A

Our service mesh story is far from over. We still have a lot that we want to do, but overall we're down with the meshness we've started to play around with integrating non services into the mesh we're following along with the enhancements coming from the istio project itself, but aren't far enough along they've developed any real opinions so far in our own testing.

A

Another one of the big things in our service measured map is moving to a multi-cloud cluster mesh topology, while we're definitely solving problems with the multi-mesh topology. We have today, overall, the more inclusive the mesh is the more useful it can be.

A

We want to target a heterogeneous mesh of meshes that will allow us to maintain the resiliency of multiple control plane instances integrate more of the non kubernetes platforms like virtual machines and bare metal systems and hopefully simplify some of the operational burden that comes along with having a hundred plus service meshes.

A

There's a lot of cool work going on in the community to help solve this sort of problem.

A

In addition to the multitude of topology options and vm integrations that istio itself offers now, there are a lot of other players in the field, doing things like service mesh interface or smi for providing a common way to configure mesh functionality and then there's good partners like solo.I o who have products like blue mesh and blue federation.

A

That show a lot of promise in this space.

A

Our istio journey has not been all rainbows and unicorns, and while it feels weird to talk about some negative stuff, it's valid to keep keep things honest. I think most folks will admit the istio comes with a pretty steep learning curve.

A

Things have come a long way since we started our journey, but accounting for this in your own timelines is something we highly recommend.

A

If you don't, it's likely a mistake, there's been a lot of work over the past few releases focusing on stability, but it has been a real problem in the past. We keep a trust but verify attitude these days as we go into upgrades and sort of tying into release, stability, we've seen default values, change and, in general, the api changes came fast and furious for a while most of the apis are mature at this point, but definitely pay attention to the release notes.

A

Now we've been bitten more than once with things outside the mesh being invasive, some in slightly annoying ways and some in catastrophic ways feel free to follow up with me offline, and I can share specifics now releases come often and it can be really hard to keep up. We have a running joke uh that anytime, we start talking about a specific istio release. There's probably a new seo release being made. So keep that in mind.

A

Now I want to take some time and share some lessons learned and I've tried to keep these focused on things that are not specific to our environment, but, as usual, your mileage may vary, and some of these may not be applicable for you, we'll start by addressing some things around complexity, and what I'm talking about here is complexity of istio for both the implementation and the consumption.

A

Many enterprises have a separation of duties around those who build, maintain the mesh layers and the developers that consume it. Here are a few things to keep in mind that we found helpful one. Do not let buzzwords or the fear of missing out drive your adoption of service mesh talk through your needs and lead based on features.

A

It's also entirely okay, if you're not ready or have no need for a service mess. Yet next, finding real world examples can be tough. I'm talking about things that take more into account than a demo install or a perfect world greenfield approach.

A

The book info app. It's really cool and can definitely be helpful for learning new concepts, but it's best to spend the time to generate some easy to consume patterns, with context to your business and your environment, for your users to reference.

A

And probably one of the biggest things I want to say to everyone is: don't try to turn all the knobs at once pick the features you're interested in build a road map and iterate over time.

A

I promise this will help prevent things from becoming overwhelming and reduce the risk of enabling something you don't fully understand that could cause problems for your users or open you up to additional security risk.

A

Next, let's talk a little about stability, and here I'm referring to maintaining a stable service mesh offering after you've got it deployed in your environment. The first thing, automation is your friend. The sdo project moves extremely fast and keeping up with upgrades is hard. The more automated your life cycle management, the easier it will be to keep up, set a realistic pace for yourself that works within your company's strategy and know that skipping releases that solely implement new features that may not be important to you is probably okay.

A

Next make sure you have a functional test for every mesh feature you adopt it's easier to have the automation. Remember all the little things and verify that everything works than to have people do.

A

This is also helpful when things change upstream and can allow you to catch those sort of things as soon as possible when you hit a snag or find some sharp edges with the project, codify that and add it to your automation, trust me, you can spend hours or even days, tracing down super weird problems, so make sure you don't waste your effort to find them and help yourself prevent them from creeping back up later.

A

If you do stumble onto a problem and find a solution, please contribute that upstream, even if it's just a sentence or two in the docs, this can be extremely helpful to the next person that comes along and is super valuable to the community.

A

Now take a brief moment and talk about usability, we saw a huge advantage in embedding our platform team with our application development teams, and this is a boundary- that's closing more and more every day and decisions on how the mesh should be installed and configured need to be made with the awareness of the applications in your environment so meet with teams regularly to identify to identify what does and does not work. This isn't a one-time thing.

A

Platform teams will rarely be able to anticipate everything. The development teams will want to do with the service mesh.

A

Next, I want to take a bit and talk about risk isolation, and I want to reiterate that we have a mostly multi-tenant model for our kubernetes platform, sometimes hundreds of teams and varying levels of mesh adoption within one cluster.

A

That being said, as we're testing out how we want to handle our multi-cluster mesh strategy, we found that there's still a lot of edges that can come during upgrades, config changes and environment specific oddities, while the istio project supports multiple topologies for how to tackle multi-cluster meshes and some even offer more operational savings than what we have today for us. We're sticking with the multiple control planes to reduce our overall risk of impact.

A

On more than one occasion, we found that noisy neighbors can be a real problem. Within a mesh instance, we've had mesh users affect each other across name space boundaries as well as non-mesh users affecting mesh users. The isolation boundaries are not rigid at all, and this was while probably naive, kind of a surprise to us. In the beginning, I'm happy to chat afterwards and provide some additional detail around these sort of problems.

A

If there's interest, while the tenancy model for istio is still maturing, we found it extremely helpful to have a policy as code tool to fill in some of the gaps to limit our risk. We use a tool built in open source called magtape.

A

Mactape uses open policy agent for its policy engine, but there are a ton of options out there like gatekeeper, caverno or krell, just to name a few others.

A

So to wrap things up a bit. Let's talk about some of the wins we've experienced so far with istio as our service bench. I can't stress enough how big of an impact it could be to have generic observability instrumentation that can be immediately adopted by your users.

A

This has provided a new level of insight for our developers with zero effort and from day one has been driving faster, mean time to recovery.

A

During events, we've been able to increase our security stature with many of the features that come out of the box, with servicer, with a service mesh like istio, moving from meaningless ips to proper service identity for policy, automating mutual tls and being able to generalize service to service authentication and authorization has brought a huge difference and decrease in engineering hours that we are typically spending on solving those for every app individually being able to take a lot of the enhancements listed here and move them from the application to the sidecar has allowed our developers to get back to focusing on business logic rather than language, specific security, resiliency and observability implementation details.

A

This definitely helps to bring down our cost per project by refocusing development efforts and another sort of interesting side effect and not something we initially planned, but we started building a shared competency in envoy as a technology.

A

The more we learn about envoy the more ways we find to utilize it within our environment, as sort of a swiss army knife of data planes, and that's all folks. I really enjoyed the opportunity to speak to with you all today.

A

Thank you to solo for hosting the so-con, and if you have any questions about anything, I presented that don't get caught in the q, a feel free to follow up on github or solo slack or istio slack or on twitter thanks and have a great.

A

A