Istio IstioCon 2021, 10 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Salesforce Service Mesh Our Istio Journey

Description

#IstioCon2021

Presented at IstioCon 2021 by Pratima Nambiar.

Istio and Envoy are foundational building blocks of the Salesforce Service Mesh. This presentation walks you through our service mesh journey. I will briefly talk about why we chose the service mesh design pattern, how we initially built it using envoy and our in-house control plane and our subsequent pivot to Istio. I will discuss how we are currently leveraging Istio and our plan to increase adoption of Istio to further enhance our Service Mesh platform.

A

um Hi, um I am pratima nambiar. I lead the teams that build and manage the service mesh and ingress gateway platforms at salesforce, and today I will walk you through our istio journey.

A

I will start with some background on why we chose service mesh I'll talk about our initial service mesh implementation, and why did we choose to pivot to istio?

A

I will talk about our initial istio poc, our progressive adoption of istio and some features that we are watching and expect to adopt.

A

One of core one of salesforce's core values is trust, trust that our customers can have in us that their data is safe in the cloud and one of the core factors that plays into trust is security and compliance and for network traffic. What that means is mtls with authorization everywhere, using a specific set of ciphers that are approved by our security.

A

It means fips compliance in some deployments and it means building our architecture with zero trust network in mind. Trust also means reliability.

A

That is ensuring that ensuring high availability of our service from an external customer perspective and what that means for our internal platform is ensuring that we have good resilience policies applied by default, to protect our platform against transient failures, like network failures and cascading failures, etc.

A

Our sre model, and, given that our requests flow through several different services in order to serve a customer, means that we need a single pane of glass monitoring solution that our srs and service owners can use to reduce the mean time to detect and mean time to repair.

A

When there is an incident, we decided to adopt the service mesh design pattern and solve for these common problems once really well inside car proxies and build an application networking layer that provides all of the required plumbing for making service to service communication, secure, reliable and observable, so that our service owners can focus on adding business logic.

A

Some design principles that we kept in mind from the beginning and even today, is one to keep it simple and solve for the most common current business use cases and then make incremental progress from there.

A

We favor incremental adoption of technologies and controlled exposure of features of complex products like envoy and istio, and we would like to make the application networking layer as transparent as possible for our service owners and we invest time in horizon scanning and keeping up with technology trends so that we can adopt those technology trends to solve for business use cases in a safe and timely manner, and so that we can adapt to changing technology trends to solve for new business use cases in a timely manner.

A

Our pre sto service mesh looks something like this. We focused on our data plane to begin with. In fact, we started with a another open source data plane and then switch to envoy, um because onward was more performant and because envoy has a good control, plane and data plane split with a well-defined api between the two.

A

At that time, which is about four plus years ago, um there wasn't a good xds open source implementation that we could leverage and therefore we built our own bare bones: xts implementation to solve for the most common use cases. At that time, this control plane was backed by zookeeper as the service registry for announcements and the control plane interfaced with the zookeeper and um triggered um xds updates to the running own voice.

A

Based on that, um we used we built a resiliency test framework and ran through some common scenarios like rolling upgrades transient failures, etc and came up with good defaults for resilience, policies to maximize the success of requests and applied that configuration via our in-house control plane to all the envoys in the mesh.

A

This included things like circuit, breaking policies, outlier detection, settings, retry policies, etc, and with this architecture we were and and one more thing- and we also shipped our metrics to our internal metric system, and we use these only metrics to observe the traffic in the mesh and with this architecture, we were able to deploy this to production and and run mid five digits number of own voice introduction.

A

In fact, this version of our service mesh implementation still runs in some of our data. Centers.

A

Our in-house control plane worked really well for us.

A

However, closed source implementations tend to be slower to evolve and as our service mesh adoption grew keeping up with these new use cases and build and enhancing our control plane was becoming challenging.

A

um For example, we saw requests related to support for stateful sets some service, pacific, config, etc, and around that time, which was about two plus years ago, um istio 1.0 was released, and initial review indicated that um istio is a feature-rich control plane that would meet our growing requirements.

A

It seemed to be solving the problems that we were trying to solve and therefore overall seemed like a good fit. Istio also has had a strong community then, and it continues to which has grown significantly over the past few years, that we could rely on to evolve this control plane. In short, it felt like it could be the next kubernetes for service mesh.

A

While there is a cost involved in steering and evolving open source, we felt that the cost was worth it.

A

A good example of where this decision um already paid off was when envoy built the v3 xds api and started to deprecate. The v2 api, in order for us to continue to upgrade envoy and pick up security fixes, we would have had to invest time in rebuilding our control plane to support the v3 api.

A

However, in deployments that use istio as the control plane, we were able to leverage the effort put in by the istio community to migrate from v2 to v3.

A

So we set out to do a poc of istio and our intention was to see what it would take to achieve parity with our in-house control plane and to do that, we needed certain minimum features to be supported.

A

One was mutual tls using our internal ca. um Salesforce requires us to use a internal ca and we couldn't use citadel-based certs. We were already running with a heterogeneous infrastructure. Our monolith runs on bare metal and we have quite a few services that run on kubernetes, dynamic infrastructure.

A

We needed to be able to publish our metrics to our internal metric system so that, from from a service owner perspective, they did not see any change in observability features as we swapped out the control plane, and we wanted to continue to apply those good defaults for resilience policies that I talked about earlier.

A

In order to achieve this parity, we had to make certain contributions to the istio control plane and some of those significant ones that I would like to call out are we added support for a custom ca.

A

We made some performance related fixes related to bare metal support, specifically around how pilot processed service entry objects.

A

We made some resiliency related fixes to pipe for pilot and onward communication, and we made the envoymetric service um configurable so that we could ship metrics to our internal metric system, as we did the poc. What we really liked was istio's use of crds to configure mesh, and that made it easier for us to read mesh configuration when compared to onward configuration, and we could see that it would allow us to plug into our tooling and pipelines in order to generate this configuration to support more higher level use cases.

A

What could have been better, though, was istio had very little support for applying mesh wide configuration, for example, if we wanted to change the default load, balancing policy for all of our services in the mesh, we couldn't do it in one place. We had it, it was it we had to set it per service.

A

um Istio also takes all of the kubernetes services in a cluster and converts them into onward configuration, and we ran into issues where in clusters, where not all of our services are mesh services, the non-mesh service configuration would somehow affect pilot or on void configuration.

A

So this is kind of the architecture we ended up with um after the poc we ran istio pilot um when, in our cluster and like I mentioned, we intentionally chose um to adopt um istio incrementally and therefore did not bring up galley or mixer for telemetry and authorization. At that time, we just stuck with the control plane.

A

um We built a config web hook that would listen for service events and it would um generate the own. The mesh configuration or istio configuration for all the services, including applying those resilience policies um that I talked about.

A

We ran istio proxy next to our monolith and our monolith continued to announce to zookeeper our service registry, and we built a sync service that would take these zookeeper announcements and convert them into service entry objects and then that would get delivered to all the other running envoys and that way our monolith was able to participate in the mesh we configured all the istio proxies to publish metrics to our metric service, our metric service.

A

We are able to use our metric service to either transform metrics or filter metrics in a centralized location before we ship them to our matrix platform, and it's worth calling out that all of these additional components that we run as part of our mesh implementation are also mesh services and leverage the benefits of service mesh.

A

We run with this architecture um in some of our data centers. We were able to migrate from our in-house control plane to istio-based service mesh in some of our data centers, and that migration is still ongoing in the rest of our data. Centers.

A

And then came salesforce hyperforce sales forces pivot to run on public cloud infrastructure and that presented additional challenges around running on a zero trust architecture.

A

We saw increased adoption of dynamic infrastructure and hyperforce is expected to run on multiple types of public cloud infrastructure.

A

As a result of that, we saw increased adoption of service mesh in particular, and it also propelled our adoption of istio in general.

A

Mtls with authorization is a hard problem to solve, and we saw a lot of new types of use cases starting to leverage mesh. In order to meet this requirement, we saw quite a few off-the-shelf products. Leverage mesh a few that are worth calling out is cupid our messaging platform, solar, our search platform, zookeeper and redis for caching, our monolith that used to run on um bare metal bare metal infrastructure now runs on kubernetes and it uses blue-green deployment strategy for deployment powered by istio's traffic shifting rules.

A

We hope to further extend this to other services by building generic integrations with our spinnaker-based deployment pipelines and and then make blue-green and canary deployments powered by istio's traffic. Shifting rules available for two more number of services.

A

We also define declarative authorization policies um in a config repo that gets fed into our mesh architecture and then gets converted into envoy or back filters that get applied at the side car and gets enforced at the sidecar of a service.

A

We have integrations with a open-based authorization integration that our service owners can leverage for more finer, grained authorization rules.

A

We also generate out-of-the-box health signals or golden signals, as they are called in the google sre book um using envoy telemetry for all master mesh services, that our sres can then view in a single dashboard to understand what the health of the system in general is and in order to deploy a steer to public cloud infrastructure. We had to integrate it with our spinnaker pipelines and and build a process for upgrades which is crucial for us. Since we actually run the latest version of on istio and envoy.

A

A

Apart from adoption of istio for mesh communication, we also adopted a software load balancer at the edge in our hyper force deployments in our data centers today. Most of the traffic flows via f5 load balancers, um but in hyper force when we shifted to using a software load. Balancer ingress was the obvious choice for the most part, istio configures envoy, as an edge proxy, pretty well with good defaults.

A

So we are able to use it more or less, as is um istio's gateway, crd to configure sni-based routing rules at the ingress simplifies that configuration and salesforce has complex, dns and certificate requirements.

A

So we had to build a workflow around dns and certificate provisioning in order to allow us to publicly expose our services via ingress gateway, and this is what that looks like. um We have a config repo, where you can define your dns and certificates and template requirements for exposing a service um to the public internet um that gets um that gets fed into kubernetes as crds, and we run a kubernetes operator um called ingress assistant that listens for these events and then triggers off a workflow to create or update those dns entries and provision.

A

Public certificates using the sand templates that were requested and then deliver them to walt that ingress is then able to read. Certificates from ingress assistant then watches for the completion of these events and then creates that gateway crd to bring it all together and configure configure the s. I based routing rules using the gateway crd that get then gets picked up by um sgod and then gets delivered to ingress gateway and the public web is functional.

A

This automated flow is crucial for us, as hyperforce gains momentum.

A

We are also working on onboarding new types of services onto service mesh, and we are looking at adopting istio and onward features for building more higher level features as part of our mesh and ingress platform, with the intention that we want service owners to adopt these features rather than solve for these problems in 20 different ways.

A

A good example to call out as a new service that we are onboarding on to service mesh is the hbase stack we're currently in the process of onboarding our hbay stack to service mesh.

A

We are looking to use the new, auto registration feature for bare metal services so that we can get rid of that zookeeper deployment that I talked about, that we run and manage today for service registry. We are also um looking to. We are in the process of leveraging the dns proxy feature for tcp multi-cluster support, um so essentially tcp mesh style, communication between services, running on two different kubernetes clusters, and we are looking to use um the traffic shifting capabilities for, say, path-based, routing, cross-region routing.

A

um We are also experimenting with standing up versions of our monolith, that is optimized for certain types of requests and then use istio's traffic, shifting rules to route traffic to those specific subsets that are optimized to receive that traffic um salesforce has complex authentication and authorization requirements.

A

Since we are a multi-tenant platform, um we are looking for patterns of authentication and authorization rules that we can move from our application code and make them as features of the mesh platform with the sidecar. As the policy enforcement point, um we are actually using um web assembly for um jot minting, for example, in this flow, we are also using oppa for enforcing these authorization rules.

A

We are looking for at integrations for service protection and um and gen in general rate, limiting features. We expect to use um onward photos for building integrations for, say things like a centrally controlled fault, injection, etc.

A

As hyper force gains momentum, we expect to run service mesh spanning multiple kubernetes clusters, so we have uh the software for things that we are watching to adopt um and features that we're watching very closely uh would be multi-cluster support um of istio.

A

We are excited about the web assembly technology for proxy extensions um to solve for new types of business use cases um I talked about, I'm using it for jot minting. We will look at using it for dynamic routing um header injection for protection against wasp security and risks.

A

um We are watching for improvements in the the dns proxy feature, especially around stateful set support and in general we are watching for improvements with the istio product in general for better support with larger meshes, uh specifically around reducing proxy initialization time, um optimized config delivery and improved envoy to control, plane, load balancing.

A

We also hope to leverage istio in egress gateway as our egress solution at some point, um and we are looking for improvements to the upgrade process we upgrade pretty often um since we make changes to the open source product for our business use cases, we need to bring back those changes into our deployments in a timely manner. So we actually run the latest version of istio and anything that would make it easier for us to do. Those upgrades would be awesome.

A

We also looking at how we can simplify application of istio configuration for higher level use cases, both as we apply them within a mesh and across different meshes.

A

We have come a long way in our journey to adopt service mesh and issue and onward in particular, but we also have a long way to go. um I would like to end with a shout out to the istio community. Istio has a strong, active community. We have seen this manifest in a variety of ways. um The very relevant features that have been added in the past few releases is a good example.

A

Fast responses to our pr's are appreciated.

A

We've been able to rely on the community for suggestions on things to try for critical issues that we have faced and then we've been able to come to a good conclusion to those problems and feedback that we give you, via user groups and working group meetings, is heard and is used to shape the roadmap.

A

A