Istio Community, 23 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Migrating Airbnb to Istio

Description

Airbnb has been using an in-house service mesh called SmartStack since 2013. In the past 12 months, they migrated hundreds of production services onto our next generation of service mesh based on Istio. Come learn more about how we achieved service mesh migration with minimal service owner involvement and zero downtime. They will cover their migration strategy for different kinds of workloads and showcase the migration tool they built.

A

Yes introduced, um we are engineers for imb, and today we are very excited to talk about how we migrate airbnb onto you still so. First of all, um here's the agenda. For today um we will first start with the brief introduction and that ed and I will focus on our migration strategies and at the end we will have a quick recap and a q a session.

A

So let's first have a very high level overview on how on the airbnb service match journey.

A

So starting from 2013 airbnb has been using a service discovery system called the smart stack that we open sourced on a very high level. Smartstack uses zookeeper as the service registry and relies on a bunch of sidecar in 2019.

A

We finally decided to stop patching our smart stack and started to search for modern service mesh solution and after some evaluations, we quickly landed on istio as the foundation and internally we use air mesh as the name for our next generation service mesh, and we will use this term uh throughout the presentation and, if you're interested in finding out more about why we ended up, choosing histo feel free to check out the istio count.

A

Talk earlier this year and as of today, we have migrated um almost all the kubernetes services onto easter and migrated about half of inter service product production traffic onto istio, and our plan is to fully migrate to east hill in next year and sunset. Our legacy system.

A

Yeah, so that's the introduction, and now in this section we have summarized some of our strategies for approaching the migration.

A

First of all, I want to talk a little bit more about user-friendly, so easter custom resources are pretty complex and a lot of the features are not very useful for our product engineers actually and the easter api is also evolving.

A

It will be messy and dangerous if our users need to handle upgrading those apis by themselves. So, instead of asking them to learn and manage those custom resources, we believe providing a simplified and opinionated user-friendly api for our mesh users is critical.

A

So, as a result, our mesh users do not directly interact with uh istio customer resources. We provide a simple config file that our user use, for example. Here I show the mesh config file for the service banana and service owner is checking this mesh config file alongside their service code after code review and during ci time we generate the instill customer resources from this config file and the generated resources are managed by our deployment system and all the conflict. Changes are made by a deploy which is monitored and can be easily rolled back.

A

And we provide several different mesh objects in our mesh api. We provide the app which is workload that only have outgoing traffic, and we also provide service which is basically app with ports and then vm, app and vm service is the ec2 version of app and service and external is used for defining external services into the mesh.

A

Like our mysql databases on aws, and then we also have virtual service, which allows users to control routing to a set of those real services, and we also allow extension and override between mesh objects, for example, in this example, banana, canary and banana canary baseline both extend a production, so they get the same config as the production, and this uh help us to reduce the velocity of the mesh config file.

A

Yeah, um so I want to talk a little bit more about virtual service that we provide, so um the most straightforward use case of the virtual service is for static canary with uh we start in canary at airbnb. We at all time route a certain percentage, for example, ten percent of traffic to cannery and the rest to uh production, and then new changes are always deployed to canada, refers to verify before proceed to production and to achieve this uh kind of traffic.

A

Routing user can just simply define uh this virtual service in their mesh config and a little bit more complex than static gallery is aca, which means automated canary analysis and a lot of our airbnb services have adopted this and for aca during non-deployment time. All the traffic goes to production, but during deployment time we scale up the canary and canary baseline pods, and then we brought a certain percentage of traffic to them to do a side-by-side comparison and after we verify um the metrics and everything looks good.

A

We restore the traffic back and scale down the canary and canary baseline, and then we deploy the change to uh production. So this requires that we modify traffic routing dynamically during the deployment time and here's how we achieve this in the deployment config.

A

As the top left user will define the mesh object, keys and the percentage they want in their aca deploy stage and based on the user input. Our tooling will um generate the istio custom resource, which is a virtual service and deploy it during the applied traffic routing stage and after aca. Our touring will also generate the resource to restore back the traffic.

A

So um this whole process is completely hidden away from user. All they need to configure is the deployment config on the top left side.

A

Yeah, so, besides making sure that mesh api is user friendly, we also another top priority, for us is safety. So every day there are hundreds and hundreds of changes going on, and we don't want to stop the buses to do the migration. Instead, we want to seamlessly migrate while keeping the process safe at the same time.

A

So, in order to accomplish that, we provide the following features: we have edge by edge migration, so we believe that a service should not require a leap of faith while onboard the air mesh. It should be able to migrate each of its inbound and outbound edges one by one and for those critical address. We support percentage based traffic shifting from smart stack to air mesh. This allows side by side comparison of arrow rates and latency and in case anything goes wrong.

A

We want a traffic to be able to roll back quickly within a second and here's how traffic shifting work. First, we run the issue on proxy. Alongside with the smart stack side car in shadow mode, we then configure each to proxy to intercept traffic going to the reserve center range for the new service mesh and then we also add the traffic shifting capability into airbnb standard client frameworks.

A

After that to shift traffic from smartsec to air mesh. A service owner can simply increase the traffic percentage using our dynamic config system, and, if anything happens, you just change back the traffic percentage to zero and then in seconds, traffic will be routed back to smart stack unless traffic is being ramped up.

A

Service owners have access to this migration, dashboard tracking the changes in arrow rate and latency for those critical services. We normally gradually ramp up traffic to a 50 50 and leave it overnight to have a side-by-side comparison to make sure there is no regression and also during gradual rollout. We can monitor issue sidecar resource usage and adjust the cpu and memory during the process to avoid like cpu, throttling and om issues.

A

Yeah, that's all I want to cover and I will hand over to edie to talk about the other two strategies we have.

B

Thanks ian for introducing how user-friendly and safe our migration is, so I'm idi. So now I will show you further how we make the migration fast and transparent.

B

Our migration scale is very huge. We have tons of service to migrate on kubernetes, we have around 100 clusters and more than one thousand of kubernetes services. Besides, there are hundreds of net legacy, ec2 services and more than one thousand of external ins external service incidents.

B

What's worse, the number keeps growing. So at the meantime, we are planning for the migration.

A

B

Tens of new service created just in a few weeks, so a good migration plan and speed really matter a lot for us, otherwise the migration will just become endless.

B

So, first of all, we realize that we need to stop bleeding. We integrated airbnb infrastructure to opt in air mesh by default on the very first day and provide tooling to generate air mesh configurations for the new service.

B

All these toolings are seamlessly integrated with the existing airbnb development environment so that new service will just onboard air match transparently. As we all know, rom wasn't built in a day. The transition period of migration can be very messy, so we provide full compatibility for a service to be in dual mode, that is well-being on air mesh. The services can still fall back to communicate via legacy smartstack, so that without stopping the service development, we are pushing all the service to onboard air mesh. Still.

B

Last but not least, education is very important. We have our dog portal and we'll create a code lab for all airbnb engineers to get familiar with air mesh smoothly.

B

Next, let's take a closer look at: how does our migration process look like, so we simplify the migration process into two steps so giving a service. The first step to onboard air mesh is to migrate the service.

B

After this step, the service will be considered as air mesh ready, that is, the service is registered in air mesh control plan and the service is able to communicate with another air mesh ready service. Once the service is air mesh ready, we are able to migrate the edge for it. That means we will migrate the edge from this service to another service to be able to communicate on air mesh when all the inbound and outbound edges of the service are migrated. We consider this service as airmesh complete.

B

You might be asking. Why does these two phases matter, then? Let me show you and concrete example how exactly we speed up the migration process with this setup suppose we have a service a to be migrated and it has one outbound edge to service b and one inbound edge from service c.

B

Now, if we start migration, naturally we will start to migrate service a first to make it air mesh ready, but in order to migrate the edge to service b, we also need to migrate service speed so that they can communicate with each other on air mesh. And, finally, we can migrate edge between a and b in order to make service a air mesh complete. We will need to do the same thing again and again to all of its outbound and inbound edges.

B

That is, we need to also migrate server c to be air mesh ready and then migrate the edge between them.

B

So, as you can see, this process can be very long and entangled if we make the step depends on one another and so on, so we clearly define each step to be independent, so we can easily pipeline the whole process. As you can see from this graph, we just migrate all services to be air mesh ready first in parallel, this will just pave the way for migrating any ages for any services by pipelining. We greatly accelerate the migration speed as well as avoid many process complexities.

B

This pipeline approach also allows us to use a white glove approach and utilize the economy of a scale better, as we make the migration process repeatable and with the help from contractors, we achieve a high speed of migrating more than 40 service per week, and in this short quarter we have tens of edges being migrated in parallel. Every day, as of today, there are more than 50 percent traffic on air mesh. Already speed is not enough. We also want to make the migration process transparent.

B

The most representative solution is that we automate this migration process whenever is applicable in each step. We provide the tooling from migrating the cluster to automatically generating and updating the air mesh configurations.

B

For example, to migrate the age on air mesh, we have the tooling to automatically detect the dependencies of a service. As you can see from this screen, it prompts out all the air mesh ready dependencies for client a to be migrated.

B

Once you pick the dependency, it will generate everything needed for you. For example, it will add server a as the dependent service into client, a's configuration and add client a into service a's allow list.

B

Besides edge migration, can be risky and everyone makes mistakes so to avoid any misconfiguration, we also have the tooling to validate age beforehand.

B

This helps us to make sure the edge is 100 percent accessible before actually shifting the traffic and and if there are arrows it will diagonals and give suggestion how to fix it with all this migration tooling, not not only. We make our migration faster and safer, but also it's more transparent that service owners barely need to know anything about air mesh migration. The behavior stays unchanged. It just works with a minimum of configuration, changes which are automatically generated.

B

So now here, let's let me show you some recap for our migration, so first we provide a simplified and opinionated api. Second, we adopt gradual, rollout and side-by-side monitoring for safe migration.

B

Third, we onboard new services to stop bleeding and pipeline the migration process to speed up. Finally, we make automation whenever available.

B

So that's all about this. Our migration and thanks everyone for listening by the way airbnb is hiring. Please check out our website for open positions.

B