Cloud Native Computing Foundation Online Programs, 3 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Multi canary release and load test

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

The topic of today's presentation is multiple canary releases and stress tests on production.

A

Let's first start with multiple canary releases after that, let's move to stress test in production.

B

Let's start with the simplest case canary is a technique used to reduce the risk associated with releasing new versions of software. The idea is to first release a new version of the software to a small number of users and then gradually iterate through the upgrade, for example, in this diagram we test 10 percent of the traffic first then gradually move more traffic to the new version and finally, the old version is cleared and taken offline. Throughout the testing process, we can label the traffic with various business tags such as android devices, location of beijing, etc.

B

Also note that user tags should not use it addresses which are inaccurate and inconsistent. Then we can specify canary traffic rules to schedule route a certain part of the user's traffic to a certain canary, for example, the android user from beijing is scheduled to the 2.0 canary version of service, a.

B

The scenario of a single service canary is still limited. In reality, it is more common to have full stack canary testing. For example, a user client cannot be forwarded directly through the router to the canary version of the service. This is because the service is very far back in the whole chain, separated by other services, as shown in the figure we have published canary for delivery with user and order services spaced in between. In this case, we need to do two things to ensure that the traffic is scheduled correctly.

B

The first is to pass through the user tags and the second is to route traffic to the correct version of the next service at any end point of the chain. If you consider the implementation level a little bit here, you will notice that there are two categories of approaches to do full stack canary release, either by changing the code or by a non-intrusive platform level solution and changing the code can be very cumbersome and verbose and prone to bugs.

B

Let's use another practical example to illustrate the difficulty of multiple canary releases.

B

For example, there are now two services order: v, 1.0 and email v 2.0 the service order, calls the service email and the service email uses, the third-party email provider tencent provider, and we decided to add some information to the order entity as a test version for android users. Only since the changes to the order entity affect both services which need to be changed, we added order v, 1.1 and email v 2.1 to apply this change. Then another team decided to replace the tencent provider with google provider in the email service.

B

So we added email, v, 2.0 one to test users from beijing only at this time. There is a dilemma that android users and beijing users are overlapping, which is traffic from android phones in beijing. Considering the complexity of the traffic and the inconsistency felt by the users, it will led to a heavy operation burden.

B

Let's take one step back and analyze different cases of canary release. Let's first concentrate on the figure on the left, for example, and b rely on z, while it tests android traffic and this testing iphone traffic. They test two different user groups and if they both rely on stow test stakes on two different canary traffic and will become a source of confusion, then there are two better approaches seen in the figures in the middle and on the right.

B

The first is to schedule traffic from auto z, and the second is to schedule traffic from land toes and z separately, thereby the principle to make many things easier. Is one canary release, one traffic rule.

B

Even if the canary conflict problem is solved, there is still a problem that the traffic rules may overlap. The previous example is the user traffic of android and iphone, but if one canary tests android and the other tests beijing, there will be a common subset of traffic rules for both canary releases.

B

So if the traffic is from a common subset, such as traffic from android devices in beijing, how should we root that traffic at this time? In fact, this ends up being a mathematical abstraction of two set problems.

B

One set matching user traffic is a set canary release of traffic rules is a set two multi-matching problem. Multiple canary traffic rules are matched on which one should be selected.

B

So, let's use an example to demonstrate all the problems mentioned before. For example, there is a back-end service stack of a food delivery application. The service consists of three micro services order, service, restaurant service and delivery service order. Service has no canary restaurant service, has two canaries first canary is for android traffic from beijing, and second canary is for all android traffic delivery service also has two canaries first canary is for traffic from beijing, and second is for all android traffic.

B

When the system receives traffic with beijing user tag, it matches the routing rules of delivery services. Beijing, canary and the traffic follows the green path.

B

Then, on the other hand, the traffic with android user tag matches the routing rules of android canary for both restaurant and delivery. The traffic follows the blue path, but what happens for traffic that contains both android and beijing tags? It matches all three canaries and there is no unambiguous way route the traffic.

B

This is how it looks in terms of sets, since in terms of math, the canary rule is matched if canary traffic is a subset of user traffic. So how should we handle the multiple matching problem?

B

A simple and easy way to solve this problem is to specify the priority of canary each traffic. Rule has a number indicating the priority from small to large, as you can see, from the figure traffic beijing and android matches three canaries, but because restaurant beijing and android has the highest priority priority. One. The red canary is therefore selected, even though the priority solves the problem of multiple matching. There is still a problem of misused configuration traffic shadow problem. In this example, the red rule is shadowed by the blue rule, as blue has higher priority.

B

It means no traffic is routed to beijing and android canary.

B

Finally, let's explain the technical details of isemesh and give an overview of how we implement multiple canary releases. First of all, all our services are running in kubernetes pods. The three services in here correspond to three services in isemesh and even different versions under the same service are part of one service, so a mesh service will have multiple versions running at the same time. In order to route the traffic to correct canaries, two things need to be accomplished.

B

First thing to ensure is to pass through user tags without losing any user information throughout the service chain. This involves the sidecar and the business application. The sidecar naturally knows all the canary traffic rules and user tags such as some specific http headers, and it will forward them with the traffic. Also, the business application itself needs to pass through the user tags, which can be done by our officially supported javagent in cooperation with sidecar and does not require user awareness.

B

Sidecar will notify the javagent to pass through all the information as for other languages such as golang, since there is no ticket technology, only a simple sdk is enough to forward the user tags. So isamesh also supports multiple languages in this advanced feature. As long as the user tags are available. Second requirement for isa. Mesh canary releases is traffic, routing all components, including ingress controller of isemesh, and the sidecar in each service pod have the ability to route canary traffic to the next service, corresponding canary version.

B

You can see that all the service components in this figure, whether receiving requests or sending requests, will pass through the sidecar and when sending requests outbound the sidecar observes the traffic characteristics and decides whether the traffic needs to be dispatched to one of canary versions of the next service, and this is all done by sidecar. Without the involvement of agent and sdk.

C

C

C

C

C

C

C

C

C

B

Let's now summarize, the design principles of the platform and the best practices for its operation design principles are following one: one canary service version can belong to at most one canary release.

B

Two one request only can be scheduled to at most one canary release. Three, the canary release must be explicitly selected by incoming traffic 4. normal traffic that does not match canary rules goes through primary deployments.

B

Here's also few best practices, 1. tagging. The traffic must use the user side information, for example, a client tip address is not a good way 2. when tagged traffic. Overlaps use explicit priority to guide the traffic router 3, the smaller scope canary rule has a higher priory.

B

This is all I wanted to show you today about multiple canary releases. Let's now move on to stress test in production.

A

Now is the full stack stress test part. The topic of this part is how to do stress testing in a production environment. Today's production environment has become very, very complex. Just like the picture on the right. There are many components in it, ranging from dozens or hundreds to thousands, and these components are developed by different development teams and in different languages, which makes the communication between them very complicated.

A

No one can tell the relationship between all of them. The complexity from a technical point of view, makes debugging difficult. In addition, the business has also changed a lot. For example, during the black friday promotion, the traffic pressure on the online shopping systems is dozens or even hundreds of times higher than usual.

A

In order to know in advance whether our system can withstand such a high traffic load, we need to perform a full stack stress test to get the real performance figures, but also due to the complexity mentioned above, it is very challenging to perform full stack stress testing in today's systems.

A

Now, let's look at the problem of traditional stress test methods. The first is to build a test environment, identical to the production environment, for stress testing in the era of standalone applications. This is a very good solution, but in the age of the internet there are at least two problems. The first is money. We can count how many servers there are in our production environment and then how much we need to spend to buy these servers and that's just the cost for servers.

A

The cost will be higher when counting other hardware. Most companies should not be able to afford such a test environment. Even if duplicating the cloud resources for the test environment is not an issue. Is it enough to get reliable results? I think the answer is still no, because it is difficult for our test environment to be exactly the same as the production environment.

A

There are several reasons first, because it is a test environment. People will continue to deploy test versions to it, but forget to restore it after the testing over time. The test environment will become more and more different from the production environment.

A

The second is that many development teams will share this test environment and, if there's not an excellent coordination mechanism, the tests conducted by different teams will also affect the test results, but the real trouble is the data.

A

That is how to ensure that the data used in the test is completely consistent with the production environment, for example, in a twitter like system users like me, generally only have a few dozen or hundreds of followers, so it will be fairly easy to notify all my followers in a second when I post a message, but for a celebrity with millions of followers, the situation will be very different.

A

Therefore, we cannot simply use simulated data for testing. The second point is the proportion of different users- users, like me, may account for 90 and celebrities may only be one in hundreds of thousands. Only by simulating the proportion of users with different degrees of followers can we get a reliable test result. The easiest way is to take the production data to the test system for testing, but it also brings the problem of data security. The production data generally contains a lot of sensitive information.

A

The risk of data leakage will increase exponentially if they are brought to the test environment.

A

Because of these issues, people turn their eyes to the production environment and try to use the low traffic period of the production environment for testing, but it's also a huge challenge, because it is an intrusive solution that involves modifying or even redefining business logic. Let's take an example, assuming it is an online shopping system, including a user module and order module to test it. We need to modify these modules first. We need to add test logic, and then we need to add the logic to detect whether we are in a test or not.

A

This looks very simple: just requires adding some, if else, but is much more complicated in practice.

A

First of all, what exactly does test mean and for what kind of request we can think of it as a test request for the user module? We might be able to do this by adding a special prefix to the id of test users or specifying a range of user it's in advance.

A

This should do the trick when the request comes to the order module. We may still want to use the user id to determine whether the test logic should be taken, but the actual situation may be after a series of complex processing the user id has been discarded, so the order module cannot see it at all. Then how to write the judgment logic.

A

The second question is how our test logic differs from production logic, it's easier for us to think about accessing different data sets or simulating a third party service, such as payments, because we don't want to actually spend money on testing, but what is really complicated is preparing data for subsequent components. This relates to the first problem. That is because the order module cannot see the user id, the user module needs to mark the request sent to the order module so that the order module knows this is a test request.

A

However, in a complex system, it is not easy for the user module to know all the modules that the subsequent process will go through. So we have to spend a lot of effort to ensure the test. State is correctly transmitted between modules to avoid disturbing the production logic. Please notice. This is just the work required for one function point, and there are thousands of function, points in a normal system.

A

So the big question here is how much effort it takes to do all of these modifications, and a bigger question is who can guarantee that all the changes that should be made have been made, and if these are omissions or errors, the production data will be corrupted how to solve these problems.

A

We believe that the key lies in isolation, which is to isolate the production system and the test system from the four dimensions of business data, traffic and resources to prevent them from affecting each other business. Isolation means that we should not use the form of adding conditional judgments to decide whether to use production, logic or testing logic, but to distinguish them clearly from the beginning data isolation means the same: copy of data cannot be accessed both by the production system and the test system.

A

Traffic isolation means that normal requests and test requests can only enter the corresponding system. The resources in resource isolation mainly refer to hardware, for example, the test system and the production system cannot be deployed on the same server so as not to compete for hardware resources such as cpu and memory. This is mainly a hardware issue, but kubernetes has given a very good solution at the software level.

A

Let's take a look at the solutions given by ease mesh first, because ease mesh is implemented based on kubernetes. It achieves resource isolation with the help of kubernetes for business isolation, ease mesh can replicate existing services, except for adding a shadow mark. The replicated copy is exactly the same as the original one, and these mesh can replace the connection information of various middleware, including miskal kafka, readies, etc.

A

According to the configuration and thus change the target of data requests, thereby realizing data isolation when creating a service copy ease. Mesh also automatically creates a canary rule to forward the request with the x dash mesh dash shadow header to the replicated service copy as a test request and forward other requests to the original service to achieve traffic isolation.

A

The above three isolations are implemented by the shadow service feature of ease mesh. It should be noted that canary is also a feature of isa. Mesh the canary in the figure, only means that shadow service will automatically deploy a canary rule. In addition to shadow service, we also need another feature of ease mesh to make a full stack stress test possible mock, because we cannot replicate some third-party services for testing such as the payment service mentioned above we need to mock. It now take a look at what will be demonstrated today.

A

This is a scenario where a user uses a coupon we can find. There are three services in it. The first is coupon service, the second is user service and the third is verification code service, which will send a verification code to the user's mobile phone and coupon service user service has their own database middlewares.

A

The entire system is deployed in kubernetes and you should have found that our traffic entry is mesh ingress and there is a java agent and a sidecar with each service in the system, which means that these services are also subject to the management of ease mesh. The java agent is mainly to hijack various requests sent by the application, including both http requests and requests to middlewares sidecart is implemented based on easegress.

A

It is mainly for various processing of traffic and also for things like service discovery, monitoring and tracing. It is this management of ease mesh that makes it possible for us to hijack various requests sent by applications to achieve the aforementioned business, isolation, data, isolation and traffic isolation for stress testing in this system.

A

When a user request comes in, it will first go to our mesh ingress, then to the coupon service and the coupon service will send a request to the user service to verify the user's identity and then, if it passes to the verification code, service, send a request to send a verification code to the user. So, let's look at the steps we need to take for a stress test.

A

As a first step. We need to replicate the two database middleware.

A

We can simply backup the databases and then restore them, and we do not need to do any desensitization processing on the data, because all our data is still in the same security domain as the original system simply backing up and restoring does not increase security risks after the middlewares are replicated.

A

The second step is to replicate services through the shadow service and automatically deploy a canary rule. As we can see, the coupon service and user service have now been replicated and during the process we have also rewritten their connection to the middlewares through the sidecar and java agent, allowing them to access the replicated middlewares instead of the production middlewares.

A

This rewritten can be done through the configuration of the shadow service or through the confine map of kubernetes for the test traffic. We will add an x, slash, mesh, slash shadow header to it. Any request with this header goes to the replicated services according to the canary rules. We just deployed following the orange lines and the normal user requests still go to the production services.

A

That is follow the blue lines. Now we have the coupon service and user service replicated, but haven't the verification code service, because it will eventually call a third party service to send the verification code to the user's mobile phone. Although the cost of each verification code message is not very high. If we send a lot of requests in the test, it is also a big cost.

A

Therefore, we hope not to send the verification code. This requires the mock feature we mentioned just now to mock the verification code, instead of replicating it directly. Generally speaking, we need to mock services like payment, because their implementation is complex, involving various verifications and encryptions which make them difficult to mock. Therefore, we need to make a service in our system to wrap these third-party services, because these wrapper services are inside our system. We can make the interface simpler by saving a lot of security verification.

A

So what we actually mock during testing is the wrapper services, not the real third party services.

A

Now, let's start the demonstration, I've prepared two scripts for today's demonstration, one on the left and one on the right side of my screen with a shadow suffix after the philenum on the right side. Now I will run these two scripts.

A

We can see that the output on both sides is exactly the same in while I will also show the topography generated by our mega cloud system from the graph. We can also see that the processing process of the two requests is exactly the same, but because mega cloud requires a little time to sync data. Let's take a look at the content of these two scripts. First.

A

We can see these two scripts are exactly the same, except that the right side carries the x-mesh dash shadow header when sending each request. These two scripts execute the get token at the beginning, because the demo system requires a user to log in first. After getting the token, they start sending the get coupon request.

A

We will also take a look at the kubernetes to check the pods. We can see four services from the pods information. We will focus on three of them: the coupon user and verification code, services.

A

Let's execute the e's mesh control command of ease mesh again to take a look at the shadow service in the system. We can see that no resource is returned. That is, we have not deployed any shadow service. Yet now the data synchronization of mega cloud should have been completed.

A

Let me refresh the page, as you can see from this picture. Although the requests within without shadow have both been sent just now, we can only see one execution path in this picture, that is, coupon service, calls user service and calls verification code service.

A

At the same time, the coupon service and user service will also access the two middleware miscellaneous.

A

Now I will deploy the shadow services. Please note in the slides we say replicating. The middleware is the first step, but for this demonstration I prepared the middleware replicas in advance and in order to show the difference, I revised the replicated data, but in practice we can just replicate the production data directly without any modification.

A

Now, let's create the shadow service just run the m control apply command. We can see. It says that both the coupon shadow service and the user shadow service have been created successfully now run the cube control command. Again, we can see that there are two more pods in the system, namely coupon, shadow and user shadow, and if we run the m control, get shadow service command again, we can also see that there are two more shadow services in the system.

A

However, although we see that both pods are already in the running state, it still takes a little time for our application to start about a minute to two. So let's take this time to see the content of the yaml file, we just used to create the shadow services.

A

As we can see, there are two shadow services. The first one is named coupon shadow service and the second one is user shadow service, with your shadow, copies of coupon service and user service respectively, and, as mentioned before, our service supports rewriting the configuration of the middleware directly. We can also see this from this yaml in the spec of each shadow service. We have rewritten the connection information for miscall and readies. In this way, we replace the middleware access by these two shadow services.

A

It should be ready now, let's execute the command and check the result.

A

Since it is a java application, the first execution takes slightly more seconds. Okay. Now the result is out for a better comparison. I will clear the screen and then run the commands again.

A

As you can see, the difference is that the coupon name field has changed from chinese to english. This is the result of modifying the database connection. The data in the database is different, indicating that they are accessing different databases. Let's take a look at the topology of the system.

A

Now, let me refresh the page. We can see that there are some gray nodes in the system which are the replicas of the original service and middleware, including coupon service user service, miskal and reddies, and the middlewares being accessed by the two replicated services are also the replicated ones. The only problem now is that these two coupon services, the original coupon and the replica both access, the same verification code service, because we haven't mocked the verification code service. Yet let's do it now.

A

The m control apply command again. This mocks the verification code service. Now, let's execute the command with shadow. Again, you can see that the verification code becomes a b c d and when executing the command without shadow, the verification code is still 123 456.. Let's take a look at the content of the yaml file.

A

We can see that the request path is first matched and then request with header x, dash mesh dash shadow will be matched after a complete match. It directly returns the http status code, 200 and verification code, abcd.

A

Well, now that all our preparations are complete, let's actually conduct a stress test because it is a demo environment. So don't expect particularly high performance.

A

Let's change this test script and replace the last get coupon command with an a b command. Let's use 10 concurrent connections and send 2 000 requests to see what the performance of this demo system looks like.

A

A little bit slow, maybe I should send fewer requests. Finally, we get the result. Request per second is 125.. This is a system that needs to be optimized for performance.

A

Now, let's check the execution path through the topology graph, since our topology graph aggregates data based on time, I need to adjust the time range a bit to only use data after we apply the mock.

A

As we can see now, the line from the replicated coupon service to the verification code service is gone, indicating that there is no calling between them. Now, that's all for our demo today.

A

Back to slides what advantages does our shadow service have over traditional testing methods? I think there are five points. First, zero code changes. Everything is done through configuration, no code, modification is required and no new bugs second low cost. In the case of using a cloud server, the hardware resources used for testing can be applied before the test and released after, and we only need to pay for the actual usage period.

A

Third, clean environment, except for a few services that are mocked. The test system is completely consistent with the production system, which avoids errors caused by differences in business logic to the greatest extent. Fourth, true data: the data of the test system and the production system are completely consistent, which ensures the reliability of the test results. Fifth secure.

A

Although the production data is used in the test, the test system and the production system are in the same security domain, so there is no increased risk of data leakage.

A

That's all for today's sharing welcome to follow our open source project on github.

A

And also welcome to join our open source community thanks.

A