Kong Kong Summit 2020, 23 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Simplifying Canary Deployment | Ingenia | Kong Summit 2020

Description

Canary deployment is a helpful tool that allows companies to put in production multiple versions of its products to control flow and access based on different sets of rules, clients, amounts and operations. In this Kong Summit 2020 session, we will discuss best practices for canary deployment and share our experiences using Kong Enterprise — supported by Kuma capabilities — in achieving this.

Learn more about Kong: https://bit.ly/2I2DypS

A

Hello, everyone: how are you thanks for joining us? I'm nicolas gonzalez, I'm a solution architect in engineer and also joining me is my teammate federico cartinello also from virginia. uh We are at ingenia.

A

We have different clients from big banks to vintage to small startups uh something from healthcare from different topics, but all of them have something in common. What they do have in common is that they have a product in production that doesn't not accept any downtime and they need to get adapted really fast, because everything changes and whenever they need to do that, they have like different approaches, but it needs to change fast. It needs to change easily and the user doesn't need to to realize that something has changed. So what are the businesses for this?

A

First of all, they do need uh to release experimental features to users uh and see if it increases the sales, because someone tries to say okay, let's try out a new dashboard. Let's try out something new and see if that's uh increasing the sales or not or maybe what they are doing, is just changing the infrastructure behind it and trying not the users to see that something changed. So the user shouldn't have something that changed, but the infrastructure should be better.

A

Maybe there are shots, releasing a new feature or for beta testing, or maybe they are just releasing a new feature for a small set of users. So what comes there? It's? What kind of deployments comes in? That's where canary deployments comes in and, as you can see in this picture, what canary deployment is is a pattern for just rolling out releases to a subset of users or servers and let them be the ones that are trying out the new version.

A

So, as you can see here, there are a lot of users, but those users can be either in treatment a or in treatment b. Whenever they go to treatment a as the in the 90 percent, they are just gonna, be serving their real application. The application that's been working and we know it works and everything works okay, but whenever they go to treatment b, that's where we are trying to try out new features.

A

So maybe we change something and whenever that happens, it might break, but it might break only for those two users over there, not all the users, and we can realize very easily fast and iterate on that for the rest of the users to keep using their real application.

A

So, as you can see in the next one how users are going to participate in, this can be something like randomly, as we said before, like, for instance, 10 to 90 percent. Maybe you can just roll out for a region.

A

Maybe you can just create an early adopter program or do what it's called dog fooding, which is okay. We are the ones that are gonna, try out our own products and see if everything works. Okay, if it works good for us, it should work. Okay for the rest of the of the people so coming up. Next are what are the benefits of this? First of all, as you can see in the next slide, is a b testing. You can do test canary deployments to do a b testing.

A

Let's try out these two versions and see which ones performs best. The next benefits of this is just for capacity testing, maybe you're rolling out something that might change a lot and might change the times. So when you do canary, you can stress out a small part of it and see if it works as expected or not.

A

The next one is feedback, because whenever you roll out something new, you expect uh feedback from the real user, not just a uh uit user or a testing user. You want the real people to give you feedback and you can roll it out right away.

A

Then we have no call starts because you are just running two versions of it. Whenever you are, uh you are confident on the new one. You just go to the new one and start using that one. You don't have to stop every everything and roll everything up just keep working with that one. We have two more no downtime, as I just said, and the last one.

A

It's just too easy roll back, because you have two versions: whenever this one doesn't work, you can roll back and go back to the old version and keep working as it should. As you, as I said in the in the image, there were a few users who were just in the wrong version and it broke.

A

You can just discard that version and go back to the old version of that so coming up next is where scanner is not currently, deployment is not something for our green blue deployments is the idea behind this is not just just create the two applications or start rolling up from this deployment to this deployment.

A

That's something that's meant to be for a load balancer or something like that. That just creates two instances and moves from one to the other. It's also not a rolling deployment. You don't want to just okay, let's start destroying the instances, we're on the old version and moving to the new one. What you are doing with canary is you have two versions and you're trying out things in the two ones.

A

So that's just the theory. So, let's just jump into into our show time yeah. So what we have here is we have a real life example. Foodx is a mobile ordering solution that allows you to just download the application or enter your web app, create an order for a restaurant pay through the application of the web. Application go a straight to department and after that you can retrieve it and then show it with your friends.

A

The idea behind this is that you don't spend much time ordering, but you spend a lot of time with your friends. So what happened here is that full text to provide all these things, the first things that you have to do is to access easily very easily your uh your places.

A

So, as you can see in the two images here, you have uh different places in places nearby and also when you provide an address, you can get the places nearby that address and as fred is gonna show in the next in in the next few slides and in the demo. uh That's something that was having trouble with one version.

A

We created a whole new version for it, but we will, but there was uh such an internal change that we needed to try out not only in testing but also with real cases, scenarios of what was going to happen and that's what canary comes in so feather.

A

That's okay! Take it up from here and please show us the demo thanks.

B

Nico, okay, based on what nico just mentioned here, we have an architectural diagram of covering all the feudax business needs. Here we see it's pretty straightforward. We have a backend with a managed kubernetes on a cloud provider. In this case it's google and then we have two kinds of application. One is a web-based and the other is a mobile version and there are consuming services through a congress controller ground. English controller is the one responsible for exposing that functionalities to the outside world, and then we have, which is called bff. Bff is a pattern.

B

The the letters are backhand for front end. It's a it's a pattern which allows you to have a detailed experience for each of the frontends. You have on your on your architecture and then we have the back ends. The back-ends are also built in python. As the bff bff uses, the flask network, a framework, sorry and the back-end is built upon django, okay, it's storing the main data onto an sql database or national database that it's as a service.

B

uh All these architecture is being observed by graffana, so we're going to see live metrics at grafana. What's going on with these with these components right, then we have the problem with the first back-end as many as nico mentioned it was, it was really slow. It was resolved because all the processing between say before saving the data was was being done on the back-end side. So that's uh brings a lot of issues.

B

So now we are going to set up a new whole version, uh which is called okay back in b2 is going to be the the same. The same stack, the same technology stack, but it's doing the processing near the database layer uh using poscas on the possible sql. So for accomplish this, um this shifting of the traffic to send over ninety percent to one version and ten percent to the other version. We are going to use qma.

B

What is schema, uh I think you all should know about qmac minus the kung service mesh solution as any other service mesh technologies. Kuma has the concept of a control plane, which is responsible for not only managing the data planes, but also configuring them and observing. What's going on at every time, it's like the brain on our source mesh and then we have the data planes and we can see here on the image at the bottom.

B

uh The data planes are the envoy proxies that have been deployed as sidecar containers of our application containers. So uh giving this configuration give this a scenario, a seriousness. Implementation allow us to observe all the traffic that is incoming and they'll come in traffic for our applications and therefore we can configure some security policies, routing policies. We can do some canary. As in this demonstration we can do also blue green deployments and we can observe all the applications I mean we can grab all the telemetry.

B

I mean logs metrics traces and centralize them in another back-end storage. What happens if we don't have a service mesh to do this? To accomplish this? Need we have well. In that case, we are going to need to set up some business logic inside the application in order to know okay. The traffic has been sent to me.

B

So it's not my my turn to respond, so I have to drive the traffic to the other box and that's not the ideal scenario, because it doesn't scale when I had to to implement a lot of particular rules there. It's not the best way to it's, not a big approach to accomplish that. So, given this scenario, we're going to show now the demo here we have in our cluster, we have let me check.

B

We have the applications in this space. We have the bff. As we told you, we have the two versions of the back-end. Of course, the first best, the the second version of the back-end, is not receiving traffic at all at first, because you know traffic, isn't there, we are not. We didn't configure anything yet on the mesh side on the mesh layer. So all the traffic is going to the b1 version, providing that we, oh okay.

B

We also have the ingress here, the ingress controller, and now we are ready to show you how the application looks like the main idea of all this demonstration is to show you us uh like it is a seamless transition to the end user. Then users are never gonna realize the changes on the application side here here we can see that if I reload the application.

B

Okay, there is traffic that is lasting too long here we have it. It's like six seconds to bring all the data we have. We can see here. So this is the main reason the mean trigger that we use to set up the new to spin up the new version.

B

In that case, we are not going to show you how in kuma we can configure that traffic routing. It is a it's a it's a yamo, it's a like a kubernetes object, but from the kuma api. So it is called traffic root. What it says here it says: okay, all the traffic that match that is coming from this component observed by qma service mesh and it's going to this destination component, send all the traffic to the first version.

B

We have here the the current label that it's specifically saying that it's version one send all the traffic, it's a weight-based probabilistic traffic setting to the first version and zero to the other version. So what happens? If we change this, I'm going to run a script here to call the api and we can see now in rafana.

B

We have rafa. We can see now, in a matter of seconds that we have traffic.

B

We are seeing two dashboards here. We configure a dashboard with the total request on a rate of 30 seconds, and then we configure uh the request time per service on a bff b bind I mean if I look at the bff and I ping the bff every couple of seconds. I'm gonna see: okay, it's lasting like nine seconds to bring all the data together to the to the view right. So this is based on percentiles, but the worst percentile. The p99 is showing that it's taking almost 10 seconds to accomplish that request.

B

So now we are going to configure traffic route to send part of the traffic. Let's say: 70 against 30 to the second version.

B

B

We did a sci there in order to changes, and then we can push the changes to our gitlab, and here we have in gitlab. We have like the pylon, which is doing all the magic there it's running right now, so what this pilot is doing is applying that file with qcdl command to the cluster.

B

Let's take a couple of seconds, and now we are going to see if we move to grafana, we are going to see that part of the traffic is going to be shown with the red with the red sorry the yellow line that is tracking the v2 request.

B

Let's wait a couple of minutes seconds.

A

So fedex what you just did there was you changed? You just change the rule to say: okay, start sending something to the old version and something to the new version. Almost all of our users right now are going to be going to the new. Do the old version, but 30 of them, or something like that, because it's not just a 30 percent. But it's up to holistic.

A

um It's going to be sent to the new application right.

B

Yeah, that's right, so we see here that the yellow line is starting to show some rates here, based on the ip request and the also the traffic it's been. It's been decreased.

A

B

That's why? Because we moved to the new version, as nico just told us that has the processing the that massive processing layer being done in another layer, which is the best approach to do that. So in that case we see that the the times are getting better and we see that here there are requests coming from both versions.

B

So imagine the time it goes by and then you are ready to put 100 version uh traffic, sorry to the new version. Well, in that case, what you're gonna do is the same, but putting all the putting all the traffic to the second one. Sorry here we have zero.

B

We are going to apply again the changes.

B

B

Push the changes again.

B

And then we should see, I will have a new pylon here.

B

I will implement that file, okay in a matter of seconds, it will finished, and then we will see that the green line is going back to zero and the yellow line is going to be the one receiving most of the rates. Every 30 seconds.

B

A

Time should.

B

Improve in a matter of time,.

A

So none of the users realized anything apart from having a performance improvement, but uh the data is still the same. Everything is still the same. The only thing that changed is that everything works faster right now right and if something uh was not working right, we could just you, could just uh roll back and go to the previous version, because it's still there it's still being uh processed in the in the back end, and if something comes up, you can just roll it out back.

A

B

It decreased to 3.8 seconds half the time, the other version.

B

uh Okay, that's delay of the seconds is because it's going to another region, uh the database to grab the metric, the data, sorry and we have a latency between regions- that's going to be acceptable for the demo, not for production, of course, and then we should see here that okay, the time decreased considerably and the yellow line started to receive all the ray all the traffic I mean all the rates of the request are going to the v2.

B

So now we are good to go with the canary, and now we are good to go with the new version for the users, the use the version is tested and is ready to be used.

A

Awesome great great, thank you fede, that's really cool, so uh felix has showed us how canary works and canary almost every time works, but we have a few things that you have to keep in mind. Let's find them so the first thing is: do not really reliable. uh Do not over rely on this, because it does not effectively mitigate the risk of silent defects.

A

What this means is, for instance, if fede just introduced a new defect into the version two you have, you will be trying that one in production, so don't over, rely on this and also uh try it out in testing. Try it out every every everywhere, because uh you are pushing code to the production right.

A

uh This is only possible where our nodes contract changes, because if something changed and the ui would uh would need to change its response, uh this is this couldn't be as easy as it was because you cannot change from one to the other one right. uh It also increases the complexity because, as fedex just show us, he was just pushing something to the automated deployment mechanism and it was being deployed.

A

But you have to keep in mind that, apart from the roots, ingresses and everything else, you have something else: that's the root between the services and you have to keep that in mind too, if not something's moving, and you don't know why, apart from that, uh and it's really important as fedex has shown us, you have to be having a look at what's being changed, you have to be measuring the changes you have to have a lot of metrics on this.

A

If not it's just something that's been changed and you don't have the ability to uh to see if everything is working. Okay, if it's not if it makes sense.

A

uh One of this is one of the most important ones that database changes can present a problem. For instance, if you are, if you have a version, that's faulty, then the database would have changes that are not that might be corrupt and that's data, that's corrupt, even for the for the latest version, the previous version. Every single version is looking at the same database and if everything works not the way, I suspect it.

A

The database is being corrupted and the last one, but not least is make sure you're rooting the traffic to this mesh, not to the load balancer. What this means is, you are, you are, uh you are just sending all the information to kuma in this case. So if kuma is the one that's taking care of this, he will be able to it will be able to just use the canary. If not, if it's just going straight to the service, uh you are going through kuma, and none of this will work.

A

uh One last thing that's also important is that you have to have a stages. You have to keep in mind the duration of this. You have to keep metrics for this, as we said before, and you have to keep an evaluation of this. What this means is you have to plan, you don't have to just okay, let's go with uh with just spinning up things and going to canary deployment, and everything will work right. It might not.

A

It might- and you have to have a lot of observability on this- to have metrics on this and to decide whether this canary deployment makes sense for your end users or not, and if it's not, if the answer is not, you have to quickly roll back to the old version and keep everything working uh as normal right.

A

So, as you have seen, we have seen a little of the of the theory. We have a little demo with a real case scenario. You can check it out in the links that you have there that this is something that happened to us. This is something that we implemented.

A

uh We would like to thank you all for joining us. Thank you, freddie also for the whole demo. um We would like you to stay tuned with us to keep up with the questions uh right after this session. So thank you all for coming. I hope you enjoyed as much as we did creating this.

A

A