Kong Kong Summit 2020, 10 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How iFood Improved App Resilience With Kong from the SRE Perspective

Description

iFood is the largest food delivery app company in Brazil and is using Kong to deal with all incoming traffic of its microservices. All of the workload creates a perfect yet challenging environment for SRE work, including developing automation, monitoring the system and helping improve resilience. In this session, we will show the environment and techniques we used to improve iFood stability.

A

So if I have to give a piece of advice to anyone that will be, things will break site. Reliability.

A

Engineering like us must remain calm and be prepared to act when outage is going on, and that is fun for us. Okay, maybe not that fun hi, I'm leandro carneiro.

A

I am part of as a part of the srn team of ifood and when I'm not dealing with outages or coding some piece of software to help us to be prepared. I like to drink some beer and walk my dog out.

B

Hi everyone- I am jean marcus, I'm part of the sr team at thai food chew. I love the animes pets. I have two cats and containers world.

B

So now I'm talking a little bit of ifood for those uh who have never heard about ifood, it's like a food delivery, app.

B

We are present in latin america, uh brazil, colombia and mexico, uh and in brazil we are the number one uh and, with the beginning of the current team caused by kovind, a thousand of restaurants are relying on a future survivor.

B

We are playing an important role here, making our systems reliable to ensure continuity of their business. uh Okay, uh talking quickly about of the future, you code, it you want it. uh So it means that the software engineers not just called the main functionality of their microservice, but also all the piece that make the micro service work, suck us terraform code for provisioning infrastructure work on emus file to expose the any point in points to the world.

B

There is an another good rule at thai food that says, give as much autonomy to the software engineers as possible. It means we should create tools to make it easier to push code to the production environment, to be able to give this autonomy. Our tools must also have some very derails to helping the engineers on their delivery path.

B

Okay, that's it's! uh That is our agenda. We will discuss some talks uh like uh first, uh we will start talking about how I food fits how kong fits on my iphone at flu. Then we will we'll give a thinning overview of the micro service behind home. After that we want to expose how we could increase the resilience of the platform touching in the microservice code, just bitcoin.

B

We were also talking about what we call a safe mode. It's a plan b when things go wrong and you finish we'll. We will give some look at monitoring. Conga.

A

Okay, so kong at iphone.

A

Here's a graphical view to be easier to demonstrate how kong is a thing at ifood when users are hungry and they want to buy some delicious hamburger. They open the ifoot app on their phones and they fire requests to our platform.

A

Before the request goes through the platform they they are received to the web, application firewall so waff or and the cdn, and then it enters to our platform. As you can guess about the icons on the graph, we are using aws, so congee, the external load balancer for kong is our first barrier from those incoming requests, but kong also can handling uh the requests that goes from one micro servers to other. It's not really a need.

A

We don't need kong between one microsoft, one microservice to other, but it's really nice to to have it and we're gonna talk about why so the same cluster of kong k handlers, they request north south requests and east to west uh requests, and we use like two fqdns in kong cluster to handle it. So one is for the the internal requests and the other it's for the external traffic.

A

So, okay, let's talk about some numbers, so here at ifood we have like 16 productive clusters. Think about 16 control planes. We are using ec2 angst instances under old scaling group with two load balancers as much as before.

A

A single cluster of kong can have up to 50 instances and more than 300 house, and some about 100 services and 400 plugins configured on it. So it's really massive for us. Of course, all those clusters are handling about 270 000 requests per second, with peaks of one more than 100 thousand requests per. Second, when people want to buy some dinner.

A

Okay and to all it worked how it works. Our ec2 instances have some pro meteos exporters uh file bit to ship the kong logs directly to elasticsearch and also chef client. So we are using chef as configuration manager here, our cookbooks from chef installs kong at the latest 2.1 version and some custom plugins.

A

We are using kong in databaseless mode. It means chef, also clones, our git repository with all the ymo files that kong needs to work properly.

A

We are not using deck yet so our chef cookbook merges all the ymo files into one interpolates, the variables, with sensitive information in posts to kong, slash config endpoint to ensure all the configurations are correctly loaded to cong.

A

We create some custom health check, endpoint that compares what was declared with what is loaded to kong, but all this future of you code that you want it.

A

All this future may be a little bit challenging for software engineers to also have to configure the yaml files for kong. You know we have a lot of emo files, so a lot of ways to people have some mistakes or misconfigurations, so think about that. We code, two pipelines. The first one is the validation pipeline.

A

It runs every pull requests and does some basic validation tasks such as linkedin tag enforcement uh check for install it versus declaration, plugins or duplicated routes or any resource names, and we also have some particular con. Configs uh validation such as rejects priority here at food. We use the rejects priority like every is less at this, the health, it's a plus one, to reject priority.

A

So if, if you have like slash orders, slash order, you you id it's going to be rejects priority of two: it's not uh really necessary to configure it, but we want to to be sure how the behaviors will be. So we enforce that. We also have the the preserve host enforcement so that that's really a nice one. Sometimes we put kong in front of f3 buckets and you know the f3 s3 buckets with websites configured on it.

A

They must to to have some uh some host name to to how to the request to it so based on some regular expressions we have here, we can apply the hijacks to the fqdn of the kong service host and figure it out if the reject priority is false or through. Okay, true is okay, but it must be false when you're talking about some s3 website buckets or some other uh layer, seven thing or proxy. That's handling the request so, based on that on the rejects, we enforce all their routes.

A

For that service to be preserve, host false and after the the pull request is approved and all the validations is made, we generate a git tag. This git tag it's used to a deployment pipeline, it's like a canarish deployment pipeline. It also have a auto row. Bag.

A

As you can see in this picture over here, we are using. You know, jenkins for now, and we have like in a highlight this type of run, shaft only in one instances one instance, and it means like a canary. We deployed the the change to one instance and we check if it's okay, if it's okay, like in the pipeline, we skip the row back uh step and go to commit and pub in publish and push all the changes to shaft and to all instances in the cluster.

B

Okay, uh we have talked uh a lot about kong. Let's.

A

B

Peek, uh how am I food micro service? Look like food, so in general, microservers, which are not just uh qr processors, looks like this designation. They are it's any point method to call. They are written in a variety of languages, sucks go rust, java and so on.

B

The microservice can publish messages to sns stocks, interactive with database write files, qesters buckets, consume, models of stage makers and what else in general, they are regular, microserves. Nothing really things here.

B

So uh there is one flow we will talk more about when a microservice receives a request. Let's say a new incoming order was placed the microservice. After all, the validations will create a new entry in the database. This information can be carried uh sometime in the future. No news here, as we know, as much as we have some read, replicas nodes are carried to a database, can be expensive, sometimes so, if microservice, when it's possible after writing, to new incoming order entry to the database, it also writes a file show estrous bucket.

B

We call all the data into database as hot data. They are a single point of turf and today, at the asterisk bucket, are to call it a quote data. It me it may be all taste outdate, sorry outlet, but they are still there.

A

Let's talk about how kong can help us to have some app resilience, improve it by using secret breakers and fallback.

A

So this is a review how kong works in a basic way. This is also a picture of our past. So on the left side we can see a user is requesting a particular order. Uuid.

A

The request goes to kong and kong matches the request with one of its house, the house points. The service cong does the proxy to orders that food at bars and goes back and that's awesome. It works. But what, if orders, the food that buy goes down so congruent return? Some 500 er to the client- and you know- that's not resilient.

A

Okay, so we could, you know, duplicate the orders.food.bar into orders, slash fallback.food.bar and use the upstreams and targets to split the traffic to 99 percent of the times goes to the the order servicing and the remain trafficking goes to the orders, fallback and yeah that that's idea that it's right, the idea is right, but we are missing one important part of this concept: the health shacks and at kong we have like two types of health checks.

A

We have the active health checks so think about aws elastic loads, balancers health, shack, the active health checks at kong works really close to that it keeps probing the service to get the status of the healthiness. So is it alive? Is it alive every time it keep is to probing the help check in it? It's it's. Okay, it's uh a way to do it, but kong also have the passive health check in it's really great stuff, so kong keeps tracking about the http codes of the microservices is responding to the client.

A

So imagine if cong receives a lot of 500 errors, marcus the the the target as unhealthy- and that's that's really.

B

A

It can help us with our task, so let's go back to the picture over there.

A

And yeah, so now we can see. uh I think it's the the other picture.

A

Okay, now we can see uh we have the health checks in green over there. We are not exposing all the the configuration because they are a little bit uh bigger to the screen, but they are there and the concept are there. So the concept is there, so kong now can understand when orders the food.bar goes down, because it receives a lot of uh 500 errors or some tcp failures and announced, and it can shift the traffic to others that are there's a slash fallback that food.bar.

A

So it means what what I mean is kong are handling the the traffic and proxy, but others that food up bar goes down and okay kong understand it because of the the http codes and stops uh the communication to that servers that that service or that uh that target right.

A

So after that kong will in the in the shadow in the behind the scenes, will probe the orders, the food.bar service until it gets healthy again and when it does kongs enables or restores the communication to orders the food that buyer. So the fallback is going to be active, 100 percent of the time when orders that food.bar is not so we have happy resilience there. It's really nice, no okay, but we still have some room to improvement.

A

So we have some challenges here at the passive health shacks kong also skipped tracks of tcp failures and timeouts to figure good thresholds to it was a little bit challenging for us and took us a lot of extreme experimentations and load tests to figure it out. What is a good threshold to its service or a target.

A

Another thing to keep in mind is at kong upstream. The health check part is the same part of the service part, so this is can maybe not true when using like uh actuator from spring booked activator uh frameworks like that they may sometimes have another part to get the the the healthy check, endpoint and another challenges.

A

From this example, we still have like duplicated resource to handle the order service and the order slash fallback services right. We also sometimes when we are not using a duplicated uh service.

A

The path may be not the same, maybe not matches the main app path, so they this can be a little bit challenging when using uh the secret breaking in congo, but for those challengings we may have some solution.

A

One alternative to the challenges uh can be using kong itself as the fallback target. It may be seen a little bit crazy, but let's check it out. So when you're using kong, as as the fallback target, we can use like the request, transformer plugin or the response, transformer plugin or any other plugin to kung works and response and deal with the request as the same uh way of the main app does. So we can use kong as the the fallback app so duplicator resources are not needed anymore.

A

So you know when I, when we said our uh microservice, they write data to the database and also to ask free buckets and you call the heart and the data hot and the coded data. So we can use kong to get this code data and when the main app is not working, so imagine that we can use cong and request transformer to get this code data or we can use.

A

I don't know- maybe some aws lambda plugin to trigger some lambda or get you know post the the the request body when it's supposed or put to some quill and the skill will be processed later. Anything like that will improve the app resilience.

A

So, okay, here's a a picture of what we are using at ii food. So we have the main app. It's still the orders of food dot bar. We still have the get. Slash orders, slash order you you id, but now the target the fallback target that handles like 0.0.1 of the traffic.

A

uh It's going to be kong. You know it's going to be the loca host column, 8, 000 uh and kong. Have this particular halt when we are calling kong fallback orders and is going to use the service as the s3.

A

So when orders dot food dot bar goes down kong handles the request, transforms it and goes to the f3 bucket and gets the code data and that's awesome- that's happy resilience we would to get without to change the the code for the microservices.

A

Right, but still things can go wrong. Things can go down when you know maybe the out date uh data, it's not enough for us. No, you have. We have to have some plan b, and that is what we call safe mode.

A

So, as we mentioned it earlier, we are using this git ops fashion way of do this stuff, so the changes flows is described in this image over there. So the engineer opens a pull request. The pull request goes through the validation pipeline. It gets validated and gets the reviews and eventually it gets emerged and a git tag.

A

The git tag is used to the deploy pipeline, that good, that does the camerish style of deploying updates the shaft and roll out all the all the changes to all kong instances. But you know in the outage it may be a little bit slow process to get these uh changes to the the colony. Although this is a really good flow, you know you have uh changes that are homologated by the pipeline and you have the reviews.

A

It is easier to roll back because you just need to deploy the the latest latest tag. It is have some documented changes, so you have like uh auditing because you know whom change it uh when and when, and we have like this scanner we mentioned before, but still a little bit slow process.

A

To have some shortcut we could use like you know, cog manager or a database mode with control plane in data plane. That's a new feature and it's really good, but we would to to still have this git ups way off do the stuffs, so we created a homemade solution to to do this shortcuts for us.

A

And that's what it would like, so the safe mode. Our plan b consists this of a predefined scenarios such as enabling uh rate limiting in out x, or, I don't know, disabling health uh in how to you know something like that. These predefined scenarios they lived at git as well, and the sr team implemented a single page application that renders the status of the scenarios retrieving all the information from the the chef and also from the git when a safe mode is enabled a request, goes through the back end of our solution.

A

The back end talks to the chef chef triggers a job to update all the kong instances and also gets posts, a message to a queue that a work or work worker will to consume the message, update, gits and also notifies the users at slack, so the engineers could run for it and check. What's going on.

A

Remember the single page, application, backend and worker were developed by the sre team. So please do not judge the layout it's ugly, but it's working. So we have all the switches over there. You can see that kung davi proxy. We have all these uh pre defined safe modes. So imagine here we have just the discovery legacy 25.

A

It means like 25 of 25 percent of the traffic. It's going to our platform at this con cluster, but we could have like rate limiting at acl route or something else just hit the save button and ta-da you have the the safe mode enabled.

B

Okay, guys uh as good as srv, we should talk about monitoring.

B

Okay, uh the frequently asked questions about kong r. uh How many 500 class errors does come closer are responding? uh Is the fallback app really working it's? It is working correctly to have answers to those questions. We use the congo prometheus book plugin.

B

uh Some metric is also, I don't know, such as kong, hpvp status and kong bandwidth, and they can help us to know if the service is respond and how it is responding.

B

From combi 2.1 version, con upstream's target healthy was added, helping us helping us to understand how the targets of the pristines are, and eventually you can plot graphics like this. In this case, uh it's showing the traffic by hpp response codes.

B

So in this screenshot we can see some graphics requests per second and http code, grouped by code.

B

Or these ones at the bottom, we can see the yellow peaks when the main app is unhealthy, as show in the upper graph.

B

So uh talking about of these two metrics, the metric hpp status can be useful to understand what is the percentage of 500 class errors in a particular route or serves the match. Chris kong upstream started healthy. I'm talking about this mattress because it was called by carnero. No I'm joking.

B

This magic is exposed the healthy state of um others or target of an upstream. This could be healthy, unhealthy then, as ever, it means once the the look up of the targets. Fk then fail and the last one uh health checks of uh once the health checks uh of the upstreams are not configured, so it means uh that the host order, dot, foo dot bar may have two ips 10, dot, 10.10 or uh and sorry 10.10.10.

B

A target is market as unhealthy when our all others are market. As unhealthy.

B

Okay, uh this is an example of a rule of promises to monitoring the percentage of ears in the target.

A

Yeah, the kong upstream target also can help us to alert, for example, when the upstream is added to the production environment, but it's not configured with the health check. You know so it can alert on it. So it says: oh, the.

B

A

Is off so you can see like things like that.

B

Yes, good good catch and.

B

Okay, this is an example of a notification received on his leg, uh so uh the promiscuous real fighter, the the alert. Then we receive it and is like so it's a simple integration to to make visible the the.

B

Alerts so that's it.

A

Yeah, that's that's all. We want to show you guys thanks for watching, let's go to the q and a part.

B

B