Kong Kong Use Cases, 22 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Reshaping Authentication to Improve Resiliency

Description

Decoupling a legacy application into a microservices architecture is always a challenge. Decoupling an application that has 70 million users and 500,000 requests per minute without impacting uptime is an even bigger challenge. In this talk Kong Summit 2019, OLX Lead Software Engineer, Rodrigo Orofino shares how OLX has reshaped its authentication mechanism, making its platform more elastic and resilient.

A

Hello thanks for coming in this talk, I'm gonna tell you: how is going to improve the resilience of our platform or likes is one of the biggest sites Inglot in America we get loads of data and a lot of users. First, let me introduce myself I have another soft engineer. My name is Rodrigo. Rufino I've been doing this for almost 12 years now,.

A

I'm from Brazil, so sorry about my accent: I live in Rio, the hosts of 2016 Olympic, Games and I work at OLX. What is the relax or likes or marketplace? It's a place where you can sell usage stuff, you don't need anymore and you can also buy stuff for a better price. Just like Craigslist from us all express you is a joint venture of two of the biggest investors in the marketplace. Business masters and a dev intern at the Ventura is the lead investor of Lebon Kwan in France and Vito in Russia.

A

Naspers is the lead investor of oil axing in europe. The whole store whole country, whole continent, and also let it go in u.s.. They are normally compatible, so they are competing around the world, but in Brazil they get together should be relaxed, and because of that we have the biggest marketplace we kind of don't have other competitors.

A

We have more than eight millions whose users activity every month month and more than 70 million accounts register in total. These are some of our public key numbers formed from 2018. As you can see, we have a very good liquidity.

A

25% of our items are being sold in the first 24 hours and almost 65% sold in the first week. We are also the number one place to sell our whole arch in America and also your phone. So we don't trade in our phones. How you guys do it here in Brazil, you'll, not really buy a new phone and you're selling our likes.

A

It's a little bit late in Brazil. Right now, it's closer to 9 p.m. but I, think I started talking. We probably sold more than 100 items, so we proved self-professed more than 50 items per minute and what how we keep our platform so fresh is because we get a lot of new ads every day. So we have have a little bit more than half a million new ads daily.

A

It was not always this good. If you look back eight years ago, we were pretty small, we have only 150 K daily active users. We have so just a little bit less than three million three million items.

A

It was a huge growth from whatever we are standing right now. It was over 20 thousand percent.

A

Sorry, the problem is, our stack was not ready for this growth, this kind of growth, so we had a lot of problems to scaling to the size we have. We are right now we started as our component is just PHP monolith.

A

It had a proprietary back hand written in C by another company from a de vente called company called blockage is a Swedish company. They start building it in 1996. So when our Laxus started, it was almost 15 years old. As you may notice, we were handling session on pretty badly. They were storing it in memory by the PHP app by the time. It was not a big problem, because most of our features didn't require logging. We don't have the didn't, have shed or favorites premium galleries recommendation. Nothing like that.

A

It was basically just need an account for selling an item, so you can't login or register and then put your your ad there and you could go look out. We don't need this session anymore, but it starts to change when start growing. When not spreads and added income came together to be relaxed, they decide to gathering in the mobile market, so they asked detecting to build a new mobile solution as fast as they could. So they did it. The tech team put together a new API and build the Clyde clients.

A

This API was talking to the same one live, so it was using the same strategy and solve the problem right. Nope Taruna turns out. We didn't have features that were good enough to be on the mobile market. So we start writing new features, get new product managers and they start thinking outside the box. We keep going, we develop a new feeding level, a set of new features, and this time a lot of the features did require authentication so when they even hit a need another problem, this time the problem was a little bit.

A

Bigger scaling was complicated. Why we excused, we were still storing sessions in the PHP app, so using a load. Balancer was not as easy as it should, and because of that, we kept scaling vertically. So we had an on-premise data center every time in each scale we just buy a new harder put it there. This was not a good strategy and we at this time we need to move to the cloud. We need to keep keep things going.

A

We were spending a lot of developed effort and hiring people is hard in here real as it is here. So we are wasting good engineers managing infrastructure. So we decided to move to the cloud and we could couldn't do that anymore. So what we did we decide: Cheers, sighs, I'm, sticky or section pinning.

A

Well, it's basically a strategy to make sure the load balancer would products all requests coming from one specific user to the same server instance and how he did that we basically ping the size shown to the user when they log in so when the guy login, we save a cook in his browser or something on mobile web and now every time the user makes a request and it goes through the load balancer. We would redirect redirect it through to the same instance again so right, easy nope. We just changed viewers problems in a few months.

A

Mobile become our number one platform. We have four times the number of users in mobile. Then we had on the web, but the experience for mobile users were terrible.

A

Sectionals used to last a lot longer in mobile people are not used to logging every time they need to use the app, but we were still using the same strategy. Juice just stick the sessions through the users. So in a few months we got a lot of bad reviews on the App Store. So it was a problem. It was. Business was not happy about that product also was not happy. So we just developed new problems.

A

The problem was sessions were being lost every time one of the server wars terminated. So so, when I use a users that were laga to that session, just watch the server just what a logout, hence giving us a lot of negative reviews in the cloud. It should be easy to scale our scaling, so we are always putting stuff into the new instance in the closer removing it and when we moved that guy everybody gets logging on so big problem increase. The trash was also hard for our service because we are not rebalance the requests.

A

When we put a new instance in production in the closer, only new users would be redirected today, so we could not reach distribute the requests evenly. So it was not really Alaska by the by the way. So we also had a huge codebase. Everybody was working on the same wrapper, so everybody was having to use dev box. By this time. We have a lot of external dependence, people using different stuff there. So developer experience was bad. When you you couldn't want, you couldn't run the test. You know your server anymore.

A

You know your machine, you had to run it in our changings, and tests would take like two hours to finish so a long feedback cycle. So we could finish our feature: try to put in production, open your merger request and discover that it would break the something two hours later so, and we also had a bad cycle time why we were deploying only twice awake and normally one of these deploys would be hold it back or cancel, because one thing would find a problem and it would be easier to just take everything out.

A

So as a lot of the developers were really frustrated and it still ever decide to make a change, they decide to split our taxing into product teams and migrate. Every fringe for micro service.

A

They also decide that you would start using the t-shaped engineer model, so we would have most functional teams and handle the whole development lifecycle, so we would think it build it and run it. One of the first teams that was ever created was their County with the mission to decouple authentication and accounting points from the mall with how using the I screen scope strategy, a team should remove their parts from the models live, so we would break them on the left part by part, and the things should go.

A

There discovery what its responsibilities take it out, make it work, the more life work using the new micro service and it was pretty important to nothing, could be broken because our end users could not surfer. We already had a lot of bad reviews on this story. So was not a good time. This way we could, we could gradually migrate to over to micro service without affecting our up times.

A

This was our first attempt. We started creating two new micro service, one for handling authentication and order for accounts information in general. The off service options obviously was responsible for authentication. She was now a new identity provider and we were using a Redis to store the sessions. So we had the search. The keys was the session ID and the valise was the login user information and again Fein started to look good again.

A

We could scale words only again, no problem with the api's everything looked great business kept growing, we kind of doubled the number of our users for one year to the other, and then we hit the proliferation of micro service. Now we were more than 100 engineers dividing in two entities. Everybody was using whatever the technology they want. So a lot of different link language eating's had their own set of micro service, normally two or three. So in just a few months we went from this.

A

Our first attempt she did everybody beautiful and sometimes they need your dedication, so what they would do, they see how we did it in the model F and copied so everybody's calling officer's direct.

A

It was chaotic, our service was receiving almost 1 million requests per minute and it was not handling it very well.

A

Costs increased a lot, we're expanding more than $7,000 monthly. We are using 28 c4x large instance in production. Our radice was up to 60 gigabytes of memory.

A

Performance was terrible, an average request you all officers would take 40 milliseconds with more than 5% of this request, taking longer than 100 milliseconds.

A

Incidents in production were a lot more common than we were comfortable with a lot of the times. It was related to that Redis. There was a single point of failure when things went wrong there. Sometimes they just lost all the sessions and everybody was logging out from the platform and why that was happening.

A

We had a huge volume of requests, new officers, and sometimes we had requests duplication, because if two different micro-service needed to make sure a user was logged in they couldn't they couldn't make sure that the guy was out in cadiz, so both would call the officers for the same request coming from the end-user. So a lot of request, duplication.

A

We also had by coupling every service, was calling our micros microsoft directly. So improving anything off service was hard because we had like maybe 50 different services using it. So it's hard to coordinate the effort to everybody make updates. So we also had a lot of code duplication. A lot. Every micro size had the same authentication logic when Microsoft needed to make sure a user was authenticated. She would make a request for officers passing the session token. So first security problem there they would check the response if it was not valid, it should return 4:01.

A

Also, if the request was coming from web browser, it the responsibility of the team to make sure the right redirect flow would work, so they would come up with some different strategies to do that, we were also not fault-tolerant. Any problem in the raddest was a downtime for us. We had nothing we could do and after seeing it is all we had trust accept that our system design was just bad. It was not going to work for us, so we got back to the drawing board. This time we had new requirements.

A

We need to keep things backward-compatible. We needed a better performance, we need to be more resilient. We need to avoid code, duplication for the other teams, so they could iterate faster and last but not least, we can't lose the access to revoke the racket, lose the ability to revoke access fast. So it means we can't go full stateless, it's a huge safety issue for us, so it was time for a new strategy and after studying the problem again, we come up with one.

A

First, we create a new identity provider. This time we would persist the tokens date. The talking data in a database, not just in the raddest we would use the red- is just a cache layers to avoid reading from database all the time we also choose to move to a different strategy. We would use two different kind of tokens are short-lived stately to stateless token and I stayed full refresh token, and of course, this time we use an API gating after two weeks experiment with some guys.

A

We decide to go through Kong and thinking one week and a half or something like that. We had our first cone clustering production and we were practicing all our requests through it.

A

Kong has a lot of nice built-in plugins. We are still using some of them, but we tried choose then for our you SK for authentication, but we need to keep things backward-compatible and our likes was not using any kind of standard for that. So we had a proprietary and proprietary stuff, so we decide to build our own plugin and make use of one of the good characteristics of of Kong. It's easy to extend Kong with plugins.

A

These are the options of the plugins we built. As you can see, you have a few choice you can make here when adding a route to Kong. The most important options are resolved, authentication, optional and water. Current accounts information. Do you need? So if you need the email from the log, it use it on your micro service. You put that in there. Also anything you need, so we we would take care of just getting up to now. I'll go I'm going to show you.

A

They require the flow of a request that pass through Kong using our plugin. So first the recap: request comes from the web browser or mobile web. So it's coming directly from the client it hit Kong's come check. If the route being access has the plug-in enable on it, if it has, it will start running the plugin, the plug-in will check if the route your accents need any account information. If you need any information other than the account ID, we will make a call to account service to fetch it in parallel. We validate the stateless token.

A

So if the token is good, we don't need to do anything anymore, but if the token is not good, we call all service passing the Refresh token. So we now get a new stateless short-lived. Stateless token, then we return it to con.

A

Come injects this information in the upstream headers for the upstream, so we get all account information that you need there and procs the request to the upstream. Now the upstream can easily read the information they need Frank outs on from the header, so it's easy to do it in any language, so we don't have problems with that. They make Microsoft's do whatever they need to do and returned to. Hong. Kong now includes the new stateless token if he got one to English the request and return to the client.

A

But what happens if the user is not a logging? Well, if look user is not logged in we have a few conditions. I have a few different case. Remember this screen. The first thing you can check here is: if the the route is, the authentication is optional. If it is, we don't need to do anything. We just don't don't put any information on the header, so the the Microsoft can know the guy is not authenticated.

A

If you keep that as required, and you did not turn on the redirect feature, it will just return 401. So what we normally do, we put a P is for for mobile without this option and when we are using it for the web. With turn on this feature, if the redirect feature is turned on or turn on, the plug-in has another job to do. It will generate a JWT or ja for the request, mainly the HTTP, verb method and the body.

A

So then, if you read redirect the user to the login page, but this time passing this drug in the query stream, so after the users log in the accounts thing can take care of the read redirect flow for you. We just need to reprocess the JWT or job, and then we take care for the whole redirect flow for you. So the team doesn't need to think about it anymore.

A

Let's talk about the results first, so a nice reducing cost. This is the total cost for the Kong instance and the new identity provider server. We are using AWS ECS, so we are running everything as a container round.

A

We read this a lot of the requests because we eliminate the duplications and we also using the stateless talking. So we are avoiding calling the new identity provider all the time. So we don't need a lot of Kingston's anymore. We are just using c6c five large instance. That's times, six five large, not X large, a big difference. We also using the raddest only as cash, so we don't need more 60 GB of memory anymore, just 15, so we save a lot of money.

A

There we also include the RDS to store call information and our tokens for the identity provider, but this RDS was not getting a lot of reads because Congress pretty efficient to cache the service and routes, and we are caching most of our requests to refresh tokens in those ready. So it was not a big deal.

A

We also saw a nice boost in production. Performance increase a lot. It takes less than 10 milliseconds to our Authenticator request, now only request that needs a lot of account information that gets laid. So if you don't need it only there can ready. We normally can do that in less than three milliseconds, so we kind of made the whole platform faster.

A

Uptime also improved a lot but I'm just say: first, is not there time for OLX in general, just for accounting, the authentication flow. So, as you can see, it got a lot better and mainly because now we have fallback strategy. When things go wrong, we don't need to upset our our users have time every time. So we also could implement an another fallback strategy, some kind of battery pattern. So we could extend the duration of the stateless talking if things get really bad. So it's a reading huge emergency.

A

We could just turn on a flag and say: okay, everything out is valid for one hour and we can buy us some time to make things get back up. So another nice benefit was better the API management right now we have a little bit more than 50 service registry in Kong using the plugin, but we have hundreds of services and we are making use of the concept of consumers income to to put things in there. So we registered all things as consumers income and we gave them different, API keys.

A

So, every time they call us, they call pass me an API key and now we can put in place some rating me limit polish and also a little bit of ACL. So it was nice to also Leslie, but it helped us a lot with observability.

A

We came from this level before this is the total of request to the all strips load bars. As you can see, it was close to 1 million requests per minute, and it's all we could see to this now.

A

We know which things are clients of our service, how long they take to respond if they are closing to exceeding rate limit the response status code that they are giving us all. This good stuff debugging become a lot easier and it was just using the built-in plugins with them didn't have to do anything other than just a neighbor it. So it was awesome.

A

I think we have time for a few questions.

A

Anybody has any question.

A

A

Sorry I can't hear you.

A

What a kind of what yeah.

A

How do it? How do how do we the plight? Okay, we are running an AWS GCS cluster. We have we're managing the clothes by ourselves. Using lambda functions, choose scaling and scale out, so we deploy it as a container right now, so we have a closer for the accounting we before that we were experienced using ec2 the first time we put everything in production you just we just used easy to use Parker for from Hodge Corp to build the ami and we're using terraform integrate with rice.

A

Our CI in CD should apply it so, but after some times we we knew we would migrate to to docker. We had a kubernetes cluster right now, so we started to migrate everything to containers and we use ACS as as a temporary solution to our closer was ready.

A

Anymore, oh I hope you enjoy. Thank you.