solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 017 Integrating an API Gateway in an Existing High Volume Environment

Description

Hear from Jonathan Lane at Vonage

A

Hi everyone, I'm jonathan, I'm a lead developer with vonage in the api group. My team is responsible for implementing gluage as our api gateway. Vonage is a leading provider of communications apis. We cover a wide variety of communication channels, everything from sms through internet messengers like whatsapp and facebook messenger, as well as voice and video channels.

A

These apis that we provide are used by our customers to add these features to their own applications, because they're being provided by a single company, the apis are consistent and the context of that conversation through all of those channels is available on each of the channels. So, if you're implementing a contact center, if your customer first contacts you via web chat and then through facebook, the agent responding in facebook has access to the initial contact that was made through the web chat.

A

If the customer, then phones up the contact center can present all of that information. All of the previous context to the agent to be able to more efficiently help that customer now to be able to provide this. These apis need to be highly available. They need to be low latency and for services like sms, the geographic location of those services is also significant so that we have access to all of the mobile providers that we need to be able to reliably deliver these messages.

A

So, to facilitate this, we run all of our services in three separate geo locations and in each of those locations we run two separate data centers. This allows us to be highly available and to have fault, tolerance and failover.

A

Some of these apis will be serving thousands of requests a second and across all of our apis. We're handling tens of millions of api calls every day on the back end. This is a service oriented architecture, but as with a lot of companies, these services have evolved over time. Some have been acquired and so they're, not always as consistent as we'd like them to be when we present these app to customers, so in particular, there are.

A

There are differences in the way that authentication is handled in some cases and there can be differences in things like the url scheme for accessing the apis. This is where our api gateway project starts. To come in, we started the project in a bid to drive consistency in everything that we do and in everything that we present out to our customers and that consistency comes in a few steps.

A

So, first of all, all of those apis need to be behind a single api gateway. Then we can start to transfer some of the cross-cutting features into the ownership of that gateway. So the things like authentication, url management once we're once we have that in place, then we're able to start to align what we do there as a single concern for a single team and the teams responsible for the upstream services no longer need to worry about which authentication schemes they should be supporting or how to present the url to be consistent for our customers.

A

So, alongside the efforts of moving services to be behind the api gateway, we are currently hosted on ibm software and, as in that environment, we deploy directly to the machines.

A

So as part of what we're also doing, the teams are working through and containerizing all of the services in an effort to move them into aws, so that that migration effort is also running in parallel with the gateway work. This slide shows some of our customers and the market segments they operate in. They use a wide variety of apis from us over a number of different communication channels.

A

I thought it'd be useful to say why I'm here talking about glue edge as our api gateway and not some other gateway like aws or apigee. For instance, about two years ago we started looking at api gateways. We had a long list of criteria of what was important to us in a gateway environment, flexibility, for instance, we're currently installed in ibm software, but we're transitioning to aws.

A

We want to choose a gateway that can migrate with us. We can install anywhere, it doesn't make any assumptions about which hosting provider or cloud platform we're currently deployed on routing flexibility is also very important. We currently root based on a number of criteria, so this can be everything from a simple path. Prefix match a complex, regular expression or any header on the incoming request.

A

We also route based on the specific url. The customer has used to access the service, so typically a customer will use a general url and our load balancers will take care of finding the the lowest latency route to a data center, but sometimes a customer will want to access a service in a particular region. They might want to stay in europe, for instance, and not be routed to the us, even if they happen to be in the u.s or closes to the u.s at that time.

A

Sometimes the customer will go even further and they want urls to access directly to a specific data center. In that case, they'll take responsibility for making sure they're routed to the closest data center at the time and also for failover in the event of a data center issue, but typically that's not the case. Typically, it goes through a higher level load balancer, but we will also want to be able to extend this further.

A

So after authentication, when we have an authenticated account id, we may also want to be able to use that as a routing parameter to make sure that, for a specific customer, maybe there's a preferred data center for them to be routed to it's important that this configuration is programmable.

A

Currently we sit behind nginx, but it tends to be difficult for their routing there to be changed. It requires the involvement from an ops team or an sre team, and we want to be able to empower our service developers to be able to change their own configuration of routing. They know how it works. They know when they've made a change to the service that impacts routing. We want them to be able to deploy that without having any adverse impact on what other teams are doing, the deployment type for the gateway.

A

So as we move to aws, all of our services are going into containers and we're targeting kubernetes as our deployment platform. Ideally, we want a gateway. That's going to fit in with those plans. We don't want to be supporting another deployment technology, just to be able to run that gateway rate. Limiting is also important. We currently have a simple rate limiting implemented in nginx, but again it's another feature, that's difficult to extend that involves other teams to look after this.

A

For us, we want to empower our development teams to be able to implement their own highly flexible rate. Limiting- and this is a topic I'm going to come back to later on in this talk, cost is important. Of course, we want to be independent of our user growth.

A

We need to be able to integrate with our existing authentication services, so different lines of business have different requirements around authentication we're even within the vonage group. It's important that we're able to implement all of those and to integrate them in a seamless way.

A

Security is course important. The gateway needs to support us in creating a strong security presence, make sure it's able to help us when we're under attack, whether that's ddos or other types of common attack, flexibility to be able to work with multiple aws accounts.

A

Vonage is an advanced technology partner with amazon and as part of that, we use a wide variety of aws technologies, but in particular we're starting to roll out our service teams onto independent aws accounts. It's important that the gateway that we choose is able to work with us in this configuration that we can route between aws accounts manage the config in that way and make sure this is a seamless experience for the service teams.

A

When we started looking at gateways, we chose the envoy gateway and I joined the project. Actually, during that process, we needed somebody to build a c plus plugin to integrate with our identity provider, and that's where I came into it.

A

So we did manage to get it working it's in production, it's serving traffic for a couple of low traffic services that we have, but the experience of building that and maintaining it over a short period of time made us think about. Maybe there was something more suited for us out there in particular building the c plus plug-in. We discovered that a number of the apis we needed to integrate with were actually still in alpha at that point, and so over the course of a few weeks.

A

Even it could be that we have to go back and refactor quite a large amount of that code, just because of a small change in the apis.

A

So we decided to assess the environment of api gateways and see what else was out there. In the end, we decided on glue edge and apigee as the gateways that we were going to assess we discounted envoy, because we had some experience with using it.

A

We know that it's a capable, very flexible gateway system, but since glue is using it under this under the hood, we felt that if it was possible to do something in glue, it was going to be possible to do it in envoy, and we didn't need to specifically trial that in both cases we also discounted the aws gateway on the basis that we were primarily at that time still in software, and we felt that the main benefit of using aws gateway would come when all of our services are migrated to aws.

A

This slide is showing the separation of configuration. It's important, that we allow teams to own their own configuration to be able to work together. We can see that what we want to achieve is for the team who owns an upstream service to also control the routing and any rate limiting rules.

A

The config must allow the teams to own this without conflicting, with the work being done by other teams. I've tried to show this here with the colors, so the red boxes are owned by the api gateway team. The green boxes are owned by one service team and the blue boxes by another team. So in the end, it seems our requirements weren't so outrageous, because both of the gateways were functionally capable of doing what we needed. We decided to use glue.

A

We like the flexibility to be able to run the service for ourselves on whichever hosting platform in whichever data center made sense to us at the time and throughout our migration onto aws.

A

We also had great support from the team at solo, whether that's via slack or email or phone calls throughout the whole poc phase and that really contrasted with apigee, where we really struggled to get any support at all from them at kickoff and throughout the the poc.

A

In fact, since we've chosen glue, the support has continued to be excellent from solo they're, very responsive to questions into issues that we might raise in the platform, and that's questions not only about the product itself, but also about the wider ecosystem of technologies that we're working with, and that's particularly valuable. If you find yourself working through something in the gateway it looks like it should work. uh The team at solo are able and very capable of stepping in and helping with.

A

Maybe we should look at something in kubernetes, or maybe it's something to do with the way that's been deployed via helm and their expertise really helps to get unstuck, not just with things that are purely the responsibility of the gateway. I mean. I can say honestly that, with my interactions with the team there, I really believe that they are very heavily invested in our success, so we chose glue edge and now we have to look at how we roll it out. We've been through a number of different kubernetes environments.

A

During this project we started off with a self-managed environment in software, and then we've moved into aws and we're now using eks in each of these transitions glue has not had any issue with being able to migrate.

A

There might be some small conflict changes that are required, but no issues, so that allows us to really focus on solving the issue that we really need to look at, which is how do we take a large multi-data center service? Oriented application migrate it to be sitting behind a gateway without interrupting those tens of millions of requests that we have to handle every day. Now our services are split between software and aws.

A

Some of those services only exist in aws. Some only exist in software and some are in both depending on which data center you happen to be talking to. So we need to be able to take the gateway and slide it into that network flow without interrupting anything, making it completely transparent to the service teams to the services themselves. But at the same time we do need to be looking to add some value. So we have to look at things that we can take management of like automated failover between between data centers.

A

So the approach that we're trying to take is to be very cautious in the early phases of this, and we can make use of the regional urls that I talked about earlier to make sure that we can pick a url that is very likely used which allows us to gain some experience with this migration to gain some confidence in what we're doing.

A

At the same time as adding value for the service teams, so, for instance, if we take a single data center, we can migrate all the services there to be behind the gateway. Only when they're called by a certain url, once we've done that we can look to migrate the second data center in that geographic region, then we can look at some of the higher level urls and how do we start to migrate?

A

Some of that traffic onto these regional urls and start to put some of the real traffic through the gateway that can be done in a canary fashion? We can move a certain small percentage of the traffic piece by piece until we're happy that that everything's working now alongside that, we can start to look at once. The service is behind the gateway. How do we start to add value to that? So once we're there, we have the programmable features of the gateway.

A

We can start to offload some of the configuration to their service teams, whether that's around things like rate limiting or whether it's around the actual routing configuration itself. Now those service teams are also looking to migrate. Their services into containers and onto aws as they do that this whole transition needs to be completely transparent to them as well. So when they're on software and we slide the gateway in place, they should notice no difference. The gateway is simply proxying. The traffic back into software and everything should behave exactly the same when they move into aws.

A

Nginx is no longer in play. All of those features that they're, relying on in nginx will have to be replicated in the gateway and again from the services perspective, shouldn't notice that anything has changed, except that from our perspective, now we have programmable routing. Now we can say to the service team. Here you go now. You can look after this and you can start to make transitions and that can really help to give the service team more confidence as they migrate onto aws.

A

They know that if something goes wrong, they can update the configuration for themselves and they can migrate traffic immediately back to software during that migration phase.

A

Now, as we move forwards with this, of course, we'll gain confidence and the teams will gain confidence with the process and we no longer have to migrate one percent of traffic. At a time we can start to become more aggressive, but this this process really gives us the flexibility to be as cautious or as aggressive as we feel like. We need to be at that time, so here we can see a really exaggerated example of the sort of setup that we need to support.

A

We can see two data centers in each of the eu and the us and there's an asymmetry between the services in each data center and also between the geos.

A

Now most customers are going to use a general url and that will allow us to control which data center they get routed to based on latency and based on which data centers actually have access to the service that they want to call.

A

But if the customer uses a geographic url, they'll only be able to access the services in that geo if they use a data center, url they'll only be able to access the services in that data center. In this case, the customer would be responsible for choosing the most appropriate data center and for failover between them.

A

Now we're able to take care of the routing based on the url and then leave the actual routing of the service within a gateway to the service teams, and that's really enabled, by the way that glue, allows us to have a complete separation between the configuration of the gateway itself and the configuration of the routing in the underlying crd configuration.

A

I've said a lot about routine configuration. This is the feature set that allows us to take our system and move it behind an api gateway. It also means that our service teams can more confidently migrate onto aws, thanks to the control that they get over their routing configuration, but there's two other key features of glue that we make heavy use of those are rate limiting and centralized authentication.

A

I can say a lot about rate limiting in a minute, so let's start with centralized authentication. So this is a really key way that we're looking to make our apis look and feel more consistent for our customers. Today, authentication is handled by the service teams. They receive a request, they'll inspect that request for the authentication parameters and then they'll send those to a centralized authentication service.

A

So the actual authentication itself is somewhat centralized because the service teams are responsible for looking up the parameters. They can be inconsistent about where they look and how they retrieve them.

A

So, to take an extreme example, a customer with an account with us who's calling two different apis, maybe for one of those apis, they have to authenticate with basic auth with username and password and for a different service, they're authenticating with jwt.

A

Now it's one account with one for one vendor, but it might look and feel like two different apis to that customer. So by moving the authentication to be part of the gateway workflow, it means that we can control that experience. We can look in all of the locations where all of the services would expect to find parameters. So, from a customer's perspective, nothing really has changed.

A

They use the services in the same way that they always have done with the same authentication parameters, but from our perspective it means that everything's in one place, and it means that if we decide that we want to add a new authentication method or even to deprecate an old one, we can do that in a single place. It'll be effective consistently across all of our apis.

A

Now, further to that, we are not only rolling out the gateway for the api group, but also for the wider voltage and each of those lines of business in voltage has their own authentication system.

A

So what we're doing at the moment is to federate all of these together, so the api gateway will forward a request onto an authentication hub which will inspect that request, determine which of the authentication providers should handle that, and that means that, for the whole of phonage, the upstream services no longer worry about authentication, that's already handled from the customer's perspective. Again, nothing changes, but again it means that when we're now adding authentication methods or deprecating them, that's done across the whole advantage consistently.

A

It also means in the longer term, we can look at how we unify our account structure and our credential provision.

A

Now the other key feature that we're looking to to make use of is rate limiting now, when we started with using glue edge, these type of rate, limiting rules that we want to express just weren't possible to express.

A

So we want to be flexible about not only the parameters of the request that we rate limit on, but also how multiple resume rules interact with each other, and so I'm going to talk for a few minutes now about where those rules come from. Why they're important to us, but we've been working with the team at solo on how this should work and in fact now there are more than one way of implementing these rules.

A

So there's a number of reasons why you might want to rate limit an incoming request, so the one that seems most obvious first of all is to try to protect an upstream service.

A

So let's say that we have a service now, once we get beyond a thousand requests per second, that service is either going to crash or the performance of it is going to degrade to a non-acceptable level. Now, of course, we can think about how do we scale that service? How do we scale it elastically? So we never reach that point, but practically it's still good to have a rate limit in place to protect the service that can scale as that service scales.

A

Now we're going to set that at a thousand requests per second. So that doesn't mean, though, that one customer they can still send a thousand requests per second and they can deny service to everybody else.

A

So as well as that rule, we need another rule which is going to say that for a single customer, we're going to cap them at 100 requests per second.

A

So now, when a request comes in we're going to increment a counter for 1000 rps, to protect the service and for 100 rps to limit an individual account, so in this case every rule, in the rate limiting service that matches the parameters on this request, has its counter incremented.

A

We can go a bit further than that and we can say when we send a message. Some of these go through an external provider. So if we're sending an sms, for instance, there's a wide range of mobile operators that we connect to to be able to deliver that message. Now. Let's say that one of those providers requires that we put a limit in place of 200 rps.

A

So all the requests going to that provider can't exceed 200 requests per second. So now, when a request comes in, we need to increment the counter for the 1000 rps service protection rule. We need to increment the counter for the 100 rps per account rule, and then we need to inspect that request and from the number that it's being sent to, we can determine which provider is going to go via if it's this special provider. We also need to increment the counter for 200 rps.

A

So again, every request is going to be inspected and for every rule, in the rate limiting service. If it matches those parameters, the counter for that rule will be incremented until we hit a limit now it happens that sometimes we're going to increment two counters if it's going to a different provider or we're going to increment three, if it's going to our special provider.

A

So we want to go a bit further than that, though, and add more optionality into these rules. So let's say that one customer wasn't happy with 100 rps they 150 rps, so we implement that rule for them. But now we need to implement. We need to increment the 1000 rps counter to protect the service and then either the 100 rps per account rule or the 150 rps, for this special account rule and maybe we'll also need to increment the 200 rps rule for our upstream provider.

A

But the important thing here is that not every rule that matches those parameters on the request will be incremented, but more than one of them will be.

A

We can go a bit further as well, and we can say what, if that special customer who has 150 rps, they have a special phone number and when they send an sms from that special phone number that one gets 250 rps.

A

So now we can say that when a request comes in it increments, the 1000 rps rate limit to protect our service and then it's going to increment either the 100 rps rule. If it's not our special account or the 150 rps rule, if it is the special account or the 250 rps rule, if it is the special account sending from their special phone number. So now, there's quite a lot of functionality.

A

In this, there are rules which the counter should always be incremented on every request that matches those parameters, and then there is a set of rules where we want to increment the counter for only one of them at a time, depending on how many of the parameters match and also which parameters match. So we want to be able to express that some of the parameters carry a higher weight than other parameters, and we should only increment the one which has the highest weighting.

A

So, as I say, there's a number of ways of expressing this in glue different ways of expressing this configuration. If we think about the most recently added one with set style rules, we essentially need to specify a hierarchy of attributes on the request.

A

The way that we do this is that we assign a weight to each of them. We increment those weights in powers of two. So then what we do is when we consider a rule. We look at all the attributes which make up that rule and we add their weights together to get an overall rule weighting and by incrementing them. In this way, we get a unique weight for a set of attributes which matches the way that we consider the specificity of that rule.

A

So, in the examples we've had, the number that you're sending from is more specific than the account id that you are so this slide is showing this graphically. We can see a request comes in passes through the 1000, rps rule and increments the counter.

A

Then we extract the target number and we apply either the special provider 200 rps rule or we pass this unrestricted.

A

Then we extract the account id and the source number and we apply one of the three remaining rate limit rules. So in general we can say that the rules and the parameters are hierarchical.

A

The addition of always match rules makes these layers of rate limits. 0, one or many rule counters rate limit counters may be incremented on a request.

A

Any rule within that set, which reaches its the rate limit, will reject the request with four two nine. So these rate limiting parameters, form a hierarchy at the base of this hierarchy are the very generic parameters which will match a high number of incoming requests at the other end are the much more specific parameters which will match a subset of those requests.

A

We can apply the waiting rules to those parameters to be able to instruct glue as to which rate should be applied to any given incoming request, and now, if we think about how that's expressed through glue to say, there have been a number of ways of looking at this over the last year or two and in the early parts of this, the the early approaches to this. There was a lot of repetition of configuration.

A

We had to enumerate every possible combination of parameters that could be supplied, um and that was quite difficult to work with once you got beyond two or three parameters on an api.

A

The more recent set style rate limiting is much easier to work with no repetition in the config and that's what we're adopting at the moment, but even so, even with the set style rate limiting, it still requires a knowledge of glue configuration working with kubernetes uh the way that the the configuration is applied. So we write a kubernetes crd with this information in a glue format. Glue will then take that and it will apply it to the rate limit service.

A

We don't necessarily want our service teams to have to think about this when they want to create a rate limit rule. So what we've been working on at vonage is a throttling configuration service, and this exposes a restful api, which means that the service teams, or even our support team, can create their own rate limit rules uh through a simple web api.

A

And what we do is we take those rules and those hierarchies and we store them in a dynamo t db table and then periodically we go through and we refresh that into kubernetes to be applied to the to the gateway into the rate limiter.

A

Now the the dynamo tables are laid out in the way that we think of the rules and the way that we think of the hierarchies so for a given api, the parameters that are used and the weightings that those parameters have for that api and then a set of rules based on those hierarchies which specify the parameters or in each parameter.

A

If we don't want to specify uh for each of those rules, so that allows us and the service teams and support to think about this in a way that makes sense to us, but still to be able to apply that to glue without really the service teams having to understand how that how that translates, and that just means that they can respond and support can respond much more rapidly to required changes in rate limiting, whether that's to to block an account, that's sending too much traffic um or whether it's to enhance the sales process and to be able to allow a customer to send more traffic uh in a particular region, for instance, uh or whatever the requirement around the rate limiting might be.

A

So that's all I wanted to talk about today. I hope it's been an interesting look at how we and vonage have been working with glue and glue edge over the last couple of years to take our existing platform, which is a large distributed multi-region system serving a reasonably high amount of traffic and put that behind an api gateway without having any interruption alongside a migration into aws.

A

As well as the adoption of centralized authentication, rich, flexible rate limiting and enabling our teams to self-service config and rate limit.

A

A