solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 009 On-Demand IoT Platform: Multitenancy and Dynamic Routing with Gloo Edge

Description

Follow along how Waylay integrated Gloo Edge into their new SaaS platform to achieve on-demand, multi-tenant deployments. The talk will focus on the integration of authentication, metrics and dynamic, claim-based routing.

A

Hello solo con: uh this is haitian groningens from uh uh speaking. um We are here in gent and uh I'm glad to be speaking on solocon hope, you're. Having a good time.

A

We need to talk about how whaley uh integrated the api gateway of glue onto our on-demand platform and also our other enterprise production platforms, um so to go over talk um short introduction by me and then there's a second part, technical parts which is uh taken up by olivier my colleague, who is sitting next to me here so introduction we are really uh founded and still is hiding against. uh Currently we are an iot automation and a data analytics platform, and we we provide actually a rules engine.

A

So that's a bit the central figure of our platform, which can be used to define complex logic on data coming in from your iot devices, you could, for example, create some machine learning models and from those machine learning models calculate predictions which you could act.

A

Sorry, a bit of a dry mod. So next up, uh this is our team, so our team is currently still growing.

A

This is definitely already an older picture by now, so in full uh growth, as we speak, um um to give you some id uh over a platform here, some numbers, uh so half a million of connected devices- and another number I can give is, for example, 313 million api calls that trigger about 1.5 billion service functions, which is 35 000 per minutes. uh So yeah we are quite busy people. Our clients are quite busy on our platform to go over architecture.

A

um Here's an overview important to note here is that we have multi-tenant and single tenant components. um That is all uh we are mainly focused on google cloud, though we are cloud agnostic, so we can of course deploy this in azure or aws.

A

It's as we were early adopters, also of kubernetes. We are using google cloud and currently or we used to be use. We used to use the google cloud load balancer with uh together with the kubernetes engines controller, but that's something that we also wanted to move away with. We wanted to go into more complex routing way. We wanted to have that unified api gateway, for example.

A

Next I will go, go a bit into how we used to do things, so it started off even before using ansible with having virtual machines and pulling our code base in and putting it live on the spot.

A

As soon as we had some tenants, we automated these two playbooks and different virtual machines, but already mentioned we were early adopters of kubernetes, so we moved all our config to helm and uh from there centralized config was pushed to the kubernetes cluster board for our multi-tenant and singleton components.

A

um There was still something missing there because of course, a tenant doesn't only consist of kubernetes resources. There were also some cloud resources so to connect all those things together.

A

We went into our third phase, we could say- and that was where we adopted their form to hook into also any cloud provider and necessary to all do things all on the fly, and um our fourth stage you could say is where we actually uh go into the need of this api gateway, uh or at least one of the needs, but that was our developer platform, so we call it we really io.

A

What is it? um It's a true software as a service platform, easy sign up, you start and everyone can join in and create the rules on data that is coming in, for them create a community get some feedback advance the product.

A

How did we do this? We gathered all this information of past experience of deployment ways, so this last part of. Therefore, we actually made this available over a rest interface and that's what we had christians, the tenant on demand, deployment or thoughts as we call it internally.

A

Then we, of course, because we had this on demand deployments. We also went into a usage based billing process, so we needed to have information of those tenants. It was important that whatever was happening automatically gave us some feedback, also that this platform was seen as a platform, so a unified gateway was one of the requirements of that api gateway, where the users would talk to one api and not two components left and right, don't need to know the specifics of what is happening in the background.

A

Just have a nice spec, we needed speed, we needed to scale we needed to delegate authentication to the edge so uh to the api gateway and not on the components we need to rate limits and also allocate some plans. Within those rate limits.

A

Metrics was very important. How can we retrieve information from one tenant? Who is calling a multi-tenant component? So those were all uh problems that we wanted to solve, of course, intuitive conflict. We as everyone else, don't like when config gets messy and is not easily readable, then also extensive traffic management.

A

Seeing we have this single tenant and multi-tenants setup. How versatile is this? Api gateway, gonna, be cloud native and then, of course, tracing for the future.

A

um So next up we went into a research we out of that research, gave points and went into two candidates that came out of there, which one was glue and, of course, glue came out as the winner.

A

So the clear reason was: it met all the requirements, great articles, even before we tried it. While we did the trial, we already got super great supports video. There was a slack channel open, so um it's it felt like we were already involved and, and adoption came from both sides. Let's say it like that. um The documentation was great and I think one of the also greater things was that it was based on envoy, which is uh heavily maintained. So it's active development and white adoption, and I think this is the part where I'll give.

B

It to my colleague, olivia hello, everyone. Thank you for this nice introduction. Caftan I'll continue with um four topics um which are more related to the technical parts and how we got working with glue, integrating it into our whaley io platform. uh The four topics I want to discuss are authentication rate limiting metrics and claim-based routing. So let's start off with authentication.

B

So here on this figure, you um can see the flow or different stages. A request typically um goes through when arriving at glue, um so for authentication, I'll focus on the external out stage, um as we have a very strict requirement that all our bots and those requests have to be authenticated.

B

Not only are user requests but also enter microservice traffic heavily relies on authentication and authorization features, so we have two options: requests can either have basic authentication so basic, uh so api, key secret pairs or um requests um can be authenticated using jwt tokens um for the api key secret pairs. We used to have a legacy component, doing the exchange for jwt tokens, appending them to the request before forwarding them further on to the other microservices, but using glue.

B

We had the perfect opportunity to deprecate that service and to move the functionality straight at the api gateway level by implementing an external out plugin. This plugin yeah uses the provided, golang sdk and is loaded in the external authentication server upon startup, and it allows us to use or move.

A

Our own business logic.

B

Straight into the glue api gateway- and this has the advantage that we have a very lightweight and simple approach to do the exchange and we don't need an extra network hop or an external service. We need to maintain to do jwt token, fetching quick side. Note. um The external art plugin um provides an ideal opportunity to implement or to have some extra business logic um into glue, even if it's completely not authentication related.

B

We also played around a bit with that to have some extra features, but before going to production, we left all of that code out because everything we needed was already provided by other stages.

B

So a brief overview of how a request comes into the authentication flow, so it arrives either using basic out so api key secret pairs or a jwt token, and the latter in the latter case, the plugin just forwards. The request to the validation stage, um which does a validation of the token based upon keys using jwks url for our remote authentication service when api key secret pair is provided in the request.

B

The basic output, the external outplugin, sorry, will fetch a token from the way layout service and we can even implement some caching. So we don't need that extra call. Every time in case, there's no authentication provided at all or the api key secret pairs are invalid. The response code will be in unauthorized or 401 status.

B

The same goes for the validation stage if it fails um yeah. um Another reason um we heavily rely on this jwt uh structure is: we include um the tenant identifier in each token, so this provides me with the perfect segue to the next part rate, limiting which is most commonly used to protect underlying services and resources, but we found this functionality extremely useful to implement user quota depending on their subscriptions.

B

uh As sachin already mentioned, for the whale aio platform, we have different tiers and each with their own limitations on how much requests they can do and other criteria so for um for this, um we we use yeah.

A

B

Limiting functionality built into glue, and how do we do this? um As I mentioned, we have for each request a tenant identifier, and we couple this with the rate limiting setup of glue. So on the left side, you see a very basic fictional example of a jwt configuration and at the bottom here you see the claims to header step which will extract the tenant claim from the token and append it as a header with key tenant identifier to the request.

B

A

B

Use it in our rate, limiting config so on the right side, um where we describe some descriptors, for example, here: a tenant descriptor with an associated rate, limiting usage and time frame and an optional value and there's also an action which couples the tenant identifier, header key to the descriptor we define here.

B

So basically, this allows us to rate limit tenant requests based upon their usage plan, which is a really nice feature for us, as it couples nicely into our platform, and it didn't allow us to it didn't require us to write extra services to take care of that. So it's nicely built into glue.

B

um One thing to note is at the moment we don't have the possibility to use the leaky bucket algorithms, so it's mainly time window based so on each minute, our day or other time frame, the counters reset. But that's not a real issue. At the moment.

B

It's a really small and basic example, but it's really powerful. For example, it's possible to extend the rate limiting configuration with extra descriptors or even child descriptors. So, for example, we could set up a rate limiting config, where each of our microservices has a an upper bound, which is completely different than the global tenant upper bound.

B

A time frame can have.

B

In the same context of usage based subscriptions, it was also a big requirement for us to keep track of how many traffic a tenant generates upstream and downstream to our services and glue via envoy, already provides a lot of metrics out of the box. um They are. There are some categories um displayed here. There are a lot and really a lot of metrics available for the whole uh gateway, but also on an upstream level um per http status code and so on, and so on.

B

um They're there at your disposal, but um each use case obviously requires their own specific metrics glue adds some more specific ones on top of envoy and also comes with some nice auto-generating dashboards in rafana to easily discover your data and to get you up and running.

B

So, um as I mentioned, it's important for us uh to track the up and down traffic for each tenant on our platform, and we do this using the built-in metrics um set metrics for up and down stream traffic um are automatically generated out of the box per upstream, which means um we need to create for each tenant for each service, and upstream um already mentioned that we want to streamline the user experience and we just want one generic universal domain name for all tenants for all users, namely api io.ola.io.

B

So we certainly didn't want to have subdomains different parts depending on the tenant and so on, and so on. No one domain, okay, no problem, but we have multi-talent and single 10 components.

B

So this would mean that for the single tenant components we have a one-on-one relationship between an upstream and our kubernetes service, but for the multi-tenant components we would have a mult a many to one relationship between these upstreams and these kubernetes services.

B

Okay, this led to the following setup: we create um tenant namespaces inside of our kubernetes clusters, um with their upstream configuration and also some other config and services, and we replicate this for each tenant. So, as our tenants grow, our namespaces grow and the number of upstream configuration grows as well.

B

This results in the following diagram, so our single virtual service will have a lot of.

B

Routing uh arrows so to say to the different namespaces um for the respective upstreams, so the waylay services can be reached. This, however, doesn't really scale well. So, for example, imagine we have a thousand tenants this. This would mean we have errors, arrows, excuse me or routing decisions based on, for example, header value or other requirement criteria to all those different name spaces and those are um yeah um uh evaluated in order uh which we really want to avoid, because it's not very effective. So this is not the way to go.

B

Luckily, uh we could take advantage of a more recent introduction or new recent feature, and that is routing based upon a cluster header. So here we see a small example on how we do this in practice. So at the top you see again the claims to headers code block which, which results in extracting the tenant claim from the jwt token into the tenant identifier header. We also used this earlier on for the radiant ink configuration at the bottom part. You see that we do a transformation.

B

This transformation will use the tenant identifier and will create an upstream header and um yeah, and a header with the key upstream and the value will be of the form upstream name underscore upstream namespace, and this is the schema, um glue and envoy used to uh designate an an upstream without using an actual upstream reference name and reference namespace.

B

um So in the middle part, you see a route action, and here we just refer to the upstream header using the cluster header key, and this allows us to dynamically at runtime.

B

Let glue decide which, upstream, the request should be forwarded to do note that we have to clear route cache set to true and that's because incoming requests already get a part assigned to them when they are evaluated at this stage. So we definitely need to reset routing cache so that the um upstream header that is set here will be taken into account when doing this evaluation.

B

Okay, so this results in the following diagram, so we have our virtual service using a single domain name as we required, and then we have an um yeah and an identifying amount of namespaces. So it doesn't really matter how many we have um it scales perfectly. With with how many we have it's just a mapping using the cluster header which, as designates the upstream, we require for a set tenant based upon um the the information we include into that header at runtime, and then those upstreams correlate to their respective kubernetes services in the back end.

B

So this allows us to have a per tenant burst service upstream, which results in a in metrics for up and down screen traffic for all of our tenants um and that satisfies the requirements and does the naming dynamic, claim-based routing okay.

B

So we have no authentication rating between claim-based routing and metrics per tenant, and this all came nicely together, starting with our authentication, where we require each request to have a jwt token, as that includes our tenant identifier.

B

This tenant identifier allowed us to do rate limiting um on the amount of requests a certain tenant can invoke per time frame to our whaley o platform.

B

It also allowed us to do claim-based routing, so we have a very generic and easy configurable configuration for our route tables and for our virtual service and upstreams, and then lastly, it also allows us to have metrics on up and down stream traffic per tenant, which was also a very important requirement.

B

Okay. So how do we manage all of this? So first off we use the glue helm chart in as a child in a custom chart we created together with, for example, cert manager for our tls certificates, some secrets, some jwt configuration and some other shared configs, and this allows.

A

B

Quickly and easily deploy glue on a new cluster uh pre-configured with all of our custom quirks. We wanted to have, for example, also the external out plug-in and so on and so on.

B

Then we have our helm chart which contains our api definitions. So basically that's just a ham chart with some glue, specific crds. So for the virtual service it contains crds for the route tables and then some other more generic uh upstreams. We need for our services.

B

um This also allows us to make it very generic, which is uh which has the advantage that if we want to do an api update, we don't have to change all tenant configurations and a whole bunch of name spaces. No, we have one chart one release where all our bots and route tables are defined, which is really really sweet to do, updates to quickly iterate to do versioning of apis and so on and so on. So that's a real big plus point for our deployment.

B

Then the tenant specific configuration um is, as hectare already mentioned, created uh whenever a tenant is uh yeah, also created or registered. So together with some more account and more whaley specific actions. um There are also some upstream crds which are created in the tenant namespace um that allow glue to route incoming requests dynamically to a new tenant without even having to update the glue deployment or the helm chart which is listed as a second point.

B

So this is also really powerful um and does the naming dynamic, claim-based routing and then lastly, we have the rate limit config that one contains our subscription models. So at the moment we have three and they contain the quota.

B

Tenant has for each microservice and globally and tenant information is added to that rate, limiting config whenever they are created, or it's also possible to update them as well. Of course,.

B

Okay, so this was a quick overview of how we got started with glue only I think now a bit more than two months ago, and how we got actually up and running very quickly and achieved all our requirements together with a little help from the glue team. When we needed some support, then of course there are still some things on our roadmap, as we just have met our own requirements, but it's far from ready. So a few things that we want to check out next is a federation and high availability.

B

For example. At the moment, we only use glue for on a cluster specific level, so we don't do federation between clusters and we don't have a high available setup on that part, but it's definitely something we're going to look at next.

B

um Also, alerting and monitoring is up next, for example, it would be nice to have some alerting whenever uh rate limiting quotas are exceeded uh or for health monitoring um and so on, and then, of course, canary routing is also very uh a very nice thing. We'll have to look at next, um for example, whenever releasing new um releases, we can do some um gradual uh changes of the routes and see if everything goes well, and that of course also comes together nicely with the alerting and monitoring and also for internal routing.

B

We can do some click couple it together with the claim-based routing to do some internal redirects to the canary uh releases, and then uh last we have the outlier detection and also fault injection, to uh yeah, harden our services uh to test uh to do help, monitoring and to yeah make sure everything is up and running and working fine, so um yeah.

A

That brings me to the last slide.

B

um Thank you for your attention. Thank you for uh yeah. Listening through me, bambling, and I hope you, you found it a good talk and in case you have some more questions or if you want to reach out, uh get to know what we do at weyland play around with our platform feel free to reach out. You find our contact information below. We also have some videos, some white papers for you to follow along, and if you have any more questions, let us know we'll be glad to try and answer them. Thank you.

B