Cloud Native Computing Foundation EnvoyCon 2020 - Virtual, 12 Nov 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Authorization with Envoy at Square - Jelle Vanhorenbeke

Description

Authorization with Envoy at Square - Jelle Vanhorenbeke

Every organization has different authentication and authorization needs and it is not always clear how Envoy can help to abstract this from the application layer. In this talk we will show you how Square leverages Envoy's ’s ext_authz filter and how our centralized authorization service has become the new source of truth for hundreds of services. We will cover how we migrated multiple authorization libraries to this centralized authorization service and how we rolled out these changes to production. This process has benefited other teams and allowed them to launch new features that were previously not possible.

A

Hi everyone, my name, is jealous, and today I'm going to talk about authorization at square. I'm a software engineer on square's developers, iam team and I've been on this team for almost a year and a half now and the entire time on the team. I've been working on authorization, which is what I would like to talk about today.

A

Here's a quick agenda of the different topics, I'm going to cover first I'll talk a little bit about envoy at square, then give you a quick overview of our previous authorization architecture, some of the problems and challenges we face with that and then how we're leveraging envoy to do authorization before we do that. I do want to talk about authentication and authorization.

A

uh Authentication is a process verifying your identity. Are you who you say you are? Is this the authentic sally, while authorization is a process of verifying that someone has the right permissions and is allowed to do uh what they want to do?

A

uh And during this talk we're going to cover the second bullet point, which is authorization? And I know that very often these two go together and I would encourage people to try to think about them separately.

A

We are definitely trying to decouple these two at square as much as possible, because I think that gives your architecture a little bit more flexibility.

A

Next, I would like to introduce you to safe, uh safe is our session authorization framework enforcer? It's one of the authorization frameworks available for service owners at square next on voice safe, which is our uh envoy based authorization solution, which is the main solution I'm going to talk about today, and it takes a lot of the things that were available in our safe framework and it took that and moved it to an actual service that we are leveraging with our service mesh uh next on voyage square.

A

So, there's a great talk uh that two of my colleagues gave at this exact same conference two years ago and a lot of the things they talked about then now are a reality at square, so we're at the next level. Where now we can leverage our service mesh to build a lot of these new, exciting features, and so some of the highlights and things to keep in mind that are important for this talk is that square has a centralized control plane and the control plane has a preconfigured cache with sidecar configurations, also known as snapshots.

A

And now a quick overview of authorization or how we used to do authorization at square. We had multiple authorization strategies, so services could implement different libraries or leverage different libraries to do authorizations. Some of these were using protos some of these. Some of these other libraries, such as safe they had echo like files or you could specify the the different rules and authorization requirements while a third set of services were using custom code, no additional library- and it was all uh written in the actual application layer on top of this square- supports three major languages.

A

Several minor and the same authorization solution is not available in all languages. What they mean, what that means as a service owner. If you have multiple microservices written in different languages, it is possible that you cannot leverage the same authorization solution for both these microservices.

A

So in reality that looks a little bit more like this, and even though we try to keep these libraries in sync or or keep feature parity as much as possible. That is not always the case. Some features get implemented in one language, then they get de-prioritized, uh others still haven't been developed. So, there's always a little bit of a difference even between the same authorization, library and in different languages.

A

On top of this, we have a different permission set for our private apis and our public apis, and only one of our authorization frameworks is able to authorize against both sets of permissions and map them together.

A

So what that means that, if you're using that framework, you can completely use this authorization layer for both type of apis, while, if you're using a different library, you still have to implement an authorization, you have to implement some authorization code for private.

A

Apis, as you can imagine, even though this works, it definitely presents uh multiple challenges. Some of these challenges are it's really hard to know. What is if all our microservices are running the latest version of our odds framework. It's also hard to just roll out new features, because you have to implement them in multiple languages.

A

It's it's hard to even roll them out for all the same services, so you might have to implement the same feature in a different library as well.

A

On top of that, given the two different permission sets, I just mentioned it's it's complicated to use our public apis internally and then a lot of people would reach out to us and ask us what is right authorization strategy? What is the right framework to use, and there was not always a clear answer to that question.

A

Besides all these problems, another challenge we had it was for our infosec team. It was extremely hard for them to do audits because they would have to look at these aqua files, proto files or even custom code to look at things like is this endpoint exposing pii data? If it is, if it's requiring the right permissions uh that it should given the data, it's exposing, that was a very hard question to answer.

A

So we try to come up with a few solutions.

A

Some of them are, we need a consistent authorization strategy um and then we could also we talked about unifying both permission sets, so we could reuse our public apis internally. The next effort to solve the infosec problem. We thought about a centralized source of truth that they could use to actually look up resources, look up their requirements and and see if these permissions match what is expected uh from a security perspective.

A

Some other things that came up is: we need a single authorization point that way we can make sure that everyone is using the same code to authorize and they're, always using the latest version available.

A

So these were some of the solutions and motivations and and then we started thinking in how we would actually uh implement these. As I mentioned some of them, we would be able to address this with a centralized source of truth, some other issues uh we could fix them by having a single authorization point, and then we also wanted to have a deny by default approach, which we're still not quite sure how we were going to fix that you deny by default.

A

What it means is if a resource or an endpoint has not defined any authentication or authorization requirements, it gets a night. So you have to explicitly define these requirements before your endpoint will work correctly.

A

At that point, we had been looking at the external ot filter, that's available in envoy, because we had reached a point at square where all these services had envoy side cars. So now leveraging envoy became a real thing, and so for those of you who are not familiar with the external oddsy filter on an envoy, it's personally my favorite extension.

A

It has made my life so much easier and basically, the way it works is that it's a filter that will call an external service send to the original request and the external service can then make a decision. If that request is authorized or not, if it's authorized, it will return a 200 and then envoy will move on to the next filter and eventually reach the application layer upstream and now the application upstream knows that this request has been authorized.

A

If the authorization service decides that this request is not authorized, it's lacking certain permissions. It can return. An error such as a 403 in that case envoy, will take that response. Return that to the client, including the response body, that the authorization service is returning.

A

um So basically, this is how this would look like for a successful request. So the client sends a request which gets proxied by envoy, which then calls the authorization service receives. A 200, then envoy forwards the request to the application layer that who eventually will return that to the client.

A

uh As you saw earlier, we use a lot of library code at square and we did not have a authorization service. So we had to build an authorization service to accept and and support these, uh this external rt filter for the authorization service.

A

We decided to use a database as a source of truth for all authentication and authorization requirements for all routes at square, and we back that up with a on voice avi and this ui is what allowed service owners to configure their routes, configure their requirements and essentially that's what would be enforced by by envoy.

A

This is a quick preview of how that ui looks like and we did go back and forth on. Should we use a ui in a database, or should we use echo files that can be checked in in our search control and the main reason we decided to go with the ui in the database is because we're still making a lot of changes.

A

We still want to make some improvements in our authorization model and making these changes having a database is slightly easier than having to make that in files or static static files, and on top of that, if you want to make a change to your authorization requirements with a database, you can do that immediately.

A

While, if you're using apple files, it would require a redeploy of the authorization service. So since we have somewhere around 250 services that we were trying to migrate every single time, one of those services makes a change. We would have to redeploy the authorization service for that change to to show up in either staging or production.

A

Next we had uh the solution in place where envoy is calling the authorization service for every single request. That's when we introduce the concept of protected and unprotected routes unprotected routes, uh there are routes for static content, blog posts, images that do not need authorization so for those routes. We we really don't want envoy to call the authorization service, because that's a waste of resources for both the authorization service and the request itself. So, in order to do that, uh we started.

A

We gave service owners the option of uh saying if their routes were uh required, authorization or authentication, and then we built a integration uh with our centralized control plane and our authorization service, which would now send over to the control plane the unprotected routes and the services.

A

So when the control plane is building a new snapshot for a envoy sidecar, it would know for which routes it had to disable. The external audi filter.

A

Next, we we had this in place. uh It was great, and now we had a migration challenge, so we had the solution, but we still had 250 services that we now needed to migrate. How do we get all those rules and authorization requirements into this central storage? This is when the team decided to invest some time in building migration scripts.

A

If you remember from this this previous overview, different libraries, some use aqua file, some use, proto files and what we did is we build different scripts that would extract the rules from these files and call temporary endpoints in the authorization service. So we could store that data in the authorization database.

A

This turned out to be a great solution, mainly because it was a very flexible approach. We were the decision makers, so we could allocate as many resources to this problem as we wanted and at the same time, it kept both authorization strategies in sync.

A

This was very helpful as we were rolling out uh envoy safe having the ability to disable it, knowing that there was still a backup strategy, this library would be up to date and would have all the right requirements in place.

A

Next, we had a set of services that had their uh authorization requirements built in the application layer. Unfortunately, there was no easy way to extract that data and migrate it to the authorization service. So we had to work with service owners to have them manually migrate. These routes. This is not as ideal mainly because teams have their own deadlines, their own schedules, so we had to work with that and even though our teams were very supportive, it's there's still no automated way to keep these both to keep both strategies in sync.

A

So now we have to ask teams: hey. You have to update your permission requirements in both places until we're fully rolled out, and you can actually deprecate the code in your application layer.

A

uh Next, I would like to talk a little bit more about our our rollout strategy. So at this point we had a solution in place. We had a lot of data and we were ready to to try this and roll out for for multiple services and, first, what we did is we introduced a logging only mode um a logging only mode. What that does is that our authorization service would always return to 200, no matter what the actual authorization decision was so envoy would never short-circuit the request.

A

We did this with additional metrics and logs, so service owners could actually compare the decision. The authorization service would have made versus what the existing library or their existing application layer actually did that allowed them to tweak their requirements.

A

Tweak some of the the permissions or configurations that they had in place through the on voice safe ui.

A

uh Our next uh rollout strategy is, we use the runtime uh fraction configuration this allowed us to split some of the traffic and roll out on a percentage-based uh approach. So what we did is that we would roll out on voice safe for five percent of a given service traffic that allowed us to make sure that the authorization service was hitting the right uh slas. We were able to handle that qps and we're also sure that not we're not blocking any traffic.

A

We should not be blocking and do a little bit more of a control rollout.

A

So, in order to support this percentage rollout, we introduced that as part of an admin panel, and we had that uh passed to the onvoice control plane through that same integration that I mentioned earlier for unprotected routes that did require some changes in our data model on the on the control plane side to support these these different values, but that worked out and then next. I want to talk a little bit about some of the lessons learned.

A

uh First, we in our ui we allowed users or service owners to use wildcards and for for given namespaces and mark that as a unprotected route or a protected route and specify and group certain permissions and endpoints. um This introduced conflicts.

A

As you can see, someone would mark a wild card or all traffic as unprotected and later on, a more granular route with actual authorization permissions uh requirements. So we had to build some logic around that to detect these cases and either notify the end user. Through the ui saying, hey, you're, introducing a conflict, you might want to consider specifying a more granular route, or we have to be very clever about how we organize these routes.

A

When we send them as unprotected routes to the authorization service, but it's definitely something to keep in mind, because we missed that initially debugging debugging becomes a little bit more challenging because now service owners have to rely on the logs we use in the service mesh or in the authorization service. So they can it's a little bit harder for them to. They cannot add any custom logging or custom metrics. They have to rely on a more generic output and next some shortcomings.

A

We noticed with the external ozzy extension is that, as I mentioned earlier, it's my favorite filter and I'm not the only one who thinks that, so there are a lot of teams at square who are trying to use this filter not only for authorization but for other use cases as well. So this filter, I think it's very versatile, so it can be used for multiple use cases and solve different problems.

A

So, as far as I know, there's no other filter that allows you to call a service mutate, the headers and mute, the headers of the original request, and there is no out-of-the-box way to differentiate two external odd-c filters and deploy different configurations.

A

So, at the same time, there's no way to enable or disable them individually. That caused some conflicts between different teams who are trying to use this filter and they cannot. We cannot use them for the same services because if they disable it they're disabling our solution, if we disable or enable it we're also enabling the filter for on their solution, there's also no way to bypass the filter for a given uh header.

A

That would have been useful in in some cases to make sure we do not reauthorize or reauthenticate a request twice. And then you can't change the class of how the grpc call works and, and that limits you, if you want to have a microservice that implements two endpoints that could be called by the external rt filter.

A

That is no longer a possibility.

A

So conclusion, we decided to move all our authorization from app and library code into the service mesh with a centralized source of truth, where we are um so currently we start rolling this out in production. We are targeting your rollout on voicey for closely 250 services and we're expecting to handle somewhere around uh 20k qps.

A

Then our next steps uh is going to be focus a little bit more on decoupling, some of our authentication authorization strategies and then that will also allow us to implement a new and more flexible permission system uh that will allow us to implement even more features from a application site.

A

That's all I had today uh thanks everyone for listening. uh This is my email. If you want to reach out, I would love to hear how your team is solving authorization and, if you're interested in working on some of these problems square is hiring thanks. Everyone.

A

Hey everyone, um thanks for for listening to the talk, um let me know if you have any questions.

A

How long are you projecting this rollout to take for, say, 90 of all services? um It took us about three four months to get most of these routes um into our database and and have service owners review them, update them and start rolling out in staging uh we're currently working on our rollout production. We expect this to take somewhere between four four to five months, um easily, probably a little bit longer.

A

uh What was the overhead overhead like from running these ot calls not only out of process but over the network? um That's a good question. I think some of our request calls. uh We have an additional latency depending on some of these calls. I think our average is still below 10 milliseconds, which is pretty good uh for a while. We were making two calls one to the authentication service and one to the authorization service, and so that was um that was doubling uh almost that uh additional latency.

A

uh Is mtls part of any of your architecture with ot calls um not for these rt calls we're definitely using mtls for uh some some other authorization strategies that we have in place, uh but not for this particular design.

A

When db updates are done through the ui, how do you push the changes through the odds service instances um yeah? So the way that works is through the um integration I mentioned earlier between the control plane. um So basically, uh when somebody updates, through the ui some of the routes, some of the permission requirements, this gets synced to to the database and then our authorization service and the control plane. They uh are constantly syncing.

A

So when a envoy sidecar will request uh our hit our centralized uh control plane, it will get an updated configuration and I think right now we can get that done in uh there's a max latency of five minutes.

A

Oh, how do you deal with the situation where safe and other teams used to separate external rt filters in the same pipeline? uh That is a problem we haven't solved. Yet it's something we're working on. um We are considering making changes, maybe to the external rt filter itself, uh so we can identify these and enable or disable it depending on the configuration or give it some sort of identifier right now. We have not solved that problem yet.

A

uh Our odyssey decision always pays path-based, or is there any application business logic? So so far, it's all path-based um and the main reason is that we try to keep as much application and business logic out of the authorization service.

A

uh That's also why we're trying to decouple some of our authentication service from our authorization service to keep some of that that logic separate.

A

uh We are working on a uh different authorization model um where you can force them of like more business logic or business decision in your and specify more granular rules, um that's still under development, uh but we're hoping uh by once. We have that to have a little bit more flexibility and you can force on certain parameters and right because right now that still lives in on the actual application layer.

A

So it's still possible that a service will get the authorized request and need to do some uh additional uh authorization logic uh in order to especially if it's very close to the data model in that service.

A

uh Did you look at oppa ferrazzi? If so, why did you choose a db model? um We did not look into that too much, um which is the db model, mainly for um when I mentioned what for what I mentioned during the talk, um flexibility was the main reason we decided to to move forward with a database model. um Also, we wanted to have the ui integration, uh which is some of the features we were looking to implement.

A

Did you guys resolve for better uh through foot? um We have not it's something. We've considered um it's an optimization we might implement later on, especially since, as I mentioned as well, there is no way to um actually skip the external odds, uh filter, um which is possibly a good thing, but uh yeah we might consider caching, uh some of the odds result results.

A

uh The does each of the rt hit the database, or is there a cache? um We have a cache um that has most of these patent rules uh already uh preloaded.

A

Is safe odds done by save um I'm, not sure uh what that question means um so we had a safe framework that was doing ot and then basically, we took a lot of the way that uh authorization was being enforced by that framework and we move that to a service. But it's it's essentially your implementation of some of those uh authorization rules um so safe and on voice safe.

A

It's it's uh it's a different implementation, uh but they use the same concept, same permission, sets and some of the mapping between these uh permissions that we had for private apis and public apis. So there's some like shared um concepts that these two share, but the authorization service has its own implementation and does not reuse the existing safe.

A

Framework, uh I think I answered all the questions. If I missed any of your questions, um feel free to reach out uh or repost it again and I'll try to answer it. um I left my email uh and in the slides so feel free to reach out and and happy to chat a little bit.

A

More awesome thanks. Everyone.