Cloud Native Computing Foundation Production Identity Day: SPIFFE + SPIRE 2020, 4 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Using SPIRE in Production at Uber - Andrew Moore

Description

Using SPIRE in Production at Uber - Andrew Moore

In this session we will provide an overview of how Uber uses SPIFFE and SPIRE for workload authentication and authentication in a diverse deployment environment. We will highlight the deployment architecture, operational practices, and benefits achieved.

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

A

uh So let's go ahead and get started, so what I'm going to be talking to you about at the start is just the scale that we deploy to at uber, including aspire uh the environments that we work with and then how we integrate with spire server on various levels. After that I'm going to be going into, how did we actually onboard our consumers?

A

uh Make that experience nice for them make sure as much as possible everyone's getting at the identities that they need so for the scale we have about a dozen independent data centers with less than 10 spire servers per data center. Each one has tens of thousands of hosts in it and then a dedicated spire agent on each every data center every day is performing tens of millions of x, 509 signings from spire, and then today we have over 500 services onboarded to spire, which is less than 10 percent of the internal services at uber.

A

For our environments, we have several in-house orchestrators for better or worse. We build a lot of things in-house. uh Our hosts are both on-prem and provided by multiple cloud providers, and then nearly all of our services are containerized and written in either go or java or python.

A

So to go over our server deployment and integration. It's already been gone over before by andrew harding and evan gilman. How spire works in general, so we have the spire server cluster. In the data center, we have the spire agents deployed on every single host, a testing with spire server syncing on what registrations they're responsible for the aspire agents are baked into an enforced image fleet wide to ensure that it's present we have next to the spire server a low-level registrant that we own.

A

This is a service that creates spire registrations for things that are very low in the stack.

B

That otherwise.

A

Would not be able to get identity, for instance, that would include orchestrators of workloads somewhere there's that bottom turtle orchestrator that's deploying other orchestrators. That bottom level thing needs to get an identity from spire, so low level register will usually make the registrations for those, after that, it's the goal state that these orchestrators be managing the registrations for the workloads that they actually own.

A

We also have things like health checkers on the infra level level, for instance, when a host comes up, make sure that the spire agent on it is actually attested with spire server when we're bringing a host down for repair or just normal host life cycle, make sure that that agent gets evicted out of spire server, and then there are other just special interest services that just want to know various things about. What's registered, inspire for their own purposes,.

A

So getting onto our consumers, so how did we get people start onboarding to spire? Well, their needs were. It needs to be as little work as possible for them.

A

They also need to not have to be experts in spiffy or spire or security, or any of that they.

B

Shouldn't have to do major.

A

Refactoring of their service, they shouldn't have to do a bunch of reading to figure out what it is they need to do. We shouldn't change their service behavior unless they want it to. We have to allow a lot of customization, because every service has different needs and then we have to interoperate with the existing authentication and authorization at the company, so keeping in mind that most services are containerized and we're limited to just a few languages.

A

Our solution for this was to create a library that wraps uh off for services, so, of course, wraps authentication with spire agent or whatever authentication the service happens to have configured uh exposes functionality for using whatever identity has been assigned for, say, like signing various things performing ntls or anything else, where it's just useful for a service to have an x509 with an associated private key.

A

This library also uses middleware to inject short-lived tokens into outbound requests, and then those tokens have the spiffy id of the calling service and are signed by that x. 509S vib, then also we wrap all logic for authorization and then authorization decisions are made based on the calling spiffy id and the called endpoint how this actually works.

A

So let's say we have some workload: a it's onboarded to the security library that we made so on startup. On behalf of workload, a that security library is going to call spire agent.

B

And get the x509s.

A

Fed for it now, let's say that workload, a wants to talk to workload b for simplicity, we're going to say workload b is also on board with the security library it doesn't have to be. It just makes things more convenient, so what workload a is going to do it's going to send its request, as per its normal business logic, off to work b, what the security library is going to do on behalf of workload?

A

A is inject that token I talked about where it's going to have the spiffy id it's going to have the intended audience. It's going to have the searching needed to validate that. This is a valid token and it's going to be signed. Then there's other.

B

Custom details that we throw into these.

A

Tokens just based on our own needs workload b side we're not touching the workload b uh business logic at all. At this point, this is the security library on workload b. Side is going to check. Does the signature on the token match? uh Am I actually workload b? Did I somehow wind up getting a request or token intended for some other destination.

A

Do I trust the presented certificate and then that's just another configurable trust, verification within the library it can.

B

A

If workload b is onboarded to spire, it could include just verifying against the distributed. Spire trust bundle uh is this spiffing id allowed to call me on this endpoint and then that's just a authorization policies as part of the library and finally, assuming just all of that passes. We finally make it to workload b's actual business logic, the code that the owners of workload b are actually worried about and just process the request as normal.

A

So the goal here was take as much out of the hands of service owners possible, so that we could do like it's generally said nowadays is let owners focus on their own code.

A

There might be a question of well we're injecting these tokens. Why are we not just using jwtsvids um so for one thing, x509s just work better with our existing services today, if everything's, just using the same standard, just makes it easier to create the.

B

A

uh There's also the issue of resiliency. If, for some reason aspire agent were to become even temporarily unhealthy, then callers would not be able to get their tokens anymore, which are much shorter, lived than the jwt generally and then receivers of requests would also not be able to verify if their spire asian is unhealthy, wouldn't be able to call verify jwts anymore.

A

So that's just a resiliency.

B

Problem and then also compatibility.

A

With existing technology in the company, the results of this is that it's pretty trivial work in nearly all cases for a service uh ad uber to onboard aspire. They just need to import our library do a config change, which is which is usually less than like 10 lines.

A

They can piecemeal import, the library if they need to if they have some special use cases if they outright can't import due to some language constraints, then we've also created a binary that will run inside the container and auto fetch the estimate on the workload's behalf.

A

All new authentication authorization uses the spiffy id standard and then existing ones are either planned for decommissioning or have been moved over to the spiffy id standard.

A

And then this library is extendable for, like future environments, that we might have and also extendable for any future off solutions we might have, if, like say, spy is not sufficient for a particular case, for instance with personnel identity.

A

All right, then, is that q, a.

B

Yeah, I think I don't know if you can see the chat window there there's a couple of questions there. I can read for you or I think zach.

A

B

A

A

uh Yeah, we don't use kubernetes.

B

A

B

See also there's questions about the side car.

A

Approach, we are also bringing up a sidecar approach- it's just not yet productionized. So there will be a case of probably a bunch of offboarding in the future. Things will still be able to inject secure tokens based off secure identity, but they won't necessarily need to be integrated.

B

A

B

Yeah andrew, uh can we read the questions.

A

Yeah, so there's curious why you chose to use library, architecture versus sidecar, uh just answered that um and then someone just had a string of things asking why we're not using um other projects that might be available for kubernetes uh we're not on kubernetes uh at uber today. So we're not using something like envoy we're not using uh other solutions like that, so library architecture just worked out better.

A

uh What sidecar we are using uh the cycle.

B

A

Using is an in-house solution uh today, but we might move on later. If, for instance, we do migrate over to kubernetes, um then we might use something that's more compatible there, rather than an in-house solution.

A

uh There's roughly how many trucks, how many different trust domains are you operating and how much traffic do you have crossing trust domain boundaries, uh so our team is responsible for just two trust domains. Only one of them is a real trust domain, because it's not used for like testing your development um and then we do have some cross traffic with the team that is in charge of the personnel identity, trust domain.

A

um So we do have to handle verification of personnel identities crossing over and contacting aspire things on board to the spire trust domain. I don't think we have any other major trusted names today that doesn't mean won't in the future and then uber is also very enthusiastic about partnering with an acquiring company, so you might have like federation in the future. Things like that who knows, but not today, uh does the receiving and validate the spiffy id and the jwt matches the city id and the signing x59 svid. Yes,.

A

I don't see any new questions unless I've missed something.

B

We do have a couple of minutes if anyone has any more questions. One more, I think one just came up from dave sure.

A

uh So these thousands of workloads are mostly managed in the single domain. Has that caused any challenges? um No, that's that hasn't caused any issues at all. Really, um the only thing that's come up was our initial work to split a like test domain versus a real production domain um beyond that, no having having all.

B

These workloads being the same domain.

A

Has worked out just fine, there hasn't been any issues related to that is.

B

The security library, open source.

A

No, it is not um there's a lot of custom stuff related to uber in it and if we were to say, I would theorize that if we were to strip out everything custom to uber in there, it would be a pretty trivial library, um because you ultimately are just contacting spy getting your s, vid, creating tokens, outbound requests, um so that'd be more an exercise for the audience. I think not something. I think that we're gonna open source anytime soon uh yeah is simply the only identity management framework you're using the uber.

A

If not, is there a transition to spiffy uh right? So we are. We do have some services planned for uh deprecation or decommissioning which didn't start out with spiffy standard because it didn't exist yet, um and we.

B

Have some special logic in.

A

Our security library that handles uh calls related to services onboarded to that that kind of takes them over to being on the spiffy id standard.

A

Use any service mesh, what kind of topology do you use for multi-cluster federation um so right now we're using spire as the basis to start up a secure service mesh. We do have other things currently implemented that are entirely in-house, um but we're.

B

In the middle of transitioning,.

A

Over to more secure, uh secure things based on the spiffy framework- um and we don't use federation today, but we could, in the future.

A

What were some of the bigger challenges he had an uptake of spiffy spire, um I'm gonna assume you mean for uptake on the onboarding of customers uh don ross, um so challenges where, of course, everyone that's currently using the legacy. Software for authentication wants to just continue using that forever. So.

B

That's probably the biggest challenge is always just hey. This already works for me. Why.

A

Are we changing at all um beyond that? We knew that there would be a lot of um uphill battles uh if we made onboarding difficult. So that's why we've made a lot of the upfront effort to make this security library as easy to onboard to onto and configure as possible to make sure all our internal guides related to it are also like as straightforward and easy to understand as possible.

A

So we try to address that the potential those challenges head on.