Cloud Native Computing Foundation Production Identity Day: SPIFFE + SPIRE 2020, 4 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SPIFFE at GitHub - Eric Lee

Description

SPIFFE at GitHub - Eric Lee

We’ve been rolling SPIFFE out internally at GitHub to empower teams to manage interoperable Production Identity documents. In this talk we’ll give a brief overview of how we’ve deployed SPIRE and leveraged its plugin system to integrate with our internal systems and tooling.

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

A

uh Hi, my name is eric. I work at github in the platform organization and for the past 18 months, my team and I have been working on making spiffy inspire available for engineers internal to github, and we have been running spire in production for the past year.

A

Getup submission is to be the home for all developers. We are the world's largest subversion repository management company, and we also hold host. Git. uh Also is something we do so. uh The goal of this talk is to provide something of a practitioner story or a team trying to make this available internally for a company that has around 15 or well 11 years of infrastructure.

A

Opinion uh is not fully running on a public cloud but runs on multiple public clouds, some of which may start with the letter a, and I really want to talk about two implementation details of how we operate spire today.

A

The first is how we operate our agents, and the second is how we generate custom, node, selectors support registration, entries for vending svids to workloads and I'll try to wrap it up with some takeaways and learnings and outcomes that we've achieved on the team and as a full disclaimer. This is how we do it. This is not how to do it, and I'd like to thank uh ben bury from my team who reminded me to give this disclaimer to people, because um we don't want to present our work as what you should do.

A

It is a contextual solution for our setup.

A

So to start things off, we talk about motivations.

A

As I said a moment ago, github runs their own data centers, we run on multiple clouds.

A

In the past two years, we've been ramping up the product offerings, we've been generating more traffic, more traffic worldwide, um we've actually taken measurements um internally for tcp flows and we've kind of shown like a linear growth in internal traffic. So there are more things talking to each other inside the dmz than before, and there are more data centers than before.

A

um On top of that, there's been a lot of hiring and news services net new products uh packages, cisga um trying to remember what I can and can't talk about, but go to the changelog. It's very well written. uh There are a lot of things coming out and there's a lot of software behind what's coming out.

A

um In addition to that kind of the past two years have shown acquisitions, github um was acquired, npm symbol, these acquisitions bring their own infrastructure, their own opinions, their own systems, and so we kind of took great pains to be sympathetic to how our colleagues are coming to the organization and how they want to work with us.

A

So um how do we kind of plan for all this variation in what run at what runs a github and where it runs, um we initially were interested in spiffy, because it's extensible and open, um for example, I think in evan and andrew's presentation they talked about the upstream ca plug-ins of which spire is itself one. uh We run vault internally and we don't necessarily want to build a parallel pki infrastructure just to support spire.

A

The goal issue should be to reuse as much as we possibly can and to leverage everything that we have done. That already.

B

A

Well um and point three, which I think didn't actually make the slide is um we have workloads that leverage l4 load balancers, so some groups use just jot svid. Some groups seem to use 9. We use both. So we can support your use case, whether you're mediated by a load balance or not.

A

um Talking a little bit about the approach, good tools have gradations of power and we're trying to make our platform offering as modular as possible. So um visually you could kind of think of it as a pyramid where, as you go up the pyramid, um you have a reduced sort of area where that's curation.

A

So at the very bottom we have these interfaces of x509, svid and jadasvid, where, if teams were to actually conform to these themselves, they could potentially just be in spec, because this is an open standard. This is more of a utility than a strategy component of how github uses technology and in the center we want to be the team that operates. A centralized, spire infrastructure stands up.

A

The servers manages the data store, manages the infrastructure, automation for agents and provides sort of a workload api out of the box for teams, in whatever execution environment they're running in and for teams that don't necessarily want to deal with raw infrastructure at the very top where we hope to land. Almost everybody is development tools, so shared libraries packages and we've also. I think everybody in industry has developed a side card. One point or another: it's kind of like making your own web framework in 2020. Everyone just does it.

A

I don't know: we've developed an external authorization, speaking sidecar and external authorization as in um envoy, so we can actually use envoy to inject and validate shot tokens coming in and out of your service.

A

This use case is particularly applicable for dynamic languages where we may not necessarily want to go too deep into the app. We don't want to do too much surgery on the workload or be kind of intimately involved in the internals of something that we're really just trying to mediate authentication with using spiffy.

A

So I'm going to talk about how we initially approach the spire setup with the agents.

A

So um take one we initially started experimenting with spire running in kubernetes.

A

We run cube internally and we wanted to actually leverage that team's good work and all of their kind of gains in reliability and operability to not necessarily be managing vms metal ourselves and the kind of reference architectures we've seen um in the community are services, as um as in kube services to run spire servers and agents running his name in sets per node.

A

We observed some issues after kind of kicking this around for a little while in the first um first month, or so um in particular, with um agents. So daemon sets can't be made highly available.

A

um They're they're unbounded in downtime between deploys and you're actually kind of relying on the kube scheduler to place a pod to replace the pod. That was the daemon set, so that that was a challenge.

A

um Workloads also can't rely on spire being available at startup because of this non-determinism related to the scheduler, so all workloads or whatever curation we provide to users, would have to implement some sort of retry or blocking mechanism to kind of pull or wait for the workload api, not not the end of the world, but kind of another small piece of complexity, rather than relying on the invariant of a workload api being there and ready waiting for your workload on startup and um something that's kind of a subtlety. Is the duel of that race.

A

Condition is draining a note, so we may actually drain the daemon set before we drain the workload. So if we pull the daemon set and spire disappears, the workload may actually be waiting or accessing the workload api with no agent listening.

A

um Also not everything we run is in coop. For very obvious reasons. We would have to sort of synchronize infrastructure automation to cover both agents in and out of hosts, and some coupe nodes probably would need.

A

Both there are probably workloads resident on cube nodes that might have to reach in, or so that would imply, we would run two aspire agents. um That was, that was the challenge.

A

So we resolve these issues by, in our case just running the spire agent as a regular demon.

A

So we kind of avoid the problem of pod schedule ordering by avoiding the kubernetes scheduler entirely and we kind of make spire part of the second party software. We lay down on a coupe note before um the kubelet starts to take work from the api server.

A

So this mitigates some of the race conditions. I mean it's probably still good resilient practice to pull or wait for a workload api, but this problem is largely mitigated by just making sure that the workload api is resident before the pod is started and the dual maintenance goes away, because everything is just one set of infrastructure, automation and um that is actually the systemd logo. I went into google image search, I think it's a green light being pointed to- or maybe it's it's the letter.

A

Okay, I was debating this with somebody yesterday on a zoom, but I've never seen it before record and play backwards.

A

Maybe it's like the missing vcr button. I don't know. um Maybe one last thing is uh systemd kind of allows, an ordering of units, uh so obviously we can say for workload out of station. We would like to start after the kubrick starts.

A

So there's not kind of a false signal about errors being unable to contact the kubelet things like that be kind and rewind yeah. um So if we're actually running this as a systemd unit, how do we expose it to pods um a kind of redacted modified version of the the wall of yaml for a deployment is on the left and uh we kind of take the underlying domain socket, put it into a volume and just simply mount that into the container within the pod.

A

um Kubernetes is kind of a mature doll, but you know pods live in templates inside of deployments. That's what this illustrates essentially sort of the the punch line is um the view from within the pod is identical to a workload running on metal or vm.

A

It's kind of in this well-known location. It could be relied on to to be there.

A

We also don't actually use um mutating web hooks or any sort of pre-deployed machinery to place these in our experience has been just instructing teams um to add these few lines for the volume and kind of guaranteeing that whatever cluster they're running on provides. This domain socket gives us a lot of mileage and avoids a lot of magic and people sort of have folks have told us that they, they appreciate the kind of transparency and how things work and um and what's actually going on.

A

So kind of the other consequences the domain socket is available to things outside of kube as well on cube notes. So there's that.

A

uh The second thing that I wanted to talk a little bit about today is generating custom node selectors. So, as I said earlier, in the talk, github runs in multiple clouds. We use multiple container orchestrators and containers outside of orchestrators, which is also a lot of fun.

A

We run docker bear for some workloads, so the consequence of this is um we can actually build a service once and run it n ways in in m places um if we think about how to vent identity to all of these workloads, there's kind of one dimension, that's the same, um which is a selector which we can gather about the workload using workload attestation, using whatever workload attestation mechanism. We are using.

A

Maybe the the unix workload tester on the k8s sat psat, workloaded testers, but there's this second piece of where that's slightly more difficult, because we run our own sites and have our own internal apis, there's nothing out of the box that knows about internal github apis in the project, understandably, because they're, private and not public, and not based on a public standard.

A

So things like the nodatester for amazon web services, gcp azure, which our public products don't apply to us.

A

We had to do a little bit uh extra work to kind of propagate similar notions that you get out of these out-of-the-box noda testers into our selector library.

A

Right um one thing I can share about how we run our sites is um machines have uh their own um per machine certs. So we can actually leverage the x509 pop a tester and use sort of uh some of that cert key material to to pull some notion and verify the identity of something trying to phone home to the spire server.

A

The challenge with the x 509 puppet tester for us initially was um it actually takes a fingerprint. I think it's a sha-1 fingerprint of the machine cert, which doesn't kind of at a glance really give you a semantic meaning of what the agent is or where it is.

A

It's just a opaque checksum.

A

So um using agent path template in the x509 pop-a-tester we actually using this little bit of um go templating. We pull out the common name from the per-machine insert, which does actually contain the fully qualified domain name of the machine, we're trying to bootstrap an agent on so we kind of go from um spire agent, sha-1 hash to an actual, fully qualified domain name. When you do a spire server agent list, um let's keep going.

A

So the consequence of actually only having this one verifiable piece of information, because the server doesn't necessarily trust the claims of the the agent in no data station is that we have to key off of this. One datum and kind of from the server side can consult some other trusted api to gather more information about the agent and where it runs, is it a kubernetes node? Is it a file server? Is it you know a uh a bastion machine.

A

Things like that are not um something we can take at face value from the noted tester, so we actually have to write um a custom node resolver that pairs with the noda tester the x509 pop nodatester, and um we actually bundled this as an os package, because the interface is just a protobuff and grpc and we kind of provision the server with knowledge of an allow list of what metadata to pull back, because every node has a set of metadata. We actually for the purposes of registration entries care about a subset of that.

A

So the real result of this is we get extra selectors specific to github and how we're running infrastructure for use in registration entries and as a reminder, registration entries are, can be written with workload, selectors or node selectors. So that's kind of the the application and then the where it runs piece of how you vend identity.

A

um This is a snippet of configuration to illustrate what my mouth noises actually mean in terms of what it would look like on um aspire server. So what I've tried to highlight is um this is also x509 pop.

A

But it's not a noted tester, it's a node resolver and we pass both a plug-in command and a plug-in checksum.

A

So we're not kind of arbitrarily executing random things in the system path and the way we kind of distribute this plug-in command is just as a base os package using our internal packaging machinery um kind of below that the plug-in data is um interpreted in our own code for the node resolver to unpack uh the node attributes that we want, as this allow list to pull things in um and make them selectors, because if we were to pull everything in there's no actual utility of knowing, maybe uh what type of rack switch a machine is connected to or um what things like that are just kind of superfluous.

A

So the result of this is we get extra selectors for aspire to use in registration activities.

A

An example of this is the one verifiable claim which is the common name out of the machine. Cert is used as the kind of um key we key off of in our internal registry api to pull out um these other selectors, which I've highlighted in white and with uh kind of a canary yellow box.

A

It looked different when I was making it, but I the point is the same: it's all highlighted there and you can kind of see that these are prefixed um similar to how we uh structure the node resolver. So it's gh api github rather than x509, subject: x5 509ca- and these are now kind of available to us to to write registration entries in addition to workload, selectors.

A

um So trying to wrap it up with some takeaways uh the benefits we've had: we've replaced kind of custom, non-interoperable, authentication, authentication approaches with a single portable standard with spiffy and spire.

A

We didn't have to redo any existing infrastructure concepts. We didn't have to kind of somehow shuttle in information from the inventory apis.

A

We just used them as any other client and spire has a lot of points of extension to to call more data and to verify more claims about agents and, as a result of that, rather than having teams manually manage certs and onboard new clients, we kind of reduce the cost of experimentation for them, collaborating with other people in the engineering department, because your new downstreams become just registration entries that are identity, documents that you can verify um from the caller and um we've kind of shifted.

A

A lot of work which could be thought of as strategy work into utility work, and the kind of future vision is to make this just a given for all teams at github to use.

A

And some other observations that we've made universalizing authorization. I think this is something we talk a lot about in the community um spiffy for authentication and authentication only having no opinion about authorization that that actually has been kind of a point of leverage for us, um because systems and teams have either invented their own may have an interest in standardizing may not have an interest in standardizing um and and fine-grained acls um are not something we necessarily want to try to provide parity with.

A

In our opinion, at the top of the pyramid in one of the earlier slides, we really just want to give people documents, they can verify, are good or not good. There are no goals to kind of build policy, languages or enforce policy languages to replace what we have already, and I think I'm bumping up against time, but one other observation that we've made is being forced to write registration entries to identify the shapes of workloads is a forcing function for discussions about blast, radiuses and security perimeters.

A

If you can't actually differentiate between two workloads, cohabitating a machine with a registration entry, that means they have kind of the same level. Privilege, um inappropriately so, rather than kind of looking at oh, we can't take this and isolate it from this other thing, because they're running as the same user or they're um in the same group, we kind of inverted that and saw these um exercises in discriminating between what gets what as fit as opportunities to improve our security posture.

A

So that's invited some interesting conversations about um everything from um how we kind of uh build machine certs to um how we run certain things in coop. So there's that and um I'm going to go to the last slide, which is just the silhouette and stop sharing here.

B

Eric you, you make a great point on on recognizing workloads just as granular as possible. How granular is too granular? Is it enough just to distinct one from the other, or do you try to get as prescriptive as you can of every single property and aspect to identify a given workload.

A

I think we partner with teams and and it's situational um there are workloads that literally run in dual modes, so they actually it necessitates them getting both documents. Even though it's the same executable, we have kind of code paths where they're, both libraries and they're. They have main functions um so it it depends.

A

You know, team to team to team. uh My perspective is um kind of it's sort of like when you enter a code base and there's like 15 million unit tests, and you change one character and like half of them break, you know that kind of anti-fragility with wide enough registration entries is probably the preferred approach, because if something is so brittle where you need a control loop to kind of endlessly reconcile it by docker image id or you need to keep track of um node appearance and depth, then that's probably the wrong shape.

A

But every organization is different. You know, mandates or different priorities are different. um We largely just partner with the teams and and try to have the discussion and facilitate what spire can do for them.

B

That's that's great guidance. Thank you externally, put for thought for for the attendees. Thank you very much uh to echo the the last comment in chat. Eric you're, an awesome presenter, nice work.

A

Thanks, thank you all for taking the time.