Cloud Native Computing Foundation Production Identity Day: SPIFFE + SPIRE North America 2021, 30 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Secure, Efficient API Plane Traversal for Compute Resources on Exascale Super Comput... Tim Pletcher

Description

Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Secure, Efficient API Plane Traversal for Compute Resources on Exascale Super Computers Using SPIRE - Tim Pletcher, HPE

Cray Shasta exascale supercomputers use SPIRE for securing machine-to-machine control plane communication. In this presentation, we’ll discuss the requirements for securing critical communication on these supercomputers and why SPIRE was a good choice for this application.

A

um I'm tim fletcher, I uh I was at uh cray for a little over two years um uh during that time, hpe acquired us. uh While I was at cray. I was the software security architect for the shasta systems. um There were a few of us that are systems architects there, uh and so my remit was the access control framework uh that currently sits over the uh the shasta systems.

A

um I've since moved over to uh to the security engineering team at hpe, and I work with dan and a bunch of other folks in sunil, and I'm still involved with the hpc side of the house uh in a few dimensions. I'd like to say that, uh while I'm here, um there are a whole bunch of other folks that behind the scenes that worked um on this particular dimension of the implementation, uh zach chrisler kevin burns a whole bunch of folks. So uh so not a one-man show by any stretch of the means over there.

A

I'd like to start off talk a little bit about shasta systems. So if you're not familiar with hpc, it's basically a giant liquid cool computer at its highest end and while they used to be vertically integrated machines, they've evolved to be commodity.

A

Compute resources that are basically pulled together by a blazingly, fast network, interconnect and so you'll see that there is a lot of commodity in the compute nodes themselves, but, for example, our secret sauce at cray is the asic and the software that drives the network along with a few other things in the programming environment and the kernel module optimizations on the compute nodes themselves for network.

A

These machines are capable of of over an extra flop of computational uh activities, and I always like to write out the 10 to the 18th, because it makes me think about how fast and how much the number, how fast and how many numbers these things are crunching at any one particular time.

A

Historically, these machines will scale to tens of thousands of compute nodes and different installations will have different node counts associated with them right now. The slingshot network that that runs, shasta systems, network wise, is 400 gig, going to 800 gig with the next uh generation switches and cards that are coming down the road in a few years. It's a dragonfly network configuration. So it means that all the nodes are never more than a fixed number of hops away from each other and allows for a highly uh consistent and uh manageable high-speed network.

A

um There's truckloads of storage behind it um and uh and so yeah. They just they're just crazy right, they're data center sized computers, they consume power, measured in megawatts um and it's kind of funny when you think about it, because there are scenarios where the machines will have their own substation or you have to be careful about when you run them, because they'll brown out the neighborhood like the city, so they're really neat machines.

A

um So uh so one thing I would like to say about uh the the craze systems- and I guess I'll get to that. Our management system, our system management model basically is going to look fairly uh familiar to a lot of you or all of you.

A

We basically have a user access layer and that deals with access into the compute plane itself and that's provided either through the a container instance that's running on the management plan itself or via nodes that run off the the kubernetes management plane, but still have access into the api plan to run administrative functions.

A

um There's a there's, a bunch of functions around the compute nodes themselves. You need to manage the images you need to manage boots. uh You need orchestration of boot, configuration management for the nodes themselves and whatnot. Then there's a network management side, there's. Obviously two networks: one is the high speed network and one is the management network which is a fairly standard uh configuration and so you've got all the interaction that goes on there.

A

Hardware management is uh is, as you would see, in a lot of a large data center operations. um You got to deal with power management. You've got to deal with all the bmc endpoints firmware the whole nine yards and then, of course, there's a security, so uh personas uh non-person entities all have to be accommodated in the authorization, authentication and key management context.

A

So how does that work? um The the csm software the cray systems manage software is, I would say, a fairly generic and vanilla, um cncfe implementation, and that was by design. So we run on kubernetes we use sdo, we have vault, we have cert manager. I think any of this. You know you would take a look at and be like, oh yeah, of course, that makes perfect sense right if you're going to run an api plane in front of a big machine like this there's a whole bunch of different.

A

If you look around the side, there there's a whole bunch of different networks that uh that are present in this system. um They deal with the hardware, their vlan basically, and they deal with the hardware management side to get to the bmc's.

A

They deal with the api plane side to get to the services that need to run, and they deal with the customer facing side or the user facing side, whether that user facing side is for management or for high speed network access, and then we back it with dedicated seth hardware. Software.

A

So, along the way um we were, we have a gen one. uh Basically, when you think about the context of how um you need to transit the api plane um from a compute node, the the side where you're gonna have administrative traffic in and out is fairly straightforward, the the harder part for us because- and I say, fairly straightforward, because our access control framework uh basically is standalone, and so um we ship key cloak.

A

With this thing, you could basically stand the system up and run it completely in an air gap environment by itself, without talking anything else, and that was one of the original requirements, obviously because of where these machines run. So so our story around uh basic api interaction as an individual is pretty straightforward and good.

A

We use opa to deal with the aussie topic and key cloak issues of standard oidc tokens and away you go where that breaks down is that there are applications that run on the compute nodes. Obviously, that deal with platform operations, and so our gen 1 implementation of this was to basically use what key cloud calls a service account which is effectively just issuing a long-running oidc token and handing it out right, and it was a horrible implementation.

A

I'm still maybe a little bit embarrassed by it at the end of the day, but we had to get started somewhere um and we knew that we were going to have to build something specifically to accommodate this uh and, as it turned out, saitail and cray were acquired around roughly the same time and I started interacting with uh sunil and emiliano, and it became pretty clear pretty quickly that while we were pretty, we were very familiar with spiffy because of our use of istio.

A

We hadn't really considered spire, it wasn't on our radar screens and after a little bit of discussion internally, um the aha moment came and the the dots are connected and- and we didn't have to build anything- we really just needed to pick up the ball with fire and get going, and so we did and- uh and it was a great collaboration, I would say that it was probably one of the the easier implementation cycles I've been through in that whole process.

A

It took less than 90 days and we were up and running in the in the system um and so that included, uh that included the whole nine yards of applications talking across the uh the api plan.

A

So this is where we sit today um with the shasta csm. As you can see, there are obviously the key cloak is there for the uh the individual users and then the mpes are covered by spire.

A

Today, we'd use joint tokens to to test the compute nodes, and I'm going to talk a little bit more about that as we go forward um and uh and then the x as vids get issued and and away we go the applications.

A

Then, basically, trans are issued an oidc token from the spire server and they're evaluated appropriately as they come through. The api calls as they come through the gateway to point them at the right uh issuing server.

A

So um so this is our world and I'd say again, it's a fairly straightforward and standard implementation in most respects, with the addition of spire making our live lives a lot easier.

A

So where do we go from here? I'd say that the.

A

The state of the of the control plane or the access control framework um is good for a 1.0 implementation, but you can always improve, and- and we will seek to do that, and so the big place that we have a challenge is really that that node attestation in the compute plane, when the original, uh when the original specifications were cut for these shasta systems, uh tpms were not specified on the compute blades, and so we find ourselves in an awkward position.

A

They will be going forward, but we have machines that are fielded that do not have them, and so that's that's. You know objective number one is to get a better story than we have today with the joint token, um uh and then uh we really like the work that is going on with um with our effort to get uh spire and istio talking together uh in the community.

A

um We already have a central pki issuer uh system that runs in behind the platform um and uh and we're in the process of basically putting an operator in place to roll the certs from the issuer for istio. It would be nicer if we could just bolt in uh aspire to that task and call it done that'd be. That would be awesome.

A

The other thing that we're looking forward to is api driven workload, registration so as platform components come on uh the ability to register them automatically, as opposed to manually with a config file, which is what we do today um we'll be uh replaced, and that will be a good upgrade for the engineering teams to get up and running um the other uh topic that we're going to be looking forward to is federation.

A

So there are certain components that we have in the platform that are not baked into the csm software itself, so one of those would be the the fabric controller. The the slingshot software is a standalone system. It can run without csm, but the ideal scenario is that their their access control framework can be federated into ours, and we do that through spire. um That has been uh proposed. I don't know where they're at on that team side, but then that takes advantage of the federation capabilities, inspire which are excellent and pretty straightforward to implement.

A

As I said today, um so node attestation, uh I have the astute observer- might have noticed the use of joint tokens. Joint tokens are not ideal. um They they require you to to create an issuing mechanism that can be um a little bit challenging to have reach the security level that you'd really like it to, and we knew going in that this was going to be something that wasn't going to be our ultimate end. State.

A

um It just is what it is um so uh so we started work on the next, the the next phases of that, and we know where we're going to end up, I think, in the short term or in the medium term, and then in the longer term, um while we don't have the tpms in place or where they're not available.

A

So let's look at this. um Coal was kind enough to steal my thunder on the tpm attestation, so I'm going to skip through this one real quick, because I think you will look at this diagram and see something that's very much similar to what uh what he had presented. It's basically the flow to do the the uh the tpm based attestation from the compute node.

A

um So uh so again, uh so we'll go we'll go past this one for now um and then uh the next up for for us is probably going to be a move to the x 519 based uh node attestation, and so with that.

A

What we end up doing is injecting a certificate into the compute node at boot.

A

Time so recall that these are disclosed machines for the most part, and so so we have a process where we can inject that boot and the init rd phase, and we do that with uh with other payload components today, and so this will just end up adding the x name certificate into the mix, and then we start down the process of uh of the cert verification, generate the nonce and you go back and forth, and you end up with the uh with the spiffy id issued using the intermediate that the spire server holds, which that would which it has acquired from our pki as a service plant.

A

So that's up next for us we're also looking at the the acme ip based attestation. So this is another avenue that we can take.

A

It's still not where we'd like to be with the tpm side, but while we get to this process of of um the best that uh that we can, then this is uh something that we've we started to look at and I think we're watching the step, ca, acme server project. I think that's something that we hope gets there pretty uh pretty quickly and then we'll take a gander at that.

A

So uh so that's that's! That's fire in in a super computer. In a nutshell, we have time for questions. I want to leave a time for questions. If anybody has any, we appreciate the chance to share the experience that we've had. I would say again that you know if you haven't, if you haven't started working with spire. It really truly is a swiss army knife and it becomes more and more that as the plug-in universe gets bigger.

A

So I suspect this is going to be something that plays prominently in in a lot of engineering platform engineering going forward, so we're excited about it and we really appreciated. You know the timing that allowed this to come into the picture for us. So uh so with that I'll take questions.

B

All right fire.

A

B

C

Have you considered other hardware-based attestation methods.

D

Like keys or smart, like that, injecting the certificate.

A

uh Not really, in this context, um the uh the scale that you have to kind of do that, with, I would say, would be challenging.

A

Oh, tens of thousands.

A

It does kind of change the the the fleet management problem is real in in the large footprints with any of this type of thing. So so we try to. We want to find the way in the short term that that is most manageable for the system, administrators and because they have a big job and these machines run a lot. So we want to keep that that overhead low.

D

With respect to the tpms.

A

D

Any other ways to manage those at scale.

A

We are, we are working on some things um internally around that uh at hpe. It's a it's to me. It's that's the that's the killer, app right there.

A

um Historically, you know tpm operations and and when you mix them in with with with anything related to fleet management, maintenance events, you know any of that just is really painful, which is why a lot of people don't do it so um that to me in the context of of security engineering, especially when we start to really focus on hardware up attestation um is going to have to be dealt with, and so, when you look at some of the other things that are going on with platform certs and whatnot in spdm, uh you know the the need is gonna come for hardware to not only be um for every component in a box to be validated, and so so I suspect that this is going to come to the forefront more and more and when we do have people looking at that problem internally,.

B

Got a couple online questions here: oh.

A

I got one over there too.

B

Okay, can I come to you right after all, right.

C

Just one question this is from richard in our virtual audience richard asks: can you speak a bit to what types of scale you've seen with spire for large node boot events.

A

um I will say that uh it's improved, um we, we seem to uh there there's one scenario that um that dings us on the boot cycle, um so we've we've we've run up to, I think, probably close to- I want to say 8 000 nodes at this point, give or take a little bit, um and so uh one of the applications that runs in the context of our world is a heartbeat mechanism. So um so what what happens is like when you run through this boot cycle?

A

All these things start heartbeating almost immediately and it's kind of a thundering herd problem, and we see we supercomputers have thundering herd problems in a few different areas. So we end up actually with the heartbeat thing. Turning that off during the boot cycle- and I think we're we're meeting most of our targets with the with the boot time is today. It is a contractual requirement in the super real world to boot in a certain speed. So it's a it fixes. It's prominent in the discussion.

A

Did that answer your question.

B

It was an online question, so I believe so. Okay.

A

B

A comment from anne antu.

A

Oh, I think I might know you tell her. I said hello, yes,.

B

She would like to uh liaise with you following the talk. Oh.

A

B

Be good to see her, I am passing on now another in-person question.

D

I wonder if you were incorporating spire into the to the user or production workloads that you're running there and how that may tie into not.

A

Today, so so, a good way to think about these machines is that they're, big iis implementations, and so it's very it's a from an analogy perspective. It's just like hey you boot up, here's, your vpc and then workload manager. Software comes in and dispatches applications into the compute plane at runtime.

A

That's not to say that we haven't had requests from uh the customers on this for paz type services. I think I think, as you see more, um if you, as you see more modern approaches to to running jobs up in the compute plane itself, you'll start to see spire make its way in there, but we don't provide that as a service and the core platform.

D

A

This is all platform based. This is all basically think of it. As behind the scenes platform operations.

D

So I've been building.

B

D

My platform, kindle's kind of next level slaughter, scheduler.

D

This seems like a to do that interested.

A

D

Seeing how together to tie this story, yeah yeah, absolutely.

A

And I think there would be I'd be there I'd, be surprised if there was an interest from some of the labs community around this as well there's uh some of the labs. Actually, you know expose these machines to a lot of researchers and they come in and do their thing and so um secrets management has been one. That's been a topic of discussion, um you know, and we kind of said.

A

Well all right, you know that's these are all viable um and, as we get a little bit more mature in the platform, then we can start to look at that. The past topic um there's also you know the topic of some type uh of get service, so they could do git ops up in the compute plane with you know, whatever they're gonna run. So I think that's only a matter of time before it starts to show up, but um I think you're starting to see different approaches to workload management come in too right.

A

So you have singularity containers, you know it's it's going to change and it's a modernization. That's going on. I think across the board, in hpc.

B

Any further questions in person or virtually.

B

Going once going twice sold, thank you so much tim. That was awesome. Oh thank you. How cool is that.