OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Securing the Cloud with ZFS Encryption by Jason King

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=14uIZmJ48AfaQU4q69tED6MJf-RJhgwQR

A

Jason, take it away, you know.

B

Hello so, as the title says, I'm going to talk about basically a project I've been working on for like a year or so see if this works, so just a little bit of background, I work for giant for those that don't know and two of our big products as a trident in Manta Triton is our cloud orchestration, a software you can use it to run, containers and virtual machines, and then mantas are storage product which sits on top of Triton, are kind of beside it, depending how you squint and all those they work on they're built on top of Smart OS, which is our hypervisor, which is based off a Lumos, which of course in me, means there's.

B

You know we use EFS, obviously, and you wanted. The big request that has come up from customers is protecting you protecting data at rest on all these systems, and so obvious answer is use EFS encryption, but with that kind of you know like with many things dealing with encryption. One of the big challenges is then okay. Well, how do you manage the keys for all of that? And so that's what I me and some others have been working on how to deal with that within Triton and smart OS, and so.

B

Alex Wilson came up with the original design and I've kind of been implementing it, but what we're calling a key backup and management, and it's totally not a backronym, so we could call it kaboom, it's just pure coincidence, but so with that there's kind of two main components to that: there's what we call in Tri there's the head node, which is kind of the brains and there's a bunch of services.

B

So there's the kaboom API, which is the service there and then on each compute node, there's a demon and kind of the key which is kind of the interesting part. At least what people find useful is that we're using a pivot, oaken's, mostly ub keys, but anything that implements the pip standard, basically to protect the key for the zpool. One thing that we've done is: if each compute node, we just use a single key for the entire zpool, and so we kept the entire zpool, not just because we use so much snapshotting and clones.

B

All of our images for containers and for virtual machines are all basically the ZFS sense dreams that we clone and snapshot so trying to get any finer grain just doesn't really buy you anything and just becomes incredibly complex to try to manage I just for those that mrs. D. If that's encryption, just because you need to add that single encryption route for all your clones or whatnot and so pip token I liken it to kind like two-factor authentication. Some crypto purists may disagree that that's the best way conceptual I think about it.

B

In that what we do is we create a random key for the pool, and then we encrypt it and what Alex had pointed in a box so he's not quite sure what the e stands for, which is analogous to kind of a feel familiar with the diffie-hellman boxes that Daniel Bernstein or JB if he or that in some of his crypto libraries came the idea essentially you're using public keys, public private key pairs and I'm glossing over some of the details here.

B

But basically, you use that to encrypt the actual symmetric key that you're using for the pool. So that way, then the only the thing with the private key can decrypt it, and so, in this case, with the piff tokens or UV keys, I can use them interchangeably. Just because there's some terminology overload, so it makes a little easier to keep it straight. Basically, they on the device themselves, they have public and private key pairs, and so we use one of those keys.

B

Then our key pairs to protect the zpool key and the Yubikey itself requires a pin before it'll, actually decrypt the zpool key using its private key, which all happens on the device, and so the idea, then, is that the compute node actually has to have physically EV Yubikey attached to it, but it also needs to have have to have the pin, and so, if it doesn't have both of those, then it doesn't matter so it's a nice thing vs, for example, just storing all those equal keys in one centralized location.

B

Obviously downside of something like that would be a chorus okay.

B

If you get access to that, then you get the keys to the whole kingdom, and here, even if you were to get all the pins, you would still need all the ubii keys before you could actually get at all the data, so that so that's why I liken it to like to factor, because it's kind of something the compute node has the physical token and something that has to know which is the pin and so next line is a little diagram and it tries to visualize it.

B

So basically, during the boot process, what happens is early on? You set up the administration Network before we import any of the storage and then, if the pool is encrypted, it will contact this KABOOOM api service to request the pin for the Yubikey. That's on the system, basically to provide the pin then to unlock the pool and with that the request itself actually is signed by the Yubikey, so that has to be present to even get the pin.

B

So someone can't just say: oh hey, give me the pin to this sort of that, and so it actually, the token has to be present. Then once it has that, then it can decrypt it. You can load the key, and now the pool is already important. Then we can load the key and then mount up all the file systems that are there and kind of proceed on normally and see here. The notes that I, oh okay, so I'm kind of during that.

B

So when we first set up the compute node, we initialize the pick tokens when you first get them from the manufacturer. They're blank, there's no keys on them, and so the first thing we do is we have it create the keys which then they never leave the token, which is nice.

B

This from so you have to worry about the leaking, and then we generate a random pin. We register it with the service. We generate the pool key and create the e box to store it, and then we create the Z pool and then what we do for the e box itself is just kind of vista serialize structure, and so we basically form coded and store it as a user property for the route data set on the pool.

B

So then we can just read that and since it's all encrypted, obviously without Yubikey, without the pin- and you can't again- you can't get it. You can't decrypt the Z pool key without that, and basically it contains some additional metadata. So it describes the gooood of the token, as well as also some bits for recovery, which is the other piece which has kind of been the really long part of you're working on, because an obvious problem with that setup, of course, that you know token gets lost or gets damaged or just gets erased.

B

And then what do you do? And so one thing we do have is a recovery procedure, we're basically it's we do a form of key escrow, and so when we create that ebox that I talked about, we actually create two copies of the key one protected by the pivot oak and then the other one. We split it into m parts and the value of n is decided by the operator, as well as a threshold value n, which it can be less than or equal to M.

B

It has to be at least one, and so the idea is it's kind of you think about if people always like like- and it's like, like the missiles like the two keys. Well, it's kind of like that. Of course you can have n keys.

B

The idea is that the keys split up into m parts, and you need at least ten of them to recreate the key, and so the ideas, then, is that employees or you can put you- can have a save like a break glass type thing, depending on what the operator wants to do, where the key is split up and protected by pivot opens that are assigned to people. So, like personal, you be keys and then what happens is then there's a challenge, response process or rate up.

B

So if that does happen, you don't I need to recover this compute node and then it'll give you a number of challenge, phrases that are based on configuration. Then you get an, however many people that you need based off the your can click. You know whatever policy that you've said, and they do this. They take the challenge string. There's a software that they run on their laptop or their desktop with their own personal Yubikey. They pass it in provide their own personal pin for their personal Yubikey.

B

It gives a response, and once you have any of those, then it can extract the key unlock the pool and things can boot up and of course, at that point, then you can replace the the Yubikey with the new one or whatever you need to do at that point. So you have the data so that way, if the token that's on the box is damaged you still, your data is not gone, although you should always still have you know a separate disaster recovery back-up plan, and obviously this isn't a substitution for that.

B

But obviously it's still we don't don't. Don't just want to have a little single point of failure with the Yubikey.

B

You know the you know, leave all the data inaccessible and the other bit about that. It's we use a it's. The split up is done using a samir secret sharing scheme. If you're familiar with that. The idea, of course, is that even with like, if say you have, ten parts in your threshold is five is to pick some numbers, then even having four of those doesn't give you any and Meishan about the final key. You have to have at least five, so you can't get like part of it.

B

You know, there's no disclosure, unless you have. You know that minimum threshold value and I'm not gonna go into the math. That's.

B

Behind all that, but so so that also just try to protect the key and then, of course with that you know people, you know new people come in people, you leave organizations, people lose their UB keys, and so the other thing we had to do is all this. All this plumbing, then also that you have this policy in terms of you know how many parts, your threshold value, what those tokens are that you use, and obviously, if changes you know occur just because of all those events.

B

We also then had a write-in mechanism to push out the new update new configurations to all the machines as well and handle that in a manner that doesn't leave an opportunity for a machine to become kind of hard to recover.

B

So that's also, if you saw me some of the discussions about using channel programs, and so that's part of why we're looking into that just because, since we're storing this as a property being able to essentially atomically update a property change, the pool key do things like then make sure that it that either happens- or it doesn't so you're not left in some intermediate state. That then requires manual cleanup just because that's never fun. So that's the basic, that's the very high-level overview there.

B

You can get way off into the detail, so so that's all the I just what a cover just with the presentation. The two links here are: some of the design are the two design documents that go into much more detail. The first one is a more theoretical abstract in terms of the concepts RFD 77 173 is more into the details. One thing we're late to this is boom bees actually intended to do more outside of just managing ZFS keys.

B

But that's all that obviously I was going to talk about for this, just because that's pretty the most relevant for other people here so but those are you know you can freely browse those and they're like saying that goes into far far far more detail, and so that's basically it like I said I try to keep it. Hopefully short. So if there's any questions.

A

Cool thanks Jason. No, it's really cool to hear about. You know. We I think we we design a lot of stuff, including the encryption stuff in ZFS, to be able to do anything but a lot of times. It takes a bunch of work. On top of that to make it, you know, solve real problems. So it's really cool to hear about how you know you've done that with the encryption stuff. We do have time for questions. If anyone here has them I know, there's some folks online.

B

And I mean usually it's kind of kind of a two-factor, but using the UV Keys is kind of it seems to interest most people, that's kind of also, maybe the most I say different or maybe a unique thing about it and again kind of the threat models here were things like you know, so on steals the drive, so in steals the server or you know you dispose of the drives and you're just trying to protect the data. Obviously, there are threat models that this doesn't cover.

B

Those things that we deemed we're kind of out of scope- or just you know, is something that er customers, you didn't consider it. You know a threat. First, you know we're putting in the effort because there's always a trade-off, and so.

B

It's usually the bits Andy I've done demos before, although they kind of our anti-climactic, because the whole point with all these integration, everything is that it should pretty much look and feel just like the norm. You know, like the encryption, wasn't there that it's, hopefully all in the background and things work, and so then, when I do that it's like okay, a SEP, the compute node II, tell it's encrypted instead of not, and it sets everything up and it's like. Okay, aside from doing your like a ZFS get or whatnot, it all looks the same.

B

So it's always look it's kind of an anti climactic demo. Just because, oh it looks exactly the same. Everything I've done.

A

One question which is so I get that the like each and each computer has the UV key like basically permanently installed in it, so we can like, if it reboots it's going to request the key request. The pin from the pin server is the pin server like how do you manage the pin server like if that dies and reboots then does it also just come back automatically or is that more of a like in event where it needs manual intervention? It's.

B

Kind of the the bootstrapping problem: basically, that service runs on the head node and for that initially at least what we're looking at is because obviously there is no service than for it is that, depending on what the operator wants to do, either they would need to just have to enter the pin for the Yubikey for that head, node on the console or through, like you know, a you know, I I, PMI or if they are comfortable with it, they could put it in a file that it could load it up.

B

Obviously, there's security implications for that. We may look into other techniques to alternate to secure that the head node. Just because were the reasons for choosing UV keys versus, say, like an HSM. Is that they're, like an order of magnitude cheaper, you know you be keys, are 40 50 bucks versus 500 each which, of course, you know times a few thousand machines or more? What adds up, but you know in terms of the headnotes and sets you know one or two machines.

B

You know it's a little easier to justify. Okay, you could maybe look at other solutions which are maybe a little more expensive to.

B

Manage or protect things for that that you, maybe you don't use for the you, know the rest of the nodes, and there is some discussion about that. Actually, in RFQ 77 with that again just trying to get rid off and trying to you know optimize, because, obviously you know people, don't you don't want, don't want to have encryption, be like oh yeah, but it's gonna chart. You know, cost you an extra. You know thousand dollars per compute, node yeah.

A

B

A

Cool you have time for maybe one more question for Jason. If there are any.

A

Okay thanks a lot Jason cheering thank.

B

A

And thanks thanks for making it here, I hope you're feeling about.