Ceph Ceph Days NYC 2023, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: Data Security and Storage Hardening in Rook and Ceph

Description

We explore the security model exposed by Rook with Ceph, the leading software-defined storage platform of the Open Source world. Digging increasingly deeper in the stack, we examine options for hardening Ceph storage that are appropriate for a variety of threat profiles.

A

All righty so um we're going to try to give you the light subject at the end since you're already sleeping, but uh maybe the subject is not light, but it's a relatively short talk at least and I think the record is 17 minutes, so we're not trying to beat it, but we're not going to bore you too much, it's nice to be back in New York. For me, it's especially special the first time I spoke publicly about stuff in my career was eight years ago in New York. The last time I came out to see.

A

Matt's team was last repeated before the uh before the.

B

A

So it feels nice to be restarting here so um I have been fortunate to spend nearly all my career in open source and, like almost everybody, have had terrible marketing managers, but I also had two excellent ones, and one of them gave me this lesson that I will never forget, which is marketing malpractice, not to introduce yourself. So that's me in one slide. Those are things I worked on.

A

Safe, obviously, is the big one before that I was the product person for Ubuntu server and before that, I was Mark, the systems management, Tara, tsusa and a bunch of other things like the maintainer of man and so on.

A

Anyway. The reason why the picture there is me All, Surrounded in clouds, is there I have written.

B

A

That by O'reilly and I am joined by Sage.

B

Everyone I'm Sage, McTaggart, I, use they them pronouns and I. Work on cyber security for Seth at IBM I did my undergrad at UMass Amherst and graduate school at UC. Santa Cruz I've done research in a wide variety of areas, ranging from programming languages, file systems, all sorts of other stuff and I was working on security for Stefan odf at Red Hat as well.

B

I love, formal methods, I love, all sorts of different parts of security and, in my free time, I love to hike with my dog and garden and work towards a more inclusive world. And that's my dog right there, her name's, Aurora and she's, an absolute wonderful dog.

A

So we both successfully escaped from Academia, except I, still have a taste for corduroy. Apparently, um so there are slides introducing saffron Rook that we're going to jump over they're built for a different audience. Those became clearly not applicable when someone asked the question about the AIX client, so let's jump straight into security, so the big picture for security is that security practices harden a specific point of the infrastructure.

A

Cherry picking practices without the model of the threat and of the attacker is just not a viable strategy. The joke usually goes that to a really secure computer. You have to cover it in concrete, shut it down, throw it to the bottom of the ocean, but then it's not very useful, and so, in other words in Practical words, absolute security is not usable or maybe even attainable, so you have to define a threat model.

A

Are you facing script kidneys or the gru or the dreaded privilege Insider? These are very different scenarios. Some of these want to steal your data, and others want to cryptolock data and hold you for ransom. Other just maybe satisfied with deleting things at random and the disruption that that causes like a transient denial of service kind of thing. You need to Define what threats are, what threats and personas you're protecting against and what is the priority so that you can pick your battles. If everything is a priority, then nothing is so.

A

Let's Dive Right In from the network side.

A

The public security zone is an entirely untrusted area of the cloud it could be the internet as a whole, or just networks external to your cluster, that you have no authority over data transitions Crossing. This Zone should make use of encryption, not that the public Zone as I just defined it does not include the storage cluster front end. The SEF public underscore Network, which sounds the same, but is not the same, which defines the storage front end and properly belongs in the storage access Zone.

A

Now the going down the list, the self-client Zone- refers to networks as accessing SF clients like the object, Gateway, the set file system or block storage Sev clients are not always excluded from the public security zone. For instance, canonical example would be to expose the object gateways as three or Swift apis in the public security zone, so that data can be retrieved from the outside.

A

Next on the list: the storage access Zone. Instead, an internal Network, providing safe clients with access to the storage cluster itself.

A

Finally, in the cluster, Zone refers to the most internal Network, providing storage nodes with connectivity for replication, heartbeat a heartbeat backfill and the like recovery. This Zone includes the ceph Clusters back in the network. Called cluster Network in SEF operators often run clear text traffic in the cluster Zone, relying on the physical separation or VLAN separation of the network from all other traffic.

A

This going back to the previous example, would not be a valid choice if your threat Model includes adversarial privileged insiders, for example, these four zones are separately mapped on or combined depending on your use case and the threat model you use, and so you have things like these, where you can look at what services you have and what um what networks they span to now at the edges of these connections, you have components spanning the boundaries of two Networks and by definition, because they're spanning the two zones with different levels of privilege, you need to secure them with the requirements of the highest privilege Network.

A

These are the natural I, wouldn't say weak points, but they are the natural points to attack. If you know that that a demon has access to the network that you are in and to the network that you want to attack you're going to try to find a vulnerability in that demon to crossover.

A

um In many cases, the obvious thing to look at is the security controls like are things properly secured? Is there an obvious misconfiguration that was missed uh if possible, uh exceeding security requirements at integration points is a good idea which, given that this is a storage product, is easier than it would be on on a generic operating system or a compute product.

A

For example, it's very easy to look at the cluster security Zone and have it isolated from other zones, because there is no reason to connect to almost anything else. It's just internal replication traffic.

A

um Completely, the opposite could be an example again with object, Gateway and needs to access all the nodes in all the USD nodes to get at the data needs to access the monitors to know where the cluster map is and we'll likely need to access the outside to serve data out. So we have all varieties in terms of the demons that we have, but we don't have to apply the most permissive policy to everything.

A

B

Yeah so let's talk a little bit about product security and what we do to ensure the security of ceph I'm, going to go a little bit light on the details with the move to IBM since we're still hashing out some of the specifics and I, don't want to say something and then we change it, but in general we're we're getting everything working and I'll be explaining it at sort of a high level.

B

So product secured at product security at IBM follows a secure development life cycle. With the goal of reducing risk and improving security for Seth. We always are suggesting improvements. We're pen testing more regularly. Now we're manifesting our all of our dependencies, just like at Red Hat, we're reviewing vulnerabilities tracking weaknesses, exploits all that good stuff, um we'll still be doing security releases and reviewing all new releases, just as before, checking everything for cves and vulnerabilities just like before, except now we're at IBM and using their systems previously at Red Hat.

B

This was split into two different roles of incident response and security architect, which you might have seen in Prior versions of this talk, um and now it's a little bit different.

B

So we're going to continue just like I said everything else that we did before we're onboarding with IBM P cert, and we're going to update you more with more information in the future about how exactly that looks.

B

And, in addition, we're going to continue to expand our process to improve code security, preemptively and eventually start following IBM standards to fix vulnerabilities and ensure compliance, and these are oftentimes even more extensive than Red Hats. So, in addition to us still being devoted to Upstream everything else, all of this will result in a more secure stuff, because we have even more rigorous requirements, as we have three releases going on um so and again, just to reassure everybody.

B

All of these cve fixes get ported to Upstream, we're still going to keep all versions of SEF equally secure.

B

Now, of course, new collaboration produces new challenges and lots of really fun and clever goals that we're still figuring out how it's going to play out, but fingers crossed it works out, fantastic and please feel free to reach out with any things. Any concerns that you have any collaborations with IBM that you want to talk about like all of this stuff, please reach out.

B

I know a little bit later on the details as to what we're doing for vulnerability response than before, but it's looking like it'll be even more secure than in the past, and you know: it'll just continue to improve and improve with this collaboration, and that will help the open source version.

B

Upstream everything just be more and more secure, as time goes on, everything's still in flux, but it will be better.

B

So, let's now talk a little bit. How stuff actually implements encryption? Not just what happens when we get a vulnerability report, so server side, The, Operators overwhelmingly choose to encrypt data at risk using Lux, and you don't necessarily have to have encryption, but it's an option that we highly highly recommend all the data and metadata of a self storage cluster as a result of using locks can be secured using a variety of DM Crypt configurations and almost all of our customers choose to. You should choose to do this.

B

We enable the a general security best practice by locating our monitors on separate hosts from the osds. We ensure anti-affinity of the keys and the data that they encrypt. This means that your driver host is physically separated for your from your decryption key as much as possible, so that if somebody steals one drive, they don't have both the key and the data they have. They either have one or the other, and that makes it a little bit harder to crack.

B

The object. Storage Gateway also has some additional encryption capabilities. It includes encryption at ingestion time we have the use of per user Keys as opposed to just pure per Drive keys. So you know, if you want to revoke a user, you don't have to re-encrypt everything and send out a whole new key. We allow KIRO. We allow key rotation with tools like vault.

B

We have support for AWS, SSE KMS and more Key Management Systems. In addition, we have fips 142 standard ciphers certified by the Department of Defense and that can allow a fips 140 mode when you're using Rel in an appropriate version as your base.

B

Now, what about encryption and Transit now that we've covered encryption rest well? Network communication can be secured by turning on the ceph protocol for with the messenger version 2.1 that was introduced to Nautilus. Now clear text is fine in this case. It's not a huge security risk because, typically, where you have a network where you're using the ceph protocol, it's physically or logically isolated from access, you aren't just having people monitoring it on The, Wire um you're, just having people sniffing your packets.

B

It is a little bit of concern if you're on the cloud in a shared cluster, perhaps with a kubernetes deployment or if you run a little bit more nervous about security and or you want to, for whatever reason have encryption on The Wire here and your threat. Model includes it. Thus we implemented encryption here and you know there's a lot of issues with compatibility and overhead for back-end protocols, so it really depends on how you're setting up your network, how you're setting up your deployment.

B

um But that being said in most cases, the performance impact is pretty insignificant for a properly designed cluster. Your latency is going to be completely overshadowed by network communication as long as you account for your CPU overhead, and you know we have some best practices here um that go back to the security zones that we were talking about. You have your network hygiene.

B

um You. You generally want to have these. In addition, so when we think about some more specific protocols, though S3 Services usually secured between rgw and the S3 client, we use TLS on Port 443. You can totally use plain HTTP on Port 80., uh depending on the nature, your data being served. If you want to make it public, you can, but we it's usually secured, and we recommend that TLS termination at our ha proxies a special case.

B

uh The link between h, a proxy and rgw is clear text and needs to be located and protected by the security zones that we were just talking about and, of course, with the network. Standard Security Standard Security practices like firing individual nodes to only exposed to cleared list of ports apply.

B

We check that with pen testing we're making sure that everything is generally pretty safe, but those are all best practices- and you know Rook specific, isn't as quite as relevant here, but Rick can use crds to encode many of these settings. Configuring trust certificates for our rgw web server Rook also supports at rest data encryption as we discussed earlier in-flight stuff protocol in 1.9. Kubernetes permission system also applies to the persistent volumes, so you get the permissions quotas and all that and all that comes from kubernetes. Nothing Rook needs to do here.

B

Rook supports a key management system in the CSI driver, container storage interface and that allows again individual volumes to be encrypted with their own key. Going back to that that limits the scope per key. So if you need to revoke a user, you can- and this is all done so we can follow best practices, easily key rotation, revocation limiting the scope from each key. It limits the scope of our unencrypted traffic and all that good stuff best practices important to follow. We all see them on paper, but we got to make them easy to implement.

B

So that's stuff that we're doing and talking about the control plane as well. It's popularized by ansible SSH is used by safed Min subfansible, all other sorts of deployment and day one tools to provide a secure command path for install and upgrade operations as part of post management.

B

We don't necessarily want our dashboard to be exposed to the world, but it does definitely need to be reachable by your operator's workstation to be of use, so the manager supports the whole infrastructure has to be reachable on the storage access zone. So how do we do that with our control zones? Ssr control, planes, SSH, so yeah, but who are we and who is accessing what we sort of have to talk about identity and access briefly, and we do use shared secret keys that protects us from man in the middle attacks by default.

B

uh Shared secret keys are done using AES, which is thought to be Quantum resistant, fun fact, and but you still need to do some good practices. You have to Grant key ring, read and write permissions only for the current user and root. You want your client admin user restricted to root. Only I, don't want all users to be rude. That's a bad security practice. Don't do that and you know uh talking about rgw next.

B

um It supports the key and Secret model of AWS S3, the equivalent model for openstack Swift as well, and talking a little bit more about rgw again, the administrator's key and secret has to be treated well also in general, the administrator's key and secret should be treated with appropriate respect. You don't want to to reiterate the point you want to use your administrative users sparingly to reduce your risk profile, the rgw user data data is stored in soft pools, which should be secured, as we discussed previously regarding data at rest.

B

This isn't required, but people generally do use it. We can couple with oidc providers, such as click key cloak, backed with your organizational IDP for an even more granular roller attribute access, but we'll continue to work to make this more and more granular, as time goes on, just to make security, best practices even easier and again we support ldap and active directory users. We so we highly recommend using secure ldap. We support openstack Keystone to authenticate object, Gateway users and openstack clouds.

B

These are all some practices that you can use here and to talk briefly about auditing.

B

You know what happens when there is a breach. How do we detect it? What happens when we want to just check our logs? What happens if we have a security requirement that requires audits? Well, we allow operator actions against a cluster to be logged and they're stored in VAR log saph audit log. You can check the information there.

B

You definitely want to periodically remove them so that, if somebody gets your cluster, they can't necessarily see all your prior logs and you want to aggregate them to your log management system as appropriate for security scans uh auditing, all that other stuff.

B

So on that note, I'm going to let federicio finish up talking about data retention, thank you.

A

So one more um once data is deleted from a safe cluster. It generally cannot be recovered for practical use, but there are exceptions.

A

Rbd is a new facility called the trash bin, where you can make Dynamic cues of spare pool capacity to retain deleted images until that capacity is needed, or until a certain number of of days has elapsed.

A

Similarly, in rgw, versioning of Object Store buckets may result in deleted, object and deleted objects being preserved in the in the history in the versioning until they are purged either by a policy or by the operator deleting them.

A

So whenever user data retention is a concern, you want to configure your storage pools so that you get the behavior that you want and don't don't just go for the defaults if a user data is involved.

A

Additionally, individual data blocks that use to constitute an object, file or volume are often still present on persistent storage, like with any type of storage, really uh until they are overwritten by that capacity being used. So you can end in the case of ceph. You cannot securely delete the cluster by writing a ton of data to it. It's not going to work, or this is not going to work reliably, which is what you care about.

A

Security deletion is another common question, so the easiest way to solve this is to do the right thing from the beginning, which is again encrypted data, I trust, and when you want to sanitize the disk, you forget the encryption key, throw away the encryption key in that's it plus. There are plenty of storage media these days that provide that functionality in Hardware, so you don't even have to manage it and when you need to sanitize um okay.

A

So besides the case when you need to sanitize the media, there are two other cases you need to sanitize the media in a way that can be used for RMA. Like your drives failed, you need to return them for warranty replacements.

A

You cannot use a shredder or a degausser. Usually in those cases you cannot return destroyed, drives for warranty unless you have a special contract with the vendor. So uh so that's another reason to use encryption Keys instead um and then the other case is when you actually want to um prevent the lesion. Absolutely the opposite scenario, and so there safer is one facility which is uh the use of multi-factor authentication in RDW, so that you can make it harder for someone to go and delete your data in an attack by requesting a second Factor.

A

um There is one more thing which is a hardening options, so these are very highly vendor dependent, but they're. Really, the availability of them is all the same across all Linux distributions. The question is whether they are compiled in the kernel or in the case of the self-distribution, if the binaries are compiled with it. So these are Red Hats choices.

A

Other self-distributions may vary. We ship with SC Linux by default in in red hat, and we will do the same at IBM and that's hardly a surprise. Given that selinux is sort of a religion and thread hat.

A

um um As Sage mentioned, we can make use of fips 142 optionally. You configure the operating system for that.

A

Red Hat um Seth, Linux, ceph storage, binaries are compiled with those options and we don't want to go into a discussion of GCC I. Think that we're dense enough already, but those are the ones that we're using they generally get in the way of exploits that are buffer overruns or Heap overruns.

A

On the Kernel side, obviously you get the options that the trell selects, but you can always build your own kernel and play with things like sitcom pi and the like, or the aslr options. Also quite a few of those are already in Real by default generally. What you want to do, as the user is, consult your vendors documentation and just figure out what is that is already in there and most of us don't have the time to do garage experiment and build your own kernel.

A

Build your own self-distribution build your own application, but we should have the time to look at what hardening has been done at the levels that we got from somebody else.

A

uh These binary is going back to the original Point set at the intersection of the safer to weaker security zones, so anything that we can do to harden them helps our case.

A

So that is it aside from some bookmarks for you um explaining the things that we didn't go into so managing kubernetes Secrets is always a a favorite Ronnie osnut at aqueous written a very nice primer on that chapter. 6 of hacking, kubernetes uh Michael hasenblast's new book has a very nice primer on storage that is primarily kubernetes based um data security and hardening guide is coming from our product.

A

It's part of the of the red hat and IBM Self Storage product, but it's it's a nice primer and it expands on what is in this uh in this deck. Essentially.

A

Core documentation is very nice details on how to encrypt secret data at rest, which kind of goes back to running is article at the beginning and then the last item is for all the stuff that we didn't discuss about hardening the kernel or the binaries. That will explain all the mystery flags of GCC for those of you that are fortunate enough, not to know them, and there is another one.

A

That's not here, but Ubuntu security team has a very nice table of the hardening options of of the kernel, so you go to their table and if you're, not a Kernel, Security expert, you know what they are and then you can Google them for your own distribution um and I. Think that's it. Let's see if there are any questions. These are all the people that have contributed to the presentation. So far, if you have things that we should be adding here, areas that we didn't address or.

A

Suggestions: complaints, Alex, not AIX,.

C

Block Level exfiltration scenario of a hyper-converged OSD and monitor node containerized. Is there a theoretical possibility that the attacker can decrypt the OSD data.

A

The cryptosd having what so.

C

Osds have been encrypted with Lux right, so GM, okay from the beginning, however, monitors store the keys. Yes, I, don't know how they store them. I know they store them.

A

ah Okay, yes, that's a very smart. You got a hyper-converged.

C

A

Before you make it more complicated, because this is the limit of what people can take already, um the OSD has been encrypted with Lux. Where is the key? The key is in the monitor, so that goes back to what Sage was saying earlier, that we want anti-affinity between the Monitor and USD, so that the key is somewhere else. So you have to steal two machines at least then.

A

The key is the key is on the monitor machine. So how is it secured on? The monitor machine? Is the the part of the question that I get from the smartest people like you, so um the answer there is that you're also the encrypting the file system of the monitor, so that that key is encrypted at rest on the monitor, that's usually, okay, the full. What's the key for the key, the full thing, the key for the key is there on the machine? So if they can boot that machine, then then you have lost the.

C

Game, so what if it's a container.

A

um It shouldn't matter, you still need to boot. The host okay and.

B

I just want to say with that, when you ask a question like: is it theoretically possible, everything can theoretically possibly be hacked.

C

Yeah but no yeah.

A

But there's realistic: you are showing that you're from the DC area, because that's where usually these questions come from up.

C

Until a couple of years ago, we didn't think that um uh the uh predictive uh you know power analysis was possible in the certain CPUs, but now we know so.

A

Yeah, we have gotten that question from several three letter agencies, of which IBM is not one um so I. That is a very good question. Okay,.

D

Yes, I have a similar question um if you use the the encryption in the cefcsi where the the password is stored in the storage class and the key is stored as the metadata next to the image and from my understanding is it like this? If all the images have one thousand ten thousand images with the keys and the keys, the key is always encrypted with the same passphrase. It should be possible to um to reverse engineer this passphrase. Oh.

A

You're saying in terms of doing um large number of known of known ciphertext, yes,.

D

A

I, don't know that's a good question about how can you attack Lux and how much how much ciphertext you need to attack Lux yeah.

D

I mean you mentioned that you normally try to separate the keys right.

A

D

The data and um at the moment we are- or we made already the decision to re-implement this part of cefcsi, because in the in the storage class, the passphrase to Inc to encrypt the key of of the image should not be stored like in a world or.

A

Usually the answer there is either a the protocol uses temporary keys, and so you don't care as long as the sessions are very not long, not very long lived or two. The protocol doesn't use temporary keys, in which case you need key rotation. Like Amazon keeps bothering you saying, rotate your keys, every I think three months already know like but I, don't know what is What scenario would Lux fall into here.

D

I think it's not the problem of flux, it's it's just that it is implemented in Sev CSI, with a with an additional passphrase to encrypt the key. It's nothing related to.

A

That, oh, yes, it's just.

D

It's just on top because they just they did not store the key in the monitor database, They, just added as a metadata information to the RBD image, and then they before they added they're encrypted with the passphrase that is stored in this fcsi storage class. Okay I have to look into that.

A

Yeah I don't know great.

D

Question. Thank you.

E

Hey um so we we actually started encrypting with looks before ceph.

C

E

A

E

Think um so this maybe is a dumb question, but is there a facility to rotate Lux keys, really.

A

Built into stuff with, like oh definitely, not built into self okay, so anything that's available for Linux we could use, but we we don't have our own okay and the most popular thing in terms of key rotation. Usually then I get the question about uh about object stores and the vast majority of customers that want to do that. Use hashicorp, Vault, but obviously it's at another level.

A

All right going once going twice.

A