Red Hat OpenShift OpenShift Commons Briefings, 27 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: State of Container Security | Urvashi Mohnani Sally O'Malley Red Hat | OpenShift Commons Briefing

Description

State of Container Security
Urvashi Mohnani and Sally O'Malley Red Hat |
OpenShift Commons Briefing
March 2020

A

Hello: everyone, my name, is Urvashi monatti and I. Am a senior software engineer on the open chef, no team.

B

Hey I'm Sally I'm, a software engineer on the open shift, workloads team for the past few years, Urvashi and I we've been getting together and giving container security talk every few months, we'll get together with Dan Walsh and we'll talk about what's new Dan's always presenting on security. But when there are enough cool features to talk about, we'll, put them together and present them, and we like to share them with as many people as we can.

A

Ok, so before we dive into a container security, let's talk about Goldilocks and the three bears so true district app the story. It's a story about a girl Goldilocks who ventures into the forest and finds an empty house being adventurous as she is. She enters the house and sees there are three types of everything there. So there are three types of stairs three types of foods of foragers, three types of birds, etc. She decides to go sit on the first chair and realizes it's not comfortable. It's too hard that was Papa bear's chair.

A

Do you then moves to the second chair and realizes it's the opposite? It's way too soft, it's still not comfortable, and that was mama Verster. She then steps on the searcher and finds that it's just right and I was baby, bear stir. So as we progress through the story, we see that Goldilocks always leans towards the just right option for the things in the house, but she goes for the bowl of porridge, that's neither too hot nor too cold. She goes for the bed, that's neither too hard nor too soft, etc.

A

So when it comes to continuous security, we can look at it with a similar lens. The first level is where we have all our security features enabled, but we realize we run into challenges when trying to run our applications. This is the papa bear model of container security too hard and often times when we had such snags. We end up overcompensating for it and disabling all.

A

Our security features completely, and this time were too soft we're running in a very unsecure fashion, and this is the bear model of container security too soft, and then there is that sweet spot where our containers are running in a somewhat secure fashion, without compromising the usability of the applications, and this is the baby bear mode, the baby bear model or the Goldilocks model of continuous security just right.

A

Sometimes no one ever really turns up security. I'm sure. A lot of you here have done stuff like adding capabilities that you're not sure exactly what you need or running your containers with the privilege, flag, enabled and the most famous one is run accepted for.

B

Zero, where.

A

We end up the same thing as you linux completely, though, in today's talk, we're going to show you technologies that we are working on to help end-users to move more from mama bear or baby bear more security model towards a far better security model without compromising the usability of her applications.

A

This way, we are, we don't instinctively end up doing things like running satin, 4-0 and making Van Walsh. We here's the Papa Bear of container security.

B

Please don't make Dan cry right so that that's what our talk is about, but first, let's just talk about container images. When you run a container, there are three inputs to a system. First, the OCI image format when a developer crafts, an image- they include things like the entry point, the user volumes whatever goes in a docker file. These get translated to a JSON file, the OCI in the spec.

B

Next there's the container engine, much of the security of a container comes from these hard-coded values from the container engine and like which namespaces get spun up. See groups assign second pools things like that, and the third input is from the user. Users can override and set aspects of running containers, bypassing flags to the run command like volumes or capabilities of port forwarding.

B

Those are the three inputs and then the Tayler engine it will take its own hard-coded default, the user inputs and the information from the image, and it creates an OC I run pine spec. It's the JSON file that the runtime then uses to launch the container. The runtime is the run, T or C run or kata containers.

B

So that's how that all works now, the middle part that container engine like I mentioned it mostly has been hard-coded, but at Red Hat we've worked to split the monolithic engine that was docker with the docker daemon into four different functions with distinct tools. Brio is for running locked-down containers in production, cryo is meant to run containers in kubernetes pod. Man, then, is for running locally, while developing and experimenting pod man, the CLI is pretty much one to one with the docker CLI and then scope. Eo is the tool just for moving images between registries.

B

You can move an image from one remote registry to another without ever pulling that image down to your system and then build a build us for building container images. So the idea is by setting security separately for each of these tools. You end up with the highest level of security, rather than that least common denominator that you had with docker and also without the doctor. Daemon. These tools can run as rootless. The Red Hat is constantly experimenting with ways to run containers most securely.

B

For example, in openshift we now run image builds inside of a container with bilder and, with builder, there's no leaking of information from a docker socket. This has made open shipped image, builds more secure and also we lead the way in using using for running containers. We'll talk more about that later, but speaking of rootless, the most secure way to run any container is by setting the user inside the container to be non-root.

B

This is a default and open regular users can't run as root inside their application containers and in almost every case, if you think you need to be root inside your container, you're, probably doing something wrong. For example, some a few examples where you think you might need to be root is if you're running a web service. If you need to bind to port 80, you can instead use port forwarding from the host to run as non-root in the container. Another common reason for running as root is installing packages in your container.

B

You should never have a package manager inside your container. You should install packages an image, build time like within a multi-stage docker file or with build a containers. These encourage minimal images that don't require root. However, there are containers that do need privileged system containers or special containers, whose purpose is to manage things on your host. There are such containers in OpenShift and in kubernetes, so these for these there are many ways to secure them: can use Linux capabilities, dot-com filters, selinux user namespaces.

B

All of these can work together to lock down even your most vulnerable containers, and this is what we're here to talk about today, we're going to try to make it easy for you to run early, because we all know that when it comes to security, if it's easier to disable a feature, then is to configure it chances. Are it's going to get disabled.

A

All right, so one of the ways we currently enforce security as by limiting the power of root using capabilities capabilities are chunks of pseudo power. So each capability gives sudo the right privileges and needs to carry out certain actions. For example, if I run a container and I disabled all my capabilities in it, the root user.

B

A

The container wouldn't be able to carry out privileged actions such as mounting a filesystem or changing the ownership of files and so on.

A

Currently, we have 37, different and except abilities, out of which we enable 14 by default, when we run our container workloads. These 14 were originally defined by the upstream dollar project back in around 2013. But do we really know what these 14 capabilities are? The answer is no, but here is a list of the 14. After after further examining what exactly these capabilities do we found that a few of them are not entirely critical to running your container workloads, so the first one is ordered right.

A

Audit right gives you the ability to write certain information into the auditing subsystem, the back in the day when containers were starting off, people thought that the only real way of running jobs in your container was to SSH into it, and for that to be possible, you needed the SSH daemon running inside the container. Now the SSH daemon won't run unless it had the audit write capability enabled now. We know that that's not entirely true.

A

We have tools such as part of an exec or docker exact, that let us do exactly that, and there are really no other applications or tools that need the SSH daemon running inside the container. So why have this enabled by default? For all the container workloads I can one is make node make node gives you the ability to create device nodes on the system. This can be pretty dangerous as it can be used to attack the kernel and the reason.

A

The main reason that this is enabled by default is that certain a bunch of packages need to make a device nodes when they're being installed. But if you have a different tool to build your container images such as bilder and you have a different tool to run your containers in a production cluster such as cryo, you can run your containers more securely in your production of production environment by disabling this capability by default, while having it enabled in your build tool without compromising the functionality.

A

This one assist route. This just gives you the ability to bility to true route inside the container. Nor will application really uses this. So we're not sure why this needs to be enabled by default, for all the containers you're on the fourth one is natural and natural gives you the ability to create any type of IP packets.

A

This can also be dangerous because we can use it to format IP packets on the wire, in a certain way that we trek the VPN to expose it to the external network, so it can be used to break out of the VPN stacks and the main reason that natural is enabled by default is that so that users have the ability to ping inside a container.

A

We see that out of the 14, there are at least four that are not needed for every container that we run and this kinds kind of ties back to what I said earlier, that no one really turns up security willingly, because no one has really gone back to see why these capabilities were enabled in the first place about seven years ago.

A

All right, we have some demos for you. So let's look at a demo where I can I can drop the natural capability without compromising the fing ability and my container. So here I am running a basic image that has thin capability, because my natural capability is still enabled and think, what's expected now using the tab drop by a flag, I'm going to drop the natural capability and is expected, it will not work because I can no longer ping in there.

A

So, let's so, if we want to drop this capability, but still won't have think there is a way around it by enabling this control here. What this is, what does this call is doing is that if your group ID falls in the range of 0 to thousand you get the ability to paying inside your container. So let's try that out so I'm running container here I have dropped. My natural capability and I have enabled that syscall and, as you can see, things works as expected.

A

Right, so usually it so, as you saw we, we said that we can further reduce the default capabilities to what a list of ten for all their container workloads, but usually as an end-user it it can be confusing. As to what exactly pathologies we need to run certain containers. The image developers are the people that best know what what exactly abilities are needed to run the containers that they are building. So if I am an image, developer and I know that my container only needs the set UID and said GID capabilities to function as expected.

A

I can set this up as an image of the label or an annotation and the image when I'm building it and when my container engine such as Bodmin launches this container, it will know what to only started with this set UID and GID capabilities and not the default 14, and we have demo for this as well. So here I. This is a docker file and filled the chakra file, and here, as you can see, as said that label saying that my container only needs that UID and said GID capabilities.

A

So when I run this container and use the appointment top commands to see what capabilities and have you can see here that it is the own, it's only the said, GID and security capabilities. So now, let's run the container image that doesn't have such a specification and we will see that it runs with all the 14 that are default. Now. What happens when an image developer says that the image needs capabilities that fall outside of this list of 14?

A

So here I have a docker file where I'm saying that I want my image to run with the net admin and sysadmin capability. So point man will run your container, but it will log an error saying that you're not allowed to run these capabilities as they don't fall in the default 14. But it will run your container with a default 14. So you will not have net administers admin enabled, but you will have the default 14 list enabled.

A

However, if you run the same image and you specify these capabilities using the cat that flag, you will see that Foreman actually launches a container, what the tool capabilities that that is still outside with the fall 14. So the reason that we do it this way is that we don't want users to end up pulling random images of the internet and running it and when I get not realizing what capabilities it has enabled like it, but for thoughts of the 14, that's usually enhanced capabilities that container workers don't usually need. So why why?

A

Why are you running it like that? But as an end-user? You really believe that you need those capabilities, then born man will not stop you from doing that, so this was just a way of showing how we can further lock down. Arkansas containers move more towards the papa bear module by living image to image developers, restrict which capabilities are needed to run your containers.

B

Great, so we showed how easy it is to set Linux capabilities, how about limiting syscalls? Well processes communicate with the kernel through syscalls, so one way to attack a host is to gain access to the kernel through syscalls. Just by turning on calm in a kernel, you go from having eight or nine hundred sis calls down to about 450. That comp is a kernel feature. It was added in 2005, so just by turning that on you're already better when running containers just about everybody runs with a default filter.

B

This was developed upstream by docker and by Jesse Frisell and a whitelist about 300. This call you can find it on your system, their user share or containers that's better, but can we do even better recently aqua sec did a study and they looked at all the containers out there and found that most only require 40 to 70 sis calls and red hat. We found the same to be true. The problem is it's really difficult to figure out which this calls a container need, they're, not the same.

B

For these a 70 each container is different, so that was the problem that we looked at last summer. We had an intern on the run time, Steen vivillon through google Summer of Code, and he worked with Valentin, Rothberg and Dan Walsh to come up with a tool to do just that. It will generate a second profile based on a container that you give it so the way it works.

B

It's an OC, I hook it's by the its OC ice comp, the PF hook and an OC I hook is a helper program that gets launched by the runtime just after our container is created, but before it starts it hooks into the kernel through BPF, and it watches all this just calls on your system. It records those disk calls that are in a given containers, Pig namespace, and when that container exits a second profile is generated with the whitelist. A second profile is just a JSON file. That's a whitelist of this calls.

B

All the recorded sis calls that the container used the will show this running in a demo. It's pretty cool, but the idea is that an application developer can run this hook through their entire CI a CD program. It can test every code path and use case and edge case. It can run it in a test or a production environment for a few months, and just you know, continuously watch the second profile until it stabilizes, and there are no new sis calls being added at that point.

B

That developer can be pretty confident that this is the profile that should be used with.

A

This container.

B

So, let's see how it works here in this demo, here's a here's, just a look at the hook itself. If you you can see it's in OCI hooks D, there's! That's where the binary is and for any container that's launched with the annotation containers trace, just call a second profile will be generated and the user will pass a path to where they want that file generated.

B

So now we can get out and we'll just run a simple Fedora image and we're going to pass that annotation to tell the hook to start and we just ran LS. So now we can look at that profile that was required just to run LS and a fedora container. You see there are about there's like 30 or so 40. This is called required just to run LS.

B

So now we can turn around and run that container again using that generated second profile, and you can see it works just fine, but behind the scenes that container is completely locked down. Only those shortlist assist calls 30 or 40 are allowed out of the 300. So great what happens if we need to run LS dash L? Let's check it out so we'll use this same profile to run LS dash, L and it errors out because apparently and dash L requires more syscalls. So you can see that not at log. Hopefully you can see.

B

Yes, some sis calls it was trying to use it stopped and there are a few listed, but there are probably more because that's just all that the audit log caught so let's go back and run the container again with the annotation to catch. Any new sis calls that are required with LS dash L. So here we have a new file generated and we can run with that new file and run LS dash, L and hopefully it'll work.

B

Now, let's look at the difference between the two second profiles: one for LS and one for LS dash, L I found this interesting. These are the plus signs. Are the added sis calls that you need to run else? What I found interesting is connect. I was surprised that you need to connect to run just LS dash L. What what is what socket is being connected to here? Well, when you run LS dash L.

B

If you look at the output, your UID is mapped to your username, so, instead of outputting for a file ownership instead of zero, it says root or instead of saying 1,000. It will say s o malley, that action uses NS, which, in the background, NF switch and the SSS demon, and that's where the connect. This call comes into play. That's an interesting tidbit, but so that that's how the hook works. You can download it and check it out for yourself, but to implement this.

B

The runtimes team has been working on a plan and what what they've come up with is that an image developer should ship the second profile within the image and include the label on the image that will tell the container engine. Oh, this image has a profile I'm going to use that the reason is because again, application developers know best how their container should run, and so it makes sense for the developer to include this with their image there. So again, with capabilities.

B

Just like capabilities say you have, as this call that's outside of the default, you might be met with a decision. The container engine might error out and tell you that it won't run this image, because this is calls outside the default or you can tell the engine. Hey I trust this image, let's run it anyways, so those are some things being worked out, but the hook itself is in github. Please download it. Try it out, it works great and that's it for the hook. Yep yeah.

A

Okay, so another way we enforce security currently is using any Linux, which is a tool we all know and love. So the way selinux works is it's a security model based on type enforcement, where files and processes of different types and access is restricted based on what type you can access. So, in the past seven years, almost every CVE that has occurred has been filed, system breakouts and guess what esle index has blogged each and every one of them. So SEL SELinux is the best tool to protect your file system from continued escape.

A

It is sort of like the baby bear or Goldilocks model words in the middle, and it gives you all access to everything you need within your container, so you get access to all your capabilities to your network and all, but only inside the container, when you do try to break out of the container and it's locked completely on the host.

A

A problem with SELinux occurs when we use volumes when we are amounting volumes into container we're essentially taking a part of the OS and exposing it, including containers, since container processes are only allowed to read files that have the container file T label. A lot of these system directories on the host are not are not accessible because they don't have that label.

A

So a way to a way to make this work is to use the lowercase e or Africa is the mount option. When you mount into your container and what this does is it really builds the content on your host, so your container processes can then access it. Now, this all works fine, if the directly that you're mounting into your container is going to be solely owned and used by your container.

A

But if you have other applications running on your host that need to access the same directory, they will end up breaking because they wouldn't recognize a new label that these compass content has been relabeled with. So the only way to make this work then, would be to disable selinux confinement in your container, causing you to run in a very insecure fashion and pushing you all the way towards the mama bear modular security.

A

So good for us is that we have a tool called you diva that helps us move less towards mama bear when using volumes. So you lisa is a tool that creates custom, SELinux policies based on the configuration of your container.

A

So the way this works is that you run a container with the volumes that you want mounted in and let you leave inspect it and then you Lisa will look through your container figurin analyze it and see what exactly is trying to do and then we'll create a customized in the next policy for you for that container.

A

So we have a quick demo that we can look at how this works. So here I am running container and I'm mounting in my home directory is read-only and the wire spool directory is readwrite, but these are their. These are system directories which do not have the right label for container processes to read them.

A

So as expected, when I, when I try to read the home directory, my container I will get permission denied and same thing with VAR school, and since I mounted this as read/write, if I try to write to it, I wouldn't be able to because I haven't be labeled the content. So now, let's use your data to generate a custom policy for this container, so I run my container with the volumes that I want mounted in and let you need to inspect it and create a ceiling X policy for me.

A

So when you run this, what you did, sir does is: basically it creates a new label type for your container process, and this is what gives you access to the content on your post and then it's very simple. It tells you exactly what you need to run to load this new custom, isotonix policy. So, as you can see down here, I have ran that command and it takes a few seconds to hook.

A

So the great thing about you deeds is that you don't need to be an acid Linux policy expert and, as you don't expert to figure out what exactly customizations is need to get this to work without labeling the content. It does everything automatically for you so yeah my thing, my new policy has loaded and I'm going to run the container again for this I'm gonna set this new label here that I got from Judy's ax and there goes my containers running so now.

A

If we look at the process running on our host machine, we will see that it is running with this new label, as expected. So now, when I exact into the container and I try to do exactly the same thing, I was doing earlier to access my home directory. It works. It works fine. Now, I no longer get permission denied when I to our school.

B

A

Do the same thing: I can read it now. Let me try writing to it and you can see here. I am able to write to it, so you need sir lets. You increase the custom policies for you that lets you volume out into a container without having to use relabeling or without having to disable SELinux confinement completely.

B

That that's really cool we've talked about linux capabilities and SELinux and.com filtering. Let's talk now about using, as I mentioned earlier, Red Hat has been leading the way and driving forward user name spaces in Linux. Just a little background. A namespace is what gives an isolated view of your system with regards to a set of resources, so, for example, in a container you're in a pit namespace, you only have access to processes in that pit namespace. So within.

A

B

Space, you only have access to the range of you IDs and G IDs that are in that user namespace. This provides just an extra layer of isolation and a privileged root user inside the container then can be mapped to a non-privileged user on the host. So if a process was to break out of that container, it wouldn't be a privileged user on the house. That's the idea and in fact a new ID separation has always been the standard security tool in like Linux dared systems.

B

So pod man, it does some really pop mmm build some really cool things. With user names faces, they are user name spaces are the reason why you can run these tools in rootless and they're, also really effective at providing separation. You can imagine if you had a kubernetes environment, it would be a huge boost in security if every container was separated by usernames basis, but sadly nobody is using user name spaces for container separation.

B

Yet there's still some work, though some issues to work out and again, Red Hat, has been and is leading the way with this work. One problem with kubernetes is that there's still no support in kubernetes for user name space, so UID shifting with volumes in kubernetes it's difficult. It requires kernel support that isn't quite there yet so, when mounting a volume from a host to a pod, the ownership of files is just not automatically updated and the work people. The community has been working on this for years and.

B

Also, the Chowning is slow, so in kubernetes you want hails.

B

They are owned by UID 0. So when you're in a user name space, any files that are owned by UID 0 outside the namespace, those show up as owned by nobody it. You know it literally. Does nobody and that's what happens to all root own files inside a user name space?

B

So you need to Chone all those files, and that is prohibitive, because it's well but the container storage team led by Nollan and then the kernel storage team was vivek, they have been working and they added a new feature recently in to overlay FS to make Chonan and assigning file ownership and user namespaces much faster. The things are moving forward, also, giuseppe score. Bono he's been working on a prototype in kubernetes using user name spaces. If anyone can figure it out, it's these three guys, giuseppe.

B

Also, as an aside, he rewrote run see in the like over a weekend allowing containers to run with cgroups v2, that's a different story, but these the runtimes team and Red Hat is working to move this forward, and it is, it is it's just taking some more time so I do have a demo, though that shows how user name spaces are really effective at separating containers and pod man. This is this is easy. You can use this in pod.

B

Man no problem so in this example we're mapping UID 0 inside of a container to 100,000 on the host and doing that for the next 5,000 you IDs, and you can see with pod man top that inside the container I am you use a root, but on the host I'm, just 100,000 and you can see PS on your system will show that those processes are owned by 100,000.

B

Now I can run a second container and map UID 0 inside the container to 200,000 on the host for the next 5,000 you IDs and with pod man top. You can see now that on the host, this is running as 200,000 inside the container on route, and you can see with a PS that all the processes are owned by 200,000.

B

This show you can, you can see now. If a process was to break out of the first container, it would be 100,000 on the host, it would have it wouldn't have elevated privilege and it would have no access to the second container running in 200,000. It wouldn't even see the containers. The container storage is separated so that that's just an example of how pod man uses use of namespaces.

A

Alright, that that is a really great tool, but, as we saw in the demo every time we run a container and we want to use user name space the board. We have to set a specific UID map for each containing. We run now as an end user. That can be pretty tedious to keep track of. If you're running hundreds of containers to know which.

B

A

You've already used and which are still available so to make it easier on the end user and to help them move towards Papa very easily. We have a new flag and form and run called user NS, and when you set this to auto portman, will automatically pick a different user name space for every container that you run and it will guarantee uniqueness and we plan to there's still some issues that we're working around this and we plan to test it out briefly and quad man and once its feature complete and we're happy with its stability.

A

We add a similar feature to cryo and eventually kubernetes, so everyone using a kubernetes workload can take advantage of it and we have a demo to show how this works for the work we have done so far.

A

So here I'm just going to run a container and set that flag to auto, and when we look at the user that is running with on the host. We see it's running with user 1 billion. Now, if I run another container in similar way- and we see what user is running on the host with it's running with a billion and 1024, so the reason it fixed 1024 is because the default size that quadrant automatic picks is 1,024. But let's say you know, your container needs a wider range, so it needs a range of 5000.

A

So we can do that with the same flag. Just add a coolant size equals to whatever size you want. So here I wanted to have 5000, and when I run my container, we will see that it started off the UID from 2048, which is 1024 later than the one that I run before this, because that the range of that container before it was 1024. But if I look at the UID map in this container, we will see that the range here is set to 5000. So this is still a work in progress.

A

This is how far we've gotten and we plan to add more to it. Definitely.

B

All right so, finally, the last thing I want to talk about is the the containers dot-coms this central file, it's a feature being added in pod man. Now it's a central location where you can set security configurations system-wide for all of your containers and all of your. So, for instance, the distro might put the containers, comp and user share containers and an application developer would include a containers comp and that would go. It would override the Sharyn going at sea containers and then a user could override that further and put the container.

B

So some things that you'd use containers call for would be like removing those four capabilities that we talked about earlier, the four capabilities that nobody really needs. You could remove them, system-wide for all of your containers and all your tools in containers comm. If you wanted to enable ping, then you could add that Cisco back in for all of your containers- and you wouldn't have to remember that long command with the specific sis call that her she showed in the demo. So that's one way that you'd use containers comps.

B

Another way is some of these commands like the build up nodes. They have tons of flags. You know you could have ten or twenty flags and parameters that you need. If, if a containers come file contain those flags, then that just makes it easier for a user to run that image. The same thing with high performance computing in very high security environments and there's a lot of configuration, that's required with every container, so adding this containers con file just makes it easier for configuring.

B

We do have a demo with containers comp also so the the username space equals Auto and containers comp. Those will be hopefully available in the next release of Hodnett. Oh.

B

Yeah, okay, so I'm, let's just run fedora container and here are all the capabilities. There's no fourteen default capabilities, but let's edit our containers, comp file and to just those ten capabilities, take out the floor. We talked about earlier that nobody needs so I'm gonna pass this containers comp and variable to the pod man command. You won't have to do that. This is just for the demo.

B

I showed you those three files in the beginning and pod men will automatically pick those up, but here running with our new containers comp, you can see that the there are only ten enabled. Now those four capabilities are gone now.

B

If we want to enable ping again, we can go back into the containers, comm file and we can add that's this call to the containers comp file rather than having to pass it to the run command so there now, if I pass that containers comp file the pod man, you can see that ping will work and that will work for all containers on my system.

B

I think that yeah, that's it that's the end of the demo and that's the end of our talk. um We do want to thank mo Maureen, Duffy D wrote she did all the artwork on the slides anything else. There Ritchie.

A

Yeah and we, we just hope that you can see how we're working on making it easier for end users to move towards a Papa Bear, a security model with all the cool features that were working on and that's it. Thank you.