Cloud Native Computing Foundation CNCF Webinars, 17 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Rootless Containers in Gitpod

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, uh good morning, good afternoon, good evening, depending on where you're joining us from um welcome to today's cncf webinar rootless containers in getpod, my name is christy tan and I'll be moderating today's webinar. We would like to welcome our presenters today, uh christian weichel chief architect, at gitpod and alvin querrey, uh director of kinfolk labs at kinvolk, a few housekeeping items before we get started during the webinar. You are not able to talk as an attendee. There is a q, a box at the bottom of your screen.

A

Please feel free to drop in your questions and we'll get to as many as we can. At the end. This is an official webinar of the cncf and, as such is subject to the cncf code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cmcf webinars page at cncf, dot, io webinars with that I'll kick I'll hand it over to christian and alvin to kick off today's presentation. Take it away.

B

Thanks christy for the intro, so welcome um today we're going to talk about rootless containers in in gitpod and to dive right in. We first have to talk about what is gitpod and gitpod is an open source project that automates development environments, and you can think of it as a ci system that automates regular builds gitpod automates the provisioning of development environments for pretty much every developer.

B

So it has ready to code deaf environments, meaning all your tools are there code is downloaded um code is compiled and you can ready to go. You can start working with a click of a button and it does that behind the scenes by provisioning, kubernetes pods. So each workspace that you start within gitpod is actually a kubernetes pod and we want those pods, those workspaces that you can start in gitpod to feel pretty much like your local machine, except you get a new local machine for every every task.

B

You want to do so, there's no previous state that can impede what it is you're trying to do and when we started out for a long time, one of the big differences between your local machine and what gitpod would give you is what you could do within such a gitpod workspace.

B

For example, there was no sudo, meaning you couldn't install things after your workspace was running.

B

You could only do that in the docker image that you would bring to the workspace and also there was no docker which, in a cloud native environment, is a bit tricky and so what we really wanted to have is we wanted to enable those two key uh features where you could sort of have root in your workspace and be able to install things after the fact once it's running, but also where you would have docker and do docker build, docker, compose, etc.

B

And this talk is pretty much about the technologies that made this possible and how we enable this in gitpod. So now this is possible. Now you can run docker, you can do upgrade install and you basically have root within your workspace.

B

And the most naive way possible of doing this is by simply giving you all the privileges within the workspace container. You know we could just run as root so to speak, but the clear and obvious downside is that that would also mean that everyone inside their workspace, would, if effectively be rude on the node they'd effectively, have all the privileges they'd need to potentially escape the container and to have really a lot of privileges on a note that is shared with, say, 25 other users.

B

So clearly, this is not an option and we need some good way of isolating those workspace containers from each other, but also to the node, and this is where linux isolation. Tech comes in and I'll hand over to alban. To talk about that.

C

Thank you. So there are different ways to isolate more of the parts from each other and from the host. One way is to think about a vm like container on time. uh Those are, for example, black containers, cheap visor cutter containers like cracker, and this this technology to provide uh improved isolation compared to what linux containers are and they work in different way. For example, nabla containers use uniqueness. This means that for every new workload there will be a different unicorn build specifically for that workload, or there is jvisa what it does is.

C

Re-Implement the linux system called interface, so it's really implemented in code. So when your application make a system call instead of talking to the linux keynote, it talks to this interface, this application can help. There are cutter containers that build a lightweight virtual machines and it's compatible with several uh hypervisor, for example, umu or firecrackers those different vm technologies. um They provide more resolution, but they also give more limitation compared to normal linux containers.

C

They could be compatibilities compatibility issues or they could have a decreased performance, for example, with a network traffic or io file system access. What we want in general is to have higher density. That means to be able to put a lot of parts on the same note without having to meet that too much so next slide.

C

There is a alternative approach which is not to use vms but use what is called usernamespace.

C

So the username space is a feature from the linux kernel, among other namespace, for example, networking space uh pin space and so on. uh Currently that's a feature that is not provided by kubernetes. So kubernetes works, like you see on the picture on the left. It has worker nodes, kubernetes and so on that don't use usernamespace.

C

What usernamespace does is to isolate users, so that means the user root inside the container is not the same user root on the host, so it provides some resolutions.

C

There are different ways to different ways to use username spacing kubernetes. Here I provide three different uh explanation of a different way to use it. The the second one from the left is called uh cap 127..

C

What it does is, um it add, a new feed in the pot spec a bit the same way that you have a host network in a prospect to say whether or not you want to use um a new networking space in your pod.

C

It adds a new field for usernamespace, so I'll present it in red in this picture, where the user's space will be located in this architecture. So that's a gap. That means a kubernetes enhancement proposal, that's not something which is merged in kubernetes. Yet that's something we work on that with others in the community to provide that another way to use user namespace is the next one cap 2033, where it's so called the root test mode, because it's allows to run the different kubernetes component without being good.

C

For example, it you can run couponet without being wood. You can run the container on time without being hot. So in this way you have a username space that uh go around all the components of kubernetes on the last solution is the one uh retained by git pod, where we don't touch kubernetes, so we can use kubernetes upstream without modification on inside the pod. Inside the workload it make use of use on space, so it creates the username space at this time.

C

In this way, it works on current kubernetes without patches.

B

Thank you. So how do you? How do you create a username space, and this is an example, sort of walk through how how to create such a thing, and it starts with the unjust call.

B

There are other syscalls that can also do that that create the the username space itself and then, once you have that username space, you need to establish the uid and gid mapping that maps a user id from within that namespace to a user id outside of that namespace, and this happens to happens by writing to files in the proc file system and then, lastly, you need some exact ve sys call to to get a hold of the capabilities um inside that username space and you basically get the full capability set at this point, including capsus admin, um cabinet, raw, etc.

B

um You can try this yourself um with this command. This lets you observe sort of the the things that the steps that happen um to make this work. So unshare minus? U or in this uppercase? U creates the username space minus r maps, your current user, your current executing user to uid0 inside that namespace and the s-trace in front um just traces what's happening.

B

So this is all fine, except in a kubernetes um pod. We would need to give quite far-reaching capabilities to make this work so to write these two files. You need capsus admin in the outer namespace and because, at this point, kubernetes does not provide username spaces. Yet this outer name space would need to be the the node as a whole, and we don't want to provide um capsusapmen for security reasons um on the node inside the workspace.

B

So we need to find a solution to that and the way we we built this within gitpod is the root process that we start inside a container. We call supervisor and supervisor ring.

B

Zero is sort of the thing that gets started, that's the command of of the workspace container, and then it starts the username space as supervisor ring one, and once we have that we make a grpc call to a node daemon service that runs on the node, that we call workspace daemon, and this service then, has the capabilities on the machine to actually write those files, and we pass the pid of the process that identifies the username space.

B

We want to write this uidn gid mapping for that's all nice, except now we have to do pid translation and the reason for that is that containers in general are in essence a collection of namespaces and other isolation. Isolation, tech and one of those name spaces is a pid namespace. This is why any process that you start sort of as the root of the container becomes pid-1, and it's not the actual pid-1 on the node, say systemd or init, or something like that.

B

So the pid that we'll receive from supervisor ring one will not be the pid in the namespace that workspace demon sees in the node namespace. So we need to do some translation here so outside of the pid namespace of the container. This might be is something completely different, and in order to do this pid translation, there are a few ways how this could be done. There is no syscall, yet that can just do this translation for you.

B

There are some tricks using unix pipes, but also it's in the in the proc file system. So if we look at the status file of a process, we see that there is an nspid entry which lists all pids in the children namespaces from the perspective from the process. That's looking at this file because we know that the pid that we're looking for must be a child process of the container of the workspace container. We can look at the children of that workspace container, look at their status files and this way identify the correct pid.

B

So now we can create a username space and we can establish the pid mapping within this username space.

B

Now we're left with a problem, because this is working really well. If we look at the file system, we see that the uids now all of a sudden, don't make sense anymore, at least at first glance, but thinking about it. This is exactly expected behavior, because on the file system, we have some files that belong to actual proper uid0, and we have some files that belong to a user that has a mapping within this username space and the ones that actually belong to proper uid 0.

B

They are shown as 65 534 in here, because we don't have a mapping that maps the user inside the namespace to uid 0 outside to illustrate that what we would like to have is a file system that, from within the username space, looks like this. You have a whole bunch of files and folders that belong to uid0, and you have some that belong to say some other user.

B

In this case 33333, and in this example, we have a uid mapping where uid excuse me, uid 0 inside the namespace is uid ten thousand outside of the namespace and uid three three three three three inside the namespace is four three, three, three three three so basically just plus uh plus ten thousand.

B

So in order to get this view on the left from within the username space, we would need to have a file system on the node that actually looks like this right. That actually has this uid shift implemented, but in reality the file system that we need to do. The shift for is the root file system of our container, and this root file system was put there by the snapshotter of the container runtime, and it doesn't know about this uid shift and it also doesn't care so in reality, the file system looks exactly like.

B

We would want it to look like from within the username space. So we need some process that dynamically. Does this ui or does this uid shift for us and there are a few technologies that can do that? For example, there is fuse overlayfest, which has the benefit that it can be used without any privileges outside of the username space. So you can use that completely from within the username space, because fuse can be mounted within username spaces and the rest that's needed is a username process.

B

There is very little upfront cost. All you got to do is start a process, but the run runtime cost is comparatively high because it has to go through userland on the upside. It is not very platform specific.

B

There is also overlay. Fs meta copy metacopy is a mode in overlay fs, where it just copies the metadata to to the upper deer. So what we could do is we could basically mount an overlay fs on top of the file system, that we would like to shift and then basically do a change own on onto that file system, and this is exactly where the upfront cost comes in. This change zone is potentially very expensive if the root file system is large, the runtime cost is comparatively low um in terms of platform.

B

Specificity overlay, fs to my knowledge, can only be mounted from within username spaces on ubuntu because they have a non-upstream patch that takes the right box so to speak on on oval afs and lastly, speaking of ubuntu ubuntu has support for a file system. They call shiftfs, which can do this uid shift at mount time, so to speak.

B

It doesn't completely work from within the username space, because you need something that they call a mark mount and this you can only do with privileges in the outer name space, but it has very little upfront cost and all you need is amount. Runtime cost is very low, it is quite fast and it runs entirely in kernel space, but it is very platform specific. It only works on ubuntu for gitpod.io, which is the sas offering the sas version of of gitpod.

B

We ended up going with shiftfs because we have control over the environment that this runs in and we deeply care about workspace, startup time and performance.

B

So now that we have the pid mapping established we're using the same trick that we used to write to the pid and uid map to actually create this mark mount this privileged operation that we need to do that. So we make this make another grpc call to the workspace daemon. Who then creates that mark mount for us?

B

Once we have this mark mount, we can use it to mount the shifted file system and then we do bind mounts to dev, proc, etc. Other bits of the file system of the container and then start supervisor ring 2, which basically does a pivot root to this new file system, and this is how inside ring 2 your a inside this username space, but also you're, looking at a shifted file system. So to you all file system, permissions and ownership looks correct.

B

This is all nice, except we cannot just mount proc in this new for this new file system, but we want to do that, because supervisor ring 2 also creates a pid namespace to sort of hide this mechanism away and also to prevent users of the workspace from sort of escaping this new root file system, and we cannot mount proc, because if we look at proc within our container, we see that there's a bunch of files that has a mask placed on top of it.

B

So in the proc file system there is a there are a bunch of files, a bunch of objects that are singletons within the kernel that are not namespaced, for example, proc, kcor or scat debug, which might even leak information about other namespaces. Hence other containers on the node, and so what kubernetes and or more specifically the runtimes do container runtimes do is that they mount masks on top of the files and folders in proc in order to prevent workloads from accessing those files and in the kernel there is a check that checks.

B

If such a mask is present, and if so, it prevents users from mounting proc, because that would essentially render those masks useless.

B

In order to work around that and to never sort of offer an unmasked proc to to the workspace container, we again rely on workspace daemon to make that mount for us and the way that works is that we call out to workspace daemon with the pid of the excuse me passing in the pid of the target. Pid namespace.

B

We do the proc mount, establish the masks and then move this entire mask proc mount into the amount name. Space of of supervisor ring one of our new file system that we're creating.

B

That's nice! So now we have root inside our workspace and it feels like root and things like apt-get are working, but docker isn't working yet and rootless. Docker has a has seen a lot of work, first and foremost by akihiro souda, who has worked relentlessly on things like rootless kit and in general, making docker work as in in a rootless mode, but also our friends from from kinfolk arban and his colleagues have done a lot of work in this space.

B

So how do we make this work? And the key issue here is that docker needs a needs, a lot of capabilities with regards to networking and we can provide those capabilities by pre by wrapping docker or the docker daemon, specifically in a network namespace, and to do that we need to provide some networking into the outside world. So to speak, and for this we use slurp for net and s, which is a user land mechanism to make um to make network namespace.

B

So the connection their connection to the outside world work without needing privileges in the outer name, space.

B

For proc mounts because the container that run inside this or run in this docker daemon, they will also need specific proc mounts because, among other things, they're also pid name spaces. We use the same trick that we use to create the proc mount for the workspace container as a whole or for supervisor ring ring one.

B

We basically call to workspace daemon and ask it to mount proc for us now. This isn't quite as easy as it might sound, because we need to sort of catch the right moment to do that, and we do this by sort of interjecting into run c.

B

So as part of the oci runtime spec, the container.

B

Orchestrator, so to speak in this case docker or containing id actually will provide, it will create the oci runtimes back and in there it will have something like mount proc and we sit in between there. We modify the the oci runtime spec and add ourself, as hook in the container lifecycle, to actually do that. Proc mount.

B

Okay, so much for how this looks like on paper, let's have a look like how this looks like in in the real world.

B

So this here is a um is a gitpod workspace that runs in my runs in my browser in a browser tab, obviously there's a full-blown container behind it. This is what we've just been talking about, and so in here I can.

B

Do things like this, so I can just install.

B

Install new software, for example, but I can also run.

B

uh Sudo docker up- and this will give me um we'll start the docker demon with the process that I just described and now I can.

B

B

um I can just run docker containers right, so I just started alpine. I can also do that.

B

With starting ports, and then gitpod will realize that this port is now served and at the moment there is nothing actually running on it. But if I.

B

If I start a web server in here.

B

Right then, I can access this service that now runs inside a docker container in my workspace, so networking also works across this.

B

B

Right back to aban.

C

Thank you and, as you have seen, there are different um way to make it work, um but there are some difficulties that uh it might be easier in the future future to implement such an architecture.

C

If you have more things in the linux kernel- and I will talk about a couple of that now- so one patch set that is currently being reviewed is a id map mod and that's something to do something to do something similar to shift fs, but um instead of being a ubuntu patch, it's something that is pushed upstream and is currently being reviewed, and once we have that, I'm hoping it will be easier to do this kind of shift fs operation.

C

That will be useful both for the rfs of the container to be able to have this uh different ownership of file um that will improve the performance both in the time on in disk space and another use case is for volumes. So we have when you have in kubernetes a host volume, uh you do a buying mod from the host to the container uh to be able to have this uh shifted ownership on this file.

C

So that's one thing I would like to have on the next slide. There is um another thing, so that's um another feature that I'm enthusiastic about. It's called second notify and it's a kind of a new second architecture with a second adjust.

C

So what is the use case for that I've seen before there is this interface grpc interface between the workspace and the demand outside that? Do some methods like prepare, usernamespace or modproc that run a privilege operation like mod and um um a matter of this about uh second notify, because that will be able to provide the proper interface for this kind of thing. um So what you will be able to do is to have the.

A

C

On the mount system called normally and then secop notify will intercept that on some message to the second version that will run the mount system call on pr for the container.

C

So on the next slide, I will explain a bit uh that at the top right you see, I have a second policy. That's where you define for each system calls if you want to or deny their access to that instant call. But with this second notify feature you have a additional action you can take, call notify and what it does. It will say every time. The process in the container use that system caller.

C

It will defer the decision to an acceleration called the second person, and then this agent will be able to take decision or run the system call on behalf of the container here, but diagram where you see at the top left currency when you use currency or is the same thing in kubernetes when you start a pod, what happens internally?

C

It will uh work on exactly a couple of times to create this uh child process, and then it will execute the second system call to with this uh notify feature, and then you get a file descriptor to be able to get the events. uh In this example. The month system call and that's, my descriptor will be passed to the second chart that you'll be able to run actions on behalf of the container. So when the container in there and marked.

C

The second bag will do that. What it means is it has the potential to make things simpler for a git board or because it's we could just use docker inside the board normally, and when it's on the mount system call, it will.

C

Automatically called the second person to do that without having to implement this grpc interface, okay and on the next slide, but a summary of the different future technology in the linux channel or in general. That I think, are interesting.

C

So first in kubernetes the support for username space. There are two kubernetes announcement proposal that for that that I described before and in rootless kit. If you go to this guitar page about rootless containers, you will find a lot of a lot of purpose. Theory: lots of projects uh interesting, like our wordpress kit, usanitis, which is about running kubernetes without being good sleep for netherness, that creation talk about on, buy, buy for lns, which is the same thing, but uh with more performance using second modify.

C

uh So second modifier in kubernetes doesn't exist. Yet that's something that is work in progress, but here the some different pillar quest. So there is a work in progress to make it available in rca from time spec and in ranci in sierra non-con man. The work is done only and as in fact, we are working on this second chance, which is a generic second pageant to uh make it easy for you to use this kind of uh this second notify feature.

C

um Thank you. That's nice last slide.

B

Yeah so briefly, to send up sum up: gitpod provides deaf environments that are built for the cloud cloud and automatically provisioned username spaces are the key tech to make provide route within these workspaces and then, thanks to all the amazing people that actually make this stuff work.

B

First and foremost, ken folk also hirosuda and the the community as a whole.

B

Thank you very much.

A

All right, thank you both for a great presentation um at this time we're going to move into our q a segment. um So if you have a question for our presenters feel free to submit it, either through the chat or through the q, a box at the bottom of your screen. That doesn't look like we have anything submitted yet, but we'll give folks just a few seconds here to submit their.

A

A

Okay, it looks like we might have a shy group among us today, um alvin and christian. um I know at the beginning of your slide deck you have your twitter handles. Do you want to go back to that slide, just in case folks, do think of questions later place where they could reach ya.

A

Perfect awesome, so you can see both of their twitter handles here on the uh on the screen, so feel free to reach out with questions. I'm sure they would love to chat with you more about this cool um thing called get pod. um Well, that'll. Do us do it for us today here at cncf webinars. Thank you again, alvin and christian for this presentation and thank you all for tuning in a reminder that the recording and slides will be posted later today to the cncf webinars page stay safe out.

A

There continue to wear a mask and we'll talk soon.

A