Cloud Native Computing Foundation Kubernetes Community Days (KCD) Chennai 2022, 30 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote: Cloud Native Superpowers with eBPF by Liz Rice

Description

eBPF has been called “Superpowers for Linux”, and in this talk Liz discusses why it’s a foundational technology for a new generation of cloud native networking, security and observability tools. The questions this talk answers include: (i) What is eBPF? (ii) How does eBPF enable the instrumentation of applications, without having to modify applications or their configuration in any way? (iii) What can we do with eBPF in the cloud today? Even if you’re not a Linux kernel aficionado you’ll leave this talk with an understanding of how eBPF enables high-performance tools that will help you connect, manage and secure applications in the cloud.

A

Hi, my name is liz rice, I'm chief open source officer with isovalent, which is the company behind the sicilian networking project and until recently I was chair of the technical oversight committee at the cloud native computing foundation, and in that role I had the privilege to see lots of technologies all across the cloud native landscape and something that has particularly caught my eye over the last few years is ebpf, which I'm going to speak about today here at kubernetes community days chennai.

A

So I want to explain why I'm so excited about it and how ebpf gives us super powers for cloud native tooling, particularly for networking, for security and for observability.

A

So, let's start by seeing what ebpf stands for, so the letters stand for extended berkeley packet filter. But, to be honest, I'm not sure that's terribly helpful, because ebpf can do so much more now than packet filtering.

A

What we need to know is that ebpf allows us to dynamically run custom programs in the kernel, so, let's make sure we're on the same page about user space and kernel as application developers.

A

We mostly write our applications in user space and we're given abstractions that protect us from the system calls that need to be made to the kernel. So, for example, if our application wants to do anything that interfaces with hardware- maybe that's writing to the screen or receiving a network packet or writing something into a file.

A

All of these things require access to hardware, even accessing memory and user space can't do this directly. It has to ask for help from the kernel and the kernel provides that interface between user space applications and the hardware they're running on, and it also coordinates multiple user space processes that are running simultaneously.

A

So our user space application make system calls to ask for help from the kernel, but we typically don't write system calls directly in our programming languages. We're given higher level abstractions, for example, reading and writing to files will map to read and write system calls at the system calling interface.

A

So the color when things happen, there are events in the kernel that could be network packets arriving. It could be um a user space application, making a system call all sorts of events constantly being triggered within the kernel, and we can attach ebpf programs to these events so that whenever the event happens, our ebpf program can run.

A

Let's take a look at a concrete example.

A

So here is my very basic hello world example in evpf the code that's going to run in the kernel is here: it's a very simple c function: uh ebpf programs are functions and all it's going to do is write out some tracing. Let's change this!

A

So okay. So it's going to trace out hello, kcd chennai whenever my ebpf program is triggered and I'm going to attach it here. This, the rest of the code is python. It's using a python framework called bcc, which is quite a nice way to get started with bpf programming, because it makes it very easy to load programs into the kernel and attach them to events here.

A

I'm attaching the ebpf program to the system call called exec, ve and exec fee is what gets triggered when a new executable is being run and as you'll see on the virtual machine that I'm running on there are lots of new executables being run all the time.

A

So I need to be root to run this because you need privileges to load. Bpf programs.

A

And we immediately see tracing being generated by um system calls that are running on this virtual machine and in another terminal window on the same machine. I can run, let's say ps and we can see the process number 74282.

A

If I find that in the output there, it is so bash with the process, id 74282 triggered exec ve and it wrote out the the line of tracing as a result, and we also got information like the timestamp and various flags.

A

That information gives us some context about the process that was running that triggered that event.

A

Other types of event might give us information about a network packet or um the socket buffer being passed from an application into the kernel, and we get contextual information about whatever it is that triggered the event that our ebpf program was attached to.

A

So we have some code that runs in the kernel, and we also have some user space code that loads the bpf program into the kernel.

A

We might also write code that can communicate with bpf to extract things like metrics, so ebpf is often used in observability. We can attach a program to increment a counter every time an event happens and then read that counter in what's called a bpf map, so that user space can read that information and display the metrics.

A

So, even with my tiny hello world example, we dynamically change the way that the kernel behaves, and this is really quite a sea change in what we can do in kernel technology.

A

Normally, if you want to make a change to the kernel, it takes a very long time. Not only is it complex to change the kernel, it's 30 million lines of code, but also making a change to the kernel requires that the entire kernel community are on board with that change and think that it's a good idea not only that, but getting the change into the upstream kernel.

A

It still takes typically years for that kernel version to become part of the linux distributions that we run the the distributions like red, hat or ubuntu or debian or arch.

A

All these different flavors of linux are distributed with kernel versions that have typically been released, one two three four five years earlier, so it can take a really long time for features in the kernel to make it into production environments that enterprises are using, and this is why ebpf has suddenly become so so popular and um we're seeing a lot of tools being built on ebpf, because the ebpf functionality that's required within the kernel to enable this whole platform that is now sufficiently stable that it's been in it's in the kernel releases that are typically being distributed today.

A

I think there's all linux distributions are now um using enough evpf technology, or they have enough ebpf um platform built into them, that we can, for example, run psyllium where two or three years ago, quite a lot of production versions of linux, were running a an older kernel that wouldn't have sufficient ebpf capabilities.

A

So this is why we can use evpf for tooling today.

A

So it takes a long time to get changes into the kernel, but with ebpf we can make changes instantly. All we need to do is write some code load it into the kernel, and we can change the behavior of that machine. This means we can have bespoke behavior very easily, don't even need to do a reboot.

A

This can be really useful for all sorts of reasons, and one interesting application is for mitigating internal vulnerabilities.

A

There's a class of vulnerabilities called packet of death, where a vulnerability in the kernel means it's unable to handle a network packet. That's been crafted in a particular way and with ebpf we can easily mitigate these vulnerabilities when they, when they're discovered.

A

So if a packet of death vulnerability exists, when the host receives a packet, that's crafted in a particular way it's unable to handle it correctly and the kernel crashes, and this is a very bad day because it brings down the whole machine with ebpf.

A

We can write a program that hooks into the event of receiving a network packet, look at that packet and see if it's crafted in the format that the packet of death vulnerabilities requires to exploit that vulnerability, and if it is a packet of death, then our ebpf program can simply discard it, and that means the colonel never gets to process that packet and the vulnerability that's unable to process that packet. Well, it never gets hit, so the the packet is is harmlessly discarded.

A

Let's take a look at how easy it is to discard network packets, so I have a an ebpf program here. This is the c code that we're going to load into the kernel. It's called goodbye ping and at the moment all it does is trace out whether we receive an icmp packet.

A

um Icmp is also known as ping or a tcp packet. So don't worry too much about the details of this code is looking at each network packet to find, first of all, whether it's an ipp packet and if it is, it will look at the protocol type and trace and trace out a message if it's either ping or tcp also currently in either case, I'm returning xdp pass as the return code from this function, and what that says to the kernel is just carry on handling this packet, as you were going to do.

A

If there was no uh bpf program here. So I'm gonna run this inside a container and that container has a well. It has an interface called f0 with the ip address, 172.17.0.2.

A

And I can start pinging that address from outside the container and we'll see um every second, a response is coming back to that ping request.

A

So I'm gonna I've got to make file that it compiles the um the c code and it detaches any previously existing program on that s0 interface and container, and then it loads the version that we've just compiled onto that interface. So, every time a network packet is received on the s0 interface, it should trigger my ebpf program and I'm going to start cutting out the the trace output, and you can see here that every second we're getting a message telling us that a pink packet has been received.

A

So we can see the sequence numbers incrementing every second and we can see tracing being generated.

A

So now, let's modify the program and instead of saying that we're going to pass the packet up, let's just drop it. So in other words, every time we see a picking packet, it's going to just discard it.

A

So I can make that and that will load my new version and we instantly see that the sequence numbers have stopped incrementing, because the ping is being received in the container dropped and no response is being generated. So the ping application here is never seeing that response.

A

But we continue to see the tracing because we trace out the fact that the packet has been received before we drop it and I can very easily turn it back on again.

A

To enable packets to be received make that one more time- and we instantly see the sequence numbers start incrementing again, because we've changed the behavior first of all, we allowed them to pass, then we dropped them and now we're allowing them to pass again.

A

A

So again, we've been able to change the behavior of the kernel dynamically. We didn't have to change the we didn't even have to stop ping from running. We could instantly modify the way that the kernel handles those packets.

A

Now, one thing you might be wondering is: how is this safe, ebpdf code has to be safe to run because if it crashed or if it looped indefinitely, that would stop the kernel from working and that would essentially cause it to like if it crashed. It brings us down the whole machine.

A

If it hangs all of your applications will hang so. There is an important part of ebpf called the verifier, and when we load an ebpf program into the kernel, the verifier checks it to make sure that it's going to run to completion that its memory access is all safe.

A

Not only that a given uh ebpf program can only look at uh memory, that's appropriate for it. So, for example, if it's triggered by one process, it can't go off and look at memory owned by another process.

A

And it also checks that we never dereference a null pointer in a ebpf program. You have to explicitly check that your pointer is not null before you dereference it, so the verifier is used to make sure that our ebpf code is going to be safe to run, and this is one reason why sometimes ebpf gets called sandboxing and, to some extent, that's true.

A

It's sandboxing our programs to make sure that they are safe, but I also think sandboxing can be a little bit of a confusing term to use in the cloud native world, because we also sometimes talk about sandboxing for containers and ebpf is not a replacement technology for containers, it's kind of entirely orthogonal.

A

So let's have a look at what it does mean to run. Evpf programs in a container environment, or particularly in a kubernetes environment,.

A

So in any given host, whether that's a virtual machine or a bare metal machine, there's one kernel and that kernel looks after all of the user space applications, whether or not they're running inside containers inside pods, okay, environment. They typically are running in containers inside pods and whenever those pods or the application code within those pods want to do anything interesting like accessing the network or reading or writing to files or even when kubernetes wants to create more containers on this host.

A

All of these things require support from the kernel, so the kernel is involved and aware whenever the pods do anything, and that means if we instrument the kernel with evpf programs, they can be aware of everything. That's happening inside those user space applications running within the pods, so we can write observability tools that can see events happening regardless of what pod triggered those events, and this is why ebpf is such a powerful tool for observability evpf programs have this view across the entire node and it enables really deep observability tooling.

A

So a couple of examples: one is from psyllium, where we're using eppf to connect network endpoints together we can observe every network packet, that's flowing to or from different pods, but we also have this awareness of kubernetes identities.

A

So we can map not just what I p address and report that packets are going to and from, but what's the pod name? What's the service name? What's the the node? What's the name space and in a cloud native environment, that's much much more powerful if, in in a cloud negative environment in a kubernetes environment, ip addresses don't mean much for very long they're only you know a pod can be created and destroyed dynamically and that ip address could be reused for a different pod in the future.

A

So if you only know what ip address a packet went to, or from you're going to have a hard time figuring out what application was involved, what you really want to know is what was the kubernetes pod that was involved in that network communication?

A

Another example of really deep observability that ebpdf enables is a pixie, and this enables all kinds of different observability measurements. This is just one example, a flame graph and it's showing how cpu is being used for all the applications across, in fact, in this case across the entire cluster, because it can coordinate information from multiple nodes.

A

So cloud native cloud native is benefiting from really great observability tools and they're highly performant, because they run in the kernel giving us these deep insights to how our applications are behaving.

A

We can also use this view across the entire node to enable some really efficient networking connections.

A

So, let's look at how networking works in a traditional pre-ebpf environment, so each of our pods is typically running in its own network namespace, and that means it's running. A network stack, that's separate from the host's network stack and the pod is connected to the host through a virtual ethernet connection, so a network packet that's coming into this host and destined for that application.

A

First of all, has to traverse the network stack on the host, and then it passes across that virtual ethernet connection through the network stack in the pod and then finally, it reaches the application.

A

Now with ebpf, we can take responsibility for connecting all of the different endpoints and when we receive a packet on that physical interface, we know that it's destined for that pod because we're aware of the kubernetes identities and the addresses involved.

A

So we can take that packet pass it straight to the pods networking namespace, and this makes the path for that network packet dramatically shorter and makes for faster networking. We can see this both in a flame graph. This is uh taken from a blog post that we did last year. Some benchmarking work that we did where you can see that some time is taken when a packet is received.

A

Some time is taken for it to be processed in evpf. Then it gets passed directly into the pod and in the part where you can see time being taken to traverse the network stack and then the socket.

A

What this results in is more efficient networking, and this is true for psyllium- is also true for calico in ebpf mode.

A

The blue line on the left is a baseline of node to node host to host traffic without any pods without any containers involved, and we can see that we achieve nearly as fast networking speeds using ebpf, because we're able to bypass so much of that additional networking stack, whereas in the um legacy mode, so not using uh the short cutting process, both psyllium and calico have less they're able to handle fewer requests per second, so ebpf is making a significant improvement in the speed at which we can process network packets.

A

Another really important aspect about ebpf is that not only does ebpf programs have this ability to see it across the entire node they can do it without having to make any changes to the applications. We don't have to change the way the application is configured. We don't have to write any code within the application.

A

The evpf program running in the kernel immediately gets visibility into those programs, even if the application was running before we load the ebpf program, it's visible to ebpf, tooling nathan. Leclair. Did this really great cartoon uh about how we can use ebpf for much more efficient instrumentation than the sidecar model?

A

So what do we mean? Why is ebpf more efficient than side cars?

A

Well, every time in the cycle model we have to have a container inserted into every pod so that we can instrument that part whether this is for logging or tracing or security. Tooling.

A

You have to have that sidecar injected into the pod so that it can share the namespace of that pod and see what's going on in that pod, and in order to inject that side container, it has to be defined in the pods yaml that probably isn't done manually. You probably have some automated process to inject the sidecar, perhaps in admission control, perhaps even in your ci cd system.

A

But if the sidecar animal isn't there, then the psycho won't get created. It has to be added through yaml.

A

So what if something goes wrong if the uh the yaml doesn't get injected correctly, then the sidecar will not have visibility over what's happening inside that pod, and that could be a misconfiguration. It could be. You know a bug, something that causes the sidecar to not be injected means. The pod is invisible to that tooling, with the ebpf.

A

We don't need to modify the yaml at all. We simply need to load the ebpf code into the kernel and it immediately gets visibility over all of the pods all of the containers within those pods.

A

Not only that, but if there is something malicious running on the node, eppf tooling can see it.

A

So if you have an attacker who perhaps they've compromised the node, they've um they've started running some malicious workloads in pods, not in pods, doesn't matter it's going to be visible to ebpf code running in the kernel, whereas, if you're relying on the sidecar model, your attacker is probably not going to instrument their pods with your observability tooling, or your security tooling.

A

So evpf tooling makes it much more likely that you will see any malicious activity happening on the node.

A

The other very common use for side cars is with service. Mesh and evpf is enabling service mesh models that don't require side cars. We launched the psyllium service mesh beta towards the end of 2021 and we've had hundreds of people sign up to use it, and the feedback has been phenomenal. People are very excited about it for a couple of reasons.

A

The first is, we don't need a sidecar in every pod and that reduces the complexity and the resource usage so take, for example. um Well, if you're going to inject a network of proxy into every pod, if that property has to have rooting information that rooting information is duplicated in every sidecar, whereas if we have one network proxy per node, then we only need one copy of that routing information.

A

It also makes it much less complex to manage and we've had a lot of feedback from uh users who really want to avoid the um the administrative overhead of dealing with a sidecar in every part.

A

The other aspect- that's a really significant improvement, is network performance.

A

So if we're using a sidecar, a packet coming from the application has to go through the loopback interface within the pods namespace, so that it can reach that network proxy that runs in user space and then the proxy can send the network packet out through the well out through the podcast network namespace and then through. The hosts networking stack. So it's quite a convoluted path for every network packet. Here with ebpf and the psi sidecarless model.

A

We can dramatically shortcut that, so if traffic doesn't need layer, 7 termination, it can be sent very much like the non-service mesh case. It can go from the application through the um through the ebpf network connection directly to the physical interface.

A

In the case where there are packets that do need to be terminated, they can be sent to that network proxy running in user space on the node.

A

But it's again a much shorter path because we don't have to traverse network packets in both user specs, don't have to traverse network stacks in both the pod and the host network. Namespaces.

A

So service mesh is a great example where ebpf is enabling us to think about cloud native problems in a in a new way in a in a more efficient way. Ebpf is enabling a range of really powerful tools in cloud native, so on the landscape.

A

Today, there's psyllium has a networking plug-in it's it's the only incubation level, cni project and within psyllium, there's hubble observability, which I showed you some examples of before to collect network flow information and build up service maps, for example, there's pixy, which I showed you the example of a flame graph, but again a very powerful observability tool that can give you all sorts of insights into how your applications are running.

A

A new sub project in the sillium family is tetragon which enables security, observability and runtime enforcement using ebpdf, and that falls into the security family of cncf tools along with falco, which also uses ebpf, or a kernel module to provide insight and observability for detecting malicious or suspicious security events.

A

So there's this wide range of very powerful tools and what they all have in common. Is you don't need to change your application code to use them and they instantly get insight and control over all of your cloud native applications running in the cluster.

A

So ebpf makes the linux kernel programmable.

A

Linux is not the only operating system out there and ebpf is now being created on windows. We've recently seen the first demos of some psyllium functionality running on windows through the ebpf on windows project.

A

So we expect to see this powerful, tooling capability extending from linux, but also into the windows world, so that you'll be able to run really powerful, tooling on windows, just as you can on linux today.

A

So I hope that's given you some insight into why I'm so excited about ebpf and why I believe it is the foundation for this new generation of cloud native networking and security and observability tooling. If you want to find out more, there are two free to download reports um that we've published through o'reilly. One is what is ebpf, which is an introduction that I wrote recently and another report on security observability with ebpf written by natalia ivankov who's, a colleague of mine at high surveillance and jed salazar.

A

So both of these reports, you can download from the I surveillance website. If you want to check out psyllium, it is like all cncf projects available on github and very welcoming. We would love to get new contributions and uh there's a very, very active slack community that you'll find if you go to the sicilian website and and follow the links to slack from there. You'll find a community full of people who will help you out answer questions and get you started on your psyllium journey.

A

So with that. Thank you very much for hosting me today, cubanetti's community day tonight. I hope you've been having a wonderful day, and I hope you have some great questions for me. Thank you.