Antrea LIVE !!!, 29 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Episode 33: eBPF KPNG proxying w stoycos.

Description

Join us with andrew stoycos to go over our later addition to the KPNG family, the eBPF proxy !!!

A

If it ain't hot, it ain't happening.

B

Is the logo set up here we go. How do I make a thing? My favorite thing is to make the little comment: banners, hello, everybody. I don't think I can see this. Oh, maybe if I go welcome to andrea, live with story of us bots here.

A

You're really good at saying my name.

B

Good morning, scott.

B

At least scott showed up all right, let's get into it andrew stoica, from red hat. Do you want to introduce yourself? Did you invite your red hat friends to come yeah I gave them a.

A

B

Where are they.

A

Okay, they're all busy um yeah, so I'm andrew stoykis work over at red hat and the office of cto, mostly upstream kubernetes and more recently been exploring the wide and dangerous world of uvpf. And so here we are.

B

Okay, it's 23 o'clock in israel, where's steve where's, your boss,.

A

Doing something boss worthy tell him.

B

To get tell him to get out of the meetings, whatever meeting he's in leave. Okay, where are you you're here right? This is a soup. This is an exciting show, because we we haven't done a kpng update for a couple episodes so for folks that don't know about it. Kpng is this new back end. It was originally written by mikhail cluseau, and then we kind of adopted it as a sort of a community and we've taken it all over and we've added all these back-ends to it.

B

Some of them are hacked together, like the iptables one literally transmogrified the code from upstream into uh into kpmg, and we adapted it and.

A

But the evpf one is fully new. Hopefully.

B

A

All the powers that kpng gives you.

B

Exactly so so, there's two different ways to build a proxy in kp and g1 is by using this this data model, where you uh directly respond to like services and endpoints, uh and then there's another one where you sort of send the entire state of the global network data model directly over grpc down to a back end, totally allows you to run separate back-ends from front ends, and so the interesting thing as an example of what that looks like is like the entire state space of a kubernetes network, looks something like this okay, so um you can send this entire thing as a data structure down to a back-end proxy, that's running somewhere else.

B

Stoichis has implemented ebpf from scratch, which was real, interesting and send this whole thing down and then read each one of these one at a time and write all the endpoints and all the services out. Oh my god, what is going on? Who I've never seen? What who is spamming us.

C

What with that.

B

Okay, luckily, that's not getting streamed right, what I'm not seeing anything?

B

Okay, something wow, okay, so that was weird. I have no idea what just happened: um okay.

A

So I don't even see, oh because.

B

Nobody saw what I just saw: okay, great, so something went on youtube and some bots started spamming us.

A

B

No, oh god, so something went on youtube and some bots started spamming us all right. uh How do we get rid of that bot uh all right? Sorry, folks,.

A

B

A

You can block, I just blocked it, I blocked it yeah and I'm gonna report the block, unwanted content, commercial content or spam and hate speech. Wait. Is it hate speech? What is it.

B

Whatever you want, it was pretty bad, um okay, so we'll I'll edit, that out later or whatever. That was, if it showed up in the comments anyways here we go so soycast made a ebpf backend, which is I mean this is really exciting. Mikhail did the nft one originally right so for folks that are new to the coupe proxy there's different backends that write proxying rules and that's how the service uh ip addresses wind up, getting forwarded to pods right uh in any kubernetes cluster and stoichis.

B

The original implementation was done by mikhail as a plc of how to use nf tables, which is a replacement from ip tables, but nobody's ever since then, all we were doing trying to do is play catch-up and implement all the upstream coupe proxy back-ends, but stoicas has done as the first person in the history of the whole world to write a back end using a completely new technology with kpng, so he wins the gold star for today.

A

ah It's just a poc right now it works with cluster ip and tcp, and udp uh protocols so still have a long way to go. But it's a good start.

B

So let me fetch the latest here and then to run k png. It's really easy to test it out for folks that want to try it all you got to do is just this hack like this. We just got to run this we'll see if it works. We did a very quick test of it earlier. It looked like it wasn't working, but we'll we'll double check, and I think the other reason is because of my system, but we can figure that out.

A

Yeah, I need to get some kernel level requirements, I'm running on a pretty up-to-date kernel.

B

So all you have to do and I'm going to put the kpmg architecture, diagram, svg.

A

Here it is, and the interesting thing on the control plane side of things with kpng is the nft back end does implement this kind of full state model where anytime there's an update to services or endpoints. You get the full state of the world, so it's essentially a level driven controller, whereas the iptables backend as it stands right now implements more of a event-driven controller where you're responding to endpoint and service events- and you know kubernetes- writing. Kubernetes controllers is kind of like you know, the the suggested way is to always do it level driven.

A

So it's cool to see both approaches, um but it would be nice if, as a community, we could converge on on one way or the other, at least with kp g. Well, I.

B

Yeah, so so, there's two exactly so, there's two ways to build one of these so as this comes up we'll put this on the side over here, so this is coming up and then yeah, okay, so it's coming up, and so so when we run test d2e dot sh, it creates a kind cluster inside that kind cluster. It disables the coupe proxy right and because you can run kind with the disabled, coop proxy and then that comes up and then it installs kpng as an alternate coupe proxy afterwards.

B

We'll see that in a second dimitri I see you here since this is a live stream. uh It's a usually a camera's on live stream, so you gotta hang out with us, but I'm really glad you're here, because dimitri is writing. The um dimitri is writing the uh coup proxy updates for the windows kernel space right. So we have this one here.

B

Maybe he's even got an update for us, um so we have this other one, which is the windows kernel space. So I made this mr originally and now dimitri's kind of taking it over and this one I'll show you all what stoichis is talking about in here. So this is the windows, implementation and yeah, so we can see here um set service right. You can see here.

B

The set service delete service like this is why uh stoichis is saying this is like an event driven implementation where every time a service is created, we update the windows proxy rules. Every time an endpoint is created. We update the windows proxy rules. This is kind.

A

Of how kubernetes currently works right, don't suggest that right, the official, how to write a kubernetes controller, leans heavy towards writing a level driven controller rather than event driven. So, oh.

B

Yeah, oh, oh, it's like we. We actually really like. We actually have documentation about that upstream idea. Yeah. We had that okay yeah.

A

Let me find it there's a reference upstream controller.

B

Okay, anyways. The point is that wrong or right? This is the way the iptables proxy works in the current kubernetes and also how it works in kp g, because basically ripped all that code out and got it working in kubernetes, but that's not how kpng was really designed to be run. The way it was really designed to be run is that you take the entire state space, and then this is what you're calling a level driven controller is. That is that is that, like is that yeah? Okay, I can show what I'm talking about.

B

If you want just give me the link and then I think you're gonna take over in a minute, anyways yeah. Let me get the link once this thing blows up, and you tell us why do whatever you want seriously.

A

We need more, we need more folks to uh start breaking everything. I've written for sure. That's the fun part.

B

A

So let me show you what I'm talking about. You can bring it up.

C

Okay, let me, where is the current show by the way I'm trying to find where we.

C

Yeah trying to make sure this is.

A

C

A

Pull that up on your screen, if you want, I put it in the private chat and in the okay comments.

C

C

Okay, I want to get to the youtube link.

B

Oh, I got it right here. I have it on page, do you have it can you can you paste that to me really quick, somehow.

A

Yeah, I put it in the private chat.

B

Okay, private chat all right here we go because I want to update that with um to be in the playlist. Just in case folks, don't see, oh, where is it.

C

Add two: how do I add to play? Yeah save it here we go there. We go okay, cool so now this is in the playlist, got it all right. So where was the.

B

What are you looking for? You have the where's.

A

The link you put sample controller is that it uh that's the actual kubernetes reference go to the one. Okay.

B

A

Sign is where do they say level controller? I don't know if they do here, but if you like actually look through it, I'm pretty sure it's a level driven controller um and on kubernetes main documentation, page.

A

Why is this taking so long? It's done, it's already failed, go! Look at it.

B

Okay, let's see it failed all right cool all right, so we can see here um we created this new cluster and then.

B

The deleted the existing coupe proxy and then we loaded the compiled images that have this from stoichus's fork in it and then um this sort of came up, but let's see coop ctl okay, so we can see we see yeah we can see. We can see that these are all in crash, that back off. So let's look at the kp and g, so the a regular coupe proxy right has these running on every node right, so coupe gtl get nodes. We can see here that, like there's three nodes here, it's.

D

B

This is running on all three nodes and it's not coming up. So you know this is we just never know who's watching stokes? I know that you know that. I know that you know what a damon said. Someone.

D

A

B

Not know how this okay.

A

D

A

The bpf back end I.

D

A

Pulled in an extra container to help us diagnose things, I am using psyllium's bpf tool container so that we can get the bpf tool binary and kind of analyze our program. But it's not doing anything functionality wise all of the actual stuff is running in the k, p and g ebpf.

B

Okay, okay, so then so, okay, so we have. If I look at the.

A

B

Okay, so if I go in here- and I look at this pod- let's see what's in here and we'll see why this isn't working okay, so.

C

Where is it image.

B

Here we go here, we go okay, so I've got a cilium bpf tool, so you've got that in here for debugging yeah, and then we've got kpmg tests for the both of the other two containers.

A

Right one is running the kpng server, which is essentially watching the api yeah and then the other is running. The kpng client and the ebpf backend.

B

Yeah, so exactly so so we should see um when we did the describe.

A

On the pod right, we saw yeah, you need to do the logs check out the logs in that pod. That's crashing yeah and the container that's crashing in the pod. That's the um kpng-evpf container, okay,.

B

Here we go logs. Okay, there we go come on: okay, actually, okay, okay, so all right one at a time, so we've got three containers in this spot. So there's nothing in here.

A

Nothing there there's gonna, be this.

B

Should be fine, because this is all this thing is doing, is watching yeah.

A

B

So this container for folks that are new to kpg this container as andrew said I'll, say it again. This thing is just watching the api server. So this is this part over here right. It's watching the api server and it's just saying: hey like what services and what endpoints are there because I'll write those networking rules for you, that's what the regular that's half of what the coup proxy does. The other half is. It goes and it writes ip tables rules on your node right.

B

So now we look at the other container and we'll see where that's failing. So the container that is failing is the one that's trying to write ebpf rules. So let's see why it's failing! Okay, so explain this to us.

A

So essentially, btf is a is a part of bpf that essentially encodes debugging information into the compiled bpf programs, um and my guess is that your kernel is pretty old. So what does you name dash r show on your machine? um Okay, it's you named hr.

B

Okay, how is this is that too old for you cool, so am.

A

I too old, for is this, like you just make.

B

Fun of us for being old.

A

Yeah, so my my initial assumption is that you're seeing this issue, because this kernel is just really old and a lot of new btf and bpf support in general is a lot newer. So I'm actually running on a 5.18 kernel and you're on 4.1.5. That's a lot of that's old.

B

A

Okay, that makes sense.

B

A

What I will do, a follow-up to this episode is kind of define what works and what doesn't for various kernel versions, but you definitely need something I want to say.

B

Hi to parent and junior, who just showed up hello, how come I mean didn't come, I'm going.

D

B

Okay, you ready to take over.

A

D

Give it to me: let's see all right, let's see here all right, uh how do I give it to you? Okay, so.

B

Stoichis has a a computer that was made in this century and he's gonna yeah go ahead and turn your screen share on.

B

A

Won't work I just turned it on. Can we see my screen? Yeah cool, okay,.

B

A little awesome, okay, here we go now. I see your screen awesome. Okay,.

A

So, let's check out what kernel my machine is running on um running with, as I said before, I'm on 5.18.6, okay, coming from jay's 4.15, so pretty old, there's, probably a lot of dependencies missing on that old of a kernel so and and remember. This is kind of a funny deployment scenario for evpf program.

A

So if you don't know what ebpf programs are there's a bunch of great resources online, I'm happy to point them out to you, but essentially, when you run evpf programs like on docker containers or within docker containers, really at the end of the day, all that matters is the kernel version on your underlying system. So that's why this is important, so it failed to spin up on jay's cluster. But let's check out, if things are working over here.

A

Oh I'm in the wrong thing. Sorry, I'm in a container.

A

Awesome um in this case, everything is up and running a little bit old. I actually haven't had to tear this down in a while, which is nice, but we have our kpng pods with all three.

A

Containers, as we were seeing before.

B

Okay, so uh now we're trying to do this again, we're doing it in his okay. So so so you you've already got this. All he's already run the steps that I just ran, except he's running it on a.

A

Much more modern server I don't get hold on. Are you ss into a linux box somewhere yep? So look, can you see down here I'm on.

B

This is so confusing because you're in vs code and okay see I shamed a meme into coming.

A

ah He's here so in the deployment you can see, we have three containers, as we talked about before uvpf tools, which I've just started to essentially um allow us to debug, and it also mounts the bpf file system, which is kind of an important step. um The kpng container, which is running this guy, is.

B

My microphone still working because I was using the wrong.

A

Mic, it sounds a lot better now. Doesn't it it sounds kind of the same jesus.

A

Okay and then the last container is the kpng evpf container, which is actually running all of the business logic. I.E loading the bpf byte code into the kernel.

A

So can you see now j? Are you back? Do I sound the same yeah you sound fine, though you don't sound bad.

B

Fine, whatever I don't, you might just not have any taste, I probably don't you might have like yeah.

A

B

A

So why don't we sounds you might just okay anyways go on? Why don't we check out these back ends? We've been talking about for a while. um Obviously here we have ip tables ipvs, nft user space, lin blah blah blah jay showed earlier in this program. How iptables, for example, implements sort of a let's see? Where is it in here? I can't even remember sort of a yeah right here. Service delete service, it's essentially event, driven um bpf.

A

Does things a little differently, because kpng was designed such that we get the full state of the world in terms of services and endpoints anytime, an event happens, so you can see that here whenever the kpng client sees a change, this callback is essentially called.

B

A

Microphone, I just want people to know that one sounds now. It sounds better. Oh.

B

And now it sounds better yeah, so I was too far away.

A

Yeah you were too.

B

A

B

Scott jesus: okay, go on sorry, no you're good, just a little bit of distraction. I feel.

A

Like axel rose right now, I don't even know who that is.

A

But um anyway, I was talking about kind of this callback function, which is how we respond to events in the evpf back end and- and you can see every time.

B

That sends the entire state space right callback so that see that client service endpoints, that's a big huge yaml file of every kubernetes service and every kubernetes endpoint after it right and so he's showing us that channel that flows all those in and then yeah.

A

B

For loop, through that.

A

Yep, essentially um it isn't the smartest, but it's a true level driven controller and thankfully kpng has also given us some libraries to help out with you know, figuring out what we need to do on each reconciliation cycle. One of these libraries is this light diff store library which essentially just allows you to figure out if something has changed in a service, a given service or a given endpoint on every reconciliation cycle. So that's really. Nice makes things pretty easily and we're.

B

Calling this the yeah this, so this is a new thing. This light diff store it. I don't remember it, neither does it meet him, and so what can you walk us through that code really quickly before you.

A

B

A

The easiest way to understand that I think, is actually to look at the test for it. Okay,.

C

A

So diff store is kind of the core one. I think mikhail was using that to prototype a bunch of different things with different data structures and then light discord is actually what he intended to be used by us. So, let's check out this test and and see what it does. Basically, we create a new light, diff store and the first thing we do is we store two entries: it's basically a map so key a uh value, alice, key b value bob and what light diff store.

A

You know the power gives you is you can, after setting things or or altering the map, you can call these updated and deleted functions and it'll basically tell you what's happened since the last time you reset the diff store. So I wonder if I can actually run this with logs.

B

Yeah, so exactly so, this is a way of for for some history here. The way that the way the kernel space coupe proxy works is it basically in memory calculates the difference every single time, a change happens and says: has the network state space changed or not? And if so, it has a little data model in memory that it modifies and it sets a changed bit and then every like.

B

I don't know, I think it's like every second or something the existing kernel space coupe proxy goes, and it looks at that state space and it rewrites networking rules. So what what the way kp g handles this sort of similar problem of calculating the differences um when you're using this proxy type, there's two types of proxies, as in kpng, as we've shown that when you're doing the type that actually reads the entire networking state space every time you have this problem, which is that you're reading in the entire state space.

B

So you need to calculate a diff somehow, because you have the entire state space a minute ago and now. So this is the tool that mikhail put into the project to allow us to calculate this diff. So you write right right right right, you hit a reset and then you right right right right right now you can see the difference between the first batch of rights and the second batch of rights.

A

Yeah- and I actually was poking around at the test earlier and broke it so that we could see what's going on here. Basically, why.

B

Is it called the light diff store.

A

Because diff store was essentially when I asked michael about this at this store was like his attempt to just prototype things, and you know here he used a different, a couple: different data structures and, in light, diff store, it's literally just using a b tree at its core. I believe so, if we go in it's.

B

Always been a bee tree, though.

A

Right here yeah, but I think in this store you know seeing this store. He was trying to use a buffer like he was just experimenting with different things. So.

B

What do you think about this andrew just shows up like, and then he starts telling us how everything works well,.

A

I've been, I had to figure out how to use a light disk store in the evpf backend. So um that's where we're at, but this is kind of cool. So I can see what this test is failing, because I forced it to fail, but you can essentially see we add two new entries, one for a corresponding to a value of alice, one for b, corresponding to a value of bob and then, when we print map.updated, we can see that those two entries were added so put this in the context of services.

A

If we created two new services, when you dumped um s dot updated, you would see. Oh okay, here's two new services. We need to add into our data plane. So that's why it's so nice.

B

A

So taking it back yeah now.

B

Let's go back okay, sorry about that. I just wanted it! No! No! No! No! It's! Okay! So now this is the meat of it. This is the part, that's really doing all the rules. So let's show us how this new yeah.

A

So, there's a lot going on here. um Basically, I started by looking at what psyllium has kind of done and seeing like what we had to do to remove it from their control plane. So that's what we've done here right now, what I'm, showing you is still in the control plane side of things we're not in the data plane yet so in the control plane.

A

We essentially are looping through all of our known services and endpoints, and then we are boiling all this information down into the exact information we need for the data plane, which is ebpf okay. So all this is kind of pedantics.

A

At the end of the day, we're storing um what we the information we need into a light, diff store, which is my it's called here: ebp ebc service map, and after that we say: okay, if anything's changed, I, if the diff store updated, has any entries or deleted.

B

As anything, the ebc is the new data structure that you're implementing the light diff store interface with.

A

Yeah, it's essentially it's what I've used to reference like an eb evpf controller right, so it holds some things uh like references to our bpf programs, ip family and then the most important thing is. It holds basically an in-memory cache of what's going on in in our services and endpoints world right, yeah.

B

Okay, so you're in yeah, okay, all right, I'm sorry, it's not an interface, though it's a it's an actual structure. It's a natural structure. The diff store is a struct and it's an it's a generic struct and it takes in the generics and you've programmed it for what are the generic types in there.

A

The generic types in this one are actually a service, endpoint mapping, so a service map stores.

B

A key you didn't have.

A

To put anything special in there, no, it's super simple! That's why I tried to boil it all down into like super simple control, plane and.

D

A

All we're doing is we make a key that's made out of the namespace name for a service. Namespace name is literally just the name of the service and the namespace. It's in, so it's unique the specific port for that service, so a single service could have multiple entries in multiple ports and then the protocol for that service right for that service, endpoint mapping.

A

So that's what I'm doing in the control plane, keeping track of that and then upon a sync event. So if we want to sync if the state has changed, this is how we tell if state has changed by checking if there were any updated and deleted, um then we go down here and sync.

A

This is where things get a little more fun. I think okay, so now you're starting to do logic. That's specific to ebpf online, not even not not.

B

A

B

A

It's 159. right. This is total everything I'm showing right now is in go, so everything in go defines my control plane. All of the data plane. Components are actually written in c, so.

D

I'll go! Look at that.

A

Yeah we'll go look at that a little bit later, but the way that my go program confers with my c program is with this concept of ebpf maps. Okay. So basically, what I do in sync is: I load those ebpf maps with the data that represents my current services and endpoints, so this is for deleting. But if we look at all of the entities that were updated since the last iteration, we are then going to add corresponding entries for a given. What.

B

Do you mean period.

A

B

I mean: what are you talking about periodic sinks.

A

I can't see his, uh I can't he.

B

Said how are you handling period sinks.

A

Periodic sinks we're not doing any periodic things. um I don't think unless.

B

A

He's doing this um full state, I mean yeah. This is totally full state, so.

B

A

You're triggering a sync is essentially whenever an event happens on the api side of things like. If an endpoint goes away, then we'll get a whole new, full state.

B

Yeah, so that's only for legacy proxies.

B

You know how hard a meme has been working since he's asking these questions. That's how you know he's.

D

B

Dealing with customer issues for the past month and a half because he forgot how he forgot: okay, png works, so you know.

A

B

You know he wrote the entire user space windows proxy here.

A

B

I know yeah, so um I mean this is a full state, yeah yeah. This is full state, so there's no periodic anything in the in the I'll say it again, because there's a lot of new people here, I'm sure that I'm going to watch this in the old in the existing kernel, space cube proxy and in the kpng iptables implementation, we do periodic syncs, in other words, there's a data structure, that's sitting there and that's figuring out what the differences are in memory and then every like one.

B

Second, if there's a if there is a difference in the way in the pods or the services in all of kubernetes, that needs to be written out so that you can write new iptables rules like a pod got deleted and a new pod came up or a service got created or whatever those new ip tables rules are written and there's a thing. That's periodically running that so there's.

A

A mess scene right here, yeah I've been using baseline. Now you see, like literally we're calling on any of those events. We're calling set service, delete service, blah blah blah. It was.

D

A

It was a shim to make it easier to interact with the existing.

D

A

Me so the idiots like me could copy people, no, no, this. It made sense. It made sense then, but it has made it a bit confusing for people who are new coming in to write a new back end like I want to standardize this like I want to do it either with a set service or with a generic callback, full state. Every time sort of thing right.

B

Okay, that's fair! um So, okay keep going! Where were we so we were just about to load those ebpf maps so.

D

We calculated you got.

B

The light diff store now you're, you know what rules you need to write to update.

A

Your bpf maps, your.

B

Cluster, so that cert, so that this, the new new hypothetical scenario, new cluster ip service was created, and it's supposed to forward traffic to these five different pods and maybe three out of those five pods are up. So let's say: okay, what happens now.

A

So now we write the relevant information into an evpf map, so the evpf map right is a really important concept. It's how you get data from your control, plane program in this case written and go to your data plane program this in this case running in the kernel with a bpf program. Okay,.

B

Now, can you tell us because it was in the show title, how is this different than what we do when we do iptables proxy.

A

A

Is basically like a rule brace proxy or, as you write more rules like for every packet, you essentially have to loop through all the rules and make a match or not, um and you interact with it in in user space right with just a simple iptables command and it's handling natting interacting a contract. It's handling a lot of things.

A

The way this c program works is very, very different and I'll talk through that when we go into the actual c program, how about that? Because it would be kind of confusing to bring up yet sound good.

B

Are you bpf programs touring complete.

B

Say again, our ebpf programs: are they like touring complete like? Can you put anything, can you put four loops in there? Can you put while loops in there? Can you do anything you want.

A

No you're very limited. You cannot basically there's a verifier that is called whenever you're trying to load a program and if you are going to do something, that's totally breaking your kernel like an infinite loop. You can't do it. You're not allowed to you're put in a safe box.

B

But they do have all the basic constructs of this programming language.

A

Most of them yeah and we'll go look at the you can't there's some constraints and I'm not going to list them all out right now, but they're pretty easy to find. But we'll show the c program that's actually running here. But let me.

B

There's a big reddit thread about whether evpf is turning complete or not. This is a big fight on the internet. Oh really, I didn't know that because yeah people are fighting about it. That's.

D

B

Evidently, it is touring complete. Okay, I don't know, but some people say it's not. I don't know just keep going. Sorry. Okay,.

A

No you're good, so what we're gonna do now is we got to the point in our control plane code, where we were writing to bpf maps, so let's go actually check out these bpf maps that are running on the system right now.

A

So this is why this tools, container I've added to the damien, sets really really helpful because we have access to bpf tool, which is a great program to help. You know analyze and debug running bpf programs.

B

Okay, so bpf tool is a thing maintained by our friends at cilium.

A

No bpf tool is actually, I believe, part of llvm. It's basically just a binary. That's I'm using one of psyllium's pre-built images that has this tool. Oh.

B

Okay, so cylinder makes.

A

B

Docker images and those docker images have bpf tool embedded inside of them.

A

Because it's not super trivial to build it into alpine at the moment, but it will be soon and we're using alpine for our kind setup.

B

Why is either one of us? You work at red hat and I work at vegas. I.

A

Have no idea either. One of us is.

B

Allowed to ship anything with alpine, so why are you demoing this.

A

This was, this was how it was. I came in a couple weeks ago and we're here, but.

D

A

We need to fix that now we're we're in a tools container and we can use bpf tool to see what's going on on our system. So if we write bpf tool map we can see, we have three different copies of both a service map and backend map, and this is how we share information between our kernel and our user space program, and you can also do really interesting things like dumping apps,.

A

Let's check it out boom, you can actually see what's going on in those maps, and this is all added by my okay program- okay, cool, but it's pretty confusing. It is a network endian and you're, seeing it all in so go to the top of it. Is there like.

B

A I mean, because that just looks like a big json thing.

A

Basically, this is just how it it dump bpf tool dumps it. So.

B

So when you run bpf tool, it okay map dump id133. So what is 133? That's the.

A

133 is what I got it's the id of the map, so here yeah, each each container is, is maintaining two maps, a service map and a back-end map.

A

The reason we see all three of those in kind is because, at the end of the day, all these bpf programs are running on a single node, so there is no concept of of name spaces for bpf programs. Yet so, therefore, we see everything. It's just an unfortunate fact of running this in kind. At least I haven't found a solution yet around it. Does that make sense.

D

A

See how we literally have three different copies of of v4 service map. V4 back end map all with different ideas. Oh.

B

I get it okay, so that's why you've got six of those yeah, but you've got two per node. So can you just highlight so we give people a visual of? Can you highlight 124 and 125, yeah or or one of them like a set like one of them.

D

B

Okay, so those are that's running on node three, so I have three nodes in my cluster and I've got inside of the third node in my cluster I have two ebpf maps that have been written for me by stoichis's back end. One of them is called the service map. The other is a back end map and each one of those maps. It's a bunch of bindings, a bunch of networking rules in it now you're going to show us all the networking rules so which one are we going to do? First, 133 again,.

A

Yeah, why don't we do 133 and this time I'll print it out in with pretty the pretty flag, which should actually show me hex and that's kind of nice.

A

Cool so now we have. This is dumping my service map, so I'll talk more about the service and endpoint map abstraction. When we look at z code for.

B

Folks watching this, would you not see this in a regular cluster, because you'd be in the node and you'd be looking at one node and you'd only see two maps right. This is just a kind thing: the fact that we see everything all dumped all over the place.

A

Yeah we can even try to well, we don't know, that's fine, no yeah, just just keep going cool. So that's how we see our maps. um Let's now go talk about the actual bpf program. That's doing all the fun data plane bits. So that's down in the bpf directory and it's titled c group connect for dot c. This is where all the magic's happening and remember those maps. We were just talking about. Well, here's where they're defined. We have our v4 service map and v4 back-end maps.

A

The v4 service map is essentially keeping track of all of our bips and the v4 back-end map is essentially keeping track of all of the back-ends.

B

So did you wrote so you wrote a go program that generates c.

A

B

I will talk about how library that does this okay.

A

Yes, I'm going to talk about how these programs kind of interact after I describe what's going on in this c group: okay, okay,.

B

All right so there's a bunch of code in kp in your kpng backend, that's generating this code, but this is the actual code that you generate. That writes these rules, and this is dynamic code that you inject into the kernel in real time. Every time a thing changes you regenerate the whole c program.

A

No it's sitting in the kernel running and reconciling based off these maps. Okay. So it's a long.

B

A

Let me say that sorry, it's it's a one-time load, but this program is called whenever a connect four syscall is made by any process.

A

That's how bpf works so bpf, essentially are programs that are invocated on certain system calls. So you, as a user, can say whenever a user space.

B

Particularly just hold on how do you give this, the new data.

A

All via bpf maps.

B

Okay, so when you write those maps that those automatically get targeted to this program, how does the kernel know where to send your map to.

A

uh It's all built into the byte code and you can. The bytecode is actually seen here with dotto files. You can go pretty in-depth.

A

Basically, whenever I compile the c program, we maintain a reference to the maps right and then we also are doing that in the user space in our go program, we're doing it on both sides.

B

How do you prevent another program to from writing to your map.

A

B

A

B

Yeah right how many okay, so so, okay, so hold on and kpg writes these rules.

A

Yes, there's got to.

B

Be some primary foreign key thing: there's got to be something you're putting in your kpng code. That says I want to write to this map right.

A

Yes and that's all auto generated and I'll talk about that, a little bit, okay,.

B

A

Whatever so, I'm not gonna go super in-depth into this c code because it's not extremely exciting. It's pretty easy, but I want to talk about the concept behind it. So, whenever a client on a node- or in this case, something in a container on a node, whether that's curl a go client, a c client, we don't care whenever a client tries to reach out to an ip one of the sys calls that happens is a connect forces call, and let me show you how kind of proof of that.

A

I think I made note of it here.

A

C

A

But I did in here, maybe I didn't actually I have it somewhere. uh I know where it is.

B

See I mean you need to get a mic like this see. Do you hear how good my voice sounds right now.

A

Oh, maybe I don't have it but anyway let me go back sorry. So, as I was saying before, whenever a client.

B

A windows machine or a chromebook. What is this, what are you doing.

A

Fedora baby fedora, yeah, okay,.

B

A

Okay, so, whenever, whenever a client on a linux machine kind of reaches out to a server, a connect force is called is made, and whenever that connect forces call is made, this bpf program is called okay. Does that make sense? So here's where that actually happens whenever a client tries to reach out to something whether that's with curl or any other client?

A

This connect for syscall is hit and we call our function sock4connect, which I've defined here, which then just calls sock four forward. It's kind of a wrapper function.

A

Now this function sock four forward at a high level grabs the traffic and says, looks at the destination ip and the port calling says: do we care about this traffic so 90 times we don't care right. All we care about for a service proxy is if the traffic is destined for a vip and a given port right. Does that make sense.

B

A

So what this does is catches the first packet of of a traffic destined for a vip and a port and rewrites it before it even enters the network.

B

You're loading, these programs for a specific ip and port.

A

You're loading, no you're loading. These programs are loaded in the kernel just sitting there ready to go yeah.

D

A

Invocated, whenever a connect forces call is made and then once they're indicated, they look at the values in our evpf map, which are basically a list of vips and they look at the actual destination ip of the packet and they say is the destination ip of the packet. One of these vips I know about, does that make sense. Okay,.

B

So you're really so you so it's a generic api call to the kernel yep. You just make a generic call to the kernel and you're saying that here's a map, here's a map of vips and then that data is sent to the kernel and because ebpf was written to deal with networking. It knows how to see with, for that. Ebpf was written to read these ip address rules.

B

Well, yeah. Let's check.

A

This out, so, as I was saying before, whenever a connect forces call is made, this sock for connect program is called, and then this sock for forward program is called. So, let's look at the sock, 4 forward program. What's passed to the user. Here is a bpf sock address and the bpf sock address is whoa. Look it's a bunch of information about our packet, so family uh ip address destination, ip address, etc, etc.

A

Right so now, in our bpf program we make a key and the key literally looks at the user, the the destination ip address of the packet, the destination port and a back end slot, which were isn't super important. It's kind of an internal subtlety, and we say we do a look up to our bpf maps, which is done here and and if we care about that ip import, then we go do other other things.

A

Does that make sense.

B

It kind of it kind of does just keep going.

A

Cool, so if we catch a packet and if the packet is destined for a service vip, I.e a cluster ip sorry, I keep saying vip virtual ip cluster ip I'm using them synonymously.

A

If we catch a packet, we care about first thing: we do because this is kind of a fun poc. Is we debug? We give out a debug log and those logs can also be accessed in that tools container.

A

um The next thing we do is we look up a back end, so we know we have a service we care about, but we need to figure out which endpoint we want to direct this connection to, and so the first lookup we did was to look up and see if we had a vip, we cared about ie service. The next look up we do is is to find a back end for that vip.

A

B

A

So once we have a back end for that vip, then we have the information we need to rewrite this packet. We have that backend's ip and we have the target port on the back end. So you can correlate that to an endpoint ip and an endpoint port and.

B

I just want 100 confirmed this is loaded on init of the eppf. Backend say that again, this is. This is loaded one time on the initialization of the kpng ebpf back end correct, okay,.

A

The bytecode does not change it's static.

B

One time load so there's like a funk in it somewhere in your code here that does this.

A

B

A

And then the last thing we do is literally rewrite the destination, ip and port of the packet and remember this is the first packet in a connection, and this is happening before the packet has even been directed um to the linux. Networking stack so we're kind of cheating like we're tricking. The linux networking stack into thinking that a client is just talking to a pod. So really we're tricking.

A

The linux networking stack into saying that a client contacting a vip is really just a client. It's just a pod to pod connection, so we could actually look at that. It's kind of cool. Why do you say tricking, though,.

B

Because, like somebody have this smile on their face, whenever they talk about ebpf like they're,.

A

Not doing something wrong like it's, not always trickery, but in this case it's kind of cool and it's what psyllium does. It makes a lot of sense. So, right now I have a single node port. uh Sorry, I actually have a node for service and I should not because we don't support, notes with node port.

A

Let me delete this service and bring it back up.

A

B

What's the command it's x, I guess I get what you mean because you're inside.

A

B

A

It'll make a lot more sense and.

B

That you're you're inside the cluster network and you're not writing any kind of a you're, not writing any kind of a forwarding rule or anything you're, rewriting the packet and but you're rewriting it at a point where the packet has no problem being routed to that new place, because it knows where it is right. Yeah.

A

So this will make it a lot more clear. We have a, I think,.

B

A

Makes it a lot more clear? Well, this will make it even more clear. I can't give you some credit here. Man, you get a ton of credit you're, burying with me. This is some it's pretty complex stuff. It's it's a lot to explain and there's even more that we could do two whole episodes. We.

B

You know this, you know this is you know, you're we've still got seven people watching and we're 50 minutes in usually people leave by now.

A

Well, this is exciting, so let's check this out, here's a vip. I just made a service. I have a.

A

Really basic neogenics deployment, so two genetics to pods backing this service with a cluster ip of 1096 one through one one, one: five we can go over in to my temp shell, it's just a temporary client pod. We can verify that this works right. So I'm going to curl the vip ta-dah things work, but it works, but let's go check out. What's actually going on like what does the the node see? That's what I want to see. So what node is my client on.

B

I just want people to realize the gra. The gravity of this.

B

This is like a personal like a self-affirmation, okay, so james, it's good to see you man, so I just want folks to realize the gravity of what stoic has just showed us. This is a kubernetes dash, sigs, slash kpng project. This is a. This is very soon going to be ebpf, supported as a proxying option in a project that is maintained by the kubernetes community in the in the upstream and holy upstream.

B

If you all vote for it and tell tim hawking to merge it it'll be a cap that will eventually be the actual coupe proxy and we're very close to that.

A

Yeah exactly we'll.

B

Give that kept a.

A

Thumbs up so, let's see, let's see, what's actually happening under the covers a little bit, so I'm running tcp dump on the node where this pod is.

A

I do my curl and obviously we get a bunch of stuff, but let's look at the first packet of this connection. This is like the coolest thing ever right, the first packet of connection. Let's look again what what ip did? I just curl 1096 131 115 right, look. What the linux networking stack sees right. It does not see that ip anywhere, all we see, is 10.244.2.3 going to 10.244.2.2, it's literally seeing podopod. That's it that's the exciting part.

A

Does that make sense like we are rewriting that packet, but the first packet in the connection before it even gets to the linux networking stack. So it's like we don't even have to keep state of that. We don't have to do any contracting. It just works in this case.

A

No contract, no contract for this case for cluster ip services cool. So.

B

You can all you're such a big.

A

You're such a big nerd, I can see how happy you are right, I'm a huge nerd. I actually found a kernel bug in this and I'm gonna be posting a kernel patch try to so. We also I've implemented some logging. So we can see whenever our epf program catches a um packet. We can kind of see what was the backend address note. This is in hex and it's a network.

B

So tcp, so that's why you say you feel like you're tricking it, because tcp dump doesn't have any idea what the hell is going on it just well for it. It's like two local ip addresses that it knows right now, yeah exactly.

A

B

Is just a cni problem at this point? Yes, now.

A

B

This is all going, this, isn't it yeah. This is now gone from a coupe proxy routing rule ip tables problem to a now. It's all andrea's problem or calico or cni, is.

A

Yeah and that's the cool part- and you know, cilium- has been doing this forever, but psyllium, really, all their code is really tightly tied to their control plane right and what we've done here is pull it out that design and apply it to the kpng control plane. So now everyone can use it pretty easily right really easily so yeah. One last thing I'd like to talk about is we looked at our c program.

A

We looked at our control, plane go program, but, and a really interesting thing that I do here is using psilium's evpf library, which is their public, go evpf bindings, basically which for note they don't even use in their production code. It's kind of interesting um I can manage the compiling and loading of my bpf programs and maps in go. That's a pretty powerful thing too. Okay,.

B

um And that's how you're doing all this so you're using the golang sdk to load all the code, so it compiled the go. Sdk compiles the c code and loads it into the kernel and does all that for you.

A

Well before before you run your main go program, you have to run, go generate and it is, it actually compiles and generates all your byte code so dash o files are bpf byte code and we can.

B

Are you committing a file to the in your pr? I didn't really look through it yeah yeah. Are we allowed to do that.

A

Yeah I mean we could do it dynamically, but.

B

Why? Wouldn't we okay, I mean I'm not trying to knit this, but I'm just saying like because because you, essentially you can't version control.

A

A dot o file can you it's just a c binary. Actually you're right, we probably should be building it on the fly. Yeah.

B

We should build it on the fly I think on on startup.

A

Yeah, I think we have to actually that's probably one another reason it was failing on your machine, but if we look at, if we use llvm object dump on these uh elf files, you can get more information about them. So remember our bpf program here it is c group connect four right, um that's a cool little tidbit. I use that so.

B

That means we have to put the go, what the hold on so does that mean we have to put the go binary into the ebpf kpng back end so that when the ebpfk png backend starts up, it can go generate the c code that needs to be loaded into them.

A

Yeah, I mean that's not a hard thing. We could ascend.

B

How these things always work? There must be a more streamlined way of doing it. Well, there must be some kind of concern, your base image that people normally use or something, if you're something familiar.

A

I don't have some stuff, but a lot of it. They have like scripts that build their byte code on the fly right, so they don't have a go generate which you know for for everyone's info. It's just wrapping clang. They. They literally have a script that runs clang and compiles their bpf uh c code into bikes.

B

At the minimum, if you're running ebpf, if you, if you were, if you were a person who was writing a container and the containers job-wise to load bpf programs, you would need a c compiler in your container.

B

Why don't they just put a thing in this kernel that can read c code? Well,.

A

No because byte code is portable and it's gotten more portable, especially with btf like okay, like the goal for bpf, is to have compile once run anywhere by code like and we're getting closer to that. So that's kind of why I included just pre-compiled byte code in here, but in reality you know, probably for the current state of the kernel and the current state of folks not being on the bleeding edge of the kernel. It makes more sense to build on the fly. Does that.

B

Make sense I got it: okay,.

A

um But the end goal is to be able to compile one to run anywhere.

B

You know like a middle of the road solution, is you could put this dotto file on like an s3 bucket and then you could have a configuration thing in the deal that said where to pull it, pull the auto file from or you could have some kind of a thing that mount. You know you could use, go bin data, my favorite.

A

All right, this would.

B

Just be, this could literally just.

A

In the in your building of the kpng image,.

B

I think you could go bin data. I think, if you I think, if you go bin dated it then go, could just write this program out on the fly and all the code would be in go. That would be a good.

A

Yeah, I definitely am going to go back and think about how I do that, but yeah anyway. So this library is nice, though, because it allows us in go in our control plane world to maintain a reference to our v4 service map, our v4 back-end map, the two ppf maps we were looking at, and it also allows us to maintain a reference to our program right, so it it generates. These go bindings that give us go representations of our program and maps.

A

So those are right here right, our stock for connect program and our two bpf maps.

A

That's how we write to those maps from user space and how we also can you know once our kpng container dies and crash loops. This is how we ensure we remove that bpf program.

B

This is cool man, okay,.

A

There's a lot going on.

B

I think we're way past we're we're two minutes past so, and I think this is. This is a great stopping point and I want to leave people wanting more. Normally everybody leaves- and I'm just sitting here talking to myself.

B

I just love dropping it. Look this somebody just joined super late. I don't know who you are, but you missed the show, I'm sorry so, okay, let's um let's wrap up here. So in summary, let's go back to your stuff, so um here's the wait, I'm sure on the wrong screen, even though I kind of wanted to share that one too.

B

But this is the one I really wanted to show um the original here's, your here's, your code, so wrapping up the here's, the kpg architecture we saw stoichis went and he implemented a full state backend, which takes the entire kubernetes networking stage uh space, which is.

A

Defined by the kpng api yeah.

B

Exactly so that's here, we just put it up split screen, so folks can see it so doc and whoops examples.

C

And where is it.

B

Here we go yeah, so if I go to examples and then I go to wait, where do we put it.

A

Which one my yaml file that I always show people I lost it, it's uh wasn't it on the the main directory.

B

Oh, it's at the bottom right, yeah global state, so he's got this right, so he's got the whole state space flowing into um here right, so he's got that whole thing and he is sending that in to the back end his ebpf back end, which he has just written in this beautiful pr here right. That's.

A

B

And so, but and then what this thing is doing, okay is this thing is going off and why did you update all my go mods for all the other things.

A

I only updated two little things, but so remember: mikhail switched to using go workspaces. So now, when, like you, update the api machinery mod in in at sort of a master level and it updates so it okay.

B

So now here's his now he's generating all this byte code right and these are binaries. So he's got this in the pr and he's going to change this, though somehow you could encode these in gobind or whatever you want. I don't think it's a big deal to merge this, as is I'm just saying so. No, I need to change it.

B

Okay, so he's got these binaries and then those binaries are so then this backend sync says: okay, it receives the um it just receives the updated state space and then it uses the diff state thing that he showed you earlier to say. Oh, I have a new service, it's 12.0.0.2 or whatever. I need to write a new service out so then that thing decides to write a new service out and it generates an ebpf map which is right to an ebps map. That's already there yeah okay.

B

So it's just writing something to an ebpf map that already exists. It does a mutation on the map. Does it rewrite the whole map, or does it just do a point mutation? It literally just adds it to key okay. So you can edit the map anywhere like a key, it's kind of like ipvs, that way sort of okay.

B

So so then he writes a key to the map to update whatever the thing is that he's updating and the the ebpf rule and then next time a packet comes in if it's destined for one of these cluster ipvips right. So if it's destined to this new thing here, the before we even hit the kernel that packet's ip address gets rerouted right. So you have a packet diagram right so that rewritten.

A

B

Rewritten, so literally you go in here and you literally you change this value right. So literally that gets rewritten, so you're you're still in here, and then that packet gets rewritten and then.

A

B

On to the kernel and the kernel.

A

B

A

Pod, pod and.

B

Then the kernel sees it. So if you run tcp dump the t, the colonel thinks you're you're going from. I guess the colonel thinks that the source ip is the node right.

A

No, it thinks it's the pod, where the client called the connect four. So if you were in a temporary client pod and you curled a service, so what if I'm coming in from the outside world, if I was doing like a node port or if I was doing an external here's, some harder harder things, you would hook into a different bpf hook point, and that's like my next one of our next items on these. That's.

B

Why you haven't done.

A

Node port, yet, okay, all right! It's a lot harder! Yeah! Well, not a lot harder, but you can't really just trick the kernel anymore, because you're you're coming in and out of a network, whereas here you're pretty much in the same network you're just defining fips right you're in the cluster overlay you're in the pod network. Rod.

B

As a squad is asking, does that mean we retain the source ip? No, not any? Well, no, we do we do scott. So so how do you get to return? How do you get? How do you? Where do you retain it.

A

Because because we never touch it, all we do is rewrite the destination ip and then the kernel sees a connection from pod. A to.

B

Body back okay, so you get to keep the source ip yeah.

A

Okay, so there's no nat, I mean it's: it's like a fake nat. You know pseudonat yeah, okay,.

B

A

Stuff and I actually wrote a I- I pulled- oh, I guess that's in the wrong thing. I have a link. I have another repo that I made that exemplifies, like the basics of how this works. That just like manually, shows how the program works away from the kp g control, plane.

B

A

And that that's nice to check out too.

B

Okay, I'm going to put some links in here before you go for folks. So we have our andrea live hack, md, nobody uses it, qinji never uses it. I'm the only one that uses this on an episode whenever a meme and shinji and everyone else does shows they always just make their own notes. But I.

C

Have this hack md, the entry live hackindy. ah I.

A

Think I've seen a link to that.

B

Okay and it's like a book about all the things we do here, I can't tell you how many times this link has probably showed up in this hack, envy, okay, cool all right and then my favorite ebpf is touring complete.

B

That's your homework! You have to tell me whether this is true or not. Gosh, back in college.

B

He says the caught. He says he won this argument. Anyways he got into some fight with his oh to use traditional definition of turn completeness. This means it cannot simulate the infinite tape of a turing machine, because you can only run one million instructions in ebpf.

B

Okay. So that's why it's not turning complete, because you can't run one million and one things that make sense: okay, so all right, but anyways bring complete network rewriting logic stuff in upstream k-8s. You saw it here first. Thank you, andrew stoykas.

B

A

That was awesome, we'll do more there's more to come. There's still a lot of work to do here so reach out. If you want to get involved- and you know after we get the poc merge upstream, we can keep iterating make it better.

B

Andrew stoic causes tour a force of the png, bpf implementation, okay and big thanks to our psyllium friends, um because like if this um big thanks to psyllium friends, if the ebpf community and the psyllium folks hadn't done all that a lot of this stuff in the cni and sort of thought about it and worked through the kinks and made libraries and containers and all the rest of it.

B

You know this would be a lot harder. We wouldn't yeah- we wouldn't be here today, so all right and, of course, to mikhail for building all the kpg stuff for us to put these back ends into so all right thanks, everybody we'll see you all later thanks everybody for showing up entry life like and subscribe.