Cloud Native Computing Foundation Online Programs, 2 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNL: Setting up monitoring for Calico’s eBPF Data Plane

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone uh welcome to cloud native live where we dive into the code behind cloud native, I'm annie talvasto and I'm a cncf ambassador as well as a senior product marketing manager at camunda, and I will be the host tonight.

A

So every week we bring a new set of presenters to showcase how to work with cloud native technologies. They will build things and they will break things and they will answer your questions so join us every wednesday to watch live, and this week we have chris from taguera here with us to talk about amazing topics. As always, and.

B

Another exciting.

A

Thing uh happening in cloud native sphere is remember to register for kubecon europe and now is really the time to secure your spot there so get uh over there, and, as always, this is an official live stream of the cnc app and as such, it is subject to the cncf code of conduct.

A

So please do not add anything to the chat or question that would be in violation of that code of conduct. So basically, please be respectful of all of your fellow participants as well as presenters. um So with that I'll hand it over to the speaker of today to kick off with your introduction chris.

B

Yeah hi there thank you, uh so my name is chris tompkins. I'm a lead developer advocate at tigera tigera make a piece of software called calico, which is an open source um cni for kubernetes. I did a previous session um actually on cloud native, live and lots of other sessions. You can find on youtube about calico, generally, so in the interest of timing and seeing the things we want to see today, I won't go too much further into that today, um but yeah.

B

My job is to is to help the open source community to hear everything that we do with the product to make and likewise to make sure that our um teams and open source community are hearing uh and developers are hearing from the open source community and- and I sit in that middle perfect.

A

Position, yeah- that's a great spot to be in at so um when we get started um so briefly. What's the calico edf data plane.

B

So yeah um I did a previous session on this, so I'm going to be really quick on it today, but.

A

B

Pop up my slide, I don't know if I need to do that or you do.

A

uh That should happen. Perfect no yeah.

B

There we go yeah, perfect correct, um so I'm people will be pleased, I'm not going to every part of this diagram, but what we're looking at here is actually the flow of traffic through a linux node, and this uh image is courtesy of yan engelhart. It's you can see it on um on wikipedia, uh but what this actually shows us is as a packet comes into a kubernetes, node or any linux host.

B

It comes in on the left-hand side, it works through this complex um flowchart and then it um and then packets leave on the right-hand side. Now the reason I've got this diagram up is because there's a fair amount of complexity going on here and and people who are familiar with computer networking will know that you've got the link layer, network layer and um and protocol and application, and most of the packet processing thing happens in in the collectors in the middle.

B

But what abpf is is a technology allows um code to be hooked uh onto um hooks in front of and after this flowchart this flowchart? So I don't know if you can see my mouse pointer or not. Yes, I think you can right.

A

Yes, we can yeah great.

B

Great, so so, with um with eppf, um you can attach our networking code um at these hooks before and these hooks at the end of the packet flow, um and you can actually do the networking outside of this flow, um and in doing so, you avoid needing to go through most of the complexity in the middle of the diagram. So just to kind of restate it. What we're seeing here is an ebpf. It is a is a linux node we use eppf and then we've reimplemented calico's kubernetes data plane in ebpf as one choice.

B

um One option basically.

A

Perfect, so what uh advantages does it bring.

B

So the cool thing is because you're attaching at the start and end and you're, avoiding some of that complexity. um I've really wanted to avoid having a lot of slides. So I should say up front. I've only got, I think, it's three more slides after this and then we'll dive straight into a live demo. um But the key these key benefits apply across all environments.

B

You basically get improved performance, and that happens because uh when you attach code at those ebps hooks you're running the code inside the linux kernel um with the performance that you would expect to gain there um and you can you can cut out bits of code that you didn't intend to run. So you get you get better performance which is less cpu or more throughput, which are opposite sides of the same coin.

B

um You get native kubernetes service handling that we'll talk about more in a moment, and then you get source ip preservation and direct server return. I'm going to dive into all of these, but very superficially, because I've done much longer talks on all of these things and the previous cloud native live session um discussed these. So I feel like if people want to see these in detail, um they should check out those older sessions, but I'm just gonna. They need to be said today.

B

So just really briefly, um the data plane replaces proxy's functionality um and by doing so um you so so I should take a step back in kubernetes. There's a service called coupe proxy and it runs um on every node and its job is to manage the services that allow traffic to flow in and out of the cluster.

B

So when we rewrote the data plane in ebpf, we had to replace kuproxi's functionality, um but instead of that being a negative. Actually, we re-implemented a ton of the code and improved upon it. um So, as well as the performance improvements that I mentioned before, I can actually jump forward to yeah here yeah. um You can see that um there are three different ways to implement coup proxy.

B

um One is in the ip tables mode, which is the traditional way of doing so in blue and then the ipvs uh coop proxy implementation in yellow and then our ebpf implementation of the functionality.

B

Technically, it's not good proxy, but it's that same functionality um and what we're seeing in this graph is, as you add, more services, ipvs and ebpf. um uh The connect time remains constant, um regardless how many services you have um and the eppf data plane is even faster, but the ip tables data plane, the old way of doing things was basically increasingly slow, as you added more services to your cluster, um so you've got performance. You've got that tcp connect time advantage, which um just be really frank.

B

If you only had a couple of services, you wouldn't notice it and you wouldn't care, um but if you have a large number of services um or a lot of um session churn, then you really start to care about this um and then the last big advantage really is that uh this one um source, ip preservation, which is where, as an external client, comes into your cluster? So at the bottom of the diagram, we've got these two kubernetes nodes. uh Now this could be 50 kubernetes nodes.

B

It depends on how many is in your cluster, but as your external client comes in, their traffic um hits coupe proxy. So what we're seeing here is the coup proxy way of doing things without our ebps data plane, and you can see that um the first thing that happens is the coup, proxy destination maps and sourcenaps the traffic, um and it does that to make sure that the traffic gets forwarded across to the service pod correctly, but also the source that is required to make sure that the traffic returns back through coop proxy.

B

So then the traffic gets forwarded onto the service pod and, as you can see, as a result of the destination there and the source nap, the pod never sees the ip address of the external client and the side effect of that is that, um let's say you have an uh an auditing requirement to capture your um to capture your uh client's ip source, ip address on the service pod. You wouldn't be able to do that with the crew proxy implementation.

B

So I'm speeding through this a bit but, like I said if people want to hear this same content, but in a bit of a more peaceful way, um then I would suggest going back to that previous one. um So when you enable calico evpf instead of that, the um source now happening, the bbf bpf code happening on the kubernetes node forwards, the traffic across without needing the source net, and that basically means that the service pod, that's actually serving the customers or clients.

B

Workload actually does see the real ip address of the external client, which means, maybe you want to block a certain um country or or a certain set of users. um I'll avoid saying the obvious example at the moment, um then you could do so using this code, um so yeah uh that that's it at a really high level. I'll move on to the next slide in a moment.

B

Actually, so those are the advantages, um performance um source, ip preservation, um direct server return, which I kind of alluded to here, but you can see that the return traffic doesn't have to go via the ingress node, and that has advantages in terms of latency and throughput um and then finally, you get less latency to services on on setup.

B

So those are the advantages.

A

Perfect uh really great extensive uh advantages there um yeah and everyone, as I said in the chat as well, leave all of your questions throughout the presentation as well um to the chat um box and and your streaming service, and we will get to them throughout the presentation as well as in the end um but yeah chris. What are what targets are then available.

B

Okay, so yeah, so given what we've just said, um we're going to build a cluster that looks like this in a moment and I'm actually going to build this live and we'll see it we'll see it explode, probably, but um but the the crux of this is that we're replacing the data plane and anytime, you have a distributed system.

B

You want to be monitoring all the components of it right so both to capture logs and so on when things go wrong, but also proactively, like distributed systems are complicated, so we tend to um proactively monitor as many components as possible. So when I was thinking up this session, what I wanted to capture was this idea that just to talk about what? What components can we actually monitor in this data plane? What data can we capture and how do we go about doing that?

B

And so um in a moment when we build this, cluster you'll see that it's got four nodes and there'll be some service pods running in each node and then because uh we're using the ebpf data plane on each node, the logic that that actually helps the comp. The traffic come in on a service and get redirected to a pod, either on the same node or on another node, that's implemented in ebpf um and using maps to store the um the data that needs to be stored.

B

To do that now there are three components that can be monitored here, um specifically for the data plane and I'll show you how to set up all three, so the first is typhum. um Now I should have to be honest with you. I should have had this on the diagram and it slipped my mind to add it, but um in any kubernetes deployment you have the kubernetes api and that's something that lots of components within the cluster need to speak to um and anytime.

B

You have lots of things talking to one thing: you have to be conscious of um whether that one thing is going to become overloaded, so in calico there's this thing called the typha daemon and it sits um a couple of instances run on different worker nodes and it's basically a service that sits between the kubernetes api and the resources that need to talk to it, and that's the first thing that we can monitor with the um that's the first thing that we can monitor with the um with the ebpf data plane on calico, so we'll be monitoring typha, um then the second one is the calico coupe controllers.

B

Now they are actually again pods that run in the cluster and their job is to um perform actions based on the cluster state. So maybe a new node gets created. Excuse me, a new pod gets created, it's calico coupe controller's job to set up the networking to uh to match uh the desired state, so we'll be able to monitor that as well, and then the final one you've got this agent called felix, which is our agent that runs on every node. So felix is the thing.

B

That's responsible for locally programming, the ebpf data plane and it runs on every node and those are the. So those are the three things that we're going to set up: monitoring for typha, calico, coupe controllers and and felix.

A

Great um there's a few questions from the audience already: okay,.

B

A

Perfect, first of all, really great to see the enthusiasm from jonathan.

B

A

Thank you for saying that this seems very awesome. I agree and then there's. The first question is from chintica, hopefully not destroying the name too badly, but that is your like service meshes complement, ebf or evf, providing similar control plane, making service mesh concept obsolete.

B

uh No, it doesn't make service mesh uh obsolete, but there's there's a larger conversation that we could have um and I have had several times actually the most recent time we talked through this in detail was we do this session called calico live and if you look back on the most recent um calico live, I think that's where I was discussing it. We talked a bit about this. Oh no excuse me, it wasn't. I knew I was wrong. That's why I was hesitating. um It was devsecops, london gathering.

B

We had a chat about this on their podcast, which will be released soon, but um a service mesh adds a lot of extra functionality above and beyond the cni. They can complement each other, um but this doesn't replace that functionality. uh What this replaces is the functionality of the traditional iptable cni that we run.

B

um So if you want to dig into this in more detail, um if you look back on calico's youtube or on the tigera, blog uh you'll find articles discussing the different data planes and how and the strengths and weaknesses of them, because we offer more than one choice. Basically,.

A

Perfect and then a question from gentlemen well, uh which is epf, replace q, proxy or eps, is an implementation for cube proxy.

B

uh Yeah I was a bit. I was a bit vague about that, so I'm glad he picked me up on that um as you'll see when we do this, the actual demo. In a moment, we literally turn off couproxy um and the functionality is happening.

B

um The same functionality is happening in the day inside the data plane, and so therefore, it no longer needs to run as a separate part. So it replaces it so. The ebpf implementation isn't good proxy, but it does the same job with slight improvements like the source, ip preservation. Essentially, um I guess I could address one more thing, which is depending on the nature of your kubernetes cluster, how you turn off the proxy varies.

B

So in the case of what we're going to do today with well, that's a good opportunity for me to talk about what we're going to build. Actually, as you can see, we're going to use google cloud we're going to build a four node cluster um with the pod uh subnet 192.168.16 and the addressing 10 to 40 24 for the nodes and um in the case of gcp, because this is just vanilla, kubernetes we're going to um turn off uh coupe proxy by stopping the demon set.

B

But if you run something else like k3s, where um coupoxy can't be turned off, because it's not running as a separate component, then there's different ways to disable that functionality and the project helicopter documentation for ebpf, which I think I may have a link to at the end. If I remember correctly, it actually tells.

A

B

Turn it off in different different deployments, so the overall goal is the same, but how exactly you do that depends on the kind of cluster.

A

Perfect, um so would this be a good moment to ask what's the process for getting access to these metrics.

B

Yeah yeah, it would be, it would be ideal, so um yeah, let's dive in um so uh I'm going to do the demo now, but before we do this is the last slide. I think um there is an oddity and I thought I would I would point it out at this point, and that is that um so here we have three nodes. I don't know why I've shown three nodes here rather than four on my previous diagram.

B

I apologize it's a bit unnecessarily confusing, but usually the job of a service is to run um and traffic hits the service, and then it gets directed to whatever the thing that you're monitoring is, but but actually when we use prometheus, which is a time series database that we're going to set up in a minute.

B

We don't use a service like that. We use the service to discover the location of the felix agents, but what I've tried to show here with the dotted lines and the crosses is that we don't actually use the service for polling. We only use it to discover where the things that we want to monitor are, and then the polling traffic actually goes directly to the phoenix agent on each node.

A

A question from the audience as well: oh.

B

A

Yes, there's so many it's great to see a lot of people engaged um so there's a question: how do you install agent in gke or any other managed kubernetes.

B

um Well, actually, we are going to dive straight in now, so I think watch this demo now and see if it answers that question, um because I think it it probably will, and if not we can, we can answer it again at the end or I can take it at the end. So I should have a terminal here, let's see if this works now I was apologizing to annie before we started that um I'm having a minor problem with my it set up today.

B

So I am not on my usual setup, which means that I'm going to be looking down here, a lot more than normal. So I apologize for not looking at the camera. What I'm talking about? uh Hopefully you. Hopefully you can put up with my side profile, no um so yeah, so to start with. I've bought I've built this um for four node cluster and I built it literally 45 minutes ago.

B

As you can see, um you can see that it's got a control, plane and a master node and it's got three workers, and this is the same one that I showed in uh this diagram.

B

um So the details are exactly the same, but you can see that the status is not ready and that's because I wanted to show the whole process, including building the evf data plane onto this. But again I'm going to do the ebpf part enable enablement part quite quickly, because we've already done this uh in a in a webinar. So I so I thought that, although it's nice to see it again, I didn't want to spend too long on it so um oops.

B

So if we look at the pods running on the cluster, we can see that the nodes are not ready and the pods. We have two dns pods and they're they're in a pending state, because uh there's no cni, so there's no networking on this cluster at all. Yet no calico, no.

A

B

Other networking.

A

Just checking, though, are we seeing the currency uh screening.

B

Oh, are you not seeing it? Oh I'm so glad you interrupted. Oh I'm not interrupted, but you told me yeah, um no you're, not. Are you thanks so much for pointing that out uh hold on? Let me just try sharing again and see if I can fix.

A

Yeah, we only see the slides.

B

Oh thanks so much for pointing it out. uh Let me see if I can figure out how to fix that.

B

I think I need to stop sharing.

A

Great catch from the audience as well, there was a uh sam as well saying it yeah.

B

I was just starting to start.

A

I'd have done the whole thing.

B

Without you telling me um there, okay, so I think I've removed that. Let me try re-sharing.

A

Browser as well that could help.

B

Yeah I'd say: actually, I think I think we should do you see it as an option now you should see a terminal that you can share now, I'm hoping.

A

um I think, maybe if our um production room can have a look.

B

Yeah, I think that's what we need yeah yeah I was, I was sharing the I thought I was sharing a screen, but it turns out. I was showing a window.

A

There we go okay.

B

Here we go, we've got a four node cluster master, node and three other nodes, um everything's, not ready, which is what we expect and the reason it's not ready is because there's no networking on this cluster, yet there's no calico, no other cni. So it's not ready. So um we can see the pods and we can see that the dns um pods are pending because they have no networking and everything else is running, but not, but there's no networking and there's no mention of calico here.

B

So the first thing we do is to put calico on and, like I said, I'm just going to blast through doing this, because I've done this in a demo before so I don't want people to feel like I'm just over old ground.

B

So all I did was I created a um tigera operator resource and that runs as a pod and the tiger operator's job is to um bring the cluster into conformance with the networking that we mandate. So I have this other.

B

Piece of yaml here- and this is an installation resource- it's a custom resource and it tells the tigera operator that we want to install calico. We want to use this ip, addressing this block place and vxlan and vxlan is the right encapsulation to use when you're using um the bbf data plane. So the only other customization at this point- that's important. Is this typhometrics port?

B

Now I mentioned that one of the three things that we were going to monitor was typhometrics and so with by including this argument um keypair here we're telling calico that we want typher the fan out agent for the kubernetes api to respond on port 1993..

B

um So with all that said, if I now just feed that yammel um to excuse me, I just had a missed call from my daughter, um I'm gonna watch that and just if she calls back, I may have to take that because she's only 12 and she's on her way home from school. So uh I'm sure I'm sure the audience will need to take it. If she calls back I'll um I'll have to pick it up.

A

Yes, safety first.

B

Yeah, thank you exactly yeah um I'll tell you what I'll do actually I'll ask my younger daughter to call her um and they can talk to each other uh what's happening.

A

B

Yeah yeah, can you call justin I'm.

A

No worries everyone will get your questions eventually, um so.

B

I'm sorry but sorry about that. I'm back.

A

No worries: I was just saying that there's a few audience questions, but we can get to them at um nice place in the.

B

Yeah, let's uh let's push on for a moment and then we'll come to those so hopefully everything's over under control over there, I'm going to keep looking over there now. Excuse me, okay, so what I've done? Is I've fed um that manifest which we saw before to the tiger operator and that and as a result, the operator has gone away in its creative nodes, you'll notice that there's one per kubernetes node and then it's created the typher, the fanout agent and it's createless.

B

And if you recall these are the three things that we're going to set the monitoring up for as well we're going to monitor calico controllers, we're going to monitor felix, which is part of this um pod and then we're going to monitor typho the fanat agent. So um while I was doing a little check there, um those things were coming up anyway, so we didn't really lose any time. So that's okay! So uh we have enabled um calico on this cluster now, which is why the dns servers are running, but we haven't yet enabled ebpf.

B

So I'm going to do that now now. One nuance of this is that we need. um We need felix. The agent to be talking directly to the kubernetes api and that's because it usually talks to the kubernetes api through coupe proxy and you can spot the problem that since we're taking proxy away um so what we do is we have a quick look and we look at an existing config map for coop proxy and we find out where the kubernetes api server lives.

B

So here's its address and here's its port. So the next thing we do is I prepared some yaml beforehand that looks like this and we're going to apply a config map in the tiger operator namespace and that's going to tell calico that it needs to talk directly to the kubernetes services endpoint and and within the details here.

B

So if I apply that now.

B

um Now it used to be that.

B

Yeah good okay, so we can see that as soon as we've applied that yaml, um the tiger, restarting the other components to tell them uh to reconfigure them. Essentially um now I do this last command kind of superstitiously and I'll probably get told off later for doing this by my colleagues, but um it used to be a long time ago that you need the tiger operator pod to make it re-read this config map that we created.

B

I know you no longer need to do this, but I still do it and it's kind of out of habit, so I apologize um so now's a good time. We need to wait 60 seconds. So let's have a look at one of those questions. Shall we.

A

Yes, perfect um and to quickly answer rahim about the journal screen seeing a bit blur. I am not experiencing at least the same issue, so you can maybe try refreshing your browser or closing tabs and and so forth, but then to get to the technical questions um we had one. uh Is it possible to deploy both q proxy and edf and parallel in order to migrate so put? In other words, how would you replace q proxy by edf on a running um in production, cluster.

B

That's a really that's a really good question. um The good news is that um enabling the ebpf data plane isn't uh is non-disruptive.

B

um I always say the same thing when I, when I discuss this, which is in theory, you could enable the bpf data plane on a running cluster and as long as you met the prerequisites, you wouldn't have any outage you. Basically any new server as it changed over any new flow would start to use the new data plane. Any old flow would use the old data plane in practice.

B

um I've been networking for 20 years and I I'm too cautious- and I think you know in practice why make your life difficult it's best to um to have just to use the data plane from the start? Having said that, depending on your appetite for risk, um it is actually possible to non-disruptively switch over um and as long as you follow this sequence, um which I'm going to go through now, then um then it will.

B

It will switch over seamlessly because what we've, what we've done so far, um actually I'll kind of address that at that point, as I go through so this current point, we've told felix to talk directly to the kubernetes api and then we've restarted felix, but that doesn't disrupt anything because felix programs, the ip tables data plane, the old data plane and so restarting it.

A

B

On performance or production services, um so now that we've um stopped um the felix agent from talking directly to sorry to from talking to coop proxy. The next step is to remove couproxy. So that's what we're going to do now and I mentioned earlier how this can be done in different ways depending on your deployment.

B

um In this case, I'm going to patch the daemon set and for coup proxy demon set is just a construct that tells kubernetes to run something on every node. So what I'm actually saying here is I want to um daemon set for cuproxy, I'm going to patch it and modify and say I only want it to run nodes which are not running calico, which is the same thing as saying I don't want it to run at all right because they're all running calico, so you could have here.

B

We were still running two proxy, but if I run this command and then I immediately run to get pods command again, you can see yeah. We were quick enough and we could. We managed to catch coop proxy, terminating.

B

um And then the last step is to to enabling the epf data plane is to run this command and what what that's actually doing is it's patching the insulin and merging in this new bit of config, specifying that we want to use bpf data plane and that's it. So um you can see that calico node is restarting again.

B

You can see that one of them has restarted and the other three haven't yet, but they will in a moment- and that's it so now just to just to take us into you know and to kind of discuss where we are and make sure we're all. On the same page, now we're running a um four node cluster. It's running calico ebpf data plane with vxlan encapsulation q proxy is gone. um Felix is talking directly to the api and we have typha the fanout agent.

B

We have two instances that are running and they are running a metrics node, because the original point of this session was to show the metrics. So uh so we've turned on metrics for those two and we're going to now go ahead and show you how to actually see those metrics um did you want to address any other questions before I dive into that.

A

Yes, there was one other question that put in for this so right at the beginning of the demo, there was a question from jonathan about um exit exist, a repo with these commands or a manual for install in our clusters.

B

Yes, yes, there is um so if you go to um the project calico docs page, um I remember as docs.projectcalico.org, but there's there's actually a new url, but that that old project calico.org.

B

um You will find that in the documentation, if you go to uh you'll find that it describes the exact pretty much the exact steps that I'm taking now- um and it tells you to do that on different cluster types as well. um So you just have to identify what what is your cluster type and- and it will tell you how um I am there's- also a blog post.

B

If you search for the tigera blog about I'm going to say about six weeks ago, I did a blog post, which is similar content to what we're doing today. um So if you prefer to consume it in a blog form that you could, you could take a look at that.

A

Perfect dandelion, thank you so much and um for my side of course, thank you to all of the people who ask questions for all from antoine and everyone else, so keep them.

B

Yeah yeah, it's actually really nice to get a question, especially you know, working from home know that people are there. You know um cool okay,.

A

B

Type us running, so let's have a look, so we um you can see that all the nodes have restarted, which is intentional. So we can see that we're running two type of nodes- and these are the ip addresses of uh those nodes. Now typha runs with host networking. That means it doesn't have a portable ip address of its own. It uses the host's ip address, um so if we, um this command is quite long.

B

So what we're going to do is because it's running in cloud we're going to ssh to the controller node um and we're going to run curl and then the ip address of um the first node that's running typha and the port that we specified and all we really want to do by doing this is to show that there are some metrics there.

B

A

B

Okay, so a ton of metrics came back and all we've really been doing at this point is proving that typha is there um and it's responding with um with metrics. So that's cool okay, so the next thing we do is we create a service to expose them. Now, if you recall what I said about how, unlike a normal kubernetes service, this is actually going to get used to discover the uh felix agents, but it's not actually going to be used to pull them. That's done separately, I'll! Show you how later on.

B

So I take some yaml um and I apply this directly and all I'm doing is I'm creating a service called typhometric service, the namespace and I'm just saying that it should address any pod that has this label kubernetes app, calico typher, and it's just addressing the port.

B

So we will actually see that if we look at the service we'll see that we now have this service 29 seconds old and it has a cluster ip, and if we take that cluster ip and put it into this command and again run this command from within the cluster.

B

um We can see that we are able to hit the service directly and we'll get some metrics back now. Just to reiterate that point, when we actually monitor this, this is not how we'll do it we won't. We won't ever hit this for metrics, because this actually illustrates the problem nicely, because if you hit this service, you might end up on one typhoon agent or you might end up on the other typher agent, but we want to monitor both right.

B

So we don't want to monitor the service because we don't want to monitor one type agent or the other. We want to monitor both. So that's, uh but this is a way of testing that this service is there and that it's working, which it is um so fifa component done uh we'll do the same thing for calico controllers.

B

um Oh, I don't have calculator now, one sec. I had to rebuild my laptop a few days ago, so I just need to install that it won't take a second.

A

B

Okay, great so this.

A

B

Really part of the demo, this is just me installing should've had in the first place, but calico cuttle is their tool for.

B

B

Yeah, let's just get calico castle working and then.

B

What did I do wrong? There.

B

I'm 64.: ah why did that get them.

B

I just found a error in our documentation, um which I will fix as soon as we finish, this call. Okay, I think I can sidestep installing calico cuttle now anyway,.

B

Am I going to need it? Let me just check that you can support okay.

A

Good, I think I can sidestep.

B

Using this right now so, basically, all I was going to do was run, um which was run the calico cuttle to find the um port on which uh prometheus uh was responding, uh wanting to collect metrics for calico coupe controllers. So I already know the answer. Luckily, so when I run that command, I would have got it, it would have output into port 1994..

B

um So what I'm doing is I'm creating another service? um This time called controllers, metric service and same thing, but port 1994., and let's just test the service.

B

A

Got the ip address.

B

Now I think your audience will probably be happy to hear that my eldest daughter just arrived home, so we are good yeah.

A

That's good yeah, that's good!.

B

I was slightly worried. uh She doesn't usually call on the way home, but um right.

A

B

We have her, that's fine, so here we go so we're going to test the service, we're doing the same thing as before we're going from the controller node and we're just testing the service response.

B

And it does okay, so that proves that we have that done, and then the last thing is to turn on the metrics for felix, which is the per node agent and.

B

B

Probably this enabled true now is this going to work. I might have to fix my calico cuttle problem, because I think I'm going to need it.

B

Just give me one second, sorry about that.

A

No worries live demos, there's always something that happens.

B

Yeah yeah exactly um it's fine one.

A

B

I'm actually going to guess the url and hope that I get it right.

B

B

I'm not 100 sure if I'm reading the docs wrong, because I'm hurrying or if they're wrong, but if they are wrong, I'll fix it. After this session, I have a feeling I have a feeling I might actually be reading them wrong, because there's several sections in the docs that look similar um anyway.

A

I think I might have just yeah, but that's great, you either discovered uh a thing to fix or.

B

A

The regular kind of demo- I don't know I've done the demo where I'm like I'm doing something, and then I didn't like it didn't work um and then later on, when I realized that I did something completely wrong.

B

Yeah at the moment, yeah yeah, even I mean even if you've done loads of live demos. You still kind of you find that when you're, when you're doing a demo, you you don't look at it as quite as carefully as you normally would. At any rate I have calculated. So we are good. We can move on so.

B

B

um Right so just to remind ourselves where we were, we were for felix. What we need to do is just slightly different. We need to turn on prometheus metrics because they're not on by default, so we patch the configuration um and we turn on prometheus metrics, um and we should find that we can. Then those will be on port 1991 by default.

B

So we can hit any node that we, like so again we're ssh into the controller and we're curling uh the first node, but this will work any of the nodes will work on port 1991, and here we go. We've got symmetrics so now we're getting to where we can start to see interesting, visual things, um so I'm going to just quickly create a service again.

B

And again we won't be using that service to actually poll, but we will test it. So this.

A

One's called felix metric service and.

B

There's the ipp.

B

Great so now we've got all three types of metrics that we needed: we've got them on port, 1991, 1993 and 1994., and um we can start to pull that into prometheus, so we'll create a namespace for um called calico monitoring and that's.

A

Just simply creating.

B

An interface uh there's, nothing nothing going into it. Yet we need to create a service account and a cluster role for prometheus and then bind those two together. um So this isn't quite a lot of yaml. uh Let me just grab it paste it in and then we can discuss it.

B

So if we just scroll back up.

B

Did I just copy in.

B

Oh, I did a little bit more than 1922, never mind, that's okay, so here we can see. Basically, I copied in some yaml I created a cluster role, so this is kubernetes role based access control, so we created a user called calico, prometheus user and this user can look at the url metrics and do a get on that um and we created a service account and then we bound those two things together. So we just created some role-based access control.

B

Then I accidentally went one step further and also applied this config map, and this is a config map for prometheus, which stores the configuration file now prometheus is the time series database, that's actually going to scrape those metrics endpoints that we created and it's going to gather that data and store it in the time series database.

B

So I won't go into the detail of this. If people want to see the exact detail of this, then the best place is probably a combination of our documentation um and the blog post that I mentioned that was about six weeks ago. On this point, so that's the prometheus configuration file and now we create a prometheus pod and this pod is going to read the configuration file.

B

And so we're creating a pod again in the calco monitoring namespace we're labeling it appropriately we're telling it which service account to run with and we're using that service account. We created before we're using uh vanilla, prometheus image um and we're feeding it the config um from the config volume, um and it will respond port 1990.

B

So this is a good point to actually not quite let's go one step further and then I'll stop to see. If there are a lot more questions.

B

um So you can consider now that we had those three components that we want to monitor and then we've added an extra layer above that we've added a time series database layer above that, um so now we're going to create a service to allow things to connect to that time. Series database.

B

um So the time series database is prometheus, so we're creating a new service and we're selecting the prometheus pod and we're making it available in port 90.. So, let's just test that service is actually working.

A

B

And now, at this point, I'd show you in a browser usually, but because we're sharing just a window and not the brow, not the whole screen, um I'm not going to easily be able to do that, let's just curl it for now. I think that we'll have to do for this moment and we'll see.

B

So what I've done here is just to reiterate: I've created this prometheus time series database, I've created a service for it, and now I've created a port forward from my local laptop into the cluster running in gcp, and I'm port forwarding port 1990 on my local laptop to port 1990 in the cluster uh on that service. So, finally, if I run this curl on against my local laptop on port 1990, I should see yeah here we go. I see a response from the server, so this shows that prometheus is running.

B

um I'd like to be able to show this in a browser, so shall we have a quick, go and see if we can fix the screen sharing so that I can see uh so that people can see my all my windows, not just that.

A

Yes, we can see our backstage.

B

A

They can do it yeah great.

B

Okay, fantastic, so let me see if I can do that now: um apologies while I figure out how to do that.

A

B

I think if I remove that source and try to re-add a screen share.

B

No, so we have a we.

A

Have a technical problem.

B

Okay, um I'm just gonna share my my terminal again, so the problem is basically that, um for some reason it won't share my entire screen, including the screen that I'm using for my personal notes and so on, which I obviously I don't really want to share. So uh we're going to have.

A

To make move, but this is okay, we can.

B

We can work through. It just means that we won't actually be able to see the graphs, which is a bit of a shame, um but we're a little bit low on time anyways. So I think we'll just push on. So what we've done is we've created a premier's time series database that is now scrap, scraping uh the endpoints that we created and um it's listening on port 1990..

B

So are there any questions that we should answer before we move on? Do you think.

A

Yes, there is three questions that have come up um so far, um so in standard caligo. It uses bgb without any encapsulation. If I understand correctly why edf data plane need um yx.

B

Vxlan yeah, that's right! um So that's a really good question and the answer is it doesn't so um the so to take a step back and not talk about ebps? For a moment, traditional um iptables data plane, calico can use bgp and no encapsulation or it can use the xlan or it can use ip ip encapsulation and which of those is the correct choice depends on your environment but in most environments the correct choices to use vxlan, but actually any of those three are possible.

B

Once you switch to ebpf, you can still use bgp and have no encapsulation, and that will work just fine um or you can use vxlan encapsulation, but but you shouldn't use ipm ipip encapsulation, because the it's not the performance, it will not be as good as vxlan. So once you so just to reiterate that once you move to um the eppf calico epf data plane, you want to either be using no encapsulation and bgp, and that will be great or you want to be using vxlan encapsulation. Either of those options should work. Fine.

A

Perfect um and then the other one was um epf replaced. My networking definition like a cni by default from my cloud provider.

B

um Let me just read the question myself: oh I see it, does it replace coupon here.

B

Oh okay, I think I understand the question so.

B

It depends, unfortunately, it's the only real the only answer. I can give that what I can fit into the time we have um so you can. uh If you go to the caliper website, there are several courses that we have that go into a lot more detail about this. If you want to, if you want to deal into it, but let's say you're, deploying your cluster in aws.

B

One option is that you use aws's eks service and that's an entirely managed service. So it's kubernetes, but you don't care about any of the workings of how it works. You just accept that amazon make it work for you, but that's not what I'm doing today. um I am running in google as it happens, but I'm replacing um the instead of using their managed service. I've built my own cluster, so ebpf is in this case replacing the data plane.

B

So I suggest that if, if that's not still not clear, uh I made a video a few months back called the importance of data planes. I think, um and that one this goes into a lot more of detail about why what a dates plane is, why you should care about it um and how you can switch between them and then.

B

Similarly, if you have a particular environment, like you have an aws environment, perhaps um we actually have a free course that content that will give you a ton of information about what the various options are and what the pros and cons are. um One thing to keep in mind, though, is that all of these are like advanced options. So if you just deploy calico on a vanilla, kubernetes cuff, it will work. uh What we're doing here is kind of a more advanced configuration to bring out some of the benefits so.

A

None of this is kind of essential.

B

It's it's optional,.

A

Perfect and then the last question so far is: is it possible to get um ipam block address in epf? um I think there were issues in api version. One version um said the um commenter.

B

um I'm not 100 sure on that one, and maybe we can follow up on that one, so I'm on the calico user slack or on linkedin. uh I don't want to give the wrong answer on that. So I prefer to give no.

A

B

Rather than the wrong answer, so um if if the, if the poster can can reach out to me on twitter, calico users or linkedin, um I will find out what the right answer is and come back to you on that one. I'm not totally sure about that. To be honest,.

A

Perfect so done, you know head head over to twitter for chris um to get a qualification there and then there's another. If you have anything to finish the demo, we can obviously go there, but then.

B

Yeah, I think we should actually, let's do the last. Let's do the last, because we've only got five minutes left. So let's do the last part of the demo, which will take just a couple of minutes.

B

So uh remember how we added the metric services and then we added well actually, no, we did before we did that we turned on ebpf. Then we um added the various components and we turned on the metrics endpoints on those components.

B

Then we added a service um in front of each of those and then we added a time series database, uh which is prometheus. So the last thing that we need to do is to add a visualization tool and that that is we're going to use grafana.

B

um Obviously, there are more options than just grafana, but it's by far the most people and I'm aware of so what we're going to do is we stick refiner on the front of this and what graphana does? Is it's just a visualization tool that allows you to take the data in the prometheus time, series database and visualize it in ways that are useful, so we create a config map and that's called grafanaconfig.

B

And it just specifies it tells grafana where to find the url for the prometheus dashboard, um I'm not sure if it tells it anything else. Important. At this point, I think that's the key thing it shares. So that's the config um and then we.

B

Can apply some default uh dashboards so we're just applying a manifest from projectcalico.org that needs to change the 322 um and we're just applying some vanilla dashboards. Sadly, we're not going to be able to see these dashboards because of the shared screen sharing issue, but you have to take my word for it. They're very beautiful and the last thing to do is to actually create the graffana itself.

B

So we here we are so we create a pod in calico monitoring, namespace using the vanilla, grafana image, and we create some melt points and pass in the um config volume where.

A

B

It config monitoring.

B

Okay find the config yeah sorry, so you have the grafina config that we created a moment ago, and we passed that in and mount it in a volume.

B

So, if everything's worked properly, we should now be able to see that if we look at all of our pods, we have the things that we're monitoring and those.

A

B

Here and I highlighted the api server- we're not monitoring that, I don't know, I highlighted it, then we have prometheus, which is the time series database, and then we have grafana, which is the actual visualization tool. So if all of that has worked correctly- which I hopefully it has, you should be able to start a new word from my local laptop on 3000. This time.

B

And we should be able to.

B

B

If we hit that url calico felix dashboards dashboard, um then if we were doing this in a proper browser, we'd be able to log in and we'd be able to see the graphs for um for the customer. But sadly we can't at this point, luckily we're out of time anyway. So I think I would have slightly run over time if I had been able to do that um so yeah that that's that's all I can demo today, without without limitation, we've got so. uh Are there any other questions you wanted to cover.

A

uh Yes, there's a few uh quite fast, obviously, but let's get to at least one of them, yeah sure there was a question: what's the difference between calico and celium.

B

um So I'm going to take that in the context of ebpf, calico and psyllium are both um cni's for kubernetes that use uh the ebpf um that you that use evpf to implement their their data plane. um I won't speak to the details of psyllium because I am not on their team, so I don't. I don't know the details as well as they would, but essentially we're both using ebpf to create a data plane for kubernetes.

B

uh In the case of calico, we have other choices um that can allow us to suit other environments and not only to use eppf but to use uh other. For example, you might choose to use rip tables data plane because you want battle tested code, that's been run in production for five years, um approximately five years, um so both calico and psyllium implement uh a kubernetes data plane using ebpf. I think that's the best answer.

A

Perfect then time for the last question. After that I will wrap up so last question of today, where I, where I get chris t custom resource young.

B

Oh uh good question um the blog post that I mentioned. um I don't think I have a way to oh. Actually I can put it in the chat can't I let me share the. Let me share the url in the chat one second.

A

Yes, you can send the check and we just and.

B

A

Get it over to the um streaming service? Chats fantastic! Well done uh one! Second, while I find that.

B

I've put some pressure on myself to actually be able to find it.

A

B

Here we go here, we are perfect.

A

It's always high stakes in.

B

Yeah, well, it's exciting! It's exciting! Isn't it.

A

B

So will I be able to post it in the other chat yeah? If you can, if you can pop that into the main chat, then essentially that is a blog post. That goes through most of the same thing that we've.

A

Been to today so you'll find that the files.

B

In there and uh and a written, you know something that made that may work well in the written form. If people appreciate that.

A

Perfect, we got it posted and it's time to wrap it up. Thank you. So much yeah yeah.

B

A

I'd really filled uh a session.

B

Yeah yeah exactly we got there, we got that. That's the main thing. Thank you. Thank you. So much for um for helping me.

A

Yeah, of course, and as always thank you for everyone for joining the latest episode of cloud native live. It was great to have this amazing session on setting up monitoring for calico's, epf data plane and really amazing interaction. Today, thank you. Everyone for commenting um sticking with us and as bringing a lot of good comments and questions to the table, um love the interaction jay and, as always, we bring you the latest cloud native code, every wednesday so tune in next week as well, and we have a great session coming up then as well.

A

So thanks for joining today and see you next week.

B

Thank you so much.