Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2022, 11 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: When the Logs Just Don’t Cut It: Root-Causing Incidents Without Re-Deploying Prod- Phillip Kuznetsov

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe in Amsterdam, The Netherlands from April 17-21, 2023. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

When the Logs Just Don’t Cut It: Root-Causing Incidents Without Re-Deploying Prod - Phillip Kuznetsov, New Relic

Speakers: Phillip Kuznetsov
We’ve all been there: your pod is crash-looping, you check the logs and you realize you forgot to log something important - now you’re unable to figure out what went wrong. You try to reproduce the problem locally with no luck: it only seems to happen in production. What do you do? Do you re-deploy to production with more print statements? You could burn hours doing that while you risk more problems. What if you could instead get that same data without the headache of restarting prod? In this talk, I’ll show you how to magically collect this data using bpftrace. Bpftrace lets you capture lots of useful data (function arguments, return values, latencies of individual functions - just to name a few) without re-deploying pods. Bpftrace is very powerful, but can be complex to work with, especially in multi-node environments like a Kubernetes cluster. I’ll show you how to cut past these problems by walking through a demo incident. I’ll show you some tips and tricks for working with bpftrace on Kubernetes, including how to leverage Pixie to easily deploy and collect data from bpftrace scripts.

A

uh Hello, everyone, my name, is Philip kuznetsov and I will be presenting a root, causing incidents without redeploying prod a little bit about me. I am a software engineer at New Relic, where I spend most of my time building pixie, um let's just jump right into it. So here's the situation. Imagine we're working at an e-commerce company called online boutique we're selling a bunch of hip trendy items, and things have been chugging on chugging along pretty well recently, we've had no problems. Code is shipping everything's great.

A

Until today the front-end service is panicking. Something is happening there that we don't really know. What's going on we what's worse. Is we haven't released code to the front-end service in weeks, so there's no way that any changes in the front-end service are responsible. This is a classic problem. In microservices you have dependency trees across multiple different services that when one thing changes in one service causes a break in another service.

A

So one thing we did change recently, which we highly suspect, is the problem. uh We recently added this new product item called uh sticker, a simple online boutique sticker, and we don't know why this is causing us any problems.

A

A

So we're looking at the logs- and we see this one log with the Panic that says one of the specified money values is invalid. It's a bit weird when our money values is weird Okay. We keep seeing this error and it keeps getting returned by this sum function um before I dive a little bit deeper into what the sum function is. You should understand why we need to have a sum function in the first place. Why don't? We just add these together. The way money is represented inside of our service. Let me jump ahead.

A

It's uh split up into the units section and a nano section. The units is the values before the decimal point. The Nanos are the values after the decimal point, but represented as Nano units, so a billion Nano units are one unit.

A

Unfortunately, whenever we see this error from the sum function, we don't know what value is actually invalid. It's it's like what is actually causing this problem.

A

A valid money type as I mentioned before, is one that uh has matching signs between the units and the Nanos, so both positive or both negative, and that the value that the Nanos are in absolute value, less than a billion.

A

For some reason, some value showing up through our system is not matching one of these conditions or both of these conditions.

A

Unfortunately, it's not that easy to tell what happened in our environment like, ideally, we just go and add a log in here and just print it out, but in many different environments, especially production environments, that's not really an option. We might try to reproduce this locally, but sometimes a bug doesn't really appear locally. It's hard to reproduce and again we're searching for that root. Cause here, sometimes deploying to probably just takes too long.

A

There is a situation where you might just go and add this log in and hope that you kind of discover what's going wrong. I guess the severity of this incident is rather low, so it would be okay to go and add this log in, but that also brings in its own risks as well. So maybe, while we're adding this uh this log in, we sneak in some code that actually breaks things such as uh some buggy code or something like that, and then often for many of our production environments.

A

We have to comply by certain rules and deploying a product production willy-nilly, especially adding just logs or something like that is really kind of out of compliance and something we don't want to do.

A

If only there was another way, I wonder where that way is going to come. Okay, so I'm going to show you BPF Trace is a great tool. That's going to help us add these logs in production. Bpf Trace is a way to get kernel level visibility inside of your running applications as it mentions here. It's also a high level tracing language for the Linux ebpf and ebpf is really just a way to run sandbox programs inside of your kernel, sandbox being something that guarantees Safety and Security for your executing programs.

A

Specifically, ebpf was designed to allow kernel developers and actually people who are not just kernel developers to extend the capabilities of the kernel and get access to the high quality data, as well as the privileged context that the kernel provides the basic of the basics of ebpf and I guess BPF Trace as a result are for us we're going to add this probe called the u-probe and that's going to intercept the running program, run the evpf program and then that ebpf program is going to ship the data out that we want to collect.

A

So, in our case, the arguments to this is valid function. You can think of this analogous as analogous to a debugger, a specifically a break point in a debugger. The break point. You add the breakpoint to your code. Your code runs until that breakpoint pauses. You can go poke around the variables get data whatever you want and then, when you want, you resume the execution and it continues onwards. So BPF, trace and ebpf are very similar in that regard. Instead of you poking around, though we have a program that specifies the poking around I guess.

A

So here's an example of a you probe. Oh sorry, a BPF Trace script specifically on this is valid function that we want to go and add a log to um we specify the type of probe u-probe. We specify where we want to go, collect this data on the symbol, and then we have this uh body of the function which goes and figures out how to grab this data that we want, and then we print it out now.

A

Typically, when you deploy BPF Trace, you go SSH into a node, and you run this BPF Trace script with a BPF Trace CLI, um but in a kubernetes environment, where you have many nodes, this can be really difficult, or at least just tedious. You have to go and find this specific node that is running the specific pod that you want to go Trace. You have to find the binary path for that specific pods server that you're running, specify that and then you can deploy your BPF Trace probe at Pixi.

A

What we've been working on recently is making this process a lot easier, instead of specifying SSH into your specific cluster, your specific node and finding that specific binary that you want to instrument.

A

Instead, what we want to provide- or what we have provided, is a way to specify a set of PODS that you want to attach to you can specify the labels inside of your pod, that uh specify which pod you want to attach to, and you specify the namespace and then uh Pixie will take care of the rest of adding BPF that BPF Trace probe program that you want onto that specific pod.

A

So, together, BPF, trace and pixie makes it easy to add print statements in your production, kubernetes, environment.

A

All right: let's try a demo.

A

Actually, before I do that, so this is online boutique. You can see um kind of what's going on here, it's just e-commerce site and then what I'm going to do is I'm going to add this BPF Trace program to go and check our is valid call and so I'll show you that actually before I do that here's the data we get out. So if you remember that structure I mentioned earlier that split up the float value for money into two different integer values. We have our units value here.

A

We have our Nano Nanos values here, and so you can see actually right at the top here. We have 99 items up here, so that is the sticker. So that looks good. That means actually, when we put in the 99 item, things are working. So probably is not just from that. Like entry point, so something in our code is probably manipulating it to cause some problems and okay, so I have to go trigger the error, and so here's what happens when you go and try to add the sticker?

A

To your cart, you add it and there's a crash. So you just see an error: runtime Panic, no fun no stickers for us, um but now what we can see if we run this again, we can see that we have this really weird representation hold on. Let me make this a little bigger. um Let me run.

B

It again looks better.

A

We have this really weird representation. We have units and Nanos that are not aligned in their sign um and is valid is throwing an error or returning that this is not valid, and then we see an error out in our locks.

A

So this is really bad, but what's really good is this actually gives us a big clue onto what's happening now. I mentioned that Nanos are um one billion. Nanos are equal to one unit. So if you convert this negative 10 million into units, that's negative one cent, so you have one minus one here: one minus one: dollar minus one cent, that's 99 cents, so I think I have a better idea on what's actually happening here.

A

This 99 value is entering into some of our sum functions and it's being converted in a really wrong way, um I'm trying to think of where we actually call this, but maybe I won't show you that actually um so in our sum function, we call sum a number of times I'm going to skip ahead a little bit here. So I've seen that sum is called a number of times one time it sees a zero and the next time uh and one sorry, one of the arguments is a zero value.

A

The next argument is that 99 value that we've added and then when I was digging into this earlier I noticed that this condition right here, which is basically another validity check after we've, summed values together. This validity check is actually incorrect when the units value is um zero, but the Nano's value is greater than zero. We actually don't pass into this if statement here, even though that value, which would be 0 and 99 Cents, would actually be correct.

A

So we're missing this statement, and then we end up down here, and so here is where we can. We increment the unit, so one becomes zero. Sorry, zero becomes one and then Nanos gets shifted over to this negative 10 million.

A

So um here's that script again takeaways we're able to insert some logs into our running kubernetes pod and we're able to determine the root cause of this tricky incident by just seeing some of these logs come out and just give us like a little bit more visibility that otherwise would have taken a long time or taken a lot of effort to get there's a lot to learn about these tools.

A

I've shown you one example of where BPF Trace can be used so specifically we're using a u-probe, but there are many opportunity which stands for user space probe, but there are other places where you can probe as well. So such as the kernel, a bunch of static, Trace points that kernel developers added as well as Library developers, have added and uh there's just many great tools that have been already listed out here that you can go try today.

A

um On top of that, I talked a little bit about pixel. You've showed you a little bit of Pixie uh and showing you how you can kind of combine the tools together to get a nice kubernetes experience with BPF Trace.

A

The evpf landscape is pretty interesting. The ebpf application landscape is pretty interesting, there's a lot of great tools from networking to observability to security, and you should also check these out as well. They give you this like nice. Access to this kernel level, visibility that evpf gives you and add a bunch of great features. On top of that,.

A

um And finally, there's a bunch of Great Links on learning how to write BPF Trace in a bunch of different contexts.

A

Today, I showed you some go code which has some challenges without there's some challenges with writing BPF Trace with go, but it is completely over um something you can overcome and it's it's very valuable once you have this tool in your tool, set um I, guess with that uh I'll be at the pixie booth in the project Pavilion um and you can come and talk to us talk to me and my colleagues more about what pixie does what BPF Trace does and uh I think. That's! That's all.

B

C

Questions just raise your hand.

D

I was just hoping you could uh go back to your um yeah. Your stack Trace code slide.

A

um Sorry, the stack trace or the.

D

A

This one I can also show it in the application. If you want in pixie sure.

B

A

E

It thank you. Okay,.

F

um So, do you have to run a pixie agent on every node and how privileged does it need to be within the cluster.

A

Yeah I think you need um I, think you need the I, forget the exact names for them, but yeah. You need pretty high privileges to run pixie and you need to run a um you. We have a Dame Insight, so you need to run a pod in every single node of your cluster.

F

Are there any security risks associated with that like? Can you see that being abused as an attack Vector.

A

It's always possible I guess, um but we we try to provide um some guarantees for Access Control and everything like that and you can continue you can keep you can host pixie entirely inside of your own cluster, um so that everything is maintained within that cluster yeah.

C

What should I keep in mind to get my or to leave my applications? uh Debuggable or you know, um I see that you have some symbols there, like with the variable names yeah and when I'm, producing a production, quality, binary, I, typically strip the debug signals, and you.

A

Know the symbols, the organizations so.

C

What should I take care of uh to not um what should my future me um wish that my future, my current me would do you.

A

Know, oh, oh with the symbols and everything like that, yeah I mean I. Think we recommend you keep symbols around to make it yeah. That's I! Think that's the biggest thing as well as dwarf info is typically helpful.

A

um I actually don't know the exact difference between the two, but it's uh I, yeah I, think having those two around is very helpful for integrating with like a tool like BPF, trace and.

C

What about for other Technologies, so if I want to do something similar for Java yeah? Is that that easy or um what else is required for.

A

I I know Java is a bit more challenging. um There is some benefit. There's some benefits that the uh programs are jitted as well, but I don't know. Actually the details on what you need to do for Java. We have actually implemented a separate feature, not not the BPF Trace feature, but a profiling feature in Java and so um there's an option to go. Look at our commits there and I. We have a slack Channel.

A

You can go and talk to us on pixie on how we did that and maybe that'll help you with figuring out the BPF trace on Java or something like that.

C

Wonderful, thank you.

C

A

A

We have a lot of time for for questions.

G

Oh thanks, um this might be a totally insane request, but could.

F

G

See sort of what uh like a BPF trace on the Node would look like, for contrast with.

A

Yeah um I can try that out uh the okay, well yeah. The hard thing is I have to go and find that binary. So I mentioned it was hard. Oh it didn't lie.

A

um Let me see I think there's a easy way for me to grab this data so.

A

um Process stats.

A

um I guess I'll I think this is going to take too long. um I'll just show you what the rough idea would be. So um let me just copy this part out.

A

So we have this u-prote here and Pixi doesn't require this, but let's say vanilla, BPF Trace does you would typically add like path to app um path. Basically, and then um you would have to go SSH into your node, make sure BPF Trace is installed and then go and add. This then run the script. Basically, so I guess yeah I'd be running that um I'll just save it um is valid.bt.

A

I can do it on this, like Dev machine I, just it just won't, create any data.

H

A

Oh man, okay, I thought this would be easy. uh uh Okay! Well, let's before we do that, I'll just show you how to run the BPF Trace script first, so this is a approximation. Let's see is valid.bt, so you'd run something like that. I actually have to adjust it so that it is.

H

D

A

Okay, so that's attached now and then, if I can get the front end service running, then it should work um in the meantime. I could take another question. I guess.

I

How long does the trace.

E

I

How long will the trace that's injected last.

A

We set a TTL for pixie, but if you notice here actually, when I was doing it locally, it just runs until you control C it. So that's.

I

A

You say like send it a signal: yeah.

I

So if the, if the Pod crash loops and restarts will we get injected again, yeah.

A

So we're I think that's the next thing. We're working on, for this particular feature is just making sure it comes back on. um So we just released this feature like I think we're actually releasing it on Tuesday. So follow us on slack and we'll update you on that. But um it's coming out soon perfect.

D

A

Oh but sorry, sorry to that point about releasing it. We have a k-probe feature already we're just supporting user space probes. uh With this selector thing, yeah on Tuesday.

J

um To your demon said: Can, can you start the demon set after you've found this an issue and you want to debug. So does it not runs all the time on the cluster yeah.

A

You can you could deploy the Damon set um after you've uh run into the issue. It's not requiring it doesn't require you to go and uh run it continuously. So if you want to add it afterwards, you can totally do that and we've made our deploy process pretty fast, so you can deploy and get data within I, think three or so minutes and or deploy and, let's say, deploy pixie and get uh BPF Trace data in three minutes plus the time it takes to write the ppf trace script yeah. This is.

J

Yes, this can address a lot of security concerns because if it's normally not installed.

B

J

And then you install it only for the time that you needed for debugging yeah.

G

J

It would be made in production they're, easy, very easy.

A

Sorry say that one more time.

J

So if you only installed it for a short debugging time, yeah.

I

J

Production to figure out a serious problem, yeah, it's it's it's something very good, because then the security guys are much happier right right. All the time. Yeah.

A

Yeah, because you can delete it and make sure it does like somebody, doesn't see it as an opportunity. Yeah.

G

Do we need to make any changes to running pods for uh Bixby to be able to observe it or pixie.

A

um As long as your pods have symbols, uh both Pixi and BPF Trace should work, and there might be a few more caveats um on on that as well. It's in our docs, but I know that symbols are important and, to some degree, dwarf info is helpful, but I don't think actually dwarf info is a hard requirement, so yeah, oh, and to be clear if you're running go or something like that, your symbol should already be included.

A

If you just use the default, build.

F

Oh I think up here.

E

Yeah really just kind of a follow-up on the simple thing: yeah I was just wondering if there is a way to use like an external, you know, symbol store, so you don't have to embed them in your production images, but you can have them available somehow for.

A

Your learning purposes we we have not designed anything like that before, but we've we've thought about it and I think that's like a really cool idea, so it might be something. So actually we have Pete back here. He's uh he's somebody who works closely on this and he can. He can talk to you about this afterwards. Oh I think there's a question back there.

A

Sorry, routing people.

B

Hey uh so, if I understand it correctly, if you have access to the like pixie agent, uh or at least that UI, you can run like any code, you want um I guess. I was just curious how you control access to the to those agents and like search it or what? What I guess yeah? What's the security stuff for that yeah.

A

There's there's a few different options, um so there is this data collection side where we offer the ability to redact key pii containing information, so you can run in restricted mode with regards to BPF Trace, specifically, we have uh adding more access controls on the roadmap, so I think a temporary solution for say like getting the observability power of Pixie. While um we are working on the access control side of things, you can go and pipe that data into some open, Telemetry collector and use create a view on that side.

A

Basically, so I guess you can pipe that data to an open, Telemetry collector, send it to Prometheus and then use Prometheus views or something like that or a grafana or some other tool like that. And that way you can get that data and everything like that, while preserving the permission layer, I guess um you can also just remove certain users, I guess from your system.

A

You can hide certain users or prevent certain users from signing up so.

K

Okay, I can respond about symbol stores. We are um I'm also on the pixie team, so kind of uh back to the question about that. We are aware that there are already existing simple storage Technologies and we are. That is not on the roadmap yet, but it would be a really really neat thing to enable both for BPF trace and also for the profiler and in general, it's it would be super useful I can see that so should anybody, you know, join us on slack and want to discuss simple stores where we are absolutely.

K

You know interested and looking to find the capacity and somehow you know improve along that dimension.

I

A

Okay, oh uh oh, there are some questions over there. Okay, uh by the way we have T-shirts at the door, I think is that true, Michelle yeah. If people want t-shirts, yeah feel free to grab them.

H

uh Can you explain a little bit how you target uh what you want to instrument, because here on the probe, is it gonna probe all the methods that have that name or.

D

H

Smart enough to use labels to get official pods and within the pods, maybe a container as well.

A

Yeah yeah, so um we have this full symbol, name, first of all, so I guess it's a little bit more specific, but uh yeah. You can specify the exact pods that you want to attach to there is this label selector down here we have our label name and then the label value so app is equal to front end. So it's only attaching to the front-end service.

A

That was having the issue, but uh we actually we have this code in a lot of the other services as well, but we wanted to ignore those, because those Services were not having that same problem.

C

Okay, the other hand, on the other side.

A

Yeah, we still have five.

B

Minutes of time.

A

B

Folks, who is living please just leave quite okay.

C

C

Penalty in having that, as part of you know, kubernetes demo set or processing kubernetes instead of having that, as as a a byte code waving or something like that. As part of my build process.

A

You're saying you like insert an agent or something.

C

Into your code, so instead of running instead of having that been applied to a running pod, um use those The Sims script to apply during my build pipeline so that those are injected. You know into the binary itself before I run, it.

A

I've we've never done a like toe for toe comparison between the two. um As far as I'm aware um but I imagine you can continually optimize, say the bytecode injection and stuff like that. But the same way you can continually optimize the ebpf injection, um but at the end of the day, you're kind of doing the same operation so I imagine that the performance is comparable right. So.

C

A

Have the great answers to that? Unfortunately, that's.

C

Good, so um just to be clear, then you are doing that process once only you're not doing that continuously or not reapplying or applying the same waving to the byte code continuously right, I.

A

C

When you click run, it does only once and.

A

Oh yeah, it it deploys only once and then it's kept around for the TTL that you specify down here. Yeah.

C

Yeah uh and then he does another um another processing of the Pod or the binary to remove the the the probes.

A

Yeah it uh we are like manager service, you can call it that um we'll go and uh remove the the approach. That's running. Okay,.

G

I think it's done.

A

All right, thank you all again, uh thanks for the great questions.