Cloud Native Computing Foundation Kubernetes Community Days Bengaluru 2021, 10 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Live Kubernetes Debugging with Elastic Stack

Description

Kubernetes Community Days Bengaluru'21

Your Kubernetes app is down. Your users start ranting on Twitter. Your boss is standing right behind you. What do you do? This talk walks you through a live debugging session without panicking: - What do your health checks say? - Where does your monitoring point you? - Can you get more details from your application's traces? - Is there anything helpful in the logs? - What the heck is even deployed? We are using the Elastic Stack in this demo with a special focus on its Kubernetes integration with metadata enrichment and autodiscovery in combination with APM / tracing, metrics, logs, and health checks.

A

Hi and welcome I'm philip and without further ado, let's jump into today's topic. So my session is about life. Kubernetes debugging, with the elastic stack and I work for elastic the company behind elasticsearch kibana beats, maybe you've heard of the egg or elasticset that's what we do so we're deeply ingrained into that full debugging and monitoring and telemetry issues sometimes called observability. Now, whatever you want to call it, we probably have some tools to help you, and I will show you what you can actually do around that.

A

So I'm running a very simple application spring boot pad clinic on kubernetes, but I don't want to focus even too much on the application. It's simple. We have included some intentional errors to make stuff break a bit, so we can actually debug that what is a bit more interesting is the actual architecture.

A

So here you can see we have an ingress coming into nginx, that is backed by node and a react application on top of that, and then it's spanning out to my springboot application again behind nginx or a python application and those are backed by mysql or elasticsearch respectively and we'll just start monitoring and debugging those with the elastic stack if you've never seen the elastic stack. This is what it looks like, so you have beats, which are like lightweight agents or shippers.

A

They can collect data like log files or metrics, for example, and then forward them either to logstash. If you need like a big heavy tool that is very powerful but to keep it simple today, I will skip that. We will forward directly to elasticsearch and then keyboard on top of that can visualize what is going on with your hit.

A

So here we have kibana, you can see we have collected around 27, 000 or so log events. We can open one of these up to see what we have here. You can see. This is running on tcp and we have collected some information around the operating system. You can see well it's running on kubernetes and we add the kubernetes metadata from the api server. So you can see what is the part name on which node is this running? What is the name space all of these pieces of information that you might want to use?

A

You could filter down on those, but since we are mostly interested in that infra namespace, I will just leave it at that. For now you can see what we have collected here. So you can see this. We got a redirect when we tried to request the slash api endpoint, and this was actually running by or through nginx, but since those are too many events to go through manually, um let's filter down a different dots, so we have kubernetes and then we can use the labels.

A

So any of the metadata that we have in here we can use for filtering. I want to filter down on one specific application. um The server is the one that I'm interested in now. So let's just have to this one and out of my 27 000 or so log events. We will then filter down to a smaller amount of events.

A

After a few moments you can see now we have only 8 000 or so events, and those will include a lot of the services that we're running. So these will include a lot of engine x and mysql logs, and we could, for example, just filter those out as well. So, for example, I know that those are being collected by the so-called event module and this more structured filter.

A

Here, I could say, is not one off and then I could just exclude the ones that I'm not interested in, for example, I could say I don't want nginx and my sequel, and now I filter down to that and out of my 8 000 or so events. This will filter it further down now we're down to 7000 events, and when I open one of these up here, you can see we have a lot of metadata that you can collect if it's useful, but you don't have to what I'm interested in right.

A

Now is this one here so here you can see. There is some debug message that comes probably from my java application. So yes, here this was written by jackson, the library um this has been generated by my java application and before we dive deeper into that now, the question is: how do we collect all of that information?

A

So if you want to get the blueprints of how to get that, um we have in the beat repository under deploy and kubernetes, we have the configurations of how to run file bit, for example, to collect those log files for you. Obviously, you run that as a demon set. Let me quickly show you my configuration so here I have on my kubernetes infrastructure, I'm heading straight to production and I won't show you how we deploy the application itself.

A

I will just focus on the monitoring side, so here in filebeat, I have a demonstration so there's one instance running on each host and that is collecting all the information that I need.

A

Most of that here is just the right configuration so file b, where it should collect the information where to get passwords from where elasticsearch is running, what configurations to mount etc. All of that is here. What is more interesting here is I have a config map and that one actually contains the configuration that I've applied with um filebeat. Here you can see we're running this on kubernetes with autodiscover, so it knows automatically if you run this on kubernetes and you'd set this to auto-discover with kubernetes.

A

It can just figure out where to get the information from and then enrich it with the right metadata as well, and then you can filter down on things and how to collect them. For example, I know that if I run a little further down here, this part here is the configuration for my job application.

A

So if it's this application- and it has this name- I know that this is this is my job application, and then I can actually collect that, and here I'm telling 5b a few of the important things that I want to run so first off, I give it the tag pet clinic server so later on.

A

We can just filter down on that which might make our life a bit easier, and then we also define a so-called multi-line pattern if you've ever touched java, you know that they write these pesky multi-line statements and, if you break them up, one line doesn't really mean anything from a stack trace. So what you have to do is you need to keep stack traces together and that's exactly what I'm defining here.

A

Every message that I'm writing out starts with one of the levels, and only if we have one of these levels then we'll keep that as one message only if we have that we'll create a new event, otherwise we will break it up into another into a new event, so you don't break up stack traces into its individual lines and what you might be interested in now is how do we actually get to those logs so in java in source?

A

That's not where I want to go. um I mean resources in log back, I defined my log level, and this is where I said this is the level on which I want to log. So here you have the log message, the actual logger and then the message we could, which could be, that stack trace, and that would then look something like this. Let's switch to another view that we have here, this is called the logs ui.

A

This is a bit more like tail f that you maybe miss or that you want to use so here this is just streaming log events that you want to have, but you can still filter down here, and we just mentioned that our application is using tags um and it was so java. Log messages were tagged with pet connect server, so I went. I will filter down on that one, and then we will only get the right java messages, and now you can see here.

A

These were the java log messages that we have collected and I could also open one of them up here. So if I open that one up here, you can see again, we have all the metadata around it. But what is also interesting is we have a log level. So, for example, I could then just say like. Please just give me the errors, because debug is nice, but I'm not going to look through all the debug messages right now, I'm just interested in the arrows.

A

The question is now: how did we get debug broken out here when the actual message was just one big string, and you have probably guessed it? It's regular expressions- and I know there is this saying: the pull of regular expression is regrets, and that might be true, but here we are using them to parse this apart. So how did we actually do that?

A

If I had back to my configuration in the conflict map, there was one piece of information that I've added here and that was, I have defined a pipeline in which I defined how to parse this log message. I hope you remember the log back pattern of how we were writing that out, because now we need that petting mix server pattern, which is in my ninjas pipelines, peptonic server, and here I have written the right, regular expression or grog pattern, which is a named regular expression to actually parse that message. So here you can see.

A

My log level has extracted the log level, then we broke out the handler and then we extracted whatever the reason was whatever came after that. So this is how we break this up and how we got um the lies, broken out log message and then here the reason which is just the rest without the log message, for example.

A

This is how we collected that by the way, if you want to make your life easier, we would recommend that you use a structured block format like json, for example, and we have actually prepared one for various programming languages by now. So, if you want to log in java, there is one that is called ecs logging, java and ecs4 stands for elastic common schema. It's basically a naming convention that we use across everything that we collect, and then we basically know what the semantic meaning of the different fields are.

A

So, for example, then we know that the kubernetes namespace always have the same has the same name and then you can always search for that and you don't need to guess like what could this field be called in this instance, and this will generate a json output like this, which has the right field names, and then we can directly collect that and you don't need to do that.

A

Pesky parsing, because I always tell people that yes, some people like writing, regular expressions, but that's a bit like the stockholm syndrome, where you get so used to doing something that at some point you accept that this is the only right way to do, or maybe it's a big job security, because while you can maybe write your regular expressions, nobody else can read them. So it's a bit like perl.

A

Okay, let's go on once you have looked through the logs and maybe we didn't find any errors in our logs and we're happy.

A

People might complain that things are not as rosy as they could be, and the next thing that you might want to do is apm for application, performance, monitoring or tracing. So, for example, to give you an idea what that looks like I pick my spring service and you can see here. This is an agent that you add to your application and that collects timing, information from the app. How do you include that now? This is not a demon set.

A

The daemon is depending on your programming language, either part of your build process or you can can just attach it to run at run time, which is the case in java, where you have the concept of a java agent. So, for example, in my case here in my job application, I have a docker file and in that docker file we add the java agent here in that place.

A

We just baked it into the image and then you can add some configuration parameters where you say: okay, the apm server that where you can actually send that information is um wherever I have it somewhere in here, apm server. So this is where the apm server is located, and this is where you can forward that information.

A

Okay. So once you have collected that information, oh and by the way you don't have to bake it into the image you can also attach it at runtime. For example, if you want to see how to do that in kubernetes, we have a very nice blog post now, where we use an init container to attach the agent when the container comes out, so you don't even have to bake it into your images, but you can just attach it at runtime.

A

If you want to, um I skipped the details, because the blog post is very good to describe that. I will focus on what you can get out of the tracing now here. So with the apm agent, for example, you can see where do you spend your time and you can see most of the time. It's like a 50 50 split between my database and my application. We only have these two weird outliers here and when I hover over them, you can see down here.

A

We also have very similar outliers, so let's zoom into that area, and now I could for example, say I want to see both or I just want to focus on one I'll. Just pick this one here um and you can see most of the things were very fast, but right now here we're spending a lot of time suddenly in the application like 99 percent in the application, and also our response times are suffering tremendously, while the requests are actually pretty stable. So it's not that we have more load in the system. It's just.

A

Maybe we have something bad in our application, and now we could look like what is happening here. um For example, let's start with get owners just to show you what that could look like here in get owners, you can see. How long did your requests take? Let's pick one of the slower ones, but this only took 60 milliseconds.

A

You can see there is this interesting pattern here, which is an n plus one, so you have one sql query and based on the output of that you'd run the next sql query, which is generally not very good because it will add a lot of latency, because you only do one step after the next one, but this is still very fast in our place here.

A

So while you should probably fix that n plus one query or where you structured your application, this is not the problem of our slowdown, because this was still a very fast query. Let's look at the other example here. That is on top of that.

A

Let me head back so now here you can see this impact and the impact is basically the multiplication between how many times is that transaction being run per minute and what is the average duration of that one? And, for example, you can see, update owner got very slow on average. So let's figure out what is going wrong here.

A

So here you can see um most of the requests again were very fast, but there was one outlier that was really slow, and when I head to that, okay, we can see a 400., that's not good, and then we can see. Where did we spend our time in the timing diagram here in validate zip code? So we spend almost 40 seconds in the method, meditated zip code. um That's weird! Probably we should check that out.

A

So I'm heading to the class validated zip code and in line 33 we're returning something in line 12 we're looking at some code. So, let's quickly search for that one. This is the class I'm interested in. So this is the class where we're doing something- and this is the return. But what is the actual interesting part?

A

Is this regular expression here- and this is a very bad regular expression- and you can actually see here if I close this one and click on the entire transaction, you can also see what were the input parameters that we were using. So, for example, this was the request body that we sent and what you can see here is. If you look at the zip code is that this looks like a very weird zip code, that's a very long zip code and that's exactly the problem here.

A

That short zip codes, like four or five digits, for example, will be very fast, but the longer the zip code gets the slower regular expression is, and now maybe you will say philip. This is very constructive, like who writes bad, regular expressions and brings their applications down. Unfortunately, that happens not that infrequently.

A

For example, there was this cloudflare artichoke a while ago and when you scroll down somewhere in here, they say something like oh, the cpu spiked, um because they had a bad regular expression and that's just one of the many cases where you will run into bad regular expressions. So this is more real world than you might think, and thanks to apn, you can actually figure out what is going wrong in your systems, because you get that timing information.

A

That would be very hard to get otherwise also kind of like to compare where do logs when you trace this fit in, because traces are kind of logs as well. But the point of traces is that they have much more context. A log statement is just output from one specific line that you manually instrumentally say please put out this log information where the trace has this entire call stack and gets the timing.

A

Information has much more context around that logs are much more common, but traces are very good addition to get more information and a broader insight around it, but sometimes you also want to get metrics, for example, because metrics are very nice to get an overview of how your system is doing it. Let's look at a dashboard, so here we have a combination of metric beat and packet b data metric b collects metrics either for the operating system or for an application like mysql. Packetbeat is a lot like wireshark.

A

You probably use wireshark when you're desperate and have no idea what is going on. Picketbeat tries to take a bit the pain away of that to centralize and extract meaningful information, and this dashboard is kind of the combination of that. So you can see we have cpu usage.

A

That is slightly growing, but it's still pretty low, like only two or three percent down here, however, we see the latency of queries, how long the queries take and how long the block time is, and we can just focus in on one of these spikes, because the spikes always kind of spike my interest, and then you can see up here how the graph then updates- and we still don't see much.

A

Let me zoom in further into that graph.

A

Okay, now we can suddenly see a spike here and when I hover over those you can see, this is the instance that I'm interested in the technique, my sequel, cc59, whatever um you can just shoot it or exclude this one or you could, for example, filter down. We have built this filter for infra pet clinic, my c quill.

A

The cc, whatever instance, was that um once you filter down on that one, then you will see okay. This is the only one that is left. So this is our spike here. um What could be the reason for this weird spike? We could look at the general metrics for our system, so this is from the operating system, the general resource usage you can see. These are all the hosts that we have in our system.

A

You could also filter that down on the specific docker containers. We have a lot of those or the parts that kubernetes is providing, and you can already see when I hover over this one here. Okay, this one has a higher than cpu usage than the others. This is pretty much the same that we have seen in the other ones. um Let's look a bit at the actual metrics for this instance that you can do here.

A

All of this information is provided by metricbeat instrumented, pretty similarly to what we have seen in filebeat, so I'll skip that configuration file. You can see here. This is that specific instance id you can see how much cpu usage you have so here it started growing and it's higher, but otherwise memory usage is low. The network traffic is also not very exciting, so I'm not I'm not really sure from the host perspective.

A

What is going on, but when I go back to that dashboard here, you can actually see that we have other good pieces of information. So, for example, here we have which processes are running, and you can see here when I hover over that db. Backup is running or down. Here you see the processors based on how much resources they're using you can see.

A

Okay, there's a backup running, and this is what is causing that spike and overall, this is what's causing all the slowness in my application or all these other spikes, because we have a regular backup job running every now and then- and this is another pretty real use case- that you might have experienced yourself already.

A

So one of the things that I think is important to go from this, you have one little island of tools for every single piece of thing that you want to collect is to have this bigger map and that's kind of like the idea of the elastic stack. We try to provide this map that you have logs, that you're probably already using, and if you have metrics you can include those and correlate those, and then you can add traces on top of that, and then you can add.

A

Health checks as well, and the combination and jumping between those pieces of information or combining them in the right way is what will make your debugging live much easier.

A

If you want to try all of this at home, we have demodelastic.co, which has quite a few features you just head over to that you get to a dashboard. You can pick what kind of use case you're interested in and then you can just dive into that and play around in kibana and see if that makes much sense for you. I hope it does.

A