Cloud Native Computing Foundation Online Programs, 25 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Real-time troubleshooting of K8s applications

Description

Don't miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe in Amsterdam, The Netherlands from 18 - 21 April, 2023. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

A

I'd like to thank everyone for welcome to today's cncf live webinar.

B

A

Shooting of Kate's applications, I'm Libby, Schultz and I'll, be moderating today, I'm going to read our code of conduct and then hand over to elope, co-founder and CTO of Ops Cruz and Nick Lee CTO of magazine Cloud a few housekeeping items before we get started during the webinar.

A

You are not able to talk as an attendee, but there is a chat box on the right hand, side of your screen, um those of you, hello, thank you and continue to do so and also leave your questions in the same spot uh and we'll get to as many as we can. At the end. This is an official webinar of the cncf and, as such is subject to the cncf code of conduct.

A

Please do not add anything like that or questions that would be in violation of that code of conduct and please be respectful of fellow participants and presenters. Please also note that the recording and slides will be posted later today to the cncf online programs, page at community.cncf.io, under online programs they're also available via your registration link that you use today and the recording will be on our online programs. Youtube playlist with it's over to elope and Nick, to kick off today's presentation. Y'all, take it away.

C

Hey thanks Louis thanks for uh the introduction and thanks everybody for joining uh on today's webinar live webinar demo and uh let's give let Nick give a shout out as well.

B

Good morning, everyone, this is Nick thanks for.

C

Time all right, so I'm gonna jump straight ahead to the topic at hand. uh So hopefully you guys can see my screen. Everything looks good.

C

So uh today's topic is going to be an interesting one. You know uh it's going to be about. How can we do real-time troubleshooting of kubernetes applications, especially when those applications start showing problems and performance issues right, um and you know, I'm gonna dispense with the legal notice, Etc and um I thought. We should give you a little bit of background who we are in case. You don't know who Ops is.

C

We are in a relatively young company uh based out of the Bay Area, but our focus is almost exclusively on how to provide observability for cloud native applications and so because it's Cloud native, we are an active participant and member of the cncf community and as you as you will see, we are pretty much built totally on cncf and open source. Instrumentation um I I, don't want to read through this, but you can see. We've been working with a number of customers number of Partners.

C

uh We are venture-backed um and the other thing I'll point out, because we are focused so much on open source and cncf Prometheus being one of the first projects, uh Julius votes, who, if you guys are following Prometheus and and CNC instrumentation, is on our advisory boards. We're glad to have him um so we'll go straight into a little bit of background on uh magazines, so I'm going to hand this off to Nick and again, as I said, you can obviously find out more while looking at our website appspace.com so Nick.

C

If you want to give a little background about itself and make it some cloud sure.

B

Thing so magazine, Cloud was found in 1998. We mainly focus on customer helping the customers to utilize Cloud, better, so headquartered in Seoul Korea. We have office here in Palo, Alto, Toronto, Canada, Tokyo, Hong, Kong, Vietnam and Shanghai, and we recently opened up one in Australia as well. That's the latest office, so our main focus is they're using Cloud for the customers and helping them to use it properly.

B

So we have AWS Google Microsoft partnership and we're the largest AWS partner in APAC and top three in globally as well as we have original clobart service providers such as neighbor platform and KT cloud.

B

We try to help the customer from the day one to operational, continuous operational, so Consulting development any reli, any migration they need from the on-prem or another Cloud to the other cloud and operation after the migration or development.

B

We work with the various Partners, such as Ops ISP Partnerships in Korea, and one other thing that I do here in Palo. Alto is looking for that Leading Edge technology companies to bring their Technologies to Asia to help the customers in Korea, as well as reduce the gap between the U.S and other part of the country.

B

We also provide a service called the hyper billing to help the customer with the billing on their Cloud usage, so multi-cloud Building Services, as well as a space one also named as a cloud foret, which is a Linux Foundation project that helps the managing the multiple cloud in a single portal. Next slide.

B

So the reason why we are working with Ops Crews was that internal we were challenging. We were facing the challenges uh like I mentioned. We have a card for it and a space one uh which is provided as a SAS product.

B

We wanted to make sure that we are in the within the slos and we provide a right level of the quality of service to the customers, and we saw the same pattern with other customers as well: our own customers, so one of the largest mobile Telecom company Mobile service provider in Korea, uh was actually working with us and asking for help on their kubernetes environment, so we're trying to solve their problem um because they were facing to use multiple kubernetes tools, it's solid and it's hard to maintain and operate, and they need to translate that metrics and perform the complex correlation.

B

The problem with this correlation correlation is that unless you know exactly what you want to do, then you know every aspect of the metrics. It's very difficult to have that, and uh you know getting those information and providing it to devops to enhance the devops practice and skills is not an easy thing and it takes two.

B

It takes way too long um to have newcomers to get trained and start using the environment and the tools properly and keeping up with a new release and various open source project is not an easy task and we had I mean it's not just us. It's our customers and also globally, it's hard to keep the talents. When you hire new operation people or devops Engineers.

B

It takes at least three to six months for them to understand what it is and for them to understand what they need to do to keep up the same environment and make it better and upgrade so. The solutions that we're looking for is that something that we can automate but easily adopted, and we can train others easily as well as we.

B

We get a single pane of glass with all the integration and Telemetry coming into the environment and it doesn't hurt to get a machine, learnings and AI assisted troubleshooting, because we all know we don't have enough people to all the troubleshooting by ourselves if we can get help from others, that's always better and easily understood, SLO and the quality of service for our devops and service owners. So that's where we were partnering up with Ops Crews and trying to help the customers with these challenges.

B

C

It's all yours now, thanks Nick I think you set the stage, and hopefully many of you can. uh You know recognize or empathize with the issues that megazone and their clients were facing in kubernetes, uh I I would say having been working in this area for six seven years. It's not an easy problem right so to get right to the heart of it. Kubernetes application performance troubleshooting is not just about kubernetes right.

C

It's about everything that sits above it and below it right uh think about just what has happened with Cloud native Services, just the number of objects containers Services, we've seen you know three thousand five thousand containers in a single cluster large number of nodes, and then it's not just that right. You have service to service calls. You have assas entities that are not being managed by kubernetes could be external calls to apis and then, of course, the reason you go into Cloud native and agile is you can constantly make changes?

C

You can make changes to the services you can scale out scale in change the code version, so on so forth. So you have now on top of these large complexities. In scale Dynamic changes that are affecting everything, because every time you add a new service or take it out your dependencies and who's talking to whom has changed, and if you don't do the deployment right, how do you know that causes the problem or run time? Something happens. You didn't know about the infrastructure or something else that changed or the configurations you set.

C

All of that is not just a kubernetes problem. It goes all the way up. You know, obviously the dependency on infrastructure, but also all the way up to the micro service and therefore applications. So the question is: if the application does fail, how do you know? Is it at the code level something else on a third-party service or it's kubernetes or it's the infrastructure?

C

And so that makes it complex right. The good news is- and this is part of the reason we love uh you know- working inside the cncf ecosystem is all of the Telemetry need metrics from Prometheus right flows. You know where I'm looking at you know level four bytes and packets or even level seven, uh whether it's from istio or ebpf right, extended birthday packet filter that tells us request rates response times, error rates that events from kubernetes State metrics.

C

So you know in changes you made you can capture that locks that are coming in, whether it's the application Level or at the you know, specific container level issues and, of course, traces, open Telemetry. So if you look at that between Primitives istio, kubernetes, fluent D and, of course, open source like ebpf and Loki, Jaeger Zipkin all of these Telemetry, as well as information on the configuration out there, what's the challenge, just like the number of objects for every object, there is metrics flows, events logs and traces.

C

So, if you think about it, there's a huge cardinality problem.

C

Technically speaking right, if you, if you analyze it a large kubernetes application, is really a highly dynamic complexity of spatial dependencies and temporal dependencies and you're only getting certain of these metrics every 30 seconds or one minute, depending on how you're scrapping that.

C

So we have a fundamental cardinality problem when we are trying to debug and if trying to figure ourselves in real time, how do we help Ops know, what's going on to troubleshoot a problem with something slowing down in kubernetes and that's the Focus right now, as I said and I'm just re-emphasizing, all of the things that we are looking at all the metrics and telemetries already available, so I'm going to just quickly. Look at that. If you look at things like all the standard metrics, you recognize most of these logos.

C

But of course you also want to capture Cloud level metrics, because that gives you the infrastructure information on the VMS you are using or the persistent volume and the storage right. The. What we want to do is leverage all of these, and this is what you should think about all of that being available leverage it to figure out hey. How are they tied together? Because that's going to tell you, you know contextually what those things mean across that different Telemetry.

C

Where is the dependency because dependency usually leads to you know what could be the causal paths and then, if you knew what to expect, what I call predict what we call predictive Behavior, then you know what to look for and we'll talk about that briefly in the context of the demo and this all ties together and leads to the whole idea of trying to cause an analysis which is not an easy problem, as we mentioned, so what we recommend and what we have done as an example is, as you can see, on the left hand, side leverage all of the open Telemetry right, especially the cncf ones and effectively.

C

What you want to do is stream processing I'm not going to go to a whole lot of detail here. You can always look this up, and maybe some of you are doing something similar collect all the Telemetry bring them contextually. So you know how they align link to each other, so you're, not looking at metrics or logs or traces or flows, or any events happening across the infrastructure kubernetes in the app in isolation. Because then you are doing all the work.

C

If you can pull that together and get the topology meaning who's talking to whom, at the application Level down to the dependence to the infrastructure, that gives you a better context running what's going on right and then, of course, look at flow understanding, behavior and all the way as this the sequence says, able to isolate the cost. So you have all this information, so the context across this telemetic configuration changes is very important and I'll emphasize this again. The whole idea is to get enough information across this.

C

So we know what is the state of the application at the time when this problem was detected right? This is where analytics comes in. This is where you would need automation, because there is no way uh you know one or more sres are going to be able to do this manually without having some Automation and that's the whole point. The point the point to make here is all of the data is available as close to real time as you can. Can you pull the insights you need so by the time a problem does happen.

C

You can able to figure out what the problem is. So, let's talk a little bit about what Buddha, what does that mean in automating cause isolation if you're a computer, science, geek and I've looked at this problem, there's something called something called: non-polynomial complete or NP heart problem? What does that mean? It means that the number of possible combinations of data that you have across and trying to get a sense of what is happening in time and space is very, very hard. But you know what's interesting.

C

I've talked to customers and people who are using this. They say some of their best sources of isolating a problem. Is there senior SRE guy who's, seen it all so because it's not a simple problem and highly non-linear think about how we solve this problem. We leverage information about the ID stack. We know that when a container is working and it is sitting on a node, it is using its resources. We know that those resources are coming from the cloud we know.

C

There's a shared service, then that shared service can become a bottleneck when multiple things are talking about it. Special things. Like databases, Etc right, these are things. These are aspects of knowledge that really good sres use. They also look for what I call, as you can see in the second bullet, follow the backgrounds they will look at the dependencies who's talking to whom, because they know if the alert is happening somewhere and they slow down.

C

If nothing is in the path that is going to be and nothing anything is that's not in that path will probably not be relevant right. We also know when there's a problem in kubernetes. Is this Readiness or aliveness? They know the meanings of that so they're. Using all of this knowledge together.

C

Looking at the alerts, then looking at the metrics logs and places- and let's say, if you have a service that has a certain kind of behavior, like let's say, is audio intensive they'll start looking at saying: hey, they did this problem, so expert srvs, who do cause isolation, use all of this information. So why not follow that?

C

Instead of trying to look at everything and throwing all the information without not able to narrow it down, the whole idea for an automated system has to follow these breadcrumbs understand, use knowledge, use the information appropriately and narrow down to a very few set of objects which gets closer to the likely cause. I would you know, venture to say, perfect, cause isolation in real time is not theoretically possible.

C

However, if you have enough information- and you have used the information correctly- you can get to it very very quickly and that decision system is what most sres use so kind of a block diagram on the right is basically saying you know when you see the alert looking at where the alert is in the source and then start looking and say, hey what's around it, what is the performance? Who was it interacting with? Was there an issue depending on the type of alert in kubernetes? Was there in the infrastructure?

C

Was there problems and saturation, and based on that, you would start looking and eliminating uh the possible cases that you don't have to look at right. So that's what the dynamic decision system is and that's what you're going to talk about, how to leverage all the information and extract uh insights that will help out uh isolate the problem so before I go into the demo mode and show you how this works.

C

As an example, I want to give you an idea what we want to what you're going to see in the demo what instrumentation is being used. So our deployment architecture is shown here uh effectively all the blue that you're, seeing as you'll recognize it are open source instrumentation sitting inside the kubernetes cluster, so see advisor, node exporter, your familiar Prometheus components deployed as demon sets in the cluster and uh promptail is being used to collect the logs from Loki.

C

So that's also a demon set I would add that in the node exporter, in order to understand flows and not just bytes and packets, coming in and out of the node into and out of the containers, we are leveraging ebpf. So our node exporter also uses an ebpf collector. So we can look at what you know. L7 metrics uh resource sorry request rates, error rates response times are happening at the level of a container within within a node right. We can get that information. We also um have.

C

Most of our environments have Jaeger, so we collect, we can collect traces. So the three primary four things that you're seeing here are Jaeger for traces Prometheus for the metrics and flows low key for the logs and then because we want to look at changes that are coming in we're also capturing kubernetes, State metrics. This is one way to see. Changes have happened within the cluster, so then what we do essentially is once this is instrumented, and you can do this yourself with a Helm chart. uh We basically have four plus one additional.

C

What we call pod each for a type of telemetry, so we're collecting uh the metrics from the the orange metric you see, Prometheus gateway to collect all the metrics from Prometheus. uh The log Gateway bar is basically collecting all the logs through the Loki collector kubernetes gateways, collecting kubernetes, State metrics and changes. So we know what the configurations are and changes being made uh and then finally, the fifth one you're seeing is the cloud Gateway, because we want to know what is the infrastructure and so do we understand the dependence.

C

All of these essentially collect data, and, just as you may be familiar with, um are scraping error classic. You know interval that you set up whether it's 30 seconds or one minute, depending on your scenario and bandwidth they're collected compressed and sent to a SAS service, which is shown Below in the orange hexagon, which is where our our controller is.

C

The controller is basically getting all that information which I showed in that previous uh staged, Pipeline and processing and extracting and doing that information, basically bringing them into context, discovering the topology figuring out what the dependencies are, both at the service level down to from kubernetes to infrastructure and then looking at the data figuring out using machine learning. What the expected behavior is so that's kind of what the scenario is so today.

C

The specific use case that I want to talk about in the next, probably 20 or so minutes is um an application slowdown that we have detected and how do we can analyze, uh especially in the kubernetes kind of scenario, kubernetes problem that God is an application slowdown, and how do we do that now? I want to make a note here that the case that we are looking at here we are not using tracing.

C

Clearly there are ways of using tracing for doing that, there's a whole nother problem and it's a separate issue on something that we use called Trace path. In fact, there is a cncf live webinar on that, if you're interested you can follow up with us, but today's case we're going to look at doing root, cause analysis for kubernetes affecting application, but there's no Trace enabled all right. So at this point, I'm going to switch my screen to the demo tab, let me see if I can do that again.

C

All right, so let me go back to sharing.

C

All right, I think that's the one if you guys uh uh Libby or Nick screen, showing up just confirm.

A

C

Excellent, so folks, what you're seeing here is what we call our application map.

C

um You can look it up I'm not going to spend too much time here, because our focus is going to be on the automated causal analysis and what you're seeing effectively here is I can zoom in and out is basically everything put together that shows at the uh what I would call uh the service to service level.

C

So, in fact, if I hover down, you know you can see this basically says request coming in yeah, we have an example: load balancer, talking to an nginx card load balancer, going into this container this service going into another service Etc, going into the whole thing all the way down to Hey There is a full stress database and it's running actually on AWS. So this application map that I'm showing you here is being built by the the data that we are getting and in fact, obviously you can organize it.

C

You know by labels on the application by the namespaces that are running on it. In fact, there are different applications running on it here that I'll show in a minute uh it's running on a five note, kubernetes cluster, and show that in a few minutes, the different paths- and this is a really small application. This is a the test bed that we try with parts containers SAS services. This is in fact running on AWS, as you can see it load, balancers and they're. Actually, multiple clusters, but they're connected together.

C

It's a multi-cluster environment, but we won't focus on that today and if you want to know what are the different name, spaces um I think I can hear uh no, not this screen here. Let me share this tab. If you guys can see this I can actually search by namespace and in fact what you're seeing here is. uh This is the application that shopping cart as a small e-commerce application. You can see a robot shop application, which is a which is a IBM application. The up screws.

C

Oh, did it not oh.

A

There you go there, you go.

C

Sorry there is a slowness here. Yes, so I'll start again, so what I've done is I've actually tried to show you guys that different name spaces of the different application, name, spaces and or what's been deployed in this cluster, and what I was showing here was today we'll focus on this little e-commerce, app called shopping cart and uh there is more there. There is our Ops rules deployment, it's only namespace robot shop, which is another e-commerce application online boutique that's used for tracing all of these.

C

Only reason for showing that is to tell you that we can filter down Etc, but for today again and I'm, switching screens again and hopefully back to this. Hopefully this screen switched. Can you guys see that again always a delay? Yes,.

A

C

You might be a want to be my uh check so to give you an idea what why this matters here these when I look at any of these containers, for example, nginx coming in from the service. In fact, if you look at this as I highlighted, you can see because we collect not only Prometheus metrics by, but also the flow metrics, you can actually see average response time between this and its corresponding service. So this is actually very useful right if you think about it.

C

A lot of Enterprises that use this environment, this kind of environment, ecosystem for monitoring up screws can see that dependency right away, and why do I say that? Because, when we're doing the root cause analysis, if I'm not involved in this card service and there's a problem here, then I don't have to look here or in some other application service like here. I, don't need to look at this, but if the data is flowing in there's a problem, I know what to look at.

C

I have narrowed down the focus and I can see visually what's going on. So let's that's one key part right. Knowing the topology and the dependency is the first thing that most of us do. This is what an expert SRE will say. I don't need to look here if the problem is here well, how do I know that in real time as kubernetes is changing, I should be able to see this dependency, how the data is Flowing.

C

So let's go into one more depth here so I'm going to zoom in here in this Shopping Cart app. Hopefully you can see while I'm doing that and if I look at that and I'm going to shift my screen here. So we can see. The key here is able to see for this container. What are the Telemetry in context and which is what I was saying earlier, I'm going to move this up a bit here so metrics?

C

What is the metrics coming in? Obviously, back from Prometheus Etc right I can get that. What are the if there's no events, I know this one doesn't have it, but there's lots. What are the logs in there? I can look for specific logs, for example. Is there a problem on this container on this engine X? Something has happened right. You know anything that I'm recording I can look for errors. Etc I'm not going to go back, but I had that in context. What we call a quick view. What is it talking to in case you're?

C

Not seeing it I can see, because I am capturing, because we are capturing those flow metrics as well as null, and we can disambigate kubernetes namespaces. We can say hey what is the IP address? It's coming from it's inbound. You can see that it's coming from this nginx controller, bytes.

A

Etc, sorry I'm not seeing it again, it might.

C

Just oh no, but.

A

I just want I want to make sure.

C

Yeah because I might be going ahead and things haven't updated, I.

B

Can see it I could see the screen right.

A

Now, can you see Nick and maybe yeah I can see it yeah.

C

All right, let's ask audience yeah audience if you can pop in the chat because I'm moving it.

A

Okay, they can see it it's my connection.

C

All right get into a better connection: sorry Libby, so what I was getting at is putting in context. We talked about metrics logs, but also the connections, because this is what we want to know when we're trying to see who's talking to whom and what is the problem. So in this case, just to quickly summarize I know what is inbound. I can just ambiguate that I even know how much data is coming.

C

You know what else is coming and actually there is, if you look at this, uh if I click on this, let me see if I can find the right way might be hard to do it because I'm trying to show this on this port itself. uh There are multiple connections because there are different uh ports involved in there and outbound, as you can see, is talking to this. This other nginx controller and in fact you can see on the screen.

C

So look what we were trying to point out is and if there are straight spots involved, if I want to go, look at the TCP address. I don't have one explicitly here. We can get Trace paths as well as in service performance right more. Interestingly, what about the configuration we can pull in the actual kubernetes Manifesto, so we know what has been designated. How much is the resources? What volumes is talking to? Are things healthy? What is the rate at which we are scraping and the timeout settings uh all of that information, including namespace?

C

Everything is right at your fingertips, so you don't have to go switch and Cube Kettle commands right. This is important, and so any changes that happen. We will update and present this so having everything together in context is very, very useful. I'll I'll do one more switch before I get into the root cause problem. So for this application, and as I showed, there is multiple containers in this environment, as I showed. Where does the node map say? Node map says: where are those application settings?

C

For example, I was looking at nginx and if I want to find it, I can look at and figure out.

C

There are five nodes here and I can see which of these nodes have what containers and, in fact more, interestingly, how much are being used so another view that we'll be able to pull together from that is the usage of requests and if your numbers are the highest requests and what is the request limit Etc, both at CPU and memory, and up on a con node by node basis, so we can see whether you're over provisioning under provisioning right. So, for example, here the requests on this one, this permit is node.

C

Exporter is set at 200 and the requests are already exceeding, and so the reason it's red is because it's burstable and it might be prone to eviction. So that gives you another view both to right size, the environment right. So going back to this um application, what happens when we have a problem? And in order to do that, I'm going to jump into? What's called our not this, but the alert view so I'm going to switch this screen again.

C

Let me know if it's showing up I actually picked up an alert 105.915., so Nick I'm going to rely on you. Can you see my screen has changed to the alert window? Yes, it has all right. So the one the the primary example that you use today is an RCA analysis that we are doing on a service level objective breach. We do this automatically because we're collecting flow metrics on that Ingress on the shopping, cart, I've showed you and it's. It was run a little while ago.

C

So I can just go through this if I click on it. This is where things start getting interesting in terms of how this is automated. So we capture alerts automatically based on explicit alerts from kubernetes infrastructure, predictive using ml, which I'll talk about in a few minutes, but also if there is delays on the uh service level, indicators and I'll go back in a minute and show that to you so actually actually I should do that.

C

Let me go back to this um example here and show you that for this service you can see the one feeding into that SL on the Ingress. There is something called SLO SLI. You see that something has been said. The suggested is this value and that's done automatically by analysis by the system. We can also look at what the current Max is.

C

Someone has manually, the user has set it at four seconds and you can obviously change that if that's because, if you're looking at an outbound connection to a customer facing you can set that so that's where we can set slos right in this app map and so what we're? Looking at now uh sorry I'm, switching back the tab here. Can you see my tab again.

C

I think I switched this tab and.

B

C

B

It shows an alert detail, detect.

C

Yeah so I'm gonna share this tab, but I was gonna. Show you I, think I missed it on this tab. Is that the data coming in into on the Ingress side on that that we can detect an slsli? And while we can do this by using machine learning on what the expected should be, someone has set it at four seconds, and this is what we'll use as an example of where the breach is so the service level objective for this application.

C

The Ingress side has been said that four seconds all right now I'm going to switch tabs so bear with me and then I'm back to that alert. That I showed you here and I apologize, I'm, jumping back and forth, but I'm trying to give you context of what was the SLO and what does the system do in this kind of automated so for that SL will breach on that shopping, cart, app. There is a breach to detect because on the floor.

C

So if you see that it says it's automatically detected because, as you can see the four second SLO, we are at 6.657 and there's a lot more detail here. Obviously this one is very simple rule. I can see you can set this up the max across, what's coming in the graph based on the response time and the flow can, if it's more than 4 000 milliseconds, that's what triggered it right.

C

It's a latency based alert and all of the details, and the aspects of that that we already have in context are given here, but the most important thing from the user's perspective is: we saw an SLR breach. The question is: how do I know? What does this look reach? So what we've done in this system? The decision tree that's running on the background. The AI engine does this and analyze, and this is where it starts getting interesting. So Nick I'm, assuming screen, has changed.

C

Can you guys see the screen I'm on the analyze tab so now this is interesting. What you're seeing here this is automatically done. In the background, whenever an SLO kicks off remember decision plan is kicked when we detect a problem. It is actually saying okay for that. There are five connection paths and the latency. The highest latency path has been pulled out. The whole idea is that if you know the context kind of narrowed on the path- and everything that's read here is basically saying all of them are high latency along that path.

C

So I'm going to close that- and you can actually see laid out here in this High latency path, nginx coming into its part and container to the service for web server button container. For that to a cart, cache, a caching element bought and container the cart server button container down to the database server and then oh I think I am actually on am I on the right one I might be on the wrong one. Actually, let me go back and see if I got the right alert here, the one I'm going to look at.

C

Let me see, let me just check if I have the right alert.

C

Okay, I think I have a different alert here. That's, but the one I want to show you is slightly different. So let me go back and retrace that one also is an interesting one, but that one that is not a kubernetes one. So the one I want to use I apologize I'm going to do this real time. I think it's this one that has a kubernetes related issue. This one also has a breach. This one even higher same same case at a different time, 15.454 seconds and exceeding the four second service level.

C

When you look at the analyze same thing happens slightly different and what it's saying is in this case there were three connection paths: the highest latency path was pulled by the system by using that topology, if I close that, in fact, you'll see one two, three four five, six seven eight alerts detected in red and you can see where the breach happened during what time, while this was happening now, starting here as I said similar to what I was showing earlier, going from nginx to its spot and container to web server, pot and container, and so on.

C

So forth, all the way to the back end the start server. What's really interesting. To note is that is not the only path, so normally a user would actually go and say hey what are the possible paths on which that ngenics container sorry the service is dependent on, and there are three paths right. You could have more. You could have 10 15 depending the complex application, but the reality is what's slowed down is the path that has the highest latency path and not surprisingly, that's the one. That's got other alerts.

C

The system can then isolate and say. Let me just show you the one, that's really relevant and that's key. So that's the first thing, the decision system and automatic uh causal analysis too. So, let's kind of walk through that. So obviously this is a situation where it was slow. We've already detected. If I go to the web server, it says hey. uh This is slow because it's higher, so the service is slow. That's not surprising! If you analyze it it'll, say yep, it's slow, because the response time Downstream coming back has slowed things down right.

C

So that's not surprising. What about the next error, because this slowdown expected how about the next one that also is high? This one also has an SLR bridge this card cache. Let's look at that again. This one also has its own analysis tab. This is also higher than expected, and this one I believe is an ml based, alert or actually, which is a response.

C

Time has been higher and it captured that it says okay further dependencies further down, so it's slowing everything down going back, going back again to that Source again, let's go down to the next one. What about this? So this was low. What about this car cache? This says: Network metrics are not normal and if I click on it and you'll see, there is something called an RCA tab, that's being analyzed besides, just being high, meaning triggered based on the expected service level for response time, if I click on it.

C

This is where it starts getting interesting. What you're seeing here is that this one has what we call our fishbone analysis. That means it's categorizing, that container based on memory CPU file system. If there's any iodependence demand side, meaning what is coming in and the responses to that Supply means, in our case, uh what is going on Downstream and even configuration changes. But if you look at one of the things that sticks up is hey, uh the number of Errors actually has increased in this case. That's one.

C

Second, this packets here decrease the data that was going. Outbound has decreased from zero. Sorry from 100 decrease, there's nothing coming in and neither is anything going out. So this is a suspicion why the network level metrics- and this has been detected by just having an expected behavior model that was learned over the past. The reason is, if you think, about threshold based on the threshold based, that's usually, are high water marks and the challenge with that is when things drop below that you would normally won't detect it.

C

But if you know the expected Behavior data comes in, but data doesn't go out or received on the downstream side. This is where an expected predicted behavior model helps right so going back to the detect the reason we have this and the analysis says, there's something wrong. The fact that there are more errors coming in so going back now that chain again. So that's what we detected here. What does the card server say?

C

We are starting to see something here. If we see it says, card server does not have any pod to serve requests. So if you are an SRE- and you know this dependency, why is this slowing down and also there's no data? Well, if there is no part here, there is no data coming back. Neither is any requests coming in, so that itself is telling us that the reason why this network metric detected on that ml side and this part not showing serving requests are connected. So we have a problem.

C

That's further away, Downstream until we come here. So if I go to this container, it says well. In fact, I already gave you the answer. There's an image, pullback error here, and if you look at this it says hey the card. Server container is terminated. In fact, if I click on it it'll just say actually I don't have a container on that. It's terminated.

C

So if it's terminated, of course, what that says is I'm not going to have I'm not going to have anything to form the cards of responding back if I go to The Container now now we can start looking at what has happened. This analysis now we'll now say what is going on here. There is no pod running and now, if I click on the analyze, we would use the same fishbone except it's not for a container.

C

This is specific to the kubernetes problem and if you look at this the same fish bone that you saw for the container side, it has two classes bought schedule, failures, note, failures, startup failures and runtime failures, and this is predefined. Remember we talked about curated knowledge. Someone who is familiar with kubernetes knows how startup failures can happen. Deployment, failures, runtime, and if you look at this, this says it is constantly in transition and it's saying container not ready. We are collecting this from the kubernetes state events: foreign.

C

There is a backup restart. Why is there a backup restart because the image is not loading? If you go back and try to pull it container? Is not reading pod doesn't come up or doesn't come up what happens further down the the service that calls it does not respond. There's one more thing: that's detected automatically dynamically. There is an image name problem that is incorrect and therefore, because it's an invalid image, name, not surprising, it's continuously trying to pull it off and never getting ready, which means as a result.

C

If you go back to the problem- and you look at this, you can now see what has happened. Image did not load correctly kubernetes issued but did not come up. The card server has no pods.

C

This starts seeing network errors and errors that increases the response and, as a domino effect, called all the bags that goes back to the engine next, so I know we'll run out of time, but what I wanted to point out is, unless you're able to pull together. All of this information in context eliminate the paths look at the dependencies analyze, each of them you're really looking at. If you notice flow metrics to understand why this was slow, understand all the way down to card ml metrics to understand why that problem happened.

C

Looking at kubernetes, State metrics to figure out why and analyze it to say, hey if you fix that image pod will come up, pod comes up card server can respond, then all of these will be straightened out. There's an old, saying, I think you've heard of for want of a nail. The battle was lost.

C

The nail was a bad image name that propagated all the way up to create an SLO okay, so I just want to give that because I know I wanted to show you a couple more examples, um but having a specific contextual knowledge on the system, analyzing things in sequence. All of that bringing all that together is the key here uh for us to be able to understand how this problem works right, so I'm going to actually go back and bring this this other deck here.

C

So the whole idea is in order to automate the causal analysis. You have to really leverage knowledge about the stack right understand how containers nodes work, how kubernetes Works understand dependencies understanding alerts, metrics logs flows and traces.

C

If you have in context because that'll kill you and give you those dependencies understand expected Behavior once you have that when you see a specific kind of problem, you can narrow it down and then eliminate all the scales as the examples that I showed you eliminating all the other paths to get to that point, so that that's kind of the way uh we can automate this causal analysis.

C

As I said, it can never be perfect because you don't have 100 information at 100 granularity of time, but you can use all of this information to solve the problem right. In this case, we were not even using traces, so just to kind of summarize, before we go into q a and throw it up for discussion as you're, probably well aware of, if you're using kubernetes. There are multiple issues that impact the application performance, and that makes just the cardinality and the complexity of space and time and dynamism makes this RCA quite challenging.

C

We can solve the problem not by blind correlations. That's not going to help, because the number of possible correlations is going to be very large. We have a cardinality problem there as well, but if you look at understand how to resolve this, how we do this very well, it's really, as I said, follow the breadcrumbs and eliminate things that are not relevant and that's where a decision system is needed. That has to be runtime and automatically doing that for you, so you're not spending time leveraging. Curated knowledge is absolutely important.

C

There is no such thing as blind correlation with a blind ml that does not work. It'll just lead you to false paths, and that means understanding the full Telemetry the configuration is changing and, finally, the message for all of you who are following the CNC: you can leverage all the open source CNC.

C

So we've just pulled that information together to do that, so I'm going to throw it for a q, a um and uh switch back. Okay.

C

Are there any questions that we can address.

C

Myself, please post your question.

B

A look: there was a question from Oliver uh does officer the show a metrics for services in the cloud provider that host the kubernetes cluster exactly.

C

For example, I.

B

Know, for example, AWS sqs and things like that.

C

Yes, uh we do. uh Let me share this step. Yes, of course you can because you can collect the data. So what we do, then, is use that data that we're getting on the infrastructure, so I think the example I'm showing here, however, is what we collected, for example, there's load balancer that you have to specify to the cloud whether it's one out of this. So here's an example of collecting data from AWS on the load balancer that we showed it's also visible um Oliver at the app map.

C

When we were talking about so, for example, this is postgres, oops, sorry, postgres metrics and you can get the Q depth Etc from the cloud vendor itself. You know. Obviously it's not Prometheus open source, so we have to kind of be able to pull that to do that. Hope that answers your question. Oliver.

C

A

Give everybody just a minute.

C

I'll, just um if there are questions I'll, just post this, um if they want folks want to follow up later, I want to talk to us or in general question on cncf metrics monitoring. Whatever.

C

A

C

Go to one one thing: I do want to mention: options is free to use. If you want to sign up is a premium case.

C

So I just put in there I know some people are concerned. This is just the product, but it is free. So we are again built totally on open source. You can do what we are talking about and if you don't, you can obviously get the free version as well.

A

C

All right, thanks guys yeah.

A

Yeah, okay, I think that's it! Thank you Luke and Nick. So much.

C

um Thanks everyone for joining.

A

B

C

For you to reach out all right.

A

All right- and we will see you next time- thank you so much.

C

Thanks Libby thanks.

A

C

I will sign off.