Red Hat OpenShift OpenShift Coffee Break | Red Hat Livestreaming, 23 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Coffee Break | Observability with Dynatrace

Description

Get your espresso ready for another OpenShift.TV Coffee Break as we welcome our special guest Henrik Rexed from Dynatrace to share with us real world production grade observability and monitoring stories with Dynatrace on OpenShift!

A

A

A

Hey good morning, everyone thank you for joining us today for another session on openshift coffee break. My name is Jafar, and today I have the pleasure of Hosting, Henrik, Rex, I, hope I pronounced it.

B

A

So uh hi thank you for joining us today, um so Henrik. Would you please tell us a bit more about you and what we are going to be talking about today.

B

Sure so, first of all, it's a pleasure to be here so for coffee break I took uh I, bring I, bring some tea, I, don't know which, which type of Beverage I was allowed to drink today, so I have the tea and I have the coffee, so in case I will be able to be able to switch to one or the other beverage uh all right, otherwise. Otherwise my name is Henrik Rex said I am a cloud native Advocate at dinner, Trace, uh so I'm.

B

My main focus uh at that Trace is everything related to Cloud native technology, of course, on observability. So my focus is on observity on kubernetes openshift, NES, Sorry's mythologies as well, um so here I'm very uh honored to be here and present how the naturalist can save your life from a broken cluster and for this uh to just some teas about what I'm going to present I have picked two real production outage.

B

That happened for real big brand of the industry uh and we'll explain how that happens and how we can resolve it and, of course, how the natural is going to help you. There.

A

Thank you so much Henrik, so yeah I really appreciate the fact that it's going to be I would say, pragmatic, uh because we're going to be talking about real life situations and and understand how how those monetary Stacks that you provide can can help address those uh issues so um yeah, okay, so without further Ado.

A

um Maybe would you like to tell us so yeah, maybe just to set the context so I've been working on uh kubernetes and openshifts for many years for about six years now and I I have been in those situations where you are trying to understand uh what what the hell is going on or what the heck is going on and because those Technologies uh are using so many layers. Sometimes it can get really hard to go to the to go back to the root cause and really understand if this is coming from the cluster.

A

If this is coming from networking from Storage from an application, is this a performance issue? Is this uh like a service that is uh completely down Etc and so yeah I hope that by the by the end of the session, we can better understand how to yeah troubleshoot those kind of uh issues and and better investigate I would say uh the the the problems that we face when we are working with the kubernetes cluster or or openshift for instance. So, uh okay, let's, let's go ahead.

B

And all right: let's do that um so I already started introduced myself, so I'm not gonna, go again on this one, but before just a small tease or a priority works of a diamond Trace I've been working as a performance engineer, so I'm trying to optimize and and understand issues. That was my my uh my main goal previously. I'm still still with observation, of course, um but because performance engineer is still in my heart and I love uh performance engineering activities, I am producing some content for a podcast and YouTube channel called perf bytes.

B

So if you want to learn about performance, engineering aspects check out, Perth, bytes and otherwise, one year and a half ago, I started a another YouTube channel called. Is it observable uh yeah? So you have in the name. You can hear observables it's about observability, so I'm producing content about how getting started in using flindif and bits. What is a service mesh? A couple of things about all always related to observity in general, so check it out. This channel is quite young and it deserves feedback to to improve the content.

B

So what are we going to learn on the next couple of minutes or almost an hour? So a couple of things, so the two two production issues that I have selected today comes from a well-known brand of the industry, but they are all related to service mesh, so service mesh is not bad or in the opposite, it's great, but there are a few things that you have to consider when you start working with service mesh.

B

So we will see those problems in details and we'll see how we could avoid those problems if we were using data Trace- and uh there will be an excuse for me to share the latest and greatest news about other than Trace products and we'll see that Daniel Trace has an AI engine called Davis and we'll see how Davis can detect those problems based on communicates all right.

B

So, let's start because I don't know how much knowledge you know about serious mesh I will have to do a small introduction, reminding what is a service mesh what's the value, and then we will quickly jump into uh the core topic of today.

B

So, like I said in a kubernetes environment or an open shift to be able to expose your services out of the cluster, you have different way of doing it and one way of doing it is using a service type load balancer, which is not the best of course, but you also have the options to utilize a service mesh that will allow you to handle the incoming traffic and manage your traffic within your cluster.

B

So that's one of the things, so it's manage the communication of your microservices, but when you start Building microservices architecture, um first, you need to provide features. You you focused on building those features in the code, but your microservices is going to interact with other services in your cluster.

B

So there are a couple of things that we have to handle so either you we code it or not, but there is the retry logic, so the retry, the try logic, is very simple: I have service a service, B uh I I, try to reach out to research, B series B is not responding. I need to do a couple of retry and after a number of a try, I will throw an error. The Second Use case, which is we need for sorry, also for uh so in microservices, is authentications.

B

So if I need to Advocate with the service being a specific way, I probably have to need to code that way that that piece in my code, um so I will have to handle that and last not last, but there there is a also a crucial component of security is the SSL the certificates, so I may not code. That I will probably ask someone to generate a certificate, but again it's great to have a certificate, but we have to rotate it to make sure that it's secure in a long term.

B

Of course, there are the other features that are important, like, of course, we need to get some observability out of our microservices. We, if we do blue green deployment, we probably want to be able to do traffic split and many other type of features that are related to microservices is communication, so the main advantage of service mesh is that it's going to handle those features without any extra line of code.

B

Basically, you build our application container. That has only the feature that we need and all the rest will be managed by a site car proxy that will manage those different features: observerty, Security application and so on through in that cycle, proxy, so service mesh is handle. The communication by adding new crds and those crds will will see, will help us to configure uh the the service mesh itself and inject the right proxy rules uh in our different parts.

A

In fact, yeah Eric- sorry just uh maybe to to for for the people who are not really familiar with what crds are uh yeah.

B

Yeah service is custom resource definitions, but it's a way of customizing and adding new objects in in kubernetes yeah.

B

um Yeah then, what I was saying is that if you want the way of you want to uh configure your uh your yours different proxies, of course, you're not gonna touch every part single instance of the H pod and I am going to add a container an extra container there. No, it doesn't work like this. A service mesh has a control plane, so it's a it's. A master, I would say component of the service mesh, where you're going to interact with the control plane.

B

Defining I want to do traffic splits I want to do a virtual Services I want to do uh SSL, so you configure everything from the control plane and then the control plane when you deploy a new part or new a workload in your in your cluster. uh The control plane will inject automatically the proxy in our pods, which is fantastic, which means I, don't have to decode anything. uh The the right feature that I need will be injected automatically in our on our workload, all right.

B

So now you know most of the details which is required to follow the story so once upon a time, a Big, Brand uh booking platform that you probably have heard- and probably you have used it uh for your Vacations or or even for business trips- um had a major production outage and what was the the issue?

B

The old nodes of the cluster will was fully saturated, so the the ends to the working day uh they go back to a home and just before getting back home, everything was perfect and they just went back on the next day. Everything was fully crushed, so they discovered that on the next morning that they were facing to uh and a normal number of pods in the cluster, in a very like a more than 500 plus parts were added in the cluster and they were doing nothing so they were naming those parts.

B

These zombie pause so Paul that were staying running in the cluster, doing nothing but just eating our resources. So what was the initial reason for that behavior? So first um this company, they are trying to provide recommendation in the websites, and so they do some analytics through a recommendation job. So they have some Chrome jobs running continuously in our production environment and that will collect some some behaviors and and provide the right recommendations, and they were doing that since many years and just before the incidents they just deployed a fresh new service mesh.

B

So all the pods uh that we expect from a chrome job are started. They are running but instead of being deleted after the job is ended, they will still be there still running. So as a consequence, if we run a job and we keep adding pods and pods and pods, we allocate the resources at the end.

B

Our nodes didn't have any memory and CPU left and it was very hard to operate and more even more than that, other workload were had difficult to be uh to be scheduled, so it will impacting more than just that that small namespace or that small applications. So let's have a look at the Chrome jobs. I, don't know how much you know about it, but Chrome jobs. Allow you to schedule a job in your cluster and the main definition of the objects in kubernetes is very simple.

B

You define a templates with the container that will hold your job to. That will basically do the the batch that you have designed, and the idea is that how it's been designed it's scheduled, so kubernetes launched a pod. It runs and once the Pod has been ended, then of course uh the the part has been deleted and the memory is released.

B

The problem is that when you have a sidecar proxy, your job ends, but in the Pod you have two container now, so you have one container that has ended his task, but you have a second container, which is a proxy and the proxy is a thread so at the end the thread never ends. So even if our batch has ended, the proxy is still running, so our pod is not deleted by kubernetes.

B

So at the end, if we have injected for obvious reasons and because we needed the cycle proxy in our case, our pod are never been deleted and we are facing tomato problems.

B

So how can we resolve this? Incube in a kubernetes world: well, let's see it so first solution very simple: we can add few settings first active deadline seconds in the definition of a job. It will just uh it won't save our our. We won't save us for this problem, but at least it will limit the impact.

B

The second one is concurrency policy, replace. If you define this um kubernetes instead of adding a new pod, it will replace the existing one. So, at the end in Max you will have one part running for that, giving job or more, if you have parallel jobs allowed, but again it's uh it's gonna, be less impact than having thousands of PODS keeping being scheduled in the cluster.

B

The other big recommendation, of course, if you want to limit the impact, if you put resource quota in the namespace, where your jobs are running, then you control at least the scheduler will block and you won't be able to add more pods in that namespace.

B

So all right! So that's the really simple configuration aspect. Now. How can we basically resolve this because at the end we need to force the proxy to end so first solution, which is not the acceptable one. But let's have a look at the first solution is using a file, so I have my app running and when my app is ending, I will write something in the text file or whatever it wants, and the proxy will look at that file quite in a very uh regular way.

B

And if there is something interesting in the file, he will basically stop the jobs all right. So that seems very doable. But again, if I'm, really using istio, it means I will have to rewrite the invoy code and I don't want to go in that direction for sure. So what I will do is taking the solution 2, which is perfect, because in fact, if I look at it all the service mesh of the industry is still link early and the others. They thought about that problem.

B

They have an end point in their uh in in their in their proxy, where you could basically basically send an HTTP post request to it and it will delete, kill the current proxy. So at the end my app is running once I have ended. My badge I send an HTTP post locally to the same uh on in the Pod through, depending on, of course, the service mesh. You will have different uh endpoints to interact with, and this will delete the container and boom problem resolved.

B

Okay, so now we thought out how we can resolve it, but how can damage stress can help you here a couple of things. So first Dyna Trace. As you probably uh know, we are an observerly observerity platform um we to in the cloud native space, and especially in kubernetes and openshift. We do. We have a an operator that we will deploy in the cluster and that will allow us to collect the various pillars of the observity.

B

So the logs, of course, uh kubernetes events, which is a rich source of information, metrics, traces events and way more also continues profiling. If you want how what is the value of that operator, so a couple of things in operator, you can even deploy it from a UI, you can deploy it from Helm charts, there's plenty of different oil deploying it. We need to have the visibility of diverse objects of the cluster, because we have couple of different components that will be deployed.

B

So the first thing that we will deploy is we will create an admission controller rule. So then, every time there's a pod that has been scheduled, then we will be able to inject the right Cod module that will instrument your code. The other thing that we can do. In fact we are quite flexible. We can deploy it with different observity way of doing things. We have the full stack where we will instrument our code and we will also look at your nodes.

B

uh We have just the uh the app only where we're just gonna instrument your pod and that's it so it's pretty flexible, but at the end we have still a component called the active GATE, and this active GATE is a component that will basically interact with the communities API to get the health of the nodes, the workload, the namespace and more so this component is crucial and we see we can there's plenty of features related to that active GATE component then address is providing out of the box alerting for openshift and kubernetes.

B

So if you look at a cluster, a cluster is like an onion. There is different layers, so you have the user sitting out of the cluster. uh You have the cluster itself, you have the nodes name space and within the namespace you will have some workload that will happen. Pods and pod will have containers and we have built predefined, alerting that helps you to keep track on. What's going on really on the cluster.

B

So, first, if you're outside the Clusters means you're a user, we want to pay attention on the response times and the failure rates, because if there is something like a memory problems or throttling heavy throttlings, then the response term will be impacted. So our engine AI engine called Davis will automatically look at those components.

B

So if the user is impacting in response times or failure rates, then boom, it starts and try to figure out, and where is the actual wood, then, if you look at the cluster I want to, of course want to make sure that the cluster is up and running that I don't have any expected any CPU saturation or memory saturations same thing at the node level, node can have different status, so I want to make sure that all my nodes are in a ready state that I have no in CPU or memory saturations, and if I go to each of the layers, we will add those alerting.

B

So we will be uh you'll, be aware of what's going on and same thing for the workload even better, if you have some some workload as in pending States, it's a sign that there's a memory problems or a configuration issue. If you have your paths in not ready or they are restarting doing a crash loop back off error, sometimes you want to be alerted quite uh quite fast, so those will be handled automatically all right, I get your point. I know that you want more. That is not enough to operate your cluster, but don't worry.

B

We know that there's a couple of things coming in the next couple of months, so be patient and we will give you more alerting out of the box. So that's just the start and we will have more and more things. Google cover most of the components of the cluster.

A

And so just a quick question: uh Henrik is there a way to define your own um metrics and not metrics, but your own alerts based on some existing metrics.

B

Of course, of course, yeah that we there is uh dining Trace is a huge platform to be honest today uh you can do tons of things, but we have lots of customers that operates tons of clusters and they don't want to go and build uh their metric expression and and then add the thresholds, because it will take a lot of time. But, of course, if you know exactly the metrics, you have the ability to do your custom anomaly detections.

B

We call it alerting, but it's of course it's anomaly detections and though we'll be also taking into account into account. What is great as well to just add in that's pretty much. What I was going to say on that slide is that um we, when you operate clusters in damage race, you probably have more than one cluster. Obviously probably you have a 10 to 20 to 50 or 60 or 100 of clusters that you have to operate.

B

um Usually what happens is that um I have some alerting that is predefined and globally for dinette Trace, but it happens that some clusters are. Let's say this is a crow cluster. Managing our AI, Discovery or or analytics and I have different type of thresholds. I have different type of requirements, so I want to change those alerting just for that cluster and you have the options to customize the alerting at the at the cluster level, but also at the namespace level.

B

Let's say: I have a team that manages a specific application and they are completely they have huge requirements and they don't accept the predefined alerting they are free to configure as well the alerting on based on their requirements.

B

So it's pretty flexible, so we have different way, a different level of configuring, those alerts so either from the custom out-of-the-box alerting in the like mentioned or through a metric expression where you extract the metrics and you define it, you can also extract something from the logs convert it into event and that event could be also an alert as well. So there's plenty, it's very flexible to to create alerting index.

B

One thing that uh we did so we this out of the box alerting, was something that we provided to our users I think now they have it on the hands last week or something something like this, and we also created a predefined dashboards. When you, when your kubernetes clusters, you, we also provide out of the box dashboarding. Of course you can create your own dashboards, uh you, you know you not have to be stick to those dashboards, but those dashboards are designed to give you pretty much a cluster overview.

B

The workload overview, the namespace and what is graded. We have increased the user experience in the way where, from a dashboard, you can easily jump to the right screens or from the screens. You can go back to the right dashboards with the right filter. So it's lots a lot of small Improvement, but it helps you specifically when you have to troubleshoot so enough of stalking, let's go to something live and show you the screens so for this I have prepared an environment.

B

um I have deployed a couple of things that of course, in my this is just a suggestion. So here's the GitHub repo related to uh this environment so have a two namespace with two applications. One is the open, Telemetry demo application provided by the open Telemetry community. So it's fully instrumented with open Telemetry I have I'm using the open television operator to be able to deploy uh open, Elementary collectors to forward the traces back to the trace and the metrics to that Trace I have link Rd here, I decided today to use Linker D.

B

um For this specific talk, this use case again. This series will be the same, but again it's just to highlight the usage of a search mesh and and reproduce the pattern uh that I just explained and by the way I. Just this morning, I almost killed my cluster, so I was trying to recover just before we we I was connected to history all right. So, let's, let's, uh let's see that in details.

B

Where is my browser to do? But here is oh no! No! Where is it here? It is okay, so here is first, let's start everything from the dartboard dashboard perspective. One small note uh when you are uh when you're using openshift the advantage that you would have compared to a traditional I'd say a user that uses another a flavor of kubernetes.

B

Is that openshift provides you details on the master nodes, lots of things that are very important, especially when you have to operate so we have predefined uh dashboard for the control plane of uh the openshift cluster. We also have dashboard for hcd, but in my particular case, I want to show you the dashboard that we have for the cluster. So here in this cluster I have two clusters deployed in my in this.

B

The dev handed cluster I will first filter to only the one I'm interesting, so the one with openshift and in that dashboard I will be able first to see what are the running pods running in these environments.

B

Able to easily jump to my my cluster details and from the cluster Details page I will be able to see uh the cluster review. Oh so how many nodes do I? Have the workload, the various events going happening in the environment and what is great is also if the node perspective, by the way you can see here doing this morning, I when I connected I saw that I was almost losing everything, so I was able to recover just before before it was too late, but small small load.

B

So here, if I look at the node, I can see uh the status of them and how they use it. The usage of each node- and if you in my case, are using Cloud native full stack so I have the node instrumented and the app instrumented. But then, if I jump into the node perspective, I will be able to see all the details on what's going on, if, in fact, in this particular cluster for example, here we see that we have a spike here, you can see here in the CPU usage, CPU user.

B

We have this Spike here coming up, I, wonder, what's happened or I can also do it on on the on the on the CPU load, so here I simply have to select, for example, this to zoom in here in this area, where we have this behavior and I can select it, and you can see that we have oh, it does and I think the time frame is not good enough.

B

Let me take something more: oh okay or maybe I can do it here.

B

Oh now, I need to zoom out. Okay, uh let me go that two hours I'm surprised up. Let's say what this we call it a demo effect. So let me grab this piece here where we have those spikes in blues here with the and you see the system load here is quite often it quite is. Is we have some spikes? So let me grab that CPU load and see in here.

B

What I'm going to do here is say: I have some some patterns that I see in the graph I want to understand what is actually causing this so I'm going to ask the AI engine Davis to analyze all the metrics, no correlation, all right, that's a bad example! So let's have your Spike here: let's do this traffic in foreign.

B

Davis is analyzing all the processes that is actually available in this host and it will correlate and it says first we have a process- CPU usage, it's the load generator service, so we can see that it has already attached to the right process. So now, I know by just looking at this specific graph that this Behavior was related to one specific process. In our case, because we were running in a cluster, it's going to be a pod, so once I have the information.

B

I can go back to the cluster and go to the right workload where I see so that pattern, so I will go to the namespace, and here, in my case, I know that the namespace that was that was uh looking at. Is this one?

B

Where we have this load Raider service and in fact here I so I have limited the impact, because, um if I didn't limit it I would have probably uh 50 Parts here running, but I have limited the impact, and this is the uh Chrome jobs that I explained that were basically consuming resources and, as you can see here at the moment, we have already three of those pods running and eating my resources so that, where basically, what this, how I recovered this morning um uh to avoid having losing my cluster as it is, and what is great here- you can see that also in this particular workload, we already have an alert that he says that the the workload is not ready and, as you can see, there is also exclamation points in the top and we have all those alerts that has going so.

B

First I had some notepad ready, so you can see that all the various uh workload we have couple of uh informations. We also have some failure, Edge increase, so this is difference. It means that something has impacted a user perspective, so here Davis is trying to analyze and figure out. Who is uh irresponsible of that behavior here? So, as you can see, with all patterns at least you're aware of, we have pending pause, or we have here these hipster shop.

B

Namespace is also as a resource quota saturations. So I can easily go back to the right namespace and look at that problems in details. As you can see, I was not able to schedule uh the the load generator services on this one.

B

So, depending on where you are, you can easily jump to a workload or to a service or to the Pod. So here I'm on the workload screen, I can easily get the services response times or I can also look at the part itself jump into the pot screens giving me the CPU the throttling all the different details. Looking at the logs the events of this, this particular container, a particular pod and so on. What is great is that we also have the notion of services, so here I can see the definition of that particular services.

B

And if it's been instrumented here we have a relation with the application, fully instrumented by Dan and Trace, and once it's been instrumented I can see the traffic uh that has been happening on this particular pod and yeah. I can even go further and say: oh I want to look at. Let's say the uh I want to look at for this particular request. Slash cart.

B

I want to look up at the distributed traces. So here we'll see all the traces that has been collected during that time frame. I will grab this one, for example, and I will see exactly all the the first, the first, the all the steps are related to that specific term transcription. So you can do tons of things from the infrastructure perspective to even going further to the application layer all right. So now that's that was the first story, I'm looking at the time, because I talk a lot, but so we have still time all right.

B

uh So the Second Story.

A

That I have so Henrik yeah. If, if you don't mind, if we can just post quickly there, because you've shown uh some couple, a couple of really interesting uh features that that provide a lot of value when trying to troubleshoot those issues.

A

So as uh most of uh people who work with kubernetes uh know, there are already some Community tools that are that go hand in hand with kubernetes to do monitoring and to provide some some kind of dashboard with those uh uh you know, graphs and resource consumption, Etc like uh Prometheus and grafana- and you know the combination of those but I- think what what you showed here is really something powerful, because the other type of tools or types of tools they don't really provide the drill down capability, where you say: okay, so I see there's a spike here.

A

uh But let me drill down and really try to understand. What's going on, it's going to give you an information. Okay today at 10 am there was a spike in CPU. But how do you correlate that to the actual workloads that have triggered that spike in CPU or spike in memory, consumption uh and yeah? I think this is so it.

B

A

uh Here, because you are just clicking there Etc, but it's really really a powerful um uh I would say capability and especially knowing that there's some analysis that happens in the background to say: okay, this is the normal behavior, because you have Trends I guess and the AI tools are like observing. This is the normal behavior. Nothing has happened, and now we can detect that there's a spike because there's a big shift compared to like the normal, behavior and I guess.

A

That's also where the AI uh you know magic happens is because it's it's able to understand that abnormal behavior and it's also able to correlate it to other things that happened elsewhere and say Okay. So this that you see here is basically caused by something that was down there and it would take I would say a lot more of Investigation to be able to get there. If you are using traditional tools, I guess.

B

Yeah I mean that's true, that uh I think alerting based on thresholds are great, but sometimes you want to also understand if you had. A change of behavior, like you mentioned and I've, been having a baseline approach where uh it detects changes over time and if there is a change, and it will basically try to understand why we have this change. But moreover, uh here I mean it's a demo environment. So it's there's.

B

No, um it's a my own tenant, but what here, if I click on, if I have a real uh alert impacting users, um so maybe I have another one example. So I would say I'll check, but when we have a problem impacting user, what damages will do? We will try to estimate the num, the number of user impacted by that problem and also, if you're, using Cloud regions, you will see okay, so it's only the user, so thousands of users, out of whatever, was impacted in specifically in the region of New York.

B

So it will also help you to figure out who has been impact acted and it goes even Beyond. So I think when you are troubleshooting. Usually there is fire. There is pressure, so you need to be able to take decisions in time and I. Think having the right, toolings and the right information in your hands, then you can make the right decision and also I didn't mention, but we also have the ability to once we have a problem that we have a pattern that we have detected and we understand and we control.

B

Then we can trigger a remediation. So we can reach out to ansible and say: hey. Do this because I reach out to that I, don't know a part saturations on the quote service then do this because I have a specific Playbook that I want to run and that will be possible. So you can even think about Auto remediation process fully triggered by the AI. So I think that that is fantastic. So you can even recover in a faster way.

A

That's that's really uh awesome, and actually that was one of my two questions that I wanted to to ask for that section.

A

So that's that's uh I would say the first one and it's uh it's pretty uh amazing, to be able to do that because say, for instance, you said that the uh Paul are being spread on the Node and, as we probably know, you have some IP allocation that happens on the pods and that creates some files uh on you know down there in some system files for for the containers, Etc and a simple troubleshooting to be able to get rid of that.

A

Well, of course you have to kill the Bots, but you have also to uh re. Remove those uh allocated IP addresses if it doesn't happen properly by the you know, by the kublet ETC, and so that's one example where having that ansible, where you say okay, so we have uh reached our limit uh on pod scheduling capacity. So there's this thing that we need to trigger to be able to uh meet to remediate basically that issue. So that's really really good. You know to be able to associate those.

A

My other question was uh so, of course, as you said, the sooner you know what happened, the the the the better you are. You know the more time you have, of course, to to address the the issue, and so my question was regarding, like notifications, emailing to what what systems can you connect to sent out? Basically, uh you know critical issue: do you connect to things like slack or any messaging or any I guess emailing systems.

B

Yeah we have a slack Integrations, we have, we can send notifications in various way we used to have I mean there was a couple of years back. People are less interested in this. We have this uh uh Davis uh component that was running on Alexa and and Google home and the others. So you will be able to say, hey uh Alexa, give me the the status of my environments and then Alexa was doing this and then I opened. Oh Alexa, open the dashboards and showed me the problems.

B

So I mean it was it was nice uh I, don't think that people are really using it in real new production.

A

B

But it, but it's a way that showcased that we are able to uh to send notification to various systems, and then you can do different type of usage with those notifications.

A

All right, okay, cool, so yeah thanks. So that was my those were my questions. I wanted to to address for for that section, so yeah, if you want, we can move to the next section uh sure.

B

So I'm just looking at time, so we have a 20 minute, so I'm gonna try to be a bit faster on the second one. So the second um horror story that I have is again the same company um all again with related to recommendations. But it's the here is more on the UI: how we're gonna deploy visualize the recommendation to the users. So here uh the the the thing.

B

The problem that happened is that this recommendation Services rely on the back-end services and this back-end services for weird reasons started to crash in the middle of the day and when they start to do the the troubleshooting and understand what's the problem, they discovered that oh, there is a. uh We have placed the CPU limit on a given workload and that uh CPU limit has caused anom kill. So what this is weird.

B

So, let's have a look why this is really happened so first again, they're using a service mesh so to manage the communication between the services and this recommendation Services has to reach out to another service called the ad Services.

B

um So uh when, when the problem occur or happened, we can see so we have extracted the the graphs uh showing the the consumptions. So we can see that we had a spike of CPU usage, but at the same time we had uh the increase of memory usage as well. So how could that happen? In fact, a couple of things the ad Services is Java based, so Java based, it means garbage collector.

B

So let's say that we are running into a usage of memory which is quite high and uh to be able to release the memory. The Java app will naturally launch some garbage collector to clean up the actual memory used by the job application. So here we are using a service mesh. So when we do the communication uh we, the Pod, has to go through the cycle proxy and the recommendation services, and there are several communication Services we'll reach out to the ad services.

B

But here at that particular moment, the ad Services were running into uh memory issues, so the garbage collection was trying to run, but there was a CPU limit and, as you may know, when you have CPU limits defined and you reach the limits, you will have some throttling. So if I run a garbage, collector and I being throttled, it means that I'm not able to clean up the memory, because I'm post during my task, I try to clean up and pose I try to clean up and pose so the memory is never released.

B

So what what was actually happening is that this, the the sidecar uh proxy that was sitting close to the Java application it hasn't. It did not have any CPU limit. So when the request was coming in even if the part was suffering, the traffic will still send to the Pod and send over and send over and send over until a moment where no way of sending up the memory. So what happened kubernetes has reached we reached out to the member limit, so kubernetes is basically killing the container. So this is uh could happen.

A

B

Guess, oh yeah, dude, you're, very famous Olympic event, that's true and the the way only way to prove uh avoid that situation is to Define precise requests and limits to your sidecar containers. But as you as I explained at the beginning, you don't Define that because it's been injected through the control plane, but the good news is that uh every service mesh has annotations that you can add in your workload, definitions where you can Define the actual CPU and memory limit for the proxy.

B

So initial here is The annotation for sorry for the for the envoy, and then you have a Linker. D has also specific annotations, and this would be a way of controlling it, so which means, if I Define the right CPU limits in this particular case, then I would avoid having the situation where my proxy is basically causing the death of my container.

B

So how can the Androids can help you similar I almost showed you that at the beginning is that we use Davis and the Davis is our AI. We had it since a couple of years and we started to add kubernetes events in the machine learning or the algorithm that that is taken by Davis, so everything related to uh struggling so CPU throttling. As you know, if, if you are throttled, then your program has been posed so you're not able to do actually work, so usually it impact also no response lines.

B

The second thing is: if you have om, kill uh or out of memory events, what would happen is that your part is killed and you generate some failure, because the party is not responding anymore, so you will probably have some errors from another Services um same thing. If you have evictions or if you have um evictions, of course, there will be also an event, but there's an also something which is very important. You can have an OM kill or you can have some throttling, so it means I have decided to change.

B

My definition of my workload, I've I've, decided to reduce the allocated memory that I need to optimize a bit my work, my nodes, I, can do that of course, but if Sun, certainly my change has introduced throttling or if my change has introduced omkil, then Davis will pick it up and say hey, so we have an omkl and by the way someone has changed the workload, and this is since the workload changed.

B

We can see that the problem has appeared, so it's even give you where actually we have that that behavior second thing that I mentioned, we also had the these sorry screens. I showed it before where we are on the way of doing. We are adding more and more things in those screens. We will have the response time, the federates and more and more what we want is from the moment. You have a service mesh um embedded in your application, so at the service level.

B

So on the networking layer you want to have some stats, so we will also provide lots of information related to your service, so this will come soon so stay tuned. If you're using service mesh, you will have lots of value uh provided by the entries.

B

So it's closely very briefly because I'm looking at the time- and we want to have a q, a of course last thing I want to show you is that I have picked up because I don't want to browse in forever to show you the problem. So here this is an actual uh problem card.

B

You can see that we have 16 it's again: 16 users impacted uh it's 90 affected service calls, so we have the details on which Services is actually having the problem, and at the end here we see that the root cause here is an OM KL. So it has detected that we have a failure rate increase in this particular case and the failure rate increase was introduced by this om Guild.

B

So at the End by looking at the problem card provided by Davis, you know exactly: okay, so I have an impact on my users and those users were at the end impacted by because probably the workload definition is different and we have some some more problems. So there are a lot of things that we provide uh through Davis.

B

uh So with the exclamation points is basically all the problems that Davis detects and then you can walk through the problems and look at the response, time, degradations or all the other problems related, but also here, there's something else that we have introduced not related to our topic of today, but I just want to mention. If you was wondering what what is the the the icon next to it is we also do a runtime uh application scans.

B

So here, for example, we see that we have eight vulnerable abilities in our environment, so we can grow and look at all. The containers that are running so here is not offline scanning, it's runtime scanning and based on this, we can figure out problems. So if you have any example here we have smuggling hdp. So we can see the problems and details where it's been happening. If it's and look at the the details on this, so we also cover the security aspects out of the troubleshooting.

B

So if, if troubleshooting could be impacted by a security issue, it will be also aware through that module as well uh all right, so I think that's it for this, so key takeaway. uh Very briefly, so, first, uh as you saw here uh when you use service mesh, there are things that are introduced due to the fact that we're using sidecar containers so like sidecar containers are great. It provides Sometimes some lots of great features, but it could have also a side effect. So you just have to be aware of those.

B

So a very simple they Define resource quotas on the namespace I mean that's, you know it's important and it will save lots of disaster. They find also limits on your sidecar containers is also crucial because you see, if you don't do that for some reason, the sidecar could introduce some problems. If you use Chrome jobs again, try to make sure that if you your service mesh will will be injected, will inject the cycle proxy, make sure to create the right process to be able to actually stop the job and not keep it running forever.

B

Otherwise, in terms of kubernetes, of course, make sure to, if you don't use a system like Dan Trace, make sure to have the right alerting on the cluster on the nodes on the name, space on the workload uh make sure to also consider events in kubernetes, because events is a rich source of information to understand what is actually happening.

B

And then you can drive your your analysis in the right direction and, of course, if you don't have the right level of authority, try to ingest external metrics, because there is plenty of great solution out there providing a lot of value like, for example, Reddit is doing a great project called Kepler to measure the resource consumptions of a cluster.

B

If you want to also to keep track on how much the Joule mini Jewel your CPU is consuming in your cluster or or memory use, this is great and you will be able to to even extend the observity with that type of data.

B

Something again so I. Have this easy, observable Channel, covering different topics. uh If you want to see more about service, mesh I have a dedicated content about it, so check it out again. The channel is quite young deserve to be the content could be improved but to improve and in feedback so check it out and reach out to me to see.

B

If, if you have some recommendation for me small uh housekeeping rules, just for those who are not aware, we have a joint webinar between Dana trace and red hat happening on the Tuesday on 29th, so next week, so check it out. If you want to see automating the cloud native Enterprise with Androids and red hat to register I think there's a one clogs of mine, which is awesome Christopher, so keep I would definitely recommend to to connect to. He has always great stories couple of additional resources.

B

Of course, uh there's we have produced a lot of content about kubernetes in general and open Telemetry and stuff, like this check it out it's available in the dynatrace YouTube channel. So if you want to learn more and see more in the product in live actions, we have a series called performance Clinic that has been renaming. Observative clinic so check it out, there's a lot of content that maybe could be interesting for you guys all right. If you have any questions all.

A

Right, thank you so much Henrik. That was really really interesting.

B

A

I've seen some couple uh features: I'm, not sure, I totally understand what it meant so I'm rephrasing it so to make sure that that's what you you mentioned, so you said, for example, there's an issue with the with the with the oom kill that happened after a change of a specific setting. For example, you decrease the memory setting in the yaml files Etc, and so is my understanding correct.

A

That Davis is able to hold the definitions of the different yamls that have been used and compare those over time and correlate what works, what doesn't work and say? Okay, so we had this yaml definition. It was working fine. Now we have a new yaml definition with this change in this field and it doesn't work and- and it says, okay, this is maybe what introduced the the error is that how it works or did I.

B

So I have I, don't know if we store all the settings to be honest uh over time.

B

I don't have the details, but yes, the idea is that what we want is sometimes by making some workload change definitions, we we don't measure the impact properly and we can introduce uh throttling or omkiel or even if I do I mean that's human mistake. I do a a problem in the name of the image. Let's say: uh instead of do a version, one I made a version 12, because I did some some typing mistake um and if we we're generating some crash to back off error due to that, then Davis will say.

B

Oh um and we'll look at the the definition that we had previously and we'll say: hey, maybe that's the issue, so the idea is going to say. Maybe the issue is is related to this. So same thing, for uh if we have a throttling heavily throttling impacting user here for the case it would say: okay, so throttling is heavily and in fact we just saw that yesterday you changed your workload, so it may be related to the fact that you change the limits of CPU, for example, and same thing for for mql.

B

So if I change, the memory limits, I may also uh being omk B is facing out of memory situation. So it's still uh I think it's still useful to be aware that our change has impacted somehow the behavioral services. So then we can basically do the right thing at the right in faster than if we have to dig in in all the metrics that we have.

A

Okay, that's that's really nice, so yeah, it's sort of what I understood uh and just wanted to.

A

You know make sure that was it because it's you know it's really nice to be able to compare data over time and and provide that feedback, because otherwise you would have to go to you know, dig into maybe your git repos, if you're using githubs to deploy and stuff like that and start comparing manually to see what are the deals Etc here, it seems like Davis is doing that sort of comparing uh for you and hands off some hints about what could be the the issue.

A

So that's that's really cool and so my other question I'm I'm waiting for questions in the channels, but since I I don't see any coming, I'm I'm asking my own ones and hopefully they will be uh useful for for the uh audience as well and so um is uh dinatra is able to provide some hints about things like. Oh, you forgot to Define uh CPU requests limits, etc. For this workload, uh is it able to say? Okay? Maybe you should try this or that? Do you have that kind of uh approach? No.

B

We at the moment we don't do that. uh It's not something that we're looking at at the moment, we're more uh trying to uh take into consideration uh all the objects and components and and the impact of those components in a given situation. So we will- or we are mainly focusing at the moment on making increase uh the the value that we provide when we observing communist clusters.

B

We have not yet touched the the this use case of um helping uh the users to tune up uh their workload definitions, and so this is not yet there I know that there is a partners of the entrace called akamas um so akamas they come from the performance, engineering, engineering World. They were doing Auto tuning uh when you do performance during activity. You want to tune and to do that, you do a load test. You look at the measure, the response, the results and, and then you make some change.

B

They have an AI engine that do it automatically for you, so in eight hours, they're able to increase 225 percent of performance or cost reductions based on with the AI, but now they introduce a new feature where they do live tuning uh of your clusters, so they have some, they take sensitive tests, they run and then based on those uh those results. They make a decision to make a change, so they don't always change uh directly, because maybe people will be concerned about it.

B

They will make a commit in your repo and then, of course, if you approve it, then you will be committed and then, if you have a gitups, you will be deployed process. You will be deployed automatically in a cluster and then you will be able to see the advantage. That of the recommendation provided by the by this AI engine provided by Acme.

A

All right: okay, cool thanks thanks a lot uh final question, um so we've seen a lot of capabilities around um using the tool to monitor actively the workloads. You know you as a person using the dashboard Etc, but how can you how? How can you plug that into like a devsecops pipeline to basically try to identify some of those issues earlier uh during, like the you know, staging and other phases before actually uh getting into production environments.

B

So it's an excellent question. So diatrace has a couple of Open Source initiative that we have launched over the years. There's one called Captain, so Captain uh has now a change, so we have the version one and the version two.

B

um The version two is called: Captain live workload, lifecycle and the other. The first one is was like a a platform that was doing either cicd or just remediation or just quite good. And here what I want to emphasize is the quality Gates. So we have component.

B

So it's an open source solution, but we also have a module that Dan Trace provide called Cloud automation where you can do all those Automation in diatrace automatically, and all those components were coming from a managed version of Captain, I would say, and one of the components called the quality gate and the quality gate.

B

The notion is that you may want to evaluate a Dev test or a desk release and be able to promote your release from a Dev tenant to uh to Dev cluster to a staging, and then, if staging goes well, you wanted to uh to make it a release candidate to to production and do that automatically and that Captain can do that. So they have this component of quality Gates.

B

Where you define your SLI, you define your SLO, so my SLO or my SLI, my indicator could be in the entries, so it could be in a dashboard or it could be a. We have a also a metric expression query. So a way of querying a specific metric or it could be in Prometheus. So we can do a promql and basically you define your SLI.

B

You define your SLO and you can have maybe 20 10 to 15 SLO, covering the availability, the network, the security uh different requirement that you cover from the various slos, with different weight and at the end every SL will give you points and you will be able to say okay. So if I have a score in Dev environment out of 890 90 out of 100, then yes automatically deploy to staging. So then you can think of lots of automated process where you deploy. You run some checks and you can go further with a new version.

B

Captain life cycle. We are basically part of the cluster. We changed Captain to make it easier to deploy and easier to configure. So we basically were part of the scheduler of uh the kubernetes. Every time you deploy, we do pre-deployment checks and then we deploy and then we do Post deployment checks and those deployment checks is very flexible. You can, then you can do jobs, you can do typescripts. You there's plenty of things you can do um so, for example, if I'm deploying my service and it relies on my database.

B

If the database is not there, then obviously my service will start, but because the database is not there, I will probably have a crash to backup an error. I will try to restart to reach out the database. Basically, we can do pre-deployment checks where we say oh, so this is a dependency of the database. Okay, so pre-deployment checks is the database here. Yes, it is okay, so we can deploy.

B

So the captain life cycle now is is pretty much designed to make your life easier in kubernetes, and the great thing is because we are part of a scheduler, then you can use gitlab, you can use Jenkins. You can use any CI system of the market from the moment. You've been basically deployed in kubernetes. Then we do the these checks before and after, and it makes your life easier uh to uh to manage deployment in communities.

A

B

A

And you know what actually this is you know: I have a special interest for the stack, UPS and I think if you're, okay with that, uh we will uh want to want to to invite you again for another session, maybe you or or a colleague that works on that topic too soon more in details about this, this um specific topic, because um it's it's at the core of you know when you are trying to automate your CI, CD or devsecups uh Pipelines.

A

You need to have that automated capability to to decide if you are moving further or if you are stopping. You know those those security or quality, Gates and performance gates in our case, so yeah I think it would be really good to have a session uh about that specific topic. That'll.

B

Be great with pleasure.

A

All right, so thanks thanks so much Henrik. We are just a bit over time, so uh if there are any further, there are no further questions. uh I will again. Thank you very much for your session. It was very valuable and I hope our users.

B

Are gonna like it as well.

A

uh Again, if you are not already subscribed, please like And subscribe and uh keep in mind that there are other sessions on the openshift TV. uh You have the links to henrik's channel as well. If you want to subscribe, uh feel free to to go there and check his awesome content and thanks thanks uh so much for joining us today and see you soon. Thank you guys, bye-bye.

A