Cloud Native Computing Foundation Online Programs, 6 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Live: Next Generation Observability using Open Source Monitoring

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone welcome to cloud native live where we dive into the code behind cloud native world, uh hello. Everyone welcome on my behalf, so I am ali talvasta and I am a cncf ambassador, as well as a product marketing manager at cast ai, very happy to be here and very happy to be here with amazing topics and presenters as well. So every week we bring a new set of presenters to showcase how to work with cloud native technologies.

A

They will build things, they will break things and they will answer your questions live here today as well. So you can join us every wednesday at this time, and this week we have a lot of people presenting from ops cruise. We have cesar and alok here with us today to talk about next generation observability using open source monitoring, but before we get to the topic, I want to remind you also to join in for kubecon plus cloud nativecon virtual north america october 11th to 15th to hear the latest from the cloud native community.

A

So that's already next week. So now is the high time to start getting on on getting your tickets. If you haven't yet and see you there as well- and this is an official live stream of cnc and such is subject to the cncf code of conduct as well- so please do not add anything in the chat or questions that would be in violation of that code of conduct.

A

Basically, please be respectful of all the real fellow participants and presenters as well, so with that housekeeping items done I'll hand it over to the presenters, who have amazing content today to share and we'll stay on for q, a moderating as well, but yeah go ahead and kick it off.

B

Thanks a lot amy for, for you know, introducing us and uh the opportunity to present to cncf. My name is alok loku, I'm the founder cto of ops, crews and my colleague cesar who's, the principal solutions. Architect, we'll do the presentation and the demo today.

B

The topic today is one: that's always been of interest, which is how do you get observability for cloud native applications and specifically uh the topic that I think is always on everyone's mind given where we are is: how do we leverage all the technologies, especially monitoring instrumentation, that's coming out of cncf and open source to achieve that, so I'm going to set this up with a short introduction and kind of state the problem and give you our philosophy and approaches to how we are solving this problem, leveraging the cncf and open source technology as an example of what all of us can do as we move to cloud native or or actively working and running applications in productions.

B

So with that, let me start in sharing my screen and I'm gonna kick this off.

B

Let's give like a second here, and hopefully you can see this right.

A

Perfect, we can see very well.

B

So, as we said, we are talking about the next generation of observability, using open source monitoring.

B

Just go across the legal side, um let's start with what exactly is the observability problem? And specifically, let's talk about what cloud native, which is what cloud native applications uh create new sets of challenges? There are a couple of different factors. Most of you are aware of this, but it's worth highlighting and pointing them out. Probably the number one thing is uh obfuscation.

B

If you will, you know, services managed microservices, and you know cloud native services that are running in the application have dependencies all the way down to the infrastructure as a service and platform service. But now we have kubernetes in the in the middle, so there is some obfuscation, so you don't see those dependencies, so that's one, which also creates what we call multiple points of performance loss. A service can be used by multiple services, even if it's being brokered through and allocated by kubernetes or by the cloud vendor. So that creates one problem.

B

The second problem related to this is dependencies and the dependencies are two levels. We talked about applications down to infrastructure platform services, the other one is dependencies across the services themselves, and this is because you have a very large number of objects. You could have thousands of micro service talking to each other. You have long chain dependencies, not obviously when those happen and how they may impact each other. After all, cloud native applications are fundamentally complex, distributed systems, and then, of course it's not just you know manage kubernetes managed services.

B

You have sas as well as apis. What makes it even more challenging is now. You have dynamism. What was great for agility is, we could add and take and change services auto scale, but that means the structure and those dependencies are dynamically changing. So this creates a significant uh what we call visibility, but observability challenge is to know what is really going on at any time. You know, even if you ignore the load is changing as well, so highly complex.

B

The good news is- and this is where cncf and open source instrumentation and all the monitoring comes into play. We have data on almost every level, but if we just look at the scale and complexity, the amount of data that you have and understanding what is happening becomes extremely complex. So that's one of the things that has actually become the scale and the complexity and trying to get to the right insights right. So what do we need? This is where we start talking about.

B

If you think about the problem and the cloud native way, one of the first things you, if you understand, if you talk to subject matter experts, is they usually understand what is happening because they know those dependencies, those service interactions the dependencies across each other as well, even as dynamic.

B

So we need to be able to extract that automatically because at that scale we can't do this. So that's one capture, structure and dependencies dynamically. Second, if you want to understand what these dependencies mean, it also means you need to understand what these applications do. A database works differently and what it does versus a queuing system or, let's say a load balancer right. So this means when you're looking at them. You can apply knowledge of typical I.t operations. Shared services can have issues like noisy neighbor kubernetes.

B

uh You can restart applications and has to be ready before something happens. There are allocation limits, so all of these means that's the lens, a subject matter expert uses. We need to embed that when we look at the application, what is the current state if it's dynamic, we need to understand for every component. What is going in? What's going on? What is the workloads? What resources is it using? What service is it calling at the end of it? You really need to provide. We need.

B

We need to have anyone in the devops team, sre teams, even the application owners, to understand what is happening in the applications. What is the expected behavior of each component and how they interact, because only then we know what to expect and that there's a problem.

B

So then the question is: how do we get the data needed to build this application? Understanding? Okay? This is why we embrace cncf and open source the what we essentially have to do and what we are doing is building this analytics layer that processes the information and we just don't it's not just simply about metrics.

B

It's about traces. It's about flows between them, it's about what's happening in changes in the configuration in the kubernetes, as well as the cloud it's about logs, that provide this information and if you look across the landscape today, all of them, if you think about open telemetry, all of these are available now as open source instrumentation, you don't need proprietary agents anymore. You can deploy prometheus, you can reply, kerfana look! You can collect them.

B

You can get traces with standards like jaeger, it's standards like prometheus and it's like ebpf for flows and so on so forth. So our thesis and our strong belief is embrace open source and cncf and leverage this information to do what we need to do, which is understand, contextually, what's going on in the application process, the data to get that real-time understanding of the application.

B

So what we're going to show to you today in the demo is how do we take the data? That's coming in from this open source monitoring, essentially build out that structure, what we call application graph and, as we understand and understand, the interactions and those dependencies for each component that comprises the application, build out this behavior model.

B

Now this behavior model is not simple. We can't predefine what metrics we have to choose to build. In fact, what we do is don't make any assumptions if it's a generic container or if it's a known container like a database or even let's say a queueing system like kafka, use all of the information to extract what the unknown knowns are and what influences at any time. So we can predictively understand what is expected of that. That will tell us if there's a problem in deviations once we learn this.

B

In fact, we want to do this in situ, while the application is running, observe it understand the behavior across all the applications and use it to detect deviations that'll indicate problems. In fact, that should tell us emerging problems and then, because we know the structure and we understand the behavior, we understand how components are supposed to interact. How kubernetes plays a role. Do this global dependency analysis?

B

That means check into configuration check into changes that it mean cross check with the events and the logs, but also look at what the expected metrics are, because that will help you isolate the problem domains and isolate the faults. If we can pull this out, we reduce the amount of space and time that the space that the ops teams are looking at and reduce the effort it takes to resolve problems.

B

We can now also pull in traces on top of the flows to kind of add, more granularity and visibility, so that, in a nutshell, is kind of the approach we are taking. Think of it. This is real-time telemetry processing from all of that, the idea is to provide actual insights and a pro and- and if you will uh the actions you can take to correct problems, as you see them, and the of course, the best way to do this.

B

Did you show this on demo, which says I was going to do so cesaro, I'm going to pass this over to you and take it from there.

C

Absolutely uh thank you. Thank you alok. um You know, that's a really important slide. You have up there, but we're gonna actually jump back to that in just a uh a second. um I'm gonna share my screen.

C

One second.

A

A reminder to to everyone listening as well leave your questions, so we can ask q a in the end or throughout the session as well, so leave them to the chat and we will get to them.

C

Awesome so uh so this is, this is our um this is you know the ops crews landing page now before we jump into a lot of these things, I, what I will do is I'm actually going to deploy um op screws into a a cluster uh while, while we're here- and um I just want to show the simplicity of uh of how you can deploy not just options, but also all the underlying tools um that that uh was mentioning- you know the the uh loki and prometheus etc.

C

All those are just you know, a couple of commands away, um so this is already a running cluster, but I'm actually going to deploy into a separate cluster that I have here and so we're just going to switch to that um really quickly, and so I'm going to show here, um let's clear my screen and uh first we'll look at uh cube, ctl get nodes.

C

This is just a single node cluster. The point of this is to really show the deployment, so it's an e eks cluster that I've got running here um and what I'm going to do is I'm going to show the pods that we have running and it's very bare bones. It's just got the aws node uh components, core dns and cube proxy.

C

So what I'm going to do is uh if I switch over into my vs code um you'll see here, uh we have a few commands right, we're going to add a couple of repos we're going to run a helm repo update by the way helm is our preferred choice for uh for deploying um you know. Most of these tools makes everything easy. uh The kubernetes package manager definitely check it out.

C

If you haven't used it, um but uh and then we have a couple of commands, one is the uh helm, upgrade helm, upgrade install for the op screws components and then helm upgrade install for the actual underlying uh uh open source tools. Again prometheus loki is gonna get deployed as well as node exporter, cubesat metrics, we'll talk a little bit more about the architecture right after this um and then we're also going to deploy again low-key itself.

C

So what I'm going to do is I'm just going to copy all of this and then just paste it into my terminal.

C

And give that a go, and this is just giving a status. You know it's. It's successfully updated the repositories, uh one of the um uh the obscure's gateways are being deployed and then so it found that it doesn't exist. That's been successfully deployed now we're also deploying the um underlying open source pieces and then finally, loki is being deployed as well and we're all set. So that's gonna that that's gonna check in I'll do uh I'll.

C

Look at that cluster again, and uh we just see the pods starting to to come up now so we'll check in with that question a little bit, but right now we're gonna we're gonna go back to the existing cluster. Actually before that, I am going to um uh bring that screen back up. That elope was just sharing, because I do want to talk a little bit about the uh the architecture, so one sec we'll bring that up um in the meantime has are there any questions that have.

B

Especially on the deployment.

A

Yeah no questions so far, but usually people think about that and then they ask in the end. So I'm looking forward to all of them and if not from the audience, I will have plenty of questions and then everything for nowhere.

C

Fair enough, so I do want to share that last screen that local sharing, which is our architecture, so um we just deployed, but what? What is it that we actually just deployed so as the local was mentioning, um you know, the the whole.

C

The whole purpose uh of these tools is to observe and absorb, observe intelligently and observe easily without the need, for you know, heavy typical proprietary agents, the you know, I think the industry has really uh standardized on a on a subset of tools, um a lot of them from the cncf, uh and so that is again what we. What we leverage I mean, the really the monitoring layer is, I should say a data collection layer for monitoring is really standardizing and becoming really easy to ingest. There's, not always a need to go with.

C

You know heavy proprietary tools, so we're leveraging that um in this example of of the architecture, this is a you know. Five node kubernetes cluster and across the top are just your workloads right. Whatever your applications are running, you might be running in node.js or nginx or uh mongodb inside kubernetes, whatever you're running um uh that that's across the top and then in the next two layers. uh We turn this like reddish and blue colors.

C

um You have the open source components now, so you have prometheus um as well as loki for metrics and logs and then in the blue. You have the exporters and uh collectors so, for example, um you we leverage node exporter for um uh node level, metrics, right c advisor for container level metrics, and it's of course important to not only look at the container itself right. This is why we use both. You don't only need container metrics, but also that whole infrastructure layer of the actual kubernetes nodes that are running um the the workloads right.

C

So you need visibility into both uh really cool tool. Ksm exporter, that gives you the state of the of the objects inside of kubernetes. So all those exporters, um of course, feed into prometheus and makes uh all the data collection really really simple. uh Prom tail itself, uh component of the loki stack, um is going to grab all the logs from the actual workloads that are running all the all the container logs pod logs and node logs as well, and we'll feed them into loki.

C

Now those are the open source components that we're that we're leveraging in here. So this is this is this is what we just deployed right now with those commands that you saw and then, of course, you've got the actual underlying pieces. You know we've simplified it here, but the actual pieces of our of the cluster are running inside of linux. Nodes. You've, of course, got the cubelet running on each one of those nodes and then you've got a kubernetes api instance. So what uh what we're doing is we're also collecting like we mentioned?

C

It's super important to have events. It's super important important to have the objects um that are inside of your kubernetes cluster, so we will query the kubernetes api uh directly to grab and do discovery of those objects as well as um event uh collection right. So all the kubernetes events, whether it's replica sets that are scaling, um failure to uh schedule an image.

C

uh Failure to schedule a pod onto a node because of an image failure uh all different types of kubernetes events, we'll grab all of them, um and the other thing that I mentioned earlier was the gateways right. So because all this data is already collected, we need to have a way to grab it and, of course, feed it out into into op screws. So what we do is um we have these super lightweight, singleton pods that you'll see here. It's basically one pod per telemetry type.

C

um You have the metrics gateway here, which is going to leverage prometheus's remote, write, capabilities, uh prometheus will write, metrics out and so um we'll also grab the kubernetes objects using the kubernetes gateway. uh The cloud gateway is going to pull into your cloud, your your, um whether you're, using uh eks aks, if you're using a gke cluster, whatever whatever that variant is, we will go in and, and you need insights into things like not not just the cluster itself.

C

That's that's a great starting point, but also insights into the other services that are tangential to your cluster right, so uh things like load balancers that are handling the connections that are coming into your cluster things, like maybe rds instances that you're using let's say if you're using aws. You know those cloud databases that you're using calling from your cluster out to those services.

C

It's important to highlight those and to be able to observe those in context as well and again, claw gateway, as well as jager for tracing um all of those are just super lightweight pods: that package to eat up and send them off to ops.

B

Just want to emphasize a couple of things this are here and the.

A

B

Realizes we are not in the data plane, we are.

A

B

On the monitoring plane, because we are sitting on the host, not touching the containers not putting side cars and we don't have to touch the application code so- and this is again leveraging anyone who's running production can deploy. These all we have to do is collect that from this open uh collector that so that's all we have to do with the minimum amount of touch and that kind of simplifies the deployment process and the data collection process. Also, the data stays there. We don't have to store the data away and lock it away.

B

It is all open access for everyone.

C

Yeah exactly, I think you know one of the big things I mean we are we are talking about. You know a mixture of obstacles as well as the uh has the you know, tools themselves that that you know again again. Things like the cncf has enabled to exist again, so you know prometheus et cetera. All these all these tools, you know, even though we are talking about obscures obscurus, is pluggable.

C

The fact is that these modern architectures for observability tools, including including entrepreneurs, but even even not um you know, even if you don't have this layer there, all this data is is still there and that's the important piece. The uh as I was mentioning, the the ease of using combination, commoditization of the actual collection tools has made. This really really impressive, because you know your data. Is there it's it's so easy and you're not tied to a specific vendor, you're, not tied to a uh to some sort of proprietary implementation.

C

You know all this open source um uh to all these open source tools, allow you to collect and keep that data and leverage it as needed. For you know, business analytics in this case observability, um but you know for whatever capacity planning etc. So.

B

So so users can use all of this data with or without obscurus. They can build there and we're just going to help them get the insights. They need absolutely.

C

So I'm going to jump back into uh I'm going to jump back into the uh I'll. Just move this out of the way jump.

B

If the deployment's ready.

C

Yeah I want to. I want to make sure that this is up and running and it looked like it came up pretty pretty fast but I'll. uh I will check again yeah. So all these all these are running and you'll notice, and I'm going to show this to you inside the cluster as well. But you know that you'll notice, um we have a couple of different name spaces. We have the collectors namespace and we have the actual ops cruise gateway, um but you'll notice things like c advisor right, loki.

C

Looking prom tail then cube state metrics as well. Here uh the prometheus instance as well is up and running uh here's node exporter, and then our gateways um are are there as well so, and so let me jump into that cluster itself. This is this is actually our demo, which we're gonna we're gonna jump into some more of the cool things that we're doing with with those open source tools and all that data that we're getting. um But this is just the cluster itself again.

C

I just kind of really wanted to show that deployment I'll uh refresh my screen.

B

C

Make sure everything's up to date- and here is our actual deployment, where we were just looking in the command line so you'll see. I have a again. We saw that it was a single node cluster inside of eks, so you'll see that node and and you'll see again. You'll you'll see these components, you have loki, you have uh prom tail, uh you have the op screws, gateways, node, exporter, etc.

C

So again, we're we're now building this really interesting view that all that all show details on, but all within a couple of minutes right while we just what we reviewed the architecture. This is all done out of the box. um So uh again, super super um cool that we're able to leverage the open source uh tools for for grabbing all of this. So now I'm actually going to jump into the into the actual uh demo uh cluster itself so that we can see some more detail um again.

C

This is this is our our view and what you're seeing here is a lot of data being represented by, for me, sorry being collected from jaeger uh if you're familiar with ebpf we're leveraging the linux kernel's ebpf capabilities to actually build this uh view where you're seeing uh data flow across and we do support tracing, and there are a lot of really awesome cases for tracing, but there's also a lot of cases where, um where you might not want to do tracing, maybe you want to avoid overhead uh et cetera, so um the the ebpf capabilities of the linux kernel really allow allow you to build this kind of view and uh topology and structure view without uh forcing the need for tracing.

C

um So I think that's that's one of the really cool things that uh modern. You know. um Modern implementation clinics have allowed us to do now before again, going into all the details.

C

I just want to show one more time: the uh the underlying pieces in a slightly more complex cluster uh you'll notice that there's a bunch of filters and stuff across the top, so we're leveraging again the the open source tools make it really really easy, because, when you deploy um when you deploy these tools, they're sending things like labels off to prometheus and we can stitch those labels together and uh and make it really really easy to to ingest this data.

C

So now we can actually leverage the filters that are being attached um to to your workloads into different entities. So you know you can filter by app or you can filter by name space and all this is just like incredibly easy to do now. With these modern tools, um I've I've built a view of just the underlying pieces using these filters, so I'm just going to apply. You know this data collection layer which is going to show us um the op screws, as well as the open source tools.

C

So in this case we have a five node excuse me: we have a five node cluster uh here. These are the five nodes, some of their data, um but again a lot of these are running as demon sets. So if I zoom in a little bit to like node exporter, you'll, see five instances of node exporter, five instances of c advisor cube.

C

State metrics is a singleton, so you'll see all of this um and how you know prometheus is actually going out and scraping the metrics from those endpoints you'll notice, those uh those arrows going outbound, because that's the way the traffic is flowing in and then you'll see. The obscures gateways like you'll see here prometheus is, as I mentioned, leveraging um remote write, capabilities to feed the prometheus gateway uh for ops, crews and then you'll, see on here on this left side, the loki components.

C

So these are the actual pieces running inside the cluster, which is where we're getting all that data you'll see. These are actually posting out to our um excuse me to our uh amazon instances, which is where we're actually housing. This particular uh demo. So excuse me sorry about that. So I'm gonna go back. uh I'm just gonna clear that that filter- and I want to you know, start showing some of the really cool things that the um that the underlying uh tool sets allow us to do so again.

C

We're leveraging ebpf to to show you this view of the flow of the different components. You'll notice, uh the different pods, for example, and you'll- also see, as I mentioned, it's important to have a view into other pieces that your infrastructure is touching not exclusively kubernetes right. So you'll see things like again, as I mentioned, we're running in aws, so you'll see things like this elastic load, balancer right and we can actually click into it, and this is kind of what we call our quick view.

C

But if you click into the actual load balancer, you get data related to a load, balancer right, the dns name of it. The ip addresses the ports that are exposed, and both sorry I mean, I should have said private and public ip addresses for the um for the load balancer along with metrics. Now this is a metric snapshot. We can. We can look at some metrics in a bit I'll.

C

Show you how the context for that works, um but again all this stuff is being leveraged from from the underlying, in this case cloudwatch right, um but but for all the entities uh as well. So here we have um actual pods right. Let me let me pick something: that's maybe a little bit more interesting like maybe like an engine xbox.

C

So if I, if I click on the nginx pod, again all the data from um from the underlying tools you'll see like if I hover on this, you can see connection data and architecture data things like performance uh pieces that we're seeing this nginx controller is calling out to the nginx service with a response time of 57.2 0.23 milliseconds, a report, 30 000., so architecture validation again super easy because of the data we're collecting from those tools and then, if I click on, for example, the pod itself right it'll bring me into that quick view again, which you'll notice is pervasive throughout the throughout the platform and again we're leveraging the the native uh data from kubernetes right.

C

So the label that was attached to this pot and and the manifest is automatically picked up and again, as you saw earlier, the that view that I built was based on in part on some of these labels and name spaces, um but why so? Why is it important to have all this data? Why is it important to have things like the name space like the ip address?

C

Like the start time, all these things are important. um I can give you, I mean just off the top of my head, a lot of different examples, right important to have the start time to make sure that uh the latest config map that you applied um is now in place right.

C

If you know that you apply to config map uh on october 6th right, which is today and you're, seeing that the start time was from february 16th, you know that that config map is not in use, because the pod itself has not bounced um things like name spaces, so a lot of a lot of other companies. We we work with.

C

um They have giant clusters uh that uh you know they might have 50 60 node clusters, even more hundreds of of nodes in their cluster, and that might be a single cluster across the entire enterprise for just their non-prod. So what happens there? You have, I don't know, let's say seven instances they have prod and pre-prod and and stress and uh qa1 qa2 qa5, and you have all these individual instances. Well, how are you going to determine which slice of the application you're looking at?

C

Usually that's going to be segmented by namespace, so it's important to be able to not only look at that, but also you know, leverage filtering inside of your observability platforms uh for that, um so the the other thing is, of course, the context that alok was was mentioning, so it's really important to have context and I'll actually click on a container to get a little bit um richer data. uh It's important to have that context.

C

So uh one of the really cool things that that we do is stitch all the all this data together again the the facility- that's that that's uh uh available to us because of you know the underlying tools.

C

Doing such a good job of sending over labels, et cetera for us to be able to you, know stitch this data and richly allows us to then now do some really cool things like um contextually, giving you access to things like metrics right. So if I'm looking at this ingress controller container, as you see on the upper left corner, I can click on metrics and, of course, it's important to have metrics. You know, because you need to know what's happening with your workloads right. What does my cpu look like?

C

As you can see, cpu utilization is really really low here, 0.12. So you might. You might know that you that you might be oversized on your on your sizing for the um for the actual pod. You know you might be able to get away with reducing your resource limits.

C

We have a view on that and I'll show you uh some of what we do with again all this data and how we make recommendations and highlight um pieces that you know could use some more resource, so higher resource settings or lower resource settings, um but but we'll jump to that in a sec. But again you get all this. You know whatever data is available for this particular entity, you will get network received by its memory utilization again.

C

The one thing I'm seeing here just by looking at this is that this pod is severely oversized, um but I'll go back here and uh things like any events. I won't jump into every single one of these just in the interest of time, but um kubernetes events uh logs. I think that one is an important one. So let me click on logs so again we're looking at this nginx controller and we're now we're straight into the logs. For that. The context is important.

C

Now, uh I'm just giving you guys like a sneak peek of what is actually under the hood, but in reality, um while it is kind of fun to go in and explore all of this stuff, um uh dml is really what brings a lot of this together so, but I just want to show you guys what's underneath um so so again, you'll see some some of the links connections is just like a table view of what it's talking to or what's talking to it, you see that the elastic load balancer is you'll notice by this little arrow.

C

This is upstream so you'll see, elastic load, balancer is calling angus engine x, and you can see performance again coming in from uvpf and from prometheus performance data around that by 10, bytes out, latency, average, etc.

C

Now we again I'll jump into the ml in just a bit and kind of show the real magic behind that, but um uh I want to show a couple of other views so actually I'll go into the node view. um This is just another view of of the underlying data. It's important to be able to see what's running where right, you might have a a particular host, that's problematic and you might want to know what pods are running on top of that host.

C

So again I mentioned we have five nodes, one two, three four and fifth note all the way here on the right. um So here we're breaking up these views into uh basically uh doing a cube. Ctl get pods with a filter on um on the individual notes, but showing them all at the same time. So uh you you can see the the workloads you'll see advisor running on here, core dns, um and you can also see data related to the node itself.

C

So, if you're wanting to check the config, maybe you have a note that maybe they're so all supposed to be.

C

You know moving off off, of a docker runtime and onto the crio runtime, uh for example, and you can check the actual config of the note itself and get details like you can see that this is still a docker container runtime, but also things like the cubelet version and the kernel version that you're leveraging operating system images and again just like we saw for the for the pods themselves, it's important to have the node level metrics right.

C

So this is again a high level snapshot of the metrics um for the node, um but you can jump into it and again, just just as important to understand how you can see here a time frame selector, but how, at any point in time your your nodes themselves were behaving, maybe they had some sort of spike etc. Now, again we we have alerts that will automatically notify you of that. But it's great to be able to begin and explore at will.

C

um So the other, so so this this is um before I actually jumping out. I do want to show the balancing, because I I didn't mention the balancing so in our balancing view, we'll show you um how much uh resource how many resources an individual pod is consuming hold on. Let me refresh my screen.

C

I don't think I made a sacrifice to the uh demo gods today, so.

B

Is the balancing not coming up no.

C

It's it should come up, I think you're, just missing. Oh there you are yeah, so uh we'll show you we'll show you resource data for for cpu memory and disk, and you can see right, for example, um let's just pick on c advisor right, so we have the c advisor pod and you can see that, for example, c advisor has no request and no limits set um for it. So that might be something to explore.

C

We can also see that it's best effort and the current cpu utilization is 195 millicourse um and the, but the average is about 124, while the max is 220.. So this might help when you're optimizing, um cluster workloads and trying to you know understand.

B

So, sir, can you also show them who's hogging most of the memory on their average, because that's always an interesting one. Yeah absolutely.

C

So what we can do is we can you know, sort by current right. We can see that load. Balancer heartbeat, is actually surprisingly consuming a lot of consuming a lot of uh a lot of memory. It looks like it's me, 139.

B

And most users may not know that and misallocated. That's.

C

Actually, you know what this is actually perfect, because I am extremely surprised. The fact that the load balancer heartbeat pod itself is consuming this much memory. So this is actually a really great find.

B

This is where you have to crack things open to understand: what's really happening under the covers yeah, you know, kubernetes does a nice job hiding all that, but it could cause problems. Yeah.

C

Yeah, so again, this is this: is um this this view really is specifically around that right being able to optimize workloads? um Make sure that you know your limits are properly set? You might have some some that are crashing or something and we'll we'll alert on that when you're looking to proactively go in and identify things, this view is perfect for that.

B

I mean it works both ways right either you're under subscribed and you may have evictions or you over over provisioned and then you're wasting a lot of resources right and doing amazing, uh auto scaling out not realizing. That's not the right place to better to drip irrigation yeah, especially.

C

When you're running on cloud cloud is great,.

B

That's right, that's right! So I'm.

C

B

How many customers don't do that? How many users don't do that so.

C

Again, um you know there are modern applications. Are you know? Very often, you know, we know that kubernetes have has won the orchestration battle. So uh you know, tons of uh tons of modern applications are actually running, of course, inside of kubernetes so um again, based on the uh kubernetes api data. Excuse me and um and some of the other open source tools we built this view exclusively for kubernetes resources. So you can see things like deployments. Replica sets and demon sets that are in your cluster, and these are all this is.

C

This is a nice map, but it's also uh clickable. So, for example, I can look at pods right, so I click on pods and now I'm looking at again the magic of those labels that are being automatically collected um by the under underlying tools, a lot of the building of these views. You know you you've heard me, mention it two or three times at this point, um but that's that's it's.

C

It's really almost magic right so being able to grab this data around name spaces and how many pods are running in those name spaces um any failed or anomalous pods. You know anything that has an auto detected anomaly will show up here, um but you'll, but you'll see you know, kind of the distribution of your workloads and.

B

C

Also have the search bar, so if you want to find something specific, like your phone look for prometheus, I can do this search and now I'm looking at my prometheus pod but I'll take away that filter and you have a tabular list of all of these, um of all of your workloads.

C

Right so and- and I can look at you know the namespace that they're part of their ids uh their status, the host that they're attached to and the ikea address for the pods themselves, along with the labels that are attached to them, if they're part of a replica set or a daemon set, and then quality of service data along with created time and start time, but also again, we come back to you know this quick view built with uh the data from all those different tools.

C

And you can you can look at all this data, the labels and all the metadata and you're back into there right uh there's, you know, there's other views. This is this is very similar, but what I want to highlight is the richness of the data and the contextuality that, having all these um tools, all the metrics, all the all, the config, all the events, all stitched together- that richness that is provided to us in the context of all these views right uh so now.

C

The next thing I kind of want to jump into is the alert view and uh and alok. I think this is a lot more pertinent to some of the stuff you were talking about, so please feel free to chime in, um but with all this data that we are uh that we're now receiving um once we stitch it all together.

C

The the the one thing that that that we're hoping to drive home is the smart layer, as the luke showed in the slide, the smart layer on top of all these tools, because it is important to have metrics it's important to have traces, it's important to have network data and config data and change data. But what you do with that data is that's the real challenge that I think we're all facing today.

C

The data in many times is uh siloed. um You might be using some, you know some proprietary tools for one piece of data and other open source tools for another piece of data, or even, if you're, fully open source, you might be looking inside prometheus for one thing directly inside of locally for another, um you might be going directly to the to the kubernetes command line for uh for other things, so that is one of the challenges we're trying to solve right, having, I think, everybody's trying to solve that issue.

C

Having that context, switching is you know if you guys are into looking at you know brain science at all, you'll notice that context switching is a big problem. It's a big drain on us, um so that's one of the things we're trying to to avoid it wastes resource time. It makes your teams less effective. It increases the outage duration, which in uh in many cases means you know, lost money lost opportunities.

C

um You know, if in a healthcare environment could be could mean uh losing health and losing um you know important time to care for patients. uh It's you know there, there's there's a myriad of things that could um be affecting. But the point is you need to be more effective. um You need to have this context so as not to waste time and weight cycles. So this is really what this screen kind of represents.

C

It's really the uh the the culmination of all that data that we were just showing, combined with that smart layer that eloqua was was um talking about. So we have the key. We have the ability to set thresholds just like any tool right.

C

You can set thresholds directly inside of um you know, you can do alert manager inside prometheus and and set thresholds there and create alerts on that, and we have that capability to I'll, actually jump out here, um just for a sec to show that, uh but that's not what we um that that's, not the philosophy that we want to go with. I mean, if you really want to you, can come in here and you can select a metric and and apply a threshold right.

C

You can say well all right, create an alert if I'm over, you know 200 milliseconds and uh if we detect you know if we detect um we'll even provide automatic threshold uh suggestions for for workloads that have been um running right. So we, this is the current max etcetera, so so suggested. Threshold here is like 0.35, because uh in this case the cpu doesn't go over that this is a little bit of our of rml. um You know in play, but this is not you know what we want to do there are.

C

There are a little bit more significant pieces that we can do around um around really stitching all this data and the behavioral models that are created with all this data. So um let me let me find an alert. I think we were so.

B

One, when you're doing that, I think worth a comment uh cesaro, while you're finding it and the.

A

B

You cannot rely on a fixed set of thresholds on a fixed set of metrics is you're, making an assumption on what that container or that service is doing, and you may not have seen it. You may not have tested locally or on the cloud or wherever.

B

So instead of trying to guess and trying to optimize and tune it instead of waiting for it to, let's say, hit a saturation limit, it's better to find out when the problem is actually happening and detect that this is where you want to have the intelligence to be able to understand what behavior is. Is it working correctly, as opposed to when it hits the limit and keels over and dies?

B

That's the whole idea, otherwise, you'll be just tuning thresholds across all the surfaces and guessing on which, which metrics to pick and in a container you have the choice of picking 50 metrics.

C

Exactly no and that's and that's a real challenge, I mean you know you, and I were talking about this uh last night about the fact that you know with especially with with the ability to um scale out that you know. Kubernetes provides right with, with with uh being able to scale out. The only replica sets.

C

Workloads are running in much tighter windows than they used to before right. So it's a lot harder for you to set thresholds nowadays, because um workloads might be running in just like a really small window. I mean everybody wants to maximize their resources. Some might be running at capacity. So when real deviations occur right, a machine is going to be able to find those deviations much more easily than a human. You know you're going to have a you know, one of your one of your ops, guys just literally scouring, through dashboards and finding all right.

C

Sometimes it's at 91., sometimes it's at 89.. So maybe we should set it at like 91 and and that's the thing the machine is going to do a much better job and um uh that's just my my opinion actually.

B

Let me double down on that. This is where I think the a lot of people realize what has changed in the communities with auto scaling. Let's say you decided that cpu is at 85 is what you're worried about so you set a threshold but demand increases and you can auto scale, which means when you hit 85 percent, you increase the number of replicas, that's not a problem unless there are so. Why are you alerting when I know I can auto scale, so I'm going to create all these false alerts.

B

Every time you try to auto scale and scale back container is behaving fine or the application is behaving fine. Nothing to do with it. Increase demand, increase resource usage, but just because I hit the limit and I'm going to about auto scale is not getting an alert, that's a false alert if the application wasn't behaving correctly when it's supposed to not hit 85. That's something I'm worried about. How do you detect that right? That's the key and like we can't do this across hundreds and thousands of containers, yep.

C

Okay, so uh yeah thanks for that, although um so yeah. This is an example of of again one of the um one of one of our alerts. Now the machine learning has done its work, um but again the work that it's done is it's. It's done on the data that's being provided by prometheus and on top of that, we're bringing in things like logs for loki and context for you to look at that.

C

Any events that are um that are happening so again, this whole rich uh uh construct and uh really rich object model, essentially what what is being built with bringing all those open source tools together I'll highlight just a couple things. This is uh there's a lot of info on the screen, but important to know is that uh actually, for starters, um we're seeing that some metrics are not normal in this particular hard cache container inside of this particular pod.

C

We're looking at things like the name space itself, shopping, cart. Excuse me, and here are some some of the metrics that are violated.

C

I won't go into these because we have a more interesting view of these actual violated metrics, but I do want to call out what um alope was also just mentioning around uh not being able to you know: you're, not gonna, know no, and in fact many times you might set alerts on some metrics, but you know I I've been monitoring for I don't know 10 years, something like that and I don't recall, seeing like uh somebody going in and setting in the threshold on container file system reads right or you know some of these more esoteric or less well-known indicators of performance right and I think that's a real, um uh that's a shame.

C

But uh again this is something you don't have to do because our uh you know the ml is going to do that for you again you're getting this data from node explorer you're, getting this data from c advisor. So what's a an abpf correct, you might as well actually leverage it as opposed to just collecting that data and then just not doing anything because nobody actually knows what to do with it right, so um uh so yeah these are. These are.

C

This is an example of all the metrics that were taken into consideration uh by the ml to to trigger this particular alert, but we'll go into the more interesting view of the analysis, and I'm actually just going to give you a little taste and we're going to jump out of this, because this kind of links to a larger piece but you'll notice. um You know we're highlighting uh some of the issues right.

C

We have this fishbone rca that we call, which really shows you um different categories of metrics and configurations that uh have that that are probably important to this scenario. um File system is not involved, it's all grayed out green and gray, meaning you know, they're the good or not involved, but I'll highlight just a category, so configurations have changed supply side. Workload is having issues, that's why some of these are red demands that work lotus having issues and so a cpu we'll jump back to this.

C

I just wanted to give you a taste of this is an auto detected anomaly by the um by dml right with all that underlying data that we've collected. um But let me step back just for a second to show something here. We have another alert, that's saying that our response time, we have an slo breach on this particular service right, the internet service, and it's saying that we have an slo value of 2500, milliseconds and uh and that we're actually responding at over 3 500 milliseconds. So we're at three.

B

C

Seconds of response time, um that is of course important right, so so somebody has deemed that this slo is important to set and now it's being violated. So uh what our ml is doing. Is it's looking uh up up the stack down the stack upstream downstream to identify where there are actually issues that could be affecting and causing this particular slo violation?

C

So again, even just just visually right, we can see that there are some clear problem areas here in the red and here in the red right- and this is again, this is really no work being done. It's except right behind the scenes by the ml and I'll all highlight what these red pieces mean in a sec. I just want to click on this nginx um and we get a tabular view again, a a an amalgam, amalgamation and integration of multiple data sources.

C

um You you have, uh you have the actual slo violation, saying it's over 41 violating um we are because of ebpf. We can see the flow of the requests and we can see which is the highest latency path. In this case, it goes from engine x to web server to cart, caster cards for the db server, which you're seeing back here and rml has also learned. What's normal right, so expected behavior for cart, cash and expected behavior for db. Server is out of normal right you're at over a second and over 2.4 seconds respectively.

C

We know this is not some sort of increased request issue because again we're also bringing in data related to the to the url request count, and this is actually going down. So it's not an increased request problem. So really, what is it well? We can jump into we if we wanted to. We could jump into each of these individual components, but we know we don't have to and nobody's going to do that because we, you know, as operators are trying to get and resolve the issues as fast as possible.

C

um So what I'm going to do is I'm actually going to click on this red, and it's going to take me actually back into that alert. We were just looking at so what happened and I'll just go back and go back a sec. What happened is that the ml actually detected that for this particular anomaly, it brought in this completely discreet anomaly that we were looking at earlier.

C

That's completely separate, but it brought it in and said you know what this is likely a contributor or a cause to your to your slo violation and what happens is now we'll actually go into these that we were looking at earlier, which, by the way, are charted automatically here. So you don't have to you know you, don't you don't have a message saying: you're I'll click here, your response time is slow and then you go with some other tool and then go to the different dashboard.

C

It's all in here- um and this is, you know, automatically it for you. You can see that the response time has increased by close to two thousand percent. You can see that the response bytes themselves have increased. The size went from one meg to you know close to eight max um incoming response time for requests coming into this particular cart. Cash container. The response time has increased by 1500 percent.

C

Cpu utilization has increased by about 50 percent, but finally, the real piece is the image right. So this is now leveraging all the open source data and bringing it together saying you have an image regression right. You were running version 0.6 and now you're running version 0.4. um So that's really the issue right. We have. We have a bad image that way. That's a well-known point for is it's broken, so that's really the cause, um I'm I'm I'm out of time so I'll hand it back to to you alok and uh as well.

B

I want to kind of cap a couple of comments and leave it for folks who are attending uh to have q a what what we just showed was in fact, if you looked at what we just did here, that response time change in slo came from ebpf and open source component. That's already there. We didn't have to do anything.

B

The metric changes came from primitives and ebf for every container and the flows. The configuration change we detect from kubernetes state metrics, for example right. So when we look and see an event, change, there's a logger event. We can bring that in. We haven't shown the traces we can drill down, which is still in the works which will be shortly out. You can pull in the trace at that time to see exactly what happens to even further confirm. But the point is all of this data: is there freely available?

B

You can install it. You can do this, but the whole idea is: why should he be looking and checking metrics across a thousand containers with 5 000 dashboards of logs and metrics observability requires us to be able to quickly isolate the problem and go there and that's what we are saying.

B

Open source monitoring has released all the data that's available, that we need for telemetry and changes even dynamically. Let's embrace that add this intelligence on top, you can do it. We are just showing you a way how we can do it and make your life simpler. That was the whole problem right. So, if I were to just you know, just summarize and then open up for questions uh and then just emphasize this, I think this is worth highlighting again.

B

You know we have all these open, cncf metrics, open telemetry. Is there metrics logs traces ebbs? Yes, if you had s2, we can also deal with that by there, but you don't have to touch kubernetes even changes as we are going on just using that and this workflow once you have that's what the intelligence is right and that's kind of what we're emphasizing.

B

I'll leave it there and pause for questions.

A

Great sorry, excuse me um so really great amazing presentation really good very.

B

A

Demo always happy to see those. Thank you so much so, as said previously, now is the the time to ask the questions you can leave them in the chat um I'll, be helping you to moderate the q a, um but let's get started, and let's kick off with a few of my questions, um so um I know you mentioned a bit about prometheus as well as how you play with other cncf, tooling and open source, tooling and whatnot how these play into that. Do you wanna, expand on?

A

How do you work with prometheus monitoring framework or or is? Did you cover everything that you wanted to to talk about there already? I.

B

Think we did and if, um if you go back to that screen to say sorry, you noticed how, when we deploy prometheus itself right and with that, just like you would today in your kubernetes cluster as a daemon said, enable the c advisor metrics enable node explorer. That's all we are pulling together.

B

I think what we've added, if you think about it, because we needed what we call layer, seven metrics is we now he leveraged ebpf as well, which is another open standard that now gives us the coverage of networking not only at the bytes and packets, but also the request rates at the url level. Response time essentially gives us gold and signals so nothing out of the ordinary. As long as we have that coverage.

C

We're using prometheus, you know as a time series database, so um you'll, you know leveraging the exporters just easily sends that data, or you know the data is described by prometheus um for all those exporters, so um yeah, that's all the metrics that you see whether they're network, metrics or um uh or the uh c advisor you know, container metrics or known metrics, whatever all that's being fed into prometheus and that's where we're grabbing all those metrics from. So it's actually a really key uh component to the observability layer.

A

Perfect, we don't have too much time left, but I do want to ask a question because we went to kind of the latest and greatest, I guess, of observability and monitoring here. So what do you think are going to be the next steps for for this scene? What will be the either for obscurus? What are the upcoming features or what will the whole space be moving into the future and what will be the focus.

B

Absolutely by the way we will be at kubecon in person next week really looking forward to that after the last two years. So, hopefully we'll talk to more folks who are yeah.

A

B

Are really working on cloud native? uh One of the things we didn't show is: we are adding tracing open telemetry with jaeger. That brings up more. um You know, adding more and more capability on the causal analysis pieces, so they can work and try to do fixes. We didn't show. For example, there are some issues uh that we are adding to like whether it's issues related to kubernetes faults and failures that stop at start time and run time, whether it's application level, we are adding more capabilities that are more knowledgeable about a known application.

B

Awareness like how does the kafka behave, what to look for those metrics? Even those have prometheus exporters with them. So now we can even drill down deeper into understand a specific problem with kafka, a specific problem with an open source database like mysql, so that gives us even more granularity to understand what problems there right and, of course, we start adding traces.

C

B

C

A you know as a in the overarching space, I think, um but I think you know the the whole concept of of bringing multiple sources of data together is really um where the industry is starting to come in right and adding that that intelligence, you know obscures, is not the only uh company.

C

That's doing that, though, I think you know we're doing some pretty cool things, but you'll notice that more and more starting to integrate um the tools together right, because there's there's a lot of value to be had um from not isolating the individual tools. um I think the industry's starting to come to realize that so that and putting smarts into um into whatever platform is, is kind of like the next phase. The future of.

B

Observability, I do want to re-emphasize that the fact that, with what has happened in the last three four years, you know two years ago, kubernetes became a standard. Now we have the full ecosystem of instrumentation.

B

It's for anyone going to cloud native. You know embrace the open source and monitoring and then add the intelligence where they really need it. That's what we see the community and it's taken a while. You know, but it doesn't matter if you're an enterprise customer with a lot of legacy, application, wounded cloud or a new startup.

B

So we highly recommend that embrace these open sources build on it. That gives you more power and brings more capability into your hands.

A

Perfect and perfectly on time, as well as far as the timing of the the session and the q, a goes so uh perfectly wrapped there as well. So thank you for everyone for joining the latest episode with cloud native live. It has been really great to have op screws, um cesar analog, talking about next generation observability using open source monitoring.

A

It has been an absolute pleasure and we really um are amazing and happy to see so many attendees joining in as well, and thank you so much for for tuning in as well, and we will be bringing cloud native tv, obviously on every wednesday going forward as well, but actually next week will we have, we will have kubecon and therefore we will have a break because there's so much going on already in the club ladies day next week. So no need for for our live stream.

A

But after that we have a session on supply chain security. So in two weeks tune in to hear more about that. But thank you so much for joining us today and see you next week. Everyone.

B

And see us at kubecon. Yes, thanks thanks.