Cloud Native Computing Foundation Online Programs, 2 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNL: Leveraging CNCF Observability Tools for K8s Troubleshooting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone welcome to cloud native live where we dive into the code behind cloud native, I'm taylor dolezal a senior developer advocate at hashicorp, where I focus on all things: infrastructure, application, delivery and developer experience. Every week we bring a new set of presenters to showcase how to work with cloud native technologies. They will build things, they will break things and they will answer your questions in today's session. Alok and caesar have joined us to talk about leveraging the cncf observability tools for kubernetes troubleshooting.

A

This is an official live stream of the cncf and, as such is subject to the cncf code of conduct, please don't add anything to the chat or questions that would be in violation of the code of conduct. Basically, please be respectful to all of your fellow participants and presenters.

A

In short, please be excellent to one another with that I'd love to hand it over to loken caesar to kick off today's presentation with that alternative. Thank you.

B

Taylor, um so I'm a look: I'm the founder cto of upscrews, an observability company built on open source and cncf telemetry, and I will also introduce cesar quintana, my colleague, who is the principal solutions architect at hopscotch. Thank you so um the way we thought we would do this uh before we set up uh the uh the demo itself and go through that. The fun part I've only.

A

B

Of setup, so you know have the context of what we do so with that in mind, let me share my screen and bring up. uh You know kind of set the stage. If you will, let me find the right one.

B

And let me know if this is coming up.

A

Cool that looks good here. Okay,.

B

Great so as as mentioned by taylor, uh we are talking about how to add intelligence and observability. Now that we have open source monitoring right, um you know going through the standard, confidentiality and legal notice we'll skip over that.

B

Well, let's stage by you know revisiting what really is become the challenges of cloud native application, observability right and fundamentally, this has been happening now for a few years as applications move to microservice architectures.

B

You know there are three things that we know have started and really added a lot more challenges to the ops teams that are managing them right number.

C

One is just just complexity,.

B

Scale, the tiering there's so many dependencies right and the dependencies.

C

As we've shown on rain, you've got, of course the application pieces could be.

B

Past services could be sas, services could be a kubernetes, kubernetes container running an application code, it could be serverless all of them and then, of course, you have dependence on kubernetes itself orchestrating all of these and then the underlying infrastructure, wherever it might be, so these create tier dependencies.

B

You know if you will, from top down kind of what we call vertical as well as across, and this is happening all the time. The third complexity is dynamism. Great. We want to be agile right, we want to add services change, any one component scale out scale in you know some things drop. Something is brought up that, together with all of this, is like a highly complex distributed system and just looking at a couple of metrics is no longer fair. Longer sufficient. You know it's things are changing.

B

Things are coming up, you have a zillion dashboards, so the good news and and the sort of not.

C

So good news is the following: the good news.

B

Is thanks to cncf and open source, pretty much every possible telemetry from real-time.

A

B

Logs events, configuration information, traces flows are all available now directly from an open source environment, meaning open.

A

Telemetry is an example.

B

Open source that has existed now for.

A

B

So that means we can move to the the key pieces to solve that problem, but so not so good news that complexity, the role of observability is changing. It's no longer about dashboards and just alerting is how.

C

Can we help.

B

C

B

You know get to understand, what's happening in real time, so they can detect quickly, find the real issues and get back up and running. You know it's the same things that you've heard time to mean time to test should be fast, don't waste time with false alerts and get to the root cause mean time to resolution right.

B

So what we? What we've learned having been a born cloud.

C

Native ourselves,.

B

Is what does ops really do if you think about it, the what.

C

Ops does beyond getting all the telemetry.

B

Is they understand the dependency and they look at an application with multiple services talking to each other? They understand what the interactions are. They know how applications you know. Services are being monitored orchestration, they know the dependence of infrastructure, so you.

C

Know really savvy.

B

Ops teams and sorry teams know that they also know what's changing and they apply curated knowledge. They know, you know a publish, subscribing or producer consumer model versus a database. What.

C

B

Supposed to do, they know what to look for what.

A

Metrics to look for, they are aware of the app and the dependencies.

B

That knowledge is what they use when they look through the data and sift through whether it's the metrics. The laws are the traces right. So fundamentally, when they look at something that's happening.

C

They're, looking at the.

B

Every component, the inbound outbound who's talking to whom, what resources and services depending on.

C

B

And so they essentially built context to understand.

C

What is happening.

B

So they can actually detect the problem, isolate it and and analyze, and and and figure out what the resolution should be. So, if you think about, if observability has to be really intelligent, they have to establish this context, this understanding and surface that from all that, you know effectively called noise. That's coming in all the data that's sitting in. If you can't do that, then we've actually made the life of a typical devops and sre very difficult. So that's what we want to do, so our thesis is help leverage. This.

C

B

Right, if you look at open source and open telemetry, clearly we know that things like prometheus, getting information from fluidity for the logs and pulling into something like grafana, loki, using jaeger or open, open tracing and standards. Even.

A

B

To get the traces looking at flows from things like ebpf and istio, using configuration information from kubernetes changes, even the cloud infrastructure data as well as pass information, all that that's available, pull that together to build that context across all of them and then help reduce the amount of information and focus on the right information that ops needs right.

B

So this is probably the.

C

B

Slide as we get into the demo to tell you.

A

On the left-hand.

C

B

What you're seeing is all the open, telemetry, all the open source that is available today, you don't have to put proprietary agents and do proprietary instrumentation code thanks to cncf, thanks to open source monitoring.

C

That's available.

B

C

The first thing, as we.

B

Said for context is understand the structure of the application. What we call the application graph right so imagine able to automatically build up that structure in real time. So the ops guy.

C

Doesn't know even the app developer doesn't.

B

Have to go around trying to figure out what's talking to or do something exotic to get there, and that graph has to change dynamically. That graph tells you.

C

B

To whom, what are they dependent on? How is kubernetes, you know, managing it, allocating resources appropriately or not, and how does kubernetes have access to the type of cloud resources it needs to allocate to the services and then.

C

To really understand, what's going.

B

On that context, pull the data, and what we do is something called a behavior model profile every component that comprises the application to see what is expected. Profiling is kind of like building a very simple emulator model and.

A

You can look at it.

B

Collect the data to figure out, for example, across all.

A

C

B

That you collect all the flows and events and say hey this one, for example, is io dependent this one is cpu dependent or mix, or this one does a lot of calls so that as data is coming in as you learn that this is where ml comes in, you know what to expect, because you.

A

Have the real-time.

B

Metrics, you have the application graph, so once you have started learning that over time, you know within actually in our case, within 24 hours, you start getting a baseline behavior, which gets better over time. You can start looking at deviations so.

C

You don't have to worry about setting thresholds.

B

Don't have to guess what the thresholds then start tuning in first of all, you don't know which metric and what the level is. Let the ml model learn expose that so that we can analyze it again.

C

B

We know what what are the other drivers for every service, because we know the application graph. So once we.

A

B

These deviations, whether it's coming from explicit alerts like a failure and infrastructure failure that kubernetes detects or.

A

B

Starts slowing down or having a degradation, you put that all in context and then analyze that we essentially do what we call local detection across the application. And then we call an analysis of what we call dynamic decision.

C

Which looks at everything in county.

B

Six, why would this happen, given what I have seen so essentially think of it? It's almost like an anthropomorphic. What an ops would do and they understand if we can put all of this in place and automate this pipeline, we have reduced the amount of work that off spends today, trying to understand what is it? What does the application look like who's talking to whom, when.

C

B

A problem there instead of setting thresholds, and if I do, how do I analyze it if we can collapse that and reduce that, we have really done the right service to get the right level of intelligence and observability. So this flow, if you think about, is what we will demo today using what you're, seeing on the left that essentially build context, understand the application graph understand the behavior to surface problems, detect it analyze it in context using all the telemetry. We have.

B

You know, including changes logs events and help isolate the cause, and so that's our purpose of our demo today and I'm.

C

Going to hand this.

B

Over to cesar, because we want to get to the demo and he'll, tell you exactly how we leverage open source monitoring and uh use uh open, cncf, open telemetry. Do this so.

C

I'm gonna stop sharing cesar.

B

Please take it away.

C

Actually, look if you could, uh if you could go back to that, we'll talk really briefly about those about those open source platforms um that we're leveraging. If you could share.

B

All right I'll go back. I shouldn't have done that. I was a little too hasty there.

B

C

That coming up.

B

C

There, it is all right, let's go yeah so again, everybody, uh my name is cesar quintana, I'm a principal solutions, architect at ops, cruise here and uh and yeah so um to to add on to to what eloqua was mentioning right, the the whole premise of leveraging these uh open source platforms that you know, essentially the whole data collection layer, has been commoditized right. Observability data is now easier uh than ever to access things. Thanks to these, uh you know powerful, open source, uh particularly around the cncf uh platforms right.

C

So um what we've set out in mind right is to build something and leverage these amazing tools to make uh everybody's life easier right. So things like um this, this is uh this is an example of our architecture, uh of how we're leveraging um all this open source data and all these open source platforms. So as you'll notice here, um if you focus on the on that kubernetes cluster square on the right side right, um what you'll see is across the top in the green you'll, see your workloads right.

C

You know pod one two, three four. These are eventually your own applications running whatever you're doing, whether you're running an e-commerce site, a financial training platform, etc. um This is what you're running inside your actual workloads, but underneath in the in that light and dark blue are the open source uh tools that are now so common throughout the um throughout the uh uh I, it landscape and in the modern application environments right so towards the bottom.

C

On the uh with a dark blue um you'll, see you know here in this this reference architecture we're showing jager prometheus loki. It could be something you know. This is just an example. uh We can leverage logs from other sources like fluent. I think somebody asks about fluent. It could be loki, it could be fluent d um and then we take metrics in from prometheus and then uh traces we're leveraging yeager as a backend for our particular architecture.

C

But we are supporting uh open, telemetry libraries for the client side, so the important that's one of the really really cool things about um the new you know standards is that they're now uh you know well defined, which means that you could be using a mixture in your environment, of uh open, zipkin and jaeger, or the open, telemetry libraries themselves and still have a unified back-end, where you're able to collect all that data and leverage it and use it, um even though you're technically using disparate libraries throughout your enterprise, all right.

C

So so what you'll see here, you know how we've architected ourselves to be built is again around these open source platforms again, whether it's fluentd, whether it's loki and prometheus, etc. They serve. As you know, now, you're you're your data collection and data data stored. You don't have to go out and pay another vendor. You know uh 10 15 x for storing just metrics right when you can store them in your own infrastructure, they're, we're all doing the same thing right just putting them inside of uh inside of a long-term.

C

You know bucket right, um and so now that's under your control, and so um we, for example, prom tail right. If you, if you start looking upward towards the stack in the light, blue promtel will run as a demon set collect uh logs from all your nodes and from all your uh containers right and then you have on top of that node exporter right.

C

Friction has an export for prometheus to grab the metrics from from the uh nodes themselves and going above that you'll see c advisor collecting uh uh data from from the containers themselves running on each node, and then we also leverage ksm exporter, pretty awesome, grabbing, uh kubernetes object, status, data and all those are going to be fed uh out into prometheus or to loki and if you're, using traces again to jager and really um you know now, even just with that you've got a pretty darn functional, uh observability layer right now you have metrics and they have traces.

C

Now you can go into different places and look at your logs. But what we're you know what what alok was mentioning earlier is that smart layer right now you want to leverage all those pieces of data, bring them in together and do something really really powerful with having all that context, all that configuration data that we can grab from the kubernetes api and then just bring it all together.

C

On top of that, uh you have you have uh metric data configuration data, uh performance data from and event data from uh from from your cloud environments right, so bringing in bringing things like you know, more and more applications are hybrid right, they're using you know whether it's vms and kubernetes, or serverless and uh and pass you know you have all these really really hybrid environments, that again it's the whole extreme uh production of data and having one place and easy ways to collect them, and that's really what these open source platforms have allowed us to do right.

C

But going back to what I was mentioning about cloud, you also want a place where you can grab your data and bring him in uh bring it in talking about again serverless or function as a service. um These the the paths layers, which are only constantly growing right. uh You know you have these these cloud caches and messaging services, um cloud databases etc.

C

So you know what obscure sets out to do is not only grab that open source uh uh data in leverage collection platforms, but also bring in the cloud data uh and and mess it all together and build something really really rich and then provide actionable data based on that. So what I'm going to do is I'm going to show you uh a a demo of obscures? Oh sorry, look did you want to.

B

Since I had the opportunity to look at the message, someone asked what about fluency or fluent bits- and you know everything for that. So.

C

Let's, let's address that yeah! No, so as mentioned right, um we we can take logs, basically from from whether it's loki fluent bit is usually the thing that are fluent. Those are usually the the pieces we run into uh right and absolutely you know. The whole point is to build a modular, flexible uh uh platform where you can grab data from you know, whatever your your preferred variant of of that is right, so yeah absolutely obscures particularly provide support for fluency uh loki um and a few others as well. Yeah.

B

So the takeaway message I want to do before we get into demo, so we can, you know, address them all across is as.

C

Long as we have the source.

B

C

B

The metrics in this approach will still work, of course, with opencncf. We don't have to do proprietary agents, proprietary instrumentation. We can be sitting outside without being intrusive, so think of it. That way, the real intelligence or observability is not how the metric has got to us and what it is. As long as we have coverage, that's the key. The coverage is, all of these is needed. You can't just go on metrics and logs and traces independently. It doesn't give you the whole picture, you know.

B

Otherwise we are one of the six blind men looking at the elephant, which is in the room all right.

C

Go ahead, so I had to say that, because I think it's right.

B

C

Thanks thanks for that, all right, so uh now I'll I'll share, look, I think you might have to stop. I can.

B

Stop sharing right sure.

C

All right, so let me share.

C

uh Hopefully, you guys will be able to see my screen here.

B

A

Have one proof point, but there you go.

B

It's coming up excellent.

C

Okay, awesome so uh yeah, so this is. This is a uh this is our landing page for op screws and you can see. There's there's quite a few pieces of data here. um You might you know this. The screen might look familiar uh for any of you who have used apm tools uh before so this is a real-time service apology map.

C

Essentially, you know we're we're leveraging, as I look mentioned, ebpf as well right, so ebpf allows us to grab this this network data and bring it in alongside not only the tracing which is which in this case happens to be optional because we have ebpf so, but it is the ebpf network data alongside the tracing data alongside the metric data alongside the logs, along to the configuration data discovered from whether it's cloud or kubernetes, or the virtual machines themselves, or the serverless or other pass components, it's all brought together in a single place, but we're giving you this real time.

C

Excuse me uh a flow of how your services are are interacting with each other um and I'm zooming in more. Of course. Now this is not even.

C

This is nowhere close to some of the busiest environments, but you can see that does get busy really really quick and that's one of the cool things about you know having all all the configuration data and the really rich data that the underlying tools like c advisor collect is that we get a lot of really rich um object data along with the along with the metric locks, so things like uh being able to under understand um you know the configuration data of these pieces um allows us to also extract things like labels and uh and tags.

C

So when you have a busy environment, you might only want to filter, for example, on a particular namespace right. I might only want to look at you know, maybe the obscure's namespace, um and so that really helps you uh cut down on on some of that noise when you're trying to isolate an issue um but going back to kind of our premise, you know what we're showing here is a mixture of quite a few different pieces of data you're showing the ebpf pieces. um Again.

C

We talked about cloud so this this demo happens to be running inside of aws, um but you know whatever cloud you're running on you're, going to have that path, layer very likely so think so being able to collect that data and bring it all together to your kubernetes environments. um You know when all monitored in a single place is absolutely powerful. So if I click on, for example, that aws rds instance, you know again we're talking about the metrics.

C

So if you look at this right side, we're collecting all those individual metrics, the the read I ops and the throughput etc. This is a high level summary, but important is metrics right. So I can go in here and look at all the individual metrics, that's one of the pillars of observability um and that's just that's just for one entity same thing for uh same thing for a pod right. This is a pod in the container. So if I click on a pod same thing, I'm bringing back all this configuration data all these labels.

C

You know what time this was created. uh What's what host it's running on? It's important to understand all these things, because when you're troubleshooting you know well, when was I you know what time was this pod running? It was supposed to have been restarted five minutes ago. Did we actually perform the restart or was there an issue? You know doing that that that uh roll out of the application? Well, look it's been running for you know, since a couple.

A

Months ago, so hey you did that.

C

That rollout wasn't uh successful right um again, we've got metrics uh as well, and each each entity has its own pieces of data and it's important to be able to look at that data uh again in context uh for, for you know, whatever probably troubleshooting.

C

In this scenario I clicked on this container, it happens to be the yeager agent, but I click on this container and I'm getting you know additional data- that's contextual, for that particular container, the ports that are being exposed. um But on top of that, you know being able to to see how the infrastructure is working. What things are related to what so, for example, uh have we have these uh contextual access to these different uh pieces right? So if I click on this three layer view right what it does is.

C

It shows me this particular container, and this pod is running some details about it. The ip address the image name that it's using as well as some high level metrics such as cpu and memory, but also it shows me what kubernetes node this particular container is running on, as well as some of the neighbors and those cpu on memory metrics for those neighbors, and then this kubernetes node is running on top of what cloud instance right.

C

So when you're troubleshooting, I know I have uh some instances in uh let's say you're running aks and you have uh some nodes in one particular subnet or one particular availability zone that are having connectivity issues and you're trying to diagnose. You know right.

C

You know this little click you can understand if your, uh if your container happens to be running on one of those notes and things like the region and how much uh storage is attached to it, but not only that again, as we mentioned, the the uh the rigorousness of all this of all this data and the ease of collecting um makes it really really simple to bring it all together and now we can look at the infrastructure map that we call, which is essentially a cloud map and now we're looking in the context of this particular uh cloud instance and we're looking at this ec2 virtual machine and looking at the configuration of that in the text right and um I'm just kind of showing behind the scenes, uh the the all the open source data that we're actually uh collecting and how even that open source data by itself makes really powerful.

C

But once we combine the intelligence which I'll talk about in a second, that's where really things really start to take off. But as we mentioned, we're collecting data from the kubernetes api and and uh from from the container. So that's where we're grabbing! You know the individual container. Metrics and the node metrics. um We also have an understanding, for example, at a per node view right. So, instead of looking at it from a kind of application center view, I can look at at the node level, let's clear out some of these filters now.

C

So you see we have five nodes running and now I'm looking at each individual node, and I can see the workloads that are running on top of that node. I can click on metrics and get the metrics for that particular node, um so load in just a second um but I'll go back, and then we can actually look at the configuration for the particular node itself.

C

Have some filters on here by default.

B

Real-Time internet issues always fun there.

C

We go all right.

A

C

Yeah so again, we're collecting all the configuration and metadata not only of the containers themselves, but even uh the nodes that you're running on so things like um the memory utilize. Sorry, the memory capacity uh where the node is ready, so you'll see here max memory, max storage, what version of kubernetes um are they running?

C

And so you know here we see that we're running uh version 117 of the kubernetes uh narrow, which is actually quite a little bit updated um and the kernel version of the of the operating system that it's running on et cetera, so we're bringing again all this data together, which is really really empowered by all these open source layer tools, we're not using custom agents, we're not doing anything. You know special, it's just leveraging all this data, but bringing it all together in a single place.

C

um On top of that, you know, I mentioned it's important to cover things like pas services and serverless. So um so again we we also collect that kind of data, so you'll you'll notice. Here you saw an rds instance. I think I might have shown a load balancer as well um in this case in this environment. I have you know an api gateway running with um with an s3 uh call out actually via serverless, um so you'll see this api gateway and again I'm grabbing the data from that particular api gateway.

C

Just like for the containers, we saw um that particular entities made it data.

A

C

Now, here's for the api gateway and some of the metrics as well and same thing for for the server list, functions right. I can see the arn or that particular server. This function, uh the region and I can click on metrics to down to that. So the whole point is to bring something that's all together um and finally, you know actually before I show that I also did mention traces and uh let me actually share this screen, because I think I'm not turning that.

C

I don't want to show the traces before jumping on to something else. There we go so again. We also have. We also have a our trace map view that we just recently announced, and so when you're leveraging as we mentioned, distributed tracing, um you know we can collect all that data again on a single space um and now what we're doing is is we're collecting uh the individual traces and actually we're doing something, pretty cool, which is um what we call the trace map and an identification of these trace paths.

C

uh Sorry to look at.

B

I don't see a screen, I don't see my screen. Sorry can you share because I don't think they know what you mean by the choice: okay, great, okay,.

A

And I did see one thing: if you are able to bump up the text just a little bit. I saw a couple of comments about um that as well sure.

C

Is this, uh hopefully, this is a little bit better.

A

Yeah, I think I think that should be good. Oh.

C

A

Calling that up, gotcha.

C

Let me know if there's still visibility issues, but I bumped it up just a little bit um yeah so again we're just we're, I'm just showing off the tracing capabilities again just bringing everything all together in a single place you can see here. This is this trace map showing the different interactions from the front end to the ad service to the product catalog service, um but one of the really cool things that that is kind of unique that we've been able to develop.

C

Is I identifying you know a lot of times in tracing you'll get transaction identifications? Oh I'm! Seeing that.

C

Hopefully, hopefully this is a little bit better. I think I've hit the limit of my uh of my uh zooming in capabilities. um uh Sorry guys, I always thought it was a little bit bigger. Hopefully this is some sort of uh for you. um Okay, so we've got the traces. We've got the the trace maps and, uh oh, it looks like I'm getting some uh too much noise on my machine. So sorry about that, I'm saying that in the chat uh churn hopefully turned off. The notification sounds uh here.

C

Hopefully that will stop interrupting okay, so we've got the trace uh map view, but we're also discovering what we call uh the trace path. So these tradespads are not just uh sorry guys. Give me just one second, I'm trying to you're.

B

Still, on slack.

C

Guys, that's fine.

B

C

You know how this.

A

Goes, oh absolutely, absolutely it's always.

B

A

I I feel like it's as soon as anyone goes live. That's there must be like a hidden button or something like that somewhere.

B

Because that's that's when I started to get a lot of fun on that so anyway,.

C

I believe I've turned off not do not disturb successfully, which I thought I did before. The call.

B

I just shut down slack. I apologize.

C

My apologies to everyone. Okay, so uh let me head back here, okay, um you know we have auto discovery, essentially of not only the the transactions themselves, but you are used to seeing distributed tracing uh platforms, but um we are also grabbing uh the identification of the paths themselves. You might have a transaction.

C

You know for one of these uh products um that you know might be a slash checkout, but you might have a different types of checkouts for maybe a class right, maybe you're selling a class on your email converse site versus a product right. So, even though you know they're, both called checkout one might go to ad service and another one might go to the checkout services and product catalog service.

C

So even though they're both named the same, who identify those differences between them and then also perform anomaly, automated anomaly detection and profile those transactions uh separately from each other right. um So that is that you know that's some of the tracing. We won't delve too far into this, because I want to show really some of the some of the magic behind um what we can do now that we have all that really rich open source data right. So um let me stop sharing and re-share. My other screen just give me a second here.

C

C

There we go now, you guys should be seeing my screen pop up here in a second.

C

All right all right, so um you know some of some of the things that we can do now that we have all this open source uh data is that we can now start doing anomaly: detection, detecting of uh misconfigurations misbehaviors.

C

um You know one of the things I actually did not show, if I go back here really quickly, is that we can all we're also collecting configuration data not only at at this kind of high level, uh metadata kind of view, but we're also showing the uh the entire manifest. So if I click and I'll just show what I did there, if I click on detailed view for this particular pod right now, I'm looking at the actual manifest for this particular pod.

C

So I can look at the details of what exactly is going on throughout without having to go inside the command line and figure out. You know, you know, cube ctl get pod dash, oh yaml, and it's this is. uh This is way simpler and it also helps keep everything in context and keep you inside of a single place.

C

um But now, with all this really rich data and and knowing you know, the other thing we do is we have the um what we call curated knowledge, because on top of all this, you do need to understand how these systems interoperate with each other and what kind of dependencies they have on each other. That's why we do build that relationship uh view uh leveraging all the data. um That's why we want to know what containers running on what pod.

C

I'm sorry on what uh node and what node is running on top of what piece of infrastructure is that we know when a piece of infrastructure is down. We know that it's affecting you know the the container that's hosted on it um and- and you know, there's a lot of nuance and variance to the kind of problems that can arise. But having again this richness of this uh open source data, it makes it all possible. So I'll show a couple of a couple of things that we do um here. Let me find an alert.

C

I think I was looking at uh this alert a little bit earlier, so I'll explain a little bit what this is right. So in this case we have a deployment problem right on our particular web server deployment, we're supposed to have a total of three replicas and in this case- and you know what I'll bump up the text a little bit, because I know that was hacked before um so we're supposed to have a total of three replicas.

C

In this case, we've only got two available replicas, and this has been going on uh for a little bit so down here. You know we provide some details. It's part of the shopping, cart, name space, it's the web server deployment and here's some. You know additional kind of feel uh key value pair details, but we'll go to the fun view. I know some of you guys love reading json, but uh I kind of like this, uh the ui just a little bit more.

C

um So when I click on this analyze view, what it shows us is what we call uh the the contextual um rca, which is our fishbone rca right. So in this case, what we're showing is we're showing failure categories across the top and bottom that are affecting this particular deployment. So again, all this is being collected just through the you know, acquiring the kubernetes api and then the the relationship of the of collecting the events and the containers and linking those all those pieces together.

C

um So we have a replica set scaling issue right, we're having an issue scaling up an additional uh replica of that particular image and now we're getting actually a back off restart as well, but this is all really associated to the startup failure right and if I click on that, what it's going to tell me is that I have an invalid image name right, so obscurus is spelled with one eye and it looks here like somebody spelled obscure with two eyes and so that's a bad image name.

C

um You know it took us all of what you know three four seconds to figure out that one of our replicas isn't coming up because of a bad image name. So it's those kinds of things: the richness of the data that allows us to build these really really quick um root, cause analysis pieces into into something like obstacles right. So, yes, you can do this from the command line.

C

uh It's you know it's a little bit more work. It'll probably take anywhere from. I don't know, 30 seconds to a couple of minutes, but you know multiply this times a thousand times 5 000. That can happen in a month.

C

um You know that's a lot of time saved for operations, teams right and you'll also notice other ones that some of these are more complex, and you know these are just building blocks to what I'm going to show you in a sec of of these individual kind of um problem, detections and anomaly detections, but you'll notice, other other categories. So things like a missing config map right. If you reference a config map in your manifest that does not exist.

C

You know, you're gonna have a failure of your of your pods, so we'll highlight those things or failed volume amounts or even bad image tags. I think I think I might actually have a bad image tag in here that I was looking at it's a very, very similar scenario, but for the cart server, if I click on analyze yep, you know same kind of symptoms, no replicas at scaling issues we're having back off restarts going into a crash loop, but uh you know in this case we have an invalid image stack.

C

This particular image tag does not exist right um now. The other thing that I didn't go too far into, but I it really is absolutely key- is machine learning right so um from all of all the individual services that you deploy onto uh onto your clusters. What happens is that, with data being collected from c advisor and from uh and from nordic's program from the discovery pieces, what we do is we create a um a really rich uh behavior model right. We we we detect what is normal behavior for your individual services right.

C

So if you are, you know we don't just look at one or two metrics like error rates and response time, but we look actually each one of the entities that I've shown you um have their own behavior models and there's a bunch of others that I didn't show you as part of this demo, but things like if you're using databases like mongodb or uh or a jbm or an nginx container right, and then the generic containers themselves the nodes themselves.

C

They all have their own behavior models um and we pick up a mixture of a lot of different metrics to understand what is normal behavior and then, when we find what is abnormal. We have these types of alerts that are prefixed by ml, telling us that there is some sort of ml detected uh performance violation right. So if I click on this in this scenario you know again I'm going to get some details as to what happened right.

C

I get you know, I know let me zoom in a bit uh network, transmittal bytes increased by 540 percent and level four bites uh for the outbound traffic increase and inbound transmitter. Bytes decreased. Actually, so we don't only detect increases but also abnormal decreases as well, but just like in the other scenarios.

C

If I click on analyze, I can get a fishbone uh representation of what exactly is going on with uh with the metrics and why the ml in the first place, triggered an anomaly, and so um I'm going to zoom out just one piece uh just like just like you saw for the you know: kubernetes specific deployment scenarios now and in this uh facebook rca we're looking at a container view this, particularly the card cache. Had you know some some deviation in its in its metrics and uh actually before.

C

Looking at at this, I'm going to go back just to the to the summary screen and show you down here. If I click more details, you know, speaking about the ml and all the metrics we take. These are all the different metrics for just the generic container model that um we're looking at right. So again, it's not just one or two or three metrics we're looking at transmittal bytes uh packets in packets memory failures, cpu utilization.

C

All this data, particularly for the container, is again provided by c advisor, an open source tool right so again, going back to the analyze button. Now we're seeing the actual pieces that actually triggered the ml, and so you now you'll notice that the fishbone has changed from from our startup failure. Now we're showing uh memory uh and file system and cpu and so right away, we'll show you in red, you don't have to go and look at a chart for this specific thing.

C

It's it's here right, so I'm seeing cpu utilization has increased by close to 50 percent. uh I'm looking at demand, which is uh which is incoming requests. The response time has increased by over 1700 percent. I'm looking at outbound supply side uh response. Time has increased by 2 200 for outbound requests from part cash and then not only that our response uh size has increased from one to close to eight megs right and then bringing in the kubernetes layer. It's this whole image change right so again bringing the data from the kubernetes api.

C

I can see that I've got a recent image change, that's likely contributing to this phase. Now um again, I'm glossing over a few details, because uh in the interest of time I want to show you guys how we bring you know a couple of these things together. This is, you know, an ml alert and by the way you can see this automatically uh chart those important metrics down here.

C

So you can see their behavior during the time of the anomaly, and as mentioned you know you can you can drill down into any any logs um that might be coming in. Actually, I should probably show that I don't think I did um so here's another example of an anomaly database, server and you'll notice here that you have different contextual access, right application. State I'll show that, but we have a time travel capability where, with all this data, the the metric data from prometheus the log data from fluent from loki uh the trace data.

C

All this you know we we build that real-time map um that you guys saw and and all the configuration data we take snapshots every five minutes and I'll show you guys that in a second, but you can go back in time uh at the time how your system was configured during the time of this particular anomaly. In this case this goes back a day.

C

So if I click that it'll take me back into you know one day before and show me the entire config of my entire state at that time, but we'll go into that in a second. I can click on metrics to understand the metrics for that database server or any events that are related to that. In this case, I want to show logs. So if I just click on the pot or the container logs- um oh maybe it wasn't logging there, but I do want to show that we have uh contextual access to the logs.

C

Actually, let me find uh I just want to show, because I think we did not actually show logs. Let me see, I think, maybe note exporter will be logging. Sorry.

A

And and caesar I saw a couple questions come in on that front. One question was: does ops crews enable custom metrics.

C

There's options enabled custom metrics. Yes, you have any. Basically it's it's. It's any data, that's being exposed to uh prometheus right. So as long as data is being exposed to prometheus that data can be brought into ops groups right, we're just leveraging prerequisites as a metric uh ingestion point.

A

Awesome and then one other question was: does do the ml alerts, wait for a particular size of training data before alerting on that front, so I can.

B

Answer that so typically it depends on you know. Our default is about one day, but we can speed it up, so we can learn so man and the only reason I'll say that is because, let's.

C

Say there was hardly any activity.

B

On a weekend.

C

You deployed it: you're not gonna, see a lot of activity to profile it, but the next day it starts.

B

Increasing so over time, essentially we update that, but our default is 24. Hours can be even less. You can make it even a few hours just have enough.

C

B

To get our initial profile and then we continuously update that cool. Thank.

A

You, no, the users, don't have to do anything. I wish I could learn that fast. Yeah accelerate, like one day, is good.

B

But yeah, if I could learn like this, if you saw that generic container example that says.

A

There's about 30.

B

Metrics and you have no way of knowing because there's some calls being made as a problem versus a memory, failure suddenly increases. So there is no way a person can do that. That's what the beauty of using the ml to get unknown numbness.

C

Keep going yeah absolutely um so. The thing I wanted to show is logs, because I absolutely did not not show that, even though it's super important, but um any anything, that's logging right. We we picked that up from from your standard out, uh but if you click on, uh if you click on any whether it's an anomaly in this case, I'm just showing the pod right, I have a pod open and from it's. This is what we call the quick view right from its quick view.

C

One of the one of the links you have is for logs right, so I can just click on logs and that takes me straight into the logs for that particular service. Now this is pretty. um You know uh static logs here, but I can you know it is searchable right, so I can look for requests, for example, or conversion.

B

Yeah, the idea is to contextually link it to the problem. Correct. Yes,.

C

So you know, depending where we get it from yeah, correct yeah. I I think I don't have a problem here that has logs right now, but um but we do surface that as well. So if you're having an anomaly, you can go straight into the logs for when they're, active and that'll show now. What I do want to show with all this really put together is I I I'm going to show you an alert right, so we we have. I showed you guys.

C

uh You know how we collect all the different data, um the architecture right again we're leveraging just purely open source um tools here to collect the data from from you know whether it's vms or whether it's kubernetes etc from the application level, mongodb exporters or nginx exporters, as well as the traces, whatever open, telemetry compatible library, is all basically built on open source, um but now right. What we have here is again.

C

I also show you the anomalies on how like specific kubernetes detection, and then I showed you, the ml, how we, how we automatically detect uh performance, deviation and again lots of different metrics. So in this scenario, we're kind of tying everything together right, so I have a response time slo breach on the on my nginx server. So I'm going to click on that and again here's some some details right. I have an slo of five seconds.

C

My response time is over 15 seconds, so I want to see what's going on right, if I click on analyze, what this is going to do, I'm going to close we'll come back to this summary in a second I'm going to close that piece. um What this is doing is now we're showing uh a slice of the actual app map that we were looking at earlier, but now it's it's. uh It's focused on the time frame and in the context of this particular anomaly right, so we so your route is essentially here.

C

You know at nginx you're, seeing a slowdown, but we've also identified what downstream services are involved right. So we have nginx itself right. So this is the kubernetes service, the part of the container and same thing service spot container uh for web server redis service, whereas pod redis container they've got a cart server, service, pod and container you'll notice. Immediately in the red we've highlighted, so we're doing fall, domain isolation as well.

C

Nobody had you, don't have to call the nginx micro service team whoever's managing that you don't have to call the web server micro service team. Whoever's. Managing that could be a couple of the same team could be a couple of different teams. You don't have to reach out to them. You don't have to go inside your tools and look at the metrics, for these particular we're, showing you they're healthy right.

C

So what the data has shown us from the data we've collected from these containers as well from the network data and the configuration data and combined with rml that intelligent layer of the operations is, we've highlighted the red pieces right. So our container for redis is red. Our card server service is red, and so our potting container, so we'll kind of take this in the chain and see what's going on. So we've identified, we have an slo failure up here, we're responding, really really slow.

C

um Now, if I click on the next piece in the chain, I'm showing you know that redis container is problematic. If I click on that, what it's going to do, it's going to show us. This is a separate, technically a separate anomaly from the nginx one, but the ml has detected that this is very much related and you'll see a few different failures right, so you'll see that we're getting an increase in throttling on the cpu, um the user. Second, solar spin on the cpu has increased by about 10 percent, um but really interesting.

C

Actually, here is you'll notice. The response time normally as at 2.94 milliseconds right now we're at over two seconds. This was automatically detected and then also super important is our error rate right? Basically, you usually have zero errors right now our error rate has jumped up sorry to 36 out of every single. Basically, every single request has essentially gone into an error mode, so something is wrong, um so we're gonna, we're gonna go back here and and uh just see what rca is pointing at so redis is calling card server.

C

Now, if I click on cart server, I'm gonna see the alert very, very clear this service, the cart server, doesn't have any pod to serve requests. That's a very, very clear indicator that obviously um redis is experiencing a bunch of response time failures and now error rate failures, because there are no requests behind this kubernetes spot. There are no pods behind this service to serve any requests and to look into just a little bit more detail if I click on this particular pod.

C

Now on the cart server to see, if I can clean what's going on, it looks like I'm having to back off restart and if I look at the details of that alert, it'll actually show me a little bit more detail again. These are all separate but linked together problems. um If I click on the analyze tab. Now it's going to show me the real root cause right. We've got an invalid image name as we talked about earlier, so this broken image, name with two eyes in india.

C

This is uh this. This is I'll zoom in a little bit, so you can see that isn't too much, um but you can see that obscures india here, showing uh two eyes right. So we have this invalid image name, and that's really, you know the the the root cause of the issue, and it's and- and it's all shown right here in a matter of seconds, all right in general- is experiencing a a response. Time slow down radius is saying that we've got an increase in response time and alert and errors hard servers saying.

C

Well, I don't have any pods to serve and the pod itself is saying. Well, I can't start because somebody gave me a bad image name all this in a matter of you know about 20 seconds right. It took me obviously close to a minute and a half to explain, but all this understanding, the kubernetes level uh up to the application level and how they are affecting each other um is all powered by these open source tools.

C

uh Plus, you know the intelligent layer on top, which is, in my opinion, pretty darn cool um in the interest of time. There are things I wanted to show. I didn't want to show time travel, but I think we're pretty close out of time. I want to open it up for questions um so with that I'll turn it back to you local time, yeah.

B

So I think we should go back to qnh. So if I would just summarize kind of back to the premise we started with right in order.

A

B

Help ops in this new cloud native microservices, environment and kubernetes, you know we have no longer to worry about where the data is coming from all the telemetry. Is there the.

C

Idea is to build it, but.

B

Really what observability has to do to have this intelligence is able to understand the full context of the application across everything across all dependency track, that users should not have to do that. So that's what we need to fill in and then understand the application profile, the behavior, so they don't have to worry about how to detect carbon setting thresholds.

B

We want to take that off the table and then contextually analyze everything, because now we have rich data in this whole, distributed systems that, whether it's infrastructure or kubernetes related or down to the application they all think together. You don't you don't need six different folks, looking at traces logs events alerts to do that. That's the role of observability in this new world and thanks to open, telemetry and open source term, it is possible to do that. So think of.

A

B

The proof point you don't need to worry about running multiple silo tools to really build that intelligence or and reduce the amount of effort needed I'll pause there. That was the whole point of you know guys take advantage of the open, telemetry and the open source tooling. That cnc has been helping. We are firm believers in that and I hope you can leverage it too.

A

Awesome awesome think of it so much I did see quite a few questions come in. I see at least three um one question. First, one from ishmael was how easy is the installation.

C

That's a great question: let me see if I have it in this environment here so for a typical deployment into maybe an on-prem cluster. uh We leverage helm right again, another another. You know open source tool, so I mean we leverage helm uh it's these commands. um You know it's about three or four. Well, it's actually five commands.

A

um Inside could you bump that up just a little bit yeah? I absolutely.

C

Will thank you so essentially, so, essentially uh you know if you don't have these, if you have these existing tools because a lot of people, as I mentioned you know, these are the essentially the de facto standard for open source monitoring and all these modern environments, most of the people we run into already have these tools, um so it's actually a little bit simpler. um But if you don't have these tools, uh we absolutely you know this. uh I think it's this last command that will uh deploy.

C

You know all the the the uh the the open source tools, if you don't have them underneath already, but essentially it's through helm right these these commands. You know these these five commands and that gets you from a green field cluster. To up and running literally, I mean copy and paste. I mean you're up and running in about three minutes four minutes and you have the entire environment that I showed the only thing that's not available right off. The bat is the ml, because it you know again, it takes a couple of anywhere from I've.

C

Seen I've seen ml alerts come back in a couple of hours to you know, 24 hours, this is usually usually that sweet spot, um but everything else that you saw within you know three to five minutes of deploying you're getting all that data. So this is how simple it is.

A

Awesome awesome. Thank you so much. uh The next question I saw was: uh is it uh free or or kind of what levels? uh How does it work? Is it a sas or is it something you can host on your own.

B

A

Cool cool, absolutely.

B

There is a freemium offering for people who want to try it out. So you know you can go to our website obscures.com and check it out.

A

Cool cool. Thank you uh next question. We've got lots of questions coming in. Thank you, everybody! So much for submitting those! uh Please please keep them come. I think we have about seven minutes, so I'm happy to happy to field those as much as possible.

A

The next question is: is it a good idea to export this data to do offline troubleshooting by importing data collected? I was wondering about troubleshooting edge deployment cases where we don't have access to the cluster.

B

Interesting so you're talking about um when you don't have what, if you can collect this metrics you're saying and push it to us, it's a little trickier with this. Depending on the context. Maybe you have to dig into a little more specifics. You know what data because remember in order to understand the application context, we pull everything being able to see the dependencies so seeing.

C

It in isolation doesn't tell you.

B

That so it will probably be specific, so we can take this offline, and this uh you know attendee has something specific that we can follow up and things.

C

Yeah yeah, I will mention you know it it. You know it depends on how your edges it just compare like if you flat out, don't have like um access to to like export metrics right, I mean again, you know, obscures itself, isn't really doing much on the collection side. It's really around.

C

um You know having prometheus on the cluster and having loki or fluentd on the cluster to collect that data. um Really, if you don't have access to there, uh you know that that's that's something to be explored, but speaking about edge itself. You know we have recently published, like a joint blog with verizon, where we're talking about and again I don't know what's going on with my dnd button. I know somebody mentioned it. I don't know what's going on, I do not disturb. I promise. I turned it on, uh but.

A

C

uh You know we when you're running kubernetes clusters, for example at the edge or running workloads at the edge. um It absolutely is a supported model right again, that joint blog, I mentioned is is uh is launching a kubernetes cluster on aws wavelength and um you know with kubernetes and observability built in with you know, with the op screws, and you know it functions perfectly fine. But again, if you have like a really really locked down edge, and that might be something we can, you know talk offline and feel free to reach out.

C

We can talk about that scenario.

B

It's a definitely use case because a lot of the edge applications are deployment communities, so we are playing into that space totally.

A

Awesome awesome. Thank you uh next question next question is: is aws bottle rocket, supported.

A

Yeah amazon's uh operating system bottle rocket. I believe that it's it's kind of built for containers and and running things on that front. My initial thought would be yes because of the interfaces that you've chosen to bind to. You know cni csi, all of those things, but so long as those are supported. That should be good, but not sure. If that might uh might correlate to the system, metrics might be the specific question.

C

Yeah correct, so it should be. You know I don't I don't remember if there was actually somebody that is using aws bottle rocket, but again we actually don't build in necessarily too much into the into the os, because as long as they're running like the minimum required kernel like on the on the on the nodes right, which is, uh I believe, kernel 415 uh of linux and up um yeah, I mean we, we shouldn't have any any issues supporting that.

C

um If you want to explore, I I highly suggest signing up for the uh for the free version and it should work. I don't see why it wouldn't um so yeah that that's.

B

Right primarily, we look at os's that will enable collecting flow data like evpf. That's probably our primary requirement. That's it.

A

And on those nodes, run node explorer, of course, cool cool, interesting interesting. It's fun to kind of see all these different computes and then being able to surface that information.

B

You know it's not just aws, we work with any cloud vendor, that's the advantage of working. Without being you know, looking at proprietary agents.

A

Awesome awesome: awesome uh next question is: does the sas support sso and saml login.

C

Yes, absolutely yes, so I believe the free version um does doesn't have like that quality of life. There's there's some of those things that you know are like more enterprise features, uh but yes, absolutely many of our customers are using like um uh azure id or they might be using octa, etc. We absolutely support that.

A

Excellent uh next question is: I promise keep keep peppering you until we're out of time uh what resources does ops cruise require, I'm guessing that might be pertinent to like kubernetes uh uh kubernetes primitives, like node storage deployments, config maps.

B

You're talking about what is required for just the collectors that we have right.

A

That's an easy one. I think it might be more more the platform to like install that integration to to report the data to the sas yeah.

C

So so, for you know when we showed the architecture, really you you, you know, that's um it does.

B

Might be worth putting it up, cluster.

C

Right but typically you know for for the actual open source collection tools. I mean you know each one has their own requirements, but they're really small. I mean I mean we're talking about. uh You know hundreds of millicourse to run the the open source uh collectors like you know, see, advisor and and notice where those are all really really lightweight. The only piece that use utilizes um more resources is really uh prometheus right and that's just depending on the you know how many objects you have in the cluster. We typically for a small size.

C

Cluster recommend, maybe like in uh maybe like a two two cpu, uh eight or twelve gig machine, um but you know, as you scale up and the amount of objects that you know.

A

C

I think I've seen just and I might be misremembering so you have, if you want to more details like really hard numbers, uh please reach out, but I think we've seen like five uh close to five thousand uh containers being monitored by like uh at this point, maybe 64, gig, uh four cpu machine node to power prometheus and it's not fully used, but sometimes when it, you know when you scale out and and and you get just all those some tons of uh a spike in objects.

C

That's really when it uses that, but prometheus is really the biggest one and you'll find this. You know it's not an options, it's a prometheus piece, but other than that, all those components I mean you're, talking extremely extremely small resource requirements, really negligible on your clusters.

A

Absolutely absolutely well uh with that. Unfortunately, we are at time. I really do appreciate everybody reaching out and asking those questions uh like all good things. uh Streams have to come to an end, so we are at that point, but.

C

B

You so much if.

A

There is anyone looking to kind of reach out to either of you. Is there a good place to uh open up those questions.

B

Sure you can reach us to uh info obscures.com to be generic enough, and you can also ping us on the website. Upscrews.Com itself. um You know we should be easy to find us. We also on linkedin. If you want to look outside our office, page love to chat with you guys.

A

Get your feedback excellent, absolutely.

B

This was uh interesting and I know exciting, given where things are going with open, telemetry, open source.

A

Well, thank you both so much. uh Thank you. Everyone for joining the latest episode of cloud native live. It was great to hear from eloque and caesar uh we really again love the interaction and questions from all of the audience uh join us next week to hear about how we're going to be building stability in kubernetes with andy suderman of pharaoh ends. Thank you all for joining us today and we will see you soon have a go.

B

Thanks everyone.

A