Cloud Native Computing Foundation CNCF Webinars, 17 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Machine Learning for k8s Logs and Metrics

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, like I said, thank you for coming today, welcome to machine learning for kubernetes logs and metrics, I'm libby schultz and I'll be moderating. Today's webinar we'd like to welcome our presenter today, larry lancaster, founder and cto of zebrium, a few housekeeping items before we get started during our webinar you're, not able to talk as an attendee. There is a q, a box at the bottom of your screen. Please feel free to drop your questions in there we'll get to as many as we can. At the end.

A

This is an official webinar of the cncf and, as such is subject to the cncf code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct and please be respectful of all fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cncf webinar page at www.cncf.io webinars. With that I will hand it over to larry.

B

Hey, thank you. So much uh hey everybody. So today we get to talk about something, that's near and dear to my heart, uh machine learning on logs and metrics in particular, uh you know when we're when we're deploying through in kubernetes and and and so with that, let's let's get started.

B

So machine data is my life. I've wasted far too many years uh dealing with machine data at a number of companies in a number of roles.

B

um You know at the end of the day what what I've found is that there's a lot of value in telemetry? That can be hard that can be gotten out of telemetry, but it really is a you know. It's a long mile to walk uh to. You know to pull the value out, and so you know kind of kind of you know the vision that I thought became realistic, maybe five or six years ago was really you know.

B

Is it possible to do a better job with machine learning such that there's not so much work to do to get the value out of the telemetry, and so really that's kind of what we'll be talking about today.

B

So I'm going to frame kind of the problem, the the problem space. You know, as I see it, in terms of how that's converging so so 20 years ago, if you think about it, you know there was. There was shrink, wrap software. um You know you had. If you had an incident uh you you would have one affected user, uh you would have one monolithic application you might have.

B

If you're you know stretching it, you might have 10 different log files to look at um and so the way you would do that for if you're digging into root cause uh is you would you would index those if you're lucky you may not even have to you, may just have to search them. They may be small enough for any given incident. You know volume of data isn't even a problem right and- and so you know it's it's it's.

B

It's always been true as long as I've known it that you know for detection, um you know of of incidents. Sometimes metrics are the best way to go, but for if you want to get to the root cause of a new incident, something that's new, a new problem chances are you're going to end up in a log file at some point, and so, and so you know, this is kind of an important. This is an important sort of piece of the workflow of incident management in triage right.

B

So if you compare that to today, you know you have a sas world so now you've got one uh operational incident. Maybe you have a hundred thousand users affected. uh Maybe you have a hundred services that could potentially be taking part.

B

uh You have you know a thousand log streams coming through with all this telemetry and so using the telemetry becomes inherently more difficult uh and so but and if you see kind of how you know where you know, people are using that telemetry today oftentimes, it's still the same, the same approach so that you know the question, is uh you know? What do we need? Do we need something better and if so, what kind of what is that so so to kind of step back and look at it?

B

If you take a look at sort of the state of devops, there was there's a report done annually that kind of tries to survey. What are the trends in the field um and- and one thing that got called out in 2019- is because of the complexity of a typical deployment. Today, um sort of mttr has sort of plateaued, in other words, even sort of the elite shops.

B

Have kind of reached the limit of what they can automate by?

B

You know scripting rules and so on, uh and- and so you know what what's typically driving the mttr now is new new problems uh and, and the reason for that is the complexity of software today, so so something something needs to break um so so so the this is this kind of is to reflect again in the same report I mean what you're seeing is no matter if it's you know, sort of a small shop or an or a large shop with an elite team.

B

um The new incident, the new thing, uh the unknown problem, has become sort of the driver of of downtime today. So our vision is that you know autonomous root.

B

Cause analysis will have to save the world from this this problem and when I say autonomous, what I mean is you know all that slogging through data and trying to find what you're looking for when you don't have a rule in advance to find it that that kind of needs, a lot of automation and the problem, is you can't do it with a with a handwritten rule that you've constructed?

B

So to me, that's why machine learning on telemetry is uh very important and is probably going to become a lot more important in the coming years.

B

So, if I think about what kind of a tool what I would want from a tool, that's going to help me in this area, you know what what really are the facets of that? Well, for one thing, it would be nice if it automatically detected new problems, um and we can talk more about that later.

B

That would be definitely a good thing without setting up alert rules.

B

You know we often have alert rules already or we have tools that are monitoring and catching problems, in which case those tools can do detection for us, but but even in those cases, I want help finding the root cause without manually searching for what I don't know when I don't know what I'm looking for right.

B

So what kind of requirements would be reasonable for such a thing today, and- and this is kind of where you know- I think kubernetes has really you know- sort of presented a whole new wealth of opportunities for everyone. You know it's, it's an amazing.

B

It's an amazing ecosystem, you know someone can come in and and they can deploy our software and collectors, for example, they can do that with a couple of charts uh or they can do it with you know a couple of uh cube control commands it's just absolutely amazing kind of the how little configuration can can be required to deploy an application, and it's, I think so, while there's been a lot of complexity, that's come along with sort of the microservices environment and sort of the decentralization of software.

B

There's also been a birth of of new flexibility um and coming out of the metadata, mostly that these deployment systems contain regardless.

B

So I'm going to want to be able to monitor you know an arbitrary application and the truth is you know it would be nice if all of our you know software was sort of running in a in a jvm somewhere, but that's not always the case. We need arbitrary run times to be supported and arbitrary infrastructure. You know it can be. It could be that I'm you know I need to be monitoring. You know a linux uh instance uh in aws, or you know some bare metal server somewhere.

B

So so there's a lot of combinations now that kind of form that complexity that we talked about a minute ago, um and so taking these things into account, um you know becomes important if you're going to do you know effective sort of automation of root cause. Another thing that that comes up is it's interesting.

B

If you take one application stack and you deploy it in one environment, say, say: for example, an atlassian stack, which you know a lot of people use, there's there's still a lot of people using it on on-prem or at least in their own vpc, and there will be for a long time because you know it has it has source code in it and some people just aren't. You know comfortable putting that in a cloud environment and there's a number of applications like this sensitive databases and so on.

B

They can run in all kinds of different places, but also the way that you're using them. For example, if I have this database and all it's doing is serving reports out, you know to a sort of a business intelligence group or something that's one way to use that database.

B

Another way is, you know, you're you're backing some web-facing services with that thing, and so so, depending on how it's being used, that's what I mean when I say an arbitrary environment, the patterns of usage, the kinds of events that are going to come through in the logs, the way they come through the patterns that they form the regularity sort of the regular heartbeat of that stack is going to be very different, and so it's it's really tough.

B

To have kind of you know the model where I'm going to go, quote-unquote, learn about postgres logs and what's what's normal and not normal, uh you know in environment a and then I'm going to take that learning and I'm going to distribute it to a thousand users who are using it in completely different ways.

B

So the the the complexity you know from from the perspective of doing machine learning in that case is there needs to be a lot of sort of custom learning, that's done in whatever environment. This solution is going to be deployed in um you know when I talked a little bit about zero required sort of configuration, you know for setup, you know and how useful kind of kubernetes has been for for deployment in that respect.

B

There's lots of other things that you don't want to require. You don't want to require someone. You know an end user to sit down and train a system. Oh yes! No! Yes! No here, you know here's a thousand log events. Is this interesting? Yes or no? We've found at least that people aren't willing to do that kind of work so that that also kind of kind of bounds. Your solution space, you know quite a bit, um there's a lot of things.

B

You don't want to require of a user to get started, even though they they may very well end up wanting to do some of those things you don't want to have to require them. If, if you're going to call your application sort of autonomous, um and so so in that sense, you know, is it really too much to ask you know I I I think I think it's.

B

I think it's gotten to the point where it's it's not too much to ask to to immediate, to be able to see real value, um there's always going to be sort of a period of adjustment and a period of sort of you know going in and kind of you know giving some amount. Of course, green feedback you're always going to need that you're going to want to let people bring their alert rules, and I'm I'm not saying these things aren't valuable.

B

In fact, in some cases they're critical, but your system can't require all that stuff to show value would be my contention, um so you know you know. Why do I need? Why do? Why? Am I saying that we need to be so so flexible in that sense? I I guess, because you know, if I'm, if I'm requiring sort of a person to go in and do training, that's probably not going to scale indefinitely and my assumptions if I make any about a stack, uh may not hold so so, given that, where do we start?

B

So, let's say you know we're doing a very general sort of machine learning task. You know on a set of data instead of telemetry. Where do you start? What set of telemetry? Would you start with? So everyone may have an opinion on that. My opinion is that if you're targeting root cause analysis, uh then you have to start with logs and the thing about logs is you know, they're difficult? You know they're difficult to work with, um but but there's reasons that logs are so valuable when you're root causing a new incident type right.

B

So for one reason for one thing, logs are kind of self-describing at least the free text kind.

C

Hey larry, this is gavin, don't forget to hand over to for the demo.

B

Oh yeah yeah thanks gavin, I'm going to do that right after this slide. Thank you very much so so a free text log is going to tell a real story about what's happening um and and so, if you look at these log lines, I've put here as an example.

B

um What you'll notice is that you know at least if you have kind of some experience. If you read these, they kind of tell you what happened without a lot of sort of rules having to be behind all that. We don't need a lot of metadata right to tell that story out of the text.

B

Oh so so actually you're, right gavin. I think I skipped that spot. I was going to stop a couple slides ago, so so really quick guys, I'm going to let uh gavin jump in gavin is our head of product and he's going to start up the kubernetes demo uh and then and then at the end, we're going to kind of look kind of how things panned out so hold on. Let me let me go ahead and stop sharing gavin apologize for that.

C

Thanks larry, so, as you just mentioned, I'm going to run this demo in in two parts and the first one that I'll do now is I'll. Just show you, the demo environment that I'm using and then I'm going to break it and then, in the second part I'll come back and I'll show you what the zebra machine learning picks up. So if I share my screen now, I'm going to take you into the google cloud's demo microservices app. So it's a little web app.

C

It's running a shop, I'm going to purchase a barista kit as we speak, just to show you what it looks like running and there we go. I've also got a whole bunch of services, running istio, prometheus and kiali, and we can see in kiali what the data flows look like and what's going on on the network or a service mesh here, so you can see there's a lot of activity.

C

We have a load generator, that's doing quite a lot and that's all very good and well and then what I've done just prior to this is I've also installed the xebrium log and metrics collectors.

C

So the way that I did that and the reason that I did it prior to the webinar is, I wanted to give it a little bit of time to just learn the the basic patterns in these logs. So the way that we sign up for xebrium is you can go to our zebra.com page click, get started, free, fill in your name and so on and then set a password, and what that does is it will take you into a screen that looks something like this okay.

C

So now what we do is we install our connect collectors and there are pre-built commands to do that with your auth token. So for kubernetes, for our log, collector helm3 create a namespace and then install a collector, and really the only thing I need to set is a deployment name.

C

Similarly, for the metrics collector, we run these two helm commands to install it, and I'm going to use would use the same deployment name and that essentially sets up cbm now to receive your logs and metrics. So what I'll do now for the the grand finale before I head back to larry?

C

Is I'm going to go ahead and break my application? So if you see here, these are the pods that are running for the app and I'm going to essentially kill the product service. Catalog pod, I'm going to scale it down with zero replicas, so we're going to see it should die in a moment busy terminating now and it should disappear. And if I go back to my web app now and try and do something, it's not working particularly well and then you'll see in a moment kiyali. Okay, it's it's!

C

It should start to turn red as as, as things start to it, starts to detect things fail. So essentially my app's completely broken. Now. What zebra will see is a change in in patterns coming through we've, absolutely not built any rules to detect this kind of a problem. So I know it's a fairly trivial problem, but just bear that in mind, so let me hand back to larry and then I'll come back towards the end and I'll show you the incident that zebra should pick up from that. So larry back to you.

B

All right that was fantastic, that was fantastic, gavin thanks. So what's interesting about that that demo that that that gavin started, um I think it may have been two weeks ago or a week a week ago. Actually there was a service provider who just went in and they signed up and they decided they were going to just see what happened if they did x, and so we thought it was. It was so simple and cool to show that that we ended up kind of appropriating it so um yeah.

B

So so so let me let me jump back into the logs now, because you can imagine all the logs that are going through the system right now um and you know so a free text log it. You know it kind of tells a story, and this is kind of part of what makes logs really useful for for root cause analysis.

B

So you have to wonder why, like in general, people, don't use them much for monitoring and and and I shouldn't it's not true- that people don't use them at all for monitoring, there's plenty of alert rules that are built on logs all over the place. um It's just in general. I think kind of the you know.

B

The direction for monitoring at least to find thing when things are broken, is to use metric alerts and part of the reason for that is because you know logs are generally higher volume uh and there's there's some other sort of conceptual problems with logs that make them difficult to work with.

B

So, if we think about what long monitoring tools look like today, generally you're going to sit down and you're going to, you know you're going to build all of this sort of automation around it and it's so it's a very tedious and kind of manual process, but but still we, we know that, just under the surface, when manual work is applied, there's a lot of value there right. So so you know when we think about you, know kind of what makes what makes them difficult to work with.

B

So, let's, let's say I'm trying to root cause some issue that just happened. um You know it's it's kind of slow and painful to do it when I'm having to search for keywords- and I don't know what they should be. um So maybe I'll look for you know. Oh, was there a spike in log volume? You know. Maybe that's going to tell me where to look first, but you know, then you find out.

B

Oh yeah, you know the yum update, ran in the background and that's spike was from something else, and I don't know where to look and it's this. Let me type in things like fail. You know bad abort, whatever so you're trying trying to find trying to find you know at least the place to start looking right. It's it's! It's a real! It's! It's a real pain, um they're fragile right formats change. So you know I mean I don't know how many times it's happened that I've that I you know that I've experienced this.

B

But you know you'll you'll set up a rule on some logs and then you know, you'll go away, you'll be happy with yourself, and you know the next time that event happens, you're going to catch it and you're going to do something special with some value. That's you know, parameter in that log event or what have you and then you know someone who doesn't owe you an explanation at all. Someone upstream decides.

B

You know they're going to do something really helpful and nice actually, which is to fix a spelling mistake and the next thing you know your little rule kind of silently, silently breaks and so like these are. These are the kind of frustrations that log monitoring you know kind of surfaces um and finally it gets annoying like there's. If you, if you try. Sometimes what will happen is if you try to set simple rules.

B

uh You'll, you know you'll sort of blow up with okay, I'm gonna do this every time I get an error, uh but but you know now you know I did. I deployed some new. You know kind of part of my stack or some new version of something or something completely irrelevant happened, something spewing.

B

You know hundreds or thousands of events, error events into the log that really don't matter, and now what is now I have to write rules to kind of suppress that I have to buy an ai ops tool to try to put them into one thing without ringing, my pager all night: it's it's! It's a real, it's a real pain.

B

So so you know to me: this is kind of logs have been kind of stuck in this in this rut for a while, um and I think it really boils down to you're stuck in the index and search kind of mentality and and what that what indexes search is is just ways to kind of speed up the manual work, and so as long as you're doing the manual work, you're going to require to be required to maintain that work, you're going to be subject to the limitations, kind of your own of your own process as you're trying to just you know, look around and find what's important.

B

So it's just it's kind of like it's kind of a self-limiting sort of approach. um We talked a little bit about this already um and I did mention since apps yeah, so applications are bespoke, so this this is actually an important one right. So, like let's say you know, someone gives me some package, that's going to look for errors, you know in postgres and then I decide to deploy you know since I have postgres in my application.

B

I decide to kind of use this machine learning package on that. What may end up happening is that's great, but now my application is bespoke, we've written it ourselves, it's in go, it has its own logs.

B

um You know those the cement, what those logs mean and and and what's normal and abnormal are completely custom to me, and so so the sense of that's going to have to get learned on my data and there's not going to be some giant multi-petabyte training data set to do that with right. So so all of these make it sort of difficult to apply. You know machine learning uh in general to actual sort of monitoring and root cause problems.

B

So at this point you know now I've kind of thrown up my hands in despair.

B

um So let me step back and think about what I what's the very simple essence that I of what I actually want to do with these logs and is it possible, um and so so the way I think about that is the junior sre problem right, so let's say it's it's day, one uh you're, a junior sre, you walk into a shop there's a few things that you are familiar with in the stack there's a giant wad of stuff.

B

You are completely unfamiliar with having never worked in this with this application or stack before uh and and and over time, you start to learn. You start to learn, what's normal right um and you do that in very simp, simple sort of terms. At least I do so. I should I should point out. This is kind of my approach to the junior sre problem, but really I'm looking for two things and my experience starts to crystallize around two very important sort of recognition tasks.

B

One is, I need to be able to recognize when something is bad, and I know that sounds trivial and silly, but it isn't. um You know there can be errors and warnings getting spewed that don't matter at all, uh but as I get to know my application better, uh you know I'll start to realize.

B

uh You know that's that when something bad and is actually happening, you know that I'm to care about and part of that recognition will end up being sort of how widespread is the sort of you know observed badness as an example. You know.

B

Let's say I've got, you know, I'm looking at you know one log or one container, uh you know type one one service, and uh you know it's got some regular cadence of errors, but then I see there's a few fatals and at the same time I see some errors cropping up somewhere else in another service. That's going to be a key that something bad is happening right. So you get this sort of correlation across log streams. That's important!

B

Something else. I think that's useful in finding root cause like getting to the bottom of things is having a sense of what's rare, like I've. Never seen that happen before right and and that sort of thing I think is- is equally critical to causing a new issue. So just these two concepts what's bad and what's rare, if, if you can start to get a handle on those things and figure out when they're happening around the same time, you have a good chance of surfacing root.

B

Cause information that'll be useful, so that's kind of the approach we're trying to take as kindergarten as it sounds.

B

So I can tell you a little bit about how I'm going to talk a little bit about what we're doing, but you know there may be other approaches and I'm going to discuss some of those after I get through a few things about how we approach this problem, because there right now, there really is no one approach to this problem. We happen to think that we have the best approach, but you know that's us. I mean there's, there's other people tackling this problem, and I want to talk about that.

B

So so we do a complete relational structuring of logs and we do that at ingest. So it's not like there's some batch process, that's going on and trying to figure out what what are the different event types and what are the parameters and the events and all of this out of my out of my piles of log data. It has to be done at ingest and there's a number of reasons. You know why that has to be done at ingest, but probably the most important reason is.

B

You know the most important events are going to be the ones that I haven't seen very often and, and you know back to our rareness criterion.

B

If, when I see something new, that's when it's going to matter the most, and so you know, if I'm going to do something like structure these logs, that's all great, um but it better do something reasonable with the first or second occurrence of that, and so that's that's kind of, I think you know approp. You know important for us as we go through this, so the idea is very straightforward and you know this particular log, I'm showing here I'm just pulling stuff out of a mocked up.

B

Json I've never seen a log, a json log, this simple before I'm sure they exist, but but you know it's trying to get the concept across and in fact, the more free text this log is I've. I've found that uh that out both our approaches as well as other common approaches, are more it's it's easier for the it's easier for machine learning, approaches to structure them naively and the reason is, ironically, the text of the log message contains a lot of locality in it. That gives you information.

B

You know this, this clump of this clump of tokens.

B

You know it says you know something bad happened, that that, as a whole phrase means a lot and and then the next thing will be sort of a parameter and that parameter will always come after that phrase, and so, if you can pull out those those pieces of those events, you have a really good chance of getting meaningful sort of event types and so on and the way we do, that is to end up creating these columns for each parameter and so grab all the stuff over here and then we'll grab the stuff over here and put it in into these columns.

B

But but the most important thing that that's coming out of this is: we know what kind of event this is. In other words, there will be one event type. That means when I see a log like this, it's this type of event and that's probably the most critical thing to walk away understanding. Is you know if you're dealing with a structured log, one of the first things you have to find out is: okay, hopefully they've, given some context. Some there's some context.

B

Take concept taken into account of what kind of event is this there's probably an attribute that will embody that and that's what you need to hone in on um so so, given you know, sort of given that you know, let's, let's apply the requirements from earlier. We don't want to have to assume that we know the prefix formats. We don't have the logs. We don't want to assume. We know the grammars.

B

We don't want to assume that we need to know keywords because again this could be your own bespoke application and in that case, there's not going to be any known, prefix formats or event grammars. The system has to learn all of that right and if you're able to do that, you can embrace free text logs.

B

So so now that you've kind of structured, the data really sort of strictly you can do anomaly, detection on it really well, if you can really do that, if you can really tell that this class of events is one, this is another kind of event, and this thing that just came in that I've only have one or two examples of is a separate kind of thing and and it's you know, I still know enough to know it's different from something else, even though, though I may not have figured out every parameter and so off, I might, I may be able to cluster those, so so, there's kind of a sort of a gradient of you know, heuristics to clustering, to classification and, depending on how many, how many examples you've gotten of an event type each of those phases is going to have a greater impact or less you know.

B

One of those phases can have a greater impact on on the structure that you've determined, and so so once you've done that you can start doing really cool things like saying you know back to our kindergarten observation boy. I haven't seen this in a while.

B

Well, that's because you haven't seen that event type in a while or you may see, two relatively rare events happen more closely in time than they have ever have before, and so you start to sort of imagine this for every log stream right you've got a set of event, types that every event can belong to, and you can you sort of conceptualize this as sort of a point process, and you say: okay, you know usually the rate of this kind of an event in this stream.

B

Is this and the rate of this kind of event in this stream? Is that uh and all of a sudden you know, maybe I have upticks in the rate of those. Maybe I have very tight correlation in the time occurrence of those that I didn't use wouldn't usually have, and then you can imagine kind of expanding that to say not just event types but also sort of severities and errors. um You know that all of a sudden, the rate of errors here went up, but not only that those events are very correlated to events.

B

I'm seeing in this stream. Over here, and so so really once you've taken that first fundamental step of structuring the events into event types, you can start to do this kind of a correlation analysis and and really to to us. This has proven to be kind of a transformational step. You know if you do well enough on two you could start to do. You can start to do well enough to to sort of I'd uh identify what I would call incidents or, at the very least clumps of stuff.

B

That's going to get you to a root cause indicator right, um so we you know, I think we already talked about this, but there's a bunch of stuff we're just we just can't require in order to do that correlation uh up front. So you know it all, at least from our perspective and maybe to a fault. You can't have any any rules built into this system.

B

That know to look for this keyword, um because if, if you do any of that sort of stuff, it may work in this one instance, but it's not going to generalize.

B

So if we can, if we can make this pro, this approach work with completely naively with zero understanding of the semantics of the data that it's ingesting and creating these models on then we know we can have confidence that it will work on any application or stack.

B

So so that's kind of been our approach, and I think that discipline is important right if you start well, you know next thing you know you have you have a database of a thousand alert rules and and and if someone buys that they're no better off than they would have been if they had their own database of alert rules right. So so that's that's one thing it's important to avoid.

B

I want to talk a little bit about other attempts that have been made to structure logs for for root cause or for just for sort of detection. Monitoring. One set of approaches is a deep learning approach. There's a number of papers on this there's an academic community. That's you know kind of really has been interested in that um one thing I would say about that. Is um you know it's there? Well, there's a couple things.

B

One is if someone's going to give you their log data um into a sas service, especially um cost is going to be important to them. You know you're not going to be.

B

You know, racking some refrigerator sized uh uh sort of uh appliance to do to do actual deep learning on every data set you get um and at the same time, I've touched on this already, but it's very difficult to take, what's normal uh in one stack and environment and generalize it to another, depending on how exactly what the stack is made up of and exactly how it's used normal can mean very different things. So there's a number of conceptual challenges like that that I think sort of I would say, they're in the way.

B

I do think that they will come when when deep learning approaches will be very successful in in tackling telemetry um from a naive standpoint, I don't think we're quite there yet. In fact, some of the natural language models are probably closer than sort of the deep learning models that have been more popular in the literature for the last few years.

B

In any case, um another thing that you'll see is the use of a particular algorithm and that's usually lcs, so for those of you in the in the audience that care about such things, longest, common, substring and and sort of different uh implementations of it. Some are online. Some are batch, but essentially the idea is you've got this algorithm and it's going to decide what your catalog of event types is: the prob, the weakness. There's a couple weaknesses with that algorithm.

B

One of them is.

B

It doesn't really have sort of an innate sense of types so, depending on your implica implementation, you kind of have to you kind of have to build in some something to to be able to tell these are different. These are different things, because this thing here is always an integer, and this thing here is always a file, and so so that's important, but I think a more a bigger barrier. More conceptual barrier is.

B

To get a good structuring out of lcs- and you see this in a number of sort of machine learning for logs packages that are out there on the market today, it takes a lot of examples of a given event type to do a good job with it. Otherwise it gets put into this other bucket and see because of the pareto nature of logs.

B

You always end up with some massive swath of your log data and it's the most important events in this other bucket that haven't really been effectively categorized yet so I think it's in practical reality. It's important to have a continuum of approaches, you're bringing to bear, depending on the cardinality of each event, type that you're seeing.

B

So you know talking again sort of summarizing. You know you got a structure first, you got to do it in line at ingest time. Otherwise you can't respond to an incident. You have to have a multi-stage structuring pipeline, at least in our view, to respect the pareto distribution of event types in in real world logs the good thing about using a correlation model. Once you get past the structuring- and you start doing this incident sort of detection and root cause report generation is the more data sources, the more streams of logs and or metrics.

B

I haven't talked too much about metrics, but you'll see a little bit of that in a minute. um The more of these streams I have with anomalies that I can detect the better job I can do of cross-correlating and picking out a point in time. I get better resolution, the more data I have and that's an important um that's an important, I think um dimension.

B

You know to to an effective solution um and then uh finally, you'll you'll see an example of this in a bit. But if you're going to pass- and so a lot of you may have heard gpt3- it's a natural language model, there's some competing models, there's some some free. You know downloads of similar models, you can you can download they're pre-trained or you can train them yourself.

B

It's really a very exciting area of research right now, but so, if you're going to pass in a prompt to one of these things and you're going to get back something meaningful, the data you pass in has to be concise. It can't be, you know, tens of kilobytes or hundreds of kilobytes of data.

B

You have to pass in a few things with enough natural language keywords in it that this thing can take a stab at responding to that prompt, and so one thing we've learned is it's important to be able to get sort of a root cause analysis report, that's small enough that a human can read it and digest it, but also that a machine can do that if you want to even automate sort of the summarization part with that, I'm just going to walk through a couple pictures and hand it back to gavin.

B

So here's an example where we had a stack and we had some incidents that were detected, and so what you're seeing here is the color represents sort of the um sort of the severity of the event um green being you know, uh warning and uh uh blue being info or no green was a debug and more and blue was info, and then you've got your warnings and your errors, which is the yellow and the red, um and so in these different services.

B

Here- and you know what we have is your minute by minute time and we're kind of you know what we're seeing in terms of the size of these things is kind of um uh how how new they were like, or I should say how rare they were so kind of the inverse it's a representation of the inverse of the typical frequency of these events. So what you're? Seeing is this thing?

B

I was talking about a minute ago where you got a lot of rare stuff, but then you also got bad stuff here too, that doesn't usually happen and pulling. Those things together was how we got a root cause report, for this particular stack, and this is this- is a typical sort of detection profile.

B

It's interesting when you get an autonomous monitoring, stack kind of working on logs. It's interesting because you can deploy like litmus, which is the chaos, testing engineering tool and basically it'll deploy some tests and they'll break some stuff and, and the interesting thing is that thing itself, it's gonna log stuff, it's gonna say: well, I'm initializing this and I'm selecting a pod to kill and so on. Well guess what? If you're an autonomous monitoring solution is doing a good job? It's going to realize that that is actually the root cause of of the outcome right.

B

So, if you think about it that way, if the thing's really doing a good job, the root cause is actually the fact that you ran a chaos test, and here it is, and so so I just thought this is hilarious, because it actually makes it harder to replicate things you know to that are kind of realistic, but we're going to show example that in a minute, here's a couple examples.

B

This is there's this pulled from a blog on our website, just some recipes to go and kind of test this thing and see what happens, but you see what we're doing is we're pulling out the rare stuff here, we're pulling out kind of the bad stuff here and we're pulling features out of metrics that give you more flavor to help. You know that yeah you're on to the right sort of approach to causing this problem.

B

Here's an example where we're doing the same thing except we've grabbed a description of this problem out of gpt3. So we pass in our root cause report. We get back a description, we put it in here and so like here. You can see the first thing that happened was there was an oom killer was invoked and then all hell broke loose. And yes, in fact, this was an o.m problem and that's what got put in here um this I'm going to let this is the kind of stuff I think the um gavin's going to show.

B

But this is another example of that same incident where you can see kind of the swap it's a more modern interface. You can see this kind of the swap free bites from prometheus dropping and we pull out that anomaly. We feed those anomalies directly into the same correlation model. It just it goes with the log anomalies, and so you get this all stitched into one correlated sort of report um yeah. So with that I'm gonna, I'm gonna pass it back to um pass back to gavin.

C

Hey thanks larry and I'm going to continue where I left off. So let me share my screen. So if you remember, we have our our broken microservices app over here. We know why we broke it but, as I mentioned before, we didn't build a rule for that in our machine learning.

C

In fact, it had no knowledge whatsoever, the environment. It's only seen a couple of hours of logs, so let's have a look. What cream detected- and you can see here actually two incidents were- were detected both for the same thing and I'll sort of focus on this one, but some pretty cool stuff.

C

I larry just mentioned the the gt3 natural language, ai engine and this.

A

Is what it came back with.

C

So it kind of nailed exactly what the problem was, so that's really cool, and that was because we gave it very tight input, which was the whole incident, and then you see we show the the hosts and logs that were detected where the anomalous patterns or correlate sets of correlated anomalous patterns occurred.

C

The first event in this case, actually again it's kind of beautiful because it exactly nailed the root cause scales. It picked up the kubernetes message about the product service, product catalog service being scaled down to zero replicas, and then we see the the dead container over here and I'm going to drill into the incident, and this gives you a little bit more detail about what happened and what made up the incident from our machine learning's perspective.

C

Here are four log events that were picked up as being anomalous abnormally anomalous with you know in a correlated way, you're seeing the scale down successful, delete, killing the container and then finally container not running so that's pretty clear: what's happened there and then we see anomalous metrics that were picked up at this time, and these are really symptomatic of the problem.

C

You can see all the cpu on one of the containers that were sort of buzzing along suddenly took a dive at exactly that time when when we killed the the pod, so we picked that up as being some correlated metrics with with this incident, but just a really quick demo but summarizing.

C

I did nothing unnatural. Here. I installed our log collector and metrics collector, which sent this telemetry to our our app. We found we learned the patterns and then we found this anomalous pattern in sort of the jumble of everything else.

C

So that's it. If anybody's interested, I think larry has the link in his presentation. We've documented how to bring up this demo app in a mini cube environment. If anyone wants to test it and then you can break it any way you choose and see what xebrum picks up. So thank you very much and I'll pass that one back to to larry.

B

Okay, thanks kevin uh yeah. Let me bring this back up, so we've had some recent validation uh in the market, so we've so my data. They make this litmus uh tool set that I mentioned so they also have um uh open ebs, which is kind of like uh uh it's kind of like a uh like a storage uh device sort of simulation uh software. So essentially um it's basically storage software. So so, basically what they did was they went in.

B

They said: okay, we're going to go and we're going to replicate the outages that our real customers had and they picked like six or seven of them. They replicated those in their environment and- and we picked those up with the root cause indicators, so that was cool, um so d-zone wrote a great article about us and then you know the guy who wrote the article. Basically, you know spun up the you know put in the charts spun up. The software um did something to break something and then and then see it show up.

B

So that was that was cool. Sweetwater is a customer of ours right now, the uh the music equipment retailer, and uh so they they ended up concluding that we've dropped their root, cause time uh for new incidents from three hours to 15 minutes and that's made a huge that's made a huge uh sort of dent in their ability to deliver. You know high quality service to their to their users. um You know everyone's encouraged to join us on this journey. I'm always happy to have discussions with everyone.

B

Anyone just about sort of in general log machine learning, not only detection any ideas. People have I love to I love to have those discussions. We have a slack community as well. You can participate in um and I think that's that's a bit. I want to thank everyone here for the time and the interest in log machine learning on kubernetes.

A

All right thanks everyone, we have about 10 minutes left, so um we have two questions so far in the q, a box. So if there's any more go ahead and pop them in there- and I will hand it back over to you larry to get started on.

B

These okay, so am I supposed to look in the chat right now and I can.

A

Read it to you, if you want.

B

Okay, yeah, I don't I mean if you can just pick some off yeah. I'm sorry.

A

Do you have ingest endpoints around the globe.

B

ah Okay, so right so by default, when you you know, come in and and you you know, become a user, you go into our um you know, we have an aws, we have a. We have a deployment aws, um it's slow, it's it's west u.s, um but we've. We have spun up. uh We've spun up incidents, uh ins, instances in other geos um on request, so you know, don't don't be shy, we don't mind doing it.

A

All right, how long does it take for training to complete? How frequently is the model trained and our gpu is required.

B

Gpus are not required it so so there really is no determinate uh sort of training period, um basically there's a bunch of parameters that are getting estimated and those estimates suck like 10 minutes after you install the software, um and you know after a day or two they're, probably really good, and so there's a continuum over the first few hours where it starts to get more and more useful.

B

um You know sometimes you'll see like spurious stuff popping up at the beginning, so so that's kind of the approach we've taken because a lot of people they'll just want to spin it up. Do some chaos tests, um you know and be done so so we let them do that. But you know, if you do, that, you're going to get noise and some things may not get picked up exactly right. So.

A

Okay, anyone else here we go, here's one more any thoughts, comments on using nearest neighbor analysis for correlation.

B

Nearest neighbor analysis can, can you be more specific like in what way?

B

Oh should I get to the chat? Let me see if I can get into the chat here and see that is.

D

B

The q, a with the spanner q and a okay, okay, perfect, perfect, okay,.

A

B

All and wrote, hey ann write, write some more and let me.

A

B

Okay, do you have plans for running this on prem as well? I mean I'm. Yes. Actually, that's a that's a very good question. So, as it turns out, a lot of people really want this stuff run on prem when it has to do with logs they're a little bit afraid of you know, sort of pii and that sort of thing you know it depends depends on the user right.

B

So um so we have a project that we kicked off this week to to build exactly such a thing, um and so you should get in touch with us if you're interested in being sort of one of the first pilot users of that on-prem, it's a kind of a virtual appliance as we're packaging it.

B

uh What api do you need access in order to collect the logs for our kds cluster? What api do you need access? Hey, roddy. Do you understand this question? I'm not sure, I'm not quite sure, what's being asked.

D

I I think the simple way to answer that is you don't need to worry about it. Our collector deploys as a demon set, it picks up container logs automatically from you know the container log outputs and it it does use kubernetes apis to pick up events, if you're interested you can examine our documentation on github or you can bring us for more details, but the short answer is you: shouldn't have to worry about it, you don't actually need to do any work, it's just just a helm chart and the demon set automatically collects everything.

B

Right, yeah yeah, so to make okay. So maybe the question was kind of about what, when they put in the chart, what gets deployed and it's kind of a it's a fluid d fork for logs and then a prometheus scraper fork for metrics?

B

Are there any competitors for log anomaly? If so, how do you compare and differentiate from them? uh Yeah I mean there are so yeah so ajay you, you wanna! You wanna address this and.

D

So yeah so log anomaly, detection is uh an area of interest has been for about a decade. uh Larry mentioned some of the academic research in the space. You know lcs and deep, learn or whatever, but there are um projects and even commercial products that have attempted anomaly detection to various degrees. They haven't gone as far as we go in terms of correlating anomalies and detecting incidents.

D

We have one such comparison with the elastic um machine learning pack on our website and we have even have a short video, comparing them side by side.

D

um That might be the best place to start um it's on under the blogs page. Just look for elastic and you'll see a side-by-side comparison of our machine learning with the elastics right.

B

Hey now I understand what ann's question so right, um yeah so right. So so what basically like? I guess you would say so. I mentioned we have this sort of multi-stage pipeline for structuring um right, so the first, the first is actually heuristic.

B

So if I've only seen something and it's it's kind of there's kind of a chicken and problem, but if I, if I don't already have a parse for this thing, I'll I'll start by looking at it, with heuristics or and as well as kind of a couple default tokenizations and then I'll do an attempted clustering of that with other such things that I do not yet have bucketed and but that so that clustering is done um using sort of reachability clustering.

B

So I'll pick a couple, tokenizations and I'll say um which, which tokenization gets me, you know sort of the most sort of compatible like things and by reachability clustering. I mean you know by given tokenization. I have 18 tokens um if, if I can, if I can kind of create this set by reaching one example to the next, with two with only two parametric tokens, then I'll use that as a cluster, um and so it's kind of it's kind of like it's a clustering.

B

It's it's not it's not sort of a trivial sort of you know nearest neighbor, like bag of words kind of stuff, but it it is. uh It is a clustering approach. Hopefully that helped. You shoot me an email. If you want to kind of talk more.

A

All right three minutes and there's two questions left looks like they're getting answers. Is there anything else we want to.

A

A

A

A

B

Oh wait: there was one that just came in. We got literally 60 seconds, probably so so right, so this is kind of an approach that's used in in tracing right, so you'll, you might see logs show up with the trace id that sort of thing and the question is: what's the difference, the difference is, I don't need support for it throughout my stack and I don't need to go trace anything so remember. The whole idea was: let's create something that can work without alert rules.

B

um If I then require tracing uh to go through every relevant code path, it feels like it felt like to us like a self-defeating kind of prospect. It's just trading one set of work for another or one set of limitations for another, and that's why we tried to avoid requiring that. I hope that made.

A

A

A

Okay, well thanks everyone for coming and for participating. Thank you for the robust q, a and all the back and forth and for everybody helping panelists answer questions through the chat. um Just a reminder. This will be up on the website later today and um we look forward to seeing everyone at the future cmcf webinar.

A

Thank you so much larry thanks, bye and everyone else.