Cloud Native Computing Foundation CNCF Webinars, 9 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar: Using Machine Learning for Autonomous Log Monitoring

Description

Logs typically record the source of truth during an incident, but their sheer volume and messiness makes incident detection and root cause analysis extremely challenging. As a result, logs are typically searched reactively, relying on a mix of intuition and brute force effort. But there’s hope. Machine learning can be used to automatically detect anomalous log patterns and correlate them with root cause.

In this webinar we’ll discuss and demonstrate an approach that utilizes unsupervised machine learning to structure and categorize streaming log events and then learn normal and anomalous log patterns. The end result is reliable auto-detection of incidents and their root cause.

A

Okay, we're gonna get started Taylor. Could you please turn on the recording.

A

Okay, I'd like to thank everyone who's joining today. Welcome to today's CNCs webinar, using machine learning for autonomous vlog monitoring, I'm Sanjeev Rampal, a principal engineer at Cisco and cloud native forum, ambassador I'll be moderating today's webinar. We would like to welcome our presenters today. Larry Lancaster is the founder of zebra and get Gavin Kohan white. Vice president of marketing at zipping a few housekeeping items before we get started during the webinar, you are not able to talk as an attendee. There is a Q&A box at the bottom of your screen note.

A

This is different from the chat window. This is the Q&A box. Please feel free to drop your questions in there and we'll get to as many as we can. At the end. This is an official webinar of the CNC F and, as such is subject to the CNC F code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct and basically, please be respectful of your fellow participants and presenters with that I'm gonna hand it over to Larry to kick off today's presentation.

B

Hey thanks so much everybody, so this is really an exciting opportunity for me to talk about something that I think whose time has come and that's autonomous log monitoring. So with that, let's get right into it.

B

Machine data is my life and I've spent most of my career sort of taking telemetry for products in the fields and then turning that into tools and business, intelligence and deliverables back to you and users and so on, and so what I'm gonna be talking about today is kind of motivated by that history, having sort of delved in and built sort of platforms on top of machine data. So many times you know I kind of got to the point where I was feeling like you know.

B

This should mostly just happen automatically at this point, I mean after doing it so many times you kind of get sick of doing it, and, and so then, I'm going to talk a little bit about sort of what motivated zebra M today.

B

So so really it started for me sort of looking at the state of log monitoring which, to me kind of leaves a few things to be desired, but but the biggest part problem from my perspective is is kind of twofold.

B

One is sort of the mess of the data itself in the sense that you know it's: it's not proper, it's not structured sort of formally with a grammar and so on log data, and so it's hard to build on top of it, because you end up in a situation where you always have to be sort of, you know maintaining things, and you have to be aware of the semantics of the data that you're working with, and so it's a very manual process.

B

So when I look at when I look at sort of what comes out of that, you know in terms of monitoring solutions. I see that for one thing, sometimes it ends up being slow anyway, to figure out. You know what's happening when something when there is an incident just because you kind of have to go digging through logs yourself anyway.

B

So that's that's! Annoying and suboptimal. The fragile part I touched on, but this is actually I mean I'm, sir, how many of you have sort of you know had to ever build sort of scripts or reg, X's and parsers, or tools on top of log data, but, like you know, what'll happen, oftentimes is you know, you'll have something that's working and then a developer does a really nice thing.

B

Then that developer may be someone somewhere else in another ecosystem, because you have a deep stack or they may be closer to you, but they may not owe you a notification and they'll do something really nice, like fixing a spelling mistake and the next thing you know stuffs breaking and so there's there's all this kind of stuff.

B

That happens, that that that makes sort of dealing with logs and using them sort of a manual or process and and finally alert fatigue is, is another way and I think this kind of stems from you know, given how little sort of visibility into the actual semantics of what the data is talking about. The tools have, typically, what you'll have is people will set up alerts on okay? You know if I get if I get more than you know a thousand errors.

B

An hour then paged me or you know, I was talking to one guy who was kind of like yeah. You know we have these alerts set up and I like I. Basically look and I say you know if, if it's noon and I've owned and I've gotten more than 5,000 alerts, then I know something's wrong.

B

If it's less than a thousand I know, everything is okay, and so that's kind of the state of things that that and that kind of annoys me, and the reason is that, if you think about it at least for me, I mean logs have always been used for root cause. It's it's almost always the case that that you're gonna go to logs at some point. In fact, if it's a new city, if it's a new problem, if it's an incident of any kind I think dogs are very good you'll end up in a log file.

B

But why aren't so given that I mean clearly they have the information that we're looking for. So why aren't they better at helping us monitor things in the first place and I think I think it kind of drill it kind of boils down to this manual arity when I say logs are stock in index and search what I mean is there's a person searching right so in the at the there's, a problem with that I mean it hasn't. That hasn't always been a problem, especially when it's kind of your app.

B

You understand things you're very intimately, familiar with the stack. You know you, you can quickly search that through things and figure out problems.

B

So let's look at that model. So, 20 years ago, when I started in the valley, there was kind of a shrink-wrap model. Of course it wasn't, you know always delivered by box, but you get the idea. You had an incident and there was a user for a customer from you know. Maybe a few users at one customer that had hit a bug and there was a support department. Everything was completely different back then, and so you'd had an incident on a monolithic application.

B

There may be up to 10, I mean 10 would be pushing in log files that people would need to be familiar with and look true, and you would have them indexed and you would search through them and you would figure things out, and that was great, um but things have changed, I mean so nowadays, you've got an incident that can affect tens of thousands of users. You've got dozens of services. At least you've got potentially a thousand log streams.

B

If you count all the different containers and services that you're running it, it's kind of a zoo, and yet, with all the pressure of you know, ten thousand times as many people annoyed at you, you're still stuck indexing and searching through the data to try to figure out. What's going on.

B

So to me this is um kind of unacceptable, so I believe that the future doesn't have to be like this I feel, like we've kind of gotten, to the point now where at least, to a large degree, we can let systems help us kind of pinpoint the root cause indicators in the swaths of log data without us having to go through the same process that we've been doing for decades. If you think forward 20 years from now I, do you really think that we want to be doing this twenty years from now?

B

My answer is no, so let's get started with that. What I wanted from a tool is something that will kind of characterize incidents before I notice, that's kind of a like a really ambitious statement. So, let's, let's talk about a little bit about what that, what that means and what it, what it's taken.

B

So what it means is, you know, sort of automatically to detect incidents when they're happening so things that are weird start happening. For example, if you were to hire a new guy to come in, you know in devops face, maybe he's checking out. You know the system he's monitoring alerts, he's looking around, he starts seeing things go haywire typically, he'll he'll have a good sense that things are going wrong. Even if you don't not know what there may be, he may be capable of figuring out they hey something's going on here.

B

So let's start with that, let's, let's automatically detect incidents without a whole bunch of alerts and rules being configured and so on, and then let's go find stuff. That's germane to root cause indication within the logs um that we found and I think what's interesting from this perspective. Is that as we've accumulated, probably a hundred and I, don't I want to say 110 120 real-world incidents across over 30 stacks. The people have been generous enough to share with us.

B

We found that there are some sort of fundamental ways that software behaves when it's breaking that lets a system go in and find clues for you to to root cause. That I talked a little bit about why it's so hard to do this from from the perspective of monitoring software, but essentially you know, you've got ambiguous, parses you've got formats changing. Another problem is needing experts to interpret the data, and so you know because apps apps or bespoke, so what do? What do I mean by all that? So so like it, you know.

B

If you go look at a typical log file in bar log or whatever on your Linux machine. You'll you'll see that there's you know there can be various different formats. Some of those may be, you know also repeated into syslog, so be not depending on your configuration, but what? What you'll?

B

What you'll find is that sort of the stuff that you'll find tools, giving you sort of connectors and pre-built sort of ways to get at that log data they're mostly concerned with the prefix, which is kind of the stuff to the left of where to the left of the log line so you'll have maybe sometimes you'll have pigs or function names. You'll have you know like timestamps you'll have severity x' you have. Some of these are really long and complicated, sometimes they're, very short and simple, and that's all very important information.

B

So we need to grab all that, but everything to the right also needs to be structured. If you want a machine to be able to tell oh this, this is a number in here and it's going up and it's not usually this higher or you know this very particular kind of event is happening a lot a lot more often, so you need to kind of you know programmatically get into that semantics of that data and that's kind of what's what's been hard about it.

B

So the first thing that we do is we we structure the logs to a relational level where, if I wanted to run queries about what was going on in my logs, like you do it now, we don't have you do it. We have a database on the back end that has that information and and customers can get access to it, but that, generally that's not that's not over what we're doing we build stuff.

B

On top of that, but the point being you know, if you imagine, in your mind, sort of a table that gets created from a certain kind of log. So here's a very simple example of a very specific kind of log and- and you can see that you can see that you know over time, there's some numbers changing and it's kind of clear to the eye what those should be called and what we want is we want.

B

We want software that basically creates just for those kinds of things at least logically creates such a table and then has columns that are typed for each of those, each of those sort of variables and and based on this now you can start to do more interesting anomaly. Detection.

B

From the users perspective, I think what's most important is that you don't want to have to you: don't want to have to sort of give a whole bunch of prior information to the system. You don't want to have to sit there and configure it all day. You don't want to have to get something wrong or right.

B

You don't want to have to go, look and make sure somebody else has a connector for X, because for your application, logs at least nobody's going to have a connector for X, and so you want something to you want to system that can just get in and rock the grandma, like a person would and and then iterate on, based on its understanding of that grammar. In the background, without you having to worry about it and that way you can embrace free text logs, you know, structured logging is cool.

B

We certainly you know can can can rock that, but it's annoying I think sometimes at least from a developer perspective that you know it's really hard to read structured logs for a human and there's really, no, there's no need that we should have to retranslate our entire infrastructure for machines. Why can't they just translate it for themselves. So you know a big chunk of your stocks always going to be free text logs, and we believe that that's important and valid so once you've done that now you can do anomaly.

B

Detection on that data, um and so, like I, was mentioning earlier. We kind of have this the sort of ways of looking at the data once we've got it structured properly. That has yielded from amazing results for us in terms of stuff. That just tends to happen when things break and and people always ask oh well, what kind of you know what kind of models do you use and so on so without getting too much into it? What I would like to say?

B

Is we don't we don't do any well so in this part of you know in this part of our software stack, we're not doing any deep, deep learning, because this is all kind of real-time you starting. We start ingesting logs in line we structured them in line everything just kind of you know: it's gets better in the background. There's no batch processing, it's not gonna cost you an arm and a leg to run the service.

B

um Well, so so what the way we actually are able to pull these out is by looking at point process statistics on the event types. So, if you think of each event type so in a given log file, there may be a thousand unique event, types right, those tables I mentioned there may be a thousand of them, virtually speaking kinds of things that can happen in the land that are expressed in the logs.

B

So if you think of each of those, as sort of like a like a like a point process where it happens with sir picture frequency or it happens with particular values- or it happens in relation to other kinds of events- and you build up that matrix, you can very quickly sort of narrow down coincidences that are anomalous and that's kind of the approach that is yield yielded amazing results for us.

B

Again, we don't to just to to do not just the structure in with the anomaly detection. We don't want you to have to tell us too much about application behaviors. So in our youth in our UI, of course, you can set up alerts. You can do all that stuff, you can.

B

You can actually set up a very complex alerts if you like on this event, has to happen and then that kind of an event with you know with the same you know the same pod name or not, or you know this parameter has to be higher than that within the event, and there have to be three of them and you can do all of that, but you shouldn't have to do it to start getting anomaly: detection working to start getting autonomous monitoring on top of your stack and the great thing about taking this generic approach is that it works great on anybody's app um I thought I'd talk a little bit about.

B

There are other people kind of approaching the same space and I think it's a fascinating space. There's a lot of brilliant people out there now looking at ways to structure this kind of data, there's sort of I guess what I'd call a community of academics that look at it one way and there's there's some folks in industry to look at at different ways. um I guess the deep learning one for me was was interesting because I kind of have this I have this story, so so I went so there was a there's.

B

A large sort of you know, tech company, that you guys would all another name and they have a CTO of sort of their services division when I went and spoke with him and he said yeah. So this is a problem for us, so we've got all these logs and and we've got all these. You know different products and we we we gather all the logs together and- and so we decided we wanted to go- do some some learning on that on that data and try to understand what's happening in customers, environments.

B

So so what we did was we went on. We bought all of our senior engineers dgx one workstations, and we sent them for training on deep learning and then they started to set them loose on it and what I end up happening was after six months, we kind of abandon it, because we found that they were having to spend all their time structuring the data rather than actually building models on top of it.

B

So it's kind of been it's kind of been my my learning that if you structure the data right, you have a lot of options and you don't have to jump for the most expensive trendy one right away. You can try other other methods as well, um so we try to use sort of a Swiss Army knife. What machine learning approach?

B

It's interesting when you look at log data- and you might find this interesting so like if you look at log log data typically you'll find like, for example, if I just take a terabyte of some of some applications, stacks log files, you know out of some environment and I, look at it typically about half of the event types I see within that corpus. I'll only see once or twice so. What does that tell you?

B

It kind of tells you that if I've got this vision in my head, where I'm going to on board someone and I'm gonna get start looking at their data and all of a sudden I'm gonna have this massive corpus of data that I've learned over what you know exactly what that thing looks like and and all the different permutations and distributions of that event type I've got another thing coming. It doesn't work that way, and so so kind of what you need to do is yeah.

B

Basically, we have kind of a four stage pipeline which, depending on how often how many times we see in an n-type well, do you get a different stage of that pipeline? We'll have the primary effect, so there there is a layer that, if I've only seen the event type once and there's numbers in it, I'm gonna assume those are parameters until it's proven. Otherwise, the next step is basically reach ability.

B

Clustering, like these lines, are kind of alike each other I'm, gonna, I'm, gonna, assume that they are until proven otherwise and once I've had a few examples. Then there's sort of a naive, Bayes classifier that kicks in for the global fitness function. That basically says: okay, here's the kind of blob of stuff, that's related to each other, and here are the columns and we become really sure of it. So, on the back end, we kind of shuffle things around into these buckets and then eventually it sort of hardens into a structure.

B

The great thing about doing it. This way is that you start out with something that works, but it gets better. The more data that you feed it and the great thing about using sort of point process.

B

Statistics, the cross-correlate among event streams- is that the anomaly detection gets better. The more complexity. You have the more cross correlation streams that you have so with that I'm going to hand it over to my colleague Gavin, who is just an absolute whiz with the demo, and he can do it in a sort of a time efficient manner and then we're gonna. Come back and we're gonna we're gonna have answer some questions, so I'm gonna go ahead and stop sharing.

C

Okay, just making sure everyone can see my screen, this is Gavin :.

C

So assuming everyone can see this, what you're looking at here is the overview screen that appears immediately after logging in now just to set some context for what I'm about to demonstrate we've deliberately ingested, just a tiny data set, so 24 Meg's, 180,000 events and there's absolutely zero manual configuration that took place yes and no one build rules. There was no pre-learning.

C

You know there was a zero knowledge of this data set until it came in and essentially our you know, ml went through the the events and what we uncovered from an overview perspective is a bunch of exceptions, some events that have error or high severity, but the really interesting stuffs here there are a hundred and fifty-five anomalous events so thence the broke pattern compared to what we really expect. Based on that small data set, and then one incident and the incidents are the things that we really care about.

C

These are the things that bubble up as correlated sets of anomalies that we believe are not happening by chance, they're happening because there's there's there's something changing and the behavior software. So what I can do here is click on the incident and you're taken to sort of a root cause description, so it seems reasonable, post quest stopped as what it calls it, and that comes directly out of one of the events and if I click on it, you get the detail of what we found. So let me explain this a little bit.

C

The data was collected from and kubernetes set of, pods running the alaskan stack in AWS, ok and what we found is in this pod Postgres master. We see a couple of messages saying the Postgres start and then in a different pods we're seeing this sort of correlated set of events coming from JIRA, so a different application, and in particular this one that says it's. You know it's having Coast crèche Postgres issues connecting to the database so clearly related set of events.

C

So out of the hundred and eighty thousand, the events that we ingested, we literally just detected five that encompass this incident and it turns out this was exactly the problem that occurred in this situation. Something shut down, Postgres and immediately after the rest of the applications running in different cards started noticing. So that's what we see that's where it detected now.

C

If you take this a little bit further, we make it really easy to sort of confirm the diagnosis that we found and maybe troubleshoot more or find out what really happened so with that I click on beretta bump browsed the incidents and I'm taken really into this. So interesting and kind of almost log manager view, but I'm filtered at the moment.

C

So what you're, seeing at the top of the screen are a set of visualizations of the data set, so the entire space of this data set encompasses 180 thousand events and then we break down some visualizations and I'll come back to those in a minute and because I clicked on the incident and filtered on just those five events that make up the incidents. Okay, so we can see them all now together, much like they would look like if you correct those lines out of the the different log files.

C

So in this case, two different log files, syslog and the JIRA log from here I can do sort of two. You know one-click workflows, the first one is P, and what that does is lets us look at the surrounding events or that event in context of the source of this, this log stream.

C

So it's the particular syslog stream that came from just that pod, which is the where this event originated and I can see all the other events around me, and maybe that yields more information about the the incident I can also go back to where I was and I can do something different which is I can unfilter.

C

So I can turn off this incident filter that we have on now and now what we're seeing is we stay in the same place, but we're seeing those incident events surrounded by all the other events from all the other pieces of the environments? In this case you see other log files by confluence and if you scroll around you'd, see all the other Atlassian components, spinning our messages. But now you can see the incident in the context of everything, and this again is another mechanism just to quickly kind of understand and troubleshoot.

C

What what occurred around the the incident itself now, maybe just to kind of peek under the covers for a moment. There's this pretty cool visualization of the data that I've just brought up here.

C

We call it an x-ray and what we're doing is on the x-axis is time and the y-axis is the event space spanning all the different log sources and we're drawing kind of a colored rectangle everywhere that we find an anomaly represented and location on the y-axis to where it came from or what particular event that came from the brighter the little rectangle the more anomalous the event. So the most anomalous are these sort of very bright white colors that you see just above my my cursor, my mouse pointer now.

C

What you see here is very typical. There are always anomalies, so there are always going to be events, the break pattern, and you don't want to alert on those. You don't want to create incidents on those because there's really not enough context to say they are or they're not the problems, sometimes they're problematic, sometimes they're, just against the break pattern. More than you know, new events occur for whatever reason.

C

So if you sort of go to this section here, which is where we found the incidents, you see this really tight band of correlated anomalies and that's a kind of a trigger when we see sets of very high likelihood anomalies, the very bright rectangles correlated across difference, either parts of the application or different log files or log types or long streams. That's our indication that there's an incident all of these things broke pattern. They're, all anomalous, they're tightly, correlated, there's some very high probability anomalies in here and that's our incident.

C

So that's how we think these 5 out of 180,000 events to make up this incident. We saw this this sort of band and we pulled out what was most anomalous there and that became the event and then the root cause is identified, kind of by the leading edge of that the first sort of anomalous event that happened. That seemed to trigger everything else.

C

And so, if you remember in the event, we showed you the root cause, post quest start and then the symptoms which were, in this case jira noticing that couldn't talk to the sequel and actually, if you go further and you look at some of the yellow, the the other anomalies around it you'll find the base they they relate to some of the other applications that also started having problems once close, read start. So that gives you a good sense of what's sort of happening under the covers.

C

The very last thing, I'll show you sort of speaks to what happen. You know what we did to get to this point and Larry. You sort of spend a bit of time. This presentation talking about how we structure the data, but there are a few interesting things you can see here. If you look at this logline as an example, wherever there's a blue piece of text with discerned that there was a variable part of an event. So in this case all these blue pieces are variables and in particular, so this is the post quest.

C

Stop message. The word stopped is actually a variable, meaning we've seen an event of this type somewhere else, where there's a different value for stock, so what I can do is I can chart it, and what you see nicely here is over here we get on you know at 4:17, the stocked, Postgres and then over here we get somewhere later two minutes later we get a starting and the started so the same event type that we've categorized for the machine learning with different values. For that variable to get this I didn't have to pause.

C

Anything I didn't have to tell at all to to you know, have what what to look for it automatically parsed out that as a variable, because that saw a bunch of similar events that had the same structure. That's one idea, obviously you know and to make to to aid in what we're doing here.

C

You also need to be able to do things like search. So as an example here, I can I can do sort of you have full reg, X searches, I'll go here and I'll search for the text, milliseconds and sorry I mistyped, something there and what I'm doing is I'm getting taken to an event that the fact where, where there's a match for the text by search bar and once again, you can see here's an example of an event that it found in the search.

C

The blue is the variable text, and this time you see you know there are a couple of metrics that it's found as variables. So I can pick one of them. I can do my display chart and you get sort of an interesting plot of that value inside all the events of that type and how it changes and I might want to look at. You know this one that looks like an outlier I'm going.

C

You know, click on that and I can get taken there and and so on and look around, and this is all around being able to learn the structure of the underlying events and then be able to pull out this day that without having to to manually, build any pausing rules. So there's a whole lot more, but I'm gonna stop here and hand it back to Larry. Thank you. Hey.

B

Awesome Gavin thanks. Let me go ahead and ensure my screen again and.

A

B

B

So, let's talk a little bit about where we're at right now, so so we're picking out application incidents, kubernetes incidents and even some security types of incidents. This seems to be working pretty well. So recently, we've had some exciting validation, so my data reproduced the slew of real-world incidents from so they kind of managed kubernetes clusters, among other things, that they do there.

B

It's amazing company, but using litmus, which is a tool that they're involved in and we were able to pull out those incidents that they recreated and 100% of them and pull in a root cause indicator and put it into that incident page without anyone having to tell the system anything. So this vision that I've outlined for you, while it sounds ambitious, is actually coming true, so that's been very exciting for us.

B

Next. Stop for us is to bring in more stuff. So it's so one now that we've got sort of sort of a baseline of autonomous monitoring. That seems useful in the sense that, if I weren't me I would want to use it the next. The next stop is well okay.

B

So now what you know things so incident sort of the sort of the evidence that you would use to root cause incident doesn't stop at the log file, sure it's a very rich source but- and it was probably a great place to start, but but the next thing we're doing is bringing in Prometheus metrics.

B

So you know, if you're it's very easy to deploy in kubernetes, especially we deploy a scraper and then we look for anomalies in those time series and cross correlate them with log log events. So that's going to be in a very soon upcoming lease and I think what that's kind of what we're doing there is we're. You know we're saying kind of this should be a one-stop shop for incident, root, cause detection for the unknown unknowns for which you may not have you know, created rules or so on and so on, or you.

C

B

It's not reasonable to do so. It's it's! It's that kind of thing that that we feel it's time to do so so that the machines can start helping people do their jobs and people can up Louisville and do more strategic work than dragging through log files. So thank you very much for your time. Here's my contact information, we'd love to hear from you and we're gonna open it up for Q&A. Now.

A

That was great thanks, Larry and Gavin, so we now have some time for questions. Please. If you have a question drop it into the Q&A tab at the bottom of your screen. I see one question there from Nikhil, so Larry. The question is: what kind of what is the actual a learning algorithm that is used? Are you using some kind of neural net.

B

Sure um right so I guess I touched a little bit about upon this during my presentation, so there's really kind of two separate. You know sets of analytics that we do sets of machine learning. The first is of the structure of the data, and so I talked a little bit about the Swiss Army knife that we use there, and so there's sort of a continuum of approaches that we use the, why more, depending upon the frequency of given event type.

B

So the first stage would be heuristics that relate to sort of nesting, nesting indicators, numbers special types that that you might you might imagine such as you know we might so. You know floats versus ants and so on.

B

So a lot of stuff like that, the next step is sort of reach ability, clustering, which takes sort of a more global view of the lines that have been seen and then the next stage is a naive, Bayes classifier with a global fitness function and then, finally, when we're going back and what kind of sort of amending the structure learning we've done, we like to use LCS for that. So LCS is actually kind of the state of the art for for learning log structure.

B

It has some weaknesses in low cardinality data and so for me, that's actually kind of like it's kind of like polish on a car. So it's the last thing that you do, and so, with the with that sort of palette of tools, we found that to be very effective and then and then there's the question about the question about the anomaly detection.

B

So, as you saw there's a couple phases there, so so the first phase is kind of determining that something is anomalous in and of itself and when I say in and of itself I mean that an event of a specific type happened, anomalously in isolation and that I will have to do with, for if it's an event type, we have lots of examples. You have good sense of a periodicity of distribution, of values, of parameters within that event, so you can speak specifically to that and then and there's other things.

B

So, for example, for us severity is a free parameter, so you may see an event type event of certain type happen. They may see a little difference, so there's various dimensions upon which that anomalous nough scan can be can be sussed out, but then the interesting stop stuff happens.

B

Finally, when you look at these event as independent point processes- and you develop statistics from those processes in terms of their auto correlations, their cross correlations and also the correlations of their activity as a sequence of events versus the values that you see in the parameters and and what you end up with then is something where you can really really start to hone in on incidents. So hopefully that gives you a good explanation for for what works for us.

A

Thanks Larry, maybe I'll tee up a question from myself here: how do you see the same product in combination? You mentioned you're going to be ingesting, Prometheus metrics. So what would be a target configuration there in terms of? Does it coexist with the Prometheus based stack or an elastic surge, or you know afk kind of a stack? What what would be you know? Where do you see this coexisting with those technologies?

A

B

You know that it feels so I'm, not sure if you mean that in a tactical way or in sort of a product positioning way. So maybe.

A

Think of like would this design to complement one or more of those technologies yeah.

B

I mean so so really I think what we're trying to do the value we're trying to deliver is you know, yeah sure you know you can have a you can have a log manager. You can have all that functionality, but let us do the work too.

B

So that's a completely different value prop right and so part of that making that better is to take data wherever we can get it and to apply it not to the problem of being a log manager and not to the problem of being a metrics alerting dashboard but to the problem of root, causing and incidents.

B

So when you look at it from that perspective, you know you, you start thinking a little bit less about sort of well how you know, because nobody else is doing that they're not trying to be that and you might say well, there are normally detectors in metrics in the metric space, at least I mean and- and that's that's true, I'm, not sure I'm, not sure how far far I would take that in terms of group cause detection for new for new issues.

B

If I don't have logs so I, don't really view those as sort of competitive and I really think that sort of like someone who's using Prometheus just to browse through their environment, we may be happy to do that.

B

We, we may take their Prometheus configuration if they share that with us and and use that for our collection, so that we know you know, I mean we're gonna want to know what's exporting we're, gonna want to have access to all of that and, of course, kubernetes makes it very easy to get that information, but but in general we're not trying to come in and say: oh just use us, you know, that's not that's not where we're coming from in the end.

B

You know if you find that you don't need to buy a certain other tool that you know is costing you money, because now you've got something that that will display the data, for you will let you search the data and chart the data and all that and it's finding incidence for you. That's that's great. That would be like success, trust, but that's I, don't think! That's really, where we're focusing we're more, focusing on building the value of finding the root cause and giving it to you. First.

A

Excellent, thank you. Any more questions well looks like that's it. So thank you, Larry and Gavin for a great presentation. Thank you to everyone for joining us. The webinar recording and the slides will be online later today. We look forward to seeing you again at a future CN CF webinar thanks and have a great day with that. We are signing off.

A