Cloud Native Computing Foundation Online Programs, 29 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Using Machine Learning on K8s Logs to Find Root Cause Faster

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Greetings and welcome. My name is gavin cohen and I'm the vp of product at zebrian and I'm here virtually with larry lancaster, our founder and cto, and iran carna. The co-founder and ceo reserved ai to get started. I'm going to demonstrate the problem that we address.

A

I've deployed online boutique, a microservices demo, app on a kubernetes cluster. On my laptop using mini cube now the catch is I've deliberately broken it. Let's see what happens when I try to buy something, I'm going to buy the home, barista kit, uh-oh the dreaded and very generic 500 error, and what I see here really doesn't help explain what happens now in real life deployments when something breaks. There are two key things: detecting that it broke and finding the root cause for detection. Most companies use some kind of monitoring or apm tool.

A

In this case I was using a kiali service graph, but this could have been any other monitoring dashboard. What we see here is a lot of red confirming things are broken, but once again it doesn't shed much light on what happens in many environments. Monitoring is integrated with an incident management tool with rules that automatically trigger incidents. When things go wrong here, I'm showing a pagerduty incident that has been created and shows up in a slack channel, but now it's time for the tough part, we've detected that there is a problem.

A

But what was the root cause if you're root causing a new problem chances are you're going to look in log files and experienced sres will typically start by looking for rare and bad things, and then they'll look for clusters of these rare and bad events, especially when they span across hosts and services, but.

B

A

Always easy and it can be a long iterative process that relies on intuition experience and maybe a whole lot of other things.

A

Hopefully, at the end of this process, which could take minutes or hours, you can find out what happened and uncover root cause now imagine if this whole process could be automated and the way it would work. Is you simply send your logs without any training set up or rules you'd be able to see the root cause without any hunting?

A

So with that I'll show you what the machine learning found when I broke the microservices demo, app that I showed you earlier from the time I broke the app and while the problem was happening about a hundred thousand log events were generated without any rules. Our ml distilled this down to just these seven events that you see on the screen.

A

The goal of our ml is to help explain the root cause, and our measure of success is that it finds at least one root cause indicator. In this case, you can see it very clearly at the top, the very first line.

A

In fact, now let me tell you as an aside the way that I broke, the app is, I created a little c program called om test and it essentially iterates allocating one meg chunks of memory until it exhausts all the memory now, in this case, I ran it on the kubernetes master, so what it did is it consumed all the memory and starved the whole kubernetes cluster running mini cube on my laptop of memory, and then you saw the problems that I showed you when the app actually broke earlier.

A

The very first line that prints out when it starts is a warning message that says oh and test starting. You can see that as the first line in this root cause report, the cool thing is that was picked up simply because it was a very rare, in fact, probably never seen before log event, but on its own it would have been completely harmless, except that it happened to correlate with a whole lot of other things. So, as part of the summary that you see on the screen, you can see just a couple of lines later.

A

It picked up a kernel message where the om killer was invoked and it actually killed off o and test, and you see that in the third line on the screen. So again that was picked up completely automatically.

A

So this is actually enough in this case, to tell us the root cause, but we also try and show you the symptoms. In other words, what happened when the problem occurred to see that I'm going to click the related events, button that pulls in the surrounding errors and anomalies that, hopefully, will explain the problem or the symptoms that happened, and in fact you can see here it's doing a really good job, because it's immediately pulling in a bunch of kubernetes events, and you can see how all the other pods were impacted.

A

You can see as it moves through there. It has failed probes on the different modules, the ad service, the cart service, the checkout service, the currency service and so on, and if you go a little bit further down you'll even see it picks up where redis is impacted and restarts. So this is brought in automatically remember. I haven't hunted or searched for anything here. I just clicked the related events button. So let me go back to the core events, because I'm also collecting prometheus metrics.

A

In this case, you can see at the bottom that the machine learning tries to correlate any anomalous metrics with what it's picked up in the logs, and so it's pulled in a couple of stats that you see on the left, node memory, buffers, bytes and node memory case. Bytes are highly relevant in this case you can see them going from very highs and then dropping right down, presumably as the o and killer killed off my rogue process.

A

So these are really useful to corroborate a root cause report that you might be writing now. The final thing I'll show you is the coolest of them all. We actually take this log line summary and remember. This is distilled down from 100 000 events that occurred at the time and we pass it through with the right prompt to the gpt3 language model.

A

And if you look at the top left hand side, you can see there's some plain language text, that's returned.

A

In fact, in this case it says the system was running out of memory and the oom killer was invoked, which is kind of a perfect summary that you could even paste in as the title to your postmortem root cause report.

A

This is an experimental feature and the reason it is is because we're still tweaking the way that we use gpt3 but in general gbt3, is only as good as the the internet as a whole and all the texts that it's been trained on now. In this case it completely nailed it. There are some times when it produces good english sentences, but they may not be completely relevant to the problem that you're, seeing but in general, we're seeing a lot of use from these, and the last thing I'll point out is the sentence that you see.

A

There is truly a novel sentence. You can't actually find that anywhere on the internet. It was generated based on what we gave the gpt3 model as a prompt. Now, with that, I'm going to hand it over to iran. Khanna he's the co-founder and ceo of reserved ai he's been a fantastic customer of ours. Now for about a year, in fact, he tried one of our early beta versions, which is the first thing that he saw and he's continued to use the product. Since so, thank you very much. Iran and I'll hand it over to you.

B

Thank you so much for the awesome demo, uh so just to give a little bit of background on reserved ai, I'm the co-founder ceo and what we enable customers like librium and a lot of other large customers running on the cloud across azure and aws to do is actually proactively forecast and manage cloud resources in a completely automated way.

B

uh So we actually enable folks running across these very complex multi-cloud deployments, to do things like commitment management, cash flow forecasting, tax, optimization and uniquely, we actually buy back over committed resources from customers, essentially making a market but really on the granular level, integrating with tons and tons and tons of different apis from you know, over 300 apis in the aws land uh 200 apis in azure, um a ton of different apis coming out of the kubernetes clusters that we're monitoring for cost and attribution and what we really found.

B

um Obviously the installation is quite simple, but the back end software is very complex.

B

What we really found was that, given this wealth of data and the wealth of systems running in kubernetes built on top of that, uh that there were things that were constantly changing um within the underlying uh primitives that were essentially pulling from on the kubernetes side on the azure side, on the aws side, um and while we have this stack running, each component was generating tons and tons of logs and when there was an error, not even an error on our side, often an error on the vendor side um or even on the customer side.

B

As things like iam roles change, we were not able to very easily go in and get the actual root cause of that out. The other end either forward to our engineering team or customer success, team or sales team. What have you and what that really meant was that our critical engineering resources as a startup, we like to move fast and build things on behalf of our customers, uh but they were getting waylaid.

B

You know once a week at the very least, going through all of these different types of uh kind of debugging procedures to find root, causes, um and even worse, was the fact that a lot of these root causes, just because we you know, took the tact in many cases like the out of memory case, just throw more resources at it, just kind of ironic as a cost, optimization company, but um you know, given that a lot of them actually went unnoticed until the volumes exploded to the point where we really had to look at them, so that was sort of the state of the world before and when I heard about zebrium honestly, you know I, I was a little bit skeptical right and I think my engineering team was too.

B

We use machine learning as well, but we use it in a much more sort of staid way of doing predictive models, um expect value calculations and actually doing risk modeling and market making on the back end. So those are all sort of established things that folks on wall street, for example, have been doing for years. This was something that new, so I was very, let's say, interested but skeptical around.

B

If this could replace the you know, specific devops knowledge that was needed before to really go in and figure out what was going on with this wealth of data um kind of streaming in and the error sporadically showing up. So this was essentially something that we decided to kick the tires on and we started the free trial with the zebra folks installed on our kubernetes cluster. It was pretty quick. uh Actually, I was able to do it. As you know, the semi-technical ceo, which was a testament to how easy it was.

B

I didn't even have to pull up my cto or my devops folks, into the conversation and literally in the first week. Aws had uh api change. If you build on sort of the long tail of aws apis you'll know what I'm talking about they'll just change all the time and not tell you about it um if you're, not on s3 or ec2, for example, and because we're built on that long tail, we have a number of systems there.

B

uh This was actually a really important thing to catch, because uh had this not been caught, if a customer went to a certain page, this would have caused. You know a complete error and a service disruption. Essentially, uh so this was what sort of piqued my interest and said hey this. This starts to make sense. um I think it's kind of working here.

B

It's seeing that there is an error that we wouldn't have caught if we weren't looking at the logs um and as we sort of dug into the system as larry was showing before we actually saw that the correlations and the root causes were really pointing to the exact system to the exact sort of uh pod. In this massive array of different services that was causing the underlying errors, so it actually led to a faster resolution on our side. So at that point you know we were.

B

We were starting to buy in a bit, and you know as- and that was sort of you know last year, essentially as we were scaling up and we've been running the system for over a year now and as we continued to run with it, we saw that you know the things that were being caught here were consistent. It wasn't just a one and done. We were consistently as we were building and seeing issues from our customer side from the vendor side.

B

Getting these reports in our slack channel with xebrium- and you know this is an example right here, where the customer actually had issues with their account where they were messing with an iam role, and we were basically unaware of this entirely until we probably got a complaint from the customer that would have been the forcing function but because of xebrium uh we were able to get the slack alert and we saw the customer was essentially messing with the role and had this big issue and we're able to escalate to our customer success team proactively, which is something that's fantastic.

B

As a business owner or myself. I love when we can surprise and delight customers and get ahead of issues without them having to default. You know essentially fall on the sword and come and tell us that they screwed up.

B

So we were uh really delighted by the fact that zebra was not only helping us with kind of the state operational pieces of our cloud infrastructure management, but really helping us surprise and delight our customers um with the fact that we can really get ahead of a lot of these issues in this complex environment without our team having to build very sophisticated kind of internal monitoring tools.

B

This was very much a plug and play, and- and this is sort of a more recent thing as larry was showing, but um you know usually when zebrian sends an alert, I'm just shooting it. Along to my engineering team and saying hey, go! Look at this! Go look at this, but now I can actually start with these nlp summaries that are coming out to figure out for myself. Hey, what's going on, do I need to just shoot it to my cto and have him uh you know route to the right person.

B

No often I can actually understand because of these natural language summaries. um You know, even as a ceo of the company, what the errors are. Who is responsible?

B

You know who is the owner of that piece of infrastructure and really have a much more targeted loop with them, and even now our dev team is starting to look at these and much more quickly route them to the right place and easily understand the uh sort of underlying root causes that we're seeing in the stream of errors that uh we get from zebrium, and this is something that you know I thought was absolutely science fiction before I saw it live because, as you saw in the demo, the logs to you know to anyone who's, a layman or even a you know, a sophisticated engineer.

B

It's it's kind of nonsense right. It's not really well structured. So the fact that these natural language summaries uh were able to be generated with such high fidelity- and you know very often right. I've not seen a lot of cases where they're wrong, they're very often very spot on. um That was something that really was a a big draw for us to lean further into this system.

B

Because of the fact that you know they seem to be making the impossible possible here, and it really delights our engineering teams and helps us delight our customers in different ways, either through reducing the downtime with faster resolutions, but also helping them debug issues of implementation and integration on their end with our systems so kind of a 360 degree.

B

View of the zebra product was really um important for us to get over the year to see how it could help us, as it developed, not only move faster on engineering basis, but really on a customer success basis as well, and I think that is something that I didn't even expect when we first integrated with the product. But I was obviously delighted to see as we moved down um the path of integrating them further and further into our workflows.

A

Thanks very much iran. We really appreciate having you as a customer. Continuing on. There are two main ways that you can use our machine learning. The first is when you have an incident management tool in place like pagerduty, ops, genie or slack. In this case, we have built-in integrations.

A

So when an incident is created in those tools, xebrium automatically augments it with root cause. The way it works is you'll, probably have some kind of monitoring or apm tool in place, and when that detects something it'll open up. An incident in, let's say pagerduty now, with our integration as soon as an incident is opened or created in pagerduty it'll, send us a signal which is number two in that diagram and we'll respond with a root cause report very similar to what you saw in the ui demo.

A

That will show up inside the page of duty incident and you never actually have to leave page of duty. Everything is automated in this case, and you'll see everything you need inside pagerduty without any log hunting.

A

You can also use xebrium without an incident management tool. In this case, when something breaks you just look at the xebrum root cause dashboard, we're always proactively scanning for patterns that make up root cause. So all you need to do is click on the relevant one and you'll see a root cause report that helps you troubleshoot. The problem, if you don't see a relevant one, all you need to do is click the blue scan for root cause button and enter a time.

A

We'll treat that as a signal that something has gone wrong with your app and will perform an on-demand scan for root cause around that time.

A

Now, because our machine learning is constantly scanning for root cause patterns, a lot of engineering teams also use us to proactively detect problems, especially the unknown unknowns, the ones that don't have detection rules.

A

If you use it this way, we recommend feeding these incidents that we detect to a p3q and reviewing them when convenient in real life, we've been able to detect a large number and variety of problems proactively. That would otherwise have gone unnoticed and cause problems.

A

Now let me explain how machine learning works I mentioned earlier that an experienced sre would look for rare and bad events when trying to figure out root cause and then for clusters of these rare and bad things, especially if they occur closely in time and across different services or hosts.

A

Well, the machine learning does the same exact thing, but to do this, it first needs to be able to accurately categorize all log events, so the first layer of our machine learning is to structure log events. This is done with unsupervised. Machine learning doesn't require any manual training once structured.

A

The machine learning can categorize all the events by type and then it learns the patterns for each unique event, type. The structuring we do underpins everything else that goes on in our in our platform.

A

The next layer of machine learning is anomaly: detection. There are lots of things that go into anomaly scoring, but the two big ones are events that are rare and events that are bad like errors or high severity alerts, criticals and so on. Now, as each new event comes in, we essentially give it an anomaly score and as an example, the rarer and the higher the severity, the more anomalous it would be.

A

But it's important to remember anomalies on their own can be very noisy. So now the magic we take the anomalies and we look for clusters of abnormally correlated anomalies across log streams.

A

This takes away the coincidental effect of having a few anomalies and allows us to really pinpoint something: that's actually gone wrong and to pull out the correct log lines that would help to explain root cause.

A

Now, once we've done that, we can then look at the metric space and look for any stacks that have correlated anomalies that match the times that we see in the log lines that we've just pulled together. This is really cool because you don't have to actually curate or tell us which metrics to look at. You can point all your metrics at us and we'll pull in the anomalous ones that are correlated with what we find in the logs, and you saw in the demo earlier.

A

It's really effective at bringing in corroborating metrics, and then all of this is wrapped up into an automatic root cause report that you can see either inside your incident management tool or inside the xebrum ui. Now our ml is able to detect a very, very broad range of root causes for a very broad range of problems.

A

This chart shows some examples of things that it's managed to find root cause for, but it's important to remember that there were no rules or any pre-definition of these types of problems whatsoever in the tool they were picked up because they essentially had patterns that you saw earlier in my previous slide that our machine learning was detected so those clusters of anomalies across different log streams.

A

The other important thing here is this is not meant to be an exhaustive list of what we can detect. It's just a bunch of examples across some common categories that we've seen happen within real life situations in our customer base. It helps automatically uncover root cause without you having to go hunting through logs and with that comes the conclusion of this webinar.

A

Thank you very much for watching. We love getting feedback, so please drop us a line with any questions or comments, or suggestions and you're also welcome to sign up for a free trial and try it for yourself with your own data. Thank you very much.