Red Hat OpenShift OpenShift Commons AIOps SIG, 29 Apr 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Overcoming Tomorrow's Operational Challenges with AIOps Phil Tee Moogsoft OpenShiftCommon AIOps SIG

Description

Overcoming Tomorrow's Operational Challenges with AIOps
Phil Tee CEO, Moogsoft
OpenShiftCommon AIOps SIG
April 29 2019

A

B

Perfect I'm super. Well, you know thanks for inviting us to speak along, and you know we're we're super excited to see. You know red hats and really dive into the air of stuff, and you know it's been a it's been an area and a market that have moved soft we've been. You know somewhat very early in from the you know, from the history of the company going all the way back to 2012.

B

You know: we've been long convinced the AI in the particular guise of machine learning and dead science has got a pivotal role in how people you know operate the sort of the modern infrastructure- and you know I- can't go forward without sort of making a naked pitch for what's gonna happen on May the 15th, which is our a our ops exchange in San Francisco, and you know, if you've not heard, please, google it and please register the device and to have fellow travelers along so without further ado. I want to talk a little bit about.

B

You know what we do with our ups to help people with with operational challenges that derive principally from you know, the modern cloud residents stack, and you know my personal journey with this- has been pretty lengthy. I've been involved in service assurance and operations since the early 1990s having co-founded Micronesia rota protocol neckhole, and you know that a bunch of startups in this space culminate in the move soft and this little light right here in a kind of gives you a sort of a brief history of time.

B

In terms of you know what I've seen changing morph over that period- and you know if you kind of go back to the original mainframe era, which was sort of dying out as I entered the market in his service assurance, was actually done extremely well and part of the reason for that was you know a single vendor. It was very, very simple: it was pretty low scale and change cycles, were you know, glacial and, to say the least.

B

Then, of course, we've all lived through the distributed computing and wave where in it is started to erode the ability to you know to deliver service in a very straightforward way, just because scaling and clarity started to enter into the environment. With you know, multiple vendors standards driven best in breed type approach and really what's going on, is the kind of the latest phase that is, you know, we're all living in the cloud residents sort of digitally transformed, Enterprise, these spectacularly complex in high scale. We are seeing almost a sort of a a chaotic.

B

You know deployment paradigm, where you know we large enterprises have moved to a fully agile, devops application development all chain, so change cycles are almost instantaneous. That second, very, very, very large scale, and you know in requires something very different to be done, and you know this is kind of a sort of almost a dimbo slide in a lot of ways, just to sort of punch. The point home that you know in the yesterday datacenter, you know if you were writing an application.

B

You know you wrote an app to run on a specific operating system to run on natural physical server. You know there are two to three interactions between those three components. You know you have to worry about making sure the right version you're in system version of our where you know the app is built in the right way. The app is built to optimize the bare metal you know, but still you know it was pretty contained.

B

Then you look today and actually this is a slice through our cloud in terms of how we and deploy our application. In you know, the app is deployed by Jackson is configured there, how that is driving companies that is driving docker that runs on VM, that is controlled by a hypervisor that runs on a road consistent that runs on bare metal that belongs to Amazon right, and you know all these very different interactions, and you know the dependence is there in.

B

You know drying two orders of magnitude, more considerations just for one application and of course all of this is you know, as as our friends at federates that I were pointing out. You know we're also navigating through the whole multiclad and instance management world as well. Net net is, is the complexity, is a complete phase transition step change to what it was before and there's. So what about that? Is?

B

You know this is there for a reason, and the reason is this digital transformation, where you know you've gone from organizations being able to happily live in the world of a much more regulated economy.

B

Where say, a Wells Fargo completes with a Bank of America COMPETES of the Citibank, whereas now that having to compete with digitally transformed startups- and this is across every segment, every part of the economy and he drives a step change in how you manage service- and you know it really comes down to this scale in complexity that is driven by this massive upstage in you know the number of moving parts in one applications- application of structures, much more chatty. Never data points is measured in billions.

B

The failure modes, you know, goes up into this 10 to 120 number, which we've worked out from a large enterprise and Yahoo from a while back, which 10 to 120. By the way, the reason I'm an ex theoretical physicist reason why that number is of interest. It actually is more information than we've stored in the observable universe. To give you an idea of how bad it is, and the complexity and in terms of what people are having to deal with is usually significant.

B

So what to do and well you know, if you look across a lot of our customers, you know what they're seeing is. Is this overwhelmed by data and lack of information, he's driving all kinds of consequential issues with the quality of service assurance? The you know the amount of data that how des and SR exam to deal with the amount of incidents that are actually being captured by monitoring. How you proceed to a resolution of an issue.

B

The Australian service is very siloed and very linear, and the net-net is is: is it's horrible for what he knew? So this is really the appeal for the insertion of AI and data science in operations, and it kind of stems from you know. Back in the day, when things were a lot more predictable and a lot more low scale, you have this kind of state and measurement separation of the world where you could analyze the system that you were monitoring work out.

B

What the potential failure modes of that system were monitor for the occurrence, those failure modes act and they sort of continue to do this in a continuous cycle- and you know the the time around if you like, the orbiter of that left-hand circle was a lot faster than the change cycle of the system, that you're monitoring and, in any case the scale of the systems- research that you could take this rules-based approach.

B

Well, you know in the modern world, in the modern stack you know you can't do that it is so complicated that there isn't the ability to analyze for potential failure modes. You.

A

B

Work them out, it is too complicated. So you can't go through this rules. Model-Based approach to monitor for the known failure modes. You really have to combine state and measurement and use the data that you're getting from the monitored systems. To give you indications of where you know there may be service threatening impacts and allowing you to have a much more sort of fluid approach to doing it so long story short, you've got to use techniques of dead science to look for in the monitored events and metrics from your systems.

B

The the clues as to where service may be being compromised. I'll be do this MOOC software or incarnation of this of this inside. Is these really a pipeline of algorithms? That starts on the left hand side? Would we take anything? We take application logs, we take metrics from collecting stats Deenie. We take traps.

A

B

Take indication from ones like management systems, we really don't. We use a bunch of information, theoretic algorithms, to suppress noise, because by the way, most of that data is junk and useless, and then what we really do is we're looking in those synthesized flows and four groups of self correlated events, and that indicate that there is an underlying causal thing that has occurred, which has caused the pattern of alerts to be sent to us. We mine that data for these groups of alerts.

B

We call them situations and because we're extracting this entire narrative of the impact, we're able to take a collaborative approach and to inviting people into a virtual incident or room to try and sort of remediated the thing you know automatically trigger automation and then mind the interactions for insights. You know for future occurrences and I'm just going to cause the animation to finish this, a slightly more detailed view of it. So what it really boils down to is sort of four pillars.

B

You know: entropy is the information theoretic algorithm that we use to try and eliminate the noise? This correlation is grouping these combinations of unsupervised, machine learning and supervised machine learning. We can take indications from time to Paula gene text in the data to form these clusters of alerts. We then have a series of information, theoretic and and supervised machine learning, algorithms to look for root cause inside of those groups of alerts. Some of them are in your natural.

B

Some of them are using terms that you might not be familiar with vertex entropy, which is you know, actually fundamental research that we that we did and we've stopped into. You know how you can take hints from interrelationships of entities to look at where you're more likely, where the highlight of it is that a service impacting incident of a riceball.

B

So there are some peer-reviewed journal articles in around that and then, as some of that was we have this- this collaborative war room, we're looking for things like situation, similarity that's across the similarity, the drawing of prompting you know remediation hints and to accrete a a sort of a living room. Look if you like, and if there's any brothers British on the call I was chuckle when I see the little bottom right hand icon, which is supposed to be a brain.

B

That looks like a kind of a gram, soon s wig and to me there anyway, I think you've seventies, permed hedges.

B

Why am I not getting any joy here from my computer to be able to click on.

A

You trying to swap into a demo or no.

B

I made the slide on for this. It's not really.

A

Perhaps you paused your animation.

B

That's fine. There we go. Let me just get back to this. Hey you get you getting their slides.

B

So you know we like to think of ourselves. The nails leader and I'll give you a little bit, as you know, the reason why we made that claim. You know we've been at this since 2011, we filed 50-plus patents being granted. I. Think 14 was the number on Friday and you know it's about 20 they're, all grand this year.

B

On top of that we've raised about a hundred million in funding since 2012 goldman sachs is going red point wing we're about a year away from being a cash flow generated, we'll never need to raise finance again we're on our way to scale. I've got 150 of the 14 mm, use, loss software and you know the team is, you know, comprised of a bunch of folks that really have been here before.

B

This will be my third at scale company in this space and we've got leaders from you, know: Splunk Qualis at dynamics and other businesses in there as well, and and if you look at the the breadth of the the IP that we've built, I mean this is just a vista across places where we have either both father patterns. All produce their peer review paper, there is generated IP, there is in the product, so everything from below significance.

B

Timestamp pans text by these similarities, use of topology use of entropy, of graphs, use of deep learning in you know across the correlation energy, in feedback and root, cause kind of form, part of our platform and and and that's that's it for me again- I'll just say whew. You know anybody is welcome to come. Join us on the 15th of May in the Four Seasons just hit our website. It's pretty prominent any questions and thank you for listening.

A

What city was that event in San.

B

Francisco and our HQ is, is in leave our classes, we're doing this on the four seasons on Market, Street, okay,.

A

If you want to post that to the cig mailing list, I'll put that link in the in the chat as well, so you can post that for other members.

B

Is from a party to that for me yeah, we'll definitely do that. Thank you.