Red Hat OpenShift OpenShift Commons AIOps SIG, 25 Mar 2019

Previous Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons AIOps SIG What is AIOps Marcel Hild Red Hat March 25 2019

Description

What is AIOps Marcel Hild (Red Hat)
Recorded live at the OpenShift Commons AIOps SIG
March 25 2019

A

Thanks Diane, my name is Marcell hilt and I'm working in a group of red hats, office of the CTO called the AI Center of Excellence, and obviously we're looking into all things. Ai related and I'm specifically focused on the broader theme of AI ops, what it means for Red, Hat and the general community.

A

So in the past years like we managed the problem of ever increasing demand for a computer computing power and applications in general. By like separating the concerns inside the data center, we needed to support a high paced environment asking for agility in application development and operations.

A

So changes to applications were driven by the developers and I mean they change their systems multiple times a day. If you look at continuous rollouts and continuous deployments, so what we did, we disconnected the components and brought them together via micro services again and I mean, if you look at it. The same concept applies throughout the whole stack like distributed, compute, distributed storage, distributed applications and orchestration of services, and with cloud native tooling, such as containers and kubernetes, were basically able to infinitely scale out and I mean. Obviously, this comes at a certain price.

A

More components mean more complexity, but then I mean hey IT operations in the DevOps world. We need to know when something isn't working and we need to know it now, because we've committed to those to those five nines of uptime, SLA and stuff. So to control these complex systems, we need to introduce some sort of instrumentation, some some telemetry is required, and so we produce more metrics, more locks more stuff.

A

And again we do this at every layer of the stack, because every persona brings different needs developers needs instead. Traces operation folks need latency and time on metrics. So how can a single human, possibly comprehend such a system, we're creating complex systems of rules, alerts and thresholds and guess what we can't keep up with updating our alerts because the system's being monitored change at a faster pace?

A

So it's a classical cat and mice problem so to say, but in our case we're in luck, because a smart person once said every repeatable human task, that is governed by a probabilistic relationship between input and output, will be subject to automation at some time and to be fair with all the air I typed.

A

Nowadays, that's that's the current state of the art where it boils down to precisely exactly this but training the computer to do very specialized tasks we excel at seeing patterns in huge piles of data with millions and millions of input features and to the uninitiated.

A

It might look like magic, whether at the bottom, it's a classification problem if input a then output, X and actually we're doing this in all these fields that have a great and powerful monetary background like showing you the most relevant cat images or with a great media coverage like beating human players in almost every game out there. But what about operations?

A

What about ops and I think here we're just starting to apply all these techniques to our very own special field, so in other words, if your website is slow because your storage is slow, a computer can tell you that and even better if your website is slow, because somebody flipped a bit somewhere in a not-so-distant system with sufficient input and with sufficient training data. A computer can also possibly tell you that, yes, your website is slow and if you flip this bit back, it's going to be fast again.

A

So what's a IUP's anyway, Gardner coined this term like some years ago, and it goes like this AI ops platforms, software systems that combine big data and AI or machine learning functionality to enhance and partially replace a broad range of IT operations, IT operations, processes and tasks, including availability and performance monitoring, event, correlation and analysis, IT, service management and automation. There's a lot of words in there and I've prop provocatively highlighted these words. Ai replaces IT operations because yeah people tend to think like this, but like it replaces truck drivers at some point.

A

Although the self-driving cluster will probably be marketed sooner than we think, I, don't think that IT operations will be replaced anytime soon, but we will certainly use big data and machine learning to support our monitoring and automation needs. It's just that. It's just another tool to make you more effective and efficient the tool to support us and not to replace us, and indeed this is something that we as Red Hat's, believe in and we are invested in it they.

A

This is another good quote from one of our team members. The key next step for systems, management and software development is the replacement of heuristics and fixed limits with learned models.

A

That's by oldest wrapper from our sauce from our CDO office, and so not only for operations, but also in our development processes. We have to apply that piece of scene learning supports to again. This is how Gartner describes in a a ops platform at the sender, there's big data and machine learning, and then it's a cycle of continuous insights being delivered to these three domains. Here, monitoring in the upper left corner will benefit from smart, alerting and like dynamic thresholds.

A

As seen before, the Service Desk will move from a reactive to a more broad, proactive engagement model with higher efficiency when it comes to troubleshooting and stuff, and ultimately your actions are highly automated and at some point with less and less human interaction and I. Think this perfectly aligns with the four phases of AI ops. So, first without data you're, nothing like so data is the new oil, and so we need to get our data collection straight.

A

We need to make sure that we have systems that emit the required telemetry and that we're able to store it for a longer term than just your two days of retention period. Plus you need some tools for visualization, your images. You convey meaning and a lock file entry. That might be obvious to you, the author of that lock emitting system. But what about all the metadata in that entry? How do you paint a broader picture over time?

A

Then? We need some tooling to the help us discover patterns patterns in that data and help us understand these patterns and correlations, because we make no mistake, there won't be as one-size-fits-all solution for everybody. You will still need to assist the computer, and the computer will support you in your understanding of that problem. Domain, you're still in the driver's seat and after learning from the past, you want to apply your knowledge to the future event to some future events.

A

You want to know that your application needs to scale out before you actually hit all that traffic.

A

And last but not least, you finally want to know the reason why something is failing like on the spot or postmortem. This is the classic needle in the haystack problem. Let the computer guide me to that flipping bits. That's caused my outage somewhere.

A

And I think what we're all currently missing is a larger community for AI ops, so in every emerging technology, we're seeing a multitude of projects and products addressing similar problems with different approaches and I mean this is good varieties like the spice of life, but remember how remember the complexity of our problem space like how many components need to talk to each other and I think so so do we.

A

So we need also to be at least aware of each other, discuss and share our experiences and our reasoning why we took specific approaches and then by finding ways to collaborate with each other, be it by integration of projects or products and or maybe even creating open-source projects that solve a common problem and ultimately leading to standards that define the way our machines collaborate like standards, how to run things, standards, how to talk to api's or how we store and query data and I think here we can build on the success of the Operations community.

A

We're slowly, standards like open, metrics or open tracing are emerging I, don't think we can accelerate the speed of adoption of such standards and come and open source tooling by having a voice in the definition of such standards and I. Think a safe is a great way and place to do such a thing, and with that I would like to open it up. Foreign discussions.