Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: AIOps on OpenShift with Sunny Siu and Tushar Katarki

Description

AIOps on OpenShift with Sunny Siu of ProphetStor and Tushar Katarki of Red Hat.

Filmed on October 28th, 2019 in San Francisco

A

For those of you, don't me don't know me, I'm too short, guitar, key product manager for OpenShift I work closely with a number of partners and customers on AI ml on and just say, I am Allen OpenShift, but one of the exciting things about this particular topic is you know so far we've been talking about how you know you can use openshift as a platform to build your AI on right, but this is kind of slightly different in the sense that here we are using AI within Red, Hat and working with partners which I'll get to in a moment to actually improve products in improve services using AI, ops and we'll define what that is in just a moment.

A

But that's the exciting thing I've been at Red Hat. You know for several years now open ship product management for three or four years, and this might be like I've done several of these comments. So it's always great to see you all here thanks, particularly to our customers and partners for coming and talking. We always appreciate that. So without, let me introduce you Sonny.

B

A

Do you under quickly introduce yourself I, guess I.

B

Introduced myself earlier, but I will repeat that yeah.

A

B

A co-founder president of those.

A

Of them, who are sleeping.

B

Yes, I'm a co-founder president of the company profit store. We develop a table solutions and focusing on the openshift workload and thanks to her head, we get introduced to a lot of different customers that are trying to understand the workload and try to optimize the cost and resources on the cloud. Isn't supporting the openshift workload. So I'm gonna give you a little bit detail after sure. Give an introduction yeah.

A

But just to tee it up in terms of a teaser earlier when you know discover, I was talking about their use case. They were talking about how you set CPU and memory. You know how much they need and and the limits for that you know so with something some of the things, the feedback that we hear from customers is you know, although that's a good thing, it's actually hard for custom, actually, a guess what that is.

A

So not everybody is kind of doesn't know how to set or what value to set for the particular app or part or whatever the CPU requests and limits, and the other thing is even if they know you know it changes over time. So I think these profits store as a fantastic solution which you'll hear about so that's kind of a up. So, let's without further ado, let's get right into it and I do have this clicker switch I can use. So so what is AI ops I mean I.

A

Let you all read that for a moment, but the idea really here is how do you improve IT? How do you take it to the next level IT operations so that you can, you know, replace a broad range of tasks which are manual error-prone.

A

You know time-consuming and kind of gives it2 a bad rap today. You know how do you kind of change that using AI and machine learning, so that's AI, ops as platforms and software systems that combine data and AI functionality to enhance and replace a broad range of IT operations, processing tasks? You know things such as availability and performance, monitoring, event, correlation and analysis and IT, Service, Management and automation. So so you know why should we care about this I mean at least I mean I?

A

Think we intrinsically understand why this is important right, like we all you know, for example, we want to know what's happening to a computer and if there are alerts, if there are things that are going to fail, we want to know about that, and you can imagine at at a cluster level or multi cluster level. When there are hundreds of nodes and thousands of projects and whatnot, you know you want to, there is no. This is the classical.

A

You know pet versus Catalan knowledge right, like you, cannot t teach them as look at them as pets, but rather look at them as cattle. So that means that you can't look at individually each one of these things in aggregate you would want to know what's happening and so so anyways so by 2022.

A

According to this analyst, I think this is iid, see. 50 percent of IT assets will have the ability to run autonomously using embedded AI, and so this is why you know to live this smart, IT and facilities system. So that's right. So what is the evolution of the AI ops capabilities that you know? What is the steps? What are the phases right so so this actually gives you a nice picture here on the left. We're in it starts with monitoring. You know: how do you observe that system that complex system- you know?

A

What are you know after that? It is. You know you know what can't African insights. Can you get from that yeah using the data that you have collected? Can you use machine learning to get that insights and then, finally, you know, can you act act upon it? You know using some automation, you know, and so so, if you think about the evolution on the right side, you'll see data collection. Visualization is the first step. It's no different from any other machine learning pipeline and workflow, you know, and then there is.

A

You know you want to understand what are the patterns? They want to discover them in that AI, ops, workflow and then you want to do some kind of prediction. You know. In this example, we talked about how you know you might may be running out of cpus at some point. So how do you Auto scale could be one example of it or it could be making predictions about network attacks if you are worried about security?

A

So that's another application, so those are some of the predictions that you can do and then you do some kind of a root cause analysis. You determine that this is the problem and then you do a remediation of that. So so real quick, you know what have we done? It's been an evolution for us from an open ship perspective and Red Hat perspective on how we are approaching this. You know, I'll start with the top left here and say you know.

A

We start first started with introducing you know the observability and strengthening the observatory observability place. If you go back to a bunch of 310 as far back as three seven, we introduced from ETS in tech preview and then 311, we g8 it. We had an operator for it and then so the Prometheus and the corresponding alert stack is a kind of the start of this observability story. For us, then we have now, with the sto Service mesh we have open tracing, you know we have.

A

Metering is going to Jay has g8 in the most recent release, which allows you to generate reports based on CPU memory, usage, etc. Then we are doing this other thing and a visual on this called telemetry in which we are collecting this Prometheus data and not only confining it to a cluster. But we are aggregating that into our you know as a service so that we can do some data mining with it and I'll talk a little bit about that and by the way, the the technology or the platform that we use to collect.

A

That data is data hub that Sharad was talking about in the morning and that's the data hub which we have which collects all that data, and we do all the analytics on that so that we can employ our products and services, and you know- and that is basically when we open sourced a reference architecture that has become open data hub and then finally, we have insights, which some of you might already know.

A

We insert has been there for some time from Red Hat, but insides basically gives you some insights about the system that is deployed from a Red Hat point of view. So I won't go into the details, but now the inside. So once we have the data we collect, we analyze it. We have developed some insights, then we bring that to the customer. So that's the first piece of that the observability piece of it. The second piece of it is really you know.

A

What are we doing in terms of you know how do what once we have observed? What do you do next like if you get a Prometheus alert for example? What do you next do with it right so easy one to say is that okay, it can send you an email or it can integrate with like, for example, pager duty.

A

You can get an alert, the the okay, we got an alert now you can do automation and you can do some automation, some some some automation tasks with it using ansible tower, and so that's an integration that we have. We have other things. You know portfolio, that is the redhead vision manager, which is a rules-based system that we have business process.

A

Automation some customers I know are using some of these advanced techniques to create rules on how to you know, react to some of these conditions right so and then how to automate it using business process, automation and then this is the connected. Customer really is our program where, in you know, customers with open shop for are sending us this telemetry data.

A

We are analyzing it, we have some historical data, we are analyzing in real time and we are proactively actually fixing bugs for them right like in fact, I might have a slide on that I'll get to it, but we have 20 to 25 percent open ship for has been out in the market for about three to four months now and I'd say about 20 to 25 percent of the bugs we have fixed are because of this data, and this is not something actively that the customer has said it needs to be fixed.

A

It's just based on the telemetry data. We are able to determine that there are bugs in the system. So that's the connected customer. We want to do more than that. Obviously, that's where some of the open data hub work is really helping. So that's why this is some something exciting for us. Then. Finally, we have the automation piece, which is really, of course. Now you want to automate all this. We heard about operators on the operator SDK. We talked about immutable host, which is our operating system.

A

We you know we have include the install experience with operators with open ship for and we have a whole bunch of other things that we are doing in this space. So so this is the example of what I was talking about the open ship tell'em. This is the dashboard that we see depending on the number of clusters, and that we have at Red Hat that I was talking about. So you can see that we get some information, there's a graph on a dashboard which shows the number of connected customer clusters.

A

Are there any errors in the system? So so so things AI, ops, examples that we are doing you know double-clicking on this. Is things such as log anomaly, detection outliers? You know, you know we are analyzing logs that we collect and determining if there are any outliers and you use that to improve the system, the product itself, the second one really is the cluster rule. Lord monitor. You know, one of the important features of open chip for is the over-the-air updates. So you we are able to push over-the-air updates to the connected clusters.

A

Now what we determined is that if we monitor that upgrade process- and if there are any anomalies, if there are any problems with that, then we are able to detect that and we are able to take corrective action based on that. Similarly, we are doing anomaly detection with matrix, which is the Prometheus anomaly detection, and then we are doing the work load prediction resource optimization, which is something that Sonny is going to talk about with their technology. So this is a kind of big picture vision.

A

I won't go into a whole lot of details into this, but you can kind of guess what we are doing here. We are collecting the data both at the cluster level and at the you know, we are aggregating that and then we are taking action based with AI and ml, and we have other things are important is is the word. Is we here and that's?

A

Why I'm here with Sonny is because obviously a lot of these things, we are not able to do on our own, but on the other hand, with operator hub, we are encouraging. Our partners to you know, take advantage of the open ecosystem to bring all these tools and technologies to you guys, so so that's kind of the introduction of what we are doing right over to Sonny, to explain a double click on exactly what's happening with historian federated AI. Thank you. Thank you.

B

Alright, so in the remaining 10-15 minutes, I'm gonna give a bit more details on how our solution worked on openshift and the value proposition. So I just want you to get this lie, because to show you that this is a server by right skill, now a part of flex era. They did a survey earlier this year over about a hundred enterprises and half of the enterprises about 400 of them have more than 1,000 employees, and this is the conclusion. They said that the cloud cause optimization is the number one priority for all these.

B

For majority of these enterprises. Okay and the it doesn't matter, the other conclusion it doesn't matter how long you have used the public law or private clouds cause is continues to be the number one priority and our solution try to address a this very important issue and what we learned from we had. No customers open ship customers and other partners is that usually the the CIO will get a shock above the bills when the developers are just using the public cloud freely.

B

Okay, so our solution is trying to adjust that and let me tell you a bit how it works. Okay, so specifically, the the pain points we address is that if you deploy your applications on the cloud, most users or developers will not know exactly what resources on the cloud needed to support the applications. Okay, right now, it's all guesswork and adding to this is all your application. What, though, is quite dynamic right containers very dynamics and sometimes we're a short life? Okay, but you get to deploy many of them on a weekly or daily basis.

B

Okay, so these cloud resources include CPU memory and running.

C

B

A machine ml workload, it would be GPU resources very expensive. You can charge an hourly basis, okay and in if you are in a major enterprise, you may have many different divisions projects and each one of them might ask the CIO office for some cloud resources and again they don't know what kind of resource they will need to support the application. So it's all best guess and usually takes a long time.

B

Okay, last time in the red summit, one of the major urban shift- customers from Europe in the automotive industry say that in certain cases some of the divisions only use 10% of what they get allocated. Okay. So it's a lot of Waystation. Okay, this one just show very high-level overview of our solution. Works federated a is our solution.

B

It works both on-premise as well as on the public cloud as long as it's running OpenShift and all these metrics that, for example, CPU memory or CPU utilization get collected in prometheus, and we use that to analyze write.

B

We do our machine learning on those metrics, the historical workload ongoing workload and if we also allow the user to get their impose on their SLA spec or what kind of margin they want to put on the resources to prepare for unexpected work load and the output of our solution is a list of recommendations on the resources on a per application basis or per namespace basis, all right, and if the users allow us, we can also automate and execute those recommendations into auto-scaling solutions for them, and the basis of it is using machine learning to do production on the workload.

B

So on the historical, what load we get insights and our outputs keep them for sites on what their workload is going to be on different time scale, and using this prediction understanding of the work load, then we orchestrate and optimize the resources and for some of the benchmarks that we have done for our customers.

B

It can show that compared with the native kubernetes of not using any mechanism at all, some of them can go up to 70% in cost savings, so the ROI is quite significant and, as I said earlier, if the customers allowed us, we can execute and dynamically auto scale the cluster for them and just to summarize, and the reason why we can do this, is we do dynamically continuously on all the workload and resource on a different time scale. The time scale could be a the next hour.

B

Next 24 hours makes seven days or next month right, and we also do auto scaling on AI and, as you know, native kubernetes already have their own horizontal paths or the scaling, HPA or vertical vertical port, auto scaling, but they are done in very primitive way and we have our own mechanism using AI machine learning of the workload and can do a much better job, which I'll show you a little bit later on alright and then after this or the understanding of the workload on a per application or per cluster basis, then we could determine the best cost solution from let's say: Amazon, Microsoft and Google right and that determine what's the best consolation for it.

B

Okay and this I will just show you a screenshot of the actual solution on the upper part. It is the the CPU prediction. The blue curve represent the actual customer workload and our daughter line represent our prediction. As you can see, it doesn't really overlap because we are just doing protection, but it follows quite well. The general pattern and the run on the red box inside is actually our production. The green line represent our resource recommendation so custom, actually don't don't really are not really interested in a particular curve.

B

The actual workload but they're interested in what kind of resource in this case on the upper graph is the CPU. What resources needed to support the application, workload right and the below represent the memory workload and we give a margin, because we want to make sure that it never runs off memory. So we give a margin of 15-20 percent and this can be configured by the user, and we are right now left with five certified operator and thanks so much for the rear head support. We have turned this staff.

C

Occasion, process.

B

A very short time, okay in terms of I, believe my team to us that it's like less than a week a couple of days so.

B

But level five, alright yeah it's a fully certified and that helps a lot when we sell to our customers and so and once we tell our customers is operator, then they will understand.

B

That's very simple: to install very simple to upgrade, and just to summarize, we gave a policy based optimize resource recommendation for any workload and the resource could include CPU memory and now GPU very important, and then we get with tell the users what's the best cost from all these free providers, because the good thing about the the major, the three big providers- Microsoft Amazon and Google, is that they published all the cost on api's for very large customers.

B

Of course they negotiate a certain break, but for the rest of the 95% customers they will use the standardized cost. Okay and, as I said earlier, we provide much better workflow management and a native kubernetes mechanism right. So just this is another screenshot. Once we learn about the workload hmm on, because it varies on different workload. This is a very small cluster. We deploy alright and it's actual customer work long.

B

So the customer may just pick on the very top Amazon, which cost them a seven hundred nineteen dollars for this particular small workload, and then we go through the historical workload and give a recommendation right, and then we find out that you can support it with a much cheaper instances using Amazon and highest is the actual Microsoft and then Google is in the middle, and this is not a general conclusion for among these three multiple cloud providers.

B

It's just example in this particular case, it turns out that Amazon is the cheapest so and just want to show you some use case. This is a leading market research, company market research, firm based in Boston. They are migrating a number of on-premise VMware based workload to your oven shift on AWS, and they have no idea how to sized AWS cluster to support this continue eyes, workload so running our tools right.

B

They can optimize the cost on every one of the applications and it turns out that on certain application, like nginx or SQL right, we can do much better job than using the native kubernetes HP a of the scaling mechanism and doing it, and we can show that in certain of the benchmark, we can improve up to 70% of the latency right, which is very significant.

B

This is the one a different use case, but the GPU cluster is a pretty sizable cloud provider. Is a government-funded high performance computing center, allowing all the GP resources to be used by the enterprise in Taiwan, University and research labs? They have over 2,000, GPUs and I have to say they spend about close to 20 million dollars, buying those GPUs systems from Nvidia over 9000 CPUs.

B

You know over 10, petabytes storage and what they do is allocate statically in the past, the GPU resources to the users right, that's a matter what the year, because they have no idea how much the GP the GP users do not know how much they need for the application right.

B

So it turns out that they already lk over 90% of the cheapy resources to these users right, even though, most of them only using 3% of it, okay, so very inefficient, but now using our tools to analyze or the work law, they can raise from 30 percent utilization to 80 percent. So that's very significant increase in utilization and the return on the investment ROI and using our toys is almost 10 times.

B

Okay and in addition to providing this workload, GPU what loop, disability and the resource protection, we also give them some performance, anomaly, detection, okay, and these that's basically about what I have to say. And if you want to learn more about solution, you know go to our website or just send me an email and again. What do you think rehab inviting me to keep it talking? Thanks for sure.

A

C

Come from Australia from Hong Kong there's it just an amazing group of people have travel very far so sunny. Thank you very much for coming, and thank you. Thank you for being here. I'm gonna bring our our next group of panelists up and stick around for the.

C