Red Hat OpenShift Data Science | OpenShift Commons Gathering 2021, 30 Jan 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Beyond AIOps: How to Open Source Operations Marcel Hild (RedHat) OpenShiftCommons Gathering 2021

Description

Beyond AIOps: How to Open Source Operations

Marcel Hild (RedHat)

OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html

Find out more about OpenShift Commons, please visit: https://commons.openshift.org

A

I'm going to talk about the topic beyond a iop, so I've been looking into a iops as a manager now starting off as a senior software engineer, principal software engineer in the ai center of excellence in the office of the cto and working on all things, ai and operations, and how that um evolved over time. So it'll be a little bit of a journey um because my perspective on it changed over the time.

A

So, let's start with the ground definition from gartner, which would say, aiops platforms are software systems that combine big data and ai or machine learning functionality to enhance and partially replace a broad range of it, operation, processes and tasks, including availability and performance monitoring, event, correlation analysis, id service management and automation. That's a whole lot of words.

A

Some people might interpret it as a as ai. It's gonna replace it operations. I mean, maybe that's a spin that um some people out there are taking. um But if you take a closer look to those aiops features, it boils down to pretty simple in air quotes stuff like um finding a baseline of your metrics, simulating the future finding some correlation with your incidents. um Doing anomaly: detection, which basically means I'm predicting the future and if it deviates from it, I'm seeing an anomaly and hopefully it helps you finding the root cause of your problems.

A

But it's not like um that. You have an ai running your system or or doing the job for you, it's not even so that a iops would be a product in a box. I think it's more like a marketing term nowadays like um when clouds hits the market and everybody was doing cloud and you talk to different people and everybody has a different opinion.

A

What cloud actually means so these days, you see a lot of old, not old, but also new products that are in the monitoring space that are in the tracing space, slapping aiops on their products and people. Think oh yeah, I'm gonna buy this product and then boom. I'm cloud native, I'm aiops and I'm gonna um send my ops.

A

People do more interesting stuff because I now have a product doing other tedious work, but in reality it's more like having smaller experiments across your ops people and then develop those capabilities that leverage ai to bring you to the next level. I'm seeing it more as a cultural shift so similar like what we saw with dev and ops during a devops movement.

A

So the developer people, using the tools from the ops side of the house with cloud native you developer, can deploy a whole complex scenario where previously, that was only only only available to the ops people, because you need to spin up so many servers and and stuff. Nowadays, you can do that with just one command cut and paste from the internet and boom. You have a multi-nodes deployment and similarly ops people using the tooling from the development folks being yaml engineers and and codifying their operational landscape.

A

So using the tools from those two domains brought devops and sre. Folks are an implementation of devops and I think it's similar with a iops see the devops people using tooling from the ai side, from the data science side, and that might start with just using um eda tools and data analysis, jupyter, notebooks and the like to detect those signals in the data.

A

So if you want to increase the efficiency with ai and you want to go on that path to the self-driving cluster at the very right end, you want to have that that that that scaling that exponential scaling.

A

So you start out with assisted ai, which is basically the ai, takes, um helps you discovering things it tells you look, here's a there's, an anomaly: here's a correlation of some things, but in the end you have to make the decisions. I think that's where we are currently now wrapping that in a box gets to augmented ai. So we have smaller parts of my operational um domain taken over by autonomous ai agents, doing some some some roll backs or or merging patches, or something like this.

A

And then, if you, if you take that even further and you have an autonomous ai running your deployments, I think then we're sometimes then we're there, where we have really a massive scale where we have smaller amount of people managing larger amounts of the data center. And that's where we want to get to right.

A

So analysts are saying 100 times: cost reduction for operating infrastructure if we reduce the operational costs and the idea here is building up the competence, encapsulating the competence in in something that you create inside your ops team gather the observations and then feed that back into building the component competence and building out that tooling. So every ops team needs to go through this circle. Through this, um it's not a visual circle which is circle, but it's it's a continuous improvement circle right.

A

So that's that's kind of kind of sad right because in the end, it's with every environment with every deployment with every customer, you are taking your data and then you are training. Your model in your ai helps product nice good. So you don't get a preach, a pre-trained thing from a vendor, because the only thing that's that you get there is the tooling to train the models.

A

So you have to do it again in the second environment or if you are a different customer, you also have to start over from scratch. Now, how can we change that?

A

And um for this we have to go a little bit back in time, so before open source, there was code right code was the secret we compiled it into binaries and we were making money out of these binaries on the left side, the code more valuable than the operations of the code operating it at scale was left to the folks in the basement, so that was more or less an afterthought.

A

Then open source happened so almost 20 years ago this red-headed company invented the rel operating system, took code, open source code and turned it into the product.

A

So the value moved from the code to the product, but still ops, was afterthought, didn't put much value into it: chip the code, the product ship, the product to the customer and let the customer figure out how to operate it then 2006 everything's, growing, crazy and scale and yada yada yada things like the cloud come along and when scale is everything, then the pokes in the basement are suddenly valuable, because if you want to scale your business, that means in the age of digital transformation, you have to scale your operations gaining I.t.

A

That's when sre folks are becoming more and more valuable and I would even go so far and say: maybe ops is even more valuable these days, because open source has become the de facto standard and you can.

A

Basically, if you have the manpower and the nd skills, you can run a whole production workload just on on an open source code basis.

A

And what you're differentiating on is essentially your ops capabilities.

A

So if the value in it is an ops and ops are proprietary, then open source has a problem, and this is uh something that matt essay being a cloud and open source executive at aws is saying what happens if you open source everything, that's exactly what juggerbytes did when they dumped the open core.

A

Instead, releasing all of its code as open source, so they are seeing that open core doesn't work which is good but running or releasing the complete stack as open source is the better example now because, because essentially, the value is in operating the software and they are also um giving people the juggernaut, juggerbytes platform to run this database for their customers.

A

So are they really open sourcing everything? No everything is the code, but not the ops platform. So that's still left to the customer or you're buying this from the service provider and don't get me wrong. That's there's nothing bad with this right. You can move these capabilities out to people doing this for you, but democratizing software through open source software brought so much innovation, and I think the same should be true for operations.

A

So if so, we trying to do this with this operate first initiative operate. First, is an initiative to operate software in a production grade environment, bringing users developers and operators closer together, so ideally operate. First becomes a partner to upstream, first as a basic tenant of our workflow, being red, hat's, workflow, being the open source workflow upstream, first meaning if we productize something.

A

Every line of code in the product should end up in the upstream project, because that reduces the maintenance burden and that shares the maintenance to the community and everybody can participate from it and benefit from it.

A

So what we're currently actually doing with the massachusetts, open cloud and open info labs at red hat is launching this initiative, where we want to operate upstream projects at scale, and I mean we're starting we're starting small, so we're not there at the scale yet. But we want to embrace upstream communities to give them a chance to operate their projects in a cloud's native environment.

A

We wants to operate red hat products in this environment so that before we ship those products customers.

A

Hopefully, we also identify the bugs and the shortcomings or the edge cases that um that that a customer would run into right and with spicing, this all up with open, telemetry, open tracing, open ops, so sharing all those best practices and tools and um and deployments with the community, so that we can replicate it into other deployments and that we can learn from each other. That's a whole lot of words in the end, what you have to imagine is think cloud provider with full visibility into the operations, and it makes complete sense.

A

So if you saw the previous presentation on the nvidia operator an operator, these days is mostly being seen as a piece of code and not the actual person operating something. So in essence, it's codified operational knowledge.

A

So wouldn't it make sense to have ops and developers working closely together in a transparent cloud working on the ops piece and then pick and choose some of the the obstacles that they run into the tedious tasks, the chore um that they have to do and codify that in an operator instead of the developer. Thinking on the product manager. Thinking, oh, this is something that the customer is usually doing, so we codify that in the operator. But then it's it's not really a problem.

A

So that's where the power of open source comes, what we want to do is turn users essentially into contributors. So that's this contributor funnel and that can only work if you have read only access to all the data.

A

Think of it, if you, if you running something, if I'm trying something out on my laptop and I run into an error message- I'm put that into google and I usually end up on stack overflow or on the project itself, cn github issue and I'm seeing that this person had the same problem before and what I can do is read through the issue, and maybe this fixes my problem.

A

If it doesn't fix it, I'm going to report so I'm taking one step in the contributor funnel and then he said. Eventually, if I'm super super involved, I can resolve the problem because I'm contributing back to the code, because everything is open in software development in open source. This is not the case in operations operations, every ops deployment is de facto a snowflake and it's behind the walls, which is completely fair because you have to deal with privacy and stuff.

A

But if you think about ai ops, using ai, to train your operational models and your operators that are running on auto mode training them always from scratch, does that make sense? In this example, here enter transfer learning. So that's an ai technique where you train your model in a certain on a certain set of data, then take that knowledge inside the model so that you can train the second model on less data right.

A

So if we could ship those a iops tools, if we could ship those operators with a pre-trained model that has been trained on all the common use cases, so that it only needs to adapt to the special use case that you have in your environment, I think that would move us beyond a ielts.

A

So ai ops in a community starts with discussing collaborating and logging in locking in standards like prometheus becoming the de facto standard of metrics. These days blocks, this hasn't happened yet, but I think we're on a good way there um for model exchange same right. So developing these standards is super super crucial because only with standards you can, you have standardized on something and then you can build on the same same foundation. So, let's grow collectively and codify that operation experience.

A

um In the end operations being democratized democratize operations, everybody should be able to to operate stuff at scale. Think of it as brew. Install op center get clone operations, don't start from scratch.

A

So then, you can use your data as a competitive differentiator or what your product is actually selling and not so much your operational excellence, but you should be building on top of the data that you collect from your customers so that the customers give to you or that you know about your landscape, not so much about operating your cloud and we're also prototyping this with the open data which you heard here today and you're going to hear here later on. So that's only natural because it has some ai in it.

A

A iops has a certain route in this operate. First idea and the team also shares a common history with that operate first team and the open data hub team. So that was the same. We incubated in the same office of the cto group.

A

It's also a young project which makes it good because you can still influence them and they are building out the operational um ideas and capabilities. Yet so we have potential to influence them and you need. Obviously, you need users, workloads, users, users, users, right so without anything happening on your platform.

A

You won't produce any data and you won't produce any issues, and then you have just an idling cluster there, which is kind of boring so doing these things also, with certain other workloads from cloud native virtualization mesh for data open shift itself, acm is advanced cluster manager and other emerging tech projects being onboarded there, but as it's a community everybody out there can onboard and can run their experiments there. We also do a lot of things with research.

A

um We have a telemetry working group, there's a lot of threats going on it's slowly, slowly, starting it's perfect time to charming. So my call to action here is: get your access to all areas caught into an op center, because it's so easy to be onboarded there. Essentially, you just need a right. Now. It's a google mail address.

A

Maybe we change that to something different, but it should be really read only by default for everybody out there, so that you can deploy your workloads in, of course, in in collaboration with the community, and then we solve issues there right. So you onboard via our onboarding infra opportunities you get compute and in return you're giving away the data that your compute produces your metrics, your locks and stuff click on operate.

A

First you're going to land on this page here where we have bucketed into data signs, users, operators and blueprints on the data science side, you can follow along what we're doing there on the ai research bits most of it. Oh, I think all of it has a a ops touch, so you won't see image detection stuff there, yet it has to do with how?

A

How do we do? How do we inspect the icd data? How are we working with time series in a prometheus format and other things, but we also want to document the user experience what it means: a data scientist on a cloud native platform.

A

Moving on to the operators bits, we have documentation for onboarding, your your your workloads. We are argo city so that we're following a githubs approach here, um get all the best practices in in a in a in a cloud native deployment model so to say and another another interesting.

A

Aspect of this or a perspective of this is that if you, if you want to do a pull request to a service or to a running running system, you need to replicate the setup somehow. So we also have tools to replicate the setup on crc. That's that's um a kubernetes cluster, on your laptop or onto other environments. So hopefully we will have um guides to deploy the same setup into.

A

I don't know aws azure or on bare metal deployments. So hopefully we grow this environment into into into other data centers over time, and I think that's that's a super super crucial part because, as I said earlier, sre is usually a process that everybody has every every customer and every project has to set up on its own and they have to write their best practices on their own.

A

We are really documenting them out in the open so that you can actually do a git clone of the decisions that we're taking and the processes that we're documenting, and then you just adjust them to your means on and and to your demands, and you don't have to start from scratch, but you can build upon our best practices here and, as I said, it's a community you every page has this contributed to this page, which then takes you to this um sleuth of github repos, because every where we are a little bit spread all over the place from that operate.

A

First, open info labs and our aicoe pages, but that operate first website is where we aggregate all the content. So everything is documented right beside the yaml code, bi beside the pipeline code and all the goodness, so that you can really follow that best practice. Skidops workflow!

A

That's it! Thank you! Here's the url in both letters, if you want to connect with me on linkedin or on twitter, I'm there. Thank you.

B

Awesome marcel, thank you very much and I'm really glad that you brought the operate first cloud um to our attention because I hadn't seen it before today. So um I really appreciate that, and the data science projects and workflow stuff is just awesome on that site. So I encourage everybody who's um who's. Listening in now to take a look at that, I heard it before today, so I have learned something new today, I'm thrilled to see it, that's good awesome, um so audrey, you've joined us.

B

Thank you, audrey, our our data scientists, who did the open data hub effort uh earlier today. So thanks for for joining us did you did you have a question for um marcel.

C

Marcel, I do so hey marcel how's it going um I'm going to take on the persona of somebody, that's like kind of brand new to ai ops.

C

So the question that the first the person may have is just to be clear: ai, ops, the data that's coming from like log files, metrics monitoring tools, maybe help desk ticketing systems sources like that, and the other question would be there'd, probably be some sort of big data technologies.

C

That kind of aggregate and organize all of the systems output into a useful form is. Is that correct.

A

So it's correct that all the data means that this is essentially all the data it starts out with. Metrics goes to locks, then traces, the observability folks would argue you can regenerate traces and locks and metrics out of the observability data.

A

It's also tickets and issues bugs and the like. That's why we're beating into our our alerts into github issues? So, yes, we want to make everything open so completely transparent. You might see some passwords encoded, you might see um emails, encoded or um or uh phone numbers. If somebody is on call, so we don't. Obviously we don't want to expose any privacy identifying information right, but if you're going to facebook you're also giving away your privacy identifying information.

A

So maybe that's just the deal if you are in there use your your internet, pseudonym and avatar, and not your real name right. So, if you're worried about this, I would really treat it as as open as it can be because that's usually the roadblocker. If you want to get access to data, even within our company, we have troubles getting access to the internal data, although it's all internal, but you have to go through infosec and all that stuff because it might contain some data.

A

So let's do it open from the from the beginning, we're not there yet, um but that's really the aim in terms of big data processing types. Yes, open data has kafka in it. It has spark in it. So we have the possibility to crunch big data. I hope that we have big data at some point right now. I think we have two issues, so it's uh not that big, but we're we're starting to collect stuff and from from the setup perspective we are connected to the north eastern.

A

I think they call it nerc or nessie something. That's that's another um research domain, so I think we have unlimited storage at least that's the way. I'm assuming right.

C

Okay, so I guess I would ask uh one more question: do I have time for one more question: diane little one a little one: okay, this one could probably be kind of a yes or no one, so we're going ahead. We're gathering all this this data.

C

Do we have something in place that will reduce what I would call noise, so there might be some kind of spurious data that comes up. Maybe there's some data that we could spot trends where somebody's trying to do something with with the system um is that in in the works for for a iops, because otherwise we would just have huge amounts of data that you know that I don't think we'd really.

A

C

A

That's that's an absolutely um great question because uh that's the challenges to the ai community, I would say so: I'm I'm not really an ai researcher, but I know that uh you that we have the same problem in reducing noise and identifying cats and images. So you don't want to identify the cat, that's in in the background as a cat, but only the cat that is actually moving for example, or like these uh adversarial attacks, where you only change your pixel right.

A

So um I also don't want to have my um ai ops agent evacuate a cluster only because somebody deployed something which has an emoji in a lock message and that's an adversarial attack on on the ops agent. So exactly um collecting all that data and then having the scenarios where you have a outage or something and then retrain your model so that it better works. So yes, that's I don't have it yet.

A

We have some pocs doing something in in that domain and we have we we're seeing interesting projects there from from research and also from ibm or in on looking or extracting log templates from log files and um predicting time series and correlating time series so, but the problem that most of them had is: oh, we need to have your data, so no, we don't have any data yet so then we can strain our model yeah. No, that's exactly the problem that we're trying to solve here provide a common set of data.

A

So dds I'd like to see an mnist for ops at some point where we have a a a standard data set on cluster outage outages or the imagenet, which is also a de facto data set on training cat images.

A

So I want to have something on training, kubernetes.

B

So that that is a great aspirational goal and I think it's doable if we do it as a community effort, so um thanks marcel for doing that, I'm going to queue up our um final speaker for the day, I'm sure griffin and then I'll come back on with some resource links as well, but um I really want to thank everybody. Who's persevered with us today. um We, it is fluid, so we've run over a little bit, but um we should be able to wrap this up in the next 15 minutes.

B

Thank you all for your time and marcel thanks for doing that.

B

B