Red Hat OpenShift OpenShift Coffee Break | Red Hat Livestreaming, 26 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Predictive Troubleshooting for Kubernetes with Sosivio and OpenShift

Description

Get your espresso ready for the OpenShift Coffee Break as we welcome our special guest Liran Cohen, CTO at Sosivio, demonstrating Sosivio's integration with OpenShift for real-time monitoring and observability and automated troubleshooting and root cause analysis on Kubernetes.

A

A

And welcome everyone to open, Chef, TV coffee break good morning good morning, everyone and Stefan. uh We have two guests today and I'll: let them actually introduce themselves and who they work. For a little bit of surprise here, um companies and the floor is yours guys. Why don't you introduce yourselves? Meanwhile, I have my coffee. It's good.

B

um um Good evening, good morning, everyone good evening, friends and where you are on the globe, my name is Duran Cohen I'm, the CTO and co-founder of Cecilio predictive troubleshooting platform for kuberneteshift.

C

Hey everyone, and thanks for joining Stephen Thorn, here, uh Solutions architect versus civios or anything business Commercial, marketing um related I'm involved with.

A

Excellent, so what did you guys prepare for us today? What uh what does the CV do for those who are not familiar with it?.

B

Stephen I'll, let you take the floor. Sure.

C

um So Cecilio, do you mind if I share my screen sure.

A

Let's, let's, uh let's share the screen cool.

C

um So, like we said, Cecilio is a predictive troubleshooting platform for kubernetes and open shifts.

C

um Today we're going to touch on a company profile, I'll talk about who we are what we do uh we'll briefly mention our novel methodology called Data swirling, uh we'll talk about a few use cases and case studies and, of course, end with a demo.

A

um So that'll leave without a demo for shift TV. We have to see the real thing or at some point.

C

um So Sevilla has been around for a few years now, uh while we, um you know, do work in openshift. We also work in any distribution of kubernetes. uh We work in completely disconnected environments, so everything that we're talking about and demoing today um is just another application 100 in the cluster itself. So no no data offloading it's not a SAS tool.

C

um It does this in a very resource friendly way as well. uh We have really leveraged lean, Ai, Ai and ml to predict and prevent critical failures in kubernetes environments. We have domain experts both in kubernetes and AI. We have one of our many experts here for kubernetes labor and Cohen, uh we'll dive into the product a bit more in a second here we are a us-based company, we're headquartered in San, Francisco California and our r d is Led Out of Tel Aviv. You can see on the right side of the screen here.

C

We have just a select number of our customers and design partners. It's just a Testament of the different environments and types of companies that were deployed in.

A

Many of them also also open shift users I recognize them awesome. Definitely.

C

um So what do we do? uh What's the problem that we're solving uh I'm sure, unfortunately, some of you have dealt with uh troubleshooting on kubernetes. um You know that that can be a quick process. uh It can be a few weeks um to solve a single issue as well. Just depends on the complexity of it, uh but the typical troubleshooting process is you'll, get an alerts. uh You know, maybe from a platform, a tool you have in your in your environment. Maybe a customer complaint worst case.

C

uh You then need to go into and prioritize and understand what what is the impact of that issue? um Then you go and go through and identify the root cause and then, once you do have the root cause. What's the remediation you implement that, and hopefully that does work and fix it, if not you're kind of going back a few steps there and trying to fix it again.

C

That is um like I said it can be pretty time consuming. So what system it does it automates that process for you in a real-time fashion and I'll turn it to the run, to talk about a little bit more Sicilia.

B

Sure yeah, so, uh first of all, if there are any questions from anyone in the audience, uh I'll gladly answer and if not, then just to add something important I think um I used to be before Cecilio principal architect, consultant for Red Hat used to travel around Europe and in Israel, and actually this is where Cecilio was born.

B

Like the the idea, what was missing when I was giving Professional Services to Banks insurance companies, Automotive companies Telco companies when something broke, it was very difficult to find it um has, and we didn't have all the information like the information was in kibana, elasticsearch or Prometheus, and you had to go and dig deep to try and do some kind of correlation or in a or in a external file, just to extract all the information put it on the timeline and trying to understand what happened.

B

So Celio does all that for you in real time, meaning it fetches all the information from the relevant entities, thoughts deployments replica, whatever you're looking at it gathers it puts it on the timeline and then creates a sequence of events in a couple of minutes. uh In the meantime, if there is no other questions, Andrea give me posted.

A

uh At the momentical questions from the uh from uh from the audience, um my question would be: you said uh that it is, and maybe you're gonna talk about it. You can work in in air-gapped environments disconnected.

A

um Is there a procedure where you have to somewhat update uh X information that that comes from my other, let's say: anonymized experiences or patterns um into air-gapped environments? That was my first question. You don't have to answer. Maybe you know as you as you go through and we clear it.

A

The second is, um you probably have a way to estimate what is the resource requirement based on some parameters of the environment and the November of notes, Gambler PODS of metrics or whatever it is, and I I'd like to understand a little bit uh that aspect. So if people want to do this implementation, what are the additional resource requirements that is put on the cluster.

B

Okay, so let's start with the with the the second question: the resource consumption resource consumption is something that I was really concerned about even before social, when I used monitoring tools. You know, because if the overhead is like 30 40 of cluster resources, that's a big deal right, it's it's cost and, and you need more nodes and then you, you start troubleshooting the nodes that are troubleshooting that are running applications through troubleshoot other applications. So that's becoming crazy. So cereal uh footprint is negligible.

B

I, don't want to say numbers here, but it should be in my in my vision, the initial Vision it should. It should have been less than five percent and it is well below five percent of total cluster resources.

B

um Of course it depends on the cluster size like the bigger it is the lower. The footprint is because we distribute the analysis over um multiple ports. It's not happening in one big pod or in one uh storage location, um but remind me the first question right about.

A

The first question was uh in many cases uh where that gnostics is involved, especially if it involves aimed, for example, um you recognize patterns and there may be patterns that have not occurred in their cluster, but they have occurred somewhere else, and uh you want to take that experience and bring it into the into that cluster.

A

There may be more cluster related than application specific, uh but so I I would imagine that there's sort of a bootstrap of intelligence that the and that's why my mental image right so it may be completely wrong, uh but and that you.

B

A

That every now and then so that can be done. Offline.

B

You're definitely right: the engines uh come with. uh Let's call it pre-trained data sequences that they already know malfunctions that we already detected and have seen uh and are embedded into the product. So you won't have to first find them on your cluster, but even if you don't have those once a mail function happens for the first time, we find the sequence and we store it inside your cluster. So it doesn't need to be connected anywhere to understand new sequences or new problems.

B

Once something happens, the pattern of severity between the different signals will appear in our what we call the discovery. Engine sequence will be discovered and from that point on that sequence will be detected all the time. The next you will probably get that sequence from from us as a pre-trained data set for the engines.

A

So I threw the discography software upgrades.

B

Yeah, because you know most of the environments, you know probably on openshift the ones that I used to work with. None of them had internet connection. It was sometimes even worse. The whole organization didn't have internet connection, especially if you're talking about government or army or organizations. That's like completely disconnected.

A

And there are still a lot of use cases like that.

B

B

A

Anything else, let's jump into the demo, no, no questions uh you want to go to the Demo First, okay, perfect! Let's just uh do that.

B

So this is the dashboard of Cecilio everybody. It's Pleasant to meet you and I hope. It's gonna be pleasant for you to meet Cecilia.

B

um The main screen that we see right now has a lot of information aggregated uh to show you the state of the cluster. You have, for example, Health course, which I'll get into in a couple of minutes. uh That means that uh what what is the status?

B

It's divided into three different categories platform, which is basically uh the hardware and the nodes and the OS of uh of the cluster application, which is actually what is running inside the pods uh and deployment, which is basically the state of the deployments replica set services like if you have uh um endpoints for each service.

B

If you have all the replicas in the correct version for every deployment, stateful set and so on and so forth, and these are the AI predicted failures, uh meaning these are the actual sequences that we find and you can see that they progress as time goes on. uh They start as a warning and they become an urgent until they reach full failure, which is the the red. We will show you uh exactly the list of them and how you see them and how you deal with them.

B

You also have like a generic inventory here to see the state of PODS nodes if they are unschedulable, meaning cordoned or if you have pods, that are in a complete state or pending or something similar.

B

um The load average of the nodes which we found and customers found very useful to see in in first. At first glance, then we have something very um interesting here. This is the automatic application profiling, it's a CVO uh offers, and here you see two states, the overallocated and under allocated uh deployments in our cluster, we'll Deep dive into it. Also there's a full screen for it, but right now, I would like to go directly to the AI predicted failures, the things that we see here.

B

If we click it or if we go to our Command Center, then we can see the full list of failures along with the uh let's call it the predictive severity of that sequence, and you can see that we have many of those. In fact, we have 107 failures and 68 predictions that things are going to break and two warnings that it's just starting to build the sequence.

B

Let's look at the full failures for a second.

A

I'm not sure if the screen is updating.

B

uh You see the list of failures right now, no.

C

Later on, it's frozen on the um cluster overview page.

B

Right now, let me stop sharing and I'll start again: okay,.

C

Foreign, if this doesn't work.

B

Yeah, it doesn't. Let me stop sharing.

B

Okay, Stephen: why.

A

Don't you let's stop sharing for a moment and uh um I'll start sharing again. Okay, let's see if uh now so I, don't send why it's frozen.

B

uh I think it's something with the browser. Give me a second I. Will rejoin.

A

Okay: okay, um meanwhile,.

A

um Okay, um let's see, is there a way you can? You can, in the meantime, continue with the presentation sure.

C

um Because we jumped right into the demo, all briefly.

A

C

A

Is it let us know when you're, uh when you're ready again I.

B

Think I think it's okay. Now let me try and share again: okay.

B

Can everybody see my screen.

A

C

A

And I think now it's updating, yeah, okay, good! Yes, yes, so.

B

So we can see uh some of the failures that are here, there's a lot of failures that I as I said. This is a demo cluster, but there's still simulation that we want on it, and you see that it is constantly updating. Every time a sequence completes, meaning a failure starts and fails. Eventually, then, the information here updates. uh We can see like things uh from panics that that you see here inside pods and you can see actually the Panic itself.

B

Why panicked uh and there's a lot of demos here that was division by zero, which was something simple, and here we have blah blah blah whatever it is um for by phone, go Java, node.js and um c-sharp.

B

um And there are, there are also failures that are not related to the code itself, meaning uh it's stuck in container creating, and you can see that basically, the Readiness probe has failed and blah blah blah it didn't. It couldn't create the container. So, basically, you can see everything that happens on the cluster.

B

Let's put this on a 30 issue, see if we have something else so here we have a configuration issue, which is someone set, the deployment uh to have image, never pull, and this one uh create container, configure, meaning you're, probably missing a PVC, and you can see a secret in this case the secret is missing. So basically you can see that everything that happens on the cluster that is critical for you to understand and know and deal with you can see in the command center.

B

So in the um Mass information, which is kubernetes and the and and the different tooling to to look at it, it is very difficult to to differentiate what is important and what you need to deal with. And what is just information that someone spewed out, because you know developers a lot of times just they just put log messages for them for troubleshooting afterwards and as a whole, to the environment itself and to your service. Most of it is irrelevant.

B

um Any questions about this uh Command Center um for now.

B

C

You want to talk about sequences um and kind of the DNA malfunction and kind of like that aspect to it. Yeah.

B

Of course, of course, so so the way this works and let me try and get one of the um or one kills.

B

You know what that's that's just open it here. It's probably here.

B

So if I show you a sequence, it is built out of singular events, meaning a lot of things happen to pods deployments. Replica sets Services whatever is running and is comprising the service that you are providing to end users, there's a lot of moving parts in order to actually find what happened.

B

You need to get all the information from all of these moving parts and put it on the timeline and, as you can see, this is just a four sequence event, but we have a a much bigger sequences up to 18 and 20 events in the sequence, meaning, let's say you had a panic inside the pod which returned 500 and because of that the bot crashed and then it restarted. Then all of these things are actually put inside that sequence and on the timeline that becomes a signature of this mail function.

B

The big concept here is that we separate we differentiate as you can see the what happened. This is what you see the description. This is internal checks that we do from the type and, of course who it happened to you can see that there is.

A

B

Here also, uh so, basically, once a failure happens, we just take the events that happen without the who it happened to, and those events in turn can be used to any other deployment or pod or whatever entity. It is.

B

I'm trying to off here, for example, this is no way I'm killed and you can see that the sequence here is much much longer. It's actually 30 13 events long and you can see everything that happened to the deployment on the Node itself, the Pod, because, for example, with omk, you have three different types of omk. You have Owen kill and that is instigated by the node, not having enough memory, which is nothing you can really do about it. Just move the Pod somewhere else.

B

You have an OM kill because you cross this, the C group limit, meaning the pods limit, the container's limit and the other. The third omk is because you created too many uh sub processes or processes.

B

On that note, which is also something that the only way to fix it is to move it to another note, if you don't have a sequence of events, if you don't exactly see what happened, and you see that it was killed by the kernel, because uh the C group was crossed the C group limit was crossed, then there's no way you can understand how to fix it. Giving more memory to this spot on this specific note will not fix it.

B

uh Sorry for for this budget will fix it, but for pods that were killed because the node was cursor memory. It.

A

Doesn't matter.

B

If you get it more more.

A

B

Limit or request, because.

A

That node doesn't.

B

Have enough memory to um uh make the Spot Run.

B

um That's it Steven. You wanted to mention anything else about the command center.

C

um I mean I'm sure we're going to want to talk about. um You know the application profiling piece there. So I don't know if you want to jump to that now. Yeah.

B

So back back uh when we started sharing, we saw this the overallocated and under allocated widgets in the uh in the main screen and I want to dive into them for a second and check the application profiling.

B

So application profiling in socibo is done in a very special way.

B

The way we do it is we look at pods and basically deployments from the second that they started and up until this point, meaning we look at the behavior at the dominant behavior of that pod or set of PODS, and we determine first of all if it if the dominant behavior is erratic, meaning there there are spikes memory or CPU, it doesn't really matter and we check also if it's uh calm, meaning it doesn't change as much by the dominant Behavior.

B

We actually decide with an algorithm what you put as recommended values for that pawn or set of PODS or deployment. uh The reason we do. That is because you can imagine that you have. You can have a pod running like crazy spiking all week and then on Thursday night. It stops spiking because there's not a lot of users or whatever, and then, if you try to profile it based on the last hour on Thursday or Friday or the last day, even then you will profile it wrong.

B

You have to take into account its full lifespan and understand what were the maximums and what what is the average and what was the behavior that was dominant for that Bud's life.

B

So this is how social does it and you can see that, for example, for this Pawn, uh the request is currently 80 megabytes, which is what we recommend, but the limit is much much higher than what we recommend um with CPU. This sport probably doesn't do really much so the recommendation is really low, uh and these are the values that we have seen what was the average and maximum for CPU and memory during the dominant Behavior period?

B

Let's try to find something that is a little bit um more spiky.

B

Maybe this yeah, this one has no limits and no requests, which is dangerous to run something like this. So it immediately succeed to detects that and, of course gives you the right. uh What is this? Okay,.

A

B

See that it's it's almost for any pod, every pod that is running on the.

A

B

And, of course, if you click the recommendations available, you can see exactly what's going on if you're overallocated, in the request or limit of the CPU or memory, and of course you can apply the recommendation, you can click here and apply the recommendation for all containers there's another button here, especially for Developers, because a lot of times you do green, blue Canary or whatever methodology that you use to deploy your applications and you deploy them in an environment, sometimes even in production.

B

But they are not running for a week two three weeks and then you start running them. So you want to restart rope profiling from the second that the board is actually put into work to actually profile it with the chaotic behavior of real users.

B

um Now, just to to explain a little bit how all of this magic is done, how how do we get this sequence that you see here this? This is a panic indicator. Actually, this is what the full sequence, but if we choose a sequence, let's choose one from here.

B

B

So how do we get all of these events and why other tools are having a hard time to to check these things?

B

Insights of civil? You also have the inside to metric which, which you glanced uh in a couple of minutes ago, when I ran through it. This is actually giving you a lot of information that is coming directly from our collectors not being analyzed or uh or this is actually the road data that is going through. The data, swirling mechanism and I want to touch once we finalize the demo.

B

um So basically there's a lot of information that Cecilio reads: we have our own proprietary collectors. We have our own proprietary um uh database, which is a graph database combined with the document store and that information comes with a very high granularity. For example, I'm sure every developer will love having something like this in their hands. Let's try to find something that is doing some work.

B

So if we take this guy, we go to realtime metrics and we can see that it is throttling like hell right. But still, if you see the granularity here is five seconds. I'm gonna try to find something that.

A

Is it might have Frozen again? I did maybe yeah I think that's very refreshing. Your browser may do that.

B

I killed it and restarted it, and it seems like the whole browser, is.

A

Just reloading maybe now.

B

Nothing else open on my on my computer right then. Let me rejoin okay.

A

um Stephen in the meantime, so uh we said that there is uh so all these data metrics um I, guess they are stored somewhere um as they okay, so I think that's pretty fast we're getting in the conversation yeah.

B

Sorry, for that guys always.

A

uh That's okay, so we're back in uh back in business.

B

Yeah uh I guess chrome doesn't like my computer too much but.

A

Okay, so we were uh just to recap: how did we get here again.

B

So we spoke about the um the how how we get information for the sequences, why it's it's very difficult uh or I haven't seen any other tools that can have this type of information and this granularity, which is what I was missing again as a consultant, as I mentioned before um the way we do, it is using the data, swirling methodology uh and, of course, our own proprietary collectors and database um just to show you the granularity of information that we have inside Cecilia, for example, we're looking at CPU right now, and you can see that every five seconds you can have a read and you can see that it's 0.302 cores.

B

So it's very very precise. If you have a spike here, if something spikes, you will see it immediately. You don't need to go indeed because all other tools, especially the ones that are based on Prometheus node exporter, uh are doing averages. So it's very difficult to find something that spikes or, if you are cut off when you are, when you request a lot of CPU or again same thing, goes for memory uh by the way you also have. If you notice here, the red line is your limit that is configured for this spot.

B

uh The blue one is the request, and you also have the average and the maximum. If I remove the limit, you can see that it zooms in a little bit. If I remove the request, it will probably Zoom more. You can see the accurate accuracy here of our reads. Same again, same goes for memory if I remove them um and by the way the green is very important. That's the maximum that the Pod ever consume, CPU and memory. We also have the same for Threads throttling, which is taken directly from the CFS.

B

This is not throttling right now, voluntary context, switches and non-voluntary conflict switches which have indications basically voluntary context, switches indicate a high. I o either Network or storage, most of the time and non-voluntary actually indicate Deadlocks or uh Loops inside the chord things that get stuck or take a lot of time.

B

So that's the granularity of information we have. We also have the network connections from each pod. It will update in a oh. This one is not connected anywhere, we'll try to find someone. That is let's take. For example, a crud should give us some Network information, so you can see that all the HTTP information is here. We can see the apis that this pod is actually uh be accessing. How many requests? What is the maximum latency if it returns 200, 300, 400 and 500?

B

um We can see the information that this pod sends and, of course, the information that this board receives. So it's very convenient to see if you have latency between pods here and to which API, sometimes the latency doesn't emanate from network uh problems, but actually from a remote pod that takes time or maybe a database that is still loaded. So it's very easy to see as you've seen a second ago, you can see um actually a lot of apis that are being accessed. Let's find something with multiple apis.

B

We had one a second ago and I closed it yeah you see here you see two apis. Maybe one of these apis is taking too long to respond, so you can see also the maximum latency per API, which is very, very uh convenient when you're working in such a distributed environment, you go to the network connections. We can also see the state of connections meaning- um and this is very helpful- to start to uh detect load balancer of firewall issues between your cluster and somewhere else.

B

For example, if you had a load balancer with the connection limit or a firewall which is just dropping connection, you'll probably see thin weights or close weights or close connection, basically waiting to go to the next step, but cannot um so that's also very uh convenient to work with and very useful information.

B

um Any questions so far by the way. Andrea. No.

A

No, no, no questions apparently awesome.

B

um So we have the same thing uh that we have for pods: the recap metrics. We have for the whole nodes, of course, with load average and a lot of other information. But something very interesting here is that we can also see the DNS latency of that node, which affects all the parts of that home and.

A

The sdn latency.

B

Which you can see if you click here, we have our own collectors, are actually measuring latency inside the sdn Not Just Between, the nodes, meaning even if your sdn is slowing down or you have drops or resets or retransmates. It will be reflected here immediately.

B

uh So also that's a lot of information that you uh you can use to actually uh determine if the problem is in the pawn a node, a whole network uh Etc again same granularity. Also for node information like what is the usage, what was the maximum usage? How many? How much limits for all aggregated limits of the pods? On that note same for requests and, of course, the average and network connections again, sometimes nodes have vlans or firewalls between them or I've, seen also ipsec encryption between nodes.

B

So you can see also all the connections between the notes themselves.

B

um We also have a couple of other things that are worth viewing. Our health checks is basically to simplify it when you're troubleshooting, you don't just read logs right, you run commands like you check the deployment you check, everything is deployed. You actually run some curls, maybe a date to see DNS resolution so so savior does a lot of these tests automatically continuously on the whole cluster on the platform, Parts application as I said and deployment. These are the health checks.

B

If you remember these, those are based on the health checks or the proactive tests that social performs continuously, but this can actually give you a lot of information about deployment or a namespace or a node, and you can see that you, you actually can see the state of every health check and if we go to status, equals failure.

B

Let's see so all these things are failing or failed uh in the in the past recent time, for example, the kubernetes API service continuously restarting it's probably because it's on a lot of load- and this is the demo environment and we are actually simulating a lot of failures.

B

But if that happens, you usually don't see it, it happens and it restarts and that's it- you just have slow API communication. You run Cube CTL. Maybe it takes three minutes or two minutes or half a minute 30 seconds. So this can give you a lot of insights as of what is wrong with your cluster again, this is just for the platform. We have the same thing for the application you can see exactly which one is in which state and a lot of other tests that we have here and, of course, for the deployment.

B

For example, if you have five replicas set for the deployment and only four are running- or maybe it's not in the recent version, the test will fail and you will see it here.

B

Okay, unless there are any other questions, I I do.

A

Have questions um so effectively? You have a a monitoring capability like these health checks, and these are the actual. So what failure they have happened and then, and then you have a capability of uh let's say, predicting with a certain confidence that some patterns are occurring and they may lead to failure of different types.

A

um Can that be somewhat externalized to, for example, a service gown, so that uh you know if people are not watching the console all the time they can get alerts. The second you get to look at this and then understand what is the recommended remediation that you could do?

A

um Is there something that you guys that you guys do already or uh or is it? Is it in the road map? Well.

B

It's a good question Andrea. Well, we do have integration with datadog and Splunk, which are used in a lot of uh Knox.

B

We don't have an integration with servicenow that at the moment, but Splunk and service now and Prometheus and graphene story, or we already have integration for those. So in case, and probably it is in your knock or your devops team, they have a screen in front of them that they look at. You can actually push all the information from the Civil. All these failures actually push them there, and if there is a failure, you can actually uh click it go into the specific dashboard and view it by.

C

B

Something that is very important that you didn't see in the demo is uh socio is completely uh built on Arbok, meaning all the permissions that you have on the cluster will be applied in Social, meaning if Steven is the admin and I'm a developer and I have permissions to see only my namespace if I use the same username, meaning if I use oauth uh to log into Cecilio, which we also support.

B

Of course, if I use the oauth and I get my username and the same arbuck that applied to me on the cluster itself will apply to me in Cecilia, which is very, very convenient, especially in production environments, where you shouldn't have people you want to, let them see, but you don't want to. Let them touch anything um and that's one of our uh big customers, uh as you said, Stephen a top 10 Bank International, Bank, International, yeah yeah. So uh because we're we can't send a name. So it's a huge environment, a very big environment.

B

Actually it's multiple environments, um but they use it, uh of course, in development and testing and all the rest, but they they found a very nice use case for it. Where something happens in production, they don't have to give the developer Cube CTL access, they can just open, so cereal, and then they can debug whatever they want. They don't have permission to touch anything and they don't have direct access to the cluster.

B

Okay, which is uh which is one of the things that they found very.

A

Useful another thing.

B

Is that that they really really are using on a day-to-day basis? uh Is they gave the tool to all of their developers and they reduced? Stephen? Give us the numbers, um of course, to devops People by.

C

I think they said it helps them like up to, like 90 reduce the time it takes to fix these issues because um they have thousands of developers.

C

um They do have um some expertise in-house, but a lot of things will get escalated up um in these experts. In-House they're spending a lot of time dealing with these developer issues, which may not be super complicated and could be solved. um You know by developers, but they just don't have that knowledge. Yet so they're, seeing what's going on they're having the root cause presented to them right there and they don't have to escalate nearly as many things up anymore.

B

And that's a very, very important, Point Andrea, because um these guys they know kubernetes I, know because we've spoken to them many many many times they are experts, but how many experts do you have in an Enterprise? And when you have I, don't know three four thousand Developers?

B

If even if you have 20 super kubernetes experts, it's not enough to support the stream of questions and problems that fill up your feed every day, giving it to like that, reduces that as Steven said, 90, that's their words, not ours.

B

Ninety percent of the request to those devops experts, kubernetes experts are reduced and that frees them to go. Do important. Things like automations involve the cluster upgrade things instead of just answering. Oh yeah, you put an error image, never pull remove it or um the image tag is not correct or you put the capital letter in the repository name like these things. That happened on a day-to-day basis, of course, for us for someone that is very familiar with kubernetes.

B

It looks like oh yeah, it's two seconds for developers that don't deal with kubernetes and in fact don't if some of them don't even have Cube CTL access, they they do it through their CD pipeline. They just try to deploy, it doesn't deploy Andrea, my Pawn doesn't deployed. Okay go do give CTL. Now you have to do describe or get oyama or whatever it is. You need to do to understand what happened instead, give the tool to the developer immediately.

B

They see that the repository had a capital letter or the image could not be found because it's not under in the Repository manifest.com and.

A

B

Just a small, simple things.

A

So not not only let's say in production, but also in the your free phase of test, while developers are still involved and uh they can catch uh and I I guess the system can be trained to detect new patterns once once, a developer has understood uh what is happening, or how do you do that.

B

Well, the system always finds new sequences, but at the moment we actually manually uh took the sequence. Everything that happens on the cluster creates a sequence. uh What we do right now at the moment as we develop our AI engines on and on, we actually have a set of sequences that we know of that are 100 for sure failures, and we exactly know what those sequences stand for, and we give you recommendations for each and every one of them the in the future.

B

There is an engine that will automatically actually find any any sequence that we already find by the way you don't need to train the discovery engine it finds everything. The problem is to take those lists that you've seen of events and translating them into something that a developer or developer person will understand.

A

Okay, I'm I'm checking. No, no questions are related to uh to our topic.

B

There's another question: it's.

A

Wrong um you mentioned you mentioned that um we were talking about. You know Steven, while you're really starting um we're talking about um the data repository that you have to have on on the cluster. Do you have to install, or does it rely on some on some database uh that you have you have to install as well or is it is its own uh type of Repository.

B

So again, very you have good questions today, Andrea. uh So this is a good question, because I uh a couple of things we forgot to mention uh one Cecilia does not require any prerequisites on the cluster. We don't need storage, we don't need a special cluster or a special node or a GPU or nothing else. We just run as yet another application on your cluster. You just deploy it. It creates a namespace with several pods in it and that's Cecilia, meaning you don't have to prepare anything for it.

B

The second very important thing is you don't really need anything changed in your deployments in your pods. There is no sidecar required for all the information that you've seen. Everything is extracted by our collectors from the nodes and there's no need for any instrumentation or Preparation in the applications themselves. In the other parts.

A

B

A

Need some PVS? No, no! No.

B

Zero zero prerequisites do.

C

You want to talk about this video DB for a minute and hit on that is.

B

Actually, an in-memory uh as I said graph and document store. It's a graph database in a document store.

B

um It is highly available, meaning you can kill any instance of it and it immediately replicates itself uh back, meaning there's. No, it's actually very difficult to cause data loss. If you want to zero up the database, you have to shut down the whole deployment, and even that takes time, uh but basically it is super fast. uh It takes all the information that the collectors ingest and stream it to the right locations. It does not require any disk because it is in memory it's completely distributed.

B

So it's not in one location, you may have 5 10 20 depends on your cluster size, so see the instances, that's the secret of it running so fast and being so highly available.

B

uh Again we have requests from for long-term storage, meaning people want to have information like for three days three weeks about three months and that will require a PD or PVC naturally, but right now, Cecilio by the way, Cecilia also doesn't save information. For that long. We don't need it.

B

The idea or the concept is you know if a poet died just now the information from it is relevant for, let's say, 30 minutes one hour after one hour, it's irrelevant because either there's a new pod or the Pod is not there or someone really deployed it. It doesn't really matter what happened. What does matter is the sequence like I want to know, I, don't care about all the logging information and signals, but I do care if it was all I'm killed or if there was a panic.

B

Those are kept for a long period of time.

B

Sorry, but the information from like all the logs, that's not kept. We.

A

B

Requests as I said, to keep it for longer periods of time, just for regulation purposes and.

A

So that's that's in the roadmap.

B

A

And I would imagine that most of the information that you get you're getting it from from from meteors and directly from the cube against apis uh on on the various resources. So there's no, like you, said, there's no need to either to install any other uh agent or.

B

um Nothing do take the information from the kubernetes API, but not from Prometheus. We have, as I said our own data collectors. Okay,.

A

And the and those are installed as pods.

B

Yeah, it's a demon set. Basically that runs on uh on every uh every note.

A

Okay, okay, excellent um y'all, I, don't have any more questions. uh Have you got any anything else to tell us about today uh regarding uh regarding the technology or some use cases that um uh have happened to you, where you have your customers? Oh.

B

Because you have a lot of a lot of use cases like one of the use cases I can Babble along so stop when I.

C

uh When I do it's.

A

C

B

You you saw, let me share again for a second.

B

So with this simple, uh you guys see my screen. Yeah.

A

Yeah now we do and it seems to be: okay, yeah. No okay, I was ready. It's okay; yes, yes, yes, yes, okay,.

B

So remember the sdn latency that I showed this so.

A

B

Actually helped us uh find a very hard problem that was around for three months uh in the Israeli Army. Actually, um that was running uh openshift on top um openstack and there was a problem with neuron with the sdn underneath the openshift and nodes were spiking to eight seconds latency and there's no way to see it right. It happens for three seconds.

B

100 pods are timing out. You can't see it anywhere else. We caught it here. uh So that's that's uh one very important, also DNS issues. You know. Sometimes uh we had a. We had an issue where uh one pod was running an application. They didn't use DNS cache. There was thousands of requests to Services. The DNS was bombarded on that note. Latency was like 500 milliseconds to get a resolution. We found the corporate afterwards, but just knowing that the DNS has a problem and it's loaded.

B

uh That is very, let's say, first of all, reassuring to understand that the problem is just on a specific node um or a set of nodes and not on the whole cluster, and the second thing is just to find who is bombarding the DNS, which was easy, because you can go to DNS logs um to to think of other other use cases.

B

I can go about problems, but I I think one of the customers I want to see who they were running on on Amazon, actually not openshift. No, it's a bad word here, but just just to give you the uh the uh uh ran application profiling and we found that they are like 80 percent overallocated, meaning you know, developers they wanna the Pod to run. So if they need, let's say, half a core, they give a request of two or three cores and a limit of six cores everybody's happy.

B

But the company pays a half a million dollars. I, don't know the numbers right, but if you just run application profiling and you go pulled by Port, you can just see the waste. Even in some Cloud we even saw requests for uh the Calico, uh pods or or the sdn ports that were supposed to come from the environment. They were, the request, was half half a core for each, which they almost don't don't even use, so just even reducing that reduced cost.

B

Significantly uh cost is a nice thing, but application profiling is the the initial intent behind it was cost. Is a nice side effect, but the initial thought behind it was to not under allocate right right where the problems are. You don't want to be throttling and God forbid. You don't want to be oh I'm killing, so we have a lot of those by the way, a lot of examples where we put it in customers, environment and we just immediately by the way the information flows into the product in 10 minutes. You already have results.

A

Of course, because, like like you said you, don't you don't have to, essentially you don't have to instrument anything, you just turn it on, and uh it already so I guess over time.

A

um You will gain you'll, add more and more of these sequences that are that they're, basically doing a a automatically diagnose of of problems as they as they come in or more engagement. You have with customers, and then you add it to the uh to the recognized patterns of the of the product. If that that's how I understood it, uh I go back to my initial question.

A

If there is an idea of allowing uh customers to over in in the future to train themselves the uh the system so that um so that I can, especially if it is application specific, is something that's. We.

B

We started talking about it uh to create something like the old was proof right like something that you wrote open to the world and everybody can contribute and which is which is cool, because we do, as I said, have all the sequences everything that goes wrong on the cluster. We find it and letting people say: okay, I know this one. This is my application. This is how it fails. I want to keep that and someone else can take it and maybe do a rabbit and queue, and this is how it fails in this scenario. Right.

B

uh It crossed our mind. We are talking about it. um There is a lot of. There is a lot of.

A

A

Sorry, I think we're getting towards the end of the hour anyway. So uh let me finish my it's called.

A

um No, no I think it's it's quite interesting, especially if you know, in the spirit of course, of the open source and uh um but but um like I said I can think of the number of projects I've been working on where one recent um that could have benefited a lot, primarily because we are talking about very, very large infrastructures, with hundreds of nodes with loads of applications that were running on them with different characteristics that make use of storage and, very, let's say, creative ways, and, um and so something that allows you to uh to do.

A

What you guys have shown could have helped a lot, but I've also I can also think of red hat support that could contribute to those those patterns.

C

A

Something that's probably good. Good cooperation could come out of that yeah.

B

We actually, we actually had some collaboration with Reddit we found in an older version of openshift. We find something wrong with the tablet and one of the customers and we immediately caught it. So we shared all the information. We had a quite a productive session there with the customer, of course,.

A

And I'm thinking a little bit, there are cases when uh remediation like actually solving the problem actually does require a patch. And while you do that you, the only thing you can do is to actually watch to predict when you know when it's about to break and do the sort of workarounds that that will keep.

A

You will keep the system running, but uh but it's not ideal, but that's what you have to do, and so, if you don't have something like that, you have to actually do a little bit of manual intervention in keeping under control certain parameters and uh and certain applications. And it's not easy. It's.

B

Actually, a very large.

A

Infrastructures and.

B

What do you do with the next version is deployed yeah yeah everything right that that's something probably you can have like a deployment every every two weeks or or three weeks or to one week and all the parameters. Suddenly change, CPU request memory, the sequences themselves, what the spot requires Network wise. Maybe it used to send just one HTTP request. Now it sends 500. You don't know.

A

So uh it looks like we don't have any questions coming from uh from from the audience: I, don't I, don't have any visibility, there's one that I had to take care of uh separately uh um and uh if you guys think that we can come to a conclusion of today. Today's talk um some final final thoughts links that I can I can give to our uh so that they want to know more about it. If you can give me some links, I'll put them on the chat.

B

So, first and foremost, this is an uh openshift uh uh session, so we have a certified openshift operator. Please you're, welcome to go to the operator Hub and just download it install it on your cluster. It doesn't put anything else other than the Cecilio names based.

A

On your cluster, so they have to do, is search for CBO on the operative hub.

B

And it it will download and run easy and other than that we have our website. We have full documentation there. It's even.

A

For everybody to see, if they want to know.

C

A

Can go there, they probably get in touch directly with you guys. Definitely the docs, um so I guess that's all we had time for today at. um Let me just copy the uh the email here today, I'm alone, so I have to do everything.

A

uh So, uh first of all, thank you very much. It was very informative. Yeah.

C

A

And uh to all our uh listeners, uh I'll see you all guys uh in the next uh episode of open shift TV. Thank you all very much.

A