Cloud Native Computing Foundation Online Programs, 4 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native Live:OpsCruise demonstrates use of CNCF observability tools

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone welcome to another installment of cloud native live where we will build things and we will break things a few housekeeping items today before we get started during the live stream, you will not be able to chat as an attendee, but there will be a chat box on the right hand, side of your screen that you can check out and drop any questions for our speakers right here.

A

uh Please feel free to drop them there, we'll get to as many as we can either during or at the end. Whichever way works out best, this is an official webinar of the cncf and, as such is subject to the cncf code of conduct.

A

Please do not add anything to the chat or ask any questions that would be in violation of that code of conduct and please be respectful of your fellow participants and our lovely presenters who are sridhar, vincr, vincractrumen and luke rota and I've totally butchered it I'm going to let you say it beautifully um with ops, cruise and chicago trading company to kick us off today.

A

I will hand it over to sridhar right now to get us started.

B

Yes, hi, hey welcome all, and you got my name: okay, no problem. uh This is sridevi.

B

You know we have a um a opportunity for us to sit down and discuss um chicago trading company's journey to uh to the cloud with an observability with our own um luke rotor who's, the manager of sre and observability at the chicago trading company right, I'm, the founder and the chief architect for ops cruise and uh look forward to talking to you. So look.

C

Hi, thank you um really appreciate uh the invite here to speak about uh ctc's cloud journey, um and uh so I'm gonna be going into uh some some deep dives um about our technology as well as uh overall, uh the company itself and and what it is that we do and then how that uh intersects with ops crews all right.

C

So uh next um I will get into here um who is chicago trading, um so we were founded in 1995, uh so we've been around for for quite some time.

C

Our mission is to make markets better and provide liquidity when it matters most. So you know being in the market and I'll talk about. What that is, is is paramount to us. So uh chicago trading company is a market-making proprietary trading firm. So you know what does that mean? It means we represent both the buyer and and the seller in the marketplace, and those marketplaces are still today some trading floor venues, but largely electronic venues run by exchanges such as cme group, bats, new york, stock exchange. You know chicago board, of options.

C

Exchange urex in london there's many many other exchanges across the world. These are just some of the examples, so we're we're interacting with the marketplace on a daily basis.

C

You know, and our customers are anyone who is participating in the marketplace at any one time.

C

We are headquartered in chicago and we have offices in new york and london right now, we're around 600 people or so and and we're rapidly growing. So you know, we've doubled in size almost uh in in the last four years, uh so a lot going on at ctc definitely exciting times um one.

C

So since markets you know, are really always on um at any one point in time. In the world there are a few market pauses, depending on which exchanges you're trading on. uh For the most part you know, uh markets are trading 23 hours a day, or at least um the markets. We trade in uh we're trading about. You know 23 hours a day in over 20 markets across the world, um and you know some of those markets are even open on the weekends. uh Many of them are closed.

C

uh So we have been moving closer towards almost a 24x7 trading environment. uh We aren't quite there yet but may get there someday.

C

So you know, depending on the part of the world, that you trade in there could be trading activity at any one point in time during the day, 24 hours and seven days a week, excluding some holidays.

C

So overall we have a very narrow window of time in which we can release software changes. So observability is really important to us, as we have to understand the state of our software at any one point in time um so uh I'll get into. You know, where kind of why that's important as we go. We have hundreds of applications that make up our trading platform um that our traders and quants use every day to run our strategies, engineers and and quants, and even traders uh write code.

C

You know most of our code is written in python, c, plus and java, um and so you know I'll go into here now, a little bit about the technology stack and some of the you know, challenges that we currently face.

C

So our current environment um is is made up of of a mixture of things. um We we pride ourselves on our research that we do uh our pricing and our risk management.

C

So we have a blend of systematic trading as well as human trading, and in order to accomplish this, we have a complex set of trading systems that traditionally have largely run on-prem and there's a few driving factors as to why those have traditionally been on premise.

C

One of them is that we need to be co-located close to the exchange engines themselves. That's sort of table stakes today in order to be able to compete. So that's you know for us. That's you know one of our kind of traditional on-prem data setters is being co-located.

C

um You know that may change over time. Depending on how market structure changes. There are exchanges that have struck deals with big cloud providers.

C

You know new york stock exchange, as well as cme group, have recently struck deals to extend their systems into into the cloud providers. um So you know we'll have to see where that takes us, but for right now um there is still a need to be. You know, co-located, um some other things that have uh traditionally kept us on premise.

C

Are things like multicast type protocols, customized hardware configurations, data, locality right these things can be challenging at times uh in a cloud environment, uh honest you know, similar it'll kind of also combine with the on-prem uh uh is lower latency requirements. So we we aren't a shop nests. Necessarily that you know is worried about every nanosecond, but we do need to compete for speed again. It is, is table stakes uh these days to have some level of speed in order to to compete with others in the market place.

C

uh So low latency uh is important to us. um There are customized server and switching configurations that aren't available in the cloud as well. As you know, we have specialized algorithms that take advantage of this hardware capability and so low latency is also traditionally in our kind of on-prem data centers and doesn't run in a cloud environment.

C

So next is our our high compute right. So I touched on you know some of the reasons we're on where we run on-prem high compute kind of really opens the door to run in other places.

C

So we, like, I said before we we, you know. Research is one of the things that we pride ourselves on and there's a lot of computing. That needs to be done when you're, researching and doing things like back testing back testing is a is a common practice in a trading environment uh to see how your strategies are are working overtime, and that requires a lot of compute.

C

So, in order to do that, on-prem we would have to really scale um and that that can be very costly um and difficult, and so that's where um one of the areas that the cloud has become front and center uh because we can scale there quicker. We can take advantage of uh tooling and native cloud functions in a cloud environment and we can also leverage economies of scale.

C

Easier in the cloud environment than we can on-prem, so you know this back testing and some other types of applications that are not latency sensitive.

C

We've changed our posture and started to consider those applications for a cloud environment and then the fourth item here is really monitoring observability and so over the years we've we've used a mixture of third-party tools, as well as some custom custom tools that we've written and, as we've made this pivot, to move into the cloud and change our application architecture.

C

Those tools are being challenged and there's gaps in those tools, and so that's where a tool like op screws, that's where they can come in and provide some, uh not only technical value but business value.

C

So, as we've began to adopt containers um and change our architecture, you know we've really had to rethink our approach around uh monitoring and observability.

C

So what has the the cloud native shift looked like for us so really markets? um You know the markets are ever changing, they're, always moving faster.

C

uh The data sets are ever increasing and we are, you know, trying to always stay ahead of the curve, and so one of the things we've been challenged with over the years and continue to be challenged with is getting our ideas into production as quick as possible, so that we can understand the impact that it's having on our business. Is it working? Is it not working?

C

So it's really important that we can iterate quickly with that said, we also need to keep our outages low right. So, um like I started out, saying you know being in the market is really paramount to us. Not only do we provide a service to the markets for our customers, but you know it there's also opportunity cost for really any trading firm when they're not participating in the marketplace right. So if you're not participating, there isn't a ability to capture opportunity.

C

So um so really we had to think about. um How can we scale right, and so we started moving down the path of microservices, which initially um we started breaking up monolith applications into smaller things, but they're still highly dependent on each other, so it has become more of a distributed monolith and so now we're. We need to continue to modernize our application architecture and adopt a cloud-native approach and start using things like containers and k-8s and cloud providers.

C

Things like you know azure in aws. So um so we we really started. You know to modernize once again um and really to reduce our slow iteration cycles and be able to get our ideas into production faster.

C

So, by moving to things like containers in k-8s or in our case openshift, and leveraging our cloud providers for economies of scale, we now have been able to start down the journey of of really being able to scale out our applications, uh which was a limiting factor when we were solely on-prem.

C

uh But this has significantly changed the way in which we monitor and observe our applications. um So we've had to really rethink. um How do we fill these gaps right? How do we know what a container is doing? How do we know how kubernetes is working right? There's a lot of a lot of things in play when you introduce these new technologies that um traditional uh monitoring, observability tools, weren't uh necessarily built from the ground up uh to handle these situations?

C

So we've we've really evolved our um focus on telemetry right, so focus on instrumentation uh what we should be logging, um what are our metrics uh preparing for things like tracing and so as as we've done, that we've really looked to focus in on open source tools.

C

So um one of those uh sorry so we've we've started to focus in on open source tools um and and really that's because we're trying to solve some pain points, one of them being the swivel chair right. So we have, we have logs, we we have metrics. We have dashboards a lot of traditional monitoring tools. Today they do a pretty decent job of bringing all of these things together uh into one dashboard, but there's still a bit of context.

C

Switching and there is still a massive amount of data that you have to interpret, and so that can be.

C

That can be hard to understand which data you should be using which data you should not and sifting through all that data is is a challenge and so that kind of brings us into logging right. So what should we log? What should we not log?

C

um As we know, logging data typically doesn't decrease right, typically, uh at least in ctcs uh case- uh we're always creating more applications right and we're logging more and trying to understand more, what's what's happening with an application right, but you can only store this data for so long um and it be then it becomes. It can become very costly to store this data in in a closed sourced vendor, and it can be hard to assimilate this log data with other metric and tracing data.

C

So you know, open source tools can can definitely help with this right, but then there's the case of skill sets right. So the open source tools are incredibly powerful.

C

They allow you to avoid vendor lock-in and they give you flexibility with that. You know you do need to have some knowledge about these open source tools and at least at ctc. um You know we're continuing to build, build our knowledge right, but you know it's it's. Certainly it's been a journey, we don't we don't have all the knowledge, um and so it can be difficult to hire for those skill sets or or build it internally.

C

C

C

As you start to build the knowledge- um and you start to figure out what data you want to collect, um it's still trying to assimilate that data right, and so that's where something um like a smart layer on top of it that can natively plug into open source tools. So you don't you don't have to.

C

You- can still continue to use your investment in open source and and the flexibility that it provides. But, in addition, you can get a smart layer on top of that um and I'll talk about that in a little bit and how ops crews comes in and some of the the business value that it provides there all right. So next I will uh talk about here.

C

Kind of the open source tools uh themselves. Maybe go back. One slide out of sharia.

C

So um so here's kind of you know the layout of some open source tools out there right things like prometheus, loki, uh jaeger, um ops, crews, out of the box works with all of these natively. So, even if you have these tools today, there's you know not much. You really need to change and you can leverage all of your investment in your current uh telemetry collection right.

C

So telemetry collection itself has really become commoditized by a lot of these open source tools, um and so you don't have to uh spend um you know you don't have to lock in with the vendor. When it comes to collection right you, you can get that in an open source way.

C

So there's also open telemetry. There's several standards uh within open telemetry different protocols that you can use which work with prometheus loki. Yeager right, you can use it for logs, metrics and tracing.

C

um You can use the open, telemetry libraries within your application, and so everything really from a data collection and sending standpoint can be done in an open source way and you don't have to worry about walking into a vendor. The only thing you don't get is a smart layer and that smart layer being.

C

Things, like you, know, machine learning and and telling you more insights into your data right things that traditional tools, um uh traditional monitoring and observability tools- that they don't always have those capabilities right, they're, very good at collecting the data and giving you ability to graph the data but contextualizing. The data really is the next. The next evolution, at least from my perspective, when it comes to observability.

C

All right, we can go to the next one, all right so ctc and op screws. uh This is where the the the intersection uh really happens here and where the business value comes in, so one of the things afterwards provides is telemetry unification and support right, so they can bring all of your logs metrics and traces.

C

All of that can be collected in an open source non-vendor provided way. They will gather that data. They will display that data. uh They will contextualize that data um and you know it it, and it also leverages. You know, like I've, said the um the collectors today that are out there like um prometheus loki and yeager. So if you have prior investments there that those will not be wasted, it also has flow tracing in it.

C

This is a very unique thing: it's based on ebpf and so there's very little investment for you to understand how your application or how your applications are interconnected to each other.

C

It uses this to do present the application map which street r will go into, um and so you don't have to do any customized tracing to get um an application map and how everything is interconnected.

C

It it is, is done without any development time at all architectural governance right, so it provides you inventory of where your containers are running, how they're running what they're doing where they're running uh you have a lot of view into that, um and um so that it really brings it all together. It's easy way for you to understand inventory, of where things are, at least for me.

C

Having spent a lot of time, troubleshooting applications right, it's it can be, and as the environment becomes larger and larger, especially into microservices microservices environment, it you can easily lose track of where things run.

C

So this this is really important piece and there's a lot of business value here, because the quicker I can find something and understand what is going on the quicker. I can understand, root cause and solve the issue, and then it brings ml into the fold right. So I don't have to apply a lot of human power to understand and assimilate data.

C

The ml learns over time and can present to you issues that it has found right and a lot of times that can take on on the order of days uh for engineers to find a configuration that might be causing an issue in the environment. um I I personally just ran into this. um A few days ago, I've had engineers spending hours or days actually on trying to find an issue within kubernetes that a tool like op screws through its ml could easily provide in minutes.

C

All right uh next.

C

Go to then yep all right, so the features um that I that I really enjoy about ops cruise is one this application map right. It's it's really intuitive. It's amazing! It's out of the box. I don't have to do any custom um uh tracing any custom development right. I don't have to invest any development time. I can just uh install the agents and I can start my applications and I get an application map.

C

The next uh thing that I really like about the apps crews tool is fault, isolation and cause analysis right. So this there's a lot of business value here, right, it's it can be. You know anyone can manage open source tools and collect data. uh That's pretty easy to do right, assuming you have. The skill sets to do it.

C

What's not easy to do um is how to assimilate assimilate that data and and find issues within the data right. um It's really powerful when a tool can tell you something quicker than what you could find out researching on your own um in ctc's line of business every minute. Every second really counts.

C

When there's an outage right when time is ticking away, we're losing our uh we're losing the ability to capture opportunity, um and so this is where the business value of ops crews really comes in to be able to contextualize this data and find faults quicker, the quicker you find them, the the sooner, at least for ctc, we're back in the market and capturing opportunity all right. The next uh the next uh feature here is, you know it makes more data available right it. It pulls all of your log tracing um and metrics data together right.

C

It pulls it into one view, and uh you and really anybody can log in and see this data right, and so you have to have you don't have to have as much operational expertise of how a dashboard was built or how it's being presented.

C

um All of that is, you know, pulled together in an easy to read view within obscures, um and then uh the final uh thing is being able to. You know, look back with. You know topology and understand um where, in the topology of an application, things may have broke down or whether you might be experiencing it in the error, and then I guess um not necessarily a screenshot for this right, but um one of the things that's uh hard.

C

That's really invaluable in my mind, is that ops crews has an extreme amount of knowledge and expertise with open source tools and kubernetes itself, as well as running and operating in a cloud native world. Their expertise is invaluable, they're, an excellent partner. They can really help guide you um with your challenges and either collecting or presenting data, um and so um you know I just wanted to mention that as well is an additional thing that I I really appreciate about. Obstacles.

C

That's um so I'll hand it off here to to sridhar, and he can jump into uh some more technical, deep dive about opt-offs cruise tool itself and I'm happy to answer any questions uh afterwards. Thank you.

B

Hey thanks, luke. That is great uh thanks for speaking so highly. What about us, so I'm going to spend a few minutes going through some slides. That gives you an idea of what. Why do we do, what we do and what we do right so um so, though we all of us know this, it's worth spending a minute catching up on the the what the fundamental challenges are in in today, where observability has to make some changes.

B

uh The first thing is, things are very complex, right, there's a lot of abstractions everywhere and and that abstraction creates points of performance loss, performance issues also just like traffic uh bunch. Ups move around the system, they're no longer static and easy to catch. There are also dependencies and uh they depend we build systems with lot more dependencies today than we did some time ago, and that makes a um brings in a set of problems that you to deal with. And finally we mentioned this before look also mentioned.

B

The fact that you've got to have you've got to recognize and understand that the speed at which we uh update and and move the products forward uh is increasing and that these are the three complexities that really form the undergird or girth the uh the uh observability goals of of today. But all this uh actually can be seen to have a problem of disjointed monitoring. So there's no problem for data. Today, data comes out from everywhere. We look at right, but the and you can bring the data together. Say I've got it here.

B

I've got it in the dashboard here. I've got a dashboard there, but that is a a data, rich information, poor situation that we have seen in the in the in the market right uh and because of that, and and and partly in spite of that, we have its manual and lacks closed loop resolutions and that's a problem that needs to be dealt with.

B

So these were the challenges we thought we should try and and solve as as obscures, and and as these complexities and dependencies and dynamics increases, the problem becomes a multiplier and it becomes so instead of a building uh saw a small number of thick things which you focus was looking at. uh Inside that thing we have a larger number of thin things and, uh and that changes the way thing behave right. So the whole emergent behavior becomes a part of the problems scenario in today's systems.

B

Now, so how? What else do we need, in addition to the normal, a telemetry that we look forward to and know and love like metrics and logs, and all that clearly, traces have added a lot of value and that traces are a important aspect of of telemetry. But there are other things we need to know right. We need to know the structure and dependencies of the application right. It's not just about moving a trace from one container to another. It's also about. Is it going to lambda? What sort of rds is it accessing?

B

Is there and dealing with so there's a dependency and structure that you have to deal with? We also need to look at every element, whether it's a container in a 360 degrees view right. What is coming in what is going out? What resources is it using? What services does it provide? You've got to look at that completely right. The third thing is you've got to have a a way by which you can understand the behavior of the application.

B

Slos and thresholds uh are useful and uh and provide value, but that they don't represent the behavior of the app of a container. They don't represent what it's all about, because things are non-linear. It's not just a linear situation, so understanding and having a better way by which we can look at a container is important. And finally, we see that in all of these systems, expertise, human expertise has to be figured out. Human expertise has to be laid into the architecture so that it can.

B

It can complete the story and and make it valuable for everybody. So, having said this, um what do we do? We have two things. One is clearly observing and following the standards is important, the more we have standards, the better it is for everyone and costs and quality goes up and costs come down. So that's one thing that we all agree and that we should do then, beyond that there are two, so many great tools which are conforming to those standards as well as open source.

B

So we also felt that we got to lay our strategy on top of these, this, these two things, the standards and the open source tools. So what we do is we are not in the collection business we don't. uh We don't even intend being the organization which keeps your data for long term right. We only keep it for the purpose of providing operational support in the short term, so you have the freedom to move your data wherever you want and keep it for whatever other purposes.

B

That you think is that you deem see and see fit.

B

So um so some of the data that we we all we all know is the and the types of tools are is shown in this in this slide, in addition to the metrics logs and traces, we get configurations right, the uh orchestrators like kubernetes, which we support and then of course, nomads and other things which are there. Also they provide a level of configuration information and in a standardized way, which is a very important for observability.

B

Then you obviously need to get changes and topology as to as the the systems are more constructive with small things. You need the topology without which you can't deal with with observability completely and finally, the knowledge of each application. How do we understand each particular application? How does this one work differently from the other one? These are the pieces of information. We feel we have to add to the story of observability before we com we can handle it all together. So that's what we do.

B

We bring in all this other information, in addition to the metrics logs and traces, to give you a complete story. How do we deploy? This is a diagram which got lots of colors in it and like a mccarter projection, it may show our tools much larger than your application, really what we just provide a set of pods, which are either they are demon set replacements like in the case of node, exporter and c advisor.

B

These are all open source tools, and then we provide these orange gateways, a simple, very lightweight pods, which pull the data and and send it to off screws.

B

uh We we also listen to cloud and all your cloud products we also have if you've got vms and and most uh clusters have vms working right next to it and sharing uh the environment.

B

So all of that we monitor and variable there's a simple deployment deployments is easy to do in a few minutes and instantaneously data is available to for you to see so some of the problems we solve uh is we try to solve and, and we succeed, is uh you know kubernetes day, one problems day, two problems and a lot of people are in the day one and day two situation, so you've got to figure out basic things, but also the communities itself makes it very um complex uh because of these interdependencies and multiple objects that it defines.

B

So we solve a lot of kubernetes problems. We solved a lot of technology, related issues, serverless we do sum in serverless and and so that you can see if an application is talking from a container to a to a lambda function. You want to include that in your overall observability in an integrated way. We do that too. So these are some of the types of problems that we had to deal with in in the interest of time I'm going to switch over to um oh sorry, I clicked the wrong button.

B

Yeah, so I'm going to log into our ui, so you can see how our system works. Give you a quick introduction to our product, what it does right up front and is give you a an app map. This is sort of a you can say the landing page um for a system.

B

Okay, so um it we discovered this automatically. All of this is picked up by our system within five six minutes of you installing it, and you automatically do a picture of your uh your environment and here the dead star view and um and when you look at each one of these things and it's expanding it a bit for you to see uh you, you get to see these different pieces. For example, you have a rds sitting there. We have a elbs in another part of the system and we have different containers.

B

You know around the system, so you can pick any one of them, for example, say one second, let me just.

B

Figure out where it is yeah, so um you can pick any one of these. uh You know tools. For example, you can see um a container immediately. The container information is available for you to see and uh and in not only that it gives you all the metadata and all that that you get from kubernetes.

B

It also gives you metrics uh that is picked up. You can also pick up its labels that are there and, and at the same time I mean right there. You can see it's logs connections, traces and even a three layer view. A 3d view tells you that this application is, you know, is running, and this is the information about that container, for example, and it also shows you on which node in the kubernetes is running and its neighbors.

B

In case you have a noisy neighbor problem and at the same time, what infrastructure, node you're running on say in this case it's aws uh your instance name and the metrics on that aspect of your orchid. So you get a all these things integrated and made available instantaneously right from from the system. Now you can. This is also happens for uh for other things. For example, if you've got you're looking at, say, uh rds, uh I'm a little.

B

uh Everything is slow when this thing is going on. So here you get uh information about rds, which we also pick up and make make it available to you right. So you have different ways of servicing you. For example, this gives you a service view the service interactions without the pods and see how they are all connected together, and you can filter them with all of these uh filters and and save them and all those standard things that you see in in dashboards, but just quickly moving.

B

You get the node map which gives you discovers all your nodes and shows you the uh these. The information that you need about the nodes, uh including its its metrics configurations. Even for example, you want to see when we talked about providing more data for to more eyes, it's much easier for people to use such a system than to get into kubernetes. You don't even have to teach them the operational tools they can. The app folks can access the system and be able to see what's happening with their part of the application.

B

Now um we also provide some analysis which helps in capacity planning, uh looks at all your pods in the node and tries to estimate uh the usage of your cpu memory and all of that and tell you whether it's effect evictions are possible whether you need to provide a better. This is look at this. It's burstable right, so you may not. You may be evictable, so these type of things are also providing this. I'm just running through this quickly. To give you a quick, quick idea. Similarly, we've got other things right.

B

We have a k test view which gives you all the k test things you're looking at the node you've got five nodes, click on the nodes. You can see all the node information uh provided, um and then we have a service performance dashboard here here. What we do is the the same technique that helped us build the app map. What it's doing is.

B

B

I don't know what right now, but I'll come back to in a second, so we have events logs uh functions which are easy to understand you can you get? You can pick up a see, all your logs and uh search on them and look at the labels that they've been with these are functions that we are, I think, they're all uh familiar with.

B

These are communities events, um and uh uh it's also, um uh you know the processing of these events in the system and I'm having an issue here. But uh what you see in the service performance dashboard is um we see the uh urls right, every service to service interaction and- and uh um maybe that's what this is.

B

Sorry damn it excuse me guys.

B

Yeah, so what we see here is, for example, uh you can pick up.

B

Here, for example, you can see traffic going from one place to another and you can see uh the traffic. You know l4 data here, that is the bytes and all that and depending on whether there is a seven available you get to see, you know the urls that are also in there right. So, let's search for cart.

B

For example, we have a traffic here and then you can see the urls which are flowing. This is all done without any tracing and without touching your container, and there is no impact on your deployment.

B

This is completely done hands off right, so you can see the average response time latency in all of theses that come as part of our uh our solution.

C

B

As soon as you install the software, so as as a total we've got logs events, time, travel, search, communities and time travel is something very unique. What we do is every five minutes. We take a snapshot of the system and we store it. So you can go back any time in the past and look at how your techno your topology look. For example, you have a problem at 12 o'clock in the afternoon and at four in the evening.

B

You want to look at what's the problem, uh what took place 12 o'clock, what you find is that the topology has changed. Your scale scale on your pods are not available anymore. So what would you look at? You need a way by which you can go back in time and see the topology the way it was and that's what our time travel feature provides.

B

You can go back and say: okay, there were four more nodes and 23 more pods at 12 o'clock and those were the ones with problems and maybe not the ones which are running today and we're running at this point. So that's our idea of time travel, which we think uh in adds value in a dynamically changing world. So this is a quick summary.

B

Finally, there's alerts right. These alerts are um are done in multiple ways. We generate alerts from logs and metrics and in the traditional way we have that too, and we also generate alerts from our machine learning, which uh provides different ways by which you can see how it, without setting thresholds, we're able to tell you um that there is something wrong in this container. Something is off, it's not normal.

B

You know, and and that available, for example, here um here is a a metric, which uh I mean, an alert which has come by automatically without any thresholds, say that something is off here and these three metrics it give. You also give you explanation, saying that these three metrics have sort of created a this situation, where we think you're doing something wrong or something off. Now we look at a lot more metrics. We look at about 34 metrics, just for a cleaner and our machine. Learning algorithms.

B

Look at all of this, but then identifies that because a combination of of metrics that we have from our learned past behavior uh some you're, not in not normal, and these are the three things you ought to look at. So that's our ml model. The other uh type of alert that we do is is uh and, for example, a chain model. We we look at the chain of.

B

A response time has taken place right and our analysis system runs through and looks at all the uh containers that are in the path and uh and suitably um you know, prepares the information makes it available to you saying there is a problem that you had a problem in the slo here, but we see down the chain. You have a problem in this car cash and that's got a problem, and here this card server has a problem.

B

So if I just click on that- and I look at its its analysis, I get to see more about what that particular um one segment where that that last piece in the in the chain had a problem right. So uh you get to see the whole chain, and this is discovered right. This is discovered on the fly based on the problem that took place uh in that.

B

So you could have hundreds of containers, hundreds of pods running and services, but it isolates the tree down based on that current situation and presents it to you to see where the problems are and then isolate it to um to a is action that you take to fix it now uh now there are other normal things like, for example. um Let me just give you one more um a um it's a replica account that is not okay.

B

It sort of figures that out and also does some level of analysis, and it can give you uh some information. There's a you know, problems many things we could live in a from the ionosphere image tags image invalid image, name volume amount, failed, missing, config map. All this is analyzed and provided to you as part of our rca process, all right. So that comes to the end of of the quick demo that I wanted to do today. Let me go back to the slides.

B

Okay, so um any questions it's time for us to question. We did. We were close to the uh time here. So any questions from anybody.

C

I I think there are. There are some questions here. Sridar around um you know its uh interaction with uh istio, as well as any uh any kind of overall system performance uh impact which I think cesar uh largely addressed as well, and then I think, a search option. Yeah.

B

Let me just I just saw the questions now, it's cool! Yes, let me just answer the one by one quickly. Yes, we do work with sdo with sdo's there we are able to pick up flow data from the sdo. We we set up certain configurations in sdo which allows us to pull useful metrics from sdo. Yes, we do work with it. Then, if you're already running the open source tools, there are customers who already use the open source tools that we suggest. So we don't have to install anything.

B

But if you don't, we would do that for you and and- and we just connect your existing uh tools- that you are only uh moving right. So, if you're running the open source tools in a cluster, the data movers or gateways use up less than um 1gb. So that sort of addresses that question about the system impact right.

C

uh What about search any kind of search functionality within ops, crews itself.

B

I mean all our screens do have search you can search in every screen right, whether you're searching for logs searching for events searching for pods searching for alerts searching for snapshots, time travel snapshots. The search is built into all of them.

C

Right yeah, I think uh with uh you know, I don't see any other questions here, but I think just uh just to add a few things to what schroeder was demoing uh and why, at least from a customer perspective, uh my perspective, it's important um is you know those being able to diagnose?

C

You know those you know like he. He showed uh what is what is the container doing right, showing faults and things like that? Those are things that my engineers spend a lot of time, trying to figure out right.

C

So there's a lot of value if I can find those things quicker uh through a tool right and that's where ops cruise really provides some uniqueness in being able to define, um find those issues quicker, uh as well as to understand, um you know thinks since we're in such a dynamic environment when it comes to kubernetes, um it can be hard to piece together what happened in time right so seeing what what it was versus.

C

What it is now is also very important in the troubleshooting process and really aids with getting to root cause analysis quicker.

C

So this is where you know, ops cruise is really unique in in observability overall, um rather than trying to provide dashboards that you have to that help. You find root cause, but you still have to figure it out on your own. It really contextualized all that information for you and that's where it's it's really unique and set apart from other tools, at least in my mind,.

B

Just you can view us as a smart layer that sits on your telemetry, but leaves you with all the options that you want to manage your telemetry yourself right. So you could be using your data for purposes other than just basic observability. It could be for product management, it could be for customer and market. There's so many reasons that you need to use your telemetry that you don't necessarily have to leave it with us right. You can keep it on your own cloud and but we what we take.

B

We use it for operational support and then we toss it.

C

Yep- and it's also really uh important from a user perspective too right, because a lot of observability tools, you know really people lean on folks and sre operations and things like that to really determine what's going on and how to interpolate the data in op screws. You know developers can and operations they can all play in the same tool and it's very easy to understand.

C

You know the information and you don't have to have that- that human layer of that extra human layer of interaction to interpolate the data.

A

All right there any other questions from anyone in the audience that wants to uh pop into the.

A

Chat going once.

A

Well, luke and tritar. Thank you so much for your time today. Thank you for all of your content and for answering questions.

A

um Thank you so much for being with us, everyone, and if there's nothing else, we will go ahead and wrap it up for today and see y'all at the next installment of cloud native live and thank you both again so much.

C

Thank you guys, pleasure.

A

All right see y'all later.