Red Hat OpenShift Commons Briefings 2016 | OpenShift Commons, 5 Dec 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing #55: Monitoring OpenShift & Detecting Performance Anomalies with CoScale

Description

In this session, CoScale’s Peter Ariji and Samuel Vandamme will demonstrate how to monitor your OpenShift environment with CoScale’s container monitoring platform. CoScale tracks container metrics and lifecycle events, combined with detailed in-container application metrics to give visibility in your full stack running on OpenShift.

Speaker: Peter Arijs, Product Marketing Manager and Samuel Vandamme, Product Specialist – CoScale

A

A

Everybody and welcome again to yet another openshift commons on briefing I'm really happy to have with the two of our openshift commons members from co scale, peter art, I'm going to say it wrong, arrays and samuel, then down who are going to talk about monitoring, openshift and detection performance anomalies on another perspective and point of view on monitoring on openshift, and I'm really pleased that they're able to join us and give us their point of view.

A

So I'm going to let Peter start us off with a bit of an overview of co scale and Samuel is going to give us a deeper dive in a demo into their they're, offering the format for this is. If you have questions during while people are talking, put them into the chat Samuel here or I, or one of the other folks are on we'll, try and answer them, and once the presentations are done and the demo is done, we'll open it up for Q&A for everybody. So without any further ado, Peter take it away. Ok,.

B

Thank you very much Diane and you pronounced my name very well, so no problems there and thank you everyone for joining. Let me say a few quick words about co skill. First, we do as Diane sat. We offer a monitoring solution, so we call it full stack performance monitoring, but really focused on microservices environments such as an open shift, and our solution is a lightweight solution.

B

That's specifically for production monitoring and we use anomaly detection to find problems faster, and we offer this as SAS as well as on-premise, and we are firmly embedded in the container ecosystem as an open shift, prime partner, but also docker ecosystem technology partner. So with that, let's talk a bit of how crew skill fits in the open shift ecosystem and and to do that at first have a look at the problem that the open ship tries to solve.

B

So look when we look at the evolution of application architectures, we see it clear shift from monolithic applications that are running on physical servers or VMs in a data center, typically towards much more agile development. These days of micro services that are supported by containers and cloud infrastructure and as we all know, on the infrastructure side, containers have become a fundamental building block for micro service. They offer an attractive way to build and package them to ship them into production.

B

All this by packing all of the dependencies in inside of our containers but running containers in production and half scale does pose a new set of challenges compared to using them just for development. You have to start worrying about things like orchestration automation, networking and storage security, hosting disaster recovery, logging and monitoring and general application performance.

B

So these are all things that that when you move it to production, our questions, you have to ask yourself- and this is actually where red hat openshift comes in, because it offers a package container platform built on da current queue, benitez and various all components. To solve many of these issues that I mentioned in the previous slide and as part of the platform, there are also some basic logs and metrics, but openshift also has a strong ecosystem around it for more advanced capabilities, and this is exactly where Co skill comes in.

B

Soco scale adds an additional layer of more detailed container and application monitoring to open shift, and that helps you maintain application performance in production and let you quickly understand when, where and why performance problems occur and for this reason go scale has also been selected as an open shift, prime partner for monitoring, because we have significant value to open shift, loving you deploy applications in production with the right performance guarantees.

B

Now, let's look into the monitoring aspect in a bit more detail. I think we all realize that monitoring is an important part of running an application in production. Yet it seems that many people are still struggling with this when it comes to containerized application. This is data from a recent survey by cloud foundry on the top challenges when it comes to running containers and micro services in production, and we can clearly see that that monitoring is pretty high up here, just after container management actually and monitoring and troubling shooting microservices.

B

Indeed, those are some interesting challenges as this funny tweet mansions.

B

So let's look at these challenges in a bit more detail, so the first obvious of reservation is, of course, that the number of containers is much higher than the number of servers, so the number of instances to increases by an order of magnitude when we use containers in a typical customer environment that we see, customers use up to 10 or 20 containers pro hose, but we have even seen cases with 200 container. So this is an immediate multiplication of the number of metrics to monitor.

B

The second aspect is that containers can be very short-lived, that this dynamic aspect also introduces challenges in rapidly picking up matrix from containers setting relevant alerts, as well as understanding the impact of container life cycles on performance. A third aspect is when we compare container environments with monolithic applications, we see a much larger diversity of application technologies used across containers or where people typically use the technologies that best suited for the use case of a particular microservice. This all comes together in an overload of metrics to monitor and alert on.

B

If you look a bit closer on how we would traditionally monitor a monolithic application and compare that to a micro services application, we see that in a monolithic application we typically have. Three monetary components is perhaps a bit simplified with at the infrastructure layer. There are traditional system monitoring tools where you look at typical resource metrics at the application layer.

B

Typically, you would use an ATM tool where you gain insight in the internals of your monolithic application and, finally, the end user experience is typically monitored as well, using some form of browser, instrumentation or another technique now from micro services. However, on a platform such as openshift, we see that an additional layer is introduced and we now have a lot of smaller and lightweight and loosely coupled application components that we need to monitor.

B

So, in order to understand application performance, we do not only need to monitor these container instances themselves, but also the way that they are orchestrator, orchestrated the way that they are tied to services and finally, also the services running inside the containers, and this is actually where most APM to start to have difficulties, and this is also the opinion of Cameron hate, who is a research VP at Gartner and one of his recent reports. He also claims that these new application architecture, including containers and micro servers, are really stressing the capabilities of APM tools.

B

Now, why is this really well? First of all, most likely em tools. They were designed, maybe five or ten years ago, specifically for monolithic applications, for example written in Java, NAT and so on, and because of the nature of monolithic applications. Understanding what's really going on inside your application and interaction between application components requires you to have code level visibility of the application.

B

Now, when we compare it with containerized environments, we see that the application is spit up in all smaller microservices, each running, a more limited amount of code, as I said, typically using different technologies and such cases code instrumentation, is as much as needed, and it's actually more helpful to understand how each microservice is behaving and what are the interactions between the micro services, and this is especially true since containers are lightweight instance, so you don't want to use, have a weight monitoring tool to monitor them.

B

In fact, most of these heavy weight monitoring tools, they will require you to install an agent inside your container, and this is really an anti-pattern, since containers should be limited as much as possible to a single process. You don't want your pollute your container by packaging, an extra agent in there. Then a final aspect is that most existing tools- they have a hard time keeping up with dynamic environments, especially if they use static, alerting below I'll. Tell it more about that bit later.

B

So, if you're looking for monitoring tool for a containerized environment, what visibility should it really give us, though, what metrics should we monitor at the host level? We obviously still want to monitor resource metrics, typical things, CPU memory, disk and so on. Typically, you would use an orchestration tool in case of KU benitez. It's a flavor of cooper natives, but there are other orchestrators out there and also at this level. You want to monitor things such as the amount of containers how they are set up, relationships between services and containers.

B

This gives you more service oriented visibility like which container runs, which servers or which containers are impacted. When a particular service starts degrading at the container layer itself, we also want to keep track of relevant resource, metrics, TPU memory and so on and as well as when these containers are started and stopped to their life cycles, and it doesn't stop at resource matrix.

B

Of course, we also want to know the requests going in and out of our containers, as well as application metrics from those or services that are running in our containers, and these could be things like engine X or rattus or MySQL. So all of these services. You also want to monitor quite in detail and then finally, our application will serve some end-users and ultimately also a business, and we also want to monitor relevant metrics from from that perspective, this could be things like page load times or conversion rates, and things like that.

B

So this those are the set of matter, expect that you want to monitor and how does coach Cal handle that? So, what's our approach to monitoring microservices and contain as well, we run one lightweight agent / host. It can be either installed directly on the operating system or in a privileged container, and with that agent we can get a server resource metric at the OS level. We can also get a container and cluster resource metrics, typically using the API is from docker and the orchestrators an open shift.

B

Now there are other tools that do that as well. The core skill actually goes one step further, because we have a very rich library of plug-ins for various application components, and we can configure these in such a way that, first of all, any new container that runs a service for which we have a plug-in will automatically get monitored when a container starts, and, secondly, we will get very application specific metrics from these containers without the need to install an agent in the container, and this is quite unique capability.

B

In addition, Coast girl also has a real user monitoring component, where we use a little JavaScript snippet to get the end user experience metrics from the from the web browser. We also allow you to track unlimited custom metrics. We have various ways of doing that: scripting, plug-in or logging leveraging our api's and on all of these metrics- and this is the important part, we're an automated anomaly detection that lets us quickly, detect abnormal behavior final point. We also track relevant infrastructure changes.

B

This provides extra context in what's going on in your environment, things like container lifecycle, events or art or events from your orchestrator. We also things like new deployments or configuration changes. These are all things that are happening in your environment and they can have an impact on performance and by also capturing these events, with various integrations that we offer we offer extra context in in those half things. So this picture is a visual representation of the coast killed platform with our lightweight agent and all of the plugins.

B

Well, not actually, all that representative part of the plugins that we support the real user monitoring component and the integration for various custom, metrics and events, and with this data we can obviously create nice dashboards. We are monitoring to, after all, but also automatically detect abnormal behavior using our anomaly detection. So anomaly, detection I want to spend a little bit more time on it, since it is one of the differentiating features of, of course, kale adverse straight again.

B

Why it's so important to use automated techniques such as anomaly, detection I just have a look at the explosion in the amount of metrics to monitor when we compare a traditional monolithic application with a containerized environment. Basically, the amount of container acts as a multiplication in the amounts of metrics and.

A

We can easily end.

B

Up with in this example, where we have 10 containers per host, more than 1,000 metrics to monitor / host, compare two hundred in a traditional environment. Now, if you multiply that again with the amount of host, you can see that it quickly becomes unmanageable.

B

Certainly, if you use plastic techniques such as traffic alerts,.

A

Now I'm not saying static.

B

Alerts don't work or a bad. They did. They actually work very well for well understood, consolidated metrics, for example, number of visitors on your sides or some business matters that you have a good hold on, but not necessarily for those thousands of metrics that are coming from your containers and n microservices, and these are not the only limitations, the amount of data. There are other limitations as well like how to deal with dynamic environments that which require you to constantly reset or reconfigure your alerts.

B

Also seasonality is hard to handle with static alert, so you would have to start writing some complex alert expressions based on time and in fact the same goes for correlations between metrics.

B

So if we look at the definition of an anomaly which is basically a deviation from what is normal or expected, this means that if we can get pretty good at predicting expected behavior, we can also get pretty good at detecting anomalies. This is basically what we have focused on at atco scale. We will principally look at historic behavior of all the metrics we monitor and make a prediction based on that.

B

We also include a fair amount of domain knowledge for that and if we see a deviation from this expected, behavior will basically give it an anomaly score depending on how large deviation is, and then we will alert when this anomaly score excels exceeds a certain threshold value. Now this is simple explanation. There's a lot more sophisticated things going on, but this is the basic concept that we apply now.

B

What we will also do is we will group metrics on which anomalies are occurring at the same time, to give you really a better understanding on what's happening in your environment, so this is an example screenshot, but I think Samuel will illustrate it a bit better in the demo later, where we see that on our anomaly timeline, we have different metrics at the server user and business level showing abnormal behavior, and basically we see that certain services service, certain containers are overloaded. This crease creates an increase in latency on our website.

B

We also see that there's more views on our website and basically also our conversion rate, isn't impacted. So this consolidated view giving you all metrics in together gives you really a good view on what's happening in your environment, a lot of context to understand what what a performance problem actually is: we're also applying outlier detection, which is a different form of a normally detection, where we look specifically at matrix from similar instances in a cluster such as containers that are supported by the same or supporting the same service.

B

So if we see any containers with a different behavior compared to the rest of the cluster week, also alert on it. In this example, we highlight containers with increased memory usage in general. This kind of outlier detection requires less of a learning period than a nobody detection on time series d theta, but the basic ID really remains the same that you can quickly detect changes in in performance without having set up a lot of manual alerts. That's the basic premise of Coast kill so I'm going to end. My part of the of the presentation here.

B

I am sure you would like to see how this all works and in practice, so I'm going to hand it over to my colleague Samuel. To give us a demonstration.

C

Okay, thank you Peter and hello. Everyone. Let me quickly share my screen. I.

A

Think um Peter might have to stop sharing his screen.

B

A

C

C

Normally, you should be able to see the coast, kill, dashboard, nerve, long.

A

Apple, okay,.

C

Perfect, thank you. So yeah welcome to the coast. Scale applications. If you create a trowel with us. This is one of the first screens you will see after creating your account. It shows you the four main components of the coast kill platform. So we have our real user monitoring. As Peter talked about. We have integration with a lot of third party services.

C

We also have a lot of ways to do really: custom integration, both with conflict management, as we have a command-line tool and API and other methods of really binding your system, together with our monitoring, but today I'm mostly going to talk about the agents, because the agent of course will be used to get server data and specifically for this demo, openshift information and docker information, so I'm going to click through and I arrived on, our agents page. So this is the page where you could see all your servers and all the agents you've configured here.

C

You can see have one agent, the open shift agents and it has three servers below its installed on three machines: I have an oped master in two nodes.

C

Creating a Coast kill agent is a really simple process, so we support linux windows, most popular Linux, flavors I'm, going to select red hat 74 now and if you click next step, you get a list of all the plugins that Peter was also talking about.

C

So you probably recognize here the most popular open source tools, and specifically, we of course have support for open shift and darker, but I'm going to show you the doc I'm going to go in to a little bit more detail about the darker configuration a little bit later now, because I've selected an open shift plugin or a docker plugin. More specifically, I get two options of installing a Coast kill agent.

C

One of those is through package management, so I get an RPM which I can install on the server, but because in open shift, usually runs in very dynamic environments where new minions can be started at any moment. We also have the option to start our agent as a privileged container.

C

Now, specifically for open shift. We have a configuration available that will allow you to just add a demon set to your open shift environment, and then the agent will automatically be deployed on every server. That's also an open shift, so here are quickly opened. The open shifts, weapons ur face and the scope costco project contains my agent and you see I've unknown for servers running each one of our agents and I'm going to quickly also show you the the configuration for it. So here is the demon set and.

C

Here we can see the config then so this is really a powerful system, because now, if you scale out your environment or if maybe one of your servers crashes, the agent will automatically scale with you or restart when needed.

C

Now what? What information do we get from the core skill agent? Peter also mentioned it all really a little bit in the slide so because open shift in the background runs q, benitez, environment you're, going to see a lot of the same concept, so we have replication controllers or we get the data from replication controllers. We get the data from the services, we get all the containers of the running and where are they running?

C

We also have a very powerful event system, so here, for example, you can see our replication controller overview and your every time you have an event of insufficient replicas, meaning that probably a container has crashed somewhere. You can clearly see this with our events and you can go and research what happens below. We have a little bit our container overview so once on the service level and once on a host level. So you see here, I have 54 application controllers.

C

Some running five containers I can clearly see which are more the help or containers started by Q benitez, and you may have noticed here that we sometimes have a different color for a container. This is because you can select a metric for each of these widgets and then set the trash hold, which you choose yourself, I think in this case we've selected thirty and fifty percent CP usage and then, depending on the value we get back from the container we're going to color code the container here.

C

So this way you can quickly see if some containers, or maybe using too much cpu as in this case, is clear that we have ninety seven seven percent CP usage. We might be impacting all the other containers that are running on the same machine, so this really gives you a little bit of an overview of the entire environment.

C

Now the next dashboard is a little bit more focused. So this dashboards, as you can see here at the top with our dimension system, is well has just the data from the MongoDB replication controllers. I can quickly change this dashboard also if I wish, if I want to see data from other replication controllers, but here on this dashboard, I get a little bit of information on the container life cycle, so I can see when containers were started when they were stopped. What the exit code was also in which Don, which machines are they're running.

C

What the cpu usage is memory usage network received and sent. Here again, you can set your own thresholds, so it's very, very visual way of seeing if the container is performing as you would expect, and then we have the the event system that we talked about already. So every time a container is started, it's gonna first send the ready signal and then running single, saying: okay, I'm ready to get traffic from other servers. In this case, the next part I want to show you is a little bit more general.

C

So this is a dashboard made by one of our customers and they've chosen to put a lot of their services together. So they run a micro service environment and they have a comment API, so comments micra serve as a product. Api and check out the API all running on openshift and they've chosen to put this information very clearly on their first dashboard that they open. So this is their home dashboard.

C

Now we also have some information coming from our real user monitoring and I want to show you a little bit how easy it is to go from the top-level view here. Just showing me, the page load time could also be just a service all the way down to a server level or more detailed view.

C

So, let's say I see that my page in time is a little bit too high. I can click through on this. This tile, we call it and I arrive on a dashboard that was created specifically with real user monitoring information. So I get the page views coming from there. I get the page load time. I get my most popular pages and my the slowest pages now it might be here that you see another page, so there's a little bit too slow.

C

You can click through again and you arrive on the dashboard just showing you information of that page page use page load time and the page resources.

C

Here again you have the option to click through ones, war, because this is still the front end of the user and now I've clicked and I arrived on the micro service level, so I get the web micra straight from the containers that are delivering this web request. I get latency and I get the error rate. It's just to show you how easy this to link dashboards together and then make a system that shows you. The information you need here. Also.

C

We have a couple, the alerts that were in this time frame the anomalies, free memory, CPU load and then another way of using our event system. So here this customer has integrated with our mail chimp integration. So every time they send the mailing campaign, it's going to be added to co scale and they'll be able to link it to perform this problems baby or changes in the metrics. They do the same for software deployment.

C

So, every time they do in your software deploy, they can clearly clearly see when it happened and maybe what the impact was.

C

Further so I've mentioned that we gather metrics from containers. We of course also gather metrics from more an operating system level, so cpu load free memory, network traffic in this report again here, I want to show just a small detail.

C

Peter has mentioned that co scale is a lightweight monitoring platform, so we aim to have a very low resource usage on the server, so we are monitoring and because of that reading reason, we've made certain decisions in our design process, for example, we're not going to push the cpu load or the CP usage of every single process running on my machine, but sometimes that's very valuable information here, I I see clear spanking. My cpu and I would like to know what happened at this time. It's for this reason that we added the forensic system.

C

The forensic system is a small lightweight anomaly detection running in the agent and when there is a southern change, it's gonna it's going to take a snapshot, take a picture of the system and send it back to our platform and then I can research. Okay. This spike was caused by the docker demon, probably deploying a new image or something else now. I want to jump back to the agents page, because I said I wanted to. I was going to explain our dr monitoring a little bit more, especially because we do in container monitoring.

C

So the idea there is the plugins you saw in the beginning that are available if it's just installed on the host operating system. All these plugins can also be used to monitor, what's happening inside the container. So let's say five: an Apache can a container running with the apache software. I can get metrics from that apache and then monitor actually how its performing the way we do. This is I'm going to quickly open the configuration of our docker image. Our darker plug-in excuse me.

C

So here you see the configuration of the docker plug-in and you see how for docker images configured.

C

So the way it works is that if you install our dock or plugin, it's going to scan the server it's running on.

C

It's going to see, ok, which containers are running here and then it's going to match that list with the configuration I set here so when it sees an elastic search image with, in the scale case, a wild card tagged, but this can also be, of course, a normal tag, so here in the case of memcached I just match on the latest, it's going to start a co scale plugin within the container itself, and here very important to note this is not you don't have to install anything in the container beforehand.

C

No, we inject this the moment that we see the container. We inject that plugin, it's going to start gathering data and it's sending it back to the agent that's running on the host operating system. Now this has two very goods advantages.

C

So the first thing is that it scales with your containers. If you're going from one elasticsearch container 25, it's not a problem. Our docker plugin is going to detect that it's going to start for more elastic search, plugins and the data is going to be gathered and you're going to be able to see the data coming from each individual container or all together on an image level or in a tag level. So we really allow you to compare data also from previous versions to the new version.

C

So it's really a powerful system, and the second advantage is that, because we start that plugin within the container itself, the configuration becomes a little bit easier. So give you an example. So we have the configuration for an engine x, plugin coast, killer, gets a lot of his information from AP is and status call.

C

So we need access to the engine, X global status, page or status page, and you might have noticed here that I use, localhost I, hope it's clear on your screen, but I don't mount any well I, don't need to mount any ports, I know and don't need to do any special configuration to be able to monitor this image know this local host because we start the plug-in within the container is just the container itself. So this port is just accessible in this case 8000 without any additional configuration.

C

The other advantage there is. This is the same for file system, so you don't need to mount any local disks on your host machine to be able to access this. This access log. No, this will just work and the moment your container stopped this as access log will be deleted. But that's fine because khoshgele, as at that moment, already gathered all the information from it.

C

Really a really handy way to monitor live, running containers now, I want to show you a couple of dashboards that the show a little bit the the advantage of having the system. So here I have a memcache dashboard, I've general metrics, coming from, and memcache connections to memcache Network bites receive the commands and hits and misses. Now you see that the commands had some changes in its metric, so we used to be around 800 commands a second and we drop down to 400, but we had some spikes, which is a little bit strange.

C

So what I can do is I can zoom in a little bit- and I can- I can clearly see here that two containers running and all of a sudden one of the containers stardom is behaving a bit because it's crashed, so the other container had to handle a lot more data and if I look at the events I'm going to see there were too few replicas, the one was missing a little bit later, a container was started, so we see the new line. Popping up here are no manual.

C

Action on my side was required, so I didn't have to change this dashboard. The new container just popped up and we decided to scale it up a little bit. So we added some more containers and I zoom out again you'll see replica scaling, so we scaled up from eight and then to 10.

C

Then the other example is our engine X. So here again we get the general dashboard which you also get. If you create a coastal application with the amount of connections, the amount of containers, the average latency the request straight and a nice heat map. That shows me the performance of my containers over time, so I can quickly identify, maybe those that are not performing as I'd like and then here we have a more dashboard. That shows me really information coming from yeah that the latency of my website and the latency of all my requests.

C

So here we have really a lot of containers delivering my website. You see at one point we added some new containers, because there seems to me that there was an issue. These are probably handled by openshift itself and then these new containers we start delivering the website to the customer in the data, starts rolling in.

C

Okay and so now, the last thing I want to show you cuz peter also mentioned this, and I think it's a very good point that in these new environments, you have so many metrics to march in so many containers that it becomes very difficult to set meaningful static alerts that don't overflow your mailbox. But at the same time you still need some warning that something happened in your system and there. We think that the anomaly detection can really add to these container environments. So peter also showed this.

C

This is the same anomaly as we saw in the presentation, so we have the anomaly on three levels: we group it so you can see here there was a nominee on latency. We had a couple of anomalies on the request rate and then we add anomaly on cpu of those both servers, I'm going to show some examples for coming from containers, but just to show you the where the screenshots from from peter came from. So we can see that the latency of my website went up. We have a nice dot plot.

C

If you want to look so you can see that yeah how much how many of my users are in which level of page load time, and we can clearly see that there is a new group of users that is experiencing a lot slower, page load times than normal.

C

Then we have a lot more visitors, so we went from 0.5 to 1.7 visitors and it's.

C

This is on different pages by the way, something to note that Coast kill automatically builds a tree of your application and it's going to do anomaly, detection on all individual pages. So if one page changes, you'll still be able to see this with the anomaly detection, and then we also have an anomaly on cpu usage and yeah. You can clearly went from thirty to fifty percent and this is I think a very good example, because normally you wouldn't set a static alert at 50 or 55 percent.

C

You would say that 70, 80 or 90, even but still, this is an abnormal behavior of your server, and you would like to see okay, what happened at this time with the forensic psych and then quickly research that engine X was using more CPU, and this, of course makes sense. I have more visitors, though my web server has more work a different example, but more on a business level. So we did a large proof of concept with the customer in the u.s..

C

They sent us a lot of their business theta and our anomaly detection was applied to it and we were able to find small issues like here in this case, where the amount of orders per minute all of a sudden dropped and if we zoom in a little bit you'll see that they dropped to almost zero. So this was a big impact for them and with the anomaly detection, they were able to identify and fix it pretty quickly. That was the last example. I have two anomalies here, one on a user level.

C

So this is the request, trade and one on server level, so I'm going to quickly open the user one. So we went from around nine point: five requests to 14. You see the anomaly detection system was able to quickly identify this, and if we take a look at the nominee on cpu usage, this was detected on an Apache container and then yeah. This is a very clear normally where we go from zero percent or very low CPU such to very high in a very short time, but again it was automatically detected.

C

So it really proactively helps you finding issues in your environment, some issues where you might have not set a static alert yourself now. This is all the examples if the show so I'm gonna give the word back to Peter. Now we.

A

We have a few questions that have been coming in and Frederick Rick, gosh I, think art, you're CTO has also joined in the call so I'm going to unmute him as well, and he has to unmute himself in order to answer questions.

A

But um thanks Samuel, that's that's a great overview of how Co scale works and showcases the anomalies, Bob and Lee see if I can find the first question we had from Luke response was asking about custom metrics from apps and are: are they supported on as a custom plugins, because you have a lot of pre pre configured plugins in there? But if someone wants something specific for their own apps, how would somebody go about customizing, a plug-in on creating a custom plug-in.

C

Incredible take the question or let's.

D

Go ahead, go ahead so well. Okay,.

C

Thank you, and so I'll quickly share my screen again so and I can show you in our documentation.

C

So, as mentioned, co scale has a couple of ways of pushing custom metrics into our system. We provide to plugins for that. So we have a generic script plugin. This is just yeah. You could write your own script or add it to your server.

C

Sorry to your binary and then co scale or the agent run this script every minute or every five minutes this you can set up yourself and then you can push data back to Coast kill this way. So this is really more of a poll. You can also push metrics with our command line tool, so together with our agent, if you install it as a package or we have a container available, we well, we have a container available with a command line tool with that you can easily push data.

C

A

C

Show it after and then we also have a plug-in, which is we call our log plug-in, and this is a really powerful tool. If you have existing log files that contain information that you need could be latency or just a number. You can use regular expressions to get that information out of there, but this is really an easy way to get theta without having to do large changes to your environment, then we also have the option to push threatened data through study and a core skill API.

C

If you really want to go and do a custom integration, we have a very mature API available and that you can use I'm going to quickly show the command line tool. So here's an example of the command line tool to insert data where this is the metric names, the level and then the value, and now just to show you. You can find more information than this in our documentation. Just docs.google.com.

A

All right, another question: go ahead: yeah.

B

This is Peter I just wanted to say, there's also a few good examples on our block for working with custom metrics and as part of that question. I also saw that this person asked to monitor specific transaction and points like a specific request, and for that it's also worth noting that we recently introduced a new feature which is basically active checks that you can really put in our plugins to say.

B

Okay, I want to monitor this specific request, a specific API call or specific query on my database and that's kind of an active system that we have also introduced recently. Anand there's also more information on that on our blog, so definitely go and check out our block.

A

um And the other question and Frederick sort of answered it in in the chat as well. It was whether or not the anomalies are based on stairs standard Garrett's or are configurable via thresholds and predicted baselines. Maybe if you could talk a little, this is an important piece. Oh yeah back.

D

Ok, so this is frederick I would like to like. I would like to elaborate a bit on that. So our anomaly detection technique is basically a fully automatic technique that will take your metrics. It will see how the metrics behave and based on that it will create a model.

D

Let's say, for example, is CPU usage and it's tightly related, mostly with the request rate that is coming in then we will create a model that contains both tcp you and the request street, and we will make a baseline of that that is based on that goes with time. So you have the per hour derivatives you have per per day and so on, and so will create a different type of analysis for each of these metrics like, for example, a memory usage is not that dynamic.

D

You typically see it rising and going down, but it's not as fast as, for example, CPU usage. So it's a completely different model that we use then to will automatically detect based on the metric based on the data which which model is most fit for this type of data and then generate the analysis. Based on that, there is no configuration needed to be done, you don't have to set threshold or you don't have to specify what your metric will look like. It will be automatically detected and we will have an automatic analysis for that.

A

Wonderful, alright and I think that is on wonder. If there's anything else, it looks like that's all the questions that they had under there.

B

Was a question that that we answered offline, which was actually a good question? I would like to answer it in public, which is regarding our architecture and where we store data so um want to be very open about our architecture. So I opened a slide here. So basically we don't use very modern application components or metric data is storing in Cassandra event, data storing, elasticsearch, also some some metadata in Postgres.

B

Our entire architecture is such that it can be perfectly horizontally scaled. It can also be deployed on premise in a doc rised environment, so that that also makes it very easy to set up and and scale. We recently did a proof of concept where we actually handled over a million data points per second. So that's some more context on on our architecture. Wow.

A

Very nice I'm, almost not infinitely scalable, but very nice I think that answer is maybe waleed's question about where the metric data data stores, except like it is cassandra elastic scripts or so that's using some of the latest and greatest fun bits that are also part and parcel of broken shift as well. So we reaffirm illy ER with a lot of that, give everybody a few more questions, see if there's any other questions.

A

Is there anything else, you'd like to add Peter, Sam or Frederick.

B

um Yeah I want to thank everybody for attending and, if you're interested to dry out our solution on openshift, so I said we were a prime partner and our solution is available for everybody. Just go to co, skillet come free trial and you can try it out for yourself during 30 days and if you have any further questions. Our contact details are here at the bottom of the slide, so feel free to reach out all.

A

Right well, thank you very much. We look forward to hearing more from you and I'm getting some more use cases and the stories on as well. This going to be an upcoming open ship comes gathering in Berlin on March 28th, co-located, again with coop con, so we'll, hopefully have some co scale.

A

Representatives there as well, and people can come and ask you questions in person and or as part of one of the special interest groups, because I think there's enough interest in monitoring, probably to kick off a monitoring special interest group at the rate we're going, and that would be of what you'll be wonderful to see everybody in person than you again.

A

So, thanks again for joining me, Peter and Frederick for popping in as well and we'll let you all get back to your days and this, if you can send the slides along, we'll, add them into the blog post, and this will get reposted on Monday on the blog openshift com blanks. All.

B