Red Hat OpenShift OpenShift Commons Briefings, 6 Jul 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing #81: Proactive Performance Management of OpenShift

Description

Frederick Ryckbosch, founder and CTO of CoScale joins us for a discussion of performance considerations of running applications on OpenShift in production, and how to address these with CoScale’s container monitoring platform. A detailed demo will be provided including installation and configuration for OpenShift-specific insights.

A

Well, hello, everybody and welcome again to another openshift Commons briefing, this time, I'm really pleased to have with us the folks from post scale, credit refresh and Samuel Van Damme who's been with us before they're, going to talk about using some of their co scale, tools and services for proactive performance, management of open shift and Eddie's, and so I'm gonna. Let them introduce themselves. The format of this session is there's a chat, ask questions.

A

One of us will try and answer them after the presentation and demo is done, we'll open it up for Q&A and follow-up questions. So please take it away. Samuel and Fred. Looking forward to hearing more from you guys.

B

Okay, thanks thanks a lot for the introduction, so I'm Samuel, Vanden and as mentioned I'm joined here today by Frederick egg Bush or Co skill, CTO, I'm, very glad you could join us today for it's happy to be here.

B

Okay, so, as mentioned, the topics of today's webinar is all around performance management of an open shift environment, and we really want to have a discussion on the performance considerations of running applications on open shift in production now really focused on production, and especially, we want to put in a lot of time on how you can address these and these performance considerations with a container monitoring platform.

B

Now we will start with a couple of scenarios that we've seen happen at customers and what the effects are that you can see on on the open shift environment now. I know that Fred's has put in a lot of time this week to set up a very nice open shift environment. Though Fred, could you show us a little bit what you've set up for us today? Sure.

C

So let me go to the open shift, your wife for that, though, as you can see here, we have a lot of things running. We have my sequel, nginx and so on running. But what this thing actually does? It's a word count application. So if we go to this URL here, you can see how it works. So here you can submit some words and then you get some statistics about the words that were entered before, so you can see the most used words, the most entered words and the most entrance letters.

C

Actually, so it's a very simple application, very basic. However, somebody done well. Over-Engineered is a little bit, so he put an engine egg first nginx sends traffic to the receiver. That goes with things into RabbitMQ. Then there is a service that picks it up and puts it into my sequel, and then there are other services who processes data and put it into Redis and that's served to the customer again. So you can see those services here and we can see the workers below.

C

So we have some workers for calculate and letters, calculating the words, a processor and so on. So this is a what we're doing in this application. Of course, I also installed Co skill and we're monitoring this environment. So here you can see that you are monitoring openshift, but also all the things running on on the open shift, so nginx java processes, RabbitMQ everything I just mentioned.

C

B

That looks actually my opinion like a really cool application, but can you maybe explain to the people that don't know openshift in in detail in the background? What happens when you send a request to this application? What does openshift manage for us and sure.

C

So for creating a route I, just click, this create route button, so I created the route for nginx. This means that traffic going to this URL will be sent to by upshift notes. Once the traffic arrives. At my openshift note, the openshift proxy will have a look at what you are Elvis and based on the URL. So it sees that it is nginx words dot this thing, then it will send it to the nginx service. Then the nginx service will receive that request.

C

If we have a look at the nginx service and the code, we can actually see that the nginx service talks to the receiver, and here we provide an environment variable. So this is some HP code for getting an environment variable so using an environment variable resets, the receiver host. In our case on this cluster, the environment variable is just set to receiver, because the receiver is in the same, it's in the same namespace as the nginx openshift, we'll add it to DNS and nginx will automatically be able to resolve the receiver.

C

So it will know how to contact the receiver and how to talk to the other services on the cluster. Okay,.

B

Pretty cool, so would you consider this a micro service environment well.

C

It depends on how you talk to I think so there are a lot of services in here and they all have their own purpose, so their their single purpose. So in that aspect this is a micro service environment. However, some people would say that the data has to be isolated per micro service. So if you look at the redness, the redness is being communicated to by 3 different services, so some people would say that violates a micro service architecture, though perhaps not for everyone, but I would consider this micro services. Ok,.

B

So it seems, I still have something to learn now when you're talking to customers of ours. Do you see mostly that they're deploying the small micro services or are some some of our customers still deploying their old applications into these open shiftin box yeah.

C

So we are seeing both. Basically, some customers are doing completely new environments, very greenfield technology using micro services to get things going and they used open ships for that to be to scale it easily and so on. Others are coming from more, have more legacy already and they are putting their monolithic applications into openshift, and then they try to split off parts. They say: ok, this component looks very isolated, so we'll create a separate container for that component and they will split it off.

C

So we should see that some people take a more gradual approach, start with a monolith and split it off while they are already running it on an open shift. Ok,.

B

Very cool, and what would you say is the big advantage of splitting everything up into small components like this yeah, so.

C

On a business site, most people are attracted to this, because if you split everything into small components and the components are more isolated, we can iterate faster on those components. So you can make sure that if you have a new feature that you want to add to one component, you can add that fast without affecting the whole system.

C

So you don't have to build the whole system again, it's a lot faster to get this into production and from a technical perspective, they're really interesting, because now these smaller services can be distributed to distributed across multiple nodes. You can scale them really easily and it's more resilient against failures by that, because you have multiple instances running on multiple nodes, so.

B

You're saying it's more resilient against failures, could you maybe show an example of that? How would that look in these environments sure.

C

So let me go to a code scale dashboard on this dashboard. We can see the free space on one of the nodes in the cluster, so I will mention this first, so I have a cluster of ten nodes. I have to infer nodes, I have three masters and I have five nodes. Here we can see the containers that are running. For my words, names is selected, this name space and we can see the containers running for that namespace.

C

Now, if you look at this graph here, you can see the free disk space on one of the nodes you can see. The disk was actually almost full and then somebody start to process it to start filling up the disk filling up the disk, and at this point we notice that there is an event we can see here that there is an event. The status of note 1 changed from this case efficient, is to note out of this.

C

This means that the openshift will not schedule new notes in Whidden will not, but on this container, so, however, what we also see here is that for note 1 there are still containers running. So it's not because it went out of disk that openshift says I have to remove all the pots from this. From this note, so these are the type of things that you really want to get to this bill right.

C

You want to see what's going on with my underlying operating system, what events is this causing at the at the OpenShift level and what is it doing to my containers? So in this case it's not doing anything to our containers. Okay,.

B

That makes sense, but, like I, can't imagine that a node running out of disk, you probably have loves your area that might impact your service, that the service looks fine, but there's actually not delivering any requests. Can we also.

C

See that yeah! So that's why you need in container visibility. You have to have look at the services that are running inside your containers, whether these are performing as you expected, because when the disk is running full, it might have an impact on these services, and you want to be aware of that. You want to make sure that it's not only when things crash that you're notified you want to know in advance. Okay, pretty.

B

Cool, so that's a little bit with OpenShift and manages for you if a node is in a bad state, it's gonna make sure that if a container crashes it goes to another machine. It pretty much makes sure that your your applications yeah keep running.

C

Yeah, so openshift will do everything it can to keep the bots running that you requested.

B

Maybe a little bit of a silly question but like how easy is it to scale out this? This, let's say the nginx service that you have running. Okay,.

C

So let's go back to the to the interface, so here we can see the nginx service if I click, this open I can see the number of pods that are running. So at the moment we have four pots running four for the service. I can easily just click here and it will start scaling.

C

If I look at the deployment, then, if I'm fast enough, then we will see that they could see that the container is creating at that moment, and now, let's start it, so we actually went from four instances of nginx to five with one-click course. You can also do this through the CLI so that you can automate this okay.

B

Maybe in terms of the request you just other than nginx, what's the effect.

C

It's so that's a very important question. I think your monitoring has to be aware of these things, so you have to make sure that you're monitoring to knows what is going on right. So if we have a look here, we can see for the nginx I preview previously scaled it up from one container to three containers, and that is that is what we can see right here. The yellow line shows that there was one container running at this point in time and then at 3 o clock in the afternoon.

C

I scaled it up to three containers, three pots, but actually what happens you can see here? So you can see this container kept on running. So the green area indicates where the container was running at 3 o clock. Another one was started. Actually two containers were started, the second one exit again and open shift scheduled another one for me. So we killed this one and Oh chief that ok, you asked for three, so we will schedule another one for you. You can really easily see in this graph what is going on when our container started?

C

When are they stopped, and how is this affecting my my matrix if we go to a dashboard, oh.

C

Okay, so I think if we go to a dashboard- and we can actually see this so this is my nginx and I- can see here for the whole service that the CPU load dropped. The average sea-blue CPU load for the whole service dropped here. So if we open this up, then we can see okay for the words namespace. There is a replica set engine X if I open this I can see all the bots or that namespace for that replica set. Excuse me so right here, I can see that we had one container running.

C

Multiple containers are joining and see beautiful drops. This is, of course, because the load is low bounce, so the requests were going to. One container are now going to three containers, so you can see this for all these metrics.

C

We can, of course, also have a look at this at the at the service level. So if I go back, if I click forward now on nginx, here, I can actually see some more in-depth, nginx metrics. So I can see the number of requests that are coming in. So if I click on the number of requests, then I can see that there is also a drop and an average number of requests as being searched.

C

That's strange, but it's very logical, right open it, so we can again see multiple container joint and they took over the the request, if I now to stack this graph, I can actually see that this is very normal. Behavior I had about two requests per second before we scale up the replica set and we can see that there are no three containers and they are all serving equal traffic for that service.

C

B

That really makes sense, of course, maybe a question like we. You gave us an example: anodes actually failing, but I can imagine a lot of scenarios where yeah you want to do this. I, don't know you need to maintain a machine or you're, seeing Hardware errors almost specific one and you think it's better to take it offline.

B

What would be the steps you need to take to be able to do that? Yeah.

C

So, for example, when you require maintenance on your machines, so this can happen right. You have a security update that has to happen to the underlying operating system and for that security update you have to reboot. So at that point you want to evacuate that note. You want to drain the nodes, you want to say to open shift drain a note. It will evacuate all the pots and it will reschedule them on different notes. So I don't know where you can do this in the UI.

C

That would be good information, but I know you can do it from the command line. So there it is Oh admin you open shift administrator to and there you have options and one of the options is drain. So here we can see drain node and preparation for maintenance, so we can just say Oh admin and then drain example. My note 2, if I do this, yeah I will get some warnings that I want to ignore the daemon sets, though, let's do that.

C

Now it's ignoring the demons, a demon that keeps running on that machine and it's evacuating all of the other pots running there. So we can see that a receiver and then neck's and aratus are being evicted and these new plots for d services will be started on different notes. Yeah.

B

It's pretty cool and the advantage isn't really that, because you have multiple of these pods, the traffic is being rerouted and the user is not impacted. Yeah. That's.

C

What should have right? Let's have a look here at this dashboard, though this is a scenario I made before so also drone training note 2, so here we can see that at this point in time, no 2 has two containers. It has a relevant aenor and a receiver container for both of these services.

C

Actually I created the graph, so we can see the number of successful requests to the server to the receiver, though I know that the orange one is the one that is running on load 2, because I looked it up before and I know, there's only one GLaDOS running, and it's this one. It's running on that note. If you now go forward in time, just a bit then we'll see at 1450 for the note 2 started training. So this is at this point and then some strange stuff starts happening right here.

C

I will explain this later, so what we can see here is that we have one red spot running then at this point in time. It's evacuated, so it's stopped on no to then, as I started on a different note, there's a bit of a gap here.

C

That is because, if you schedule it on a different note and the image is not yet on that note, the image has to be pulled and the container has to be started, so it can take a while, and the error here actually is that there is only one Redis note running, so we can see that if you only have one note, if, if you have one pot running and the note that runs that part fails, then you lose the service, of course, and that's also what's impacting us here.

C

So we can see that the number of successful requests is very low in this area, and that is because there is no connection to Redis. So there is no Redis at that point, and things start failing. However, since openshift manages to get the spot up and running again, we can see here that it that requests are restored and this yellow container is a new container that is scheduled on a different note. We can see here that note.

C

2 is now empty, so there are no active containers on this note anymore and the containers are rescheduled so.

B

That's that's one of the things you need to watch out for when you're building these services that you are ready for these kind of events that your software can handle. This.

C

And yeah you have to think about how many things you have to run have do. I have to run them in a cluster. How will my application handle this? If, for example, you are connected, you have a Rennes cluster and you're connected to one of the notes that note fails.

C

You want to end your client code, make sure that you connect to another node in the cluster and try to request again, so these retry kind of mechanisms are also really important, and the crew thing with these tools is that you can actually see this behavior really easily, so you can actually see what's the impact of no trade. What's the impact of a disk running fool, and so on. On my on my application level, yeah.

B

Now it seems to work out pretty well, though the OpenShift Orchestrator have you ever seen it miss manager container, put it on, let's say the wrong machine, even if there is such a thing.

C

That ok, I, wouldn't say, miss manage so, for example, if you don't want to, if you want to make sure that your containers are not interfering with each other, you can set quotas, so you can set a quota for CP usage and for memory usage on a container and OpenShift will make sure that, on one note, if you add up all the quotas, it doesn't go over to resource usage of the machine itself, so you can make sure that it doesn't schedule containers together that consume more memory than the machine has available.

C

However, there are other things like this throughput Network throughput that are not on which you cannot set quotas, so this is a limitation of the Linux kernel which in which it is not possible to to to do these quotas today, so home just can also not do it. So this means, if you have one container, that is a very disk intensive, so it writes a lot of stuff to the disk. It can actually consume all of the bandwidth to the disk and another container.

C

On the same note, if it also requires band bandwidth to the disk, can experience problems from that, so it can be starting on on through boots, so one container can affect another container, and these are the kind of things that you want to see. So you want to know for all of your containers. Okay, what's if you are using watch what memory are they using against their Koda? But you also want to know: what's the network traffic what's disk throughput, you want to keep an eye on this.

C

It's very important that you do this, that you have a historical view of this, so that you can see which containers should can be scheduled together and which containers you should keep on different notes. Actually, so you can use mechanisms like note, affinity and note labels to make sure that heavy containers are not scheduled with other containers, but you need data to come to those conclusions. Yeah make sense.

B

Maybe just to give everyone an idea on the on the on line. So what are we seeing? Currently, its core skill are most of our customers using open chips in production or, what's the the range of environments that we currently monitor, yeah.

C

So we started off with seeing a lot of testing environments, seeing of CI CD environments, people getting their feet, wet, trying out small small projects, but actually lately a lot of people are starting to push heavily, so they have tested it with these on these QA environments, on those testing environments- and there are now ready to go into production, so we actually have customers that are using them just quite heavily in production, so we're very glad that we can help them assure that their performance is good in production.

C

B

B

Would you say that most of the applications are currently internal applications or are you seeing external ones also.

C

That's a good question, so most of most of the companies we talked to start with an internal application, so they have this internal application that they would use to test this where they, where they start with. Okay, there is no real ant user impacted. If we do this, they try it out with that, but we now we are seeing the push for more and once they gain experience with that they start going to more customer-facing applications, and we see a lot of customer facing stuff starting to happen right now. It's.

B

Very exciting, to see all this happening. Of course now, maybe coming back to open ship I can't imagine that there's some scenarios where a container is misbehaving or or doing something it it shouldn't, but that from opens shifts point of view. This isn't really clear. So, for example, yeah I can open shift, handle every type of container issuer to container crash. So.

C

When a container crashes open shift will help you right, it will reschedule it and to make sure that the pot gets back up and running. There are, however, a lot of situations where the container is not crashing and health checks appear to be healthy, so hoping to OpenShift things. Okay, this container is doing what it's supposed to do, but actually, if you look at other metrics inside of the inside of the container, so more performance metrics. What are the ladies of the requests that are being done and so on?

C

You can see that the services are suffering, and so you actually need a tool that can provide visibility into this. That has a wide range of plugins or can capture a wide range of metrics for all types of services, and that's what we do at ghost cave, okay,.

B

Cool, so can you give us an example of an issue we detected or an issue that we saw tor.

C

So this is the the wrapped mq still running in our test application, and so we can here see some very global, metrics, very, very high level, the number of channels, the number of connections, consumers, exchanges, queues and so on message rates. How many messages are coming in at a moment? What's the memory looking like- and we can see here that there's a strange trend going on here right. So at this point everything is fine, so there are not a lot of messages in the Q.

C

This Q is used as it's like a job, so you put something on the queue and then somebody else would pick it up and process that data. But at this point we can see that something is going. Work starts piling up or messages start piling up, so you can see the request rate goes up and this message you're not being handled this causes the memory to go up. So this container starts consuming a lot more memory.

C

It keeps on going up and at some point it will hit a limit and it will crash or restart or so on. At that point, of course, you will lose that data. In our case the RabbitMQ is not persistent, so if it restarts then we'll we will lose that data. If I click here then I get can get some more detailed information and I can actually see that there are two queues.

C

So here we have the cube called junk that contains a lot of messages and then the messages queue that is actually used by the application that is being cleared often so the work is picked up, and that goes ok, so you can actually see here how to how to how to debug this. You can see. Ok, there's strange behavior going on queue is filling up and we can get on to the queue level. Okay, it's disputer that's filling up, and then we can mitigate that.

C

B

So these are large environments, so I think when, when talking to customers, they start small, they start scaling up after some time after they have their tests with the system. So I can imagine that you maybe start with 4 containers, but after a while, you have 20 containers running the same application and it becomes I think very difficult to do something like this right. How do you monitor 20 different rabbit and pews and pick up when they're not behaving as they're supposed to yeah.

C

Definitely so you want to have good dashboards that provide you good visibility, but it's not possible to look at these dashboards all of the time so having somebody dedicated going through all the dashboards all of the time. That's not something we are interested in. So we have an anomaly detection mechanism that will alert you when there are big changes in your system, things that you should actually look at.

B

B

So how do you, how do you detect this changes? How long does it take, for example,.

C

Okay, so let's, let's go to an example. Where is my example?

C

Okay, there we go so here's an example. So we can see in this case we are looking at the receiver and we are looking at all the different containers that are running for our receiver.

C

We can see that one of the containers is experiencing a strange behavior, so it's going to 100% CPU, so perhaps it's in some kind of a loop it can't get out. So it starts consuming a lot of CPU Co scale. We'll look at all of the containers in a certain service. It will see that okay, these things normally look like the same. They have the same type of behavior as we can see here, it's very regular, the same kind of behavior.

C

If something pops out of that behavior, then we'll alert users so for this one we can detect it within one minute we can say: okay, there's something strange going on, especially because it's a very large anomaly, though so you can see highlighted here and pinky. That's an automatically detected anomaly from croscill, okay,.

B

So that's very cool and that's running or that's checking everything so is it checking the nodes, the containers, the orchestrator yeah.

C

Exactly so, we have four metrics for both the operating system, the orchestrator, the containers and, of course, the applications inside the containers and for all of those metrics. We make models and we make sure that the models are calibrated by the data that's coming in and if new data comes in then, if it isn't normally, you will get alerts for that. Okay,.

B

I'm sorry I can also receive emails for this. Then yes, okay, pretty cool, no yeah, I! Think all the data we've been showing has been coming from coast kale, which is pretty clear, I. Think now, I think the people on the webinar will probably be interested in okay. How do I install this? How do I, no matter my own openshift environments? How long will it take yeah.

C

So if we go to data sources, we can see we have an agent, and at this point the agent is installed on all of the 10 nodes in the cluster. Let's say: I was starting and I wanted to recreate this thing, then what do I do? I create a new crew skill agent in this case, I will deploy it as a container I will deploy it on open shift. I will talk about how to monitor the images the containers inside your environment later on so we'll skip that step.

C

I can give it a name, give it a name. Then we get install instructions. So let me first scroll down a little bit so here we can see the step where the coastal agent is actually deployed and we can see that we are using an teaming set. So we team set is a mechanism and open open shift that deploys a certain container on all of the nodes in your cluster. This way the post kill agent. The course kill. Agent container is running on all of the notes in your in your cluster.

C

So you can see here that we are mounting the docker socket and some other stuff to make sure that we can get metrics from docker that we can get metrics from the from the underlying operating system. Another thing to notice here is that we are using privileged mode. So in order to read metrics from the the underlying operating system, we have to have privileged mode to read, talk and an open shift. You have to do some stuff to make sure that the privileged mode works to make it easy for our customers.

C

We actually include the instructions to do that as well here. So in the first step, we set up a security context, constraint constraints which allows us to run a privilege container. You can also see the other things that we are using. You can review that and give this to your security team to see where they, where they like this, then here we create a new project code, go skill project on your cluster. We create a service account for it. We add the.

C

D30 context to that service account and then we deploy the diamond set, so the installation is actually as simple as copying this and pasting it into your and pasting it into your terminal. So we can just do this right here. Of course, this was already done, but you see how that works right and that's the whole installation, though.

B

What five minutes ten minutes, this episode, depending of course, yes, okay, pretty cool, and then let's say there is a new node added or we change. The configuration do I need to read. We install the agent every time or no.

C

Because we're using a daemon set if you're, adding a node to the cluster OpenShift, knows that there isn't you know at any cluster, it will automatically also run the daemon sets the container for the daemon set. On that note, very cool, so.

B

We're using OpenShift itself to deploy our monitoring pretty much. Yes, all right! So now, let's say for basic install, so we've done, we've deployed this demon set. We haven't been that much configuration I think what what do I think get out of here. What is the information that I could see? Yeah.

C

So, let's go to the home network: when you do the installation with the daemon set, you would see the resources from the operating system, the matrix from Tucker and the OpenShift there's a bit more configuration required to get the other matrix. We will talk about that right away, but I can show you what what is already in the open, shipped dashboard. So a lot of the widgets that I have been showing our present on these default dashboards. So if you install it, you get these teeth.

C

You get these dashboards immediately, so you can see how many containers are running, how many nodes do I have how many replication control services and so on I can see the events that are happening in my cluster and so more information about builds deployments and so on. The cool thing is that this dashboard gives me a lot of high-level view right.

C

It's it's very high-level, but I can click through on this, so I can click through on the nodes and get this this kind of view same thing as we saw before we can click on one of the containers here and go into that dashboard. We can see that here the container was running. I can zoom into that I can see the events for that container. I can also click through to other technologies.

C

There are a lot of these dashboards so, for example, for replication controls. We have different types of dashboards and so on. So it's very purposefully made for these types of environments. Yeah.

B

I guess when, when talking to customers, you put a lot of the knowledge you builds into the dashboard again, yes, exactly okay, now I noticed on top of maybes. Some other people also notice these dropdowns. You have like replication controller namespace. In this case, can you maybe explain a little bit what what the what that is, or what that.

C

Does yeah sure let's go back to the dashboard first, so for our notes, we can see that here, I have selected the the words namespace, so I can see all of the containers that are running for that namespace if I now click a different namespace, for example, koh skill in which the coastal agent is running I can see those okay. After that we can also we can filter, so we can find a certain service. So I can look for my Redis, see where that is running, and so on yeah. It's pretty cool there's, however, more.

C

So if we have look at this dashboard, so this is a service metric dashboard. This is also in people type or that you get out of the box. You can see the average CPU memory network traffic and they stupid for all of your service, which might not be that useful, but you can open this up and actually drill down. So you can see all of your service, but I can also go to kubernetes, for example, and there I can see there are notes.

C

There are master note and regular notes, so I can select one of the masters or I can select one of the other notes. So if I split this up for all notes, that I can see per note what the behavior is. So if I click on this one I would see, which note is that I can then pick out that note to inspect it further. There are two other dimensions here, so we can also see that there is this dimension and an interface dimension. So the interface is the network interface.

C

I can see that at the moment we're taking the average over all interfaces, but I can spread it across the different interfaces on the machine.

C

So if I'm interested in, for example, the network traffic that is going between the notes or publicly I can click on that interface and see data for that same thing, for the other interfaces of course. So this allows you again to start from a very high level and then drill down on certain certain aspects.

B

Ok, cool yeah, I think or maybe a quick question on this before I forget is so this system really means I can create a dashboard, for example, form application and if I have a development, namespace is staging namespace in the production. Namespace I can quickly compare the performance between the tree, so I don't need to create 3d dashboards to see the same information pretty much.

C

Yeah exactly so, it makes it easy to create a general dashboard for which you can select different nodes or different queues. Or what have you ok.

B

Cool, probably a little bit of a silly question again, but can I alert on this look.

C

The first thing here is alert yeah.

C

Ok, so you can have a look at that so by default there are some alerts that have been set by go skills, so you have the average load of the CPU free disk space and so on, rememory. If I click on this one. So let's say the free disk space I can edit the the event and it's very readable, so I can say if free disk space and percent is less than 10% for five minutes for D servers.

C

So at this point it's for all servers, but I can actually select just a number of service on which I want to apply this. This alert rule for containers. You can also do it for a certain set of containers, so you can, for example, say if you want to alert on the.

C

Memory used by a docker container we can go here, I can type memory and find the memory use bytes for docker, if that is greater than 100 megabytes gigabytes.

C

In this case you know who that is, for five minutes and I can then set it on a certain container. So I can do it on the image. So I can say all the containers that are running the nginx image. I want to do it for that, but I can also do it on a more granular level, so I can do it for a certain replica set only or I can do it for the whole deployment, for example, or for services, and so on. So it's a very modular.

C

You can do it for one namespace or for all namespaces, it's very easy to create alerts for a very specific thing. You can also see that I'm, not selecting containers. So if you drill down, you will see actual containers, but that's not that's not very relevant, because containers go come and go a lot, so you want to do it at a more more of an aggregate level.

C

B

Make sense so, okay, so the alerting system, you can send an email I. Guess you probably integrate with slack, because it's very popular but can I take actions from this like what if I want to do, I know this sometimes happens and I want to take some more debug steps or like.

C

How would I do this interesting that you ask this I have a very good example of this. So here we have high memory on the Cal clatters deployment. So, as I showed you before this one is if docker memory each byte is greater than 200 megabytes for everything that is in the help address deployment, then I want to get trigger and alert. So we can see here that actually goes up a lot more than 200 megabytes and for that I will set a certain rule. The rule here is that a web hook is being executed.

C

So you can see this URL here, so it's web hook, words dot this IP address and then / backslash heap dump here in the format in the body. We can see that a token is provided an action is given here, so the action will be filled in microscale.

C

We can read that right here, so the action is either triggered, acknowledged or resolved, so the back hook will be sent for and an alert, that's being triggered right now or alert that is being resolved right now at the server field actually contains this or that is affected by this alert mix-ins. So this information will be sent to that URL. So if we have look at our OpenShift dashboard.

C

We can see that this service, the white book service, is also running in our open ship deployment. So that's this one right here: I can show you the code, it's it's really simple peyten program and what it does so it exposes a route. Debug slash, keep them. It checks the token. This is a very basic form of security right, it's it should be over HTTPS and it should check the IP range of the co skill and so on, so don't use and production.

C

But it's a very simple example with a very simple security mechanism, so it checks the token first. Then it checks whether the eight where the server is present and the action is present present if the action is triggered. So if the alert is created right now, then we will extract the pot in from the server. So we can do that with a regular expression. We can get the pot name and then we do a heap dump, so we take a heap dump for that pot. So what is happening in our system?

C

We see that the memory usage is is growing for a certain container. At some point, the alert gets triggered and we say: ok trigger this fat hook, that will take a heap dump and that heap dump can then be later on analyzed to see which objects are consuming the most most space and you can actually optimize your your service with that.

C

So, if you look at the take heap, dump method the thing that is being executed, so we take it up, we upload the dump and then we do a cleanup, so taking a dump is done using a map. So this is a Java utility for creating a heap dump from your JVM. We have to fill in the Java process ID. So this is. We use this command right here, for that we use the curl to upload it to a certain FTP server, and then we remove the heap dump from the container yeah.

C

This command is being executed using cube CTL, so we do cube CTL exact in the pot that was provided by the alert. The alert provides the container that is having that that problem. So we use that here to execute a certain command inside of that container, to get the heap dump and put it on to an FTP server. Does that make sense I, don't.

B

Know that much about Java but I, guess people that work with Java every day should be pretty excited about this okay cool. So we you can take actions now I. Remember you showing us a little bit of application data. Probably the people on the line will be pretty interested in okay. How do you now get that in container data into Co scale? So how do you monitor up the applications in the container.

C

Okay, so let me go back to our agent page page, so the step I just skipped was this step so where you can define the the images so right here, I have defined that I want to monitor the res image with an attack with a radish plugin. If I click Edit I can see how the Redis plugin is configured and it is using a certain connection, localhost and a port and it's doing an active check. So this is how I configured my radish monitoring for this image, maybe go back.

B

To the connection for a sec, if I'm running an open shift, I know there's a system that I can automatically generate a password first service would I need to configure this in Co scale for each of these services. So.

C

Actually, you can provide the environment variables like this. You can just put the environment variable. We will detect that you provided an environment variable here. We will see that the containers that are running are having that environment variable and we'll fill it in at runtime. So there is no need to set a fixed, password and user name on your containers. You can still do that using the native mechanisms using the environment variables using the config Maps, and then here you can just use the environment variables. Okay,.

B

Cool so yeah you can, you can do it on the images. Is there another way, yeah.

C

So, if have a look here, we can see that there is only one one thing here, so only redness is here, but I've been showing you a lot of dashboards which were monitoring, for example, nginx and my sequel and rabbitmq. So what is the trick that we are? We are using right there. Thank you.

C

What is the trick that we are using there? So we can see here that there is another button generate docker labels. Do you can actually configure a plugin? So let's do it same thing as before for Redis, let's configure to read this plugin with just the basic the default and then I will get a label this.

C

This is a docker label, so this is label that I can put into my docker file and whenever a container is started and the image has a certain label, Coast field will pick that up and will start monitoring as defined by this label. So this means as a developer. You can set how your container should be monitored.

C

So, for example, if you change something to your container, if you, for example, adding you metric to your JMX metrics, you can just add it here- and the label at the label to your doctor container and the metric will be automatically picked up when the container is started. So there is no need to run to operations to ask them to add this metric. For me for this container, you can do it yourself, you can add to label on on your container and then things will be started automatically.

C

Okay, cool I can show an example of this for our nginx container, for example nice. So this is the nginx container that we that we are using yes, pretty familiar yeah, so we just took a random github.

C

Sorry docker hub mix, empirically selected, yes, and there we copy our own files and configuration into it, and then we add the label. So in this label we can see that we will get information from the access log file and from the stats interface.

C

B

Then I'm seeing localhost there. What does that mean? Is that localhost on the the ho that's running the containers? That's.

C

Actually a good question, so the plugins, the Waco, still starts them as the date. The agent that is running on all of the notes will start a plugin for the for the containers inside of the namespace for those containers. That means that these plugins can see everything that is local to the container, so they can use local hood local host 8004 seen from within the container, so you actually don't have to expose the support. Even so, this is a port that I only exports the status. That is interface.

C

It doesn't have to be exposed as a service or anything like that. It can remain a local inside of your container and we will still be able to access it Tingting for this, for this access log file, so normally you would put this into standard out and get the metric from there.

C

We also support that you can put def as to be out here, but if you have container that does logging inside of the container we can manage so we can actually get to the file inside of the container no reason to mount it anywhere or so on. So it's all very transparent, you can you can reason from the the view of your container, okay.

B

Very cool like what would this mean if I have a development, the staging and a production OpenShift cluster would I need to do something on each of these monitor micro scale or just add the label? And yes,.

C

You just add the label on the container and if the container is running on the first environment will be picked up the picked up there and will be start monitoring same thing for all the other environments. So there's no change between the different environments, so you can actually test your monitoring on your staging environments, see where everything works, perfect there and I moved into production, and you will know that configuration of your monitoring will be exactly the same cool and.

B

Then does it matter if I have one or two or hundred containers of this, this nginx running no.

C

It will automatically detect the containers that are running and start monitoring for those automatically and.

B

I can imagine if I have a hundred containers that this is gonna have an impact right. The monitoring is gonna. Take up some resources. What can I expect yeah.

C

So we pride ourselves in being a very lightweight monitoring solution, so we make sure that everything is running very efficiently, that we don't put a lot of burden on your containers and on your hosts. This is really important because we're seeing a lot of containers right now running. It's not like you have one process per machine and you can add 10% overhead there, because there's only one process right right now.

C

You see, for example, ten containers running on one note, if you add ten or five percent per container, then that's a huge amount, so you don't want to do that. So we keep our overhead per note limited to one percent. No.

B

Okay, I think there was a lot of information and I have thanks for that. So is there anything else you want to show us near the end. Yeah.

A

Make sure you leave a little time for the Q&A because is, if you've answered some of the questions in chat, but there's a few extras here: yeah.

B

That's the last question.

C

From me, I don't have anything I special to show you, okay,.

B

Yeah we did get a lot of questions. I'm gonna just quickly take a look yeah.

A

One that I'm curious about and Jonathan's asking two is: where is the co scale UI web console running? Isn't if there's a template for it.

C

So, do you mean the web interface or were.

A

You actually running away.

C

So we have multiple environments, we have an environment in us, an environment environment in the EU. So it's a it's a SAS based, but you can also get this on premise.

A

You've answered a couple of those it may be saying: I'll take a look at the check cause. I think you covered a number of the questions that people were asking yeah yeah.

B

I think we have one difficult question a little bit, so it's all over the special ones is from Jean Francois he's asking if it's possible to run Co scale in an environment where you don't have privileges.

C

I think that's a difficult one, so we can run in some kind of a degraded mode. So if you don't have privileged mode, for example, the agent will still work and the plugins will still work, but some things that are shielded like, for example, I, think the disk metrics are shielded by in the proc file system. So we cannot get those without privileged mode, but the plugins will work and you will get gathered data, but those specific metrics won't be won't be available. You.

B

Know just a little bit more limited than normally see a.

A

Couple other questions here: captain do you think you've answered most of them, so if someone's doesn't have privilege mode, isn't what what sort of changes do they have to do in the installation? Is there anything special to go a little bit further with jean-francois has question yeah.

C

So, in that case, the comment that you get for installing you have to change it a little bit so right here it tells you to use privilege mode. If you just leave that out, you will be able to deploy it at an environment where you don't have privileged mode, and you will get into the the degraded scenario.

C

A

The reason I'm curious about this, because I get a lot of people asking about monitoring solutions to use when they use dedicated. We ship, dedicated or elsewhere, hosted open ship deployments, so I think I hear you saying that we could use Co scale if, like we were using someone else's hosted environment. This is that true important was that sort of yeah.

C

In most pieces that would work yes,.

A

That'd be very handy to have a lot of people are asking for different monitoring tools for not just for dedicated in online but and business. My experience with services similar to Mercy is pretty limited to using them, like New Relic, to hack on and deep of my own application. It's not that operations level that your shine, the kubernetes and it's this is it pretty pretty stunning. The deep and operations focused its but I. Think for those of us who are writing applications.

A

This is a great way to to really see the impact on on the the nodes and the use of the cut under underpinning infrastructure that you might not think when we write overly complex applications, sometimes yeah.

C

A

What else is there in the questions here? Is there any other questions that we haven't answered, that.

A

It can be hooks itself posted invent.

A

Think we've asked every question here: can you put your final slide up with how to contact you guys so that that was that way? If anyone has any further questions or anyone watching this video and the later has a question that would be a great place to reach out to them and you get both of them there's.

A

um Can we get the viewer report of microp services which are not in use for a longer time.

C

Yes, definitely so the monitoring starts at the moment the services started up. So if your services are very short-lived, will actually start monitoring them at the moment they are started and when they stop or die, then the monitoring will be stopped for them. We also take a lot of scaling considerations into account for this. So if you have a lot of these jobs, we make sure that it's still possible to visualize those and to see those over time, because you can get a lot a lot. A lot of containers in that case.

A

So the first one who's asking is Debbie, okay and I'm wondering if he wants to just get unmuted and he can ask his question directly, find him and do that. You can ask.

D

Hi, this is David Teague.

D

Yeah, my question is more specific to on the services that we are going to host on containers now because of the new native cloud native architecture and various micro services that we are going to build on these containers. There is a possibility, this most of the cases there will be a lot of micro services or the application that we developed may not be in use.

D

Is there any possibility that, after a certain time, we can find out which all micro services, which we call counter containers not at at all in use, and we can probably be commissioned those containers of micro services which are not in use? So from that perspective, I would like to understand.

C

Yeah definitely it's a very good question. It comes back a bit to our to our web cook thing, and so, if you have a look, it's difficult to do it as as openshift, because judging from CPU usage memory usage and so on, it's difficult to see whether a container is active or not. However, because we do in a container monitoring, we can see, for example, if there are, if there are requests coming into this, these containers so you'd have Micro Services, and you see that there are no requests for a certain period of time.

C

You could trigger a web cook with that to scale down that particular service. So that's definitely something that is possible.

D

Thanks I have a couple of more question. If you have time, then probably we can or maybe I can drop. My queries ask.

A

Away, let's see if we can get that get them snuck in here, I.

C

Have a next meeting in two minutes? Actually: okay,.

A

C

Would take your questions offline, so you can tell me- and we can set up a web conference, that.

A

Would be great awesome all right? Well, we won't. We won't advance you guys too much longer, because we did fill up that entire hour and but it was a spectacular show and I'm so pleased that you didn't use slides through the entire thing. So thank you very much because it was a really useful information and I know guys, have a trial capabilities too. So folks, on the call, if you want to keep it a trial and check it out.

A

This is a really a great service and hopefully we can use it to gain some insights into our open ship deployments. So thanks Samuel and thanks Fred and anyone who'd like to reach out this. This podcast will be up online at the blog that open ship dot-com site shortly and we'll also put it up on our YouTube channel. So thanks again guys and have a great evening over there and building Thank.

C

You Diana bye take.

A