Red Hat OpenShift Cloud Tech Tuesdays | Red Hat Livestreaming, 22 Jul 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Cloud Tech Thursdays: STF and how to monitor your OpenStack Cluster

Description

Cloud Tech Thursday explores the full modern open source cloud stack, from hardware to serverless. Learn about new ideas, projects, and releases around Kubernetes, OpenStack, hybrid cloud enablement, and many other topics.

Twitch: https://red.ht/twitch

A

A

A

A

B

Good morning, good afternoon, good evening, welcome to another episode of cloud tech, thursdays. I am chris short producer host and showrunner extraordinaire. For this thing we call red hat live streaming. I'm joined today by amy josh and a special guest leaf madsen to talk about stf and how to monitor your openstack cluster uh amy. Why don't you introduce yourself talk a little bit more about the topic hand it off to others that kind of thing.

C

All righty hi, my name is amy marish. I am the openstack community person here at red hat and we're joined today, as already mentioned by josh burkus, who is the kubernetes community person and our guest today is leaf madsen, who works on stf, which is also known as infrawatch upstream leaf. Do you want to go ahead and introduce yourself and the project.

D

Sure uh my name's life madison, I'm the cloud service, telemetry architect at red hat and service telemetry framework- is a project that I've been working on for three plus years and the idea is that we install a set of microservice applications on top of openshift and we monitor our infrastructure as a service, our openstack and so today, I'll just be kind of going through the architecture of stf, um some links of where the open source project components are available for anyone to make use of and just I'll go through some dashboards and some live environment stuff and happy to answer any questions.

D

uh You know as we go.

C

D

All right, I will just share my screen and we'll get presenting here.

D

Ta-Da, all right uh so um just I'll just place the links at the beginning here um and I can get those over to um the show hosts um after as well, but basically github.com infowatch is the upstream um location for all the source code that I'm going to be going through today from our uh from our overview and uh the rendered documentation, um which is also written uh in the upstream source. There is at infowatch.github.io documentation.

D

So just a quick overview, so service alumni framework uh is basically receives monitoring data from openstack um or third-party nodes and uh is a central location for storage, viewing uh of dashboards uh and alerting uh for your system, and so what we make use of is collect d for, uh storing and collecting. Sorry, not for storing for collecting the metrics and events uh for the infrastructure components.

D

uh The openstack aware metrics and events come from salometer. We also support multiple clouds going into the same monitoring infrastructure.

D

We also provide availability monitoring, such as container statistics for cpu and memory and the api health checks for the various openstack epi interfaces, so glance neutron things like that um integrated stuff metrics with the collect dcf plug-in. So if you happen to be running ceph within your openstack infrastructure, we can also collect uh that information using the theft plugin. uh We can send snmp traps using alert manager.

D

We have some we make use of the prometheus snmp hook, implementation for that we make use of various storage back-ends that are provided by uh openshift and so we've tested with ocs and things like that. Just to make sure that's all working and our visualization is uh done with grafana, and so all of this is all uh operator driven um using the operator lifecycle manager with an openshift.

D

uh So this is kind of the high level overview of the architecture um on the left. Here we have our actual openstack deployment and we make use of collect d and celometer for collecting the events and the metrics. uh We then make use of uh amq interconnect, which is also apache, cupid dispatch, router and that's in amqp1 protocol based message, bus and we use that without brokers or anything in order to just stream the telemetry and the event information.

D

So that's our transport protocol coming across the back end. uh We then make use of the smart gateway, smart gateway, um which is actually made of two components: the sg core and the sg bridge, which I can get into later. um But basically that's our middleware that sits in uh in between the data, storage and the transport layers. So takes the information from the bus for metrics. It provides a scrape endpoint for prometheus to scrape so it can collect the metrics for all your nodes and for events.

D

It will write that into elasticsearch currently um so.

E

Right we're all cuny here, um so collect d is been around for a while, um the um uh and, and particularly based on my experience with it, although it may have evolved since then, um you know precedes having any sort of standards around um how what its messages look like the. um Why collect d? How? How has it been useful as sort of the central collection point.

D

Yeah mostly collect d, uh because the intent uh when the framework was originally being developed was to make use of something that was one small and fast.

D

So it didn't have a lot of overhead and uh to be able to run it on things like uh other network devices, so collect e can actually be compiled and run on uh some routers, some switches- some, uh you know, infrastructure components, so that was kind of uh the main reason around that uh it also just had a lot of um plugins that were good for infrastructure monitoring.

D

So it had a lot of the information that uh that we required, and uh the messaging uh thing hasn't really been too much of a problem for us when it comes in we're, basically collecting that data with our uh smart gateway, anyways and we're exposing that data, um for you know, via plug-in instance, by the type instance and there's various uh labels and fields that we make use of, and so that's actually been quite uh consistent across um across the plug-ins.

D

So we haven't really had any issues with having to um you know, do any like crazy manipulation or anything like that. So uh the other thing was the vents as we make use of the ves format, and so the ves format is like an encapsulation um vnf event standard, or something like that.

D

I can't remember what the ves stands for, um unfortunately, but uh we're making use of that for our events, so it kind of has another encapsulation layer on it as well, um so that when the events come in uh from collect either also somewhat standardized.

E

You didn't uh consider trying to push this upstream so that the collector was actually using standard formats.

D

uh So everything we're doing is is from collect the upstream. So all the best formats and everything.

E

D

The collect d upstream project, yeah, okay,.

E

Okay, I'm sorry, I misunderstood, I thought you were doing that in the gateway. Okay,.

D

Oh yeah, no, no! We we just decapsulate the messages. The vez format is actually coming from and has been programmed as part of the event plugins that we're making use of inside of collect d.

E

D

D

All right so um so we'll just kind of go through. um I only have a handful of slides and then we'll just get into kind of the sun's alive stuff, but the intent here is to provide kind of a near real time, event performance system, and so we can collect. You know various pieces of information from various events and telemetry systems we're making use of collect the installometer currently, which is our collection layer, the distribution or the transport layer is making use of amq. So that's the amqp1 protocol inside of openstack.

D

The message bus in use is rabbitmq and that's amqp0.9 protocol format, so they're actually different formats, but a similar idea. You transport the information across the bus and get that into the central location. The big thing that that provides us is kind of a push gateway model for telemetry for prometheus. So, instead of going and scraping all 500 600 nodes of your system, you actually just end up scraping the one smart gateway.

D

So all of the data is collected and sent across the bus and then that's exposed as a single scrape endpoint across effectively the local network um for prometheus. So we kind of get everything collected and and sent to the central location, and then we just have a single scrape endpoint. So there's no real um need to uh do no discovery or anything like that.

D

You just basically start sending the data over and it becomes exposed for prometheus to scrape uh events are obviously a push model anyways, um so we collect that send that across the bus. Just so we have a single transport, and then we write that into our event: backend uh storage backend, which is elasticsearch.

D

uh So this is kind of a little bit more of a blown up view. The same kind of idea. Here um we have various collective plugins that we can make use of um a lot of the reason for collect d. Is it also has a lot of nfe specific things so for telco back ends? um You know overlay networks, uh things like that. We have a lot of uh information here that we can make use of uh from an openstack perspective.

D

We also make use of our syslog and we're in the process of getting our logs uh potentially across the bus, we're doing a bunch of uh load testing right now to determine whether that's feasible and uh so we're we're testing out uh at you know.

D

Hundred thousand, you know four four million logs and things like that and making sure that we're having as close to 100 percent uh delivery of those messages as possible um so that that works on going so at some point uh logs may show up here um also in the single transport layer uh and then yeah. So it just comes into the bus here.

D

um Smart gateway is basically the middleware uh from a third-party integration perspective because of the way that we're doing the transport layer and using the message bus in a distributed manner, both the smart gateway and other systems can actually connect to that same bus, and so you can actually have the smart gateway can collect that data and store it for you, and other systems can actually also listen for that data and make reactions to it. So part of the reason this is set up in in this way is to allow for closed loop remediation.

D

So uh what you could actually do is you could have a process living on side of your openstack system. Listening to the local message, bus being able to react to that and do something. Let's say an error showed up or a warning or something like that. That says I need to go and like restart a service, for example right that same service can listen and react to it without actually affecting the data storage, which can happen much further down the line.

D

You don't necessarily always want to be reacting to the information after it's been stored, because that can actually be quite a long time right. So so that can actually be a long, long loop right, which we call the northbound loop. So going all the way up going into the storage domain sitting there for uh several cycles to determine that something is actually wrong and then reacting to it, sending an alert and then going and acting on it, which you know that action may be.

D

You know done by a human, but if you wanted, you could actually have a remediation system that could react to that. So we have these various loops within the system. So we have like a very closed loop here or a very uh small loop, and then you have a longer loop that can come up into the actual storage domain and then even longer loop, where everything is actually in the storage domain.

D

And then you have some prediction based on several samples, you know or some period of time, um so that that's kind of the reason this is laid out. This way is to allow for that really fast reaction, while also allowing for the longer term predictions of what's happening in the system.

E

And you better know what you're doing, because it's very easy to code, a race condition.

D

Yes, so you definitely have to make sure that your system, if you're, going to do a closed loop remediation, is able to understand that something happened. It was then able to react to it, make make a change and then it should technically also report back into the system. In order to clear you know the condition to say that I have resolved it or it's being resolved uh system that we have set up also allows for multiple clouds. So you may have you know several different data centers or you may have one data center.

D

That has you know various small clouds that are broken out, maybe on a tenant basis or just you know, specific features, or maybe you just um you just have a system that you have.

D

You know various small clouds or you know medium and large clouds right and so, instead of having multiple different uh monitoring systems, you can actually centralize that again we use the transport uh when that transport comes in, uh we actually have various groups of smart gateways, um one per cloud, and then that can also go all into the same storage, backend or different storage backends. If you actually want to configure it that way, excuse me so this is just um the various components, so stf is actually made up of a bunch of different components.

D

So I've been talking about the storage domain, I'm talking about our middleware, our transport layers and our collection layers. So, on the open, stack side of things I mentioned, collect d and silometer are the data collectors uh we're making use of the apache cupid dispatch router for our transport layer um in that diagram? Here we're? Actually that's the amq interconnect operator here and that amq interconnect operator is what manages the life cycle of the amq workload.

D

We also make use of certificate manager for creation of certificates. We use elasticsearch and prometheus operators in order to deploy and manage the life cycle of the data storage components. um We use the grafana operator for managing the actual grafana deployment, and then we have the smart gateway operator. That is what actually manages the deployment the stand up and the configuration of the various smart gateway components, and then we have.

D

uh Finally, the service telemetry operator and the service telemetry operator is what I call an umbrella operator, and so it is the thing that you install and when you want an stf instance, it actually goes out and creates various objects inside of openshift and those objects are then reacted on by the different operators listed here and each of those operators then goes and manages their components.

D

So stf just says I need an elastic search or I need a prometheus, so it will request that and then the operator that has all that operational knowledge of how to stand up a prometheus, how to stand up in elasticsearch how to stand up a a qdr uh system. uh Stf just makes those requests, and then those operators actually manage the stand up of all of that and then, when all of those components are up and running, you now have an stf instance. Basically,.

C

So leaf is this a page of view of what we would see on the open, shifter, okay d side.

D

Yes, so if you were running this in okd or open shift, um you would see this as the installed operators, so once you've gone through the installation process uh following the um the documentation for stf. uh This is what you would see inside of the operator lifecycle manager page.

D

um You can also do it from the console and I can actually just show you what that might look like.

D

If I have this on the proper.

D

Page which I have moved it somewhere else. Oh there, it is.

D

Sorry for that, let's try that again there we go. uh So if I get to the proper end point here, oc get service telemetry.

D

And obviously you get stf sorry, I think I just typed it wrong, so oc get stf default. This is what you would actually um create as part of an stf instance. So you basically say inside of alerting you ask for an alert manager and you request its storage back end, and then you can have a receiver like for snmp traps and things like that.

D

uh Here's our back ends that define our events back end, so we basically say elasticsearch is enabled and we're going to use persistent storage uh for our metrics, we're going to say we're going to use prometheus for that. That's been enabled with a scrape interval of 10 seconds and then we've again set our storage backend and how long we're going to retain the data and then the various clouds that we're going to monitor. So these are basically we want. These are our collectors.

D

So, within the clouds set up we're saying we have events and metrics collectors that we're going to listen for which is the subscription address inside of the cupid dispatch router. What is the collector type for that smart gateway? And then uh basically we define that, and so we can have a list of various clouds. If I had a multi-cloud setup, I would have another line down here.

D

So I've defined this as cop04, which just is a is a short form for the for the cloud configuration again I I can even have overrides so this is an example of a grafana manifest and I'm doing an override, because I needed to change the base image, so I I can actually uh the different objects that stf can manage. I can actually override if I needed to again graphing is enabled- and this just kind of sets- you know some information for it and whether I have high availability and the transports and things like that.

D

So that's kind of the object. So then, if I did oc get prometheus, these are objects that the service telemetry operator actually requested.

D

So oc get prometheus default and then you would basically have another object here that this is the object that the prometheus operator would um react to and again that just took the information that I had in my service telemetry saying I wanted a prometheus storage back end that created this prometheus object, which the prometheus operator reacted on and resulted in a prometheus object for me, and then it created the different storage backends and things like that. So here's my.

D

Persistent volume claim that I've requested as part of the standing up of my storage back.

D

D

So this is, uh this is a picture of kind of the layout of the routers, so these routers are the cupid dispatch routers that are collecting the data. So you can see this is basically uh stf. This is what's running on okd and then each of these routers are running on each of the nodes inside of my openstack environment, and so each of these routers run locally and then the local clients connect to them.

D

So the local clients in this case are basically collect d for the infrastructure metrics across all of the nodes, and then these are our controllers at the top, and you can see that I actually have two different clients connected to it. One of these is collect d, and one of these is celometer.

D

The way the salometer works on openstack is that there are compute agents that will run on the non-controller endpoints and then that information is actually sent across the rabbitmqbus to the salometer agents, and then the salometer agents via oslo messaging will basically be able to send that information across the amqp1 connection, which is our cupid dispatch. Router that we're looking at here and that information is then centralized back to stf.

D

This is just a picture of one of the dashboards and I'll go through the various dashboards. I've been kind of working on this week, but this is kind of the results of all of that information. So I've got um you know my various apis inside of openstack that I'm checking and I've just created very a dashboard here that has various bits of information.

D

So I can see how much cpu usage the horizon services is providing or is making use of, and how much memory it's using again with ceph and nova and whatever other endpoints, that I actually care about. Monitoring.

D

uh So I have these pre-recorded demos, but uh I have to take any questions and then I can get into any um live environment stuff. If uh that's interesting to.

E

Folks, so I actually have kind of a big question, which is: this is a very multi-layered system right with with a lot of components.

E

What does degraded mode look like if the problem you're having with your cluster or clusters, is actually affecting the ability for this full stack to operate.

D

So if worst case scenario, you basically monitor to say I'm not getting any monitoring and I basically sent an alert saying your monitoring system is basically I'm not getting any info information from this from from the cloud that you're monitoring right so that could be, the networking could have gone out. um Services could be overwhelmed. uh The memory could have run out in the system, but ideally I would have enough information leading up to that event, that, even when I got the alarm saying well, your monitoring system's offline. Well, it's! uh Why is it offline?

D

Well, I can see that the memory usage of neutron or some system was running and overwhelmed the controllers, and now the controllers are basically offline, and now I can't send any information now if any of the networking is up and running, even if my controllers go offline, I should actually still be getting information about my computes or my storage back ends, because it's not centralized through the through the control plane.

D

All of the systems are actually running their own collectors and they're running their own uh cupid dispatch routers for the transport. So if you have networking, you will still get data if worst case scenario, you just don't get any data. That is definitely something you'd want to alarm on. So what I can also do is, I can start doing. I can make some use of things in prometheus, for example, to do predictive.

D

You know systems, so I can say I've been watching memory going up or I've been watching network utilization going up over the last hour and as predicted in three hours from now that at the current rate, you will have overwhelmed the system. So, ideally you react to things faster that way without having to actually, um you know actually have the worst case scenario where the system's offline and then you have to react to it. That way.

C

And it's kind of an advantage versus the openstack telemetry project and that the telemetry itself, except for salometer, isn't running on openstack, so your monitoring is now on another cluster. Therefore, it's not actually affecting the system itself, which I think is a great improvement over what we had before.

E

Well, maybe it's on another cluster, if you're running that openshift cluster on top of the same openstack cluster, it's on the same cluster.

D

Yeah, we actually don't recommend that, so our documentation actually will say you should not run your monitoring platform on top of the thing you're monitoring, because again you have that. Let's say I have that catastrophic network outage. Well now the thing that's running on top of it can't actually notify me that it's out right, um so the only other way to do that is then have another monitoring system.

D

That's checking to see if your monitoring system's up and running and then alerts you when the alert monitoring system goes away, which is kind of silly right, but it's not totally impossible. If you have something really small running somewhere, but you know, ideally, I believe, a lot in kind of an infrastructure cluster where you actually deploy a very small cluster. That's specifically for running things like monitoring, you know, maybe you run your under cloud. You run your open shift.

D

Installer, you run your acm, whatever systems you happen to want to be running, to actually manage and deploy. Your clouds, I believe, should be in a separate cluster anyways. So that's kind of the idea around here is that you don't run your monitoring on top of the thing you're monitoring you know in an ideal world, anyways yeah.

B

That makes sense.

D

Okay, so uh I don't know: let's look at some dashboards, because dashboards are cool.

D

So hopefully this is back up and running, so this is my service telemetry deployment and these are the various routes of how to get to the systems, and so I can look at my prometheus. I have you know cabana running, I have my interconnect.

D

Things like that.

D

These are the alerts that I've built in so, of course, that didn't log in for me I'll come back to that. I've already seen that part anyways, but these are the various dashboards I've got put together recently.

D

Assuming that they will load yes,.

C

And leave while you're clicking through there, how hard is it for someone to make their own dashboard.

D

It's pretty straightforward. um What I like to do is I'll go to prometheus first, and I will go and find something that I want to monitor right. um So I can, I can click through this list you can see this list is that true.

C

D

Sometimes it doesn't show up on screen shares. So let's just say I want to. I want to check memory right. So I'll just do something like this and I'll say I'm going to collect for memory and let's say I wanted to see the free memory, so we can just do a search for, like type instance, equals free.

D

And I can execute that that'll, you know shows me the amount of free memory across my systems. um So then maybe what I would do is I would take this. I would go into my dashboards and go home if I can get back to the main screen here.

D

Oh I probably have to log in here give a second.

D

There we go now, I can actually create stuff, so let's create a new dashboard. Here's a new panel. Here's that query that I created there's my graph say I want to make it. I just want to see the hostname.

D

I can add that I can change this.

D

Unit it's data and I think it's in bytes. That seems about right. I hit apply and I have a new dashboard.

D

Obviously I can set this name set this panel, whatever the case may be, but that's pretty much it and then usually what I end up doing is once I've done that I will, you know, save the dashboard, then I will export it. So that'll save it to a file that gives me a you know, a json document and then uh what I end up doing is oc, get rafana dashboards.

D

And so these are all the dashboards that I've created. So the nice thing is that when you load these in they're managed for you as as objects inside of of openshift, so I don't have to manually import those every time if I restart grafana, if it doesn't update or whatever these dashboards will all be automatically loaded back in so oc edit. Let's just look at the rofana dashboard.

D

1.3- and so I just have to wrap it in this- basically header so which api it's the integrately api. What kind is it it's? A grafana dashboard, it's a little bit of metadata, this stuff's actually all created for you, so you wouldn't even have this stuff in here. uh I just name it and then which namespace it's part of and then basically that would be it and then I'd have the spec. I say that it's a json blob, this bar just means multi-line.

D

So everything after that that's indented, and then this is literally exactly what I just exported um out of that file. So if I quit out of that, oh I might made modifications so then I would do oc create dash f. You know new dashboard.yaml, for example right and then that would create this dashboard for me.

D

So if I go to github.com infowatch slash dashboards, this is actually where our dashboards live in github, so it would be oc, create f. um These are the dashboards here, but lc create dash f, stuff dashboard, for example. Right once I do that, then that that would automatically be created for me inside of the dash. So in fact, if I'm going to be really brave here, oc delete dash, f.

D

D

And then do a review here. This is the one that I created, but you can see. My cef dashboard is no longer here, but if I go back to my console, oc create dash f staff dashboard, and then I refresh this page there's my stuff view that wasn't in there before and now, there's my dashboard and it's making use of all the data that I've already previously collected, and you can see all the information about um you know your stuff back end.

E

So you could, you can totally get up this whole thing then, so you could actually tie creating new dashboards to deploying a new resource. Yes,.

D

So that's that's part of the reason I make use of okd, for this is because it's really easy to manage these components so, instead of me going and having to write a whole bunch of um stuff to manage and deploy all this uh for me kind of after the fact I make use of the operator model to do that. So um I can show you a little bit of that of how that works.

D

um Infra watch service, telemetry operator, and so this is just ansible in here, and so it's just there's a lot of boilerplate of of creating the actual operator itself. But ultimately it's just ansible, and so part of this is all of these. You know these components or these uh playbooks that I've created so there's one. You know the alert manager, the certificates, the clouds, the elasticsearch, the grafana things like that right and so all that's doing. um In fact, I will load it in something, so we have some colors.

D

It just looks up a template. It sets some defaults and then it creates an instance of grafana using the kubernetes module inside of uh ansible, and so that takes that object. Inside of that template loads it into okd, and then the grafana operator reacts to that and results in the creation of the grafana instance, and then there's other things where you know anything else. It needs like. Looking up, you know, data sources or you know what this is doing is also creating the data sources.

D

So oc get grafana data sources, I think, is the object name. So there's the default data sources um default data sources, oh yaml, and then so. That creates the data sources inside of my grafana instance, which is defined here under data sources, so es cellometer es collect d stf prometheus. These are all created for me as part of an stf deployment. So I don't create any of this. This is all automatically uh resulted at just by enabling dashboards inside of the service telemetry object.

D

So anything you want to add, you just add it to the service telemetry operator and then that can go off and work with other operators to actually deploy the components that you might need or might want. Again. We have overrides that you can also pass into the service telemetry um object. uh So, like I had that grafana manifest. For example, I did an override of grafana, but I still had access to my data sources that were automatically generated for me and created as part of the service telemetry deployment.

D

um Our documentation's, even uh all auto-generated and everything as well so any time that I make a change to this um infrawatch documentation. So this is our source and ascii doc. Anytime, that's changed and it merges into the main branch here. This actually will update and you will get changes into this documentation.

D

So all of that is auto-generated, and you can see that for our upstream here for the open source deployments here, you know openshift uses okd, um you know, suggestions of different backends, you can use for testing and things like that and then it just goes through of how to create the object.

D

So when I deploy an stf, I literally just do this copy and paste that copy and paste that copy and paste that just keep going, and then I'm basically done- and once I get to the end of this, um I will have everything that you just saw there um in terms of the operators that are deployed the crafada instance that exists all of that and then all you have to do is add your alerts, so there's a file that we provide that provides. You know these alerts um that you can use for.

D

You know what to monitor of an openstack system. You can customize that you can add whatever you want again with the dashboards.

D

You can manage your own dashboards and you know create one like I just did there export it, wrap it in a little bit of boilerplate at the top, and now you've got a basically a git ops model of something that we check in to github and everyone can make use of and if there's a change or we need to make a modification, you just go and change it, submit it, and then you can make use of it immediately.

E

The um yeah I can see that putting that part of a workflow one of the other things I want to actually ask you a question about is a couple of times when we're talking about things like we're, talking about degraded mode and other stuff. Is you mentioned predictions?

E

So is there any kind of a predictions feature um in in stf to say, hey, you know you're going to run out of memory in your cluster. At this point.

D

Yeah, so that's actually part of prometheus, so there's a lot of different functions uh that you can make use of in prometheus. um So this would be like this predict, linear, um I'm not going to try and create something on the fly, but you basically make use of these various things inside of prometheus. So there's some there's lots of functions so there's you know just summing so like adding things up, predict linear, um you know and then just various things when you write your alerts.

D

That says, when something reaches this threshold, send a warning and when you reach a further threshold, send something like a critical notification right. um So you can actually get these different things when you send alerts to say this is just a warning: we're kind of getting up to the you know the critical area and then, when you surpass that critical area, then you get. You know your alarms that say: okay, you really need to react to something right now.

D

One of the other things that we have for the eventing is uh just in the virtual machine view. Whenever we have events happening, we can actually overlay them in the dashboard. So that's what these like little annotations are.

D

These ones just happen to show up as an example, but it's just telling us that our virtual machine's still active basically inside of this project. So if I change to a different open stack project, this one doesn't have any vms in it. This one has two um and then we can even see that down here. So this you know these projects, here's the instances that are living inside of that.

D

D

So I think that's pretty much everything I have to share right now.

C

Okay, I've got a question because you mentioned monitoring, not monitoring alerting. So how do you alert out of stf.

D

Oh so we're just making use of the alert manager um that comes from prometheus. Basically, uh so you just load your alert rules in, um and I can actually show you that really quick too.

C

Because I know it kind of distracted you off of your actual demo, and I apologize for that.

C

Because I haven't seen this since you showed me stf 1.0, so it's been a while.

D

Yeah, um so I yeah rules here we go so here's the rules, um openstack rules. um This is just loaded into the console um oc get. uh I can't remember what it's called. I think it's prometheus rules we're.

C

Back on the documentation, just fyi.

D

Oh, I am oh because it's not sharing the right thing, you don't see the console right.

C

D

Now you see the console is that right.

C

Sort of we just saw it now we have we.

B

Saw a gray screen yeah now.

D

We have a gray screen, yeah hold on it's just sharing the wrong thing. Okay, oh, I know what I did. One second.

D

You're gonna see chris for one second.

D

Do you see see.

B

D

C terminal: okay, perfect, I think that's yeah! This is only three hours.

D

Ago so yeah, this is uh the file that we created, so you just open stack rules, and then these are the expressions um that we've created and then uh any alarms or alerts that you want to create. So if we sit in this position for 10 minutes, then the label basically is severity. Warning.

D

So that's how you create those and that results in these rules being loaded in so we can see. Here's you know, load midterm, shows the alarms. So these alarms are what will show up on.

D

I believe the infrastructure node view. If there's any you know recent alerts or whatever these need to be tuned for the environment. Obviously these are the fact that I'm seeing a whole bunch of you know. Current alerts and recent alerts just means that it's flapping, because those queries are too aggressive for this particular environment. This is a demo environment, so it's always heavily overloaded, but that shows that the alarms can basically show up here. So that's that's. What results in the alarms it can show up on the dashboard?

D

um Those can then be sent to alert manager and alert manager can be configured for various receivers and that receiver can go to a web hook um or whatever the case may be.

D

If you make use of the snmp trap functionality, then we have a little bit of middleware that sits in and listens for the web hooks being sent as a receiver from alert manager, and then it can actually convert that and send it to a system that can accept snmp traps, um but otherwise you just set up your your alert manager. Just like you would set up alert manager for any other system. So stf is just making use of those existing components. It's not doing anything magical.

D

You just have to create the alarms or the queries that result in alarms, anyways.

E

Is it possible to plug sort of third-party services into this somewhere um like, for example, if somebody wants to use pagerduty for alert management.

D

Yes, so alert manager actually supports pagerduty, so that would be a receiver that you would create inside of alertmanager.

D

um So there's only a few things that alertmanager supports pagerduty happens to be one of them, primarily if you want to interact with any different types of third-party systems for sending alerts um or warnings or whatever the case may be uh generally. Those are you consume a web hook and then you convert and send it to whatever endpoint is you want, but some of the various built-in ones are like email, slack, um a few different ones and a pagerduty happens to be one of them as well.

E

Yeah and I'm just thinking that you know managing who should get an alert, is its own thing.

D

um Yeah and that's not something.

E

You would want in this tool right.

D

No exactly so so that's actually part of alert manager and alert manager will do so if it receives. I don't actually have it working right now. um I didn't set it up quite right um when I was deploying this because it's actually disabled for some reason, um but this is the route. So you set the route here and then you can group by various jobs.

D

You can determine how long you're supposed to wait for um how often you're checking things like that, and then you would have a receiver now mine's set to null, because I don't have any receiver set up, but you basically set a receiver here that may be, you know pagerduty. It may be uh to email system, whatever the case may be, and that receiver is what ultimately results in the delivery of the alarms and it will also do the duplication, suppression and things like that. uh So alert manager.

D

You can run like you know two or three of them uh for like high availability and it will determine when it gets. You know the alarm: if it's already been sent to a receiver from by one of the alert managers, you won't get it like multiple times. For example, okay,.

E

So one other thing I want to ask about is application telemetry? So if I want to feed the telemetry from an application running on openstack into this, um how does that work? Do I send it via snmp via collectd, either or.

D

Yeah, so really stf is designed as an infrastructure monitoring tool. It's not really meant to be the application monitoring tool, so you could either run something inside of your virtual machine that sends it either to another system that is designed specifically for application monitoring or you can match the same pattern where your virtual machine can actually run say a collect d, or something like that now for virtual machines themselves, if you're just trying to get cpu memory, uh I o things like that. You don't need to run anything inside the vm. We make the collect.

D

Dvert plugin deals with that for you, so it will talk to libert and it gets information about all of the virtual machines. So if you're actually just trying to monitor the virtual machines themselves, um then you don't actually need to run anything inside of the virtual machine. That's already dealt with for you um using this um collect e vert plugin.

D

E

D

C

Of the virtual.

D

C

You mentioned earlier that it could be reactive, so kind of building on josh's question about monitoring an application.

C

Can we do something like checking that the apache on our vm is up and running to determine whether our vm is up and our application is running and then restart it? Can we do that through sdf.

D

Yeah, you have to send the data so, like I said, stf is not meant to monitor the workloads running inside of the virtual machines. It's meant as to be used as the infrastructure monitoring system for an administrator of a cloud platform.

D

So if you wanted to as an administrator, if you wanted to allow the workloads inside to also send information, you would just have to run the qdr or be able to point your collect d at the qdr, and then you would just make use of the collect d plug-ins that you want to make use of. So if you want, for example, your example is you know, I want to know that the apache running inside of a virtual machine is still active.

D

I'd make use of the apache plug-in inside of collect d and then send that into the system and then basically monitor for that. But that's not that's kind of out of scope while technically possible, because you just have to send the data once the data sent and it's transported, then you just make use of it right, um but that is kind of at a different layer. That is not kind of the the scope of this. You know monitoring system.

C

But is in ohm and out of memory within the scope, because we can monitor the memory in the virtual machine.

D

Yes, so I can just monitor the virtual machine itself and I'll know how much memory was allocated, so you can see down here. um You know how much total memory, how much is unused? How much is usable? How much is available to me? You know all of that kind of stuff, so I can make use of the lib vert stats to determine that a virtual machine that was given you know, 16 gigs of memory is approaching.

D

You know it's theoretical maximum right.

C

So the infers the infrastructure method of monitoring your application, if.

D

Your application is too much for your.

C

D

Yeah, it's not monitoring the application itself, but it's monitoring the virtual machine that the application is running on top of yeah. The.

E

I mean I will say, as a former dba, the separation of infrastructure modeling for application monitoring is not a decision. I've ever understood because, for example, if I'm running a database and the queries start being inexplicably slow, often frequently. The reason is that the machine that they're on is running out of resources and- and it really seems like you- actually want that telemetry unified rather than in two separate systems.

D

Yeah, like I said you just have to run, collect d inside of the virtual machine right, so that that that is a decision for the infrastructure administrator to determine if that data is appropriate for their system and what they're monitoring. um But if you just run collectd inside of the virtual machine, then there's absolutely no reason. You can't collect data from inside of the virtual machine, but you have to configure it that way.

D

Right, like the deployment of the of the data collectors and the transport inside of the virtual machine, is definitely out of scope of the life cycle management system for your infrastructure, in this case triple o right. So I'm running triplo. What I'm doing inside of the virtual machine is is outside the scope of that. You know system right, which is what I'm using to deploy the the data, collection and transport.

D

um If you happen to have a virtual machine image that you've created and uploaded to your data store and when you launch that vm and those vms are set up to automatically no matter what you're doing to run a data collector and a transport system- and you can figure it in such ways that it can connect, there's absolutely no reason. You couldn't um centralize all that data with the system. Wait.

E

Why don't you run a transport? What I need to run a transport system in the vm.

D

Well, you need to connect it to a qdr somewhere, so you can either understand that there is a vm that might attach to the qdr running on the host. But then you have to make sure that your network routing allows the virtual machine to connect to the host that is running the vm and you that might be a security issue. So you may need to set up another qdr, because every node is running a qdr which is very light right. Those qdr routers are running in edge mode.

D

There's, no really it's all ephemeral, there's no data storage requirement or anything like that. You just run that really small little system, and then you basically connect it to the you know back to the central location right right.

E

So you do the same with an application. Yeah.

D

Yeah, that's in theory, you could do it that way. Now, obviously, it just depends if you want to use the local qdr for your virtual machines, and you have to you know, decide whether the security requirements of your system, you know, allow for virtual machines to connect to services running on the host.

D

It's a bit of a southbound connection that might not make a lot of sense for some people or- or you know it just connects, you know centrally. All the way back to uh if collect ea is running. The amqp plug-in that amqp plug-in can just connect to the interior routers running in stf.

D

All the way that way too right. They may not need a local transport anyways. They may just connect all the way back to the central.

A

C

And there are so many different ways you can monitor and everyone is very passionate about what they think is best. um That's why I was mentioning that if you had an issue knowing the cpu was higher, the memory was low, could point to your direction, but at the same time it is nice to know that things are responding, whether it's in the same system or not, you know connecting to the database. Are you getting a good response time connecting to your website? Is it up?

C

You can kind of get some of that predictive behavior on whether it's up or down based on you see the cpu going up on the on the vm itself or on the host itself, and then you know you need to do something before you even get to the point where you're down.

D

Yeah and there's obviously there's tenant implications and things like that. So if you have multiple customers or tenants inside of your infrastructure, um sdf's not designed to be a multi-tenant system right, so all of your data is collected in one location: that's not separated out, so I can say: customer abc can only see the information collected by their vms right. That's that's not a scope for what we're doing so.

D

If you are the only tenant, you're running the cloud for the purposes of say, running vnfs and all of the workloads running on top of that infrastructure are specifically in support of your administrators and you don't need that multi-tenant, you know. Separation at the data store level, then yeah I mean run your virtual machine run your vnf inside of it and then have your data collector inside of it. That goes into your main.

D

You know monitoring infrastructure, um you know like stf, but if you need some very fine-grained controls of who can see what um then probably stf is not the appropriate solution.

E

You could seems like actually, if that was your situation, you could go the other way right. If you have an application monitoring system, then you could expose select data from sdf to that system right because like if I'm in charge of the application, if I'm the dba, I don't care how I get the the correlated information behind what the cpu is doing, what the database is doing.

E

I just care that I can get them together, so you could go the other way if you're doing multi-tenant right, you could say hey these kind of events or these queries from stf are going to be available to the tenants.

D

Yeah so like prometheus is not multi-tenant aware right, like I can't, I don't have separate logins that I can say you get this subset of data right that that's not how prometheus works right so, but what you could do, if you really wanted to was you could break those out into separate data stores. You could effectively create two service: telemetry objects in the same name space and then have a cloud configuration for just those that tenants application and then, when they send their data back, it would be listening on a different transport address right.

D

So you have a different set of smart gateways on a different address. Space inside of the actual transport medium and the applications would only send to say you know, application, slash, dba or something right, and then the smart gateways would listen to that topic on the transport and it would put it in its own data store and then you could create. You know routes just for that person and then you could, you know, use you know okd's. um You know login system to provide who can access and log into various routes or whatever too right.

D

So it just requires a little more setup than what is just kind of out of the box. um You know takes more than the 10 minutes. It takes to set up stf normally, but in theory it is totally doable.

C

Now I know we're running short of time, but there's one thing I wanted to ask: if someone wanted to get involved with infowatch, do you hold any meetings? Is there a contributor guide or anything.

D

uh No, uh we haven't really had that um so much so it's pretty much just uh we we monitor the the the github uh system. You can open issues there. um You can always. um You know, find us on irc. I think we have a service telemetry channel on.

D

I want to say on oftc, but generally just go to the github open, an issue and we're pretty responsive there. Otherwise uh yeah there's, uh we haven't had a lot of um other folks. You know outside of our team, really um making use of it, uh so we haven't really had that need for uh for a guide or anything like that, but uh obviously we would. If we started getting lots of contributions, then that would be great great problem to have.

C

I'm looking at one that is tagged, actually a good first issue. So if anyone wants to get involved, look under service telemetry operator- and there is a issue that is marked as good first issue.

C

If do you have any parting words for us.

D

uh No, I think I mean it's definitely a little bit of a different system and it's not built on. uh On top of the uh you know, openstack services running directly inside of openstack, but the idea behind it is to allow information coming in from various different systems. So that's kind of why the architecture looks that way and if anyone's interested in running it, you know feel free to reach out and I'd be happy to help anyone work through anything that they need.

C

Great, thank you so much josh. You had some parting words.

E

Yeah, so thank you. Thank you leif. um Thank you, everybody for attending or for watching this later on youtube um cloud. Tech, thursdays will be taking a brief hiatus and we'll be returning in approximately one month on august 17th and the reason why it's august 17th is we're actually becoming cloud tech tuesdays.

E

um The reason being that we wanted to move this to is an earlier time slot um in order to be more friendly to european viewers, who have been complaining that um this is way too late in the day for them to attend um so yep. So you'll see us in four weeks um minus two days as cloud tech tuesdays where we will be meeting with the kubernetes 1.22 release team um to talk about how the release went and what's in the release um and all of those other things so see.

E

You then, on cloud tech on ctt, um which we still are.

B

Same show acronym different day, different time.

C

Do we have a utc time? Does anyone know it offhand.

E

uh um Utc time, I believe, is 1300. yeah hold on.

B

I want to say yeah because it's at 10 right 10 a.m. Eastern.

E

Yeah, so it's 1400, utc um and 10- am I edt? Okay, um but but you'll see more in various places.

B

Yes, absolutely we'll be tweeting that out quite a bit so yeah thanks everybody any.

C

B

Is we good here.

C

B

C

B

C

D

So much thanks for having.

A

A