Red Hat OpenShift Operator Framework SIG | OpenShift Commons, 19 Oct 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing: Operator Metering with Chance Zibolski and Rob Szumski (Red Hat)

Description

From Operator Framework SIG October 2018 Operator Metering with Chance Zibolski and Rob Szumski (Red Hat)

A

I'm chance I work at Red, Hat and previously Korres I came in with acquisition and I've been with a crossroad hat for a little over three years and Rob asked me if I would give a brief overview of what operator metering is and, if possible, also demo. So I have all that I thought. I'd start by just kind of giving the base idea of what metering is for what its purpose is and like what we aim to solve with it and where you can find out more information.

A

So we are part of the operated framework, org and github. So if you're familiar with STK already or om the operator lifecycle manager, we are one of the projects as part of this org under operator metering.

A

So the project name is operator metering, but I want to preface that with that, the fact that this isn't necessarily only geared towards operator these cases, though, that is probably the best way for to get better integration. If you do have an operator that you can leverage metering in a more communities native way.

A

So the base idea of metering is that we work closely together with your monitoring, stack and potentially other data sources to collect data, store it for a long term and then provide the ability to report on it over over time there at slice and dice it. The way you need, though, let's see just to give a quick example of kind of what everything actually looks like we.

B

A

A guide on how to actually start from scratch what it looks like to create custom resources for metering involving Prometheus queries, which it talks about here, is how to write a Prometheus query: how to feed them into metering.

A

This is this portion here and then continue to iterate until you get data in our database, after which you can query it using sequel right here, the sequel eventually ends up and another custom resource which users can use to more dynamically, a iterate or to paycheck Luther operator, and this is that's what this looks like here and then at the very end you scroll down.

A

You can create a report which is actually going to run the sequel query with the various inputs that you specify and produce that the results and make them available for creating, at which point you have an API which will be able to query and then get the results as a CSV.

A

So what I'll do is I'll actually go through this in more detail, but that's just the rough rundown is that it starts with is, for me, says: query. You start ingesting the data through the reporting operator reporting operator then has the ability to query using some sequels I. Either you write or we write, and then you get that clear to run by creating a report or a scheduled report which actually says what do you want to report on? So, in the background, I already have an installation of u-turning running by default.

A

It runs a set of pods or storing our data for querying it and then also for the part that runs collection and the actual queries themselves. In particular. The primary component here is the reporting operator, it's the one that does the data collection, wreath iasts, and it's also what queries the database, which is presto for all the real work on the underlying data. We use ACS HDFS for storage, but that is something you can change.

A

You can actually also use s3 natively, as basically a file system is the way you can think of it, or you can also use a local disk that is magical and all the pods still like NFS cluster FST's deficits. Anything that's mountable by many politics could also be used as a storage bin. For for this, instead of HDFS.

A

As I mentioned before, we have a number of custom resources can.

B

You make your font a tiny bit bigger chance. It's really tiny. Thank you. Big.

A

Resolution makes it a little hard to tell where it's right.

B

A

Me see that gets the pholis resoures.

A

Alright, so we have a number of custom resource definitions, so the top one here is a metering resource, it's kind of the resource that tells you to install everything. Our metering operator installs, the pods listed above presto hive reporting operator. It does all that through the metering resource, is basically the config resource for installation.

A

We have some presto tables which people don't usually interact with, but they're kind of restoring some of our state and then the rest are all things.

A

I would expect the end user to deal with so there's a report, data source, which is basically incoming data or data that already exists and I'll show you that there are poor generation queries which are the sequel previous that we saw before there's the report parameters, query, which is a Prometheus ul expression for collecting data out of Prometheus, and then there's reports and schedule reports, which are the parts to actually act upon that data. The storage locations are a way for configuring.

A

Whether or not you want your data to be stored in HDFS s3 or a local file system, for example. So, starting from the bottom up, I'll start with data collection portion. We have these for poorer medias queries which are really simple. They are an actual Prometheus expression prop QL. So let me make this a little easier to read.

A

So I'm just pulling out the.

B

A

About second, here.

A

So we have an actual prometheus QL expression right here. This is not the easiest to read still.

A

Though I will make this comment slightly smaller see if I can get a nice break in here, but this is a large prometheus query expression that gets the containers, memory usage and then does it by grouping it by pod level. So you get apologies such information instead of just container level, and then we do a bunch of joining basically at the end do poorly. That was other kubernetes data so that we have like the pod name, the node name and the name space. So this query is just run by the configuration of the data source.

A

Which actually is what Maps, eventually down to a real table of evolution? Query. So the report data source has a Prometheus query name, which is the name of the pot of the report from ez squared we looked at before pod usage memory bytes and that configures the operator to actually go and collect this periodically.

A

This section normally can have some extra options for like how, often to poll and like come sizing for like how much data to grab at once, but by default it would just use some local defaults, and then we have the status just like every CR, which stores other information about this resource. In this case, the table name field is set indicating there's a database table created for this and that we're collecting the data. So now that we have a data source, we can actually query it from our day base. Using a report generation, query.

A

So there can be many report generation frees to act on the data sources. That's kind of the idea is that the data source is the underlying raw data, and then you can have one zero or more queries. We utilize that underlying data. That way, you don't actually have to collect the data more than once for processing in different ways, just obviously useful.

A

So let me actually just open this in the Edit view.

A

So a report generation query just like everything else has a name all this other stuff is auto-generated because likes to fill in the metadata. We have a columns set of columns, which is basically what we expect. This query to output, in terms of like a database schema if you're familiar was like a sequel table. This is roughly what that map's to is the columns in that table and then some extra information for how to display it.

A

Reports can take in custom inputs, so you can like override their default behaviors and program them dynamically. A little bit right now. M'allister reports only allow override in the start and end dates and then there's the actual sequel query, which is just an SI sequel that has go templates in it that could be processed before the queries run on them to allow us to do things a little more dynamically.

A

So this sequel query is doing an aggregation on the name, space pod anode, to get us a pod level usage query that indicates how much memory a pod has been using over time.

A

So we have a bunch of different ones and I want to say that one I'm gonna run, which is the name space, memories, it's free and it's very similar. The main difference is, it's got less columns and the let me show you the query. Query.

A

The sequel query is the same, except instead of being grouped by the pod and the node and the namespace, it's just being by the name space which, in the end, will give us a dump, a little memory usage in the namespace over time.

A

So to use these, we have a concept reports, get reports and I've actually already created a scheduled report. For us, though, the skipper report on show.

A

Is this one? So we have an hourly report, two reports, actually one for memory usage and one for sick usage, and what these do is they run the sequel query specified by the generation query field and they run according to a particular schedule. We can do hourly daily monthly, whatever you really. We also support cron for the more flexible use cases as well, and so it will report from data starting at the reporting start time until the reporting end.

A

We don't have a reporting in so it's gonna report forever, which is what I want for this purpose, um and it will retro actively go back and fill in little data. That's missing from the start, assuming we have the data collected from prometheus already, and so, as I showed before this has been running for about ten hours since last night, so that we actually have some more than just a few rows of data before I show you that, though, we can see the status to indicate like where it's at in the report.

A

If we're back long for anything, you checked where the last period I ran for is and then because I'm on open ship. But this works with regulators as well I'm using routes you can use, load, balancer services or no ports as well. I have a route that is configured to expose my endpoint at a particular domain name here, so I actually already have a command set up through create. Yes, you can query it, but it's not really anything.

A

I'm too worried about this CI clustered I set it up with off using the moonship off proxy and I'm querying the route that I just showed before, and then this is the endpoint. The API v1 schedule reports it and then bit hard to see, but I'm querying for a particular report, which is the namespace EP usage hourly and I'm. Getting it in touch of the deform at don't accrue that we basically get the result says tab. Tab, delimited format for each column period start is the start time for the given scheduled interval.

A

So it's an hourly report, so each period starts period. End isn't one hour. The namespace is the namespace that we're calculating on day to start and data and are the minute max of the values in that time range. And then the policy pieces for seconds is the CPU every instance in time, multiplied by the resolution of that data, all added together to get us an actual CPU core usage seconds, and then we do this for every hour.

A

So we've got 13 to 14 14 to 15 and everything up down to basically the last hour 15 to 16. You see we have as well with memory and you can see the same value except it's in bytes and it's a different set of values for the these politics usage information. So this is actually all coming from node exporter at the end of the day and then Prometheus collects annoyed exporter data and we do some extra processing with the sequel and the Prometheus query to get it into this format.

A

But that means basically that if we ever needed to change power, it works. All we have to really do is modify one of these resources, like the the namespace memory usage request. I can just edit it and modify it to my needs.

A

So that's the rough idea of like how you interact and use me, but there are other things you can do we're currently working on a another concept which is, let's see.

A

This is a regular report, but the concept is that you can have custom inputs where you could imagine having a query that maybe is specific to a particular namespace and you could add the old inputs to it that customizes the behavior so that it filters everything. That's not the namespace that you want. Maybe you know your CI test namespace, and you only want to report on that namespace. This is something that we just added in is currently being worked on. So I don't have a great demo of it because never agrees utilize.

A

These custom inputs very heavily yet, but that's something that we just released and then we're also working on with this future, a concept of roll-up which allows you to calculate really granular reports that have like. Maybe you say, like the hourly interval that I was showing. But then this could be rolled up into a daily report which basically aggregates the hourly results.

A

And then you can continue to aggregate at higher levels to have accurate snapshots, either at one interval or a higher interval or resolution without having to actually compute the whole series of data across all the lower granularity.

A

Yeah, so that's the rough idea. I, don't really have a whole lot more, um given that all of this is just custom resources. The real power here is that you can program it using kubernetes. Just the same way, you can put them anything else in kubernetes you have an operator that wants to interact with this system. It can do so using typical kubernetes technologies like operators or true CTL.

A

So you don't really have to worry so much about like how does my component integrate it's just the same way, you would interact with any kubernetes systems, though, on turn the sdk will likely have bindings to create various report, queries Prometheus queries and will integrate directly with metering to be able to expose your metrics, make them collected by metering and then potentially even report on them automatically without your operator having to do anything that really extra steps potentially, except for define what the query should be.

A

So that's where I'd like to see it go but I'm open to any ideas as well. Anything you won't add this ROM or something I should cover.

C

um No I think you covered it. I do have a random few slides on some diagrams, if that's useful for folks, but I might as well just go ahead and share those really quick.

B

There's any questions just pop them and shot or unmute yourselves and ask them.

B

Well, he's doing that he's got that shirt, okay, go there! You go.

C

um So here's kind of how it all comes together if few examples of how you can use this in a real environment, so chances showing us all the reports, all the stuff under the hood. But at the end of the day, say if you want to do show back for a number of different teams. You know each team has three different projects and you know they have a certain budget. You can run the reports the chance we're just talking about and get the usage on Amazon.

C

We can actually correlate to a dollar amount which is really cool using the Amazon billing API, and so you can get those into Excel and just you know, sort them and group them by the different namespaces and total things up just manually or because these are all just using CSVs. You can actually import these into your business intelligence tool of choice. Whatever you want to do and make dashboards out of these and have a more automated flow, um you can also start doing a number of kind of like augmented math.

C

If you wanted, if you want to call that so, if I've got like two bare metal clusters, for example and I know how much that I'm leasing the hardware for a certain amount, maybe I've got a block of bandwidth that you know whether I use it or not. Don't cost me.

B

C

So if you wanted to combine all of that infrastructure cost together, you know of your shared, like enterprise math device, for example, you can take some of the usage from these reports and combine it with that fixed cost and multiply that stuff together and then show that back to your team, whether in you know another Excel document or in that bi tool hook it up to any of the other cost reporting that you might be doing.

C

Email reports, that type of thing- um and my favorite use case for this of all, is you can shame teams that are under utilizing what they've reserved. So, if you're, you know asking for more than 2x what you're using on the cluster itself, you can start shaming those teams, you know calculate the ratio of what they're using like or not lift them out, say exactly which apps need to be. You know yanked down to size or even just go ahead and do that for them.

C

You could have automation, that's running, it's automatically adjusting people's resource limits and that type of thing. um So it's pretty exciting. This is, you know, all using cluster metrics that we have today, and that is one whole use case for this. But you can also, you know, export custom metrics from your operators, and so that is kind of the the other use case of this. So the cluster monitoring and gaining insights, for that is great.

C

But if you've got a database operator and it's emitting metrics for the different things that is tracking internally, the number of rebalance operations or things like that that are critical to how it.

B

C

Or the number of tables that people have or the size of those tables, you can start exporting those and, using this same framework to do.

A

C

Is pretty exciting, so that's kind of a reverse order. That's the high level chance gave these a low level. So let us know if you have any questions and we're excited to be testing this out with folks on their super nice clusters.

B