Flatcar Container Linux Tech talks, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Taking the Guesswork out of Your PromQL - Julius Volz, Prometheus

Description

Prometheus co-founder Julius Volz presents some usability challenges with Prometheus's query language PromQL, as well as approaches to make usage easier. PromQL is great for doing calculations on time series data. However, the language also has plenty of sharp edges and can be challenging to learn and work with for both beginners and more advanced users. The talk starts with a language overview, goes into examples of usability issues, and then presents recent efforts like PromLens (https://promlens.com/ by Julius' company PromLabs) and the new PromQL text editor that will be part of both PromLens and Prometheus soon.

A

So today I want to talk about so first of all, yeah, I'm I'm julius um the co-founder of prometheus and founder of prom labs, a company that does services and consulting and integration work around prometheus, but also is building its first product right now with prominence which I'll also touch on today. um So that's for my background and today I want to talk about the query language that we have in prometheus.

A

You know not really give an introduction to it, but more talk about some of the challenges that users have with it and then how um a new text editor and the product prom lens, try to ease some of those pains. um So just taking a step back here. This is the overall prometheus architecture.

A

The prometheus server in the middle collects data from your devices and services, and then you can query that data and you know doing dashboarding or diagnostics on it or calculate alerts.

A

So today we're going to talk about this part, the prompql part and yeah, as I already mentioned, one use case for the data that is collected and for the query language that goes with. It is dashboarding.

A

um So there you write prom ql expressions to show the overall state of your system and to make sure that everything looks healthy.

A

Another use case is for ad hoc diagnostics to just ask a question against your infrastructure at a given moment. It doesn't need to be part of a dashboard in this case. For example, I was curious what is in my onenote kubernetes cluster, the container type with or the image with the most cpu usage that is deployed and basically sorting all the different image types by their cpu usage.

A

The third use for prom ql is for alerting and alerting rules. You configure as a yaml file in prometheus um with you know, an alert name and some human, readable metadata and routing labels, and so on, but the heart of every alerting rule is actually the prom ql expression that I uh highlighted here in this case. This is an example alert from the cube prometheus project um to alert when the file system is about to run full.

A

So you know you want to make sure that these alerts really work, because if they don't work, you might have an outage that you're not detecting and you might be losing in the worst case, millions of dollars and not actually noticing that. So you want to prevent that.

A

So, let's just look at an example of bronc ul, building a query from very simple to more complex and talk about this overall nature of the query, without diving too deeply into each individual. Like query concept, so we might start out just selecting all time series that have a given metric name in this case the counter of all http requests- and this might give us- maybe 10 000 different time series with different labels on it.

A

So we decide to scope it down to a particular job, so a set of targets that we're interested in and since it's a counter we might not want to actually look at it directly. We want to see how fast it increases.

A

So we kind of look at the last five minutes of each of these return counters, maybe they're only one thousand now in within this job, um and then we want to calculate the per second rate averaged over these five minutes. And then maybe you know we're constructing our query. Further. We don't want to actually see 1000 individual rates. We want to sum them up by the different paths that these http requests happen on. So we add a sound by path around this preserving the path dimension.

A

Maybe now we're only getting back 10 different paths as a result, and now we might want to know what is the ratio of bad 500 requests in comparison to the total requests for every path? And now we're introducing a binary operator here already making the query even more complex. So binary operations in prongql are really awesome, but they're also kind of a sharp sharp edge because they do automatic joins between the left side and the right side.

A

They try to find series on the left and the right side that have identical sets of labels and then, in this case divide those identical elements by each other and propagate that into the result, and that only works if you get the labels exactly right. If the underlying data exactly contains the right labels and so on and those modifiers, you can apply to these binary operators that you have to get right.

A

Otherwise, you will just get an empty result, for example, or an error.

A

So, in this case, for example, I am so let's get back one second here we were just selecting 500 status code requests and now we want to maybe see the ratio for any bad request. That starts with five, but has any other two second digits, so we're changing this condition here. This label match, on the left, hand, side um to a regex matcher five, something something we do have to then preserve the status dimension.

A

If we want to have that kind of, you know keyed by every status as well, but then the the binary operator doesn't quite work anymore. We have to tell it: hey only match on the path label, because, on the right hand, side we don't have the status, so we can't imagine all labels anymore, but now we have more label cardinality on the left side than on the right side. So we also need a group left modifier for this binary operator to tell prom ql yeah.

A

Yes, it's okay, you can give me this result per status code and per path.

A

Then we might want to transform these ratios to a percentage multiply by 100 and then maybe filter it down to just those paths and status code, combinations that have a larger than five percent error rate um and yeah. So you know this expression is exactly the same expression as this one. This is just indented a bit more nicely so that you can actually start to see the nested tree structure that cronkite consists of, and we could indent it even further to reflect the evaluation order.

A

So this operator would actually be the root note of the entire operation, and then it goes down to this which multiplies it with this, and we could draw this as an evaluation tree. um So, in the end, bronc ul is a language that has arbitrarily deeply nested expressions um the expression evaluation types that can be. They can evaluate to a vec like an instant vector, a range vector, a string or a scalar numeric value, but the types of nodes they can be aggregations.

A

They can be function, calls they can be binary operations, it can be actual selectors of data, and so everything needs to fit together correctly and also fit to the data that you actually have in your underlying parameters or for this to work out in the end.

A

So this is easy right. um Well, I already suspected that it. You know. I already knew that it's not always easy, but I did ask on twitter end of last year what people's biggest frustrations were with both both for beginners and for advanced prometheus users, and there were many clusters of answers in the replies about long-term storage about scalability and some other features.

A

But the biggest cluster of those answers by far was about prom ql. So here you can just see some of the examples um this reaches from just you know, having trouble with this binary operator, vector matching. We had a couple of those this one and this one and a couple of others, just you know, surfacing the actual help metadata for metrics in the ui somewhere, um trying to understand some language concepts and just working with the data and so on. It just turns out to be a challenge to make this a bit clearer.

A

Let's just hop back to our example that we just had- and let's say we just copy- that into a prometer server, where it's supposed to work on and in the vanilla, prometheus ui, which I can make fun of, because it's built by me as well.

A

You only see okay, this entire expression gives you an empty result.

A

Well, you might be thinking this is totally great because in prometheus, if an alerting expression in this case, you know, there's no paths with a greater than five percent error rate if an alerting expression returns an empty result, everything is fine, but you want to make sure that the empty result really is a result of only this filter at the very end and not of something completely different in the query.

A

So what could be possible reasons for an empty result of this query?

A

So, for example, you might just be selecting a metric name that doesn't exist, in which case the rate would produce an empty result, in which case the sum would produce an empty result, in which case you know the binary operator would create like wouldn't find any matches. It wouldn't even find series on the left or right hand side, uh so you get in total an empty result. You might also get the labels wrong. You might get your regex wrong. You might not actually have time series exported with the status 500 or 5xx label values.

A

Yet maybe your scrape rate is too large for your rate window that you're supplying the rate function requires at least two data points under the selected rate window to be able to compute a rate and even output output, a point for a given series. Otherwise, no result for the series.

A

You might also maybe get the actual matching modifiers wrong here um like if you just omitted these modifiers, you would also get an empty result because you have the extra status label on the left hand side, you don't have it on the right hand, side you're!

A

Out of luck- and you just get an empty result, and so this is really troublesome right if you're, just looking at this at an administrator and you kind of want to make sure that this is actually working correctly, you do get alerted if there's something bad happening, then it's not that easy to tell at a first glance.

A

Same thing for errors, if you do input a prom, ql expression, that is more than completely simple, you might get some kind of weird funky error, and you don't know exactly where in this expression is it happening in this case you know, I can tell you um poo. Actually, I think it's because we have group write instead of group left, um but you know you're wondering which one of these notes or sub expressions is actually producing the error.

A

What do these different, weird syntactical constructs even do and mean, and what data am I working with not exactly clear just from this?

A

So you know, even if you get a query right initially, you do want to make sure over time that it does the right thing.

A

And so you you do want to you, know the the metric names and the labels can actually change over time and you need to and be able to understand and verify the query correctness over time.

A

This is a bit of a tough thing to solve, in the completely general case, in a system that is not statically, typed and compiled.

A

Where you know, you can't automatically check very easily that some metric exported by some python script over here really matches some alerting rules. Some are completely different.

A

um So before talking about the new text, editor and prom lens, I want to give a quick shout out to grafana's new explore mode. Actually, it's not that new anymore. I think it's about a year old or so it's a modern grafana that already gives you a bit more facilities for constructing and exploring the data in prometheus and promql and has nice autocomplete features.

A

It doesn't really give you too deep insight into the structure of the query itself, yet there's also the prom ql language server project by tobias gubernos when he was an intern at red hat, and this is great, for I mean it's still experimental, but you can already install you, can install it locally and then you can have visual studio code or vim or other editors that support the language server protocol and get nice autocomplete, inline, linting and so on.

A

The downside is that this microsoft defined language server protocol is great for local editing and not so well suited for a web-based editor, but you know we do also want to have nice editing functionality like this built into the prometer server at some point. So this this is not quite 100 the solution for everything yet so today I want to talk about two new projects to improve this whole situation. The first is a prom ql text editor with a contextual, auto completion gives you linting and snippets.

A

This part is actually open source um partially, like it's a collaboration by me, promnaps promlabs and augusta from amadeus, and you can. You can actually look at the source code of that in these two repos that are linked here and contributions are super duper welcome. So that's already great, but it's not perfect. Yet then prom lens is a commercial product by promlabs, and this is a query, builder, visualizer and analyzer tool for prompt ul.

A

That gives you really deep insight into the actual structure of your query and since prom lens actually includes this new text editor, we will only be looking at prominence, but prometheus will in the future. Soon now also include this cool new text editor.

A

So let's only take a look at prom lens.

A

The first thing is a cool new contextual autocomplete, so you do actually get proper autocomplete with snippets, and you know it knows in which positions it's supposed to auto: complete label names, label, values, metric names and so on.

A

Then you do get an offline linter that works directly in the editor. Since it's offline it doesn't have access to the full prometheus parser, but it has a lightweight offline password that already detects a lot of common errors. So in this example, it would tell you that you're trying to pass the wrong type into a function.

A

Then prom lens more broadly, really tries to solve this problem of having like getting an idea of the data and the shape of the data you're working with, so you can plug in any prom ql expression visualize it as the actual tree structure, where every node in the tree is one full promptql expression and its children, and this shows you for each of those nodes.

A

How many results there are, what are the different label names and how many different values do you have for every label name, you can then visualize the actual data as well in a graph into table for the different nodes.

A

You can quickly spot where, in your query, there's actually an error so you're not guessing any longer with a really complex, even non-syntax highlighted query where this is happening.

A

It explains certain stuff to you in an explained tab, so you can select any node and any type of node and it will try to do its best to explain what is happening here. In this case, it is visualizing the actual matching between the left hand, side and the right hand, side of the binary operation and also explaining what is happening here um yeah. It can show you the help and type of a metric later on.

A

Hopefully, we can actually visualize that directly in the editor and explain functions with a documentation and also aggregators and so on.

A

It includes a full form-based editor for any prom ql node type, so you can actually go into this tree. Select any of the nodes go into this form-based editor mode and just you know, go wild. So this is great if you maybe have a rough understanding of how promql works but you're, maybe a data analyst or just not super familiar with every little syntactic detail of prometheus of promql.

A

So all of those details are actually mapped into this form based editor, but if you're more of a power user, you can also completely just switch any of the inline three nodes into a prom ql editor with all the features we saw earlier and just change exactly the things you need and then again switch out of that inline. Editing mode and you get back to your modified tree view and just change exactly the things you want.

A

Okay, these were just some features of prom lens, I'm just going to quickly, also demo it just to give you a feeling of what it actually looks like um to use it. Okay, I probably will need to zoom in a bit here for people to be able to see things. This might distort some things, but yeah that's to be expected.

A

If, if it's really hard to read because of the zoom level, please let me know I'll zoom in further, but otherwise I'll continue like this. So the first thing I want to do is just write. The same histogram quantile, query three different times. So, for example, I want to calculate the quantile, the 90th quantile from a given histogram. I could just start out with a snippet and start typing it right. This is one way of doing it.

A

A different way of doing it is actually to say, go directly into the form based editor, but go over to snippets and say hey. I want to calculate the quantile from histogram, and now I get this tree view here with the placeholders, and I can actually just you know, directly jump into inline prongql editing.

A

If I would like to I can you know, navigate with the keyboard in here, and uh you know select the histogram that I want to um look at and then it already detects that hey, you probably want to add a rate, because you don't want to look at the histogram over all of its time, so we're also adding a rate. So this is the second way to get to the same result.

A

um A different way, you might start, might be just to type the histogram you want and then it already detects. Okay underscore bucket. This is likely going to be a histogram. Do you want to add this very common structure around it and you get the same thing and you know you could then adjust the labels you see like. I actually want to preserve the status label here as well, so you're, adding that um and uh yeah and then you're getting the status label as well. So this is one thing, um a different thing.

A

You might want to look at is you can actually share pages? So if I click on this whole link, it will load a page. I can. I can share this entire page state.

A

I can demonstrate a bit of like drag and drop features and stuff like that. um If I now, for example, edit this note inline- and I just remove the group left- I will get an error here and yeah in this case. It's not visualized properly yet uh will soon.

A

But if I have like, for example, group write where I was supposed to have a group left, then I also get some helpful hints for how like what could possible fix. This be, and no future also have more action buttons for that here. um Okay, this is kind of synthetic data. Let's have a look at one realistic ish example. I have a onenote kubernetes cluster set up with the prometheus operator deployed using the cube, prometheus jsonnet files, and these include really crazy, alerting rules.

A

So these are, you know uh quite complex already, and those are the kind of use case that prom lens is targeting. So, for example, if we're looking at um yeah, I guess you can't read those if you're looking at the file systems on given nodes and whether they're full or not, there's alerts for that.

A

So we could just copy one of those, for example, and look at yeah. If, if you just, you know, look for this in prometheus and you click on this alert- and you know it's not going, it's not going to be very uh intelligible. So if we just take this entire thing, we paste it into prom lens. um It's also not going to be immediately intelligible, but at least now we can work with it.

A

um So in this case we're seeing zero results everywhere, which is what you would see if you did something completely wrong in my case now, I actually just I need to choose the right prometheus server to evaluate this against, and then we will get some results.

A

And now I can, you know, start looking at this and see like okay. This actually does return results, but then this filter doesn't- and um you know the predict linear works, but then the filter also. Doesn't it filters away stuff?

A

So this is good, so this makes it very obvious very quickly which parts of the tree actually don't produce data, and um I actually had a screen share session with one of my customers where we just went through their alerting rules and we basically copied them all into prominence and, like I don't want to say half, but almost half of them were broken in some way, and so basically my answer for most of them was like yeah. Sorry like this is never going to alert.

A

Is that what you want- and this is exactly what from lens, is trying to help you with all right. So that's it for the prom lens demo and then the last bit is just now. Please try it out either you can try out a you know: preview version at promenance.com.

A

It has a public repo with a readme changelog and issues where we can discuss and yeah. There's commercial plans available soon all right awesome. Thank you.

A

um Yeah I already answered one of them in chat and the preview there will always be a free preview. I think the difference is uh that the free preview version uh will be licensed in such a way that you're only supposed to kind of it's still in the influx, but use it with, if not millions of time series, and maybe only use it for personal reasons, and then, if you want to actually use it commercially, there's going to be a license.

A

I'm currently actually working out those details with a lawyer, so all that will be online soon, but I'm totally happy to just shoot. You like a 60-day trial, license that unlocks the link sharing the grafana data source, picker integration and all that.

A

Does it work with extended queries rates of rates? Oh you mean sub queries. Yes, it does should at least uh is there anything I missed.

A

How much of prominence functionality will be available in prometheus, um so the editor is the main part which I really think we have to have in prometheus. um The rest, I cannot currently justify so uh prom labs is just myself and I can't currently justify completely open sourcing it like having spent that much time building it.

A

Somehow I need to make a living and eat some.

A