Cloud Native Computing Foundation PrometheusDay EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sponsored Keynote - Connecting Prometheus and OpenTelemetry Data for Faster Troublesho... Ramon Guiu

Description

Sponsored Keynote - Connecting Prometheus and OpenTelemetry Data for Faster Troubleshooting - Ramon Guiu, VP of Observability, Timescale

The last few years have been fantastic for observability practitioners with the growth of Prometheus as the standard for metrics monitoring and the emergence of OpenTelemetry as a standard for application monitoring. Interoperability is key for standards to be adopted and successful. In this case, these two standards can make it easier for engineers to both instrument their systems and troubleshoot problems faster. In this talk, we will show the true power of Prometheus and OpenTelemetry working together.

A

Hello, everybody and uh yeah welcome to prometheus day europe 2022., uh my name is ramon. I work for time scale uh and, um and today I'm gonna be talking about correlating data from different sources and, in particular, promises and open telemetry for faster troubleshooting.

A

I've been working in the you know, building observability products for the last uh few years, and this is a you know, a challenge I've always encountered.

A

You know, as I you know talk to users of those products is that they don't typically use just you know one tool, you know they use a lot of tools and if you just look at the observability cloud native landscape for uh for all the different tools that are there, which are by the way, not the only ones that exist, which is just part of the community, there are a ton of them and you probably are using more than one and uh quite often the challenge is that you're having you cover you're, getting collecting data and getting that data into uh different systems, and you have to correlate it somehow so the first.

A

The first issue is that interpretability interoperability is key and uh and in particular this is about data. You know how you can get the data, the telemetry, the metrics logs and traces flowing through different systems, so you can more easily correlate the data and and with that hopefully also a troubleshoot problem faster.

A

Luckily, cncf is sponsoring and supporting uh two standards that have you know a lot of adoption, a lot of momentum, so we have on the one side for metrics. We have prometheus uh with its prometheus exposition format and the uh open metric standard that is for metrics and then uh open telemetry for metrics, logs and traces, and you know prometheus, obviously very widely uh adopted and uh open telemetry is a standard, has a lot of momentum. You know still, you know a lot of building happening with, um uh but it's the second.

A

You know most active uh project in the cncf and it's also the second with the mass contributor. So there's definitely a lot of momentum. You know on both sides. uh The question is okay. As you know, time goes by you. Most of you will probably end up having data that is generated using those two standards. So how you correlate the data together? So that's what I I try to cover here before I start. I just want to paint a picture of what a high level architecture of this system would look like.

A

So you have your services and infrastructure and you're generating uh prometheus metrics out of them, and they go. You know into uh into prometheus, and in this case you know, you also store them in prom scale, which is you know, a long term uh storage for for permit. So you can do long-term analysis and things like that. But the key thing here is that, as you start, adopting open telemetry as well, you'll have metrics and traces that come from open, telemetry and open telemetry doesn't have a back-end. You have to store it somewhere.

A

The first thing is that there are metrics there. You already have prometheus. You know how to use it. You probably want to have your data in there. So you know. The first thing you have to do is: how do you convert your metrics from open telemetry into prometheus? So luckily there is this component called the open, telemetric collector. That does a lot of wonderful things, but one of those things is can convert data from a lot of different standards via something you can receive data from a lot of different standards.

A

We have something called receivers then process that data to do things like sampling or batching the data, and then you can export that data to a lot of different solutions, one of them being prometheus. You know via exporters, and so here what this architecture is showing and this configuration of the open, telemetry collector is that it's getting metrics open, telemetry metrics and it's transforming them into prometheus metrics and sending them to prometheus via the prometheus, remote right exporter uh and then for traces.

A

The only thing it does it does some processing and then it still exports it uh using the open, telemetry format. Otlp stands for the open, telemetry uh protocol, and so, in this case, traces are being stored in prompt scale and because prompt scale supports you, know: prometheus metrics and open telemetry traces, we're storing all the data there and then we're connecting rafana to it. So we can query it and you can query uh all my metrics using promql, but from scale is built on top of postgres and timescale b.

A

So you can also use c uh sql to query both metrics and traces and do some you know interesting correlation.

A

So let's talk about metric and trace correlation first, uh one very common way, or at least the one that is uh most typically a talk that we typically talk about is correlation via examples.

A

So did you have some example of you know python code that is instrumented with both prometheus and open telemetry, so here we're creating a histogram metric to measure uh the duration of api requests to our service, and here we're recording a new open, telemetry span every time, the random weight, endpoint or the random method in that api gets called, and so in order to correlate and use examples.

A

In this case, what we do is we add additional metadata when, when we add you know the duration to the prometheus, uh when we add an observation to the to the histogram that we created for api duration, and so the exemplar is, is this piece here? It's a piece typical piece of metadata set of attributes in this case just one that references data that is outside the metric set.

A

uh In this case, it's you know, a trace, the trace id, and so when you do that and you get the metrics, you know from the prometheus endpoint of that service. This is what you get you get. Typically, you know on the left side, you see, you know the typical. uh You know exposed metrics. You know prometheus exposition format.

A

Then here you have some additional information which is has the trace id, and then you know some other things like the duration of the trace and- and you know the timestamp, but this is this is the idea. This is the thing that allows you to correlate is an example.

A

You know of a trace that fell within that bucket of the histogram, and so when you are in grafana, for example, grafana that support does have support, for example, so you run a query, a proper ql query to do the 19th percentile on that histogram, using the histogram quantile function and grafana.

A

If you enable exemplars- which you know is this thing here, which I believe is enabled by default, if there are exemplars that are were sent to uh to prometheus or or prom scale, because perhaps also supports that it will show you data points that you see here. Those are the exemplars, those are individual traces and how long they took, and so, if you put your mouse over one of those dots or you click on it, you'll get this uh pop-up here and if you click on this button, then you can jump straight to the trace.

A

The hope here is that you have a you know you you're, trying to get an example of a trace that took you know a certain amount of time within you know it's within uh you know the uh the percentile that you're looking at and then you can see or the bucket. You know that you're looking at and then you can see uh where the time is spent as long. Obviously, as that trace is representative of all the traces that fall you know within within the bucket or within that percentile.

A

So that's that's the whole idea here. It will help you instead of you know like trying to figure out okay what traces uh were generated and when this metric had these values, you can actually jump straight from one uh from metrics to uh to the traces. The other way, which probably is simpler, but still you know really important, is correlation via labels and attributes, and so uh open telemetry has a concept of attribute and it's the same as a label in uh prometheus.

A

Basically, and so the only thing you have to do, if you have your service was or maybe already instrumented with prometheus metrics. When you are traces, don't forget to add, maybe those attributes that you're using you know in your prometheus uh in your prometheus metrics. In this case you know endpoint and instance, you know it's a very. This is very similar syntax to do this again, it's a python example here, and so, when you do that, you can do things like that. So this is an example.

A

This is a dashboard where you are, you have a filter at the top you're filtering by service and um and what it's showing is uh on the top. You have metrics. You know this could be queries, you know using promql and you're, showing those metrics in charts, but at the bottom you know, especially the two on the on the bottom uh right: it's showing squaring traces, so you can actually see performance of your service. You know with the three golden metrics, but also traces and how long they uh you know the slowest traces.

A

So you can jump straight into those and maybe even errors. You know there is error information in trace data, so you can see which ones are the most common. So anyway, you can correlate. You know visually. You know the data in a dashboard other things you could do actually in the case of prom scale, because you have sql, you could actually run a query that returns. You know all the hosts where there are there were traces or spans that had the most errors and then do um a subsequent query.

A

All all you know all or a join to retrieve. You know all the um plot. You know in a chart the memory consumption on those hosts, so you can actually try to understand if there is a problem. You know uh that you know memory, maybe is growing or peaking at some points, even farther you could actually do another joining all of that in the same query, to just retrieve the exact processes that we're consuming the most memory at that point- and that gives you that gets you very quickly.

A

You know from spotting a problem here to actually uh getting uh you know very uh uh much deeper understanding of what could be the source of that problem, so using labels and attributes actually is very powerful, especially if you can do joins on on the data.

A

Another one is metric correlation and for metric correlation one. The first thing you have to take into account is that uh open, telemetry, metrics and prometheus metrics. They have different types of metrics, and so they need to be mapped, and here you have the mapping. I won't get into a little bit another thing to keep in mind. There may be some types that are not uh that you cannot map and an example of that would be open. Telemetry has an exponential histogram.

A

You know that doesn't have a way to map it into prometheus metrics and that is defined by the way in the expect. You know the uh the open telemetry spec and there were you, know a lot of discussions between open, telemetry and prometheus projects to come to come to this mapping and definition of of the metrics.

A

But that's something to keep in mind so make sure that you're using metrics that you'll be able uh uh to convert so that you can. You can map them together, especially you're gonna be storing them in uh in prometheus and so again with metrics. Probably you know the only thing most likely is available is you have labels and so again you're going to be able to correlate with labels and attributes.

A

So here is the same thing: you see code. This is the same code as instrumented with a prometheus client library on the left and with the open, telemetry sdk on the right and except the beginning, you'll see the rest is actually fairly similar. You know it's the same. You define the metric there's a name. There is a description or documentation in the case of the prometheus client library and then you just increment the counter, adding some labels to it.

A

You know in this case it's for you know, we add the uh the name of the api endpoint, which is add product as an example, and um if you do that, then you can start correlating metrics again in a dashboard. You know you could be filtering data from both prometheus and open telemetry.

A

So in this case, most of those charts uh grafana panels are from prometheus metrics, but the one on the top right is actually from a service instrumented with open telemetry that is reporting metrics, and so you can show and filter and see in the same dashboard for a specific service as an example, uh the uh the the telemetry for um uh from coming from open telemetry, as well as seeing telemetry coming from prometheus and so just uh to wrap up uh tool. Interpretability is key, and I actually you know at the moment.

A

It's uh I'm really happy that we're saying you know there is obviously so much momentum with prometheus for the metric side, but now with open telemetry as well, especially for traces, uh because that gives us you know the tooling and the foundation that we need, um and then you know you have to think about planning when you're doing instrumentation planning carefully to make sure that you'll be able to correlate the data uh uh in the future, especially by uh using consistent tagging across uh sign ups.

A

Maybe thinking about using examples- and uh you know also choosing your metric types carefully- you know so you can you can do the mapping correctly. uh Thank you very much uh and by the way we have a booth outside just outside. So if you want to talk more about this, you know I'll be sitting around and be happy to discuss. Thank you.