Grafana Tempo, 9 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grafana Tempo Community Call 2023-02-09

Description

Tempo 2.0 discussion and celebration!

Join our next Tempo community call: https://docs.google.com/document/d/1yGsI6ywU-PxZBjmq3p3vAXr9g5yBXSDk4NU8LGo8qeY/edit

Learn more at https://grafana.com and if all of this looks like fun, feel invited to see if there’s a role that fits you at https://grafana.com/about/careers/

A

Okay, all right all right welcome everyone. This is uh a pretty I think this is an exciting Community called This is the first one we've had since releasing Tempo 2.0. So this is great yeah. Thank you. Everyone for coming.

A

um We shared the Google Doc uh in the slack Channel, and so, if you haven't opened that you know we kind of have a light agenda in there, um but you're welcome to add anything to it. um You know feedback or issues or discussion topics. Anything we uh love to hear it so um otherwise I can go ahead and start off yeah, we'll just talk about Tempo 2.0. So all right, so we're all really excited about this.

A

um This represents months and months of work by dozens of people, huge effort, um but we think it's a great release. Best release of tempo um tons of new features in there um kind of like a very large changes to how Tempo Works internally. So we're really happy about that um yeah. So the big feature in here is probably Trace. Ql, that's really cool I mean we knew.

A

We wanted a query language for a long time and uh but we kind of needed the new back end storage, to kind of like power that so the existing one. What just wasn't going to be efficient enough, so the new back end is the parquet. So that's we, uh you know that was actually released in 1.5 and I know.

A

um Some people were uh interested enough to go ahead and try that and that was really cool and we, of course we were using it ourselves internally for a long time several months, but it's finally the default. It's uh we feel like it's production, ready we've been using it in behind behind graphonic Cloud for months now, so we feel pretty good um yeah. So that's turned on by default and so is traceql.

A

You don't need to do anything to enable that on the tempo side there is a feature flag on the grafana side and also you may need to upgrade your girlfriend instances.

A

um So yeah I, guess uh I just want to say yeah. We should just thank you to the whole team and everyone and all Community contributors and everyone. So it was a huge effort. Yeah Zach yeah cool, um so there's a couple links here in the documents: um there's a blog post kind of like uh from Joe about Tempo, 2.0 kind of covers like the new features and the basics.

A

There are some breaking changes so um when you're upgrading uh I think we have a really good upgrade guide in the documentation, so kind of cover the braking changes and then other things like settings defaults that have changed. You know that you may need to be aware of um yeah. So and uh of course we have a patch release upcoming, so there's a couple uh kind of smaller issues and things that we weren't able to get there in time. um So yeah there's also a link here to the 201, Milestone and um I.

A

Don't think there's anything critical in there, except for there are some people that are impacted and I. Think if you are impacted you know it. So that's good! So yep, oh yeah, we got some activity in the document here, yeah and then a couple days ago there was another blog post, just kind of like getting to know Trace ql, so that link is in there too, so um kind of like just a good way to get into the basics of it and get up and running cool yeah.

A

So um I guess I'd like to do kind of a survey here like is anyone running Tempo 2.0 yet or have you been using parquet or Trace ql yeah cool.

A

Do you have any uh any feedback or thoughts or like that you'd like to share.

A

Go for it yeah.

B

Yeah, so we uh with the version 1.5 we enable uh parquet, and now we have it with the 2.0 and I can say that at least in our in gestures, I see that uh is working better. So, like uh performance improved there I used to see the compactors uh working better I, don't know if I had some issues before but now I see, uh I have the block list like uh uh better and so far like uh it just feels more just feel faster. Yeah I, like it yeah.

A

Yeah so the version that was in 1.5 was stable, but the performance wasn't where we wanted it to be yeah, and so that was a big focus of effort between 1.5 and 2.0 yeah yeah, especially getting it to run better at higher scales and um yeah, like better use of memory, handling large traces. Things like that, yeah cool.

A

Cool yeah deployed V2 this morning: okay, cool, great yeah, okay, that's great yeah, except for the breaking changes and things um yeah I think it was pretty good.

A

All right, cool um yeah, so okay mentioned about resource uses, yeah, that's true, so 3x, so we were estimating, maybe about 50 higher is what we've been seeing internally compared to well. This is with Tempo 2.0, so in 1.5 with parquet it was definitely higher.

A

um 3X is definitely more than we were expecting. Yeah, oh on the S3 storage, yeah yeah, so but for CPU and memory I think we're. You know we're kind of seeing 50 more resources, uh but I think it's worth it right for the new features like yeah I. Think- um and it's still you know, Tempo is still very cost effective as far for a back end and it's I think it's still. It was a good choice: yeah cool yeah, yeah.

A

um Really! This is all we kind of wanted to cover and uh we can talk a little bit more about maybe some of these things that are in here, but this was the big news. This is the first Community call since Tempo 2.0.

A

Sure Boston go for it. Yeah.

B

uh I have seen that uh using this ql um now I can't query like longer hour. Hours like before, I was limit, um so we have. There is a settings that you can say: okay, I can query like one hour or three hours or so bye go in one go, but now I do like sometimes like. Okay, look in the last six hour, but my default was one hour.

B

I don't know if this is a feature or something that should be addressed, but it's working fine for me. So I like it so far,.

A

Oh sure, yeah I think we increased the default query time range to 24 hours cool. Does that match what you were saying: yeah yeah I think it used to be a smaller yeah.

C

Yeah cool nice.

A

Yeah yeah, just V2, wasn't efficient enough, so it had to do a lot of work to search through. So one hour is quite a lot of work, but on parquet with it's addressing the columns and also like the internal metadata, you can skip over things a lot more efficiently, yeah yeah, actually 24 hours is probably you know, that's something that probably could be higher in the future, like I think um yeah. Maybe that works for some cases. But you know, as we work as we improve the things.

A

I think we could definitely increase that and, of course you can increase it yourself. uh Luke yeah.

D

uh Yeah um I had a question about um yeah about how to handle like scaling. Up, Tempo queries, I noticed that uh which I skew all the uh the the member usage can really Spike up, and you know when I throw a lot of simple queries at it. It really helps to to handle some some larger, expensive, uh Trace ql queries, but is there like a recommended way on how to scale this up? Like I noticed there was some documentation about using uh like Lambda functions or you know.

D

Other sort of you know scaling functions like that uh or it's like an auto scaler sufficient most of the time um just trying to figure out yeah what are some like, cost-effective ways or any like recommendations. The team might have for handling that.

A

Sure, yeah yeah, so uh there are a lot of knobs that to turn with all this new stuff. So one might be the traceql query.

A

So um if you just like a short recommendation, would be use scoped attributes, if you can so in that case, that would be like instead of dot, http.statuscode span.hb dot status code is more precise and if you're, following the semantic conventions, that's where the HTTP status code would be, um and so that when you, when you do that, that lets Tempo know only to look at the span level attribute column for that one versus searching everywhere, which is the resource level and the span.

A

um So there are ways to like make the query a little bit more efficient, and these are just early things and kind of maybe you know, is not intended, but this is something that works, so maybe we can work up, make it easier in the future. Vice versa. Dot service.name, you know, would be kind of like looking at the resource service named I.

A

Think that's pretty common, but if you did resource.service.net and that would be efficient for the same reasons, we can only have to look at the resources which are way faster as far as scaling queries, yeah, so Tempo is kind of unique, so we have serverless and we use that ourselves and um yeah to achieve like the scale that we need. So that's a really good solution like depending on your load and it's cost effective too right.

A

So because the serverless, uh you know the queries, maybe are infrequent enough like make serverless a lot more cost effective than running queries.

A

So if you haven't done that I think that would provide instant scale right um as far as memory I think our serverless functions run with two gigabytes of memory: yeah, it seems to be okay, yeah yeah, so I mean you can run thousands of serverless functions and we do, and so that's kind of like an instant way to scale. But it does take some operational setup.

A

Yeah yeah go ahead. Yeah.

D

Oh just saying: wow yeah yeah, oh.

A

Sure yeah it's.

D

Smaller than I thought, but yeah yeah.

A

um Yeah you're right I guess when we've seen Courier spikes too and um I, don't think that's usually from searching, maybe large, pulling back large traces can can be a thing as far as query or is sure you can scale them up and run a couple, hundreds of PODS quarters and things like that, and they actually have more consistent performance that we've seen than serverless functions just as far as startup time and tail latency.

A

um But as far as, if there's anything to do about yeah I'm, not sure. If there's any uh Hey can anyone else. Think of like any specifics or best practices for scaling the query or pods specifically.

E

I mean the serverless is a good start, but yeah as Marty said it's um it's a little more complex to set up um I know. We've got some items that we're looking at in the next quarter to help smooth out the query path, but I don't actually know if that's going to help performance necessarily um yeah interesting, add more of them.

D

Yeah yeah that sounds reasonable um yeah. It sounds like it's not necessarily required, but sounds like a good, solid approach, though using serverless here, yeah.

A

um Query shards so I'll type this in here, but um I think that's the setting name. um So that's particularly impactful too. um It should be equal to or greater than the number of PODS you have or yeah. So if this is smaller than the number of PODS, that means it won't actually scale up enough to make use of every pod.

D

Oh interesting and you're, referring to like Tempo query or pods yeah yeah, uh uh would you recommend Amanda? uh uh Was it called? uh We have the query front end as well. Is that something that needs to be scaled up? Similarly, I haven't noticed as many spikes there, but and I'm not quite sure how it's all wired up, but is that something that needs to be skilled as well.

A

uh No, uh not really just a couple plots. There is all you need, and that can go quite far so they're just kind of coordinators and proxies they don't do a lot of the heavy lifting uh I. Think like two query front ends. Just for high availability can run a couple hundred queer pods, just fine yeah, gotcha yeah.

A

You know, there's just a lot of tunables for performance um row group size like that's: a setting maximum block size, maximum traces per block, there's kind of like a long list of things to check um but yeah for just that instant scale. I would probably go serverless.

D

Sounds good, yeah yeah? Is there any documentation on like how to you know tune these knobs or or maybe? Is there a need for that? Maybe like uh like, like here's, some recommended settings or like a a guide, perhaps on um tuning that or maybe that's kind of hard to do, perhaps because it's maybe variable on everyone's depending on everyone's situation, yeah.

A

uh I think we did, we did change some defaults in 2.0. I. Think that we've found work better, like some of those defaults were kind of getting outdated, um yeah. It's kind of a tough tough thing that variability in traffic does make it hard to choose one set of settings. Yeah.

D

Sounds good yeah yeah thanks for all the info.

A

Sure yeah, hey Gmail, yeah.

F

Yeah I've seen that there's a group function coming to trade scale in the future, and we want to achieve something like. Please show me the five most used, SQL queries in my trace. This is something I. I could use this for.

A

um So it the the group I in Trace uo is yeah. Maybe, like we don't know everything.

F

A

It would be able to do, but it would be able to like count the number of times each query was used in a trace and I'm, not sure that we have anything in language about like just ranking them like show me the top five, but it could just show you the count for each one, um but what also might work is metrics generator.

A

So this is a feature that has been in tempo for a while, so we can generate metrics from spans, it's a separate component in tempo and you can customize the attributes that are turned into metrics Dimensions there yeah, so you, if you your attribute, was like db.query. You could add that as a custom, Dimension, I think to the span metrics and depending on how long that query is. Maybe that might not be great I, don't know. Mario knows more about this area. Yeah.

G

um Yeah I will link the metrics generator docs, so uh everyone can they commit to look uh but yeah in a sense uh with this component, you can derive metrics from your as patents uh to get this pan level uh metrics um and additionally, like you, can extract any attributes that are present in this pans asymmetric labels. So if you have a um an an attribute in European studies, um SQL statements or something of the like, you could distract it and and have it as metric and then operate um with it.

G

Just as regular metrics and yeah, you could do top five SQL statements, um the only caveat or limitation is this. Is this pan level um so I? Think for the for your initial example would work perfectly um since this would be like a span level, question or yeah question that you want to ask, um but yeah I think it's um a different approach to getting metrics out of choices and so far as worked uh well based, simple and useful.

F

But what does then also see like in one Trace, how often a certain query is used.

G

No, that's the limitation. I was uh referring to this. This is Spam level. It doesn't record metrics uh with knowledge of the entire Trace. um It's Perth pattern. Only yeah.

F

Well, I will check it thanks.

A

Yeah, that's exciting, there's, definitely a lot of things that we can do so Trey skills, a good start like, but now that we have this I'm sure yeah the language will go in Wild directions. That would be fun.

A

Cool you know: I'm gonna, put this link in the doc too.

A

A

Yeah any other questions or any issues with tempo last time. Any problems we can solve. These are fun too.

A

Okay, um I guess uh yeah otherwise, like we can wrap up unless there's any other topics, we'd like to discuss sure yeah, Lucas.

C

uh Yeah is there a way to filter what the Distributors send to the metrics generator? Like can I say, like hey, I only want this service to go.

C

I know that, like once the metrics generator gets it like, you can, you know, set the processors on like what dimensions get added to the metrics, but so, for example, if I just want to say like I only want to take like you know, my by you know like, for example, like my my publicly exposed web API and only do metrics off of that, as opposed to like all of the other backend services that we have.

C

I may have missed that in the docs, but I didn't see it when I was looking at the last couple days.

G

Yeah, so this is um an item that we're actively working on um yeah I didn't respond because I was trying to look for the issue to give the most accurate information of the state of that uh so yeah there's something that we definitely intend on doing. Okay, I.

F

G

Found the issue it's this one.

G

um If you want to uh subscribe to updates um but yeah, essentially, we just want to filter down what we send to the metrics generator, um although uh right now, um if you want to cut down on a cardinality or the metrics that you're sending there's still workarounds uh that you could apply um like since we're embed in the Prometheus agent in in The Matrix generator for the sending metrics uh via remote right, you could apply relabel configs to drop labels to drop metrics and to do all the magic that we label uh reliable.

G

Configs do um yeah.

C

Yeah, that's good. I was just curious because we do like I forget why I think it's like one and a half or two terabytes of traces a day, but we only actually care about the APM statistics on just a handful of those services. So it's kind of it's a lot less workload for the metrics generator to just not have to be sent that at all, um when I was using another APM tool, I was trying to do the filtering like at the hotel agent.

C

You know, send everything to Tempo and then only send a subset to this other tool that I had building the APM stuff and so now I'm trying to switch from that APM tool to Tempo for APM. So I was just wondering if there's a way to do it, but it's not a huge deal, just mainly if it's on the roadmaps right.

G

Yeah uh makes sense, then, uh hopefully we'll have a explicit solution. Just filtering in the metric generates itself uh for now yeah. If we want to investigate um yeah, you can still apply reliable configs to the Prometheus agent, and this will do all kinds of magic like Drop in specific metrics, specific uh labels, and it I mean it will still eat into memory and CPU and the metric generator. But since you're not writing them you're still saving a lot in your metrics shortage um yeah.

G

If maybe this is too deep to get into reliable configs if you're not familiar with them, I want to chat. uh Maybe we can move the condensation to the public slack or someone else. Do yeah talk more in depth.

A

Cool good question: yeah.

A

Okay, all right well, I guess we can wrap it up um again. Happy to see all the new faces. Definitely welcome, and uh this was uh yep really excited about Tempo 2.0 check it out and uh I. Guess we'll see you next month, okay, bye, everybody.

E

A