Grafana Tempo, 13 Apr 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grafana Tempo Community Call 2023-04-13

Description

Join our next Tempo community call: https://docs.google.com/document/d/1yGsI6ywU-PxZBjmq3p3vAXr9g5yBXSDk4NU8LGo8qeY

What was discussed:
- Redhat's Tempo Operator
- Tempo 2.1
- Trace by ID tuning!
- TraceQL aggregations built by a community member!

A

Okay, all right welcome to Tempo Community call April 20 23 what a year um we're going to start with uh some red hat devs Engineers, showing us this Tempo operator demo, which is cool and then we're going to review 2.1, so uh is it Andreas? Is that right? Is that how you pronounce your name.

B

A

Appreciate the PRS Andre I guess: I recognize your name as soon as you jumped in here. You've had quite a few in Tempo. Is that right.

C

No I think just one: it was about the metrics yeah, the you know extended metrics for some goaling memory statistics. Oh.

A

Yeah, that's in one, that's in two one, thanks um all right! Well, we'll give you the floor, feel free to show us to check out.

C

C

Let's see if the screenshot works properly.

A

C

Okay, so um hi everyone. um In the last couple of months, we were working on a kubernetes operator for the tempo um deployment and exactly a week ago, we actually tagged our first release on GitHub, so it's currently in the US observability um GitHub organization, but we plan to transfer it very soon to the grafana org and today, I want to give you a quick luck. Demo about the current state of the operator, um so I prepared a small mini Cube cluster, so I set up site manager.

C

Third manager is used to generate the TLs certificate for the Web book of the temp operator. Then I set up a load generator, so it would have a few traces in the tempo once it's set up and install key Prometheus Tech. So we have Prometheus and grafana running and for the object, storage I'm using a menu instance- and here is the tempo operator So.

C

Currently we only support the microservices mode of tempo, but probably in the future we will also support the monolithic mode, so the main CR the operator is the tempo stack like count, Tempo stack. It has a name and in this quite minimalistic CR.

C

We can see the storage so in this example I'm using menu with S3 So. Currently, we support configuring S3 Google Asia for a start. So this is the name of the secret where the credentials for the object storage are stored, which then later will be written to the tempo configuration file and the observability field is about observability of the temp operator and the tempo components.

C

So um in this demo, I enabled that the operator create service monitors for each type of component and I also enabled de query. So that means that the temp operator starts a new container with Tempo query, which provides a UI.

B

C

Which exposes the eager UI, let's apply this file to the cluster and then we should see um yeah all the tempo components starting. So it's the compactor distributed interest, query and query front end. When you apply this CR to the cluster. We have a web Hub, which sets a lot of default options.

C

So I'll just quickly go through kind of the CR AP in the images section you can specify which container image you want for the tempo container. So of course we only um like if you change the default setting, it needs to be compatible with the tempo version, which is like the default Tempo version for one operator version, because if you, for example here just upgrade Tempo, it probably um won't match the tempo configuration file which the temp operator generates, but it's nice.

C

For example, if you have some customizations which are not yet released, you can create your own container and use it here. Then some Tempo settings so the limits which then will be put in the tempo configuration file, standard, observability settings, as mentioned before these are for the operator and the tempo components.

C

Then we can set the replication factor of the deployment. Resources are the kubernetes resources for CPU and memory, then retention and so forth account. Then the storage where the storage secret is stored, storage size is the size of the persistence volume claim which is used for the interesting component and in template. We have one section for each component and in this case I just enabled the eagle query. So we have the eager UI front end and in the meanwhile we hopefully um the load generator, hopefully generated a few traces.

C

So we can look at them in Eagle UI, for example.

C

Let's see if it works, it looks good, find traces.

C

I want and then we should see a few created, I tried it a few times. It always worked instantly and now live demo takes a while.

A

C

A

Used the Jager UI in front of tempo in a long time, I assume it still works. So there you go.

C

Yeah and maybe it's just slow because yeah it showed up now, but it took a while, maybe I, don't know it. Google meter also always takes a lot of cpos on my machine. Do you have some issue with Graphics travel? But finally, it's here and um yeah, you can see the spins here and because I enabled service monitors. We also have metrics in our Prometheus instance, like all the metrics, the tempo components expose, for example, we can really build info, and then we have one for each component interest, that distributor and so on.

C

I also set up grafana for this demo and I imported the tempo operational dashboard from the tempo Upstream Repository, which shows us cool metrics and, like some, some graphs are not visible because the UC advisor which I didn't set up locally but um yeah the tempo ones, are showing stuff. So and um of course, you can also use rafana to very tempo because of a search gotta.

A

Get Trace ql in there come on now.

C

Yeah I try to enable it, but it needs still the feature flag and I'm using the cube. Prometheus stack to deploy like the cube Prometheus stack, the handshot to.

B

C

It which embeds the Upstream grafana hem chart but I couldn't find any option to enable a feature flag. So yes did I miss something or is it not? There.

A

In the latest, grafana nine five is that what I found us on these days, uh it's enabled by default, there's no more feature flag, I. Think so, I think if you get the absolute latest, grafana it'll just be it'll, just be there cool I.

C

Thought I have to probably keep Prometheus deck is using an oligraphana version, then, because.

A

Does the latest grafana enable triskyo by default.

B

I looked yesterday and I thought nine. Four six was the latest um British gold. The bottom of the screen you'll be able to see the version it's on the bottom of like every grafana page I, think nine three yeah.

C

This one was nice weird, but.

B

In the helm chart itself, you have access to the. If I remember correctly, it's like the grafana.ini and you can do whatever you want. Basically, in there.

D

C

B

Where I think people are setting feature Flags.

C

ah I didn't realize that so again.

B

C

B

Yeah, exactly it's a little buried in there and it's not exposed at a top level. So you kind of gotta yeah infer what you want to do with the grafana ini file.

C

Awesome yeah next time for the next demo I will make sure that I can probably already up to this version, which has it enabled by default.

B

That's right, we forgive you we're just very partial to uh traceql. Lately.

A

Cool, how does this work with versions like we're about we're about to put two one for Tempo? So how does the operator do you have to submit a PR to update the operator to the latest Tempo? How does that work.

C

Yeah, so basically, because um there are some breaking changes in the configuration file, so we kind of always need to sync the temp operator with the tempo version, so there will always be required. A new version under the by Chance, the config file, is exactly like, doesn't bring any issues if you've started.

B

C

So yeah that was basically from the demo, so um yeah, let's mentioned before we plan to donate the repo anytime soon to refund org.

B

C

Open source- and we are happy to get contributions, cool.

A

uh Yeah so we'll work that out I think in the next couple weeks we'll get that repo over I think we have to figure out like the docs, publishing and stuff, but for the most part uh this is awesome, work and it's gonna be all ready to go soon and that people pull it now.

A

Is there like uh or I guess in the link, the link will probably have like a readme that shows how to actually make use of it. Now.

B

A

I forgot: we have a grafana slack Channel cool awesome. This is awesome, work thanks, red hat for putting the time in and building this um and uh yeah I think it's cool that you all are participating in Tempo and helping us build a this new tracing database. Just put it in openshift make it default openshift tracing database. Next, what's that take.

E

A

All right all right awesome! Thank you! So much. uh Let's move on to Tempo 2.1 um I can kind of I'll review the release, notes a bit and kind of talk through some of the big things that are going on here.

A

um I'm gonna share my screen, so we kind of all see this because I'll, probably pop back and forth between the dock and the release, notes a little bit. I think this is the right one. I hope so.

A

Yeah, okay, so um 2.1 we've cut RC 0. Yesterday, uh probably we had some pretty big merges right before we cut it, or it would have just done two one directly. In particular, we oops that's not what I'm about to do.

B

A

Particular we updated otel and it's a pretty big update, so we're unsure it's probably fine, but we thought it makes sense to be a little bit uh patient here and not just cut too wide immediately.

A

So there's some big changes that went right before 2-1 RC zero, we're gonna, let those kind of hang out for a little bit in our clusters before we, you know, pull the trigger and just do two one itself I'm on the road a little bit next week, I'm going to be in Chicago for the observability con uh thing um so, but I really want to release two one next week, I think there's a good chance.

A

We're going to cut it next week, depending on kind of how slammed I am with the observability con stuff, um but uh breaking changes so we've removed search. This was announced with V2 Earth, 2.0 uh V2, all the old V2 style blocks and we're going to leave and we're going to continue to support for Trace by ID.

A

um But we removed all the search code related to that, and that includes a couple of options here and a metric. So these just don't make any sense anymore and they've been pulled from the configs and the metrics. So the old V2 blocks hopefully have fallen out of retention since you've installed 2o and it won't matter because you have a bunch of parquet blocks and if you're continuing to run V2, because you are doing your lookups through maybe exemplars or through logs or something, then you can continue and always continue to use.

A

Tempo in this older mode and it'll be just a Trace by ID database, which is totally fine.

A

um We change metrics name so tons of our metrics started with cortex I've been wanting to do this for a long time. Mario noticed it and finally fixed it, but uh we had tons and tons of metrics just because we had this old cortex dependencies, um which are now like mimir and DS kit dependencies, but we've changed all those prefixes to Tempo underscore, which is mainly like a Clarity thing right.

A

So you probably don't even look at any of these metrics because they're really weird metrics, but uh if you do heads up or if you see something break or a dashboard change, you might want to keep an eye on that.

A

um And then we added this idea of um we're trying to figure out how to do SLO metrics, based on our slos, based on both throughput and latency, on a query and we felt like the best way to do. That was just kind of add this feature into Tempo.

A

So you can configure an SLO uh in Tempo and it will just track metrics for the number of queries that come into meet the SLO, whether that's by the latency or by the total byte scan um or you can or right- and you can build like metrics out of that. So we're going to start using this in our Cloud uh graphene Cloud stuff, which is why we needed that feature. So this might be kind of Niche I'm, not sure who else might use this, but heads up in case.

A

You really want to do throughput based slos.

A

um Let's see other interesting things go updated. Hopefully, that won't break anything too terrible um feature wise. We do have a number of new uh uh Trace ql features we support kind, which is a cool one, so I don't know why we didn't do this with 2o I think we just missed it, but we have kind. Is it like an intrinsic in um to the spam? Every every span has a kind, so it can be like kind server. You can search for that.

A

You know kind clients, there's also producer and consumer and I think some ones that I'm not gonna, remember off the top of my head, so consumer and producer tend to refer to like often like Q based traces, like consumers, pull from cues producers, push to cues and then server and client kind of are more traditional, HTTP requests or network style request uh types of spans. So it's kind of a cool um kind of a cool addition. You can write a query like this now kind equals server and array.

A

So this would look for you know any endpoint like an HTTP or GP grp standpoint that took over one second, for instance, by kind of focusing down on to the server kind.

A

Cool, um oh, we added arbitrary math. So let's put some Trace keyl stuff in here. Well, this is a fun one to add because it makes for silly queries, but we added arbitrary math. So you can do this and I I'm just going to make a giant tree here uh math. So you can do like you know, uh span dot, bytes process greater than 10 times, so you can do this kind of work, which is fun. So you can write mathematical kind of statements in your in your query and it will do the right work.

A

You know it'll correctly do the stuff. You could also do the opposite of this, so we could go. You know and uh bytes process divided by.

A

It makes more sense to you yeah. It might be useful to write some queries that read a little bit better than having to type that full number in there. You might find some other uses. Technically you can do stuff like this. You know, span dot jobs divided by spandup bytes or something greater than some threshold number three you can. If you have different number attributes, you can do math with those as well. So that's another addition to traceql in this release.

A

um uh We added new Aggregates. So previously we had average and count I, think uh We've added Min and Max, so Min of some field greater than a value and then we've also added Max I thought there was a third one. My change log entry said oops uh said uh min max and average, but we had average right.

A

That's that's incorrect. Let's go look at that because average was in a 2-0 uh some, my bad! That's uh I should fix that change log entry. So then some uh I don't know why you would ever want to sum a bunch of durations but hey. Maybe you do greater than one hour so min max sum. Our new Aggregates and traceql, which is a cool addition we fixed some bugs before this, would return.

A

uh Let's see number comparison before this would return nothing which was kind of difficult to understand, but we've fixed it so that all numbers compare to all other numbers. It was due to the data type, so integer compared to float would always return false. So we fixed all the number type comparisons to work and uh any any number compared to any other number will convert correctly.

A

So this will now return traces where that would not have returned traces of 2.0, um oh and then I think uh Jenny, who is not on this call fixed an issue where you couldn't write this duration, greater than 1.5 seconds. You had to write a duration greater than one you know, or one second and 500 milliseconds. You had to write that basically, but now we've made it so you can do like float. Durations, which is kind of nice, is a little bit more succinct.

A

It doesn't really add new abilities, but it is I think a lot more natural to write that 1.5 seconds than to like one second 500 milliseconds, which worked, but it was kind of annoying, so I think those are the trace, ql uh trisql, fixes and features. uh This is a big one too uh where's the um well I can't find it.

A

But there is a really good um there's a really nice Improvement to trace skill performance where we only we pull a lot less data um before we uh when we assert the uh when we assert the conditions. So previously, we would actually pull far too much data and then assert the conditions and throw a bunch of it away. We made a change to more to Target the data that we pull a lot tighter and it it makes for a much more efficient, Trace ql queries.

A

We do have a ways to go to continue to improve it, but this release. You should see some really nice uh changes, it'll. Also, let you this particular Improvement uh performance Improvement and it's particular on memory and bandwidths. This will let you increase the target job size, bytes, Target, job size bytes. In the query front end, you can really beef this up quite a bit. We run it as 200 megabytes in our Cloud offerings and I think, like 700 megabytes in our internal Ops cluster.

A

So if you are seeing performance issues with traceql I'd recommend, one thing is to increase Target job size bytes a lot in the query front end and to use two one I think both of those together will be really nice.

A

um Is there anything else? Oh uh uh Alton? Who I don't see? I, don't think he made it to this call uh added the ability for the metrics generator to uh attempt to like up sample basically using X sample ratio, so the metrics generator just counts spans.

A

You know exactly as they receive them, uh but open Telemetry has added, uh in one specific case, uh a some information about the ratio of the spans like how many spans that one represents, and you can use that with this PR to uh to inflate your metrics to better represent the like real traffic, the true traffic uh before sampling, which is kind of cool.

A

um Oh, this is a funny one. This was broken before uh you couldn't do this. This is a trace, ql query that didn't work, which was weird. uh We couldn't write count greater than negative one that broke uh so that's been fixed and you can do negative values in your Aggregates and I. Think that might be roughly it uh Zach do you know anything else in here. That's worth calling out the check, config flag, oh Azure workload, identity, identity, so a new way to off to Azure back ends.

A

um I think those are the big ones. Yeah.

B

Lots of good stuff going in.

A

Right um this V Park A2 is kind of a Cool Change. It's not default yet, but we're trying to build a parquet file that works with a wider array of tooling, and we found that some of the choices we made um for whatever reason are not compatible with off-the-shelf parquet tooling. So Adrian worked on this and this brings us closer to that, and it also adds a couple of columns that we're going to use for structural structural queries.

A

So those are queries like The Descendant operator and the parent operator, and these things sibling operators we need to add a couple more columns and they were added in V, Park A2, so he's working on populating those columns and then we'll be able to add the functionality or those. You know those uh descendant operators, which I think are probably one of the most exciting things about traceql and definitely definitely looking forward to that cool.

A

um All right, I think that's uh I, think that's a good overview of two one. Any questions about two one or um trisql or just run in Tempo or anything else.

E

Actually uh yeah I I was testing. um I was talking with Marty, because I was having some issues. uh Querying um I was doing some querying that uh we're using some uh tags that has not columns in parquet and looking in the last uh 24 hours. I had uh since six queriors with two cores and eight gigabyte of memory, and they were just dying all the time. Just restarting then I try with uh main branch so, but now the 2.1 and yeah.

E

The problem looks like it's all now, I can I can do even at seven days View that I couldn't do before. Okay, let's say I would like, if I, if I, can also configure The Columns myself like if I know, that's, uh for example, I know that Tempo has HTTP dot. I think um uh was was uh URL Noble, Star, URL yeah as a column, but Target for me is better in some cases uh to aggregate because it doesn't contain uh query strings.

E

So I wanted to do some aggregations there and I I couldn't because it was just time because of this okay, yeah, okay,.

A

Yeah, so Adrian is working on that project as well. um I think he's trying to weigh there's two directions: we've been talking about going and it's something that's on his list. One is where you configure it, so you like statically configure these are my special columns and you get to name the columns that get pulled out into that. uh You know into those extra columns.

A

uh He is really hot, on trying to just make every column Dynamic, though, which would be amazing but I think very difficult. So it's a discussion we're having internally, we are moving forward with it. We totally agree it needs to exist. We want it for our Cloud offering, because we have tenants who have all kinds of different columns.

A

It doesn't make sense for us to use only the columns we've picked, so we definitely want it and we're just trying to figure out the best path to it, but it's definitely being worked on so um we'll get there eventually.

A

And then yeah query should go super fast, then I know man I want the dynamic columns. I really want the full Dynamic columns, but it's tough It's really tough.

A

A

Something else. I've been thinking about lately and putting a lot of time into and talking to team about, is kind of the experience of writing queries. um Something I want out of traceql.

B

A

I want out of the front end as well. I've been talking to grafon about. This is a an experience where the results of the queer query is more valuable, because right now you write a query and you want to go jump over to the trace, but I want I want an experience where that pain.

A

The information you get back from the query is immediately useful, more useful than it is now so like write, queries and learn things about your traces and not jump to a specific Trace, but like like repeatedly right choices to get aggregate information like you could do like a grouping right, like uh average duration, Group by service name and then I want a result, set that immediately communicates that to you, and maybe you jump over to some specific traces, but I want that, like data right there immediately in front of you, so that's something: we've been working on internally I've got a couple design docs out on um and I really want to move forward, but I really want the traceql, like experience to be like a learning experience where you're writing a query and learning about your traces and the structure of your data and iterating on that until you and as you're iterating on that I want you to like yeah, learn and then eventually you get to the results you want and- and that comes from just like my own experience, using Prometheus and Loki like when I'm writing a Prometheus query.

A

I'm learning the whole time, I write a very basic query and then I aggregate by this and then I make a change. I do a histogram on a different I can pop the you know the histogram value around a little bit the quantile, and that, like learning experience, teaches you about your application. I really want the same for traces, so it's something I've been trying to drive internally, some and I. Think that's going to be the next step for Trace. Ql is a better experience in the front end and not I.

A

Think it's a powerful query language and it's going to get better and better, but I want the experience to also match the like power of the language.

A

That was just some.

B

Random, that's super exciting. Thanks.

A

Zach I know you're excited about it.

B

A

But I know you're excited about it too. Thanks Bob, the.

B

Other thing I'm excited about that I. Don't think we're ready to share any screenshots of, but there's some work on and grafana. That uh is going to make viewing large traces way better and we got a little preview of that. I. Don't think we're ready to show anything, but uh it's gonna be awesome, so stay tuned.

A

Can we show that video that we share.

B

A

Sorry you said that Zach, you can't say that why don't you see if we can share the video and if you can maybe.

B

A

Put it in the slack or something yeah.

B

A

All right, let's totally unfair, to do that to these fine people all.

B

Right gotta get them on the edge of the seat. That's.

A

Right all right uh anything else.

D

um I have a question: if you don't mind me asking uh what is the reasonable time frame for a search by Trace ID to return back like uh like what is the uh with the 2.0, you would say is a reasonable time for a uh search by Trace ID to return with a large amount of data set like in terabytes.

A

um That's a good question uh in our Ops cluster. Let me go. Let me go. Do some digging? We get a trace back, I! Think in what do you say? What do we say Zach a couple seconds and we're talking about?

A

Is it I, don't even know if I'm at over a petabyte of data billions of traces, um it does require some tuning when you get pretty and you get much larger, there's some touch points. There's some things you can hit to improve.

A

Trace queries um and I do want to make better improvements to it, like it's kind of in an area that we've ignored a little bit while we've done the trace queue on the parquet and I know it slipped from V2 in terms of performance, um just the trace by ID search, but uh in our in our Ops cluster I'd, say four to five seconds for yeah billions of traces in a terabyte. So it's one of those things that I want to improve, but is not glaring and so I have not pushed on it.

A

Are you seeing very long time periods, yeah.

D

uh So uh like in our production cluster, basically in stage we get data back in two seconds, but in production cluster we have been having with this weird issue in last two days like uh one of our clients went rogue and he pushed a ton of data, and so now we have like 40 plus terabytes of data, but when we are trying to query in last 1R or or last like couple of hours uh or like a hourly time zone and the trace queries take like more than 60 seconds or like sometimes it just times out, and so we try tuning like uh the workers.

D

We try tuning bunch of other settings like uh on the query front. End query. So I was wondering like what is the reasonable work workflow here.

A

um I think the things I would look at if you're kind of in this world is um definitely the query. Shards, okay, query front end traced by the query: shards set that to 250. there's a Max of 255 because of the way ananya wrote it three years ago.

A

um If anybody wants to take that to make the query shards that they can go higher, please do it's it's a it's a issue! That's marked like a good first issue and we just have never had time but put that in 250. That's a good initial thing to do. Okay, um uh how I think the next questions I would have is like how long your block list is the longer the block list. uh The more time it's going to take and I would encourage you to increase compactors we've seen.

A

Our current block list is around 20 to 30 000, but we've seen it in the 80 to 100, 000 and I in the 80 to 100. 000 range creates by ID search, really really did suffer. It became sporadic. It was harder to figure out. It was harder to do. Okay,.

D

uh The block list is at this moment at 50 000. We are trying to bring it down, uh but yeah. It's it's in 50, 000 for sure.

A

Yeah, so 60 seconds is way too long. uh We can get it below 60 seconds, but 50 000 is kind of approaching sizes where we have seen longer times for traces, but definitely not 60 seconds. Oh, what the size of the trace also matters a lot if it gets into the tens or hundreds of Megs, then Tempo does struggle to return traces at that size. Okay,.

D

Okay, yeah, it's, it doesn't seem a lot because it's been like spathetic on few services, but so I was just trying to understand where this might be happening, but query shots, bringing the block list back to a normal State. And uh the third option you said, is: what was it again.

A

I would just review the size of the traces so yeah um we had a customer, try to pull a 400 megabyte Trace uh two couple days ago and they filed an issue like we can't get this Trace in, like it's 400 Megs, like sorry, Tempo cannot return your 400 megabyte tracing your phone, um so the size of the trace does impact a lot. What I would do is, after you try those things start a discussion on the GitHub, so that's a great place to have kind of like a long-running conversation about tuning.

A

We were just went through one on Cut compacting and pushing blocks I think we have one going on that somebody's looking at Trace ql sharding. So if the best thing you can do is file an issue there and put your config give us some good metrics and we're probably gonna it'll go back and forth for a couple weeks.

B

A

But hopefully we can get you in a place where it's working, better. Okay,.

D

Sounds good! Thank you. So much cool.

C

Just curious you guys were chatting about uh block list is that in the compactor config settings or.

A

um The compactor pushes the block list down, uh but there's no like max blockless length it'll. It's really your attention right, so it'll start deleting blocks after their attention gives up. But then there's a metric called Tempo DB underscore block list underscore length or something I'm, not sure off the top of my head, but that'll.

D

A

How long the block list is and yeah like I said twenty to thirty thousand has always been fine for us, and it's always been our Target uh 80 plus, is when we really started to see problems and started panicking a little bit.

C

Yeah, so it's more of a metric than a configuration right.

A

E

A

It'd be interesting to do retention based on total count like uh just retention based on 20 000 blocks, whatever that turns out to be and just delete the oldest blocks. When you hit that I, don't know not sure if anybody used that or not.

D

Just just out of your curiosity, uh what is that block list block list is the count of blocks which we have in the back end uh in this uh storage. Or is it something else? That's.

A

D

A

And the reason uh that impacts, Trace, ID search is because we have to look in every single block, yeah and so compactions. The main, if we didn't have Trace by ID search compaction, would be way less important, um because keeping that list down is what makes uh search better. Oh another thing you can do is use memcached. So with memcache d, you can cache uh the bloating filters which reduces pressure on S3 or Azure or whatever, and should speed up your queries a bit as well.

D

Yeah we do have memcache I I mean the block block list comment was helpful. I will try to get the block list countdown.

B

E

Maybe if you have one minutes, I can show also what I mentioned the last uh permutical, uh like I I, feel it's very hacky, but I build a little uh uh refunding to aggregate the tempo traces well spans and I think it is a little bit aligned with what uh yeah your were saying like how we uh analyze the data when there is error, how we see uh yeah how you go into it to get familiar with what is going on yeah, maybe I can show it. uh It is not pretty. It is.

E

I will show you this environment. um This, uh oh I, need to give permissions to this.

E

um Wait a moment sure.

A

I also has been talking about this for a while uh and I've been excited to see it, uh it might make you write a blog post about it or something fast, though sorry bud.

E

I need to quit and re-enter to be able to access.

A

What is your uh uh sorry, I can't see your another? uh Is it.

A

Okay, what is your ingestion like your bytes per? Second? Do you know off top of your head roughly.

D

I, don't know the bytes I knew the samples yesterday was like around 150k and it went down. uh We got it down to 70 000 spans per second I can take quickly. Take a look at that. I. Don't know.

A

That's pretty good 150 000, uh sizable cluster, that's nice! Yeah.

D

That's why we were struggling yesterday, yeah.

A

How many queries do you run.

D

So we have been playing uh with a lot with queries because, again of the traffic we were running like 10 query years yesterday and and then that that didn't help, because again it was like uh I, don't know why.

D

But for somehow our requests were like uh even though query front end said, uh the request were uh over, like timed out the query kept on searching for those blocks uh and that kind of increased the load on the back end, and so we were getting read timeouts back again, so uh what I ended up doing was we ended up, doing was bringing it down to five and seeing if that helped that kind of helped for a bit, but still we are still seeing seeing some issues so yeah. This.

A

Let's chat, let's chat in a discussion, I think you're on to something important. um We run like 50 to 100 queries, so we run very small tons of very small queries. Okay, so that would be my recommendation, uh but I have always been suspicious that a context cancel does not correctly propagate through the entire system.

A

So, if you're seeing that I, don't doubt it so if you got us some graphs and metrics and logs to help us track this down, I think um I think we could probably help diagnose it and fix it for both of us. Okay,.

D

Okay, yeah for sure cool.

A

All right, Fausto what you got.

E

Let me try again: can you use same screen now, Okay cool, so normally uh yeah? So normally we have these right. So we just write some queries and if you want- and there is a trick that you can do these, for example- and then Tempo will return these attributes back. That's quite neat because now I can use this to build aggregations.

E

uh So if I take the same, this same query here: I just built this very silly thingy, but uh I can now do aggregation. So now, I can see, for example, in this environment um account by status code. The way how it decides which attribute to aggregate is the one that has the the wild card, so I could put whatever with white car, and then he knows.

E

Okay, I I want to see that so I can modify this now and say: okay, I want to see all the 404s and then I want to see um which service is uh doing. That.

E

I, don't have autocomplete, uh so that's not nice, but yeah now I know it's uh some service called PHP. uh Something is the one who is doing all of this, so it already narrowed down the issue for me. So a lot of times uh when I'm looking errors, I just try to yeah just narrow the issue to see what is the root cause and so far yeah. This is uh it's helped me it's quite simple, but like this I can do the nice aggregation.

E

Easy aggregations of of the data that we have and I with real data is, is just better yeah yeah.

A

um Are you was that a you're on a table or a graph on a plug-in? How did you do that processing.

E

ah So uh I'm using maybe you can share also uh have you seen um a plugin called Infinity so Infinity, so it's a data source and it allowed me to do like HTTP requests and then I have like a small uh backend service. It's just a small goal, application that is doing this aggregation for me and send it back to grafana you already like yes, aggregated.

A

My table Yeah, okay, well, that's cool, um yeah, I! Think that's really neat, and it is really in line with what I'm thinking about next. So it's cool that you're already doing it. It's so worth doing. You wrote a whole service to do it and hacked it in grafana. um I. Think that kind of validates some of the thoughts we've had uh as we work on this we'll definitely talk about the community call I want your thoughts on it faster as we move forward with it because you're, you know using something very similar here.

B

A

All right team, uh how we doing anything else.

A

All right well, thank you all for joining and I will see you in a month and if anybody's in Chicago next week, I'll be there find me all right. Take care.

B

Bye, everybody.