Grafana Tempo, 11 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tempo Community Call 2021-02-11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

But that doesn't let us do cool stuff like make dependency graphs based on real traffic and do aggregations and get average real latencies based on our traces stuff like that that doesn't work. Our plan was to use data flow and just iterate over the chunks in our bucket, which hurts sure, but it's going to get really complicated as compression gets added and a bunch of other things come in. So that's why I was like. I want to read more about this query from that, so I don't have to deal with data.

B

That's like pretty cool, so joe added compression here so he's to blame for all your troubles with the parse blocks with compression, but I don't know how much the query front end can help with that. I mean really it's designed to be something that can help to scale out the query path. So what the query frontend does today is split, each incoming query into shards, and each shard is basically a range of blocks that um it's it's like.

B

So every block has a uuid and the and the chart says, search between this range of uuids, and so the queries pick each of these smaller shards and then work on them in parallel and if we have multiple hits with the different queries. They'll send all of these results back to the query, frontline, which will merge all of the split traces and then return that whole trace. I see.

C

A

So I I misunderstood what the queryer did, then. In that case, I guess what I'm talking about is more. We have a query language similar to promptql and then be able to do stuff with that, which is another different ballgame entirely, even though it's the same word, yeah.

D

Yeah you're a couple steps ahead of us, but we've.

E

D

About that, what that, how that could work with tempo, um how we can do that with our existing backend, what we would need to add or change to make that work um so yeah, that's kind of on our long term roadmap.

D

But it's certainly nothing that's going to happen in the near near future, but we generally want to kind of make a few. I think, there's honestly a very short list of changes to really feel comfortable with the key value store being in the millions of spans per second and we're already kind of working on maybe design, docs or floating around ideas about where to take it next and definitely query language to do. Search and metrics is on this long long list somewhere. That makes sense.

A

Looking at the next version of grafana, I can also see super interesting synergies, but like they're, making this dependency panel right, which could tie super nicely if we are, if you could have a temple, ql or query that can actually give you that and then display it directly stuff like that, you can do lots of interesting things with that data.

D

The service graph would be hard, so when I'm thinking of a query language, I'm thinking of things like give me, you know a p99 of a certain span. You know latency over a given time period to generate a query graph, though um you would need a different kind of question right. You need like what are all the connections between all my services.

F

D

F

Different kind of query.

D

I'm thinking more, like a time series query right, a query that results.

F

D

Know some kind of ts or time series data. That's a different query. um That's very interesting! um I wonder if something.

E

G

Like what you all are.

D

As I think the direction you're going might be more a more appropriate direction to generate something like that like yeah either I don't know either kind of do.

E

D

Or like kind of live calculate it as it comes in and just store some kind of value somewhere. I don't know yeah.

A

Yeah, that's that's one option. Another option is to have some kind of query language where you could say give me all the trace data for these criteria and then you can process it like. You want right, and you just do any aggregation that you might want to do as long as you have some kind of starting point like give me all traces that include this service, and then you can kind of go from there since, like if you know your critical path, which you usually do.

H

A

Then you can, you can create okay, give me all traces, including any of these services, and then you can find their dependencies and their dependencies and actually generate something useful out of that or you can do it on a just like you said that does have a disadvantage that you have some lag like you write your query and you need to wait x, amount of time everything.

D

Is there so for a query to to do that? Like say, you wanted to write a query. That's a that's! A good idea like I want to ask the question of what service or what traces include the service and kind of start building a dependency graph via that.

D

um I think the issue with that is just the sheer amount of data you would need. I mean we take. I think something like seven to eight thousand traces, a second and if you were to want to generate a dependency graph, you need to see a good percentage of those. So let's say you want to generate parentheses for the past hour. That's 8 000 times, 3 600., that's a whole bunch of data to request from the back end and then start doing some kind of some kind of work on it.

D

I think a really cool path here might be so the collector the open palm tree collector, which the grafana agent is built on the tre or at least only the trace pipeline and the grafana agent, is built on already has kind of this pr up to add a metrics processor uh where they would, like you know, calculate metrics live based on span streaming through.

D

I think it'd be really neat to um extend that or add a new processor work with that community to add a new processor that would generate the data necessary for a dependency graph store it in redis or something who cares uh and then query it out of there.

A

That that makes it harder, though, because you have you- have one per cluster right, so suddenly, you're at the question of cool tempo has my full traces, but my cloud agent does so as soon as you're tracing a cross cluster request, you're, making your life really hard. If you do it on that level,.

D

That's true that would not work well for um cross-cluster work unless you had some kind of centralized data store right that you're putting this all in. I don't know, that's a good point which is temple.

A

I

Would you say um your dependencies, like your dependency graph, is kind of stable over time, or does it like how much change is there like, let's say over a given day or an hour.

A

Yeah, it's mostly stable. The main value I see in running this regularly is that you can you can notice when someone sneakily adds a dependency or accidentally adds a dependency? That's why you want to run this, for example, on a weekly basis, but it's not like this thing that you won't need to calculate every minute right, it's more great! Okay!

A

Every week I want to take the last I'm going to take one percent of the last week's worth of traces and just have a troll through and see see what I get something along those lines and then then the challenge is getting a representative set of traces to actually do that with you can obviously take them all. That's the obvious answer, but if you can have some kind of query language where you can somehow get a representative sample, that's better.

I

Right, um yeah, amazon x-ray has a service graph that is built up, but it requires a time range, so it's kind of similar. So I mean it's how for this, given time range? Here's what the dependencies look like, so dependencies change every time it thinks yeah, that's fine! um Now, joe, like do you know about the jaeger uh dependency graph? How it's generated is it um across all time? Is it maintained like statically outside and added to, or is it within time range like how x-rays is.

D

um I know you run like a batch job and then it just runs continuously.

E

D

Background it took like a day to calculate it over our data, which was significantly smaller than our current data. uh I never really dug into it.

D

I never really saw how it was like the past hour or a day or what it was even trying to do, but it just stored it in the same cassandra cluster as our normal jager data, and I think it took like a full 24 hours to do like a single run to generate the new like dependency graph, and it was, we didn't, get a lot of value out of it frankly, so we turned it off.

A

Yeah, and in our case it's also not just about the dependency graph, it's also about okay. What can I expect will happen if this thing breaks, um you can do all kinds of fancy things once you have a dependency graph like computing aggregate, sl slas, but basically, if this thing is that- and this like this thing- is that what can I expect of the thing that uses both and and stuff around that level that we actually think is pretty useful.

A

It's true. If you do, if you don't do anything with the dependency graph, it's not useful on it. So it's more of a starting point to do more things with.

F

B

And like yeah like now, the question about using it at the query makes more sense right, like that seems like the best place where you can sort of stream this data and calculate your dependencies and then generate the sort of graph. It seems like a good thing to play with yeah.

D

Yeah yeah, I really like the idea of what you're trying are you luna are you with? um I think I was just told about somebody: who's writing best jobs against tempo david.

A

That's us: okay,.

D

You know all four hundred thousand companies doing embarks embark is that right, okay, very cool? It's such a neat idea. I'd really like to meet with your your engineering team and see what you all have done and uh and get some idea what's going on there. I really like the idea of batch jobs for this. When I was thinking yeah like when I was thinking of query language, I wasn't really thinking of something that would fill this need, but I, if it is a need, we need to kind of document it.

D

Do you all mind, maybe writing up an issue on the repo kind of showing the work you all have done and kind of saying what you all would like, and we can use that to kind of document. This feature like service graph and how tempo could support that and maybe use that as a running document going forward.

A

Gladly uh the both of the person is invested to work on this is actually my boss or director of infrastructure and services, who did a bunch of stuff like this at dice, so I'll hit him up and ask him to write something up for now.

D

It's very cool yeah. We it's a neat idea and it was not the direction we were thinking so definitely um getting it documented would be great, so we can kind of keep an eye on it and think about how we could do it ourselves as well roll it up into what what what language is. It is it what's? Is it dataflow? It's dataflow.

A

So uh dataflow is this google thing that essentially lets you iterate over things in the gts bucket and since chunks are stored in the gcs bucket and have a fairly easily decoded fixed encoding format.

A

Oh easy right, you just write a small java class that can extract uh data out of them you're golden very cool, oh obviously, as we knew that that was going to break right as soon as the data format becomes more complicated, and you start looking at encryption compression and all that stuff that falls apart. It's fine sure! That's why my thought was more like okay.

A

Can we have the thing that basically like it could be as simple as a grpc endpoint that just streams traces to you right in a fixed format like it doesn't have to be complicated, but just something to get data as yeah? You can batch process the rest. Essentially it.

D

Would have to be down sampled in some way right like if you not you'd, have to build a whole thing capable of taking your entire trace. You know firehouse.

A

Probably yeah either filtered or down sampled or sold right. That's what my thought was immediately. Can we have some query language? Let's just do this.

D

Yeah, it's just hard, so we've been made what we've been looking at when we talk about query language, what we've been looking at is just the amount of data we have to scan to answer questions right we're currently. What are we at 50 close to 50 mega? Second in proto? Is that right yeah? Something like that and then so you know 50 mega second times 36.

D

Second, 3600 is gonna, be you know, 180 gig in an hour, 175 gig in an hour, um so to answer a question over an hour definitively you'd have to scan 175 gigabytes of data. um Looking at some internal loki metrics, I think uh they were hitting without, I think they're hitting like around 10 gigs. A second is that what we saw marty.

I

um I thought it was maybe a little bit less than that, but yeah.

D

Sure, let's just say, 10 gig, so that would be 17 seconds. Is that terrible is that's not terrible, 17 seconds to answer a question over an hour to scan a full hour for the data and that's without duplication without replication factor? Assuming you perfectly compacted your time period, um that's a little bit more than I want like for an hour query.

D

You know you're in a grafana and you type an hour query in cortex or prometheus or whatever, and it's going to come back basically instantaneously, like no no delay, but I don't know if we can set expectations different for the amount of data that's being scanned and also just the sheer amount of data that has to be returned.

D

If you just were asking for everything, that'd be crazy, you'd be asking for 175 gigs of data, so I could go build a service graph somewhere.

D

So I think the uh I think something that's more aggregate right like.

A

Yeah, something very true I couldn't just tell I- can just tell tempo hey, give me a representative sample of my traces that include surface a and the make it representative part is what's hard right for actually figuring out. Okay, these two traces that look the same, except some small difference in timings, control, one way and stuff like that. It's like.

D

Super hard that would be extremely difficult. um What might make more sense, instead of that is like so a trace tends to have um you know the service name or other metadata that lets you know when it crosses trace boundaries right um or sorry service boundaries. uh So, if we were to maybe if the question were like, what are all the relationships and counts like show me, every ser er just give me a list of uh parent and child uh relationship and a query, and um the number of times that happened.

D

uh That might be something right. That would be, I think, more tenable like it would be a lot less data to return and more directed at what was attempting to be done, which is build a service graph right um of course, what else? What other kind of data people use in service graphs? um Do they want latency? Do they want error rates? What else are people? I've never really used a service graph that I liked.

A

um There's a bunch of things there's like what's my p50 p95 p99 latency between these two services, um what what's my error rate between these two services? How many times do I have to retry between these two services and then also comes the question how the heck do I tell her? We try a regular request, um like there's lots of things there, a bunch of them. The answer is: make some metrics record your metrics go home. That's a perfectly valid answer for a lot of them right.

A

That's not a valid answer for making for a bunch of them like actually generating the service grab and the dependency tree and, like that level stuff, you can't do with metrics. So I think the focus should be. What can tempo answer that metrics cannot and kind of gear it towards that, because for everything else record it damn it and you'll be fine.

D

Yeah I agree: 100 uh metrics are just too good at storing what they're supposed to store uh in terms of like compression or the size you can keep them on disk and how fast they are to query and scan and do all these things generating metrics out of traces is like a last resort like here was a metric.

D

We couldn't capture due to cardinality or some other reason, or we just don't have a metric for yet, and we want to ask a question that our metrics don't answer just yet or can't answer, perhaps, and that's where you want to go query, something like logs or tempo for traces, um so yeah. This is a cool idea, um definitely get an issue up and I think it'd be a good place to chat about it and uh yeah. I really want to meet girl's team and see what you've done internally.

D

This is a really really neat usage of tempo for sure.

D

Cool, um I don't think I captured half of that in the notes. I tried to get some of it, um but but I got a bit there, uh marty or ananya or whoever, if somebody could also try to help uh keep track of what's going on. I appreciate it in the in the meeting notes um for anyone who showed up during that whole conversation. um We uh are just getting started on our community meeting here.

D

um Luna had some really good input on distributed or like building graphs or other information of distributed tracing as well as kind of pushing tempo forward. In terms of the queries that we can uh use with tempo right now, it's just a key value store. um We have an agenda in the document linked. Let me link it again in the chat for everybody showed up after I did and please feel free to add anything.

D

You can treat this kind of call as a ama or a uh or a commute or like an office hours kind of thing like we're just here to chat. I do have some uh agenda items if anybody doesn't have anything directly to add, and also this isn't really like a canned presentation thing feel free to jump in and talk uh whenever and add whatever you, whatever you. You know.

F

A

Another potentially interesting topic uh I have is um doing tail based tracing with tempo, um I'm not sure if anything there is planned, but I'd be surprised if it wasn't discussed before like I it in my perfect world everything I have in my clusters, just traces 100 and I can actually decide whether to save the trace or not. When I can look at it at the end as close to tempo as possible right has any work been done towards something like that or yep.

D

So, let's see here um the collect, so the trace pipeline in grafana agent is based on the collector, either fine to push traces into tempo, but recently uh jp from red hat added support for tail-based sampling in the collector. So we're looking at also getting that to the agent and kind of playing with an r environment and trying to get a feel for. um I suppose how far can scale like how many spans per second is it going to handle? How much cpu does that cost in memory?

D

And all these things it's a little kind of um it's not obvious how to do it, and I think we want to try to help. You know make it obvious, put together some documentation to help see that. um But it's split: are you familiar with the open telemetry collector? Have you messed with this thing at all.

A

Yeah we're running it in like 10 plus clusters, so nice familiar.

D

Route of necessity right familiar out of necessity, that's right! um So let me find a closed. Oh it's in the contrib repo, I think so. There's a tail-based sampling processor in the collector now, but it requires uh the full trace to go through um one collector, because it's not like right. So, let's see if I can have.

A

It and though, in practice unfortunately, like the moment you deal with a close cross cluster um request you're back to oh, I don't have my full trace.

D

Right, um that's the first part. The second part is, is why I say it's kind of like a it's a little project to put this together. The second part is this: guy here.

B

Cross cluster, you would need a open, telemetry collector in a central cluster, where you can.

E

B

Pieces from these individual clusters and then run a tail sampling process there. That would be huge, like egress, um like network cost, just shipping. So much.

F

B

Your cluster and it could like be the whole point of like table sampling at all.

D

Right cost clusters. Another question we can talk about in a second but inside of a cluster. These two pieces put together will give you some kind of tail-based sampling. We can't really talk to it because we've not used it, but we do want to experiment with it.

D

The first one or the first one just does the actual work of trying to like batch up a trace in one piece and make a choice about whether to pass it on or drop it and then the second one I linked actually attempts to route a trace to one collector, so it will uh batch up a trace, wait, however many seconds and then it will choose one of a set of downstream collectors to or yeah is that upstream, whatever downstream collectors to uh send the whole trace to so, the second collector can make it the choice right because it's supposed to have the entire trace, um so you actually need like two layers of collectors.

D

The first is like doing this routing choice. The second layer is doing batching up and making the tail-based sampling choice. um Now that doesn't really fix. Cross-Cluster, like ananya, said you would need like some central spot right where all this garbage added.

A

D

Had to do something.

A

Yeah my thought was temple should have some kind of rules here or well. Actually I, I guess practically you could put a collector straight in front of temple running in the same posture that gets the traces instead of temple makes a decision and sends them along, but obviously, if you're using grafana cloud that doesn't work um and you you're introducing an extra point of failure. But that is that is expected.

D

The oh, my bad.

B

Pretty neat idea.

D

Thanks serge you're right, I was wrong. I think I posted the wrong one. I thought this is the right one. Not I don't even know what the routing processor is. The load balancing export is what you want. What is the routing process.

B

So like what luna just mentioned is like really sounds like a really great feature to add to tempo, where you can say hey. This is like an allow list of conditions that I need to have to store anything at all to the back. Latency of my trace should at least be so much. We can add these conditions, probably when we marshal into the trace object at. I guess, like the distributor level.

A

Yeah, that was my fault as well, where I can like say, keep all traces that either have errors or are fruit in our case, if we're in bark studios so keep all traces that include an embark user or have errors- or, I don't know- are out of the p99 latency, which brings us back to the lovely.

I

A

Out of traces problems.

D

I wonder we something we talked about a long time ago, but haven't done, is kind of have a way to keep certain choices for just longer, so it's kind of interesting traces, traces that have failed traces that are latent or whatever uh keep them for a month and traces that succeeded and are under your latency requirements. You don't care about, you know, get rid of them in two days or something like that. Yeah kind of cool, too that'd be kind of a that way.

D

You can kind of keep everything immediately and you can cut your trace to id logs jump will still work all the time you can feel comfortable with that, but then, like when you're wanting to see certain information a month later, maybe you can kind of keep the things that make sense. We could kind of do that compaction time. That's another! Really good idea for yeah you're welcome to make that issue as well. Please do.

C

F

It's so many related things.

C

Oh sorry, I don't think no go for it go through it.

A

It's really hard set of all related problems makes it really hard to also know like.

A

What's the issue here like the issue, is I want to do tail-based sampling right but, like you can do compaction time things or, like you, described that that's good enough for most purposes right, you're, effectively tracing everything but you're, throwing away all the uninteresting bits after two days as far as I'm concerned, that's still based sampling, even when it's two days later right.

A

D

uh Yeah, we also kind of are trying to balance like changes we make for the open source product as well as, for you know, graphonic cloud for graphing cloud. It makes way more sense to put this in the agent right, because that's where people want, they don't want to send us the trace and pay for that and then for us to drop it. Two days later, they they're gonna pay us to keep it they may as well.

D

You might as well keep it the whole time, um but for the open source option for people who are running this in their own clusters yeah. I think that would be really valuable to you. Reduce storage costs quite a bit um if you could do some kind of like filtering or whatever.

A

Even for the cloud offering, though right like surely storing a trace for two days can cost less than storing it for 30 right. So it of course depends on the pricing model, but it could still make sense.

D

We'd have to change billing quite a bit and I don't know it'd be really complicated billing, maybe not like charge you for the amount you're storing right now we charge you for the amount you ingest right.

G

D

D

Tail-Based sampling, do it in the agent. Do it as some kind of like process that drops it slowly over time, do it in the distributor or sorry the ingester would be the place. We would do it or maybe do it as like. You know, like some kind of degraded thing over time.

D

My children are screaming if you can hear them just. Let me know.

D

uh Compaction cool- um let's see so, let's talk a little bit through this agenda here 0.50 was released. um That was released. I think, shortly after the last meeting and we're going to say oh 6 0's about to be released now, um marty or nina. Do you want to go over the o5 0 notes.

I

Sure yeah yeah we'll paste the link.

D

That's a good idea.

I

Right um yeah, so I guess the biggest feature was support for azure blob storage um yeah. So that was a community pr. I guess so that was really cool to see that um the query was doing that release gosh, it's been so long. I didn't realize it was doing that release, but that is a very scalable way to to query the data right.

I

The separate module, that's independently scalable, um so we did have a breaking change in there for the communication, signature between distributors and adjusters, and so that means that it's a new endpoint that is on you. The ingesters have to be de rolled first to add the new endpoints. So we don't drop the old endpoint. So after upgrading the adjusters, then the distributors can be upgraded afterwards to kind of push to that new endpoint uh disk based caching is removed. um Probably is.

G

I

A great idea in the first.

A

I

A

I

Memcache and redis yeah yeah um yeah, so uh this was another change to kind of help, control in gesture performance, so it can target. um Well, I'm sorry, never mind. This is that's in point six. So uh a change here to just control some burst settings and then uh other fixes. I don't know if we want to read through all those, but uh that link is there.

D

Yeah um and then upcoming an o60 um marty. You had some good performance improvements. You want to talk about those um yeah.

I

Let me pull up that change like actually yeah.

D

I was going to grab the I'll link it here.

B

Hey and daniel added the red is caching support. By the way then you're going to show us your tempo stickers. We don't have any so.

D

Oh yeah, do you have some? I still don't have any.

I

Sure yeah um so well, one thing we're kind of working towards nice. Oh yeah, that's awesome!.

D

I don't even have one dang all right. We.

I

Need to have conferences again.

D

So I get stickers.

I

There was uh who who had a 3d printed tempo thing or was maybe that was a grafana thing. I thought.

B

Yeah yeah, okay, yeah.

I

A

I try and find it right.

H

Are you talking about the coins.

I

I think so, yes.

D

You're not sure.

H

Yeah uh richie can hook people up with swag, so uh at least stickers, at least if you're external, to grafana, so uh yeah, nice yeah.

B

I

B

I

I don't know how many tempo tempo dollars that is worth, but it's probably a lot. That's right.

F

If you have a printer like.

B

F

I

Yeah, so um in the next upcoming release, probably the biggest thing is compression of the storage, so that is significant. So um the default right now is z standard, but there are a couple of other algorithms there that may work better. um You know for certain data. um The savings I think were 75 reduction. So I something like that. So I mean it's significant so.

D

I

Was previously the largest uh like cost driver, so I mean, I think, that'll really help and the overhead of z standard is. You know we'll use more cpu right, so it is kind of like the heaviest algorithm but, um like I said, there's multiple ones there that may work better. If you need to yeah.

D

Yeah real quick I can share. um Can I paste this anywhere?

D

No, I don't think I can oh, I can put in the dock. I can just push this in the dock. Yeah one sec.

A

Does that mean that the interest price on the cloud offering will go down as well too much.

D

Not my not my department, the uh so compression ratio for none. Of course you have 100 there right and we're taking a two gig block and we batch them. I think internally well right now it's by bytes, but um at the time that was 200 traces per compression page essentially gzip was the most expensive one in terms of uh in terms of cpu, but it was also uh it wasn't.

D

Even the best one z standard had the best compression and had like kind of a middle ground in terms of cpu costs, and so that's why we chose it. um It was a little rocky at first we had our queers and what was it? Compactors and queries were just booming constantly just serious memory leak which we fixed um and so we've.

D

So what we consider what we consider kind of supported, I suppose in tempo, is none no compression and z standard. Those are the ones we've run at scale, those ones we feel comfortable, advising people to use. If you want to use snap, I don't see any reason to just use g zip it's in there, but it's whatever. If you want to use snappy or lz4, there is like it will read faster. The compression ratio is not as good.

D

But snappy and lz4 might be interesting for other people to experiment with. Please do but know that we have not run it at scale internally and there might be some memory or other kind of resource issues.

I

And there are a couple different block sizes for lz4 and I think those are documented in the config options.

F

I

Cool um so there's some additional compression changes for the traffic, so another area. I guess we keep looking at this between the distributors and the adjusters, because that's the critical path for data flow and so we'll have it we're changing the compression there from gzip to snappy. So those are. um There are more limited options there, but snappy performance is looking better.

D

I

Yeah um yeah, I guess uh what else out of the next upcoming release would be good to talk through.

D

uh Excuse me ananya's, finished, exhaustive search, so query front front-end was added in 05-0, but exhaustive search is a cool feature uh which is not it's owned by default. You can't even turn it off at the moment.

E

D

Required to run it exhaustive at the moment, uh and I need you to talk about kind of what that looks like.

B

Sure thing so uh yeah this was like what we were looking at with the query front end right like that's what we wanted to do when we introduced the query front and.

E

B

Wanted to combine traces that could have been split across blocks for a number of reasons. They could be across compaction windows.

B

They could be long running traces that could be part of multiple blocks and we wanted a way to combine all of them and return the full trace and uh that's now possible, with the query, frontend and yeah we're like searching for traces and also blocks in parallel and yeah. It works really well. Our latencies are still as good as they were when we, when we were at replication factor 2 with 150 000 spans, we're now close to 400 000 spans per second internally and our latencies are, I think, 99th percentile is around two seconds.

B

That's like pretty impressive for, like a balance, which I think yeah.

D

uh What was what were those stats and on you.

B

B

Get the numbers.

D

B

F

B

Up our internal dashboard.

D

Oh, it doesn't have to be perfect. Man. Sorry! uh Well! I guess the point I was.

F

Gonna make was, I wanted.

D

Him to reiterate for a second and then our current, to give you an idea of our scale internally is we're searching 13 billion traces over a four day, um I'm sorry 14 day retention window. So whatever I didn't exactly catch the latencies, but whatever they were, they are fantastic and it's over that size of data. So um whenever you're kind of building running tempo yourself, you're kind of making a call about like how long do I? How much volume am I adjusting?

D

What retention do I want, and that turns into how long my block list is, and that really is the driver for your um your latency. So you could do significantly higher volume with two day retention or you could do lower volume at a 30-day retention and your query, latency is going to be. You can kind of balance that, against your query, latency.

B

Yep and the number of queries you run, I guess like the size of the query or deployment that would yeah. That would also factor in so okay, accurate numbers. We are at 1.6 seconds at 90th percentile and our average is.

A

Well, below a second.

B

Like that's 700 milliseconds, I guess 50th percentile average.

A

um A question on that: is there an upper limit to how long a trace can be in temple? Like am I going to get in trouble if I record a five hour trace? Theoretically speaking,.

I

Well, this actually helps fix that right, um whereas normally that would have been split across multiple blocks, because there's that trace idle period time, where it will be cut- um and so this will actually, this will bring it all together and combine it fully. So um different parts of this had been released in different places, but that this was finally going to no matter where how many blocks it split across it will bring it all together and combine it right. Yeah.

A

So I guess that is: there's no limit.

D

Well, just kind of the end of our attention window, essentially yeah.

I

D

Not sure how the grafana, I'm not.

I

Sure if the grafana ui handles traces that are too long, I'm not really sure we should do.

D

That it's like.

I

We should try that yeah.

D

We should make one span a day for a trace.

F

For two weeks.

D

Something I think, we've seen a little bit more of in the community, is wanting to trace longer processes like ci pipelines or other other things that take batch jobs that take hours and initially tempo didn't handle that well, but with uh nanya's recent changes it does and so we're meeting that need um as well with uh with this release.

A

Yeah in our place, it's mostly streaming grpcs, where it may be a while, but yeah if it can handle hours yeah. I don't that's not going to be a problem. Cool.

D

Let's see so 060 exhaust search and compression, uh we have two blog posts. I wanted to highlight these um so um for people who are kind of getting started doing some instrumentation uh marty wrote a nice post here and I wrote one from before: that's certainly not as nice, but uh I should probably make these links instead of this horrible giant thing. um The first one is spring boot.

D

The second one is dotnet, so if you're using any of those technologies, um you can see some good examples there, how to set it up with tempo and how to uh how to instrument applications. um I think maybe the people on this call are a little bit past that we're trying to just create more material that gets people involved in tempo. um If there's any languages or frameworks that you all are aware of that, we could add.

D

We kind of are trying to target popular things right to get to drive some attention to tempo, get more uh people more training on instrumentation, but uh I don't know what else. Perhaps the question for the group is um what other kind of frameworks or technologies you think are very popular- that we could like benefit from a blog post like this.

H

So another thing to think about, as you guys are working on documentation, is our lovely go-to market cloud writer is basically is working on making tutorials and stuff that pulls stuff together, but it really helps him to have raw material to start from. So if you guys make more raw material odds are higher, that he will be able to go in and help make something cool and multimedia. From that sure, good.

D

Shot sure for a lot of these we're trying to just, um I think, get people started. There's a group.

E

Of people who are interested in.

D

Tracing right, um who don't really know the beginning of the process or what tracing is, and these are meant to be like an opportunity to draw people in and we chose net and java, particularly for that reason, there's lots of you know odd enterprise, java developers extremely popular spring boots. Very popular.net is very popular. We want to attach or connect to those communities and kind of give them material to get involved in tracing.

D

um I don't know if I mean for me those are two of the most popular huge. You know kind of languages. Is there what else could we do python? um Do people still make python back-ends.

A

Regarding python and rust, which I mentioned as well, what could be also very interesting for those languages in particular, is they're often used to make things that talk to back ends and initiating a trace from there can also be really interesting, since you could actually well, you can start getting increasingly that earlier, even if you don't entirely trust the client, you can do some interesting things. Black starting is found.

D

Yeah we recently connected grafana like the trace from grafana all the way down through cortex, and it's really neat to see the span start in the front end and go all the way through the back end and all the way back. um I don't.

E

D

That I don't think we fixed that in tempo I mean it's, not it's not broken in tempo. It's I think, just spam or trace propagation, that we need to figure out that we haven't done yet internally um and then rust. I've always wanted to fool around with rust. I've never done anything with it um uh is it.

D

I guess it's kind of a growing community and I don't have, I guess, there's an op. I guess there's clients, jaeger, rust, client, I'm gonna, google, that right now there.

A

There's an open, telemetry, rust, client of some sort and there's also some um there's some language-wide efforts in just introducing tracing as a provider, agnostic concept which are currently in the works. So it would be interesting to uh plug into that.

D

Are you saying, like open tracing like where they kind of standardize the api? Is that something like that.

A

It's actually above that, where they're kind of trying to standardize it as a language, let me let me find a link. Yes, please.

B

Hey lucas, you you want to talk about the auto instrumentation stuff from python. That sounds really cool. Do you want to walk us through? How are you using.

G

It sure I mean it's, I don't really know what to say. We mainly just kind of like followed through the examples uh we started with. uh We've got like a flask app that then you know talks to a number of data sources.

G

Postgres redis, I believe the postgres one was the only one that we didn't get working fairly quickly, um so we basically did a combination of the flask gap using the flask, auto instrumentation, various data storage using the auto instrumentation and then the back end python stuff was all manually instrumented, that's not flask, based so okay. um I can probably find some code snippets. What's that.

B

Can you teach us how to do it can.

G

I teach you how to.

C

G

C

G

Go back and dig out our examples. I mean I don't know a lot of what we did was from the from the open victory python documents. um So yeah, it's one thing with the python stuff is definitely it. It tends to matter how you're passing through the thread context and such, and that seems to be pretty app specific in my experience, so I'm by no means a python expert, but I can help provide some examples. If that would help you guys along so.

D

Cool um yeah, if you have some links to share uh in the document, please do that's another.

E

Curious, how are you like go in the trash id with the python auto instrumentation, because I I raised an issue on the open, telemetry client because they don't have that option. So how do you discover traces on your pythons.

G

Sorry I couldn't make that out. Could you repeat it you're, like really soft? On my side, I.

E

Mean the: how do you discover traces using python, auto instrumentation, because you can't love the truth id right now, so.

G

uh Yeah, so that's definitely one of the things that we uh had to work on and we did it through logging so um specifically like putting the trace id into like.

G

If you use struck log, you build kind of like a logging context and then so you start your instrumentation and then you build the logging context for struck log and then basically, you process the rest of the request through that and then the the data source auto instrumentation, obviously like it will, uh it will pass through the trace context, or I forget the exact name but anyway, but the the the data sources should pick up the parent trace id for you. It's mainly just getting the initial log message out that has that and yeah.

G

That does depend again, that's kind of like python specific to what you're doing, if you're doing something with a lot of forking, then that can kind of like you know, obviously change how you do it, um but in the example or like it also too, if you're like doing g unicorn flask, uh I don't even remember the workaround that we did for that. Somebody else did that one, uh but passing that context back and forth by invars can uh that's one possible solution, but yeah.

G

You definitely have to start the trace and then create your your kind of your logging context and then just pass that trace id through with all the login contacts.

E

D

I always pronounce that gunnacorn. I have no no idea. I've never actually used it. It's g unicorn.

G

I I so I it's supposed to be like the ruby, unicorn project right and then they ported it to python, and so I always called hg unicorn. I have no idea, that's what actually my experience with.

D

Python is pretty limited, so I'm definitely not the expert here.

H

Should totally be gunacorn, because that would make an awesome logo like the unicorn, with the canon on his head.

G

Yeah, I I I'm gonna second, that one. So it's now official.

G

D

Okay, um python's a good target, then we can look at that um I'll kind of put out my personal notes. I've always wanted to mess with some of these languages and it's kind of building these blog posts, and these these repos uh kind of gives me an opportunity to dig into some of these frameworks and languages. I don't have a lot of experience with, which is cool.

H

um Maybe a hack day project, joe.

D

Every day's a hack day.

H

Words to live by.

D

Cool, um let's share some internal metrics here, so just to show where how we are running tempo, I'm going to do some screenshots, I guess of our internal guy. We are around 450 000 spans, a second which I'm happy with. I really wanted to push towards a million, um but we re. We could it's one of those things where we could, but we don't need to, and the cost would be more than we want to pay right now.

D

uh Compression changes, as well as some other changes to the query, path that we're going to work on, I think, are really going to make a million spans a second without scaling our current cluster feasible, and so that's going to kind of be the goal.

D

O6 is going to drop tco a ton for those of you who are running tempo and then we're going gonna see further improvements, particularly on the way we query our back end uh in the next couple, I'd say a month or two um so about 450 000 spans a second um it's about, uh I think, seven to eight thousand spans traces. A second I think our average is. Is it about a hundred? I always think a hundred, but now that I'm saying it out loud, I wanna make sure the math works.

I

It's more about seventeen thousand traces per second.

D

uh That's divided by two, because it's after a replication factor, oh okay, cool, never mind uh and then we're looking at, let's say: 420 000., so we're at oh 50., okay, yeah we're about 50 spans per trace. um I don't know what other people are seeing in terms of span space we're at 100, because our query path is much larger, but we started to sample larger and larger percentage of our right path, which is smaller traces so that our spans per trace is going down.

D

um Are people seeing I don't know. I really don't know what the expectations are here. I've only done tracing at grafana. So um what can people share like? Do you know how many spans per trace you have on average in other, in other places,.

D

Or does nobody know, that's also possible.

G

D

G

Ours, ours varies quite a bit. In some cases we have like three or four spans per trace, but then we've had traces that were uh probably like 80 is probably closer to the average, uh especially when you, when you're instrumenting, all the data store calls, um but we've had traces as large as 1700 on kind of like longer running things that do a whole lot of data store queries.

G

Hopefully that's not going to be the average, but we'll see.

D

Right so, let's see our average is 50, but our largest traces hit several hundred thousand. uh In fact, if you look at that thing, I I linked the spans per second. You see some yellow refused spikes. Those are spikes because I generally loki is crossing the 350 000 span border. We we have a current limit that prevents you know, kind of a max on trans uh spans per trace and it's always loki. So those are few spikes or somebody querying a huge query in loki basically um but yeah.

D

So yeah in the thousands is no problem. I think our query path is in the tens of thousands. Are we there marty.

B

Tens of thousands yeah.

D

B

Yeah twelve thousand thirteen thousand is normal. Okay,.

D

Sometimes I wonder at what point it becomes useless to have a trace, um at least to just look at it in grafana at what size it doesn't, even matter that you have this whole trace, but I also think for some of the things luna was talking about earlier in terms of service graph.

D

Building having that data is valuable anyway, to do kind of analysis, kind of kind of work, maybe not staring at a couple hundred thousand span trace, but just using it as data for some kind of some kind of other job metrics or whatever um we are close to time. I was going to kind of talk about our road map to ga, uh but let's not get into that. uh Oh five o is released, o six o is coming up and we'll have really good tco improvements.

D

um Let me open the floor for some final thoughts or questions from anyone else, and maybe we can save kind of a roadmap to jay for next month, at which point we might be done with the roadmap to ga or at least close.

D

Cool okay, well, thank everyone for coming in and showing up. um I wish luna didn't have to run. I wanted to thank um uh thank her for getting involved and sharing her work at uh embark and hopefully I'll meet those engineers soon. Thank everyone else for also joining, and uh we will see you in a month. Hopefully.

F

Thank you thank.

I

F

F