Grafana Tempo, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tempo Community Call 2022-03-10

Description

- Metrics Generator Details
- Search improvements for 1.4
- Looking forward to Parquet!

A

We'll go ahead and get started there.

B

A

Cool, so welcome to the community call march edition 2022. I can't believe it's the year 2022, but it is here we are. um We got a couple things today and I think we even have a fancy presentation, we're going to talk about the metrics generator and we have some good news in that. I that will be in one four, which should I guess, maybe we'll target for next month or is it this month.

A

I don't know I'll, go back and look. uh We just generally do every other month and maybe that's next month, um and then we have some back and search news. I was mentioning that earlier, but I really feel like we crossed a hurdle. We added a couple features there that have seriously increased query times and then I think we have one more coming soon and one four is going to see way. Faster search, speeds and uh marty and ananya can talk about some of our parquet work.

A

We continue to hammer on parquet, it's a very promising technology. I don't think we're quite ready to commit to it yet, but they have some great metrics and inside information to share kind of what we found as we have taken it apart and try to use it to store traces.

A

So I think I don't really know, but I think metrics are first. Is that right, oh before we get into it, uh feel free to put any things in the agenda doc and then also you're welcome to jump in with questions either in chat or you can unmute and ask questions. This is very casual.

A

We do have this fancy slide presentation thing, but um you know this is a conversation, don't don't be afraid to jump in and ask what you need to ask um and interrupt you every whoever's chatting so uh with that? I think. Is it metrics first conor odd? Is it you.

B

A

B

So I'll start by talking a bit about the metrics generator. um We already discussed this a couple of times in previous community calls. So this would just be like an update and share a bit of like the performance we're seeing in our internal clusters.

B

um So I think, last time we discussed the design document and going further from there from that. We uh have now also merged the metric generator code into the main branch. So you can run it already. If you like, dare to run the latest commits from temple, uh we run it internally and it's been going pretty good, uh but we, like you, know, doing a lot of improvements at the moment.

B

I just wanted to do like a quick recap of the architecture, um so how the metric generator works is we're adding this new component next to the ingester, which will receive um like an a second asynchronous request from the distributor. So the distributor will write to the ingester first and then it will do an async request to the metrics generator um and it it also has a ring to do load balancing and then the metrics generator will process the the incoming spans and write metrics to promise use.

B

um So this didn't really change since the last community call.

B

I also have an architecture diagram of the internals of the metric generator and that's kind of moving around as we um like discover stuff and, like figure out better ways to design this. So this is kind of the architecture we are ending up with now and it feels like a you know, very good base, um and so what it looks like internal is, you have kind of this, this pipeline from left to right and they're kind of like three um like modules or like stages in this in this pipeline.

B

um So the first step in this is the processors. So the metric generator receives all these spans that are being written to tempo and we use these matrix processors to convert the spans into metrics. So the processor can, for instance, count how many spans they are with an error and we'll just keep like updated metric. Like oh yeah, I saw spam with an error, okay plus one and then again like that all the time and the processor will then update the metrics and store them in the registry.

B

um We intend to make the process like kind of dynamic, so you can change them at runtime. You can add processors, you can disable them and you can also change the config in them like right now. You can only enable and disable them, but we also want to make some configuration dynamic, so you can adjust them on the fly in case you're, generating too much metrics or too many series and stuff like that, and we currently also have two processors.

B

We have the service graphs, processor and the spare metrics, but we want to design it in a way, that's easy to add more processors. So if you like, have some use case in mind, it should be easy to write a process processor for that.

B

um So the next step in this is the registry and the registry. Is this store of metrics, so it just keeps track of all the metrics that are um like generated by the metrics generator. So the processor uses the registry to store the metrics and it will always like update them, increment them or like do whatever the registry will then scrape itself kind of like promiscuous scripts other instances, it's it's a similar operation, so it just gathers the state of all the counters in the registry, and then it writes a sample to the storage layer.

B

And in this registry we can also enforce limits on the amount of active series and some other properties, and then the final step in this pipeline is the storage layer, and what this does is it will buffer the samples as they get written to the downstream time series database. So we will be using the promisius remote right protocol to send you know metrics to, for instance, promisius, cortex or whatever other database supports this protocol and the search component.

B

Will you know buffer samples before sending them, and it will also handle retry logic, cueing and those things what's actually pretty fun is that we didn't write this ourselves. This whole storage component with a wall we're using the promiscuous agents, so the previous previous agent is like a specific mode to run promisius, which is um has like a minimal wall implementation without the querying and alerting capabilities.

B

It's just a wall to store samples and to remote, write them, and we were able to just use that code directly plug it into tempo, which is, you know, really cool, because this is like a very good implementation which has been battle tested already, and it will also operate the same as permissions, so you have the same metrics and the same behavior as if you would be running a promiscuous instance um yeah, and it's kind of like the architecture we have right now.

B

So we still have like a couple of pr's up to get this into place, but it's like slowly getting like uh shape and then I can also share some performance stats. So we've been running this in our internal clusters for like one or two, maybe two three weeks, I guess- um and I can already share like some stats from what we have absorbed observed already.

B

So the first thing that we observed is um by adding the metrics generator. The distributor will have to do more work right. It has to send the second request to the metric generator, um and we saw that this currently especially has a high impact on the cpu.

B

So we saw like a 30 to 40 percent increase in cpu usage across all the instances, and this is caused by you know the extra work that we have to do. We have to send a second request. We have to marshal this request again and then buffer it for a bit and stuff like that. um We think we can still optimize this a bit by um doing some better, better buffer logic.

B

um So mario made a pr for this, and the next step might be to use like a data format which is easier to serialize, which is like less cpu intensive.

B

Currently, we use otlp directly, which is a very rich format, with a lot of nesting, which is nice to express traces, but we don't necessarily need this richness to implement the matrix processors, so there might also be an option to improve this further.

B

So we hope to bring this down by the time of temple 1.4, but we'll see how far we get and then what we also saw is yeah like resource usage for the metric generator itself.

B

So the largest deployment that we have it running in right now is a cluster which is ingesting, like 1.2 million spans per second by like 200 megabytes per second, and we can handle this load by running eight metrics generated next next to each other. So, like just eight replicas- and this is, I think, about half the amount of distributors and like a third of the amount of ingester.

B

So this component is already proving to be more efficient than the distributor which is nice, um because then you need less instances and yeah in our cluster was like consuming about two cpus and like 1.6 gigabytes of memory. So just to kind of give like an idea of what this uses.

B

We noticed that um the cpu usage is correlated with the amount of spans we're ingesting, um but memory not so um so. Cpu cpu usage is expected because if you ingest more spans, you have to do more work to deserialize the data to unmarshal the protobuf.

B

So um if you go up like we notice, if you go higher, then the instance starts to struggle and starts to crash and stuff like that. But memory is not really impacted by the amount of spans you ingest, so that's also nice, because that means that we can optimize the code a bit more.

B

So it's more cpu efficient, but memory is not like a big issue right now, like it seems, okay um and then oh yeah, and then what I can also share is um an issue with generating metrics from traces is that traces are typically, they have higher cardinality right, you can have like a lot of span names, a lot of different tags within your traces and when you instrument your code, you might not be thinking about cardinality like you might not consider that you shouldn't add, like a random number into your span, name that can happen and by generating metrics from traces.

B

We have this risk of just causing too many series which which would like bring down your time series database, so something we are working on is also adding controls to limit the amount of active series we are generating within the generator, and I have two screenshots here, showing the amount of active series in the generator and the first one is without any control. So this would just every time there's a like a new span coming in with a new um with new data, you would start a new serie like a new time series.

B

um We should. We would just keep sending them all the time which would mean if this instant runs for multiple days. This number always increases, so you can see this always goes up. It starts at like 10 000, and it goes up like 15 000, 20 000, and you just keep sending all this time series all the time and then what we've now integrated into the registry is the ability to um drop stale series.

B

So this means a siri that hasn't been updated in like 15 minutes, then you just delete it, you stop sending it and then you get like a nice like a flatter line and a more stable system. So this is one way to kind of deal with. The cardinality is like: if there's a spiking cardinality, there will be a moment in which there are more active series, but after a while it goes back down again when the series stopped being updated.

B

A second thing, we're adding, is also the ability to set limits, so you can, for instance, limit the generator on like hey. You can only emit 10 000 active series and when there's like one more, you just stop, you don't generate like a metric for that, um and so that would be like a hard stop to protect the downstream database or like the bill you're paying for it um yeah and that's kind of it for what I have for the metric generator so yeah.

B

I think we have like a couple of big changes coming in like next week in the next two weeks, and then we should be at like a very good spot um to release this with temple 1.4, so we'll probably be working more on like documentation and sharing some like operational um information, alerting stuff like that cool.

A

Go back a few slides to your metrics, um one of your distributors hit 40 gigs of working set. Is there a limit really over 40 gigs? How did that not boom? What happened there do you know.

B

uh Yeah, I don't know, I didn't really look into it. I thought it was just like um something to do to roll out. Maybe it's um the bucket like that didn't match. Well,.

C

B

Because I don't think they can get like 40 gigs.

A

Yeah they can't I'm 100 sure I don't even yeah our nodes would handle that, but they boom well before that. I don't know what our limit is, but there's no way it's over 40.

D

um The other thing that's really cool is: do we know what the interest rate is in terms of like spans per second or something, because 10 000 active series, for what I believe is a million spans per second, is really cool.

B

Oh yeah, um so these graphs are not from the million spans per second cluster, so this is from the smaller cluster um but yeah in this cluster, in which we were ingesting, like a million spans per second. I think we had like closer to a hundred thousand active series um and we notice.

E

B

Also increase all the time, so it started like a hundred thousand and if you run it for like a couple of days, it goes up to five hundred thousand. So every instance is sending 500 000 active series, which is just too much um but yeah like we still have to kind of figure out. This ratio between um the amount of spans, ingested and amount of active series, like the connection it's not always correlated, could be.

B

If you're sending like a lot of spans but they're all of the same system, you might have a small amount of metrics. Well, if you're sending a couple of spams with like very high cardinality, you might be generating like a ton of series. So that's something interesting um that we'll have to kind of learn how it works.

D

And like sort of we'd have to do some sort of analysis to figure out which act of series are just enough to uh create the kind of dashboards and graphs we need.

B

D

The red, metrics and so on and then sort of maybe work backwards and see if we can reduce um stuff but but really cool. This is really cool and very exciting.

B

Yeah, I think it's something we should also explore is currently the processes just process, all the spans like without filtering. I think we could add.

D

Filtering capabilities to like.

B

Say you know: don't process spans from these services because we're not interested in those or like drops bans with like that? Don't have a specific tag or whatever, and it would also be a way to keep the amount of active series down. So you only generate metrics for the interesting spans cool.

D

I think the there's there's another cool thing about this and you can add pertinent overrides right and you can you can specify which labels you want to be recorded as series and stuff. I think that's also really powerful and, like another cool feature, nice.

B

All right cool, uh I think we can move on to the next part of the community call which is yeah back in search.

A

That's me um yeah back in search, so we're going to talk a little bit about some of the um settings or tunables for search and what we have I'll share kind of our internal settings there, as well as some new ones coming out one. Fourth, so we'll talk a little bit about what's in 132, which is out now and then what we're going to what we've changed in 1 4 and uh you know what kind of performance we're seeing because of those changes.

A

um I'll share some internal metrics kind of like kunrow, did with the metrics generator. So you know given our spans per second, like what kind of rates of search are we sing and then we're going to talk about grafana cloud traces a little bit um coming soon, it's already actually turned on for a handful of customers if you're interested in this feature and you're using cloudtraces you're.

A

Welcome to ask me in this call or dm me on slack I'll turn it on for you as well, and then across the board, we're targeting the week of march 21st. We have some. We need to have some things. We need from the hosted grafana team, they need to set a feature flag for us and they're working on how they're going to do that.

A

So um I think about the week of 21st. We'll finally have this enabled in cloud traces for everyone, which means everyone will have this full backend search experience. Well, my bad, if you're in google cloud, which almost all of our customers are and then we're looking to roll that out to other clouds, probably in the next few months, so yeah we can move forward.

A

So in 132. um We've kind of shared this diagram before. But you have the query. Front-End a query comes in and it's broken up into thousands of jobs which the queries then do their best to service, and they also pass these. And then you can also configure to use. Serverless is what we which, which is what we do given our volume. So in 132, the query x is a proxy for serverless.

A

Basically, the query front end makes thirty thousand not that many thousands, three four thousand jobs the queryers consume those jobs one at a time they pass those onto the serverless functions which then serves each one one at a time, and I've highlighted some um settings. If you look in our docs there's an under the operational section, there is a dock that outlines actually everything you're seeing here, but I'll just talk about it, real quick. um We need to just make some adjustments.

A

You'll notice, we've changed the server timeouts, the http timeouts we're still kind of operating relatively slowly, not as fast as we want, but we're improving um and then the concurrent jobs and the maxed outstanding pretend it just lets. You push a whole lot more through the query front end. The defaults in the query friend are front. End are still kind of tuned for trace by id search.

A

We might want to actually update that and make the defaults uh work with um just the search generally as we've built now, but the important ones are like the target bytes per job, which will which will divide your entire search job into this into individual jobs. I'm sorry, your entire search request into individual jobs that are about, in our case 10 megs, each, which are then served by the queries or the serverless functions. And then you can see some other options on there. Concurrent jobs and max outstanding.

A

Pretending just allow the query front end to create way more jobs than it would normally.

B

A

Your careers to service I'm at the querier level, the search external endpoints, is what actually tells it to use serverless um and the max concurrent queries by default. The max concurrent queries is like two again: everything is tuned for traced by id search, um but if your queries are only doing two at a time each that's not going to do you any good, uh you need them to do way more. So we have that set to 100 right now in our largest cluster.

A

So each one, and since they act as a proxy, they actually don't do much work. They just take a query and instantly shove it into the back or to the server list, and they could really do more than 100 at a time they're not really doing much of anything except proxy.

A

um If we move forward so 14140, we've added a few settings that are really kind of changing the rate at which we can search the backend quite a bit um one. Is we added hedging to the query front end actually just yesterday or something in fact, it's kind of embarrassing that it took me this long to add this feature because it tripled our uh late or it tripled our kind of rate at which we could search the back end.

A

Our latencies were just dominated by the long tail I should have seen that weeks ago, but didn't I was kind of caught on some other details that I was trying to unravel and then once I added edge requests to the front end. We just instantly tripled the rate at which we were searching the back end by just repeating requests once they exceeded a certain threshold. So the defaults there are kind of tuned to search which is five seconds for a for a uh individual job to be out and we'll try it three times.

A

So if it doesn't come back in five seconds, we'll just try again, it happens again, we'll try again, and these are tuned to roughly our p99 on on this uh on this on the search um another feature we added was for so the queriers in one three, two just proxy to the server list like we talked about, but we added a feature, so the queries will actually do some work themselves. There's no reason for them just to sit there and do nothing so search prefer self will have.

A

The query are always doing two jobs while continuing to proxy to the back end. It adds a little bit of throughput and the queries tend to have more horsepower than the serverless function, so they can churn through the jobs just a bit faster and it helps some on reducing your latency and kind of better utilizing your queries and your resources uh during when you're doing search um and then finally, in the ingester, we added uh this ability to well to take a step back.

A

uh We've made a change in one four, where the range on the blocks. The time range on the blocks is based on these span times. It used to be based on ingestion time, which wasn't great, and now it's based on the span times. Instead.

A

This allows us to just kind of better set the range for the blocks, especially when we replay the blocks when the injector restarts, but the problem we found in basing on span time is people are sending us spans from like three weeks ago, which didn't quite work, because basically every single block covered most of our retention, so any search attempted to access every block on the back end because they just said they were head spans in those time ranges. So this ingestion time ring slack will force the ingester to only update the block time range.

A

If it's within a certain slack time and the main git here is when you used to replay the wall, you would basically reset those block. Time ranges to reset the block. Time ranges to be whenever the wall was replayed, which basically kind of lost them in search with these updates. When the wall replays, it will use the span times instead, and so you won't lose those those will have authentic min max time ranges which is the kind of big get there.

A

um If you go forward so progress so far, we fixed the min max time range in the block which we talked about. um This will make our search more authentic. I really wanted this in uh in cloud before we uh rolled it out to everybody and uh we added that ingestion time range slack setting, so it doesn't accept times from three weeks ago or something silly.

A

We've added a new object, encoding, actually, the only the second ever, which is increasing our the rate at which we can search it's. So it's an improvement on 132 we've added the start and end times outside proto, one of the biggest bottlenecks in search is unmarshaling the proto and then piling through every span.

A

Looking for a specific tag, key value pair- and most of this is basically just thrown away essentially, and the cost of unmarshaling proto is quite a bit so by putting the start end times outside the proto, we could quickly check those to see if we even need to bother with the whole trace, and so it helped quite a bit in terms of speeds.

A

Also, we use that start end time for duration. So if you have a duration in mind, and you specify that in your search, we can skip quite a few traces as well. So if you say like one to two seconds or something anything under one, second won't even be unmarshaled, and our current throughput at tip of main with all these improvements is about 30 to 35 gigs a second.

A

We were only seeing like 10 to 15 gigs, a second before occasionally we'd, see like it jumped to 20 and that's kind of what I was trying to unravel these past few weeks is why we're seeing it jump to 20, sometimes by just throwing in a hedge request and ignoring and not caring anymore, why we were jumping to 20..

A

We just immediately got 30 to 35 gigs, a second which is great, and then I have an experimental branch with some proto-marshalling improvements that sing 60 gigs per second um uh 60 gigs per second on search uh pretty regularly it's uh it's consistently at about 60 gigs, a second and the improvements, unfortunately, are to use gogo proto, which is not maintained, so we're looking at possibly still doing that using gogo proto and some features in gogo proto.

A

We're also looking at um the tess planet scale has a proto-marshaller that I'm working on actually just today I was spending some time getting our proto-marshall using I'm sorry, our proto-generated, using our generator. um So we might try the vitesse proto generator as well and we're looking to get around 60 gigs a second.

A

I think at that point at 60 gigs, a second I'm probably going to let this rest, maybe there'll be some small improvements, but I think uh we'll feel very I'll be happy with those speeds um as long as we can get them consistently and then look forward to maybe the next generation of tempo back end, maybe parquet and we'll have the team uh marty and ananda talk about that a little bit about that next and what our thoughts on parque and our chances of moving there, um uh and so to give some metrics here at 220 megs, a second which is what we're doing internally.

A

um It takes about 20 to 25 seconds, to search one hour, not great, but not terrible. That's an exhaustive search! If you actually specified some parameters, it'd come back a lot faster that is searching every single trace in the back end in that hour and failing to match anything. Basically and it's it's pretty stable in those time ranges 20-25 seconds search for these for these very for these larger um installations.

A

For you know, hundreds of megs, a second search- will continue to feel like a batch job for a while, um but we're for smaller uh installations. You know double digit megabytes a second. You could definitely see significantly faster uh search times, I think cool and then some some or some metrics from uh our serverless.

A

If you want to go forward yeah so just to share some of this because it's fun to look at I yesterday I was- or maybe this was two days ago, two or three days ago earlier this week, I was like you can kind of see my pattern here. I'd make a small change and then run a test and then make a small change run a test.

A

I was trying to unlock the combination of parameters that really helped us get to this 30 gigs a second, and so here you can see um just to share some of this. This is the number of instances of google cloud functions when I'm doing my searches so they're hitting like five six seven thousand instances that are active at one time and then you can see the idle functions too. It's kind of see the way fun to see the way.

A

Google is actually scaling down our functions after they're active, so they're active, they're doing something, and then there are no queries and you can see how google's kind of bringing those down, which I think is kind of a neat neat insight into the way they're managing their services.

A

um Next. This is the gigabytes per second, as reported by gcs. So this is compressed. So it's going to be less than the 30 we were talking about um because the 30 is uncompressed. Proto is what I'm talking about there so we're seeing like 10 gigs, a second spiking up to 18 gigs, a second. This is coming out of gcs compressed.

A

We are currently using lz4 on our our back end. The stat or the default is actually z standard, but lz4 it costs less to uh on marshall, so we've been experimenting with it because we're doing search uh honestly, maybe for one four or perhaps one five, we'll probably do a revisit on our defaults.

A

So I think in one two or so, we revisited defaults and we checked kind of what we were running internally and we brought our defaults up to what we were running internally, and I think we might do another pass at that once maybe in one four or one five to make a configuration easier and make things just work out of the box for large scale, but lz4 might be the winner here.

A

On the back end, it's just way cheaper to decompress, which is a lot nicer for the serverless functions they can with with much less cpu and memory resources. They can churn, through these pages, a lot faster um and then. Finally, um this is the number of requests per second we're hitting gcs with these are all these are reported by stackdriver here, but we're seeing um yeah, 15 000 or I guess about 12 000- looks like average requests per second gcs is returning. These are tiny. Read ranges over this block.

A

Each of these serverless functions is slamming gcs as fast as possible to slurp out data. You can even see our polling cycle way down here. These tiny bumps every I don't know five to ten minutes. You see a very small bump. That's our polling cycle when all the different compactors are and all the different components are asking for the index json or building the index json. um So yeah, that's serverless, I'd, say things are feeling good after feeling rough in 132, I didn't really like where it was, but it was technically working.

A

um I really like where things are now they, the hedge requests, really added consistency to our searches. um This new, these proto improvements are going to maybe close to double the speeds we can get.

A

I really feel, like things are entering the territory where, um where I feel comfortable with the community running it a little bit more, I feel comfortable with it being in grafana cloud traces um and it's still not amazing, at extremely high volume, but it's functional and usable, and I like where it is at and then hopefully we'll look to the future for some better formats to just reduce the amount of data we pull we're pulling just gigs and gigs and gigs from gcs to answer these questions, which is uh hopefully going to be unnecessary in the future.

A

If there are any questions feel free to hit me, if not, I can kind of hand it off to team future.

E

Hey, um can we go back a couple slides to the number of active and idle instances, and I remember that being a lot lower, like that's almost 8 000, that was really cool. Was that the trick of like having multiple sets of functions and we to get past kind of like the throttling.

A

um Right so the serverless endpoints, we actually have the ability to have more than one and the reason is to get through past google cloud um quotas, so google cloud will only have 3000 instances of any function at one time, so we run 10, which is unnecessary. I've just been trying every combination of parameters. I can to see how I can get data out of gcs as fast as possible, um but we could probably do this with three or four and basically, this is the sum across 10 different google cloud functions.

A

So right, that's a good call. We are seeing quite a few of these functions when we run this.

A

It's still, depending on your query, load a fraction of the cost of running tempo as a whole, but it can add up if you sat there and just beat it to death. Constantly constant querying would cost quite a bit, but um tempo is not built for that. Tempo is built to ingest as cheaply as possible because that's almost all you'll be doing with a tracing back end and query in a more expensive way.

A

It's kind of the balance that loki takes as well. The idea being, let's make the thing that you do all the time cheap and the thing that you do occasionally more expensive.

A

Cool good insight.

E

That's super cool, um okay, sure yeah. So we can talk about some back-end stuff. Okay, so we actually had a couple of late minute slide changes so either you can reload and run it or I can share. I can do it too.

E

E

Okay, cool, hey um yeah, so a couple if you were here maybe or saw it a couple of months ago at a community call we kind of talked about how we had been looking into kind of a new block format for the back end kind of like you know. We touched on that a lot here and a couple of it was flat buffers.

E

We dug pretty deep into that and we had some kind of numbers and some interesting things and findings, but ultimately didn't feel like it panned out since then, we've been kind of looking into uh culinary and parquet in particular, so we'll kind of walk through that with just some thoughts and findings and status, and you know just kind of like what we're thinking there cool. So I'm just gonna do yeah, so, um okay, so just basic stuff, like white colander. Well, we kind of touched on that.

E

Like protobuf, you have to deserialize the whole thing can't get past it. That's not great flatbuffers was one way to jump around, but you know it didn't work out calmer. You only read the columns you need so naturally the data in our filtering, our search behavior, is like very commoner. So as an example in a 600 megabyte block, did you know that the just the span ids are 160 megs? So that's fascinating um to me because that data we never use for anything so for searching.

E

So why read that, conversely, like a column we do use span status code, we'd like to search on is only six megabytes for the same number of spans, so that's really cool.

E

um So why parquet so there's a lot of other columnar technologies out there, so why per k well in particular, so it's a file which is great that works really well for us so like as far as tempos blocks, it'd be very easy to put a file in the back end, treat it as a file. It doesn't require like heavy state or um heavy installations. Things like that, and it already has so many awesome things in it already, which we would want anyways, so different encodings for different data types, so delta encoding dictionaries run length.

E

It has different types of compression, page statistics, so you can skip around and skip over blocks of data balloon filters. These are all super cool, and uh so they all come together and make it so super fast, and uh hopefully really you know you can read a lot less data. So um let's go sorry. Look around here.

E

Okay, column, design, so um kind of there's two top approaches to columns. Would you do a nested kind of like a trace? A trace is a graph of batches spans attributes things like that. Would you do that? Or could you flatten it into just a list of spans? Both options are really really interesting. We actually tried both um and we it just ended up that it seems like the nested approach is better. I mean it seems to come down to things that are at the resource or batch level.

E

Right are just flattening those down to every span. Just is kind of like outweighs any benefits you might get from the different designs, so examples of that are like cluster namespace pod attributes, so here's an example schema um just to kind of show like the graph right, so you have trace root at the top. um It has a trace id inside that this is kind of matching the open, telemetry line protocol resource spans resource that has attributes keys and values.

E

Instrumentation library spans has spans those have attributes, keys and values and then also other attributes like start time and expand name.

E

So the actual columns in here some of these are virtual, so the actual columns are like trace id the resource key attribute key value. The span attributes um cool, um okay, that's cool, but obviously like we got to go way further. There's so many cool things you could do so one is um we.

E

We want these attributes that we're searching on don't put them all in one column like, let's put them into their own columns, that's really cool, so uh so you're searching you're just targeting a lot less data, but actually the actual block itself gets a lot more efficient because the you're putting the repeated like values together into their own columns versus like keys and values.

E

um So what what would that look like something like this, like highlighted, so the resource would have a dedicated cluster name space pod container columns. These are just attributes that we commonly use in our own stuff, so yeah, and then so we can target those directly. So that would be really cool.

E

um Okay, hey nanya! Do you want to touch on these slides.

D

Yeah sure so like what he said, we can actually blow out each of the span attributes into their own columns, and this is an example of doing that from one of our internal blocks like we run tempo internally monitor our own clusters, and this is what it looks like for one of our blocks.

D

So under the spans uh thing, the field you can see, there's there's columns like alert id attempt id block, and each of these was actually tags on the span and we figured out a way to sort of uh blow these out into their own columns, and that way you can actually search for them in parallel and perform joins across them.

D

So if you have multiple conditions in a search query, you can fire off readers for each of these columns, and each of these will be really tiny, because I know block is a couple of couple hundred mb and each of these could be like kilobytes.

D

So so you only end up reading like a really small section of the block, and then you can actually see the traces that match and perform joins across these columns to finally filter out the traces that match multiple conditions. So so this is really cool. um One of the problems that we ran into can we go to the next slide. Marty was this so when your app has auto instrumentation, you sort of end up with this.

D

With this bizarre schema, where you have a million million tags on your spans and these get blown out into their own columns and then it starts looking something like this. So we're trying to come up with an opinionated way of saying: hey, we're not going to blow out attributes that have a prefix that says http request or we're kind of playing around with that.

D

We don't know what our approach is going to be, but if this turns out to be too expensive, then we might just put these into like the old way of doing it, where you just put each of these into the into one column. So we won't provide search on them, but you can still reconstruct the trace out of these, so this is still up in the air, but it's just something to think about. The the uh writing of the schema gets a little expensive as we grow the number of columns.

D

So this might help reduce that a bit, but uh our results are still inconclusive around that and um yeah. I think that's about it for the dynamic scheme as uh if we go to the next slide. I just have one last thing to point out, and all of this is possible. Thanks to these folks, the segment I o team has come up with a really cool uh parquet sdk for golang, and we were able to do a lot of these things.

D

uh Thanks to thanks to this library and the awesome work done there, so just wanted to call that out.

E

Yeah, this library is fantastic. It's really fast, really stable, really robust. So it's been awesome, um yeah. Just just a note on this. This is cracking me up because we didn't even know any of this data existed. It's funny to see it all in one place.

D

There's more, this is a small part of it.

E

A

Go back to the uh oh, my bad! You had some more stuff.

C

Yeah, that's okay, okay,.

A

C

To the resource go.

A

Up one more slide: I'm shocked that that's all we have on.

C

A

Like I'm not surprised, there's a billion stupid tags on our spans, but I figured a resource- would have a similar like ridiculous number of spans. It's kind of cool that it's constrained, that much yeah.

E

Yeah, I mean I'm sure a lot of these are coming from the tracing pipeline right, so it makes sense in some ways. Yeah I mean it's good and like it's really powerful to be filtering at that level. Yeah cool and sure it's.

D

Also, it's also cool that we have so many languages internally and somehow all of the sdks have convened around these uh and you know, is going to use these or maybe it's our internal uh open, telemetry conversion thing. So that could be true.

C

Good question I wanna know who's using slug.

D

Open senses explore tuition.

E

Right plan plan is, I'm not sure what plan is.

A

Gotta have a plan.

D

Is that danka cops.

E

Yeah, okay, so we have some super early, but cool benchmarks. We just wanted to share them. You know just to get on the hype train, so um a super simple query for one cluster value and a min duration. So just there's two test cases here, comparing like against the original proto and the new parquet stuff.

E

um So you can kind of see like maybe in here it's reading, 600 megabytes, so the entire block, um finding 55 traces after converting that to par k, doing the same thing: we're reading two megabytes same 55, traces, super cool, it's way faster, so 22 seconds versus like this is 140 milliseconds um yeah. So we got some cool like numbers here um now more realistic query just for comparison.

E

um I actually forgot to paste in the time here, but it's like 200 milliseconds or something but looking for all of these and, of course, they're really not into their own columns but like cluster namespace pod service name duration. So this is like kind of filtering down a lot further um we're only reading 11 megabytes, so that's pretty cool um yeah that is sweet yeah. So um what else do we have? Okay? So, there's a lot more to figure out. These are just start current findings and thoughts, and I kind of are current directions.

E

So things figure out dynamic columns, add a lot of moving parts, but they seem like they're worth it, because it's just fantastic speeds and things like that. The results look really promising. um So we, if you go back, we just had strings everywhere, so we need to figure out every other data type. So we have some progress on that, but um there are still things to figure out um things like you could actually have attributes of the same name but different types. So we need to figure that stuff like that on.

E

How does that even work? I'm not even sure how that would work with our current search ui. So we need to get together. Things like that um parquet column, naming is actually more strict than the tag names in the data, so there has to be some sort of transformation. So what would that look like in the example is hhp dot status code and then um just any collision in general?

E

That maybe has to do with, like the structure of the thing that we're creating and, of course, like the um we're reading a lot less data, but the ios are still kind of high. So that's just something we need to dig into that's the potential thing that we need to figure out.

E

Is there any other things that we needed to figure out? I mean I'm sure. There's times I didn't mention.

D

Does also um like each column can have its own compression type, so I don't know how we'll figure that out for.

E

D

E

Yeah so just tuning in general, like which college encoding and compression makes sense for each column, yeah and then again, there's in writing the parquet. There are things like row group flush sizes, like page sizes, so there's a lot of tuning like that too.

D

It's also, it's also slower than writing proto right now, so yeah. We need to figure that out.

A

D

A

D

Yes, block takes longer to create the block, um although it's not in the critical path, because we do it async, but it still needs to be fast, but otherwise our rollouts will become slow. It'll just generally be a little sad.

A

Sure, I guess I'm not surprised right. The proto is so dumb. It just barfs everything on disk, but this actually does a lot of work to organize the columns and do the compression do the dictionary compression types and run.

B

Length encoding.

A

Compression types and all that so we'll probably pay a little bit more on right when we move to parquet, but the benefits, I suppose would be definitely worth it. Not only those awesome search metrics, but I love that parquet. If we just put raw park a and s3 suddenly um it unlocks the ability to use any of a million tools that already exist to work with parquet and object, storage and there's a lot of those. You can query with athena. You could uh write a flink as flink a thing spark one.

B

Of those job things.

A

Right, I've never worked those technologies, but you could write some of that to take your data and do anything you want with it. There's just. I think it really unlocks a world of um a world of awesomeness when we connect or use this open standard. If we can get there.

E

Yeah, so we're trying not to do anything horrible with the parquet file structure. That would prevent a lot of that stuff.

E

um So um I honestly don't know if we are, I don't think we are, but I haven't used those tools, a whole lot, so sure we'll get if we are. If we are please let us know try it.

D

I think there's like parquet tool, visualizers and stuff that we should probably try using uh try querying the parquet file outside of template.

A

They just crash as soon as you open your file like. Well, maybe we did something dumb.

A

Cool any questions any concerns thoughts. um I think we've been talking about this, some things for a while, but I also think um we've talked about some very future things with parquet, which is very cool.

A

I think metrics generators coming up in one four it'll be an awesome feature. Addition for anyone running it as well as we're going to get some of those features into the graphonic cloud. I think one four is going to see way better performance on search, so those of you who are using search or wanting to experiment with that are going to see far better results with one four. I still owe the community a guide on setting this up in aws.

A

Technically we have a proof of concept running the aws, and I have not put this documented this anywhere. If you do check our docs there's terraform for deploying to google cloud, but nothing for aws. So I do owe you all better documentation on the serverless path, as well as some specific docs around how to get this running with lambda, aws, lambda and s3, but other than that.

A

I think we're cool.

C

A

Right well, thanks for another good community call, I will catch you all next month. Grab us in slack, throw some issues in the github.

B

A

um Yeah connect to us, however, you feel most comfortable and um thanks for using tempo and we'll catch you next month see y'all yeah, okay, cool, see.

C

C