Grafana Tempo, 10 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Tempo Community Call 2022-02-10

Description

- Discussion of Backend Search with Serverless Architecture
- Metrics generator updates
- Pyroscope devs demonstrate linking traces and profiles

A

All right so welcome to tempo community call whatever month and year it is 2022 february. I believe um we got a bunch of stuff to talk about today, including we have a group from pyroscope, which is awesome, we're interested to hear about some continuous profiling uh details as well, as I think, they're working on some kind of trace to profile linking, and I think I'm personally really excited to to see what they're working on and to get an idea of what this looks like.

A

um But before we get started there, we have uh we're gonna talk about our latest tempo release. This last past month, um we're gonna get into search some more serverless discussion, how we're using serverless grafana and then we're gonna just talk about metrics generator again, where we are in that we're making progress there as well. So with the releases, I think zach is gonna. Take a look at that. Oh we have a present. Somebody put a presentation together. Didn't they was it mario.

B

Nice yeah I like to see presentations uh yeah on the release. On the new front. I just checked the dock. It looks like we did cover the 130 uh release last time, so I won't uh belabor that here do check the release notes. There are a few breaking changes that are worth um uh worth pointing out. um So yeah take a look at that.

B

We did have a 131 release since which is a simple bug fix in the case where you're using etcd as the key value store um that will uh that cause the panic so 131 will will fix that you'll want to upgrade. If you are using a td um and that's all we have for the releases right now.

A

Cool, I was kind of surprised anyone was using fcd, we don't. We only use gossip here and then our other back ends use console and I didn't think anybody really used the lcd ring mechanism. But apparently someone does mario mario bud. You kicked out the person. Who's tried to join.

C

Yeah, I think I rejected rejected someone. I was typing and.

A

All right, well, hopefully, they'll try again. We certainly want anyone to be able to come.

C

I will pick them in in slack.

A

Okay, cool cool. um Mario: do you want to present? You have the last piece. So if you just go ahead and present, then I can talk through my slides and you can do your part as well.

C

Should I go now sorry.

A

Oh, I was, I was to say if you present, slides I'll talk through them. So, okay.

A

This is the super professional and well uh well produced tempo community call.

B

We could have some intro music or something and then we'd be pro.

A

We should have intro music.

B

That can be arranged.

A

If you can do it, please do zack. That's your.

C

New job, the correct one.

A

uh Yeah, that's what uh zach zach was talking about 131, we released it a week or two back and then uh right did we talk about the breaking changes? Zach. Did you detail some of those.

B

Yeah um we did cover it in the previous meeting. I was just checking the notes, but we can talk about that a little bit if you want.

A

To mention them, yeah yeah.

B

There are three breaking changes in the 130 release. uh One of them is the hotel grpc receiver default port changed um so you'll want to make note of that. um You know firewall rules, policies etc.

B

um It looks like also they were doing this to get out of the ephemeral range of the the port list and um and into something that is actually codified uh by ayanna.

B

So if you look in recent versions of your etsy services, you will see open telemetry listed there super exciting for them. um Also uh config change. um There's uh two parameters that move from the query config to the query: front-end config. So you will want to make those updates if you are using uh search max result, limit or search default result limit. Both of those will end up on the query front-end now and then also, we removed a deprecated um grpc endpoint on the ingestor uh and its associated data encoding.

B

um This was dedicated back in 1.0. um So if you're running earlier versions, you need to upgrade to 1.0 first and allow that to um to blocks to be rewritten as the uh as the new encoding. um So just that's the three there and then we got a whole pile of features. So I would highly recommend checking out the release notes for the remaining features.

A

Cool thanks, um in particular, 130 included, back-end search um for the first time, we're going to talk a little bit about that now kind of what we have going on for, like this full scan of our gcs or s3, or whatever object, storage um and uh we've already had some people experimenting with the cool which is cool. I think that really points towards how excited everyone is to be able to search their full background with tempo instead of using logs or exemplars to jump into traces.

A

um But there's going to be some some hiccups here and there's going to be some some config. We have to talk about, and so let's talk about what we're doing in search what we're doing at grafana- and hopefully we can help you all.

A

Hopefully we can help you all uh get your search configured correctly, um so go ahead. Mario, if you want to move forward.

A

um So that serverless help, I think, mario copied what was in our doc. But if you look in the community called doc, um that's a link to our docs uh in our uh just our standard tempo documentation, and it talks about uh some some of what we're going to cover today or real quick here. um Some of these config options that you might need and help you get started with serverless and getting getting your backend search to work with serverless, which is what we're using at grafana.

A

um Also, oh, that is totally wrong. It's not grafana386! I'm sure I typed that wrong. It is graffana, 836 nope, sorry, mario, you got it. That was you bud. Grafana 836, which was just released yesterday, supports uh back-end search. um We need to update our docs, I think we say a future grafana version, but we'll update it to say 836.

A

um and you need to include and I'll include this here, there's a feature flag, back-end tempo search, um so we're still marking this as uh beta and the reason we're buying this is beta is because it's just not um at the performance. We want it to be for our level of ingest, which is about 250 mega. Second, I think once we reach that scale, we'll feel comfortable. um Dropping that tag and saying it's, you know, ga we also just want to be a little easier easier to configure it's.

A

It's still a little bit um gorpy in that world and I'll show you some of that and talk about why it's a little rough right, serverless help, um and then we also added lam to support since 131 was cut.

A

This was like last week I built a code package for lambda, and so, if from the tempo repo, we have a simple make command that will make a google cloud functions package which you can use for your serverless functions there, as well as the lambda package, which you can use for aws lambda, so we're experimenting both those we run in aws azure and in google cloud, and so we'll eventually get all three of those cloud providers.

A

um Yeah. Let's go and move forward.

A

So this is kind of the architecture and the way we're searching or we're making use of serverless. So currently, when you do a trace by id search, um the query front end hits the queriers and the queryers both ask the ingesters, and then they search the gcs bucket directly with serverless. If you want to use serverless for search, we, instead of digging directly into the gcs bucket, we query a serverless endpoint and it's just an endpoint that you configure using this query or search external endpoints.

A

So if you want to use serverless, you have to configure this value, which will tell the query not to do the search on its own, but instead to kind of offload the work. So with this configuration point, your query is basically a proxy when you're doing these range searches when you're doing these like time range searches with different criteria like you're, looking for different tags or durations or whatever.

A

So um the query front end builds up a whole bunch of jobs, it scans the bucket. It knows how many blocks that need need to be searched. It makes a bunch of jobs which it then gives creates a giant queue of these jobs. The queriers take the jobs one at a time, some of the jobs say, go look in the ingestors and some of the jobs say go query a certain time range in the back end.

A

If you don't have this external endpoint configured that search the back end, the query will do it itself, you don't need serverless, so the queriers are set up to completely search the background on their own, but at certain scales- and if you want certain like latencies, then you basically have to have some kind of burst capacity which we're using serverless technologies to do so. um These jobs that say go scan. The back end hit the query.

A

If that search external endpoint is configured, it will just hit the endpoint, it becomes a proxy for the query um and then you're kind of offloading this into a serverless technology like lambda or google cloud functions, so that's kind of like the key config option that turns on serverless. It changes your query from attempting to do the work itself to offloading it to something.

A

In the background um cool, why don't we go ahead uh and to give you some numbers about what we're really trying to hit before we drop this beta flag, so we're doing 250 mega second of ingest, because we use replication factor 3 that actually turns into 750 mega second written into gcs. We attempt to compact that away, but that's kind of like the search space at time, 0.

A

Essentially it's three times what we're ingesting because of our replication factor um and then over time that kind of gets reduced as we can pack some of that away, but still it is kind of makes a challenge for recent searches.

A

We have 202 query front ends and for a single hour, if you make a request for a timerage of a single hour, regardless of your conditions, it creates about 38, 000 jobs, so 38 000, small jobs for the queriers to do that are like scan this small range of a block and they all kind of take these jobs and either the queries are doing. Those or kind of the serverless function is doing those.

A

We have 50 queries and these queries kind of act as proxies for the time range search for the traced by id search. They still do the work, but for the time range search uh for the back end search. They are become this proxy and we're seeing a spikes to about 2 000 serverless functions when we, when we're doing this, this search. So you know it starts at zero. Of course, I have that you see a graph down there of count of active serverless functions in gcs or in gcp, um and so the querier.

A

You know it's sitting at zero of course, and then you start asking questions. You start doing these time range searches. The query front end hits the query. The query starts slamming the cl google cloud functions, which then scales up very rapidly and you're able to um you're able to service your query, hopefully pretty quickly- and this is kind of the current bottleneck.

A

We're hitting, which is gcs- is really only allowing us five to eight thousand or so queries per second, and that's what we're really trying to move past right now is and it kind of um uh generally. We want to be agnostic about our backends and just kind of think of that about them as generic object storage, we don't care we're writing blocks to them, we're carrying back it doesn't matter but to really push the limits of these back ends. We find that we are going to have to get into some of the details.

A

So I've been off this for the past couple days, but I think next week I'm going to get back into this very seriously, which is kind of how do we get more throughput out of gcs?

A

And how do we push it harder because all our entire infrastructure for search is capable of doing this 250 mega? Second search, everything is set up. We just get bottlenecked on gcs gcs won't uh we'll just start throttling our queries. It starts returning 429s and other error codes indicating it just won't handle any more requests.

A

So we need to figure out how to move past that and that's kind of where we are, I suppose, on search- and I expect there to be s3 issues and azure blob storage issues and we'll kind of move through all those and kind of gain knowledge as a community as we as we build all these technologies or all the serverless technology cool.

A

D

A

One more slide: two more slides question mark yeah: okay, this one's kind of a super boring slide, but it shows you some of the config points um that you're going to have to look at if you start dealing with these huge jobs. So if you're doing 100 200 megabytes a second you're going to have tens of thousands of jobs, you're going to need to make some of these changes, they're documented in that link and tempo community call, so don't feel like you have to just you know, digest this all right now go!

A

Look at that link. Every every line I have in here is also in our public documentation, but I'm including some timeouts.

A

So our read and write timeouts are up quite a bit of course, because these become more like batch jobs, which is not the feel we want, but it's just the way it is right now um so we've we have our http timeout set at two minutes on the query front end, um and then we have to massively increase like the number of jobs we'll perform by the query front end so you'll see, there's less max outstanding per tenant.

A

Where I think it's only like 50 right on trace buyers to d search or 100, but we're going to like 10 000, basically 100 000, because we make 38 000 jobs and we have to put all of those into queues. So we have to uh increase that quite a bit. There are some other queue kind of configuration options there and then the query, also by default, we'll only try a few queries at a time and we need to up that massively as well, because the query becomes this proxy.

A

It no longer is doing the work so instead of the default, which I think is only like two or five, which was a default setup for query by tr or traced by id search, you have to massively increase that value as well. So max concurrent queries, I think, on ours are like a thousand. Each of our queries will handle a thousand queries at once, but handle a thousand queries means proxy these to serverless.

A

Basically, so these config points are what you have to kind of hit inside of tempo to unlock tempo and unlock all the bottlenecks inside of tempo, at which point the bottlenecks become your back end, I suppose, but we're kind of trying to figure that out right now and then. Finally, I have just a quick blurb there we support gcp and aws, and it's a simple make command. This just makes a zip file.

A

You can upload that zip file to a gcs bucket or an s3 bucket, and then you can build your serverless function and terraform or, however, you do that and you can reference this code package and it'll work cool. Maybe one more slide from me or was I done.

E

No, but that's it.

A

That's it for me: okay, so um serverless we're very excited about, but we're kind of just learning it as we go. We feel like for these extremely you know these extreme environments, extreme high load environments, it's going to take a little bit more nuance. It's going to take maybe a little bit more data structures on the back end to get the speeds that we want, but it is functional and available now and we use it internally. We're also rolling this out to cloud in cloud.

A

Our volumes are, you know our per tenant volumes are at a level where we feel comfortable rolling this out. So all of our cloud customers are going to have access to this to their full backend search and we're going to be iterating on this in the next months years. Frankly, to improve performance, add features, kind of build. This query language, we're going to start talking about and other things cool, so search is in a good spot.

A

If anybody has any questions, you're welcome to ask now for serverless. I there's been quite a few questions in slack and in the github issues about how to set this up and other concerns people have. If you have questions you're, welcome to ask, or we can say that for the end you can ask in the chat or you can unmute. If that's your style.

F

That's very cool joe. I have one question: yes, how did we get past the cold start problem in serverless? I saw that like in the number of jobs, the graph that you showed us. It was like at zero and then it would like massively scale up to like a couple of thousand. How did we get past like that? Cold start.

A

Honestly, cold start really hasn't been an issue with gcp cloud functions. um I've kind of had my eye on it, but the bigger bottleneck right now is the gcs requests.

A

um Another field I set was- and I I think I referenced this in the docs in the docs that are linked in the tempo community call, but there is in google cloud functions. There's like this min instances you can set, so it will always keep one alive, and I did see that making a substantial difference in cold start just telling it to keep a single one around seemed to be meaningful um over zero.

A

Basically, something else I noticed was um if I do a code update and I push a new package and then I update my function.

A

The very first query after that update often took a long time like there was a serious cold start issue there versus you know, even if I waited like a day. The second query on that same you know, function was significantly less so that very first query seemed to have the worst cold start issues, after that um it did. Okay, so um certainly with serverless functions. I think you're always going to be trading off right.

A

Some kind of cold start latency in exchange, for you know this huge kind of extreme uh burst capability, and I think that's just always going to be like a challenge when we use this.

E

A

I think that's okay, like that the amount of data we're scanning the kinds of things we're trying to do. I still think this is the right choice and there just might be this like min latency of one to two seconds on cold start queries and I think we're gonna have to be okay with that.

F

Yeah, I think if it's happening just one for every new function, that's that's pretty. Okay,.

A

Yeah yeah, the worst one, is that one I kind of we should set up a job that just like curls it as soon as we roll it out or something right, the instant we roll out a new version, just curl it, so we hit it and that customer doesn't. I don't know it's a good idea, cool um all right I'll hand. This over to mario we've been doing a lot of work on metrics generator.

A

We've been talking a lot about this in the community calls, um and I think we have some updates in that world as well.

C

Yep, that's right, so metrics generator satisfied, metrics, we kind of using interchangeably um and well. The first update is that our design proposal has been accepted. We opened a design proposal for this project about a month ago.

C

Right and yeah, I mean we gather some feedback. I think we were happy at the point uh we were at. I think yeah. We have a good base to to build a solution, so yeah it's been recently accepted um and it will be merged um soon, um so yeah we encourage everyone to take a look at it. uh Even if we're accepting it and merging it, uh please feel free to uh ping that things ping ass ass at a slack.

C

If you want to chat about it, comment on the um the same proposal open an issue open up your against the design proposal. um It's not definitive and written in stone um yeah. We will be iterating uh the project and and making changes as as we built it so yeah. We appreciate all the all the feedback that we can get um so.

C

Yeah well, the next slide is: maybe you should have come. First is um so what is actually the metrics generator for those who don't know just to refresh everyone's memory? um So the metric generator is a new component that we have designed to solve this project, and this new component is used to derive metrics from ingested traces.

C

So it sits at the right path and it will receive all the traces that the injustice receives. So the distributor will now write to the injustice and to the metrics generator.

C

This new component will derive metrics from traces and we'll be remote writing them to a prometheus backend.

C

Initially, we have uh designed to um kind of metrics that we're deriving one, which is uh what we call spam metrics, which are basic metrics from spans.

C

Essentially every span is equal and yeah we're trying to start extract as much as as much relevant data from his pants as possible, and the other one is service graphs, which is this feature that we built for the graphene agent um in which we inspect traces and try to match expands as requests from the client and server perspectives.

C

With that, we can build metrics that represent how services communicate and we're able to display like a complete graph of a distributed system and in grafana breeze based from those metrics um yeah. I mean this was the quick introduction to the projects, um but if, if you want to discuss more dive, more into technical details about or have any question uh feel free to, ask we're happy to uh to go into them um and finally, well almost. Finally, uh what are the next steps.

C

At the same time, we've been working on the design proposal. uh We've been building an initial implementation based on that designed and yeah we're opening a pull request with this initial implementation. We also encourage everyone to take a look and and and give any feedback it's always appreciated um and as for when this uh will be released, we're targeting tempos 1.4 release.

C

So I will say a couple of months: it's a fur estimate and yeah also to to uh announce that as well as uh on this being on open source tempo, uh we'll be integrating it to grafana club traitors.

C

So this will happen automatically when, when using graphana cloud, which is very cool and now truly, finally, we have a quick demo to show kind of what metrics we're generating right. Now, with this initial implementation and yeah see all the cool stuff, uh then we have.

G

Yeah I'll quickly share my screen, so I can do a quick demo about the metrics generator and how we are running it right now, so we have the metrics generator deployed in the development cluster internally and it's currently, you know just generating metrics all the time and we're using this as a platform to try out our new code and see how it works. So just to give everyone kind of an idea of what these metrics look like. I can just show them and go to different graphs. You can get from them.

G

So this is our internal tenant that contains all the metrics on the metric generator and you can see if you open the metrics browser the metrics that we generate. So currently we only have two processors or all we already have two processors, depending on how you want to see it.

G

So we have the the service graphs processor, which generates the metrics backing the service graphs, as mario explained, and we also have the spam metrics.

G

These can also be generated by the agent, but in this case we're now generating them by tempo itself. So tempo is in ingesting traces and also generating these metrics at the same time, um and just to give like a quick example of what you can do with them, um the spam metrics are great to see uh like historical data, about your spans, so we're generating these metrics. These are aggregated view, high level view of all these little traces.

G

All these requests that happened, and you can see stuff like the rate of the request, the duration, the amount of errors so like the typical red metrics.

G

um So, to give an example, this is a histogram quantile of the p95 of the spam matrix. I trimmed it down. I only showed the spans that start with http gets tempo api. So that way I only show spans. You know that are part of the tempo api in this case, and then you can see them here so in yellow, that's the trace id request. So all the requests to fetch a trace and in green those are the the search requests and then you can see like you know. What kind of pattern is there?

G

Is it slowing down? Is it slower than usual like that kind of stuff? What's also really cool, is you know we're generating metrics from traces? So we have this full trace information at the moment that we generate the metrics. So it's very easy to inject examples into that, and so all these green little dots you can see.

G

Those are examples that we added to the metrics um about the traces that uh these metrics are based upon, and so what you can do is, for instance, see oh there's a slow request here, that's kind of weird: you can see the labels of this request and then you can jump to tempo um to carry this trace and then you can immediately see like oh, why was this request slow?

G

Maybe it was hanging somewhere, whatever you can continue exploring, and what would be really cool is something we will try to implement in grafana. Is that if you're able to go from a span back to spam metrics so that you can go in the opposite direction, that you can select a span? And you might wonder like?

G

Oh what do the metrics of this pan look like, and you can generate like a new graph here with the latency of this specific span, and that way you can go back and forth and, like you know, just explore your traces, explore your requests and hopefully find you know something interesting.

G

um So those are the spam metrics and the other set of metrics we're generating. Are the service graphs um and so the service graphs? um These metrics extract.

G

You know information about how your services are linked together and how the requests. You know how many requests there are between each service and and the latency between those. um So what you can see here is, for instance, we are collecting traces from from loki, and this is uh like how the loki components are talking to each other.

G

So there's a gateway, a distributor adjuster and you can see the arrows of how the requests are flowing to the system, and you can also zoom in and then see like how many requests this node is receiving every second and also kind of the latency and stuff like that. This isn't the final ui, because it's a slightly older version of grafana, but in a new version, there's also the option to see a histogram and to jump to the histogram like in a separate panel. So you have the full graph view there um yeah, so yeah.

G

This ui is constantly improving as well. I.

C

Think you can actually jump to some data in this one. It's just on the notes. Yeah.

G

Oh yeah, okay, yeah.

C

Yeah bigquery is not the best. Oh yeah.

G

Make like a red query from it.

C

All this is um improved in as conor was saying in in um new versions of grafana.

G

Yeah, this is an example of what the service graphs look like, so you always have a client and a server um and then like the amount of requests that happened or the latency of those requests um and every client server. Paris is what we call an edge. It's like a jump from one service to another, um but yeah. This is currently being worked on graphile, so yeah expect to see some improvements, all the time, basically um yeah. That's kind of what we what I had to show for, like the metrics.

A

G

Yeah- and we really hope that by building it into tempo, it will just be much easier to generate these metrics. You don't have to set this up on the agent anymore. You can just run it on tempo and the backend can generate them for you.

A

Cool excited about that. um I really want to see it in our operations cluster internally. I think we're only running it in dev right now, but I think especially the automatic exemplar thing will be a great way to be able to see kind of trends in your spans and then jump back to the choices that are having the issues very cool.

A

Speaking about linking traces to other forms of telemetry data. um I think we have some people from pyroscope here who are interested in sharing some of what they've been working on, which I believe is linking traces and continuous profiling. So I'll give this over to ryan um and yeah show us what you got.

D

Cool yeah: um what's up everybody yeah, I'm ryan, there's a couple other people from pyroscope in here. um If you don't know what pyroscope is it's yeah, as joe mentioned, a continuous profiler open source um and something that we've been working on a lot lately has been yeah. This idea of like being able to link different types of yeah like telemetry data to uh profiles, and so I will show you we kind of have like the the beginnings of um of kind of getting this in in a uh yeah.

D

I guess we kind of have the beginnings of getting this at more of like a production state, but one thing that we're definitely interested to learn more about is sort of how we can design this in a way that would be most useful for, um in this case, particularly tracing um how you could get more. uh I guess like value out of your traces by being able to see the profiles associated with, you know, specific spans, um and so I don't know- maybe um let me see, as you said, the button always moves.

D

How do I share my screen here? We go um your entire screen. Let's choose the right one and.

D

I believe you should see all of yourselves right now.

D

Yeah all right cool, um so yeah, I guess um I won't get too much. I mean uh if you aren't familiar with profiling, uh we do have like a a demo page, you can kind of see, but just like real quickly for anyone who's not familiar. Basically, this is just this is a flame graph.

D

It's a visual representation of sort of resource utilization um in this case cpu, where the top is a hundred percent, and then each node below that is a um you know, function that you know this function calls this function which calls this function. So this is kind of like a pie chart on steroids um sort of like the uh yeah, the the path of your code and resource utilization.

D

um As far as what we did for this kind of like examples, so basically we just created um kind of similar to jaeger's hot rod.

D

Example, if you're familiar with that, we created uh sort of our own little like ride share example um to kind of yeah represent this there's multiple endpoints car bike and scooter literally you just order a car or a bike or a scooter and then and then so that's what this tracing example is sort of showing there's a load generator that is generating the load just so that you can be able to have something to look at for the traces in the profiles, um and so that's kind of the the structure of of the of this application, and we can also, after this right now it's sitting in a in a branch somewhere.

D

I guess I can paste that in the chat after um this example I'm about to show, but basically what we have here is um you know, so we have this running in grafana, um and so, if you do like search, you can, as I said, there's these two there's, the load generator app and then the ride sharing app the ride, sharing app is more interesting. So if we search here, you can see a bunch of different traces and I'll just click on one random.

D

One and again, as I mentioned, it starts off in the load generator and then from there it gets into the actual uh ride, sharing app, and so the thing that we did here was we added. um We added this like pyroscope profile id and pyroscope profile, url, which for now opens in a separate tab, um and so basically, what you will see is that in this tab for this specific span, you know this yellow span right here for this specific span. This is the profile for again that specific span.

D

So in this case um you can see it's calling this car.order car function, that's what it's spending most of its time, working on um and uh yeah, I didn't really get into it. We we made the uh ordering a car, go extra, slow on purpose um kind of to exemplify some other aspects of this, but basically um but basically yeah. That's that's what's happening here and and as you you can drill down as of right now we only show the um the profile for like the the root span here.

D

um Eventually, maybe we will um yeah or we'll likely kind of add it for each one individually as well, um but yeah, that's kind of the the gist of our kind of like at a very high level.

D

The idea here, um what we kind of hope that we could do eventually is we do have a grafana plugin as well, and it would be cool to be able to kind of come into this explore tab, similar to as you were just showing, or somebody was just showing where you were able to kind of link the um yeah. You know a bunch of things to the traces.

D

It would be cool to have that here we have a similar query language to prom ql or it's almost identical, and so, if we actually, you know copy and paste that in here you can see the obviously. This is not a flame graph. This is a json that would otherwise be a flame graph if that was a supported, visualization type here, if I do go into the actual, uh let me see uh yeah, I guess here we go so yeah.

D

So basically, if I did go into here- and I were to like paste that exact same query in here and hit apply, then you can kind of see it in the panel over here um but uh yeah. I guess I don't know, as you may know, the these panels. This is like a custom visualization panel that we made specifically for pyroscope, whereas the panels that are available to actually like visualize data here are sort of hard coded to be either graphs tables, logs traces or node graphs.

D

I believe, if I'm looking at the right thing here um and so anyways, that's something that we kind of hope to eventually you know potentially lobby grafana to allow either custom visualizations here or something other than those um sort of uh blessed visualization types or add profiling to those visualization types there, um but yeah, that's that's kind of from a high level how this all works.

D

um We can get into more of like how it's implemented or uh whatever is interesting to you there, but I would say that, from my side from pyroscope side, the thing that we're most interested in from you all people who are using tracing and using it frequently is sort of um what are the yeah like? Does this sound interesting, um you know.

D

Would this be something that you see an extra like use case for being able to go from a trace or a span to the profile associated with that um and just kind of in general sort of you know yeah, whether it's for incident solving or capacity planning, whatever it might be, just improving user experience in general and being able to kind of move throughout your sort of the life cycle of whatever your application is doing.

D

Yeah just honestly interested to get your general thoughts that can help us sort of plan as we as we design this and build this, that it's we're building it to be something that's actually useful and we figure. This is a space where um you know you guys are most familiar with chasing or probably more than most and um and so yeah we'd love to hear your thoughts.

A

Cool, uh I do have a question about the profile that I think will help me understand like how I can use this um so generally with like cpu profiling right, you are sampling the stack right at so many times per. Second, then you're building your flame graph out of this so is when you're doing this link between the trace and the cpu profile is the flame graph, somehow scoped down only to the request, or is it the state of the process as a whole? At the time of the request.

H

It is scoped down to the request: hi, I'm dimitri, also from pyroscope um in go, there's actually a very cool way to basically tell the runtime that this particular function, or I guess this particular go routine. um You know belongs to this trace or something like that and that's how we're able to yeah really scope it down, and so it doesn't include other things like garbage collection or you know other background threads um in that profile. That ryan showed cool.

A

All right, so I think my my instinct here is, I think, both have a lot of value like um some people do a lot of just like in out tracing right, just the requests. You know the http request in their http request out and internally inside the process. There's still a lot of questions about what's happening, and I think what you've done here is awesome.

A

In that case, like you, don't have to do any specific instrumentation, you don't have to do anything manual and you can immediately see the state of you know the stack all of the stuff that you did. An instrument is given to you kind of automatically and for free by jumping over, but I also think there's a lot of value to knowing everything that's going on in the process at that time.

A

You know if you're having, if you have a long-running query- and you don't know, what's going on- and it just was this weird one off and you click it, and you just see that oh the process was spending half his time in garbage collection or something you can write it off or you can go, make a point to go, reduce your you know your stat or your or your heap allocation or something so I think both would be really cool.

A

Having the option just to see everything that's happening in your process at the time of the request, can give you a lot of insight into maybe what else your process is doing that could help. You understand why your request was slow, so I do think this is pretty cool.

A

um I've been trying to use profiling more as a as a general tool, and I think this could help me kind of bridge that gap like get me into private scope more instead of just bringing it up occasionally, when I'm having issues like use it more regularly- and you know, give you more insight into your processes.

D

Yeah yeah, one thing I would add there too, is just that yeah so like that you know, is while that's sort of like the specific profile for that trace, I mean the other thing. That's nice about you know continuous profiling is there also is, if we, you know, removed that those specific you know profile id from the query parameters there you would get. You know a full profile and you can also you know. It's also tagged uh appropriately in this case.

D

It's tagged by vehicle and region, as I showed in that kind of graph, and- um and so you can kind of um you can kind of zoom out that way as well and be able to see sort of just like at that time.

F

D

And also um you know, if you're using kubernetes or something along the or yeah, if you're using kubernetes, you can see you know for that pod or for that name, space or whatever you can slice and dice it. How you want, um and so we're kind of trying to figure out yeah like what the balance is there between being able to see as high as you can, but then also being able to drill down to like very, very granular as well.

A

That's cool, like yeah, I see that as a cool ui improvement right where you could toggle on the rest of the process, toggle off like what or you could just like show as a shadow mode or something where it'd be really interesting to see them both at the same time right, I do think there's a that would be really neat for sure.

D

Yeah and then also being able to yeah like compare sort of like the full, I don't know yeah demon. You want to talk about what you were. You were talking about earlier with comparing the uh sort of like full path of the. I don't know. I guess exactly what you call it like the trace, the full path of the span versus like one that was particularly slow, yeah.

H

So this whole idea with you know attaching profiles to spam started by you know like originally, we had only kind of like full view of your um application right, both garbage collection, all the threads everything some people wanted to.

H

You know kind of get more detailed information, and so now we have this uh feature, uh but now that we're talking about this another thing people are asking for is kind of be able to compare um a particular profile for a particular span to kind of, like all the other profiles for that particular span, and so we're starting to think about how we could do that. Basically compare you know compared to some sort of a baseline, so it's some somewhere in the middle, where it's not like your whole application.

H

um It's not a particular spin, but it's kind of like um in between hope. That makes sense.

A

Yeah, no, the I mean these are the kind of things that I think would get me into profiling more, instead of just pulling out of my pulling it out when I need it like once a month like actually making it like a regular part of my you know, understanding of the process.

A

Personally, I find myself staring at metrics a lot because I like that. I like to get a feel for the process and what it's doing, and I think it would be neat if I also had a feel at the profiling level, like just every other day once a week, just jump in there and just see what it looks like it during a normal period of time and get a get a intuitive feel for what the state of your process is, and I think these kind of features would would bring me in more.

F

Cool well awesome.

A

F

Go ahead, I had one more question: if you've got time and so yeah, I know you can't see me, I'm only I'm part of the tempo squad at grafana. um First off really cool product. um I had one question around sampling. Actually, so it seems like you would almost be storing profiles for every request and that's kind of like similar to what we deal with with uh tracing and also it's like um periscope seems to be a pull model from what I've read.

F

So is it like like how do these profiles for every request actually how they get picked up? It's almost like. We were dealing with this with exemplars when we add exemplars to the metric. um We ship those traces, but it's not necessary that the exemplar will be scraped by prometheus and landon and land in the u.s. So do you actually sample 100, or do you have like these rules in pyroscope as well.

H

We we do use sampling, profilers, so they're sampling in terms of like how many you know stack traces. Do we look at each second, um but when it comes to actually storing them, we store every single one right now, like every single profile for every single profile id.

F

So this is like every like: it's almost every request creates a profile and that would be stored together. Okay, good! That's that's a lot, but yeah.

H

It's valuable it. It is a lot so, for you know, we we build a pretty good kind of storage engine compression algorithm um to be able to kind of scale vertically and like even if right now, you install just one pyroscope server and scale vertically you'll be able to get pretty uh pretty far um yeah for the future. We are planning, you know we're planning some sort of more scalable solutions so that you could do um yeah. You could do horizontal scaling, basically.

F

Very good, no, we use bioscope already. We wonder, like I think, the highest volume cluster uses spice scope um right in tempo yeah cool thanks.

D

For sure, well, yeah we'll definitely uh try to add a uh a tempo example as well um to the uh we're still like. I said we haven't like really been. You know promoting it. Yet we want to do some like, like um some ui stuff around this and some uh cleaning up the way that we we do it in the back end as well, but um but yeah I mean uh once we kind of make more progress there.

D

Yeah we'd love to kind of share it with this community and and have uh have you all, try it out and and get some more feedback on it, but appreciate you giving us a little time to talk about it today.

A

Absolutely thank you all. It was awesome um if anybody has any questions about pyroscope for that team or just about tempo or continuous profiling or distributed tracing or any of these fields. That would be a good time to jump in. You can chuck it in chat or you can unmute. um Otherwise, I'd like to thank uh all of our friends from pyroscope, showing up, of course, as well as uh everyone else for coming in um and participating in the community call.

A

um If there are no questions, it is only five minutes left we can head on our ways.

A

Generating metrics? Okay, why don't you uh gabe you? You can ask or sorry.

E

I just wanted to make sure I don't want to step over everybody. What's it do you know, what's the performance impact of the generated metrics right now, um when.

G

Like approximative um yeah, so by adding the generator, um the distributor has to do a little bit of extra work. Since it's you know sending requests both to the ingesters and the metrics generator.

G

um So there will be a bit of extra cpu, but it's not you know, really significant, since we don't have to do any extra processing um that we don't have to do for the investors and the generator itself um like memory, usage and cpu seems fine, like seems pretty low right now, but we haven't been able to test it at like a good scale. Yet so, if this isn't like a small dev cluster, it seems to be using uh like have the memory of the ingester right now.

G

um So that seems all right, but we still have to test it at scale. It's.

E

A separate component right- this is a yeah, okay, okay cool. I had forgotten that okay, nice, um is that what also generates the service graph that you were showing.

G

uh Yeah yeah, so it generates metrics, and these metrics are used by grafana to show the service graphs.

E

Nice, that's awesome.

E

Thank you. Okay,.

G

Cool yeah, something we noticed with the resource usage of the generator, is that memory usage is correlated to the amount of active series you're generating and not necessarily to the amount of spans you're ingesting. So this is different from the ingester.

G

um If you have like a lot, if you're ingesting a lot of traces with a low cardinality, your memory usage will be lower than if you ingest, like just a small amount of traces with a huge cardinality. So it's a bit tricky to estimate at this point.

E

Good to know, thank you.

A

All right well, thank you all um good community call. Congratulations us everyone all right, take care everybody and we will catch you in a month or so.

D

All right, nice meeting, everyone have a good one, bye, everybody.

B

E

G