Grafana Tempo, 8 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grafana Tempo Community Call 2022-09-08

Description

Join our next Tempo community call: https://docs.google.com/document/d/1yGsI6ywU-PxZBjmq3p3vAXr9g5yBXSDk4NU8LGo8qeY/edit

Discussed this month:
- Parquet/TraceQL progress
- Helm chart changes/Upcoming operator
- Azure fixes!

A

All right, so this is the september tempo community call. We have pretty good crowd here, appreciate everybody showing up. um We have a lot of smaller stuff to talk about today. um Just kind of progression on parquet progression on traceql got some news. There, I'm going to farm a fair amount of this out because, like I said, I've been in meetings all morning having a chance to review any of this.

A

I guess I can talk a bit about parquet, because that's kind of been my main push and I will ask others to talk about um some of these other things so just be like super ready if you're on the tempo team, even if you're, not maybe I'll, call on you anyway, um let's start with the parquet and grafana cloud and other internal clusters. So we are currently running parquet in all of our production clusters.

A

um It's mostly stable, uh it's less stable in our ops cluster, where we have a lot more traffic, but it we're kind of struggling a bit with different components, but for the most part, things are going well um and we think, maybe in two weeks we can really kind of ratchet down a few settings that are going to make major changes to cloud.

A

So right now we are creating parquet blocks, but our settings are still kind of tuned to v2 in cloud, and the result of that is that querying is not going to be that much quicker than v2. For the moment as soon as we get kind of this like critical mass of blocks in our production clusters, we can change a lot of our settings over to um a kind of more parquet friendly settings. In fact the tempo.

A

What is it um tempo, 1.5 blog post head those settings I'll go ahead and put a link to that in the docs? I'm sorry in this agenda doc here so um there's just a handful of settings that make querying way better on parquet, um and we have some recommendations in in that uh on that blog post there likely.

A

We will change the defaults for 2.0 to be more parquet friendly and we'll move everything over kind of with 2.0 at the least we'll, of course, have you know, release notes and blog posts and everything that again kind of advise people on how to tune their their clusters for different um for different backgrounds.

A

So parquet is moving forward. um It's kind of a two step forward, one step back thing. If, I'm being totally honest, we make some progress, we add some stability, we get some speed, then there's you know a little problem comes up and we have to kind of like reset a bit and um get it fixed up, but things continue to move forward um and we are looking to cut 2.0, hopefully by the end of the year, hopefully november or so, with um with the parquet as a production ready back end.

A

um It is kind of ambitious, we'll just see how things go and we'll, of course, always keep you all up to date on this call about how or how that's looking um right. Now it's right now, things are going slower than we'd like and we'll just keep talking about it over the next few months as we go forward, but we are running it in production. So that is, um you know something I suppose um tracy yeah.

B

Oh yeah marty jumping here for parkland um we're we're we're making a lot of learnings as we go along like there's so many tunables and different things you can do with parquet, like the group size max block size, um different encodings, compression algorithms, so we're trying a lot of different things like we're still finding like what is working best.

B

I would say that the defaults in tempo- probably right now, are not the best like for sure so um we'll def we'll definitely be coming back and improving some of those yeah, but I was wondering, is anyone um using parquet? Has anyone like had a chance to turn that on and have anything to share good or bad.

B

A

I know that uh alton has because he's filed like three bugs, because the single binary uh uh is panicking for him, but we'll we'll figure that out uh the distributed mode, doesn't panic, but- and then we have this so one thing the parquet library does is use a lot of unsafe pointers for speed, and uh so we have some garbage collection issues and uh we keep going back and forth on the stability of the single binary.

A

I would say right now: the distributed mode is more stable than single binary and we don't even run single binary. So we're gonna have to go back and spend some time with that and root out that particular issue and fix it.

A

um On that note, oh here we should show this. This is um we're currently seeing uh one plus terabytes per second, I'm gonna, say effective, terabytes, effective, effective, eb per second scanned with park a um and what I mean by effective is basically you know. The b2 blocks would have to pull the all data all the time right and it would just on marshall proto and then assert your search conditions and keep or like keep a trace or drop a trace.

A

So if we continue to use those metrics like just the total size of the block, we're scanning over a terabyte a second right now in our ops cluster, so queries are in our way better spot. They go much faster and we're pulling far less data and we're just continuing to like tune and improve this over and over again and one time I did get around four terabytes a second. I don't know it was some kind of fluke, but don't tell anyone that, because it's uh I I can't reproduce it.

A

But one time I got like around four terabytes, a second uh and I think, it'd be awesome to see that more consistently like four plus terabytes, a second on queries. So, while we are, you know struggling a bit with this and we're continuing to move forward. Just the promise of the back end is, you know, continues to be there uh and we have a lot of uh hopes, moving forward for both traceql and just performance in general.

A

On that note, I'll pass it over to marty, because I believe he's probably got some info on traceqvo.

B

Hey yeah, um so we started implementing traceql. I think. Maybe we talked about this last time or maybe we said we were going to start well, we have started, um I don't know, maybe it's halfway like it's going. Well, I think. um Actually uh what we were thinking here is, I think we invited andre, hey and you have maybe some really cool ui work.

B

That's in progress too, so we're kind of working in parallel between the front end of the back end, and um I thought there was some really cool stuff going on there, like with the traceq editor and things like that.

C

Yeah I can, I can show it off yeah. That would be great.

C

Cool so um yeah.

A

C

All of this is still pretty much experimental. um It's really not stable, bugs may appear, um but yeah there's been some progress on actually building the the editor. um How to complete syntax, highlighting uh which I think is, is pretty cool and should help everyone understand risky well and write some some really good queries, um so that the main thing is that, as soon as we start to build a spam.

A

C

We are, we have all of the completion options available scopes. Resource fan, the intrinsic values are possible, like the ratio name status and then all of the tags that we retrieve from the the tempo api- um and I think this works pretty well. um It auto completes operators, the the possible values again retrieved from the tempo api, and this should make writing queries a lot easier.

C

D

C

Just go through a few examples here.

C

um But yeah, I think this. This is working pretty well for for just an experimental um status.

C

But yeah for the editor. This is our progress. um It's again getting getting uh stuff from the the the tempo api, but also for the response. What we're trying to do is build something that shows us the traces and then which spans we we matched using the query.

C

So we we want to show um a quick look at the few spans like two three four maximum um inside that that trace where we can see which attributes matched so that the user can have a quick look at um and how the query performed and and what the results are. This will all be linkable. Of course, this is just using some uh mock data, some random data that we generate, but it will be some some real data very soon.

C

um I think, there's a question.

C

I don't think that's available yet so that the question is if we can search tags inside a specific span.

B

Hey yeah, so I think this is part of the traceql uh language design. So I was bringing up a link here for kind of like a intro to the language and um all of those, the dot, the dots that pre precede the attribute name and the query, or I mean to actually search both places. So it's searching the span level and the resources in that the span, the if they're specified to both the spay level takes precedence.

B

um So it's searching everywhere, but you can actually, with those parts of language, do resource dot and then it only it's scoped only to look at the resource or the span um so like by default. It should be as easy to use as like kind of like the tag search that we have, but you can also dig a lot be more precise um for that kind of stuff, um and then I was gonna post a link here for kind of like the stock here. So this is an intro to the language.

B

um It doesn't have like the full capabilities and everything that that we have but, like. I think it's a good good intro to read and kind of catch up on these concepts.

B

E

B

Oh yeah, another thing here: can you pop up the operator? We also have like a lot more expression there so currently for tags. um Sorry, for the equal sign, yeah um we'll be able to do integer, math um string regex, so you can handle case sensitive your insensitivity things like that. It's going to be a lot more powerful, too.

A

I think one to talk about you're, talking about, like you have this dot, which does both span and resource attributes, or you can say specifically resource or span. I think one of the neat things about the parquet back end.

F

A

Is going to be kind of enabled here is that the amount of work you ask of your backend is going to be kind of transparent to you like it makes sense to the user, and you can train your users and yourself to write efficient queries. If you write a query that says dot, name or say dot tag, then it's going to look in both resource and spam. It's going to pull a lot more data.

A

If you know for sure it's a resource level attribute like resource.cluster, then suddenly you can more effectively search your data by more correctly scoping your queries and you really kind of have a more you know, powerful user interface a way to write queries that will be more effective than than before.

B

Yeah, I think that's really cool part of the attributes. I guess we did touch on. That was our idea. There was to actually bring back the parts of the span that are relevant to the query, so there's still some things to figure out there, but basically, if you're filtering on an attribute while we've scanned through and have found that attribute we'll go ahead and return it to the ui. So you can see it. I think that doesn't make sense for things like equals.

B

You know method equals get right, there's only one value, but if you have something like a regex or a greater than then you have a range of values and we're bringing that back. So it's useful to see what was the actual value that it matched yeah. There's, definitely definitely a ton of stuff here and it's uh pretty exciting.

B

Cool all right: well, thanks andre.

C

Awesome. Thank you.

B

A

Cool, um let's move on to additional work. We've recently moved forward to go 119.. I think we're mostly on top of that, it's been out for what how long month or two not too long or am I way behind couple months. um I want to introduce one of our newer team members, jenny she's, going to tell us about this. She super doesn't want to because you know she's kind of new. That's what she's welcome to make me talk about, go on team 119.!

A

Why don't you? Why? Don't you give us something? Jenny.

F

um Hi, I'm jenny um so uh one. I think that the code was not that particularly different, but um golang ci gave us a lot of um lensing problem, uh so they also needed to upgrade their thing to 1.19 and I don't think they predicted that there were going to be a lot of issues with that. So their first release of one night support was kind of terrible um and then we're using their latest version of like 1.49, um which now they got rid of.

F

They got rid of a few of those checks because um they were deprecated and the uh the original contributors were just not responding anymore. So I guess they're just getting rid of those checks to get those gloves down. But it's it's fine. Now everything pass on the ci job, so I think we're good to.

F

Push it out cool.

A

um Appreciate that work, I asked jenny to upgrade us to go on 19, expecting it to be like a find, replace 118 to 119, and it turned into like a week of work of like tracking down a million weird go, go like ci linter issues and ones that weren't even really issues. uh So, uh thanks to her for getting us moved forward, 119.

F

And then yeah gold format is very sensitive about spaces now and comets. So if there's an extra space it'll make a new line and and then indent it so that was a it's a fun little thing to figure out.

A

Yeah it mangled some of our comments. I was not happy about that. Yeah that well cool thanks. Jenny. Welcome to the team appreciate the work that was a that was a bigger task than I thought when you started on that one.

A

um Let's move on to so we have uh helm, chart changes coming up we're trying to do some standardization there and then also some interesting information about a potential operator, and I think zach hopefully has a chance to talk to us about this.

D

Yeah, so um for the last few months uh you know the other database teams have been sort of working on helm and it's been more important to the community seems like a lot of people deploy with helm, even though we internally have been json users for a long time, and so the tempo team is a little slow to join the bandwagon, but we're getting there. So, for the last few weeks I was taking a look at a little bit of a refactor to try to align with some of the approaches in the helm chart.

D

This is mostly you know around code reusability and a little bit of patterns um that have been developed by the other teams. So uh seeing that I raised an issue with the community. So if you, if you have comments, uh feel free to look at the issues um and then I've got a big gnarly pr uh that does the refactor um and so that's uh currently working locally. uh All testing is going very well there's a couple changes.

D

I need to go back to um and address some feedback in the pr, so um that'll probably be coming in the next couple weeks and um also we included jet install. So this will be a way that enterprise customers can take advantage of the installation and um all of the foundation of the helm chart that already exists today. So that's pretty cool. If you have feedback, like, I said, feel free to use it, uh you know drop it drop it on the um either the issue that I filed or on the on the pr itself.

D

um It's a big change and uh I understand if nobody wants to review it, but uh we'll get some of the other teams to take a look at it.

A

Yeah zach, how so something I think has always been important about the helm chart is like how community driven it is. Are we maintaining that? Are we keeping that feel? I I want to make sure we balance that with some of these changes.

D

Yeah, I agree, I think I think we should balance that and I'm kind of relying on feedback from the community about um about that approach. I um you know I'm definitely building on the work of the community. This is not a. This is not a rewrite in any way, um obviously like during a refactor, certain portions of the code are going to get smushed, but um but yeah. This is definitely built on the work that the community has done over the years.

D

So you know like full props to the community for this and um and we.

C

D

Want to you know, encourage contributions in the future and that kind of thing as well, um but as the company as grafana gets more involved with the helm charts, it's gonna make sense for us to align in some ways and, um and so you know, individual changes between the charts. um You.

C

D

Might be looked at a little bit differently in the future, but yeah definitely definitely a concern, and we do have some breaking changes in this in this change coming, so those are documented uh be on the lookout if you're just a you know, bump the version and run it kind of person. uh You might want to review the change log for this next upgrade.

D

And then uh yeah there's no questions about that. I'll. uh Just give a brief mention about the operator. So speaking of community, this is a community driven project. um I think it's building on the work that the loki team has already accepted into the into the project, and so this is sort of modeled after that, but we've got an operator there's a tempo operator channel in the community slack.

D

I would encourage you to join and give feedback if you have it, there's a design, doc and a review, and all of that, so all that's starting to take shape and, like I said, it's kind of modeled after what happened in uh in the loki project. So we're probably going to take a lot of nuggets from that that adventure.

A

So, like zach said, we don't use helm and we don't use operators either at grafana, and we really rely on the community a lot to make these things a good tool like a worthwhile tool to use so appreciate. All of the input on the helm chart. um Please jump in that pr that zach has and give some feedback, and then, if an operator is something that interests you, we can all make this better. The best way to make this better is by chatting in that channel, giving us some feedback making use of it.

A

These kind of things we do really deeply rely on feedback from the community to make these particular components work again, because we are not consumers of them ourselves, um but yeah. We are very happy to see some interest here. uh Operators are a cool feature of kubernetes and uh hopefully that project will move forward and we'll see something um see something cool in the next few months.

A

All right. um On that note, I want to introduce siraj who showed up on our team and solved a problem. We'd had and ignored for about a year and a sure, and had only two or three issues in the reap of on so uh he's, going to give us some some tech details on dns lookup and how he fixed an azure error, like literally we've been ignoring for six months and occasionally going back and trying to fix and band-aiding, and he he got into the nitty-gritty. Did he create nitty gritty details and fixed it.

E

Hello, hey uh I'm suraj. uh So if you are running uh tempo on kubernetes with azure background, you probably would have seen these errors. uh We saw these errors in rh of clusters.

E

We ended up going deeper and then we found out that this happens when we try to do dns lookup and in kubernetes dns lookups are amplified. I attached a github issue in the agenda dope that shares details, but mostly what happens was there is something called and dots and kubernetes has and dots configured to 5 by default.

E

uh So if you don't have fully qualified domain names, uh kubernetes will try to do local, dns lookup and that results in, I think around 11 local lookup and then it finally is resolved and that used to take fair bit of time and we used to see the timeouts in like when right trying to to call as your blog stories from competitors.

E

We did two fixes in this one. Was we increased like max connections so that we don't? You know like create new connections every time that made the problem better, but it didn't solve it completely. Another fix we did was we wanted to make. The azure backend store, is url fully qualified by adding a dot at the end of tld, but turns out that needs bit more work.

E

I have an issue attached, uh so you can look at that issue and see uh why that's the case, mostly azure is not happy when you pass like post name header uh with the trailing not so as a workaround. We lowered our n dots to three uh in in azure clusters and the max connection and n dot change uh solved that issue for us, uh but yeah. Another pr that came out of this was uh all the tempo services uh like if you're running tempo in distributed uh mode.

E

You have like bunch of tempo, related services right and those services also do the dns lookup stuff, so we made them fully qualified by default. So if you are using tempo on kubernetes, you probably want to do that. That will improve latencies. You know because you get to skip all the local dns stuff.

E

I have a follow-up issue on this one. uh We we do want to support. uh You know like doing the fully qualified uh domain name lookups uh in azure and other backends, also because from what we know, this should save us. uh You know like dns, lookup time when we are doing reads and write, and uh theoretically we do a lot of reads and writes right practically also. What you know depends on the the cluster load, uh so this should they should help us uh on that front.

E

So yeah I mean this is something we learned and.

A

Rarity the trick here is like it's amplifying the number of dns requests, because it's what it's like asking for every spot, all the way down or every like a namespace right in it.

A

So this fix this n.s is useful for anyone on kubernetes right.

E

Yeah anyone who's running kubernetes, and this is not just tempo if you're running any other applications uh that make you know, api calls for outside services. You may want to. You know, make usefully qualified domain names for them as well.

A

Cool um yeah, I appreciate that work. We we did. It was just. I don't know why it sure what do we? I guess we don't know why it happens on azure and nowhere else, because something about the azure dns really didn't like it.

A

It just would fail some percentage of the time and this fix, I guess just massively reduced the amount of requests we were making to dns so always a win. I suppose cool um thanks to siraj for fixing that it's been something that's been bugging us for a long time. It was difficult to prioritize and we appreciate him jumping in and helping us on that note I'll open up the floor a bit. We have quite a few people here from the community if you have any questions about upcoming work or even just how to operate.

A

What you're working with now, if you're having issues running tempo or just want to talk about anything, that's going on in your tracing world, please feel free to unmute and ask, or you can just chuck it in the old chat or you can even type it in the agenda. I guess maybe people have permission to use the type in the agenda doc, I'm not even sure it's possible. I have permission to type in the agenda, doc.

C

A

All right, maybe I'll, ask marty a question marty. um What are you gonna do this fall.

B

um We have observability come on.

A

Yeah I'm gonna be at kubecon. Is anybody else going to kubecon in n a in north america?

A

No am I going by myself all right, I'm actually going to be rep presenting on jaeger, so anybody goes to kubecon n, a I'll, see you there and I'll be talking about jaeger. Sadly, I'm not talking about tempo.

A

We submitted a paper for tempo and it got what was it marty? What was our it was not. It was wait listed right, yeah yeah. We submitted one about the new columner, back-end storage and some of the other storage types, we'd attempted flat buffers and proto, and all this um and it was wait listed and we haven't heard anything back. So that's the way it goes. I guess.

A

Oh heads will be there cool I'll, see you heads all right. Well, if there are no questions or thoughts from anyone else, uh we can go ahead and wrap this up. I want to thank everyone for showing up. um This is a pretty big call.

A

um Just keep coming to these calls we'll keep keeping you updated with parquet and traceql and keep you in line with what we're working on and how far along we are, and I expect every month there to be new news and we'll hopefully deliver this soon question mark soon, tm soon in the next few months, we'll see um we're working hard over here. I appreciate the support from the community and everything you all are doing. Keep following issues.

A

Keep following prs, keep doing what you're doing. um I think we're building something awesome here. Everyone take care and I'll see you next month.

B

Okay, see ya, bye.

B