Open Telemetry Uncategorized, 20 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-10-20 meeting

Description

cncf-opentelemetry meeting-2's Personal Meeting Room

A

B

Are you doing today pretty.

A

Good thanks, how are you.

B

B

Coming back from about of covet a couple of weeks ago and a.

A

C

I can type either.

B

B

A

A

B

B

D

B

Josh's write-up of our last discussion- and it's really good um I hadn't I hadn't read it before now, but.

B

D

I read it yesterday and I thought: I wasn't there uh two weeks ago, so it was I thought the notes were well done.

B

um Somebody put something on the agenda here.

C

But something uh I had briefly brought up towards the end of the last meeting and I think Josh might have I. Did it for this week's uh or this meetings agenda.

C

Like how to how to get consistent sampling across linked traces right, if that is.

C

Open issue where you're looking.

B

C

How to do this for tail sampling right.

B

Yeah and actually I followed that link and then I looked at my thing and I'm like I, didn't mean hyper level up so.

C

I'm sorry I missed the last one.

B

I had made a comment that um you could do this somewhat inaccurately.

B

um You know you you'd have some limit on on your level of accuracy by using, instead of just having a cache of Trace IDs that you've kept doing it with like a bloom filter, um but I used the word I Used hyperlog Log instead of Bloom filter, because I, how to think of but anyway,.

B

um Do we think Josh is going to show up, or has anybody heard anything.

C

You can check on the.

B

C

C

I, don't see anything on the slack channel, so foreign.

C

Should we go ahead and discuss this topic a little bit more while we are waiting for Josh or.

B

C

Do you think we need Josh for that.

B

No sorry, my computer's acting up a little bit. Oh no! It is.

C

So I had one more topic to chat about which was more like a SDK side. uh Tail sampling I can explain what I mean by that, but but before that, we can probably start with the linked traces and uh see. If there is anything we can do to improve the situation. So the challenge or the problem, as we all know, is uh since the link traces are at least what I'm saying in Microsoft link traces are becoming a common pattern, particularly for say, long running operations or like anything that involves like a producer consumer pattern.

C

Things like that the standard use cases for link traces, and so how do we really, uh if I take it? If we break it up into say two categories: right, head sampling and then tail based sampling with head based sampling, since the probabilistic sampling is based on a single Trace, it feels like there is no easy solution to say: hey like, but my link Trace is probably not going to get sampled in, so I may not get a fully consistent um group of all my linked traces together.

C

So that is one challenge and then the issue that Kent linked here is basically the same issue, but at the but but at a tail sampling perspective. Correct.

C

um So I was just curious. If this is kind of a known problem, I mean or if there is some way we can make the situation a bit better.

C

uh I haven't done a lot of thinking about it, but I just wanted to First brainstorm with the group. If there is a if there has been already some discussions around this in the past since I joined the stick only recently I don't have all the context so that that's basically my uh the summary of the problem, at least.

A

I, don't recall discussing linked traces before so. This is. This is a new topic now I I haven't used uh link traces personally, so could you explain me to me how, uh when, when we want to link to another Trace from S, Pen I understand that this link is from a span?

A

Does the new Trace already exist, or is it being created at this point.

C

So so, when the link is created, usually the destination, so you're basically saying here is my span ID and the trace ID, to which I'm linking to so usually that already exists. um Okay, at the time you're doing the linking so I.

D

A

C

Be like a producer consumer scenario where, let's say some item gets enqueued into a queue and the Trace ID and the span ID of the producing span is the metadata that is associated with that item in the queue and then, let's say minutes later or hours later, some consuming span picks it up and it can start a new trace and that that root span of that new new trees will basically say: hey I'm, linking to this Trace ID and span ID of the producing span.

A

B

So in a deterministic sampling world um you could tell at the point you created the link if the trace was going to be sampled, and so you could sort of avoid bothering to create the link. If you knew that thing wasn't going to exist in your in your storage system, um but that's kind of a negative decision. It's it's you. You could decide that.

B

There's no point in having the link, because the original span will not be available in your system and you would have to know like how those decisions are made and and be able to make the same evaluation. At that point.

B

I I don't know it's just like an observation that that you could maybe avoid trying to create a a link. You know a sampled span, then we have a link to a non-sampled span and um it, but it doesn't I, don't know if that helps enough.

C

A

Going back to the example that you provided uh with cons, producer and consumer, with some messages of my sit in some queue for hours, I understand that when the message is picked up by a consumer, we want to create a new thread, a new Trace. At this point right and well, since we know and something about the producing Trace we could, um we could just copy the sampled flag.

A

So if, if we have access to the sample flag or the R value that that we want to propagate, then we can just copy that over, and this will give us a guarantee that they are both sampled at the same time or not not at all.

A

But this is a special case when, when the new Trace is created by the consumer at this point, if we want to link traces that already started before, of course, this doesn't work.

B

It also assumes you have access to the entire Trace, whereas what may have happened was that the producing Trace ID was copied in to some package of data that flowed through your your processing queue. um Yeah, you see it just it's one more piece of information you would have to copy of.

A

Course, of course, yes,.

C

Yeah yeah it it could work, I think the sampled flag would have to be I agree. That could be another data point and then it becomes like a new type of sampler which looks at the it's, not exactly parent-based sampling, but it's kind of like a link based sampling right. It looks look at whether my thing exactly.

A

Yes, yes, yes yeah, it.

C

Can work for some situations, I think the complexity comes in as uh because of the end-to-end mappings right, like since uh different spans in a trace can link to different spans in different traces right like there can be n different links from the same Trace, maybe not a 80 scenario, but I think the making it like a general solution. I think the problem comes with um what. If my the link, one was sampled in but link, two was not sampled and I think there are probably some more since it's basically just a graph yeah.

D

Becomes a bit more complicated.

A

Well, uh somewhat related problem: well, not really the problem space, but in the solution space is um a desire to to um observe complete user sessions from a browser. So we want not just a single trace, but we want to to to have all traces that belong to the same session and here the solution that I I think might work is that when we uh see the first trace of a session, so we need, of course to have the session ID.

A

We generate the R value as as usual, but later when there is a new Trace being created for the same session, we just use the same r value that we used before, so we are not generating new R values again this this increases the chances of keeping all the traces belonging to the same session as a whole, because they all share the same r value right so in general, I think we should. We should think about providing a mechanism in SDK to.

A

Not to generate our value independently every time, but have some hints about what the prop correct r value should be: I'm, not sure how the actual API would look like, but I see the need for for such a thing.

C

That's a good point Peter, but I think the only thing I'm just trying to connect back is in the discussion last meeting. I think the idea was to move away from customer values right and just use. Trace ID as the source of Randomness. Well.

A

This is, this is very debitable, because.

D

A

Well, I I, I, I I. Don't like this idea of going away from the r values I understand. They are not perfect, but, as we discussed using using Trace ID and uh a hash function it it is opening another box of problems and we we are not sure how we'll how it will go. But again, the the problem that we are trying to solve is not really tied to the r values, because if we decide to go with uh random bits in Trace ID again, we we have to um uh well. This is no.

A

It would be much harder because we would have to copy the original bits from from the original tray switch. Oh, it would be a nightmare yeah.

C

Yeah then, it would not meet the definition of being random anymore you're right.

A

Right, it's yeah, yeah, okay, yeah, so this here is another argument to keep the r values around.

C

Yeah definitely worth looking more at this argument in terms of for Value can help to tie together things across a group of traces.

A

Right right, yeah yeah, it's much easier with our values: okay,.

B

C

See yeah so I saw someone asked in the slack Channel or somewhere. I saw something about this client uh or the browser based right. It's.

B

A very hot topic right now, I mean you know: browser-based Telemetry is really getting a lot of attention.

B

um I mean we're, seeing it we're seeing a ton of it right now, um people, you know, there's there's been a lot of people going, wait: uh real user monitoring, ROM whatever is awfully similar to Telemetry, and how do we you know? Are there ways we can sort of unify that stuff? And you know you have the problem. You have all sorts of interesting problems when you do this um as a vendor. Honeycomb is dealing with.

B

You know, authentication issues, and you know now, like a collector, becomes a challenge like you get these front-end teams that are trying to do Telemetry and they don't have the ability to stand up their own collectors and things like that. It's it's really interesting, trying to figure out how to solve some of these problems, but um but we are seeing a tremendous amount of interest in browser-based, Telemetry and traces and all that kind of stuff. So I, don't you know.

C

B

So sampling is actually going to be a really interesting thing there, because in a world where the browsers are sending telemetry direct to a vendor, you, you know you'd like to be able to have your browser system.

B

uh Smart whatever's, in the browser smart enough to not send everything but to sample it in some useful way, and so now you're back to head sampling, which you know has all the problems we've been talking about. So.

C

I'm just trying to capture some quick notes.

D

So um I'm curious from like the customer perspective where people would sort of um argue like are, are the following two problems: equally urgent: um the the negative case that uh I do not sample a sort of descendant span or like pseudo descendant.

D

If it would link to something that wasn't sampled, um there was that's like potentially the case uh addressable with like memory um with like a bloom filter or Perfect Memory or whatever, um but then the other question was like: do I um can I make sure that I sample uh a span or a trace which may give rise to uh spans that link to it like arbitrarily far in the future, um so I guess I would ask like those two perspectives like um for what people use, links for um and I won't like attempt to answer this, because I've never used links but I understand some of the like written motivations for them.

D

I'm curious, if like in practice, people feel equally strongly about both of those uh sub problems. Here.

A

A

With sampling, we we look, we look at the algorithms here and with with some kind of academic angle. This is we. What we discuss here is very pure technical issue, uh but in practice you will always have other components that create uh a in making decisions.

A

So, for example, um when we see let's take, let's consider this linked linked uh traces example. If we want in our consumer to create a new trace- and we see that we were linked with- uh we want to link with a trace that was already assembled. This is only a hint that we want to sample.

A

We still might need to consider other things like General throughput. If we are under high volume of throughput of of traffic, then then we might not sample after all. So it's just a hint and the mechanism that we are talking about is just making these hints possible, but it's not mandatory for the consumer to follow exactly the decisions that were made Upstream and the same is with regular sampling with regular traces.

A

I talked about this last week in general, you may have some notes that have enormous amount of traffic and they have to make their own uh autonomous decisions about sampling. Because of that.

A

Not sure if it answers your question but all right.

B

Yeah we had one large customer uh today asking about making independent sampling decisions. They have different regions that their systems are running in and most of the time, their transactions don't cross regions, but sometimes they do, and so they want to be able to not have massive amounts of traffic going between regions but but just sort of have both sets of regions making the same decisions, um or at least mostly making the same decisions and and yeah. This is a interesting, interesting question. I mean, like I, think you know when I think about it.

B

I think like what I really want and can't have is Downstream when I'm processing that thing that I got five minutes ago, and it sat in my Kafka queue for all that time and now I finally get to it and then it errors. I would like to know that I can go back and see the thing that caused it to happen, and obviously um you know unless I just hang on to it until everything's done I can't you know all.

D

B

um I can't make that decision. uh So you know sometimes you're just gonna have to say well I'm keeping that error Trace, knowing that it's got a link that points nowhere, because I didn't keep the source Trace.

A

B

But if I could detect that I didn't keep the source Trace I might at that point, be able to attach more information or at least put a reference in that says. I you know like we know we didn't save this one or something you know um so I, don't think we're trying to dig for it.

A

B

B

Spencer to your your question like: is it the same severity if.

B

Or how like? What happened like being able to know that a sampling decision was made later on?

B

Is that the same class of problem as being able to force a sampling decision to be made early in the process so that it gets carried through.

B

um I just feel like um forcing it early in the process. I mean it's I. Guess is trying to say if I'm going to make the decision early and then I'm going to just respect it all the way through, as opposed to I'm, going to know in in the past whether I made that decision right.

D

Yeah yeah, it seems yeah I, guess like how I'm leaning toward summarizing this conversation. It's like we're realizing that, for some perhaps most use cases for spam links um those share out like a lot of sort of auxiliary requirements when it comes to sampling, as do like parent and child traces and like, um and so it sounds like you know, we're wanting the same sort of capabilities of like you know deciding all at once.

D

If we sample the whole chain where we're now expanding the meaning of a chain from just like parents and children, in the same choice too, like spans across choices.

B

um We're sort of adding another layer of hierarchy.

B

Yeah, you want to be able to say something like I, don't know: yeah I, don't even I like whether it's a session or or you're referring to a previous Trace. There is something some larger entity, that's bigger than a trace. That is that you want to use to make uh a unified decision.

C

Yeah I think the session concept is a bit easier to wrap my head around right. Like basically you you start a session and then you do a bunch of things within that session, and then you end the session. So there is like a natural hierarchy and it feels like a retractable problem to solve like achieving consistent sampling within everything within that session. It's it feels more self-contained, whereas the linking thing is like not that self-contained right, because you start a trace and then something happens to it and then could be minutes later.

C

Somebody else links to it in a completely different execution environment.

B

C

So that is feels more uh more challenging of the two problems session based sampling feels like a.

C

At least I haven't thought through it, but it feels like at least everything is happening in the in the same environment and one more layer of one more wrapper or one more container, where all the sampling positions are applied consistently feels like a much more uh solvable and easier problem. Yeah.

B

And it's kind of like he, you can think of the R value as Peter's been talking about it in uh well. It's a session. Id I mean it just it kind of kicks the problem up one level, but it's a similar. It's a similar sort of thing. There is some number that is shared among you know as a shared context. That is then used for right right.

A

Yeah, well, the the the the K the case with sessions is uh easier, not only because it's easier to understand, but also because all traces are created within a session, and we know that with linking traces, the the number of scenarios is different and it might be possible to link traces which already started both independently and somehow they got linked because of some SDK calls, and we we cannot do anything about that, because it's too late.

C

Yeah, going back to Spencer's question I mean um the scenarios for me. I personally uh have heard a lot of requirements around the linked traces session based I haven't heard yet, but I'm I can totally see why that could be totally uh important as well. uh I I, don't know, I mean, depending on the resources.

C

I, don't know whether we should, if we have bandwidth to I, mean if we all agree there are problems, then we can potentially have different work streams trying to solve both I guess: I, don't know if it's a prioritization uh resourcing question between the two or whether we can kind of put efforts. They end both independently because they seem to be different problems and different solutions for both.

A

Well, I feel the solutions will be pretty much similar, which is to share additional piece of information between those traces like r value, because I'm keeping back this r value, because I think it's it's really really great technology, but it could be something else. Of course,.

B

I think I mean you could toss this out because it's just kind of percolating in right now, if we had the concept of, let's call it our value um or session ID or whatever it wants to be, but basically saying if you want your links to be sampled according to the same rules as your initial praises or all the traces in a you know a session, then then you need to pass on this r value and propagate that through the system.

B

So you can create a link that doesn't contain an R value, and in that case, then the sampling decisions will be made independently. But if you pass on the R value as part of the link, then at that point um you know at that point we could state that we want the sampling decision to be made in the same way yep. So this could become an optional feature of linking essentially where you know.

B

If, if this is something that's meaningful to you, if you're tracking sessions or you'd like to track a, you know, query across your systems that propagate it over time with multiple traces, then you need to pass the R value along to seems like a plausible pace could be made without having to say now. Everybody has to think about this differently.

D

Yeah and as I was um uh trying to sort of just steal. What we were talking about. um uh I was looking into like the open, Telemetry API spec for for adding a link to a span, and it's like. Oh you give it. The span context and I was like. Let me double check that what is in the spanning context and the span context contains, among other data, um the trace Flags, including sampled um and choice State, including R values.

D

So, um just like sending additional hints, um it's actually the case that, like those are part of uh the data that is already being relayed in order to be able to create a link back to a span, so we're not so like in this direction of solution. um We're actually not talking about adding something to uh you know the side channel that you propagate metadata through.

B

Right, nice, and also so it means more along. The lines of here is advice on how to do this, rather than.

D

Yeah so like yeah, what we're coming toward is like. Oh, the consumer, that like creates a new Trace, creates a new span and adds links to that thing.

D

um Conceivably could have some kind of code, and maybe we now get into talking about like what the API experience ought to be, or whatever um conceivably creates code that actually just like inspects that span context a little bit more and like does something with it.

B

C

Thanks Spencer I will take a look at that, so I think one main open issue seems to be the r value uh like proposal, keeping keeping it uh bringing it back on the table. So Peter is this something you would follow up with Josh or.

A

Which which one sorry I.

C

No I think I think in the last meeting. Josh was uh uh going more towards the trace ID Randomness as the um placement for the R value, custom, r value generation, um so I think we need to.

A

Yeah I I, read, I read his comment and yeah there is. There is some um um so.

A

Well, we definitely support this new random flag. We want to have it because it helps us in in a number of corner cases that we had, in particular with stale based sampling based on R values, which we couldn't do without when when we want to to sample with probability different than a power of two now with a random flag, we can do it.

A

So it's a it's a big win um and there is still this problem that some people apparently really want to have something with, let's say, 75 percent, which in my opinion, doesn't make much sense. But of course we cannot. We cannot argue with that. If they want to do this well, they can.

A

They can use the new mechanism, but I I, don't I, don't think what Josh wrote really eliminates our values at all, because they are still needed well. At least P values are needed um and they both can coexist in in in in in in some form, so yeah I hope we will go get back to discussing this in more detail and while we are waiting for this random flag to become standard which Josh expected to take about a year just a little bit pessimistic in my opinion, oh we have.

A

We have some time to to to plan around it.

C

The yeah I agree they can potentially coexist and I. Don't know whether um in the session based sampling yeah, we would need to propagate something which is common across entrances. Okay, I see yeah, they.

A

C

A

So so, going back to the session example with our values is fairly simple, because we just need to ensure that all these new traces have the same r value with uh with a random part of Trace ID, it's much more difficult, um but I'm sure we can. We can. We can work something out.

C

We only observation, I have uh on the other proposal. The non r value based proposal is in the in the latest thing that Josh uh has written up. It doesn't necessarily depend on the w3c spec um becoming standard and implementing right. The assumption is, it assumes that the last seven bytes are going to be random, even if that, like, irrespective of whether that flag, new w3c flag is set or not.

A

C

It's kind of like a change.

A

C

That, like whenever the implementation start setting it, it will just work, but the new sampler will not necessarily block on. It is my understanding.

A

So if, if I remember reading this, that meant that we should not wait for wr3 C committee to to accept this new random flag. We can just use it right now and and use and treat them some portion of Trace ID as random. If this random flag is set, we still will test the flag. Even this is a non-standard flag. Today that was.

B

A

B

It doesn't just the random flag. Is it just one of the bits in the trace ID? No.

A

This is this lives in um tray status next to the sample flag. So sampled, black is just a single bit that was used in a byte that was passed uh as flags, and now we are talking about using the second bit of this bike. I see.

B

No I understand I'm. Sorry because.

A

B

I because I kept thinking like if people are already randomizing, the entire Trace ID, we're gonna, give the bits at half the time. Okay, so.

C

It's in the trace parent uh header, the last.

B

One: okay, sorry thank you.

C

uh So Peter my impression, is slightly different based on the discussion in the last meeting. It was that uh uh the new tracity based ratio, Trace ID ratio based sampler, will treat the last seven bytes of the trace ID as random and originally The Proposal was what you said, which is look at the new flag. If the flag is set, then use it like you get a guarantee that hey the last seven bytes are random, but then I think the question was about hey: what happens if that flag is not set?

C

What should be the fallback algorithm and that's where I thought the decision was? uh We will still do best effort in assuming that it is going to be the last seven bytes are random because most of the current implementations of Trace ID generation, including AWS x-ray they're, all the last seven bytes- are already randomly generated, so it doesn't hurt to rather than some other custom fallback logic, it doesn't hurt to just use the same thing, because there is higher probability of generating consistent uh sampling with that. No.

A

Okay, maybe I misread it. I need to read it again: yeah yeah, okay, okay,.

C

D

I had a question um going back to um just and Link question uh and just use cases for it um is, is the the downstream span that typically went back to the earlier one um is I'll call it span. B is B um like almost always a root span, or is it ever like part of some Trace that was started and then, uh and then you know is, is be always a root span? That's it.

C

Yeah sorry go ahead, no.

A

No I I was just saying that uh the API allows you to link anything to anything right.

B

Right, it is not there's no guarantee that it's a Roots fan, yeah or even strong implication.

C

Yeah I think uh to add to that yeah there's no guarantee, but in the consuming uh example, if a new Trace is being started based on that yeah, it's likely going to be a root span, but there are other use cases where it doesn't have to be a root span at all. So.

D

C

There is yeah I think the challenge is with this. Propagation is again um I just need to think through. Where you have. Let's say: Trace one span, one, it is doing something. Then it links to uh span 2 in Trace two and then Trace once does some of the some other things in the country, and then there is like a link to a different, like some other Span in Trace, one links to some other plan in Trace three. So basically, it's just a random graph of things.

C

So it's very I, don't know if, like the decisions were taken um independently by some two traces and then somewhere down the call tree some linking happens then with multiple links in place. I, don't know how um this kind of a consistent like what, if I'm, linking to two different traces and one of them, decided to sample one of them decided not to sample, and there will be all those kind of cases. So it's not that straightforward as the session based one but I think it's uh an interesting problem.

C

I mean at least for customers that they they want some kind of a consistency, even though they're using links yeah and take it down into the next level and.

D

And I did just verify that it's the case that um that the sampler interface um among the data that it receives uh to make a decision um are the links that will be added to the newly created stand.

D

um So there is like an API I think to mutate uh existing span to like add more links to it, but links that are known at span. Creation time are eligible to be utilized in some way um to make a Samsung decision and like that's it, that's a specification that exists today. So that's cool.

A

Yeah, uh just just a observation um so with with linked traces, we can. We can almost certainly say that there will be never a solution that could satisfy all use cases, because you can think about a system where every Trace is linked to everything right. So we could have a mesh of everything. Every possible Trace is linked to something that already exists, and obviously we cannot support uh sampling, synchronous, sampling of all these traces together, because that violates the purpose of sampling.

D

Yeah I'm, almost thinking of this as a like, a like a head sampling solution to this problem, um but then like, as you noted, Peter, maybe like the tail sampling collection layer, is where um more systemic requirements around you know. Maximum throughputs and things are, are satisfied and maybe a bad layer. It may reserve the right to like further shed Data before forwarding it on to your storage system.

D

Does that match you? How you're thinking of it.

A

Yeah yeah very much so yes.

B

B

Just clarify that you're saying that like okay well, I made a head sampling decision in this Trace B to keep it, but that the collector or equivalent doing a tail-based sampling decision might say yeah, but I didn't keep the original so I'm, not keeping this one either.

D

You could say that that would be one reason that would be, you know pretty sufficient to drop the another reason it would be pretty sufficient to drop. B is like oh I've sent a spike of events, yeah.

B

C

Yeah just to add one more clarification question to that. So is that a combination of uh head based sampling and database sampling, where you you did some head-based sampling and then you were doing additional pruning in in some tail based sampling system right.

D

C

Would say you used only uh like a pure tail based sampling? Then you have the luxury of uh getting all the even the linked traces and then at tail sampling time. I.

C

Try to achieve consistent sampling by by having some logic around hey, like I, saw the stress it's linking to the other stress and when that Trace comes I'm, going to make a decision just because I decided to sample in this interesting trace and I know that this links to something else I would try to sample in the other linked trays, also uh at tail sampling time. So I think that kind of goes back to this issue that can share I.

C

Think those were kind of the challenges on how to do it at tail sampling time, um but that, but once hit something is involved. Then I think the probability of getting consistent sampling across linked traces goes down significantly, because now you you're not going to be able to get the full group of all the right.

B

You might not even get correct the thing that you needed yeah.

B

I mean it's interesting from the point of view of like for honeycomb's refinery.

B

We keep a trace around. We accumulate the trace and.

B

Either there's a timeout, in which case we say: okay, we think we probably got the trace. We make a decision and we send it on and then we hit we track that decision for a little while, so that, if new spans come in as part of that Trace we can, we can also send them along, but that the period for which we keep.

B

That decision is not that big, um because we're kind of assuming you're all part of the same trace and that once you have the full once you have the root span, you know there may be some asynchronous things that trickle along, but you don't need they're not going to be very much delayed, but in this world we're talking about. Oh I may put this thing into a queuing system and it may be a long time before it comes along.

B

um So we can't hang on to that because you're, because then, if we're just waiting to make the decision for half an hour, then that means you can't look at anything in real time, and so in those circumstances, then you have to keep the trace decision around longer, and that was what I was talking about in that issue of like well, maybe there's a shared redis cache or something like that where you can refer to it and say, and at a trace level did I keep this thing um and uh and have some sort of high problem.

B

At least you know that's where I was talking about the bloom filter thing like and maybe I don't have to make this decision perfectly. Maybe you know fast and close to right is good enough and so throw it in a bloom filter, and then you can keep that around for a fairly long time um reasonably efficiently and and so at least, you can make a mostly consistent tail sampling decision. In that case,.

C

It makes sense.

B

You just got to figure out how to refresh a bloom filter.

B

All right is that enough for today.

C

I had one last small topic: I don't know. Oh.

B

That's right: yeah! You did mention that SDK.

C

Yeah, if we have appetite for this starting it, we can just started I just want to get your thoughts on it. So um one of the things like with the tail sampling I think there are multiple places. Tail sampling could be done, for example, like a hotel collected style like a external to the process. I think there is some interest from some of the people.

C

I'm speaking to in terms of hey, can I do this site um within my process right or at SDK time like using like say, like a custom filtering processor within the at the at the time of the end of the span when the span is there?

C

Are these hooks right where you can go and say: Hey I want to register this span processor, which is like a but I, want to go filter and look at my whether that span is an interesting span right like look at if that is a failed span or if it had some high latency or things like that, then decide to include it.

C

If not, if it was a successful span, then apply some kind of a probabilistic, hey I want to just do like one percent or point one percent of all my spans.

C

uh The challenge, though, is uh this means that at the sampler interface like the beginning of sampling, it has to lie and say: hey I'm, going to use an always on sampler right, I want everything and then at the span creation uh once the span is actually done. Then this filtering kicks in.

C

So instead of sending it to a system like Hotel collector, it's actually doing the decision locally. So it's not going to be a consistent trace and all that, but but at least those failed spans are going to be um emitted like they like customers who care about here. I need to get every single failed uh span or a high latency span. They they will get the failed span and like that is the. That is the good part.

C

But then the trade-offs are the sampled flag is going to be always set to true, because the always on sampler um was kind of like the initial decision right, because you're kind of doing a late decision at the end of end of the span. So if some other Downstream service just blindly does parent-based sampling, that won't be right, because now it's not exactly the parent, exactly didn't decide to sample it in so I was just curious. If there has been any discussions around this uh topic, um I did I did prototype it out.

C

I mean it seems to work, but I think it comes with a whole bunch of trade-offs in terms of it's, it's basically like a hacky approach to tail sampling, but but for people who are very sensitive about, hey I, don't want to go, manage my own Hotel collectors or like send data out. It feels like a something worth, considering some some kind of a solution.

A

B

Here sorry, yeah.

A

I'm, not sure if you are familiar with was uh Baltimore, did he he wrote rate limited, consistent probability, processor I think this is the name which is based on our values, but does more or less what you described so it it. uh It makes decision for for sampling after after the spans are complete. So this is a database sampling technically, but it's working in the agent in in the SDK sorry. So if the criteria for for selection would be changed like you described, I think this could be adapted to to work as as you want.

C

Okay, thanks Peter I'll check it out, uh but you said rate limited consistent probability, sampler, okay, I see so so it makes some. It has some selection criteria when the span is complete. It will yes, so.

A

Right right, so the the goal was to limit the rate but uh again, but have it done in a fairly consistent way? Yeah.

C

I see cool I can reach out to uh chat more about it. Yeah, in my case, uh it is not rate limited directly, but it's more like a at the end of the span. Completion I checked for hey. Is this a failed span? If so always continue the pipeline like send it to the exporters, if not, do some probability based uh sampling, uh like one percent or something.

B

But essentially the thing the problem: you're bumping up against is a thing that I have I, think was I, don't know, I'll the premise of span. Immutability is I. Think a a pro I think there are many cases under which it's really useful. To have a span in Flight be something that can be modified um baggage comes into this there are. There are a lot of places where it's like hey that thing. That's up there that haven't actually sent yet I want to change it and and I think it was uh I.

B

Don't know naive, maybe to assume that you could just say like here's, this pure pristine thing that we don't mess with ever once. We've created it, and- and so that's that's where this comes in- that that I feel like it was a design decision made early on in this whole process that has complicated a lot of relatively common use cases.

A

Guys I'm sorry I have another meeting starting so I have to leave. So. Thank you. Thanks.

B

A

See you in two weeks.