Open Telemetry Uncategorized, 15 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-07-15 meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

A

Morning, everyone.

A

Or good afternoon here in europe,.

B

Good morning, everyone hi ted.

B

Ed, since you kicked this off last week, I'd like you to keep keep doing so um unless you feel otherwise um sure, partly because I know if you let me me, start talking, I've got a very particular angle. I want to talk about probability, sampling and I think it's nice to have you holding up the bigger picture.

A

A

A

A

One item from the.

A

A

148, uh can we get a link to that yeah.

B

I'm putting it.

B

A

Oh well, we've got a good group at this point looks like um hopefully more people show up, but I think we can. We can kick it off. So the two.

A

I guess three, three major things have been brought up uh when we went over what people wanted um out of sampling and open telemetry, and they seem to be one.

A

How does how do how does otlp transmit sampling information to the back end? uh Currently, sampling is something that happens in band, but there's no sampling information being sent to the back ends, and so we want, regardless of what kind of sampling is going on. The back ends would like to know about it.

A

Two in general there's interest in some kind of weighted or probabilistic sampling, uh ideally uh something that could become like standard within open telemetry like agreeing upon some shared way of doing it. People today can implement their own weighted sampling if they want as a sampling plug-in. So the reason why we're talking about it here is we'd like to standardize that and then the third is remote sampling, the ability to.

A

Supply rules so not just weighted sampling but rule-based sampling based on what jaeger is currently doing, so that you can get into a control plane can enter the picture and you can do things more like tail-based sampling, where, based on what information you have you're adjusting um to specifically capture certain kinds of things and specifically suppress certain other things in order to have a balance between getting all the data that you know, you need, while perhaps dumping data, that's expensive, but useless, um and so those are like the the three subjects uh we've been talking about.

A

I think that is more than enough. uh I'm just curious real quick is: is someone got an itching desire to talk about something? That's that's not one of those three things.

A

Sounds like no okay, good, just making sure I'm not not shutting somebody down here so, given that um I think we should dive in josh uh has been charging ahead on the probabilistic sampling front and has no tip on the subject uh josh. Would you like to talk about your otep.

B

Yes, um I will just project it for a minute here. um I want to say it's been open, for I don't know a couple of months now and it it started out me just trying to introduce a bit of terminology that I wanted us to all use when we talk about probability sampling, but that opened so many questions that eventually I decided to fill in the rest of it, because I couldn't get away with a simple document that was just trying to originally just try to talk about the term adjusted count the term unbiased.

B

What does it mean to be sampled and an unbiased way so that we can count? I was trying to get that far and it just left too many questions for most people, so I continued writing. um The problem now is that it's quite large um I've I've got zippies to collapse, parts of it here, so you don't actually have to read the whole thing.

B

But it is fairly extensive and um I've gotten a lot of feedback from atmar who's here and as well as yuri who, uh the jaeger chief and um I would say, from my understanding of the feedback that it's mostly been a positive and accepting what this document's saying, with with maybe questions about opinions that could be go either way or.

B

um But I think we've reached the point here where enough enough of the sort of people who do understand, have read it and I'm not getting enough more feedback to go any further, can't merge it without more. I don't know what to do other than ask for more people to review it. um After all this background, we get into the really detailed discussion about trace sampling. Why is trace sampling so different or so special compared with ordinary sampling whatever?

B

That means- and this is where we have to discuss the term head sampling and tail sampling and the idea of propagating head probability, which is an old idea coming out of dapper um and and what that actually means- and we have to. I don't want to try and explain everything I've written here in this room, because it's it's quite a long document at this point.

B

But the point here is that when we talk about sampling in traces, it's just a bit harder because you have to talk about making a decision before you know everything and that's the head decision decision, because you're going to propagate that decision to your children before you get to the end of your own span.

B

So the decision that we make at the beginning of a trace is different than the decisions that we make at the end of a trace, and there are several ways that are known to essentially assemble complete traces in the presence of head sampling.

B

And we are left in my opinion, if you read through all this document, you'll get to the point of understanding how we can include probability, that's the and it's inverse, which is effectively a count. That concept, I think, is pretty established. If you read through this, what we're left with is, unfortunately no way of knowing when a trace is complete and I think of that as the bigger problem.

B

um I didn't try to solve that problem in this document, and I have a few ideas about how it could be done or should be done, but it's almost orthogonal to sampling. It says I'm, you know, I don't know when all my spans are arriving and I have no way of knowing when all my spans are arriving and before we started sampling. You could just assume all my spans are supposed to arrive. That's all you had and when you begin sampling you don't have that so now, how do you know when a trace is complete?

B

I think that's our biggest problem today um and if you read in you'll find more, but it you know, boils down to um this idea of a trace id ratio sampler. This was the one that open census really wanted. It was the thing that opencensus was promoting and it solves some of our problems. We don't have to propagate a probability. The head probability does not have to be propagated, it could be independently recorded by everybody.

B

And if you look at your trace id- and you have a consistent hashing function now you can everybody independently in a trace, can decide whether to or not record their span, and everyone can record their sampling probability. But then how do you know when you've gotten a complete trace? You might be missing some spans and when you know you know you're missing it when it's a parent like I've got a spam, and I don't know my parent and this trace. It was sampled, so maybe my parent wasn't sampled, but you're.

B

Also the opposite case, where I, by my parent, decided just I'm a parent, I decided to be sampled somehow my children were not sampled and I never never collected them, and I didn't don't know it. That is a case where we just misinformation. Today, um open census was trying to solve that by having a count of children uh open, telemetry very early on over a year ago, just tossed that out because it was causing trouble. um This is why it was causing. I mean this is why it was.

B

There was to be able to tell when you've got a complete trace when you're sampling, um I'm gonna, stop talking, because I think I've presented the trouble. That's here is that we have a I'd, say, a complete proposal for probability sampling and what we're left with is no way of knowing when choices are complete, but the problem was already there, so um I want to turn it back to the group, maybe back to ted. um This goes as far as proposing text, and you know it's a it's a draft, but I think it will work.

B

I think we should talk a little bit about incomplete traces.

A

Yeah, just just real real quick for the short short version um in practical terms. Just could you just quickly go over what this changes like? What do we add so the.

B

Hotel reading what.

B

That I just unshared um was trying to say that for every one of the built-in samplers we will output a span attribute named sampling.adjustedcount if it was probabilistically sampled. According to some, you know, rules about being unbiased, then you can in sort of standard operations. You should be able to see a stream of spans and and turn them into metrics as you stream process.

B

What we don't want to have to do is buffer the data until a complete trace is available and then count it, which is what we sort of have to do in today's world in order to be able to stream spans and turn them into metrics.

B

We just want to have this effective count or probability encoded in them, and you know it takes a lot of text to explain how there are various ways you can end up at a probability depending on how you sample, but as long as the meaning is there, we can just encode it and not care exactly how it was produced.

B

That's a fairly like a like a lowest common denominator for sampling is at least we could all agree to count them when they were probably sampled, and but if you read through this you're, also going to notice that there are more there's more, we could do to get more about how it was sampled, and that gives us more information about what you're looking at, like so there's information loss.

B

Atmar is pointed out in a couple of comments here, like you might not know, because there's cases where you want to know coordinated information about, what's what's being sampled in order to make better inferences. Let's say um this is a case where we could go further and we could have more information about what was your strategy for sampling as a process, and then you could potentially know something about spans that were sampled in different locations and compare those pieces of information um that is above what I'm trying to put into this document.

B

um Really just saying that if we've done it correctly, you should you should not put an attribute called sampling.just account. I want to say asterisk sampling.probability also works. It just requires inverting that number over and over again as you work with it, which I don't particularly like. I think it makes a struggle or a hurdle for users to understand what we're doing. If we call it probability, I think it maps to user intuition when we call it count, but that is my personal opinion there.

B

um So concretely, what's left in this in this document, says that uh it goes as far as documenting how we could propagate head probability, how you can use that to record parent when you're, when you're a parent sampler, and you want to record adjusted count, you can get it from head probability in your context.

B

That's one way to do it and the other option that we've discussed in this probability in this document is there's quite a bit of resistance to propagating head probability. It's never been done in the open source world. It was done in dapper, but it wasn't clear why it was being done exactly if you are using a parent sampler and you want to encode probability.

B

That's one thing you can do, and so this document says that, but I, but I'm getting so much pushback against encoding head possibilities that I've just called it an option um to do the kind of inflationary sampling that the dapper did, which is where you need to know you're being sampled, even when you're not being sampled. That requires head probability to be propagated as well, um but it seems that the the broad state of the world from where census left us was pushing us away from propagating head probability pushing us towards using.

B

Towards the use of having independent span probabilities be recorded, all this means is we just need a way that this is saying we should all be doubling down on the trace id ratio sampler just that we need a way to know when traces are complete, we don't have to propagate head probability.

B

We just need a way to make sure traces are complete.

A

But if we do propagate head probability, does that reduce the need um for having to to know when traces are complete and get a good account? How many spans are in your average trace.

B

Well, I don't think it necessarily helps in that corner case on the tail, the the leaf of a tree of a trace. um Knowing head probability means you can output.

B

uh So you means you can output that probability, but I guess um the premise I have is that we can't tell the difference between a span being sampled out a span being dropped because of collection difficulty. So if we have no collection difficulties, then I think you're allowed to say yeah, then there's no problem with drops bands or incomplete choices either. So if you have a perfect collection, I don't think there are there's an issue with um incompleteness, um but that doesn't seem like a realistic picture of the world.

A

B

A

Knowing when a trace is complete is basically the halting problem, so you're never going to solve that right.

C

I was going to say this is fundamentally unsolvable in some.

B

Ways I I like your perspective, that's actually a good reminder. um We can't solve the halting problem, um so I I I don't want to just completely steer this conversation in one direction, the entire hour, but last hour the something came up that was mentioned by jager that addressed some of this stuff, and I and I'm I'm trying to draw a connection between um that and this one here.

B

So one thing you could do to know when traces are incomplete in a more realistic picture of the world is have two samples coming out of every process. One is your trace sample, which is going to have follow all the rules of trace, completeness saying. If my trace id matches my trace id ratio, I will output a span.

B

The problem is, if you have your trace, id ratio set to zero you'll, never see a span. If you set it to zero zero, like a very, very small number, you may never get a span all day right, so you just you're still waiting for that. First, complete trace the problem: is you just don't know how long you have to wait for that? First: complete trace. You're waiting for a span, that's going to come through with the right trace id.

B

Essentially, if we knew the trace id that you were waiting for in that in every particular node or every particular span name, then we would have a second way of looking. Oh it okay, so you have a trace and it's um uh it's got a parent named something and um I have a child. So if you start recording your parents probability in a separate sample, that's kind of what I'm trying to get at um the separate sample is going to be a a sort of system-wide sample of every span.

B

But it's going to tell you what is my like minimum threshold that I am sampling at, so that you can have two signals, one is to say I have traces which may or may not be complete coming through, and I have a per node sample of its span. That tells me it's basic configuration so that I can um know at least like something about every span. Yuri mentioned this. That jager does this, that there are non-probabilistic samples like one per hour.

B

If you have a one per hour and you put information about your sampling rate in your one per hour, then you can go through and look at all spans figure out their minimum rates and then say: okay, well, there's no way. You'd ever expect a complete trace until a one in something span.

B

Traceability ratio passes through this node and like there's, some inferences you can make there. If you have global information.

B

That sounds terrible. I.

A

Think I miss cool I so the the part that I got is if you set your probability low enough, then you'll just like never get traces right, there's a threshold where, if you go below you're, basically saying like this, this is you just turned it off for these kinds of spans for this system and um in order to just know, it seems almost like, like a poor person's um uh form of service discovery or something, but for for was like there should be a floor for sampling, which is like no matter what we want.

A

We want a a record of everything that happens at some minimum once per hour or something like that. Otherwise, parts of the system just disappear and there's some practical reasons. For for why? That's that's helpful. um I I don't know what the practical reasons are, but you know intuitively, I'm like yeah. I can see why you'd want at minimum one trace your system to have like indexed one.

B

Sort of suggesting uh that it's useful to sample just spans on their own, not like not considering their trace relationships so that if you had a samp, if you, if you did a tail sample of every span, you know one per hour, then at least your collection infrastructure sees one span per span name per hour and if you encode your sort of minimum trace id ratio at that point two then, across your system you know minimum trace id ratios.

B

Now, when you see a trace that is small enough to be complete across every node in your system, you can at least say this: auto have been complete. That's all you know.

A

I see because these sampled spans, in this case this once per hour, sampling you're, not going to get a complete trace, you're saying like.

D

A

Just gonna once for hours just get a span, because these nodes are not participating um in some kind of algorithm where they don't do it at the same time, they're just deciding like okay at minimum, I'm just going to choose to say well the reason I'm supposed the.

B

A

B

Thinking this- and this might not be clear which is but like in the world of tracing, you can also just focus on spans. They are a stream of events with attributes, they have a measurement which is their duration and they have a count which is like one. So every span equates with one histogram event, for example, and sampling spans is essentially saying I want to collect a histogram, approximately of of that just that span to tell me something about that span independent of its traces.

B

So if we had that, you could then also piggyback in some information about your trace sampling ratio, so that you can know when your choices are complete, as this is the highest level idea, I just wanted to sketch it like. If we have um two samples you can maybe solve the problem of completeness.

A

I see so part of the reason why you want to send these individual spans in. Is you also get sampling information on that span, which tells you something about the the overall.

B

um Yeah and I mean overall picture another idea- I mean I'm just talking about ideas- things you could do when you're sampling spans, you know, light stuff has a product, we call streams, which is basically like a stream view of your spans. So I'm sort of picturing my light step products right here, but I'm also describing a bunch of people that don't know that so every span event can be turned into essentially one latency measurement and and put into uh a time series with um count information.

B

So um if you had two separate samples going on your system, one is the trace id ratio. um Well, I you know what it doesn't even have to be two samples, the trace id ratio. Sample is independent for every span and if you it's, your, the use of the word floor, ted was kind of the key. There is that if you have a independent trace id ratio sample, then eventually every span will output one sample over time.

B

You will begin to know every span in the system and as long as every that sample includes its ratio. Now you know how long you have to look for a complete trace.

D

Yeah, this means you assume that the sampling rate is stable for every spin right. So it's. If I mean in theory, you could choose the sampling rate uh each time differently.

B

D

On attributes on every span, so if you want to have that very flexible, then then this approach wouldn't work. I mean what what's possible is maybe, if every sampler reports, the minimum sampling rate, um it knows that it applies um on a regular basis to to a collector or wherever.

D

And if you know the minimum sampling rate, which is active at a certain time, then you can at least filter the complete traces. You know all all rain uh old race ids, which are below this minimum sampling rates have to be complete. Maybe there are more, but uh at least you can filter those right.

B

I think that's roughly, the idea is pitching here as well. Is that you're you're going to have a threshold attached to every span um and that as long as it's not super dynamic, you can just infer then for a period of time. The minimum threshold for this stand was.

B

I think you can also break the situation by saying there's a particular span, name that has different thresholds on every different process and then, like that's, why I mentioned the use of your parent. Like the at some point, the question is: did I have a parent that could have called me and is my threshold too low to have been included in that in that trace? This is what you're really trying to answer, and so, if you knew threshold by parent, that would also refine that information, but just simple threshold.

B

um Well anyway, it seems like we do understand this issue.

B

And either of these sort of another channel to to to pass minimum threshold to the system is roughly what we're discussing.

A

Yeah, I I do think uh we need to get together and explain, like I'm five version of this stuff, um because our end users are gonna need to understand this as well.

A

um Also, I'm gonna need to understand it better, um but part of that is, I think, the ex, when I think about explain like if I'm five just making it very clear what the goals are right, like it's helpful for people to be like our goals, are the following, because when people think about sampling, there's like a simple goal of I just don't want them all, um because it's too expensive, but it gave a list three goals. Yeah.

B

So my list was it for tracing, especially we reduce tracer overhead, meaning we don't actually turn on a tracer and attach it when you're, not sampled. That's a pretty big goal for people and you can't in some solutions. Just don't do that, like you can sample downstream and it doesn't help you with tracer. Overhead produces complete traces and spans are countable like those were the three requirements and we just spent a lot of time talking about, produces complete traces and, and ideally spans accountable. Just falls out of this.

B

You know simple attribute, name sampling, adjust count or sampling probability, um but it turns out that reduces tracer overhead and produces complete. Traces are at odds with each other um yeah and and this question about whether we propagate head probability is connected. But um it's it's also kind of independent.

A

Yeah yeah, it would be good, maybe we can work together um or maybe this is a thing this group can help. Produce is just like try to draft a second document that condenses all of this stuff into into something that can be as like consumable as possible uh to people who are just trying to like wrap their heads around what what we want to do here. um That would prime them, so they could then go read the otep and go. I don't.

B

I don't think anyone wants to read this out too long. That's my problem. I don't know what to do um yeah so.

A

B

A

This would help right like let's come up with, like, like the condensed version of what we're trying to solve these goals. If we do these things, we solve this goal. This creates this problem, so we're gonna. We want to add you know the second kind of sampling in there to solve the problem that our first kind of sampling can create. Essentially right.

B

It's like that's, that's more complicated than it needs to sound. um I I mean if what you're so so, I guess maybe one of the feelings of this otep was. I tried to take a neutral stance and say there are three ways that are known in the industry and like I'm, going to describe them all to you, because it seemed like there needed to be some discovery, but um I I just spent a while encouraging us to focus on the trace id ratio solution.

B

So one document ted that takes less of a comprehensive picture and just says we're aiming to get chase id ratio sampling to work.

B

The way it will work is by recommending the use of trace id ratio sampler everywhere and um we will come up with, and this is like a little bit of work that I'm gonna that I can do or we can do- is one essentially um new attribute that could be a resource attribute. That says not just what is my probability effective, but what is my minimum threshold?

B

I think so one new attribute might solve this problem and then a short document that says we are going to put a new probability attribute we're going to put a new threshold attribute and then we're going to turn on we're going to recommend the use of chase id ratio sampler and then it'll just work, it's kind of what we're what we're kind of saying- and um this I I said two spa two samples earlier, but really what it is, is one sample being interpreted in two ways.

B

You can look for complete traces by checking to make sure that your minimum threshold is met everywhere or you can just look at spans because that works when a trace id ratio, sampling picture and you can develop a picture of every span and for every span. You can look at your threshold because it's on a resource, and that gives us a bare minimum. I think for complete choices plus sampling, that that would be great to get okay, I mean I can take the action I've done there.

B

I don't know, I've got a few other ones and I'm uh yeah yeah. If.

D

You think you could just quickly.

B

I'm better at this than histograms I'll. Tell you that much um and um yeah.

A

I think it would be helpful. I would rather write sampling than histograms like you can read the other one, but this is like I propose we do these precise steps that would be okay. Okay, I have two questions before I want to like. Let other people talk and like talk about other things? One is and they're related. One is propagating information in band.

A

uh Why was there do you know why there was resistance to doing that.

B

um Well, my understanding is that there's never been a practice of counting spans in the open source world and, as a result, the meaning of sampling developed into a. How do I pick spans and how do I pick traces, essentially mechanism and it essentially lost the connection with sampling statistics and.

B

I I that's all the explanation I have really um is that um so this idea that was baked into dapper um and requires you to think about parent probabilities or um it's a little harder to think about it's a little hard to justify the sort of causality you're using causality. You're saying because I started a trace means you're, because I injected a a trace context, you're going to extract it and use it. um That is happening as a causal.

B

That's a cause. One was caused by the other, and the probabilities are the same. In other words like you have to reason through all this stuff um to understand how the probability that you get in your head is useful, even though it came from your root. I it just requires a lot of logic and probability, reasoning that I think the average user never has encountered in their life. So it's just like this. This big big hurdle.

B

So then jaeger has lots of community members who just wanted to get traces and- and then you know years later, we are talking about sampling like it doesn't mean math. um That's that's how I see this.

A

I'm a little confused about what's lost, maybe like what? What are we losing by not doing that.

B

So the the parent sampler, which is just says if my parent is traced, then I'm traced to it. If you want to be able to count that span, you need to know the parents count and that's lost, so the use of just a simple parent sampler can't have probabilities.

B

You already asked the question in the o chap this is a very down at the detail like technicals or in the weeds now. But the question is: um if you are a parent sampler and you don't know your head probability, what should you do because if they do nothing, it's going to look like? I can count this as one spam, but that's the wrong count and I'm proposing that we count it as explicitly put in a zero zero says.

B

You can't count me and it says it leaves open the option that you can count me in other ways. So if I am forced to the parent, sampler will be used. I don't know my head probability. I I should record a adjusted count of zero to say that you can't count this span by itself. You need to know its root.

B

A

If you use the always-on sampler.

B

Just nothing needs to be done because you're going to count that as one by default.

A

Okay, I'm curious how this relates to metrics, generating metrics off of spans.

B

If you so the the trick or sort of the key is that in in the jaeger world, I think what you're asking is: how come we didn't? How come this hasn't been a problem right in the eager world. You collect traces and then you click, you assemble them into collect spans. You assemble them into traces.

B

After you assemble them into traces. You can turn them into a metric because you can look up the root probability or whatever it was part of the remote sampling configuration. So you at your root, you know your probability and if you would code that in expand data and then after assembling traces, you could go. Look at your span. Data for the root, extrapolate account and then every span in that count every span in that trace gets the effective count multiplier of its root count.

B

So what is lost? I guess is you're not able to do that. So we can encode a zero to say, don't count me and then, if a vendor cared to recognize that zero means, you can't count me directly, but you can count my route. You can go, assemble the trace and then turn them into metrics.

B

What lightstep was asking for- and this came as a surprise to us- I think, is that there was no practice in the open source world of counting spans on arrival, and um so we and- and I mean the unfortunate history- is that lightstep built up its product without building in sampling from the beginning.

B

We were just assuming it would happen later, so we count spans on arrival one to one and as soon as we start sampling, we just want to know the count and we didn't have it and then we here's a group now to talk about it. um So I guess the quick answer to what's lost head is that the parent sampler is lost and you have to assemble traces to count spans got it so.

A

um In a world where, if on every span, you were recording um the the uh the root um information off of the root, then you would be able to to do these calculations without having to assemble the trace, because otherwise you have to find that root span and.

D

That root span has to make.

A

It all the way back um in order for for this to work, um which could be extra tricky when you think about the fact that the root span is often going to be on a web browser or mobile device, or you know somewhere else, that's annoying. I mean we will probably end up in situations where actually um the server side span is something of the route span, and it's going to like send information back to the client to be like.

A

You should be sampled and here's your trace id and here's your other stuff, because I don't trust, I don't trust ie6 to produce all this information. For me, um I.

B

Think you're you're breaking into another topic ted, maybe about propagating backwards and and.

A

Yeah, I guess I'm pointing out that there's there's some work in general to like deal with the client problem, but in general um it would be nice if we didn't have to rely on that one span to be present or it kind of foo bars. So.

B

And then there really are two ways forward. One is the one that I described earlier, which is where we do trace id ratio sampling on every node and then you know they're independent, and that way you don't have to propagate probabilities. That was why that option is appealing and then the other option that we haven't really discussed.

B

Is you stick with the parent sampler, but you propagate your head probability and then you can count spans because you take the head probability and you make it your adjusted count and then and then you keep doing that through your propagation and then every span encodes. This account that was effectively carried all the way from the root all for the purposes of counting- and this is where the yeager community kind of shrugs and says I don't know we haven't ever done- that.

A

But I mean we can do it, the only you know the only. I guess I got another final question here um which might tip your answer to this stuff, which is so far we're also presuming that all of these um systems can be configured to match each other right like currently the way I understand this system that you're describing is there would have to be some consistent configuration information spread to every client right every tracing client.

A

In order for for this approach to work.

B

um In the trace id ratio- sampler approach, I believe it just requires independently that you just you each node can decide on its own how to limit spans, and you know, you'll get some complete choices. According to that discussion, we had in this other solution where you do propagate head probability, and then you need to make sure everyone's got a propagator that supports it. That's I think the bigger hurdle right and I and I adjust updating everyone's propagators.

B

uh Sorry, the parent sampler, would have to all right.

A

If you switch to b3 or some other system, then you would drop it right as.

D

Long as you are.

A

Correctly propagating the well now.

B

A

B

Have to invent this head probability and put in the trace state um all the samplers that are head samplers want to do that, so either your probability sampler or your trace id ratio.

B

Sampler should encode head probability so that your downstream children can use parent sampler and still be counted, um and I think that that is a question that the technical community is a little uneasy about, like it means modifying all of our samplers and means propagating this new piece of information which, which is of questionable value to a lot of people um and that's and and then I think this is why the open census system had moved away from propagating head probability.

B

Yeah um um I mean I don't know like.

A

It makes sense: okay, um yeah because I was gonna say we we're living in a world now, where you're going to have multiple independent systems participating in a trace right. So I'm going part of my system runs on lambda, right and so part of my tracing is going to go through api gateway and all this other aws, stuff and and they're going to produce some trace data. For me that I want to get and they're also going to propagate this stuff along.

A

um So the two questions there are, you know: will they be able to participate in sampling to some degree? This has always been a tricky problem.

A

No one really wants to trust the um the sampling bit, though I think that that'll go away for other reasons, but um but there still needs to be a way for for independent clients that can't be have their configuration synchronized to be able to. It's got to be workable in that situation.

A

Basically and then the other situation, which I think is perhaps temporary but temporary on the the um time span of years, is we can't presume everybody is going to be running uh the trace context, headers because, for example, on lambda again in aws, they are going to convert that to amazon headers because they just from a practical timeline purpose. It's it they're able to get like tracing output into those systems much faster than they could go into those systems and say, like change.

A

How you're dealing with headers and like start supporting um this new header, uh that's just for practical purposes. Turning out to like take longer um for the reasons tracing is always difficult for everyone to implement. Is you have to go around to a bunch of independent teams and be like?

A

Can you please like like make this change and redeploy it uh not for you, but for these other people and like this system won't work until, like everybody does that, and so that's gonna, that's gonna take time, so any system that anything we add to open telemetry today.

A

That says, like this feature, only works if you're using trace context is, is like a little sketch and yeah um in an ideal, really good point: you don't have to rely on that. It would be nice to rely on that. But um that's like the kind of thing where maybe like a year or two from now, we could reliably say like if you're not using the like standard uh in the w3c, then you're just like not doing http correctly, and you should do.

C

A

C

Yeah I mean: would it make sense in those contexts, because I don't think we're going to be able to come up with a complete solution that works with tracing header formats, that we don't support or don't control.

C

um So in that case, having like graceful degradation of features, in other words like here, is your sampling solution, but with the caveats that, if you're not using a w3c trace context, you're going to lose some of this, and you may have to do more. For instance, coordination of probabilities from each system.

B

Well, I mean, I think my conclusion from this is that it lends even more weight to the trace id ratio sampler solution, where people don't have to be kind of coordinated. So much.

D

B

And don't require a trace state field to work correctly and so on.

A

B

Like that's the thing we.

A

Should go for first, I think that's what's paul, saying like we should get that like, let's just say like propagating information is like too fraught for now. So if we can add a solution today that doesn't require that that would be. That does solve a bunch of our problems. That would be a good sampling system to add to open telemetry today, and maybe that's not perfect, maybe once people get used to that kind of sampling, they will like start to want some other features that we could provide uh if we started propagating information.

A

um But we want to look at that as like a thing we're gonna. Maybe maybe we go tackle that problem a year from now or something, um but it's not the thing that we're gonna try to jam out. You know over the summer.

B

um I feel like this makes sense to me. um The idea is that we are going. I it sounds like we are going to promote um the trace id ratio, sampling solution that I will write a like try and write another document that explains this at a higher level um and then, but I have to update this, I think we've we've touched on us a potential solution for the incomplete traces column.

B

I think that needs to go into here um and then um the output of this for a customer like for light step, let's say, is we're going to recommend to our customers.

B

If you want sampling, you got to use the trace id ratio, sampler, don't use parent sampler.

B

If you use parent sampler we're going to get receive spans that we can't count- and I think what we'll do is hopefully get the specs to say to put information in those spans to say that you can't count these so so we'll be accepting we'll be accepting it there's definitely configurations of sampling that we can't count, but we will recommend our customers don't use those right and um this I think this will lead us in the direction that we want to be yeah.

A

And I don't know we might want to do it in a way, that's even less tricky than if you put a zero. It has this special meaning like we may want to also just include, as part of like the information you're getting out either as resources or other words. The word parent algorithm is being used here.

B

um um That's not a bad thing so, right now I have this idea of one attribute.

B

That would be your adjusted count, um so if instead there would be several options, one is adjusted count if you know it, and otherwise it's like information about how sampling was done, hopefully hoping that you can construct something useful in the in the back end, which, which is the only one I know of say that parent was used, therefore, track down the parent and figure out its count, so you could have a string like that's one of the nice things about hotels attributes is their their typed values, so you could just say: sampling adjusted count is a number if it's a meaningful number and including zero, which is meaningful or it's the string like that says.

B

The name of your sampler, see parent and, and that would be very explicit, um go look at your parent. uh It means that your adjusted account is a string that says see parent that I don't know that actually seems like yeah.

A

Just just making sure there is enough structure in how we do this so that if someone comes up with another sampling algorithm that they want to use or a custom sampling algorithm like this information is, is going to the back end in a structured way, so they can be like like. In other words, we don't want different types of sampling to like camp on the same stuff. Right like the back end needs to be like.

A

Oh, I know what this data means and not, um and so if it gets some kind of sampling information that it doesn't understand, it can be like actually, I'm just gonna post a warning on the ui. That's like yo, I got tr you're, sending me traces with some kind of sampling. Here's what it said, and I don't know what this means. But it's probably screwing me up because I don't know it's I'm hearing from open telemetry that you're doing sampling- and I don't know what that means.

A

So bad things might be happening right now.

B

So it sounds like we want at least one other signal, which is the algorithm used and from my perspective we don't need to know the algorithm if we know the probability, so I would say either or but I could I could go both ways on that. um So saying parent was used is enough and then but.

A

Might be enough to have this be a resource right like as part of like your sdk posting, its configuration information just being like I'm using blossom right, because we already needed that you know if someone pops up a tracer, that's like I'm using like um x, sampler, then the so.

B

A

Can be, I don't.

B

Know what randex is sampling.algorithm.

B

Equals parent or equals you know, string that you can presumably infer something from and- and this is and this could be, the name of the sampler which was actually specked out um and it's like a little bit personable. But it's like its own freeform string. I guess I could propose that yeah, but if it's uh trace id ratio, sampler it'll include that threshold, which is what we're after. So this is starting to just to take on structures. So yeah.

D

Specific tells you the.

B

Name of the sampler and an important configuration item, and now you can use that plus the probability on the span to determine whether traces are complete in some course way.

D

B

A

Okay, we're getting somewhere. This is nice.

D

um um Currently we have this uh tracid ratio base sampler in this uh current sampler, but uh I mean just the possibility to implement your own sampler. Basically, I mean based on the interface so and if someone is using that I mean we don't know actually what's happening right and.

B

Yeah that was kind of what ted was just proposing, um and I was I was thinking about just having a resource attribute that tells you the name of the sampling algorithm or something like that, so that, if you're doing something custom, you can fill it in and then you would know atmar that for dynatrace customers, yeah.

D

B

What you're, seeing and if you're doing something that is broadly has a probability. You could also include that probability and then lightstep can broad count the dynatrace spans, but not know more about what you're doing. Maybe.

D

Yeah I mean you lose a lot of information because you do not know if this sampling decision was somehow consistent like that razer dealer issue.

B

This sampler does.

D

And if someone is really rolling it dice, I mean completely.

B

So we really want to know trace id or simple probabilities. I agree um and you've pointed out a few of the reasons why um no the.

D

B

D

I have is: do we really want to uh support uh uh independent sample implementations uh or just those defined in this thing,.

B

Well, I I think it's nice that we leave that open and I feel, like you, probably have legitimate explorations or ideas about how you can make better sampling, um but I think what what I was hoping was that we get sort of like the the lowest common denominator specked out, so that no matter who's sampling. Ideally, we can count them. That's all we really kind of wanted to get out of this, and I think that you're probably doing better sampling tomorrow that we could still count.

B

We just wouldn't know that additional fine grain information- I think it's my guess.

D

But the question is: if, if someone uses a third party library which is somehow instrumented using the sdk and whereas some self-implemented sampler was defined, and then it's difficult to make use of that data or to to extrapolate the data in a correct way, if we do not know how this sampler works, so I mean: is there a recommendation to only use just those samples defined in the spec or- and I don't know, what's the goal I mean I mean it would be better for for.

B

D

The interpretation of the data is, if, if we recommend um all open, telemetry users to just use the define same cluster,.

B

That's right and then I think lightstep is going to recommend not to use parent sampling. You know more or less for the reasons that we discussed, but.

D

B

Also feel, maybe I'm wrong all right. Tell me what you think is that the id, if that, if there is an unbiased probability like it, doesn't kind of matter how it was computed um at a first order like first order, I can make a count. I can make a span of metrics and yes, I've, there's more information, I'm just leaving on the table there, but um is that like? Does that sound valuable to you?

B

um What what do you mean like dude? I mean.

D

B

It's trace id ratio versus a simple probability like flipping a coin. I don't at the end of the day. I can count it either way. I don't kind of don't care.

D

B

Mean yes, there's differences, I mean there's a.

D

Difference I mean, if you're interested, for example, in counting the traces which uh touched some serious, a and some seriously yeah. Then it's difficult. If you do not know if the sampling decisions have been correlated somehow or not. So if you're only using the tracer d ratio based sample, then you know how everything works and you can uh extrapolate that number in a correct way. So uh I would yeah, I think.

B

D

Using just the reset eurasia sampler and nothing else, yeah.

B

Here's what I think we can do a little bit just a little bit more open, which is to say we're we're going to recommend using only the built-in samplers we're also going to spec out how to convey a probability in case you've got your own fancy sampler. You can just give a probability, but you should also give the name of your algorithm and we we've got several reasons now. Why I do that one is. We want to know when you're using something totally different than we don't know about.

B

So it's like a non-standard implementation, but the other is just for trace id ratio sampler. We need to know the threshold on every span. We want to know that for completeness, so it sounds like two attributes. One is your adjusted count or your probability, and one is your algorithm with with any threshold that we need to know.

D

Yeah I mean this: if you know the algorithm, of course, you can reproduce the sampling process basically, and this is what you need in order to be able to to do estimate things in an unbiased way.

D

But it turns out that if you're using, for example, samplers that are sampling completely independent right, like every sampler uses its own random number generator, then it will be really difficult for us in some scenarios to to to extrapolate things. Yeah.

B

D

Think we should agree that.

B

Vendors, don't want to see that and that, like the spec, can leave it open for if some great new idea comes along, but for now we're just gonna, we will only know how to count the built-in stuff and- and we will recommend away some of them as well yeah. um It sounds like I know, chad, you want us to get off. This call in less than two minutes.

A

We've gotta drop or we're gonna get a swift stick before we go and the.

B

Last thing, I think, is that we will need some mechanical wri, something written about how to hash trace ids, that was that was left on the table like a long time ago. I don't particularly love that type of writing. Is anyone else interested in writing out how to expect how to hash.

D

um I mean this also relates. If you wanna support, uh just discrete sampling rates or not yeah, I mean.

C

Yeah I mean honeycomb, we do that kind of sampling. I could write out as at least a starting point, what we do and then we can poke calls on it and decide if it or, if decided it's useful.

B

The implication was that you, maybe only want to do powers of two. Sometimes, is that right?

B

Okay, let's uh we better, we better get off. This call, and I think we've.

D

Left ourselves.

B

With some action items to figure out how to do trace id ratio, sampling.

A

Yeah uh join the hotel sampling slack channel. You can continue cool perfect thanks. All.

A