Open Telemetry Uncategorized, 8 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-07-08 meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, good morning,.

A

We do have jager people here. I figured you all be at your yoga.

B

C

Hello good morning.

A

Hey, I see you all finally cranking this thing up.

A

I'm gonna drop a link to the notes.

A

In the slack people wouldn't mind, just filling in uh the attendee list.

A

D

D

As you can see, my room comes with a seat belt.

A

Nice, I hope you're not in the driver's seat. I am yeah, it's an awkward placement for the camera. If you are.

A

A

A

E

Looks like dad you, you are the organizer of this meeting and I think you should uh kick us off um today, at least and see what happens.

A

Exactly uh we've got a good crew happy to see people, and um I thought I would kick us off um well, actually, someone said: should we create a dedicated channel in the cncf slack? Yeah totally.

A

Definitely just go create that um right now, but uh I thought I would kick us off by just maybe going around the room and having people say what um quit just briefly what they are hoping to get out of sampling in open telemetry.

A

I think that would be useful, because sampling is actually kind of a broad topic and my impression when talking to people was uh there's actually like multiple different aspects of sampling, uh that people are interested in. We've got people wanting sampling for purposes of backwards, compatibility uh with current systems. Some people are interested in pushing the field forwards, um so I thought that might be a good place to start, so I can kind of just go around the room and uh I'll.

A

Just I guess, because slack or sorry slack zoom is not stably oriented, I will just uh maybe uh call people out and they can just say uh what brought them to the meeting. uh So I'm just gonna go by the order on my screen and so that'll start with uh atmar dynatrace.

F

Yeah, uh yes, thanks for organizing this meeting, so the goal, um what we see is that the specification is such that sampling is well defined. The sampling process itself such that we can estimate things and extrapolate counts in a non-biased way. I mean this should be the. I think the goal of every one of us, because our unbiased, biased sampling, does not help and we want to have the sampling defined such that our analysis work basically also on sample data.

A

And so you're looking yeah, perhaps what a number of people looking for is for open telemetry itself to ship, with stock sampling algorithms effectively.

F

I mean the process of sampling has to be defined and since open telemetry defines that it must all, the information must be available to to be able to extrapolate the data.

F

um If, for example, the the sampling rate is not reported to the server, we cannot extrapolate that correctly right so, and things like that have to be well defined, uh because otherwise we we cannot use the data as we want.

A

Great okay, that's super helpful. I think. Next up we got paul osmond here you paul.

G

Hey everybody um yeah much of the same, um so I work for honeycomb and what I'm really interested in is um ways that we can kind of represent sampling decisions in the specification. um For many of the same reasons mar was mentioning. We do estimation of population size uh when you're doing aggregate functions- and you know having a sample rate in um in the span or attached to the span is really helpful for us.

G

um I'm also just interested in like if there is additional work happening to package like standard sampling, algorithms, we want to make sure, of course, that we're doing our part on the implementer side and making sure that they work with honeycomb. um I also have like a curiosity about if any additional sampling work is going to happen in the collector, because that's something that we're keeping an eye on, but uh I know that that's probably not a friend of mine for this group great.

A

Awesome: okay: uh next up, we've got uh prizmy's law. Sorry, if I'm butchering your name.

C

No worries- that's that's good, uh so I work for sumo and yeah, like all the folks here were like using open, telemetry a lot, and I have a couple of things here like why I'm participating in this sick, the first one well just like folks before me have reported.

C

It will be good to have information about what was the sampling rate when some data came and we we used to have this sampling dot probability field present in the past, which I think got removed later for some reasons, and I think that largely putting it back and agreeing how it should look would solve many of the problems.

C

uh The second reason why I'm here is we had a bunch of discussions on the tail-based sampling in open, telemetry, collector jurassic was leading the effort and he had some really nice work done. We also did some work on our behalf. We called it cascading filter processor. It has some adaptive sampling rate capabilities, so essentially you can set say that you don't want to have maybe one percent of the of the traces on the output or spans or or less or so, even if it says sampling, but you can say that you can.

C

You want to have up to, let's say 10 000 spans per second and accordingly, it will change the rate um using sampling rules. So it's it's a bit more complex, but it also gives more capabilities, and I think this is one of the things worth discussing and the third reason why I'm here and this might be controversial.

C

But I've heard this from some people is that we should think about um correlation with logs and metrics as well, and uh I've heard more than once, people wanted to sampling locks as well, and I think that this goes beyond just this sampling used for for tracing and spans. um I think that thinking about how this should work for locks and metrics or blocks specifically could be could be helpful as.

A

Well, that's awesome, yeah.

A

um Definitely a theme here of wanting the back end to know what we're, uh what kind of sampling is actually going on some interest in advancing the the field of sampling itself, more advanced sampling, algorithms, possibly stuff that could be boxed up so that the sampling occurred in the pipeline and the collectors perhaps before it hits the back end, while still sending information to the back end. So that happened and then ensuring that that sampling work is potentially working across all different signals and that works for all different signals.

A

Sampling metrics, for example, is uh perhaps a different beast from sampling tracing, but I'm definitely interested in sampling logs. Okay. Next up, we've got uh pavel good to see you.

D

H

D

For red hat's on the jager.

H

Project and uh I'm in general, I'm interested in trace in a sampling and, more specifically here, I would like to discuss the proposal for remote sampler and personally, I'm super interested in knowing more about tail base sampling.

H

Yeah, that's it. Thank you great.

A

uh We got joe elliott next.

I

Sure um I'm also interested a lot in the remote sampling features that uh pavel's been working on uh at grafana. We use the jager client we use that heavily and we can't move forward to open telemetry without it. um I also often when I'm discussing with potential customers like this is like a killer feature that I wish was an open telemetry, so I'm very interested in that regarding correlation between logs metrics and tracing.

I

I also really like the idea of some thought about that in the collector, in terms of exemplars, so exemplar support in the collector, where it could keep a trace and also make sure it keeps the uh you know the exemplar as well awesome.

A

Okay, we've got a josh mcdonald up next.

E

Hi um I came to represent lightsteps interest, mostly with stuff. That's been discussed earlier about counting um and being able to extrapolate from the sample spans that we receive and the sample choices that we receive. So I won't repeat everything else.

E

One of the particular interests that brings us to sampling spans is that we don't want to collect logs that are sampled on their own. We think that there's a benefit to having a transaction be sampled as a whole, rather than individual log statements. So one of the things that we think we can get by sampling spans is something you cannot get by sampling logs or by sampling metrics, um but we are interested also, as as a tech lead in in the open, telemetry metrics community. We are interested in sampling, metrics and example.

E

Our selection and weighted exemplar selection, which is a form of sampling, metrics, uh are all in scope here and um so I'm interested in technicals and how we can do probability sampling across the board, in particular, how we do rate limited probability, sampling, which is particularly interesting to me.

E

That's about it awesome.

A

All right we've got yuri up next.

D

um Mostly all of the things that were said, I I mean coming from jaeger. uh I think one of the big roadblocks for us to uh to start recommending open telemetry is the lack of remote samplers, uh because it's like a big feature in jager, um so like once once it's available in in open telemetry in some form, even in the form that the jager currently implements, uh we would be able to sort of like officially change the recommendation uh from open tracing to open telemetry and uh but, like a longer term.

D

I I would be interested in sort of expanding the the way that jaeger does with the way jaeger controls this sort of like remote sampling, uh because, right now it limits it by service and operation name and we've had a lot of requests both in jaeger and in the companies that I worked with for more advanced sort of selection. Criterias for like what what's up what, when you should select the sampling strategy and that ties back to all of the questions that, like uh josh, is raising an apr for sort of like the weight.

D

uh What happens if you have conflicting policies, uh both sort of applicable to your request um and how you represent? All of this um I think like in the in agenda the rate limiting, as mentioned. That's, I think it's another aspect uh specifically um like jager, never really did rate limiting of the uh of the samplers, uh even though, like we have a rate limiting sampler, but I'm talking about like a rately making.

D

In addition to something, let's say some someone says I'm on the samples hundred percent, but then the system says: oh sorry, you only get like a budget of five per second all right, so that kind of rate limiting uh doesn't exist in jager uh like it does exist at facebook. So I think it's it's like beneficial to have that in open telemetry in some form and that again raises the question as what what it does to the sampling weights um for the for the adjusted counts.

D

So those are the topics I think like the tail by sampling is like I'm theoretically interested in that, but would be lower on my priority. Yeah.

A

I do think uh one thing we'll want to to carve out in this meeting or what are practical things. People need today to ensure that uh systems which are receiving open, telemetry data can function correctly and then there's sort of this other category of um moving the field forward. You might say uh new features in sampling that that would be awesome, but are a step beyond uh what people need to function today.

A

Definitely waiting the the first batch of work to to things people need to to to function today, uh probably be good for this group.

D

Yeah one one more thing, I would really would like to hear uh a convincing story about the value of exemplars, uh because I I've read a bunch of posts. I've seen that I've seen this in the last step. Sorry still not convinced that this is the sort of like a valuable workflow, so would be happy to hear otherwise.

A

Sorry, though exemplars you said.

J

A

Oh yeah, no examples are super valuable.

E

We're talking specifically about spans being used as exemplars and um in say time, series when we, when we take spans and turn them into metrics, the spans become exemplars to us um and their if they have correct probabilities or adjusted counts. We can then attribute, like a portion of the population to each individual exemplar.

E

Is that what you're asking about.

D

I'm thinking in terms of like examples as in prometheus, where you attach uh sample, trace ids to the metrics collection, and then you can overlay that on the time series uh as the dots, um because the number of tools supports that. So that's the sort of like the workflow that I'm not completely convinced that, like how beneficial it is.

E

That's that's a workflow that I I also hear people talking about and um I think there's a future- maybe they're there, but the one- that's here today, at least for micro. For sorry. The one that's here today for the light step users is that we take exemplars out of the span stream, um because every exemplar is like a histogram measurement of its own latency.

E

So in the sense that we could be sampling, histogram events producing exemplars in the histogram metric stream, just sampling spans, equates with counting histogram events from a sample.

A

Yeah, I I we could. We can have debates about whether it's that example, our work stream, is useful. I personally think it's useful, but I would I would put that stuff, maybe in the the second bucket uh of things that that would move the industry forward versus things that are perhaps uh jamming up people trying to use open telemetry today so but uh happy to have that discussion. I think that workflow uh having used it is, is really useful to operators being able to move back and forth, but that's fine.

A

um We got jp up next jp. What brings you here.

K

um I think um pretty much everything has been said already. I'm I'm here mostly. You know from the eager perspective, I'm here uh to hear more about remote sampling, but I think model is covering that that part quite well from you know: red hat, ear side, um I'm mostly interested in the um in the tail base.

K

Simply or you know the discussions around that that started a few months ago, um I'm I'm the author of some of the components for things that might compose an axe, tail-based sampler and um I'm yeah so and discussion back there back then was whether or not it would make sense to split the current tab-based sampling sampler into different pieces so that we can compose them better.

K

um You know some of the cases that that mentioned before, like starting with a central sampling rate and then using decisions after that to to filter out things and then get things back in to fill the quota yeah. So my my interests are mostly on the collector's side of things. I would say.

A

Awesome awesome: okay. uh Moving on we've got uh denise nice.

B

To meet you, hello, everyone, I work for time scale and I recently started contributing to upstream open telemetry. So I just I thought, like I just want to collaborate with uh sig sampling and I'm totally new for everything happening around sampling. So this is my first time to learn, sampling and upwards here.

A

Great glad to have you around welcome to the sig uh we've got carlos next.

L

Yeah, hey I'm just another guy interested in tracing um sampling, specifically sampling for tracing, that's about it. Awesome.

A

And then last certainly not leafs. We've got never.

M

Yeah, hey um microsoft, yeah all right, video, um I'm primarily on the client side of things. So I work with a number of teams within microsoft, um just just small ones, like teams and office, and each team has their own methodology around sampling.

M

So I'm here to try and make sure that we we have a smooth upgrade path um going down the open, telemetry, uh primarily looking at traces and logs um and effectively protecting servers so effectively your backup strategies and, in the case of uh you know, clients start throwing errors, uh whether they're in traces or logs. um You don't want to start throwing. You know thousands of requests at the server we're gonna back off, gracefully and stuff like that.

A

Awesome yeah, so that dovetails with some uh rate limiting discussions. Other people brought up so that sounds like a good subject. Awesome, uh that's that's a variety of stuff! um I think that's uh it's! It's good to see that there's, I think, uh a general pile of things. People are interested in it's not too too diverse.

A

um Perhaps in next um now that we've gone around room and generally said things like, I said before, things that are are big. Blockers might be useful to to quickly highlight so I'd like to just do that next, just to ensure we've got that written down. So I'd just like to note like what about sampling is currently a blocker or a painful for existing back ends systems.

A

So I'll leave this like an open question who who's here showing up because they have a blocker and I'll put one in for sure is a jaeger, remote sampling, so jaeger has a remote sampling, algorithm and because open tracing client, open telemetry clients, don't support that. Currently, that's a blocker for for yeager being able to move fully over to open telemetry clients.

E

I'd like to try and improve the group terminology that we use in this discussion just there, you said they have a jager tracing algorithm or a sampling algorithm, and I want to try and separate in our minds what is an algorithm for sampling from what is a configuration of how we samp what we sample and how we sample the the metrics group has been using the term view to describe at the moment you set up an sdk you, the instrumentation is going to provide a bunch of instruments and meter provider.

E

Meter providers are tracing providers and then the also provides some sort of configuration which will say. I would like to view these metrics in the following ways, with the following aggregate functions and so on over these time ranges. This is all the things that go into a reject, remote sampling or all any other kind of sampling configuration that I'm aware of end up looking a lot like that, what spans are you trying to select with what probabilities?

E

What are your overall configurations, like rate limits and so on? So I really want us to have a discussion about view configuration as a first class thing for spans because it's not about sampling algorithms. It's about what attributes are you matching on? What names are you looking at and what properties are you trying to identify as you sample and then we can? The technicals are really interesting to me.

E

I don't think everyone needs to like get down into the details of how do we preserve correct probabilities as we do these sampling decisions, because when it comes to rate limiting it actually becomes tricky and tail-based sampling is necessary um and that's super technical. It's all about algorithms, but I think most of us are here to talk about views and I don't want to block the views with conversations about algorithms.

A

Great, so maybe just to clarify something here so with remote, remote, sampling or yeah, I think remotely controlled sampling is home growth. There sets that's, perhaps more accurate. One piece that's missing is: is a mechanism right like how does a control plane access these things and actually change the sampling?

A

What is the uh the configuration language, but perhaps the other bit is like what? What are the configuration options themselves so uh maybe get some clarity from the eager team. It's not just a mechanism for remotely changing sampling you. You also have um the configuration options themselves and the type of sampling that jager does. Maybe we could hear a little bit about that really quick just to get an understanding about what you're trying to configure.

D

I can explain briefly: the uh the options are fairly limited, uh so you, you are restricted to uh essentially probabilistic sampling. um The selectors of of the traces uh are based on just the service and name and operation name. So you could have uh in the configuration you can define a probability for different endpoints over service.

D

um Then there is a additional feature uh of sort of like a reverse rate, limiting which is uh meant to guarantee that you like. If you have a very low qps endpoint, then you get some data out of it uh and then so, like it sort of bypasses the default sampling rate uh with, with the rate limiter saying like well at least once an hour.

D

Give me something right, regardless of whether probability said yes or no, um so that kind of uh guarantee is that the back end at least is aware of all of the end points that exist in architecture and that allows us to build the adaptive sampling feature which runs on the back end, uh which now auto calculates the probabilities uh and sort of like it's. It's sort of like an active control plane, rather than just a config based control plane, uh which is like a next iteration.

A

I see so um let me make sure I got those last two points right. So one point is um rate limiting, but it sounded, like reverse rate, limiting like like minimal um uh you're, trying to say, like minimal amount of sampling must occur. Yes,.

E

This will come up again. um We have a trace and completeness problem and that's my big blocker and I think it can be addressed with a similar solution that yuri just mentioned.

A

Great um and then the last bit you said around like a more advanced control, plane, you're, saying: there's a feedback loop where you're actually receiving information about about what the sampling is doing effectively. Yeah.

D

So what the sdks have is this remotely configured sampler, which simply pulls the configuration from the back end and uh and that sort of drives the sampling decisions? Then, in the back end, we have the functionality which can auto generate those configurations based on the traffic that it sees.

A

Right: okay, but from the perspective of open, telemetry, the telemetry system, we're not yet talking about a control plane, uh a completely autonomic system. There, you're saying this is it's from our perspective. There just needs to be a way to either push or pull um configuration updates from some source and that source has its own internal way of making those decisions. Yes, great.

D

Yeah and that source could be just a plain configuration pile or it could be something.

A

A

Okay, easy peasy: um are there things beyond what we just discussed with jaeger that are a blocker or painful for for existing systems?.

E

I'd like to talk about tracing completeness, I think it's our biggest issue here the world. As I understand it, is actually quite convoluted here we have. This notion of a trace is sampled bit that's passed through w3c and, as far as I know, the only officially specced out way that we have to signal.

E

I would like this trace to be collected, um but it requires you to use the parent sampler and it requires you to use, which means that we don't have adjusted, counts or sampling rates on our child spans, um and we can correct that by propagating those probabilities, but that opens up cans of worms and one of the proposed solutions that the open census system had come up with several years ago was to stop using parent propagation and to start using the trace id ratio sampler everywhere.

E

So you can use a trace id ratio sampler in your in your roots, but also in your children and then what we do is we coordinate the decision using the trace id rate ratio. um There's this deterministic or consistent hashing function. That's going to have to be specked out there so that every node in your trace can independently decide if it's meeting, if it meets their rate limit for that particular span id now. This is one way to solve the rate limiting problem it.

E

It solves it for the children and which we haven't had a solution to, but it leads to a situation where you don't know what nodes you're missing in your trace and there's two ways we have that we know of that. This can be solved. One was this is sort of like uh maybe not obvious, but but sort of. Clearly, if you're missing a parent in your in your trace, you must be incomplete.

E

So I know my parents, man id and if I go through my trace, everybody with the same choice id and I can't find my own parent- I'm an incomplete trace. That's one technique we have, but that doesn't work for leaves in a trace. You, um you don't know you're missing a leaf because you only have the parent, not the child, so there's there needs to be some way. I think to signal when we are missing a part of a trace.

E

What census was going to do is count children. So if you have a child count, you can check your trace by saying. Am I am I all my are all my children present, in which case I am complete.

E

Presently we have no way to know if a trace is incomplete and it's difficult to count. Children, the apis of inject extract, are kind of stand in the way you have to you don't know if there's going to be a child created using the open, telemetry apis, you just kind of pass your pro your contacts. So the question of how you could know and the only solution that I've ever come up with when I think about this sounds like what yuri just described is.

E

I want a separate configuration, that's sampling, individual spans, not traces, but for every span. I would like to know like once per hour or maybe just give me an independent, not by my parent, but just every span gets a sample at some very low rate, so that eventually I collect information about all spans and I know their trace id ratios.

E

This only works in very static configurations, so we need some way to know when traces are complete. That's my big blocker! That's.

A

All and but just to clarify is that um something lightstep can currently do elsewhere, but um we're blocked from doing with open telemetry.

E

No, this was missing in open tracing. For the same reason, the inject extract api broke our ability to count children so you're. If you're missing a child, you can't tell.

A

Okay, so just just for clarity, I kind of want to just move it into this other bucket, which is a good bucket, not a bad bucket, but I just want to like try to fork out for the time being like what are like new features that are solving a problem. That or not does that make sense.

E

I mean this: is we broke open census here and we all knew it a year ago there was a ticket about. How do we fix this open census feature to count children and we just wrote it off, but this is this is what was broken by that.

A

Okay, so that's some some helpful clarity, so this is actually a step back for.

A

J

Is open census support.

M

Okay, maybe another way um prior to joining my current team, I was on identity um for client, uh effectively tracing and what I built there was effectively something similar to what we were talking about here, where we, we can say: okay, we're collecting all these traces and logs on the client, but we don't want to send all of them all of the time, but eventually we get to a critical situation, exception errors whatever and we say oh crap. We want everything.

M

So it's sort of the reverse of what josh is saying where we say well we're down in the final span.

M

We got an exception and we actually want to flag that we want to like effectively get everything uh up to up all the way up to the root of the of the parent um and make sure it gets sent right um yeah. So I I don't know whether that would help with this situation.

A

So I think that that falls into the bucket of of tail-based sampling.

E

Which is one thing.

A

E

Help now there's there's been a discussion about reverse context: propagation, like on the way back from a synchronous request. At least you could pass back. Oh, I decided I this to sample and you should know that or I decided not to sample you, for example, is one question you could pass back um that would let the parent know that as an unsampled child, for example,.

M

Yeah or in in my case it was a you know. I wanted to make sure I got everything so um you're telling the parent, even if you're assembled out, you actually wanted to be safe.

E

Yeah, please collect this, even though it's on the sample I've I've been talking about using adjusted count zero for that situation, but we still need to propagate the signal that says something like that.

A

Yeah, which is yeah, I would love to avoid reverse sampling, but I can see or back propagation of information, but I could see that clients like web and mobile clients are the example of a place where back propagation is like possibly the better answer. Because of the situation where you can't on the server side, you can do tail-based sampling in a way, that's fairly lightweight, which is you're, sending it to somewhere in a local network where the trace is being assembled or there's some other way to do it.

A

But when it comes to um uh tail-based sampling for clients where the end user is paying for network egress for all of the observability data you're sending, I can see that being a situation where something like back propagation might might be necessary.

A

So yeah makes sense to discuss this as.

K

I have a question josh um you mentioned. I might have lost at the very beginning for explanation, but why would then spanish be missing? How do we know that they're or you know? Why is that an exception is that they in our situation? Is that because the uh the leaf broke in some way.

E

um So the the premise here is that each node wants to apply its own rate limit, so some child might say. I was asked to trace this, but I've got a total rate limit exceeding my.

M

Number of expense.

E

That can be written out, so I'm going to not trace this now you have an incomplete trace and and when you collect that trace, you just don't know that you're missing some notes and spans. That's the case that we're kind of worried about.

K

All right, okay, yeah, it makes sense excess.

G

Is that kind of accommodating for the use case where you have multiple services with highly variable traffic rates like one service gets hammered and the other service only gets occasionally called.

E

uh There have there's been a discussion. I um I gathered some information about a dapper tracing algorithm from you know a decade ago that kind of addresses the situation that you just described. So I came to mind when you mentioned it and and it's this idea, that you can have children inflating their ratio, meaning I'm I'm not being traced enough. It's like what yuri mentioned with reverse and it it's really complicated. It requires propagating probabilities even when you're, not sampled, and that adds to the kind of conceptual headache of that solution.

E

But it's documented in that otp, the inflationary sampler. I don't like.

A

I it do think paul just there to answer your question of where this comes up. Yeah one place it comes up is when buffers are getting filled too quickly and some service, so that service is dropping spans and other information in an unsampled manner. um You want to know when that's happening. The other situation is something in your pipeline is borked, so you have services sending data to a collector, but that collector can't connect to the back end or whatever. So, like all that data is getting dropped on the floor.

A

So there's like the two primary ways it shows up today.

K

But that's very different um from one of the case that josh mentioned right. I mean those are our conditions and we should you know if you're accounting for air conditions, then we have to account for for them happening all your places as well. But uh in the case where we are applying sampling rates for each span, that's very different. That's not our condition. This is this should be part of you know the main algorithm or it should be supported.

K

A

That's very true: um okay, uh do people have um other other things that are, they feel like are painful or blockers uh uh for connecting a system today to open telemetry.

G

Okay, um it's probably wrapped in some of these, but uh just explicitly calling out um it's not a blocker, but it's pain. It's a pain point for us right now is just um communicating the adjusted count or sample rate uh to the back edge like currently, we provide kind of sampler plug-ins for people and, of course that means, if we don't have it in a language, then people can't use it or whatever.

G

um If someone was just to come along and use like a vanilla, open, telemetry, sdk they're not going to be able to take advantage of our estimating on the back end.

C

Yeah, and actually this is something I was going to bring there- was this pr open, telemetry specification 570, which I posted on the chat in the chat window, and this was like a way to address that. I think it was closed automatically by the bot, but the idea was to use this sampling.probability attack. There was some discussion there about using two tags about.

C

uh If this sampling probability is a good way to measure the rate or or or a bad way and et cetera, but I think that it would largely solve this issue if we would make it a part of the specification.

E

Right, I've formally proposed that we record the inverse of probability it's. It has a lot of nicer properties basically, and that's my otep 148.. I think the reason that that issue was closed is it ran into this problem about probability and the thing I just mentioned about inflationary sampler kind of exposes that issue, you need to propagate sampling probability, even when you're, not sampled, which means you don't want to count that same probability. This or you don't want to count that probably the same um using adjusted count makes counting simpler.

E

You don't have to think about. Was I sampled or not, because not sampled means zero count and if you have to count probabilities. Otherwise you have this. You have to invert the number and then check if it was zero first or something like that. The point is it's harder to reason about probabilities than it is to reason about counts.

A

How would you sum this up and just to be clear, I'm helping to get this meeting off off the ground, but I'm not like some kind of sampling guru. This is an area that I have no special expertise. So when I'm writing things down or discussing it, I'll probably be using wrong words. There's.

E

A long otep on this that I wrote 148 this says we need to propagate probabilities, but it's so much easier if we all just propagate adjusted counts, which is the inverse of probability when you are sampled and zero, when you are unsampled.

G

Right and for what it's worth, this is what we're actually doing in our package. Samplers is we're using the attribute as a way of communicating the adjusted count. It's hacky, but that's how we're getting around it at the moment is we're just setting an attribute in the sampler when we return the sampling decision, but then we're not making sure that that's propagated across process boundaries, etc.

A

And would you call this like ratio based sampling or weighted sampling.

G

Yeah ratio or deterministic sampling.

C

Yeah, but this can go uh even like further, like with taste sampling. If you applied some rules, you can also like use this information to like to so the backend knows what was the selection here and what was selected out of what number of spans.

E

Yeah the document says that look, it doesn't matter what algorithm we use as long as it needs certain properties, then you can use this number kind of regardless of what sampling was done, and so that means it works for head sampling and for tail sampling. Now there is a question I think of whether it makes sense to use an attribute to describe something which is sort of a property about the collection not about the span and that's, I think, a independent question that open geometry has in front of it.

E

I I'm on the side of it makes sense to use these attributes, even though they are not descriptive of the span, but they are descriptive of how it was collected, because it's too much overhead to add new fields to indicate new stuff. So um there is discussion elsewhere in open telemetry about coming up with a sort of schema for attributes being. Are they identifying? Are they descriptive? Are they non-descriptive and I'm?

E

I'm now believe I or I would recommend that we use non-descriptive attributes just describe sampling, probabilities or adjusted counts being that there's already a field we can use and it doesn't cost us anything to add it just means we have to interpret things differently.

F

Awesome um yeah for me, it's also clear that we need to add this information, the sampling probability or it just can't. So I mean it's the same kind of information, so it's just a matter of taste in my opinion, but we. What we should also discuss is if we want to restrict the sampling rates um to discrete values, because it's easy to report, so I'm thinking of using powers of one half in this way the sampling rates could be encoded with a single part um but yeah. This is what they want to discuss.

F

It has also the nice property that the reciprocal value is always an integer, which means that if you um estimate or extrapolate integer quantities, then you will end up with an integer again so which makes it easier for customers, because they do were users of those extrapolated values, because they do not have to think of rounding that to an integer, because if it's the original quantity was an integer, then you want to show usually also an integer on the screen.

F

Otherwise, customers could get confused right.

E

The same comes up in metrics because you have data formats like statsd, where you're ingesting sampled metrics events and you're turning them into otlp, which means for a histogram. You have this count bucket and it's an integer field and logically mathematically probabilistically speaking, you have a fractional count, but it's there's nowhere to put a fractional count for a histogram count in open telemetry, which means that you are better off using integer reciprocal sampling rates.

E

However, I mean I don't love them, because you can't do sort of reservoir tail sampling with rate limits your and you end up with fractional probabilities and fractional counts as a result of unbiased sampling. um So I don't feel I don't feel strongly that we should restrict ourselves in that way.

F

I mean there is a solution.

F

You can sample, basically a sampling rate, for example. If you want to have a sampling rate of one out of three, you can first randomly choose either one half or one out of four and such that the effective sampling rate is one or out of three, for example. So there's a way to to work around.

A

And just for everyone, I have to drop off the call at this point, I'm giving an interview, I'm interviewing bhs and I have to go prep, um but this has been great discussion, feel free to continue uh for the next 10 minutes, but just fyi, where we have hit a map, coloring problem with the number of uh zoom rooms that we have, that uh record things the internet do all that properly.

A

So until we get some more zoom rooms, that means uh this meeting will overlap with the swift sig, which means, if we don't all get off the call before they start getting on the call. Then the recordings for those two meetings will get glued together in youtube, not that people are avidly watching all these things on youtube, but it is a thing we want to try to avoid. So just fyi, please uh make sure you end uh a little bit before the hour.

A

Okay, all right! This is super.

M

A

uh There's a lot to dig into and looking forward to talking to you all next week, yeah thanks so much yeah.

E

um So I would like to propose, if I may, um with a little bit of remaining time as I my plea is that we can try and separate these topics: sampling, algorithms, expressing probability how to do that with tail sampling and reservoir sampling and all it's like technical stuff and it's its own conversation and then there's a conversation for for tracing at large is how do we configure views? What's the language of a view configuration and then you know later on, someone will say: I have a way to touch it remotely.

E

um I'm gonna ask for help with metrics right now, I'm putting up pr 1730, it's been sitting for a while. It has only one approval because it's full of contentious stuff and if you look at it it talks about how to set up views for metrics. I think we should be talking about all these tricky questions for spams and it and and the outcome of this will be. You have a way to set up probability sampling according to all those algorithms and technicals that we can discuss elsewhere.

E

um This is tricky because you have a in the case of metrics. You have a meter provider, a meter provider has a bunch of instruments. Now the views come along and say I have names. I want to match against those instruments. I have output names. I have attributes, I want to match and there's a bunch of different views that might be effective on one particular metric instrument, but then we're talking about what's the default behavior, which says if there are no views, I want to do something um and then so.

E

The question is what, if I only want to set up one view, do I lose all my defaults because of it, and that I think this is the question that we need to answer for traces. So I'd like your help, if you could review this pr see what the questions are for metrics you're gonna. I think we have all those same questions for tracing, and this is when we talk about jaeger, remote sampling, configuration or reading a yemo file to configure sampling.

E

You know in an sdk that we're we're basically having a parallel conversation about that metrics right now. um That was all I wanted to add. I don't want to steal the rest of the agenda. Please review pr 1730 if this matters to.

E

E

If nobody else has um a comment, um I would recommend that we could end this meeting.

E

In addition to the pr that I just showed, um there is an uh an otep about all those technicals that I'd like to get reviewed as well. There are it's been a long debate. I've got atmar and yuri have contributed quite a bit, and um I think it's pretty close.

E

um This would be includes a spec language proposal. That says there will be an attribute named sampling.adjusted count. It will be used anywhere. You want to express sampling, so you could do use it on a log on a metric on a span and as long as it meets its definition, you can use that and interpret that correctly. It doesn't matter which sampling algorithm. You used that's in otec, 148.,.

G

Yeah that um I'll take another look at hotel 148, because there's been quite a lot of discussion since I last approved, but I may take a shot at summarizing like what the key points are in debate right now. um But I think that's a good, that's a good action item and I also I'll I'm going to drop off. Then all right see you folks.

E

Well, thank you all. um I think we're going to continue this um next week and I look forward to seeing you um there's lots more to discuss.

E

C

K

B