Open Telemetry Uncategorized, 2 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-12-02 meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

Good morning, josh.

B

B

A

No problem, I find the the google calendar.

C

Oh geez, I know my speaker.

A

D

Good morning um I was just working on an agenda. uh Well, I didn't listen to you all talking.

A

Oh, that's: okay. The um the google calendar um has the zoom link in the location of of the caliber, invite and then like, but you can't click on it directly. I find.

D

I was wondering if this was also because hotel changed all of its zooms and started using auto-generated ones, but this calendar invite hasn't been fixed. Yet I think um so. I might ask someone else about it um good morning. Everyone, I know, will that you were um uh talking about uh configurable sampling uh and we talked about having that be the primary agenda item. I think today um yeah as I'm looking at what other things we might put on the agenda.

D

There's sir two minor notes uh that I thought worth reminding uh the audience, um but I think we should not belabor those. So the pr I've been working on with all of you probability, sampling using trace state.

D

Is what it is? It is kind of ready, I think, to merge. There was one piece of feedback last night saying this could use more examples. That's great feedback. I, like that kind of feedback um uh during that review. Ottmar pointed out at the last minute, uh something that is worth an understanding about how delegation can break probabilities.

D

I found an issue to talk about it, um because I think we should refactor some things, but it's a far distant future. So I put that in the agenda um just a link there, we can talk about all those things. Last I propose um will let me put you up ahead of all these other things.

D

Has a discussion about configurable sampling for us? Is that correct.

A

Yeah, um it was I plan to give an overview of like how we do sampling and tracing at autonomic uh the mechanisms that we're using today and what I'd like to see uh for the future um and a bit about um like how we actually don't use probabilistic um uh and like adaptive sampling mechanisms, though the ones first proposed by yeager jaegerman.

A

um So just to give you sort of like a setting of our use case uh and like a sort of a high volume, tracing customer um use case um yeah. That's all.

D

Sure so I I mean I think, we've been putting off this discussion about. Configurable sampling includes jager remote sampling. It includes amazon's x-ray system. That includes, I mean I'm looking at honeycombs in the audience, honeycomb dust and stuff we're all interested in. How does the user get to control sampling?

D

And um you could you could phrase that as a question for the client itself, but I think many of us are going to like immediately say yeah, good luck, configuring all of your clients with a configuration. So at some point it becomes a question about how can I get remotely distribute this configuration?

D

um So it's a big problem space, uh but I think we're all interested in all of all this. So I I'd love to keep hearing more. Will.

A

Sure, okay I'll um I'll get started. Then um let me share my screen.

A

A

Okay, hi everyone, I'm william tran. I work at a place called autonomic. um What do we do at autonomic and what are we tracing? We're building uh something called the transportation mobility cloud, and so this connects vehicles to applications like unlocking your car, with your phone updating vehicle software over the air or streaming vehicle telemetry to consuming applications.

A

uh There's a lot of vehicles out there we're a wholly owned subsidiary of ford and ford vehicles are connecting and using our platform.

A

So um how are we tracing uh scalable tracing is is challenging uh and for us, it's too costly to sample only at the tail. We can't collect everything where all the action is happening at the head and send it down a pipe. um So so we need to do some head sampling and uh some traces contain hundreds of spans which are deeply nested and in these cases most spans are actually not on the critical path of interest.

A

And some use cases are more important than others, so some use cases are sampled at 100 to ensure we capture air conditions that are hard to reproduce and we identify these use cases um right at the very beginning like using um uh like the http method and path uh for the entry point uh to our system and for other use cases sampling. Just some traces is sufficient.

A

uh And so in all cases, though, we need analytics that represent reality, so um here's our architecture, just I mean simplified enough to just show you what's relevant to this discussion. um We run our. um We run our workloads in kubernetes. um Our services are, are mostly java based um and we're using java special agents um and inside that uh jager clients and this stuff might be kind of uh a bit outdated, but I mean at the time that we started tracing hotel. I was was just being born.

D

For the for the record, this special agent is the open tracing, auto legacy of automatic instrumentation that predates hotel.

A

Yeah, exactly hotel java instrumentation actually um was descend as a descendant of the datadog java agent uh and then the java special agent um had, I mean similar functionality, but um it was a lot more open in terms of how you could modify it. um I found so we we went with that and ran with it. It was also interoperable with many other open tracing tracers um yeah, and that was like a project that came from lead step.

B

A

um Yep and now, um and so the jager client, um this is the jaeger client java that consumes um sampling configuration, sometimes that, like that, can typically be served up by a jaeger, collector and then hotel collector provided support for this recently.

A

We're just writing our own config server, because we've had to modify a few things there and I'll get into that later.

A

um Our jaeger client sends things to hotel collector through the jager receiver, and then we have our own component that we've contributed um to uh hotel collector contrib uh called redactor and then goes through the honeycomb exporter over to refinery which is honeycomb's, tail sampler and then over like honeycomb. So that's that's pretty much our pipeline. In a nutshell, uh some details about the components yeah, you have a special agent yeah.

A

We started tracing uh early, we actually started yeah in 2019, um and then we things really started um rolling, uh beginning of 2020 and uh yeah. We chose special agent uh for those reasons I mentioned before.

A

We were trying out actually many different tracing back-ends. All at the same time, hotel collector was was very helpful there as well, and we modified this and added some of our own auto instrumentation. We do plan on moving to the hotel java instrumentation um in the new year and uh jager client java, so this was the most fully featured tracer when we started. I know just recently, um java instrumentation has reached feature, parity and, uh and the yeager team wants to deprecate this, but we really needed a remote, controlled sampling.

A

uh And uh we modified jager client java samplers to convey an adjusted count and support rate limits uh less than one per second, and we wrote our own sampling config server, which pretty much serves the same stuff as jaeger's remote, controlled sampling, config, and we modified it to allow for configuration of rate limits of less than one per second. So the jaeger data model didn't allow you to have a rate limit of less than one per second. So we we wanted to modify that um so to to review that yeager's remote, controlled sampling.

A

Essentially you get to apply different samplers, like other probabilistic or rate limited samplers.

A

You get to apply these samplers and they will be applied um given the service and the operation name of the span, um and so there's there's just like a one-to-one mapping of operation, name to a sampler instance, so that um that actually worked out well for us in having these samplers can convey an adjust account and so I'll get into how we deal with that. In a.

C

A

So redactor, that is a component that we wrote and we've contributed upstream to redact sensitive data for compliance reasons and that's the pull request there. We just have a skeleton up right now and that's been merged refinery. This is honeycomb's tail sampler and that respects the upstream adjusted count, and so the little code snippet that makes that happen is is where they can multiply.

A

um The sample rates with the sample rate of the incoming data.

A

All right and yeah, we said everything to honeycomb.

C

A

So I think the the compelling thing here and the reason why we approached adjusted count. The way we did was it was really just to to make everything work with honeycomb. We saw we saw this feature, we saw it would work and I'm like yeah. Let's use this, because um we want to see a realistic representation of reality um and and so that's what honeycomb can do for you.

A

um Okay, so rate limit versus adaptive, uh jaeger was baiting adaptive sampling when we started tracing- and I think they've just recently said: hey adaptive sampling is it's it's now ga, but you need to use cassandra um and just to uh like adapt to sampling in a nutshell, is uh where we want to produce a constant rate of output for variable rates of input and the way yeager.

A

Does this jager's adaptive sampling is it will remotely control the sampling probabilities of the samplers to achieve like a target output rate, um and so there's this, like this whole distributed feedback loop? That happens to enable that, and so we agree with the motivations for adaptive, sampling um and and that's yeah- to produce consistent output variable input, um but we weren't up for the complexity of running at cassandra and running that whole pipeline.

A

um So we chose a simpler approach.

A

uh So outputting the adjusted account in samplers rather than inferring the adjusted count as the inverse probability. We chose to convey it directly.

A

Each operation name has its own corresponding sampler, and so we added a corresponding atomic integer counters for each operation y decisions increment and reset the counter to zero and and decisions just increment the counter by one and um and this works regardless of the the sampling algorithm. So in some cases we do use like probabilistic sampling, um but this this just works for both and uh we add this to baggage, and then we copy um that baggage key to every span as a tag called sample.count.

A

And so this is. um This is kind of.

A

I know I joined up uh in in this sig last year when we're first talking about sample.count uh and I regret not being part of the conversation around um how this turned into probabilistic sampling uh up until very recently, but but hey, I'm here now so, okay, some caveats with our approach adjust to count uh it's set at the beginning of the trace and it's not modified until we hit the refinery, tail sampler um and- and so that's just the limitations that that we've put on ourselves to um to have confidence that this mechanism works.

A

um I mean, with one exception, is that we we silence some parts of a trace that may be noisy and not on the critical path, and so we do that with a probability, zero.

A

So we'll just assign a probability zero to some part of a subtrace identified by service and operation name. There are some limitations there and I'll get into that later and so down sampling subtrees is not yet supported. So that would be like something that looks like okay. I want to see this subtree, maybe one out of every 10 traces that end up getting sampled um and it, but it is feasible to do down sampling. I mean refinery. Does this by just multiplying those sample rates.

A

um And yeah we we wouldn't want subtrees, just kind of like appear on their own. um We want the appearance of a subtree to be dependent on the sampling of its parent. That would be nice. I think the consistent probability sampling does this.

A

And sampling bias is possible, with a periodic input to rate limited samplers, and- and I'm I mean it's- I don't have a proof for this, but it just it seems highly unlikely in our environment, given the characteristics of um the input to rate limited sampling decisions where we make that decision, the volume that we make those decisions on and the data that we're making those decisions on um yeah, it seems unlikely, but I can't I mean you have a very rigorous test uh that I've seen for the um consistent probability sampling that proves that yeah there's no bias here.

A

um So I that's something that I'd like to investigate further. um So one comment on on uh on the the current otep and its implementation.

A

uh Yes, I do see that there's there's guarantees from bias sampling here, um but I think it goes a little too far in maine mandating p to convey adjusted count, um but maybe your intent there is that just the scope of using p for adjusted count is the scope is just I mean this is for probability stamping if you're using probability sampling, then here's how you use p, I'm not sure.

A

If you mean, though, that p is the only way to convey a conjunct, an adjusted count, because now we're precluding non-probabilistic samplers, and if you understand the trade-offs, um then it might be still appropriate for your use case, um and I would just like to see a more interoperable adjust account mechanism um or have that defined outside probabilistic sampling.

A

D

Think I understood this. um This might be a tough topic to grasp, though, for everyone who hasn't been through the entire debate here.

D

So I wonder if I could summarize this last um feeling I got it's that you just showed us a mechanism that computes what what you called sample count, um uh but by basically counting the number of yes no decisions on a particular operation, um and then you can essentially output a count whenever you sample of the number that you've skipped, plus one essentially um and and um and- and I think the words that you put on the previous slide- is that this this might not be perfectly unbiased, yeah.

A

D

Almost certainly going to work in our work workload and and and that's good enough- and I I I appreciate that remark um yeah. I think you could probably construct a.

D

A test case with a very contrived input that would demonstrate why randomness is better, but I think you have a very good claim that this is good enough, um and so, if you wanted to do something simpler and like you described, you wouldn't be able to use the adjusted count mechanism. The way it's expected, partly because I specified how it should be unbiased and so on.

E

D

um So that you might prefer to see us go all the way back to roughly speaking where we were in summer, where I originally proposed. We use a span attribute um to convey adjusted count and we we landed on having well just let's just use this trace state mechanism and record it with the spans.

D

um It sounds to me like we could add a portion of our specification. This is a question mark really, which is um designed to help um essentially in this type of situation, where you have um another way of computing adjusted accounts which may be biased, but are good enough um and you want to use them anyway.

D

Yes, we had a semantic convention for spam attribute. That was the adjusted count. Then you could just use that, and that is actually one of the first proposals that that I had um so I understand I like it, I get it okay um and that would and then what it would mean, I think would be to specify the rules of interpretation so you're looking at a span, it has an attribute that says sample count. It also has a trace state, it says p value. What does that mean? I mean that's a question as well.

D

um I think the point I I take your point though I I think it's um a reminder that we uh uh that that there was a lot of technical pushback on what we're doing and there's also simple cost pushback on what we're doing and then um adding a span attribute seems to be more than we needed. After all, the debate is all I'm going to say. There's a lot of debate uh there.

A

Yeah, I really like the compactness of p you're, only using seven bits and, like my just account, like that's, that's more data for sure.

D

Yeah there was at one point where we actually had a span field that would have conveyed the same field, same information.

A

D

However, it was going to be a logarithm again because it's so so few bits and it wouldn't let you have an adjusted count of three, for example, um so you could definitely see. um I can definitely see adding to our specification to say um the that there's an also a span attribute with many conventions saying this is my unjust account. Just trust me. I know what I'm doing. um uh I I just imagine a lot more questions coming up, um you're using baggage, so it works.

D

um We still don't have an hotel aspect that says how to put baggage into a spam for one thing.

A

Yeah, there's always obstacles yeah, we we actually are. We actually take all of our baggage, except for, like a few sort of metadata things that, like don't need to be revealed as span tags, we just copy them all to span tags and it's kind of like a way to to do this sort of denormalization but like at the clients like, rather than sort of somewhere, further down the pipe, um so that we can kind of like flatten out these key values that that help us correlate things yeah.

A

um And then we find that like also useful mechanism, and- and I mean it's- it's no big deal that it's not like part of any spec. We just sort of went in there and did that because it works for us yeah.

D

The metric spec will tackle the question of how to convey or to how to configure attaching baggage to metric events in in the next year. But um I'm I'm aware that it hasn't been done for spans and I'm kind of surprised by that cool.

A

Yeah, so I think just to some some this this slide up um as long as there there can be some room left for like an adjusted count, as is that then, I'm happy and like in the scope of probabilistic sampling. I love what I see it all works and is very rigorous uh and efficient.

D

So I've um in the past, you know referred to other ways of sampling which are not, as perhaps as good for most use cases that we know about. But you know we can devise schemes that are not power of two-based sampling and you can. You can imagine an adaptive reservoir scheme or something like that. That has these adjusted counts, which are not integers and they are not necessarily powers of two either of course, um and so I can certainly imagine you know, I'm I'm doing something esoteric.

D

I've plugged in my own sampling algorithm for whatever reason and now my my sample counts are floating point numbers um and it would be cool if um that would work with the vendors um there's a um there's one other uh esoteric corner case that you run into when you start talking about this. It's that people often want to turn their spans into a histogram, so you're, counting latency items, you're counting latency measurements from your sampled spans.

D

If your sample spans can have non-integer counts now you need a histogram with manager accounts, so we don't have that. um So that's always that's one of those other obstacles.

A

D

Up yeah- um and uh I don't know what what will happen with that. um So if we could just a step back, I'm going to say that sounds like will's suggestion is that it would be nice to have a more broad, less specified uh semantic convention to say I know what my justice count is. Please take this and use it vendor who you are counting my spans. um I think it's um a reasonable.

D

um uh It involves writing specification and extending the one that we've got drafted, I think to say what priority you give if you've got a conflict, you know um if you have both trace date and adjusted count attribute what what do you do, um but I think it's reasonable. um Does anyone else want to speak on this.

A

Yeah, if not, we can table that point for now and just stew on it. um Yeah.

D

I think that that's worth filing an issue about maybe coming back to um it's, not something that I personally need given what I'm trying to accomplish for my own vendor's request. But um um I I propose it in the past because I think it's fine, uh so that's good, um but I think that as we we we may want to table that and come back to talking about configurable sampling in general.

A

D

A

uh More to say.

D

I feel, like everyone kind of knows what what it is, that we want, and we all have these examples in our head of jaeger and probably x-ray, um and what I'm hoping someone will do, and it could be me if my vendor wants me to in the next year is to just sit down and write down.

D

The proposed specific proposed structure, which is the hotel sampling configuration and it might look exactly like the jager sampling configuration, except that I expect certain semantic conventions to be updated, like jaeger has an operation name and we have a span name and yeah. Those are the same. I think um jaeger has tags, we call them attributes, that's a difference um and then and then we have parallels in the metric space already called views which is just getting to be stable in the hotel spec, which is where you say.

D

um I have some attributes that are going to match. I have some metric names like which are very close to span names that are going to be used, and I have a configuration, a blob of configuration now using my span. Look up your attributes figure out what sampling policy applies.

D

Now the sampling policies might be a probabilistic one, a parent-based one, a non-probability one.

D

Hopefully we have a rate limited one, there's, probably several choices of rate limited one that we know about depending on whether you care about completeness or whether you care about um rate limits being hard and so on, um but that we need essentially a file format um to do that- and um this is, I feel, like all roads lead to configuration at this point in hotel right now.

D

um We're we're just every every every part of the group is struggling to to accept that we have 80 80 environment variables and no comm and no file format right now. um So there certainly is a headwind, um because there's.

A

D

Other format in the sdks, the collector has one and that's probably the model we would look to so um yeah. I hope someone else would uh can do this because um it hasn't been a priority, at least for us at lightstep. um The priority is to get the counts done first, um although definitely our customers want the same thing that we just talked about um the only other sort of concurrent parallel, that's happening in the hotel group.

D

That's related to this is that there is a configuration agent configuration group, that's talking about how we will configure agents. So if you had a file format, you could then go talk to that group about how to get it distributed to your agents, and then we could be talking about how to distribute that to your sdks and there's some open questions about whether agent management protocols can be used for sdk management.

D

Let's suppose that they aren't that's too much complexity, but then we need something: that's essentially a lightweight version of the agent management protocol, which is same as what jager has for the sampling. Endpoint says: I'm an sdk. I've been configured with it with a destination for sending otlp, but I also need a destination for getting my configuration and is that the same as the collector that I'm talking to is there a port that I have to use?

D

What's the you know, what's the path that I have to like there's a lot of questions here um and we can follow jaeger. I think jager is the best example, because it's open source. I think we should make sure that what x-ray is doing can be done. We should get x-ray to come to these meetings and.

C

We're here by the way, oh.

D

Good will the other will. I was wondering if you would both make it here. Sorry, I didn't notice, I didn't check my participants list. Yeah so will um is the other person who's been talking to me about this, both wills um and uh so x-ray? I don't know the differences between jaeger and x-ray person exactly, um but this is what I'm interested in seeing.

C

Yeah, just I I was gonna say we have um myself and also batik uh from the x-ray team on the call, and we added an item on the agenda to discuss our uh remote sampling uh approach as well, um which is uh something we're going to discuss later, but I don't know if now it would be. I.

D

Think now is the perfect time I'm moving you up in the agenda, so thank you. Will um I took some screenshots. I thought your uh your slides were great. um uh I'd be glad to have a link to them if, if possible, but in the notes, um but so this is part two of our remote sampling discussion, then um thank you will well.

A

D

um Let's talk about it, why don't we? Why don't we talk about um now that we've discussed will's back story? um I think it'd be nice to hear about the x-ray um system.

E

Hey uh so hey uh this is project. I I have attended a sampling like in a very early early starting point.

E

The next thing um I'm just gonna, share my screen. I I have sent a document to the google um to the zoom chat. um I'm just gonna share my screen uh to kind of like walk you through.

D

The pr or the issue number has a link to it as well, and I've opened the document.

E

D

D

Well, will trans spoke about setting up cassandra for the jager uh centralized system? um It looks like what we're going to talk about here is essentially amazon's equivalent for that.

E

um uh I I just thought I would mind, sharing you carrying your screen and seeing like.

E

Yeah, um so maybe you can go to the introduction.

E

Yeah, so I I guess um the will has talked about this, like the remote configuring sampling in in his slides right. uh So basically, centralized sampling is something like where customers can configure the sampling configuration like, let's say, reservoir size or fixed rate in the in the in somewhere in the remote console, and basically the sdk would pull in those configs and kind of like basically uh try to try to make a sampling decision.

E

So the basically the main goal of the remote sampling is like, for example, if you have like multiple hosts in your feed, uh like let's say your five hosts in your feed and uh like basically customer set configuration like file, request per second and then and then the five percent of fixed rate. Basically, they would want to trace like uh every second, they would want to trace five request, but you still have like five hosts in the in your uh in your fleet.

E

So typically this would work well uh or it would be easier to implement in the in the one host kind of a setup where you know you can basically like control like basically how many requests are, how many requests? You are going to sample in in that second, but for with the remote sampling, this would be a little bit a little bit trickier.

E

So the kind of like the idea of the centralized sampling is to like kind of distribute it distribute the uh sampling configuration across all your rows and then kind of basically like, for example, let's say a customer is defined like five requests per second, and if you have like five votes, then maybe like divide, like basically sample one request from each host and then apply the five percent of the additional rate, uh basically like aggregated additionally to all those, so that would kind of like satisfy the that is where the customers criteria of like, if they want to sample, like particular number of requests within that second, uh so that is kind of like in general idea of the central assembly.

E

X-Ray sdks uh have been open source and they have been providing the support. Probably I think, right after the gl launch. um So it's it's been in production for quite a while now, uh so, basically uh so the doc. The documentation mainly talks about like how we can how we can implement this.

E

Using the open, telemetry and as uh as basically we, we would also want to define like uh like a like a spec for this, um so that you know everybody can utilize this pack and we kind of like build a general model for like configure configurable sampling or remote sampling or centralized sampling.

E

So so I I am just going to walk you through the design of like how extra sdk is like basically have extra sdks like implement the centralized sampling. So can we go to the implementation design first and then we will go to the cabinets, so so in general in general. How the how basically the the links. So we have also implemented the centralized sampling in java, open telemetry java as well right, so so, basically, how how the general implementation flow would work is basically we can like. The open.

E

Telemetry has like sampler interface, like which basically defines like each samplers so for centralized sampling. We can like kind of like create like two decomposed samplers like one is the centralized sampler which basically like which basically likes handles all the logic related to our centralized something and then the other one would be the rate limiting sampler yeah. But before going into that, I would like just like to walk you through the in general workflow of like how centralized sampling like basically how we would be able to do the centralized sampling. uh So it's so.

E

So, if you look at the uh centralized sampler, we we like basically like x-ray, provides like two apis, um which is like get sampling rules and get something targets. Api, so get sampling. Rules is basically like pulling the sampling rules, so so basically pull in like pull in the data.

E

Like you know, customer data that has been set by the customer at a centralized location, which is like a reservoir size, fixed rate uh which path they want to sample like, for example, if it's a http request, then there would be some paths they would want to apply for service type service name all those parameters which we definitely have some equivalent to that in the open telemetry.

E

uh So so, basically, using get sampling rules. Sdk would have that data uh like periodically and there would be another api which is the get sampling targets, so it so excel sub system which basically computes the computes to quota like basically, which host to assign like how many numbers of requests are typically, first, a reservoir size. So like those, those calls like the sdk would make those periodic calls to the remote location uh or the remote back end uh periodically so typically like with extra sdk like currently, the interval of fetching is for sampling rules.

E

It's five minutes, because we don't still. We still don't need that that frequent, but for sampling targets we need to compute like more frequently. So it's default period is 10 seconds so that that would continue going to happen like in the background of the sdk. Now like. Basically, the main idea is to sdk would have kind of contain a rule cache.

E

So when, like basically after retrieving all the rules, it basically have like a rule cache where, like you know it, it would have all the rules, all the active rules, basically like, for example, for the first time it has literally, the customer has set like two rules and then basically the next time customer deletes one rule and and add one another rule.

E

So it's kind of like contains the contents, the entire rule cache for that, so that, at any point it kind of like keeps the sdk updated uh with the with the recent changes with the remote backend. So that's the idea behind it, uh and uh so, and so basically now when when there is a request coming in right, so so basically it would.

E

It would match it against the configuration that that the back end has provided to the sdk like, for example, uh host like host name uh service, name, service type, the the url path of the request. This is kind of like a matching matching thing like where we are essentially matching the request attributes with the with the with the sampling configuration that has been set. If there is a match uh aft after we matching okay, there is, if the request is messed with one of the one of the sampling rules.

E

Now we would apply the sampling rules to that request, so it's kind of like so so we we also, we maintain like reservoir quota, so reservoir quota is, like you know. If you said, reservoir is like five requests per second.

A

E

Then then, basically, it means every second. Those five requests are going to be sampled, irrespective of anything so so customer will be setting reservoir quota and fixed rate, so fixed rate is like. After reservoir quota is consumed. We use the fix rate for to sample extra request, so it's like kind of giving the customer guarantee like. For this part, this request, five percent, uh five or six requests per second is going to be simple, um so it's so so once the reservoir quota is consumed.

E

For that second, it's kind of like you, you would see we would like basically centralized center, would instantiate the another sample, it's kind of like rate limiting sampler. So for that limiting sample we can use whatever whatever works. For our case, like, for example, when we do the implementation sdk wise, I mean it's it's. There is not not a hard limit.

E

We could have used like 3d ratio based sampler as well for this case, but it's just kind of like some probabilistic sample would work for this case so that the so basically the major idea is like a centralized sample would basically try to consume the reserve quota. Once the reservoir board is consumed, it would instantiate another.

E

For example, this is the case where request is matched right, for example, for cases like when you know you, don't you don't match with any request like sampling rules that has been said by customers are not messed so and that get in that. In that case, we would be having like a default of like one reservoir per second and five percent of the fixed rate, so that is kind of like in general workflow uh in in general workflow.

E

We have like basically uh so for remote sampling with open telemetry java and then the x-ray sdk. The both implementation has like two major caveats that we have kind of like thought. It would be better. uh One is like you know: in x-ray, sdk implementation, we like fix the sampling rules and sampling targets, fetching intervals wherein in open telemetry java we provide like customer to. um Can you can you go up to the caveats.

D

Sorry, oh caveats: yeah those are down.

E

I I think it's yeah is that all right.

D

Implementation design, caveats.

E

Yeah yeah so basically like, uh like in open telemetry with the open, telemetry implementation. We are planning to like give. Probably the configurable interval that the customer can set when they set the sampling uh when they set the sampler, and then um I guess it would be. It would be nice for customers like who are not updating their sampling rules frequently.

E

Basically they can. They can manage some of the traffic uh going via collected. So I guess that's the idea and then the other major difference would be to kind of like not like, basically not related to uh this, but it's it's like x-ray. Sdk provides like a local file of sampling, configuration in case of like fallback.

E

uh It would use that file, but with x-ray. Sorry, but with, I think, open telemetry. We are not like. I. I think we would be just okay to like provide the default um sampling rules so uh yeah. So those are kind of like major uh caveats um can, can you can you go down like yeah, just still go down, yep yeah. So this is the open, telemetry, sampler interface design right so like.

E

Basically, we would be able to implement like a standalone sampler like centralized sampler, which basically implements like shoot sample and description methods uh like. Basically, it takes the sampling parameters like kind of like uh request, and then it would feed in the uh it would feed in the sampling configuration that it has been received and makes a decision, and then this is that it returns the response, so I think it works. I think basically like we could have like two decomposed samplers, as I mentioned above um so this is.

E

This is kind of like a sampler interface design, that kind of align align with the implementation, and this is the bernoulli sampling. We don't have to go through that. uh We also don't have to go through the data model, it's for like minor detail, but this is like in general data model that we're gonna use and and- and there is one more thing that I wanted to talk about- can you go down here? uh Still a little bit go down, yeah yeah. So basically I wanted to talk about the matching data model.

E

It's so it's basically like we. So if we can, like probably like define this in this pack, it would be easier for everyone, because essentially, this is like kind of the where the request matching is happening with the sampling rules. So this would be the case with uh yeah.

E

So basically, this would be the case with uh whenever, whenever there is a request incoming right, we would like there would. There would be some equivalent to the data model of like service name service type to the to the open, telemetry data model.

E

So, but the only concern here is like uh so this this we wouldn't be getting this this all these data model fills in the same in the sampler um like, for example, some of the parameters are very dependent specific to the instrumentation per se, for example, url path, http method, attributes we wouldn't be having those in the sampler, so I guess the one way would be kind of to because anyways we cannot define it because it would be very specific to the instrumentation and how instrumentation is going to like kind of like uh kind of like uh emit those fields for us to use.

E

So one of the one of the probably solution would be here to kind of like whenever you we define a sampler as a shown above we can take the resource, so that would probably give us the service name service type and we can match it against that and then probably um the matching of this thing uh probably would be dependent on instrumentation, so yeah. I I don't have any clear idea on how like basically, what would be good for for open telemetry use cases but like open to any suggestions on like if the.

E

If there are any suggestions on matching or if you can, if there is any way we can like, you know, standardize, matching or anything.

A

Yeah, I I totally see that, um like just relying on on the existing um data model, yeah, it's really specific to a particular use case. I mean imagine um messaging systems you'll might want to match on topic name, and so that's not none of these. These existing ones here, but I mean, if you want to open it up, attributes in in hotel spans they're, just key value pairs, and they could be anything so you could just sort of open it up to all attributes and and throw in some key values.

A

I think that the things that I would like to see um fleshed out is um how we deal with um multiple potential matches for a sampling rule, so you might have ones that are more specific or less specific. Do we automatically determine specificity and ordering through the configuration, or do we just.

D

um He will you're breaking off, I think, will just dropped off, um but he's I can finish his sentence for you. um The same issue comes up in the metric views uh specification which I was talking about earlier, because it's very closely analogous to the metrics, the the need for a configurable. Sampler.

D

Sorry, will you you dropped for a second I started talking, but if you're back.

A

You just say: oh, this is the order. I know.

C

A

Doing and just go through these in order.

D

Looks like will has left the building.

D

um uh He was getting to the point to say that it's tricky when you have more than one rule matching. Do you go in order? Do you think about all of them um and there's certainly probability logic that will be implied when you start to certain answers, may or may not run a foul of probability, rules.

E

Right so so we we have um priority that we set uh for every rule uh in this in the in the remote configuration like in the in the back end and that's how like, for example, if we have a first mesh, then we would use that rule to kind of do the sampling.

D

Probability sampling. We have this consistent mechanism that and we have specified a mechanism for composing them, so I think we can, if they're all probability, if they're all consistent probability samplers each rule, we should be able to compose them because they all can consistently apply their same probabilities, um so that problem may be solvable. um The idea that you can actually let any all of the different rules be applied at once, um if any one of them decides to sample you can sample. um Although that might not.

D

I, I think that that's probably the best answer, but I can imagine other answers and then I can imagine thinking them through and finding problems.

D

Because you don't want conditional probabilities, basically you don't want rule a to to be you don't want the chances of firing rule b to be dependent on rule a otherwise. I don't think I can understand what's happening.

D

um So if you have two roles that are independently applied and either one samples we can, we can calculate the probability of the adjusted counts correctly, based on the fact that they're consistently done.

D

That's not that's a, I think, that's a good outcome. I don't want to distract us, um but that that much like in the metric specification, there's a lot of complexity, about how you specify these rules and and the priorities are um implied.

E

Right, but I guess in the in that case, that would still be kind of dependent on the how like, basically, where we are configuring configuring, the sampling configurations, um because it would be different like if we move if we change the vendor, so it would be very vendor specific. I think.

D

D

So, thank you, um bautic and will I think.

D

um As a as a engineer, who's interested in this topic, I do have some thought questions about how the centralized reservoir is managed, but I don't think we should belabor that questions now, especially there's only a couple minutes left in this hour. um I think the the roadmap, that's the sort of outline of the roadmap is pretty clear here. We we want configurable sampling, we want a way to describe a bunch of rules, but the rules are based on key value pairs. The key values are semantic conventions that otel is specified.

D

um The the outcomes of these rules. Evaluation are probability, sampling or other sampling, and this this idea of a fixed rate, a minimum rate, is a non-probability sample or the way we've specified it. So that- and this was actually a strong use case that was given in the beginning- is that I want a probability standpoint and a minimum threshold, um and so we can do those so that would be a standard configuration.

D

Then you run this configuration. You match your span, you get a bunch of rules, they decide to sample or not. You output a probability and that's good. um I think we should resume this discussion, but I think we should also find someone to champion and make a proposal for us. I don't see it happening in the next few weeks, myself because we're entering holidays and there's a bunch of other stuff happening in hotel right now, but this is definitely up next.

E

Yeah sounds good.

D

um I don't have any more, I don't think we have time to discuss any remaining issues on the agenda. There was um this pr of mine is, has an approval from atmar. Thank you lotmar. uh It has a couple of approvers that from the hotel, spec approvers list already, and I think it's going to merge.

D

We did discover some corner cases that are not exactly correct, but it's not because of the probability sampling it's because of the existing parent based sampler. If anyone is curious about what I just said, this link here is worth reading through um and then um the last I think and most exciting part that we are now maybe only a couple minutes left is that um ottmar's posted this consistent reservoir sampler um and I believe the algorithm is similar to the partial tracing analysis algorithm. But I don't think I know the algorithm.

D

It is definitely a curiosity for me um and something I'd like to know more about um omar. Is this something that you've written in one of your papers as well, or is this um sort of like proof of concept in code that you're going to.

B

Write up a proof of concept, I still need to write some more unit tests, but um the idea is basically um you you have a random number and um to make uh resource sampling consistent, you keep just the the smallest or the largest random numbers right and based on the r value. You already know that your random value had a couple of leading zeros, for example, and if you incorporate that in your reservoir sampling algorithm, then you make sure that you keep spans which have larger r values in your reservoir.

B

While you start dropping those with lower r values, and but I still need to to test that further, but I pretty sure that this will work.

D

Cool, that's really exciting. um All the other algorithms I knew to do. Reservoir sampling were not going to work with consistent sampling, so this is cool. um I look forward to more on that. I think we're out of time right now. I would especially thank you will and biotic for presenting, um and I think that we have a little bit more of clear road map for what what's next, um and I think I'll see you here again.

D

um I don't know that these meetings are useful to have every week at this point. Does anyone have a feeling as to whether we should do this every two weeks.

D

I'm going to answer that myself. I think we should do this meeting every two weeks, um I'm going to talk to someone in hotel about updating a calendar link to get the zoom links right and I will post in the slack when I succeed. Hopefully um at scheduling this down to once every two weeks.

A

And I'll try to get um the slides uh in linked in the docs there's not to clear it with legal and uh thank.

D

A

D

Right yep! Well, thank you all.

A

D

See you next time, um I'm on slack. If you need me, have a good good week, goodbye bye everyone thanks.