Open Telemetry Uncategorized, 16 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-11-16 meeting

Description

cncf-opentelemetry meeting-2's Personal Meeting Room

A

Foreign foreign.

A

B

C

B

Okay, let's start uh maybe I think you have the first item.

D

E

Yeah, no I think I um generally wanted to find out what the next steps are on on this uh PR that introduces event.data, I, think um uh I think it. It aligns with uh Cloud events, and uh you know that's one thing, but um in terms of implementing it it needs the build tools to allow the data type of a map or whatever data type we we want to consider for it. That is not a primitive data type. So how do we go about that? I.

B

Think we need that anyway, because we say that the logs allow it so, regardless of whether we decide with uh to go with event.data, we still I believe needed for the logs Maybe, not immediately. Maybe we could delay that, but the tool eventually likely will need that. uh But for the event of data to be honest, I I personally am not yet convinced I I, unfortunately, I didn't have time to look into the cloud events.

B

I wanted to be to do that first and then uh then decide for myself right, but I don't have a personal opinion on this topic yet, um but I'm happy to to continue discussing this I guess if anybody has any other thoughts.

D

Yeah I guess one thing we talked about last week is: do we just say we take the event.data and we prefix it with the events, um I'm quite happy with that as a potential option based on what cloud event have already gone and defined it's more than what we need for the rumsig, but I think it would serve open, telemetry and good standing for, like all of it,.

B

Also, what do you mean? What do you mean by prefixing? Can you can you say that again.

D

um Okay, so currently all the attributes of a cloud event are like your data data schema, URL data content, type um from an open, Telemetry perspective, I'm, just saying we say event Dot and then whatever the name is.

D

Like that, one.

B

Level up, essentially that's what you're saying correct.

D

Yeah yeah yeah, so we're effectively decide the semantic convention for using a cloud event is yeah. We prefix or the topic yeah yeah um on that I'm actually going to be not attending the next two meetings so I'm out, um but yeah tiggered, a few. You know, you're, probably one of the biggest people to go, have a look at that cloud event. So, but I'll be back for two weeks. So after two.

B

I think a lot of people are going to be out next week. We should probably also cancel the next week's uh meeting yeah.

D

But yeah, as Santosh pointed out, the the build tool doesn't support logs.

E

um Yeah I I looked at that we could add it. um Only thing is, if we have to add it only for logs it, it gets complex.

B

um So yeah, with regards to that I have just reopened the pr which was about supporting mobs throughout today, values everywhere it was a draft. It got closed automatically. uh So I turned it into a proper PR. It's now so I'm going to advertise it for proper review for by everybody. uh Let's see if we can make it happen, actually, because I see that the requests are coming from many different directions. A lot of people want this need this for the for a variety of purposes.

B

So, let's see if we can make it happen, okay,.

F

From my understanding tigrant, what's the difference between uh this PR that you just reopened um to support maps and heterogeneous arrays, and what uh what navspr was that I think eventually got closed due to Stillness that tried to, but.

B

It if I understand correctly.

F

B

Tried I think this is. This is more far-reaching this one, the the one that I opened Neville's, trying to navigate kind of within the limitations that we had this this Cuts away, all those those limitations. Instead, that does that they are universally supported, nested maps, all sorts of arrays and all all kinds of things.

F

B

More difficult to navigate those limitations so I'm hoping so that we can actually resolve this at the root right. So we get rid of the limitations completely.

D

Yeah my first version uh tried to address some of the 376 things which tigrants PR was also addressing um by opening the door rather than saying it's fully supported so I'm I'm, all for saying it's probably supported so um I have actually got my PR reopened again, so it is actually sitting there but I've linked to tigrants.

B

Yeah, so this goes farther than that right. This tries to uh go all the way to where we want it to be in terms of supporting it universally.

D

Yeah, if Tigris gets in mind, is automatically closed right, yeah they're, all addressable, uh although there may still be some clarification, um there's probably still a size thing. We need to do, but that's that'll be a separation.

F

B

Okay: okay, let's see what what do we have next.

E

um So that's from me um so for the ramsig uh for the rubber events. We are looking for a mechanism to even Define the shape of an event. um So it's just asking for suggestions any.

B

Ideas, that's going to be some some form of semantic conventions right so essentially you're you're, looking to expand in a way the the semantic conventions generator to to give you more I, guess, expressiveness in terms of how you specify the values. Today, it's like more primitive. Essentially, the values are just numbers strings or something like that. But for the for the events, you need the values to be also so some sort of more complex uh things that that's that's what you.

E

Want right, uh no I, think it's more than that. I I think the current semantic conventions are are defined. um You know using namespaces. So, for example, my uh a specific event can have attributes from multiple different name spaces and different events could um use the same attribute.

E

um So so we want to specify that hey even type A has these four attributes a dot b, c dot d or something and then even type. You know two will have a dot b again and then some e dot f or something.

F

You know they're.

E

Not higher I.

F

Would call these namespaces yeah, like so I? You know if.

D

I understand technically.

F

D

It might be a better explained. The events themselves are completely separate, so while they might be sharing, um you know some of the common attributes, which is now a blocker for the HTTP semantic conventions. um The the individual events themselves aren't aren't directly related, so like you'd have a PG or event page of event. Has these 10 things uh a resource timing event would have these. You know 12 things, so we need somewhere to define those. So really um the the two levels that we're looking at is the shape is really I.

D

Think the cloud events discussion, so that's my event.data one and this other one that Santos is talking about, is well. Where do we Define in the semantic convention so that everyone can follow what is event x? What is Event Y right?

D

What What fields are are included in those.

G

This does seem like a place to me that that you know if we are going to align with Cloud events. This is probably going to be important because they've done a lot of this stuff like they've, got a an event schema, so I think we want to at least be compatible. If not like you know, it seems like this would be an opportunity for us to kind of delegate some of that stuff to them.

D

Yeah, although the event schema doesn't go into the what each event would contain. So like we're saying well what is in event, data for event x. What is the event data for event? Why um that's? That's really the level that we're there and what other common attributes may or may not exist.

H

D

We need to have a.

H

Like a a standard position on that, or is that something that the event domain and name pair can identify out of band? If you have this domain, and this is how they Define their events, so you use Json schema for this or use xslt or no XML scheme xsd. Whatever.

A

B

Can you guys do this, maybe open an issue and show some examples of what you would like the schema to Define because I'm not quite sure? What is it that you want to do that? It's not possible using semantic conventions. It seems like you're saying you have a you: have a certain data model and then you want the events to reference bits of that data model. If I understand correctly yeah.

D

That's sort of like where Rams PR is is starting. It's it's trying to put that Foundation down to say where and how do we Define what the events? What the domain name looks like you know: do we have that under logs? Do we have it as a separate thing in semantic?

D

Where do we Define that, um because yeah, as Anthony pointed out, it is the domain name combination that each event will Define what what it means, whether that's using Json schema or whatever that's another thing, but we need some way to define it so that every every open, Telemetry user can use that same event and go and reinvent their own domain name. Combination.

B

Okay, so can we write down the problem? What is it that we're trying to solve in an issue and then okay, maybe with a couple of examples because I'm not still not quite sure? What is it that we want to achieve and then uh I think yeah if David, if you're saying Cloud events do something like that, then let's definitely consider. Maybe we can borrow it or borrow ideas, maybe or borrow wholesale I, don't know.

G

I may misunderstand too: I'm, certainly not an expert on this, but um but I just want to make sure that we don't create a definition that clashes with a definition that they have.

E

Okay, I I I'll go through um the cloud events schema registry.

F

So I'm on the same page, so the issue that you referenced uh nav was was that number two 933 or the pr that you uh that you referenced and.

D

uh My PR of tigrants PR now.

F

The uh it's it's by Ramsey.

E

D

Yeah I think my PR records Ram. If you've got the number of your PR.

I

Yeah, that's the that's.

D

I

Are yeah? That's right, yeah, the one that you uh three three yeah, that's the one! Okay.

B

933 is the one: is the one about the event schemas so that.

I

Is it's about laying the foundation for the following structure? Where do you know do and it sort of starts giving you uh examples of what the definition might look like?

I

um Ultimately, you want to um you know: um I, guess if you open it, um there's each of the thing. In the page view, area, interaction and stuff, we hope to have um a set of collection of attributes that is defined, probably elsewhere.

I

um You know so, for example, HTTP and a browser might be defined in a different place.

I

It could be linked to that and then there are attributes that might be just specific to um one of these event type stage view, for example, um but yeah. That's the that's the idea.

F

So if I understand correctly, then so um you know these, this markdown is kind of expressing in some form or another, the schema that you expect some of these events to follow um yeah but you're running into issues representing this with the semantic convention yaml.

I

uh That is another issue too that we need to solve as sort of an apparel thing um we realize we need to, you know, start with the yaml and then the MDS are generated and stuff, and this tooling around it. uh You know we kind of tackle it like this. Let's get it in front of the group. uh Get some feedback on. Is this the right place? To put it in? Is this right way to organize things, because we we realized?

I

um This is the first time that I attempted to do some sort of a schema for traces and metrics. The the thing exists, at least in my mind, but it's not um not very clear in people's uh people's mind about romsky, so you know we yeah. So that's this to get the conversation going, but ultimately we want to get the yaml defined in things. That's a separate problem, I think the technical problem, if it solves separately.

F

Okay, gotcha, so how to organize it within the um the open, Telemetry specification and then separately, how to build the tooling um so that it's Auto generated thanks.

E

Okay, I I can open an issue where we can discuss more.

B

Okay, so I see that PR, the 2933 has some examples. I don't know if they're sufficient to to actually tell what the problem is. Maybe out there in the pr that's fine, too or open an issue either way works right, but I'd like to understand more about specifically what is not working with what we have right, maybe also or what are the limitations in the generator that we want to overcome? If you have that already in the pr and then great I'll read it I didn't read it in the details.

B

I commented on it, but I didn't have time to thoroughly take a look at it.

B

B

So next uh attribute limit for for log records.

C

Oh, that's mine um yeah, this uh PR I opened a while back and I guess mainly I'm, just asking for it to be reopened. If somebody has that Authority, this is introducing a uh some environment variables for configuring. The attribute limit, it's mainly just bringing logs into uh parity with spans, so I'm, hoping that it's not uh controversial from the standpoint of introducing new environment variables.

C

um It's had a few eyes so, basically just asking for some more eyes and making sure that folks feel this is good. uh We have uh at least from the net Sig. We have had customers needing this needing the ability to configure uh limits on attribute size, so this would be useful.

B

I've just reopened it.

D

C

A

And additionally, I guess, you know related.

C

To the previous conversations, there was a comment about nested attributes and you know what limits should.

D

C

There, and especially as we are considering you know, map valued attributes for both spans and logs I. Think that uh uh there probably is a discussion they're related to attribute limits for map valued attributes, but I'm, hoping that that's not a blocker for this um again. Just because it's bringing bringing things into conformance with the current spin spec yeah.

D

My my earlier er um or iterations of that PR were trying to find the similar thing to what you were saying. I think I even commented on those about max depth and all the rest. um uh Somebody Christian got me to pull it out or someone else's company, but yeah they're.

C

C

C

If there's no other comments or discussion there, um the next item is mine is well. If folks are ready to move on.

C

A

C

Item yeah. The next item is uh just comment on that we have some to Do's in the spec for the log record, batching, processor and I'm putting forth. Maybe the like the low hanging fruit kind of proposal that we just adopt the same exact defaults for the batching processor configuration options as we have for the span. Patching processor um I, don't know.

A

The history of.

C

You know how we, as a community, devise those defaults for spans. I can imagine you know there might be some questions. Oh that they may be should be different for log records. However, um if there is any feeling about that I guess my question then, would be what kind of Investigation would folks like to see done to inform that decision, or should we just adopt the same defaults just for consistency.

B

I think consistency is good. The numbers.

B

So the scheduled delayed in the seconds five seconds between the batches that seems a bit excessive to me. I, don't know why we chose such a large number for spouses, and so it looks weird to me.

D

The spans are probably a lot smaller than logs too logs I think would want a smaller number.

F

You think spans are smaller. Logs are smaller.

D

um I think the the actual transport size of a span is going to be smaller than the potential transport size of a log. So I think we probably want a smaller size for logs a smaller time for logs, so that we flush them more frequently.

B

Latency to well, if you have lower rate of exporting five seconds latency for spans, that's that's not good I! Don't know why we have that as the default. To be honest,.

D

Yeah I I, don't like a prologue either, but but for spans it really depends um like if you're creating a few spans. That's on the tube cleaning, a lot of spans Maybe um internally I. uh We have different setups, where we have uh one to two seconds uh plus time for our internal product. Our external uh product is like 30 seconds, but we also have internal teams that can figure out that one to two seconds to be.

D

You know 60 seconds, so it really really depends on the application and what the defaults are, but I think logs, probably should be smaller than five seconds.

B

B

I Believe In The Collector, with default the batching logic to 200 milliseconds for all data types, so five seconds very, very large to be I, don't think! That's reasonable for logs, like delay the logs by five seconds at the source sounds wrong to me.

D

Yeah, it probably depends as well like for a client, 200 milliseconds is way too quick, because that's a network request coming from a browser, every 200 milliseconds, which would affect the user interface as well. It's all single frame yeah, but yeah for a server yeah right, fine.

F

So let's say, let's say that folks generally agree that it should be smaller. What kind of framework can we come up with to come up with like an actual value for what it might be.

B

Yeah to to me, I expect uh capabilities like live tailing right. You want to see the logs almost real time. Almost real time means that I guess you do something and you almost immediately see it on a human time scale right! That's stop! Second to me, maybe no more than half a second right. It doesn't have to go all the way down to single digit or even double digit milliseconds, but go above a few hundred milliseconds and it no longer feels like it's like it's.

B

It's a it's, a life sort of uh happening in life mode right and that's a that's a real capability that some products have live tailing you want to. You want to see your logs as they happen in the application.

D

Yeah probably depends on the log as well like there is a concept in the Azure monitor of live metrics, which definitely the 200 milliseconds is needed because you want to see that live, um but General logs, not necessarily.

B

D

B

I I agree with you, but we have a single default right, we're not going through multiple defaults. So we need to choose something that is reasonable for all the use cases. Yeah yeah is.

F

There any prior art we can take inspiration from so the you know. This. The spans are like an example of something that we can reference uh I. Think it's a reasonable thing to look for for inspiration, we're saying they're a bit too high. Do any other log forwarding systems have any default intervals that we can look at.

F

C

A

C

To say, I think that the the defaults from New Relic products, I I, believe, is actually five seconds. um So that might be an example of Prior. But yeah I hear the use cases for needing um faster.

F

Time to Glass, if.

C

Depending on your back end, maybe or your use case.

F

And that I just want to remind everyone- that's still possible, because these are configuration options so.

B

Right right, right, I I, don't see a huge danger of going too low here, because you're not you're, most likely putting something like a collector between your application and the uh the back end, which then can budge the data from multiple applications. Typically, that's a very, very typical situation right, so it's unlikely that you're going to be causing a very high load on your back end because of choosing a small touching interval here that.

A

That is the only danger.

B

I guess here right so I think it's it's a small danger to be honest. I, don't know.

F

That that has been the inspiration for the defaults before is the use case of uh you know running an SDK that is exporting to a local collector. um That's why the gzip is disabled by default. That's.

B

F

Https is disabled by default.

B

F

The right otlp endpoint.

B

um And we're not consistent here, yeah right, we're we're setting defaults that assume this goes to the backhand and we don't want to overload the back end by sending too many small requests. But this is not consistent with our messaging earlier that you're you're absolutely right. We said the default, we're making we're going to assume that the default configuration is to send to a local collector. So this five second patching interval is pointless. In that case, it's it's harmful. Actually,.

F

It forces you to hold on to memory references longer than you would have to yeah.

D

Yeah- um and it probably depends on the run time, so maybe we have recommended defaults like for a client. You definitely want that to be longer, because if you've got browsers, you can have millions of browsers going directly to your back end. There won't be a local collector running in the browser.

B

We have that actually differentiation for some other environment variables. We say that the sigs May made this side may make their own choices depending and I think that was for the compression actually, so we set the gzip may be chosen as a default by by JavaScript or something like that. So maybe we want a similar wording here for this one.

D

Yep, thanks for the service these are recommended, or you know, stickers can find their own yeah based on their own yep yeah, destroy yeah.

B

So maybe just just go with a smaller number recommend a smaller number and say that six may decide to do differently to choose a different number.

F

Okay, well, let's try to aggregate some um some, you know some prior art like we can reference the collector's defaults for their batch processor. We can try to see what maybe uh fluent bit does by default. um The Java implementation of this actually uses defaults that are different from the uh the batchings. The batch span. Processor, um you know, just as a historical artifact I'd have to chase down what you know the git history to figure out exactly why, but my guess is somebody just arbitrarily chose.

F

You know defaults that are a little bit different than spans, um so yeah. Let's just aggregate some data and and comment on that thread.

C

A

C

B

Okay, I added the next one, it's uh about making it more clear that we're not building a logging API, so I keep hearing questions from people who are confused by what we have. They think that it looks like a logging API somewhat a bit. It's called the logger, so I don't know what sort of changes we need to do there, but we need to make it crystal clear that this is not the case. One possible approach is to use that kind of the more recent terminology of front-end and back end.

B

uh uh I, don't think it's very widespread, but some logging libraries use that and what we're building is really a back end right. So it's so what is also sometimes called a Handler uh or what is called a back end. So maybe we use those words to make it clear and then I don't know what else we can do. Maybe another alternate would be to to rename back to what we had like instead of logger call it the log in enter.

D

Yeah, it's probably like a log support PPI rather than yeah, but yeah I. Think it's just the wording.

F

I've included a link in the document that uh to our current phrasing on this uh you know we say that it is not a goal of open Telemetry to ship a feature-rich logging Library. So that's, but you know it's kind of buried in the readme um and if you're looking at the log API document, it's not it's.

B

Not API, yeah yeah, maybe add some some some additional paragraph at the top of the API document.

B

F

That reinforces reiterates whatever.

A

Right, yeah, yeah.

B

This is probably one of the cases where it's useful to repeat yourself a few times to make sure that there is no misunderstanding: people lend on a specific document. They don't necessarily read the readme and then they they start wondering what is happening there.

B

Yeah uh I can actually take this. I can I can try to add some wording there. Let me assign this to myself. uh It's already assigned to me. Okay, I'll create a PR.

B

Ok: okay: let's move to next.

F

I added this issue: yeah yeah, so a while back, uh we had a PR emerge that made it possible for log record processors to mutate log records, and um you know this is essential for supporting enriching log records with baggage for doing things like redacting, sensitive information from logs and other use cases, and that wasn't possible before and after that, PR was merged.

F

um uh Bogdan commented on that PR and you know, took issue with mutating making log records mutatable, and so um you know, I I just want to put this issue to bed. I think we should discuss this and um you know either make changes or accept the uh the log record processor as it is today. I opened this issue to kind of track that conversation I've laid out a couple of options. They may not be the complete set of options, but um yeah take a look.

F

Let me know what you think, and uh you know we can have kind of a record of this. The stance on this on this topic.

B

So in the collector, what we do is we make the processors declare their intent, whether they are mutating the data or no and depending on that, we actually have different different performance paths. There. We we clone the data or we don't clone the data.

B

Could we maybe do something like that here, like the processor, if it intends to to mutate data, declares it up front and at build time, and then you know that you, you need to use, locking or whatever you need to do right and if not, then you you do the fast path, something like that.

F

Yeah I think that's super reasonable. um It's one of the options that I've laid out in this issue.

B

Oh, you have that already sorry, yeah I didn't read that okay.

F

Yeah, uh when I, when I, was thinking about this more deeply I I determined that you could actually kind of um you could have a lock phrase strategy that doesn't require upfront Declaration of whether it's mutating, so the implementation could track whether any of the processors actually do mutating uh and what based on whether or not any mutation has occurred.

F

When it comes time to for the batch log processor to do the handoff to log exporter, it has to create, like an immutable copy of that data, to hand it off to the log exporter, and um you know, based on whether any mutations have occurred or not you. You know whether you need to take a lock or not when you're transforming that to an immutable copy.

F

um So I think it's possible to you know, have lock free implementations without explicitly declaring mutability, but it definitely makes it more challenging. uh It's a much more careful uh implementation that you would have to do.

B

Yeah, it also has its its performance impact, so tracking itself has some impact right. So it's unclear whether.

F

I figured the tracking of whether or not a mutation has occurred would be done with like one like a high performance uh like compare and set operation. So you know you you: can you can track that without actually taking a lock? But you know there is still additional overhead yeah.

B

Is it necessary, like you, why why not pre-declare it? You think it's it's error problem. What's the problem with that.

F

No I guess I'm, just I I, guess I'm just enumerating the possibilities. um I think I'd probably be most in favor of of declaring whether they, the processor, wants to mutate the data because of the Simplicity and implementation that it provides. um But you know in the spirit of just you know, exploring the solution. Space I included another option.

B

J

Are we saying that it wouldn't be possible to put a processor in front of my pipeline like let's say I do want to mutate? It and I want all exporters to see that mutation. Is that a supported case.

F

Yeah, it has to be.

J

And then there's a way to say: Okay I want to mutate it just explicitly for a single exporter and I. Don't want other exporters to see that training. No.

F

The the so a log processor declares in this in this proposal. A log processor declares whether it intends to mutate the data or not and based on whether or not one or more log processors intend to mutate the data. The implementation can change to either be a lock free version or one with locks, so it just provides like a hint to the implementation on on. Basically the performance path to take.

B

uh Chuck can I ask a question: why is it a problem? Why do you need to do a locking we? There is no concurrency here right, the processor finishes its job and then the exporter works. So essentially the processor hands over the hog record to the exporter. Is that not the case.

F

Well, so you could actually have um multiple processors that do asynchronous operations on the data. So, like imagine imagine you know having an asynchronous processor at the head of your pipeline and then a batch processor at the end of your pipeline and this asynchronous processor, you know cues up the log records and then a separate thread pulls them off the queue and performs some mutation or enrichment to them and then later down the the batch processor is similarly doing an asynchronous operation, um and so there there's there's determinist it's it's indeterministic.

B

Non-Deterministic, why are why are we allowing that that doesn't sound like a good design to me? Why is it not just a sequential operation from One processor to another, and you have to process it synchronously, whether you you have a queue or a background thread, it doesn't matter you're blocking your whatever method is called when we hand over to the processor and you return when it's done right and then we hand it over to the next processor. So the result was a single owner of data.

F

Well, the the batch processors.

F

um You know it for performance reasons: they they batch up the records on a queue and then they do as little work as possible while batching them up and then a separate thread later pulls them off that cue and does some work with them. So there's.

A

F

Going to be at least one built-in processor, that's doing asynchronous activity.

B

Sure sure, but I'm saying you keep the data ownership and you you when so there's always only a single owner of data right, that's what we did in the collector. The batch processor works exactly like that. So you you give it a little record. It says yes, I I took the ownership, but then he does have the ownership when, when it needs to give it to the next processor or to the to the exporter, that's what happens right it.

B

It essentially gives up the ownership completely and- and the ownership is taken by the next processor in the pipeline, so I I'm not suggesting that the call itself necessarily has to block the the call to to take a log record by the processor.

B

It can return asynchronously right, uh but then it means that the the processor now owns the data and uh there's no need to lock anything. It can do any mutations it wants and then, when it needs to hand it over to the next processor or to the exporter it gives up, gives up the the log record completely. So you forget the log record right, something like that.

B

There's no concurrent processing necessary here as far as I understand by different processors.

F

So so, if we went down that approach, we'd we'd essentially be saying that um you know uh log processors when multiple are registered are called synchronously and sequentially, and if anybody wants, if any of these log processors want to do anything asynchronous with their data, it's up to them to make a copy of that data so that they're not impacted by potential processing from from other no.

B

That's that's not what I'm saying? No, no, that's not what I'm saying yeah: they have to be called sequentially, but the the whoever makes the call doesn't have to be blocked when it hands over the data until the processing is fully finished. It doesn't have to be like that right. We don't necessarily have to block, so maybe I can. So are you familiar with how the collector does this this processing? If not, that I can maybe show you where it happens in the code, maybe I can show you some sort of diagrams. We have.

B

We had exactly the same problem in the collector, it's even worse there, because we have multiple pipelines. So we solve this by having the notion of data ownership. Essentially you own the data and when you all need your free to to mutate it as you wish.

F

Okay, yeah I'm I'd love to see some diagrams so.

B

Yeah I'll comment on the issue right: okay, I'll, do that and maybe maybe that's the fourth option. I guess right.

H

J

I just want to say I think you guys are talking about the same thing in that the processor is synchronous, if you do add, let's say a log record into a batch, and you want to then modify it later like at the time of export, you're sort of taking on the risk, then you're modifying data that could impact other things. So I.

D

J

Of saying to do that safely, you should make a copy, then in your own code, because you're kind of taking it out I, don't necessarily think the SDK should have mechanisms for tracking that ownership.

J

I think we just and we have seen customers do this in.net, for example, with logs we have users that are running these regular expression engines over the log data to scrub connection, strings and credentials. So we said you should do that in a processor, but then you're going to delay your application anytime, it's logging! So the only way you can do that sort of asynchronously is just use a batch exporter and then, as you're exporting on a dedicated thread, run that logic now you're going to run in there.

J

If you have multiple exporters, multiple batches, you do have now that potential to clobber the data, so there it's perfectly acceptable just to make a copy and you're fine. At that point, does that kind of make sense.

F

Yeah and that's how I imagined it working um so you.

B

F

Essentially, we describe the pipeline and we describe ways to protect yourself from uh non-deterministic behavior, and you know part of the that protection is making sure that you know if you're going to do asynchronous modification of this, you make a synchronous copy of that data. First, before you queue it up to process it asynchronously.

H

I think the the thing that's not clear to me here is whether these log processors act as a pipeline, or are they a set of independent operations on the same data like with with the spam processors on the the trace sdks, they operate independently on the same data, they're called sequentially given the same data, but they don't feed one into the other um and they they receive immutable data. So they don't really have this concern because they can't mutate that data, but is the expectation that processors form a pipeline or do all processors operate, independent.

B

And in a well-defined order, otherwise it's it's unclear what happens then or.

H

They operate on Independent data and if a processor needs to mutate, you would give it a copy of that data, which is what the collector.

B

Does then, then walk? You have multiple different versions of the data. What do you do with the results of those mutations.

J

I think it has to be the same data. Otherwise, how would you do like enrichment.

F

Yeah, unless, unless the processors are responsible for calling the next processor in the chain, which is not the case today, unless that's the case, you know the mutations have to be able to have the ability to be seen by the next processors.

B

That's that's what we do in The Collector, the processors call the next one in the chain and.

H

That's the expectation for for Trace span processors in the trace sdks as well.

F

Yeah, no I I, don't think that's the case and that that's not at least not how it's implemented in Java. So processors don't call the next in the chain.

B

Because they don't mutate the data, it's not necessary right because.

H

They can't, but.

F

I

H

Somewhat work that way the batch processor wraps an exporter and calls the next thing in line by calling the exporter once it's ready to ship that data.

F

So yeah the batch processors itself like a processing because it's you know it has an Associated exporter and the exporters. You know you can do whatever you want in there. You can have a sort of processing chain in there.

H

And filter and redaction processors are described to work the same way, at least in in the trace. Sdks, um that's more complicated there, because you need to make a writable clone of the read-only data that you get, because we decided that these these couldn't mutate but I think that's. The intent is, is that you set up processor chains if you intend for a set of processors to operate on the same data sequentially rather than operating on copies of the same data.

B

Yeah I I, don't see how you do it. Otherwise, if you, if you want the data to be mutable by the processor, it has to be a sequential operation, it has to be a chain, and one has to finish. One processor has to finish its job before the next one starts, otherwise, otherwise you end up with concurrent modification or you end up with copies, which have different modifications done to them and clear. What do you do after that?.

F

But span processors don't the span. The trace SDK doesn't give you tools to arrange that chain. It's up.

B

To you, yeah I mean for logs. We need this for logs right. It seems like it's necessary for logs I, don't understand. Why is it not a feature in the spans like the reduction since the useful there as well?

B

F

Way, you have to do it in traces is in exporters, so you have to create a series of exporters that kind of are chain and exporters. um You know filter out data and call the next exporter that they delegate to at least.

B

That's that's no longer an exporter, that's a processor right, functionally yeah.

F

It's it is Fallout from the current design, though. um Okay.

B

So I I still maintain that what we do in The Collector is a better approach. There.

F

Yeah just gotta balance requirements and symmetry across the signals.

B

Yeah, but if we made the mistake in the spawns I, don't think we should justify by consistency repeating the same mistake here. Yeah.

F

Yeah, well maybe there's a way to with this. um You know having processors required to specify whether they mutated data that there's a way to have. You know um you know: symmetry uh spans are set up, but still allow processors in the log pipeline to mutate the data. So there's a happy medium without having to introduce the concept of a processing chain where processors now have this new responsibility to call the next one in the chain.

H

I think if you've got a an ordered, well-defined pipeline, that's operating on one data item and there's no asynchrony in there. You don't necessarily need to know whether it mutates data or not, because there's no opportunity for simultaneous modification, Right One processor gets the data.

B

It gets a chance to mutate.

H

It gives it back and.

B

Then you give it to the next one: that's what I was leading to why? Why do we even need this in the first place, if there is no concurrent modification and there shouldn't be a congruent? The reason this is a problem in the collector is because we do actually have pipelines that can operate in parallel and uh the same data may be fed into two different pipelines, so we do have true concurrency there here in the SDK. We don't allow that as far as I understand, so there should never be the need for this.

H

The Collector also has fan out at export where.

B

The fan outs, yeah.

H

Some things may.

C

H

The data on export and some may not, and so you need to be able to copy there, I.

B

Don't know that that's handled slightly differently.

H

B

In the exporters we we say that it's it's exporters, responsibility to do the cloning if they modify, because most exporters don't need to do that, uh whereas for the processors we recognize that it's very common to meet mutation, so it's a declared capability of a processor, but in this case in the SDK I, don't really understand. Why is it necessary? Like use the ownership model, you own, the data, you can urate it. If you want, you don't need to do any logging or anything you're done with mutation.

B

Only after that, you give it to the next processor, which can do whatever it wants to do and so on until the the chain is. Is the chain ends in an exporter right, one exporter, so I, don't I, don't think we should repurpose exporters to be a sort of modifying processors.

D

Yeah I agree: I think we should Define processing chains as a concept. If we need to um the only thing I can potentially think of that might be is if you've got a process that wants to like book the request. So it's like a t processor, but then it would. It would be better responsibility to clone the data to send it down two different chains, but yeah I I, definitely like internally. In all our products, we even in the in JavaScript.

D

We have the concept of a chain where you know one plug-in calls the next one once it's finished, so both of that.

F

All the problems go away. If you only have a single asynchronous processor, so a single, if you, if only one processor, is queuing up these records and then processing them on some different thread. Then you know and then that processing on the separate thread can itself be synchronous, and you don't have to worry about that performance impact of doing synchronous, processing uh from the applications perspective.

F

That's what Mike was pointing out earlier, um and so maybe the batch processor just needs to be adjusted to have a configurable set of uh of like operators mutations that can happen uh after the after the entries have been queued up to be processed asynchronously and when they're being pulled off that queue and before they're sent to the exporter.

F

If, if that was possible, then this would I think go, go.

H

It seems like the batch processor could be put at either the head of line or the end of line, um but it has to go in one of those positions. Then right and if it's at the head of the line, it needs to know what the rest of the chain is.

B

Okay on the issue and without the references to the to the collector's design- and maybe we can take it from there- see what we can borrow or what we want to be different.

B

Okay, so you're saying Jack I.

F

I was just kind of riffing off what Anthony said.

F

So um if the batch processor is at the head of your, is, let's say it's the only processor, you have it's and it's at the head, so the batch processor takes log records, synchronously cues them up, and let's say that you are allowed to configure a batch processor with um one or more additional processors, and what it's going to do is it's going to synchronously queue up records and then, when it pulls them off the QA synchronously, it passes them to that queue that you've uh you know configured or the the additional processors that you've configured it with, and so those now can, just you know, operate synchronously without having to worry about impacting the performance of the application.

F

And so the only thing that would change that is you have this additional argument: optional argument for a batch process, a batch processor which is like you know the set of processors to delegate to before passing the data onto the exporter.

J

I kind of like that I have a need right now. I have a couple customers asking about filtering: they have multiple exporters and they want different rules for which exporters send like high priority low priority volume and stuff like that they're trying to do it with a processor. The problem is that processor has to mutate the data to pass it along. So the first processor says I'm going to filter this out. Then the next exporter doesn't have an opportunity to do its thing. So.

A

Having that sub.

J

Chain where the first level can just pass it to those batches, and then those exporters can have their own filtering processor, that's sort of independent. That would definitely solve a need. I have that's. You know, semi-related to this discussion.

B

If you want to do that sort of more complicated processing, I think you should go through the collector in that case, which is capable of doing those things right, it can have multiple, multiple pipelines. You can fan out the data. You can do different processing.

J

100 I agree with that. I do have some customers that, for whatever reason they want to do this locally on their SDK yeah.

B

That's I, don't think, that's sustainable. We can't replicate the entire collector's functionality in all of the sdks. There is a reason we implemented one single collector right. We do it once there's so many things that you can do. If we try to to do that in all of the languages, we will not just bloat up the sdks, but it will also be an enormous amount of work to do multiplied by the by the number of languages. Essentially, I think we should avoid avoid that.

J

I would say, then we should probably remove the concept of a filtering processor from the spec because it hard to make it work for anything other than like really trivial.

B

Yeah, maybe I guess we we should have only maybe it's too late, I guess it's too late, but we should have only added the things that they that must be in the SDK things like batching, or something like that and delegate all other processing functionality to The Collector. You either send, as is directly to your backend. If you, if you don't, want to use the collector or use the collector and do the processing there any sort of mutation or filtering or anything that needs to be done.

A

J

B

Don't think it says.

G

J

Must provide a filtering processor, it just says, like one of the use cases for processors is like filtering yeah, so people come and ask like okay. How do I do that? And then we say: okay, if you want to filter everything from your pipeline, you're good, any extra, more complicated logic, there'd be dragons.

F

Well, the dragons are just that the it ha it happens, synchronously on the calling thread, so it has the ability potential to impact the performance of the application.

J

Well, I mean even even how our chain is set up right so like, if you have five processors, they're all going to get called. So if that first one wants to filter out the thing it doesn't have a way to like abort the chain. All it can do is like set the recorded flag or add a tag like you have to sort of add some information so that it flows.

J

It would be great if you could just say I want to kill the chain. Then it's truly it's just kind of dropping it later on it's sort of like a tail, a poor tail sample. But then, if the case, if you want to filter it only for specific exporters, there's really no good way to do that.

F

Yeah yeah yep there's a lot of ideas floating around okay um yeah like.

D

Like an app inside, we have affected the plug-in chain and effectively. It is the responsibility of a plug-in to call the next plugin, um but to manage that I also have a higher. You know an overriding processing context so that the plugin doesn't actually call the next plug-in. It actually calls the next context um so yeah. So if a player wants to completely drop an event, it just stops and then everything else stops so it it sounds like we.

D

We probably want to Define that for the logging processes and say you know either the batch processor was at the head or we just say that you know this is how they should be implemented. So it is in asynchronous chain where processor a has to call processor B rather than having it or synchronous. So therefore, process array could call processor, B, Secret asynchronously, all.

B

B

J

Enough but I kind of like Jack's idea of let's say we had export processors which are sort of the inner, the inner loop or inner chain. That's kind of an interesting idea, so.

A

Your synchronously.

J

Add to your batch, but then as you're taking records out of the batch instead of just feeding them to the exporter, you could now hand them to a chain, that's kind of a cool idea.

F

Kind of Best of Both Worlds situation- you can do synchronous processing if you want to, but uh if you need to you can do asynchronous filtering and enrichment as well.

D

Would it be possible to create a chain log processor, and just so you have that as the head so effectively you enforcing the chain without strictly important chain. You have a processor which enforces it for you. So, rather than saying it's embedded in the battery processor, we just pull it out and say to Network processing.

F

Yeah um and that could be possible too.

F

So we'll see I think everyone should chime in on the thread with these different or on the issue. With these different ideas, um the the the fear in my head is like.

F

If this diverges too much from the design of Trace tracing, it's going to it's going to be a big discussion and there's going to be require a lot of justification for why the design should diverge, um because you know on the design diverges it needs to be vetted and it's going to have impacts on on ergonomics for for users, because they're going to have to become acclimated to or to different ideas and different signals. So that's that's, probably a bigger chunk of work.

F

So if we can get what we want within the confines of having a symmetric design to tracing I, think that that would be good.

F

But we'll see, fortunately, I have to run everyone. So um I'll talk to you next week.

D

Cheers I'm mad for the next two weeks, so.

B

I'll hold you off tonight, see you in a couple weeks.

F