Open Telemetry Uncategorized, 7 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-07-07 meeting

Description

OpenTelemetry Prometheus WG

A

A

A

A

A

B

What's up everybody, we're got a couple: people trickling and we'll give a couple more minutes and then get started.

B

All right looks like the flow has slowed down a little bit, so I guess we can go ahead and get started. um I guess first of all welcome back everyone. um I hope you had a nice. I guess uh bonus week uh to catch up or relax or whatever.

B

um If this is your first time here, um we've been meeting, this is the fourth meeting um to talk about profiling and adding profiling as a supported event, type to hotel we've been discussing mostly early on, we kind of talked with a lot of different people who were involved with various facets of profiling and sort of what their goals were and now as we're starting to think towards what a ideal format would look like that can support the widest array of use cases.

B

We've been evaluating various formats from companies who are using custom formats, as well as public formats like p-prof um and hopefully, today, jfr in order to better understand sort of what the I guess you know, kind of landscape looks like and what types of problems we can hopefully sort of distill down and find one agreed upon format or generally agree to problem format that will support the most used cases.

B

A couple weeks ago, we talked about some just general high level goals about um you know just what we're ultimately trying to achieve the the main one being ability for data center, system-wide profiling ability to connect profiles to other signals representing profiles across native code, slash runtimes. I think we'll hear some more about that today. um Those are all and then ability to map existing formats, to whatever format we ultimately, um you know decide is, is best and um so yeah. That's I'm still a working list um and we're we're continuing to add to that.

B

So, if you have other ideas or thoughts there on things that weren't included in that list feel free to share them, but um but yeah. If, if anybody has any thoughts or questions I'll pause there before we get into some of the formats today,.

C

Going going gone all right, I got a quick question, um so I I saw the list of goals and I missed the last meeting. But uh one thing that we could maybe add over time is the motivation for the goals so which of these goals are sort of. So we as like vendors of profiling, can share technology, reuse solutions and which ones of them are in the interest of users and sort of what's the motivating factor behind them to be good to capture.

B

Okay, um I will note that down.

B

um Yeah thanks um anybody.

B

Else, right cool! Well, um if you are not part of the slack I'll paste, the uh oh yeah also um feel free to add yourself to the meeting notes to the attendees list. I will post it in the uh in the chat here um also, if you're not in the slack I'll, add a link to the slack as well, but uh the elastic, slash profiler. I keep saying elasticslash profiler, I don't know which one you guys would or would uh prefer, but um anyways they uh created a doc about their custom format.

B

um That's super detailed. um Thank you very much for adding that also just cool to see as someone interested in profiling but yeah. I wanted to give you all a chance to sort of perhaps summarize, there's a link to the uh to the dock and the meeting notes for those who want to dig a little bit deeper. But um perhaps, if you wanted to someone from there wants to summarize sort of that and we can uh discuss some of the key points uh from your format.

D

Yeah, so I think the the key key decisions uh that we made in the format that I to some extent would like to see in in a future standardization um uh there's a few of them. First of all, uh I'm actually a strong believer in not sending the entire stack traces, but sending uh a hash of the stack price, and it's probably also the most controversial decision in the protocol.

D

Judging from the the comments on on the dock, uh the second thing uh that I found beneficial is trying to do columnar alignment of uh of values in the format versus uh uh row based or a value value-based alignment just for better compression and, lastly, I'm partial to use protobuf similar to gproff just because once you use like it doesn't have to be protobuf, but I'm I'm partial to using something to specify the protocol that can then be used to generate parses for different languages, um because I like we, we all know that everybody is working with heterogeneous languages on the back end and protobuf, or something similar has the benefit of being able to generate parsers for very different uh infrastructures.

D

So I think those are the three like if I had to name the three important decisions that I would like to to advocate for, in whatever we come up with. Those would be it.

B

Yeah uh thanks so the I it seemed like from the conversation both in the dock and in slack that the not duplicating stack traces bit was the um was uh in. I guess, like a point of discussion, do you want to kind of describe sort of how you're doing that um and and kind of the reasoning behind or yeah the motivation for it and and sort of how you got to the current state that it's currently in.

D

Yeah, so the motivation for it was just realizing the data volumes you would be collecting if you have a machine with many cores running very deep stacks, perhaps like particularly with java applications.

D

We see stacks easily exceeding 128 200 frames, so you just get a huge amount of data if you send out the entire stack frame each time, um so we we learned- or we noticed early on that if we want to stay within the envelope of performance and network bandwidth, that we wanted to stay in that just sending all the frames, all the time isn't isn't really an option at which point we decided. Okay, we need a way to avoid sending the frames all the time.

D

So the easiest way to do that is just send a hash of the traces.

D

So what we literally do is we hash the stack traces and then that forms the id of the stack trace, and then you send out the stack trace the first time your client sees it um and on subsequent encounters you just sent the hash and the count like how often you saw it there's possibly like. If we wanted to go crazy, you could bring more efficiency out of it by splitting the trace into the leaf function and the rest of the the trace, because everything, but the leaf, uh tends to stay more constant over time.

D

So you could even get things to be more efficient. If you wanted, we decided against that possibility. Long story short. um We identify a stack trace by a hash of individual components um and then the first time a client encounters uh that trace. It sends the entire trace to the back end and then on subsequent encounters. It only sends the um the hash it doesn't have to remember hashes forever because it doesn't like, if you just keep a local cache that eventually uh gets filled and replaced.

D

You'll send the same trace again at some later point, which isn't actually bad for resilience in case the first message gets lost um so so over time you converge to full knowledge of everything on the back end, even if you had intermittent faults elsewhere um yeah. So the summary really is take the stack phrase, hash it and then, whenever feasible and simple, use the hash and sell the full trace.

D

Oh, and with with hash, I don't mean a cryptographic hash, just something that generates 128-bit id um I'd, be careful about using 64-bit ids. Just because, if you think about a sufficiently large fleet and sufficiently large data collection, you will get collisions on 64-bits.

E

So I I commented on the document, so I will just repeat for the others who may not probably read the document. It's a valid choice. The statefulness and compression that you achieve. It definitely is a valid choice, but it's a trade-off right.

E

Let's make it clear that you lose something by doing that and you what you lose is the ability of full inspection at the intermediaries uh that we we we made a different choice when we were designing otlp for traces, metrics and logs, we decided to include all required full necessary state with all of the messages that we send in otlp, so that intermediaries don't have to reconstruct the state.

E

It may not be necessary for the profiling, so perhaps that's the valid choice for profiles right, but but it's a trade-off and showing demonstrating that by doing so by uh having this uh statefulness saves. Very significant volumes of data and by perhaps showing that intermediary filtering is not even necessary. For profiles is a very strong argument in favor of the choices that you made.

D

Cool yeah um one one very quick note: the intermediary filtering of stack traces is something that is exceedingly difficult to do for native code anyhow, because the collecting agent usually doesn't have the debug symbols locally.

D

So I think my only like the only comment I would like to add to what you said is that if we want to filter and do different routing decisions, buy an intermediary based on particular stack frames and functions being present? That is not something that is easily feasible for native uh traces anyhow, so, um but yeah and yeah, perhaps.

E

Yeah, perhaps for not all of the native code, but in some cases it is definitely available. Right. Take go, for example, the goals. Pprof definitely includes symbolic data.

D

I mean in in case people deploy debug symbols to prod. That is definitely possible. The normal case for native traces is that that doesn't happen.

E

B

uh Dimitri you had your hand up first.

F

Yeah, so I have a question regarding kind of client memory requirements in this scenario, I imagine once you start hashing um strings and you know you have to kind of keep track of that information. Maybe this is not really relevant in the context of it all, but I wonder if um yeah, if you've done kind of um analysis of um yeah extra memory requirements and if that's a consideration, if that, if that's a trade-off or not, that kind of thing.

D

So the the beauty of the hashing scheme is that you can like you, can set the memory requirements, meaning uh you decide how many hashes you're gonna be remembering and you can more or less trade off network traffic for lower memory like requirements. If you want um so I think I need to check what we are reserving.

D

I think it's on the order of a megabyte or two, but maybe 10, even um the, um but this is absolutely at the discretion of the client like the in the limit you set it to zero and then you're sending stack traces all the time.

B

uh Sean you had something you wanted to add.

G

Yeah, just a quick question for uh tigran I was wondering: could you explain um because I'm not that familiar with the other otlp protocols, like wasp kind of the use case for filtering, is or what it kind of means to filter in intermediaries.

E

Yeah so let's say I want to: I have an intermediary so like an open, telemetry collector between the telemetry source and the back end and in the intermediary. Let's say, for example, I want to drop all samples that are coming that contain stack traces from a particular file right.

E

I don't know what exactly the stake trace is, so I don't know the hash of it, but I do want to just drop all the samples all the counts if they contain a stack trace from a particular file or or from a that includes a particular function called to do that efficiently at the intermediary.

E

I need to have both of this information. At the same time, both the samples, the counters and the stack trace, and with this approach, you're not sending them at the same time right you send the stack trace when you see it the first time, but the counts are sent subsequently, they do nothing. They reference the stack traced by the hash, but they do not contain the stack trace.

E

So to do this filtering, I have to keep the stack traces when I see them uh and I have to then then essentially keep this state forever right, because that I may see the the reference to that particular stack trace later to do this field drink.

G

Just eric stick, what would be? I think, I'm probably missing some context here like what would be the use case for that, like from a like a yeah.

E

And then let's say I have some some chatty source right. I want to drop the data from it. I don't care about the profiles from it because it's not under my control. I don't care about that particular information, but I still receive it for some reason.

E

The scenarios like this happen all the time right when, when you have a large organization, you use intermediaries to collect and budge all the data before it goes to your vendor of choice, and sometimes it's absolutely not under your control, what you observing in in your intermediary right, but you still know that it's it's it's pointless data. You don't want it to be sent to the vendor because it costs you money right.

E

So you just want to drop it, but you want to drop it precisely what you don't need, not just so it's again it depends on whether you have such an either no- and I may be wrong here- maybe in the profiling world, it doesn't happen, but it is very typical to do this sort of filtering or routing. Let's say I don't drop it, but I send it to some other cold storage. Cheaper storage right, so filtering reduction, rerouting is very common operations that you do at the intermediary.

G

Cool yeah, I think it's probably like the philosophical difference with what we're doing is fundamentally want to catch everything all the time and, if like something will be significant enough, that it will be costing you on the back end, then it's probably significant enough that you want the profile from it. If you get what I mean, um but I haven't thought.

C

About the before so.

A

C

I think one use case I can think of there's different kinds of profiling and a lot of people here think about cpu profiling, but there's exception profiling, where you take a stack trace for every exception and there might you might get a very chatty source that you want to filter out.

B

um I don't know if anyone with their hand raised, wants to respond or ask the questions. Okay,.

H

I wanted to expand on this if we make a parallel with the logs, which is very simple, to reason about you wouldn't most of the time filter on the log message, but you might filter on the log labels like I don't know, host name or host name or a service type or whatever. So, if you think with in this context, we also have additional labels that we attach to stack traces like the container the hostname, and you know you can filter on that. Even if the stack trace is a hash.

H

So I see more value in filtering there than filtering in the content of a stack trace, which is huge.

D

Sorry, if I button for a second um I mean if it's true, that the hashing will make it difficult for or impossible to to do, filtering and routing at the intermediary without additional client support.

D

A question that we, like I posed on the dock, is that if the filtering is desired, there would be the option of just pushing the data collecting client to attach a metadata label, because we already have the container name, the pod name and so forth, attached to the counts so but yeah it is. It is a trade-off in the sense that, if you want to implement that filtering, then you will need client support.

F

I'll just add on the topic of filtering one thing: we've seen: security teams insist on um is some sort of intermediary that would filter out data that they don't want to share with vendors. um You know some people are very concerned about. Like specific, um you know, labels stack trace names, so yeah, that's another kind of use case for that.

B

um Cool yeah, I guess any other. I guess questions about that uh alexa.

B

Can you hear me.

I

Yep um on filtering one thing is that I think with profiling data. Sometimes you cannot just drop the data because, for example, for cpu samples, if you drop some cpu samples, then it will essentially skew the profile.

I

I think it cannot be just drop like it needs to be aggregated into some kind of like ignored bucket, or something just mentioning that, for example, in our profiling tools, we go through many hoops to make sure that we like, even if there is some back pressure, needs to be applied and data cannot be like all streamed out. You still provide some aggregated metric for it.

B

Yep good point.

B

Cool um so yeah, I guess just to summarize and as we kind of think moving forward, um I know you briefly mentioned it or someone did, um I guess kind of as you're thinking about this standardized format. You're saying that the I just want to make sure. I understand right that the main pieces that you would like to see is, I guess, some mechanism that you can use to not duplicate, stack, trace, um columnar alignment versus row alignment- I guess uh maybe could you expand on how that would be?

B

I guess how you would see that in a a standardized format.

D

So I guess, uh if I had to rank my wishes, which one would be duplicate stack traces which two would be, let's use something that has a description language similar to pro buffs or whatever. That can then generate parses and number.

A

D

Would be columnar storage? um The columnar storage is mostly because compression works so much better when you do columnar alignment, so that's really just an optimization for getting getting better data compression.

D

So it's essentially my first wish is in order to reduce the overall volume. Second wish is to allow people with heterogeneous back ends to automatically generate parsers and not have to um well be stranded and having to write parsers from scratch, because nobody wants to write a low level parser and the third one is again for for efficiency's sake,.

B

Cool- um and I know you also touched on in the doc the comparison to prof and jfr- I think we're about to transition to. Perhaps some jfr talk anyway. So maybe can you briefly sort of just touch on? um Why or I guess yeah, you know, sort of where p prof and jfr lack in the areas that you sort of needed.

D

um Yeah, I mean, let me let me pull up the dock. I think the the first thing was that they are file based versus network based right. We wanted to have a grpc stream and both jfr and prof, where file-based, let me pull up the details of the dock.

D

My biggest concern with jfr is that there is no official spec and there's very few implementations of something parsing or writing it. um So like as a former googler, my instinct is obviously going with grpcr and protobuf. That's just the damage that that does to you. um Let me pull up the dock.

B

uh Why you do that alexa, you wanted to add something or is your hand.

I

Yeah yeah, if it's, uh if it's okay, to ask there was one thing that confused me a bit or I I was just looking to clarify for myself. You mentioned columnar storage and my impression was that kind of, like the storage format, is almost like out of our control, because if we talk about open telemetry, we kind of like rely on the transport mechanisms that open telemetry provides.

D

So when I speak about columnar, it's mostly about the arrangement of fields in the network messages, meaning uh I ideally you want to keep similar fields close together versus uh having them row oriented in the network. Messages.

I

Is it like something like structure of arrays versus arrays array of structures, correct.

D

Yeah, okay and then it's largely because um lz based compressors want localized repetitions um and therefore.

I

They yeah in one of the formats we use internally. We also did that optimization like to like, for example, in code. This stack has two arrays with parent pointers and things like that because it like it, it encodes more compactly in json and things like that.

D

And I mean we can discuss whether it's worth doing. We got about 20, better compression out of it using gzip, um and so again it's a trade-off.

E

Sorry, you're also going to see better uh less, let's say a memory fragmentation, better memory allocation by doing the protobufs that way, because um this arrays they are usually allocated as a single slices, whereas the the nested messages they are singular locations. Usually so you benefit in memory as well in the recoded format.

B

G

Do I guess one more and then we'll move on uh sean?

G

I was just gonna say if people are interested in the actual numbers, I think in a mailing thread somewhere we actually have the comparison of the before and after of the columnar storage for network traffic.

B

You said you have that where.

G

Yeah we have that we have the numbers in a email thread somewhere that I can probably dig up.

B

Oh yeah yeah, if you could um yeah, dig that up and share it. We'd love to because yeah we'll talk a little bit about benchmarking and stuff, and so yeah uh yeah be useful to know.

D

um I've pulled up the dock and I can summarize what the differences where that made us deviate from jfr and prof was mostly deduplication of traces and the need for 128-bit ids, where prof was using 64-bit ids for for reasons of collisions of ids.

D

I guess that's my fourth wish for the protocol. Let's not use 64-bit ids.

B

Anybody uh what a I guess- yeah, okay, um we'll we'll move on for now, um if the people have thoughts on that feel free to add it to the doc, um the uh yeah. So we have the datadog folks who also mentioned wanting to talk some about jfr themselves.

B

um I don't think we had anyone from your side here last week, so um feel free to I guess kind of take it away and sort of um yeah in whatever direction you want to uh take. It.

C

Yeah I brought it along my colleague jaroslav, who is a jfr java expert, I'm more of a go person myself. So your stuff, take it away.

J

Yeah, let me let me share the screen first, so I assume that you can see the the purple screen with the presentation, so I'm going to I'm trying to do a very, very quick presentation. So this is very, very brief. Interaction, not going anything deep. So it's an overall jdk flight recorder, jfr uh number of asia.

J

It originated a long time ago in in jeroki jbm, then it was acquired by oracle and then it merged with sun and then finally, it was open source, the whole implementation with the writer and everything in opengdk9, and we got back port to open jdk 8 in update 262., so datadog was also taking part in the backboarding effort and oracle. Also still is shipping closes version, oracle jdk, which is still kind of active, so the key features of the jfr is that it's fully integrated with jvm.

J

So it's it's completely hooked in the runtime compiler gc. What not it's uh it's very lightweight! Everything is event oriented, so the idea is uh to take as little time and resources to write the data as possible.

J

So everything in gfr, like implementation and file format, is actually it is submits to to this goal so that that's why we don't have columnar structures, anything because it's uh it's more difficult to to maintain it and and it's more costly to maintain it, and it's uh it's fully self-describing storage format so like once you, you just need to know just a few details about how the format is structured and after that, all the events, all the types, everything all the all the values you can read it up uh without any extra knowledge or extra description external.

J

So everything is in in the in the recording itself. So everything is event like that. That's the base unit of jfr recording uh each event has start and and time stamp it is associated with thread through thread id. It can have stack trace, it doesn't have to have stack trace and we can put like any number of other custom data fields to the event as we want.

J

uh The stack traces uh are actually deduplicated uh per chunk. I'm going to talk about chunks like slightly later, but we are, we are not sending the full stack trace for each event. We just send the stack choice id and then it points back to the stack trace. uh We call it constant for that.

J

The native we have native java jvm events and the native jvm events they are created during the build time.

J

So there is some xml definition and then it will just it will explode into a bunch of c plus plus files and it will be compiled together and it's hooked in really deeply in jdm, and uh you can also have user defined events for which there is java api and with those you can, you can write your own java events and they will be integrated with jfr and they will be emitted the same way as the rest of the recording.

J

So now I'm going to talk briefly about the storage format. So, as I mentioned like we have the recording so that there's the top unit, you start recording at the time. You end it recording at a time. So it's time about collection, events and the recording internally split into chunks and each chunk is like self-standing unit information and it has header just some like informations describing the chunk. Then we have metadata event, which is describing all the types which are in this particular chunk.

J

The metadata event can be repeated and then the the definition of the types in the recording is actually incremental. So you can, you can use uh the subsequent metadata events to add more information like when you register new events during the recording, then the new types used in there will be in in that new metadata event. Similarly, we have checkpoint event. It's uh kind of confusing. This is a constant pool event.

J

The checkpoint events containing the constant pool data, so it will contain data for uh strings for duplicated strings for the stack traces, but you can, you can create uh custom pools for any any type like even for the user type. So if you define your own user type, you can. You can tell uh the this in the format that this type is using custom pool, and then everything should be using the pointers to constant instead of putting all the data directly into the event.

J

The type descriptions are based on built-in types, so there there are numeric types, boolean and string, and then you can define the the other types based on this built-in or primitive types. So they are composite types and the type needs at least name and attributes, and each attribute has a gain name and type. So it has very simple, like definition, language for for types which are then used in the particle chunk, um yep constant pools, yeah. These are the cash for redundant values. So we point back to to the constant pool.

J

So we don't need to. We don't need to store the same data over and over. um There are built-in custom tools for string stack traces, as mentioned before. um There might be more custom pools for other types and then it's up to the the producer of the recording and then the parser to actually store the data in the custom pools and then for the parser to read from the ghost temple.

J

The idea with the chunks being like completely self-reliance of describing, is coming from the need to write the events from multiple threads at the same time with the minimum possible contention. So how the jvm or jfr does it internally it will. It will open a chunk for each thread uh for up while it is using a size, limit or or time limit for the chunk, after which the chunk will be kind of concatenated to the main recording and during this time only one thread is: is writing data to to the chunk so internally?

J

It's memory mapped on on the disk, and then it's just appending new stuff new events that once when the the chunk is, uh is going to be appended it's finalized. That means it will get written the uh the custom pools and the metadata with the types to there, and then it's moved to the part where it's going to be joined with the previous uh previous chunks.

J

So while this makes it very easy to write in a highly concurrent environment with very little contention, uh there are some ordering issues because we are like. Basically, we are flushing the chunks at any time and the events are not physically. They are not ordered by the time when they happened. So then you need to when you are passing. You need to rely on the timestamp uh of the event to restore the order for the timestamps.

J

The jfr is using our rdtrc when it's available. So it's it's a cheap monotonic clock source, but the thing is that the ticks are not convertible to two milliseconds to epoch, milliseconds. So in gfr we in the recording. We did the trick that we store the milliseconds and ticks in the chunk header and also we store the thick frequency. So we can divide number of takes by by the frequency and there we go. We will get the milliseconds from the ticks.

J

To make like even better well compression, let's call it compression. uh We are also storing, not the full ticks uh in in the in the chunk, but just the delta, which takes uh to the chunk start fix. So since we are like almost everything all the the the integer uh numeric values are led, 128, compressed or encoded.

J

So when you, when you use just delta instead of like full time, timestamp or ticks number, then you can, you can get. Basically, you can get the timestamp in like two bytes instead of eight bytes.

J

um It is if by default it is used, as I said, for almost all integer numeric types, but it can be turned off uh by a flag, and this flag is also written in the chunk header. So you can have a chunk where you will not have this compression, so you can decide what makes sense.

J

um What we observed, this leb, 128 or var, and compression or encoding, was not that great for large numbers with big entropy. So when we tried to use this for uids, then basically everything was uh was encoded in nine bytes instead of like eight, so it actually kind of grew in the size.

J

But it's it's really good for small. So when you have when you have the deltas or you have, you have small values and it's pretty nice, so yep. This was very, very brief. Introduction. uh There is a blog post, secreted by gunner merlin, who went in and and the rebbers engineered the full gfr file format. So if you want to go there, you can you can take a look, so it has like all the offsets all the meanings of of the values on on all the offsets and it's.

J

uh I think that this is the best source of the knowledge right now.

J

Okay, so that's it! Thank you.

K

Just had a comment: uh hi, my name is stefan by the way, first time joining here, I used to work at oracle and with more concert with them with j-rock components was implemented. So, as you see like it's, it's very optimized for writing fast for many threads, and it's because jaroslav was into everything is event and we collect a lot of events. It's it's not just cpu like method samples, it's uh everything from from gcs happening to lock, contention, etc.

K

So it's a lot of data being written, yeah and, and that's sort of why this very rather specialized format and a lot of stack, trace, optimizations and constant pools, etc. To to be able to capture everything that happens in the jvm basically and the job of.

J

Them we, we are we're writing recordings with, like hundreds of thousands of events per minutes like we are capturing one million recordings and they are usually with with lg for compression, they are not not larger than like 2-3 max. So that's that's pretty.

B

Nice um thanks yeah, thanks for adding that we got a a pretty good um uh sort of similar rundown on p prof before so yeah nice to have one for jfr as well um alexei you, you had something you wanted to add.

I

I have a question um licensing: are there any kind of like licensing aspects of using gfr format? That's one and second dimension. Timestamps. I wonder if you use timestamps between profiles and how you deal with, I assume we cannot assume that, like time is synchronized between hosts, it can be, it can be off.

I

I wonder if you deal with that.

J

uh Well, first licensing this the source for the part, which is writing uh the recordings. It's uh it's gpl2 with class plus exemption, it's open jdk, so it's open source and there is no.

J

uh There is no format specification. There is no patent for the format. There is no copyright anything so basically, it's uh up for taking so oracle didn't spend any any effort on protecting this.

I

J

I

Good all uses of the format are based on reverse engineering. Is that right.

J

Yes, except of gmc, which is a java mission control which was developed in oracle, and it is using kind of well in internal knowledge. But.

K

There is a parser in in mission control, which uh has been open source as part of the opengdk project. There's also parser built into the jdk as well uh less performant than the one in mission control, but it supports the file format. So there is a java api. You can be used to to read files.

J

And also also that there is a gfr writer api added to jmc pro gmc project now so currently, it's mostly for prototyping and experimenting, but it can be also used to check the the former definition.

J

And yeah timestamps uh jfr is not dealing with the with the time synchronization host, so it should be done by the infrastructure, so the time stems are always valid within the the chunk.

J

So, as I said in at the start of the chunk, we capture the the ticks and the epoch millis and we base uh everything in the chunk. All the time stems are derived from this, so yeah also. This is done in order kind of to fix the time sku since the the drdsc timer. The ticks are moving slightly faster slower than the system time. So you you get you get out of sync after a while. So you need to you need to re-sync between uh between the ticks and the airport millers.

J

um So we do it at the chunk uh boundary. So.

B

um I cannot uh thomas, you want to ask someone.

D

uh Mostly, uh first of all, thanks a lot for the overview, because it's uh was super helpful to get uh got an idea, uh I think an interesting paradigm to point out between what we've been doing and what jfr is doing: uh jfr deduplicates on a per chunk base. If I understand correctly, yes- and that's- that's somewhat similar to deduplicating uh in the manner that we do, except that, because it's a network protocol there's no chunks as such- it's just a stream of messages. So I think that was a really helpful thing to learn.

B

Cool um yeah, I guess I don't know if you have uh any thoughts on this, um but as we do think about a standardized format, I guess how um it sounds like jfr is. Very I mean it does have a lot of um you know, sort of stuff built into it already like one of the things that we said is one of our goals is being able to sort of map existing formats to a different format.

B

I'm curious, if you have any thoughts on the viability for jfr to be able to, you know, be be somewhat flexible in that way. um You know if, if we decided a format, that's not jfr, I mean I guess it obviously depends on how different it is curious. If you have any thoughts on that.

J

um Well that there is a one thing right now, which the jfr is not supporting out of the box, uh which, for example, people support it's uh labels so but, uh like speaking just the format wise, it is possible to support it like the jfr has: has the concept of arrays or sequences, so we could. We could have right now the stack trace or or the profile event as we call it doesn't have anything else associated, but we could. We could associate a sequence of labels to that.

J

So, and I don't I don't know I I was thinking about not. I cannot come up with anything which would be impossible in jafar or with the with the small changes into integer from formula I know I know there were some uh like there were more powerful features in the type modeling in the past. uh They are not used now, but the the format, the parser, at least in gmc- it's kind of ready for that. So, judging from that, it's uh it had this in mind, but yeah it had to be.

J

It had to be turned off because it can, you can actually can generate quite large event and then it would blow up the memory. So that was the main reason like removing all the sequence. All the map, uh map types but yeah.

B

All right, thanks um felix, you had someone you wanted to add.

C

Yeah, I have another thought on that, um because it was mentioned that the design goal of jfr is essentially to write away stupid amounts of data with very low overhead.

C

I think the idea of trying to convert jfr files on the client side before sending them somewhere defeats that point, because you're going to spend a lot of cycles undoing the smart things that have been done and you'll get some power results out of that, and that is, I think, that's a long long term concern for getting detailed, runtime data, not just for for java, but go essentially has a similar problem with prof, uh where you like, get p props out of the runtime and that's not under your control.

C

Parsing p prof, however, is not as much because it's aggregated, so maybe that's not a fair point, um but the future of go runtime observability may include runtime tracing similar to jfr. uh There's a go enterprise advisory board where they share a little bit of what they're planning to do for the go. Runtime and uh the words google sorry go. Flight recorder have been ushered on the slides, so there's a chance that it go to go. Runtime will move into a very similar architecture.

D

But I mean what's what's the way around it so the way around it would be standardizing for every run time on whatever that runtime dumps to disk, which doesn't seem like a great path forward, either right.

C

Yeah, that's fair, um but I think that ideally, a standard would allow a side channel for including raw data for people who want to do that. Also, I mean there's always going to be data in jfr that might not be supported and whatever otel standard is going to be cooked up, because it's a very rich data format, but I could be one.

K

Just just one comment: there is like it's optimized for writing from multiple threads. A lot of events that doesn't preclude you from having a parser thread as part of sending it out right, like the key, is avoiding the whole path. Sure you don't want to waste a lot of cpu sucks on the other sort of uh in a separate thread, either right, but uh it's making sure that you don't sort of if we're like one event, for example, is low contention.

K

It's gonna be sort of a timing for every thread, taking sort of a lock in an address for that lock from multiple threads right, and then writing that information down, and you need that to be much faster than sort of whatever lock contention you're. Having so you're really messing things there in the critical path, and it's one thing to do the measurement. It's another thing to sort of transform that data into something else.

K

A little bit similar to sort of tracing right, if you do the retracing, you want the spans to be created fast and whatnot, then you put them on a buffer and a separate thread to sort of do the processing and exporting of them.

D

Yeah, but I guess my my point is that hopefully, whatever we come up with is like in the end, the runtimes will have to conform to what we decide on, because I think the idea of taking what the runtime does and then transforming it into something to be sent out does seem like a an expensive and an inelegant proposition.

D

That said, good luck to us.

B

Yeah, well, that's part of part of while we're here uh alexa you had uh you're hitting up too.

I

Yeah, well, I think it's kind of obvious, but uh but I think the path of like have kind of like side channel and then include before the data in the format whatever the runtime produces. I think that's also a slippery slope, because that removes the advantage of being able to build shared infrastructure and like profile, processing code that can be shared between.

I

I I think, like I would view the format we defined more as like. Would you find the core and then, of course it can be like it may be extended to a certain extent, with labels or attributes or whatever things we define, but just saying like? Well, you got the side channel and you can put a profile proto a gfr go flight recorder formats.

I

That's that's very shallow to my taste.

D

I mean both the jfr and, in our format, are essentially sequences of events with some amount of deduplications for groups of events, and it seems to me like um we should be able to come up with something that resembles a protobuf or whatever similar specification for events that is extensible for future events in the similar way that jfr is at which point.

D

Hopefully the runtimes can generate sequences of events, and if they want to encode an event that isn't in the standard, there may be a way of speccing it, but.

J

I mean that I guess that should be pretty possible to define format the wire format, so you can describe actually the types. That's what jfr does as well. So the the thing which is describing the types is just not an event. So if you, if you, if you define this particular event, type, will be defining the events.

J

uh Given that you have a number of primitive types there which are like existing from the beginning, then then we should be okay, I mean that's. uh That's all jfr does there's no magic that that sounds a bit like protobuf within the protobuf.

D

This is an interesting, interesting question. I have um is fully self describing a necessary and desirable feature in the sense that so clearly, my my bias is from from from google experience.

D

And if you've got whatever product of specification, you're using under any form of version control the benefit and have it public, the benefits of self-describing are not immediately clear to me, but uh I think reasonable people can disagree on this as well. So.

J

Yeah yeah, the thing is, maybe just for for only profiling information like you get samples of allocations of cpu of logs. Maybe it's not that crucial, but like in jfr and in the runtime they're like in jvm how we're used to live now. Is that you can? You can have any number of events. You can create events uh for your application. You can create events for whatever subsystem you have, but you need to describe them somehow, but.

D

But what I'm saying is that um uh I mean uh protobufs are future extensible in the sense that you can add new message. Types to an existing product specification and old parsers will just keep on parsing and skip the new message types, whereas new parsers will will then uh have their message up. The the architectural difference is that the jfr format essentially keeps the protobuffer specification inside of the data it sends out, whereas the the protobuffer approach is to keep that separate.

D

So the the extensibility is is the same. It's mostly a matter of. Do we need the full type information in line or not.

J

It's again, it depends like if, for for this, the the custom events like anybody can come in, like uh you might have events for for apache uh for tomcat for for spring, for akka for whatnot, and if you want to have protobuf distributed for each new event, then you- I don't know like it's, it's getting pretty difficult.

J

They you can. You can create events for applications like you can you can do kind of structural log, section login, but this is probably this. This is kind of it's. Not it's, not profiling. That's that's what I was saying so this is extended. This is this goes beyond profiling and that that's where we need it for for profiling, where you would have like.

J

I would assume that for profiling we would have just a like predefined number of events like we are not going to have like fully flexible events which can be created out of nothing by the users. So.

D

I'm going to make myself unpopular in the sense of uh unpopular two vendors. Now one could argue that if there's a public protobuf spec for a profiling format that more or less forces every vendor, if they want to extend profiling, to send a pull request with the message to type so they, the absence of self-describing, may be a hammer to bludgeon people with. But.

B

Then uh wrap up with like a couple comments go for it.

K

Yeah, no just a quick quote like the other thing to think about here is like jfr was built at a time when that there wasn't much other profile like it was like. Okay, we want all of these dynamic events, so we wanted people to build events in there. This is going to be part of open telemetry, so some of the events might be yeah.

K

This is going to be part of a tracing or a log events or uh etc that where some of these might fit so there might be a sort of a way to think about it as more of a lockdown format. I think the other part, probably for a later meeting, is just thinking about like okay. Well, is it a sort of micro, batching type of thing that we need? That needs to be done just to keep performance uh or do? Is it sort of the tracing type of thing where every event like?

K

Oh yeah, you can batch them out, but every event happens by itself right, but here it might be it's low level enough that it requires sort of some type of aggregating and batching, which is sort of the culinary storage or messages you've been talking about. um uh That probably is required, so it might be worth values. Thinking about the whole opens limited ecosystem. What what fits where um so? That might be one one way of of not having to be self-described yeah.

B

All right, awesome, yeah, uh good conversation thanks everybody for thoughts there we got like one minute left um so yeah. I guess one thing that uh we talked about last week that we haven't really made. I guess regardless of which route we choose with a lot of these things. I think um you know a key piece of data that we're missing.

B

For you know, all of these formats is some sort of benchmarking and we kind of agreed on that last week, and so we don't have a lot of time here, but um yeah we talked about it internally um at pyroscope and we came up with um a proposal for how we might be able to set up some sort of process for benchmarking these different formats.

B

So we can follow up on that um I'll paste it in the slack and we can kind of follow up offline and perhaps talk about it next week or if others have other ideas on how we can do benchmarking um and then yeah outside of that. Just some other points here we'll try to I'm trying to reach out to some more yeah people from jfr community people from like go language maintainers to kind of get their perspective on all this as well.

B

um We also need a second tc member in addition to tigran, um thanks for uh kind of being the the lone tc member for now to run and um yeah.

B

I would be think it's worth chatting about potentially moving these meetings every two weeks um in the future, but uh um I would just propose that as a discussion topic, maybe for next week, uh does anybody else have anything very, very quick that they want to add before we wrap up for the day.

B

All right cool, I guess.

K

There's a lot of things that people I just thought it's like would be. It would be interesting, I'm very much from the outside. It will probably be more of a use, so I'm hoping to develop it, but I think, like just to clarify like benchmark, is great, but the other part again.

K

What we talked about was like what what are the data that we expect to flow here, because they're coming from our background, is like it's everything uh coming from periscope or from others, it's probably more like method samples and then the truth might be somewhere in the middle, so it might be good to start filling out that list. So we know what needs to be supported by it.

K

Do we want to be on the level where we can have a lot of contention, for example as an event here or garbage collections events or those things that fit here, or should that be some? Some other format.

B

All right cool! Well, I don't want to run too much over thanks again for coming everybody and we'll talk to you all next week, thanks so.

C

J