Cloud Native Computing Foundation TAG Observability, 18 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-10-18 CNCF TAG Observability Meeting

Description

Yuri Shukro - Talk on the 6 pillars of observability data
* https://medium.com/@YuriShkuro/temple-six-pillars-of-observability-4ac3e3deb402
* https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry

Meeting notes and more: https://github.com/cncf/tag-observability

A

Alolita sharma: hi, daniel,, good morning!.

A

Daniel golant: morning! alolita sharma:, how are you?, I think, the early morning. so.

A

Alolita sharma:: let's wait for some folks to try it.. You really did confirm that he was going to join in um to kind of talk about your to get uh kind of talk about um.. Some of the his thoughts on observability and hotel,.

A

Alolita sharma: so should be pretty interesting.

B

Daniel golant: just got back from a vacation. oh, good for you! hope you had fun. well,, I'm unemployed,, so maybe not so good for me,, but it was scheduled.. I had to and do it.

A

Alolita sharma:: are you looking to do some like an observability?.

A

Daniel golant: um,, you know I was, uh,. I think, out here I'm in new york. uh, and I think the gravity is just towards fintech,, which is where I've been for the last few years., so you know,, I've always been like I'm., definitely more on the user end. In this group.-.

B

Daniel golant: um, so really,, I'm just talking to like f tax and and larger kind of companies where I can do this kind of work.. But you know,.

B

Daniel golant: don't have the expertise to actually participate. really.

A

Alolita sharma: no, no,. I think I think it's.. It's also,, you know, like there's a lot of.

A

Alolita sharma: areas where you know the end. user. um engineers,, you know, end up kind of building out the usability of you know many of these core tools. right?. So it's not only just building out the core features,, but it's also looking at overall.

A

Alolita sharma:, you know.: how do you actually scale?? How do you configure easily? and many of the other areas? so um again, I'll. I'll ping, you on chat, and uh,? Let's, let's see. chat of it.

B

Alolita sharma: sure. sure. hi, hendrick,. How are you? good!? Let me just um pull up our.

A

Alolita sharma: doc. and uh as I was uh mentioning to.

A

Alolita sharma: uh daniel,. We have.

A

Alolita sharma: yuri shukro, joining in today for um, sharing some of his thoughts on observability, and,. As you know, he's been talking a favor about the six fillers of,. You know telemetry data., so that should be interesting.

A

Alolita sharma:, let me go and ping him.

C

Alolita sharma: hi, matt.

A

Alolita sharma: um,, I think I I just ping yuri so hopefully he'll join shortly., but um the shared link of the doc.. Let me um just.

C

Alolita sharma: share our screen.

A

Alolita sharma: okay,, so we have.

C

Alolita sharma:, so we have the agenda.

A

Alolita sharma:, we have cubecon coming up next week,, so I think many of us are going to be at yukon, and it should be fun.

A

Alolita sharma: um for those of you who are, you know, in the area. it's in in and around detroit again.

A

Alolita sharma:, I just found out that you know you kind of never think about it., but um!. The niagara falls, I guess, or canada is right. Across, you know, like walking distance or drive over.. So all of there is a lake in the middle. so but uh, for.

A

Alolita sharma: there are some folks who are planning to kind of take a day trip uh to canada's back.

A

Alolita sharma: oh,, it's gone, this week.

A

Alolita sharma: there are lots of observability activities going on uh at cubecon,. So again will be great to see the in larger, you know, participants. um,. I think a lot of activities around open telemetry, where typically I'm pretty busy or active.

A

Alolita sharma: we have yuri wonderful!, hi, yuri!, good morning!.

D

Yuri shkuro: oh, matt young: hey! welcome! matt young: oh, on the previous topic,, really fast!. uh I didn't wasn't there. A poll put out. I've been a little out of touch. uh,. I've started a new,, a new role., so I've been a little dark for a month until um I kind of I'm doing some onboarding.. But I do remember there was a dinner.

D

Henrik, rexed: or or a observable type, an o tag. did you decide? yeah, the first place where, where they didn't take any booking,, so I I we picked another place. So it's on in the slacks channel.. I can answer that can send you the the place,, but it's going to happen.

C

Henrik rexed:, as of now,, if you want to come uh,, we have booked a table for thirteen people. uh So if we have to adjust the booking,, let us know. so we can adjust to booking., but the dinner will happen on wednesday.

C

Henrik rexed: uh. of the week of coupon, and uh, he will. we will basically meet at.

C

Henrik rexed: at seven pm., but I will send the address on the chat, and then you have the address in case. You want to join us.

A

Alolita sharma: awesome. thank you, henrik.. I just noted it in our agenda uh docs,, but henrik feel free to post any details.

A

Alolita sharma: for the for the location, or you know,. If they can.

A

Alolita sharma: and if folks can ping up p: you whatever works.

A

Alolita sharma: all right. hold:.

A

Alolita sharma: okay,, I think. um,. We can probably get started.

A

Alolita sharma: you're right. um.

A

alolita sharma:. Would you like to get started? uh again? uh, with great pleasure.? Let me just uh turn off our sharing.

E

Yuri shkuro: yeah,, I have a slide that that I can just share it. To. yeah, that'd be awesome. Instead of just uh,. I guess.

E

Alolita sharma: yeah, sure., but uh again, for those of you who don't know your he is,, uh one of the uh core maintainers of jager has been very he's a next subject matter: expert in the open, source. observability has been involved in observability for a long time.

A

Alolita sharma: he has co-founded open tracing as a project and also is an core contributor and a co-founder of open telemetry.. So you know again, very happy to have him here. Today,.

A

Alolita sharma: and uh, he will be chatting and sharing some thoughts about the observability and the seven fellows of telemetry.

A

Alolita sharma: welcome,, you ring.

E

Yuri shkuro: no, thank you. um., so yeah, uh,, my name is juris kuru.. I'm a your um engineer at meta.. My primary focus is um obserability platforms and products for for internal consumption. uh and uh,. As I already mentioned,, I work with uh jaeger and open to them at the projects and open source,.

E

Yuri shkuro: and I have a book on tracing, which is now three years: old. uh,.

E

Yuri shkuro:, so um yuri shkuro:, today, or or like,, maybe a couple months ago, before the profilers came in open to them.. It was essentially three pillars: uh metrics, logs, and traces. um!, and there is some kind of noise about other stuff, like events uh,, but wasn't really anything official. um, then there's also not determined industry that you may have seen like me.. That kind of adds the events to the to the picture. um,.

E

Yuri shkuro:, like exceptionally bad name events. but I'll I'll, go over that um, and then uh,. I think tomorrow, with some of the uh things that are starting and open, telemetry. we're agent uh profiles.. The events are kind of being discussed into like what exactly it means to to support events like we have span events.. We come some other uh events, potentially in the log in space. um,, and so I, when I was kind of uh uh discussing this in terms of that metal. one of my colleagues came up with this temple.

E

Yuri shkuro: uh acronym, and then I said, well, what if we do, the full work. temple. uh,, because the one signal that was missing in in all discussions are is is the exceptions. um., and so I wrote a blog post about it, uh linked to all um kind of going through, and that's what I I I will go through quickly. Here, and the sort of the ordering of the letters in the word doesn't mean anything.. It's really because the word is nice., it's temple,, but it's it doesn't imply any sort of.

E

Yuri shkuro: of priority or anything., so I will actually go in a more traditional order through the signals, starting with metrics., and so metrics are as many of, you know, as like a numerical uh observations that are highly aggregateable., we kind of uh.. We do support dimensions on them,, but we often drop those dimensions from the row. Events and uh aggregation allows us to drastically reduce the amount of data that we have to store. uh, and, at the same time, sort of provide much longer retention.

E

Yuri shkuro: uh, as a result,, because we can keep aggregating it even after the date is collected, so like by compressing it into uh like a lower ganularity time. Wise. we're gonna there in in the in the space as well.

E

Yuri shkuro: uh mostly, though,, when we talk about metrics in in,, you know,, open to themetry space., we're talking about operational metrics and not so much about business. Metrics. uh, in fact,, like business metrics,, I typically more usually collected from like in the form of structured, looks rather than the actual traditional sort of like time series thing.

E

Yuri shkuro: and one thing that methods are great is for monitoring because they're highly accurate. they don't lose preceding with the aggregations uh well, when done correctly, like uh you, you can, of course, aggregate average, average, but like,. If you don't do that,, then you or you get a very good numbers uh,, but they generally considered to be fairly bias for troubleshooting, because, as a with aggregations, with with dimensions. uh., and so you kind of,. You know there is a problem,, but you know where and why,.

E

Yuri shkuro: and so logs is, is like the very classic way of troubleshooting systems, and there are, uh, several categories of logs. Like then structure to the classic printf style, uh free form, text, logs. um,, we send it structured.. Sometimes you sort of like.. You can consider same structure in the. I think the timestamp, or like a log severity can be isolated to the separate field of the log,, or sometimes you can even do well. give an api to the users where you say just log random events. uh in in.

E

Yuri shkuro: and sort of like, in the structure form where you give names to the dimensions right?, and so you can have region as a dimension, or you can have a customer, id as a dimension. That kind of makes it um same is structured. what I still separated from fully schematized, like fully structured, looks where you actually go with the schema first approach, and uh,, something similar that that is much more prevalent in a business. Analytics., uh, uh,, where you, you define the scheme upfront, because it's really using it, and not just one.

E

Yuri shkuro: about yourself as the producer of the log,, but much more about the consumers of the of those looks, and how it the tole effects of the consumers. um. one super nice feature about logs is that they are very local to our specific uh resource in meeting those logs, and so,. But they are very easy to chart from sort of capacity capacity management points,. If you like. you can.. If you have a service,, you can do that service., and this is how my slog volume you consuming right?

E

um uh!, which is not as easy with some other today.

E

Yuri shkuro:, um. and but at the same time logs are very genuinely expensive., and so you kind of give this uh knobs to people to say you can.. You can do various severity and uh, various retentions, and all that, and one other thing about well,, because they're localized, they they kind of hard to correlate across the architecture, or even within the node right. When you have., I don't know.

E

Yuri shkuro: one thousand kps on a service. uh,, all your logs come with a single pile of of of like consciousness, uh right, and it's very difficult to make sense of them. Without introducing some other nick, and using which you really come from tracing um, and those so tracing is.

E

Yuri shkuro: is also.. You can think of it as a special form of log, but structured and uh request scope, loads right, all more generically,. I think I I I prefer to call trace as a workflow centric load,, because uh request, kind of narrows you more to the rpc. space,, whereas workflow opens up other avenues of vlog, and where you have,, I don't know, uh c. icd. pipeline,, which isn't really working on rpc. I think, or the the messaging and and data pipelines where the workflows maybe will define. but they're, not our pc. based.

E

Yuri shkuro: um. and the critical thing that separate logs from uh other community. to them. It is that they capture causality in the form of a directed classically graph uh, between the events that constitute the given trace.

E

Yuri shkuro: in in concert with log,, so the traces are distributed, and as a result, they are actually pretty difficult to to um a portion to specific services in terms of capacity usage,, because,, uh,, the value of tracing comes from the fact that they span a lot of different components., and so who do you build for that?? Is it like.

E

Yuri shkuro:, the first service,, the start,, the trace that made a sampling decision., or do you build the series which says, oh,, I'm going to add like hundred spans to this trade?? It's big,, all my internal spans,, and that affects everyone else, right?, so that, of course,. I have not seen a good solution yet to this problem in the industry is how you properly sort of like deal uh internal beating., I'm not thinking about the and sort of like a vendor dealing um.

E

Yuri shkuro: one unique feature that traces provides in in terms of monitoring people., don't often think of traces as monitoring,, but they do give you end to end monitoring capabilities which are just impossible with other two limited types. and a simple example: is, uh,, a message delivery uh front to end timeline, right?, because you have to collect those murders at different points in the architecture,, so all other to them to types of very localized uh and traces. A unique in that sense.. But there's a lot of other kinds of use: cases that for traces that.

E

Yuri shkuro: uh, not like fully explored in the industry, today, like root cause, isolation is probably the most uh frequent people think of. uh,, but there is also, uh the uh, like some of the big companies that I know they they very effectively utilize and traces for resource distribution uh for like for product line, distribution.

E

Yuri shkuro:, something that's again very difficult without you in two types. and now I mentioned that the advance is a is a super bad name,, because technically all telemetry starts with events. and so uh,. When we talk about events as a as a distinct telemetry type, we're really talking about change events uh primarily,, although you could extend that notion to some other stuff. Like I don't know, weather events, or I don't know big uh football game in the town,, and that might call that this by a traffic spike.

E

Something like that which might be interesting from the operational perspective.

E

Yuri shkuro:, but change event is definitely the most important category here, because they.

E

Yuri shkuro:, very often uh, are responsible for over fifty percent of outages in many organizations and and by change. Events, I mean like code deployments. configuration changes., maybe some like routing configuration changes, uh even some order or the event like what the scaling could be considered to change event as well,, because they're not that often,, although that they kind of strive, starts to boolean the um,, the boundary, or and in terms of shape.

E

Yuri shkuro:, the events are just nothing but structured, logs,, right?, and so a a reasonable question is like,. Why? Why do we consider them separate uh to them? to type?, and my reasons for for is considering them is because they actually have very different requirements., one of them is, uh,. One hundred and fifty.

E

Yuri shkuro: events have very strong identity,, so meaning that if your service start to throw in an error of.

E

Yuri shkuro:, I know some specific error message: right?, it's probably going to do it like thousands of times. and so,. If you lose five of those,, no one cares,. You still get a strong signal that there is a problem of this type, and then you can go and localize it,, whereas with events,, if you did, the deployment of a specific code commit, and you lose that uh as a to the inter platform., then you kind of in the bad situation.. You may not be able to troubleshoot your outage um for much longer..

E

uh I I, if you didn't, lose that events right? so like the much cry, real ideal to require.

E

Yuri shkuro: and uh, on the other side, like, yeah.. So when we look in the four or four events like with looks. we're not looking for specific log instances., usually we're looking at a more like aggregate view of them,, whereas with events, we very often look into a very specific instance of an event um as part of the troubleshooting, and as a result,. They also tend to be much lower volume than the looks. um,. But but that's not only the case, though,.

E

Yuri shkuro: depends in again how you characterize what an event is.. uh Now the profiles uh coming into uh open, telemetry., that's that's great to see. uh.. They had a bit of a hard time like even just describing what the profile is in the hotel. um, uh,. I think the current definition, I I I call that., you know it when you see it. really.

E

Yuri shkuro:, I mean, there is a definition,, but it's it's it's it's. It's still pretty way right in my.

E

Yuri shkuro: personal experience., I've noticed that profiles are usually much lower usage than that the toilet,, because they're kind of a power user tool, like even though most engineers do come across profiling tools and some don't need.. Sometimes you do have a performance issue, and you want to look at that., but there's not something that you do. Every day. um, unless you are like a dedicated performance engineer and who's the office to go across multiple systems and kind of do this type of investigations. um.

E

and one unique feature of profiles is that typically no instrumentation is required to collect them.

E

Yuri shkuro: uh, as far as the way that we think about like open, toile image instrumentation, whether automatic or uh manual profiles are just like.

E

Yuri shkuro:, uh the collection uh framework for profiling uh usually integrated with the runtime itself., and so you kind of get it uh out of the box uh, and they tend to generate much larger volumes because there is like a one profile.. It can be pretty large if you,, if you capture, and a bunch of stuff. uh., the one thing that is,, I think people don't often realize is that profiles are actually very well aggregateable, and- and that's actually is a huge power.. uh When you do on like a consistent.

E

Yuri shkuro:, real-time profiling, or what's it called? What's the name?.

E

Yuri shkuro: always on providing right in production., so it's not like every second., but you kind of consistently taken profiles from production.. Those things can be aggregated and give you a lot of uh useful information about a sort of like overall impact of different things., so um,. I I worked with organizations where they are using this to be able to sort of attribute these things even to the like a pull request when you have a pull request, and they say, oh,. You add in this changing this function, and this changes your sort of like a.

E

Yuri shkuro: complete, wide cpu consumption by this, my percent right?. That's like very awesome power of the profiles that you can. you can,, so you can immediately see the performance. degradation even that you are sort of like offering step. um,.

E

Yuri shkuro:, but I like,. I think this is not a prevalent kind of at least experience from from what I've seen. um. and finally, the exception. That's the one that I think, is completely missing from the open to limited discussions, today. um! and on one hand again uh the boundaries between the telameter types are kind of blurred.. You can always find sort of like exceptions,, but uh and and exceptions as a as a form of where, like super structured logs,. They technically are defined and open to limited prot above format. today.

E

uh in in some way.

E

Yuri shkuro: um,, but they just sort of like we didn't pay much attention to the processing and specifically to the sdk uh impact and collection, because one of the things that uh,, when I first time I I ran across century uh in production center, was like an open source pro um exception, capturing uh thing.. I was just blown away by how much information that gave me is of like I got the ticket from some other team saying, no,. We see in this uh problem from jger sdk recent release in python, and.

E

Yuri shkuro: they in, instead of a stack trace,. They gave me a link to the century, and,, like it,, took me like one minutes to identify the root cause,, because I was able to go and see like for every stack of the frame and the exception.. What are my local variable?? So I could reason like greatly about like what was going on in the application of of course, like overlaying it with the source, code. uh,. But this is something that uh a a as as a sort of like, as a debug and experience..

E

I don't know if any other telemetry type that allows you that kind of debugging uh,- and this is almost like very close to the actual device,.

E

Yuri shkuro: and run with the breakpoint, all right, and um,.

E

Yuri shkuro:, and so that that's kind of that's why I think that exceptions deserve a special for like a letter in in the acronym. uh, they're, also uh aggregateable,, because, uh it,, it's very common. uh when, when you do have a well established, like a exception, processing pipeline,, it's very common to look at aggregates of those saying, like, oh,, I'm I'm seeing a new type of exception, suddenly popping up as a as a time series right? and the way those pipelines work.. uh They are very special., they sort of like..

E

They look at the stack traces.. They sometimes do all kinds of.

E

Yuri shkuro:, uh clever things about fingerprinting them,, maybe like collapsing some of the stack uh frames that are not interesting, so that you can identify.

E

Yuri shkuro: uh unique,, but also like a common patterns in those tech, frames. and and then do they sort of group in analysis, and show them uh in in aggregate. and as a result.. They also tend to have custom. ui, and I should have added customers the case uh,, because again, the way that the century and it's raven sdks are able to collect this sort of like the information about exception. That requires a very special sdk to be integrated into application.

E

Yuri shkuro: um., so that's basically uh, all,. All I have is a conclusion, so like uh,, my point of this block was that this is like really more than three pillars that people.

E

Yuri shkuro: typically talk about uh that,, the the temple. uh,, maybe we'll we'll get adopted,, as is a term, because I think it's awesome. um,, then the boundaries between things. As I mentioned, that pretty diffuse you can like you., you've seen that a lot of stuff can be classified as an event, or is a log uh,, but with all kinds of caveats. and.

E

Yuri shkuro:- and, of course this is just to this- is the data pipes that we're talking, about. right., we're not to can but actually absorb the solution that still has to come afterwards to aggregate all this stuff. um! and one other thing I want to mention is like I have a talk at the next week uh about another uh,, seeing that we published a a a paper on schema, first application to limit three., so the I mentioned like a kylie, schematized and structured logs..

E

So we are kind of taking this approach from the logs to all the telemetry that we produce. um,, and I will.

E

Yuri shkuro: you talking about that? and then that's the rough,.

A

Alolita sharma: awesome, awesome uh,. This is very, very cool,. I mean. in fact. um,. It's actually very valuable to have a more precise and focus discussion around telemetry data and and types itself. because, uh, typically, you know, it,. It gets munged into a much larger.

A

Alolita sharma: context., so thank you so much. uh, matt, did you have questions?.

D

Matt young: um more humble request, if there's time, or if you're, if you're able to today. uh,, could you provide like sort of the next layer of detail on the paper you mentioned around schema? first.

D

Matt young: stuff? um,. We have as part of the tag a linux foundation, um internship going on right now, presently. that's about a month in uh to to generate uh ontologies uh for kubernetes and a couple of other ancillary uh workloads,, the top kubernetes uh in the service mesh space.. It's a collaboration between the networking tag and observability. tag, so I'm kind of curious,. Just if you could uh,.

D

Matt young: uh give us a little overview if, if there,, if there's time, and then if this is the right space.

E

Yuri shkuro:, uh you mean now? yuri shkuro: uh yeah,, but I don't want to put you on the spot. Like I can. I can actually like I I I'm preparing for this area to as well., so I do have that. but uh,. I think, um on the high level uh what.

E

Yuri shkuro: we are proposing is essentially.

E

Yuri shkuro:, a schema. first approach is, is, is the opposite of a code first approach.. So most of the time to type today is produced with the code first. where, like I. I just write some attributes to to an sdk. ah, and that's my source of truth about what the shape of tileameter, the timing meeting right? and that.

E

Yuri shkuro: provides absolutely no metadata about what that telemetry means to the consumers. uh,. It has no safety in terms of like.. If you change it,, are you gonna break your consumers.? It doesn't give you any information, about. well,, I'm I'm writing like uh the number, as as a as a latency.. What is the units of that number, right?? So those like very common problems,, that uh kind of stem from the lack of metadata about the telemetry right and the approach to metadata like schema first, is not the only approach. and so.

E

This is what the thing that we discussed in the paper that there are other approaches.

E

Yuri shkuro: maybe.

E

Yuri shkuro:, do you see that?.

A

Alolita sharma: not yet.,.

E

Yuri shkuro:, we we sort of contrast, uh a various different other approaches in the industry, with with like, how how well they fit our goals for for sort of like knowing the metadata about symmetry., right,, um, and some of them. Like I see ninety conventions, and open today, meter, or to limited schemas.. There is also, like vendor approach,, where they just automatically in reach, uh telemetry that you collect with. so to like infrastructure dimensions which is. uh, actually,. You can see that it's pretty green across the board,, except that it just doesn't support certain things.

E

Yuri shkuro: at all,, like any custom dimensions., you can't do that custom., metadata, um, and uh, uh, and so uh do this.. This is what basically the the like exists of the paper. we kind of go through.. What is our approach to schema? First? uh,, I think at meta.. uh There is a a a very uh important aspect of the cultural change that uh, occurred already several years, ago,, where it was,, I think, with all the kind of privacy and other like big data requirements, which we said.

E

Yuri shkuro:, as we produce the like all of the workhouse data,, we really have to start with schema first, right?. So it's a very established already.. It did not apply to your application. telemetry, um. and that's what we're introducing. we're saying, yeah, the same approach scheme at first works for application, symmetry as well, and in general it works much better across, like you can see here this mostly green across our uh evaluation criteria, uh, at at the expense, slight expense of the delivery developer experience.. But we have already a bunch of tools in that area.

E

Yuri shkuro: that actually make it not not too bad. um! the the the change in developer experience,, and sometimes we even treat it as a valuable.

E

Yuri shkuro:, so the fact that yeah,, you have to stop and think about this scheme that you're producing, rather than just really really writing what you want. uh,, because again date is about consumption,, not so much about production and and and a very typical view on telemetry is. oh,, I'm just going to throw stuff in, and then uh some call telemetry platform is supposed to make sense of it and and give me like great solutions to investigate outages., and this doesn't work. This way., yeah, well, said I, said,, you,,.

A

Alolita sharma:, I think taking a user's standpoint, is super important.

A

Alolita sharma: for all producers.

A

Daniel golant: oh, pretty cool the paper available. uh,. I I see it online some places., but there's like a get that link., I think I might have to pay. is it?. Is it available publicly anywhere yet? yes,? It's available to facebook, research.

A

Alolita sharma: oh, cool.

A

yuri shkuro:, so uh you can just go to the website and just drop a link. I can look for it to.. uh Let me cook it,, I mean, do the g. uh Do I link? should work as well, because I think it's you can just uh.

C

Yuri shkuro: daniel chaired it on on the research page.

E

Yuri shkuro: yep,, that's it., matt young:, very cool.

D

Matt young: thanks.

D

Alolita sharma:, I can't wait to read this.

A

Alolita sharma: very cool., uh you again., thank you, so much. uh,, it's really nice to have you on, you know. uh joining in the tag meetings. and and uh again.. I think I hope that you know, with all the activity that is ongoing in hotel.

A

Alolita sharma: uh, right, now, a lot more work and we actually done around,. You know, thinking of how to also correlate across some of the data types that you talked about. Today. typically,, you know, correlation, is left also up to the user.

A

Alolita sharma: uh, also up to the uh tooling that you know a service may provide. but, on the other hand,. What are your thoughts? About?? You know: pre, aggregation, uh, and and um,. You know kind of correlating some of the data before it even hits.

A

Alolita sharma: analysis service right.

A

Alolita sharma:, when you read., do you think that that's something that's um,, you know feasible., I mean it's feasible for some kinds of data and at some scale., but.

A

Alolita sharma: again,, you know, as you look at profile providing data. As you look at events as you look at,, you know, exceptions. uh,. Do you see?? You know.

A

Alolita sharma: standard rules that can be applied there for correlation. um, at the collection layer before it even hits.

A

Alolita sharma: genesis.

E

Yuri shkuro: well,, I mean this kind of the whole uh point of of of our paper is that? uh, you can it?, it depends on how you produce telemetry.. You can do this through, like with semantic conventions,, which is a way uh it's.. It's a weaker way than we would like uh,, because it's just like doesn't need some of the other requirements that we have., but but yeah, that like, if, if all of your telemetry is compliant with the c ninety conventions, reliably, then um!. That gives you like a power to correlate them.

E

Yuri shkuro:, right., yuri shkuro:, but again uh,, well,, yeah, and and so like. The semantic conventions are sort of a hard thing to do, because they're, centralized. and so, anytime, you, you have something completely custom. um,.

E

Yuri shkuro:, I don't know., I mean if, if I'm in certain, uh let's say uh sort of like some sort of like a customer id in in my particularly limited data set., and I want to say, yeah,, that's the same field.. The would, in my other data, set.

E

Yuri shkuro:, could you do it with semantic conventions,? You can uh it. it's.. It becomes less clear, like well,, who is responsible for sort of like for data governance of that right?, because it's not going to be open to the image,, because it's a completely customized.. So you kind of need to stand up. Your own organization,, saying like this. Is my data governance, for to damage it. and that's kind of problem is unavoidable,, because we also have that same problem with the schema. first of all.

E

Energy., but with the schema first, at least, that we provide the mechanism of how to do that. That actually can be decentralized.

E

Yuri shkuro:, uh it doesn't have to be one single data, governance, organization,, and it's automatically recognized like once. You put it as a metadata in the schema., then our both our back end platforms and our sort of like a front-end tools,. They automatically recognize it, and then they, so you don't have to do the correlation that they at the ingestion level. uh,, just because they you essentially you.. You label your data already with with things that oh, this can be.. These are the the columns that you concurrently on.

E

Yuri shkuro:, it is really metadata.

B

Daniel golant: yeah,, I just.- I just want to clarify,, because I think I I think I'm getting what this uh uh scheme. uh um,, I'm forgetting the schema first application to a long tree means.. But are you saying,, you know, a an approach? Basically where the event described,, like the.

B

Daniel golant:, the interface through which you log the high level.. What we call like everything is an event uh is unified, and then the event that you're logging that you're sending out itself. um uh maps to a schema which describes the way you store and process. That data is that what you're getting at basically here.

B

Yuri shkuro:, uh,, yes, and no. so the the sort of unified aspect. I I would downplay that because uh, that sounds a bit like a boil. the ocean thing that we never try to do. uh.

E

Yuri shkuro:, so we do provide a an incremental change to the api's uh for for telemetry,, but we do not try to consolidate them.. So uh a metric api will remain a metric api right? it's in the and the tracing is different., but we build the common building blocks into those apis, such that when you define a a a schema for your telemetry um like in the protocol, or in our case and thrift., then you get the auto-generated struck that you populate, which gives you all kinds of nice being things about like.

E

Yuri shkuro: safety,, like verification and all of that., and so.

E

Yuri shkuro: that that's what you populate, and whether but then that's still the sub. uh I was like up to the uh specific to the image sdk:. What to do with that. struck so like with metrics. um,. We actually, our metric system is interestingly, is like different from um.. The way that most metrics system exist today in that it is more like a table than like a time. Series. so,.

E

Yuri shkuro: in other words,, so if you think about like your red metrics request., there is duration. Right?, very often you see those as.

E

Yuri shkuro: three independent time series is coming out of the application,, even though they have the exact same shape of the dimensions that ties to them right because they essentially describe the same business process, just different measurements of that business process.. So what we're trying to do with our like a metric back end to say, yeah, well,. That's just model it as as this business process as a table. essentially,, so that you don't have one single numeric value in the metric,.

E

But you have more than one potentially a bunch of measurements., but all your dimensions that describe the actual.

E

Yuri shkuro: instance of the business process or the event., they are defined once, and then, on top of that, we have, like a a function out to attach me at the data to those dimensions,, so that then we can correlate it with some other type of dimensions..

B

Daniel golant: interesting it it.. I don't want to hold this up too long., but I am curious.. You said you don't want a unified like, and it seems like folks in the room agree that a unified. interface is not a good goal.

B

Daniel golant: I'm. I'm. curious,, because what I've run into repeatedly is two questions. one is like a um.. I have a situation, and I deal with the app level, mostly right?, um, uh I'm,. You know I writing, you know, mortgage system, right uh,, you know, engineers, saying I need uh to set an alert.. I want to log line for context.

B

Daniel golant: potentially. and then also the data team, wants a piece of data out of this, and I'm logging,. You know I. I have like an entire function of just like data., I'm sending out.. Why can't I send one line that produces, you know, a metric, and then from the same. Data, produces a log of line, and also send something to the data warehouse, and from there,, like teams,, have tried to implement their own unified logger, and to see,, and that you can configure to to send to different uh um uh destinations.

B

Yuri shkuro:, it sounds like you're saying this is something, and most people in this room agree it a bad idea. uh,. Maybe at a high level,, could you say? why,? I don't think it's a bad idea., it's it's kind of uh it's situational.! uh It's sometimes.! I know that uh um some of the companies they public to talked about those kinds of sdks that they've developed.. I remember the name being near from one of the companies, which is kind of this: a single unified, event. api that you emit, and then behind the scenes.

B

It can decide what.

E

Yuri shkuro: to do, what goes into metrics, logs, traces, et cetera. right?, so that that makes sense. uh, but um!. I think that also uh is uh that's. Why I mentioned boil the ocean.. We actually look at the project like that, once, and- and we decided not to proceed,, because when you already have thousands of services uh kind of pushing that kind of unified api is almost like non-stutter.. This is such a huge migration. uh that you need to force on people, and the benefits are not there for that migration.. So.

E

Yuri shkuro: uh,, whereas what uh we can instead,, what we can do is with the existing api.. We can extend them with the schema first capabilities,, which are where I already allow you to capture. but yeah,. They like independent sdk, still means that people are making this upfront decision. oh, am I meeting the metric, which will be aggregated and lose dimensions,, or am I meeting very reach log statement with all the fields? right? yeah,? That's that's sort of like an unfortunate side effect, which would potentially you can.. You can uh get away. From.

E

Yuri shkuro:, if you have a unified,, just an event,, api, and and let the infrastructure deal with how it's best to representative imaging.

A

Alolita sharma: no great point. thank you. really.. You run.

F

Ryan perry: yeah, yo: um: yeah.. I kind of wanted to ask um something you were saying earlier when you're kind of like going through the different signals. um, you know,, I've been working with the profiling group on the yeah on like the otep and that kind of stuff., and I would say, yeah, that a common uh,, maybe not necessarily concern or criticism,, but just a common like thought or response that people have is that profiling does tend to be for like power, users,.

F

Ryan perry: and I'm kind of curious. um,, because I know you were instrumental in sort of like the beginning of tracing, and that kind of thing which, from my perspective, also seemed like it, kind of started out as more of like a power user tool., and so I'm curious like,. Do you see similarities between?? You know the way: tracing was, maybe,, you know, three,, four, however, many years, ago. um,, you know., because yeah,, like I hadn't heard of trade like I guess, I'd use profiling before I'd use.

F

Ryan perry: tracing um,, but I also kind of,, I guess came later to the uh to the game., um, and tracing tends to be more of a,. I guess. yeah,, like kind of a,.

F

Ryan perry:, I guess more, like a at the company level like you., probably aren't tracing things locally, as much. um: yeah. I'm., just curious, how you think the the path of profiling, you know, sort of compares at this point, or if it's even comparable at all, to to where tracing was.

E

Yuri shkuro: um, yeah,, it's a good question., so one thing I think profiling actually may have a um easier pass than tracing into the day to day life. uh,, because uh, a concept of profiling is still actually easy to understand for people than they distributed to trace. because, like trying to explain context propagation to someone who never heard of it is just it's a plane. I've been through this like thousand times it. it., it's very difficult,, implying that it is also very difficult.. It was like with profilers.

E

Yuri shkuro:, you already have a much easier pass on the implementation side,, because you don't need users to do anything. You just like. you integrate with the runtime and boom,. They got the profile right. um,, whereas with with like context propagation., it's almost always the sort of application level, instrumentation required, and that's a very big sort of roadblock to uh adoption of tracing., um. and and so yeah, like tracing,, because I I I personally still think it struggles to kind of gain them sort of like mind.

E

Share of engineers to say, like as as a way.

E

Yuri shkuro:, valuable tool for various things,, and in and like our ri, is really,. I think,, maybe one of the big uh problems for for tracing.. I don't think I realize the problem for profilers.. That's actually is way easier story for profilers. um:, the um, yeah., but uh, similarities, though,. Is that.

E

Yuri shkuro: and oh,- and I think another thing is, profilers- are just much easier to to to make them work uh right.. So I mentioned that uh so like, there's a value in aggregating, profiles. and.

E

Yuri shkuro: erez agmoni,, there's not a lot of technical challenges in doing that. uh,, whereas there is absolutely no solution in the industry today for proper aggregation of of traces in so it's like, there is no one.

E

Yuri shkuro: query language in existence,, let's say I,, I I can ask a ground-based question of the trace, like actually using the fact that they are drafts right, like temple uh dbs of like., we came up recently with with the some form of it.. It's not really even the method yet,, but and it's not the how efficient it's going to be. uh we're like that's, not the problem at all., this profilers you can., yeah. aggregation is very straightforward, there. um. and, and you can., you can get immediate benefits from doing that..

E

So I think it's like, yeah profilers.. What the one challenge that I do see with profiles is just like.

E

Yuri shkuro: um, a vast amount of different formats., so the stematization effort will be potentially harder, because with tracing.

E

Yuri shkuro: there weren't that many sort of they literally that that I've encountered only two data models that tracing and use like event, based. and then the uh span based and the industry is completely conversed in the span based uh, like some systems. That, like facebook canopy is event based, for example, right?, some others are there like that. uh,, with the profiles., from what I've seen and the tap activity,. There is like literally uh,, like almost what, like hundreds of of different format.

C

Yuri shkuro: and that that's, I think, that that's probably the biggest challenge in the profiling space, to sort of like doing the gap, analysis and saying what is, and not possible in in all different formats;, and if it's possible to sort of come with a common one, I mean linux did was like you.. If I trace profile or something a trace format,, I forgot what his name right,, but I don't know how well it represents all of the use. Cases.

F

Ryan perry: yeah. interesting. okay, yeah,, that makes sense. thanks.

A

Alolita sharma: no. very,, very interesting, uh points., uh, and especially the linux I mean linux. Profiling is the most well known. use case right, enough,.

A

Alolita sharma: all for fine,, just because at the kernel level it makes full sense to be able to do that, and also doing,. You know, just just plain development: uh profiling code, profiling from a familiarity. Standpoint.- and now you think, is that in linux it's called tracing.

E

Alolita sharma: yes, you're right,, but it's actually profiling.. That's very good point,.

A

Alolita sharma: all right. um,. I think this is.. This has been a pretty awesome. Discussion. and again, you're very,, very grateful uh that.. You could join us today. uh again,, looking forward to your talks, and and again um at sr. icon,, as as you present, this paper, it'll be uh,, actually, quite quite a good discussion to have,.

A

Alolita sharma: because it does change the you know traditional way of thinking,, but it's actually very applicable when you're looking at uh application uh more and more applications that are enabled with.

A

Alolita sharma:, uh tracing. and, and you know, observability come into the observability.. So uh thank you. Again.

A

Alolita sharma: mit ctl. and with that said, I think we are at time uh matt and others,. Do you have any questions or any areas you wanted to call out because you're going into cube con again.? I hope to see many of you there in person. One hundred and fifty.

A

Alolita sharma:, we have a whole set of uh observability events going on uh one day: events as well as focused on um metrics. on tracing on um prometheus, um as well as open, telemetry.

A

Alolita sharma: uh, and I think here he also has a talk on pricing and.

A

Alolita sharma: jaeger, uh your age.. Do you want to share some details on your talk at cuecon? no,? I don't have a it's: a jonas and uh jonas. okay, okay, you're you're uh.. I guess he's he's boring for you all right. good. uh., but thank you again, everyone and matt any other words. You want a bad.

D

Matt young: um, so I'll be kind of returning to public life, so to speak., um, or at least open, open, source. land um uh after coupon uh,. But I'm looking forward to seeing folks there. um,.

D

Matt young:, I think. uh, thanks a lot for the last couple of months. kind of uh organizing meetings and such. and I think uh, we're actually ending with three minutes there.. So you know, causality is,, you know,, not necessarily correlation correlation.,.

A

Matt young:, but I'm expecting updates from you on landscape graph. Later at a later point. yes,, that's been a little bit pause while I uh on board in my new role,, but I I do intend to to return to that. and um if anyone is interested in jumping in,. I know some people have have expressed interest. uh,, there's a whole bunch of stuff there. That's just waiting for new people, to., there's a bunch of issues marked as.

D

Matt young: um good first issue, and some help on it. stuff with a roadmap um! I'll check it out,. If you like,, correctly. alright. thanks,, thanks, everyone.. I want to give some a couple of minutes back., but uh thank you again for joining, and thank you so much.

C

Ryan perry: bye, bye,.