Cloud Native Computing Foundation TAG Observability, 16 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-08-16 CNCF TAG Observability Meeting

Description

Liz-Fong Jones joined us for an discussion / talk on "Evolving and hybridizing signal types - my journey from metrics/logs to traces to profiles"

* OpenTelemetry Profiling updates (#otel-profiles)
* potential for collab w/ other tags (STAG)
* updates from https://github.com/cncf/landscape-graph
* we have a logo!

Meeting notes and more: https://github.com/cncf/tag-observability

A

Yo fancy seeing you here, ryan.

B

I know right, I just saw the agenda I was like. Oh, this is coming.

A

You might want to uh tilt your camera yeah.

B

It's I don't know what happened like it used to. You know, stick fine and now just it moves it has a mind of its own.

C

Oops, I'm going to turn that off.

B

Oh daniel, are you in indiana.

D

uh I am in new york, wait. Can you yeah uh I'm in new york, but I went to purdue. Oh.

B

D

B

In indiana I'm in uh oakland, but I went to iu.

D

Amazing amazing.

B

I'm from indiana, though yeah, where are you from I'm from uh fishers? So just oh of course, yeah yeah.

D

Yeah got a lot of friends from there.

B

Yeah, I imagine that, for you.

D

C

D

Over represented for sure yeah.

E

Hi folks, uh good morning, I think cal is welcome thanks for joining in we're super excited to have you today and I think we're gonna wait a few minutes for matt to join in as well as some of the others joining in how's. Everyone, good, good, good, good ryan. How are you.

A

E

Good good good super excited again uh kind of spread the word across different channels, but uh it takes a while for folks to join him. Hi kevin thanks for joining steve.

A

Honestly, I'm not too worried if it's a small group um I actually uh figured. I would do this more as a little bit of a chalk talk rather than having fancy slides and like expecting to be presenting to a bunch of people.

E

Yeah, that's fine unless you're always a great speaker, so any any format is it's.

A

True, but also the slides are a substantial part of the polish, so this is a little bit more rough and raw, but that's fine.

E

Good good, it's good for discussion also.

F

That's perfect.

A

Exactly what I figured is, you know, like small audience, you know, I know you all and like this, you know we're all uh people who work in the space as opposed to uh necessarily audiences who are like what is this a durability thing anyway? I.

E

Know exactly, but I think it I think it's great, that you know liz you and the end charity and the team did the book, because I think that the book actually is very helpful for engineers. You know who are getting involved in you know in actually building out observability into their applications and and just understanding.

E

You know um the the basic concepts and advanced concepts so good stuff and- and you know really thank you for making the book available online because I think that's always a great game changer, for you know, folks, to pick up the thing. That's really.

A

Surprising to me is that, despite making the book or maybe because of making the book available for free online um o'reilly has said that it's one of their better selling books. uh Oh.

E

Fantastic, I think it really helps actually because people have a chance to kind of read through it and then they're like oh okay. I can mark up my book, you know, so let me get a paper copy too.

F

Yeah, I I think that the formalization of uh you know structured events. As you know, a thing that that's that's that's actually defined um is another great addition of the book. You know, especially that it's complemented uh with more pragmatic, practical advice about you know how to actually employ this to humans right.

A

Because that uh yeah, that turns out to be one of one of the things I that I wanted to talk about today, is kind of how.

E

A

Came around to uh yeah we'll talk about how I came around to structured events, because I was a very hardcore metrics person at google.

F

Also, fair warning, like.

A

I am, I am going to break my rule uh about about saying about saying you know when I was google wii dot dot dot. um I try not to do it. I have a swear jar from when I do it at honeycomb, um but uh but here it's like for setting historical context, not a you should do this because we did it at google. So I I think it's okay, but give me a little bit of rope. There.

E

Yes, totally always liz.

F

Did you already uh do the disclaimer uh elita? I joined a minute late. The toc meeting uh ran two minutes late.

E

I introdu just said hi and I welcomed everyone but uh matthew happy times.

F

Welcome everyone: uh this is a cmtf sponsored event. As such, the cncf's code of conduct uh applies. Please don't do anything in chat or anything on there in the meeting. That would be in violation of that.

F

um I have a few things to briefly cover that are more administrivia that I hope to kind of blow through and there's links and follow-up can happen after uh liz, I'm not sure how much time you want to fill, but I did want to leave the bulk of the time for for you for your talk um again, I love that it's uh you know for practitioners, so impromptu and not super um not slidewear or architecture is actually an advantage here.

F

So thank you, uh um okay, uh so I can share my screen briefly just to anybody's, following along later.

E

Great go ahead, go for it. Obviously, you cannot see a screen here.

F

Can you can everybody see.

E

Yes, we can see it now: okay,.

F

Cool um so today, uh in the tlc meeting, that's just prior to this meeting by one hour um is my mic level. Okay, this is a brand.

E

F

Totally deadly um that's.

E

F

So the tsc has asked us to uh the tag you know and the tag chairs and whoever else is interested, I suppose, uh reach out um to to kind of assess the health of the cortex project um uh and make some concrete recommendations about what I might need. uh So there's a link there um at long last. I think it's a year and running we. We now have a logo uh and we're getting the actual high-res svgs um from uh the cncf creative folks.

F

um I wanna highlight something uh you know I kind of I went. I went and visited tag security last week um uh to talk about the landscape graph, uh which I don't want to get into the details of here, uh but um uh for time, uh but there are links um as they have a working group on secure supply chain, and why does this matter for observability right? Well, you know when we build things and they bring in all kinds of dependencies.

F

um Not only are there impacts to performance and runtime and all that, but from a security perspective, part of our charter is to help with the comprehension and observation of cloud native workloads, including what cves might be there. You know as of this morning or last week, and so uh the part of that graph project that is, defining the data model for packages and pm through deb.

F

You know et cetera, um all of the all of the packages um they have a similar effort going on uh in in some collaborations, outside the cncf, with other open source and other linux foundation affiliated groups.

F

In addition, uh they've kind of proposed a similar graph project, that's more requirements phase, and so there's some overlap, and so some there's some collaboration here. That could happen uh between tag observability uh and tag security. uh They actually call themselves s-tag and there's some debate as to which, as to whether we should call ourselves otag, but that's probably something for slack and a poll, um uh and then uh it looks like uh that's all I had in the way of uh you know: administrative stuff, um I'm assuming ryan.

F

uh Do you want to talk to the next item? Otep for profiling,.

B

Yeah, I can uh just uh briefly give like an update of what's going on there, um so, basically a couple weeks ago um they merged. I guess, like the you know, open telemetry merged this uh project management guidelines.

B

So there's there's a bunch of um yeah there's like a bunch of different efforts around various kinds of specifications and that kind of thing, and so they wanted to create a more sort of standardized process for, like you know, the the different check boxes, you need to check in what order you need to check them, and so um the the main ones that they, uh you know that they say is like sort of the the minimal set of criteria that you need is uh one a group of designers or subject matter.

B

Experts or just you know, people general generally. um You know, I guess like somewhat interested and qualified and ready to dedicate some time into working on the project, and so we've already sort of established that we have a lot of people from a lot of different.

B

You know some open source projects, some vendors, some end users, so we have like a good mix of people there and then the second one is the tc needing to be aware. um As of last week or two weeks ago, we got our second tc member uh to like sponsor the the working group, the efforts there, and so um you know obviously they're aware and then the third one is that the spec approvers in the broader community need to be aware of progress being made, and so that's the one that we are currently working on.

B

um I actually also just got out of there uh the specification sig sig yeah. They call them sig. Still, I think, or tag whatever. It is the specification it's the sig. It's.

E

A

Anything cncf level is a tag. Anything project level is a cig yeah.

E

That's confusing.

D

A

Is six other cigs but uh cncfs tags hope that helps.

B

Yeah yeah, so I just got out of that meeting um and so yeah, so we're now um and I put a link into the doc which we are still in the process of finalizing, but um basically yeah we're we're.

B

um You know in kind of the final stages of getting that first, um you know check boxes done where this becomes an official thing um and then, after that, the next steps are creating a project tracking issue, a project board which we actually kind of got a little bit of a head start there, as we've already started talking about um sort of uh more.

B

I guess, like the writing actual code for this you know, instead of just talking about, we've talked a lot qualitatively about all the different pieces of all the different ways you can collect profiling data and the different formats that different. um You know. Companies and projects are using, and so uh we've already started to figure out what the qualitative metrics that we'd, like are sorry quantitative, metrics we'd like to have our um in terms of like you know, benchmarking stuff like that, and so um basically I think things should.

B

um We should be able to move pretty uh steadily, hopefully quickly, as that happens, um but in the meantime, we're finishing up this official spec and presenting it to the specifications um group and then and then yeah we'll go from there. So that's the update on that. um The last section that I guess needs the most filling in and you know it's kind of something. It's not something. That's meant to be like done by any means, but you know we can always continue to add to it.

B

But uh the last section that we're still sort of working on is the different use cases for profiling um and um yeah, so just fyi that group meets on every other thursday. So not this thursday, but the one following is the next meeting there. um So if anybody would like to join, feel free to join um otherwise uh check out the doc- and let me know if you have any feedback, thoughts, questions or feel free to just comments in the doc.

E

Cool tren and, and can we we can just add our comments to the doc right, yeah.

B

E

Good good thanks, it really looks good. The doc has come a long way. I think you're definitely.

F

Thank you to everyone, that's contributed to it, I mean I don't brian probably has the actual account. I added a little section at the bottom, but you know if you've had a hand on this and there's usually double digits of people that have, um you know, make sure we add our name.

E

On hotel, definitely I mean that's the great thing about the project that you know. A lot of folks do contribute and read the docs and really provide good feedback. So thank you. Everyone.

B

Yep yep and uh oh yeah, and then I guess I didn't mention the um yeah kind of the the structure that we went with there. So we we used the um I mean.

B

Obviously, a lot of the the signals are similar, and so we kind of use the structure from the logs proposal from a while back and then kind of merge that, with the various like vision, statements, mission, ambition, statements from hotel itself um and sort of like combine those to create kind of the outline for this and then sort of filled it in that way.

B

So um you know yeah, so we really tried to be intentional about having it align with the overall goals of old hotel as well um so yeah, also just fyi on how we sort of came up with the points we came up with um the following docs after this will be sort of more you know now, every doc after this we'll get a little bit more into the weeds. um You know what specific you know. uh What specific fields do we want to have in the you know in the profiling format?

B

Stuff, like that, will will be the future documents, but for now this one's meant to be more just like high level to make sure everybody's on the same page, with regards to what the goals are, and that kind of thing.

E

F

Yeah I've put a link in um last tlc meeting. uh We talked about this specifically and in the slide there if you're curious about the actual meeting notes and whatever um there's jumping off link points there as well. um So we've been socializing all of this and um really trying to get consensus prior to a formal thing and it seems to be working um so.

B

Thanks all right! Well, that's all I got.

F

uh Similarly, I think uh the rest of the time is yours uh liz. um He said 20 or 30 minutes and I think that'll take us to time. So um thank you again for joining um us here. If folks aren't familiar I'll hold it up once more, everyone probably is here, um there's a new o'reilly book out uh with uh fight with liz and charity and george uh with that uh I'll, say um and and stay tuned for uh the pr actually to formalize this, but we're hoping to have a speaker series.

F

uh You know, through the fall uh to see what happens uh and we'll have a process to propose speakers. um You know we're starting with authors, um so there's a couple of books that have come out. This is one of them uh and- and you know, we hope to make this talk. uh You know feedback from domain experts and practitioners, um which is obviously in our charter and relevant. So uh liz is the first.

F

A

Didn't tell that I was going to be your first skinny thing. Well, that's exciting! Okay, cool.

F

Test subject: no.

E

You're a rock star here, so thank you joining in.

A

Basically, what I wanted to talk about today is how I came to where I am on this journey of thinking about signal types and kind of what our objectives are and observability and how we can best realize them, and what I think is up is up ahead, um so I've been working as a site, reliability, engineer of some kind for the past. um Let's say uh 17 years now, uh 17 years, 17 18 years sounds about right.

A

um I started started doing this when I was very, very young when I was uh 17 years old um and I cut my teeth uh really trying to solve problems that we wouldn't necessarily think of as systems engineering or systems administrative problems. um I actually started off as a um what you could, I guess, described as an abuse analyst um at a game studio.

A

um So basically, I was scraping through logs, um using using grep and awk and all those things to try to identify patterns of people who are cheating at our game, and today we might include that in the broader observability umbrella.

A

Right like it is a user pattern that we're trying to detect using the signals that we have, but unfortunately the signals that we had were not very we're, not particularly good, um but what's really interesting is that you can start to see kind of the similarities with what with what I do today um and and the differences right like the similarities, are that I'm trying to identify unknown patterns caused by my users in my code, um without necessarily having foreknowledge of what these users are trying to do.

A

But the way that I solved it at the time was there would be. You know various log entries sprinkled right. There was a log entry whenever someone finished a gameplay session, so we could detect. You know how often per user is, is a gameplay session being being concluded?

A

What's the duration per user of of those gameplay sessions, so you're able to you know with clever use of said and awk and grip, we're able to produce these like interesting command line right like text text uh histograms that showed some users were completing um or completing hundreds or thousands of play sessions in the amount of time it would take a normal user. You know to complete, like 10 play sessions, so that was kind of that.

A

First exposure to what I didn't realize at the time was kind of a high cardinality problem right where there are many many thousands of users who may be logged into the game at any time doing you know tens of thousands of different play sessions. How do we sift that that signal from the noise so fast forward about another five years? um And I had joined google at this point, um and this was kind of my first exposure to what you know best practices. Look like what kind of doing things in anger at scale.

A

Look like and what's interesting is that the tools were different from uh were very different from what I did in my game. Studio jump right like I was not necessarily going to be using uh gripping through logs. That was a thing that did not work at google scale.

A

um What google had really embraced at the time, even as of when I joined in in 2008, was this idea that everything should be recorded as a time series metric, because there's just going to be too much data to record to centrally index, and you would use uh your metrics, which were potentially broken down by machine um and then by job and then by and then by data center and availability zone and all these varying things right, like you, could use metrics, essentially to do what I eventually formalized as the idea of binary searching the potential problem space right like that, if we were seeing a high rate of hdd, 500 or high rate of latency, what we would wind up doing is we'd wind up finding a number of different dimensions to break down by right, like maybe it's availability zone.

A

Maybe it's kernel version right, like I'm going to see, I'm going to you know, keep on plugging in these values that I want to group by to see where all the lines simultaneous right, where one line spikes and the other lines don't right, like that's kind of how you bisect, um where the problem is coming from inside of your systems and sometimes it was sufficient to bisect it down to okay right, like you know, what will you think of today, as as a particular uh kubernetes pod template spec right like that once once you know that it's a particular deployment id that is all exhibiting this problem or a particular availability zone?

A

That's all exhibiting this problem. That would then enable us to say: okay, let's you know, let's coordinate that off right, like let's, let's drain the bad availability zone, that's clearly having some issues or let's revert that bad release. That's clearly having some issues what's interesting is that google, there was not necessarily a notion of caring about high cardinality in users. Basically, the idea was any problem is going to you know. Oh, that's an impossible problem right like you, you would never want to. You know, for both privacy and and technical reasons.

A

You would never want to group by individual google search user, that's just impossible. It was kind of the thinking at the time, but sometimes the metrics were not sufficient. um In particular, the metrics were not sufficient in two different ways.

A

First of all, the metrics were not sufficient.

A

When you had issues of noisy neighbors, when you had issues of crashes and single machines, where you could tell that an abnormally high error rate was coming from a specific set of machines, but not necessarily your metrics were not sufficient, and what I found interesting was that there, yes, you would fall back onto logs, but the logs were not centrally indexed that we would go, and we would um you know, not ssh the machine, but we would go and pull up a log viewer that would scrape the the files off of the individual machine to go and to go and look at them right, and that would kind of enable us to have this rolling circular buffer of logs that we could go.

A

Go to go to if we desperately needed, but that it was not a signal that we relied upon uh kind of for our bread and butter work. It was kind of only if all if everything else has failed.

A

But there's another interesting problem, which is what about the problems that do not appear as kind of single point sources of failure or what about problems where you don't know what to filter or group by because there were millions of metrics at google, and I think, one of the fascinating things that I uncovered there and kind of what pointed me down this path of tracing was when I saw when I saw for the first time um we had a black box probing service um that would basically uh repeatedly hit the service.

A

I was responsible for at the time, which was a big table, one of google's course storage systems and what it would do is basically, it would say, always trace this request. I'm going to call out to bigtable I'm going to issue a read request, a write request against a special.

A

You know against a special table, that's inserted into this uh into the into this uh tenant in order for us to be able to perform, uh read and write tests against that one table, and that was set to always trace and we were getting very high quality data out of that. Out of that, that would tell us no matter what, for these kind for these black box probes, they were issuing. You know multiple times per second, where did it get stuck? You know, is the request flow because it got stuck in the underlying file system.

A

Layer is the request slow because it got stuck talking to a particularly bad, uh bad, bad uh shard. These were all things that we could identify by looking at the individual, failing probe.

A

So that was really neat, but I think one of the challenges was what happens if you have request comes in from a user saying, hey that you know the big table's slow, but it's not something that was necessarily forced to trace on. How are you going to find that request? How are you going to find the needle in the haystack of a request that looks like that, one?

A

That is that is traced, because if it was not a black box pro, this is not something that you're mainly forcing on again problems of google scale we're we're sampling one for a hundred thousand we're sampling one for a million right like so. It's like, okay, you're. Looking for a p99 latency event, that is also one in a million sample rate.

A

That means that it's a one in a hundred million chance that any individual request is going to be both slow and traced.

A

Okay, so this is where we finally get to the idea of trace. Exemplars of by this point, um the folks at google had had had designed a second generation metric storage system. um So we originally had this thing called borgmon. That was, you know very similar to and kind of, inspired prometheus right. So it's kind of this idea of okay right you've got a pool based protocol that goes ahead, and you know scrapes uh scrapes, a bunch of key value pairs out of host we're all very familiar with this format.

A

But what was interesting and different about monarch? The next generation system was that it was designed to be able to propagate additional information, but besides the key value pairs, for instance, it had a native histogram type, and not only did it have a native histogram type with you know, custom bucket widths and uh and various other things to improve resolution.

A

But the folks who designed that system had added the the idea that you could attach exemplars to your chase buckets to pass on. um You know, even if you had aggravated away detail, to keep some of the aggregated, the the uh the pre-aggregation detail.

A

So, for example, if I had a if I was aggregating a metric on request, latency um and I was aggregating it at the uh at the data center level, I might choose for any specific bucket. um Sorry if this is a recap for people who already know what exemplars are but like for a given bucket.

A

If I was aggregating away the machine id field, um because you know that's no longer relevant, I'm I'm I'm uh I'm just you know combining all these various machines together into into one aggregate, composite uh latency histogram.

A

We would still keep for every kind of you know, 50 millisecond bucket, for every uh for every minute of scrape. We would keep one example data point one example: hostname that had had an event happen, matching that criteria in the past minute and the other really cool thing.

A

Besides, leaving on and not filing off, the cardinality of you know, hostname or of or even of user id or other things like, that was that we had the idea of attaching traces trace, ids right, so you would separately, you know, have the decision about whether or not to sample. But if you did choose to sample- and you were also in the context of a metric, we would tie together one trace id that exemplified the the histogram bucket of the metric and send it along and when it got post aggregated.

A

You would pick at random one of the trace ideas that might have gotten kept and then and then propagate that along. So the net result of this was that when you look at a histogram, you could see for the first time trace, ids or you could see the first for the first time that higher cardinality detail that you had previously had to file away because it was too it was too noisy and to to create a time series for each individual tag.

A

And this, I think, was for me the moment where I realized that we don't have to manually correlate all of these things right. I don't have to keep a correlation id in my head and then go to the machine and and like grab for that particular correlation id right. I don't have to manually, um you know, look for trace ids and then see whether the metric spikes, at the same time like that, I could visualize these two things together at the same time.

A

So while I continued to rely pretty heavily upon metrics kind of the thing that really opened my mind um in in 20 in 2017, was this idea that we can better utilize signal types if they, if they share the same vocabulary, if they share the same verbs and nouns, and we are able to jump fluidly between those signals, so that if there is something where we don't have sufficient resolution just from a metric or just from a log, we can all. We can jump to the relevant piece of context that will help us understand it.

A

So when we turn this on for bigtable, which was the service that I was running at the time, what happened was kind of magical in that we were suddenly able to. Finally diagnose for the first time customer says that their you know. Internal customer says that there's a latency problem in this particular part in this particular partition.

A

Okay, let's graph the latency in this partition, let's pick an outlier that happens to have a trace associated, let's zoom in right, like let's enhance to see what happened and that you know didn't necessarily always immediately give you the answer, but it at least gave you a list of questions to ask right, seeing that, oh all of these high latency points are all being served from the same worker that suddenly gave you the idea of okay.

A

Maybe I should group by worker to find out whether that's a fluke right, whereas previously you would have to magically know that the worker id was was the relevant field and dimension to be aware of, or you know, if you clicked into a couple of example, traces and they all said that the underlying storage system was slow great now we know to flip to the storage system uh dashboard so that really really accelerated time to debug. um For for for my team, uh the the big table- sorry team at google.

A

So I think that brings us to the next question, which is: how is it that someone who's you know really hardcore into metrics? How does she wind up working at a company?

A

That's working on tracing, so I think here's where a couple of things combined um first off um I'd, had a number of vigorous twitter arguments with charity majors over over over the years, um and we developed a sense of respect for each other, rather than hatred for each other out of that, it's kind of cool to find people that you can disagree with and not get upset at, um but I think the other thing was that I'd seen how useful tracing was based off of my example of exemplars, but I was still not necessarily thinking of tracing as a primary data source right.

A

In my view, what I was thinking in my head was that tracing was you know, at least at google scale. It was way too way too expensive right, like that. You basically had to sample one one for a hundred thousand. How are you going to actually capture any resolution? Out of that.

A

It turns out that there are two answers to the first. Not everyone is google right. So just because I say when I was at google x was bad does not mean that you should not do x. It just means that at google scale you shouldn't do x, um but the other important thing is that.

A

Tracing can have variable sample rates. That was the thing that I had completely missed was that at google, the tracing systems were actually fairly inflexible. Right, like you know, you would set a sample rate all across the board right everything had to be sampled. You know one for a hundred thousand, unless you were manually specifying a specific request to trace through because, for instance, it was a black box probe, but it turns out think about. Let's think about this.

A

For a second, let's say I have five events that happen between 200 milliseconds and 250 milliseconds between 9 30 and 9 32 a.m.

A

Let's suppose that one of that that one of them happens to get traced and I sent- and they and I send that through as an example- that means that that one exemplar is representing five events in that histogram bucket.

A

Conversely, if there are a hundred things that took like 50 milliseconds- and you know I might have you know dozens of traces that exemplify that, but I pick only one to exemplify that and I pass that in histogram bucket.

A

What we're starting to kind of approach from the exemplars and histograms side is the idea that trace sampling with variable sample rates can produce an approximate histogram.

A

It just so happens that, instead of sometimes having no exemplars in a bucket, we might have you know we might have one exemplar in a bucket for sure or there might be two or three exemplars right and- and we might just say, okay, you know, there's a sample rate of 50 on this one there's a sample rate of 20 on this one. So we're just going to add those two numbers together and say that there are 70 total events approximately that meet those criteria.

A

We may not have kept all of them. We may have kept a precise counter, but statistically over the long term, that that is close enough to be able to generate metric type data out of out of trees.

A

That really blew my mind, but it suddenly made sense too right that this thing that I had been conceptualizing, as you know, metrics is the primary source of truth. We sometimes use traces traces can exemplify particular sets of behavior. Instead, its traces exemplify all sets of behavior. We make it cheap enough by sampling and for a majority of people sampling one for one or even one for ten is sufficient.

A

You don't have to go all the way down to one for a hundred thousand, which means that you can get resolution to the nearest 10 plus or minus five right like rather than than saying you know either I get one event which represents a hundred thousand or I get zero right like that. For most people at most scales, it's actually possible to get higher higher fidelity than that and still be able to assemble histograms.

A

So once I understood that, I think that kind of really primed me for being able to say okay, now that I understand that these two data types are very fundamentally similar. It's just a difference in how you conceptualize them.

A

What can we do with this, and the answer is that by keeping traces and by aggregating them at read time, it opens up a few possibilities that we couldn't do with metrics, because metrics pre-aggregate right, they say you know I only want to break down by by attendant id. I only want to break down my host name and it solves a lot of the problems I've had before of having to correlate right to see.

A

Did these two lines wiggle at the same time right did the error rate by tenant spike, at the same time as the error rate by hostname right and if they, the two curves exactly match. Then I know that those two things are probably at least uh correlated, if not causated, whereas if I have the raw data that that includes the tenant id and the hostname that I can perform these operations to be able to filter or group by both of the two things simultaneously, rather than only one at a time and squinting.

A

So that also meant that a lot of the work that I'd seen done at google to kind of be able to do this machine learning of do these two curves look the same or you know, hey, find everything graph expect the same time as this other graph.

A

If running a lot of that work, we're done in redundant right like because, if you have that direct chord correlation there, you no longer need to correlate based on the shapes of things you can correlate based off of whether two tags are present at the same time in the same object.

A

So I think that this is a moment to step back and say: hey wait, a second.

A

What we're actually grappling with here is not that you know metrics rule traces, drool or you know, trace's role of metrics rule right like these are all different manifestations of how we visualize the system's data.

A

At the end of the day, events are happening as, as as our systems process requests and we can choose to pre-aggregate the events and generate metrics or we can choose to pass along the detail about each each event and each request flowing through our system and post aggregate, the metrics, but there's still representations of that underlying data that we're trying to express of what's happening inside of our systems, and you can do it at varying levels of granularity right. You can do it at the whole system level.

A

You can do it at the individual service, individual request level or you can even think about it. The line of code level, even though, like we know that we can't keep every reflect every request to every single uh assembly instruction right like that, that that's wait. That's way too much data, if you're keeping 100 of that right.

A

So I think that trade-off is that the more granular you need to get the more you need to sample, but no matter how heavily it's sampled. If you have enough samples, you can reconstruct a composite of what happened, and all of this is in service of answering the questions right like who. What, when, where? How? How why? Right? Like those those are you know, those are the questions that we want to answer about our systems, and it does turn out yes that some that some uh debugging techniques are are better suited to answer.

A

Some of those questions, for instance, right, like um metrics, are really great at answering the when question of telling you like. Should I be getting out of bed for this right and, if you're interested in where right that that is a really great place for tracing right for tracing, to tell you which services to look at or kind of where that gap in time that's unexplained is coming from and if you're interested interested in, who that's not a thing, that's specific to a signal type.

A

Instead, that's a question of: do you have the ability to group by cardinality right? Do you have the ability to group by user id, regardless of which signal you're using?

A

But I think one thing. That's really eluded me over. The years has been kind of why and how? Because tracing does give us some degree of why and how? If the problem is a call to, for instance, to an external resource like a database right in open telemetry, we generate trace plans for when you call a database, and we even have with sql commoner the idea of propagating through to the database, to tell you who called you that way you can uh trace back.

A

You know who's generating all these very expensive queries, but if the problem is inside of your code, that's not going to generate a trace span and that's potentially a problem.

A

So one of the realizations I have come to over the past year is that we often need resolution beyond the uh request level and beyond just the tags that we attach to the request that no amount of tagging a request.

A

If the only level of granularity you can get to is the request is sufficient for being able to understand how and why did that request spin for 2.3 seconds before it called out to the database right? What happens if the problem is not that the database was slow, but instead that you sat there thinking or maybe blocking on something or waiting on a lock or something right like we don't know, but for some reason this request stalls for 2.3 seconds and then it talks to the database 2.1 seconds and immediately returned right.

A

What was it doing and, yes, we can wrap this in additional trace bands, but that breaks the fundamental promise of observability. The fundamental promise of observability is that, without pushing new code, we should be able to understand any behavior of our system, so the problems that we've solved to date with observability have been kind of chipping away at various angles. Of of these of these questions right who? What? When, where? Why? How we've chipped away at needing to push new code to deal to deal with them right?

A

If you have adequate ability to debug and diagnose cardinality on the fly you no longer have to think about who any or you no longer have to push new code to learn who great right? If you have tracing, you no longer need to push new code to learn to learn where right. If, if you have, uh if you have an adequate time series to unders to understand, like you know, when behavior is happening right, you don't need to say: oh crap, there's a system outage. Well, I better, you know turn on the metrics right.

A

So when is no longer a problem but yeah the kind of the the. Why and the how? I think you know either need very high high granularity um trace bands to be wrapped around every function and kind of turn on on demand for future flags right. That's one valid way of doing that, but it turns out there's another answer. The other answer is continuous profiling.

A

So this is why a lot of my attention over the past six over the past six twelve months has been spent on profiling, because I think it kind of answers that that holy grail question, of which specific line of code is causing problems which previously you know. Yes, I traveled through logs, if I happen, to have a log statement that that matched what I was looking for, but otherwise I would have to push new code with a you know. Log saying I got here, but I think there have been two challenges with regard to profiling.

A

Adoption I think challenge. The first is that it's in a similar maturity stage as as tracing was uh three to four years ago, right that it was a disconnected signal that people are treating as this kind of completely separate thing that uh had no relation to what people were working on before and, secondly, that it required a lot of advanced setup um to collect and to analyze and that it felt to people that you know you had to have the entire system.

A

You know profile or trace really well in order to get any value right, but I don't necessarily think that that's how we should be thinking about things right. I think that a lot of the previous attempts at tracing failed because we tried to say you know: oh, you must be jaeger or zipped in your entire system in order to get any value out right and it turns out that there is value in examining individual services right and being able to generate not distributed traces to to understand what's happening inside of your systems.

A

I think anything, that's true! That's true! For selling too, I think the other thing is reframing. What is the value that we are getting out of this right? I think that when we framed tracing as being only for problems that are that are involving many different microservices, I think we kind of we lost the plot a little bit.

A

So, let's think about what a modern approach to observability should look like how how should how should be? How should we be integrating these various signal types together?

A

I don't think the answer is you know you must have logs and you must have traces and you must have right like at this point. We've hit a saturation point where people shouldn't have to collect them all like pokemon people shouldn't have to pay multiple times to store. What's fundamentally the same kind of data, I think the right approach to me is is that, yes, you know there are similar to how there are separate use. Cases for for tracing uh were separate use cases for tracing versus versus metrics uh five years ago.

A

Yes, there is some value in using profiling to identify cost uh improvements, but I don't think that that's you know. A majority of software developers are not thinking about how much this is going to cost in production.

A

What they're trying to do is understand is this going to deliver a good user experience in production, something that spins for an extra 500, milliseconds or extra two seconds is probably not going to break your bank, but it is going to result in blown service level, objectives uh and happy users um and that's something that we ought to be able to to fix, and we cannot expect people to wrap everything in a trace fan right. I I don't.

A

I don't think that that's a reasonable presupposition, I think that's an extension of the open, telemetry, auto instrumentation work to say we should be able to auto instrument your code to the function level without you having to lift a finger right like that. The promise was that otel is going to give you request level tracing for free.

A

Why shouldn't otel, give you a function level tracing for free where that function level tracing is is is profiles that are highly highly sampled. You know per you know, per per 1 millisecond uh increment, so that you know sure you may not. You may or may not get a sample for a request that runs less than 10 milliseconds, that's fine, but for a request that sits there. Spinning for for two seconds, yeah you'll statistically get you know at least 20 profiles.

A

Out of that that'll give you some you know 20 or if you're sampling, everyone else and they're going to get. You know 2000 profiles right, they'll tell you which line of code which function right like which function is slow and I think that's kind of how we connect the value of profiling. To average developer is that the average developer needs to right like we, we live in a world of you build it. You run it right, so developers should have service level objectives.

A

Developers should be able to bug their service level objectives to understand where things going south right who, where? What, when? Why, how and part of that? Why and how is tracing and profiling, and that this is a new set of behaviors people are going to have to learn, but hopefully not a giant a giant step. If we can make the user experience smooth right, if we can make it as seamless.

A

As you know, for me, at google, going from a uh metric heat map to a trace was if we can make going from a trace to a profile. Is that, I think, is the vision that I have for for the future of observability? Is you know, truly being able to debug any problem in production anywhere to the line of code to the user um and being able to fix it.

E

It's pretty awesome thanks liz all right, I think uh liz was right on time, so um I think, let's open up. We have a few minutes. If folks can just run over a bit questions.

E

C

Have one on on the profiling.

E

C

Because I recently went through the process of um enabling profiling and I wonder what your stance is on- not the runtime expense, like the the time expense of profiling, but more also, that like it depends on language and and what profiling, but I've, seen a quite high memory cost of profiling in like using like profit stuff. So I wonder if, like with the trace um open telemetry, if like, we should be working on something that is a bit more lightweight or if we can even work with something. That's more lightweight, then yeah.

C

What we have like p profit go, for example, is like expensive, like crazy.

A

So I'm not necessarily the best person to to speak to that, because full disclosure, um when we ran p broth funnily enough on our main interest service, we discovered we were spending ten percent of this of the of the processes, time creating and sending trace fans right. So right like we view that, at least at honeycomb as an acceptable expense to spend ten percent of our times generating trades because it turns out, it enables us to debug high cardinality issues that we otherwise wouldn't be able to right.

A

So that is a choice that we have willingly made to sacrifice a little bit of performance for better visibility. um That being said, you know if we really cared, we would head sample rather than tail sample the data right. So we would not bother generating the trace events in the first place uh for, um but instead we choose to generate all the data at source and then to tail sample it later. That is a choice that we have made. um I I think yes you're right.

A

Some organizations may not view investment in observability and kind of run and and being willing to tolerate some runtime slowness in exchange for being able to see what's going on, um and I think that the way that we solve that is, you know number one, as is the case, was with uh with tracing right. Adjusting the head sample rate can really really um toggle that overhead versus versus granularity and fidelity- and I think the same is true for profiling.

A

Great, like that you can choose what your profiling interval is and that can really reduce the amount of memory, the amount of cpu that it consumes.

A

But in practice like you know, because profiling shows how much time you spend profiling and how much memory you spend profiling.

A

Our analysis says that tracing is actually far more expensive than profiling um in term in terms of uh percentage cpu hit. um You know even for the services where it's not as high volume of traces and therefore we're not you know, spending ten percent of our time. Mangling traces, you know we'll see, maybe a two percent hit from tracing and we'll maybe see less than a point. Five percent hit from profiling. um So that's our experience um and we continuously profile it profile. Everything asterisk uh stupid go runtime bugs.

A

Unfortunately, we stumbled into a number of go runtime bugs because to our knowledge, we are some of the first people to be exercising a lot of these code paths in anger across like 100 production, um but yeah basically- and I think that's the question of maturity, I think that's a question of effort right like if enough people are invested in in investigating this, if peop, if enough people are feeling this pain, because they're they're using this in anger right like we'll, get those bugs squashed, pretty quick but yeah, I think I I think in my view, it is worth it and even if it is not worth it, you can always put it behind a feature flag for it.

A

You can always turn on profiling uh temporarily um at varying rates. You know you can increase the profiling rate or turn it from zero to you know one one, one: every ten milliseconds, even right. It's just that limit of resolution right if you're profiling, every ten milliseconds you're, never going to catch something that hangs for one millisecond, but if you're profiling, every 10 milliseconds, you will get enough samples for something that happens every you know that that blocks for two seconds.

C

And I think that's an interesting point also with like profiling, on the open telemetry level that develop developers could provide some wave information for the trace. Let the instrumentation add to call level to know this function will never should never take more than x, so only create a span when it when it does stuff stuff like that.

A

Yeah that one that one's a fun one, because it's kind of a post fact of knowledge- thing right, right, yeah, but you can say okay right, like if I've had at least one call this function take more than x seconds in the past. uh In the past five minutes, then I'm going to turn on increased sample rate for that very big, so you can dynamically just, but you can't catch it post facto, if you never traced it in the first place,.

C

um And that was kind of a disappointment.

C

Well, it's just like, like just like unexpected, like in containerized environments like in companies where containers have memory limits, for example, and you want to dynamically enable okay, let's trace this path, and your container keeps going out of memory because uh well you you switch the flag that uses more resources. So some of these features might changes, sadly, always involve restarts which then trigger like okay. The problem doesn't show up after yuri started anymore,.

A

Yeah, that's why I am you know, I'm cautiously looking at uh ebpf approaches that run in a sidecar rather than directly touching the process right, because it does allow that separation of instrumentation from uh from the code that's under under test, but I think that that is a maturity question right, like you know, for all that I complain about p, prof and and instability in the go run time like you know, it is at least a standard thing that is, that is produced by the go authors that is well supported. Ish.

C

I'm very curious about eppf with that yeah I haven't had time to look into it yet yeah.

E

Yeah and daniel.

D

uh Yeah uh thanks guys uh liz great talk, um really appreciate your time. I I have a question you mentioned.

D

Profiling is kind of where traces were on the maturity curve um like a few years ago, and I'm wondering if you see, I guess what I'm wondering is- are the adoption sort of strategies that you can drive inside your organization? Do you see any reflection there? um The reason I ask is like I tried to introduce profiling, and I think this is a bigger problem in like interpreted languages like the usability is harder for the profiling tools. um I tried to introduce it a few years ago.

D

Locally didn't play nice with our test stack right gave up for a little bit. um Our vendor came out with a continuous profiling tool in production. I enabled it and immediately even with the sample rate, drives our uh usage on resources out of control and also because it's a beta tool poisons all of our metrics for the month and that was kind of like after shot two. The organization was like we're not using profiling anymore right. um Where do you see like this like?

D

Is it a safety question because you're playing with your like data, real up close in production? Is it a usability question? How do you see the curve going.

A

It's 100 usability question right, like you know, with much love and respect to jager for kind of paving the way of looking at a single trace right, like the value, comes from being able to examine multiple choices in context right for treason, so the value for profiling right, like profiling, is never going to succeed. If the only people who use it are the bren and greg's of the world right are the performance engineers of the world. Right, like you know, yes, you will have right like yes, you'll have organizations that think it's worth it.

A

For you know a select team of people to understand profiling to be able to you know, drive, cost reductions, but like for a majority of organizations, the problem is not the cost of the aws or azure gcp bill. The problem is that they're wasting developer time chasing down bugs right like. Why? Should we not be able to fix that right like so? I think that when you articulate value that people solve with that problem and when you the usability such that the average dev can can can can can use the tools. That's that that's that's!

A

When you get to that point right like and I don't right like, I don't think jager crossed that chasm right like I I at least you know, maybe you can recommend.

F

If I'm wrong, but like.

A

I'm not aware of organizations where devs routinely look where, where you know ordinary dad's routine, like jager right, like you kind of have to have more layers of obstruction over the individual trace to be able to get value out of tracing. So similarly, you have to have more levels of abstraction over the profiles to be able to or to give people the carrot to.

A

Then you know deal with the pain of profiling, so we can both address this by kind of decreasing the pain the profiling causes, but also, I think the main thing is articulating value. If there's not value there, there's no incentive to work on the pain so without.

D

With just to clarify, without like some sort of scale on the on the profiles that you can collect like, for example, uh local profiling is uh local profiling. Usability is not necessarily a goal uh in your mind of any like uh new profiling initiative.

A

Yeah right, I would say you know yes, if I am running, go test bench yeah.

E

A

On you know, go test bench, people off right and I'll go and look at the at the profiles in in in go tool, prof, absolutely right, because I'm current I'm working on a benchmark right now right, but I think you know to me what the way I should be approaching. This is not in terms of signals. The way I should be approaching. This is what's the problem I'm trying to solve right and to me the biggest problem is I have a trace right like it used to be I've requested taking five seconds?

A

Why is it taking five seconds right? Oh, hey! It's a huge improvement to see what servers it's coming from right and.

F

A

I'm frustrated that I have you know that I have a request of taking five seconds. That's blocked in this individual request for two seconds, and why is it taking two seconds right like that's the motivating fire for me right like and and and then the profiling is the how right it's it's either profiling or you know dynamically, enabling traceman's uh down progressive levels of function stack but similar to how um you know exemplars take into an extreme approach approach, sample traces uh when you turn on trace bands on finer and finer levels of granularity.

A

That starts to approach profiling and it turns out to be way more efficient to uh profile instead of to in instead of uh creating trace, stands for every function. Call right awesome. Thank you.

A

Awesome. One of my hopes is that the uh hotel profiling effort is going to standardize. The agents right is going to lower the overhead is going to make it tunable is make it correlatable to the open. Telemetry uh span ids right like that, a lot that you know the mission statement- and I think about you know when I think about, should this project be an hotel is, does this relate to signal collection and correlation to the common set of principles right like that's?

A

Why, for instance, we accepted sql commenter right like it's a this is a trace propagation issue right, so I think with profiling. It's a this is a trace correlation issue for diagnosing things that go even even further beyond uh beyond traces. If there hadn't been that connection to tracing right it would, it would be like okay great. This is a performance tool right. Why doesn't this belong with ebpf right.

E

Yep absolutely and then that's a very good point uh to call out, because I think that that's not clearly understood given the overuse of the word profiling itself. So.

A

Right right, like it, turns out that profiling and ebpf are correlated right, but you can use them to solve slightly different problems right like, for instance, ebtf is more than just profiling. You can use ebpf to live debug things and look at variables right. Conversely, you know there are other ways. Besides eb you have to accomplish profiling, like you know, runtime support for pre-prof right. So it's kind of this overlapping venn diagram circles um totally.

E

Totally and and that's a very good point- um I think we are a bit over time and two minutes to ten. So if folks have any one more question, maybe we can address otherwise we can give a.

C

Minute, I don't would have one last: what's your thought on uh animal detection at collection or span profile creation time like let's say you don't profile all the time, but if your metric exporter notices, hey then to the standard, let's collect profiles.

A

Yeah, I I think, that's totally reasonable to turn on additional debugging. I made the face when you said anomaly: detection, because it's the you know, jelly beans might cause cancer. Well, let's try green bell.

C

Icon: let's try yellow jelly.

F

Beans, right, like the excuse.

A

Comic from like a week ago, right like I, I have a very dim of kind of post facto anomaly detection, because I think it has a very high false, false positive rate, because when you run 100 experiments, one of them is going to have a significance. Value of point. One point: zero.

C

One not not after a fact but like at the point.

A

Yeah yeah yeah run runtime runtime, enabling of additional tracing or increasing sample rates or enabling profiling. I think those are all the same category of when there is a problem, absolutely increase your resolution via whatever mechanism you can yeah great.

E

Awesome, I think.

A

E

At time and kevin, thank you again, uh liz again deeply grateful that you could join today uh and and uh really appreciate everybody joining in a really awesome uh talk today. Thank you and we'll be posting. The recording uh right after it's available take care. Everyone thanks. Thank you.

F

All right take care everyone.

B