Cloud Native Computing Foundation Open Tracing, 3 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenTracing Monthly Call - 2018-08-03

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

A

Now, recording welcome to the open tracing specification, Council meeting everybody good to see all your lovely faces. We've got a fun presentation today from Jonathan Kaldor and Michael Boyk and I. Now pronounce your last name.

B

That's exactly right, you got it yeah.

A

B

C

A

C

A

A long line of can't pronounce your last name family tree.

D

A

It'll be on tracing at Facebook and rather than ramble on myself, I'd love to let Jonathan and Michael introduce themselves and then kick it off. So welcome.

E

Hi so yeah as I'm Jonathan I used to work on the canopy team on Facebook I, recently moved to another team and, and so Michael recently joined the team as well, and so we're using this also as an opportunity to kind of like move me off of the advisory board and move Michael onto the advisory board as well. And so you have Michael hi.

C

I'm Michael I think I'm, unmuted, yeah I, know so I'm Michael I just recently doing Facebook and the canopy team. Before that I was actually at Comcast for a while, but actually worked on. One of the things I worked on. There was actually our sort of internal tracing system as well. You know which was which was the sort of X trace it can you know dapper style, one so I've kind.

F

C

A you know some experience with both worlds, which is kind of interesting yeah.

E

Cool so yeah. Let me attempt to share a screen which is like the last frontier and giving a presentation over the Internet.

E

And so can everybody see this?

E

Yes, okay, so yeah, so we're gonna be talking about canopy, which is Facebook's, distributed, tracing and analysis system. This is kind of an amalgamation of a couple of talks. We published a paper in SOS pe 2017 last year, two of our engineers talked to Q Khan in New York, and we're focusing this mostly on the instrumentation and representation of canopy.

E

We'll talk a little bit about some of our distributed or some of our trace analysis and trace analysis pipelines and so we'll be happy to sort of like talk more about that, but we're kind of focusing on the representation side, so yeah canopy, is a an umbrella term, but in for a wide set of things for tracing at Facebook, it encompasses both our a single and cross system tracing our single own cross system analysis.

E

We have instrumentation available in a number of languages, C, C++, Python, Java and because Facebook PHP other languages are sort of supported through C or C++ bindings. Our instrumentation is integrated into both are like common RPC stack, that's shared across all services. It's integrated deeply into our dub-dub-dub stack, so the overall page load process both on the client and server, as well as some other common pieces of infrastructure, and then we're also sort of able to ingest data from other sources. So we have tracing and our mobile applications.

E

The profile, oh I, think is the the name that we open sourced it under and so we're able to ingest data from there and incorporate it into other traces through our back-end systems.

E

It also combines an extraction and processing framework and so given a trace that we receive from some source we're able to run custom user code to extract trace patterns, information from a right them, two datasets that we can then do aggregate analysis on and then there is a separate team that works on performance visualizations, and so they work on both single trace and aggregate visualizations. For these traces.

G

B

B

E

So can people see this again or girl.

A

Giraud second yeah.

E

Powerpoint is not happy with screen sharing.

E

Okay, I'm just going to share the whole desktop, and hopefully we won't get infinite recursion with the preview windows that also pop up.

E

Okay, so canopy is a little bit different than sort of other likes fan base, tracing systems, we're an event based system that we then, in our back end, will take those events and parse them into a higher level model because we're not span based. We also sort of have explicit edges between points.

E

We still enforce that the overall structure must be a dag, so we don't allow any edges that go backwards in time, but otherwise you can have sort of like edges between arbitrary points within your trace, and then we also have sort of metadata that every other tracing system has we've sort of layered types on top of them, and I'll talk a little bit more about what this means later.

E

So this is what our overall model looks like. We have sort of five basic objects in our trace. We have the overall trace that encompasses everything. The trace is broken up into a number of execution units in a like explicit manner and execution unit represents a sequence of or a sequence of, trace data that comes from a single clock. In practice, this usually represents either a single host or a single thread within that host, but it can be used for modeling other primitives as well within an execution unit. Munich contains a number of blocks.

E

These are sort of the closest analog to spans. A block represents some duration of time.

E

It contains a start point and an end point and then zero or more other points that may occur within the trace and then a point is just sort of an instant in time that captures some single instant and then edges connect two points together and so then all these objects can also have sort of arbitrary metadata attached to them, and so I said that we're we're an event-based tracing model that we parse into a higher level model in the back end, and so here's an example of what I mean by this suppose.

E

We have a simple case with four events: this is sort of the classic orefici call-and-response events, these sort of get broke. You know we have a call to receive on some back-end service that back-end service sends some response, and then the parent service gets a records. A complete event when it received the response from the RPC call we're then sort of we take these events and interpret them as okay.

E

There's some current block that the recall and complete are part of they call to some other service which is generating a block at the start in the end, and then those are all both within some sort of execution unit as well, and so this is the base original instrumentation that we've had we've since extended these events over time as we've added new model elements, and so as we introduce execution units and explicit edges, we've introduced, we've introduced events that allow users to create these within the to trace themselves.

E

So one benefit that we get from having this decoupling of events from the actual model is that we're able to update cross system instrumentation without having to carefully arrange releasing instrumentation versions to both services? At the same time, in practice, given you know, service release schedules, it's impractical to assume that the instrumentation version on both sides of the boundary are going to be the same. So you need to have some compatibility across the boundary, and so this decoupling allows us to say, interpret events on one side of the boundary differently.

E

If we update the instrumentation while keeping the instrumentation on the other side the same, and so then this allows us to handle sort of all combinations of cases where service, a and service B may either have old or new, instrumentation or potentially even different instrumentation, entirely.

E

So, coming back to the explicit edges, this is probably the like biggest single difference between a fan-based model and our model. All of these things can technically be represented within spans. We found the benefit of having explicit edges, meaning that we haven't needed to change the structure, to add additional changes, structure of the trace to add additional features, and so for this I can walk through like one an example that we'd have in our current system.

E

So we trace through browsers, you know, which includes both the client-side JavaScript, that's executing, as well as the PHP running on the server, and so you can imagine some sort of JavaScript execution unit which is recording all the JavaScript that executes on in this particular page load, and so there's actually three separate hierarchies that can occur within or three separate causality hierarchies that can occur within the trace here.

E

So you can have the sort of standard one which is your JavaScript is making some sort of remote call in this case, fetching a resource and then at some point that resource will return and JavaScript is able to use it.

E

However, we also have a second causality hierarchy, which is the function hierarchy, and so you know the function call stack sort of represents here. Our you know, relations between parent functions and the child function that they invoke, and so we found that you know this is useful for sort of representing nested blocks. So you can have one block that is entirely contained with inside the execution of parent block and we're able to use edges to sort of say that this child block is part of this parent block without sort of having to represent it.

E

As the parent function is making an RPC call to the child function, which is occurring within the same thread.

E

The third causality is actually an interesting one and it occurs in like I, guess more and more languages over time, but like JavaScript and other continuation based languages, and so here you can imagine our schedule function, queues up some future that will then be executed later on, and so in this case, like our causality, isn't necessarily between the route functions that we're executing. But in this case we've scheduled some function and then we have some. You know common infrastructure stack which is executing these or pulling these entries off of the queue and executing them later.

E

And so we need to connect to the actual function that we're executing instead of say, the parent process function, because that may also be.

F

E

Other futures that are not actually connected to the one that we scheduled and so like these are sort of all. You know parent-child relationships ish, but what we found is like they represent different. You know different parent-child relationships and having edges and specifically types for those edges. Allow us to say that the function, the function hierarchy relationship is different than the RPC hierarchy. Relationship is different from the continuation hierarchy relationship.

E

So one other example, we found like having the ability to create explicit edges has been useful is representing application flow. One of the like common tools for understanding our traces is critical path. Analysis, particularly for browser traces, and so we ran into a problem where we end up with, say our JavaScript requesting a couple of resources and our JavaScript thread tends to be fairly busy, and so we would get traces that look like this.

E

We wouldn't necessarily know which resource fetch is on the critical path, and you could actually argue that you know either. One of these is potentially the resource fetch that is blocking.

E

However, if we have additional information from our application say when we actually end up using these resources, we can actually see in this case that we use resource two immediately, but resource 1. We actually don't need for some soup. You know non-trivial amount of time, and so we could have actually delayed resource one significantly without affecting or overall time, but it looks like resource two is actually are blocking resource, and so in this case we want to represent some sort of application flow.

E

That says, we've received this resource or this result of this RPC call, but we don't actually need it for some period of time, and we should take this into account and we're actually computing the critical path, and so here when we're computing, the critical path through resource 1, we can say that there's actually a slack in the critical path corresponding to the length of this required for edge.

E

This has also allowed us to experiment with different representations of application, application based logic. So, for instance, we've also experimented with saying that you know certain events must happen in order for other events to even be considered, and so again in the page load process. There are some synchronization points where we know that we won't receive.

E

We won't process a receive event until some other synchronous event just happened, and so we can say that this synchronous that the synchronous event is going to be blocking anything before it, and so it's a prerequisite for anything else that happens afterwards.

E

So, coming back to our metadata, we have sort of the standard string string, annotation map, that's common. Among you, know a lot of tracing platforms. These can be attached to any object in the trace, so points edges, blocks, execution units or trace. All of them have a metadata object associated with them.

E

We've made a distinction between what we call core in error properties, and so there are separate types and separate Maps for each one of these, the distinction is a little bit fuzzy between some of these but effectively a core property is something which is used by our back-end to interpret the trace, and so, for instance, the type of an edge is a core property.

E

This also allows us to distinguish between annotations, that users add and annotations that we absolutely must have. For you know, loading or displaying the trace custom is then sort of a general bucket for any any trace data or any annotation data that users add through their own instrumentation, and then error properties are typically used for noting errors in trace construction, as opposed to errors in the overall execution of the trace, and so, for instance, we might use an error to indicate that the trace instrumentation never closed.

E

A particular block versus like an RPC call returned a particular error.

E

The other feature that we have is typed counters and so there an explicit, separate type from the string, annotation map, and so these are counters that have a numerical value associated with them a particular type and then also a precision. And so this allows us to say things that, like a thousand and 24 bytes is distinct from a thousand and 24 milliseconds, which is distinct from 1024 kilobytes. But it does allow us to say that if the user records 1,024 bytes in one place and one kilobyte in another place, those two values are actually equivalent.

E

We've also extended it over time to more types as we've needed them, and so we've introduced sets of strings that can sort of be appended to over time. We've used this, in particular on, say like execution, units or traces and then also for a stack frame so capturing either like sampled profiling data or the stack frame at a particular RPC call.

E

So, coming back to I guess putting this all together in between, like the metadata and our events, we've run into the I, guess some fun challenges in modeling, and so going back to our old instrumentation, where we just had a call receive response and complete events. Each of these has some associated metadata with them, and one problem we ran into was well when we wanted to extend this to. You know, blocks and points and execution units a call event. It now does more than just create an edge.

E

A call event actually ends up, creating a point and an edge to the RPC service that you're calling to, and so there's an open there's. A question of: where does the metadata actually apply to like? Does that metadata apply entirely to the point that it creates? Does it apply entirely to the edge that it creates? Is there a mixture between we sort of made the decision that a call represents the edge and the point is sort of a like side effect of that, and so the metadata gets applied there.

E

But this does mean that you know when users are using the old instrumentation, they can't actually attach metadata to the original calling point instead, and so this is why we sort of extended the instrumentation over time to allow more places for this metadata to apply and with that I will hand it over to Michael I. Don't do you want to try sharing your screen instead or you want me to walk through the slides? As you talk.

B

All right, let's see- that's that's like it, I feel like this. This is dangerous, either way, I'll give it I'll give sharing my screen, shot: okay, yeah all right or the green thing.

B

B

C

Right did anything good happen on the other end, oh yeah.

E

I can't believe it worked all.

C

Right: okay, let's, let's just all right Thanks, so so you kind of pick.

B

Up from uh what's that, oh yeah presenter I know how to use PowerPoint. Everyone don't do that man disaster.

C

B

C

Right great, so so yeah it kind of pick up from where, where Jonathan left off you know it was kind of interesting before I before I came here, you know, I worked with, we had a well we've open source has open sourced it, but uh a you know very span based sort of a span based tracing system. We called it money as in like follow the money, and then we had all these like clever things around it, like the money bank, was where all the traces lived and stuff.

C

So it was fun, but the you know we didn't run into like some of the modeling issues that Jonathan was was talking about, and actually two in particular that, like we, we kind of ran into there and then read the canopy paper and then I quit and came here like you know that we were like hey. This could actually be useful, were you know one? Is we had these sort of situations where we had a trace on a particular system?

C

And you know a bunch of stuff is going on in the system when you just wanted to to sort of attach like a profile of what was happening on that system at you know various levels um to the trace, and you know kind of the best way we could think of to do that in the sort of span based model.

C

Was you have a sort of a top level span that represented like the entire scope of the execution, and you start profiling and then end profiling when that thing closes and and try and attach that profile to that top level span. um But then it was kind of like in turn you had to know that span was kind of like special right, like that was the one that that like had the profiling information right.

C

It wasn't that bad, but it was actually kind of clumsy as far as the tooling rebuilding around it went and in sort of the the canopy model is actually kind of more natural to just annotate the execution unit. That represents that, like request handling right because we use that sort of naturally represent. You know here's the entire span of processing, an individual request, but not span and tres and cousins, and another interesting one was, if you just had to click some work in a queue, and you wanted to understand.

C

You know how long was in there alright and when it came out. You know the that's actually pretty naturally modeled by another execution unit, with points for in qdq and edges for causality right, whereas like, if you sort of put it into you know, you could model out as a separate span, and this start you know, starting into the span or when it goes into and out of the queue.

C

But it sort of means a very, very different thing than like most of the other spins do where it's like an actual RPC graph like it's like. Oh no, you just have to know you know. As far as your tooling and stuff goes, it's like, oh well, that particular span happens to be one that represents like this thing sitting in a queue for a while.

C

um So so those are a couple of things that we actually did struggle with the modeling perspective that we were actually pretty interested in when we read the canopy paper to sort of help us out with so you know just kind of worth noting it was. It was kind of an interesting thing to sort of see it from one side and now start to see it from the other.

C

So so that said, I wanted to sort of move on a little bit to talk a little bit about what we're we're doing now and sort of where we're focusing um you know, I guess: Facebook's probably grown quite a bit in the past. You know X years and we've got a ton of engineering teams right. So one of the things we're focusing in on is getting getting the backend API is that you know Jonathan alluded to into a point where they're, safe and like clear and usable. So that means for us actually sampling isn't enough.

C

We also need great, limiting and I'll kind of go into that a little bit later. We also want these sort of somewhat tailored. You know high-quality api's, instrumentation layers that are, for you know, back-end use cases right now we're you know what we really wanna be is where we have just sort of like most of the complexity in dealing with you know, the the underlying model is handling sort of instrumentation layer and an end user.

C

The system just sort of has like a small set of functions like log, a point right and it you know the instrumentation layer just says: oh, okay, well, here's the active execution, you here's an active block and we'll put a point on it or if you need to put a new block.

C

Alongside of you, know the sort of default, one you'd sort of do that an easy way um you know so so we really want to do is just make tracing on on the back end, just like really really easy for folks that are that are building back in services, the the PHP instrumentation we have that Jonathan mentioned actually kind of.

C

Does that to an extent already, um but you know we sort of exposed a lot more of the underlying gots to to back-end folks right now, I, you know and then another thing that actually ends up being useful for is having you know, sort of different api's that are good for different situations. So you know one of the things that we did we're just working with one of our teams. That has some like really really um stringent sort of perfect quirements.

C

As far as memory usage goes and then sort of you know, they really worry about things like like thread contention, um the sort of flexibility underlying models actually going to. Let us fairly easily create like a fairly tale, or you know, hey if you need to.

C

If you have a really high performance, you know RPC system and look you don't want things going on behind the scenes that could cause additional thread contention or memory allocation like use this API, um you know so so that's one thing and then the other thing we're working on is actually a sort of a revamp of.

C

If anybody's read the the canopy paper, there's a there's, I think they're referred to as I think, custom extraction functions or something like that, but but it's essentially a DSL for working with traces that happen to run in our back-end we're working on sort of revamp. You know expanded version of that that'll run a you know in a separate set of processes elsewhere and use a you know then be based on sort of Python rather than this completely custom DSL.

C

So it's really the two things were working on now that that's actually super important for us, because we tend to look at traces in aggregate a lot um and we just sort of compute. Like summary, you know information about traces, often right and you know but stuff- that's sort of covered in the paper, but I think the way we're going to be doing it is sit on the the safety. An API clarity. Side like this is kind of a um this.

C

Is this sort of an overview of what the instrumentation stack really looks like for us at the the bottom layer? We've got like a layer of sinks that do nothing. You know just the old serializing the events that Jonathan mention and flushing them somewhere. um You know we've sort of an internal Kafkaesque system that you know is used on top of that.

C

There's a trace model that really you know, represents that that that trace model as an object model, but doesn't let you do things that don't make sense right, like you, can't create like a block on a point, for instance right, but it just makes it makes a little bit easier to work with. But you know when you do things that they're, you know when you do things with that model. It'll give you pointers, so you can sort of keep reference them in your code and then flush them.

C

You know, it'll flush, things to events usually usually right away, but you can do some things before that if you need to and then on top of that, we've got sort of a set of code that deals with creating instrumentation layers right because we don't want people to have to worry about things like you know, propagating context either through thread boundaries in their system or you know, across system boundaries. We don't want people to have to like understand, really really understand that trace model deeply understand which parts are active right.

C

We just want an underlying instrumentation to take care of that. We obviously don't want people to have to do their own rate-limiting because they won't, and then our system will get you know knocked over, so that's not great, but then the the whole idea being on top of that, we've got this sort of instrumentation kit to build back in instrumentations. We've got a set of instrumentations built on it that we are that are either built or we'll be building on top of that, um and then we really want to have most folks.

C

Leverage is sort of this high level API that really just lets them do like a few simple things right and all of the more complex pieces are handled by instrumentation right like at the end. At the end of the day, you know soda lacked something more like a logging framework right, like you log a point or you create a point and it it goes into the right place on the trace right. It goes on the right block.

C

All right execution unit- and you know and so on, and then you know maybe exposing you know some a little bit of additional stuff at that high level that that you know handles like the 80% use case and then, if we need to something more sophisticated, you know people would have to go sort of layers down in this API stack to do it um right and then, like I'd mentioned earlier.

C

Another another thing that we're we're talking about doing sort of on top of this is creating just to really really performant but but much more constrained. Api on that sure just does that RPC trace model for some. You know particularly use cases, and it's kind of nice that, like this, not only the underlying you know, event an object model sort of gives us that flexibility, but you know sort the instrumentation that we've got built up. Lets us do that too.

C

So this is a quick overview of the other big chunk of stuff we're working on now, which is actually borrowed from the cue comment. Presentation of to the other gentleman in the room.

C

B

C

We're good at we're good at video conferences over here, 7:00, so so, but but basically are saying, this is really a domain-specific stream processing engine right and and language for processing, streams of traces, and if you look at sort of the way that we tend to use trace data and we do a bunch of different things with it, one of one of those things is I'll, be just like looking at an individual trace, which is you know, kind of useful if you know which trace to look at.

C

But if you don't even know that um you know or if you want to compare. You know data in aggregate before and after some of that right, like you, know a deployment or something and see, what's happened right or if you just want to compute. You know summary statistics off of something that can only be derived from. You know a trace right. You know with that actually ends up being I, think a more common use case for us than just looking at an individual trace, um so so really what this soso yeah.

C

So this is a domain-specific string processing. You know system for forgetting at that sort of stuff right and we've had a bunch of things that are built on top of it at a high level. You know really what we've got sort of a configuration based. We've got our you know internal configuration system that you sort of put. You know this python-based, you know DSL into that's run by this. This whole set of sort of machines that will go in and run those on a per a sort of use case basis.

C

Alright, so it's it's sort of you know. In the old system that's mentioned in the paper, we kind of ran the the moral equivalent of this. All that sort of one, you know said, am infer. That was also doing a bunch of other things. You know this one, the use case are actually split out. Separate sort of you know, chunks of machine for different. You know, use cases different, essentially going back to different users of the system. Right that are gonna, be doing vastly different things.

C

So that way we get some more blood, isolation and they're, not stomping on one another, the and then we sort of export summary statistics to you know an internal database called scuba and folks build things on top of that.

C

So those are really the two big things we're working on for the time being and then, though, yeah the other important thing about this, that I didn't mention which is different from the stuff in the papers that this actually does allow us to do sort of ad hoc queries right, which was not possible with your old system. So that's super useful if you especially when you don't even know what you, what your yeah yeah one more good after that yeah.

G

All right, let's, let's.

C

G

C

Is this is this? Is a slot I stole from Joe?

C

So that's the stuff we're working on now, and this is just some stuff that sort of you know kind of keeps coming up again and again that, were you know, thinking about don't know to what extent will actually end up really tackling this, but I think they're, interesting things that no one I think we're thinking about, but I've sort of seen other folks thinking about too so I thought it was worth mentioning.

C

One of them is how do we sort of safely do a more arbitrary context, propagation at scale and and and the interesting thing a scale here is actually more: the diversity of teams and workloads than sort of the scale of you know our overall infrastructure.

C

You know, there's there's sort of a and then I think anybody who's done.

C

This at, like a big company, has almost certainly came across the use case where somebody's like hey, there's this stuff that'll like magically propagate, like this idea around and like I, want that to probably get my session ID like so you know the notion of having a more abstract way of of using the underlying right wherever you've got the tracing instrumentation of being able to propagate context through which I know is you know, in in open tracing with baggage and and the paper that uh you know, Jonathan mace I think wrote like is: has some really good stuff in it, um but I think one of the, including things is like once you open up that capability.

C

You know like how do you keep people from doing really really bad things with it and such bad things that you end up having to turn it off is kind of a an open question um and I think it's kind of worth thinking about like? Is there a difference between? You know that sort of really common use case of like hey I, just want to propagate an ID sort of within my system, boundaries for some sort of session or something versus the broader.

C

You know, I need some context that really might propagate across a very wide set of system boundaries and just in terms of like safety for some sort of context, broad and then another one is which, which Jonathan mentioned earlier, like one of the interesting things that I think this does fall directly out of the model.

C

The the sort of canopy data model is that it's actually really really good at doing single node traces as well, and we get like really detailed traces of mobile clients and dub-dub-dub and and folks actually want to get like really detailed traces of their own back-end systems. You know um so, but it's like well. How do you then take that really detailed view of a little piece of the system and working into an overall you know distributed trace that, that's probably broader. um You know you sort of create a lot of noise for people.

C

I just want that broad view, but then sometimes you know the person that's like looking at that really detailed view might want to look at something. You know a few layers back, so there's actually a couple of different ways that we've talked about modeling that and there's some different thing. You talked about doing that doing with that in terms of like viz, tooling, you know and so on, but it's but it's actually kind of just.

C

You know it's kind of built the question of how we actually get like a good, solid, end-to-end trace, while still having like these chunks of like really really detailed trace at various parts of the system in there- and you know they really are different things with different audiences. You know so so yeah. So those two, the things that I think we're we're thinking about and you know, may or may not do anything useful with cool. So anybody have any questions.

F

Ever scream, but they don't want to ask on this group because then it's gonna take forever.

C

Well, Jonathan is available for all questions at any time. So just you know.

A

I have a question about metrics and aggregates, actually I'm wondering I mean. Obviously you have an event based system and you're rolling up some amount of aggregates out of that, you know into your tracing system, but are you doing kind of all metrics extraction based on the system or do you have a totally separate metric system? And if so, are you kind of sharing? Are you using the tracing system for context propagation and like how are do those two things relate to each other.

E

So we have there's an independent metrics based system for sort of like operational management. The traces will tend to the traces can share some of the data from those metrics. There's some caveats. Usually there, where, like the metrics, captured our system level but say we want to capture, request level metrics within a trace, but we can sort of hold the same like overall system CPU utilization and things like that.

E

What was the oh yeah, and so then that has a separate aggregation piece because they're kind of collecting at some regular interval over hosts, whereas we're sort of like very fundamentally kind of request base. And then there was a question about context. Propagation I think yes,.

A

So well I mean it sounds like you actually have things separated between system metrics and then maybe your application level metrics are coming out of the tracing system, but the degree to which you may want to dimensionalize some metrics is that all happening? You know that context tends to get propagated in the tracer, which is just wondering how that really.

E

Yeah, so one of the places where there is actually overlap is, if you want to understand, say, be if the overall global efficiency of a particular system and, in particular the resources that it's utilizing as a whole, you do need to kind of look at the resources captured through traces that you know the trace starts at some particular point and then you're looking at sort of like what are the resources used by this particular request and then aggregating over.

E

All of these requests, in some sampled fashion, to understand, like request based utilization through the system where we're currently using tracing for that like this, is fundamentally at Facebook like context, propagation and tracing are fundamentally tied together for better or worse, and so that's where, like as Michael, said like we end up with these cases where people are like man, I really need to propagate a context and like. Let me turn on tracing and we're, like you know,.

C

Yeah and and and that like we don't we don't have a decoupled, generalized context, propagation system and I think you know to that sort of like how do we do that safely, with the number of teams that we have is sort of an open sort of operational question that you know what to think about and see if we can tackle at some point but um but yeah, it's it's interesting. We have this many different folks, sort of that. You know yeah it just yeah.

H

A

C

Anyone have anything else.

D

So on the safety piece on context, confirmation, like you kind of given us the problem and I, totally understand the problem with went through this. You have any thoughts on how you're, actually a nice, solvent I.

C

I have some thoughts, although the interesting thing I realized I, somehow, when I redid, these slides, I I skipped my slide on rate-limiting, which is actually the the important safety piece were tackling now. So actually maybe I'll just go over that really quick.

C

The other big thing we do with the API cleanup is we're actually adding pervasive rate-limiting at and as well after uh you know, after the sampling arm, sampling actually ends up not being quite enough for us, because you know, if somebody's doing a coin flip that they expect to be on like a tiny percentage of traffic, like maybe in a region that you know, is being used for, like you know, for some experiment and then suddenly a lot of traffic fails over to there.

C

You know, suddenly you can sort of get this explosion of traffic just because of that, so we're actually adding rate limits on a per trace and the N trace size. You know before we can actually sort of start new traces to cover that. So that's actually one important safety piece as far as generalized context prop is uh you know, I think it's something we've been talking about for a while and starting to think about.

C

I do think that opening up a sort of completely generalized system and like saying you know like any engineering team and an organization of our size, is free to like go and attach baggage to this thing is like a non-starter. We have. You know: we've got systems that you know that that are they're extremely memory sensitive, where you know that, like the engineers that sort of when those teams would you know, we'd sort of rightfully just you know haha say very loud things, you know and then I think.

C

If you look at some of the the safety that safety that was sort of in the the paper that Jonathan wrote, don't know if he's around, but you know essentially it comes down to like having a principled way of, like you know, capping the amount of size that the amount of data that gets propagated, um but then it sort of has this like downside of like oh, maybe you really really rely on a particular piece of data. You know now it's not there. I think so. The extent that I've thought it through I do think.

C

It's really really worth thinking about. How do you separate out the use case where somebody wants to propagate an ID within their system bounds and then sort of attach metadata to it after the fact that can be processed by like some other system right and have it emit that data, but sort of like then not have that ID crossed you know be propagated outside of the bounds of their particular?

C

You know set of systems and then separating that out from like the generalized context prop, which should be very strictly controlled and really only used for like a specific set of blessed things with like a decent amount of process around putting things in them upfront right and like some like some like set of configurations that, like can't be changed outside of, like you know, review by some accountable team. So you know to the extent that I've thought through it, like that's sort of where I've landed.

C

But you know that's just we haven't really really worked on it heavily yet.

D

I've tossed around the idea of like prioritized namespaces, this realm I have actually implemented any of that yet yeah.

C

Yeah, you know and I think I think stuff like that I think is. Is you know it's good, um but then I think you do sort of run it at least in our rules. We don't run into this thing like well. Okay. What? If you know the data in your most prioritized? Namespace is larger than like the amount of data that, like the most you know, sort of the most conservative team is willing to accept. You know so you kind of just like need.

C

Some I think you do need to like sort of really tightly control like you know the generalized thing and then try and figure out how to build the more you know hey if you want to do something within your own system. Bounce thing on the same, you know context prop, but you know so I think I think really had the notion of having like some sort of system bounds.

C

That say, you know what don't propagate these pieces outside of this system bound but propagate like these blessed pieces that, like somebody's up and gotten Buy in from, like you know, the most contain that that stuff's, okay is, is probably necessary, at least that you know really large-scale yeah.

E

I mean I think this is like it connects back to the you can have a really flexible API, but you also want something higher level and very restrictive that most users interact with, and in this case there are a bunch of like very subtle questions around the propagation of.

E

Like you know a you may have some session ID, that's propagating over multiple individual dub-dub-dub, the quests, but each dub-dub-dub request should have its own ID and, like making sure users understand like what are the boundaries, were things cross over and are able to do it in a safe way? I think like we're, it's a very, very open question on our side of like how to make that work.

A

Yeah, the boundary issue is just pernicious for any form of context. Propagation yeah.

B

A

Big missing piece of the Internet right now, yeah.

F

Michael, can you elaborate on the rate-limiting? You said you do it after sampling. So what happens to the trace? If you start rate-limiting yeah.

C

So so the way we're planning on doing rate-limiting is we're planning on still doing it at trace start and not killing traces that are in progress so like, for instance, like you know, from the point of view, let's say: you're an individual node and you know you have one in a thousand coin flip right. But when we configured that coin flip, we had like two nodes running and then for some reason. Now, there's a thousand running. What would happen? Is you do the coin flip?

C

The coin flip would pass and then there'd be an additional rate limit check and we would have a set of centrally configured rate limits. That say essentially, okay, this you know this. This policy gets to start five traces per second, it would check the rate limit after that, and the rate limit would then fail and that's also where it would check the trace size rate limit right. So so we're only gonna. Do it at start. We don't want to try to get it. You know we're not gonna try to get into like hey.

C

Can we somehow like kill? You know a trace? That's too big! Halfway through because that's just you know.

G

C

Right yeah. That also does that a question yeah.

F

I mean we we've done something similar, but not for regular something, but for like specific bugs little something that people were abusing, so we rate limit those. But for the reason we didn't do it for that. The regular sampling is because we extract some extrapolations from from the statistics we get from overall traces and and they're the probability of sampling actually very important. So if you start rate limiting that, you cannot do extrapolations anymore, so yeah.

C

That's something we'd have to watch out for so we do it. We do it and we do that sort of thing in some cases as well. You know the intent of the rate limits is just for sort of our own safety. Like we don't mean we don't intend them to be things that are gonna, be hid under like normal operation. You know so and and work and we're gonna monitor them right.

C

Well, monitoring on there's somebody like hits them and in that sort of situation will let them know right, but but yeah, it's sort of something that you know. We we sort of can't get away with anymore and continued on on board people without being kind of skittish, they're kind of good.

G

Probably use case where you have like dozens of teams, we can't manually go in and understand like what their limits are, gonna be for maybe volume date or number of traces, or there are like there requests per second or all this, and they might not even know so like there's just so much overhead in this space that if we can, through the starcade I'd like a same out or here's the amount of precision, yet what the amount of data you can send us like, our system doesn't have to fall over and they can iterate on that and then they can come in quickly.

G

Not us. It's also good for the case of maybe not this than the new hole tracing error case, but maybe I have this distributor request and somewhere in that request. We continue on because it didn't have instrumentation or something like that. And now maybe it's like a huge like a course. Everyone hits and they want to add a ton, a like information, and now their service has 10x the amount of data pumping out than it was before, and this gives us like a good way of understanding.

G

Oh crap, maybe sampling was the same, but the size of all these traces just went up significantly and we need to have.

F

G

Back mechanism- sorry service doesn't fall over, so it's usually like those are the scenarios where it's like, allowing us to move like quicker and not fall over and then come in, react to it and then have someone. Okay,.

G

H

One of our are here right now he's he was working on more of a sampling based approach where we dynamically changed the rates based on how much they're outputting I mean we have both, as you said, because of the debug override that we allowed in the system so like. We still need that case, but yeah I mean the changing. The probability is a nice way of doing it. If it, if you can, it seems like it's pretty static right now,.

G

Traces of the same size- yes,.

H

Size is an interesting point. We haven't done that, but we've seen it definitely.

G

Or size is more is a better indicator. We've such diversity in what races look like, and especially when you're in when you're not to mean the case where it's a new tracing scenario, but obtaining an existing one that just a ton of like an update, existing sort of talking people hit. It's really hard to understand like is this, make us fall over immediately.

G

You might have someone who's like Walmart races, but they're, only like 1% of all of our data, but if they were to increase their data significantly like we might just fall over immediately, and they might not understand that we.

C

G

A good push back.

C

Are you doing the dynamic sampling just by propagating the sampling rates through the baggage.

H

Actually we have like, we have like a remote sampler, it's it queries, basically on a minute-by-minute basis, I think so it's like we, the ingest side, is sort of calculating the throughput from multiple areas and figuring out a good strategy of probability. There yeah.

B

Yeah definitely.

H

Adds complexity, I mean I, don't write, it's.

E

Something similar where will like get a constant volume of traces and will adjust the sampling rate accordingly, with some feedback mechanism like running every couple of minutes and that sort of has the same. Like you know there are, there are challenges with it. Yeah.

H

Yeah, it's hard to find him well.

F

The main challenge is that there are some work laws which are periodical and so, when they're quiet their probability, kinda climbs up because there's nothing common and then suddenly boom they flood you yeah, we've.

E

Gotten hammered with things like that, particularly around, like aggressive rollouts, to where somebody has like a high rate during testing, and then it gets rolled to production. There's like four to seven minutes that are not fun, and then you have abilities, adjust.

F

A

And we are yeah. That was that, thank you so much Jonathan and Michael. That was a great presentation, we'll be posting that on the internet and see you all next time the internet.

E

See you all later it's been fun.