Red Hat OpenShift OpenShift Commons Briefings, 12 Jul 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing #82: Distributed Tracing with Jaeger & Prometheus on Kubernetes

Description

Jaeger was inspired by Dapper and OpenZipkin and is a distributed tracing system released as open source by Uber Technologies.

It can be used for monitoring microservice-based architectures:

*Distributed context propagation
*Distributed transaction monitoring
*Root cause analysis
*Service dependency analysis
*Performance / latency optimization

In this briefing, Uber’s Yuri Shkuro and Red Hat’s Gary Brown, both core contributors to the Jaeger project, will give an introduction to using Jaeger with Prometheus on Kubernetes.

Find out more here: https://github.com/uber/jaeger/blob/master/README.md

A

Well, hello and welcome again to another openshift Commons briefing. Today we have a really interesting new open source project that we probably haven't heard about.

A

Jaeger you probably have heard about oiler, and we're really pleased to have URI with us is one of the four contributors to Jaeger, and you can tell us about how all this works with open, tracing and Athiya sand lots of other aspects of distributed trace and today, and we also have with us from Red Hat Gary Brown who's, another one of the contributors to the project and we're just hoping to build some awareness around Jaeger and get more of you involved in the community, and so the format we're gonna use for today is we're going to go at URI and Gary.

A

Do their presentation, you can ask questions in the chat and we'll have an open, live Q&A at the end, and all of this is being recorded. So don't try and scribble notes fast. There will be all the links at the end to the references or all of the stuff that we're talking about and with that I'm gonna. Let URI take it away and introduced himself and I'm looking forward to the discussion afterwards.

B

Thank you Dan, so just a few words about myself I'm an engineer at uber we have an observability team in New, York City, which does I think like metrics, login and tracing and other observability related applications and I have been attacked yet for a Jager project at uber for about two years and we've open-sourced that project back in April this year, I also was involved in the open tracing project from the beginning and one of the co-authors and I.

B

Also a member of specification Council for open tracing, and so today what I'm going to talk about is really to demonstrate why open tracing and tracing in general is a big deal in the Microsoft world and I will do it with intro into what distributed tracing is humans. Some people might not know exactly. What is I will also show you a demo of like really demonstrate why it's useful in an example, application and then and that's pretty much it, and so basically, what distributed tracing is the way I tend to think about.

B

It is a new way of monitoring for micro-services, and so we can ask. Why do we need a new way? Why, like the old ways, don't work and to answer that question? I want to kind of show you rendering, where an artist of the micro-services versus a monolith application, and so what micro-service is really the biggest difference. Obviously, is that the pieces of the previous big application are now individual pieces that work independently of each other, and so when we were monitoring like a monolithic application, we would put like some probe on it.

B

It could be metrics, it could be some stream of standard logs out and we can see what's going on in that application, and things are pretty simple there, but in micro-services how you do the same thing is not that obvious, so you can definitely put a thermo meter on every one of those guys, but they don't give you the whole the whole picture of.

B

What's going on with the whole system and, more importantly, if we think about another aspect, specifically the concurrency, so we like the old application, start the single threaded, where you would process one request at a time, then it became more complex with multi-threading, where multiple requests like being processed in parallel, but still one request per thread, and then we went into a synchronous program and where a single request can actually jump between different threads during its life cycle and finally, in the microservices world.

B

We we took that a synchronous picture and split it across many different processes, boundaries and should be broken, and- and so what we really wanna see here when we monitor that system is to to be able to track a single request, as it goes not only between multiple threads but between multiple process boundaries and and and that's what distributed tracing really provides. It is the ability to trace a single turn action throughout your architecture and throughout process boundaries threads whatever continuations, I synchronous calls and all these things and conceptually the way it works.

B

Is it fairly straightforward. There is a contact concept of context, propagation, where we say if we have a microscope as our key to actually with this five microservices and the first series receives a request. We create a unique ID for that request and we stick it in a so called context, which is like a virtual container which is associated with that request and that context is propagated by whatever means throughout every single call downstream.

B

As far as pressing that request, and when we do that, it allows us to stitch together all those independent pieces of execution across the call graph and build a timeline of that same request where we can see well. The whole request took that much time and it sure is a the answer. Is a culture is B B equals C and D, etc, and so that that U is the typical view that tracing systems provide based on the tracking of the requests that they have.

B

So, but why should we care like why? Why is that a good idea to actually do these things and now I'm gonna jump into into the demo, and so I will demonstrate the demo, based on the only Jaeger as a an open source tracing system and plus inside that repository there's a application called hot rod, which is a sample micro services, application that I will be using here. So first I want to start the Jaeger back and so I have, where am I in a github over the Hager, and this is our main repository.

B

So I can start the kind of Jaeger back and a single with a single command, though, and I'll just give it a second. So one thing it shows here is that it started Jaeger query service at this port right, so that's the Jaeger UI that we'll be using later and then I'm gonna again. This is a Sima same repository, but a subdirectory examples.

B

Hot rod so I can start the this application as well, and I want to pay attention to the logs here, because one thing that it's starting a whole bunch of services like route service front, end customer and driver service. So just by looking at these logs, we kind of can get a sense that this is a apparently micro services based application, because it's starting a whole bunch of things, and so but the front end is obviously the entry point.

B

So let's go to that to the UI that application I can make it a bit bigger like this. So just as a quick enter intro this this sample application is is like mock rides on demand thing. Where you have these customers and you click a button and the backend kind of find the car which is closest to that. To that customer and says: ok, well, the car will arrive in two minutes and it gives you the license plate number.

B

This is likely in New York license plate numbers, and it also gives a few things that will be useful later in the demo. So one thing is that when I load this application, there is this client request, ID, which is just a stable session ID on my page like if that, if I reload that page I'll get a new ID there's also every time there is a request made from this application.

B

The front-end JavaScript front-end designs it like a unique ID, like request, number one, and then it also prints the latency, which we'll see it's useful later on how long it took from the point of view of the front-end right. So.

B

Now that we see ok, well, this application kind of apparently like dispatch the cartridge. But how do we actually see what what they architectural at application? So we saw from the Lords that apparently there are certain micro services involved, but maybe the logs lie. We don't really know.

B

So what we're going to do is we're going to go to Jaeger, UI and- and we will take a look at what what Jaeger UI out of the box provided to remember that we execute that one request so far to that application for this this car right, and so that already provided certain data to Jaeger back-end and we can go, and we can see that data, for example, in this format, so by observing tree, is that that really it's a single trace by observing that interaction of microservice?

B

We can actually see what happens within that service. Although the front-end service, which called three downstream services and two of those called, apparently some storage backends like radius and MySQL, and we also see that the counts, how many times so just for the single web request there, apparently over, like 25, 27 RPC pol. That happened within that micro service based app application, so that kind of gives us an architecture overview of the application.

B

But it doesn't tell us what the actual workflow and the data flow was like, which services was pulled first and how long it took. And for that we can go back to you, the main for a chief Jager and then because the services like emitted tracing data to the backend. We already have this information, for example, all the services kind of presented and known to the agree.I. So if we search for a trace, we see that this is the one trace that was imputed by system and, and it says that there are like 50 ones.

B

Parents I will go into that a bit later and then this is all the services that are involved and how long they took YouTube, stevan carnage, 43 milliseconds, so notice that this is a bit shorter than 70 milliseconds reported by the UI, which isn't surprising because UI is measuring it from its point.

B

If you were the Jager measured from the back end point, if you saw some network delay between the UI and the back end, which is response propensity, and so when I go to that trace, I now see that a picture that I showed in a slide before. So it's a timeline view of the trace so which means that this is a time and this every every horizontal bar represents like a unit of work performed by a certain service.

B

In particular, we can see from the top that the very first request was for the front end service and endpoint call dispatch and then, in order that service called like. If you go down kind of the parent-child relationship, we can see that this front end service. All the customer service with a customer endpoint and the customer did some my spiel operation, then the front end called driver service, and then the driver service did a whole bunch of other called apparently to radius.

B

Like first point driver, IDs and then a whole bunch of get driver request, apparently to like to achieve driver information and then some of them we can see file like they're marked by the exclamation point, so they took longer and and then most of them have succeeded. And finally, in the end, the front end service, like after the driver called the front end, did a whole bunch of requests to do the route service. And so again we don't really know how and what's the business mode UK.

B

But at least we see the data flow of this application and then, once that all this route requests were executed in the end, the front and produced result and the front end later the UI display. So this is kind of very simple walk through the workflow application. Just by looking at the single tracer, it gives us a lot of context that what happened in this microlight, four seven, whatever micro services in this application, now just a bit more details about this tray so distributed tracing kind of allows.

B

You not only see that information, but also drill down into individual pieces of every every span and again spend is just a unit of work within the application which is instrumented with a particular kind of annotation, and so we can, for example, like in demise field. We can, we can expand that MySQL spray and you can see out there is this.

B

We can see that the actual SQL statement was this that execute that we also see the request ID from the remember that request ID in the UI this guy, and we also see something in the log which is associated with that span. So this information kind of allows you to if, if there's an error, and in particular, let's look at this, the error case is where we see Reggie's Falls failed.

B

So if we drill down into that, we can see that virtually in the logs that apparently it was a time out basically on radius, which was that request to fail and then the back end austerity, the driver series, we tried it with another request, so this is kind of again just like a quick walk through through the capabilities of the tracing system. This is very common functionality.

B

B

So but, however, we still don't quite know what the actual business logic within this application. For example, why did the front-end call the customer service right and so to understand that we can actually turn again to login and and try to understand the behavior of vacation based on the logs, but before we do that in a trace? Let's take a look at the logs here so and look it's like I'm scroll, and this is what's the one single request right.

B

So there's like a several pages of logs that were written to standard out by this application. So we can probably figure out what happened in in this request by it like reading very carefully, especially if you like the guy from.

B

The lone the brunette but like I, find this very difficult to actually follow to exception, like exception traces, and remember that we only did one single request so far like if this was a real production service and it was servin like requests per second, these boards will be complete mess, everything would be interleaved and there is no way to tell what actually is happening. What's the logic of the application, so, instead of looking at the logs, we can actually look at the logs in the tracing system.

B

In specifically, if we look at the front and service the very top span, we can see that it has 17 locks and if we expand that now all those requests that we saw in the standard out there kind of the same same walks, but they are now very contextualized to say: ok well, I only see in logs from this particular span, like some other spans, like my schpeel, it had its own box, the radius pause there. They had its own log, so they will like in the log output.

B

They would be all mixed up here, I'm only seeing what's relevant to the Spence. That's what we call contextualized Logan that tracing provides its kind of allows you to narrow down the behavior of a particular execution very closely and by looking at the lot, we can now actually understand the actual actual business logic that the application is doing so once it received the request, it says: I'm gonna load the customer information by customer ID, which was sent by the UI. Then I'm gonna find the nearest drivers to that custom.

B

I'm load loading, all the information for those drivers and then I'm gonna find routes for each of those drivers and then finally pick the shortest and dispatch it back to the the front-end.

B

So again, the main point here that the logs are very contextualized every individual and they're, not mixed up with anything else, and also we can see that are we shown that logs and tags? This is like a standard feature in open tracing tags are really the things that you want to assign to the whole span kind of a description of this plan. For example, it's like I'm calling my SQL service and this plan kind. Is that I'm, a client of my school server site, where is logs, are really things with the timestamp?

B

So if you, if you meet in something at the point in time, then it's a log. Otherwise, if it's the whole descriptor of a of the attribute of expand and its attack. So this is like a standard terminology and open tracing and finally, the last kind of I guess not the least important thing of tracing is that we can see the overall latency of of this request and what was on the critical path and what was was taken basically 750 milliseconds.

B

To do this request right, so you can see that my scale query took over 300 milliseconds. So something to you to look into then now. The thing you can see is that the loading of the drivers took another 200 milliseconds and by looking at this kind of staircase pattern in the trace, we can understand that all these drivers were requested from radius differentially right. So another potential optimisation for this application, saying maybe we could just call them all in parallel and just reduce it to just a few milliseconds.

B

Instead of 200 and finally, the request to the route service, this is interesting like we see that they are actually concurrent right. So we see a whole bunch of of concurrent requests, but they all not they not all concurrent. So actually, in fact, there is at most three concurrent requests going on to the route service and then, as soon as one of them stopped like first stop another one started right. This stopped sorry.

B

This stopped another one started, so it looks like there is some executor pool, which is bounded by the three threads and that's like so. The parallelism of this of this whole segment of the trace is limited by three and so again, potentially another optimization point to improve the application latency. So now, let's see what how this application actually performs.

B

If we start doing a lot more requests right if I start clicking many times here, and so we can see that the latency is starting to climb, essentially the more requests the longer it takes and notice that request. Id is keep like incremented as I mentioned before. So how can we use tracing to investigate it? So I'm gonna pick this like this driver, ID or license-plate ID, and then try to search for trace with this ID and tracing allows you to I think it's like driver ID.

B

This no space, so we can, we can find oops.

B

Let's see what.

B

The syntax is just a driver, okay, so I'm. Looking at this thing, it says driver equals license plate, so I can search by the tag driver ID, but if driver okay, so now I get this trace and we see like it's the one that was actually very long like almost 200 milliseconds. So this one is saying 182 close enough right. So when we look at this trace immediately, we see. Oh, my spiel is taking an enormous amount of time here, 1.4 seconds so clearly, there is something wrong with that application.

B

There is some bottle knife and, let's actually use the login feature of the trace in it we jump into the logs we can see. Oh, this request is actually blocked by four other transactions and it was waiting for over almost like a second until it acquired that log and allowed to proceed to you to query my skill. What that means is like in practice this is obviously a multiplication, but what it simulates, a real environment, where you only have one one like connection to the database instead of using a connection pool.

B

But what's another interesting thing, though here is that not only we see that how many transactions are blocking us?

B

We also see the actual request ideas of those transactions, so imagine that there you were in this scenario where there is some resource which has a queue in front of it and every time you execute the request, you actually have to wait in the queue for something else to get processed, and suddenly you see this kind of pattern where oh, you got stuck for a long time in the queue and then, when you look into this thing, you can say: oh, oh, this is all the requester to block in me.

B

What, if I, go and book for those requests and see which one was actually the longest and and cause this all isn't working in the queue and that allows you to do that. But what's interesting is that if we look at the customer service there is this HTTP request that was executed. It says nothing about request. Id. It only says well give me a customer information, so request. Id came all the way from the from the front end from the JavaScript UI, but it wasn't passed as a request parameter to this service.

B

So how did this guy know about all these transaction IDs right? And the answer is because it's another feature of hope and tracing API called baggage and back, which is essentially remember. We talked about context propagation, where tracing is is using context propagation to pass around trace, ID, but context. Propagation itself is a more general concept.

B

You can essentially pass anything and so baggage is this kind of anything, a value store which is passed around the whole architecture as part of the request, and so this request, ID that UI creates for every request is actually injected into the baggage, and then it becomes available at every level of the of the call graph.

B

So every micro service can actually access to that idea without having to change any of the api's of that service right, which is very important like if you have multiple level, 4 micro services, and you want to pass something from the top to the storage layer, for example, Haven to go and change all the API is of the services in between it's usually very like difficult work, whereas baggage allowed to do that almost for free.

B

So so we figured out that ok, my spiel is kind of the culprit. To what we can do is we can go and fix that we can go and fix like this. This lock in contention and I, don't know if I have time to actually do that. So I may try to do that.

A

At that time, dofollow.

B

So I'll go to the code for the application.

B

Yes, so- and this was in the customer service, so in the database, and so finally enough right, we see that there's this lock, that we saw the log about it, and so the reason this lock is here like I said this is the multiplication, so it simulates, like a single connection, thread pool so simplest way.

B

Is this legislated out and not not block on this one transaction and then the actual transaction to the database is simulated by this sleep statement, which has a certainty way right and so I also want you there just for demonstration purposes. I want to go and reduce that delay to to make it a bit shorter and and see how this this small change really affects the behavior of the application.

B

Ok, so we started again reload this page note my session ID changed now and so again, I do a whole bunch of requests. So what we see now is that latency is still kind of going above the first one, but it's not as dramatic anymore as it used to be right. It doesn't go to two seconds, and so, if I again, the latest like sorry the longest race and try to search for it.

B

See how that change in the code that we just made affects the the trace shape, so here I change it 200 milliseconds. So that's what roughly we get now and so the the whole shape of the trace changed significantly. It's still long I mean it's like over a second, but this this segment became shorter.

B

The call to the driver is still the same: 200 milliseconds, because I really haven't optimized the thing but notice how this change this segment change now so remember we used to see is 3 at a time, but instead we actually see in sometimes one sometimes not even less than one request being executed. So my whole request is being blocked like we can see it in the mini-map.

B

There's these gaps in the execution, where my request is actually not doing anything it just waiting on the resources right and again, as I mentioned this, this front-end, the route service has some turtle thread pool inside it and that thread pool is bound by three executors, and so, when I execute a lot of requests, then obviously there is a contention of that resource and we can easily see the impact of that contention on the trace and so what, if we go and fix it as well so- and it's happened to be like very close to two two in the same configuration so and this is a go routine, so they cheap, I'm gonna, go 200 and again, let's see the impact of that change.

B

B

Swear, my laptop is usually faster. It's the video, that's slowing down, okay, so we got this application started reloaded again and now like because I optimize the whole bunch of stuff in this in this bowl I have to pick really really fast to actually get any sort of latency, or so you see like I, request a tunnel but immediately so like, and they all way shorter than before and like so. If I take the longest one just to see what what actually is happening.

B

Okay notice that there's lots of more errors. Somebody I don't know why. That is interesting. The air is actually random and is so like it's kind of surprising. Why why there are more errors, but, like we can see? Actually this the impact of that last change that so we have Qian drivers being requested or like a chain routes being requested from the route service, and now we can see that they all execute it in in peril, because we have essentially removed a contention on the resource pool on a thread pool.

B

So this again, this is I. Hope like this is a demonstration of the tracing functionality and like how tracing can help you quickly narrow down what the problems are in individual components of your RTT, actually individual services and how you can try to optimize this by looking at either relationship between the calls critical paths like here.

B

If we, if we look at this whole trace now, then obviously the critical path is going through this segment, and this is the longest segment and the most obvious, optimization here is to try to paralyze this thing right, instead of doing sequential but I'm not going to do that in this demo.

B

It's like an exercise, if you want to do it yourself, okay, so and the final thing that I want to show here is that so I mentioned baggage I want to show another use case for baggage, and this application actually emits a whole bunch of metrics. So if I go to this yes, this I mean it's another port that this application exposes. So we can see a whole bunch of metrics emitted by the way some of them are I. Think before I search for oh.

B

I, actually, don't have metrics from the tracer itself, so it's probably not configured. Normally the tracer itself emits metrics about how many spans it starts or stops, and instead what's configured here is the RTC metric. So we can see that all the services and all their endpoints are actually being measured by by jäger and emitted as metric. So like tracing in general, does a heavy sampling of the requests.

B

We don't capture every single trace in the storage, but metrics work for all requests, and so you can get very very pretty accurate picture work, how the application behaving by looking at the metrics- and this is something that Gary will talk in a second segment of the student ation.

B

But what I wanted to really show here is this part so notice that this is a metric which says how many, how much time the route service calculation and the route service spend in seconds on behalf of individual customer or on behalf of individual web session, and remember that this my web session ID? Is this one right and so well, it's kind of nice.

B

Look at the route service. Let me collapse this thing, so pick any request to the route service and look at its HTTP request. So again, there is no mentioning of either the customer or of the web session ID in the request, because well they don't belong to the API the route service.

B

It really cares only about where we start and where we drop off so just take two coordinates all that needs, and yet it is able to produce these metrics by by the customer and by the session ID, which, which are the identifiers which are only available at the very top of the application. Essentially, the the front end service knows that, but it doesn't pass it explicitly to the route service.

B

So I'm kind of repeating this myself, but I just want to make sure that this is very important and like powerful feature of open tracing where bag is propagation and allows you to do a lot of smart stuff by kind of this implicit propagation of the data that you can use at the lower stack of your application- and let's say your resource attribution, I can say what I can.

B

Maybe do charge back to my to my like customer, saying: oh you, you use so much computer resources from my application by like pull the request going to your customer right. So this is something that essentially distributed context. Propagation allowed to do.

B

And finally, one other thing that I want to go over in this presentation is really so. I hope that you like this, feel functionality and you think tracing is great, so great like how how difficult is actually to instrument that application to to get all these data bases. And the answer is it's actually not that hard and in fact, if we look at the source code for this application, they will be surprisingly very little.

B

Information available to instrumentation for tracing explicitly and the reason for that is because open tracing API is, is an open source API that any framework can use to instrument itself, in particular any RPC framework in use James to mend itself and as a result. If we look at the source code for any of the services, let's say we look at the front end service.

B

So when we create in a server.

B

We see that there is this like one mentioning of over-over tracing for instrumentation, which really just creates a wrapper around the the server and then once that's done all the requests through that you're automatically traced, you don't need to do anything special. Similarly, there is another service here, I forgot, which one I think it's a route service. So when it starts this one is not based on HTTP or it is actually maybe to driver, so the driver server.

B

Yes, so the driver server is not based on HTTP. It uses a tea channel, which is another RPC framework, open source IPC framework, and that framework is instrumented with tracing by itself with open tracing. And so what we can see here in the in the code is that when I'm creating this new channel, the only thing I'm passing it is the tracer and and that's it- there's no more instrumentation anywhere in this in this service to actually enable tracing.

B

In fact, if we look at the handler, so this is the Condor function which is being called by the by the server. There is no mention in the Papa tracing here anywhere right. It just gets a context object, which is the common way for tracing to propagate data inside the application and on tracing kind of happens behind the scenes automatically again, because, because open tracing is an open API that anyone can use. If you are writing your RPC for, were you writing your I? Don't know radius driver in particular language?

B

You can you can write open, trace instrumentation, either into your driver directive or provide like a wrapper which what happens with the HTTP, like they're, in standard libraries and open tracing contributor space which allow you to wrap HTTP clients and servers, and and and not really worry about racing. However, if you do want to trace explicitly on, obviously open tracing allows you to do that, and there are examples in this application like radius, for example. This is not a real radius.

B

This is a simulation of radius and so to to actually simulate that we're making some some sort of RPC request. There is explicit instrumentation for open tracing. We send okay start a new span here representing a Cote radius and we're saying that this is a RTC client kind of kind of spend right so to detect that we've seen in a tracing example, and- and this is the only really place in this code where open, trace instrumentation is done explicitly simply because there is no real ready server.

B

If there was then you probably could get away without experimentation.

B

And finally, the last thing I want to mention is so: we've seen how logs go both the standard out and the same logs appear in in a tracing and in fact, I'm gonna go back to to this service. So we can see an example of the log statement here, so it looks pretty normal right so info.

B

This is a kind of a key value, login framework that zap is a login framework which allows your structured plugin so rather than a format in a string with like a formatter, you provide key value pairs explicitly and it's a lot more efficient and go away. There is no memory allocations and suppress. However, the the really difficult difference here from normal login is this part right.

B

So, instead of just calling Bogard info, if we did that, then we wouldn't be able to associate Logs with the actual context, because they would just go to standard out, and this is just a little trick in this application, where the logger isn't really the normal blogger. But it's it's a wraparound logger which allows you to me others, either you can get a background which doesn't require context and can load your standard, like lifecycle, application messages or if you have something that is request specific in this case.

B

It's obviously a scope to this particular request find nearest car. So we get a different type of logger for that context, and as soon as we do that there is a magic in that you can look in the source code, how it's actually the same globe is 14 to both standard out and into your tracing spam, and that's why I am able to get to show it in the UI.

B

But when it's associated with a span, you get contextualized login versus like a standard cloud mess so I'm chicken- that, oh yes and that's at the end of my end, just like the very final point is open tracing doesn't bind you to any particular tracing organization right. So here we use the Egger, but if we look at how tracing is actually initialized, this is the only single place in this whole. Application, which is specific to Jaeger, will say and configures the project from Jaeger this one I guess yeah. So it's a.

B

We can see that the Eggert client, that's the only place where it is actually specific to Jaeger right. So we instantiate Jaeger tracer and from that point on the rest of the application, is not aware that there is anything to do with Jaeger. If you want to swap it for Zipkin or for a light step or for any other open, trace and compliant tracer. This is the place to do it and it will work just well.

B

Your UI will be different, obviously, but the the actual instrumentation doesn't need to change so that I think is the end of my demo. So let me see yes. So as a recap, what we've done is I've showed that instrumentation itself is pretty much off the shelf. I didn't have to change a lot stuff in my application. I can swap another tracer, so there is a gender neutrality to the whole of penetration. Api and tracing allows the two moni's transactions across multiple microservices and process boundaries and different threads as well.

B

We can definitely think do things like latency merging latency operations, finding critical paths, analyzing root, cause of some errors or delays in the execution. We can get contextualized logon, very highly contextualized loaded with tracing. We talked about baggage propagation, how it's very powerful techniques. In fact, at uber we have a number of projects which are built strictly on top of baggage propagation.

B

They really don't even have to be anything with tracing, but they rely on the egg instrumentation because they need back propagation and and I showed quickly the RTC metric, but that's something that the area will talk more in the next session and just a few words about Jaeger. So Jaeger is a dissident racing system. We open sourced it in April this year, it's open tracing inside, so it's like built from open tracing from the beginning. It can be used as a drop-in replacement recipient.

B

If you want to just replace the back end, it's the back end is all and go. We support several back end storages, and this is the main URL for Jaeger and we'll come to that slide, while after Gary's presentation, I'll stop sharing. Now all.

A

Right we'll get Gary to share his screen and big office is bit I.

C

Can't see my screen.

A

It's coming soon, we've got your sharing screen, so click into your Jaeger browser there you go okay,.

C

I'll try to get through this demo really quickly and thanks for the demo Yuri. So what I'm going to do is just show how we can use some an open tracing system like Jaeger, but also capture application, metrics and integrate with something like Prometheus and have that all running on open shift. The this example also runs on kubernetes and there's kit up repository located here, where you can find the example and the instructions for running on both.

C

So the the the main aim of this short demo is to show you how we can sort capture the metrics, along with the the tracing just by instrumenting, the application with the open tracing standard and, as yuri pointed out, there's.

C

There are ways in which we can make the the instrumentation of the applications not intrusive by instrumenting number of the popular frameworks and the the benefit of capturing the tracing and application metrics information separately, as we can report them to our prefer back-end systems and the the the sort of metrics we're going to be capturing, isn't constrained by the particular tracing sampling policy that we want to use because application metrics is useful to be able to capture the application implications and have lerton mechanisms to detect situations.

C

But the tracing information is useful when you want to dive into more information about a particular invocation of the application, there's also the cards, some adaptive sampling mechanisms that were very looking to put into Jaeger, which, for example, if you, if you get an alert on a particular area of your system, that could potentially be used to automatically increase the tracing information, that's being captured in advance, is somebody being able to investigate the problem so in terms of the example I'm, showing it's very simple swing, GUI application consisting to services in order manager and then the camp manager.

C

The order manager has a couple of rest endpoints for buying and selling and and want to generate an exception, basically as it tries to invoke a missing endpoint and then the apathy Account Manager just has a simple account endpoint, so both both for the services using the open tracing API, which in terms of the trace that were using Jaeger, but we're also decorating the tracer with a new component in the open tracing contrib github organization that basically intercepts the the tracing information and extracts relevant metrics, and these have been reported in this case to a Prometheus, okay.

C

So there's also a blog on the the Red Hat developer program. That explains how to run this on on kubernetes there's a github organization called Jager tracing where you can find the templates for deploying Jager onto kubernetes and OpenShift and the as I mentioned. There's the the Java metrics component that decorates the tracer can be found in this organization here in this repo. Ok. So what I've done is I've already deployed the example, so there's the account manager and the order manager I'm using Prometheus operator, which is an extension project.

C

Super meaty 'its, which is able to identify services that have been deployed and if there's multiple instances of the services and be able to update the configuration Prometheus to scrape the metrics from those those services and, of course, we've got Jaeger deployed as well before viewing the demo I'll just quickly go through the application. So, as I said, this is a spring book application, the the main application itself, as you can see, that snows tracing specific code added here for the account manager on same for the the controller, the rest endpoint itself.

C

So this basically just introducing a bit of a delay to make the metrics more interesting and then randomly creating an exception, saying for my power to find a count. The the configuration of Prometheus is pretty straightforward.

C

The metrics are being reported using a servlet which is exposed at the this endpoint here, and the tracing configuration is basically using a component of the open tracing project called the trace. The resolver. Don't this case, what we're doing is we're we're obtaining the tracer based on configuration information. So in the same way the URI is pointing out yuku.

C

You just changed the coke in one place in if you're, using the tracer, resolver and the tracing implementation supports it, then it can be done without any code change at all, but in this case we're decorating the tracer before he gets returned using this. This component here with a Prometheus matrix water with Prometheus, the metrics are reported with a set of labels.

C

So from the stand in the standard way, what we're doing is we're using labels to represent things by the service name, the operation and various other fields that can then be used to sort of categorize the metrics, but through this mechanism you can also customize and add to your own labels as well. So in this case, what what I'm doing is adding a package label.

C

So this is using the mechanism that we talked about where application specific information can be propagated with the trace in context through a chain of the services they're being invoked, and so what this one is doing is it's adding a transaction length? So this could be a business transaction and the second parameter is just a default value.

C

So, in this case, where, if, if one hadn't, if a baggage item with that name, hasn't been provided- and we just use this value at the other thing in terms of the tracer configuration- is- we need to tell it to ignore the rest, endpoint slash metrics. That includes two to scrape the Prometheus metrics, just so the order manager, because this is slightly differently. The application itself again has no specific gold, but the the controller does include or injects the tracer, and this is purely just to be able to set the baggage item.

C

So what we do is, if the if the buy end point is called, then we set the baggage item of transaction to be by and saying for the sale and that information would then get propagated through to the account manager service.

C

C

Okay, so if I, this is the UI for this particular application. So you can see, there's there's some transactions, that we've got the order manager which has the by endpoint vote and that's invoking the the account manager. So that's a simple invocation and let's see so this one's showing an example of an error. So if I look at the account manager, if I have a look at the logs and they you can see that he fell to find account has been reported, but because URIs donor, an in-depth demo of jäger.

C

What I'm gonna do is focus more on prometheus. So this is using the the Prometheus user interface and I've set up some some queries already. So this this first one. What it's focusing on is a metric called fan count. So that's just the number of spans that have been created at a particular point in the business process. So if we have a look down here, you can see that there's a metric, that's created for the operation sell on in the service order manager and this band kind of server.

C

So that's the server endpoint for that operation and that service. There's a number of labels that we're ignoring to simplify the information so that, for example, you can view information based on Todd's instance, job namespace, the the transaction field that we added ourselves the transaction label and also errors. But the moment we're just aggregating known for those those particular fields.

C

There's also I've got a graph here representing the duration associated with each of those those fans, don't looking at the an aggregated view of those particular fields and, for example, we if we're interested in a particular transaction, so we could add a filter.

C

So if we got interested in the sale transaction, but what this does is it takes it cuts through all of the services and is only focus nemetrix being reported that the transaction type, though, for example, if you want to find out what the bottlenecks were with a particular business transaction, this would be a good way to be able to focus in on that. And, similarly, you could, if you're interested in what's executing in that particular part of your infrastructure.

C

You can focus on the pods so because of the pod also includes a service name and that's quite useful, as you can see what what services is running on that particular pod. But again it helps you to locate if there's particular problems in you infrastructure and then then. Finally, I've got a graph. That's basically looking at the error ratio between the the for the different services.

C

So again, you can see whether a particular service is starting to generate more than the usual number of errors, and you can set up alerts that would be triggered if certain thresholds were exceeded.

C

Okay, so that's just a quick demo, but so just to recap this is this. Demo is primarily to demonstrate the integration of the open tracing technologies with and something like camellias for capping application metrics, but within the context of a kubernetes or open safe environment, where you can also capture information, implicitly about where those services are running.

A

Into your these last slides and we have the resources slides up there and then I just want to say really. Thank you for this. This is wonderful to see sort of the interplay of all these different open-source projects and how they all interrelate and there's a lot of them in here, and this has been very a very good way to showcase. They have lots of different things.

A

The open tracing project, Jaeger Prometheus spring mood- even it's pretty pretty cool I- think you've done a pretty awesome job with this presentation, because I'm not seeing any questions yet from any of the many folks that are following along.

A

Is there any feedback that you want to add in hearing now that Gary's finished his minute, yeah.

B

People probably lost eye to go pretty fast. I just want to mention that so a couple of like a few links here so for the open tracing project itself is open, tracing that IO and then there is a get our chat room. If you have questions, I want to discuss things that this is the link, and then this is the link for Jaeger for the main repository.

B

We also have a chat room for questions and so on and the the demos that we given, they actually have blog posts that essentially describe what's happening and, in particular, like hot rod, has a very detailed walkthrough blog post. That kind of talks about the same thing that I talked about, but with more examples in at a slower pace, obviously and and and Gary's blog post that he showed is also here. So people want to check them out later and actually go to the repositories and look at the code. These are the links alright,.

A

Well, there are a couple of questions now that I've put people on the spot. Jethro's asking the sampling and Jager tracing is its fan level or trace level. Is it it is head based or tail based sampling.

B

Yeah I can answer that, so definitely an expert asking that so sampling is Trace based once the trace is sampled, it essentially sample throughout the whole architecture and it is head base. So the sampling decision is made at the very beginning when they trace ideas generated first time. That's the only way for us to actually ensure consistent sampling across all micro-services, but having said that, we we actually have various work-in-progress that are trying to like add other ways of sampling things.

A

And that was Jeff Roe who's from the mass open cloud, that's actually using Jaeger today. So hopefully we can get some feedback from you guys soon. There's another question from Dickus: did you measure any performance impact after implementation.

B

So this is kind of very interesting and very detailed question if we really want to go into that. The short answer is yes, and no, because the actual performance impact cannot be measured in isolation just based on a tracing itself. It really has to be measured within a particular service within a particular traffic pattern, because it's highly dependent, so we usually at uber at least, run tracing with a fairly low sampling rate, because we have very high volume, very hard traffic and, and so because of the very low sampling rate.

B

Our performance impact from tracing is completely negligible. There's, like nothing talk about u1, but if you crank up the sampling rate much higher, then you will start seeing differently some performance impact. However, the reason why that question is actually very difficult to answer is because that performance impact is itself very hard to measure, because it's not just like how much CPU time osep CPU load, you add to the to the service, there's all kinds of other applications like how much memory pressure you create.

B

How much throughput is affected because face collection happens in the critical path of the application of the request themselves, but trace reporting happens in the background, and here that background work is somewhat expensive if you sample a lot of data, and so that starts affecting your application, throughput and latency, and that's why I like it's the you really have to try it out. I mean with a low sampling rate. You're not gonna, have any performance impact.

B

But if you want to go to very high sample rate, then yeah definitely need to try it out and see what happens. The.

A

Narayana and is asking actually a good question, and can we enable on-demand tracing with this performance is a concern and Gary's come back with a little bit of an answer? Gary you want to try.

C

Yeah, it is for better experiences, but I believe we were working on an adaptive sampling mechanism that we would address that, so you would be able to switch off the tracing and then ievel it for certain scenarios.

B

So there are two parts here: the adaptive sampling. First of all, it solves the problem of having very low throughput end points which would be affected. If you have very low sampling rates in a tracer, then some of your endpoints may be sampled and some others may never be sample because of the they're just low key PS, so adapt to something takes care of that and guarantees certain throughput of traces for nu endpoint and the second feature of adapt to something is what yeah.

B

So what what Gary mentioned is that it actually dynamically adjusts to the traffic from your endpoint and it can either increase or decrease the sampling rate, based on, like certain throughput, that desired throughput into the storage, but I think what this question was really about is like on demand, sampling and that's possible via two ways.

B

One way is you can do that programmatically, open tracing has a standard tag called sample priority if you set it on a span with a non-zero value, then I'm a interpreted as a signal that you want to kind of turn that trace into e debug.trace and it's going to be guaranteed. Sampled across the stack- and it also bypasses- and you down something that may be happening on the collection layer. So that's one way or if you don't want to do it. Programmatically jäger also supports eager client support.

B

A special header I think it's called like Yaeger, there's debug. You can pass a sort of correlation ID as part of that header and then it will also trigger the debug functionality for trace, and then you can go and find that trace by that lady, that you provide indicator.

B

So that's useful if you want to like, send the curl request into your application from outside and just saying I want to trace this curl request, where, obviously you cannot set programmatically anything because it's curled, it is not instrumented open tracing, but together allows you to do that.

A

Does Giri is asking, does the adaptive sampling idea use the circuit breaker pattern.

B

I'm not sure what that means, but yeah it. Basically, adaptive sampling works at the central collection tier and it measures all the traffic. That's coming from a particular end point of a particular series and it has a target and if we see like oh, we want kind of traces per second started by this endpoint and if we see a thousand, then we're going to reduce the sampling probability by ten times. So that's how it works. So I guess it's like kind of a circuit braking when.

A

We have one last question: I think we can sneak in here Jethro's again asking granting and merging you're not captured right. It is by design not in the dapper model.

A

Let me just yes I.

C

Think there's potential for the open tracing model to support it and because it has, it can handle multiple parent references which the dapple model is a single parent approach. But I don't know if I think more work may be required in the standard, just maybe to find additional references with reference times. Yes,.

B

Yeah exactly I mean the references mechanism in open tracing does allow you to have multiple parents, but there hasn't been like a lot of work. Put on that specifically there's no reference type defined for that use case a currently. But there are open issues that, if you want to like provide an opinion, there's definitely an issue about that and another similar situation. Where you want to do the related issues like when you want to link to different traces, then you can also use the reference mechanism to to link them.

A

So that is really all we have time for, and I really appreciate, URI and Gary for taking the time out today to do this, if you guys are interested in Yaeger or open tracing I know, like URI just mentioned, there's a lot of issues on that. The uber Jager github repo that you can weigh in on and give feedback on.

A

We'd love to hear from you all and stay tuned for lots of new things coming with Jager and we'll probably get these guys back on again, sometime soon sort of a roadmap where we're going from here kind of talk ahead now and then hopefully, some of you were mentioned that you're using it in production on your pocs, can also, if you want to talk about some of some of the work that you're doing to implement this and the benefits or whatever at your facility.

A

So if you'd like to also be part of this, just give us a shout and again, thank you all for joining us today, and this will be up on the openshift blog post and a few other places within the next day or so. So. Thank you again for joining us. Thank you.