Cloud Native Computing Foundation CNCF Webinars, 17 Jan 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF Webinar Series - Introducing Jaeger 1.0

Description

This webinar will demonstrate how Jaeger can be used to solve a variety of observability problems, including distributed transaction monitoring, root cause analysis, performance optimization, service dependency analysis, and distributed context propagation. We will discuss the features released in Jaeger 1.0, its architecture, deployment options, integrations with other CNCF projects, and the roadmap.
Join us for KubeCon + CloudNativeCon in Barcelona May 20 - 23, Shanghai June 24 - 26, and San Diego November 18 - 21! Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.

A

Alright, the numbers seem to be stabilizing. Sony has started my name's Mark Coleman and I'm the marketing chairperson to the CN CF. Today, we've got your skirt.

A

Oh, if I said that correctly, who's on the observability team at uber in New, York he's gonna be telling us all about Jaeger, 1.0 I'm gonna drop a link to the github in there, so you can add even more stars now, if you haven't been on one of these webinars before first I just dropped the link into the chat for everybody, so you can see all the previous ones that we've done recorded in case any of those are interesting for you.

A

We will take questions throughout, but none of you can speak so just drop me questions into either the Q&A or the chat window and I will find opportune moments. When EURion asking those questions. Yuri are you? Are you ready? Yes,.

B

A

Take it away all.

B

Right, thank you. So we will be introducing Jaeger the supersaturation system. A member of CMC F is an agenda. I will briefly talk about what distributed tracing is I will show you a demo of Jaeger on with a sample application, we'll look under the hood in terms of architecture and some of the sort of technical details of the system. I will talk about what is in 1.0 release what features etc.

B

I'll briefly talk about like the future work that we plan in situ and about the project, governance and setup, as well as we'll have a time for Q&A. If you, if the answer it's like, if your questions are not answered during the presentation, we'll have plenty of time at the end to discuss so don't worry about it. Just brief intro about myself. I am engineer. Tuber Mike mentioned I am working on that variability team.

B

Our team deals with things like metrics alerts, logs and traces and other stuff around that, and also the founder of Jaeger. It started here at Auburn few years back and they also offer up on tracing specification, which is an API that's over in libraries.

B

In da girl, I've recently met, so let's kick in so this is a sort of like a snapshot or a part of the service dependency graph, the TA governorates for uber, and when you think about when you use them, let's say uber app to order a car, then what happens in inside the circuit exits a microsphere attack system, every request, sort of looks like this: it's not just hitting one single services, kitten like dozens of different services and hundreds of instances.

B

So it's very typical for us to get a trace which contains maybe I, don't know. 200 service calls within that within a single transaction, and that happens like billions of times a day, because every every second your office is telling something to children by consistency, and this is where the car is going with, like the road conditions are etc. So as the engineers, if we want to sort of maintain this system, this complex system operational and monitor it, how do you do that right?

B

So how do we know what's actually going on whether everything is working correctly and the typical answer? Is we monitor tools? That's what my team is doing, the observability team, the classical monitor.

B

B

However, the these tools have been designed in the in the days when, like distributed systems and especially systems built on micro services, weren't that widespread, and so this problems, this systems have a problem with actually dealing with the complexity introduced by micro services. This is an example that I recently got like I was destroying some coal program and it crashed with this line, and I was like there's no step trace or anything a very unusual for go actually to crash in this way.

B

So my point is here: if we use in traditional monitoring tools like coal can and metrics what.

A

They're telling.

B

Us is, there is something about one individual instance in this whole whole graph, but they don't tell us what the context of that of that event that we observe right and so, rather than today, monitoring sort of every instance of the service separately, which is what logs in metrics tattooin is like. Debugging your whole application without any single stack trace anywhere and what we really want to monitor are distribute the transactions that transpired within that system and involve multiple services in one request and also distribute the tracing systems.

B

Do in a nutshell, if we just to introduce what they ideal the trace, and let's say we have a like a top service that the requests enter into our system as soon as the request enters, we assign it the unique ID, and we store that ID in the thing that we usually refer to as the context that context is a request, scope that it travels with that request with that transaction throughout all the other service calls that this request turns, though, if service a called here is B, then we pass the context around with the request.

B

If it's called to more services, we continued buys in that context, and because we have this unique ID, we can do various things with it. We can collect the time sequence. Diagram. Sorry,.

A

To interrupt you, it seems some people on the strength screen and actually I've got the same thing. We can only see your face and not the slides. Could you could you perhaps try reshare in the slides I think there was a interruption in service I'll, be there.

B

B

A

About that yeah yeah, you can smell that it's full screen and we should be good again all right. It was actually that when anyone watching please please let me know thanks Yuri.

B

So yeah, so, as we basically is the request, Traverse is our architecture and multiple services. We keep passing the context around which contains the unique ID or trace idea, and we keep building the time sequence diagram on the right side, which also captures things like. What's the causality of individual requests that specifically, the call to e service came from a and not from something else, and it came after BC and D service requests already finished. So we can. We can view this this understanding of what happens within there within the request.

B

So that's the basic idea of the distributed Tracey, it's not very difficult, it's not very difficult to implement and it has been around for quite a while, but hasn't caught up until recently. So now, I want to jump to the Gemma part of this talk and I will show you.

B

So let me edit this is full screen. Can you see the the monitor is on the terminal? Yes, so I'm here in the Jager repository so I'm, going around to two applications. One is the standalone version of Jaeger, which includes all kinds of bytom's components in 1.3. Just so it's easy to run. Even it also includes the UI. So when I run it, we will see that it is starting a number of services. For example, it says starting Jaeger collector. It's also starting yoga. Query, there's also a good agent somewhere. We will talk about this.

B

One I talked about the architecture, but anyway this is. This is Rian. So if I go back to the website, the Jaeger UI I can sort of load this UI from that process. So it's accomplished and the other application, which is also included in the in the Jaeger poetry. It's like under examples, hot rod. So it's a Jemma application that illustrates the features of open, tracing and India as well. So when I start that one one thing I want to point out that it also contains multiple services.

B

Even though it's one binary, you could actually start these services independently. If you wanted like I'm saying all, but they could be started one at a time, and so there are a number of services we need within this app. They all talk to each other using some remote codes over network, and so that obligation also has the front-end- and it looks like this so now just a quick intro to this front-end.

B

So what we have here is basically it's a right on demand, a mock application, you, you click a button, it the backends sort of says I'm dispatching a car. It finds the closest car with the license plate. It says when it's arriving and it gives us some debugging information like the unique request ID and how long this request took on the back end and I will come back to this web session ID later. There are interesting uses for that.

B

So the first thing I want to sort of ask myself when I'm looking at the application is, what is there? Detect application right and since we're talking about tracing is a monitoring tool. I want to use tools rather than go in and talk to the whoever created that application to understand, and so by executing one single request.

B

Let me try again, let me restart everything, something something went in.

A

B

B

All right started stirring this one.

B

This day, the old one sort of the standalone Jaeger it keeps a memory storage rather than any persistent store. So that's why, when every startled the data is lost and that's what I want actually so again executing one single request and then go into this diagram, and now we get this, so it just laid out incorrectly in the in the previous thing. So by observing the behavior through traces of that application, we already sort of got an idea of the architecture of that application.

B

You can see that there is a front-end service, there's the route service, customer driver and the customers ticket to the MySQL database right. So we haven't needed to understand anything about like going into the details of the application. However, it doesn't still tell us what exactly how the application itself is. He and what's what's the logic within it like we can look at the logs of this application. There's lots of logs as usual, and so this is a typical problem with logs is that they seem like a good idea.

B

But as soon as you start trying in any real service, especially micro services based on the amount of work becomes overwhelming, and we can't really make sense in yes. So even.

A

B

The same like the beginning, the same service names here being started, but really to understand what the application is doing, a credit card. So that's why, if we go to the actual traces screens.

B

So here I have a few traces already and captured by a system. This some of them are are just from the front-end I. Think. If we look at this, oh, this is just like loading. The front page of the of the service, but here is the interest one that is the this one is actually the the request to the application itself. That produced is right for us and that's what we're interested in, and so when we look at that. First of all, we see this.

B

This sequence diagram that I was showing in the slide before so it shows us what going on within this call microservice based application. We can see that, for example, a front-end made a call to customer which made a call to my scale then front and medical to driver service. Then it made a bunch of calls to the route service.

B

So already we are getting some idea of the actual business logic, but in this application we also can drill down into the logs which are captured within the trace, but these logs are slightly different from what we see in the standard out, because these logs are attached to individual spans within the trace right. So this top-level span is really the different requests that came to the to the top application to the dispatch route, whereas other ones they can represent individual operations done by experience, and so like this one doesn't have the logs.

B

If I go to my SQL query, then I do have a box and like acquiring some book with a transaction I'm, also seeing in the tags of the span. I can navigate and say: okay. Well, this is the SQL query that was executed by the separation captured automatically by the trace instrumentation. So what this provides us as sort of like what? Why is it different from logs? And there is like one big difference?

B

Is that this, even though this this tool can also show you, the logs, these logs are contextualized to the individual operations in the service. So, no longer you looking at kind of this bunch of aggregated log statements across all your services in no particular order. You are looking at the very specific sequence of events, which is much easy to understand. It tells us exactly what happens within this step of day of the transaction processing and we actually, if we follow it, we will see the exact same thing.

B

What I described like getting a customer find the nearest drivers and then finding the shortest route to to the like to the shorter. So the me nearest driver so another thing that distributed trace, and so this is kind of sort of the understanding of what what the application is doing, but really as a monitoring tool. We also want to see well what are the problems within the my applications? Let me do some other request.

B

And search for that, okay, here's another one, so notice! It says two errors happened. So this one we can see it so I just want to go through this sort of what, if you represent, so the mini-map just allows us to zoom in on certain segments of the execution easily. It is typical, a navigation tool.

B

The main timeline here shows this is basically each service represented by a span by in operations in the service, and it gives the latency of that individual operation and and also they sort of the hierarchy on the Left provides you causality of which operation cost which other iterations. So we can see, for example, that a call from front end to driver find near separation, but that iteration itself consisted of find all driver.

B

Id is equal to radius apparently and then a bunch of course, to get driver and point and rate all like some implementation of Redis, and so we we can. We can. You can see also the performance profile of this request, as it happened in the architects right. So we can see that, for example, the SQL query to what almost like thirty forty percent of the request. So if you were to optimize this thing, let's say our latency user, visible data suddenly went up.

B

This is something that very easy to see, even in the stress but yeah. Well, this is definitely problem. We need to dive into that and understand why this SQL statement takes so long. Another problem we can see again just not really diving into a lot of details. Just looking at the time sequence diagram. We can see that IKEA will find nearest drivers, there also find all driver IDs and then for each driver we go and the query individual driver, ID and apparently getting some information with location of the driver.

B

In fact, we can there's not much info here, unfortunately, I wonder if this game, so you can find driver ideas. There is a log statement.

B

I just said: okay found drivers, not that useful, but one thing I want to point out is that we, you can see the staircase pattern where every night next request sort of starts after the previous request in it, and that obviously adds to the latency of the overall operation because in theory like, if we know what this application is doing better, we would probably confess that all this driver get driver requests could be done in parallel and that would reduce the latency of this from 200 milliseconds to whatever the longest one is, and there are also some errors here.

B

We see marked in red those are basically ready timeout, so we can see in the local that the information so again, they also contribute to the agency. Actually it's proportionately compared to the a successful request and the last final part about the lecture to the Timeline view. That I want to point out is that let's look at the last segment of the trace where front-end calls the route service. So remember what the business logic here is that we got all the drivers will get their locations and then for every allegation.

B

We say in the cable from that location to my customer: what's the shortest route, and once we get all the all ten routes, we would pick the shortest one and say: okay: this is a driver we want to send to the customer, but the behavior here we can see is that there is this three parallel requests going on. Apparently, two to the route service, which is the good news, I mean there's some parallelism.

B

However, it's just three, even though we had ten drive radius, and so we can also see that once all these requests finish watch it once one of them finishes: diety the other one starts because in this case they ironically were like very close in time, but these ones are different in time, and so we can see that this longer one prevented anything from starting idea.

B

This point right, and so really what we've seen in the in the time sequence, is that there is some contention still within this sort of get root operation which prevents the parallelism to just three requests at a time and in fact, I can demonstrate that if I do this, so I'm sending a lot of requests simultaneously and then, if I go and pick the this request and by the way, let me find the request by specifically this one.

B

So I can grab the license plate because it's a tag that is stored in the span and I can say: driver.

B

B

Think we change this index recently, okay, so this is just this long request. Remember it was 1.5 seconds, so this is exactly what it is.

B

So if we look at this last segment well, unfortunately, we still cannot see the behavior and but what we can can see is that the my SQL statement now takes almost a second, so there is a definitely bottleneck in s application, not only it takes so long, but it also the more request the system is scanning concurrently, the longer this thing takes right, and so this is obviously something that we need to optimize by.

B

What I was was trying to show is that these individual requests like if you, if you have many concurrent requests to the system, then you're not even gonna catch three at a time executed, but it could be. Sometimes there is even nothing is executed because the whatever the executor pool that system is using. This is limited by three and that's what's brought in this whole thing, but actually, since we have time, I can actually show you some way to hack into this application and fix some of the performance issues.

B

So, for example, they mask your statement. This application is the test application, so it doesn't actually have database is it's like database is simulated, so the customer is called and I think well actually.

B

So if we look at what the.

B

Database right so the database implementation is just it smoking sort of the my SQL statement, and this is what we see here. We can see that there is actually a lock taken just to simulate the misconfigured connection pool right, and so, if we take out this lock, then we will unblock the parallelism in this particular step, and so, if I go back, I can restart the application and I execute a bunch of requests concurrently start again, so let's do them.

B

So it still takes a while in some cases, let's look at the longer one, but now we see that the my SQL statement no longer actually blocks the overall request, even though the whole latency still increases the more requests are going on and in fact, the reason why this latency increases is again. This part as well plus this part, as I mentioned, is the sequential execution of like radius calls really needs to be fixed by using some sort of a thread pool or a number of concurrent requests.

B

There is another one, another thing that I can fix as they, because they might still delay is actually simulated so and I can go and change that delay from 300 milliseconds. He wants to make it like even faster. We pretend that we just fixed the performance issue with the MySQL storage in some way right, and so, if we again go and is it get a bunch of requests, something all.

A

B

Well, I wonder if something is I think it's unusual. The zoom is taken my CPU time more than my application actually simulate the latency so again searching for this one. So now my skill was very short right.

B

One is kind of the same as before, but this is what I was mentioning earlier now remember: we used to have a concurrency here of three at a time now we don't even get one at a time and because multiple requests, they all contend for the same thread, pool where it's beautiful and so again the point of this exercise is that I'm not really doing any deep analysis of the application. Then you sort of profiling and looking at the function, calls I'm looking at just a time.

B

Sequence of the trace and I can solve performance issues and well at least understand what the performance bottlenecks are within this system. So let me restore this.

A

Uri with the speaker time for question or not yeah, it could be the time so Alex AG asks he wants you to break the app. Basically I asked him to be more specific and he said he'd like to see if Jay can catch exceptions, for example, if the database server is not reachable I'd like to see in the UI is that is that possible?

A

B

So I mean this is I. Guess the this application is not really written like this, because there's no database service here, but it's really I think this is an example of a request which tries to go to pretend the database and it fails because it's not available at the time, and so what happens is really we still our I mean it's also depends on how the application is written.

B

This application is written in such a way that it's tolerant of Redis timeouts right, so it just retry is the same iteration, but it does log expand and it says there was an error. So if your investigation is something you can even search for for this errors, it wouldn't the spend- and it will show which show me the traces which contain error right. So that's how I would say how you would you would find something being unavailable like.

B

Another thing is that if there's a stack trace, then currently Yaeger doesn't really capture stack traces in full fidelity. If you use instrumentation, which you can log a stack trace into the trace and it will show up in the logs, but the formatting problem is not going to be super great. This is something that we can work on future. We we want to take some lessons from.

A

Sentry open-source.

B

Application which captures likes tetris isn't much by the way, and basically it's really just a matter of displaying them better, but it will catch them and represent them in the.

A

B

Like again, this, the like I showed this. This kind of works right, so these blocks at the top. So there's a tons of logs in this span, and so there are errors they would do. They can be captured as well here. I hope that ends in the question. Thank.

A

You Alex had a follow-up question, which is: is it possible to instrument JavaScript with Jake agent.

B

Good question, so it is possible to instrument nodejs servers. The javascript version is currently being worked on, so we were originally released. Only the node.js version of jäger client so that one works fine for servers for javascript for the front end. We just need you to do some work on that node client, so that it can be compatible with the browser JavaScript.

A

Sanctuary Alex a fantastic answer. Your question just add some more details on the chat, we'll come back to it later.

B

All right so last thing: oh, maybe not last thing. Actually so one thing I just did I I rebooted my changes to again reintroduce the SQL bottleneck, and so what I want to do is when I request to this multiple requests. Again, we see in the two latency is climbing. So if I go and search for for this traces again and the one I want to pick one of the long ones, this is probably the longest right and I want to look at the log of this thing and.

B

So this is an interesting blog, so obviously I remember in the source code. There was this blog and it's not a normal go log. It's it's an instrumented lock, which takes a context and the reason it takes a context is because context is where the the trace information is stored and propagated through within the application. So it's a slight modification of the log, but what it does it able to tell us not only that?

B

Okay we're working on this particular look, but it also captures which transactions we have broken behind right and and gives us this transaction IDs. And so what is the transaction? Id? That's actually we've seen it here, and these are the transaction IDs.

B

However, what's interesting about this is that if we look at the so it happens within the customer service right customer service is called mysql'. So if you look at the URL request of that customer service, it looked like this. It says, give me a customer with customer ID one two, three, that's fine, but there's no information here about the transaction ID. So how suddenly the service, which is like a database below this customer service and well below the front-end, knows this transaction ID, which was available only in the front end here.

B

So we have the session. Ids just occurred every time, I reload the page I get a random unique accession ID, which is sticky for the HTML page, and then each request is sort of unique with that number and that's what I'm seeing in the book and seeing these IDs here right. So, first of all, why is it important what's important, because I can actually investigate as well?

B

Okay I was waiting on the lock on some contention, really Oh on the resource, so maybe I'm doing some large work, but most likely because I'm just waiting most likely. It's the other guys who are doing some work is work and that's why I'm working on the queue right and so I can go and find out these transactions and investigate.

A

B

Happened to them, why did they take so long? So that's one! Why that's why it's important to have this, but it does important part. Is that, like this, this whole idea of transaction ID is really not even available at the API of either these services either the customer service doesn't get it in the URL. The my skill of visit doesn't get it anywhere because it can't pass that information here, but it still knows about it and that's a feature of not just Yaeger, it's a feature of opens recently in general, it's called baggage.

B

So if you remember my with the services I said we propagate the context throughout the kold-draft right well, I said we pass a unique idea within that context, but that's not just the only thing we can pass. We can essentially pass any random key value pairs and make them available throughout the whole call graph just by using this like transparent context, propagation mechanism and one of those key value pairs. In my example, is this transaction ID, which is really created by the front-end application.

B

It sticks it in the request in the context and that context is becomes available to every single node within the application right. This is super powerful feature, because not only it allows you to do things like just this lock contention, but it also allows a lot of other things and maybe I'll speak to that later, a bit and and just to illustrate the same point with another example of this.

B

By the way it's called baggage very, not mentioned so this is called like key value pair within the context is called baggage because you really care with your request as an extra pile load, but take a route service route service. Again, conceptually is just the function. Well, it's a it's a service which says given two locations, find the shortest route and give me back some information about right. So again it does know anything about the customer. Doesn't here, it doesn't know anything about this transaction idea that was somewhere at the top because it doesn't care.

B

However, if we look at the metrics emitted by this particular service, so I have this X bar, which is a functionality that allows me to sort of have a web page showing the metrics. So this route service, it means these two metrics where it says. I spent this many seconds on behalf of this customer and this many seconds on production, ID so magic again. How does the route service know anything about the customer transaction ID? If I didn't have that as part of its API right and the answer?

B

Is it got them through the baggage and it's able to calculate the centrality, resource consumption, a Thomas tree service attributed to an ocean or to a piece of data which is only available at the bear top level of the request to enter into the system right, so aduba we're using this for various things. I know that Google, for example, also uses that for quite a while.

B

So if you're cover, what's a big table request, all the requests to BigTable actually come with attribution to a particular Google top-level app like Gmail or Google Docs and again, the context propagation is what really drives that behavior biased by attaching this information to all the requests with on the on the call graph, so super powerful teacher and yes, Yeager's- fully supports it in all our clients and all languages.

B

So what what else can I show you in the demo? I guess.

B

One other thing I can show you is, is just to dive a bit more into the metrics, and this is not necessarily the feature: jäger backend, it's more of a feature of of Jaeger client libraries. So again, looking at this output, this debug web page I can see that they're a bunch of metrics which look like metrics from my service right.

B

So it's a service name, the metrics name, HTTP request, so how many requests received at which endpoint, which is like slash customer endpoint and for each status code right, and so this is the counter you can get the same exact information in the form as Prometheus metrics. If you want to it, just like a configuration switched application. It also.

A

Gives you a latency.

B

Of errors or successes, and it does it for every service. So it's not so unusual to have these metrics. You can get it by many other ways. What is unusual is that the application itself is actually not doing these metrics any of the services. The way they am human, that they're not limiting the metrics and neither are the RDC frameworks that they used by the what's emitting this matrix is actually Yeager client, because and if you think about what open tration EDI is it's it's a way of describing your transaction with an application right?

B

So it happened to be called tracing and it's used primarily filtration, but it doesn't have to be. You can implement an open tracing tracer simply by immediate metrics from the trace and do nothing else right and, and that's like a wrapper really what he. What Yaga tracing contains an extra feature which says oh and by the way, if you ask a community metrics on for you automatically, because really instrumentation will open tracing is a superset of normal metrics.

B

Instrumentation you'll really count how much each request to in terms of label see how many errors are there common requests in total, so it's very easy to actually meet this metrics using trace instrumentation. So if you use a Yaeger client you're going to get them for free and enable this thing and one since we are on this on this topic of metrics, this is analysis, so this page was from 803 83. This is the hot rod application itself. So, like I said it's not currently configured to send date in the parameters format.

B

But if you look at the at the help of it, it says that here is metric. You can, you can say, Explorer parameters so so far, we're looking at from expire and I can switch it to prometheus, but the Yeager back-end itself this this one that I was running. It is configured by default with Prometheus metrics, and this is what we get, and this is the port where the UI is running both all the whole back-end.

B

So it gives me similar things also similarities in metrics, but it also gives me all bunch of metrics about Jaeger itself about individual Jaeger components like agent, Jaeger, query, service somewhere downstream collector, etc. So you don't have to do anything. It's really like. If your character, mrs. Runyon, you just point it to this URL and you can feel charts and alerts, and that's probably monetary, acre. I think this is a this is all I have in terms of the demo. I can pause here and for another set of questions. Great.

A

So we've got a couple coming in: let's mix up the order, a bit Rangaswamy asks does Yaga make use of go templates by a chance. No.

B

It is not simple.

A

Questions simple answer: there we go. Alex has come back with another question, which is you Lee? If you have a choice between dynaTrace and app D, what do you prefer Y and a related question? Did you hear anything by the open source? Apm's called pinpoints and inspect it.

B

So I don't have an opinion about dynaTrace versus up dynamics, I assume because I done the products I cannot use them myself. I know that. Well, these specific ones actually I did I, don't recall them coming up with support for pond tracing, but a lot of other vendors came out with support for open tracing standard. What that means is that if, let's say you have this hot rod application right, it's currently a sentient races to Jaeger.

B

But if we look at the source code, this application there's only one single place in this qualification that actually binds to the Jaeger. If you want to bind it to any other vendor, like whatever was what are the vendors like in Stannah data dog, New Relic, they all came out with support for open tracing. So if you want to send trace instead to those tenders, you can easily do that with sin. Usually changing one single line within this application, I'm getting similar behavior within there you eyes.

B

As far as the ins I forgot was it in inspected. I have not looked into that. One I know that it's open source, but it's also a vendor. They also support the con tracing so again yeah you can use that as well. Jager is just yet another version. I, don't know how actually Jager compares with expected thanks.

A

Ranga I just saw your last question come in. If, if that I feel like that, did answer your question about New, Relic and Splunk, if it didn't, then please put another question in there. Alex asks any plans for PHP Apache agents.

B

Not for agents, so there is actually an open tracing generic Java agent, which allows instruments various things, so you could use that again. It can work with any tracer. Php Jager does have a PHP client libraries kernel on developing. It's not official. It's still the community contribution at this point so I think that it.

A

Okay, got it the covenants, they confessed. So let's sum of the Alex s a game about using Jager instrumentation in Java, SE, apps, specifically running custom homegrown protocols.

B

Okay, so the the way a distributed tracing in general works is that it can work over any protocol as long as that protocol allows you to pass some sort of metadata, usually as key value pairs right. In fact, this particular application pattern is using the such such custom product multichannel that overdeveloped long time ago.

B

We kind of moving away from it, but it is custom, it's a binary format, but it does support key value pairs as part of the request and so tracing just works and so any other custom protocol which allows sort of tracing from it all like, not even traces. It's like a metadata information to be attached to the request. Those protocols can be traced like, for example, Cassandra Cassandra has a proprietary protocol, but it can still be traced it over Tracy thanks.

A

Shirley one more here, let me just scroll back up, find it again.

A

Yang asks, after his new relic question, which you already answered. Yes, also, is this insolence on the sto service mesh.

B

So Easter service mesh currently can work with Jaeger there. Even you can find example. Talks about doing this, so I'm not sure what yes, the Easter is independent. Really it's. The only thing about I should say about serious matters is that they are not magic bullets for tracing, because the if you you can actually maybe find my talk at cloud native corn in December, where I talked about this, the.

B

Difficult part of distribute the trace in the space and the context within the application. That's like passing the context between the application is actually the easy part. It just sticks and some grain HTTP headers or something like that, but within the application. Sometimes you need to write your code in a careful way so that the context is not lost right and therefore, if you use your service mash, then services can take care of all the things like. Oh I'm, gonna create a spans on the server on the client to the causality, alter the headers etc.

B

But if your application does not actually propagate those headers, then you don't you don't get any tracing with service lashes right got you that people may not realize, and so there is also another talk at cloud native where we show that how you can get tracing with easier and with easier and open, trace, instrumentation simulation application and that the second example shows you how much richer the traces become. If you, if you actually bought like there's a little service and within the service match the service mares can give you very basic tracing.

A

Thanks Sheree I'm, just putting the link to your talk from cube comp in chat now one more question coming in so look: do you plans from Alex again, do you plan to add plug-in support so that you can collect forms metrics and different apps I JVM metrics over JMX or operating system performance counters.

B

So Jaeger is not meant to be a metric system. I. Think a lot of dust is really pretty she's. Already doing all of these things, the Jaeger is really about collecting performance of transactions rather than individual pieces of the application. So I'm just rereading this question right, no, so well, I guess. One other thing that we do have plans is is to you have some sort of custom metrics within the span where you can like.

B

If, let's say your application wants to measure one particular thing but make it a contextual measure rather than just plain metric so like. If you read Facebook's, can it be paper? They have example, so that so that way is something that we are considering, but there is no active.

A

Okay, this one Alex I think we should just like give you each other's phone numbers Alex. Another question coming: it's already plaid. Actually, can you just read them? There is.

B

Their plan to session storage, so the team can trace back particular transaction.

B

Well, I mean I, see so I guess I'm, not sure. What's what's meant by session storage I mean traces are definitely stored and in in Cassandra you can have any retention period. We also store them like an HDFS at or internally. So if you want to keep them for roughly here and them go and blame people so yeah Jagr support, persistent trace, storage.

A

Cool thanks, Cherie thanks Alex thanks Mona and thanks Brenda, okay back over to Yuri. Alright,.

B

So I will have to hurry up now, because we spend a lot of time on questions which is good. So let me go back to my presentation. I will try to speak fast and cover a bunch of stuff here. So what we've seen so far in the demo is that we can do this through the transaction. Monitor right. You can see how transaction progressed with architecture.

B

We can do root-cause analysis, I've shown how you can analyze, why a particular spend within a transaction is slow or where the air is coming from, like the greatest time house, when we can do performance and lightness that mutations by figuring out what the critical path is, what really needs to be optimized first and finally, the very first screenshot of the service dependency kind of gives us understanding of architecture, and we saw how the baggage or distributed context propagation helps with a lot of other like powerful features that usually not enable with a monitoring traditional monitoring tools.

B

So just a few words about Jaeger itself. We started it at uber in like two years ago, two and a half years ago now open source web. In April, it's been cnc official projects since last year and it has full operation support, including client, libraries and the backend. On the community side, we have about 10 full-time people working on it, both at uber and rat-head, and we have plenty of contributors on that. It happened in used in production by companies. Already on the technology side, all back in components implemented can go.

B

As I mentioned, we have like persistent storage back-end. We officially support cassandra in elasticsearch. The the example I ran here, uses in memory storage. So it's gone when you started the web, front-end is implemented in react and JavaScript, and the open, tration, instrumentation libraries are available in the five languages from here and plus PHP and Ruby are in sort of a community development phase. Right now. The open tration api for those is not finalized either yeah. Now this might be interesting, their detector.

B

So imagine this is your application granules cause per container, so we typically recommend the sidecar model where there is a Java agent Ryan on the cost, as the demon set in communities or any other way, you can run on a cost and the reason is so that when the application basically sends traces, it sends them over UDP, and then that makes the Java client to be not aware of any so the routing concerns or transport protocol concerns.

B

You don't have to bring a lot of dependencies in the application because, like it is available in every language, and we can do very lightweight clients for that, and then the Jaeger agent is the one that actually knows how to like. 80-Acre collector in the backend communicate to them and also there's a feedback loop. That's usually used for adapt to something that they'll talk on and on the collector side.

B

Collectors are fully stateless, so you can scale them any way you want, and the the persistence is really nothing that Big Data data stores like a sander in elasticsearch people, are experimenting also with influx to be silly to be. There's a Amazon to be NDP I. Think someone asked for so. We do not officially support those but we're working on the blogging system, where people can basically just contribute those as plugins on run side by side with the main Jaeger binary and then Jaeger career is another component.

B

We start to freeze the database and format it to the area, and the data pipeline is not something that's in open-source yet, but we're working on that. This is where all the obligations that opinion and like the dependencies diagrams I built, the I, was keep the data model. So I was keep the well actually I want to say a few words about.

B

So something is a very important topic to understand, so the amount of data that traces capture from transactions can actually sometimes exceed the business record right because it depends on how heavily instrumented applications come when you lock you right into traces, so most most racy systems. They do not persist all the data in storage, because that just too expensive and instead we sample them. However, sampling doesn't mean that you can just randomly flip sound spans into the storage and some throw away, because you want to consistently portray.

B

So if you sample one spend of the trace, you want all trays expand within the trace to be sample not as well. Otherwise, you just gonna get garbage data as a trace, and there are two techniques head by sampling where you make a sound, a decision right, the beginning of the trace and then give it respect by all the services or you collect all the data. First in some temporary storage, maybe in memory and and then you make it sampling decision at the end, so eager supports the first model. That's a classic tapper model.

B

There are vendors which do tell by something actually only one lifestyle I know, but we are considering that is for Yeager's. Well it it's like for our traffic duper. It was a bit challenging to do full a hundred percent collection, even in a temporary storage. That's why we haven't done it yet so on day, one going to what? What did we release right so? Well, we officially really support for Cassandra in elasticsearch. We wait made a bunch of improvements to the UI we enabled metrics.

B

We have the kubernetes exploit templates and also there's a culture that people developed an open source so that it actually pretty easy instrumentation libraries I mentioned, and then there is a backwards compatibility layer with simcha so that if you're already invested in specific instrumentation, which is generally not open, tracing compliant. So you cannot really just go and switch the traces to Jaeger, but you can still use those zipping libraries and just configure them to send data to Jaeger pack, and then we can accept Pacific and formatted spans and represent them in a Jaeger data.

B

So this is I mentioned to a storage format in the UI. One of the notable thing is that we've spent a lot of time on performance so that you can go to very life stages. Like we've tried up to 80,000 spans like dual eh. Don't implementation is kind of sorry company. Don't in the browser is a bit challenging, but we made it work with some tricks.

B

Mention Zipkin metrics, like I, said the all the Jaeger components come up with premises by default, all the metrics, but you can switch to other ones and we actually internally have support for even more metrics, but so I think involves the B's current already compiled, but we're not focusing too much on that is seriously possible, but once we have the plugin systems it might support those better. So roadmap is I, think an interesting thing, so adaptor something that we already have a dragon introduction into word. But it's not open source.

B

Yet the point here that, as I mentioned, we do upfront, sampling and one of the challenges with that is that usually the same thing was done as like one probability per service. But if your service has multiple endpoints with different GPS, then that one permeability is good for one and not so good, follow kpo some point and vice versa, and so we adapt if something actually breaks it apart to be per endpoint and the. Secondly, why is it?

B

Adaptive is because the backend actually keeps track of from I stated its resilience from every service and every endpoint and feeds back the information back to the client saying you should adjust the probability because you're not sending an update or you send it to my state. So this is like allows us to control how much day to be getting into a curve back end from old uber services, for example. The date applied pipeline is our biggest focus for this year.

B

Is, is really so far, I've been showing you examples in the demo, where I look at once race at the time right and it's useful, provided that you can find that race, but at uber way like getting several billion races a day and so there's no way. Anyone can actually look at all of them right. So we would, by the way, snow the state or not being able to find what we want to find. So aggregations can come and play here work and we can actually do data mining and say: ok.

B

We we see in this kind of problem, maybe like a long latency tale on this particular service, and these are the sample traces, so Google look at them. So that's a much more viable approach to investigating for most and trying to find this individual traces using just around ensures that you can use in an eager UI and some of the examples of this.

B

So this is a another rendering pretty, but you can actually get this picture with just like Network sniffing, because all it really does it just measures how each service talk to each other right. This is a pairwise connection is really what it doesn't show. You is deeper connections and that's what we have here so in this example. So, let's let's pick this three services, so we see that this made-up name of the services, so the service makes a call to shrink. So it depends on that service and then shrink makes a call to dog.

B

They also demanded it. But that's actually this thing you service depends on talk or not from the from this diagram, the previous one. You can tell that I mean even from this one. You can, but the tool that we have working on internally is that you can actually just type a search, st. Ingo and it will show you it will hide the other services and you will see where there is actually a path from dingo to dog or not, and that allows it to do again even deeper dependency.

B

Analysis of the Sherry's is saying like oh what's my SLA for the service. What's my latency SLA all part, it maybe I need to look at all my downstream services to kind of figure that out and if you don't know what you're dancing really from previous diagram. You only know this thing, but with with the more passed by this diagram, skip and actually show complete, full depth links and this I'm actually one minute so another way is.

B

We cannot again finding problematic traces and looking at the latency histograms rather than looking for traces by tags, and what we've built here is that so it drove you a latency, so I mean the an endpoint I doesn't show here. So I filled this sort of another service and it shows me the the diagram and it's interesting, because it's it's multimodal right, so distribution.

B

Normally, you would hope to have like wondrous calm, but here we have a very low range distribution of latency, but then there's another hump around like slightly longer and then there is a much worse come over here. Where is a very really long tail comparatively, and so what this diagram allows do is just drill down into this one and say no yeah. This is the actual traces that represent this this long tail and you can go and investigate them and those who shows which upstream colors are responsible for most of those traces.

B

So again, this information could be rather useful. So if you were like, if you're interested in contributing to Jaeger, then we have plenty of issues open and very easy for you to attack like either bonded or beginner to ask for documentation, and so really we don't have any CLA. You just agree into a certificate of origin, which is the linux approach. You just need to sign your community with the dashes, which puts your name and email address in the commit and blend your work around to go around here.

B

And finally, this is just a reference page. So if you want to get involved, if you have any more questions, you can come to this chat room on get ur Jaeger tracing, there's always people hanging around there. So if you have more questions after the seminar feel free to ask them there or on the mailing list, we have a blog on the medium with a bunch of posts about Jaeger and.

A

Also, if you interested even more.

B

Video calls every other Friday, so there's a link and the slides I'll share this flight afterwards, where so yeah you can join the basically video ghost people to the maintainer, listen what we're working on and what kind of project we are doing, and so just a few more.

A

B

You have any more questions.

B

So where do we start I'm, not sure what come worthless, but there is a count chart that you can basically run ho Jaeger installation with a single command.

B

Is it possible to trace apps that use Kafka? Yes, especially in the latest Kafka? The protocol now supports metadata as part of like Generic metadata, so it's possible to instrument that and and yeah and trace all the message. Passing, there's really nothing. That's preventing. There might be some like funny issues with the time scale because most of the traces that happen in RTC world they're, very short, whereas in Kafka you can calculate of Mesa Tempe that are so the UI might look funny a bit. We don't focus on that.

B

Okay, what is the price? So overhead is a question that is very difficult to answer, because it's completely dependent on your application, all right and the workload and the cpu capacity that you have allocated at our application. So you can, you can essentially get any number, and so any number I feel is meaningless, but I just can say that in production, the more traffic we get, the lower probability of sampling we use overall and so with beauty. There is no nodes overhead, especially in the languages which are like sort of efficient like Java and go.

B

Where is, for example, in Python we use internet' the framework. It's event will place this framework and there are two paths: the context we're using the tornado native stack context. That actually adds water for get, but it's not really sort of thing ager over get it's just the overhead of the frame or profession. The context around.

A

B

Right, that's all we got.

A

Time for we're going to stop now. Thank you very much Yuri for that fantastic webinar. Thank you to everyone in the chat for being very active with questions. That was good. If you want to keep an eye on upcoming events, I've just put the CN CF events in there. That's all we got time for now. Thank you very much and goodbye. You can bang.