Kong Kong Summit 2019 Sessions, 22 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Let’s Turn Tracing on its Head

Description

Tracing has been around for a while and has proven its value to debug issues like performance and common day-to-day scenarios. Users commonly query for particular traces and analyze a waterfall view of a single trace. Similar to logs, the value is not from a single event but by the aggregation of multiple events, and the breakdown of these aggregates over different dimensions or tags. In the case of traces, it is possible to construct rich topologies and navigate the aggregate data from these topologies. In this Kong Summti 2019 talk, Omnition Software Engineer, Constance Caramanolis introduces a different approach to consuming distributed trace data and the full value that can be realized.

A

So I am here today to tell you that most likely, everyone here has been doing tracing wrong and there actually is a better way to do it and tracing is believe it or not the actual future of observability.

A

So that is me. I work at a mission I'm a software engineer on working on tracing open telemetry, which we'll get to a little later. But what really actually got me passionate about observability is actually my time at lyft, so I am one of the original, creative maintainer and contributors to envoy alongside Matt, and the benefit of this is that it gave me a lot of experience in terms of operating micro services and teaching other engineers how to operate them. I was on call for envoy and other services that lift and on-call sucks.

A

Like most of you, probably know right, it's always a mad dash in terms of trying to identify where the errors are coming from what could be causing it, and it is definitely very frustrating.

A

I have been woken up many times three o'clock the morning being very confused in terms of what's to do one other aspect that I was mentioning is definitely was me empowering other engineers, I, don't know if anyone's looked at the Envoy codebase or the documentation, it's great reference documentation, but it's not the best in terms of user friendliness, because there's just a lot of features and it does get overwhelming really quickly and so I spend a lot of time, translating it into what I like to say.

A

Human I would teach others in terms of how to connect upstream metrics to downstream in terms of what how he did configuration, and it was great because a lot of our engineers were able to figure out what these things meant, but it wasn't intuitive and so actually what I'm gonna show you today would evade my life at lift a lot easier because, instead of me, explaining to people, if you follow this metric hop to another hop to another, hop to another hop.

A

That's how you figure out, where the error came from, there's actually a visual way to show you all this using trace data and well actually only like a trace, is three times throughout this entire thing. So you'll actually forget we're doing tracing at all.

A

Micro-Services are great right. They allow us to deploy code independently, other things. So, instead of having one deploy schedule, maybe once a week you can deploy your own component whenever you feel like it, you can use whatever technology there's a lot of flexibility and it's really powerful for engineers and applications to quickly develop. The downside is, is that things start looking like this I wish.

A

I could use yesterday's slide in terms of what was at the keynote in terms of a really nice graphic of what looks like stars and all connected, it does get that messy really quickly, and so as great as it is to be developing on this form. It gets really hard to actually debug and so like some of the main three points that we're trying to solve at with tracing as a following. How do you actually know if you have a metric? That's saying I have this error?

A

How do I actually know that whatever example, the log or trace, is actually correlated to that? Metrics are great, but you don't have a context who to wake up so say now, you're all using on voice. You at least have consistent metrics, but then errors propagate across the system. So how do you know? What's the difference between a service propagating that error and actually being the one that originates the error and then what is the customer impact?

A

How do you actually determine if the change you want to make a two o'clock in the morning is worth it? Do you actually make the risky change if it's for something that's used, maybe once every three days or do you risk making something that will fix a bug right away? That's impacting 90 percent of your customers and a lot of things right now in terms of serve ability, unless you've done it manually yourself will tell you this.

A

So right now most of you probably use are using metrics and logs they're great they've gotten us this far and as well saying is that they're missing the context metrics is giving you one data point in time same thing with logs right logs can be very rich for that one applicant for that one instance of a service. But then how do you connect them across multiple services? In the micro service world?

A

You can maybe have a request, ID and follow those through, but with both of these, you are missing the context across a very complicated system. So how do we handle tracing.

A

Well, this is how we're doing tracing now. One is that we pre aggregate the data. What this really means is that we're gonna be generating some metrics off. You know some s lis or metrics off of the traces. But what ends up happening is that once you've created this data and also the trace is usually tend to be sampled once you get rid of most of the traces that actually create this, you can't pivot on the data in any different way, so you're really stuck with one way of measuring your traces.

A

Most people who've ever interacted with traces. Actually, probably maybe most you haven't right. Cuz tracing is usually viewed as you look at one example, so you always look at. Like oh I know, I have an arrow the service, maybe I'll, pull up a trace today, instead of a log, and so you see in terms of what the entire call stack is for the service or through the application.

A

But it only gives you a lot of data points, but doesn't really help you find out in terms of where to look at, and the next thing is sampling tracing. We usually maybe keep point something percent, one percent, two percent, and if you use tracing similar to, if you use logs, usually the one you want is the one you didn't capture so there's a better way, we're gonna kind of directly call it the omission way of doing it, and so to all those three points I made.

A

There is a much better way to do tracing so one instead of sampling. There is actual way for you to capture all your traces and store them right. No longer do you actually have to generate them and choose which ones to drop most of the actual burden and the cost comes from creating it on the application side, the cause in terms of storage and transferring it is actually not as bad as people think it is.

A

Instead of doing the pre aggregation, if you determine later on in terms of what you how you want to group the data, now, you actually don't have such rigid things in terms of like oh I, looked at only these type of aggregated data, but I won't look in a different way. Since you have all your traces, you can actually go back and look at everything on a different pivot.

A

And no longer actually looking at it is as a funnel. If you look at all your tracing, if you have all of your traces for an entire application, you can actually visually represent. What a service looks like what do the application looks like in terms of service dependencies? So no longer do you actually have to look at metrics in terms of this is the upstream request to another one to another one? You can see things really nicely. So what do I mean?

A

This is a happy application. All of this. This is a visual representation. It's a demo application, but it is actual visual representation of a live application. Using tracing data there are no forms of building metrics in terms of I know, is safe, we're using envoy in terms of what the upstream metrics are. Once you look at a trace.

A

You know who you're talking to, and you can build this here so now in terms of whenever you're onboarding new people, you don't actually have to say, like hey, you know like last week, we called service ABC and make sure that API isn't change. You can actually just look at this here and say: okay well now, I'm calling service, B I'm, calling checkout currency service, and this shows you up to date.

A

So in terms of whatever applications changed, you know Micra service dependency change, because as soon as you get into micro service world, it's really fun to change things and change where the api's are going to this. Whatever static lists you maintain will be out of date within days or weeks, so kind of like kind of like what pages like to do. Airs gonna start happening right. So we have all these services that, like to page themselves and Anna legacy system or I'm gonna call it a legacy way of doing observability.

A

These errors right now are being paged based off of the success rate. So that's what this area of red box is here, and so this is what an application is service. Sorry, not application to services, seeing in terms errors so upstream requests 5xx the thing about that is that we actually don't know which one is causing it. So you will be paging all four of these service team owners to investigate now. Before we get to how to make that better, let's actually look at a trace. This is a typical view of a trace.

A

It goes across level up at the services and, what's really unique about a trace, is that you can actually pinpoint when an error started. So, instead of actually saying all these shaded red here, all these services are paging, since you know that at the bottom, with the currency service, I get conversion is the first one to actually throw an error. You can differentiate between a root cause and the actual ones are propagating errors. So when it comes to paging, you can say I'm only gonna, page currency service and let everyone else sleep.

A

So let's look at that again. So here's where we actually calculate the root cause or the root cause in terms of currency service, causing the errors so we're gonna, stop paging on errors and we're gonna start looking at root cause. Now this actually helps us identify who to wake up in the middle tonight, and also just who to page whenever this is something that is unique to tracing metrics. Unfortunately, doesn't give you the context. You need to determine this unless maybe added some headers and did some special processing on top of it kind of ugly.

A

But tracing is unique to this. You can't do this right now with logs or metrics.

A

So now that we know that the currents of service is paging, we're gonna debug it what could be causing the issue in terms of whenever I'm debugging I, usually try to come up with a hypothesis right now. I see that get conversion path is returning errors. My first guess give me like: maybe someone messed up that API so all right?

A

Let's we're meeting guest and I'm gonna look for another instance of get conversion API and see if there were any errors well, this one, unfortunately, doesn't have any errors, and so as obi one like to say, these are not the droids you're looking for, but this is a very common pattern in terms of debugging we're usually trying to pull at threads to trying to figure out. What could the error be? All right, you'll look at logs and you look at all the different parameters in terms of version. You know version input.

A

Instance, right all those things there so similar to logs like how you would log the version, all those things. Let's mentioning you can put these put those as tags in a trace. So here we have there.

A

So, like one thing we're going to raise, we weren't trying to come up hypothesis ignore the errors text in the middle. It's supposed to be more to my right, but apparently poor out point decided to change that, but we're gonna do now. Is that you, since you have the full, like you, have full fidelity of all the traces when you're breaking down you can actually break down for service or any part of your application on these tags. So what I did here is I'm trying to find out? Maybe it's an environment.

A

Maybe someone released something development or staging or whatever, and it's particular there. That's causing issues so when I break down the currency service based on environment I see that there are solid red everywhere and in terms here, solid red is bad shaded red, not so bad and we'll get to that a bit. But it's very much a visual cue to say: do I know if this is going wrong or not. So we know that all these are experiencing errors. So this is not what I want to look into.

A

So next one! So what about region right? This is another thing we can usually look into. Okay, so I know that US east one has no issues. So, okay, I'm gonna, eliminate that, in terms of where I'm trying to debug us West one has some issues. The next thing I can do is like maybe just break it down within there. What I'm doing here is actually breaking down by instance, and I, see that it's one particular instance that is returning all the errors originating all the errors.

A

What's really cool about this is that you're actually just breaking down. This is obviously a very static screen shopping, but you're just looking at the app service and look at different tags, you've included on the trace, and so these are actually just three four clicks versus before, at least in terms of when I've done debugging using logs. Is that I'll look for a hypothesis query for it and then see what the results are, and this is actually just an arbitrary tag, there's nothing special about instance, or region.

A

It's just something: that's very a very common for most of us in terms what we think of the most common culprit, so you can make it in terms of input user. Anything like that so now that we actually know how to root cause. Let's talk about propagating this across the system right, so we know here that currency service was route, causing errors right.

A

So we know that there now this, the rest of the application graph here you see, is actually scoped to anything that has a dependency to currency service or any things interact with it in terms of upstream or downstream. So we want to find out. Do the other services have any impact so check out services next, and what we actually see here is that there is no root cause so from this, we're able to say that actually, the check out service is propagating errors through across the system instead of the ones that are actually causing it.

A

So once again, we don't need to wake these people up now. What about the front end? The most important one, depending on who you are but for most people, is this one originating errors. No, it isn't, and so with tracing drilling and again right tracing can actually tell you the difference. Front end will be paging, because it is seeing a lot of errors or it could be paged, because it's seen a lot of errors, but you actually don't need to have all the front end team anymore, same thing, with current server check-out service.

A

Now, more importantly, customer impact. How do you actually know if what you're doing or what's broken impacts, anyone with tracing what you could do is we have this concept called workflows where you can catch and turn like sets of transactions to a business workflow so for lift it was requesting a ride. You know payments.

A

If you have Amazon's, they give you check out all that there, and so with this, you can get a horizontal slice of the entire application, and only see services are involved in it, and so also what ends up this ends up doing is that when you look at the SLI for front-end, you are getting what the users are seeing in terms of impact, so if they have I can't fracture I can't remember where you know, the number is right now probably need better glasses, but if you know that it's a twenty percent, what that number is as errors there.

A

You know the 20% of your customers, who impacting or using this path, are actually experiencing errors. Why this matters is that, at least for myself, in my experience and from getting to talk to a lot of companies through my time with envoy, is that it's always really hard to connect. What a service for hops down does and its bugs is actually impacting your customers and tracing can help you do this.

A

So I know everyone's very excited now to get tracing I can sense everyone just like typing away. So the good thing is: is that there's actually there's going to be a much better way to get telemetry and tracing?

A

So it's calling it we're saying democratizing telemetry for everyone, open census and open tracing have merged to create open telemetry, and what this is is it actually said of libraries, frameworks, agents and collectors that are all agreeing upon one standard in terms of how logging, metrics and tracing should be defined and collected, and that way so for anyone who's adopting these, you actually only need to bring in one framework, one library once at a collector, and you can choose whatever back-end works for you, instead of being tied to a vendor or to anyone like maybe your own proprietary thing that you've made, you can actually have the flexibility to up change any of your backends for, like metrics logs tracing whenever you see fit instead of being stuck in terms of like ask your team's hey.

A

Can you update the library, please I, know this parents gonna open for three weeks, but please just change this library version or, like you know, replaces one thing here: it doesn't make things a lot better. What's also really cool about. This is that this is actually being done by really big companies.

A

This is no longer actually just like, say us, begging the rest of the community to join in Google Microsoft's involved, light step ourselves, we're the some of the biggest contributors, so only in terms of fitting they're on the slides, but many many companies are involved in terms of making this accessible to everyone, because we just realized it's a lot of cost just to get telemetry into your applications, and it's really not worth it in terms of having to update it or change library every year or two whenever something better suits your needs, so I do recommend.

A

Please check it out. We're also love contributions, any form of contributions in terms of documentation, tiny bug fixes those are very much welcome.

A

So I want to say thank you, everyone for your time. I didn't I know we're supposed to may be doing. Q&Amp;A I, don't know if they got the mics quickly but yeah. If anyone has any questions or one doesn't as anyone doubts the tracings possible yeah.

B

So you said like we can ingest all the traces but let's say I'm doing a million RPS or 10 million RPS for a service. The amount of trace data can you gets huge and network costs in cloud is non-trivial like it adds up fairly quickly. We can have like a petabyte of traces. You know day very easily, yeah.

A

B

You it's a good idea to index all of that and keep it around so.

A

Like it really is, like case dependent, but so it's more like to write, the statement of sampling should be done. Like was made, you know, rightfully so from like really really big company. It's like Google, Microsoft or anything like Amazon and they're right.

A

Billions requests per second all that, but a lot of people actually aren't in that size, and so a part of it is like you know, especially if you get to like- maybe a million like it spans for a second or like that, depending on what your costs are like, maybe you don't save all of them, but maybe don't do one percent like find something is a little better, so it is definitely partially to challenge that. Don't say you need to sample right away like, depending on your constraints, are but see.

A

If you can like I, don't I can't share the math in terms what it is, but it's not as bad as people think it is, and I know like I can't give you the proof in the whiteboard proof to like show you that, but it is actually not as bad as people think it is, and so in terms like when you play around with it. If you like, say check like do, 10% you'll actually be kind of surprised. Okay, follow.

B

Patient, so you said like how you can do the root cause analysis like yeah, but which micro service is actually sending it. What happens sometimes is because even unrelated micro service, they start through errors, because one of them is throwing errors right, and you can't like praise that with with tracing. So, even if suppose, service a is throwing errors, and it does not talk to service me even because service bee is throwing it or it sometimes starts throwing error, so that correlation sometimes gets hard to ya.

A

B

Out is that something that's even possible to automate like something with combination.

A

Maybe it's usually like I guess, like so depends in terms of how strict I feel like that's a little bit in terms of people. Relaxing like I will say that I'm very traditional in terms of like what my approach is in terms of forwarding errors. So if you see one, you know any form of error that you deem an error, 503 s or errors. Everyone don't rewrite that to 200, but like once, you see a 503 instead of like you can either have retry logic in terms of handling and fixing that.

A

That might be a little more corn a case, but at least more in terms of my experience, most people tend to like once they have one error, just probably flush it through. It is something in terms that we definitely can adjust for and talk about, but it's mostly common case yeah. It also depends how strict like, if you're bending the rules, don't do that the right way of propagating errors make.

B

Sense, thank you talk thank.

A

You one more question there: one here or okay, I was gonna roll, so slowly people can hear it in terms of our repeated.

A

A

A

Think that's more dependent on how you want to use your traces, oh yeah, the question right. um So someone a person for me was asking in terms of like what are my opinions in terms of baggage I.

A

Don't know if that's the right term, but um some other tracing standards have provided the ability to add extra information metadata or whatever you will to your traces I think it really depends in terms of how you use it right for maybe like from what I've seen that hasn't been necessary, but also like I've, seen very, like I've, seen the lift environment and, like other customer environments, and so I haven't seen that knee. But it doesn't mean that I'm right, it just means some terms.

A

What I've seen so I think it's really case dependent like most things like. If you have extra information that you really need, and you don't want as a tag and add it any other questions.

A

Cool. Thank you very much. Everyone.