Continuous Delivery Foundation cdCon 2021, 6 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Improve Deployment Velocity by Using Distributed Tracing and Lightstep to Understand...- Andrew Chee

Description

Sponsored Lightning Talk: Improve Deployment Velocity by Using Distributed Tracing and Lightstep to Understand and Debug Deployment Problems Quickly - Andrew Chee, Lightstep
Speakers: Andrew Chee
Modern software systems are becoming more complex. When problems happen during a deployment it is very difficult to identify the actual root cause of that problem. See how distributed tracing and Lightstep's analytical capabilities help quickly identify those problems so you can quickly remediate deployment issues.

For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

A

Hi, my name is andrew and I am from lightstep today we will be doing a quick overview on how you can use lightstep to help you accelerate your continuous delivery process by helping you find your problems and fix them more quickly when they inevitably do occur, a little bit about lightstep.

A

We are a company that is built on the idea of observability for distributed software systems based on distributed tracing and metrics, and the idea is that with modern software systems, where any given request can potentially be hitting multiple service boundaries as you're releasing code faster and doing it more often when problems occur, it is oftentimes very, very difficult to understand and know where the root causes of those problems are because, as you probably all know, just because one service is slow, the cause of that slowness could be anywhere downstream from that particular service.

A

So as we're releasing software faster and faster. Today, it's important to understand when these problems occur, where they occur and how to quickly solve them. So with lightstep, we actually capture a hundred percent of the distributed traces that go through your system and help you monitor them.

A

So we actually follow every single request that goes through your application and will actually be able to tell you the steps that they took and how long they take and when they have problems where those problems are so, for example, what you see here in front of you is a small view of our service map that shows the services that are upstream and downstream from the inventory service that we are interested in today.

A

Now I will jump over to our service directory view and we can actually monitor the actual performance of that inventory service.

A

As you see here, all of the service that services that report data to lightstep are listed on the left hand, side and the left nav, and for each service we are actually able to monitor using these distributed traces, certain key operations that occur inside the system. These tend to be your important uh transactions that occur in each service.

A

So for this inventory service, what we are also able to do is monitor the deployments that actually happen in your system, so by in just simply letting lightstep know what particular attribute in your code represents a service id.

A

We are actually able to automatically monitor when code gets deployed and we are also able to visually show you obviously, those particular deployments as well as whether you may have actually several versions of code that are running in production because of canary builds and things like that, and we would be able to actually compare those performances for you. So as an example here today, you will see that I am looking at this inventory service within lightstep, so this is meant to be an e-commerce app that we use.

A

You know, obviously, for demo purposes and in this e-commerce app there one of the operations is to update the inventory. So you know, for example, if a store is updating the inventory in their web store and, as you can see here, we can tell by a deployment marker that a deployment happened and then, as you can see soon after that, the response time latency of that particular transaction on that particular service has gone up. So it's pretty easy to imply that a deployment may have introduced a problem here. But what is that problem?

A

And how do I go about fixing it right? That is actually where light step fits in. So as we see here, we see that the response time before the deployment was about 158 milliseconds at the p99, but after deployment, it's you know almost 1.2 plus seconds okay, to understand the cause of this, which could be this service itself upstream services downstream services for istep. It is very simple.

A

We simply click on that particular regression, and what happens is that we can compare that performance to what happened before so in this case, I'll probably choose an hour before because as problems occur, you know really there's there's two things we want to find out. We want to know why what happened like, why did this problem occur and what that problem occurred? And so, when we look at why a problem occurred.

A

The easiest way to look at it is to say what changed in my system, okay and with lightstep, because we actually capture these metrics for this period of time and because we're also capturing detailed sets of traces that represent the data. Underneath these requests we're actually very easily.

A

We can very easily tell you the answer to those questions so, for example, I can simply say I have a regression here and I'm going to compare that to the performance of the system an hour before, and what lightstep will do is we'll take those two small periods of time about 20 minutes each and they will compare this, the known good, baseline in blue as well from the regression in yellow and say what is different between these two time frames.

A

So obviously we can see here that the yellow, which is the regression, has a second mode of distribution of latency. So something is going on here where the original did not the baseline did not. But what is more important is because we actually capture every single request in your system and we actually can follow those requests upstream and downstream.

A

We can actually see where the bottlenecks are so, for example, we're looking at this inventory service this inventory service does some database calls and goes to some database update work, and then it also does some cash writing, and you can see here by this yellow circle, which represents the total time taken overall, that this this particular right cache operation actually represents the bulk of the latency, and now we know that this is where the operation is actually affecting the performance of this system.

A

Now, just by this very quick glance, we know what operation is being affected now with lightstep, because we're capturing these complete traces across all of your services. What we can do is zoom in and say well what upstream services are affected by this problem, so we do see that this inventory services right cache operation is causing this latency.

A

But then we can then keep going up and say that this particular operation on this service is actually being called by the api gateway, which then are in call by being called by some web and mobile front ends, android ios, web, etc. So very quickly. We can not only tell you in this distributed service architecture, where you have multiple services, interacting together, to serve up requests to your customers.

A

We can actually tell you where, along these chains of services, that problem is actually emanating from, as well as what are the downstream and upstream dependencies that may be affected by this particular problem. All of this is simply done by two clicks, with light step so very quickly. We can tell that this particular problem that we have in this update inventory operation is actually coming from this right cache operation that is in this system now. Next question that we want to ask is why and with lightstep.

A

It is very easy to answer that question as well, because, as as you may or may not know with distributed tracing data, it is possible to capture the timings and orders of operations that are called across service boundaries, but we can also capture any attribution that we want on any one of these operations, as well as log statements if they exist.

A

So with this we are able to actually show what particular attribution may have changed from the baseline to the regression, which may help explain why this is happening. So, for example, as we can see here, there's some attribution that have changed two, very obviously obvious ones. Is we see that the version four four four five five did not exist in the baseline, but does exist in the regression data set. So now we can pretty clearly see that that particular regression was introduced as part of version 455.

A

But what we can also see here is that there is an attribute called large batch which is set to true now. This value does not exist in the baseline, but does exist in the regression.

A

So what we can probably surmise from this is you know, perhaps you are now writing larger batches to your cash and hitting some type of limit, and now your cash writing is is much much slower.

A

We do very similar things with log statements. If there are log statements that are attached to your traces, we can also tell you if they represent any changes between the baseline and the regression. Now. What we've looked at so far is that this inventory problems problem. We're in the inventory services problem originates from this inventory right cash operation.

A

We can then actually go and investigate an individual request, because everything we've looked at right now so far is in the context of a large data set of in aggregate where the time is, but if I wanted to, I can actually dig in to one request, and now I am able to actually see the performance and behaviors of an individual request, and we can actually confirm that that right, cache operation is indeed the culprit and where the actual latency is originated from.

A

So hopefully, we've been able to show you when you have problems, how lightstuff can help not only monitor deployments, but when problems occur as part of the deployment, you can obviously roll back the deployment, but in order for you to continue, you know speedy delivery of your software. You also need to understand what the cause of that problem is, so you can quickly fix it and re-release your code.

A

So if you are interested in finding more about lightstep, please visit lightstep.com and you can also always email me. My email address is andrew lightstep.com and we would be happy to help answer your questions and happy to help in any of your observability needs. Thank you.