Cloud Native Computing Foundation Online Programs, 7 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Understand your system like never before with OpenTelemetry, Grafana, & Promscale

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

As more and more engineers, other microservice, architectures and cloud native technologies, understanding the behavior and failure patterns of our systems is key to ensure they are performing and delivering a great customer experience.

A

Yet it is really hard because those environments are highly dynamic, with all the scaling and frequent deployments of new versions of our applications.

A

I'm ramon gu vp of observability products at time scale and in this cncf webinar I will show you how you can use three open source projects, open, telemetry, grafana and prom scale to get insights about your systems that will help you deeply understand how they behave or misbehave, so you can improve them and deliver a better experience to your users.

A

This is the agenda first I'll, do a very quick introduction to open, telemetry and distributed tracing. Then I'll talk about primskill, which is a free, open source, observability back-end that runs on top of timescale, db and postgresql, and finally I'll show how you can use open, telemetry, prompt scale, grafana and sql to better understand your distributed systems by using a demo environment. We've created that you can get up and running on your computer in just a few minutes.

A

The demo environment we will use is available on github at the url that you see at the top and we've also written a detailed blog post that covers everything about that demo environment, how to set it up, how to instrument your code with open telemetry and how to query the data and build grafana dashboards I'll be covering some parts of that in this session.

A

If you want to dig deeper, I recommend you check that blog post, you ready, let's get started for those that are not familiar with it. I'm going to take a few minutes to introduce open, telemetry and distributed tracing open. Telemetry is a new standard for instrumentation that is hosted by the cloud native computing foundation.

A

It was born after two other open source projects, joined forces, open tracing and open sensors in the three years since the joint effort was announced, open telemetry has become the second most active project, as well as the second, with most contributors among all cncf projects.

A

Only after kubernetes paguela, both prometheus and other very popular projects, most observability vendors, including us time, skill and all major cloud providers, are contributing to the project.

A

Why is there so much interest in it? Well, first, it's vendor agnostic instrument once and send telemetry anywhere, so your investment is future proof and behind are the days where you had to re-instrument your systems when adopting a new observability tool.

A

It also opens the door for engineers creating libraries frameworks and tools that other developers use to build their applications to add instrumentation into the source code. For example, kubernetes has started adding open, telemetry instrumentation into their code.

A

Second, it's a standard that includes the three key telemetry signals: metrics locks and traces that share the metadata and tags, so you can more easily correlate them.

A

It also defines align protocol and semantic conventions, making interoperability between open, telemetry and other tools much easier, and finally, it provides libraries that do automatic code instrumentation dramatically reducing the effort required to instrument your code.

A

In today's session we will focus on open, telemetry traces, since they hold a lot of valuable data to understand, distributed systems that metrics and logs cannot provide. But what is a trace. A trace is a connected representation of the sequence of operations that were performed across all microservices involved. In order to fulfill an individual request, for example, if you open an article from a new site in your browser, there would be multiple operations saved by different microservices.

A

Read the article read the comments for the article and request ads to display with that article. Each of those operations are represented by a span with their own subspans. A span can have 0 or multiple children. All spans have just one parent, except the initial span in a trace called the root span, which has no pattern.

A

Some of you may be familiar with yaga a popular open source distributed tracing product. This shows a screenshot of the jaeger ui, which includes an individual trace with all its span and their parent-child relationships.

A

In the demo, we will make heavy use of pram scale and its capabilities.

A

Prom scale is an open source, observability backend for metrics and traces powered by sql. As I mentioned, it's built on top of the proven rock solid foundation of time scale db, which is a time series database built on top of post sql and, as a result, it lets you query the data using full sql.

A

We will use sql extensively to derive insights from traces in our demo on top of open telemetry promscale also integrates with prometheus for long-term metric storage and analysis with grafana to visualize the data and also their tools like jager, which we saw before.

A

This is just a high level architecture, where you see prom scale using time scale db, to store the data and integrations with prometheus, open, telemetry, grafana, jager and any tool that speaks sql as a node time. Scale db is positive, sql with time series super powers. Technically it's a postgres extension and so also get access to all the capabilities. Postgres provides.

A

Enough of an introduction, let's start playing with open telemetry, prompt skill, graphana and sql.

A

You can get this demo up and running on your own computer. You need to have docker and docker compose installed and then download the github repo and run docker compose app.

A

This is what the three commands in this slide: do I'm going to copy and paste them into a terminal. So you see what happens.

A

So, as you can see, this has called the repo and then I just run docker compose which will download all the different uh images, build them and um and then get the environment up and running. This will take a few minutes, so we're not gonna see all of this now, but you can do it on your laptop and we've tested with mac os linux and windows.

A

So going back to the slides, this is the high level architecture of our demo system. It's a password generator that has been over designed as a micro services, application connected to an open, telemetry, observability stack.

A

It has a load generator that makes requests to the generator service. Then the generator service calls the upper lower digit and special services to get random, uppercase, lowercase, digits and special characters to build a password.

A

The local server the lower service is written in ruby and the rest in python. All services have been instrumented with open, telemetry traces and send those traces to prompt scale.

A

As we saw before you know, you can get this demo up and running on your own laptop.

A

Okay, now now, let's go into grafana and let's check uh all the different dashboards that we've built. That will show how to use sql to derive insights from tracing.

A

So here I'm already logged in, and I have a demo environment that has been running for uh quite a bit uh by default. The demo environment comes with these six dashboards uh and we will be looking at them now. One thing to keep in mind is that the first time you try to log in into grafana, it will ask for your login and password and those are admin admin. Okay, these are the defaults that are set.

A

You can change them if you want, but since this is for demo purposes, you know that's not as important so we're going to start by looking at the request rate dashboard, the request rate dashboard is just simply showing the number of requests per second that are happening uh across.

A

You know the uh the micro services, since this is a um since uh the architecture, if you check the architecture of the uh of, if we check the architecture of the um application, we'll see that there is basically one entry point which is the generator okay, so this is basically measuring the request throughput for the generator the throughput you know. Request per second is one of the golden metrics when measuring application performance.

A

The other two are error rate and latency, which we will be looking at next, but here, let's take a look at how these dashboards are built. So let's we can take any of those two. Let's take the the one at the bottom.

A

So you'll see this is a standard time series component from grafana and what we're doing is we're doing this sql query. Okay, so you see you recognize the select from work close in the select. What we're adding is we use the time scale db time bucket function. This creates buckets to be displayed, so it aggregates that we, we then group by the time bucket, so this basically aggregates data on a per second bucket and then we're doing counting everything that is happening.

A

You know within the bucket, so that gives us the number uh of requests so we're counting all the expands. uh Since that's what we have in the from close we're we're querying expands counter star is giving us all the span that meet this requirement. You know where you know parent span id is now so these are you know. Entry requests into the system which, as I mentioned, are basically requests to the generator service, and so this is the. uh This is the throughput that we see.

A

You know it comes in in sort of waves, and we see you know max- is getting to uh 11 requests per second, but we also see in some buckets. There are no requests at all.

A

Okay, now we're going to take a look at we'll. Take a look at another dashboard, we'll look at the error rate dashboard, so this dashboard.

A

I think it gets a little bit more interesting than the other one, so the other one was obviously showing the evolution of throughput over time, but this one is give us uh is giving us more detail this one. For example, if we look at let's focus on this table, this table is telling us for each service and operation.

A

What's the error rate, you know how many what's the percentage of errors you know across all these operations that happen in the system in the last, uh in this case, 30 minutes.

A

So let's take a look at what this query looks like. So this is what this query does. It looks at you know it retrieves from um so it's using a sub query again. This is an interesting thing. You know it's something that is available in sql, but not necessarily you know in other query languages for or you know the observability tools offer, and so in this case we have a sub query. We have an initial query that is doing a select again on the span view.

A

This is a view that we, the prime scale exposes, but you can think of it as if it was a you know, regular table doesn't matter that much for, for the purposes of explaining the sql that we use so we're querying this pan view, and in the spam view we have a service name, which is you know again name of the service plate. The explanatory span name span name is typically the name of the operation. Okay is the name of this spam, but what it indicates is the name of that specific operation.

A

Then we have we're counting how many of those spans have a status code of error and we're also counting the total number of spans.

A

And we're grouping by one and two that means that you know we're grouping by service name and span name. So these these two statistics are calculated group by service name and span name. But that's why we see you know this in what we see in this table and uh and then we also we're using you know two variables as filters. Okay.

A

So if we go up because we, as I said this is a subquery, so we have the we have these results and then the only thing this all the query is doing is just taking service name span, name and then calculating the rate, the error rate. Okay, we could have done actually everything within the same query, but you know to make it easier to read. We just used a the subquery.

A

And finally, we're ordering ordering, by error rate descendants, so we show those uh operations that have a higher error rate at the top okay, so with this information very quickly we can see. Okay, so generator generate is the one that has a higher error rate, but that is the top level operation.

A

So, let's focus on you know next level and next level we'll see uh process extra extra process upper is the other one that has a very higher rate. There are some other operations that have some errors, but they are the the error rate for those is much lower. So probably you know we should go and focus in this. You know check this method and see what's going on why we have such a high error rate.

A

um As uh as we mentioned, you know you you have here at the top, if you wanted, you could actually filter down to some to some specific uh service or uh or or operation in here. The other thing we're doing here is that we're looking at the evolution. You know this is a similar to this, but this is looking at the evolution over time. So, if you, if we open this query, we'll see that the query is pretty much the same, the main difference is that we're introducing a time uh variable here, a time.

A

Projection in the select that is the time bucket, you know so we're calculating this stat, the the error rate per service and operation uh in a you know on a per minute basis and we're plotting it here over time.

A

Okay, let's move to the next one. Let the next one is latency. Okay, request durations! This is the third golden synonym. So, as I said, there are three, so we have uh throughput uh error rate that we've already seen and then latency- and here we can see.

A

Let's look at this chart here. This chart is showing the evolution of duration over time, but we're not looking at average we're actually looking at percentiles, okay, so we're computing percentiles. So how does this work? Well again, let's take a look at and see how the query works. So here again we're using this time bucket function. That time scale db provides to group the data in brackets of one minute.

A

So then it shows you know the group by close and then what we're doing is that we're looking at the uh percentiles for uh 99 percentile, 95th, percentile, 19th percent percentile in the median or 50th percentile, and to do that we're using the approx, percentile function provided by time scale db, which looks like the we use.

A

This percentile act function as well, which calculates a sketch on the duration millisecond, which is a data structure that then allows us to compute an approximate percentile on top of it in a way that is, you know, more performant, and then you we're just plotting all of those here. So again, you know we can use the power of sql and then scale db to compute those percentiles and we could use any. We compute any percentile that we wanted here.

A

Another thing that is interesting is this: histogram of durations okay. So if we look at this, this is showing us the distribution of latency for uh for request again, because all requests go through the generator. You know this is, for you know, all generator requests. There is just one um entry point into this micro services, environment, and what we see is that, while the majority of the uh requests are processed in- let's say maybe let's say two seconds or less, there are some of those that are extremely slow.

A

You even have requests that took you know, 30 seconds. That's that's a lot of time. What may be going on there? Okay, so here at the bottom, we have another interesting thing here: we're listing individual traces, again a trace maps to our request and how it flew how you know it went through the system so we're looking at individual traces when they happened and how long they took okay- and this query is actually showing the slowest one.

A

So let's take a look at this, so if we look at it we'll see that we have a number of uh traces, you know the start time.

A

Duration, as we saw in the uh in the panel in the dashboard- and this is what we're doing so- we're talking we're displaying the trace id okay and we're doing this replace text that I'll I'll explain why we're doing this, we have at the start time and the duration that we're projecting and the only thing we're doing, is just sorting okay, so we are using again span id null, which means this is you know the root span and basically maps to a trace, a full trace and- um and the only thing we're doing is we're just sorting okay.

A

So it's a very simple query: we're just searching for root spans and uh we're getting the top 10 that the slowest one right, because we're sorting by duration descendant so we're doing this replace thing. Why are we doing this? So the trace ids when they get stored in prompt scale? They have they use a uuid format, so they have dashes in them, but you'll notice that this trace id here is underlined. This is because this is a link. We've made this a link and so we'll.

A

Actually, if you click on it and any of those traces will open the grafana ui to show the district, an individual distributor trace, which is similar, it basically reduces the code from jager, and so with this you know you don't need to copy and paste the trace id. You can actually just use this linking smart thing that we use. Thanks to you, the amazing capabilities that grafana provides that are very flexible. You can jump straight into that slow trace and you can check and try to understand.

A

What's going on, as you see you know, there are a lot of those spans that are very quick but they're, always you know a few of them that are slow and if you check closely you'll see that those that are slow actually belong to this um digit, and actually it's the random digit function that it's slow. Okay, you can see it. You know very quickly here, so you could actually go back to your code, the random digit method or function in your code and check. You know try to understand.

A

You know why that is a slow, okay, so very quickly, we've nailed down that the problem is related to this specific function, at least in this trace. You know we could look at other traces and see. Maybe the problems would be different, but in this case you know that is. That is the problem that is causing this trace to be slow.

A

Okay: let's go back to our dashboard and now, let's take a look at something even more interesting service dependencies.

A

And so this is a service map and it helps us to quickly understand how our different services interact for this dependency map we're using grafana's node graph, which at the time of this webinar, is still in beta.

A

So let's take a look, as you can see, we're using here. The node graph.

A

And the node graph panel from grafana expects two queries one to retrieve the nodes in the graph. The first query here- and that is the notes in the graphics, the circles and another one to retrieve the edges in the graph, and that is the arrows.

A

And so this is the query to get the list of nodes, and basically, we just retrieve all service names that appeared in spans in the currently selecting selected time window in grafana.

A

Id and title are two parameters. The node graph expects id uniquely identifies a node, and title is the label that is assigned to the node and so we're using service name in both cases.

A

The second query is more interesting and it has to return the arrows so the relationships in between the services.

A

This is actually something that is typically or usually impossible, with the limited query language that other observability backends provide, but because we can leverage the full capabilities of sql provided by password sql we can do joins, and in this case we join the spine view. The span view with itself to identify paren and child we're using k here for kid, so identify parent and child spans that are related to each other.

A

To do that, we have to check that the span id of the parent is the same as the parent span. Id of the child span.

A

We add two additional conditions, so the first one is not strictly needed.

A

But what we're doing here is we're ensuring that both the parent and the child span are part of the same trace id, and this would only make sense in cases where there are two spans that were assigned the same span id, which is very unlikely to happen.

A

The other condition the one at the bottom is actually very important, because it ensures that we only look at parent-child relationships across services, that is a service operation, calling an operation in another service, and so we remove inter-service relationships. That is an operation in a service calling another operation in the same service, because we don't want to show those in this map where we're interested in cross-service dependencies.

A

This table here shows the same relationships as the service map, but in a table in a table format with some additional stats.

A

So you can see we have, you know number of calls total execution time and average execution time if we look at the query behind it's the same. So in this case, as I said, this is a table we're showing a table.

A

So this is a table panel uh grafana's table panel and the query pretty much uses the same, join okay, so it's very similar uses the same, join but we're showing you know a set of different stats, so we're grouping by uh source target and uh span name so that that's the grouping that we're using and then we're showing how many calls are happening from the source to this source service to this target service and operation and um the total execution time of um you know that was spent.

A

So we just some spans and we just compute how much time has been spent in this specific operation across all spans within the selected time window and then the average execution of that of that span, and so here very quickly. We can see that you know most of the time is actually spent uh in the generator calling the lower service, the lower service, calling the digit service and the generator calling the digit service. So I mean it seems to be that the problem is actually in the digit service.

A

That's the service that is very slow, and um I think we already saw that you know when we looked at the um traces of the specific trace and we saw that a lot of time was suspended, digit service. So this is just you know, reinforcing that, and that is not just an individual current most likely, but this is happening consistently across or over time and across multiple requests.

A

So we're now going to take a look at an other way to explore and visualize trace data. So again we're going to be using the node graph panel, but in this case we're trying to solve a different problem.

A

Imagine then that one of your services is unexpectedly going through a high, increasing load and understanding where that load is coming from in a micro services. Environment is not easy because you would need to check all the different option: services that end up calling the service under pressure.

A

So let's select a different service here: let's go, for example, for the digit service, and so if we look at the digit service and the you know the the slash which is the entry point operation which is here we see in this tree, we see that you know this is being called by the generator um it's called by the generator through it. You know you get, is a http get request to the service, but it's also called by the lowest service, and we see that uh this is.

A

You know that there is a you know: digit operation in the lower service. That is uh end ups, calling digit, which is I mean we already saw in the service map. That is probably wrong, but the thing that is interesting as well is: there is quite a bit of load going to that service through this path. Okay, so it's you know close to half of the load is generated via this path and half of the rest of the load you know, will be generated by this path, which is the correct one.

A

So we see that this digit server is probably under pressure, we're doubling the amount of work it needs to do, uh because we have you know so, there's some something wrong in our code. In this case, and again I mean we could have you know a lot of other hopes. You know in the uh tree of spans or operations until we hit this service and we could use this visualization to quickly spot um where most of the requests are coming from again.

A

The number here inside what we're doing is showing the total account of spans for that specific operation or what is the same saying? The same thing is the total number of times that operation has been executed in the selected time window, which in this case is 15 minutes.

A

And so again, as I said, we're using the node graph panel and we have again the two queries since those two queries are a bit long. I'm going to move to a text editor to review them.

A

So the first thing to note is that you know doing this kind of thing like going up. You know in the chain of calls is something that would be very tedious. You know if you had to do this without a powerful query language, because basically, we need to recursively traverse the tree of spans up across all traces that involve our problematic, and so you know. Luckily, you know we can leverage the power of sql again and in this case, what we use is a recursive query. Okay, so we use this construct.

A

Our recursive query and the way it works is that there is an initial query that will get executed, which is this one and we see you know. Service and operation are the ones that you selected from the drop downs in grafana, and we it runs this query, which is retrieving basically all spans. You know I have you know some data for all expands that match this specific service and operation, and then it runs the results so x. You know our are the results from this initial query.

A

It runs them through this other query and basically, what this is doing is a join where it checks that x, so the uh the results from the the it looks at the results from the original query reads: the parent span id and then it checks, for you know this new table that we're joining again, which is again the same. You know span view um uh table. If you want it collects it, it looks at uh comparing and ensure that we retrieve the parents okay. So basically s in this case will represent the parent of x.

A

So we're going up one level and we're projecting all these different values from the parent span and because this is recursive, it will do the same thing again, so it will take the results that we just got this results here and run this scoring again against it again, so it will do again. The recursive thing it will check okay.

A

So I have to look at this, uh the values that I have I inject them. You know I inject them into x. Here and again, we look for okay for each of these spans that were returned here. Let's look for the parents, okay and let's retrieve the parent spans, and we do that again and again and again until there are no results returned okay.

A

So this is how this recursiveness works and once it has built, you know that table, because you see this union all is just appending all those results, the results from the first query and all the subsequent queries. That is, that navigating um upstream you know through the spans it runs on those results. It runs this query: okay, which is doing okay. uh Returning uh using a span is the service name and the span name the operation to generate an id, so this is will be we're generating one node for each service name and span name.

A

Something important to notice here. Is that we're not excluding inter-service operations because we're actually interested in seeing them in case the increase in calls was coming from an internal operation within the service and not generated from something outside it could be maybe something wrong. You know a new deployment that we made and maybe cause that problem, so so we're not excluding and we're actually, including inter service operations as well, and then we add service name as a subtitle.

A

You know of the node, so we had a span name as a title of the node service name as a subtitle of the node and then we're counting the number of spans the number and we use this thing just in case the the you know, a span id for some reason during this recursive operation created some duplicates in theory.

A

I don't think that's necessarily needed, but just in case we use this thing so that we remove any any potential duplicates there and we have an accurate count and, uh and then so we and then what we do is we're grouping. You know those results by service name and span name so that basically grouping by node.

A

So these are this is the query for the nodes and the edges uses a very similar query. But so again you see this join here. What is traversing app? You know from the current set of results. Let's get the parents and project them, but it also adds a bunch of additional information because in here we're interested in the edges, so we're project projecting the um the id for the relationship, which is you know, service name span, name from the source uh to the uh service name and the spine name of the child.

A

So that is, you know the relationship between two notes, essentially in the um graph that we're displaying and then we also are doing the um uh target and source we're using uh the md yeah. Okay, we're doing an md5 on the service name and a span name again to compute ids for those and then we just project here again the same thing where we're projecting is the target, uh the id the target and the source you know. So the the node panel can actually connect the dots between the uh the services, so they need to.

A

We need to use the same id in here for targeted source. We it's constructed, as you can see the same way, it was constructed in the node so that you know again, the node graph panel can identify those those nodes and make the connection with an arrow.

A

Okay, so we saw we've seen how we can troubleshoot scenarios where you know we have a service that is having some issues. You know we can actually navigate up through the stream of uh through the sequence of spans, in all across all the different traces to understand the. uh How did how this service is being called. You know, what's the impact of things happening upstream into the service we're uh looking at?

A

We can do something similar, but in this case using downstream expanse. Okay. So let me let me make this bigger, and this is showing again uh here I have selected uh generator and http get uh so, let's actually select generator and the generate uh uh operation, because that is the entry point, and so this is showing an entire map of all the requests.

A

You know that go through the service, what are all the different services and operations that are being called, you know and how often they are etc and we're using the same technique that we use in the upstream.

A

Spanner action dependencies dashboard. The only difference is that in this case the join is the other way around. Okay, so we're looking before we had the x pattern span. Id equals s span. Id so here is, is the other way around, so we're looking for xs pan id being the same as the uh pattern spanning so we're just going. You know downstream okay and then we project the children and then again we do the same same operation here.

A

So it's a very, very similar thing, so I will not explain it in detail, but you can just show me you can navigate upstream, but you could also navigate downstream, and you know this gives you a very interesting map of all the different calls that happen in the service across you know the you know in this case the last 50 minutes once uh you know, for for all the requests to the generator service.

A

So it helps you understand in detail. At the end of the day, what's happening, you know how? How are the different requests being processed.

A

Another thing that I'll explain here in this dashboard that is interesting as well. Is this one and this one is looking at the total execution time but operation, but it's not doing this just by blindly adding up the duration of all spans. For that specific operation is actually looking at time actually spent in the code of that operation. That is, it is subtracting the time spent in child spans.

A

Okay, so you have an operation, it has some code and then it makes request to other operations to have their. You know that are tracking their own spans. So, instead of telling you that you know the the high level span is the one that is taking the longest.

A

No, it's actually looking at how much time is spent within that the code of that span, and so that you can identify where the bottleneck is because otherwise it will always the one that is at the top of the hierarchy will be the one that shows us being the slowest. But here it's not the case. You know.

A

If you see this query the slowest one where most of the time is spent- and we already have seen this- you know over the course of this presentation- is the digit, random, digit method or function is where most of the times I spent 88 percent of the time, the spender. So definitely this is the first place. We should go to optimize the performance of our service.

A

If we didn't be subtracting uh time from child spans, the one at the top would have been the generate password from the generator service, because that's the top level one, and so all the time is adding up into the duration of that span. So how do we do that? This is actually really important. You know it's this idea of okay, I'm looking at where, specifically in which code is specifically the time spent, and by doing this subtraction of okay, the parent span duration. I re I subtract the time span in the children.

A

Then I get the actual uh execution time within that that specific uh code, it's really helpful to understand bottlenecks and the way we do it is again we use is the same thing again. We use a recursive query to traverse all spans and assign time to the different um actual time spent. You know in the different spans and what we do is this thing that you see here. This is the key. This is the key thing and what it's doing is subtracting to the parent span.

A

Duration is subtracting the sum of the duration of all the children. Okay. So it's looking where um the span id you know this. This span here is the parent okay. So so it's subtracting all this time and coalesce. What it's doing is if this return null so no data, it just says it's zero, so it basically doesn't there is. There is no number you don't need to subtract any time to the duration. It's this is only useful for leaf spans. You know they don't have any children.

A

All right, so we've reached the end of the this webinar.

A

I hope that you enjoyed it. uh We showed that with open, telemetry, prom skill and grafana, you can get insights, you didn't think I don't think you would think you thought were possible thanks to the power of full sequel.

A

I encourage all of you to download the open telemetry demo today. All the software we've shown here is available on github and it's free to use and if you have questions about prompt scale or the demo environment, we're available in the pram scale channel in our slack community that we see you see here. I just wanted to take the time to thank you for uh for watching this webinar and I hope to see you on our slack community soon.