solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 016 Observability in Istio and Gloo Mesh

Description

Collecting telemetry and logging in a service mesh across multiple clusters can be complex. In this session, we look at common multi-cluster observability patterns and how to solve problems like large-scale collection, querying, and aggregation.

A

Thank you, everybody and welcome to our talk observability in istio and glumesh. uh I'm scott weiss architect at solo, io.

B

I am harvey shaw, a software engineer at solo.

A

And today we're going to talk to you about observability, and what does this mean so just to start off with a rudimentary definition, general definition observability can be understood as the ability to understand the internal states of a system based on knowledge of his external outputs, so digging into detail. What we're really interested in here is um approaching this from a microservices point of view.

A

How do we gain observability into our microservices stack and just to explain the problem a little bit, uh we're aware that there's a transformation going on a shift from monolithic to microservice architecture, applications and doing so brings some challenges with it just to give an understanding of the scale at large-scale companies. You have uh hundreds of microservices that are all interdependent and communicating with one another.

A

uh So this leads to situation where um we love this tweet at solo, we replaced our monolith with microservices, so every outage could be more like a murder mystery. Basically, um all of this scale and separation, modularization of components, makes things more difficult to track and understand.

A

So that's where observability comes in in this context. In the microservice world we talk about three different types of data: three different types of telemetry: we have tracing logging and metrics and we'll go into each one of them and explain how they work and how they help understand the microservice deck.

A

So first up is tracing. How does traffic flow through our system? So when we have a system of microservices and an error occurs somewhere in the system? We want to understand what is the context for that error so, for example, uh buff io scanner error that we see here uh occurs somewhere in our system. That's great, but we want to understand the context for that error with metrics and logs. We only understand the picture from each instance.

A

We want to have a an understanding of the context in which these things are occurring, it's kind of like debugging, without a stack trace, so we need to actually monitor what are these traces that happen within our system in order to understand who invoked what just to give an explanation of how tracing works um each the the edge service that initiates a flow of requests creates a unique identifier and initially initializes a context that context gets passed down in headers um to each backend service that gets called in the chain.

A

Each service is responsible for propagating the service or propagating the context, and this is important to understand that tracing unlike metrics and logging requires some application modification in order to make sure those contacts are propagated within the application.

A

um This allows us to capture timing, uh put arbitrary metadata on a context, and we can then reassemble the call tree that we collect in a ui um and just to give an example. So you can see here we can construct these spans out of the individual traces of each step in the request chain.

A

So there are a number of benefits that we get to tracing.

A

It works well with logging and metrics alerting systems, because we're able to gain more insight into where.

A

Unusual behaviors are arising from or undesirable, behaviors are arising from. It allows us to analyze the topology of our system, so we can actually understand which nodes are connected together and and how they correspond to the logical flows in our application.

A

It allows us to propagate the context between applications, so application themselves can be aware, and it allows us to find bottlenecks, latency bottlenecks and the sources of errors. Inside of our request flows just to give a little bit of a clearer idea of what's going on um in the old world before service mesh, our application would have to instrument all of this on its own, so it would have to initialize context. It would have to propagate the context.

A

It would have to manually insert at every point in the hop additional context in order to construct the span as well as ship it to an open, telemetry implementation, such as jager or zipkin.

A

What service mesh has added to the equation is now now having side cars for our applications allows us to automatically generate the trace context and propagate it to a storage implementation, all right, an open, telemetry implementation. The only limitation is that the application still needs to connect the context between different requests, so it receives a request from one service. It needs to know that another outbound request is sharing that.

A

Context so that wraps it up for traces and I'm going to hand it over to harvey who's going to talk about logs thanks scott.

B

So the second pillar of observability is logging logs help us answer the question: what are the attributes of individual system events.

B

Next slide more concretely, a log is a time stamped record of a discrete event and the specific event depends on how an application is instrumented but common event, types include, network requests and responses.

B

Database operations and other critical computations logs provide us with detailed information with an expansive context, and in virtue of this they are often used in service of debugging specific problems when paired with a logging management system, it's common to impose a well-defined structure on the logs, which enables searchability. This is also known as structured logging.

B

Next slide, so, let's take a look at the logging instrumentation provided by envoy envoy is instrumented to emit access logs, which are just logs that describe network events.

B

These logs provide l4 and l7 level attributes on network traffic and their emission is configured on either the tcp proxy or the http connection manager and some examples of l4 level attributes include upstream and downstream remote ip addresses tls attributes the number of bytes sent and received some application layer attributes include the request, method, host header, targeted port path, request, headers and request body, the number of bytes moving on to istio's logging instrumentation.

B

While istio does not bring additional instrumentation, they do provide some basic support for configuring. The access log emission from envoy you can configure both the format as well as enable writing of access logs to standard out.

B

Alternatively, you can use istio to configure the proxies to send their access logs to an access log sync and, as the name suggests, an access log sync is a service that receives access logs. This service needs to be implemented externally, because istio doesn't provide a built-in implementation.

B

The advantage of using an access log sync is that each proxy pushes their logs directly to the sync, which obviates the need to implement a separate process that goes and scrapes the logs from the standard out of each proxy.

B

So in more complicated environments from our experience, working with customers operating in larger scale environments with multiple clusters, multiple meshes, we know that to fully realize the value of access logs, the infrastructure layer needs to accomplish the following logs should be accessible from a central location.

B

They should be annotated with cluster and mesh information so that they can be correlated across cluster and mesh boundaries. With this in mind, the tooling that we build into glue mesh solves this very problem, and we will showcase that later in the presentation.

B

Moving on to metrics the third pillar, metrics help us answer the question: how are the application and infrastructure components performing.

B

More concretely, they are quantitative data that describe the internal state of the system and again, the exact metrics depends on how an application is instrumented. Some commonly seen metrics include measures of hardware usage such as cpu and ram consumption, as well as measures of network activity such as requests per second and response. Latency metrics provide us with a high level insights over time that summarize the overall performance of the system and for this reason, they're often the first signal to check when evaluating overall system health.

B

In addition, it's common practice to integrate metrics with automated alerting systems, for instance metrics for business critical services, might be wired up to an automated system. That pages an sre that's on call if the percentage of failed requests reaches some critical threshold.

B

Taking a look at on voice metrics instrumentation, they provide low level metrics typically used for diagnosing proxy level errors, and these can include incorrect crop proxy configuration perhaps caused by control, plane configuration errors.

B

These metrics are also useful for drilling down into network errors, such as getting more detail on failed, tls handshakes and we've included some examples of envoys provided metrics here in this slide.

B

uh Next slide with uh so istio's metrics instrumentation istio provides instrumentation for what are called the golden metrics golden metrics provide a bird's eye view of the overall system, health, and there are three typical categories: latency traffic and failures, latency metrics provide us with a measure of how slow or fast the service is. It's the time taken to service requests and it's typically measured in percentiles.

B

So a 99 percentile latency of 100 milliseconds means that 99 of the requests are served in 100, milliseconds or less with istio. You get the istio request. Duration milliseconds metric, which is a distribution that allows for subsequent computation of percentiles traffic metrics, provide us with a measure of how in demand a service is- and this is usually measured as the number of requests per second failure. Metrics are similar. They provide a measure of the number of requests that have failed and when combined with traffic metrics, we can generate a success rate or failure rate.

B

In other words, the percentage of requests that succeeded or failed. So istio gives us the istio request total counter, which includes the response code as a label which allows for subsequent partitioning of responses as failed or succeeded, depending on how you categorize http response codes.

B

One limitation to note is that istio's instrumentation is constrained to a single mesh, so requests that cross mesh boundaries will not be properly attributed to the origins or targets of the other.

B

Mesh, so what other metrics features does istio support out of the box at install time?

B

Istio includes an option to install prometheus, which collects metrics by scraping them from each proxy, and we know that each proxy exposes their metrics through the slash stats, prometheus endpoint for prometheus formatted stats on the admin interface, which is usually port 15000, similar to what was discussed for access logs envoy proxies, can also be configured to push their metrics to a metric sync, and this will also be showcased later so now I'll hand it back to scott to discuss some of the challenges with achieving observability in high scale environments.

A

Thank you harvey. So when we look at the real world use cases when we look at how customers today are actually running their mesh or running their clusters, they are running multiple clusters. Sometimes those clusters are replicated and they've got clones of services and they're running, let's say in different regions or on different clouds.

A

Sometimes there are different clusters that have different responsibilities and therefore they have different security boundaries.

A

There are really a number of uh complex situations that we get into and this increases the burden for observability, like carvey mentioned um when it comes to istio, running and managing a single cluster. The responsibility for collecting the metrics is left to prometheus and prometheus can quickly become challenging to scale uh when you're dealing with multiple clusters, because you have to set up a prometheus federation and it's it's really. The burden is left on the user, and this is what we've seen time and again we're working with customers.

A

So what we've done is we've built out solutions for them in our product blue mesh to help address and tackle some of these, including how to handle multiple clusters? How to handle multiple meshes? How to handle meshes and clusters that are being shared by multiple tenants, how to handle meshes and clusters that are running on multiple clouds, as well as the different permissions and personas that users have in order to interact with these.

A

um So some of the challenges, just to summarize that we've come across, are um that metrics attribution. Log attribution can often be missing the uh context of which um mesh that they're, a part of because different meshes are not aware of each other. You may have different meshes. For example, you may be running app mesh if you're running on amazon, and then you have clusters that are, you know, kubernetes clusters that are living on gke or running in vms.

A

That also have their own meshes installed, and you may have cross-cluster traffic going on, and we want to add that context of which cluster which mesh these requests are. A part of another problem is the aggregation of the data. um This is an ongoing issue. This is something that's that could be quite challenging, in particular with the scale of the amount of metrics that that wind up it being stored, and there are a number of best practices out there for solving that or reducing the the cardinality of the metrics that are being collected.

A

uh But this aggregation question still becomes more challenging when dealing with um large environments um and finally, another question is how to actually integrate into third-party tooling. So once we've actually figured out how to aggregate the metrics, how do we then provide those to tooling like kyali and grafana, to get more visibility into our system?

A

So we've solved this for our users um through the use of an aggregation layer which funnels metrics logs and in the future, we'll be doing traces as well um into a single pane of glass that is then accessible to a user or third-party application which can take data from disparate sources.

A

Another feature of this system that we've developed is the plugability of data sources, so, depending on how users may want to collect their metrics certain met. Certain users already have a solution with thanos prometheus-based solution that they're using to scrape multi-cluster some are using datadog to manage their clusters and collect their the telemetry data, um which we then put. We we allow the plugability of a data source which plugs into our server and then can be propagated to our ui.

A

um Just to give an overview of our batteries included, approach, which is the out of the box approach. um We install an agent that runs on each cluster.

A

The agent connects to another cluster to a server running on another cluster over a secure mtls connection, which uses client certificates in order to verify the identity of each agent. The agents connect to the server and then they begin to push envoy proxies are pushing their metrics and logs to the agent via grpc service.

A

So there's no need to install any prometheus on each agent cluster.

A

Everything goes directly through a secure ingress into the management cluster, where it can then be aggregated into prometheus, and we use that in combination with mesctl, to interact with the rest api, that's exposed and allow us to consume those metrics from a single pane of glass, as well as the logs.

A

So, just to summarize, we have a hierarchical aggregation where each proxy pushes to an agent. Then the agent pushes to a server. All of the communication is tls secure.

A

um Then the glue mesh server exposes a rest api for querying metrics as well as logs, and we will soon be supporting traces as well.

A

And we use mtls to ensure the identity of each agent and prevent our server from being susceptible to man-in-the-middle, attacks or false information.

A

And now I will hand it back to harvey for a demo of the software we just described.

B

Okay, so let's take a look at glue: mesh's observability features in action, but before we get started, let's first review the architecture. So, as we can see in this diagram here, each managed cluster. The envoy proxies are configured to send their access logs and metrics to the local glue mesh agent, which then forwards those access, logs and metrics to the central glue mesh server, and so this acts as our central repository for all of our observability data, from which we can access it.

B

And in our scenario we have two clusters. Both of them are managed, and one of them is also hosting the management plane. So, as you can see here, we see both the glue mesh server. It's indicated by the enterprise networking pod, as well as the agent and in the the other cluster. We also have an agent and we've deployed the bookinfo app.

B

The reviews. V3 service is deployed in the remote cluster and everything else is deployed in a management cluster.

B

So you might wonder how we have configured the envoy proxies to send their metrics and access logs to the glute mesh agent. To do this, we leverage istio their mesh config has an option where you can declare an access log service, as well as a metric service, and for both of these we've set enterprise agent, as the sync for the access logs and metrics and istio will do the work of configuring across their fleet of proxies okay.

B

So let's look at the product page so immediately we see that there seems to be a problem talking to the review service. Let's try refreshing a few times. Oh, we get a response for reviews v3, but it doesn't look like we get anything for v1 or v2.

B

So at this point, it'd be a good idea to check out the metrics, get a higher level view what's going on and this matches what we're seeing details is fine reviews. V3 is fine, but requests to reviews v1 seem to all be failing, as indicated here. So now. Let's drill down, let's see a few example requests and maybe that'll give us a clue as to what's happening. For this, we want to use access logs. So first thing we need to do is configure the ability to collect access logs so we're creating this access.

B

Log here, which says with the omitted workload selectors, we are saying: collect access logs across all workloads and include the header foo if it exists.

B

So we've created this access log. It might take a few seconds to take effect and then now we are going to connect to the glue mesh server. It has an end point from which you can stream received access logs. So let's connect here and we're looking at access logs either coming from reviews or from the product page. So those are the pertinent workloads.

B

Keep this window open, let's send off another request, so this one is a successful one. We don't care about that here. We go. We have an erroneous request and it looks like we're seeing a 403, so this access log just review it's being emitted by product page.

B

It's a target cluster is reviews v2, it's indicated by this envoy cluster and the response code is 403, so that immediately tells us something it might have to do with authorization just on a whim. Let's see if we have any istio authorization policies, and we see that we do.

B

It looks like this authorization policy is saying: restrict reviews to requests that contain the header foo with the value solo I o. So that explains it. When we look back at our access log, we search for foo. You see that the request header is not even present, which would explain the request failures.

B

So, let's see how we can use glue mesh to also issue a hotfix for this without having to touch the application code, we can make an edit to our traffic policy so see.

B

Okay, so let's go into this traffic policy. That's already performing a traffic split to the review service from product page. Let's add an entry to append the required request.

B

B

Then, if all goes well there we go we're getting requests from reviews, v1 and v2.

B

Let's go back to our metrics confirm that we've in fact solved the problem. Let's refresh and it looks like we have so communication between product page and reviews, v1 and v2 look good. Now, let's go to the product page and see if we get some more data. We see this timeline. We see that these.

B

This represents successful requests over time and it looks like we have indeed solved the problem because now request from product page to v1 and v2 reviews are now at 100, so, hopefully, you've gotten a sense of what glue mesh can do, how it acts as a single pane of glass for your observability needs and how you can also use it to issue quick hotfixes at the network level.

B

B

B