solo.io Hoot Livestream, 6 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Hoot Live Episode: Understanding Istio Metrics

Description

Join Scott Weiss, Architect at the Office of the CTO, on our next Live Hoot Episode on April 6. Scott will dive into Istio Metrics.

About us https://www.solo.io

Questions? https://slack.solo.io

Code Samples: https://github.com/solo-io/hoot
Suggest a topic to cover here: https://github.com/solo-io/hoot/issues/new?title=episode+suggestion:

A

Oh hold on how's: that is that better. Can you hear.

B

See if I can drop down I'm sorry.

A

A

I'm gonna try restarting the stream here. One sec.

A

Hello, everyone how's that.

B

Showing up audio good.

B

B

B

Here I have a better mic: maybe it's not configured to use my mic. This is this. Is the big.

B

B

Oh sounds good, okay, so good we're good good.

A

All right, everybody, sorry for the technical difficulties.

A

um Welcome to this episode of the hoot, uh I'm scott weiss architect, solo, io and today we're going to be talking about istio metrics, so just to give a little bit of context for those who are unfamiliar, although probably everyone here is fairly technical.

A

um We've gone from the monolithic to the microservice world, which has made things a lot more complex in terms of understanding. Our system identify uh performance issues um just to give you as a sense of the scale that these things reach. This is actually from a couple years ago, these were taken, but these are basically published by the companies shown here showing their microservice graphs.

A

The communication graphs that are going on between hundreds of microservices, oh, it can become pretty challenging problem of figuring out. Where is the bottleneck? Where is the source of an error? um Do we even have an error? Is there something happening? That's a sign of a service failure or outage um there's a tweet that we love. We replaced our monolith with microservices so that every outage could be more like a so. This is basically giving the framework for the problem, which is how to identify the performance. How are things performing in our system?

A

How do we know when there's an error, something that needs human intervention or some kind of automated process?

A

um So we use metrics for that. Metrics are one of the primary tools in the toolkit of any operations, devops sysadmin for understanding the health of their infrastructure.

A

Identifying and solving problems, so metrics provide high level statistical insights over time. These insights can be aggregated in various ways. They can be used to drive automated systems like alerting systems, as well as automated rollbacks, rollouts, canary deployments, etc.

A

Metrics are our extremely valuable tool that can be used to power a lot of what's going on in our infrastructure and help facilitate the concept of a self-healing or almost self-healing, with a little bit of human help, application infrastructure and again this is all theoretically running on kubernetes.

A

So that's the high level. Now, let's get into some of the details. So what metrics do we actually get with a service mesh other than those that are produced by the application itself? um How are we uh hold on I'm just going to pause one second and just check the stream is good.

B

Because sometimes it is not, so let me just make sure everything looking good.

B

A

Great and just uh throw up any questions if there's any questions, um okay, so what metric? What envoy gives us directly before we get into the istio piece and just look at what happens when you stick an envoy proxy in front of some traffic and you're, using it to proxy? The traffic um envoy will give you out of the box a bunch of data.

A

That's really helpful for understanding connection issues, um issues going on within a single workload, so this could be invalid configuration for the proxy um connection failures when the you know, upstream that we're trying to connect to, we can connect to it for some reason, when tls handshake errors happen. These are all things that that correlate with metrics. That envoy emits here's an example of some of the metrics that envoy provides us. These are kind of low level.

A

This often is useful for debugging the mesh itself, um as well as, uh let's say, tls errors or when a service is offline, refusing connections etc. um So that envoy stuff we're not gonna dig into too heavily here, um but just know that this is something that you'll get. You know quote unquote for free.

A

All you need is a prometheus instance or some place where envoy can push the metrics to in order to retrieve these metrics and start making use of them.

A

Now, when istio comes to the mix, istio gives a bit more of a high level uh set of metrics, which are really very useful for understanding the the service health in our system. So we have hundreds of microservices istio maintains a number of these istio underscore metrics, which are used to track um sort of a higher level, they're aggregated across all the services in the mesh. So they give us an understanding of, for example, request. Latency istio is measuring through envoy through a custom filter that istio has added to their installation of envoy.

A

um It's measuring the latency between requests, so we get a global metric for all of the the distributions of all of our latencies of all of ourses. The requests go into our services and those are distributions, so they're calculated for different percentiles.

A

Then istio also keeps a counter of every single request that happens in the mesh and each request that happens. This is the istio request, total each request. That happens is labeled, so we know what source what the source was, what the destination was, what the namespaces are, what the clusters are. We have all that information um on the istio request, total so that we can split it up by service. We can look at it by errors and just using those two, we can actually get three of the four golden metrics.

A

If you're familiar with golden metrics, which include latency, um you know how how fast or slow our services are, uh how much demand there is for traffic? um How much demand there is for our services, just by looking at the actual quantity, and then we can look at our failure rate as well, which will be the percentage of the um and uh the last one. That's not in here is saturation and saturation has to deal with um what the actual capacity of the services are, and that requires a bit more.

A

uh It's not something istio can give us on its own. uh We can take the traffic that's here and understand what the capacity of our services are through. Various measures by looking at, for example, cpl and memory usage to understand, uh get that full picture view, and you can see how, for it ops any kind of operations.

A

This is super valuable in order to understand um you know just having istio installed, just putting it down as a transparent mesh without applying any configuration and collecting all this data out of the box is pretty powerful, but there is one limitation to using istio or or a limitation to all this design, which is when you have multiple istio control planes.

A

These metrics don't get correlated with one another and there's istio is unable to sort of aggregate or know as it should be, um because each steel control plane represents a boundary, a logical boundary um between for the set of all of this data.

A

So that's where we start looking at solutions on how to instrument metrics in multi-cluster and multi-mesh multi-control, plane environments, which is something that at solo we've seen we're working with customers today, who are some of them, deploying in thousands of clusters and thousands of instances of istio.

A

So just to summarize what what some of the gaps are in the current ecosystem today, where customers are moving towards, where we're looking towards building solutions and customers are in need of solutions are situations where you have multiple clusters, you may have multiple meshes. They may be running in the same cluster or across clusters, there's really a many-to-many possible relationship there.

A

You have multiple tenants where essentially different tenants are sharing functionality of the same mesh, but we want fine-grained control of what metrics that they can see and operate on and then also digs into uh permissions and personas within an organization.

A

This is also relevant in multi-cloud case, where we may be using istio, but we may also be bridging it to an atmesh instance, for example.

A

I won't get into all of these inside of this hoot. It's going to be a relatively short one. Today, but um we will look at the multi-cluster multi-mesh setup and and understand how we've been able to attack that problem.

A

So just to summarize some of the the limitations here, um cluster boundaries can lead to improper metrics attribution because, um when a met, when a request crosses a truss boundary, which means that it's you have a multi-cluster setup and you have traffic leaves through an egress which terminates the mtls inside of an individual mesh. And then it goes to a remote ingress which initiates a new boundary of tls, the metrics data, all of the metadata that istio uses to generate metrics so that it knows sort of the sender and the recipient for the traffic.

A

um The the context is lost. Essentially, istio will only know that the egress was reached. It's not going to know that that a service in cluster a is actually talking to the service of cluster b. um So you need some kind of orchestration or tooling, on top of to reconcile that um another problem is actually the aggregation itself.

A

We've seen various approaches in this space. It can be done doing prometheus federation, where you have a prometheus instance, set to scrape the envoys in each cluster, and then you have a centralized prometheus that goes and scrapes each prometheus or it federates across multiple prometheus instances. But doing this is a fairly complicated is a large operational overhead managing so many instances of prometheus and some of them. You know you'll have to focus on synchronizing storage between them. The cardinality of the metrics can become uh pretty intense and it can.

A

It can just be difficult to manage all of that metrics aggregation um and then, once you have all of that uh those metrics, integrated or aggregated.

A

We then have the question of how to actually leverage it um in its raw form and some of the software out there today, keali and grafana do a nice job of integrating into prometheus or some other metric store, but adding those higher level insights is still something that requires either modifying the metrics themselves so that they contain the context of, for example, which cluster boundaries are being used, um or this the the third-party software has to be aware of these differences or that it's it's observing.

A

A multi-mesh environment, so what we've done and what we're working with on with customers is to provide a single source of truth for observability, a single pane of glass that can be used. Oops, a single pane of glass that can be used to aggregate the metrics from different clusters um unify the formatting of the data, clear up the differences between um our meshes and uh and and the metrics that need to be correlated together across different mesh boundaries and produced a single pane of glass.

A

um So our tooling, that we're building out it's kind of like a platform for all things mesh right now. I want to demo for you how this glue mesh today is working which will.

A

Aggregate metrics across multiple clusters, so I'm going to switch over to my demo here um and before I do as I'm doing that um if anyone has any questions, feel free to drop it in the chat. I would be curious to see.

B

We have anything there.

A

Okay and I'm going to jump into a demo so just to show I have two clusters running here. I have a.

A

I have this uh kind.

A

Of slow, these are two two kubernetes clusters both running in kind. I have a management cluster that you can see here.

A

The kind management cluster where I also have istio installed and the bookinfo application- and I have a remote cluster running that also has istio and the book info and as well as glue mesh, is installed on both clusters.

A

So what I want to do here is, I want to first.

A

I want to generate some traffic and we're going to use that to um generate some metrics, so I'm going to port forward to the product page of the book info application, which some of you may be familiar with if you've played around with istio at all and I'm going to use hay to generate some traffic against it.

A

Next thing I'm going to do is I'm going to actually what I'd like to do. First is I'd like to explain what's going on a little bit? So if I show you, let's, let's uh let's get all the pods here and take a look at what's actually running so.

A

You have in our oh, we have a crash loop. Well, this demo may actually not work. That's unfortunate.

B

Let's just do restart those pods.

B

Of course, doesn't work.

A

Oh, oh, thank you. I just got a notification, I'm not sharing whoopsies, okay, good call! uh Let me turn all right, sorry about this. um Okay. So let me restart anyway, um these pieces of the demo and I'll just show you that.

B

That data again, hopefully now my pods, are running. Please no pressure, hooray, hooray,.

A

No crashing all right so um geez.

B

Is this so slow? It's running on my local.

A

Machine, it's not like going to a cloud or anything okay. So, just to reiterate, I have two clusters. I have a management cluster and a remote cluster. If I show you the pods that are running in.

A

The remote you see, I have the book info pause and I have this uh agent that runs. This is the glue mesh agent that's running and it's collecting um what's actually happening under the hood. Is um these aren't the envoy side cars? So you see the ratings pod here. The reviews pod they're running two containers.

A

One of those containers is the envoy sidecar that sidecar is configured to push metrics actually directly to the agent. Now the agent then sends it over a secure channel across the cluster into the management cluster, and this is one of those concerns that we talked about before. When you send metrics, you want to make sure it's over a secure channel. You don't want those to be in plain text, obviously, so this is one of those problems that um customers are coming against, so we've integrated that into our solution.

A

If we look at the management cluster, we also have the book info running there. So this is this cluster is doubling as both like a management plane, control plane, as well as a data plane, because we have our our pods running and our envoys running there and you'll see. We have more pods running, we also have an agent running there.

A

um This agent will connect to this uh management, pod and uh and stream to it. uh Both of them are streaming the metrics up to the agent pod and the agent pod is aggregating them. So let's generate some metrics here.

B

Real fast, I'm gonna, close chrome because it's it's doesn't like me.

B

You should free up some resources here.

A

These macs, they don't make them like they used to okay, we're going to port forward to the product page running in the management cluster here and we're going to generate some traffic.

A

Well, let's just hit it with some traffic. We can see it's getting hit with some haze and now, let's, let's look at what actually happens, we're aggregating that those metrics in the this enterprise networking pod the management pod.

A

B

There's a close problem: they bring back chrome. I want to show.

B

I apologize it's, it's all incredibly slow. It's really really slow. Today, um I'm going to switch.

A

Back to chrome for a second on the.

B

Stream and we're going to show you now um the metrics endpoint, where we're aggregating the metrics. Please work and it's not working oh yeah yeah. Why? Why are we not working?

B

Unfortunately, I don't.

A

We set this up properly. Just give me one second, to verify my setup.

A

Here, by the way, this is all something you can do yourself. Okay, I'm just following this guide. You try it out at.

B

Home is um share this.

B

And we'll see here.

A

um There's a certain setup step that you have to do for istio. Then everything.

B

Should just be working, what why am I going crazy here set up the poor forward.

A

Reports, localhost 8080.

B

Metrics, it's really not liking me.

A

Today, maybe it's just taking a while to load yeah.

B

Let's curl the sv over here no.

A

It's empty all right, I'm sorry, everybody! This is, um I think, there's something wrong with my setup. Unfortunately, I I don't think it's gonna be fruitful to debug. Here um there are videos that I can point to where we already have demos of this and, like I said this is uh this is something you can test that at home and please try it out and let us know if you run into an issue with it.

A

um I'm pretty sure I I skipped a step in my local setup and and somewhere along the way um it stopped working. But essentially what you'll wind up with is a single prometheus, where you can actually query metrics and you'll be able to see um the uh references to each workload. They'll be cross cluster, so you'll be able to see the cross cluster metadata.

A

We also have our own rest api for querying the metrics directly. If you don't want to use prometheus and we have our own metrics endpoint that you can use, you can scrape it with datadog and any other solution.

A

um If prometheus is not your preferred, metrics backend, so again, really sorry about that demo, I'm not sure I I would. I would have to go back in and um uh kind of from scratch recreate my environment, which I don't want everyone to sit through. So uh why don't I just jump into some questions.

A

um Are there any questions? I don't see any.

B

A

A

Yeah- and we will be- this- is what we've only shown here is just metrics. There are three pillars of observability, which is metrics logging and tracing. We have another solution for log aggregation where we aggregate all the the logging, the access logs from all of the envoy instances.

A

To a single pane as well, which we can see and we will be working on traces as well and providing a similar functionality there in an upcoming release so anyway, thank you. um I again, I'm so sorry about that uh failed demo. um Just.