GitLab Monitor:Observability, 14 Jan 2022

Previous Meeting

⏯

youtube image

►

From YouTube: Opstrace Tracing demo e2e

Description

Demo of the current state of the new clickhouse backed tracing in Opstrace

A

Hi, this is nick parker on the monitor observability team and I'm going to do a quick, um relatively unscripted demo of just the current state of tracing in ops trace. So tracing itself is a relatively recently added feature um and it's still kind of in the proof of concept stages. But I figured it's still it's at a point now, where, like we can probably do a quick demo and show you like what works and what's what's left to do um so, I figured I'd start it off with kind of this diagram showing how things flow.

A

um The the main thing this looks way more complicated than it is. The main thing to get from this is that you can have multiple tenants, an ops trace, um that's already a thing um and against each of those tenants you can send traces in um the traces uh then get stored, and then you can get them back uh by going to a ui. um So the idea here is that some uh end user uh would send uh probably using the open, telemetry collector, which is kind of like a you, can think of it.

A

As like a swiss army knife agent for uh managing uh traces, it would send data to an authenticated endpoint using the otlp span format um and then that endpoint, what it does first is obviously check the authentication header, but assuming that passes it will then re. Take that data and internally convert it to a jager format so that an internal uh jaeger pod um will accept it.

A

uh The the aeger instance then uses this click house storage, adapter to then uh store the data into a cross-tenant, uh shared uh click house instance, where each of the tenants gets its own database within that instance, um so that at that point the data is basically stored.

A

The the traces are being uh streamed in uh through the system into click house and then an end user uh when they want to look at something um they would go to the yeager ui, which is literally just a stock uh jaeger interface and view it for the tenant that they're interested in.

A

So this is sort of the state of the system. At this point, you can get data in and you can see the data coming out and we nowadays have a regular ci runs against all merge requests, as well as periodically against the main branch that check that you can get data stored into the system and then query it back out via the jaeger api, and so we've basically been running it long enough that it looks like things in their current mvp state are are relatively stable.

A

If we look at the system tenant on this example, ops, trace instance that I've got up and running um so I'm at the system tenant and then the the cluster domain and then slash jager. um We can see we've just got a stock jaeger ui with a little bit of customization over here, with some links related to ops trace itself, um but otherwise uh this is just the anyone who's used. Yeager has probably seen this before.

A

In this instance. In the system instance, we can see, we've already got some internal obstrace components, sending traces into this at the moment. It's just the cortex metrics management system that obstructs uses internally for metrics, as well as the jaeger operator, which is deploying all the per tenant jaeger instances, as well as a kind of jaeger self reporting on itself.

A

uh For this yeager query, one that one's basically on by default, so that one's we're getting that for free, um but uh so, for example, with the cortex traces I can hit find, and then I see a bunch of cortex spans or traces for storing data into the system. So you can see, distributors are accepting data and then picking an adjuster and then the adjuster is actually storing it into I guess locally.

A

And then, if we go into like the jaeger operator and then query for spans against there, we can see it technically has some errors. um If we go and look at one of those we can see here, we go it's complaining about no matches for kind, ingress and version networking case I o v1, that's okay, that's kind of expected. It's a side effect of deprecated um uh ingress versions between different versions of kubernetes.

A

So that's that's expected for the record, but you can kind of see that, like we can very quickly diagnose problems in the ops trace instance itself um for the other half of this demo. um I guess we can look at uh like so we've we've looked at for some examples of the system tenant getting data, basically from the system itself and a snake eating own tail form form, but let's try actually sending in some data from the outside.

A

um So I did a quick look around for options uh for things that would generate some realistic uh tracing data and it turns out that you can actually configure a kubernetes cluster to send traces as of version 1, 2, 2 or newer.

A

So I'm not going to show you how to do all this in the video, but long story short. Is you need to go to your api server and enable some things and configure some things and basically point your api server to an open, telemetry collector?

A

So I have a kubernetes cluster, that's just running in my house here and so I've configured the api server on the raspberry pi to send metrics or sorry send traces. I should say to an open telemetry collector, and this is the definition of what that collector is running, so you can see the uh to be clear.

A

The open, telemetry collector is just a third-party um uh stock uh agent that you can run um and we're just configuring it with a the uh auth token um and b the end point where the tenant, where the uh tenant uh tracing endpoint is located, where the where the traces should actually be sent, um and once we've got those two things we can check up on our pod here and see yeah, it's basically sending some spans every five seconds or so, and then, if we go whoops, I didn't mean to click on that.

A

Wherever that's going, let's go back. If we go into the default tenant here, I can just go back to the root we can see here. We've got api server, which is the the trace is coming from. My open, telemetry collector pod here in the house getting sent to this remote ops, trace instance, um and then we've also got jager query, which is again just jager by default reporting on itself.

A

So if I do a quick find traces, it comes back. Very quickly we can see uh there's some events going on. I don't know anything about the internals of the kubernetes api server, but you can see um okay, there's my raspberry pi, uh there's the kubernetes cluster name, um so we can see we've got traces coming in from an arbitrary source um and getting into the ops trace instance against this default tenant.

A

So anyway, I guess that's that's kind of it, um so those are just some examples of uh sending data into the system and being able to see it in the ui um again we're just running a stockyard ui for now- and I guess, as far as like things that are left to do uh obvious things are, for example, uh the click house instance right now is running in unreplicated mode, which means there's basically one pod, that all of this data is being stored against, um and so joe shaw is currently working on setting up replication for that click house instance, so that it's a bit less prone to single point of failure issues.

A

I am working on um setting up some quotas and limits. A lot of these individual components have um limits around. For example, total throughput of data coming in the total number of spans that you can store that sort of thing so that tenants are not taking up too much of the system. That sort of thing, um mostly it's just a matter of exposing those options that already exist, um but anyway it's it's.

A

Basically, at this point, like you can see, we've got an end-to-end system that works and a lot of it at this point is just kind of doing more testing checking for scale checking for reliability. That sort of thing, but um long story short.

A

uh That's I don't know, I guess you've seen uh how it looks and end at this point so anyway, that's it for the demo uh thanks for watching and if you have any questions, uh feel free to stop by the g underscore observability uh channel on slack and um someone there can can help out anyway, nice seeing you bye.