Cloud Native Computing Foundation Online Programs, 9 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Uncovering hidden OTel traces - leveraging Prometheus in a standardized manner

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right: everyone, thanks for joining us today for today's cncf, live webinar, uncovering hidden, Hotel traces in the standardized manner, I'm, Libby, Schultz and I'll be moderating. Today's webinar I'm going to read our code of conduct and then hand things over to Steve waterworth from asserts and Taylor dolezal head of ecosystem at cncf a few housekeeping items before we get started during the webinar.

A

You are not able to speak as an attendee, but there is a chat box on the right sidebar, where you can say hello, tell us where you're watching from and leave all of your questions here, we'll get to as many as we can. At the end. This is an official webinar of cncf, and is that just subject to the cncf code of conduct? Please do not add anything to the chat or questions that would be in violation of that code of conduct and please be respectful of all of your fellow participants and our presenters.

A

Please also note that the recording slides will be posted later today to the cncf online programs, page at community.cncf.io, under online programs they're also available via your registration link. You used to join today and will be available on our online programs. Youtube playlist. With that I'll hand, things over to Stephen Taylor to kick off today's presentation. Take it away.

B

Awesome well howdy, howdy and welcome everyone uh like Libby said. If you have any questions at all during the session, please feel free to throw this into chat and we'd love to surface those and get to as many as we possibly can. I'm really excited to be joined by Steve from asserts and we're going to be covering a lot of things as it pertains to open source and uh really would just love to dive in and get started.

B

One project that I've heard quite a bit about from many folks has been open, Telemetry also referred to as otel. uh For short, so Steve, uh oh tell me more I'd love to hear about it.

C

Oh, my goodness did that drugs continue.

B

C

Open Telemetry is it's a vendor agnostic, truly open, uh observability, Tool, Set, uh trivia fact for you, it is the second most popular uh project in the cncf as far as sort of code, source code activity with commits and the likes second only to kubernetes. So there's a there's, a lot happening with it, uh where we're predominantly interested in it is from the tracing side of things as as we know with Cloud native applications.

C

Microservices and the like uh being able to do distributed tracing can be a very good diagnostic tool when things aren't maybe going quite as well as you'd hoped that might.

B

I find it funny that the second most popular project in the cncf portfolio right now is one focused on Telemetry and observability, and so it's great to see that it's doing so well in that space and that folks just want to know what's going on um one other project that I've seen that paired quite often with kind of like chocolate and peanut butter uh has been for me, yes, uh is: is that one that you can kind of talk a little bit more about.

C

um Chocolate and peanut butter I'm, not sure, that's a good pairing, but there we go each to their own yeah. So for me, Prometheus is sort of the flip side. Really uh that's All About Time series metrics and it's probably one of the more mature project projects in cncf. If I, just uh flick over to my crib sheet, just to be absolutely certain but I think it sort of graduated in like 2018, something like that. Yes, it did it graduated in 2018..

C

So it's uh it's.

B

C

Quite a mature project and by no means stale I, think it ranks as number seven as the most active project in the in the cncf. So yeah still very much up there still very, very active, let's say.

B

That's all about.

C

Getting time series metrics metrics, it really is at the heart of any observability platform or solution that you're building, everything's I think is really driven by metrics. That's the way you start and then you hop off into into logs and tracing.

B

It's it's been interesting to see what folks are using Prometheus to measure, and it seems kind of like a dark art for some uh when it comes to figuring out the right way to craft their perfect dashboard or their single pane of glass. So I know that can be a little bit difficult for some folks, but I feel like once you have that setup. You have a really good view into the things you actually want to see, and then we I know we use that at the cncf for things like Dev stats.

B

Looking at this various projects and folks companies and other uh types of observability as it comes to our projects, who's contributing to them with the project. Health might look like in some metrics there. So I I know that we're yes, biased but very happy to have the project, because it helps provide a lot of insights on that.

C

Yeah, for me, this is great because it's very easy to get it set up and running, particularly if you're running it as a container or you can it's not it's not a difficult install if you're running it natively on the on the operating system. So it's very easy to get Prometheus up and running in the kubernetes environment. Using the Prometheus operator is an absolute no-brainer. That's like the easiest way.

C

It's a simple Helm install and you're there, and so yeah, getting those metrics in and there's a whole bunch of exporters for for various other software components and a lot of other cncf projects expose a Prometheus Matrix endpoint by default anyway. So it's very quick and easy to get Prometheus running and fill it up with a few million metrics on absolutely everything and.

A

B

C

Doesn't just have to be, as you've already alluded to, it doesn't just have to be bits and bytes.

B

C

Computery things it can be business type metrics as well. You can you can measure the you know. The average cart size if you're a if you're a e-commerce platform or the number of items in a car. Anything like that.

B

It's I I've really liked as well, that within each of those components that you're able to take a look at Prometheus, you can actually inspect the data in many cases and even download that in CSV or other types of formats as well, and continue to work with it. If you might not like your view, or you might be locked to a specific view within your organization as well, I'm, not sure if you have any Easter eggs or tips or tricks when it comes.

C

To yeah yeah, this is sort of starting to get to Natalie sort of the the. What are the challenges with with implementing these open source? Solutions? uh It's it's particularly with with.

B

Prometheus, it's very easy, as I said, to.

C

Get a bucket of two million metrics. The problem is now you need to get information out of it. You know just the data is not information, it needs to be processed to turn it into into information, and then this is yeah. Where the hard work really begins. You you suddenly find uh yes, yes, I.

B

A

I was done when I.

C

Had everything installed and all my it's great to configs, set up right again. I've now got to start building Health rules because I want to get alerted against things and I want to be able to see the data so I'm now going to start building a whole bunch of dashboards and I want to be able to link from One dashboard to another.

C

So I'm going to build all those links in to match my topology and you realize that actually getting it installed and setting up a scrape, config was the first step on a very, very long and never-ending Journey.

B

It's it's so true. It's it's! uh It's I like that.

B

It at least gives you a framework to iterate on, and uh you know it's what looks good or is relevant to you right now might not always be the case so giving you that ability to change things and make it modular in that case is quite helpful and and, like you said, I've seen, people use it for all kinds of wild things from tracking all of the times that you know you fed the dog or gone on a walk, or you know, measuring Personal Fitness in any anything else. In that category, it's.

C

B

Fascinating to see all the different ways in which people use that so yeah.

C

I, actually I actually use it personally as well in a little home automation projects, my wife's, a king Gardener, so we're like monitor the temperature in the greenhouse and rainfall and all this, and it's all just metric data yeah and we just goes into Prometheus.

B

There there was one group in uh in just outside the Bay Area Napa, Valley and I I. Had a friend send me some images, but this uh Winery actually uses it to measure uh soil uh like wetness and humidity, and all these other things too, and they have all of these uh Prometheus grafts up on uh like projected onto their little like imax-esque uh little viewing deck. So really, cool, really cool and wild way to see people using.

C

That yeah and then because this time you know we're talking about observability and there's more to Observation. Just metrics. Metrics are certainly your starting point, but because we've said we've got with with otel: we've got the traces and logging is as old as the hills. Most organizations will already have some logging solution, uh be it open source or proprietary.

C

There's a pretty good chance that that logging will already be there. So.

B

Then the challenge becomes is.

C

How do you, how do you tie it all together, you want to be able to be.

B

C

On something when, uh when maybe a metric isn't uh in the value that we we hope it should be or like increase, latency or excessive resource consumption, and you want to dive into that, and then you want to be able to go and look at the traces for the transaction for that or the logs from the container. How do you?

C

How do you pull all that to all that together and that's where it gets really difficult doing that manual correlation, particularly then, when with distributed systems, when a single request well could hit, you know a dozen different micro services and possibly in different kubernetes clusters, maybe even in in different data centers. So how do you? How do you pull all that together and that's sort of the work that asserts has been doing is to add that layer of intelligence and automation, on top of these great open source tools, to help you pull it all together?

C

Now, yeah sure you can do it manually but hey having a having something help you help you do it. It makes life a lot easier and saves you not an inconsiderable amount of time, effort, Blood, Sweat and Tears.

B

We've talked a little bit about how folks are using open, Telemetry and Prometheus, but I think one thing you need to cover this a little bit but I'd like to dive deeper into why uh people are using those those solutions for their problems.

C

B

C

Think the the key area there is is you're you're, avoiding that vendor lock-in, the the with the open source- tooling, it's so good. Now you there's no requirement to pay for proprietary agents. You've been able to collect observability data and store it, and that is that is commoditized you you can do that very easily and at minimal cost. Obviously it doesn't run in thin air, so there's a better compute cost somewhere, but yeah.

C

You certainly don't have to be paying licenses off to an organization in order to collect observability data and in fact, in many cases the obser. The free tools are better than the proprietary Tools. In that there's no limits on the custom metrics that you can have, and also the maturity of the and range of collectors that are out there again is, is often surpassing what is available commercially.

B

And I think that, while it might be, in some cases, a little bit more difficult to set up the things that you care about, initially you're going to have that much longer term satisfaction, uh especially around that cost as well. You know just a little bit to go through the wall, the first time that afterwards fairly smooth sailing keeping up with the nautical terminology yeah um when people are using open, Telemetry and Prometheus. What kinds of challenges have you you seen folks run into?

B

C

There's a bunch of you've sort of broken free from that license. Cost you've been you've, embraced, open source. You've got all this fantastic data. We've already touched on you. Turning that data into information is a challenge. There's a there's, a lot of work there and then the the correlation aspect of it I've been able to pull data from different places and and have it all related to maybe to the uh issue you're working on and then the one of the other challenges.

C

One is also data volume because it is so easy to go collect all this data you end up, then paying a penalty on storage costs, particularly around tracing uh metrics, are tiny. I. Think Prometheus is about one and a half bytes, something like that per per sample. If you have a look at their documentation, but yeah tracing is is much more. The the worst offender there with uh you know with the span being about 2k by the time you've got all the baggage in it and then a particular transaction may be like 12 12 spans.

C

So you can soon be into sort of a few kilobytes uh per trace and then you've got millions if not billions, if you're a busy site of traces. That's very quickly a lot of data and yeah a lot of storage, cost and processing power as well so yeah, there's there's a there's.

B

Got to be a better way of.

A

C

And what most people come across is like? Oh, oh, this is a lot of data. We can't possibly Trace everything and what we need to do is some sampling and hotel offers various sampling strategies, but they're all a bit of a blunt instrument. You can say well, I'll just take 10, um which is great, but then Murphy's Law comes it comes to the fore and says Ah when there's a problem yeah the choices I need were the ones that weren't sampled, so I'm, still blind.

B

When when so, we've heard a lot about open Telemetry as well and I, remember early days of folks focusing on tracing and being told like you can set up tracing. But it's something that you have to instrument your application for. Have you seen that change when it comes to open Telemetry and the amount of effort needed to get started in looking at function, calls and and kind of delving a little bit deeper when it comes to adopting open television, yeah.

C

Open Telemetry has done done a lot of work on the on the Tracy. Now it is the second most active project and there's a lot of Automation in there. Now it tends to be very much language specific some languages, yeah make them make themselves easier to instrument automatically, and they are some more of the compiled languages like go. Those types of things, they're they're a little more difficult to do, but certainly something like Java, which has had a standard for for the Java agent since, like about Java 1.5.

C

If you can remember that far back so there's a standard API for it. So it's very easy to automatically instrument your Java application say some of the others. The automation is maybe uh not quite as advanced, but certainly for things like go, there's a whole bunch of middlewares. So if you're, using Gorilla or gin to do your request, routing, there's wrappers for that, so yeah.

B

C

Manual effort, but it's like changing two lines of code. It's it's not a huge effort. It's not like you've got to go and hit every single request, endpoint and put in a dozen lines for each one. It's just one wrapper.

B

I think that's nice, it's uh I know we've also gotten a little bit further, along with things like transparent proxies for service meshes and those kinds of concerns, as well so great to know that it doesn't take as much time or effort to get those things instrumented and we're starting to see more capability right out of the box. Yeah.

C

That's another approach is using the service smash like it's co and Link D, both of those guys you can configure it and they'll spit out Hotel spans as they're rooting the traffic across the mesh. So that's an that's another another way of doing it and.

A

C

Layer of complexity, it's like all things in engineering, the swings and roundabouts. You you gain in that you haven't, got to go and uh reconfigure each service manually, but then you're, adding another layer of complexity with the service mesh. But then a service mesh can do lots of funky things for you as well. So.

B

I I saw a question: come in uh asking open tracing is now within open Telemetry uh and yes,.

A

I guess that was.

B

A really interesting part of the the history of the project too, if you want to go into that CC.

C

B

Well, cool yeah.

C

Now, you're, going back into the dim Mists of time in Computing, speak anyway, probably like a year ago in in real time, yeah, so so, open tracing was probably the first open standard for doing distributed, tracing and a lot of uh actually a lot of the commercial products are actually built on top of those Hotel standards, um open tracing standards and then yeah open television came came along and it has. It has a broader remake than just distributed tracing it does.

C

It does also include metrics and logs, although the the support for metrics and logs isn't as mature as it is for tracing. If you actually go and look at the various project. Statuses, uh most of them are pretty much there with Mainline releases, uh General availability releases on the tracing side, and you look at metrics and logs, and these are still still a lot of alphas and betas and uh don't deploy this in production type per caveats on it. So yeah, it's certainly so it.

B

C

um The the open tracing standards into open, telemetry.

B

Amazing, it's it's wild to look back and see which projects within the cncf have gotten merged or archived, and things of that nature I, remember, reading about. Was it open census and open tracing and being really excited uh back? Also back in the dimness of time when I was working at Walt, Disney Studios and got really excited seeing those things, but it's like uh it's so many projects to kind of put together and so now seeing those culminated together as one as open, Telemetry, I think was really helpful same thing.

B

You know continuing to modify what's needed and really focusing on adaptability and usability within the project is really great to see them move in that direction. Yeah.

C

I also like the concept they have with the open Telemetry collector, so this sort of acts as like a like a patch board best way to describe it I suppose and that you, your various services or your service mesh, uh sends the data to The Collector. So it has, you can set up various receivers there in The Collector, so it'll it'll receive the metrics and spat and Trace spans, and then optionally, you can configure processors, so it can actually massage the data before passing it on.

C

So we can do it and we'll get on to what certs is doing with that uh in a little a little time, and then it can then dispatch that data to one or more back ends. So you can. So if you've, if you've got Zipkin and there you go- you don't have to choose, you can actually have it go to to both or off to um like one of the cloud providers you can use Google, Cloud, tracing or AWS x-ray.

C

You could use that as your as your Trace store or, of course, um Jaeger's, probably the most popular one in the in the open source world.

B

When, when I was working at a previous role, one of my colleagues was talking a little bit about annotations and making sure that, um and they were implementing some service service. Mesh workloads well, I'll say that five times fast and they were talking about uh annotating that and losing that annotation about halfway through.

B

So they didn't get to see that full traceability until they had the aha moment and said like oh no, this actually needs to be annotated each step of the way so as you're passing along this thread, or this call that's something that you have to be mindful of, and with that I'd like to transition into where it is that you work, you know talking a little bit about asserts what you're doing with otel and Prometheus, but really I'd love to hear first about um what are you doing at a service? What what's the com?

B

What's the company focused on? What's your mission and vision, what are you working on.

C

Yeah so, as I said, we're working on providing a layer of of intelligence and automation on top of these great open source tools, you know the well by both the founder and and I previously spent time at adapt, Dynamics and so we've. You know, we've got this background in in APM monitoring, observability call it call it what you will I suppose yeah. We had we sort of had this Epiphany that there's all these. This is all this great open source, tooling out here now to collect your observability data. So that's.

A

C

You know don't reinvent the wheel, why would you do that? Just just use the great open source stuff? That's that's there and we realized. You know the the problem is, then, is turning all that data into useful information and doing correlations. We thought well how you know. How can we help help people do that?

C

So we've said: we've built this layer of intelligence and automation on top of these great open source tools that provide that correlation information and also help manage the data so yeah you don't you're, not not drowning in data you, we, we, uh we distilled the data down, um so yeah sort of talk about okay. So let's talk about the sort of the metric side of things first, so you, you really only got two use cases for the metrics yeah you're in the short term. You want as much as possible for troubleshooting.

C

So in case anything goes wrong. You want fine-grained metrics on absolutely everything, but because it's expensive to keep that long term, because the other use case, of course you have is for long-term analysis and Reporting. You know: we've got CI CD pipeline, we're throwing out release. After release after release, it'd be useful to know for making things better or worse, hobby Services, getting faster and less error prone, or are they getting slower and more error-prone?

C

So you you want to you want to have that long-term data for that analysis. So what you don't want to do because is is keep everything forever because well they make this doesn't really like it and it becomes very cumbersome to try and run a big, a big Prometheus with.

B

C

Storing everything forever, so we've essentially automated that we take your existing Prometheus, which typically 15 days of retention, uh if you're still, if you're, still troubleshooting after 15 days, you've got other problems, so we take that. So we essentially do queries on that data, run it through a set of rules and then uh store low, cardinality data long term. So it would probably not with data volume down to about two by about about 10 percent of what it was. So that is really easy or relatively easy to store that long term.

C

So then you can still do your Trend analysis. Hey you know. Have we made this service better, all the more errors, less errors? Is it going faster? Is it going slower and also for customer metrics? You know: are people buying more? Are they buying less? Is customer engagement getting better as performance improves.

B

I'd, like that and and I like what you said around just storing the right data and actually being actionable on it right, I think that when it comes to you know if, if I fill my garage with all of these things or packages or if I just keep pushing things into there, because it's important okay, that's great, but then I've filled up. My garage and I can't park my car there or you can use the same analogy with the closet or any kind of room.

B

But um you know if I just pulled in all of the mail that would include my junk mail too. So I like that you're taking the time to focus on making this data actionable and really being able to focus on that yeah.

C

The other, the other thing we do to help people get started. Like I, said earlier. It's really easy to stand up. Prometheus set up a bunch of collectors and exporters scrape config, and if you, if you're, using Prometheus operating kubernetes, it's even easier so yeah you've got this data, but you've got no real way of understanding and visualizing it. So these search product ships for the curated library of pre-built, dashboards and health rules for all common Technologies. So from day one you can be, you can be effective.

C

You know you can actually be productive and start using the data you're collecting without having to spend weeks or months building dashboards and writing Health rules. Of course, yeah. It's not going to be one size fits all. There's always going to be some uh some uniqueness to each environment. So, of course you can still write your still write your own and the dashboarding we're building again just Leverage The Open Source, we're built we embed grafana in the product.

C

So if you have some favorite grafana dashboards, you're, not saying goodbye to those you can, it is grafana. You can just import them and if you tag them correctly, they will also appear in the right place, contextually as well, so you don't have to go hunting for them.

B

It's really helpful and I think that folks would be overjoyed to hear that it's like hey. We can save you a couple weeks or months of time, um even for folks that are that have implemented open, Telemetry and Prometheus and grafana these other tools. Do you help leverage making their Stacks better I've? Definitely like I, I, won't name who but I have heard of folks say: hey we set this up four years ago, and we really haven't touched these rules, since uh is that another kind of uh problem case that you helped solve for.

C

So yeah so I think we've since a curated Library, so each each new release there may be updates to the to the rules as things change and you get you get different. You know newer releases of the software components you're running, and so they maybe behave slightly differently uh so get those rules are constantly tweaked and massaged uh to to be being most effective. Like I say you always have that ability to override and and tweak and tune or disable a one of our health rules.

C

If it's nagging you, if you think, oh actually, you don't need this. This is fine in my environment, uh you know I, don't care about that. You can. You can squelch it down and turn it off and say if you've got unique things you might be in your environment, there's a particularly important message queue and if the queue depth is greater than five, oh dear we're in trouble. So that's a very unique rule to you: okay yeah! You can just add that in there and that you'll get you'll get notified about it.

B

I think that's helpful to you is, is being able to focus on a great point about alert fatigue right. You know, like you're notified every time you get a sale you're like no that's a good thing. I can look at that in a different way. I! Don't need to get bugged about that at three in the morning.

C

Yeah well, the way we handle that is yeah. As you know, there's in any large system there's always something running a little hot, going a little slower. So you get this constant chatter of alert notifications and the vast majority of them probably are actually that important. There may be something that could be tuned later, but you certainly don't want to be woken up at three o'clock in the morning to say: Hey, you know, CPU consumption on this container was a little hot for a minute. Oh, who cares so so.

C

The way we manage that is to really operationalize slos. It's a it's a.

A

Partly, hopefully, everybody's.

C

Read the SRE handbook, or at least flicked through it and so slot.

C

Do you know what you know what the acronym stands for so science level, so the the idea, then, is you you set up the slos and the things that are important, like yeah users must be able to log in in less than 500 milliseconds or payments got to go through in less than 300 milliseconds, whatever so you or the error ratio on a on the integration with a shipping service, or you know anything like that, you can set up slos against it so and, of course, that that service there'll be a bunch of other software components.

C

Underneath that make that happen. So you know it could be a dozen micro Services underneath there and some data stores and some caches that are all making it happen. Now they can all be having little issues, little bits where they run a little hot or a little slow.

C

But if it doesn't impact that overall SLO, then we're not going to alert you, we still record that those things happened, but you're not going to get that emergency page or a slack message or whatever at four o'clock in the morning, uh telling you to panic only if the if the SLO is in danger of breaching or has actually breached. So we we monitor the the SLO burn down and if we see a rapid acceleration and burn rate we try and sort of rather than wait for it to smash through and head.

A

Off to the hills.

C

A

It starts accelerating.

C

We're going to issue an alert.

A

C

This slos looking shaky you might want to take a look at this.

B

I've seen folks Implement some uh alerting, Implement slos and have some you know key metrics or key uptime deliverability and reliability, factors that they're trying to aim for, though they will set monitors and alerts on uh objectively, the wrong thing and, like you said this container was running hot for two minutes, but kubernetes is going to reap that and bring it back anyway.

B

So it's not that much of an issue or um it's or the auto scaler just hasn't, kicked on yet to kind of adjust for this influx of traffic that we're looking at so I think that's been uh those again hours of those stories. Many many fun many.

A

Fun moments in.

B

Retrospect not in the moment at all but um I, think for folks. Looking at uh those kinds of you know, implementing, monitoring and alerting, but making sure it's the right kind as well. That's also really helpful and it sounds like you have yeah.

C

This is sort of there's another layer, then, on top of that, so really one of the really clever things and I don't understand how they do it. It's some very clever Engineers that wrote it all, but one of the very clever things we do is analyze all the the metric labels. You know so Prometheus metric obviously has its value, but it has a whole bunch of labels describing what the metrics about so we analyze those metric labels and similarly traces have tags which is the metadata about about the trace.

C

It's not just the timing, so we analyze the trace tags. So from that we can build up a graph database of how everything's interconnected. So it's not just service to service. That's what tracing gives you, but it's. It's also the stack that it's running on so I like to think of it as a four-dimensional graph of your application topology. So it's service to service, which is your X to Y the stack, which is your Z the depth, and then we record it all over time.

C

So at any single point in time we know what was talking to what and where it was running. So when there's an incident, so your your SLO goes bad and you, oh, no users are taking 1.2.

B

So I never have login. You.

C

Know- and we didn't want that- we definitely wanted it at 500, milliseconds or less. So what went wrong so without that graph you're then, relying on your, maybe your own knowledge to know that hey this user service uses this database and this cache and and piecing it together that way or having to ask a colleague but hey, you know, certs has done this for you. It will, when that instance generated it automatically traverses that graph database and collates everything together onto One dashboard, so everything all the information you need to troubleshoot.

C

That incident is just right. There, you're, not rummaging around fishing for stuff and asking colleagues and yeah.

B

Every everybody loves a good scavenger hunt, especially with their metrics and trying to figure things out. I think that's a great point when, when slos you know, go bad or you even if you don't break an SLO and you had a really impactful event and your team was still kind of scrambling to meet that SLO.

B

um Do you have any tools or or features available to help out with that root, cause analysis or or anything like that, yeah.

C

Like I said so when that incident happens and it's always like okay, well, that login service, it's now taking a lot longer than our Target of our of our SLO, so that generates that generates an incident and so you'll get notified. You you just use standard, Prometheus alert manager, so that can whatever.

B

Hooks into that, you.

C

You know all the usuals uh candidates there pay to duty and the like, so you'll get you'll get notified. So then you can go in to the dashboard and as I say on that one dashboard, it looks so. The the SLA was against an end point on the user surface, but that user service has dependencies yeah. It obviously runs somewhere. If we take kubernetes it's a service, so there's a pod and then that pod will be running on a node within a within.

B

C

Cluster, but that service may use a cache, it may use a database. It may call a whole load of other things. There could be, like you know, easily a dozen Services, a dozen microservices involved in that, so you know which one's causing the problem or which more than one is causing the problem. So you want to be able to go and investigate and check everything out and, as I said this, the graph database we've we've built, understands all those relationships, all those services and the stack.

C

So what it does it traverses that database and pulls in everything that's immediately connected uh around it and puts all that onto onto a one Dynamic dashboard for you so you're not having to fish around trying to find out. What's uh you know, what's going on where and also any of those dependent Services if they've had any issues, they're highlighted as well, so you can see that maybe it's not actually our user service itself, it's reliant on this database and this database was running a little slow. So then you can go and investigate hey.

C

Why was? Why was a little slow because it's running out of resource or whatever.

B

Like I love it when it's just a uh capped resource kind of problem, yeah, it's much less fun with it. It's like, oh, this yeah null type of defined. What is what is going on? Yeah.

C

Yeah, that's an easy example, and you know there's a lot more. That can go wrong than that as I'm sure we all know and.

B

Great to have that Telemetry to dive deeper amazing uh for folks that uh have any questions, I'd love to urge you to throw those into chats and we can get to those I. Think I've got a couple more going, but uh would love to hear from all of you and we can get some more questions answered.

C

Yeah the worst ways on that theme of then troubleshooting, so he's saying: you've got all your metrics and all the dashboards immediately accessible from that from that dynamically created dashboard, but equally from there you can then jump out into logs. So you've got your existing logging solution. Maybe you've got an elk stack well, we'll jump out to Elk and you'll arrive in Elk uh with it with a deep link.

C

So the the time range and the search query is already filled in for you so you're straight away, looking at the appropriate container logs or subsystem logs, whatever it happens, to be that that's linked across and then the same thing with the tracing, and we do a really quite a clever thing. With the tracing as I said, we we have a an our own hotel, collector module and we and we're using that for two purposes really so.

C

First of all, we're analyzing all the trace tags, so that so we're essentially that creates uh that hotel collector that we've got creates a bunch of Prometheus metrics from all the spans that it sees and they get they get reaped in. So that gives us helps us build our graph, but also we're looking at the timings, so we're building baselines, multi-period baselines for each endpoint.

C

So therefore, we know whether an endpoint with a a particular call to an endpoint is normal or not. Is it slower than normal, uh or was it normal because if you think about it, ideally most of your requests will be handled in a prompt and error-free manner. They won't be interesting, they'll be a perfectly normal request that came through not a problem at all and you're really generally not that interested in those it's only the slow and error ones. You want to go and you want to go and delve into and go oh well.

C

Why did this one go wrong and if you think, if you've got an SLO of 99, that means in the worst case scenario you're only expecting one percent of those traces to be interesting to be slow or erroneous so hell. You know why am I trying to collect all of them or 10 percent of them, so so what our hotel collector does it then, having sent all its metrics up?

C

It then calls and pulls down the Baseline information, so it knows for each endpoint that it sees uh when a span comes in, if that's, if that's slower than normal or not, and therefore, if it's, if it's.

B

An interesting one if.

C

B

C

It passes it through to the.

B

C

And if it's just a regular one, then we just drop it and because we hey, we don't we don't need to fill our storage up with all these perfect traces. So that's you know, taking the 99 SLO type per argument, that's going to reduce your stored traces. You know down to one percent of of Trace volume, which.

B

Is a good thing.

C

And eventually, if you're using cloud storage, it'll quite often sneak you into the free tier, which is even better.

B

Always always helpful when you can actually make use of the free tier. It's like. Oh.

A

B

Good I came in under the lemons on that front: yeah I I, like I, like that you, you talked a little bit about elk stack as well, and and I'd like to focus on that data collection strategy, and so so does that mean you don't necessarily have to change it with open Telemetry and with Prometheus? Can you keep your same measures and measurements that you used to have yeah.

C

Absolutely and being so, the idea is we, we sit if you've already implemented open source, Great, Well Done. uh We love you and we just sit on top of what you've already done, so we're just providing that layer of intelligence and automation to make your life easier. On top of the hard work you've already done and freeing you from having to manage and maintain hundreds of dashboards you'll, probably better, knock that down to just a handful of very specific ones, to your business, uh the rest of the rest of the commodity dashboards.

C

We've done that for you and the same thing with those with those Health rules, and we also you know I'm saying we solve that real problem of of how do you correlate across the distributed system? And how do you? How do you go from one service to another and from traces to logs and back to metrics? And you know how do you? How do you manage all of that? Well, hey! You know we're doing that, for you.

B

I think that's one of the most painful things that I've dealt with in in other roles and responsibilities is having to being told that you know like okay, you have to rip out everything that you have installed and you install this new operator and no, you have to use our agents and everything like that. It's so much nicer to be able to Leverage What already exists, and then it makes it an easier adoption path, at least in my experience. So.

C

Yeah yeah, that's what they I think the open source tooling is all is all about. Is you can you can Implement these great open source tools and you'll you're not tied in anywhere you uh you? Can you can use that data wherever you want and you can choose to to send it off to uh some licensed uh Cloud software company, or you know you, can you can use a one of the big cloud providers as a service? You know Prometheus as a service. What the big cloud providers do that or you can run it yourself.

C

The freedom is yours, you you can choose to do with it, that's what you want and that's definitely our philosophy. You know we're not saying all that hard work, you've done with open source rip it out and start again install our agent. No, we want we. We just want to sit on top of your data and that you've already got and just allow you to do a lot more with it.

B

I understand it's a lot more job security to keep rewriting it, but it's very important amazing amazing.

A

B

Think one thing that helps too is that uh I I took a look at a certain site, and one thing I thought that was interesting was the intelligent sampling that you have, because that's a core concern of what a lot of people focus on, especially right now with Workforce reductions and everything else is what's the cost, and so by utilizing things like that and Prometheus up in Telemetry. Can you talk to some of those cost cutting concerns that that kind of come into play? Yeah.

C

Like I said, but doing the intelligence sampling of those traces you're going to significantly reduce the amount of amount of storage, you need right. Prices are big and there's a lot of them. If you, if you, if you turn the dial up, even if you try and collect a lot, you can it's one of those. Yes, it's a horrible balancing act, because if you collect all the choices, there's a lot of them and it's really expensive- you've got egress charges if you're sending them somewhere, it'll, certainly break through the free tier of that cloud provider.

C

So you've got big bills for the storage. If you're running it yourself, you've got to scale up a big Cassandra cluster to store all that data and again you're going to be burning through through storage space.

C

So and but then, if you go the other way and turn it right down, then you know as I say: Murphy's Law clearly states the traces you want the traces that you didn't sample. So what's the point of doing placing in the in the first place the.

A

Thing you really really.

B

C

For oh, didn't get that so yeah doing that intelligent sampling is the best of both worlds. It's uh it's gonna, really really compress your data down, so you don't have the cost associated with trying to process and store all those traces, but having compressed it down. You've still got the really interesting ones when you find, when you're trying to do that problem solving. Why didn't that work? Where did that go where they like? What, through that error, you can go on it.

C

You can go and open the trace up and you've got all those all those interesting ones you haven't thrown it away. So yeah Murphy gets frustrated at that point.

B

uh And you'll have to read a lot, um my my last question and then I would love to tie it up and talk a little bit more about any calls to action. Or you know any things that you'd like to point out with asserts. But when it comes to understanding the relationship between your data and automating, that correlation um are there any tips, tricks or things that asserts offer that help out with that when it comes to open Telemetry and from the PS.

C

B

I said yeah that you.

C

Know that graph database that we build by analyzing the the metric tags and the the tracing tags to build up that relationship model. So we know what services are talking to what and where they're running and that you know that makes your your life a lot easier, you're, not relying on that tribal knowledge. You know you might be trouble. You might be an engineer. You might be trying to troubleshoot a payment Gateway but you're dependent on a.

B

Bunch of other services and they're giving.

C

You some troubles so uh right, but I, don't know how they're deployed so then you've got to call somebody else in and then they're going, oh yeah, but that uses this database thing. But I know nothing about that.

C

So I've got to call somebody else in and suddenly your War room's got half the company in it and because, if you're, if you've got all your engineers in a war room trying to head scratch and figure out a problem, then they're not doing really what the organizations primarily paying them for which is writing new features and fixes.

C

So there's a big cost to the to the organization there. It also most programmers I know prefer to write code than debunk code. So again they get a bit.

B

C

Having the system automate a lot of that donkey work for you is, you know again: it's a Big Boon, it's it's both in sort of productivity and sort of the and the intangibles. As you know, happy engineers.

B

Amazing amazing yeah, no I can't I can't agree more with wanting to kind of jump more so into the code and less you know, focus on the uh you know: I I, like the idea of code, but uh writing useful code is always more helpful.

B

I think uh some some interesting things I've seen within the community are like the open, Telemetry demo, which I'll I'll link in the chat for folks um many different programming languages and options to start understanding what's possible with open Telemetry, and then you know you can tie that together with something like asserts or crafting up a dashboard. That's really helpful for you, but I like that. The community is focused on providing an actual use case of how to put these things together. So it's not just wishing you well and kind of leaving you.

B

You know in the winds to figure this out on your own.

C

um Yeah, a lot of it's pretty easy. I've done, I've done a little bit, my myself in various languages with with open, Telemetry and yeah. It's it's not actually that onerous to do because there's quite a lot of automation around the open, Telemetry libraries. It's uh if you yeah.

A

Some things you don't.

C

Have to change a line of code is just how you just a startup parameter, but if you do change code, you're talking like six lines, it's yeah, it's not a lot of work.

B

Nothing having that ability to link together all of those different types of telemetry to like your logs and traces and and different applications stack. You know, uh stack overflows. Those kinds of things really helpful: yeah, not stack Overflow copy, paste service, yeah.

C

Yeah, that's where most bugs come from so I think someone did. Some did some analysis that there was a incorrect, significant code on stack, Overflow and it was. It was found in like 500 projects across the internet or something yeah.

B

It's I forgot to remove example.com yeah.

B

um For folks looking to get started with asserts and just read more about you, uh do you have any information on that front.

C

Yeah absolutely go to if you go to the search website, which is asserts.ai and there's lots of useful information on there, some great blogs around how to actually use Prometheus and how to how to set it up and get the collectors all going and the and the same with with otel and the the other thing to have a have a great little play with.

C

We have a Sandbox environment on the website, so go in there and you can you it's read only, but so CC you don't break it from everybody else, but you're free to click around and and have a look and see how asserts Builds, on top of these great open source tools to and really sort of glue it all together and make your life a lot easier.

B

I love that I always love when you can test something out before you actually go to purchase it and just get a better idea of how to work with it. Yeah.

C

No, if you wanted to go even further, you can actually uh Run asserts for free forever. This we have a free version, so you can go and install it yourself. If you just want a quick play around probably the quickest and easiest way to do it is there's a Docker compose for it.

C

So you can, you can run it up on a reasonably meaty, developer's laptop or a spinner VM up in a cloud somewhere and just point it at your existing Prometheus and it'll query: the data give it a give it a minute or two and everything will light up.

C

Obviously, if you're going to do a more serious production, install uh yeah running it on Docker, composers and perhaps the best way of doing it, there's a Helm chart, so you can deploy it into kubernetes and then you have all the benefits that kubernetes gives you of scalability and self-healing and the like.

B

Amazing again, just love that accessibility and the fact of being able to give folks the option to just give things a test, Spin and dive a little bit more deeply into what's possible for them. Yeah.

C

B

Awesome well, I, don't see any more questions rolling in and with that uh we'd love to give a final call for that, but uh otherwise, but that's uh Steve, do you have any parting thoughts, wisdoms mantras or anything else that you'd like to share before we start we spin down today.

C

uh Well, I think we think we've covered everything that I can think of so yes be great, having a chat.

B

Hopefully, hopefully our.

C

Audience learns something useful and they'll all go to assert and read some blog posts and have a play in the sandbox.

B

uh My my passing Word of Wisdom is uh make sure that your computer is turned on.

C

That usually fixes it for me anyway, so you've got to turn it off, then turn it on yeah. It's like.

B

ah There's a good podcast I'll link to later that uh Somebody went into like for an hour about the Deep mechanics of why that actually works in state machines and everything else, but it's another fun conversation for another time. Awesome well,.

A

B

You so much Steve! Thank you so much everybody for joining us today on this live stream. We hope to catch you again and until we see you again, keep your head in the clouds we'll catch.

A

You around okay.

B

A

Everybody thank you. Stephen Taylor,.

B

Thank you. Thank you.