Cloud Native Computing Foundation Online Programs, 24 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Is Kubernetes Monitoring Flawed?

Description

A 3-node Kubernetes cluster with Prometheus will ship around 40k active series by default! Do we really need all that data?? The current state of Kubernetes open source monitoring is in need of improvement. High churn rate of pod metrics, proliferation of metrics with low usage, and configuration complexity are some of the issues that need to be addressed.

I discussed this topic with Aliaksandr Valialkin, CTO at VictoriaMetrics and creator of the open source project. We discussed the common problems, as well as directions and best practices to overcome some of these complexities as individuals and as a community. We also discussed VictoriaMetrics open source project and how it addresses some of these challenges.

Aliaksandr is a Golang engineer, who likes writing simple and performant code and creating easy-to-use programs. Sometimes these hard-to-match requirements work together, like in the VictoriaMetrics case.

The podcast episodes are available for listening on your favorite podcast app and on this YouTube channel.

We live-stream the episodes, and you’re welcome to join the stream here on YouTube Live or at https://www.twitch.tv/openobservability.

Follow us on Twitter @openobserv to get the live stream times and other updates, and to pitch in with your thoughts and comments.

Have you got an interesting topic you'd like to share in an episode? Reach out to us and submit your proposal at https://forms.gle/9LDkYCmegyS5D8Li7

Dotan Horovits
============
Twitter: https://twitter.com/horovits
LinkedIn: https://www.linkedin.com/in/horovits/

Aliaksandr Valialkin
===============
Twitter: https://twitter.com/valyala
LinkedIn: https://www.linkedin.com/in/valyala/
VictoriaMetrics: https://victoriametrics.com/
On GitHub: https://github.com/VictoriaMetrics/VictoriaMetrics
VictoriaMetrics community channels - https://docs.victoriametrics.com/#community-and-contributions

Resources
=========
Why Prometheus cannot query remote storage in an expected way via remote_read protocol - https://github.com/prometheus/prometheus/issues/4456
VictoriaMetrics: scaling to 100 million metrics per second https://www.youtube.com/watch?v=xfed9_Q0_qU

Chapters
========
00:00 show intro
02:07 topic and guest intro
03:13 monitoring microservice system, app and communications
05:43 high churn rate for pod metrics
12:02 Kubernetes produces too many metrics by defaults, most of which are unused
17:06 recommended listing of metrics
21:50 removing unused metric labels to reduce cardinality
24:16 Prometheus native (exponential buckets) historgrams
26:49 Configuration complexity with multiple deployments
33:16 OpenTelemetry and OpenMetrics open specifications
36:11 collecting system metrics and application metrics uniformly
40:20 VictoriaMetrics essentials
48:46 VictoriaMetrics extensions beyond Prometheus
54:06 a full stack monitoring collection, analysis and alerting
56:09 how to join the VictoriaMetrics community
58:05 industry update: 2023 cloud native predictions post by CNCF CTO
59:16 outro

A

A

Hello, everyone and welcome to open observability talks, I'm your host, the Tom horvitz and here at open observability talks. We talk about anything, devops, observability and open source, so May the open source be with you and uh Happy New Year everyone. This is the first episode of 2023 and uh we've got a whole new. Look for the episode for those who remember the the yellow theme theme so uh hope you like it and uh do. Let me know uh on Twitter at openobserve or at orbit.

A

I also just got the end of year, podcast statistics from Spotify and I'm glad to share that. We have listeners from over 40 countries. uh We have more than tripled our followers, 270 percent increase and several hundred listeners ranked us in the top 10 podcasts. So thank you so much for joining me on the show throughout the years and for following us and giving us high ranking and the show is available on all the popular podcast apps, not just Spotify, so as well as YouTube. If you want to watch us chatting.

A

uh So if you haven't followed yet do go ahead and join us. I'd also like to thank our sponsors. Logs.Io the cloud native observability platform logsio takes the best of breed open source projects such as Prometheus, open search and Jager, and offers them as a unified observability platform built for scales for those joining the live stream or on YouTube or twitch, feel free to share questions and comments on the chat. This makes things much more interesting for me and the guests, and let's move on to today's episode today, I'd like to talk about kubernetes monitoring.

A

It's astonishing, how much data is emitted by kubernetes environment and how complex its monitoring can get and it's time to talk about the unspoken challenges so I invited Alexander, valialken, CTO, Victoria metrics and the veteran of the monitoring World he'll he's also the creator of the Victoria metrics open source monitoring solution, which we'll also discuss today on this episode, so I'd like to uh invite invite invite Alexander to the stream Alexander how's it going hello.

B

Everybody uh great: uh let's talk about uh challenges in kubernetes monitoring, I, really thought about this yeah.

A

We we actually, we had an interesting chat a few months ago at uh open source monitoring conference in Germany and I'm really glad for the opportunity to have this uh fireside chat with you on the on the show today and um as I presented. Also uh the opening uh you live and breathe, metrics and monitoring. So maybe can you tell us a bit about your experience with monitoring kubernetes as a practitioner and uh the challenges that that you've encountered.

B

Yeah, um our users use the software Matrix, frequently Monitor kubernetes and kubernetes applications, and that's why I'm aware of various issues which are related to this monitoring?

B

And let's start with this enumeration And discussing these issues, and the first issue is the kubernetes itself: I mean as a kubernetes popularized microseries architecture instead of Mona list architecture and when you monitor monalis.

B

So there's only one application. Harmonious and each topic is this: application export some kind of application, metrics and system metrics and the number of such metrics is fixed. And if you split monoliths into many microservices, then every microservice need to export its own set of metrics system. Metrics I mean CPU memory, Network users and so on, and also everyone microservice needs to expose its own set of application metrics and additionally, on top of this, every maker service should expose communication.

B

Metrics I mean metrics related to the communication between microservices, latencies, RPS and so on, and uh so, when you switch from single note application to microservice application, uh the number of metrics explodes many times 10 times and more and since uh kubernetes popularizes microservice architecture, as this means that when you move to kubernetes from a plain old hosting providers, hosting Services, um you're, usually start to exposing start. Exposing many more metrics.

A

And I think there's also a point, maybe worth mentioning that it's also the the frequency like it's not just that we break the application into many many microservices. It's also that each one has its own life cycle. It's releases, obviously uh making it smaller and more agile means more frequent releases and that that's a major contributor as well uh to the to the to the aspect right, correct.

B

Correct uh it was the next major contributor uh to the metrics uh charm and the metrics uh cardinality is so, as you said, uh frequent deployments in kubernetes and with every deployment in kubernetes new instances of ports are created, and reports in kubernetes has its own unique name which doesn't intersect with the previous names of such ports, and if this code name is included in as a label in Matrix exposed by the spot.

B

So this means that, with every new deployment, uh old Matrix stop stop receiving new samples and a new metrics appear and start receiving. New samples and suchitation means is named this charm rate, and if this redeployment occurs frequently, this leads to high churn rate situation when an old metrics are substituted by new metrics I mean time series it's high rate and this High rate.

A

Worth mentioning because what you're talking about usually the the Pod name is, is one of the labels right that you you refer to, and then you get the new time series essentially right, just to make sure that the the listeners understand the the challenge here and what's what's changing. If you.

B

Can elaborate because every time series in every monitoring system is uniquely identified by its name plus a set of its labels yeah, and if one of this label value changes, uh the new time series is created. So.

A

A

Yeah so sort of the physical, if you can say physical and pod, but the physical pod disappear. The new pod was spun up to carry on the the workload, but if, if the design by Design, if they design it by using the the Pod name, that it creates a new time series, maybe uh this is.

B

A

Episode of the the pattern, because oftentimes people result to the Pod name, rather than maybe looking for more uh something that will will be more consistent across the life cycle, which is the the actual logical workload right. What did you find useful in that respect?.

B

If we look uh in kubernetes deployments, there are web replica sets and interim when you uh instant, State forces, and if you use uh state to send deployments, then every uh Port has a consistent name naming. uh This name consists of the original pod name from deployment plus numeric uh the number of this spot from zero, starting from zero to the number of positive deployed in a stateful set, and these names uh stay uh static and don't change when you're deploying deployments in uh stateful sets.

B

This means that such naming uh doesn't introduce generate during the deployment, but.

A

The pattern that you found is replicated to just use the replica set, or the type of that as the the one that carries on across the.

B

I would say that this can be used as some kind of hack to use stateful set abuse, stateful sets for uh reducing charge, but I think that the general solution uh should be implemented by kubernetes developers itself. They should provide such some stable labels for usual deployments, not for replica size deployments, so uh we as Monitoring Solutions could use this stable labels instead of uh name which changes with each deployment, and this will all be reducing just eliminating the channel rate in kubernetes.

A

Sounds interesting, yeah! That's why I was referring to like back at biological name, because I think replica set or stateful set is not the it's a hack, as you said, but uh maybe the ability to attach a logical name that would carry on would be more of a a permanent set that can be interesting um and and I think there's an interesting uh common uh comment here from uh from the audience saying that uuid is a label always so another suggestion here from the from the audience the uuid.

B

The what your uid also changes to each deployment uid is a unique ID of any object in kubernetes and this ID changes when you redeploy some ports and because Newports obviously different logical objects compared to previous spots, and that's why uid also changes also introduce uh generate so.

A

That's not solution.

B

Fix such some to use some some fixed name, perish report which doesn't change when you redeploy the spot.

A

Yeah I think another point that um that is worth discussing you may you I, think you mentioned. That uh briefly before is, is the uh the fact that you know when we have kubernetes. First of all, uh coins itself is um I, guess a new.

A

uh Let's call it critical uh system that we need to now monitor, uh so we have uh in a digital just like we need to monitor our back in the days that the bare metals or critical systems, and obviously that the operating system thing so now we have kubernetes itself and lots of components both on the Node level and on the control plane level, uh the cube, the the the K proxy and and then hcd and whatever, and um and also uh it comes with a pack of uh of metrics on its own.

A

uh One of the things that uh is astonishing is I. Think the the number of metrics that you'd find by just installing uh the bare minimum I, don't know a tiny uh three node cluster and just checking the default metrics that come out of the box. If you just use all the all the defaults there, um I I know that we've experienced that, with with our uh with our users, uh often that they quickly start running out of quota. Just because of that, but interesting to hear your your experience with that.

B

Yeah by default, kubernetes exposes many metrics and let's, let's look at primitio, separator or electron metric separator, which uh others almost the same as.

A

B

They come with additional companions for monitoring of kubernetes such as Coop State metrics, not experts here.

B

Metrics from different companies we mentioned in uh which include in kubernetes and the number of such metrics are usually very big from the start, and you monitor entrepreneurs, cluster of consists of consisting of three nodes. You end up, usually in tens of thousands of metrics in this cluster um and.

A

We do the same by the way we have also an agent that is based on open, Telemetry and then some other best practices on the open source. But you know based on C advisor and Cube State metrics and the and the node exporter you get the gazillion of metric that that's that's uh and.

B

The issue is that the majority of such metrics isn't used anywhere in dashboards. They aren't used in dashboards, they haven't used it in alerting rules or recording rules and according to grafana study, it has been appeared that only 25 percent of this exposed, Matrix use, actually use it, and 75 percent of these metrics are never used in anywhere. So it is possible to just reduce a lot on your monitoring system by fourth, at least by four four types. By removing uh these unused networks.

B

A

Actually, astonishing, like you say that only like 25, only one, fourth of the metrics, that the direct exposed by the defaults, as we said, is actually put to use from what you see. Yeah yeah.

B

That's correct and I think that there is a way to improve this situation and I think I believe that the proper ways uh the kubernetes Developers should think uh sit at the table and think which metrics they need to expose and compose the set of essential metrics for exposing uh from kubernetes companions and can create some kind of standard for this Matrix and the standard should describe where this metrics should be collected, and this Matrix should be exposed by kubernetes components which are included in every kubernetes installation.

B

So uh third-party Monitoring Solutions should not install additional companions for monitoring kubernetes itself, because, right now you need to install additional components. As a URL mentioned already. You should install.

B

Metrics these companions aren't included in kubernetes itself. The third party components and I think that all these metrics from such companions should be included in kubernetes itself and kubernetes should expose this metrics, and this the set of this metrics should be carefully uh maintain. It I mean the list of this metrics, uh and this list should not contain metrics, which aren't used every anywhere well,.

A

That's that's actually a tricky question, because I wouldn't expect most of my users at least the one that I uh you know, I encounter and I talk to, uh and we see also in my company, don't really know what they need and usually they just go for the defaults, because that's the safest bet and just bringing everything and then seeing so I think and also it's not static. It keeps on growing I. Think one of the most astonishing stats that I I saw was that scene I think you brought it up.

A

Also in the early discussions like since 2018, the amount of metrics exposed by kubernetes has increased by three and a half times that that's astonishing.

A

That means that it keeps on growing in an immense rate and expecting the end users to keep up with this growth is very difficult and I think this is where we as uh as a community and obviously the the community leaders, which is the vendors and the open source, uh the kubernetes uh cigs, the you know the uh special interest groups and and the relevant working groups, maybe to help uh provide guidance on some of what is uh used or not.

A

For example, we since there's no such standard we at uh at logs there, for example, curated the list uh that is open source. Obviously it's a it's part of the helm, charts that we provide, and you have the list for um for both the kubernetes out of the box and also for AKA cks gks, all the managed versions of the uh kubernetes to to recommend at least what we find useful amongst our users. uh Should you use the cube system this or that or DNA Cube, DNS or others that giving a curated list?

A

But the question is: if what we see with our users is similar to what you see with your users and if we can provide some sort of an overall overarching best practices for the entire community in industry,.

B

um I think that um system kind of metrics which such as CPU usage, memories and so on can be the list of such metrics uh is isn't too big, and it is quite easy to provide such list of uh create, at least of such metrics, from cabernetes developers itself from computer developers and to provide such a single or multiple experts in kubernetes in every kubernetes installation which exports this system metrics and as for application, Level metrics. It's a.

A

Hardy question because.

B

Every application has different set of metrics and, as you can hear, this, that's different sets of users need different metrics and use different metrics from our application, and it's maybe hard to provide to the curator at least, which fits everybody in this case, I I think. The way to go is that every application, such implication, which is used in kubernetes, should provide its own set of curated, metrics or multiple sets for different use. Cases of the metrics well,.

A

That's an interesting question: I uh I'm, debating it myself, you know again, as I said, we resulted to to having on our GitHub repo Public Health charts, with with the best practices that we found. But then again you have different um components in it, obviously ones that use Cube, State, metrics, Prometheus, node exporter and others, and each one brings its own set of metrics. And the question is if these can't come with some sort of of guidance, at least on my uh on my sense of things.

A

um But it's a question and I think it definitely is something that is worthwhile, uh trying to at least open this channel of discussion um between them and and also that's on the standardization side, and also on the tooling side, to enable this removal of irrelevant, uh metrics and and maybe adding to the metrics themselves. So there's a metric that might not be used all together. The the other level that I found, at least at least very useful, is also removing labels, because that goes to the cardinality challenge.

A

So many of the the built-in metrics come with lots of labels, some of which are, we found at least less relevant and the question: do you need the per core I? Don't know CPU analysis, or is it fine for you to just get the overall CPU and things like that and when you remove redundant labels, you dramatically decrease the overall load and volume because the cardinality reduces the. Have you encountered something like that.

B

uh Yes, another example of such labels is histogram labels, because usually uh a single histogram can introduce hundreds of new metrics new time series because of a bucket label.

A

And maybe just to explain to our audience before you carry on for the so there are different types of metrics like gauge and like a counter and histogram is the the more complex one because it actually is not one number being aggregated over time. It's actually several numbers one per bucket, and then you get for each a point in time. Multiple data points that are being collected and, as you said, this dramatically increases the the overall data on that specific metric.

A

It's one metric, but the type of metric being a histogram means that it collects buckets rather than uh scalar single single number yeah. So.

B

Just explain in the.

A

Background yeah.

B

uh I think that the solution there is no uh Universal solution for this problem, uh but I think that the way to go or is that every developer, Who develops exporters for such metrics uh should pay more attention to Matrix it for this experts are exposes, and just should think twice before aging additional metric to this exporter and should think more about which metrics or labels can be removed without hurting the observability of this experts. In this application, yeah.

A

Sounds good, I see also a question from the audience. What do you guys think about exponential buckets histogram? They only take that.

B

uh Like uh you mean as a histogram type, which recently added to parameters.

A

Actually I think it's I I wasn't called I, wasn't sure if it's called the exponential, but.

B

The parameters, uh his new premises, histograms uh named native histograms.

A

That's how I know them is native yeah.

B

They use the exponential bucket histograms, also uh actually in Victory. Metrics Instagrams also provides some uh some kind of histograms and they also based on exponential bucket histograms, but they differ in details between each other and.

A

B

Think that exponential uh Instagrams, with exponential buckets with a great way to use uh much better than other bucket types for histograms, because you don't need to think about about boundaries of this buckets. These boundaries are created automatically and you automatically cover uh all the range of values for measurement. You measure with the histogram, and the main issue is an exponential bucket histograms- is that the number of buckets can be quite big.

B

It depends on the size of on the exponent which is used in this buckets, and if you set too small exponents, then the number of such buckets can explode to thousands or more, and this is also not great from performance perspective from resource usage of mentoring system perspective.

B

So we need to keep some such balance between the accuracy of such histogram Precision of such histogram and the amounts of resource usage for this Instagram and that's the most complex task when using such histograms.

A

Yeah um and and another point that I wanted to discuss is uh it comes up, especially when dealing with multiple deployments. So when you have a large environment involving many deployments in parallel- and this is the configuration challenge, how do you configure these multiple environments? uh They also talk about how you view this and how you handle this.

B

Yes, usually kubernetes is used for deploying multiple different applications.

B

This ends up with multiple deployment configurations and then the number of such deployments different deployment configurations can grow to hundreds or even thousands in big kubernetes clusters and currently open source monitoring for kubernetes uh says that every such deployment should contain custom configuration for monitoring which ends up into parameters: script, config, a single compremises config.

B

You know such custom configuration usually contains some kind of relevant rules and some kind of filtering for selecting uh the needed posts for this deployment. Some kind of configuration for uh discovering the.

A

B

Where to screen this Matrix and if you have thousands such deployments, then uh this configuration page deployment uh for monitoring uh can end up into big mess. And it's unmanageable. But I say it's hard to manage such a big number of different configurations for mentoring and.

A

B

Is that every such configuration be rich deployment ends up into a single script? Configured parameters and such every single script config in parameters generates additional Lots on components. Api server, which is used for discovering the needed targets for scraping to reach such as deployment.

A

um So people end up with separate crds and like for each and every one it could be potentially hundreds or even more per depending on the number of environments.

B

Right, you need to manage individually, uh yes, and so uh this increases a lot, both some users who operate all this stuff, and this increases a lot on kubernetes API server because it needs it needs to uh answer a hundred times. Instead of one time uh when you have one Escape configuration and I think that the way to go is to oh to use some uh to divide some uh standard for serious Discovery in kubernetes for deployment and what discovering kubernetes on top of serious discoveries used in parameters.

B

So in most cases, uh Primitives or some other monitoring system should automatically discover all the deployments of the pots which need to be scraped to collected metrics without the need to write custom configuration per each deployment and only if, in some exceptional cases, when you need to customize something, then you can write this customer resource definition for scraping and in most cases it should work out of the box without the need to write, custom configurations and I think that this way to go and it will remove the maintenance burden from operator of the kubernetes and also reduce the Autumn kubernetes API server.

A

Interesting, so you see that there's something that is uh out of the box, which part do you think that users should uh should configure, in which part you should.

B

Yeah I think that they shouldn't configure anything uh in ideal case. uh Their applications, which are on Imports, should just expose metrics in a standard way in parameters. Exposition format, for instance at standard, was known paths and CP ports and that's the way to go, and in this case the monitoring system can just Escape discover all the ports in kubernetes and just scrape Matrix from well-known endpoints.

B

This is the way to go and um actually metric separatory support, such kind of strapping, uh which is close to this video situation. uh This escaping configuration can be configured via annotations in deployments. You can say that please scrape Matrix from this ports which, from this person or this person from the sport and this configuration, is written in annotations in every deployment.

B

This is better than uh writing custom resource definitions for Matrix grade and perish deployment, but I think that the even best solution is to just appears and some kind of standard, formatic Exposition in kubernetes spots, and in this case you don't need to write anything uh metric related into your deployments configurations, and the monitoring system should just cover all these New Ports and discover metrics on this spot's. Without the need of additional configuration from users.

A

So uh actually there is a there's, a question here from the audience: isn't the this? The future is in the future in otlp protocol otlp being the open Telemetry protocol. Do you see that as part of the roadmap on this path, or maybe I will add to that question from the audience? If you see that within the open, metrics format, or how do you say that mapping to existing opens open specifications.

B

um uh I think that, as for Matrix, the main the easiest solution is to use parameters, text, position, format or open till open metrics. It is named open, metrics right now,.

A

B

This uh uh metric Exposition format uh is very easy to implement it's very easy to debug, because it's just plain text and everybody knows how to read it familiar with it. As for open, Telemetry, I think that uh it can be used for uh collecting traces, probably uh logs I, don't know, but as for Matrix I think I see that open Telemetry is quite complicated compared to simple, open metrics format for for Matrix.

A

I think that the the main.

B

Stumbling block for open Telemetry adoption for metrics.

A

Yeah I think it's. uh It is by the way in in the roadmap of open telemetries, not just for logs and tracing and definitely aiming for uh for uh metrics as well and there's lots of work also with Prometheus uh uh projects and the working groups to uh to uh create the Synergy between the two specifications. So uh definitely uh I think that it will be interesting to see how uh the open Telemetry specification also addresses that, but I guess we will have to live and see that um uh I want to.

A

um uh There was one more point that uh you mentioned about the lack of common scheme, also to enable correlation, which is also related to that uh or that aspect yeah.

B

That's actually related to opens Elementary efforts because uh open Telemetry uh popularizes uh as a common standard for uh telemetal data collection uh so uh and the one of the main points for such standard is to ease the correlation between different kinds of collected Matrix. The SS logs, so I think that that's also a way to go to improve the situation in a kubernetes world.

A

Yeah uh so I think we discussed, and we also said that uh you mentioned about the uh try to create to I, guess, collect in a uniform manner the two aspects, both the system- metrics, that are I, guess more, a close, more closed set of of metrics and also the application metrics. Do you want to add anything about the uh what you see lacking in the in the collecting in uniform manner,.

B

um I think that the system Matrix in kubernetes, such as CPU usage memory, usage, Network usage, disk IO usage, can be collected from kubernetes from kubernetes itself or from ports which run in kubernetes can be collected in a uniform for uh format and can be standardized by kubernetes can be right, written in some standard of kubernetes, and there is no any additional uh input from users, because this system metrics asked the constant across every application.

B

I have report, and there is no uh something to invent here in this area, so this Matrix should be I mean system metrics in kubernetes should be collected by provided by kubernetes itself and collected from exporters which are provided by kubernetes itself. Oh, this is I, see. I mean I, think that it's a easy to solve a task.

B

Yes, and as for application kind of metrics, every application, uh every application uh exposes different sets of metrics, and it is hard to create some kind of standard which can group this metrics from different applications in some form. So I don't know how to solve this issue.

A

Yeah I think there's also aspect of what uh goes back to what what's needed, because many of the metrics that are exposed are not I find not actionable, and it really depends also on the on the platform. For example, if you compare, if you go to the managed kubernetes, for example, when you collect system components, I, don't know Azure Defender, for instance, you pretty much probably collecting the the uh system. Metrics will be useless for you as an end user, because you don't manage the underlying system anyway. So it's not actionable for you.

A

So and still you can find these metrics exposed by the managed kubernetes uh nonetheless, I guess they didn't filter it. So what I find is that really focusing? It goes back to also focusing on on and the relevant metrics, also in the aspect of standardization, so per the environment that you work, uh managed kubernetes environments should also provide their own uh exposed set that is relevant to their end users in a way that will will focus instead of defocus us as the consumers of the service.

B

Yeah uh agreed, but uh I mean immense, uh that there are systematics which are related to every pot in which you deploy in kubernetes, and this system metrics are relevant in any kind of in any environment or kubernetes environment, because you, every user, should take about the resource usage of the deployments in kubernetes ports and kubernetes yeah yeah.

A

No sounds good sounds good. I want to uh switch and talk a bit about your journey with Victoria metrics, it's a very, uh very interesting, open source in the field and uh quite quite uh widely used these days so um interesting to hear you started. Actually your service practitioners working with Prometheus- and you said you actually had great Improvement, especially over uh zapix back in the air in the previous day. So what what drove you to start developing another solution.

B

um Yes, as the previous work, uh we started using parameters and we're very happy with parameters compared to zotics, because we were using zombies before uh the main feature of parameters which we liked that developers can add as many as they want metrics to their applications without the need to talk to devops and say: please set this metric to my application. This is the way to go the way which we were using in the Xanax error.

B

That's great. That was great, but when we started using Prometheus, we end up in a very big number of metrics and uh the the at that time there were no. There was no premises to zero. There was uh premises, one point something and this parameters couldn't keep up with the increased amounts of metrics, which we happily fit it uh even on a pretty beefy Hardware.

B

um That's why we started searching for improvements, how to improve uh pramitos. It's.

A

Important to note for the listeners that are not aware Prometheus by Design is, is a single node installation, so it can scale obviously vertically, if you add more memory more disk whatever, but it can't scale horizontally. So you can't make it into a clustered solution. So that's that's the limit. I guess that that native from it is offered.

B

Okay and around the same time, we were, we started using click off. uh Instead of pause.

B

This the previous work, we were collecting uh huge amounts of analytics stats for his head cellular and this starts was storing, was storing in uh was interest initially, but the positives also couldn't keep up with the amounts of data storage to it and the queries over this data, and we discovered click house at the same time and when we tried clickhouse and we discovered that it can handle 100 and even more thousand times more a lot on the same Hardware as in progress and traditionally clickhouse can scale to multiple nodes, and this was a very great experience.

B

So I decided that probably we can use clickhouse architecture to Playhouse technology for Matrix collection for Matrix processing.

B

So that's how parametrics has been appeared.

A

Nice, by the way we in a previous episode, we also hosted the founder of signals, which is also based on on clicks, clickhouse architecture, very fascinating, open source clickhouse, and we can talk about it in more in different episodes. But I'm curious actually to hear so uh uh so you you found yourselves need liking the Prometheus concept, let's say, and then many of the features, but you needed to lack the scalability.

A

uh In fact, here on the show we covered the several open source projects such as Thanos, cortex and n3db, which aim to tackle the same challenge of a scalable Prometheus. So I'm curious. How is Victoria metrics different.

B

Yeah with telematics the main difference between Victory metrics and other systems on the markets, such as Thomas cortex. Only Mir is that Vector Matrix is written from scratch. It doesn't use any source code from parameters, so uh other systems cortex and veneer based on parameters source code. They reuse it. So that's the main difference between this systems.

B

uh Another difference is that Vector Matrix stores data on the ordinary disks, so block devices I mean I and Thomas and cortex started from storing the data to object, storage and, as both.

A

B

Have pros and cons the main process of uh block storing data to ordinary disks is that ordinary these usually have much smaller latencies much smaller error rates and much higher bandwidth compared to object storage. This helps improving query performance, comparing to object, storage and the main disadvantage of local disk compared to object. Storage. There is a local disks, have limited space capacity and you need to manage resize some kind to think about how to resize this disks.

B

When you, your storage, will shrink and Victor metrics can scale to multiple nodes in this for solving this problem. So if you uh fuel a single disk in this Vector Matrix, then you cannot more storage nodes to economics, and in this case you will get more storage space and, additionally, um some local disks, such as Google, persistencies or Amazon block storage, can be resized on the Fly. You can just add more space to this in disks without the restarting of the application.

A

Nice, can you give us just some figures to understand the scalability that you're you're talking about with Victoria metrics.

B

Recently, we run a benchmark who in which, in this Benchmark, we uh when Jason uh 100 metrics per second into a telemetrics cluster during uh 24 hours, and these metrics were collected from a real node expert parameters, not experts or companions. This metrics weren't generated around only so this Benchmark was very close to uh production data from from collection collection site and uh in this Benchmark we discovered that Victory metrics can scale to such uh huge workload on that ingestion path, and this uh Benchmark was running.

B

I didn't was used around uh 100 nodes. Different kinds of knots in uh Victory metrics cluster was used in uh well. Foreign was around 1 000 as I. Remember,.

A

Nice yeah I remember. Actually, we went from from OSMC from open source monitoring conference in Germany. That was a pretty uh astonishing numbers and that you shared there. um One point that I wanted to ask is that um you know many of these Solutions started as a long-term storage for Prometheus. I. Think that you with Victoria metrics, took some a different approach, at least at some point. You you started uh I, guess diverging or extending beyond the Prometheus ecosystem, both in the Chrome ql, query language with extensions and with apis.

A

Do you want to say a word about uh where you're heading with this.

B

um Yes, as you said, we started initially, we started as a remote storage for parameters and later we stumbled with some issues with Prem in premises ecosystem. The first issue was um not so good support for a remote read protocol from Prometheus site when we created new telemetrics as remote storage for parameters.

B

We were thinking that parameters could use Vector metrics as a storage and all the queries can go to Prometheus and premises can ask for the data from electromatics and processes that and return the query response, but it has been appeared that that's not that was not working because of uh different issues. I can later put you post your links to these issues in parameters, issue tracker and that's why we had to create our own implementation of parameters. Query language. We start implementing it implemented it from scratch, so you can uh query Victory metrics on itself.

B

Without uh the help of parameters and later this parameter, Square language implementation has been grown, overgrown uh bronchial, query language and we renamed it to metic scale. This uh superset on top of bronchial I, would say- and it also has some incompatibilities with bronchial, and this and capabilities are deliberate. This.

A

B

We are not going to fix them this kind of capabilities because of we're seeing that our implementation fits users better than parameters implementation. For instance, we can look at increased function, implementation uh if you use a execute and increase function on top of integer counter in parameters. You can be surprised when the parameters will return. You not non-integer responders from, for instance, you want to get the number of requests for the last five minutes, so you write, increase over number of requests for the minutes and the premises could write on you 1.5, for instance, not.

A

What you'd expect.

B

Yes, and uh and.

A

Also, the rate function, I think there was also something yes, the right function.

B

Is uh tightly connected to increase function and also uh Victory metrics implements it's quite differently. For instance, uh when you use rate over a small time window uh in parameters which is smaller than the double of scrape intervals and prematures, you get an empty graph and empty response and victory metrics, just Returns what most users expect it Returns the rate well.

A

Yeah, there's also a question here from the audience. If you can refer to uh m3ql the m3db I guess this is the language, the query Language by M3 I, don't know if you know it if you, if you can relate to this product, we mentioned a bit Thanos, cortex and mimir. But if you uh you can say a word about M3.

B

I'm aware of M3 database and I I know it is also written from scratch. It probably it uses some kind of uh parameters. Code, I, I, think that they use parameters code for query execution because they declare that they are 100 compatible with from Kill. So, if you want to achieve such compatibility, the easiest way is to reuse the code for query execution. So probably they use parameters code for from kill, but the storage layer is written from scratches as I know in M3 other from this.

A

B

A

If you don't know it and it's a question from the audience, but it's fair to say, if you're not familiar, we're not expecting to compare with all the the products out there. I I have a I have a question that also because we talked about a lot about the storage, which is obviously the time series database is the core. But uh it's not everything, even in Prometheus it's for me. This is not just a Time series database.

A

It's got a full stack, including the scraping like the agent mode, the the alert manager, front-end UI and, more so uh from does Victoria Matrix plug into that stack or provide a vertical full stack of its own.

B

B

It provides its own, the imagined.

A

B

uh This component is like a Primitives agent, this lightweight agent, which can discover scrape targets uh scrapes these targets, and it supports the same configuration for scrape targets as parameters.

B

So you can switch from parameters to the magent and script the same targets in the same way, and traditionally we imagine supports not only uh scraping metrics. It also supports popular data.

A

Injection protocols.

B

For instance, it accepts the data from infos DB, graphite protocols, data dot protocols and so on. Also, we have alert.

A

B

This component is responsible for alerting and recording rules, and it also supports the configuration from Prometheus for others and recruiting rules, and also we have other kinds of companions such as, where Mouse for authorization uh multi-user of civilization, acetyl companion, which is used for various operations, tasks that, for instance, transferring data from one system to another system of checking health of some systems and so on.

A

All right, so it's a full vertical stack, so we're about to run out of time, but I want to. Obviously, uh if you can share with the audience, how can people join the open source project and the discussion and get involved.

B

Yeah we, we are very glad to see new users in our community's community. uh We have a slack shots like the operating system. uh We encourage you to join this workshops. I will post a link to it, so you can find it when on our victim, metrics GitHub page, and we encourage users new users to use Victory metrics to file feature requests to file bug reports and to send pull requests. We are very happy when users sent even has seen very simple, pull requests which fix some tips in documentations.

B

For instance, that's the starting point for new users in Victoria Matrix.

A

You have other community channels that you would like to highlight over slack Discord or any other, that you.

B

Want to know by origination slack chat, so this is the most uh active Community point for victory. Metrics. We also have telegram shots. You can also join if you use Telegram and we have arranged Channel.

B

A

They can also reach out to you, we'll also put on the uh on the YouTube description, your your own LinkedIn and Twitter. So they can. People can also reach out to you directly, but I think the community channels are the best best way to go if people are interested.

A

um I want to uh um thank you for that and I want to switch over.

A

We have just uh one minute left, so I want to mention one on the on the on the industry updates section I want to just mention one very interesting uh thing that I uh just saw uh a few a couple of days ago um by the uh by Chris, the CTO of the cncf that uh wrote the cloud native predictions of 2023 I found them very interesting, uh including things such as the cloud IDs becoming normalized and the standardization of phenops and wasm webassembly, Innovation and others. So I highly recommend checking this out.

A

I'll put it on the resources page, a fascinating sort of predictions uh listing that also is very aligned with what we see. What we discuss here on the show and um and the topics that we we keep on discussing uh very interesting on that um I want to. Thank you uh again, Alexander for joining me uh on this show and sharing your uh your insights. uh Thank you very much.

B

Thank you for inviting for this talk, who was very interesting to take part.

A

Thank you and, of course, uh thank you to all our listeners for joining that. As for the episode um and as always, you can find all the episodes uh on the on your favorite podcast app or on YouTube. If you want to see that in the videos videocast format, uh if you are listening to the episode on on the podcast and do know that we stream the episodes live on Twitch and YouTube live.

A

So you can find all the details about the upcoming uh streams uh on Twitter at open, observe check them out for updates, or you can follow me at horvitz, where we update and share comments, and you can share your own comments, suggestions, uh news bits and everything else, um and if you have something to contribute to the show, if you have something interesting, if you're a subject matter expert feel free to reach out and submit a proposal on the website, openobservability.io, where you can find all the details and also the cfp with that, I would like to thank you, I'm Dr horvitz.

A

Thank you very much for listening and we'll see you on the next month's episode.