Grafana Community, 26 Jan 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Grafana Community Call 2023-01-26

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

And here we go so hi, everyone today is January 26th, and this is the grafana community called after Christmas period. We skipped the last one because everyone was celebrating something somewhere and today we have just one point: it's a big point, which is in your 2.6 release in the agenda.

A

We are building the release candidate now the first one and if everything goes well, then we will publish the final version next week or yeah at the end of the next week. We are testing this right now in production.

A

I. Think most of the changes coming in this release are changes. We announced in the previous call about this target with Gateway improvements. I'm personally really excited about these changes. Maybe someone who worked on this can speak about them.

B

I can say a bit about the streaming changes um yeah, so we added a new setting, which is two experimental, the streaming series batch size and when you turn this on the circuit, which we, when you set it to something other than zero, zero Gateway should be more efficient in handling larger requests.

B

This would help with requests or queries that select tens or hundreds of thousands of of series. um Previously, those might lead to Auto memory errors.

B

Now they should be smoothed out. The memory Peaks should be smoothed out.

B

It's still experimental we've seen pretty good results internally on reduction of outer memory, errors, um latency and CPU seem on par with the previous implementation. So do give this a try. If you have problems with this or gateways, I'm, not sure. If we have anyone worked on their map.

B

Shows Maybe um yeah and the other changes that they can Charles did were on removing a map they use it to thumb up. This would solve some in-free some stalls or hangs in the store Gateway due to how a map is used in go basically, instead of just a small part of the program stopping to read from this a lot of core routines stop.

B

So this looks like the holster Gateway is, is told and not making progress, so health checks fail, request, latencies increase you can yeah, Nick and Charles worked to to replace this with the regular five reads: you can turn these on with this flag. uh Log storage bucket store index, header stream uh stream reader enabled I think we're looking to make both of these improvements. The default in the future release.

B

Yeah I think that's all I have.

A

Yeah great and we also have a point about moving the hand documentation to a new place. So.

C

So before this change, basically, we could only document new stuff in the ham chart for mimir and Enterprise Metrics as well, uh when we had an email release, because the release process was such that basically, we did. The health chartreuse together with mimir and the documentation had to be in place because it was coming. It was being copied to the website from a single place, and so the last month or two we were done separating the documentation, which includes some rewrites and and splitting documents as well, so not just moving files around.

C

But the upshot is that, from the next release we will be able to uh simply release nimir and the ham chart separately. So we can have more frequent as needed, more frequent regions of the ham chart with its own documentation, and you will also find the Hampshire documentation it's in its own place. So it's not going to be mixed in with the media Docs, and uh this idea uh was picked up by other projects as well.

C

So in fact, we will have a ham charts Hub on graphene.com, in the documentation where we plan to add more documentation for various ham. Charts of different products at the moment there's this is very much work in progress, so you know there can be some issues or missing links, but at least the mini ham chart docks and and the link to the Locking hand, charts docs is already there so hopefully we'll improve this, not not hopefully, but we'll definitely improve this uh in the not so distant future.

A

Great, do you want to also continue speaking about the native histograms changes, you're working on.

C

Oh yeah, um I, don't know if you know that promoters, uh the native histograms support was emerged last year and we are bringing that into mimir. This work is in progress. We have a development Branch where we can. We have some basic functionality going, but we need much more testing and optimizations, and you know fixing some missing features for this to be merged on on Main, but this is coming along. uh So for those who don't know about Native histograms, if you use histograms in parameters, they are stored.

C

As a number of separate series, Time series uh and Native histograms is about storing the histogram information buckets in a single Series in the new data type. You can find I put a couple of links into presentations about this uh into into the document right now, because it's quite an interesting topic, I think.

D

Crier, do you mind sharing uh you know an estimation of when the native Instagram support could be ready to test, not production, really, but really to test for the open source community.

C

Right but our kind of planned deadline is uh March, so uh we want to finish it basically next month um and you know we might have something uh sooner, but I I'd rather not commit to anything sooner.

A

Okay, we don't have much more points, actually so I've added one about the design I'm working on for the rollout operation operator. We have to automate the downscaling of the investors. I would like to make that at least managed automatically by the opener operator, instead of having to close the shutdown endpoint on each of them.

A

That's still in progress, but the first step towards that is that they've released the new version, the figure on three zero, which has a no downscale web hook, implementation, which basically allows you to label some stateful sets injectors in in your deployment and that will prevent them from being double scaled accidentally by someone and yeah. That's all I will share the news in the future once I have designed for that and I have.

E

Implemented the.

A

A

And I think we have no more points for today. So maybe someone has any questions.

A

From this side and a lot of new people here today,.

D

Foreign yeah, anyone not working at your follow-ups, want to present themselves. uh Tell us their story. Are you using mimir? Are you curious about Premiere? Do you have any question? Can we help you.

D

F

Yeah I can quickly, please introduce myself um Theo I work since uh five years at Jan, swine uh and now the product owner of the monitoring team, and we currently thinking about um we're interested into Meme here, because currently the solution we have for monitoring is not really scalable. So we're trying to evaluate the different solution out there. We've heard about panels, we've heard about Mimi and we're trying to make our mind on what's a good fit for us.

D

um Can you share what are you currently striking with.

F

And So, currently we struggling with uh scalability because we have a kind of In-House setup where so roughly our architecture is that we we do have multiple cluster, that we Monitor and all our Prometheus interesting. All those data are sitting on a single cluster that we call the management cluster and though we just have one Primitives per cluster, it's really like not ideal, but that's that's what we have uh so.

F

Ideally, we want to replace this to have something that scale better, because sometimes the cluster we're monitoring can be quite large, like hundreds of nodes or more and because there is a single Primitives in front now, then the memory just explodes and and scalability. Then we need to scale the nodes of this cluster. It's really it really doesn't fit.

F

So yeah. That's what we're facing right now.

D

Yeah I have one last question: have you already tried mimir or um yeah yeah I know you just read the documentation so far.

F

Yep basically reading documentation but I think Muhammad, which is there and is part of the team. As as tried me near some handsome experience a bit.

G

Yeah sure I'm Muhammad I work for Jameson and also part of the observability doing art chains on.

B

G

Since I had some Hands-On on Thanos before so, I just was very interested to explore mimir in action and yeah. So I did a couple of you know, documentation, reading and so on and for me, mimir looks like the same as Thanos, but there is a couple of difference. You know and yeah, but honestly I think uh uh personally, I think mimir could help uh as more in this uh in our situation.

G

uh So what I did I didn't do much honestly, I just tried living on a kind cluster, and there was a lot of you know. Few components I don't understand like, but it's caching and so on in the Hampshire. Mainly so I need some more time to you know. Maybe we can get some help from you guys. That would be great.

D

Are you using our Helm chart.

G

Yes, the distributed nimir chart.

D

F

B

G

B

D

The health chart there's a crime.

G

It's not about eventually, but it's the matter of you know ourselves. Choosing the best architecture for our monitoring setup you know, like meteor, is very flexible. You could choose whatever that suits. You like whether you need to run multiple queries, whether you you want. You know investors in workload, clusters and so on, so it it differs and I think we wanted some expertise in this. Maybe we wanted to get some help to. You know: choose the best architecture that we could integrate. Mimir our setup.

A

So basically, we gave too many options.

G

Exactly yes, so the whole chart is just a matter of configuration now we want some, like you know, drawings and design document and so on, and then yeah maybe doing some real Hands-On and.

A

Yeah yeah, we would be happy to help on the community Slack and we have so. There are some options that we are thinking about: deprecating some something we want to make default in the future and but still haven't made the public. That's why some some options just have to two or more options still so it's always in progress like I I would personally deploy everything multi-zone right nowadays, not single zone, but I'm, not sure if we are really moving towards that or but still.

A

Things like that, you can, you can just ask, and we will point you or maybe even improve the documentation, because.

G

Yeah, that would be great exactly.

G

uh I think that I wanted to ask actually is just simple like there is a query, front-end and query component, you know like: is it possible to to connect um multiple queries to each other?

G

I know this is possible in Thanos and I. Don't know, is it possible here in Media.

A

What do you mean by connecting multiple carriers to each other like so, let's.

G

Say there is a workload cluster that has you know a query component and we I I thought about having a global query that actually queries from the workload clusters. So the query is actually equipping the query on the older cluster.

G

D

What you mean is what we call cluster Federation um I. Think it's to my understanding, um I think you're. Thinking of oh too many things, uh I, guess you're thinking about uh installing uh something like one mimir cluster per kubernetes cluster or per region and then having a global view over all your memory classes, something like.

G

That yeah exactly.

D

Yeah, what we typically okay, so to do this, uh you need a feature which we call cluster Federation, but it's not available in memir open source, it's available in our Enterprise auto, which is called gen grafana Enterprise Metrics, which is mimir plus some extra features, including cluster Federation.

D

However, what we typically suggest is to just add one single global mimir cluster, and then you have uh the agents or Prometheus running in each of your data centers or kubernetes clusters or region, whatever it is, and they all remote right to a centralized place. So basically, you run one memory cluster. That's uh just to give you an example: that's how we monitor uh grafana Labs infrastructure. We need one big mimeo cluster centralized and then we have. We use grafana agent.

D

Instead of Prometheus, um we have grafana agent running each kubernetes clusters, which are across multiple regions um and they all remote right to the centralized linear cluster.

G

In those two things.

D

Or maybe we can come back later to this. uh I also would like to let Sean introduce himself.

H

uh Yeah uh I'm Sean I work at Bloomberg on the explanatory infrastructure team and we're uh actually evaluating grafana memere as a replacement for our fairly large metric tank uh installed. As you know, that's kind of sort of the way to go.

H

I, don't think we really have specific targeted questions here, but we do have a lot of like back story that I think some people at grafana know and some don't uh with uh you know our data size and um it's gonna be quite an effort, I think for us to migrate to mimir, just because of um some of our internal I would say like uh opinions that we've already kind of placed in our data schemes um and just our volume, because we have about a billion series and right 13.

H

14 million data points a second right now. So it's uh it'll be an interesting project.

D

B

For the rest of.

D

The audience and Metric tank is another open source uh tsdp uh built by grafana labsa uh and it's specific to graphite. So my question for Sean is uh given you are evaluating the option to migrate to the mirror. uh Do you also plan to move your observability stack from graphite to Prometheus I mean prompt ql and Prometheus agent or grafana agent, whatever it.

H

Is yes, we don't actually use anything graphite except the query language. um Realistically, uh we don't use like the carbon protocol or anything like that. It's all um you know sort of in in-house built uh protocols and things like that, um including our own histograms, which so we're really looking forward to Native histogram support in the mirror, because we would definitely want that. That's been on our to-do list for like five.

E

Years and we just never.

H

Got around to it, so that's really exciting, um but yeah. uh We don't really use graphite, except for the query language. Unfortunately, um you know we have about 10 000 dashboards in grafona, all written with graphite. So we probably will need um that shim layer that I think Jen provides to make graphite accessible.

A

H

We have a lot of teams running kubernetes or other open source products that are just to sort of plug and play with Prometheus that are, you know, dying to get Prime qo access, so they can just grab open source dashboards and plop them into grafana and have it just work without having to like rewrite everything.

H

A

H

Support both for at least uh a reasonable length of time.

D

Nice, well, that's pretty exciting uh you're, currently around magic tank at a very big scale. Obviously, for us would be very interesting uh if you would switch to Lumia um uh sorry to call you directly. uh Even if you didn't raise your hand, uh I know we.

B

D

Here previously.

B

E

Well, I'm part of the same team as still and Mohammed uh a chains farm, so working on monitoring and yeah I think they already plane yeah what we are looking for at gents farm. So last time I guess you said just try it replace one of your promises with mimia and see what happens. But we have. We haven't had time to do that yet so nothing new for me.

D

Yeah I think I think my main feedback is, if you, if you hit an issue and you may um actually it's likely, uh you really have any issue um like you know many times. The difficult problem is, you know doing capacity planning, uh that's one of the typical problems. You know when you approach mimir, you don't know how each component should be sized or scaled and so on.

D

um My suggestion is just resonant uh like uh try to reach out to to Hazard through either through slack or through a GitHub discussion. um Nowadays, I mostly monitor GitHub discussions. Other people monitors on slack. uh We are implicitly District um but yeah. You know ask for for help and we will do our best to make you successful um in this proof of concept.

E

I guess today our program is we, we don't know yet what problems we will encounter with mimia. We know what programs we have with Prometheus like single instance, and what programs we have with our way of shutting it, um but yeah I think we should really have like do do a big test, installation like with 100 nodes and our usual monitoring and just just set a meme.

E

Instead of for our premise, use and see how it behaves like I know, we we have some some specific use cases that are quite a pain for us like, for instance, 100 nodes means UH 60, 70 gigabytes of RAM for our promise use uh if we could, like balance it as as our premises run on management cluster, if we could balance it over the nodes of the management cluster like see it, how how it actually works, if it really changes.

E

um um The yeah, if it really can balance the the memory and and I, have a global measure. What is the global consumption of the memory for our small clusters for our big clusters?

E

What happens like we've had some strange cases where, if Prometheus crashes, when he gets up again, we have some premise: use agents that just try to catch back on the on the old data and promise use juices like three to four times more memory, that in normal usage and how does it works with mimia? This kind of this kind of questions.

B

D

Yeah yeah I think that's why you know, since we are already using the mirror, uh the best way to do. It is again uh try to install the mirror and try to configure the remote right from at least few Prometheus to me mirror and then you're gonna. You can configure more and more remote rights until you eventually remote right all the data from Prometheus or directly from the agent to Mimi. Whatever works the best to you um just to give you a data point um at Griffon apps.

D

We don't have any part um with a memory limit. Sorry any pod, an email pod with a memory limit above I, think 25 keyboard, right, gigabyte, more or less, um then. Obviously, uh the you know the number of replicas you run and the memory requesting limit really depends on on your scale um and we generally measure the Scale based on the number of active series you have in your memory cluster, which is similar to the sum of the healing memory series across all your Prometheus servers.

E

Okay, so but I. Imagine then: um okay, if I, if I, have, if I keep my promise, use and I. Let it remote right to me to mimir. At least these Primitives will have to like an agent. It will have to manage its wall at least his its world time, so at least two hours of hot data, and it will have to manage the these um this amount of of Matrix this cannability, so it will still use at least my previous entry point will still use its 70 gigs of RAM.

E

If that's what I need today.

D

Yeah, so um my question is: are you already using the graphene agent today.

E

F

E

D

Okay right uh well, the yeah. The key point is that you still need an application which scrapes the metrics yeah.

E

D

Monitoring Target and write this data to milia, um so this application could be permitted itself, Prometheus agent or the grafana agent graph, an agent that from Mrs agent, are pretty similar at.

F

Least for the basic use.

D

Case learning graph, an agent we have some extra features, um so typically the the resources required by the agent either from ethius or grafana agent, are way less than the um than Prometheus itself.

D

um Yes, there will still be a while keep in mind. The wall itself uh doesn't doesn't take uh much memory, uh it's actually very light in terms of memory. It's you know, there's a lot of disk here, but in terms of memory, derivation is not that much um typically, uh the the memory utilization in premises is Mustang given by you know, keeping all the active time series in memory.

D

um Basically, the time series over the last two hours two to three hours, um so the agent can help you to to reduce a lot uh the memory activation and, obviously uh you would you wouldn't have any more all the memory or CPU utilization spikes to hold the queries um which.

E

F

Not negligible.

D

uh You know queries a single query, can take 10, 15 gigabytes or even more.

E

Like the the alerting rules and stuff like that, could at least would run on on me.

D

Yeah, you would run any query the dashboard query, uh the alerting rules, the recording rules. They will run all in the mirror instead of Prometheus.

E

Yeah, we really have to do some tests and- and maybe maybe it opens the possibility to like to shout promise- use more more strongly like like. Have all the odd notes uh uh agents send data to to one promiscuous and even to the other premises, because we don't need to reconciliate the data on One, Promise use, I, guess yeah!

E

D

What we do as well, we shot at the agent.

D

Also, uh what I recommend in this setup, where you have an agent in mimir, is to run uh the agent in high availability pairs, um which means for each agent sharp. Well, let's keep it simple. Let's say you you don't have you don't use any sharding? uh Let's say you just have you know one One agent per kubernetes cluster, for example, instead of running just one agent replica, you run two agent replicas and they both remote right to me.

D

B

D

They duplicate the ingested data, so if one agent is down the other one, we still remote right to the mirror. This is very useful, for example, to read: running updates.

D

um I did a new roll out a new new version of the agent, but you don't want gaps in your metrics or also for high availability, like uh you configure the agent with node and Affinity rules to to run the true agent pods into different kubernetes nodes. So if one node goes down, you still have the other agent as a backup.

F

Is this behavior is distribute the duplication behavior in linear specific to the agent, or could this work with any of the solution that send metrics like the.

D

The last one uh with tiny, solid it works with any solution like you can set up uh two Prometheus pairs.

F

D

Will just more expensive with Prometheus, because yeah.

F

D

F

D

You can do it with premises agent to permit it yourself with an agent okay. This deduplication is based on a specific label which is attached to all the series remote written by by.

F

D

F

D

External labels, yeah.

F

G

D

There are two specific labels you have to set them. One is called cluster, the other one is called replica, and the idea is that you set the same value for the pair. Like you have two agents in each. uh We both build an H A pair.

D

You set the same cluster name, I, don't know the name of your kubernetes cluster and the different replica value so that when we receive the request we know from which replica uh the request is coming from, uh and then we elect one of the two senders one of the two agents as the primary uh and we keep ingesting data from the primary. Until we we don't see any more data from the primary and then we switch uh we. We basically fail over.

F

D

I will give you a link to the documentation of how this stuff works. So you can read more in details here in the chat yeah.

F

E

And okay, I have a non-related question. You said we still need the Prometheus. It means mimia does not do the scraping just like tattoos. We we still need the promise use for scraping okay service Discovery, okay, I thought it was a. It could be a drop in replacement for Prometheus, but okay.

D

Now you still need an application doing this creating foreign and the main reason is uh the typical deployment mode. You have one centralized memere um and then you write from you know remote resources. Sorry remote locations, like you know, multiple data centers, uh where and typically there's no direct connectivity from nimir too the different data centers, uh the the main idea is I. Don't know you set up the mirror behind the public Cloud balancer, with the authentication in front of it.

D

um So any agent across multiple data, centers and region can write to me, but mimir can't help any TCP connection directly to the agents because they are behind firewalls private networks, with net gateways and stuff like this.

E

E

How does it work regarding the the Ingress I? Guess we? You need some tuning for the Ingress so because, when, for instance, our agents sending to our premise use, even if it's one Prometheus per cluster, uh the the our ngenics Ingress with the default configuration, could not manage it. They were just denying the the data too much data for them. So we had to do a bit of tuning and.

E

D

So we don't run with nginx for follow-ups uh cryo. Maybe you can help uh with nginx in the health chart.

C

Yes, sir, we do have nginx and I'm sure on the in the. If you installed mirrors osas. What was the question? Sorry.

E

It it does the does. The nginx interest come with some specific tuning to manage the the flow of data. If you have like uh I've seen a few, a few megabytes per second of data sent through HTTP from the agents to mimia the media, Ingress should.

C

Matter, there's no there's no special setting. So it's up to like regular service selection on the kubernetes side uh to balance it there's no, nothing special. There.

A

Okay, I would say it scales pretty well.

A

So if it, if it if it rejects on on some limits, because we do have some limits in place, of course, if you see the rejects or if you are being alerted, then just add more replicas and suddenly everything works.

A

Yeah I was going to say that you should I mean. Maybe this is something something obvious, but please install the mixing and the alerts for linear, because there is a lot of Knowledge from our side. There.

E

Yeah yeah it was, it was not. It was not with with um with mimia it was. It was with Prometheus that we we had this issue and we we had to do some tuning and on our Ingress, and so maybe maybe it's different with with mimia. If, if it's provided with some specific settings, I always don't know, don't know what can change, but we will see when we try it. Okay,.

A

Okay, Muhammad, you mentioned you have HBA, do you mean you have HP and Prometheus.

E

I'll run yeah, we no, we don't have HP out premises. I guess he's talking about mimia, okay, yeah.

A

We also have HPA, so we use together to configure its PA for of the Aquarius right now, and we are doing also query front-ends I think are, we is working on query front ends and more components being Auto scaled.

I

A

I

Guess that will come in the next releases.

A

I

Yeah I mean they should be introduced uh gradually. um The.

I

Yeah and I'm finished finished the work on obfuscating, the the griller, um the the Cal Gateway I mean the Gateway pretty much and then clearly front-end uh is following and also the compactors.

C

Thing um I just wanted to mention that when we talk about features there are sometimes there are two steps before they make it to the ham chart, uh because internally uh it's a different setup that we have and um I think always working on on that. So once we tested them out, then they get into the Hampshire. But in fact I just checked that the nginx does have HPA capability capability in the hand chart already. So you can configure that as well.

A

Great any other questions.

A

Otherwise, we get 15 minutes back.

A

Well, it was nice to meet you thank.

D

You very much thank you, see you next time, bye.