Grafana Community, 15 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grafana Agent Community Call 2022-06-15

Description

This call covers usage stats, hardware requirements, and an initial peak at the state of Flow.

A

Hey hello, everyone uh welcome to today's edition of the graphenizing community hall, I'm pascalis. I will be your host for today. uh This talk will be available on grafana's youtube later on. So if you want to be seen or heard, then consider this your warning and in today's agenda we have two main topics.

A

uh First, we're gonna present a new way of how we plan to utilize, like bi data, to understand the the usage of beta features in the agent and how it can help us plan our next releases, plus some data around the agent's hardware requirements as we scale up. We would also like to give a small update on how flow is going.

A

You have no idea about what flow is it's a new way of configuring engagement that is going to be a beta feature in the next few months we dedicated the whole episode of uh the community, called, uh I think, like the last community call. So if you wanna, if you're interested about that, you can go there and check it out.

A

uh So without further ado, we move on the first topic mark. Do you want to take over.

B

Yeah, so let me share the screen.

B

Okay, so I want to show you what we added in the last release of the agent, which is the version 0 25, and it we basically added uh the ability to report all the well, the usage of the feature flags that um agent users can configure.

B

So this will help us to identify which feature flags are being used more and prioritize if we want to promote some of these features from experimental to like a more stable, yeah feature right. So by default, um uh we decide to be this uh opt-in, uh sorry, opt-out, so by default by default, it's enabled we follow the approach of other projects like loki uh and in this um yeah in this documentation.

B

We can see like a bit of explanation about this, uh which basically, we are sending this uh this data, which, by the way is completely anonymous, is there's nothing that we track and we send these statistics to our internal clusters and, if you, by any other player, any reason you don't want to report this data.

B

You can use this flag and it will disable the reporting. So that's it. uh We encourage people to keep using this reporting, because it's gonna help us to have better insights of adoption of features and maybe in the future we will use for other statistics yeah. So I think that's it. If anyone has a question.

B

C

Can you talk a little bit about why it became opt, opt out instead of opt-in.

B

Right so there wasn't this like an internal discussion about that, so the loki team initially uh decided to well. They also they added this like before us and they did like uh as opt-in and after uh the rollout of this. No one enables the feature, so they decide to to do it helped out because they didn't have any data after a couple of weeks or months of having this feature, uh yeah rather deployed right. So that's why we yeah other people in grifana encouraged us to do to follow the same approach.

B

Can we see the dashboard.

B

We don't have the feature flux, but we could see yeah the number of reported installations and so on.

A

Will you say, okay.

B

Yeah so yeah. This is uh the dashboard so far and yeah, as we will see the this, that we can see the different versions that are being reported and.

B

This is the yeah where the agent is being used and different operating systems and the future we're going to have here. The number of feature flags used.

C

B

Imagine like all the main.

C

uh Sorry pascals go ahead.

A

No, we don't keep any like uh detailed geographical data. This is all ballpark uh estimates about where an it came from from the request and all we're gathering is just the version of the agent currently running and uh what kind of flags they've passed in uh at the command line uh for interrupting.

C

I imagine like all the main uh versions and like they're, really scandal. That's probably us right rather than uh like external users,.

B

C

Yeah, uh I mean, I think it's really interesting. How like the os breakdown happens, where we have like way more windows users than we have any mac users.

A

B

Yeah 30 2018 and the wrestling.

C

Nice thank you for trying mark. I like this.

B

A

A

Does anybody have anything else to add on these.

A

I'll take that as a no and we can then quickly move on to the second item on the agenda. um Like a long time ago, the idea floated around that. We weren't really aware of the hardware requirements of the agent, and that meant that we could have some ballpark estimates regarding our internal usage.

A

But we could not extrapolate that to how the agent would react when scaling up. So in order to test that we provisioned some environments where we could deploy a standalone agent and stress it up to like 14 to 16 million active series. That means it's uh actively recording data and then in a more uh real-life scenario of having like 400 000 cities. We would never recommend you running the agent with that kind of load. You would be much better served by using some clustering method from our documentation page, but nevertheless these are some.

A

Search data that uh we'd like to share with the community.

A

uh This was heavily inspired by what prometheus does with prompens, which automatically can not automatically, but using a github's approach can serve data about the performance, how a certain pr affects performance of prometheus and maybe in the future. We could have something like that for the agent as well like being able to quantify a regression or an improvement right there in the pr.

A

Other specific questions before uh we go ahead and show the results.

C

I'll have questions I think after we see the results but yeah, let's see it.

A

Give me two minutes.

A

I should be able to start my screen like this.

A

Can you see it now yeah? Okay? So this is the internal benchmark that we run the first one is uh the more active one with 14 to 16 million active series, and the second one is the more realistic scenario: uh we're measuring uh things like os um hardware, usage levels uh like memory and cpu disk space usage uh network rates file descriptors, as well as how the one size oscillates when it is checkpointing data or is clearing all the.

A

Other entries how the bytes per series metric runs over time. That means how much memory the agent consumes per active series, as well as some go specific metrics about how many gold routines are running, to verify that there's no leaks or if the garbage collection takes a certain amount of time and does not explode for some reason.

A

And after um getting all this data and uh having an event that solely scrapes matrix and send them over to some remote right, uh there's no clustering going on there's no filtering of other targets, metrics happening with reliable filtering no logs or traces.

A

um We've come to the conclusion that uh for a hundred thousand series incorrect here, it seems like uh we'd recommend you running with at least 0.01 cores that might sound stupid. But if you can configure that limit in kubernetes and it might make sense about 0.6 gigabytes of memory and uh half a gigabyte of disk space.

C

The head robert, so when we say chords here, can you talk a little bit like the specific hardware specs we were using. What type of cpus were these.

A

Yeah thanks for that. um Basically, we did all these to standardize around the virtual cpu units that the google cloud provides. So we used a standard update, called a good vlog provision machine.

A

I'm not sure how this first between different cloud providers, if it's comparable between aws and google cloud or azure, for example, um but it's for whatever.

A

Definition of a virtual cpu from google cloud is using.

C

Roger thanks.

C

Do you want to talk a little bit about the implementation details here, like what what the kind of architecture looks like how we're you know doing the tests so on yeah.

A

Yeah, that makes sense just a territory. This is uh like a something of a minimum requirement.

A

It's a an agent that doesn't do much, but hopefully it's something that we can use to benchmark future release in the future and try to improve upon uh to obtain this data. We used two machines in the same tcp subnetwork to avoid latencies or other network issues.

A

The generator machine was generating metrics using governance and basically two sets of metrics one more stable one that uh only whose sample values would fluctuate, and one set of metrics that would uh go away over time and come back with different level set to simulate a more realistic scenario like container labels on ports or something like that.

A

The runner machine, the second machine which was just running the agent on- um and I would do 20.4 if I remember correctly uh instance, and nothing else- was scraping, five copies of the first endpoint and then one copy of the second endpoint, and it was just dumping the data on the fake, remote right and point that always returned immediately with uh 200k code and to obtain this data. We also enabled the the process node exporter integrations.

A

So that we were able to um get this data back to grafana cloud and sold them into a dashboard without and carrying extra load and contaminating the results.

A

Let me see if there's anything more specific, we have recorded the commands used to generate the avalanche, metrics and the configuration for the agent. So once this document is public, then maybe someone could uh they could replicate these results if they wanted to, but I think that's um pretty much an overview of how this data was obtained.

A

Any other questions.

C

I love this. Thank you. I see that privacy says currently internal. I imagine we're going to like publish these probably every release or something like that. What do you think.

A

Yeah, I think that would make sense and tying it to an automation process. That would be something that would be really helpful for the squad.

A

Okay, to avoid a longer awkward challenge. Here, I would like us to move on to the third part and provide a small update on how flow is going again. Flow is a new experimental feature that we're bringing into the agent that will allow you to configure the agent uh in a different way and allow us to map the agent configuration more into the mental model of how a user would like it to work instead of replicating prometheus, compatible jams in duplicate over and over again.

A

Does anyone want to take the stage here and express their opinions.

A

Okay, then, let me give a short update, so we right now we have merged into main uh the logic of how flow components should work and are currently using hcl as the base of the configuration language um plus two more components, one that can read the contents of a file from the disk and a component that can mutate targets based on reliable configs that come from a discover, a service discovery component.

A

Our plan for the next few months is to build an mvp that can replicate the current functionality of scraping metrics, storing them in the headlog and sending them over to remote right and be able to do that solely with flow components and being able to evaluate how that works for the developer for the user from performance perspective and all that.

A

So yeah, if that sounds interesting to you, I think that there's uh more good stuff on the way you can either wait a few more releases to to really get to see how it would work for you or you could check out the main branch and see what we're currently building there's work to be done. So, if you even want to contribute it's open source and we'd love for more people to join in on the phone.

C

The head, robert, all right so right now uh all the flow code isn't isn't in the the the normal agent command. There's a separate command, we're working with for prototyping, but when we're ready to release it as an mvp um it'll, be in the bridge age command with a flag to enable it. uh I don't know when what release that'll be, probably like 20 00 27, um probably not probably not 26, though within the next few months, is the hope.

A

Yeah and again, this is not something that, as robert said, that it's gonna impact you uh like uh in any way it's gonna be an experimental way for you to be able to have more control of over what the agent does and allow you to tie all the different components in in new and novel ways.

A

C

That sound okay.

A

C

Yeah sounds good. I'm.

A

A

I think there's fun stuff coming up yeah and uh getting the community involved in it, and these early stages would also be like super nice.

A

Okay, I think that's the agenda for today. uh Does anyone else have an extra topic to bring on here something that popped into your mind? The last minute.

A

Okay, then uh I'll probably see you in uh like six weeks right.

C

Every every four weeks, every four weeks: okay, sorry.

A

A

Okay, all right see you bye.

A