Grafana Community, 17 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Grafana Agent Community Call 2022-08-17

Description

Discuss the operator's future and continuing flow work.

A

Hello, everyone and welcome to the august uh agent community call and, as always, we're happy for anybody to jump in with any questions or if anybody has any comments, if you don't feel like just jumping in feel free to add them to the chat and we'll just go ahead and jump in on our agenda. We have the state of flow robert.

B

Thanks matt, I'm trying to think where you want to start so. uh We've been running graphon agent flow in our dev clusters now for uh 13 days. Is that sound right, uh 4 minus 17 is negative, 13, yeah, okay, so 13 days um and things are looking pretty good.

B

um We ran into a few issues at first, where, like it was using too much memory. In fact, let me just let me actually show everyone um that might be more interesting than me just kind of saying it's happening, so we have a few new dashboards for flow, but I'm going to start with.

B

Maybe this won't actually work out.

B

Operational okay! So if I switch over to um the agents that are running flow.

B

Pretty much uh I I we've been running it so long now. I don't think we have data anymore for like a comparison view, but really this is pretty pretty comparative to the normal agents. um Bytes per series is around six ish, uh which is pretty normal. uh These eight we have four agents.

B

We have eight agents, but with like two replicas each um and then across those it's like five million active series, so I think we've been really trying to hammer flow and get as much value as we can out of it, and it looks like it really doesn't: have that much operational overhead compared to the normal agent, which is which is good and pretty promising for for its results. We also have this new. um Let's just switch to light mode. That's fine! Let's change that!

B

We also have this this new dashboard uh for the flow controller, so you can see that we're running eight agents across those across those agents, there's 368 components.

B

Every component is healthy and uh those components are being updated roughly uh one uh once and how would you say this one and a half ish times a second uh most evaluations of the components being updated and re-evaluating the graph takes about 120 milliseconds, pretty quick, but uh sometimes the really big components take a little bit longer to update we've been trying to get this number down as much as we can, and I think it's a pretty good state.

B

um I think so. For those who weren't aware the way flow works. Is you have different components which interact with each other using like a declarative config and if a one of the component updates and another component is referencing that component's config the component, that's referencing, it will update itself. So that's what we're saying about the graph reevaluation, uh where all the components that are referencing other components get updated.

B

So um sometimes it's slow, and I think if, if you come across this, the indication would be to try to reorchestrate um how your components are structured, maybe like we're gonna, reuse information as much as possible. What we're running right now is a one-to-one translation from the existing config to flow, which means we have like eight service discoveries which are all doing the same thing, because that's how prometheus would work.

B

uh We have like re-labeling the same targets over and over again, because that's how prometheus would work- and I think that adds up to sometimes re-labeling pods- take takes quite a long time. So all that can be improved, um but we wanted to test a one-to-one match before we tried optimizing, something that you couldn't do prometheus matt.

A

Yeah and I believe, if we um actually wrote it flow first instead of uh conversion like the config, would be something like a third or two thirds smaller considerably smaller.

B

Oh yeah, I mean they can so we already kind of knew this that, because flow trades off the hierarchy of the current agent for a more um flat config, that the new config will be a lot bigger uh and it turns out to be like twice as much lines of code. Although that that's also a lot of like white spaces that that you wouldn't really have in yaml, but that size can be cut down even further to be more yaml size equivalent.

B

If you started with the flow config rather than trying to translate your existing yellow one.

B

So um should I move on to the the fancy ui we're working on.

A

B

All right so this this might not make it into into maine just worrying everyone, but the current fl. The current like workflow today, would be here's. um Oh sorry, here's like I'm running an agent right now. Here's this config file where we have three components: uh one gets a file from disk, the other one's, going to scrape um brian brazile's, robust perception, demo and forward it to this third component, which is the redhead log plus remote right.

B

If you wanted to see the current state of of this agent, you would go to the debug config endpoint. We would get this like long list of each component with their health when they were last updated, what their exports are for other components to reference.

B

um So you can see uh here we're wiring this dynamic label to the file's content. So the file content was changed me and that got added. They got merged into the uh the dynamic label of the of the target, but anyways. This is pretty big and what we found out for our kind of dev cluster workload.

B

This file was 500 megabytes and it was so big that um you couldn't even load it into a browser. The reason why it's so big is because of all those like those of those like repeated service discoveries, so we have eight service discoveries, all of them finding 22 000 pods, so that's 22, 000 pods being listed in like a fancy.

B

You know indented format, eight times it's a huge file, um so we're trying to find ways to work around that which the ui, which I'm going to show you in a second will help with the other kind of debug endpoint. Is this graph which uses graph is to render out the dependency graph? It's really basic, um but kind of taking those lessons of this file. This one giant file is way too big, uh matt, durham and I have been working on a ui to kind of explore what what a better world might look like.

B

So I'm running the ui right now with um mock data. This isn't real data it'll be replaced with real data soon, but this is just the kind of fake data right now, so you can see the main page shows a list of components uh where I'm running four and I kind of faked, like they're, all in all four states, so local file is healthy. This one's unhealthy, it's one of them, this one's exited um and aside from this view, we also have the dag view, which is a similar view of the graph.

B

But now it lays out with the health of those components, and you can kind of see you know if if something is feeding into an invalid state or if something might not be healthy, because it's depending on something, that's unhealthy. So here the arrows mean uh reference direction, so metric scrape, kubernetes pods is referencing. Metrics remote, right default and metrics from by default is referencing the api key to write with. So if the api key was unhealthy, that would kind of mean that you might have problems soon with metrics remote right.

B

um That makes sense to everyone. I should also state like the health of a component is: is independent from its dependencies. So if local file was unhealthy, that doesn't mean that anything referencing. It is also unhealthy uh because health is is calculated, like kind of uh you know, just just local to a component, so these are all links, so I could. I could click on either a component from here or a component from here to go the component page I'll click on uh I forgot, which one I mocked up.

B

So I'm trying to make it look it's a real demo. I think it's this one metricscrape, so here's the component page at the top. We have just the raw river block for what that component is with fancy syntax highlighting so its targets are coming from kubernetes service discovery, um it's forwarding metrics to remote rate receiver, and you can see that it's evaluated arguments is like this list of targets, which only has the one fake target in it and it's forwarding to a river capsule value.

B

We might make this a little bit nicer in the future right now. It just kind of says hey. This is a value. We don't really know how to represent to you and then inner blocks get shown as uh indented config, so the job name is being put in like a separate section here then, at the bottom we have the components that this component is referencing and the health of those components which might be interesting, for you know, figuring out.

B

If, if the whole chain is working on the left, we have the navbar each of these links this this page will probably get pretty big, uh so you could click on site. You know right now, I'm not scrolling it, but if I resize my window, uh you know you could click to go to each section jump from the from the from the nav on the left, which is helpful. um Okay. So there are some extra pages we haven't. Oh that's! That's a bug with the z index, but there are some pages we haven't mocked out.

B

Yet uh we're also expecting to have runtime and build information like prometheus would show you right now, which is an empty page, also show the command line, flags that are used to launch the agent. Also right now, that's an empty page. Same thing with the config file, this would be the raw unevaluated entire config file that the agent currently loaded successfully into memory.

B

So I think all these things will be generally pretty useful for the agent in general really like this would have been nice to have today, but with flow. It's a lot easier, because we can have these generic pages for something like this, and if we wanted a more specific customized page for what it means for a metric scrape to be shown to a user, then this could be like a different rendering.

B

That kind of you know maybe maybe expands out this target's argument into like its own table or whatever for debugging, but I think I think, uh because of because of the component-based concept, we were able to make this ui a little more easily than we would have been able to with the agent today, given that the agent today is just like a conglomeration of different things.

B

Any questions or thoughts on the uh ui.

B

Still in progress by the way like the the the stylization is kind of hacky, we'll make it look cleaner later on.

B

All right, craig.

C

I think my question is more about flow in general, um like I'm, seeing a lot of things that are not compatible with the agent as it is now is flow going to replace the agent as we know it in the near future or what are our plans.

B

Near future now uh um flow will okay, so the plan is to try to launch flow in an agent release next month before the next community call.

B

We will listen to feedback and use that to decide how much effort we want to invest into flow if people like flow people like the idea they like all like, you know the benefits it gives and we invest more time into flow. The end goal would be to deprecate the existing agent.

B

The existing agent config format provide tooling to migrate from the existing engine, config format to the new one and go all in on this being what the agent is now like a terraform but for telemetry data I think, is kind of the catch phrase I'm going to spit out. I don't know.

B

So I think it'll take time for that to happen. There's a lot of functionality. That's not included right here. uh We only have some components from metrics. We need more for service discovery. We need components for logs. We need components for open telemetry.

B

uh We need. We need a way to do clustering with flow. uh You know to kind of replace the scraping service, there's a lot of work that needs to be done. So this path of flow becoming the default and only path it depends on one people liking flow, but then two a lot of time, so I think it'll probably be about a year before the existing agent as it is today, goes away.

B

If flow is received well and people don't have any negative feedback about it.

A

C

I guess a follow-up there. Does that mean for that year? Any features that get added to the agent need to be implemented on both sides.

B

Yes, I mean not necessarily like I mean there will be functionality that you could say only make sense into flow uh like I prototyped a a component to get keys from vault and expose those secrets to other components that might need vault secrets. That's something that we really couldn't do within the agent today, because it really depends on this idea that some part of the agent can reach out and ask another part of the agent for a value and have those things being wired together.

B

um But most things like the fact that we're working on flow does not mean that we don't we stop working on the current agent and the way that flow is built. um There's a lot of code being shared, so bug fixes for how metrics get scraped will impact both flow and the agent by default. We should keep doing it like. We should not be duplicating code right now. We should be trying to share as much as we can.

B

All right I'll stop sharing. Thank you. Everyone.

A

All right looks great, and now we will jump to the operator craig. You want to take all right.

C

um Yeah, I'm gonna drop a link here, um wow. Where is chat there? It is um basically um a number of months ago we made an rfc titled, let's deprecate the agent operator, and that caused a lot of conversation and, frankly, a lot of concern that we're not committed to the agent operator or that it's going to go away, turns out people generally like it for what it is. So we just made a document here to kind of express our commitment to the operator and talk about our plans.

C

The the primary thing that people said they really like about the operator is that it lets us use the prometheus crds use. Pod monitors service monitors, node monitors that kind of thing. It's just a declarative way to monitor cluster resources um and people really like that. um Our our immediate plan is to bring that into the grafana agent itself, so that the agent itself would be watching those custom resources and generating the scrape configs internally doing the service discovery and going this works really well as a flow component.

C

um Maybe it would need to be back ported into the the. I don't know we're calling the existing agent, um but the the plan for the agent will be um if we implement the the services that not service discovery, the custom resource discovery in the agent so that it's handling all that, then a large part of the operator. That's reloading the agent.

C

Every time you deploy a new pod um is a lot simplified and the operator then becomes primarily for deploying the agent itself, which is still an important job, and people are using it for that. I think, particularly where there's a complicated sharded config or you have a lot of integrations or you have multiple agent deployments.

C

The operator can be really useful to declaratively define what grafana agents you want. There's daemon sets or stateful sets or deployments that can get pretty complicated. So the agent is a good way to do that, but it's also a lot more static config. So um at that point we may be able to address alternatives. We may say: oh this helm chart does really well, um but that's not to say we're ever going to say.

C

Oh, the operator is dead unless it turns out the operator really isn't useful, but I think it will be and we can continue to make it useful. So we've had a lot of a lot of issues with the operator. There's some documentation issues, there's some usability issues and some maintenance issues, um but I think the bottom line is we can address all those internally without making a lot of big external changes, um and that's really the crux of it.

C

I don't want people to be saying: oh, it's it's going to go away, so we shouldn't use it if it's useful it'll stay we're not going to take anything away. I'm certainly not going to like we're committed to these use cases of declaratively, defining your your monitoring, especially the pod monitors and service monitors and things. I think those are really valuable. We're going to keep supporting those we're going to make the best way possible to deploy the agent and we don't like breaking stuff.

C

So um hopefully we can assuage some of those fears and just make this project really good.

C

That's all, I have to say, go ahead. Matt.

A

um Since this question's been asked a few times, um the grafana operator is currently labeled as beta. Do we have any idea.

C

Yeah, um I I put a little note about that, because that question has come up as well. It is labeled beta. I think it started as an experiment. um We weren't sure if it was going to catch on and turns out it did catch on people like it. um So we I think we it's not beta, is in we're not committed to it.

C

It's beta is in, it still needs work, and um since I've outlined kind of some major refactorings, I hope there'll all be internal changes, but um I don't want to commit us to kind of a style of we're, not gonna change anything at all ever, um but certainly any changes would need to be well justified and have a transition, um and once those are done, I assumed we'd revisit the beta designation uh go ahead. Robert.

B

So just remind everyone: um when we see something is experimental. That means we're exploring a new use case and we're not sure how it's going to turn out like we might just completely backpedal. So flow is going to be an experimental feature.

B

um It's a new use case to a new way to use the agent and, as an experimental feature, we would allow ourselves to just like say, nope nope out of it, remove all the code not do that again.

B

If something is beta, that means we're more confident about the use case being supported, but the way that use case gets exposed to users might change over time. So if, if you use the operator because you want the use case of prometheus crds, that is something that we have committed to, um but the operator as a delivery mechanism could potentially change whenever right, like that is what's in beta.

B

The use case itself is is, is kind of supported, but the the way that use case gets exposed to you can is like we're, allowing it to change and then stable means we're kind of more confident in both, and we might not do as many breaking changes but but also stable doesn't mean we'll, never make a change if, if we don't have to, if we have to.

C

C

So um tldr bringing the prometheus operator crds into the agent, I think, is pretty uncontroversial. People seem to really like that idea. It means the operator is not required to take advantage of those features which I think is universally a good thing and the operator doesn't really need to change externally to handle that. So that's, I think, a good change that we're looking at um in the future, the near future and then um from there that just lets us kind of re-evaluate.

C

What is the agent for? Is it necessary? Would it be easily replaced by a helm chart or as a wrapper around a helm chart or who knows, but I think it becomes easier to maintain at that point and we have a lot of options. um Good question, robert um the agent.

B

Operator, you said what is the agent for such is correcting you. Oh.

C

What is the agent operator, okay, yeah? What is the operator for? um We have a lot of options and we're gonna, reassess it but kind of constantly the goal is: make it easier to monitor your stuff, um particularly with those crds that people like and make it easy to deploy the agent we're, not gonna change anything that makes any of those wrong and we're not gonna rip anything out from people that are depending on it. So.

C

A

All right um any questions about the operator before we move on.

A

All right does anyone have anything else they would like to bring up. I think that's the end of what we have on the agenda. Oh it just signed me out of google sheets awesome.

A

um If anyone has any questions or comments um or any other topic, you'd like to bring up, um feel free to do so I'll give everybody you know 30 seconds or so, to jump in with anything.

A

All right, if no one has anything else, then we will cut it short and call it a day, and I appreciate everybody showing up. Thank you all. Thanks for coming.

B

A

A