Open Telemetry Uncategorized, 2 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-11-02 meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone.

B

A

Let's give everybody a minute or two and we'll start.

A

Okay, I think we can start.

A

Okay, since the last meeting, I I posted a couple documents there. One was the specification draft that I was talked about and I had been working on for a while. Now it's there uh it's a large document, so it may not be very easy to read, but if you especially, if you have any experience with protocol designs, it would be great to have you review and comment on that. One and the other two that I post there is kind of an overview talking about how I think the management can the agent management can work.

A

This is primarily the reiteration of what we discussed in the in the first meeting and uh and the last one is the result of um discussion with uh with bogdan who's, the other maintainer of the collector and uh that's a possible way that we can implement the management features for the collector or in the collector, the supervisor model.

A

I had I kind of put together some initial thoughts for the supervisor model, but it's fairly wide open in terms of what features should go in the supervisor or in the collector there's a spectrum of solutions possible here, uh and I think then had very nice comments about the stability. So I guess there are concerns about keeping the supervisor slim anyway.

A

I think it would be great if, if anybody has any actually has interest in working on this on figuring out, what's the right content that goes into the supervisor versus being implemented in the collector would be great. If somebody could take that one and uh maybe they'll, do some research comes to some prototyping and figure out. What do we do with that one?

A

And, in the meantime, I myself, I plan to continue working on the on the protocol. um I'm going to do some benchmarking scale, testing stuff, like that to get to gain a bit more confidence about the protocol and uh that that's all I have- and I think the other one pre-make you post. It was about the the about on telemetry right. Maybe you can talk a bit about that.

C

Yeah sure, um okay, for whatever reason my my screen go, go went off, but I think it should be good now so uh yeah. So um one of the first things I identified like for real management is like understanding what's happening with uh uh remote agent instance and frankly, I think we do not we're not doing that great job on that with open telemetry collector, mainly because we have so many great means for sending data, and we are not using this for own telemetry of open telemetry collector.

C

So I have prepared this document focusing on on telemetry reporting and also maybe touching base on on status reporting. However, as we've been discussing this with uh with tigran actually status, reporting is something else from telemetry reporting.

C

It serves different purpose, so status reporting is used, for example, or might be used for providing the remote configuration depending on the environment, in which a collector instance runs and etc, and on telemetry is more about just providing the signals, so one can understand what's happening with the with this particular instance, or troubleshoot it and and so on. So I posted this on on slack and I've seen some comments there. uh I I like that, uh despite we've been working on this separately with trigram.

C

I think we have some similar idea that the open, telemetry collector should be able to just reuse, otp http for sending all this data to the backend, and I think that's that's going to simplify other things.

A

Yeah that that that was also also bogdan's feeling when we were discussing how the telemetry is going to be reported, we're open, telemetry after all right, so we we should reuse the open, telemetry's protocol to report report our own metric, uh so yeah. I think that's, that's good!

A

So uh can you maybe uh add the link to your document to the uh to the meeting notes document as well, of course, meetings so that people can open and have a look at that, and uh I guess in the meantime- well, we should probably pause here and see if anybody has any comments. Any questions on on the documents that I posted.

B

So that's a great question, quick one thanks integra and the supervisor model um makes yeah. I had not actually caught up to that, so this is the first time I'm seeing this. um This is sort of that. There's patterns that suggest that this is a good idea right. um It includes an additional executable that we will need to build, install um it will from what I saw also proxy act as a proxy for the collector.

B

How how uh you know how firm is that that needs to be part of it.

A

How firm, I think, it's not firm, it's more of something that we need to figure out. It's a possible option. We could put everything inside the collector right technically, the collector could just receive the instructions directly from the management server. The reason that we think that supervisor model is probably useful is that what if the collector, has a bug and crashes when it when it receives some instructions from the server?

A

And now it's gone right, so you probably would have to rely on some sort of system this built-in watchdog to restart it, but even that doesn't maybe help if the collector crashes on the configuration loading or something like that right. So it seems like it's not a bad idea to have something that is highly reliable. It's small! It's not complicated.

A

There is kind of the part that you rely on being up and running versus much more complex collector, which has a lot of code written which actually may contain bugs may crash right. uh The probability that this is much higher for it to do something wrong and crash. So that was the idea. I don't think it's firm at the moment. So, but I think it's it's worth exploring this idea, and the part that is especially unclear is okay.

A

This there is there's a bunch of ways that you could put certain functionality into the supervisor, but you could also put that in the collector, and so how do you decide what goes where right? I think.

D

A

It's very useful when we work on this and do some prototyping to structure the code that we write the prototype code in a way that is, it is sufficiently decoupled from the rest that it allows us to make these decisions down the road as well in the future, so that if we decide that no this doesn't really belong to the supervisor, it should move to the collector. We can easily do that. So I think whoever works on this prototyping, please keep that in mind.

A

Don't come up with monolithic code base that is difficult and to move around right because we're not certain at the moment, even though it's not a ton of code, hopefully, but still right. So there are a bunch of ways you can. You can do, let's say remote configuration for example.

A

So I think it's it's very useful if we start with implementations of the protocol, which are very, very separate, libraries like go libraries which could be used anywhere completely.

E

A

From the supreme supervisor and completely decoupled from the collector and then start building on top of that, but even then the handling of particular features and responses as a result of receiving something from the server that preferably should also be kind of very loosely coupled between each other, so that we can later decide to move move around the bits here and there right. So at least that's how I'm seeing it.

D

Previously, on another conversation, uh the topic came up around like last known, working, config and some sort of state management within the agent. So in the situation where an agent gets you know, a new config from the server is not valid for whatever reason or or you know, services a bug and it crashes.

D

My expectation at that point was that the system supervisor, whether that be systemd or something else, would bring it back up, and then there would be a fallback mechanism within the agent itself um to kind of avoid the scenario where we're looking to create yet another. You know supervisor process to to manage the agent um that that's just anytime. I see like a like another supervisor proposed.

D

My initial response is: how can that be avoided? What you know.

A

D

A

D

Here to kind of make the agent itself a little bit more resilient.

A

Yeah, no, that's that's totally fair right. You have some sort of a supervisor built in in the system, so why are we not using that right and on windows as well? If you have a service, then the services automatically restart on crashes, very fair and again. I think that is a totally valid way of implementing the agent management. I don't know at this point so for me, I'm not sure.

A

What's the right way to be honest, there are cons and pros one additional argument in favor of going with this separate supervisor approach is that we can actually iterate on it quicker. The uh the collector code base is well. I I guess the bar is a bit higher if we try to make all these changes in the collector code base versus. If we we say that this is, you know, maybe even a throwaway implementation, we try a supervisor we iterate on it. We can move faster with that approach.

A

If it's a separate repository, separate code base and then maybe later down the road, we say that okay, you know what this actually belongs to the collector. Let's move it there. Overall, I think we can do the the the entire process development process will move faster if it is in a separate code base. So that's that's another argument after I discussed it with with the collector maintainers, because the collector is right now in the phase where we want to stabilize it and make a make a generally available release.

A

So there is that as well right, so we don't really want to add stuff. That is very, very unproven to the collector at the moment, which is the case for aging management.

B

So, but basically in this proposal here when looking at the the diagram at the top, so what you're saying is the supervisor would simply receive the new config write it out and then bounce bounce, the collector and that's one one percent option. I mean yeah, it's a very, very first sort of like.

A

Incremental weight and that's.

C

A

F

B

Like you know, that can be done right in a very couple ways.

A

Exactly the beauty of it is you can easily try that approach without even doing anything in the collector codebase. You could try that completely independently from the from the collector, but then, if, if we want, we could move that that code that interacts with the server receives the configuration validates it. We could do that in the collector later.

A

So I again, I think, at this stage we need to kind of keep our options a bit open kind of a bit wide open and do do some some sort of experimenting before settling down on a single one. Unless somebody has very, I guess very strong arguments in favor of one approach which clearly says tells that that the other approach is completely unreasonable.

E

uh Tigran one thing I want to quickly mention here is the supervisor model might also help in terms of upgrades and rollbacks are probably the downgrades when we do it in the future. So because, if everything is being handled in the collector, trying to upgrade itself is going to be a bit of a pain and and secondly, if we go but then again the supervisor, if he goes the supervisor ruled, then you need to also figure out what if we want to patch the supervisor and which means it's more like a chicken neck problem.

E

That's another important thing, but.

A

E

A

Yeah go through.

G

Yeah, I wanted to say something um tigran, I'd like to to second your approach of building it as a separate process, I think partially, for because I also agree with sean that there's a bunch of different ways, people schedule and manage processes. Unfortunately, if uh and that creates a lot of duplication, there's a lot of variation, but if we build our collector manager as a separate process that actually helps ensure that other things like system d or whatever container schedule you're using, has the ability to also control the collector.

G

If we build it baked into the collector it, it becomes easier to potentially fail to expose some kind of functionality. So I think that's separation, but besides the development process, you're talking about of you know, making it easier to iterate. It also ensures the collector is locally exposing enough control surface for these other systems to manage it.

A

G

Longer term sorry go ahead, anthony no sorry. I was going to say long longer term. If you know this supervisor that we build is actually like useful. I don't. I don't see a problem with baking it into the collector code base in the sense that there's only then one image you need to download and you can spawn two separate processes that do two separate things or one process that that runs them both go, makes it really easy to to kind of have multiple independent.

G

You know sub-processes, so so I think that's there's nothing about this approach that precludes. You know baking it into like a single single downloadable image. If people want to do that.

H

Yes, I think that kind of gets to the point that I wanted to make too, which is I'm kind of torn about this. Like I, I see the the open telemetry operator that we have for kubernetes already as a kind of supervisor like this. I can manage configuration restarting and dealing with it, but that only works for kubernetes right, um but there are existing systems like systemd or windows services that can handle other deployment types.

H

Perhaps, and the other side of that is is then you know, we've been doing a lot of work to decouple the collector capability from the mechanism that actually starts an operating system process and runs it so that the collector can be embedded in other applications or processes, and I would want to ensure that we didn't by moving some of this configuration management or supervisor capabilities outside of the collector proper, lose that ability right so that if someone embedded the collector in their own application, they didn't lose the ability to do on-the-fly, config, updates and restarts uh within that process, which I think they would have today.

A

Right, that's yeah, that's fair! I think the way that we would probably well one possible way that we could implement this is by having this nice feature that we wanted to have and still don't have the ability to watch configuration file for changes and wheel of the collector, which is valuable itself, but once we have that the supervisor could use that right to update the configuration. So that's that's one possible option.

A

There are other ways we could do this as well, but if we do it this way, then you're not going to lose this capability, even if you're, not using if you're using let's say, custom, builds on the collector or something like that right.

A

Okay, michael good yeah.

I

I was, I was, I mean, I think it's um just one of the second or third, maybe the thought on an additional process, but also I wanted to um at least offer we have an implementation, that's very similar to the supervisor. That may be interesting that we could um present or or potentially show folks and just give give them an idea of how that operates, and what the pros and cons of it are, which is just what we're describing very lightweight supervisor of of agents and specifically the open symmetry collector.

I

So let me I can, if that's of interest, I can um uh look into.

A

That make sure it's ready to be very useful. Yeah yeah.

I

It would be very useful so that we can learn from your experience right and how- and I guess one question then is um you know, I know that it sounds like upgrading the agent or the collector is is part of the scope of this. Is that you know a hard requirement or is it um something that we're still discussing.

A

I think that's a possible feature that we would want to have in the future. I don't think it's right now something that we need right now, um but at least we need to whatever we do well. However, we design these capabilities.

A

I think we would want that to happen. There are so, if you're, using something like kubernetes. Probably you don't need that right. You use the kubernetes control plane to update the images and all that stuff and, if you're, using some other deployment tool for your vms, like ansible or whatever, you probably do the updates using that as well.

A

Well, then, some people don't use either right. So I guess for those use cases you're still, maybe interested in auto updates. There is a reason. Lots of tools do auto updates. It's because handling updates is painful. You have a variety of tools to deal with in in mixed infrastructures, so you have to go update your vms separately, your kubernetes stuff separately, your physical hosts, if you're still using those separately- it's painful- I don't know when and when we will actually have that capability, and maybe the answer is never because nobody needs it or wants it.

A

But I I I still suspect that it's probably necessary because I I guess I agree with.

I

Because yeah yeah, I think it's a positive and I think that, but it depends on what the community is looking for and if it's, if it is determined that that's a positive thing, I think that influences what your architecture decision is or yeah the supervisor or whether you have a supervisor.

A

Should I make you one? Yes,.

C

Yeah, I just have like one more concern. uh Maybe I missed something, but I'm wondering if, uh if these like models, we have with the supervisor idea if this will not uh cause losing some data like when the collector is being restarted, to pull the new configuration.

C

If, if that will be in process, then even like, um I can imagine that we could have like this more graceful mechanism that, let's say one configuration of receiver, is at some point plugged to like some some different configuration and when this is a full process. Restart it's like a bigger item uh also, I think well. I don't think that we will use the remote configuration for kubernetes environment, but I can imagine some other, let's say cases of processors or that require some. um I know pulling some data from some api.

C

They will need to do the same, so this can have um some consequences. um Essentially, I think that's what I'm.

A

C

A

Yeah, I don't think we have to restart the process. Actually, if, let's say the collector implements a watcher, we can watch the config files and reload itself reload its configuration. Then you don't need to restart the configuration. You just write. The config file from the supervisor and the collector involves that and then it's a matter of how you implement this reloading. Are you shutting down everything and starting again or you can calculate the deltas and just change components that actually changed in the configuration?

A

It doesn't have to be a process termination and restart necessarily- and I don't think that is actually dictated by the supervisor model. You could keep the process running and only so the only maybe situation when you actually need to terminate the process is when you're writing. When you're updating the executable, I want to restart them right yeah, so I think well, yeah that those are probably independent decisions. That's what I'm! What is what I'm trying to say.

G

I I think, at the end of the day, the approach here is clean separation of concerns, right just make sure we're not building just a giant ball of config management and restarting and image downloading. You know. Just these are all separate concerns um it's possible to to keep them separate and and build a design that works with these various capabilities being handed off to various other systems, um because that that's just just gonna be the the reality right. Yeah.

A

And totally- and I think I called it out specifically in the very first document, I wrote where there is a number of principles for designing the agent management, and one of the items is exactly what you said. It needs to be loosely coupled set of features, not something that is monolithic where you have to use everything or nothing.

G

Yeah yeah, I like that. um I think when it comes to to config management, you know there's definitely this issue of like a lot of this configuration is probably going to be pushed out from um some kind of you know, observability back end right as opposed to the you know, deployment uh system potentially right.

G

So that's a reason for why you want to have something that can have a port open and receive new configuration, and but if it's you know implementing that by doing the same thing that that a deployment manager would do by you know overriding the config file, uh for example, then you know those are examples of like just just keeping things simple and not creating like multiple, multiple ways of implementing this stuff. You know uh there's, so I I don't know, I think everything you're proposing tigran sounds sounds very clean and doable.

G

I I kind of have faith that you've got this stuff in your head. um So so it's feeling good to me.

A

Okay, uh to touch a bit on what you were talking about, I think um yes, there are tools for configuration management. There are tools for deployment right, so you could use those right. There are tools for monitoring agents. So why are we not using just those tools right? Why are we reinventing something new?

A

uh I tried to describe the rationale and why a cohesive agent management is important and the reason is that these tools- they are uncooperative, they don't know about each other and cooperation- is, can can be very valuable here when you want to actually tailor, for example, the configuration of an agent based on what the agent is telling about itself, it's not possible to achieve using like separate configuration, management and separate observability tools right. These are disconnected, they do their job.

A

But if I want to have a business logic which says, send this configuration to those agents which are reporting this particular thing, then this needs to be a single tool or a set of tools which know about each other which can interact together which, where you can define even simple but but some business logic, rights and some rules.

A

So I think that's where the agent management, probably the centralized single set of cooperative tools, is probably very useful, and I think you're totally right, where this likely will be used, is in observability tools by the vendors who want to have a centralized kind of single glass of paint for all of the agents that are deployed by the customer on a variety of different types of infrastructure, where you have to use a multitude of different tools.

A

If you want to actually manage you have to manage some of those using kubernetes control planes, some others using whatever is your deployment tool for the vms, etc. Right. Instead of that, you could use a single tool to at least see the state of things how they are and push configuration updates to all of those things. For example, I have a bug in a particular version of the collector.

A

I want to change uh change the one of the settings in the configuration file for all collectors, for that particular version of collectors, regardless of where they run, and I can do that if I have a centralized tool. But if I'm using a collection of different tools, depending on where the collectors run, I have to repeat this process for for every single environment or infrastructure type or whatever. It is right.

G

But as long as you've just broken it down as long as we are problem oriented instead of solution oriented, so one issue is, you know, there's a certain amount of information about collectors that need to be discoverable right. So the collector needs to provide that information. So it provides that information. What version I am whatever you know, other things it needs to provide collectors need to be have their configurations be updatable.

G

Okay collectors need to um be able to to live reload the portions of that configuration that they can live reload.

G

um They need to be able to like drain and cleanly shut down for to handle configuration, changes that that they can't live reload, because I imagine there's we're going to have both for a while um or we'll probably never be able to implement, live reloading um and as and then you need something that uh can be hooked up to uh an observability back-end to control things like sampling, for example, stuff. That needs to be pushed out uh uh quickly and in a pattern that just doesn't make sense to go through your deployment system.

G

uh So that's great and the thing that's talking to that back end. You may want that to have different um uh capabilities and network rules from the collector and other things. So a separate process can help that too, as long as it's just like just just make sure we're just keeping everything clean and separate, then then, there's no problem putting these pieces back together to manage it through through some other kind of system.

J

To that point that I just wanted to suggest that we we should consider to the extent possible um modularizing all of these features into existing components right like like exports or not extensions, mostly right like so, I guess suppose there may be an exporter, but um you know the collector has a nice framework for keeping things separate, and you know for developing things in one repo and then moving it over later and there's all sorts of benefits to this.

J

um You know and then, to the extent necessary, we can have just a common package that is shared amongst those components. You know, if there's I don't know api authentication or something like this- that we need.

A

So yeah totally, I agree well, provided it fits the the model that we have for components.

B

I have I have a question: that's very going to steers in a slightly different direction of discussion um appropriate at this point, or are there more comments on the supervisor idea.

B

Okay, cool so I'll just throw my question in uh with all of this agent management. um You know business, uh there's in my mind. You know this might not be complete, so so help me if I'm missing something but there's. Basically, we have status, updates, um monitoring, right and configuration as kind of aspects of this right.

B

Do we do we want to prioritize those to sort of inform what we're gonna um implement. First to see, you know especially.

E

If we want to go down.

B

Like a route of doing something, you know where we want to spin.

E

B

Up in a quote-unquote, quick and dirty way, even like even externally, you know get some going. You know I think, from the vendor side. You know me representing you know one of them, but it's there's a huge interest in getting and getting this soon right right, um because.

E

You know, as we are, switching to open, telemetry and.

B

I think a lot of you guys are in a similar position right yeah, our our our previous infrastructure infrastructure, has probably had some of these features already right. So we're trying to catch up right.

E

B

I'm I would be happy to kind of like in my mind, I think we can kind of do something quickly. Incrementally it will be cool, uh and so then that led me to ask you know you know what are the sort of? How do people feel about these priorities? um Like I'll put my chip on the table, I think config is the most important one, um but that's just my personal opinion, not representing you know anything else, just my god.

A

Yeah so yeah definitely very very valid question and definitely we should prioritize it. There are a couple other things that are probably part of the overall solution. One is the uh agent updates, the executable object and the other is support for add-ons or plugins, which doesn't even exist in the collector today, but probably is in the overall management picture.

A

So I think those two are very likely to be lower priority plugins simply because they don't even exist yet in the collector and the updates or executable updates, because I think um it's a dangerous feature that we need to get right from security perspective.

A

So we need to probably give us a bit more time to understand what exactly is it we are designing and how we are making sure that we not going to end up with catastrophe when when, when the machines are owned by malicious code?

A

Among the rest, like the three things that you listed, I think we need to discuss. I'm not completely sure what is the actual order of implementation and the priorities, but it's probably important that we uh we have the kind of the all the entire picture of at least those three features uh designed not implemented right and for implementation.

A

We can go in the priority order, but because the features are likely going to be somewhat at least interconnected, I would like to have the design in place and and and then yes, sure we can prioritize and implement some things earlier than the rest, but I wouldn't want to just just go ahead and start implementing, let's say just remote configuration without fully understanding the implications of how does that interact with with the rest of the capability settlers right.

A

So probably I I would spend a bit a few few weeks on fleshing out a couple of weeks, maybe on fleshing out more of the design uh and and some prototyping can start in parallel. Actually I was I was actually I wanted to propose that we create a repository and maybe start kind of putting together, maybe some some some hacks. Even that doesn't have to be production quality code to get a sense of okay.

A

What does this look like right if we write code and in the meantime, while while we are also discussing in parallel the design and the protocol and all the stuff that needs to probably needs a bit more upfront thought to be put in before we start writing production quality code? For that.

B

Sounds good, yes, I did actually forget the add-ons in my in my enumeration there. um So uh you know we have sean here right, sean porter, who has been working always with the guy who basically built senzu for the last 10 years, and one of the cool things that sensor has. Is this artifact? They call it artifact over there distribution method.

B

um That sounds a lot like what we have in mind here when we say add-ons. um I don't know if it makes sense at some point to kind of you know go through. You know, maybe his experience of implementing something like that, but you know. If so, then you know he's here so.

A

Yeah definitely makes sense. I think it would be very, very valuable for you sean to share your experience and tell us kind of more over the learning learnings. Maybe we're missing stuff right experience is a lot more valuable in this kind of things, especially if it's a long experience, because some of the things they they you don't realize the full impact of these things until it's typically.

D

Those two ways of.

A

Location until it's too late right.

D

Right absolutely yeah. To be honest, I was very, I was pulled to part of the management protocol spec uh that was related to the add-ons um and I've just kind of been putting together my own notes and thoughts on that. Perhaps I can share them uh this week, just in slack and then, if, if you know, if there's particular areas that we should discuss and then we can always um do an ad hoc discussion or just do it async and slack um or have it as a topic for for the next work group discussion.

A

Sounds good looking forward to your notes to see.

D

Yeah, there's definitely some some traps you want to avoid, but uh I'm excited to see it already uh in scope of that protocol.

G

Cool yeah for what it's worth my my background is actually scheduling and init systems prior to this open, telemetry nonsense back when I actually wrote code. um So I'm I'm happy to to also try to help point out potential, gaping security holes, and you know some of those other things.

D

Yeah, this is part of the game.

F

Perfect, so we're going to end up if you had another system.

F

See that's every project.

G

F

G

I really really want us to be to be problem focused. I mean I found that even when dealing with with writing schedulers, and all of that is like what what are the capabilities the collector needs to have and to think about how the collector can expose those capabilities, and rather than thinking about the supervisor, needs to be able to do stuff.

G

So how do we let the supervisor do stuff to the the collector, if that it's a little subtle, but you end up with something: that's that's much cleaner. uh If you, if you kind of reverse the way you're thinking about it,.

A

I tend to just just to make it clear whatever I wrote so far: it's not some sort of fantasized set of features that I came up in my mind and wrote. It's actually a result of me doing internal research at splunk, yeah, reading some pain points reported by the customers, and that is the result of that right. The the particular features that I came up with yeah, so maybe there is more that.

D

Definitely comes across yeah.

A

D

G

D

G

Emphasize I have a lot of confidence in the the documents you produce. They look very they they do not look pie in the sky. To me, they look well researched and clearly based on on you know, experience and it actually.

D

Makes they're rooted.

G

Yeah, they feel very rooted, I'm happy to see that a lot of this proposal is rooted in you know some splunk technology, uh just because it's rooted in stuff, that's had to actually go out there at work uh and solve problems, and I I think that's just a great starting starting point. So so I have a lot of confidence in what I've seen so far. I don't want to imply that I don't.

A

A

All right any other topics. Anybody wants to discuss.

K

uh Tigran um hey, this is andy um on the uh op-amp proposal, um I had a small comment about a rename of the error response, but I was thinking more about error handling in general um and it seems like there's an opportunity for the agent to report back more detailed response that could be could help the end user with configuration errors um we'll be sending down configuration to the agent.

K

uh If that configuration doesn't work.

K

um The the feedback loop so far is is an error message um and I'm thinking we could probably flush that out a bit more with type information, possibly a map of other information that might be appropriate, for example, a metric receiver unable to connect to a metric source, just wondering what your thinking is there in terms of error, handling.

A

So I had that in the spec, the the state of the connections, uh so part of the stack is actually pushing down the connection credentials to the agent and then, if, as a result of that, the connection cannot be established. I had that as part of the reports that the agent could send to the server so that it could say. Okay, I tried this connection, it didn't work. I got this error. I removed that because I I was kind of scared by the size of the specification. I tried to simplify that.

A

So maybe I don't know it belongs to the spec, and I can add that back. uh It was one of the things that I did remove. um If there is, there is a feeling that it is necessary. So I don't know I.

K

Could also imagine you know a scenario where you're trying to collect logs and the file that you're collecting doesn't exist, and you know beyond just an error message: is there more structure that can be provided to aiden um so resolving.

A

That that example, that you had yeah that second example that you had, that probably belongs to what we call own telemetry reporting right. The collector or the agent itself does things which are its business. It's regular business and things can go wrong while doing its business. That should be reported as part of own telemetry, whether it's logs or metrics of some sort.

A

But but there is the other part, which is not the primary purpose of the agent. It's that the management something is going wrong while we're trying to apply the configurations or not really collectors primary business. It's part of the management- and I think these specifically need to be somehow part of the specification, hopefully because you may not even have a working channel to report the telemetry right if your configuration is, is not applied.

A

If it's it's broken, you still want your management channel to provide valuable information about things that are that that are not working from the management perspective.

A

The rest, if we assume that there is actually a working telemetry channel for the collector to report for the agent to report about its inner state, then things like that like I cannot read this file, I'm trying to collect the file. I cannot read that file. That probably should be part of the regular telemetry that the agent emits, regardless of whether you're managing it remotely or not.

A

But yeah I mean I'm totally open. If you see opportunities to have a finer, more granular error reporting from the agent to the server, then uh yeah make a proposal. Maybe I guess let's, let's talk about more details about what do we want to include there.

K

A

Okay, now alex we're not we're not building a new system. We never you're right. Let's not.

G

Yeah um I I have some some questions I think. Well, I should say: maybe one action item is to I don't know. I want to throw things on your plate tigran, but it does seem like I did like the idea of identifying priorities.

G

I don't know what the the best way to do, that is like a spreadsheet or or something or just a having them listed in the dock. In such a way, people can make their comments on it, but that that seemed really valuable. Focusing you know, tool um the the other so yeah, that's, maybe just an action item to get us rolling the um question I had this actually dives into the weeds a little bit, um but for configuration management in particular, uh it would be helpful to understand.

G

Given the current state of the collector or the collector's architecture, you know what what um what kind of design constraints does that architecture put on configuration management yeah? You know we have these complex processing pipelines that the end users can design and construct, and I personally am not familiar enough with the collector architecture to have a sense of like what does it mean to be trying to change those things on the fly or, or you know anything like that. uh Yeah.

A

I can answer that it was, I mean at least we wanted to when we were starting the collector and we were designing it.

A

We we knew that at some point we will want to have an ability to reload the configuration so the components we designed them in a way that they can be shut down and restarted on the fly without shutting down the other components or the other unaffected pipelines.

A

That's part of the design um now how easily we will be able to actually implement this. It's another question, but at least there is nothing that is preventing that design-wise in the collector from from happening. One bit that we were missing is the ability to have pluggable sources for configuration, which probably we need for for this, for agent management, um and now we have that there is only one source right now, it's reading from the local config file, but it is now possible to replace it by something else if we wanted to.

A

So I guess this. These topics were the focus of my discussion with pogba. We had a couple days ago, so I think well, we do not see major roadblocks at the moment that prevent us from having at least some basic management capabilities applied to the collector. It's a matter of more matter of deciding whether we put these features inside the collector code base, or they are more of things that the outside the external supervisor effects on the on the collector by some means putting files, maybe through some exposed api.

A

Somehow like that, that's that's still very open, but the good thing I think is we do not see anything that is totally blocking and and we're not going to be able to make progress, because some some design decisions led to a code base. That is not letting us do those things.

G

Yeah, I I would love to understand, you know if we just stick to the simplest thing for now, which is configuration, changes are loaded by changing the configuration file and then yeah. You know.

G

A

That simple thing should be fairly easy to achieve, like all you need to do, is watch the file system for for config changes and reload, and then have a supervisor process which writes to the file right, but.

E

A

Are dangers to that right, like like the collector, is going to bail out and terminate the collector if the configuration is invalid, for example? So what do you do? In that case? The supervisor needs to somehow keep the old one there and restore the old configuration. Make things go wrong, so there are nuances there if you describe it in the most simple terms that the most simplistic solutions are easy, but if you want to have a robust solution which works on all these edge cases, that will take a while.

A

I would like to to implement it.

G

But that's like a good example of of decoupling. The problem right, like the situation you described, of how do you roll back changes and whatever is not something the the collector internally needs.

G

You know, that's just the the collector does something simple it shuts down if the configuration's invalid or it refuses to load the new configuration if it's invalid. You know something simple like that. um I personally think there is going to be some some detail around having like being able to understand what it means to change. A complicated pipeline, like I suspect, they're they're.

G

Some amount of operational complexity just falls out of that. uh I would love to understand how that's that part's expected to work. Yeah.

A

Yeah totally you're right, you're right, absolutely the very simplest approach that is very kind of easy to implement. I kind of have a prototype that does that you shut down all the pipelines and shut down, means that there's a cycle order. You start with the receivers then drain the pipeline, then sometimes the exporters that is implemented that that works. We have that, but that that that's, that means downtime right.

A

That means maybe a significant pound time, depending on how long it takes to drain the pipeline, which may not be easily possible to do if your destinations are unavailable, so you can't actually drain it, which then ties it to the other feature that we probably need to have, which is persistent buffering, in which case you can actually quickly shut down, because you don't don't need to train your in-memory data.

A

So, yes, it's kind of a set of very related features that need to come together for for overall nice production ready capability to to be there, but again some some, some rudimentary simplistic solutions are easy to produce. We should start with those I mean we should definitely start with those and then the more complicated ones where you don't really shut down the entire pipeline, because there was a single configuration change for one receiver or one processor.

A

That probably will come maybe later right.

G

Yeah, that's great, and I like the approach of always keeping operational simplicity and understandability is kind of the first thing. So it's like there is down time with flushing out the collector and restarting restarting it, but that, as an operator, that's easy to understand what it's going to do and likewise, if it's only going to shut down the pipelines that contained nodes that had a configuration update. That's also like pretty easy to understand, and as long as we kind of take take that approach, I'm sure we can get to something something efficient.

G

um I'm happy to hear you you're doing that. I I get a little nervous about people wanting to have so zero down time. That.

G

You're getting into some weird state where data that's halfway through a pipeline is now getting shunted into some other pipeline. That's you know been tweaked in some way and yeah.

A

Now I think we should totally start with the simple solutions first and then think about the more complicated ones and the the hot reload with deltas. It's it's not so trivial, it's kind of complicated to to do the right way. So I I wouldn't definitely wouldn't start with that. Yeah.

G

Like you wouldn't want data that inadvertently went through a mishmash of pipelines right yeah, it got.

A

Processed by some nodes.

G

Yeah, who knows.

A

G

Data looks like when it comes out stuff like that is where I think it gets. It gets a little tricky cool. Okay, that's awesome glad to hear.

A

Okay, any other thoughts. Anyone.

L

This one thought.

D

That came to minus.

L

Oh sorry, this one thought I came to minus. We were talking about this, as if you had the uh the supervisor model, um you could potentially move some of the logic around how the downtime is managed across, like a fleet of collectors at the same supervisor, was um aware of like multiple collectors, so that you could roll the update. That way, as opposed to having to worry about adding the logic into the collector itself.

A

Yeah yeah, that's a good point. Well, I guess, if you're, if, if you have a nicely implemented management server, you probably would anyway roll out the updates kind of in a rolling manner anyway right.

A

So you probably don't want to push the configuration to your fleet of 1 million agents at the same time and then right not not not not a good idea, but uh but another thing is that whatever infrastructure you have in place with with collectors and all that stuff, it probably should be ready for one agent or one collector to go down for a while temporarily and then come back right, you're, probably using some sort of load balancer.

A

If you're using properly implemented receivers in the collector with the right protocols, then you shouldn't be losing any data as well right. The receiver shuts down the sender, whoever sends data should be ready for that. They should have some sort of minimal queuing there, while waiting waiting for the agent to to come back. So if you're using the right set of tools, then it shouldn't be kind of a very kind of impactful event like restarting the collector but um yeah.

A

I I don't know if we need to go farther into that like like come up with our own way of handling the fleet or collection of or set of collectors using one supervisor.

A

Maybe I don't know, I don't know if we want to go into that business.

B

So one thing that I first thought alex was referring to in his comment, was to potentially run two collector processes on the same box. Basically one that is training uh and, like start another one, that's already starting to receive.

A

In a sense, you're saying yeah.

B

Yeah, so in order you know instead of bounce. Basically, you know set, you know basically send send the kill signal to the current running one. It will then shut down, shut down all it's receiving and you know we'll just do whatever needs to be done until the cues are flushed right and then, but you immediately start a second one that starts receiving immediately.

G

A

G

You have an issue with port port owners.

A

Exactly you, you have open ports, you can't completely shut down the first one. At least the receivers portion have to shut down before, and it's not just the receivers. In reality, some of the exporters open course for pool protocols, prometheus exporter, open sports, which is very unfortunate.

A

J

Look at server ports.

A

Like listening.

F

A

Of course, that's how prometheus works right, it pulls. Data prometheus exporter opens a port to which the prometus server connects on the collector. Unfortunately, that's prometheus, that's how it works. It's the poor model.

F

Yeah and that's very.

A

Unfortunate despite the fact that I seem.

F

To share the same national nationality as the principles behind prometheus, so something went terribly wrong and back then, has been my observation.

A

One of the well-known debates- okay,.

G

M

They're fairly they're.

G

M

Prometheus is like.

G

Traumatized at this point I didn't I didn't.

F

Read those opinions.

M

A

Okay, I think we're maybe there's just two minutes left I mean. Maybe anybody has anything urgent to be. They want to say, not comments. It.

G

Is the real solution that we should just rewrite this in erlang, and then everyone just use relay.

A

We need tristan for that. Nobody else knows airlines, so I don't know.

D

That's all right. The acronym for the protocol had me triggered. I was thinking amqp every time I read it, so we're already one step towards erlang yeah.

B

Here we go, I do.

D

E

B

Reference in the in the protocol name, though, I've been looking at amplifiers for a long time, so yeah all right.

D

Thank you. Everyone.

A

Thanks thanks everyone, so please have a look at the the documents posted if you're interested in the supervisor model help is really wanted there. uh Anybody can work on that. That would be great awesome thanks. Everyone.

L

All right, bye.

K