Istio Extensions and Telemetry Working Group, 1 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Policies and Telemetry WG Meeting - 2020-07-01

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

The men do you have everyone, you were hoping to be here for the other policy discussion. Are we still waiting yeah.

B

I saw a David, is here yeah we can start okay.

A

Well before we do that, let me ask: does anyone else have anything they want to add to the agenda? That's not on the agenda today or are we okay? Well, if you do, please just add them to the agenda as we go through and then yeah I'll hand it over to the minute David go for it.

B

So doc, um okay, can you present? Are you yeah.

A

I can be comments.

A

Let me know how you want me to move around here: okay,.

B

Yeah yeah, okay, thanks stuff yeah, so David I will present the audience of policy. David is an intern. He started working on this project for his intern project yeah. So the proposal is to have an audit policy that allow user to control when a log is created and the content of the log, and so the motivation is pretty clear if customer I have to log. So customer wants some some control, um what they are going to log and to avoid the log everything which is a one hand.

B

It cost a lot of starch cost and, on the other hand, it's very hard to identify any useful information from the large amount of logs. So we propose to extend the authorization policy to add a log action. So today authorization policy has allowed the nine and the proposal is for add one additional action which is called a log. So basically we want to use the matching condition, matching hosting authorization policy to ensure on under what condition a log should be generated and young yep.

B

We should also introduce a log entries will control the lock and lock content for yeah, for the second part is actually pretty similar to the previous mix, a policy yeah doc. Can you, okay, move up.

A

Where'd you what yeah.

B

Yeah, okay, yeah, so because audit policy is a basically author, I should policy with the lock action. So it has. It shares all the properties that authorizing policy has so, for example, it can be defined as namespace scope and the master code and the policies are additive, which means a request is unlocked as long as there any lock policy system and the policy should take effect only in the target in a policy scope. So if there is no auto policy apply to a given workload, the default behavior is basically lock everything.

B

There is no effect under workload. So here are some examples for the lock policy, so the first example says lock or the right to request to my API application on the slash user, slash profile.

B

The second example says: audit or request to my API extract the request coming from a certain service count, and in this case the service county is properly a robot service count and are creating a lot of noise, which we want to filter it out and another example, is you probably want to audit only the internal user access?

B

In this case, you can look at the domain claim in that Josh :, that only the requests coming from Acme comm should be locked and there's also very common use case, and you can also write a policy to log everything cause a lot of nothing. This is actually the same as how oscillation policy work today. So, in order to log everything, you basically create a policy that apply to the given name. Space. Indication in this case is ns-one and you can define I drew up that has no restriction.

B

So, basically you are saying everything will match this room, so it's basically lock everything and lock. Nothing is basically you create a policy apply to the dev namespace and but there's no rule allowing to lock anything. So, yes, that's the lock, nothing policy and if you apply this or if your says the namespace would to the routine namespace, your basic raid apply the policy to the mesh scope.

B

Yeah log entry, as I said, is actually a very similar to the previous mix of your log entry. We also have had a log entry to define the content of the law, so we can have source section destination section, requester section and the other context, and the other syntax is most today copy. The frontal previous mix, a policy blog entry and yeah now I'm going to hand over to David David is going to talk about how this a lot policy. Excellent. It's going to be implemented. I.

C

B

C

Can we so did you all, have the questions about the actual dock later, or is this not the forum that your host.

B

C

B

Can ask her she's I.

C

Mean I'm, okay waiting until he finishes the thing I just want to check. Oh okay,.

B

Yeah, let David finish.

D

E

Hi, so to go into the implementation details, the way we were probably going to do. It is currently the authorization policy uses the our Beck, the our Beck engine or the are back filter on envoi.

E

So and right now we deployed two different filters: we deploy and allow our back filter and and deny our back filter so pretty much on sto. All we do is deploy a third filter, the law in our back filter that corresponds to the log action.

E

The reason that we're deploying three filters, and not one or two, is because currently the our back doesn't really support multiple actions and I miss do we do we convert the off? Each authorization policy could fake into a single our back config, so it just went in there's no easy way to do multiple actions in one hour back filter. So that's why we deploy D log action as a third filter, and this is the simplest change. Obviously, again, one action per config is reasonable.

E

The issue is, we need three filters to run envoi, but there is potential in the future, and voy is looking to supporting multiple actions for our Beck filter, but for now this makes sense and the way we're actually transmitting this decision to do. Telemetry backends is we're setting a key we're sending a access log policy key that can be read by each individual backend, based on the matching principle, permission and principle information, and then the lemon tree back-end can then determine whether it wants to log not so currently.

E

So in this doc, it says it's a filter state key. That's we've changed this to be a dynamic metadata key on simply because that's more, that seems to be more convenient for use in wasum extensions and some of the other solutions that we considered for this. We're deploying all actions of one filter and doing allow allow log or deny log, but again that that would require a much bigger API change to both our Beck and sto and with allow login to dialog it's not clear which which of the filters we pair the log action with.

E

So that's about it for in terms of the implementation, it's it's pretty simple, especially on the sto side. The only thing we're doing an sto is adding another filter on voice. Add is a little bit more complicated because we're adding a whole new action, but anyways generally supportive of this. They they are already considering, potentially adding multiple actions. Just besides the regular allow and log allow and deny actions so yeah.

E

Yeah any questions thanks.

B

David yeah, so now we can take questions.

A

So I like the idea of passing in sort of an should audit flag or audit audit this this mesh flag, I'm, wondering I, think I, see comments in here from John and there's so much of mine. That can we, if do we, need to define log entry as part of this, or is this something that it makes it's a limit? You API a general purpose.

A

Let me show you I could could could define, and you guys you know the inside of authorization policy, would reference somehow or I'm worried about fragmentation of places where we define telemetry scopes and I. Don't I, don't know how other people feel but I just sort of curious. What your thoughts are on that.

E

Yeah, you can you go ahead. Okay,.

B

Yeah I was thinking of the alternative, introducing independent audio policy instead of having it in the authorization policy, but because we are actually sharing all the rules is so one one thing is its introduce another CIT and which is mostly almost everything is the same, except the action and yeah.

B

Actually, when we designed the authorization policy, we were considering the different action and the log action is well since we we had a in mind and I actually sent a email school, a couple folks, including Louie and others asking. Should we make a war channel name because oscillation. So we are also going to add audit and the the the comments. The feedback was that, because the main use cases for oscillation is okay to keep it best name and young and I also mentioned in the comment replying to adjourn.

B

It's actually very common practice that log and the log and I are in the same policy. One example is the like Caraco network policy. It's actually it has the action theater, which is which has allowed, can.

D

B

Yeah so katakana is the action field called allowed, unite log and the past, so past means that does not do anything. Similarly, in google, we also have the RTCC current policy. It has a very similar structure to this. It also has a lot and I'm, not sure it has deny yeah buzz loggers cause of it. Ip.

C

Ip tables also has the same thing right you can allow deny or log yep yeah.

B

A

I think I guess I'm, not quite clear. I wasn't clicked through I think having something that says log that makes a lot of sense to me. I was wondering about the log entry CID part of this. Oh, should that be something that we we need, or is it just be something from as part of a more general purpose, limited I wish we referenced somehow through this API, or is this they're doing do we want this to be a separate thing and serve asking the group and you at the same time, so.

B

If you are planning some telemetry API that commerce log entry of course yeah this Balasco part of the geometry, lock geometry configuration yeah, actually even in the API version, I just used to configure Tec on total, which is the previous. If you are watching for mixer yeah, so because I'm not sure where this API showed you know below so I think it's plus for security, API, so yeah I think.

C

Right so I think I think when went to log and what to log are somewhat separate concerns right. So we we already have this well, what this exactly, but maybe have something similar. We have a stateful filter which decides when to log, and then the stack travel filter decides what to log it. It doesn't act. The first filter doesn't decide what exactly to log it just says.

C

Yes, you need to log this log entry and the actual filter that does the logging decides how its configured and what the configuration is and all that you you're clearly combining those two things at least had an API level and saying that you will decide both went to and what too long it's it like to play nicely with all the ways that we can generate a log, it's probably best to leave it to the backends right. So, for example, you can use stacked over logging to log like we have today.

C

You can also use just envoy native logging.

B

Yeah, this makes sense to me I yeah. So in that case, maybe you leave out the scope for what log from right, I feel, maybe so yeah totally mushy TM to decide where to apostle API. And how do you support it?.

F

B

Lemon another question.

F

Related to what Mandar was saying so currently, there are few other things which are logging right. So we turn on on. Why access log? Is this going to replace that, or is this going to work with it or they are totally orthogonal.

B

I think today, you're allowed to control which backhand to plugging or what is that it is the control you are talking about turn my access logger, so.

F

B

F

Is so there is a global option where you can say I want envoy access log sent sent, it will just be using the STD out and you can configure the format of that log and also the fields in that log. Additionally, you can use the Envoy filter to configure it at a per trust at a per listener basis. I am trying to understand if this works with it, or these are two orthogonal things. I see.

B

So if my photo is the designing deciding that condition when to log, then this filter is going to replace it. So I think we are providing a more negative integration to to allow yourself to control like a lock the.

F

The issue is, in my opinion at least: logging is very broad, so access logging is used for lots of purposes, whereas audit logging into security sense is more for logging, when you are doing something whenever something, whenever you're taking security related activity right, so I'm little confused as combining them. And how should I express this so there's access log which on were will do already I'm, not I'm, thinking you're, not stopping it or that's orthogonal to this. But then there is an Aussie logging which is provided in this document.

F

B

C

I think I think they're there, okay Doug so.

A

I mean it sounds like, and this is like something we discussed before- about having separate streams for access sources. Audit, so I, don't know yeah.

C

So so so, actually, like the mechanism of creating the logs, is again separate than the mechanism to decide when the log should be created, and we already have two broad mechanisms to create logs. One is on voice native access log and we also have stack travel logs and then other logs as well, and those individual mechanisms have more configurability right with envoy and native access log. You can send it to the G RPC service or you can log it to standard out or you can do many many other things.

C

So the the question is: is there specific value in unifying when to log right and and I think I think I think there is I, think there is value in deciding when to log, but then the second question is: is there an also unifying what to log, or is that something that we leave to the to the actual back ends?.

B

Yes, so actually, this comes back to your largest point that there's a difference between the security is why the audit log and the normal axis lock this probably only useful for debugging or other purpose right.

B

Yes from that says, if we want to distinguish it is to so previously, I was also had a like motivation. Try to a distinguish these two, like for audits block. We want to make it more reliable, like any message cannot be lost and though we want to make sure all the required fields are for audience. Purpose has to be logged. It cannot like. You cannot skip a lot of message in their lock in the log information.

B

Yes, so if we want to do this, distinguishing, of course, I think we need to a strictly control like even for the log entry, we need to have very strict access control. Who can who can edit this a log entry right- and this has to be mesh wider from one singleton and short by the audio data mean it's not like any user can modify it exactly.

F

I agree: if you're using it for security purposes, the personas are different, whereas access logs are a debugging tool. In my opinion, is that it looks like lemon. You and I also agree. I want to make sure Mandar and the same page like we should try to separate those two streams.

A

So I managed for separating streams and were having different personas controlling the content of those dreams. So then I think it falls down to how we define and where we define those dreams. I guess, let's get curious about yeah.

B

C

Just just to continue that that further download the the separated streams in order to actually realize it right or materialize it. We have right now, let's say two different streams right and I'm, just using using example of stackdriver, because I know let me use it often and then the standardout stream with envoy, even though the even though audit logging and access logging have different different uses, it looks like we just have one means of producing both right, even in those two different paths, there is just there just one.

C

So so we so we we need to. We need to square our desire for separation of streams with the actual mechanics and and what it really means to have to have to two separate streams. I.

F

Like if we are trying to do this separation of streams with personas, we need to make sure those personas can can reliably configure. There are what is the reliability and durability of each of those streams and then where they go, because those will vary differently, but this is a good start. I mean authorization shields like a good housing for the log action. I think we have to go a few steps further around where to send it, how to send it.

F

What are the parameters on which you can actually drop things if at all for auditing, and then probably the log entry see Rd is the one which is more contentious, I would say around the what rigor set yeah.

B

Do we actually want to help separator like a audio entry versus the normal can show for the blog content, yeah.

C

I think I think that, like that, then that makes perfect sense in which case, actually it is not at that person so later on, I'm, actually rethinking what I said earlier. If, if there is a single persona or a single role that controls all of audit logging, then it makes sense for that authority to configure both when hello and we're no, it's not a separation offense on then okay. So that's yeah.

F

So the separation of concerns is that the intent of why you are collecting the log right right access log for debugging or is it audit log for security purposes?.

G

Optimisation policy on the actual end points separate logging.

E

G

Definitely different concerns right ins to.

B

G

B

G

Because the structure is so similar, let's just reuse authorization policy. When you know everyone's talking about the person managing the audit logs is a different persona, which is more than likely it's not someone who's, saying: hey I only want these people to have access to these endpoints on my services or methods which to me seems, like the other persona that we should be talking about, and it seems like we're sort of commingling these in you know, auditing versus who can access it. I.

B

I think from your artists point: if auditing belongs to a security, rather so basically it's the audio data me who is audited, so you can. You can also consider it as part of or like a security enemy. It's controlling what content should be audited and when to audit, if ya like ACL, if you try to apply access control to cocaine, okay, who can edit this the computation you're? Normally our science to the audit aadmi, which is a part of security, atomy.

G

Possibly but the service writer themselves.

B

G

So on my harddrive right I can say these are the permissions on a file, that's different than someone's auditing who's accessing the files on my machine right, potentially.

F

So I so I think so Rob so they're, two things which are happening right here. So if you look at other traditional authorization, products or the regular security products, whether it's a networking policies or IP tables or even firewalls traditionals, that is the place where you configure allow deny and log or not. So in that sense, this location feels ready.

F

It's it feels natural. What you are saying is we don't have separation of developer persona versus a security persona to begin with, so I don't think that can be fixed by this particular API. It's it makes it worse, but I think we have a broader problem of. We don't have a developer persona separately defined because ODS II policies kind of right now they're tied to applications, but they also type to security right, so we have already mingled them well,.

G

I mean not necessarily right, because I can go in and create policies for my application in a namespace that only expose particular services outside of my namespace right. So I can say you know here at my application, resides in this namespace and the only people that can access the public endpoints I set it through. You know JWT policy or something like that. Right and I can control who's coming in and what methods they can see and they can make all their other services internal to that which don't have access right. I.

A

G

That as a developer, right, I don't have to be the mesh admin or anything like that. We talk, personas and I. Think you know, I think it sounds like I mean that it would be kind of what I'm hearing is.

G

It would be great if we had something like that and if we had something separate for configuring auditing and all these other things, but for simplicity, we're looking at as well, we already have this authorization policy, it has all the same fields and all we're doing is adding an action, and we can wedge it in to this other stuff and to me it seems like we're looking at it from ease of implementation versus use cases personas and that kind of thing, and then we end up with just you know, going down the same path.

G

We don't have separate personas, so.

B

How about we just keep the my field term like oh, if other developer wants to the lock for debugging, they can use my filter, but this is the artist. Logger is only configured by the security Academy.

F

Yeah, but that's the thing, women that Rob is saying right now: developers are configuring, odd, C policies, because.

H

F

They are fond, they are following namespace isolation, Oh.

B

On the next page, you can still separate to them like Oh on the next page. You can say which resources are allowed to edit. You can say only the security admin can answer to this. The IDS Reiko, authentication, CRT or sedation CIT they can only. This are only allowed to be edited by security at me, but if you have a namespace super or super user permission, of course, you can do anything.

C

Wait I thought: okay, I thought rob was saying something slightly different right. It's that we would be forced to form that audit is a separate persona and it's like a cluster wide super user admin that decides when and what right, when to audit and what toward it. But then Rob pointed out that we have actually delegated to act, the CR that you will use to configure this down to the namespace, and now those two are already in conflict mm-hmm, because now in every namespace I can have whatever it is.

C

I want now I just kind of brought this out there there is. There is another way to solve that, and and this this comes up in other places too right, which is that we don't necessarily have to solve everything in history API.

C

But we need to make it such that you can use admission control or other things right in CI cd2 to make sure that you can still have the overall desired effect, so that so the part that that I'm thinking about right is that you could always have admission control.

C

That says: if your policy doesn't conform to like these fields, then then even the namespace argument cannot write right, the CR right and then you can do that through CIC be as well just to composition, and then it's never an issue, it's correct by construction, so it it is possible. I. Think we generally do need to decide, though, to what extent we want to model everything in these two API, and what do we want to leave to other tools there outside?

C

F

Mandar I think I like where you're going so there's only one gap with this API then is as a mesh admin. How will they configure a log policy cluster wide so that either at the namespace level they allowed to override, or they don't so currently, it looks like this will be always tied to a workload right, so you can't configure things at a cluster level if I am a security admin yeah. Is that correct, lemon yeah.

B

Or the other ways we only allow user tool so that wouldn't require introduced are separately if you. So basically we are saying auditor is only allowed at mesh level, but the magic admin can have some defined some exception if they want to eradicate the permission to a namespace as me, it.

F

Looks like that composition will be the most flexible one.

B

Yeah yeah, if that's the desire behavior properly, we have to have a separate like an audit policy. Construction.

C

So solute mean I, think I. Think then what would help in the dark is to add the table be explicit about the the persona. Basically, the discussion about personas that we had and and that way, yeah and and I think like that, will inform the rest of the discussion.

B

Yeah, okay, this.

C

This by the way is a is a very, very general problem and, in fact, part of the extension API and just the requirements right off. What are the goals of an extension API, and this is another extension now we're trying to make it first class, but this the exact same issue there is, to what extent do we go in modeling these use cases in the hto API and to what extent we say that anything, that's more sophisticated that this needs to be handled outside right. So this is so. This is a very very.

C

This question like is pervasive, so yeah.

F

I can make one suggestion here: lumen, which might be a reasonable middle ground here, which is for the mesh level permissions or for the mesh level audit log capabilities. Maybe we can embed some of this inside mesh can to start with which gives you the cluster white defaults, and then we can keep still the authorization policy where we can add the log and then the idea is like Mandar was saying you will have to have either admission controllers or CI CD, where, depending on your policy, the namespace levels are allowed or not.

F

It can be a reasonably good starting point if you want to do instead of creating a fresh CRD, so you're proposing something.

A

Parallel set default proxy configuration policy, then so.

F

Something like that, but you can still embed it inside the mesh config instead of creating like a new CRD. Maybe.

B

A nice mesh configure, but maybe our mesh level audio configuration the default configuration default audit at mesh level, but for in these days it can specify the namespace can overwrite yeah and for the namespace API I think you can just use the current one.

F

Yes, the only reason I'm saying I mean I'm not opposed to new CR DS for the mesh one, but I know you will get a lot of pushback from the I. Think you see members so try, but your easiest path here might be or the path might be, let's embedded within mesh config. If needed, then progress it to a CR D. Let's I can help you with that, but yeah.

F

B

Config is is currently already has too many stuff right, and you know you are not able to uh separate the fields for different purposes.

F

Yeah I agree. Mesh config right now has all the personas mixed into one, which is basically, you are a cluster wide yeah steel, mesh admin.

B

Yeah yeah I think even as level we should have distinguished me. You are the security enemy or you're. The networking enemy.

C

But so I think again right. This does go back again to the same question. So a if changing msconfig is not a very frequent thing which hopefully is thought. Then you can actually go through the course admin say: hey make this change right and it should it shouldn't be too bad and then the second question about again about modeling.

C

Do we want to model granular access in the East, your API, for each and everything, that's configurable, or do we want to go to some reasonable extent and then make sure that we integrate nicely with get ups and other technologies where people already have set up very, very sophisticated controls and tests and approvals, and and and things like that, so I would also hesitate before we go too far down the path of trying to bottle everything.

B

Yeah, so mental, what's your suggestion? Oh you think we should rely on the external CICE pipeline to control.

C

Right so yeah, so my my yeah, my my suggestion is that the mesh configure proxy config files, even though it doesn't look the best is acting okay because we have been doing that and then, if there is actually a pressure to say, no actually take these defaults out of there and like put them in some other place, then you can always grow the API. But once we add the API, it's difficult to go back.

B

Okay, yeah, but in this case it's not a single field right, it's. Basically, we are adding a bit of structure to a mesh config. So that part is what I'm worried about you're. Basically saying the mesh admin can say: what is the administration doing this auditing condition and for this def namespace do something else right and the can say, Oh for this bad name, space delegates to the namespace owner, I, don't care so that part of logic is a pretty complicated adjust to adding to the national field. So.

C

So, okay, so I'm actually suggesting we don't do any of that right. So what I'm? What I'm suggesting is that there is? There is a default that is in proxy, configure or msconfig, and just like today, the API in the namespace right, the API object in a namespace can just decide completely what it wants to do in the namespace.

C

So as far as it's is concerned, that is still the API. If, as a deployer, you want to add more controls, then you will add admission control that says a namespace admin. You cannot do this or you cannot do this under some conditions and we don't need to. We don't need to actually do that in our API in.

F

The first CD or kubernetes admission controllers right wandah. Yes,.

C

B

So in the first version here just to go is the current namespace level Oh CIT right? We don't need to worry about it. We leave it as the C ICT yeah. That means responsibility. Yeah.

F

So that's one more thing which I can? Let's think about that? Also we are talking about mesh config and for the sake of this argument we should not add it to proxy config, because that will make it a boot time configuration we want it dynamic, so mesh can, instead of adding in the mesh config lemon. Can we go down the route of the peer authentication API? Where initiative system there is a default one that we create, which is for mesh wide. Is that better than embedding it inside mesh config.

B

Yeah actually I was thinking of were creating a mesh level security config we can probably because authentication is just a one field of previously. We were reluctant to edit as a single appear right, so we can probably have a security configuration at measurable, which include the POS education config, like MTS, for the whole mesh. This is the some similar consideration right. Thank you. I saw mesh audio setting for the whole mesh yeah. That's the one possibility, yeah and also things like I chose domain I was also thinking putting into the mesh level security configuration.

F

That make sense.

B

Yeah, maybe this can be relieved at our left. As the second step, we will have mesh lab or security configuration which can show all these so and the for now. Let's start with the namespace level, you know for auditing. Does it make sense.

C

Yeah I think I think it does.

A

Yeah I think so too.

F

So I'm told like I, was saying before: I am fine with having the author, the log action in authorization policy I will have to spend some time looking at that log entry CRT so give me some time and I will update the doc. Okay.

B

Thanks yeah thanks Niraj.

A

Okay, should we move on, we took I, know, John has some stuff on the agenda and there's some other stuff too. Is it okay to transition yeah.

B

Yeah thanks good yeah, thanks for all the feedback, yeah and.

A

Thanks for bringing that test, thank you, okay. John did you want to talk about yeah.

H

Yeah, so this is not a major concern really, but one thing, I noticed is I've been testing pilot at larger scales, like the order of you know, 10,000 proxies connected, and one thing I noticed was that as we get up to a scale, the census libraries are starting to take up more and more CPU at that scale, I see usually like 12 to 15% of the CPU is spent on recording metrics. What I was wondering is like is this normal?

H

Are the things we're doing that we shouldn't be doing and that's causing this and what we can do to improve it? If anything, specifically, we see certain metrics, the two main ones are. This is for every cluster. We record how many endpoints it has and we do it on every push. So it's kind of like N, squared 2 n cubed, almost on the number of times you record it, and then we also have one that's every push.

H

We have like a pushed reason like why we did a push and so that one also seems to take up a lot. Both of these are you know, x, number of proxies x pushes so these get very large, as the mesh goes up.

C

So of these, have these proven useful.

H

Yeah, so that was my my next that was kind of so I think the EDS one I, don't think it's really useful great, like that. One would probably be the first to go this one. That's push triggers I'm, actually kind of surprised. It used this so much because I like we have similar metrics, where we record it for every push for every proxy.

H

Like we record the like latency, you know convergence time that sort of thing so I, don't know why this one actually uses more I feel like there actually may just be a bug somewhere in that code or something that's somehow like triggering that more than we you to that. One I think is useful, so I would like to investigate that one more probably, but I mean the proper answer, could just be yeah. Let's just remove the EDS one, but I don't know you know how we feel about removing metrics. That sort of thing just can.

A

We have a clearer, like this is removed. This is about it, remove benchmark that shows the performance that it's scaled with that metric and without that metric yeah.

H

So I just tried large scale cluster with 10,000 proxies and it does go down like 10%. It's not a huge difference, it's honestly, maybe less than 10%, but the the flame graph shows that over 10%, but then the actual difference I'm a slightly less so it is. It is noticeable. But it's you know we're not talking about two times performance increase or anything like that. Can.

A

We can we maybe make it fly protected, like only collect this metric when you really want to- or you know, in non performance, critical scenarios is that something so just totally removing it give the option of not collecting it.

H

Yeah, it seems reasonable tell like debug logging, but debug metrics thing, III, think I.

C

Think one one more place where we can change is that we so, for example, I think cluster and the more friend points right that has got nothing to do with pushes cluster and number of end points is actually measuring inputs to pilot it. So so it should not be recorded at the output right right, definite right, so so that so that we so that we just remove the x proxies sitting there, so so that and then so that that that seems like that should definitely helped right. Hey.

F

Is it so if you have sidecar resources defined, do these metrics have the labels which says which proxies are receiving, which endpoints or these are not at that resolution? No.

H

They they don't actually- which I think is a good thing, because the.

F

Cardinality would.

H

Be massive but.

F

H

A bad thing because the data is not actually correct, exactly.

C

H

C

Yeah, so it it it is, it is still helpful because it we still know at a system level. What like how many, how large of a cluster the control plane sees what I agreed. It doesn't give you a poor proxy view of that. But then you ask the proxy that question. You don't even.

H

Osweiler that question no vendor but I think what will happen is if you have two proxies with different sidecar scopes, the first one will push and I'll say: ok, we have 5 min points and then the second one will push and say we have 0 and then it will do yeah ultimate.

I

Back and forth, it's a gauge, so it is.

C

Wait wait wait! No! So we okay, so we're saying here again the clusters, the endpoints and clusters. It does not vary by actually said: the control plane does not vary by who's. Asking well, I mean actually does vary, but.

H

It does yeah. Houston are probably made the metric with it, but now it does so John.

F

The current implementation of this metric is completely broken and looked at it long back and when sight curves resource was being added, I wasn't sure what is going to happen now. This explains it. That's not correct, and the second thing is we can add what month are is asking. I am still not sure how useful that is Amanda just because you can get that information very easily from kubernetes.

C

Well, in a in a multi cluster voltage, Registry way right, like pilot, definitely has a view that no other single thing has yeah.

F

That's reasonable, so basically, we can have a view of what are the inputs to the pilot. Okay.

H

Yeah I mean I, think that makes sense like if you're I think you guys are familiar with pilot, but the there's like push context at the start of every push which you know cache is like everything we have and at that point we can easily record like the number of destination real services, virtual services right et cetera, I- think we actually do with virtual service. Although that's probably the wrong place. You probably want that in the actual config reading correct but yeah it's what it is.

F

C

F

Just gonna say: if you have this, set up still up I would love to take a look at the setup and also the flame. So.

H

F

C

H

An issue but I, don't think I posted it here. So let me it has a flame graph. We can do it in.

I

A separate setting well so I, don't know monopolize this meeting. Yeah.

H

Now I'll just add the link to the doc, and you can take a look at it right.

C

And and I just I just added for reference right what percentage so we found that envoy was taking 17%. However, most of that was on the scraping side. So if you, if you look at the the link that I've, that I posted of some flame graph, there you'll see that on envoy, when you enable all metrics, it takes 17% about 17% CPU, and that's why 10% seemed okay to.

H

Me so I think actually on the pilot side, the scraping is, is pretty cheap unless I'm just misinterpreting the graph, but I'm pretty sure it's it's either doesn't even show up, or it's very small, that I didn't notice it.

H

It's actually just a recording, because we record this rule.

B

H

All it's one metric, so it's very cheap to scrape because there's just one one I, don't know what they call it one like line and they have response I, just recording it and there's like lunch. A mutexes and stuff like that.

C

H

Clustered or dimensional oh yeah, for that one, it is big, because cluster is a dimension for.

C

The other one, the the pushed rigorous they should just be like. Oh right,.

H

C

That so the equivalence was also that I had measured these at ingress. Where we get all the clusters in the mesh and if you enable all telemetry, then then that's where the explosion is and and again there is some some equivalence, but but but yeah they I mean clearly pilot and on what are different and you're, not seeing it in the scripts. Oh okay and.

F

Also mandir for envoy, if the scrapes are taking some significant portion of the CP, isn't.

C

F

A separate master threat versus all the worker terms, yeah.

C

That's that's even worse, because now it means that you cannot push ETS or other things. Yes, correct. The the current data plane doesn't quite suffer yes,.

A

So it sounds like we have some action items now out of this right, we're going to add a new metric, get rid of the one. That's not helpful, and look at maybe flagging, like protecting ones that are problematic, so they could be turned off new performance, critical situations since I'm, right, yeah,.

H

Sounds good and I'll take a look at that other one that pushed triggers recent, because I suspect there's some bug or something it shouldn't be using that much I would think.

A

Okay, anything else you should think about here: Oh.

C

What one one question quickly related to that: if we, if we enable and disable metric collection, then Prometheus would be okay with it right. I just wanna make sure that so it would. It would think that the counter hasn't gone up for a long time and gauges don't matter anyway, because gates yeah, so it should be fine if we suddenly stop collecting and then restart the Prometheus data would still make sense.

C

Yeah I think that's right. It.

A

Would act sort of like a processor restart, and that sounds right right.

C

A

Okay, well, there's not much time left, so there is a community member who was looking at alerting based on its geometrics and having trouble creating doc, so I created the doc to them. So I I just wanted everyone to take a look at this as we try and beef up, alerting, I, don't know if anyone wants to add comments now or discuss something now that this is something we I think we want to flesh out over. That's in it didn't pull it up is like yeah yeah.

A

This is just so I basically cut and pasted this from another Google Doc, so this would be in the drive so I'm an owner of it, but only because I copied this. This over I need to go through it as well, but I just wanted to bring this up in case. Anyone has expertise in alerting if they want to share or contribute to I think there is an effort building around getting a good doc on alerting and to follow up alerts like a 4h.

C

A

So please take a look if you, if you know of alert, alerting configs out in the wild that are useful. This would be a good place to sort of add references so that we can develop this over the next week or two into something more substantial.

A

Any questions about that.

A

No, okay and then the other thing I wanted to bring up. Is we had some discussions about the RFC for its lemon tree API, so I took a quick stab at merging all the existing configuration into one sort of bigger proposal, so I just wanted to bring that to attention everyone to take another. Look at what that looks like and add comments and questions. So you can keep iterating on this and hopefully get to a spot in which you can build a larger design.

A

So I just want to mention that again, I, don't think we need to to rehash it all here, but please just take a look and see if this is.

E

Sort of of the.

A

Format that is more acceptable than the the pricing config proposal that was originally there and seems in the right rate span or not so I, just wanna keep moving that that forward as we get as this unit element cycle is moving on, so they just bring it up to mention that so appreciate you in all comments there is there anything else. Anyone wants discussed in the last minute bring.

H

Up a quick question about the alerting- and maybe this is already answered, but what do we plan so wants me to find the alerts? What will we do with them? Will we put them like on the website? Will we put them in the sample or you know? What's our plans? That's.

A

A good question I think at a bare minimum we would add them to the best practices documentation, as we have on the website. Saying: hey here's some alerting configuration that we recommend tailor it obviously to your needs, but this is a good starting set just to have something we can point people at to say this is how we think this do should be, should be monitored. I, don't know about including it by default anywhere.

A

That gets a little bit tricky yeah.

C

I agree, I, think I think we we should. We should include in in the dark and then then provide it as as a as a starter artifact, so that people can just download it and install it or.

F

Put somewhere and github this thing right.

C

F

Really be helpful, yeah thank.

C

And and oh by the way we we will be using it right so, for example, the the release, qualification testing and monitoring effort that chignon is working on. He is also working on alerts and he will probably consume this and kind of add to this, but we will be using that right away kind of internally to make to so that these alerts are actually used in you know in our own clusters for for testing yeah.

A

It's a good point: we hadn't done a sync between his peers and and this this doc, so I will take the action to do that. Okay, make sure that one's on the same page, yep, okay, well, there's not anything as a ticket. We've taken up a large chunk of people's time, so.