Istio Extensions and Telemetry Working Group, 29 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Policies and Telemetry WG Meeting - 2020-07-29

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Everyone um thanks for coming. uh There wasn't a lot on the agenda today, so um thought I'd start by once again asking if there's anything that anyone thinks we should discuss or want to discuss, or we should use this meeting, for um I can add it to the agenda or you can uh just mention it now and I'll update.

A

Notes: okay! uh Well, I'm sorry to put you on the spot, but I thought that maybe you could provide some updates on the work you've been doing for extensions config and how we might be using that moving forward.

A

For extensions.

B

And telemetry sure, so the update is on extensions, config discovery service. We implemented it in any way, and the main benefit here is that you can apply configs separately from the rest of networking convict.

B

That means, if you want to change some configuration in telemetry, for example, the update will be localized to the extension that implements telemetry and that and it will not cause any drainage or disruption to the networking uh it also it's it's basically not the xds specifically for extensions. So that's the general and our plan of it is to use it to ship awesome plugins. So we can load modules dynamically using this service, and that means you can also ship code as an option using the service.

C

And- and the other thing to note, is that the way to ship wasm bits is still orthogonal to this particular process. So so we can still choose the three different options that we have to write three different options that we have orthogonally to how the extension config is shipped, which which gives us a lot of flexibility in in deciding under what circumstances.

C

What is right right so in the in the simplest case, for like a smaller installation just having everything shipped by the control pin is actually easier because it has far fewer moving parts, but that may not be recommended for like two production use.

B

C

B

So I think it depends what kind of model you want to ship if it's substantial module, then this probably shouldn't be shipped via this service. But if it's a small like script, 10 lines, it makes a lot of sense to use this service and there's also like a provisions to deal with the warming. So when you, when it takes time to load the module, it will await until things are ready. So that's the benefit of minimizing disruption.

A

And is our plan to still cherry pick us back to one seven and have this be part of the one seven release.

B

Yeah, so we gonna cherry pick it into one seven proxy, but not control, plane, because it we don't have a good design yet how to implement it. In pilot, we decided not to do it for one seven and delay it for later, but you should be able to use any filter to point proxies to a different service.

B

So that means you can experiment with one eight while running against one seven proxies.

A

Okay and um just in case people were looking to try this out. Do we have a reference, implementation or framework that we could use for that, or um is that still in development.

B

No, I mean, I think, a lot of it depends on api because we still need to consume something from the user side, and I mean part of it. The idea is to use it to optimize some of existing pilot code, for example, the authorization api right now causes listener reload.

B

Every time you change any information config, which is not always what you want for http traffic, and so it could be used there, um but we need a good design because it I mean, if, once you come to shipping binaries, it changes the workload and pilot and we have to make sure it's done right.

B

The there's, no reference implementation. The service is just implemented, so there's no existing server. As far as you know,.

C

You you have something in the google control plane right, quad.

B

Very easy to write a static server, so I I did write one and it's not that hard, but since you need to have a way to program it right, we we need to that's where a discussion should be. So what do we supply to this server?

B

So we can create appropriate response to client.

C

Envoy, and- and so so I guess the the related question is that the lookup is by name correct. The lookup is by a config name, yes, which, which means that that name differs slightly from the semantic names that we have been using. So, for example, we use steroid stats to mean like an abstract notion of stats, but we actually apply different configurations to stats based on whether we deliver it to ingress or a site car. So I think that that that will also have to be taken into account.

C

That now name actually means a config set, not just uh yeah.

B

I mean you can always um scope it by proxies right, so it could be same name resource that is not the same across different proxies.

B

Right, okay, yeah, so I guess.

C

The other that's we, we will need to think of that.

A

So the related question is, I know that um I saw that config dump was not yet implemented or coming. How do we debug extensions and find out which versions of these configs they have uh when they're running.

B

There's a result of stats implemented, so it's it's it's just another xds, so the standard xcs metrics apply and it's the config reload set stat that allows you to monitor, updates the config dom is was not never always reliable in any way. It's not never actual truth. So you should always kind of assume some kind of approximation, but we should.

D

B

It for this service too, and I think that was one of the remaining items. It's also I mean it's not implemented for network filters as well and for like access log filters. So this is a bunch of follow-ups that need to be done.

C

And those surely will not be into stu 1.7. Those will.

B

C

B

Yeah but the hardest one was making http work. I think the hardest one is down.

C

So I have a a small, I guess non-update, which is that the quad and I did do some more work on just the supportable onboard filter thing, which is like the scope down version of of what we need. But I I don't have the I don't have the concrete proposal yet so.

C

It will it will be sent out shortly, though, within within the next couple of days, uh I'll send out the supportable onward filter.

C

A

And we're expecting that to be reviewed in the the network network working group is that where people should go yeah.

C

Yeah, so this like this, one will be reviewed in the in the networking group and it yeah, so so the minimal part, so the goal is to make it modest so that we can make progress.

C

um But yeah I mean like this group is the is the primary user of of that like of that api for filters specifically so.

E

E

Are there any other topics that people want to provide updates on.

F

Yeah, I have an update on uh one of the issues. Let me think there, so um it's about uh exposing more uh uh our stats, so um I think there will be uh like uh so. Basically the easiest track, some experiment.

F

I did uh and see whether if it is possible to expose more and with that currently we are omitting uh the uh listener and cluster related stats due to the performance easier that we run into back to one zero one one, uh and uh this issue is to see how much improvement that performance wide and what has made on stats and see whether we can enable them enable other stats or more stats by default and the experiment.

F

I did shows that um like for for a gateway kind of proxy, if we um install like uh 200 clusters, there, uh cpu and memory usage is still not looking good, it could be uh out of it for like two twenty or thirty percent um sa car, uh like we're like uh not not that many cluster sizes, not uh cpu usage, are uh looks a bit better than gateway is less than 10, but uh the memory like a part of it might because we are sending less uh requests per.

F

Second, when do the sidecar testing, but memory still grows a lot. It grows about like 30 percent at sidecar, even with just 20 or 30 uh clusters.

F

So I think, um like the conclusion I have is that it's still not very, uh very good or very uh appropriate to turn on all the stats. So uh we still need to use whitelist-based approach um to expose stats that we just. We think that would be useful or uh very important for debugging.

F

I'm not sure whether there are some other opinions from this group, or should we cast out uh away folks um uh whether this result is expected or do we need to generate uh some profiling? I.

C

Think I think I think you've already see joshua moran so on this.

F

D

But yeah, but he hasn't.

C

Replied so yeah.

D

C

So yeah, let's just make sure he knows that this is done and and we are headed towards this decision yeah, uh but I think uh yeah. So so I think that being able to add more metrics across the mesh is something that we don't easily support today. Right.

F

uh We support, we have the annotation, oh across semester, yeah right.

C

Yeah, so we we need right, so I think I think that uh clearly envoy has made progress, but from your experiment, it's not enough, but if someone wants more metrics across the mesh, then we just don't give them enough. Give them an easy way to do it at all today. So so I think that that would at least enable experimentation and put the control back in terms of users, but in terms of which other metrics we want. I think we should be able to come up.

C

We should be able to come up with a budget and say we have room for like whatever five ten more metrics uh and then and then just get those important ones in.

F

uh I think we already turned on some low cost metrics, like the manager, metrics et cetera, but right here.

C

I think I think per cluster metrics, but just a few, not all right like you. If you turn on cluster, you get like all whatever 30 50, some large number. So if we, if we can just say that hey in in a cluster we just are, we just think that these two are the top most important. Then that could be. I think, that's a good starting point. Yes,.

G

Have you also looked at prometheus uh memory, cpu.

F

Yeah yeah, if I turn on us, that's promises. It's not looking good, so uh so, okay yeah, I mean like if, if eventually we're going to turn our stats, that then we need to make it very clear that promises need to ignore those uh some metrics during scripting. So.

B

Can pilot generate some of those stats instead, but all xds that could be done on pilot right.

F

Yeah I mean at sidecar, we exposed xds stats. Already, it's not expensive, but uh the expensive part is the asset only available at sidecar, like the cluster etc. Yeah.

G

Because those are basically n squared and in prometheus right, because you have n clusters and each cluster has and clusters like.

B

G

B

Yeah nice to aggregate them based on cluster. Instead, that's.

C

Right so so, I think I think that we, we don't necessarily have to solve the prometheus problem here, but I think it's important to a give the deployer a choice of how much they want to expose.

C

And then we already have the prometeus federation thing right. So you can have recording rules and you can down sample or you can aggregate it at prometheus to the appropriate amount.

C

So what we need to do, as as a group is to know the cost and then have a solution, and then we can let the user decide how much cost they want to pay to get those extra metrics.

F

Yeah, that is the update for this. uh I like I, can definitely pin android for more to get uh more insight into this. uh I'm gonna follow up in the next working group meeting yeah.

E

Just out of curiosity and sorry to bargain, but what is the primary driver for cost outside of volume? I I so yes, there's there's a certain amount of volume for logs and uh assuming that they're going to go and log everything they can log. What is the primary factor that drives cost.

F

It's a memoryless cpu, so I always spend lots of memory cpu on clicking stats, so that is the driving factor to make the decision here. I think I think that is what my means for.

F

Is is there anything else that we should consider for cost.

C

uh No, so so yeah so cost on the proxy is is kind of the most important part because it gets multiplied by the size of the mesh uh and then cost that is imposed on the telemetry system, which is prometheus or like stackdriver or whatever else is the it's like. The second consideration.

C

um Yeah, though those are the only two.

E

Got it in that sense, cost is a function of resources. This is what we're saying. Cost is a function of cpu and memory consumption. That's that's what it means to us.

C

Oh, we we, we weren't, we weren't, actually thinking in terms of dollar cost at all. We were just thinking in terms of resource costs,.

C

So so yeah, so that is yeah. So I think this discussion sort of ended at cpu and memory.

E

Right, I was curious thanks for watching.

H

I mean there is downstream cost right once you've collected it, there's, storage and and if you're, aggregating and number you know that that all falls out the more stuff that you collect, obviously the more stuff more space. It's going to take the more time series you have the slower quarries are going to be, I mean, there's a huge amount of cost. Actually you know resource wise, especially if you're collecting things that aren't used in any way.

E

Is there? Is there an upper upper bound to that cost? If, if I am logging, everything that I can log, is there an upper bound for that resource consumption uh and do we know what that upper bound is? Is it is it something like uh millicourse or so and so uh gigabytes of ram, or is that.

H

Unbound I mean I would say it's up to the to the customer. You know how many, how many resources do they want to throw at it.

D

You can keep on adding uh workers and I've seen envoy up to like six gigabytes of ram. uh If you really want to throw a lot of stuff at it. So.

D

Yeah, it depends on what your configurations are.

A

A

Okay, um peter did you, do you want to mention anything else or do you think we've covered.

H

That the only other comment I'd make is instead of upper bound. I think it's lower bound. That's a little more interesting. uh You know how much, how many resources are you using just to do minimal amounts of things, and I don't know what the answer is there, but you obviously want it to be as low as it can be.

E

Good good point- and I think if we find that we can probably think of an upper bound or think of an asymptote and or maybe we get, we come to a function where we can model what that cost will be and how it grows.

G

I'm just curious, I know, link rd uses like a very low amount of memory uh in the side cars do they do that because they have less stats or they just have a more efficient stats. Implementation.

G

Like I've seen them using, I don't know exactly but like a couple of megabytes of memory, whereas on voice like 30 megabytes out, you know out of the box, it's quite a bit more.

C

Yeah, actually, the like a a lot of that memory. Even if you switch off all stats collection onboard still uses quite a bit memory based on the greatest right.

E

Cluster overhead.

C

Okay, so the stats is kind of the secondary part, and the other thing is the cardinality of on-voice. Stats is actually somewhat bounded. It is actually bounded because um it's not it doesn't have the configurability that is geostats have which can lead to cardinality explosion. So so we so there. So it's quite predictable and well known, like once, you add a cluster, you get these 20 things and that's it. You just get those 20 things.

I

A quick question here so the onward level stats. I mean, I see the issue filed by peter, I think, uh are they actually being asked by customers or is it something that uh istio devs, who are very familiar with proxycrave.

C

So so the these are, these are free, like these are needed from time to time during debugging.

C

And some of them have equivalent in istio stacks and some of them don't, but like that, that's the only time that I have turned them on just because uh they're debugging. Now there are certain like downstream stats and like some, some other stats, which we just don't have exact equivalence and people do turn them on using the annotations.

C

So so this was this was to investigate what would it take to just allow everything and what's the resource cost of that exercise, so.

A

As if, if we're only using it for debugging, it would be interesting if we could have just sort of a like debug mode right or some way I mean that's you're, getting an easy way to turn on additional metrics mesh y. But it sounds like we don't want to do this on a consistent basis all the time, but we want to turn it on for the 30 minutes that we're looking at uh right and then be able to easily switch it back off.

I

Yeah and additionally, I was thinking, can steer ctl do something here I mean if you have to look at time series data over a long time, there's limited value. That is maybe these limited values qctl can provide, but if it is just for interim debugging for like 30 40 minutes, I'm just throwing ideas out there so that we can avoid the impact uh globally but still provide these metrics when the time it comes right.

F

So that requires a proxy restart right. If you want to expose one stats uh that the stats configuration is uh input strap right now right, so I think restart will make it not very useful to provide debugging signals in some cases it will make it pause. Debug signal.

A

It'd, be nice to yeah it'd be nice. If we could have that be driven by 60 years, yeah.

C

So so that so I think the the work that actually peter and then someone else is doing to get uh to connect the sandbox stack system to an alternate stat system um like open, sensors right, but once once we have that fully baked in, we should be able to go back with some more data and conviction to the envoy community and say: okay, can you can you add this and not have everything in the bootstrap um yeah like that? That will that will be the first step so.

B

Do we have any idea if open sensors is actually more efficient or it's just more configurable.

C

Open sensors is definitely efficient. I I don't know about more or less so, if you, if you look at so, we use open sensors for stackdriver and in in like measurements. If you see the stackdriver measurement, which directly uses open sensors and sends it out, is basically the same cost as stat system which well, which uses uh onwards native pipeline.

C

C

C

Okay, so there is, there is one I I'm just looking for. So there is an issue that doug has opened for for some time about um updating the grafana dashboard with some some of the newly created wasm stats, and I'm just looking for someone who is able and willing to do that.

A

Yeah and the idea here would be don't we we want to make it easy to monitor extensions, as you add them, and start to expose more information about how they're behaving yep.

C

All right, well, okay, there's a lot of data see yeah seems like there is a lot of potential. Okay, all right, I'll I'll, see what what uh what what like! I can do that everything everyone else is busy.

A

Okay, I did have some other updates. I mean like we should probably just talk about for a second here, um the big one being the mixer is now gone from the code base.

A

And uh there's a shim, the uh the grpc shim is sort of a bad name, for it um is being developed against the one seven code base.

A

And I think right now, the biggest thing left is uh developing an integration test around the functionality. So I think it's almost ready to go.

A

uh So if anyone is still, you know interested in the grpc adapters for mixer and testing them with envoy x, off or access log service, you should have something in a week or two that will allow a lot doing that with one seven and then we'll need to develop the docs for how to use that in one eight, if you still don't want to transition, um you you're out of process adapters.

A

So I just wanted to share that that update. um If you see anything I mean this relates to the dashboard updates and other stuff, there's still probably small vestiges of mixer left in the 1 8 code base. um So if you see anything anywhere that still references, it it'd be a good time to go ahead and just clean that up as you're working through.

A

So I did want to mention that, um and that also thanks, ed and others who added the deprecation and detection tools for one seven. I think that's gonna be super useful as well. That's a big part of being able to remove mixers. So thank you for that work.

A

um The other thing I wanted to bring up is: I started thinking a little bit again about uh the sort of metadata discovery service that we've talked about several times in the past.

A

There's been a number of open issues where metadata exchange has failed for some some reason, maybe maybe errors you know, maybe calls coming from out of metric going out of mesh, um and so I didn't know if anyone had done any concrete thinking about that or advanced it. I know niraj and jacob. You guys have looked at that a little bit. um I don't know if we have any good thoughts there or are thinking has advanced on how to do meditative discovery.

A

So I just want to bring that up and sort of ask ask the group what their thoughts were on that.

C

So so I think the the prioritization like deciding the priority is is one of the it's not exactly the same, but I think we have to agree on on the priority. What I think what happened last time was, even though we did agree it. It wasn't really a form agreement right because we weren't willing to put any effort into it like we as in we as google did not, and neither did nirich and and his his team. So we we really need a consensus on priority like that and yeah.

C

If it is important, then then we do need to like put serious effort into it.

C

uh So so I'm I'm not I'm not commenting on like design or technology, or anything like that here. I think, but I'm just saying.

C

A

C

It's important.

A

That's a good point, I guess maybe it's just like sort of the bias on the issues I've seen, um but it does seem like it keeps coming up in more and more scenarios like there's, you know different conditions under which this you know the exchange fails um more than just being out of mesh, so uh seems to me that we're finding more and more edge cases now uh in which something like this would be useful.

B

That these mostly exceptional cases or it's like normal behavior, it sounds like a lot of it is fails of something source on the network.

A

Right, I mean, I think, that's fair. This is coming in cases where things fail, um but the number of reports of those seem to have gone up.

A

So I don't know how that impacts our priority, but I feel like we're going to keep running into this right, so the so. The question is.

C

If we had, if we had a solution, that was already ready, would, and so would it be a better solution that so at which point we can just switch off metadata exchange right, we can say: okay, fine, we don't. We don't need it, because this service is so reliable and so good that, like we just don't need it it.

C

So I I would. I would like us to give that some thought as in is it strictly better? Are there places where we need both?

C

uh If it does turn out to be strictly better, then, like, I would say, we would be able to put more resources behind it right. It's not.

C

It is more efficient for sure, but it does make the code more complex, because now you have to wait until you know to mint metrics, for example,.

A

Yeah, I mean that's a good point. I hadn't really thought about it in terms of replacing metadata exchange, but I was thinking more of an augmentation, um but you write it if it exists.

C

Right if it exists and does everything that we wanted to do, then it gets one less thing. It removes one more thing from the data path.

B

I don't think it can do everything that we do already, because.

C

B

Some approximation of what we do now, if you cannot talk to some upstream means, you don't have a choice of an end point, so you don't actually know which end point you're talking to.

C

But but that's that's same thing with metadata exchange right. Unless the connection is complete, we don't exchange anything.

B

Yeah, I'm just saying this service cannot give you full information, because you simply don't know, input information right. So oh I'm approximation of what you've getting, but it's not complete.

C

Okay, yeah, that's fifth,.

A

Okay- okay, I just wanted to raise this and and have it reach sort of the group consciousness is something we should look at. I think for one eight um figure out what we want to do there.

A

uh Are there other topics, other things that we should discuss this week? I don't, I don't know a lot of other updates from just been ongoing.

A

C

Right yep, you can return them. Okay, all right.

A

Thanks everyone.

F

E

E

E