Istio Extensions and Telemetry Working Group, 16 Jan 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Policies and Telemetry WG Meeting - 2019-01-16

Description

Agenda:

Recent changes in 1.1
- 128bit trace ids
- NoopSpans in Mixer when request not sampled for tracing
- Verified IP whitelisting (including subnets)

Update on Perf & Scalability work

Update on outstanding items
- Prometheus init
- Namespace for service entries / VirtualService in telemetry
- Server-side check caching

Pending changes
- Pod annotations for policy enforcement
- Better close handling in adapters
- OOP Adapter Auth (is target still 1.1?)

Beyond 1.1
- Istio Multicluster

A

Wait a minute, or should we OH two minutes or see a name I, don't recognize Orion. She were just want to introduce yourself and say hello.

B

A

Fine too so I, don't know that we have a whole bunch discussed this week. I don't know it just feels like things were slow because of the the break.

A

So I just created this short little agenda a minute ago, things that are top of my mind and maybe people can add to it. I just want to go over some of the recent changes that affect policy. In telemetry we switch to using 128 bits for trace IDs. This was a request, so that was pretty easy. In mixer we were having issues around random number generation for tray suspense, and so we've opted to use no ops bans when the requests came through was not selected for sampling and the tracing that should dramatically improve contention.

A

There I verified the work that been done for ID whitelisting, including using subnets, and the way this and there's some discussion on this feel or discuss that is theö about that as well. Some of the stuff that I've noticed in the last week, it's kind of an addition to improve Indian test stability. So I don't know if there's other things that people want to discuss, changes dimension.

A

Okay, we have a mix point and.

C

That's something yeah.

A

You want I grab some.

C

Of the reference scalability work yeah, so so I think some of it. It may be repetition. But we've had this issue with telemetry, specifically that under circumstances with telemetry memory and CPU and guru, teens gonna grow unbounded and until basically consumes everything there isn't and then, if then, it's killed.

C

It's not just traffic related because normally under the same amount of traffic, everything works fine. So it is. It is definitely a timing of type issue which causes is to telemetry to get into that sort of spot under the same amount of load. What we have found in debugging, that is, there are three or four bottlenecks that that those requests can hit because of contention right because there are these shared resources.

C

One of the first ones that we hit was the open, sensors internal telemetry collection. Then there was the face ID right and face ID involved, creating random number, which had another shared resource which had lots of contention and then ultimately, the Prometheus adapter itself, which has internal cues, and that was the final potential.

C

So we have, we have a. We have a test fix for for the Prometheus which actually there is a test fix. That's in the Prometheus client library, but I want to avoid that and I want to have a fix, that's in its to code, instead of of their code and that's being being tested.

C

But one of the things that came out of all this is that large is to telemetry instances right which, which are vertically scaled by using like 2030 CPUs and using lots of memory, it's more likely that they get into this sort of scenario. So, in addition to kind of solving this and getting to the root and fixing all these things, we have just started testing with a much with a larger number of smaller, easier telemetry instances.

C

And the expectation is that when you have smaller telemetry instances, they're less likely to have this sort of like huge contention of thousands of connections being concentrated in one and then causing downstream effects. So we are they're, actually testing and measuring that right now, and that could be done in next several days and in addition to that, we have already added overload protection right, which was added just a few few months ago.

C

We want to have we want to deploy these with a reasonable overload protection as a default, which means that, even if it does get into that sort of situation, it won't kind of continue throwing unbounded just use of everything. All load protection will kick in and kick out some of the traffic. So let me how that works is.

D

It this is just drop requests, yeah.

C

Yeah there is no back pressure; it doesn't tell the client that you should drop requests. Okay, so we still end up being somewhat, might sway less then it's in stories, so it.

E

Sends Phi was here easily, the client could do something, but.

F

D

But it and are we tying into their life, this checks, health checks? No, no! Oh! So if we did that that.

C

Would tell that that that would that removes yeah, so I actually have have tested with that, as well did the issue that that I found there was if there is something kind of more systemic, then suddenly.

D

C

Is to telemetry pods say that they are not available and but but yes, I, think that's a that's a reasonable thing to move forward.

D

From the GC team on on the applying back pressure and there's Network less don't pick up the requests. If you can't deal with it, yeah that just have a disability, that's really what you want right. So they just need a little callback, some somewhere that calls back in to us and then we can tell them stop right, sit there for a while and then then you'll get the request. Oh you.

C

Know what I think so your PC server takes Avery, similar abstraction as the thing that it uses right. Normally, we just pass in the regular listener attraction which is nectar listener, so we could actually introduce something there that says: I'm just not going to read from the socket now until there is. There is something more I'm doing something similar for for pilot, also by cooking in two there and setting in an abstraction, but I think that that should be something because actually it could be generic, oh yeah. We could try that.

B

C

Work we'll definitely have to.

C

Framing in streaming yeah right because it yeah I mean we may stop reading and like. Ideally, you want to read a frame and then don't click. The next frame right. You don't want to like stop. You don't want to read eight bytes in the frame and then oh I.

D

Guess I was thinking that.

C

We would simply just.

D

Not call listen.

D

C

Has to be called the the choice that you have is when someone issues a read/write, you, you will say well, I'm not going to read I, know that I'm not able to process it. So you can stall the read yeah.

G

The policy is not the text and those are some like maintenance frames right.

C

Yeah so so we won't be able to distinguish between, like those sorts of like go away frames and other friends yeah they should. They should really do that sort of stuff.

B

Alright, that's that's.

A

All okay: is there anything else that is join the call we'd like to talk about or discuss or mention.

H

Really is biggest thing right: yeah we have a continuous deployment so that promises you a check whether the certs are ready or not. If Citadel, it's also deploy the PRS.

A

H

Pending yeah and.

A

This is to allow scraping application metrics that are protected by yes.

A

I think last month.

I

Then you've got an action for namespace for service entries. Has that anything happened there? There's a agenda item for community tomorrow.

G

It hasn't been discussed everything.

A

Similarly, I guess we also have the question about virtual service territory. Did that that's also, with the same same.

I

Not responded item.

G

That's that's Thursday yeah. It keeps being put back so I'm, not sure it's gonna happen anymore, but yeah.

C

Okay, maybe yeah like there's already lots of stuff going on in that meeting, so maybe we can just pursue it on separately, 110 and all the on the mailing list and use the minus. What the discuss.

C

B

Is so much better.

C

Yeah so I mean if, if it's you can scan up lore on their agenda, yeah person I think just having the discuss is probably better and then just get just get a double stamp. In.

G

I think it would actually some point so I'm not sure it's connected right.

D

Yeah I think I asked the question on the still networking right and then they turned it to that.

D

Yes is probably worth at least a few minutes. Discussion face to face, get on the same page.

A

The other thing that we had mentioned before was and I lost track of ages, because I have lost track of it. Where do we stand on the server side check cash? Is it it's currently disabled.

G

That was waiting for me, that's stabilized, which I assume they got better. Lately you've checked cash month in Las Vegas before so we can try to turn it on and see.

D

Because there are people that are going to start using heavily depending on check, check calls, so they walk through the cash. It's.

G

A matter of setting default, it's just I, just need some confidence that.

E

I think the Indian test check cashed tests is now much more stable, so yeah, it's nice to start from there all right, so you're gonna.

D

Try to get that on for 1.1 now just hard to pronounce the default right. Oh.

C

Dude, so should we do that or because it's just as easy to tell someone who needs it to say: hey: okay, do you need it so turn so to change the configuration I think it affects performance, I.

G

Should probably have it measurements with on.

B

E

As long as this correct, it seems like it to be yeah.

B

C

E

As it's correct,.

E

Basic criteria and they're just thinking it's gonna, somehow decrease before I did all the victories better yeah. That's it good! So that's why it.

C

May not actually affect performance because it you know like that, may not be the issue.

D

And it says all this is not for telemetry this one policy, a policy yeah, so we don't have a first problem policy. So much of it.

F

D

The same issues not.

F

Gonna have as many people using it.

G

On the client causes everything much great.

D

Well, but that's the problem these you can have dramatic increase in load by virtue of having cache misses. So suddenly you can hit a during.

C

D

Want mixer and then make sure it has it's not in this case, although the good day you get excellent performance on a bad day, things.

G

D

Yeah, it's interesting with so with mixer fee. We're not gonna, get that secondary cache great.

C

Anyway, so yeah, so then, then it is going to be up to the adapter or whoever they're calling to maintain that second level. Yeah.

D

I was one I've been wondering if we should, at some.

J

Point in the mix really yeah, that's a memcache.

D

Thing and someone gets supported like I'm right there in mixer, so from envoy, you can share their cache results so then that would actually be more efficient than what we have.

C

Yeah we could I would I would measure first, let's.

E

Get v2 working.

G

D

Well, that's the thing! So, if the reason I mentioned mentioned, this is if we're not gonna, have this into v2 at least, and they say we don't have it in 1.2 and 1.3 and we only have it at 1.4. Then you could leave it off in 1.10.

D

Have a perfect decrease in 1.2.

I

D

So the one thing we can do is also make the cache bigger in the proxy alleviate.

D

G

I mean there is a benefit, enhance stability cash, but we don't have it's not right. Yeah.

D

They local cash. What do you think it's you're talking a hundred hundred microseconds or something to talk to a local cache versus 100 milliseconds to talk to Active Directory on the internet? Sorry, where it's a it's! A high value, do.

C

We do we have any any measurements from our internal mixer about the shared cache, get tricks or anything like that. No I mean for chemists, because if we do then I think that would be a good guide to say: hey yeah, this the secondary cache. It is it's high in a normal scenario that.

D

Means definitely useful well. I can tell you that the I'll check true. We can. We have these numbers, but the chemist maintains like a gigabyte of cash right, which means somebody decided that was useful. It was.

D

And there's no reason to believe the access pattern should be that much different different brightness yeah.

A

And all about some of those stuff- that's gonna fly in for one day, one is still in play, so I know like yeah the changes to make sure that things are closed and check that we've close things out properly. Instead of you know, report things aren't closed when they're really close for handlers.

A

We have this worked it with annotations for turning off check policy enforcement. Is that we're still targeting that for one one? Yes,.

A

What's not started, what are the other big outstanding things for one one? Is there any Rob corrective stuff? That's still landing for one one. There.

G

Might be but I don't have a plan is a request to supply different requests, called response code response body, but it's a feature request.

H

Adapter, or should we delay it, I mean if the throws gonna happen, this Friday we still have time to check that in what's missing Auto versus adapter off. Yes, that.

C

Then that should also like okay, so like song, where I don't understand what that means: the art, PR Oh off. Okay, sorry heard.

D

D

Okay, so if we don't have this, then the other process adapters are still useless. It's French accent yeah, so we need to get it down. It's like one of our two features for one point: one yeah: how.

F

The process adapters a galley.

D

We both need to be there.

A

There anything else anyone knows, oh, that needs to be addressed for one one.

J

Hey guys, it's quick clarification on the pod annotations for policy enforcement is that is that a net new enhancement to the way that yeah.

A

This would be a new new feature: live opt-out of policy for certain workloads or to change it from failed clothes to fit a little bit. If you wanted to use policy, nice, okay,.

J

D

So so, but via Polynesia, yes,.

A

Cuz right now we just have the global mesh yeah.

I

G

So it gives you some more resolution, and that is an injector has components. So if you have a controller injector.

D

We don't want to do through the workload level. Well,.

A

So I think yeah the deployment little sort of where I think it's naturally yeah. There's an open question as to whether or not this should be one of the things that we call us into a service.

B

C

B

C

Basically, the it is the same question that that's applied everywhere right in any configuration can be crv or it can be an allocation and.

C

I mean our bias should actually be towards the already yeah, but annotations are so much.

G

C

That's what ends up happening so.

G

It depends on grandma, it depends how much control you wanna have all those checks if it's like just what is enough and implantation is fine if you want more fine-grain like protocol base to the service based and it becomes very.

C

The CRD like we can do anything that we want in end up in the Sierra knee and it's yeah sadi takes more work. If we write exactly it just takes more work, it becomes an API change it, whereas annotation is like a little.

D

D

I think the key difference is who's responsible for managing the cohesiveness of the these annotations. If it's on the pod, it's up to the user to make sure that all the positive their workloads have that annotation. If we do it in a CD that just says this workload has this feature and we take over that responsibility.

C

It's still I mean it's still up to the user right it. So ultimately, I am the service owner I have a choice of either expressing the fact that I want to like deny on check failure. I can put that in the CR D and associate it with my service or I can put it in an annotation, and my.

D

Deployment yeah.

C

It's not it's not actually, that different users.

J

Users perspective either way: they're gonna be able to interface with the CRD they're gonna. Be able to put this to the N will either way right. Yes, correct, it's a different. You.

C

Know Fiamma, that's all.

J

There's one versus the next bear weight on okay. You know, as you venture into this, and you talk about granularity and sort of inheritance and kind of global level settings versus those that get overridden the more you know as you go down the funnel it's one of.

D

Those easier to handle.

J

Within sierra.

D

J

Now annotation approach, I.

D

Think maybe this the annotation approach is the the terminal case. The thing can be global. You can have it next at the main space level, and then you can target an individual pod farm say you behaved this way, so I think what really matters here is cohesiveness with the rest of the platform, and so there there's Suzanne has a half more than half of it.

D

He's got a proposal underway on how we do use a selector workload selectors and that kind of stuff to specify what what goes target, what you're, targeting with this config, and so we want to apply this universally.

C

Right, but with other patients, we sidestep that right with these.

D

C

Leave that up to the users with that there's put it.

D

Needs to go so that's saying: look we're gonna converge. Everything else. Meanwhile, we're going to go over here, we'll do something totally. So if it's, if the annotations seem like a reasonable thing, but it needs to be done in the context of history or the whole if it makes sense for mixer things, and it probably makes sense for other things as well. Oh, we have done some elevations already yeah, so yeah, but we're not tracking them they're, not systematically documented the responsible.

F

I think it best. We tag it as an alpha way.

D

I think I'll put a note. The defense doctor says we need to do that's part of the same holistic thing. We need to put annotations as well, so.

C

I think I think that at this particular PR in some ways has to at least go through a quick epi sanity check. Oh one of the reasons is that it introduces new like new annotation routes right so policy outside card or destroy dial, slash check. So that's a convention, yeah.

G

That's the connection that you could find if you look at the code section nowhere document.

C

Agreement on software will move in the future or probably in the Navy, on the deeming is where we continuous care for still open. No.

A

That's that's terrible, hey I know you I think we're at the same here, but I'm not sure that the rest of the world agrees. So you need some and.

D

And we need to surface this at the talks. Yes,.

C

These types, okay, so we'll add it to the to this Friday's.

G

So who define this well, the sequester was privatized and solution, and that seems to start.

C

That's like I think the same buildings completely disabling. It was was already in like Soompi ours, but the failed policy was not, which is why I was confused when you guys are talking about it, because.

G

Yeah I mean this is only one control right now, which is global brain just everything.

D

What would be the the equivalent of annotations or gone to related systems? I.

G

Take a question: recommendations were meant to be extensionists of pod specs because it was took a long time for people to extend specs or they put their looking foundations. So if you don't have a prospect and I think it's a totally different story, so.

C

I think I think that it would be on a case-by-case basis. It's like, whatever description you have put the metadata there and it's stupid inside, like whatever the vehicle. Is you so in theory? It it's not incompatible right, it's just a map and you stick it somewhere, yeah and as long as you make it available to the injector and then eventually to pilot.

G

It's a good question, there's like if you look at ingress I, just look over 50.

D

Different atoms or.

G

The pole in the English spec is too limited. People have to find a way around it, and if you don't limit the spec, you don't need to have that many health issues so other.

K

J

K

G

C

I mean the other, the other is environment variables, which is even worse. So we definitely don't want that. I.

D

Was gonna actually like to do this at some point to have a standard command like a standard way to read environment variables from go code so pilot? Mr. Citadel can all read it that way, and then we can automatically generate thoughts. So so at least he.

G

Has him as a tool kelsey has a tool for that kind of concept. The reason why they are so abused great is because many programs didn't have like flags or something and it's hard to pass flags and pinches. So it is a tool that converts um did you learn by bus into flags automatically.

D

Well, I had started on something that looks like the command-line parser mm-hmm same principle. You say in addition to these command line, flags I also take these types option and then it could generate the knocks out of that.

C

So now pilot has a centralized location. They're all environment variables are documented, I think for the user. I don't think we have.

C

Probably on use while it uses many and now it's in one place, so you can go to one file and say.

A

The only thing I wanted to mention now I was.

D

Gonna say actually in galley: Oz is checking in removing most he had tons of in command line flags he's removing those and now just supporting a settings file. It's.

G

Tough to Campbell high, because that's how much I did good morning and it's been not doing a good experience.

D

This is about bootstrapping, the config system. So there's a point week you gotta cancel slack. There was.

G

We need to be careful cranky like API change, completely it's difficult with bootstraps yeah. Oh.

A

I say beyond one one: there in some people raising questions about multi, cluster, telemetry and policy and having issues and I don't know that we've spent much time at all. Looking at that, so I was going to start looking at that over the next couple days to see what happens, I think the people who added the original code me no longer you're participating the project so anyways, which is something else to be aware of their might.

A

You might see some stuff from me about multi cluster and coming across, or you still have a couple PR doctors that are pending. Yes,.

B

And I, can you finish those I forget again: there's already two disappeared: if not and I need to go back, I see do every morning. So that's why you.

A

Start sneaking stuff in there and get your attention to other things. Okay! Well, unless anyone else has anything you want discuss, maybe that's.

I

Alright, thank you thanks. Everyone.