Kubernetes SIG Instrumentation, 18 Apr 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Instrumentation 20190418

Description

Meeting notes: https://docs.google.com/document/d/17emKiwJeqfrCsv0NZ2FtyDbenXGtTNCsDEiLbPa7x7Y/edit#

A

Alright, hello, everybody today is the 18th of April, welcome to the 2nd cementation meeting and let's get started with our first topic. Watch API! That's a topic! We started last time but didn't fully finish. I believe we have the author here, yeah I, guess.

B

That was mine. Yes, so yesterday give some context again. Our yeah.

A

I think we only had like three minutes last time. So, let's maybe start over. Ok.

B

So basically, the idea is to introduce watch ability to metric api's, I, guess all three of them so resource metrics API serve biometric server and custom and external ApS as well. The latter two were suggested by Marcos, who is not here, but I, think it makes sense for all of them to be consistent so and the reason why we want to do this is basically HPA, so in HPA is scaling.

B

Deployments based on the based on those metrics, so if it will be able to like subscribe and then get notifications us about new metric as soon as they appear, it would be like perform better because it will get the data faster.

B

So that's that's where this is coming from, and there were some concerns about this, so I think all of them mentioned.

C

B

So let me go through them. One is the existing backends for custom and external metrics. Do not support stirring of metrics, so it had. It would have to be emulated anyway, on the whatever is serving the API and I. Think it's a this wouldn't be the case for research, my friends because per metric server, we can just notify everyone. Subs like as soon as the scraping is over.

B

We can just notify everyone, so that would be wouldn't be a problem for research metrics, but it would be a problem for custom and experiment ryx and for those I think we could just move the Pauline gloop from the client, the server so that the client could just subscribe and get notifications. The server would do polling of whatever backends it is supporting.

B

So whatever there will be more than one client, it would be like simplified for multiple metrics. With that there was a on the instrumentation least. There was another problem mention that whenever there are different metrics that require different polling intervals, we probably want to offer them individually, so that single watch on the.

B

The metrics good result in a multiple watches on being started by the server, but I think this could be mitigated by grouping the metrics and pulling them in groups. So whatever, like all the metrics dots, are refreshed every mean it would be pulled together, everything that is poled every 30, second, that this is refreshed every 30 seconds, both together and and so on and the third, that's. The lack of implementation, part problem, but I think one more argument for introducing this API. Even though no implementation today supports it would be it would.

B

Actually, it could actually encourage people to implement this on the back end, so I. So if you want to show that you're metric back-end is the best one, because you can scale fastly if.

D

B

Then you would just implement to watch API in wherever entering the solution. You, you you're using.

B

So that was the one problem, the other another one was migration happening, because there is a number of implementation that exist today and they they would be essentially broken if we just added that, but I think this is. Since this is an API change, it would require changing the API version as well, so those so the all those implementations in order to migrate to a new API version would have to implement the watch.

B

Otherwise they would be serving an old API version, which would be there.

C

B

Some time, but since this is only bit.

E

B

When we go to stable, they would have to implement that anyway, to serve a stable, v1, API I.

B

Think that summarizes all the discussions so far.

A

One more question I had: have you done an analysis? What this would mean for the HPA? Would there need to be modifications to the HPI other than supporting the watch API? Would they I'm? Not a hundred percent? Sure anymore, is the the API version is not specifiable in the HPA right, so currently, they're actually bound to this API. That would mean probably mean that HPA would need to be bumped to v3 right.

B

That's a good point: I didn't do any analysis on that. I was just chatting with some folks from HPA mmm-hmm like working on the HPA.

A

Just for my understanding is this actually a requirement coming from sic auto-scaling, so.

B

They wanted to speed things up, and their approach was basically do more frequent polling, which is well. It would get them longer latency, but I think the right thing to do this is by have it like pushing the metrics rather than pulling them from thee from the HPA I.

A

Mean in reality, most monitoring systems, even if they are somewhat push based, have some sampling in-between so I think we're probably rarely gonna see some second latency for updates anyways, just something to keep in mind, not necessarily blocking this mm-hmm.

B

Yes, to answer a question: no, this is not a requirement to have a watch. The requirement is to have the metrics fast and the watch is how I think it should be done.

F

But there are two layers right: I mean you have the metric server and then the thing that it is get a gathering metrics from so like.

F

If you introduced watch at the front layer, then now it becomes possible to facilitate a different mode of communication between you know the metric server and the thing that it is getting metrics from so that maybe one day you have watch based thing all the way through certain would be a necessary requirement of that I mean maybe it it doesn't get you there in one step, but it seems like if we ever want to get to a state like that, and this would have to be done. Yeah.

B

I think this is what so earlier I said it would enable implementations, and by this I mean further, improvements would be enabled, so subsequent sub seconds would be like subsequent second latency would be eventually enabled. If we do this, and there are other changes that we will follow in the implementation of those metrics. If.

A

I'm following you you're like an example of that would be I, don't know the couplet pushing metrics and because of that push immediately, the watch would like.

F

Propagate through the system, yeah I mean that that would be like the ideal scenario. Right I mean right now. It wouldn't really change anything but yeah.

A

Yeah I think I think I'd, like I'm I, think this is still a difficult thing to get right. I think we can move forward and maybe write a bit more detailed of a proposal for this I do want us to.

A

I would like to see us investigate the HPA side of things a little bit more and see what their actual requirements are, and what this kind of API change would mean for them. Yeah.

B

Yeah I will I will talk to.

A

Them yeah and then I think it would be I. Think it's hard to tell which ones are the most popular ones is at least I mean metric server obviously needs to have an implementation of this and I'd like to at least see an analysis.

A

How maybe the stock driver and the Prometheus adapters could potentially implement this out of curiosity, I, don't know if anyone's here who can answer this, but does this stack drivers like query engine streamable, as in like something like this even fundamentally possible in Sec driver I.

C

B

Know the implementation details but I I guess it depends on the amount of work need to be there. I.

A

Know at least in Prometheus, this has been discussed a number of times and it's absolutely non-trivial, like I.

C

A

Speak for the driver, yeah, anyways, I, think I. Think it'd be at least nice to have an analysis of what like the watch would also mean in terms of the capacity planning because for metrics server. We initially did some very specific capacity planning and I'd like to see what this, what kind of impact this would have on metric server in terms of scalability.

D

B

Yeah well, there's this other proposal from signaled to change the.

B

Change how we fetch those users metrics from cubelet and I- think it would actually reduce the memory footprint because we want in just the whole summary API and I. Haven't done our analysis there, but I think this could save some resources that could be used to keep pool of clients.

A

Yeah actually, metric server only keeps like in terms of what it keeps in memory is actually just those two numbers per pod and yeah, but the scraping.

B

Logic is a part of the metric server, so you have to fetch this all this Jason.

A

Right, the there was a proposal to change metric server to consume the Prometheus format, anyways, so yeah yeah, that's the one I was referring to yeah, okay, yes, absolutely I mean and, and then I think there's so, but this is probably further out. We still should discuss. We probably don't have time for this today, but we should discuss further how to get rid of see advisor in the long run, but yeah I I, don't know if there are any other opinions out there.

A

I feel like this doesn't need a little bit more thought, but I think we can move forward. I think we've thought about a lot of aspects already yeah.

A

Okay, then maybe, let's move on to the next topic, this is Han control, plane, metric stability cap. Take it away, yeah.

F

So I introduced this last time gotten around a feedback I feel like I had touched base with just about everybody: I protected people from sig node people from API machinery.

F

Obviously a lot of people here, so it doesn't seem like that. There are any blocking objections, but I could be wrong, and so my thought was to phase this out. So I had a couple questions. One of them was: do we have a directory that cig instrumentation owns.

A

Not formally how, in the same way as like six CLI does or something like that, it's pretty much spread out through the entire repository okay. Can we change this sure I think as long as we're keeping saying software engineering practices, I think yeah.

F

Yeah for sure, so the reason why I would like to change this is one so this code would have to live somewhere. Currently, I have been putting the stuff in utils metrics, which does not have an owner it's owned by whoever owns utils.

F

So the reason why it would be good to have a folder that say, instrumentation owned is because, once this framework is in place and we have static analysis, we can basically do the same thing that conformance tests do and the thing that conformance tests do is. There is a folder which cig architecture owns, and you know in a pre commit stage. Static analysis is run against things which are tested for conformance and then basically, a file is dipped which belongs to this.

F

This SIG architecture directory and if there is a diff present, then you have to commit that diff, and that requires then that you have to have algae teams and approvers from that directory.

F

So what this would do would it would enable the instrumentation, instead of being, you know, scattered out through the various kubernetes codebase? It would actually give a centralized place of ownership and also allow us to automatically be tagged for PRS I'm sure everybody wants this. This is like I'm sure metric stuff is probably been driving everyone nuts, because it's not audited so yeah.

F

So anyway, if we don't have a directory, I currently been working out of yuto's metrics, we can add in under files there or if anyone has a better suggestion on where we could create the directory or if anyone is against this directory. Please I'm happy to hear what people think I.

A

Don't have a great idea for them for the directory, but I loved. The idea like this is what we've been missing: absolutely okay, yeah.

F

Yeah I talked to I talked to Jordan about it, and he suggested it too, and I was like. This is like basically.

C

F

It's perfect so so then, basically so for now, I can do that. I will add an owner's file. So you tell us metrics I, guess we can try to figure out. If there's a more proper place. We can put it, but does anyone have any objections to the proposal? I'm sure everyone, it's been like now several weeks for the proposal. Does anyone have any blocking concerns I.

A

Just have you been able to look at the more recent changes, but what I reviewed when I reviewed looked pretty good overall, but yeah I I, don't have to necessarily block us if there are enough people agreeing with this I'm happy to vote to move forward. Otherwise, I'm also happy to review once born yeah.

F

So another thing is, this thing will likely occur in a number of stages, so obviously we're not going to migrate the entire code base right away. This thing will be isolated to our newfound metrics directory and so unit tested whatever, and then we can get various components in place like the static analysis thing and then and then we can start migrating a binaries metrics endpoints one at a time uh-huh and this will mitigate risk.

F

Obviously, so there I am envisioning a multi-phase process here and so, which is why I have been pushing and thank you everybody for being so patient with me, and you know so, because I envisioned that there's a lot of steps to do- and you know getting blocked on the earlier stages- will obviously just make the entire process a bit more drawn-out.

C

Just from a like implement ability, standpoint, the the cap doesn't have any reviewers listed or approvers list, there's a bunch of TDD stuff, basically still in here I'd say until all those T V DS get filled out. It can't possibly be virtual, so that would be the big blocker. In my opinion, there's no.

F

Suit, so you do not say that the the graduating criteria- it almost doesn't have any graduating criteria because it will not actually be graduated into the code base right. It will live in an orphan state until we have a migration cap until we have static analysis cap, but I mean you say that okay yeah I can write that I can write that I'm sure.

A

Okay, even if that's the case, I feel like that's right, I think there should still be reviewers and approvers I will.

F

I will put them, I will add some yeah.

A

Good catch Thanks.

F

Yeah, so that was it for that. Oh yeah, so I enquired about the in person sake every like it sounds like. There are many people here going to Barcelona. I am.

F

F

But I, so it turns out it's too late to have an in person Zig. They are out of rooms, so I'm sure we can just meet up and yeah figured out or something but yeah.

C

Because, last year, like this happened, where we failed to schedule a thing and they found us a room, so yeah.

A

We have good connections to Chris from the CN CF I think we might be able to make that happen. Anyways, okay, awesome and.

C

Announcing it in advance this time, as opposed to scheduling it at the car yeah.

A

But actually, let me write to Chris right after this meeting and let's get something set up. Good call feel free to call me out on things like that, when I messed up on scheduling things.

F

That was it for me. Oh oh yeah, yeah, that's right! I actually did talk to the open census team. This week.

F

F

Currently, their metrics related code and they're tracing related code.

F

Is a bit coupled I, don't think that they are Nessus. At least the person I talked to said for their Java client library that the code was quite coupled and he suspects it's the same way for the going labour library. So it would take some wrangling to be able to use one without the other and they are working right now on.

F

Breaking it apart, because in our case it may or may not make sense to bring in the metrics things and given the fact that we are using the Greens client so anyway, yeah that's just I had I just started a line of communication with them. If anybody has any questions or anything I'm happy to, they said literally like 400 feet away from me, so I can just pop by and ask them questions.

E

One thing: hi I'm Austin with open tracing I, don't know if everyone's aware, but we are merging the projects together and right now, I think the current plan is to have a publish, a road map like within the next day or two, but next week we should have a draft specification up and that I think would be the place to ensure that, like the metrics and the and the tracing components were complete, yeah yeah, the I certainly know the plan is to have interfaces.

E

Implementations, decoupled and I I focused on the tracing side, but I presume the same thing applies to metrics. So if that's something that anyone here is interested in yeah.

F

Yeah, you should, if you, if you don't mind linking I'd, probably to the group, I, would definitely be interested mostly I. Also just think it would be easier to get tracing stuff in without metrics related code. Changes is well already.

E

F

Already there there would be some some friction into replacing the metrics thing and then obviously we have to be very concerned about increasing the binary size, yeah necessarily um so.

D

Yes, go ahead. Sorry.

E

No go on I think we're. The last thing on the agenda was talking about this general yeah. No.

A

I was gonna, say last time we last time we talked about introducing open census was a difficulty with that is at least for tracing and I guess for some some of the open tracing implementations. This is the same same case that there is some agent that the thing need, like the thing that actually produces, the spans needs to talk to right and that's kind of a difficult situation for the couplet. For example, like can we? How can we like we significantly.

A

Complicate the setup of kubernetes: if we force everyone to have this agent, there I think I'm, not sure, that's a that's a thing that we can do. I, I, know or I've heard of at some point that possibly these functionalities could be embedded into the application itself. I, don't know how feasible that X of a thing that actually is, but then again that that would influence what Han mentioned about binary size.

A

I think there are various various things that we may need to think through a little bit more I'm generally super I think I'm, like I'm, a huge fan of especially having the correlation of multiple signals like being able to make use of, like exemplars and metrics, to be able to tell a trace sample from a histogram bucket or something. That's absolutely amazing. I totally want that. Yeah like as a it's, mostly an organizational problem, I think introducing it into communities and, of course, not breaking everything we already have it.

F

Would mean it would probably be easiest to introduce the tracing first and then, if we wanted to do the metrics migration, then to do it and then to get the example our stuff right.

F

It's probably the easiest way. Yeah.

E

F

E

The way that, since like the way that theoretically will work or should work because it we're modeling loss at after census, is that traces and metrics, as long as, if generated from the same.

E

The same thing essentially do like have nice correlation stuff added in I, believe I.

E

Think in terms of like where you would actually. Yes, there is sort of inject like something that needs to an agent process. It can be run inside the process itself. I know like senses these babies for this, so you don't actually connect you don't have to run anything else, they're just available Italy, some point I.

E

Think you would be able to I think it seems to me that there will be a way to start adding this through just the tracing side, not actually bundling anything else, and then, if someone and then provide a way to configure where they get sent to, and then you have sort of the best of both worlds. People that wanted to opt in could plug in you know a Jaeger or for stackdriver or whatever, and see traces, and then people that didn't want that. Well, if you don't have anything there, it just sits there. Yeah.

A

I think, like the reason for the for the agent is so that all this logic of of supporting, Yeager and Zipkin and all those other projects is so that it doesn't have to live in your application right, but then pulling that then revert reverses that and it's kind of a Divya call to make I mean there's some. Some like this, essentially open census does standardize. This is in some ayat by standardizing the communication to this agent right.

A

So maybe this is a problem we can just shift away from the agent and put it closer to the thing that actually ingests in just this I have no clue like this is just thinking out loud I've, no clue if this is a realistic thing with these projects, I think.

F

We're I'm agent, sorry I I think we would need the agent. The agent is basically like a sidecar exporter right.

A

But because the the point of it is that it actually talks to the real upstream project right and and does this sampling decision so like it's kind of a difficult thing to introduce architectural e into kubernetes, I. Think I.

E

Mean what if, though, you just run, you would run.

D

E

Agent in the wherever you yeah.

F

Your master pod right.

E

Yeah and it's yeah, the faculty would run the agent in the master pod or whatever you would configure all the trade like. You would configure all the tracers to report to that pop. To that you would make it available at you know whatever name and then I think the only things would be. Maybe a little weird is like. How would you there's a couple issue? The agents can point to other agents, so you could have that agent and then you could have a user actually say like well.

E

I want to direct other traffic, this agent, where I've actually put in my exporters, but there's still keep in mind like some of these terms, might change and like right now we're guaranteeing support current open sense of stuff for two years. There'll be a bridge between that and the new project, so I would expect that there will be some changes happening to the agent and the open sense of certain stuff as well. Okay,.

A

But I don't know how everybody else feels about this, but this sounds to me, like we probably shouldn't be introducing yeah either of those two projects right now we should probably we should probably influence at least or or get an idea of what the metrics side of things is going to look like, so that we do have a non-breaking migration path from what we have right now with the native Prometheus client to what I would hope is opening open metrics.

A

F

Still want some sort of tracing like ability, so we should have at least a stopgap I mean two years to have. You know. Matt the master without any sort of tracing ability is probably not a desirable stage. I understand.

A

The two years is for the existing projects, like there's, going to be a new project right, which is supported for like until the next project.

E

Project supported yeah sure, but if you so, my point is, if you said: ok, we'll do census today, then, as of like the end of this year, I think we're planning to sunset in November, tracing and open tracing an open census. There will be 2 years from that date like we will support the bridges and to the new project. So if, if this isn't something that we need, you all need to do like immediately, then it might be good to give it a couple months and because I believe we're playing work.

E

Our schedule right now is to have one dot o of the golang new thing done by September looks like sounds ambitious.

E

A

Sorry, everyone we're already 5 minutes over I think this is a very interesting discussion and I think we should continue next time. Let's put this one, just like we did with the watch API and last time. Let's put this one on the top for next time, and then we can continue that discussion. Maybe we'll even have some new some new information in this area. Already: okay, thanks everyone for attending and see you in two weeks and happy locally local time, all right, ow.