KCP KCP Design Discussions, 2 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: API Priority and Fairness in kcp. Discussion 2 of ?.

Description

Discussion of approaches to including the API Priority and Fairness feature in https://github.com/kcp-dev/kcp .

A

All right, let's see, maybe I, should put in the chat a link to the document. The this.

A

So this the check gets saved in the recording range.

A

um I hope all right anyway, so um you know basically this relationship. Basically the you know, I think the most serious problem.

A

Well, I mean we have concerns right, so we're forced to trade off between either um having a priority level be specific to a workspace, in which case we have lots of priorities potentially to deal with or sharing or having a priority. Little span workspaces, which uh destroys some of the insulation between workspaces.

A

um You know, that's that's one, a big trade-off and the other another one is if we want um configuration specific to a workspace, and that requires objects that are specific to a workspace. You know we can use. If we're willing to have the behavior be specific to a workspace. We can use the approach of having a default set of objects that are not stored and then having additions into overwrites stored so that that kind of mitigates that um so anyway,.

B

I have a question.

A

B

Does it make sense to do option one which is priority levels per workspace with the PLC and Fs objects per workspace like it? If that makes sense and I'm going to defer to you in particular, Mike um like if that makes sense, I think we can find a way to make it happen.

B

I just thought that it would be difficult to borrow between workspaces or, to put it another way, difficult to take the total maximum request that the server gets and somehow apportion or proportion that out.

A

So um it makes sense it's coherent in at least the sense that it's you know something you can Define it has that concern that you know we're talking about problem.

A

So let me review I'm. Sorry, let me get a new cough drop and review the bar Wing that we did just introduce that will appear uh in 1.26 between parietals. So the way it works is.

A

Configuration defines for each parallel, a nominal concurrency shares and then the servers- and this is Upstream, so there's no workspaces. The server's total concurrency limit is divided amongst the parallels in proportion to their nominal concurrency shares to get a nominal concurrency limit.

A

um But that's not the current currency limit that's enforced. If there's a dynamically um adjusted concurrency limit that gets enforced and the way it works is every 10 seconds.

A

um There's an adjustment that takes into account the recent I tend to say pressure or I. Think in the the cap dock it's called seek demand. uh It's really looking at the number of seats that so seek.

A

Demand specifically, is defined to be the number of seats that are occupied, plus the number of seats in um of the queued requests and um there's a subtlety also in APF, in that the a request has two different number: two different widths: there's, an initial width and a final width um and but with the number of seats and so for this seat to bang.

A

We use the larger of the two because that's actually dispatching Works dispatching also uses the larger of the two, because, unfortunately, um when there's a difference, the larger width comes second, but you can't interrupt request once you've dispatched it, so you have to dispatch conservatively.

A

So anyway, that's seat demand and the way it works is over the course of each adjustment period. You know between adjustments.

A

uh This technique keeps track of the average estate of deviation and the high water mark of the seek demand and um then at each oh, and also each parade level is configured with a limit on how many seats it can lend and how many seats it can borrow, and these are expressed as fractions of it's nominal uh percentage, but anyway, anyway, um the what the adjustment does is. It's uh it's got a little bit of logic, that's not quite trivial. uh You can see it in the cap, but basically what it says.

A

The first thing it says is the effect is the first any prayer level that has lent seats uh for which demand has shown up in the last adjustment period, uh we'll immediately reclaim those seats um and then beyond that, there's um it's kind of a Target, that's defined as the um well. It just gets smoothed the thing uh there's exponential smoothing and the thing the input to the exponential smoothing is each period and we take the average plus the standard deviation, and that goes into the exponential smoothing.

A

um But then, uh if the high water mark ever jumps up uh that that immediately jumps that up so anyway, the net result is that a level that has led seats um out of its nominal allocation will immediately reclaim them at the next adjustment um and beyond that, there's a kind of a attempt to share the pain or the wealth um relative to the demand. That's been seen so um I, don't know if it's at all clear or understandable,.

C

I, guess is it predicated on each priority class having some like non-fractional number of seats to give away in the first place.

A

uh So, to be clear, um the regarding fractions, uh the way even before borrowing the way it worked is we will take the server's concurrency limit um and then for each variety level. It's got a number of shares right, which is a fraction of the total shares. So you take that fraction of the total concurrency limit, which is in general, a lot of integer and round it up, and that gives the concurrency limit that gets enforced for that priority level. I.

C

Guess uh assume? Okay, so we would see everything round up to one I guess if.

A

If everything would nominally have kind of a small fraction, it gets rounded up to one yes.

A

So if we were to be concerned with a large number of workspaces and do this same kind of thing for workspaces, um we would um yeah yeah you wouldn't get to zero unless the fraction was actually zero, the which you would basically never do, because the exponential smoothing would never go to zero I.

C

Guess also, if we have per logical cluster priority classes and the user has control over their own sharing methodology.

C

I feel like. We would not be good, because there's kind of two um audiences for PNF in kcp, the first being that kcp had been the second one being an individual in a cluster and I'm, not sure that letting users configure their own priority classes and like lock, people out of sharing, would suffice for the admin case right.

A

uh Yes, okay, that's an interesting point right normally in kubernetes, we suppose uh you know that a cluster has an admin admins that are privileged and other users that aren't, and that's just in the story. um If here you really got those two levels.

B

That kind of goes back to what I was saying yesterday about.

B

um We want to configure this for or as kcp operators, and our secondary goal is to maybe let users configure PNF for their needs within their workspaces, but that's really a distant second compared to the needs of the platform itself.

C

And I guess to maybe put some context there as well like when you're using SCD as the backing store, like you, have some really rough. Like mutation, concurrent mutation in flight limit type, things before stuff goes to really unpleasant places, and so one of the thoughts very early on was we kind of know the scaling envelope. That will make this reasonable and it does look like quite a lot fewer mutating requests as compared to a normal Cube cluster.

C

But given that we don't have a bunch of reconciliers throwing status updates around, like that's, probably fine and being able to somehow programmatically enforce. That would be critical to ensure the stability of the system. But I wonder if we do a two-tiered approach like.

C

To your point, about isolation between workspaces inside of a workspace, if you configure apnf.

C

Like for all you know, the servers, Max and flight requests is just a dynamic value and like the number of requests that actually make it into your workspace in order to be handled by your own apnf.

C

You know that's out of your control and you would mostly be oblivious to the fact that there's layer on top right.

A

You being an owner of a workspace sorry, um so if it's, if we were to do borrowing between workspaces, then clearly it's a dynamic value um and I'll. Let me just add: uh you know another FYI I'm, not quite sure you know what my I'm still processing it so I'm, not quite sure what my final opinion is, but I'll just FYI, you know Upstream. We do.

D

A

um We just I was just told earlier this week about someone running into trouble with APF and the scenario is they have? um It didn't really explain why? But there was one client that was sending requests that always time out, and these were like big list requests. I think yeah I think he was saying that the client is slow to read the response, um so each one of these requests occupies 10 seats because that's the maximum width, that's a limit in the code, and um it takes a minute.

A

You know it holds those 10 seats for a minute and um in the priority level that he was. uh This was happening the hand size was six, um so that means that this um client could really occupy 60 seats at once. There was more or less ready to occupy 60 seats all the time, um because uh for each of the cues that this client is being dealt onto every request, it holds onto 10 seats for a minute and as soon as that timeout happens, there's another request waiting to take its place.

A

um Other requests could get in occasionally, um but the concurrency limit applied to that priority level was less than 60., or maybe it was 60, but it wasn't greater than 60., so uh the other clients would would have to wait um basically for uh six minutes, um I guess yeah I guess well. He was saying that the concurrency limit was uh 10 or less so um really the other clients would have to wait for uh 10 requests from this. uh You know bad client before another client would get served.

A

um So you know we talked about kind of guidance for configuration, and you know the really the the problem condition for this really is, uh when the uh hand size times the maximum width is less so than or equal to the concurrency limit. Then one bad client can, you know, cause a lot of latency for other clients.

A

So you know the guidance would be set. The concurrency limit to be above the hand size times the maximum width.

A

um Now thinking about this in the kcp case, you know that puts a really non-s small, lower limit on the concurrency given to a workspace, we're given to a prayer level and vessel workspace, um so yeah.

A

Unless and until you know, I could think of something better and you know potentially we can, because this is kind of a special case that could maybe be recognized and dealt with other ways. You know, but you know if we take that as guidance, you know I think that that really shoots down the approach that says we're gonna really um have low concurrency limits for a workspace.

A

um So you know that forces.

C

So is the implication there that having many priority classes that each get their you know evenly divisible set of seats is a bad approach, because it'll run into this issue.

A

um I'm not quite sure, I'm gonna follow what you said, but get the evenly divisible number of seats.

C

uh So I was in requesting flight 1000 workspaces, each with five priority classes. Each priority class in each workspace gets uh two-fifths of a seat right. Does that trigger the behavior you're talking about.

A

Well, I guess: I was starting to explain earlier. There's um uh ceilings uh gets applied so every time um the the the logic right for deciding on the concurrency elements has a ceiling in it. So um you know you never get a fraction of a seat. You always get one or two or three. uh Unless you know your shares were X or the total limit were actually zero and then it could get zero.

A

um But that's not a situation. That's going to happen here so um see. Unless we had you.

C

Know I mean, but is that.

A

But my point here is that one isn't enough: okay right if you've got a hand size, even if you only had a hand size of one, um the maximum width is fixed in the code at 10., so you would need really 10 seats in order to not have this problem right.

C

I guess that's what I was getting at.

A

All right, that's 10 seats in one priority level in each priority level. You know so.

C

A

Like, however, you know, I do want to say kind of still in in defense of this uh borrowing idea right, um we could potentially have a different kind of borrowing between workspaces so that um you know as it as I was saying for borrowing priority levels. We use this exponentially, smooth um Target or demand figure, uh and since it's exponentially smoothing it'll never go to zero uh for workspaces. We could tweak that, so it does go to zero uh because you know, obviously you know we you we must be expecting.

A

If you've got a thousand workspaces, you must be expecting that most of them are completely idle, um so we could adjust the borrowing so that it will give us zero allocation to those workspaces, um and what that would mean is that a request that arrives while the allocation is zero? You know sits in a queue until next adjustment period, in which case that workspace would get. You know some multiple of ten, so they can actually succeed um and and then get dispatched.

A

Now I guess I want to also should also point out. um It's also possible again, since you know today, with with APF turned off, you know we're just betting, hoping that uh you know they're, not a bunch of requests that actually shut up. At the same time, we could continue that okay and continue and and just say that, yeah okay, the minimum allocation per workspace- is not going to be zero uh and- and it will add up to a lot, but again we continue to expect that most of it will be unused. I'm. Sorry.

C

A

Not use any of it.

C

At least spent significant amounts of time dealing with, unfortunately chatty clients, as it is so.

A

No no, but to be clear what I'm saying is the assumption is the bet is that for most workspaces, they're completely idle, not to say that some workspaces, some work Services can have chatty clients, while still most of the workspaces are completely idle.

C

B

I hesitate to try and design under that assumption right.

A

It's making an assumption, I'd rather not assume.

C

A

C

Already seen like individual workspaces with like 200 mutating QPS, and it's not.

C

A good place sure sure right.

A

I understand I mean that's. Why we're here I.

C

Guess the other uh other thought is like so just back to the straw, man of invisible, Uber APF, visible internal APF in that duality, if it looks to the user in their logical cluster, like they are virtually being given all of the mutate or all of the Max and flight requests and the number of seats is, you know calculated as if they were going to get all of it. None of the math is broken.

C

It's just because there's another level above it, they just happen to never get very many. At the same time um and I.

A

Think that I'm, not quite sure I'm not quite sure, follow what you're suggesting so.

C

Two entirely disjoint systems uh of APF, one at A Shard level, and then once a request uh goes through I guess this would probably be a pretty invasive server change. But once the request goes through the first layer at The Shard level, it can then be uh handled by APF in the electrical cluster that allows users to you know, give relative priority to their own workloads, but never exceed what the system gives their workspace.

B

C

Understand what you're saying.

B

I, don't know how feasible it would be to implement because you're either processing everything twice or you're mapping what it looks like 100 of the shares, or you know whatever I.

A

Think it's twice right. Basically, two layers of APF yeah.

C

I mean yeah: can we I guess I'm trying to think about? Like you know, one ancillary design goal here is minimal. uh Massive changes to APF right.

A

And I'm that clear to me that it would require massive changes to do two layers right right.

C

A

What it's saying is that uh you know where Jamie's got this delegator delegate T thing right. That would still work for the second layer, inner.

C

A

Right and the outer layer um again, you would have one shared uh delegated, because it's really the controller right that consumes the config and manages requests. So you could you you because of this, the existing structure, um the thing that handles requests- you know the controller logic is- is really modular um right. It's just got um enough hooks to be able to delay or reject a request to then release it for processing, so it could just as well release it to the second layer of APF as to a following Handler yeah,.

C

I mean that's basically my suggestion right.

A

Yeah interesting I hadn't considered that and.

B

We could start with the system-wise button and then add the second layer later. If we decided we wanted to.

A

So only have APF between workspaces, yeah, yeah.

B

Like my primary concern is around being able to protect the system as a whole and not let users override system level protections that we would get with system-wide apnf.

A

B

A

So, actually, right this, this concern about users being able to destroy the protections uh is a new one. um Also I need to add that into this document, um and let me pursue another thing. You know I just wanted to understand.

A

You know the the concern in the other direction right is that without giving users control, you know, they're the kind of uh don't have a lot of um ability to customize right. It's the the administrator of the you know the kcp server really has to be able has to configure it in a way. That's going to work for all the workspaces.

C

It seems like you'd, basically end up treating every workspace, every user workspace the same and the user like, because workspaces are cheap. It's easy enough to stamp them out, but I mean from my understanding of Jamie's work like it's well positioned to solve both problems and I. Don't know that like solving the global one first and then the user. One second isn't like another year of effort right like it seems to me like that, should.

C

D

I, don't know that we should I.

A

Think I'll totally right, he's already got the second layer. I mean it's working now the only thing that he that's not there is the config producing controller right, and you know if we go with the two layers think about this now. Let me yeah I'm, not sure I really thought through the two-layered approach.

C

The only change that I might see us needing on the Brooks face side for the like user specified things is the.

D

C

D

C

Somehow, making it opt-in, because yeah more than doubling the footprint of each workspace, is probably.

B

Yeah I mean my desire would be number one that we install one copy of system level objects and then, if we go with letting users customize APF, they can optionally create their own flow schemas and priority levels in their own workspaces, but they're not required because it would always fall back to the system-wide ones.

C

B

That might be easiest.

C

Having an export to provide some right.

D

um Do the users in different workspaces um compete for the same concurrency level.

C

Yes and I think Mike's previous suggestion of adding the workspace as input to the shuffling makes sense.

A

Well, yeah so I'm still trying to think through the the two layer approach. So in the two layer approach in the outer APF, remember, APF itself has kind of two layers right. There's this isolation into a collection of priority levels and then fairness within each priority level um in the outer APF. Is there any meaning to a collection of prayer levels, or would we only be doing fairness between workspaces.

C

If we can detect uh like loopback traffic from the virtual API server stuff, like that, I think that could be a privileged priority class.

C

Or from the caching server.

A

So something we haven't talked about um is in APF. There is, you know it starts with configurable classification and um actually some priorities, don't cue or reject. They just pass things through without any further control.

A

So that's already kind of a feature of APF. In that the configuration you know you write, you can write a flow schema that says these requests go to that prior level and that parallel can say, pass it through without any control.

B

Yeah, so that's basically what we want for all the system level things for I mean not necessarily all of them, but the majority of our controllers, um the loopback clients and whatnot. If we need to divide them into categories, we can, but those need to take priority generally over any other client yeah.

A

And in the default configuration for APF, um the admin clients uh get you know classified or their requests get classified to a prayer level that passes things. That's called exempt that gets process things through without control.

C

There's also a future I guess where, like in a production deployment of kcp, you provide users, um you know different payment, tiers or whatever, and that allows their workspaces to have more or less priority.

A

um Okay, but remember priority is actually only an aspirational word. We don't actually have priority. We have these items well, but it really, it really is not priority. Okay, um there was originally some thought that we could have a sense of priority, but we don't um even with borrowing- that's not prioritized more.

B

Shares right, Steve is right.

C

Yeah I'm, just saying like there, there is like the further meaning of priority classes at the system. Level could be some sort of quality of service. Well,.

A

I'm I'm again, you know I'm still struggling with this. If we do a two-layer approach and we have APF, that is just you know, treating Works each workspace has a unit. Okay.

A

Does it make sense to put both have priority levels and flow schemas, or does it make sense to only have the fairness, which is what the flow schemas give right? Is there any meaningful, wet sense of dividing workspaces into different concurrency pools.

C

I guess I'm saying yes, because I think, like first you know, I could consider uh like two canonical users of kcp. That I think we've had in mind. One is like the service provider right, so I'm exporting certificates and I'm running cert manager in my workspace that workspace is going to have a lot of traffic and that traffic is critical to the functioning of like their service and that could be higher priority class than some random user. Installing right.

A

Except again, we don't have higher priority. We just have different uh tools uh which can be given more or less concurrency. Okay,.

D

A

You added an affirmative there. There is a sense of um a kind of a use case for grouping workspaces into different concurrency pools and then within each concurrency pool. We would just do fairings amongst the workspaces.

C

Yeah I think that makes sense as a person.

D

C

Analogous and namespaces the way that they're handled.

A

I'm gonna, okay, we're at a time I'll update the document. According to you, know our discussion and think about this multi-layer, two-layer approach, storm and I. Guess we'll talk again later awesome thanks.

D

A

Your time, all right thanks, thank.

D