Kubernetes SIG Scalability, 20 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-01-20 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

um So this is six scalability meeting um 20th uh january 2022 and I think today we have one topic to discuss because the first one was added by mike and he's not here so um yeah. Maybe, let's start with with the first point.

B

Yeah, so this is a clan, go pr that we have been working on so basically um for rich rise in clan. Go um the retries are not exponentially backed off, so this pr actually adds that support. So I just wanted to bring it up here and also. um I want to run some scale tests. Maybe the 5000, like node, run to see if it has an impact other than that the ci jobs we have for the pr.

B

They look to be green, but I want it has impact on the scale job the scale test we have.

C

I'm afraid that, like running the current scale, job will not really benefit us, because um I think we are we. We are not really exercising the path of um of retries like I probably there are a couple retries over the whole test, but um um this is. This is negligible. I would say so.

C

um um I'm just pretty much sure that, like the test will pass and we will not see any reasonable difference, but it doesn't mean it like um really helps or it it breaks anything or or it doesn't really say us that it won't break anything. So.

C

I'm personally, okay with this change, um I would like some tests to be run, but I don't know like what exactly we we should be testing here. I just I I guess just I'm just saying that like running scalability, just we can do that, but I don't think it will. We will get anything useful from that.

B

Right yeah, I mean, if depends on the rate of requests that actually fail the first time and then, if the retro actually is kicked in, so we don't know, I mean, do we keep track of the actual retries like tried by client. Go look at the clango metrics.

B

um I didn't see any, but I will take a deeper look. um Maybe that's what we want to measure first, like you know, act like we want to see the retry logic kick kick in and how many retries actually um happen for a for a logical request.

C

Yeah that that could be useful, yeah.

A

Okay, yeah so from client goal. Probably we don't have any metric like that. Yeah.

B

A

Were not scraping.

B

Yeah last time I checked, I didn't see any so I will try to do some open pr with that change and maybe then we can use that too to um see if, like in regular, ci jobs or in scale testing. At what rate are we actually seeing retries being tried.

C

Yeah, I'm fairly sure that, like we have like negligible rate in our scale, tests for retraction.

A

But actually I have one question so uh what kind of errors it will retry like all of them, because um you know like we could probably play a bit with priority and fairness and throttle some requests and make them time out. For example, I think we.

B

This is just random idea, but yes, and the retro happens on like if the server sends a 429, um okay.

D

B

With the retractor response header, um I.

C

Think we are not retrying automatically things like timeouts, because from timeout you don't really know if it succeeded or not. So we want like the user to handle that um yeah. Somehow.

B

Apart from apf, the only time we retry that I'm aware of is when it's a read request and you are running into some uh retribal um error only then uh it's the request, yeah.

C

The read requests: yes, there are a bunch of other codes that we retry for readers, but like not for right.

B

Yeah for right, exactly yeah, okay, so I'll. Take this an action item um to add something incline go to measure the retries. Also for this pr um do you think we should involve other folks like danielle legit, to take the input or you think this is we can because, ideally we want to we. We want to get it in and and give you some soap time um for the for 124 right so that we, if there's an issue, arises, we can fix it in time.

C

um I mean it's, it won't hurt letting them know, but I I wouldn't, I don't think it's necessary to to block on their cons the their comments or whatever, like that.

B

Okay, yeah that works for me.

B

Okay, so I I'm holding the pr right now, then: let's uh I'm going to open a new pr with the the metrics change and then merge that first and maybe then we can. We can march this pr.

C

That sounds reasonable to me.

B

Okay, all right yeah.

C

I mean um regarding like letting people like joel battle or daniel know. I think it's just maybe useful to put the link um and let people know on sig api machinery slack. That is something has happened and like. If someone is interested, then they will look at it.

B

Okay, yeah. That sounds good I'll. Do that. Thank you.

A

Okay, so um I don't know: do we have anything else to discuss today.

C

um I don't have anything well, it's probably worth mentioning as a follow-up from like um lastic meeting that happened that um lukasz actually opened the cap for the thing that oh.

A

C

A

C

They were proposing, so it's um let me just find it.

C

Why it's opening so sorry.

C

C

Let me place it here if you can um copy it until to the dock,.

A

So, actually, regarding this the streaming api, I was also wondering if we will need to do something in priority and fairness um like to handle it differently.

C

We probably should yes, I mean it's not a.

C

Huge change per se, but we should have like a different estimator for for that.

A

Yeah yeah, that's that's what I was thinking about, because we already have, for example, for lists and watches some special logic. Let's say in priority and furnace, and probably for for this one. We also need something like that.

C

This will in practice it will be also like a regular watch yeah and like what will this? What will be different? This will be that um the initialization, like basically the initialization part of the watch, will will be much longer and we are already um we already have support for watch initialization. So, like watches initializations, this count is this um is consuming the um the tokens or, however, we call them like the the api or the in-flight tokens.

C

I think what the only thing that we really need is. We probably may want to adjust the width of the of those requests based on like the size of potentially based on the size of the or the number of um the estimated number of objects that we will be returning. um So it's like code wise. It should be like a trivial change, like literally a couple lines of code.

C

um We just need to tune it well right.

B

So, in terms of this cap, the new streaming list appear for list for the. As far as the user is concerned, it's still like a synchronous. Api call right.

C

um No, it's um basically so, okay, so maybe to put in another. We are like. Basically, what they are proposing is like not to touch lists at all. Like list will remain the same api as it is. um We are basically going to utilize like the more or less existing functionality of list that, if you or sorry of watch right.

A

C

You pass the resource version equal 0 to watch, then it will, at the beginning, serve the um basically like add events for every single object that exists. Currently, um there are some tweaks that we need to do there and so on, but like conceptually it's um it's just utilizing this existing functionality and changing reflectors informers and so on, like underneath everything is using reflectors.

D

C

More or less reflector to do that, instead of like doing a list request.

C

So it's um I would say it's more of like a client side change, plus some like that. That's a little bit of simplification, because there are server side changes. There will be like a new param to like and um to to get that, but like most of the change like um conceptually, we will.

C

We are tweaking a little bit of the watch api and making it possible to um parameterize it slightly differently and all the other changes are like client-side.

B

C

Yeah, so um this gap is like not yet um approved or anything like that. I don't think it was even like fully discussed with an api machinery.

C

um I did two passes of that and had like a bunch of um comments, so so there are like still some gaps that we need to fill in there, um but like at the high level. I think I'm I'm more or less supportive, but I don't think it's it's in a state of like being and they were close to approved yet so.

D

So um on this I actually remember um voytec we used to have if I'm not wrong, there used to be an old watch api where I believe like when you start the watch instead of doing the initial list, they could start the watch itself directly and the initial all objects are sent one by one as events or something like that. Is this something similar to that.

D

It's exactly this. What I'm saying so so this this.

C

This um this api kind of still exists like it's. It's basically exactly what happens when you send, like a resource version, equals zero. So it's more about like tweaking that um that code and making it more explicit and so on um then inventing something new.

D

D

Cool, um so I had one idea to discuss and I only have 40 15 minutes left. If, if no one has anything else, I can, I can bring it up. I wanted to discuss actually in the last sig meeting, um so so I mean I still. I just recently only started looking into it, so I may not be so thorough in how I see this, but so um like this is essentially um so with fcd.

D

One of the problems we see is uh when you're making rights to even apsr makes rights to lcd right uh within um when uh actually the um when he writes these operations to its right ahead log. It does this sequentially uh one right at a time because of this, what are globally unique, increasing counter right. So what this means is rights are actually happening, um sequentially, so pretty much all the I o operations are sequential.

D

That means no matter how much throughput or iops you provision to the disk, you're still latency bound, um which is not very good um right, and I, I believe, like one thing we did to work around. This was with we split events uh into a different lcd um right. You know in our scale test open source um so that so that was one way to kind of like get a little bit like two parallel streams, instead of one stream kind of thing it.

D

So what I was thinking is what, if we uh within lcd itself, if we can somehow parallelize different prefix prefix uh ranges right, let's say the prefix for pods. If you think of that, like as an independent.

D

What is it like? An independent stream which has its own unique resource version um right, and I I won't talk about how we will migrate from the existing thing to that, like I have some ideas, I don't yet like I said I don't want to go too into those details, because I don't know if it will work without looking at the code yet but like let's say if we parallelized uh the at some level, the mvcc for each of these objects.

D

So instead of creating different lcd clusters for different objects, let's say we had the same lcd process, but within that we could shard the resources. Because from what I understand nowhere in kubernetes does api server ever make a call to xcd that's trying to get more than one object type um at once right, it's. It always makes only pods, only node or something there are some calls like cube cut will get all and stuff, but they basically internally still translate to uh individual separate calls.

D

So if that is the case, um and also there is no such kind of ordering, so so one benefit of having one global unique counter with all the objects together is um you can make some complex checks like which can be dependent between object types? Let's say uh if a pod is created with a node already assigned, and you wanted to check if that, no at the time when that part was created, does that node exist so for that it'll?

D

If you could have strong consistency, if you have this globally increasing count, but I don't think we actually exercise that kind of logic anywhere in our clients, because.

C

Actually, we are officially saying that you shouldn't depend on that like we are. We are officially saying that, like um you, should never compare the research version across resource types, because um any operator is free to um share that cd or switch at cd or whatever.

C

On a pair resource type basis, so um actually we are even officially saying that you can rely on the um monotonic increase of um of resource version within a resource type, but you shouldn't assume anything like across resource types.

D

Oh, that's great! That's! Okay, that's perfect, so that that felt intuitive to me, even otherwise, because usually a lot of these controllers, they have independent reflectors for each of these object types right, so you cannot guarantee that at the point when you receive a pod event, what's the state of your node, cache uh and stuff like that, so um so anyway, so I think that's the highlight I don't know if there have been similar ideas floated in the past or we've discussed anything like this, but what I'm trying to do is within xcd itself.

D

If we can like kind of do a prefix level, we can paddle, we can do some group.

D

Basically, at the prefix level, we split out the mbcc and maybe like the right ahead log, um maybe not even mbcc, but just the right ahead, log, so yeah.

D

um If, if you think at a high level, this makes sense. I I I'll consider like exploring this a bit further and maybe writing a cap. But uh this is this is a scalability issue. We are seeing within eks I've seen with a bunch of customers some of our customer clusters.

D

um I don't know um how much uh others see um like splitting events helps right but like even within within a particular object type. Once you go beyond the scale, I think you might hit this, and this also has cascading impacts on reads because reads are also consistent, so they are behind the right. So if rights gets piled up, then reads can get slow and stuff like that. I think.

C

Yep, so um I don't know at city well enough to likely say something smart um at the high level, it makes sense to me. um I see one problem that you will encounter for sure, which is review bandwidth on the lcd project, like that cd project is really really suffering from lack of maintainers and lack of people who are able to review stuff and so on. um So and this effectively will be a change in the cd right, so um they are prioritizing.

C

Bugs and by or sorry bug fixes and like um reliable, like um reliable, purely reliability related and a little bit of tank depth cleanup and so on. To make this um the project a little bit safer uh on top of any feature related work, I would call it like a little bit of a feature what you are saying, so um you will probably face this problem. It's just like a heads up for you.

D

Okay, so I think things will probably move slower than I'd: expect uh cool, okay, um all right! Okay, if you think it makes sense, um I I'll see if I can dig a bit further into this, but is this a problem you've seen like rights because it has? It also depends a lot on your setup and like what kind of disk you're using and all that stuff and.

C

C

I think I've seen that, but it's not a problem that I'm I'm seeing frequently.

C

So yeah, I think I remember a case with that, or maybe a two cases of that, but like um it's not like something that is seemed to be an urgent problem for us. I think we are. We are often seeing like a throughput problem to hurt us even more on the api server side and the watch cache and so on, and so on then like on the cd site per se.

C

D

Okay yeah, so I I I've seen this in some of the right heavy clusters, but um it's yeah. I think it very much depends.

C

On, like the workout, if you have.

C

I I can easily imagine a right heavy with like many crds and like not that many rights per type, but like a lot of a lot of different types and like if you have like 10 types or like I'm sorry, 100 different, like crds and even 100 rights per crd, is not that much so like like things like watch questions and so on will easily handle that, but it it already gives you like what um 10 000 qps, which is that cd will can easily like depending on the setup, but it it's it's it's not that far from its limits.

C

So um especially if those objects are bigger and so on and so on. So um I think it really depends on the workload or like the um yeah, the characteristic of the workload you are putting on that cds.

D

C

So I can definitely easily imagine a workout that will will put us we'll put the clustering or a cd in the state. It's just probably not that frequent type of workload that we are seeing or I am seeing.

D

Okay, that makes sense.

D

All right, um so, if, if there is nothing, no one else has to discuss, I want to bring up one last thing. This is regarding the the issue I had cut earlier with this behavior with reflectors, with the watch cache.

D

So this ticket- I think, oh, I think you did take a look at this one already. um So I'm still pending a comment as well I'll respond to some of the things you've asked but uh martial. Can you open the open that issue real quick once.

A

Yeah yeah I opened, but I'm sharing only one um one tap. Let me change it.

D

Oh, I see okay.

D

Yeah, so I think um what I was mainly wondering is uh that fix, which was made uh um right. So, um okay. So like a quick recap uh of the issue, what happens is um I can just if you can scroll down uh uh so there's like a event, log actually yeah in this comment, so what happens is like there's a client which is trying to do a watch and it's trying to do a result.

D

It's trying to do this initialist with a given resource version right and, let's say for some reason, um like the watch- moves from one api server instance to another apis or instance, and that api server instance for that resource type. Its resource version is behind it's in the past because um just because there weren't any events for that object type since since that resource version.

D

So now um now what happens? Is this resource version? This call from the client it asks for the source version x, but the server is at x, minus 10. So it's a two large resource version.

D

It will get like a 500 error um right and then it will keep retrying like 10 times and every single, because some client retires and every single time it fails, and eventually it does like a list from hcd and then that list takes it to an even further resource version in future, while the resource that the cash watch cash is still stuck in the past.

D

So I think why I think you made a fix for this to I I guess to do this behavior, which is like, if you get too large a source version then actually make a call to hcd. uh But what I? What I'm trying to claim here is: is that even useful, and is that even helping anyway, because if there are no events to if there are no changes to objects of that resource type like there is probably no point in making forward progress by just relisting and like causing all this churn of 5xx failures.

D

um Should we have just like listed resource version zero instead of instruction.

C

um No, I don't think we should be listing from resource version 0, because the fact that your api server doesn't have it, it doesn't mean it it's not there right, like your api server, may be slow with processing that and um and lagging behind other stuff. So um it was very conscious decision to not do not list reach or from resource version equal to zero to not to avoid going like to ensure that I can.

D

Kind of go so so that's one thing I want to understand is going back in time. Is that something we assume shouldn't happen, or do we assume, like controller, should be built to work based off of a released, even if that goes back in time.

C

You should assume that you, which you shouldn't, go back in time.

C

There is an issue opened.

C

Let me find that.

C

Yeah it's this one, so um let me place it here. Oh, is this the clayton issue? Yeah? Did I paste it in the issue itself or your issue also, no.

D

I think I remember the one that you're talking about. I saw it in the bus as well: okay, okay,.

C

All right, I think um I agree it's like uh far from perfect behavior, what you're, what you are saying, what we are showing here. I think the um the question is like if we should be doing anything without about that or um basically with the efficient watch resumption feature or, however, we call it the reuse of progress notify from hcd, um which, basically is this feature, is doing. um Watch cache will actually be making progress and it's enabled by default in 121 right.

C

um So it shouldn't be a problem like this shouldn't happen in 121 already.

C

Okay and we, we already won't be able to fix 118 right because it's like out of support window, we are not even releasing that version and so on and so on.

D

Yeah this happened in, like 120, I don't know, is 120 supported yet.

C

um I think it was 121 when we enabled it let's. Let me.

D

No, I mean, like is 120 in the support window for kubernetes right now it it is.

C

But I think it's going out like it's pretty.

D

D

Okay, so this progress notified that something someone has to turn on at cd, though right, so that does so. It means that kubernetes being in 121, doesn't mean automatically anyone using 121 kubernetes will get progress notified right because.

C

D

C

The free point.

C

It's not that fresh version of that cd. I can't remember, but I think it's um um it's maybe even 3.2.

C

Maybe not maybe maybe 3.3. um I think that the trick was that, like in the older versions, it wasn't exposing the knob to configure it, and then it was. It was only sending this progress notify like every hard-coded time. I think minute.

D

C

Minutes events or something so um um so so we exposed that like knob and some patch release of lcd 3.4 point something. um So yes, um it's. um It requires some reasonable version of xcd. But it's not like super fresh version of xcd.

D

Okay, yeah, and so what I was trying to say, though uh I think is just like cuba. Anyone who's running cubans 121 does not automatically get this right. uh They also have to explicitly in their hc the way they set up fcd. They have to also enable that flag. Whatever is going to send this progress, notifier.

C

That flag is enabled by default, so actually, okay, yes in 120, it's already there in 121, it's enabled by default, so actually in 121. They will already have this behavior. It's just like the progress notify if they have hcd configured like if they have default lcd configuration. This wouldn't be super helpful, because those progress notifies will be coming like every minute or five minutes, or something like that. Some um way too too rarely do to be useful.

D

Not sure, okay, so on on that note, actually one last thing so, as I think of it, I'm not fully sure uh if progress 95 will solve this problem 100 as well, because, let's say if this happens, that the progress notify keeps updating the rv. Let's say every few seconds: it'll update it by 100 or something, but if this retry keeps happening on the reflector where it keeps making these list calls that fails a tree lists and when it release it will get like the latest rv from hcd.

D

So that may still be ahead of the of the progress notify uh rvs that are coming to the watch cache right. So could this happen? Could this happen that we are perennially all in stuck in this?

D

The reflector is always ahead and it's it's always getting too large resource version.

D

I think it's it's a matter of time outs right, um because it's.

C

Doing every one.

D

Minute from what I saw on the client, uh it's trying to like re-list every one minute and it it it once at least from etsy, it starts a watch.

C

C

Yeah- and I think we are recommending like by default, setting this progress 95 to five seconds.

D

C

So um so, if we are not retrying too frequently, and we shouldn't also because of um we have multiple retries, then it should work.

C

um We should probably, I think.

D

We'll just we'll probably just know this like when it happens, uh but yeah at least like this cannot make things bad. It should only help if it can uh cool okay. So then we are fine. I think the call we are making is we'll just uh we'll just say that okay, we 3d progress, notify fixes this. So let's not worry about optimizing it optimizing. This whole behavior, okay, cool. I think you're over time. Sorry, for for taking more time.

A

So, thank you all and see you in two weeks.

A

D