Kubernetes SIG Scalability, 9 Feb 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-02-09 Kubernetes SIG Scaling - Weekly Meeting

Description

2017-02-09 Kubernetes SIG Scaling - Weekly Meeting

A

All right, this is the fab 9 meeting of cake scale recording is on clock amongst ourselves.

B

A

B

Found a super nasty bug and the reason it's super nasty is it escaped all radars until we deployed a cluster, um so it's an actual cluster which had separate monitoring right, so we found it indirectly by zabbix, even though our profiles showed that we weren't really paying attention to it. So here's the the the sort of summary of the issue that we found is that we have with 14 and 15. We have a crazy high number of I ops going through SCD.

B

If you have multi master at cds, this will cause leader election failures on large clusters and bring down your whole cluster. So it failed on two fronts for us and we found the root cause of it. The root cause is basically in 14. We have a bunch of corn. Read additions that occurred now, Korn read will force them right.

B

So if you have a highly available cluster, this basically increases the rights that you had and so right now we're trying there's a ticket but I'll link to in the docs or an issue and we're going to have to sort of triage it and go through the different locations that are doing corn reads and try to remove them more possible, and some things may need to be rejected. Because of this, you.

A

Oh I, don't hear you about this here. Listening about, hear you I'm having problems sending buttons this morning, the, and so why.

B

Didn't it show up before me, you guys at least a point. Oh.

A

B

Reason it did show up before is we don't have any upstream I up tracking and that needs to be done. America and I were talking about it like we need to add at a gate, because right now and a lot of performance tests, we monitor CPU and memory profiles, but we need to add a gate for eye appt profiles to on its ed. That's that's one thing that we can do proactively.

B

The second thing is that we weren't paying as much attention on our side, even though we actually had the data, so it escaped it escaped us, because we were in many cases. We actually had machines and environments that were fast enough right, but then, all of a sudden would somebody tried to deploy on Amazon with a limited pipe mount. Then all of a sudden up right away right.

C

You give an example of some I think they're to be required for our breed. Oh that.

B

Was really hard to hear Joe sorry.

C

You give an example to think that required form.

D

Read so I can talk a little bit to the chain of events was cords were off at the storage level and there's a bunch of discussions from various people who are starting to run at TD in this mode, and they were like we're seeing slightly weird things happen at big scale. We're ordering was it quite what we expect when one of the members starts to lag Yuri sinks that got something kick it out and then a couple of seconds later put it back in, and so the change was made in court.

D

Cubed turn core reads on for everything: by default in open shift, we had never had queries on from the beginning, really great AJ, and there are flakes. There are like small flakes that occur when one of the members lives, but it was just never liked they. Almost all the things like it would cause like a flutter on something like a service, maybe briefly, which I think would impact.

D

You know people when they're running into extremes, but just not in the practical and in open gently you're, only using quorum reads and not even really core memories doesn't be even more sophisticated for authorization. When we push down an authorization token to add CD, we would wait until all the members confirm they had it to prevent the you know the classic grant access and then someone immediately uses that token and gets denied. So it was just not a services in extreme cases, not the normal cases.

C

What sounds like it's one of those things where it's just super easy to use the form great thing: let's go away, awkward, yeah.

D

And and I'm scared too, because because they are very subtle there things there is a risk right that we that we on both ends, which is we get small flakes window, fix one thing and we get small flakes. So if we turn court read off, you go fix at other places and we get small flakes here and there.

D

The list watchers are reasonably well set up for it, but the problem is that most of the controllers use the list watch and then do a series of calls based on it and as their round-robin through you know, masters they can trigger this by because you know they're, not sticking the masters or because you know there's other delays going on in the cluster, so it was cropping up in hard to debug places and so I think there's that's the trade-off risk.

A

You I'm sorry, what was the trade-off I was. I was tracking it to the point where you said there was a trade-off here. I mean.

D

The trade-off is you, give up some performance for not having weird edge case bugs happen.

A

Right, like I, think I think most people would probably take that trade-off. Yeah.

D

A

D

I worry about too, because I think we can relax it in some cases, but we don't have like, like we're, not instrumented on the client side, on the server side or in the NCD side to do like causal reads, and so, like the vast majority of our controllers or single process, causal read types of things, but we've not done in eight infrastructure work to do like you know even the simplest. You know, session style, vector, clocks or anything up and down the stack and that's you know a significant investment to.

D

Like the simplest possible fix for someone for the vast majority of clients is a precondition on the CD version, so the list watcher knows what resource version it's got for a particular resource and it passes it down to the controller. The controller just automatically carries that through impasses that down to the API server and the API server says the resource version that this request gave me is stale I'm just going to wait until I hit the research again, we've talked about plumbing it, but it's a pretty big chunk of work.

C

Just look at the client pretty hard in place where the client is pretty get able.

D

Right now and yeah, it also like we got to start dealing with people who are using threaded clients a little bit more leggett. Ed client does a little bit of this for us and there's maybe something we could turn on that we're sorry. The STD clinic can do this on the Masters.

D

That's one option, but it's not going to stop if the Masters for talking to a lagging API server and that's the same thing that stuff so really does have to go all the way up to the third keep going and back down and threaded like we're not using threaded, or there is no scenario today under which someone who's sharing a client across multiple number teams is being impacted by like there's just no state, that's shared. This would be like a you know: high watermark and the clients probably get a.

D

We have to go to a couple rounds of tests to be sure that that's nice too so I don't know if anybody else has any other easy fixes. I mean Tim I, think like when we had the off thing on the open shift side for create and then make sure that at least the members are up to date is a fairly easy thing to go.

D

Duck do like to take the code that we had before and move it into the storage layer, but that only that only helps some of these cases and it doesn't help all the subtle issues.

B

And I have to do it on a case-by-case basis because like without knowing what the flakes are other than you know, some people are not passing around resource version. When they're trying to do some updates and.

D

This is them is actually what would happen. It's like. So as an example, the one that actually triggered this with the cube lit.

D

So the cubelets Inc status manager is doing gifts and updates in two different places, and the version coming from the watch would be one and we would go update the status and the server would return as a resource version, and then we would use that for a subsequent get when we needed to fill out the most recent copy of the data for a couple of places, and so that data coming back came from a different master and thus came from an older, trailing at see beyond the read and so the resource version for that was older than our high water mark when Appa gate that verb.

D

B

I mean like on a h, a proxy. We specifically specified session affinity um because we knew that when you hit when you hit the non session, one you were doing, you could potentially run into these issues. Right like that was a. It was a hard requirement and we Daniel and I even chatted about this about. You know starting to have session affinity all clients, but I, wouldn't that one well that.

D

Helps until a leader change happens, yeah.

B

But I mean you're decreasing the probability from like one in the you know, one and a hundred to one in a thousand, but you still have the case. You still have the edge yeah.

D

And we're not doing load distribution on the API servers, so there is the case of today. The Masters are spraying more or less I. Think my understanding right now is that the Masters are spraying to the underlying STDs fairly evenly. If we once returned encore breeds, we were focusing on one particular master or one particular entity master the to get the benefit. We would also need to ensure that we didn't get hot spot like I would be a little worried about. You know.

D

Turning on such an affinity and rigorously enforcing it from api server dead, CD and putting in scenarios and restart hammering one of the at cds, because all three of all for the masters or halfway, there are start talking to that. The API servers started talking to one particular entity. Id is no worse than it is today we're talking to the leader. So.

A

I, don't know if you guys noticed.

C

A

Comments on the chat.

C

um One other point here is that it's worth talking to the people folks, my Borg I- mean Borg essentially is single master. At a time where you know the backups are just there for backups and they just scale the queen size, the board.

B

That's the same model that other things used to the other masters are standby, because you know it's only in this model, where we have scale-out API servers horizontally, that we, this race condition, occurs so.

D

um Actually had one question on the on the.

D

Shoot I was going to Center that OMFG d3 to do we I thought the rights were reduced, that the case I have.

B

To double check, I have to talk with young.

D

And on right now and I thought this was like when they did, the improvement for the senior serializable leads I thought they were batching them, which they bet big.

B

Net us commit that was something that they talked about. Yes, I mean we could get higher. We could get much higher throughput on a single API server. You know, if you really wanted to. We could also do things like windowed rights. You know shorter short cycle with the rice 2xcd, because right now it's like a straight click right through right. It's.

D

A the discussion that we had the same time that this happened, so we discussed an open shift, whether we wanted to patch it out, and one of the challenges would be we're actually giving up. Availability like this, because the ability to do stale reads, even though it causes some weird consistency. Issues in some edge cases means we can still lose the master and satisfy at least reads you just can't sense.

D

My rights and Q does not do well when you can't satisfy reads, but it's fairly resilient in many cases to not being able to temporarily satisfy a right and so like in the long run, even though I agree that it is easier just to turn on court reads or bump up by ops, I do think. Having causal consistency at the client, as Joe is saying over to agree versions like I, would prefer to have causal consistency in the client and sticks it that way.

D

Just given our rough performance numbers and it's going to give us the best availability or the best combination of availability and in performance, because we can lean on the resource version fairly effectively,.

B

I mean the numbers that we have on API server today, because we haven't done in a while, but the numbers we have today on API server / on the load balance case is actually really good. By comparison, we should probably do a single API server with backup standbys and just to see what the performance characteristics are. A ver, you know, do an a/b test for that model, because that's the traditional model. Honestly, it's only it's only in coover dennys that I've ever seen this model error, you're.

A

Talking about changing the API server to the same master election model with the other, the other control reviews, yes and.

D

I don't know I mean the masters has to proxy exec and port forward, and so when we went through the first round of designs for this, we said, rather than creating two different scale out layers, we're just going to create one, and if we don't have, I mean like we can solve the cause of consistency thing if we've already started to do this of it. So I don't know I mean I agree.

D

Yes, it is simpler, but we're going to be pushing a large amount of traffic through the masters and I'd, rather have one scale out master layer and have the controllers be master elected that have three different types: controllers, API, server and and a scale-out proxy layer, because I mean effectively. The masters are supposed to be on proxy.

D

If you can't make them be effective, steel and proxy, we should fix that, like, for instance, Oh defaulting decorum reads on clients that don't opt into cause of consistency is easy and then adding cons of consistency for controllers and standalone clients should also be fairly easy to make the default.

D

And we have that at least we're in a scenario where, as long as you're, not using two clients which, as far as I, know a single component in cube that, like all the controllers use a single clients, cube losing client q proxies in the single client, like you have to work to use to clients at the same time, I think that's fairly feasible. To make happen. So I don't know my dad's a little bit to not change the API server model.

D

But to just do the thing that scales and we at least understand how to fix. But.

A

Presumably, you still would need to make a recommendation for folks for 16. Now, absolutely I mean who's.

D

Running at this scale in the pure cube, cakes, who doesn't have the I ops, 2 spare, probably not many well.

A

I think I think the bigger issue is just understanding. I think this is jose point from earlier understanding that this positive problem, so we supposed to make sure I understand the implication you're running on amazon, the the recommendation would be if you're running, really big clusters run a TD run, you're a TD instances on separate instances with high I ops ii lb and put a monitor on the ayaat so that, if you're ever hitting your I ops limits your ringing big alarm bells that you need to go.

A

Do something about it because weird things will happen. Yeah.

B

Yeah turn out session affinity it'll only.

D

Thing this definitely crops up. So we started seeing this in small clusters on in one standard two instances because they were running out of even like sorry, we are seeing high I ops. It is not blowing up those clusters, but some of those clusters are very I up constrained and to cpu. Instances are reasonable for small clusters, but I would expect bigger clusters to have much bigger, I ops, provisioning for.

A

Sure it is this when you're saying I ops you're talking about the again. This is kind of an Amazon oriented question you talking about the dis guy ops or some like more, like system, ops, guy apps, to write caused by the quarry yeah. If you were you were, you were.

D

Confusing me with the two cpu comment, so on GCE, the quota for I ops is great combination of size of disk and instant size and Jess I've been defending another problem. We are having with dr. and device map around inversion, so I was momentarily crossing. My scenarios.

A

No we've run into the same thing where we tried. We've had to build these, like like informal tables, with some help from the google cast understand like it after terrible minute, a calculator.

D

um So I mean I think we should bring us up with API machinery. Certainly it's been on Dan and my radar for a little while, but we just haven't formally put it into anything. If 16 we left court reads on and bumps I apps for those things for most people, that's okay for the specialized gave for open shift like I, do think we can probably survive it, but we'll probably make some different recommendations and then I mean again, like my preference is to just fix the cause of consistency from the client.

D

But if we can, if we want to take a broader discussion like I'll schedule, a session, a kind of shooter, you can argue that out I.

B

Don't I think that probably worked well because I'm still curious on the numbers of single API server I know that it's an exit strategy that gets you out of this pretty fast and that allows you to still fix the other ends in the long haul. Right, yes, I mean that's where we're not seeing terrible one wears I DHI server like we used to right now. Well,.

D

In the API server, I mean there's like three or four more optimizations that we've kind of planted around that could get us down even cheaper. So we're basically the point where the controllers are the expensive bits. But if we have to I mean I, don't know, I mean.

B

If you wanted to you, could just do it be even smarter, where you make sure you don't como your control or keep the way you have your H a set up. You just don't Colo controller with a bi server, and if you do that, you're going to not hit the same, you won't compile the profiles right. That will be a different position. Yeah.

D

And the controller's right now the controller manager is fairly easy to chunk into smaller subgroups that somebody wanted it like, even as we get the number of shared caches down so we're still. We still got like a three to four cash duplication layer, most of Andy's work for that I'll go in 1617. We should be down to single. You know one or two replication factors for the various and formers and list Watchers, which one that's in place like splitting out. The controllers causes more traffic to the API server, but less traffic.

D

To the thing I mean the nodes still are going to be. The numbers, the biggest single client and I. Think.

D

To say we want to scale the large numbers of nodes at some point, the number we can keep optimizing, the API servers for number of nodes, but I would prefer to be able to scale API servers independently a number of nodes or scale a pair servers with the number of nodes, rather than have to do expensive, optimizations and.

D

It feels like right now, like I, mean wojtek I, don't know what your opinion is on this with the update status changes but say we say we get this up. This performance improvement in update on node condition, status tracking is what's the next update. Is pod status, update lurking behind no status update as a significant cause of cluster load in Qbert.

E

So it depends like on the load we have reasonably or maybe unreasonably low free booze in our test. It's like I think we are generating roughly ten to twenty, like both deletions and or or creations or changes. So like changes from node status are all the strangest coming from. Node objects are like order of magnitude. More of them. Oh I, I'll.

D

Say that we in some of our very dense customer environments, I, some customers who are on older versions, are starting to get into the hundred update a second clusters from all the different participants and like they don't even have half the optimizations that we put, but if status updates are coming somewhat frequently like pod status would be like a gut next check.

D

I feel like like I guess at my my gut feeling is that I still would like to preserve API server scale out, just because it's the easy hatch for if node updates are crushing the cluster or pods Bettis updates, and we want to go from a 500 node cluster to a thousand no cluster. If we want to go to a 500 node cluster with you know, 10 pod density to 25 density in both of those cases, number of updates from the nose is going to double, roughly and being able to throw more API servers at.

D

It is very nice.

E

D

I mean again I'm open to the like openshift bundles, more things with the API server masters like the OAuth servers and so we're somewhat biased towards we have a higher workload in the masters and I think you know like from a dex perspectives if you've got lots and lots and lots and lots of tenants, you start crushing your or you start hitting your login servers more often for anybody who can offload all their authentication off the cluster like gke, not as much of an issue but for anybody who's got a dense, multi-tenant cluster I.

D

Do you think at some point? You know a lot needs to scale independently. So it's not the end of the world to say, maybe in like on our side, we're doing too much air mass and you can tolerate going back down to one.

D

I mean Joe on the exact fast bastian proxy thing: I mean I. We we have poured forward and exact today, both of which could be extremely bandwidth, limited.

D

I think there's a scale point at which you do want to steal bastions independently I, don't think most people has even come close to hitting that scale point, which is why I worry about creating another type of saying for that in the near term. Well,.

C

I'm um I was playing around with an grok. Have you have you looked at in brush? It's essentially like a forwarding thing that, like you, know, skype hooks off of a cloud server for getting access to stuff, and so something like that is just want to mentally, distribute it right where you're just running another job that exposes this in a different way through a different path. Well,.

D

You get root access to all those processes everywhere on the cluster to mmm I'm, confused well, I mean like we don't so in like this is like a pretty common thing we have, but we don't run anything that has access to a can car than like no level privilege on anything except a very restricted set of nodes, just because any any kind of proxy running on the cluster.

D

That is a man-in-the-middle to the nodes, and it's given authority to talk to the nodes to do exact report forward like have to get some credential or we have to do it. We've gotta hand off one time. I.

C

Guess what I'm, trying to what I'm trying to say is that is that you don't have to run these things as a shared service. It can actually be /, namespace, / job like when you want to portforward something launched a specific job for port forwarding, math thing and then have it used some other resources that just are scalable network. You know you know Bastion pipe stuff right, so it's a different architecture. It's a red herring right now, let's not worry about it. Yeah.

E

C

About putting so much stuff through the API server there, it feels like it is a little bit a little bit tightly coupled, but it's definitely not the top of list. In my mind, it.

D

Was kind of the original design decisions I think we actually talked about bringing up business before node towards even existed and before we have the changes in CRI to offload exact and port forward, to something that's even resource-constrained versus just a dr. Damon. When we went through that kind of stuff, it was how many different complex pieces do we want to run as the edge layer and trying to get that number down to like two or three things was kind of the overriding concern to keep the complexity of topology low and so I agree like.

D

We probably want to start considering that now that we have actually split out from not only were in the process of splitting out exec in the CRI to actually not be flowing through the culet and going through the dr david and.

C

I think when you say that the API server is really a bastion. The thing that worries me about that is that you generally don't give your bastion permissions in, like Ed AWS case, we're giving it like essentially rude on AWS to the AI em roll. So they can do the cloud provider. So getting the cloud provider separated out up.

E

Two separate.

C

Nodes that are not the server it's.

E

C

Big step in the right direction here, because now the API server is, but at the end of the day, it'll has write privileges than into the fde, which is essentially route to so it's like Irie have a hard time seeing the API server as a true Bastion yeah.

D

I mean it has to man in the middle all of the traffic for the cluster, and so, if we're, we don't have better less privileged man in the Middle's I. Don't want to add new maid of the Middle's I guess well,.

C

I, ok, so I don't think it necessarily half the man in the middle, all traffic to the cluster right, I. Think you're, like.

D

We have it, it had to do a syndication and authorization which today means that it didn't make decisions arbitrarily who can write. What and all it has all you have to do is able to write to a pot arbitrarily and clusters. Yours what.

C

So, ok, let me be country here what if we had a system that was like in grass, but it was actually focused for doing port forwarding something like that. So the country is the API server to program that thing and then that thing, essentially, you know, does the magic yeah I would.

D

I think that was that was one of the original proposals for exact import forward. Once we have something like that that works on all the cloud providers in metal I would absolutely prefer to move exact and talk loud and.

C

It doesn't have to be, it doesn't have to be a either or right. I mean both yeah, but you know again not comfortable with it's.

D

A so rewinding back I'll schedule a discussion in API machinery to talk about like the causal consistency issues and what we could potentially do and Tim and Joe and everybody else on. The call who cares about that all I'll send the invite to both legs. Don't.

C

D

C

Would be worthwhile to have like some sort of of DOC that lays out this this space because I think there's. This is the type of discussion that could easily get wound around itself, just deciding what the hell talking about yeah.

E

A

Do Clayton, do you want to make a point of mentioning the ticket or possible impacts on 16 of the main meeting, or you think it's goodbye I think.

D

10 the kenwoods to bring that up and talk about yeah. We should definitely tell people that if they were, if they weren't looking at I ops on their clusters, they should do that calculation. Hey! Are you okay, with tripling or quadrupling the amount of I ops that you push through your entity? Instances, yes or no I! Think.

C

It's definitely worth the heads up in the main meeting. Well,.

B

Yes, but it's more than that tube is like you. Can, if you're, not careful you can. You can totally hose your cluster so on a side point I do think that we should have gating on upstream. We need to start adding more tests so that we can monitor this because it gets blindsided, us um and I think if we have some more more tests that validate what we're actually any drift and performance outside of you know just the standard CPU memory, so that way we can track it over time.

C

Yeah I think I agree with the larger point that you know, corn reads are a expensive telling thing that we need to track, and we should do that for I ops, which is what chords turn into you. But regardless.

A

Well, we're we're out of time for today, Tim I did want to do a quick follow-up. There was a one of the things that we got tasked with that we discussed last meeting was getting going through that CD. It was the sed ticket feature checkoff list yeah.

B

And done that I've been unbuckling; okay! Well! Consider yourself, nagged, okay,.

A

Anything else anything for today.

A

Nope all right good discussion, everyone in the main meeting thanks happy thursday yay.