Kubernetes SIG Scalability, 18 Aug 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-08-18 Kubernetes SIG Scaling - Weekly Meeting

Description

2016-08-18 Kubernetes SIG Scaling - Weekly Meeting

A

Okay, so public recording, August, eighteenth, Cooper, Nettie's scalability sick now begins, engage I was just saying: I've been out on vacation, so I've I'm a little behind on things, but it looked like the notes. From last week there was a fair bit going on so I think.

B

We could, if we collect agenda items, there's a bunch of stuff to talk about, there's the exit, III stuff, there's the future, the assorted nest of issues that the garbage collector controller exposed other performance issues that are kind of running into with both the watches and some other stuff. So the laundry lists that has gone on this last couple days has been pretty pretty large.

A

All right, I'm, putting some things in here, looked like there was a stack used to test 2k density test, runs darker version and patches, etc. I'm, not sure who put that n did you put that in Joe or I met.

C

David Lough and David.

A

Oh yeah David good. That's that's, definitely been a topic for us as well, so.

A

Well, David, you got your! You got your agenda item in first there were there. Were there any others. That's probably that's probably enough to keep us busy here for a bit right.

D

We want to go in reverse order or.

A

Yeah, why don't we do that? The stock question might might be a little might be a little bit lengthier some of these others. So why don't we do that Tim? Did you want us? Did you have a witch any one of these you want to start with. You know start with the sed three or I can.

B

Give a quick update on the end wojtek has notes to in so this poncho, so I think we're all rock party to the Quran gear off so I have the fixes of the for the client stuff to enable for the testing. I was able to get that in this morning, but I mean and then some type of testing purgatory enough. A lot of other people are seeing this, but like I'm, getting random test failures that have nothing to do with my dr's. It's happening with Chris frequency I. Think it's like an endless cycle which.

E

Next, like integration or you need, or three.

B

Depends on the day of the week earlier this week it was gke.

E

Oh, it's okay, he was broken. There was a day that GK was basically broken, so that wasn't that those wasn't those wearing flakes like this was simply broken and it's fixed them. Omg.

B

Today, it's it's GCE on or the cheese okay, so.

E

Because there are also some problems with integration tests like they I, don't know what it is. If it's related to like to low memory on our machines or something like that, but some random tests didn't don't compile like randomly so yeah. There.

B

Are some issues here so um the PRS are released? Obvi um and I saw hunch output in the TLS one, but I think it still needs a unit test. Is there? Is there a list? It I know you created that list on the feature repo.

B

Should we enumerate that for the larger audience and things that still need to get done.

E

We can move it to some other issue sure or to doc or whatever.

B

There's a smaller the sting of basically action items that need to occur for migration and whatnot. There's many actually code, changes that are needed any more on the meat features.

E

New title, so it depends. How do we like define code changes? There are some code changes in terms of like set up scripts and stuff like that, but I'm not aware of any like called changes not being scripts. Okay, at least like for now.

A

Tim I was looking to see if Aaron happen to join us this morning, but I know from my his last his last commentary to me after the cig testing meeting was that there seemed to be some pretty big code changes that were getting put in that you know what would he I think he certainly felt were a little bit late in the process. So it doesn't surprise me to hear you saying you're seeing things maybe get a bit flaky here at the end.

B

Yeah and and some of the changes have exposed some other issues too,.

B

So I think it's ed I think is the feature Rico, which we should probably linked to. Has the void check right up to the short list of things that still need to be done for HDD three.

F

So I have a question who attack like about about rolling back to up light from leafy data to wheat, to, as you requested, I'm just wondering I do. Can we assume I the way to data is empty because, like that, that's much easier, we can just why a simpler widget data file instead, like this, we do data. We cannot either catch it and merge it. I.

E

Think we can assume that it's empty like if it's not empty, we can probably delete the data. That's there sure.

F

Sure great thanks, I.

E

Mean we don't need to like merge this data from whatever there is yeah.

A

Did you want to move on to the garbage collection issues Tim sure I meanwe texture color on that one I.

B

Asked us to our sponsor expose that that controller problem, which was fixed I, mean we were seeing something weird on other PRS, with the deep down in the data structure of the reflector.

E

Yeah, so there was at least one rice which is hopefully fixed, but like I'm, not one hundred percent, it's fake took sure that it's fake, but hopefully it is like. So that was one issue. It was like yeah, it was reflector issue. There is, and there's also like a huge increase in Murray consumption in both the controller manager, which is kind of expected, but also an API server, and the reason for that is basically that garbage collector is using dynamic client, which is using Jason's and/or.

E

All other clients are using proto box now and it's like a known issue, but like basically, we need to make a decision if we want to enable it without product with Jason's or not. It's not it's. In my opinion, it's not the blocker, but it needs to be a conscious decision that our components will be using like API server will be using two types, more memory, so yeah.

B

What what to be gained by having the garbage collector on that we wouldn't have otherwise I mean it's we've had these known issues for a lengthy period of time, but we're enabling the garbage collector to clean up some of the artifacts like broke I. Think.

E

The biggest change that we gain from it for now is like owner reference for controllers, which we don't want to enable without having garbage, collector and then like what it gives us is that, basically, you can tell in any constant time whether the pot, which which controller owns a given pot, which is like significant portion of amputations in different components, including scheduler and, like controller sins, just.

G

The the first is naive clients being able to get clean up without having diplomatic, client-side Reaper still like web consoles, like I, think the dashboard was blocked on this and a couple of other clients. People have reported like once. Garbage collection goes in the naive clients can delete controllers and not have pots left around I kind of had assumed that this would be something that people want to enable what to choose to enable because of the impacts. But we do want people to enable it.

G

So we can fix the rest of the bugs that probably still exist in and even though we're in a pretty good shape now better than we're in 13, so not enabling it would probably just pump this problem, and we would have the same problem in 15 that we would have to go fix. But we, it seems, like people believe that the the functional part of it is correct. Now.

B

Does it make sense to at least have the performance issues fixed in the one dot 4 series like cherry-pick backwards, I.

G

Don't think it's going to be possible, so we got closure on how we would do it or there's a proposal that we're going to write up in the API machinery for how we're going to enable partial object retrieval for protobuf, so that a naive client can say I want to get so the dynamic client. The reason it's doing.

G

That is just to get at the object, meta to get a controller references, and we we have a rough high level agreement in the API sig that we will introduce a mechanism that we can use to say the generic Clank and say I want to just get the object meta out of this protobuf object and get it back and protobuf, which should fix the API server side in the rough timing. For that is, there's a lot of motivation to do that for 15, but I, don't think we can guarantee we'll get it in 15.

G

It's going to be too big to back for it, though,.

C

So let me let me ask some questions here, so the big thing here is being able to mak that map back from an object to other things that are referencing. That object via some sort of label. Query is that is.

G

That the cost here the cost here- is the cash on the controller side that tracks all of the objects in the entire cluster in memory, and then the fetch is the cost. So the garbage collector has to keep all the reference to grasp. Basically, so.

C

Okay I mean this is placing the scheduler where the scheduler actually makes decisions based on a controller managers and I find it I mean based on replication controllers. I find it offensive that the scheduler even knows what the hell are replicating controller is right, so.

E

Basically, what scheduler is doing or part with it's doing? It's like.

C

Spreading by and I totally hear that, but if we want to offer spreading as a feature that should be broken out as a feature, we shouldn't pollute the the the scheduler with knowledge of objects that don't hit. You know that shouldn't affect scheduling at least logically right.

C

We should have a way to say here's a spreading policy and then have the the replication controller set that spreading policy right I think it's changed the replication policy now I, don't know if that same sort of like like layer violation, is a Korean again with the with the the GC I haven't kept up on that issue, but I think the fact that we have to keep the entire err graph in memory to to be able to do these things.

C

To me, it says that we've moved from this sort of choreography with a bunch of things, making localized decisions to a world where, where we have something that has to have the global, you know view of what's going on controlling the flow controller, if I think is the idea of moving to that local observations.

G

C

G

That's controller replication controllers, the pod says here's my controller and that's an opaque keying total. It's.

C

Explicit then, you know, then we're moving in the right direction, I just because there is a fundamental difference between explicitly setting up bi-directional connections and implied by directional connections like what the scheduler does right now, with respect to replication controllers, I think.

G

In practice, all of the controllers ended up needing to post a reference back to themselves in practice because of dueling controllers, which, while the label selector sounds great in practice, the downsides of dueling controllers was very bad and the cost to fix dueling controllers is basically you put something on the thing that you created that other people can be like. Yep I, don't touch you Jodi your point, though, about the graph that most of that is because we don't have tombstones anyway to tombstone in that CD. Today, the deletion characteristic.

G

So given it's a graph and that we have things coming and going that are disconnected, and we can't build a single transactional tree. We have to have either tombstones or a long wait period, and the memory is basically dealing with that. I think there was a proposal, maybe further down the road that we would track tombstones at some point by you it and once we can track tombstones by you it then the memory implications go away, but it's kind of that short-term.

G

You get the thing you actually wanted and then we have a path to long-term, but I agree. I mean.

C

I, honestly, I think you know you know it back. References are totally saying the semantics are on the back. References in terms of spreading and stuff, like that, I think should be more explicit. I just want to make sure that we're not continue like we're, not making another sort, of instance of that ugliness. That's in the scheduler right now. So it sounds like that's not the case, so I believe it is not the case, but we should double.

G

Check that with with chow yeah, okay.

C

G

G

So so, given that we think that there may be a fix in 15 for some of the worst performance aspects of it, but there's no guarantee do we want to hold it up or do we want to make it be opt in everyone make it be opt out or have a recommendation like is opting out of a new feature really that bad for end-users? Do we have the control to turn it off, etc. I.

C

Mean ideally, we'd have some sort of global setting that you could set here for, like an experiment. I mean I'm thinking about, like the been during experiment that the goal line guys went through and they were able to age that in over a couple of versions so that there was some leeway right. So initially it was off by default, but you could turn it on and then it was on by default. But you could turn it off and then it's a done deal.

G

Trying to see if they have a flag for this today, but.

C

We don't have really have this capability for a sort of global flag for behavior like this, we kind of do on controller manager like in theory, the controller manager.

G

The only thing that the memory impact is solely caused by running the garbage collection process. If you do not run the garbage collection process, you still get controller refs, but you don't like the controller, rev sardonic creation time. So it's basically free I think if the controller is off, there is zero impact to the system for garbage collection, and it is exactly the same as a 13 system. Yeah.

C

And I think it might be worthwhile to run the garbage collector in advisory mode or it provides some sort of API or UI or whatever that says, here's the stuff that I would delete right, and so you know, as people turn this on, they may be able to have some insight in terms of you know: dry run, type of mode.

G

Turn it on Caesar cluster blows up. That's worked well for us. So far,.

H

So just so I understand it sounds like there's a path where we could add the refs by default, but not actually double memory usage so long as cluster operators do anything.

G

Controller manager today there is a flag enabled garbage collector, enable garbage collector defaults to false. So today, if you have a one master cluster that you turn on today to coming from master 314, garbage collection is not on by twofold: it has no memory impact and the references are being set and setting the flag to true will increase the memory usage and we go delete a deployment or a daemon set. All the pods get cleaned up by default.

H

Okay, I'm, just as a cluster operator from a capacity planning perspective, if I'm being told memory usage could potentially double I might need some additional time to give myself room for that. So maybe not enabling it by default would be the least surprising thing here, but I totally understand your position that it the longer we don't enable it by default the longer this continues to be a problem, just thinking about the element of Lee, surprise and really loud, release, notes or something.

A

Any idea what google will do in gke with regards to this.

E

Not yet decided.

A

I'm trying I'm sitting here trying to think of it to the degree to which that matters, just as a general principle, I, think that a lot of folks still looked gke as sort of the canonical running system and defaults that kind of match what gke does to you know that they're there some, like principle of least surprised again there to echo what aaron is saying. I mean.

G

I can set it from an open chef perspective. We are unlikely to enable this an open shift 34 for our customers, because we still have some lingering security issues with regards to multi-tenancy to sort out so in a multi-tenant namespace context where you might have users who are only editors, an editor can use, can abuse references like deletion, cleanup references to trick the system into deleting a resource that you don't have authority to delete only to edit and from a cube perspective.

G

This doesn't matter it only really matters when you have fine-grained roles and namespaces, and so that's our primary concern and we have the selfish desire that we would like to see all the bugs sorted out as well before we turn about introduction environments, so somebody's got to jump off the ledge first and we get that we have the excuse of security, but I do think. We really need to move forward on this.

A

All right should we should we move on I, don't know, did weaken, was buried in there somewhere. The watch's issue, don't think so that was separate. All right, I think that would be the next one on the list here. Tim.

B

So what types you want to add the color, you open the issue that you're seeing in the scale says and.

E

So basically, the problem is that enlarge, or at least like I've, seen it only in large cluster.

E

The problem is that from time to time, like once once an hour or twice an hour or something like that, it happens that all the watchers are dropped and the problem is basically that the watch between at CB&I PI server, because basically like that, what all watches watches are currently served from api server, not from a CD, and there is one just one watch between HDD and ipi server and it's sometimes like drops and we cannot write, retrieve watch and we need to release from the beginning and and start watching from that point from a CD, and this means that we are terminating our old water system for a given resource and I'm I.

E

Don't yet know why this watch between API server at CD. It's like dropping. This is something that we need to investigate. What.

F

That is, it is a little producible like how long were.

F

You mean like how often know it's like how long like how long so I think I think in your in your description. It says like 40 minutes. Is it like as long as you want like one hour, it will definitely happen.

F

It probably will, oh sure, sure, like awesome, so we will probably I tried to look at that and trying to produce. Yes is a curve. Our cluster, you.

E

Ready to test it, it happens. Yeah, it's like I've, seen it both in the real clusters and Kumar clusters and cube mark is probably like easier to reproduce, because less resources are needed, yeah as soon as a mentor.

F

Finger I see light like um I. Just should I like the connection between is the NPSL should like never choked up yeah, we'll, probably a look thanks.

E

So excessive you to the issue that I opened, there are some blocks from api server. It seems like that we are going out of the history of that CD, but I not sure why exactly like either like API server is not able to process fast enough or HDD is not able to send fast enough. Or could this be an issue with compaction, possibly like.

B

E

B

Cycle because you're you have on your tests, you're running d3 call it which starts compaction and now.

E

It's it's like it's wit with v2 client. Still, it's not related v3 API.

E

D

A

D

Few minutes left, you think we can move on to my question regarding stacks.

A

Yeah we're getting a little short of time, but I think you should. You should certainly at least explain the problem. Maybe we can hit it hit it again next week, if we need to sure.

D

So I think what my goal is: I'd like to compile a list of versions and patches that anyone that is running 2000 note clusters are using.

D

So I think I know where red hats patches are I'm kind of interested in finding out if the patches and versions for gke are available and where those are so that I can compile some sort of death of anyone running 2009 clusters and what their staff looks like. Well,.

A

Let's start, let's start up: let's start up one level, which is what we're continuing to struggle with, which is large clusters that will pass density on.

A

You know, wit with with sufficient reliability and failure, love the failure rates that we get are pretty high and it seems to be docker issues, that's the sort of thing, and so we're just trying to still get to a point where we have a cluster setup, that's sufficiently reliable, I mean I. Think we see the we see the community tests pass far more reliable, far more reliably than anything. We can manage to assemble at this point, and it has it's still pretty concerned.

B

So upstream doesn't patch talker. It just takes great of docker, whatever version it is and dawn, but know which version they're. Targeting for this release cycle I, don't believe they've shifted to 112. Yet.

A

Right well, we'll certainly will certainly will certainly hit Don up for whatever it is we can get, but that's that's the that's the thing that we're that's the thing we're after here.

B

For us, we will always lag sometimes we'll make the call after the release cycle, so we're willfully selfish in that we want to make sure that we're behind upstream a little bit. Is there some 11th hour issue that occurs, so we will leg in that way. So we won't necessarily have an answer.

D

All the time, because we'll be still betting, so Tim do do reliably, run 1,000 or 2,000 no density test, not.

B

Against upstream um we we are still in the process of our of our vetting for one hour, 13 rebase, I'm, still finding issues and I believe David EADS just pick some stuff this morning, Clinton so and we have not rely. We are not reliably running at that scale. No, because we have a bunch of other stuff on our side. So openshift adds a bunch of extra resources and objects and controllers that, to the point where we don't run, the numbers back up stream runs. We run denser clusters with less nodes and.

D

Reusing 135 I.

D

D

Ok and then wojtek um would scale. Do you guys realize what feel are you guys reliably able to run at so.

E

Currently, we have some reliability issues, but we basically are running continuously 2000, note, busters and 2000 month. We are mostly like running two thousand note cube marks just like different, but we are also running like from time to time, real monsters but like currently, and we are running them from heads. Basically, all the time.

D

Okay, so maybe we can take this offline I'll contact each of you and see if I can find out what your stats look like in particular years, boy deck is where we're consistently running into problems with clusters, far smaller than 1000 notes, I mean even hundred oak clusters. We've seen problems so infinity.

B

D

B

A ridiculous number of hard-coded timing parameters inside of the tests and that could usually be a weird source of issues and anomalies and for a long time we've we've struggled with that. I know I now that we have Jay back, I've asked them to start tweaking or digging into that space again, because we belong the issue a long time ago. So, if you're seeing errors would be nice to have it like least reported upstream what those errors are and I wouldn't be surprised if a large number than our timing, artifacts, okay,.

C

Thank you very much. Yeah.

A

C

Just want to mention I mean one long stay. We can dig up the number if you're interested one long-standing sort of open issue that we want to be getting some way to point yourself at a cluster and say tell me about yourself down to the level of all the parameters and this isn't setting parameters, which is a whole nother effort. But this is just collecting meta information about a cluster. So it's easily shareable, that's kind of been out a want for a while, and so you know help wanted.

C

But if we had that it'd be much nicer to actually it'd be much easier to share at least a lot of what you're asking about right. I, don't.

B

See why we shouldn't report this meta information directly at CD.

A

A

F

A

Think we're out of time and add out of items today unless there's any final thoughts.

A

Anyone anyone All Right, see everyone next time and at the community meeting thanks bye.

E