Kubernetes SIG Scalability, 17 Aug 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-08-17 Kubernetes SIG Scaling - Weekly Meeting

Description

2017-08-17 Kubernetes SIG Scaling - Weekly Meeting

A

All right, this is the August 17th 2017 recording for Communities scaling all right. What I'd like to do is share the presentation that I'm proposing to give at the community update in a moment.

A

You guys seen that.

A

You all right good yeah,.

B

A

Right super, so our habit here is to try to give the update at the cig meeting, get some feedback and then have a bit of chance to update it, based on any feedback before delivering the final at the community meeting. So that's the purpose here. I'll go through this fairly quickly.

A

This is repeat from last time that should be completely non-controversial: I, don't wanna I never want to since I'm representing the cig in the update. I never want to represent things that people are disagreeing with.

A

The first bullet I think is a non-controversial but I'm going to represent that we are that the tone has been shifting from sort of bigging builder, big, building, bigger bigger clusters to stability on 5,000 node clusters and I'm. Gonna. Ask the question about whether stretching beyond 5,000 there there's an urgent need to do so. I.

C

Think what people might be interested in view as well is the the idea of density of services and everything else, because right now we we could have that many nodes. But if we had a densely packed it would implode still so.

A

So specifically, I think I cover that and the rest of it the rest of the presentation.

A

Let me maybe we should come back and see whether well, but that really is the: how do we ensure big clusters are stable with varying loads at the 5000 dead level, question right, varying loads and densities, loads and densities? Okay,.

A

Okay, how's that thumbs up.

B

Great so on different one gives me like if you're talking about the size of the cluster versus the entire clusters or its image cluster, when they say fight, game.

A

We're talking about 5,000 nodes in a single cluster.

B

Okay, so my problem is that you know we want to have the slightly larger footprint across the bowl for all these things. So, for example, our operatically stones have close to 20 K machines. So you know Italy won't like some feature so that all this 20 k can be managed in a very fondly, but that might mean using Federation so not sure.

C

There's there's well-established documents that we have that having a single large X thousand in the tens of thousands node clusters sort of violates principles of failure. Domains in Federation is a potential escape hatch, but it comes with its own complexities, like sometimes people just do multi cluster approach to for evaluating third stage abroad. There's many patterns that exist that give you an escape hatch versus having a single large environment.

B

Yeah I'm not denying that that's exactly what we're also going to do, but the kiss nice making it would number you know.

B

A

I'll add a clarifying note here. It's a good question right. The the but the 5,000 node target here is a single cluster.

A

We haven't yet really extended our purview to 20,000, dobe, federated cluster or five thousand nodes each or something like that.

D

These Doc's discussing sort of motivations for why you might clusters are in the safe scalability directory or they buried somewhere it like the design proposals directory I, don't.

C

Know where they live anymore, because they've shuffled the docks around so many times but Quinten, and why tech originally had the document which outlined why they were going to a specific size and what the primary goals and dry his were. The cap was at 10,000 originally.

D

Okay, I think that might be the goals MD doc in scalability, which we call out shortly.

A

So I I had it I added to I made two edits. The slide one was to just clarify big old, bigger single clusters and then because note that there's kind of a ongoing discussion- let's call it not a debate, discussion about very large installations and a federated strategy.

C

I'm have no I, have very colorful opinions with regards to Federation and the ideas therein, but there, when you just want to say, multi cluster strategy, multi.

A

Cluster, alright I.

C

Don't want to, like you know, Federation in quotes behind it. I.

C

Think you know my personal take on Federation. Is that your grouping failure domains and like we? We when we wrote up our blog post on it? You know we talked internally for a long time about it and there's actually a three part blog posts that joandcraig wrote up regarding it, but it conflates failure domains. So.

A

I would see.

B

Me I'll toss the name of the ball.

C

C

I'm put the blog post in the channel in the chat. Okay,.

A

I think the the color here is that things other than pure scalability of the control, plane or node. Networking or something starts to drive the cluster designs, the scale or it can, if you're not doing overnight, HPC loads or something right.

C

Yeah, it's the wrong tool if you're doing super fast elastic HPC. This is a service orchestration engine. It's a minivan for microservices is the souped-up minivan as Bob would call it.

A

All right, do you think I've covered it here sufficiently for the community update to give the right flavor all right, good, so I'm not intending I mean it's a fairly short update. I'm not intending to.

A

Are you guys seeing the speaker notes or just the slides.

B

Speaks a month, okay,.

A

B

A

Aaron Aaron actually typed up some very helpful kind of comments here. I really wasn't it's a fairly short update. I really wasn't planning on covering all this in detail, but just to give the wider community some exposure to where the docs are what they, what they do. I do want to do a quick call out for folks that are interested in the provider coverage beyond the ones that we have. This seems like a good call to make sure they know that we love their input.

A

So I'll cover this briefly. I'm gonna mention event refactoring, which Merrick we're here as and I. Think. The only piece of this I'm really concerned about whether I'm representing properly is the fourth bullet, which is the sort of at least two releases of work, with kind of a baseline into 1/8, and not really not really helping substantially until 1/9 I just call it a work in progress.

C

Because I don't know, I never know when any features, gonna land anymore right.

A

Well, I'm saying at least I think we can say the first batch of work is going into 1/8 that right, I, don't.

C

Know hi CPR is but I don't know if it's going to go into 1/8 if it goes into 180 it'll it'll be not enabled right.

A

A

That's what I was trying to say here. One effect.

C

But I would get you know, concise feedback from Merrick yeah.

A

Shoot I'm not sure I'm going to get that. Let me see if I can catch him online and I'll I'll dial this back to work in progress with maybe parts of its starting to land in 1/8.

A

Okay, I'll I'll follow up I'll fix this. One I'll fix this one later working progress.

A

A

Hang on I'm, sorry for the sorry for watching making people watch me tight, but there we go there. That's sufficiently uncommital.

A

Short of getting some additional feedback from those guys- and so this was the this- is the paging API update just let people know this is going on for these kind of like large scaling parameters. That's a pointer to that issue. Here, I got the 1 8 1 9, 1 10 bits straight from the straight from the feature straight from the feature: repo I pinged Clayton just a few minutes ago, just to make sure he knew I was going to say this.

A

So I think that one's probably okay cluster results, sharing I looked at the last update. I did, and this was there too so I think this is 10. This is probably mostly for you to make sure you're comfortable with this update in terms of our interest in the sig, for using sonically in some form to capture and share results, looks good to me alright and then what I'm planning to do is.

A

Introduce Aaron as having done some good work, constantly helping us keep the cig, testing and scalability glued together, yeah.

D

Hey so this isn't something I can take. A ton of credit for I just happen to be aware of it, because I'm artistic testing, so I think Shawn's been pushing this pretty hard. So one of the big things we noticed during the 1:7 release was that the scalability testing wasn't really passing and we didn't notice this until late in the release cycle. So we put together a proposal or strong and I think merit to some extent.

D

He put this sake, put together a proposal to make sure that full-scale, like actual real testing with real notes, was done continuously throughout development during this release cycle, so not just hoop mark and not just 100 nodes, but more the 2,000 or 5,000 that level.

D

So the initial proposal was to do this on some sort of like every other week, schedule that doesn't play so nicely with cron, so the schedule that we've actually ended up with today is we run 5,000 note: GCE runs on Monday, Wednesday, Friday 5,000 note, GC GC, he correctness runs, so this is basically making sure that the Monday want to say.

D

Friday rooms are performance oriented you know, throw the density test that it makes sure it meets those two big round s loz that we've defined thus far we'd like to get to the point where it it passes, sort of more refined SOS. But this is state of today. The correctness has to make sure that all of the other tests still pass at the 5000 M level and then Saturday and Sunday. We.

B

D

Sure we run 2k variants of both performance and correctness across GCE and GK heat. We would like to get to the point where both GCE mg ke are considered. Blocking, where we're at today is it's just the GCE variants that are blocking next slide.

D

So the way this is implemented today, you may have heard me talk about things like proud before. Jenkins is still involved in running tests for some things today, and this includes the scalability test, so we use Jenkins cron trigger in order to make sure that triggered according to the schedule, I've just laid out and then we're using the script called bootstrap for the testing for repo. It's basically the script that, if you run that locally, you would do all of the exact same setup and configuration and layering of stuff.

D

That would actually happen the same way we run tests for the project. So if you look at any of those end files, there you'll see the exact configuration down to the size of disk in size of nodes and QPS to nipples that have had to been adjusted in order for the clusters to meet the SLO that the sizes next slide so and we can maybe click through some of these lengths. I don't know. Basically I just want to call out that six scalability is being a good thing and we have our own dashboard. Today.

D

It's called sake scalability. It's one of these fancy tab group things where we actually have the just the density tests out of the GCE test um and then yeah. You can just open like cool, so you can see there's sort of all the tests that six Caleb Lee cares about. On that second row there and then the top row calls out Google GCE scale and G Google gke scale. Those are the larger scale tests with the two thousand and five thousand minutes that I was just talking about.

D

So you can see if there are problems with the tests that this sake cares about, and we can see that and the other thing that I wanted to call out again. I mentioned that the GCE tests are blocking. If you go back to the slide and look at the release master blocking dashboard there, you can see I have two links to the GCE scale, correctness and scale performance jobs. Those are the five thousand nodes jobs.

D

Those are part of the list of jobs that are considered release blocking so the tool that's in charge of cutting builds for kubernetes actually looks at all of the jobs and that release master blocking dashboard. So the same configuration that drives test grade is the same configuration that drives an aggregate and.

B

D

I always forget how to pronounce this tool for those of you who are familiar with, um and so if this is read which it is, we can't cut an alpha or beta or release.

D

Okay, next slide something cool that we sort of discovered along the way is collecting all the logs from all the nodes when you're at 2,000 or 5,000. That scale takes a long time, so the way that it used to be done was basically SSH in through Jenkins to copy the logs off from each node I think it was either in serial and parallel, but basically Jenkins was the one going out getting the logs. We've now shifted to using this tool called bog exporter, and it's linked in the slides here.

D

You don't have to follow through, but if they're people who want to look through the slides afterwards, it's a keeper Nettie's project and basically the nodes, are now responsible for pushing their own logs up to GCS. And then all we do at the Jenkins level is watch to make sure that all those logs land- and then we can the logs are already there in GCS. So what this means at the 5000 level is instead of spending over four hours to collect logs.

D

We now just spend less than 20 minutes and at 100 note level, it's productive as well taking us down from ten minutes to two minutes. This is now able to cross all the scalability jobs and they think it works so well that it might be worth pushing all of the testing jobs period in the failure mode, where a node fails to cut them up or doesn't come up completely enough to push its logs to GCS. We can still fall back to SSH into the node to collect what is there for forensic purposes. It.

C

Seems weird like: what's the biggest mean the time frame seems weird? What's the biggest lag? Isn't everybody writing the GCS buckets because we did. You know if you're writing to a single node who actually is running to a spinning disk before you before you tore it up to push it off to some bucket?

C

Is it all of these nodes trying to hit the same location via some web traffic thing? Are.

D

You saying the final time of 20 minutes is still too slow for 5,000 ODEs.

C

Yes, I'm saying that that's kind of egregious because, like if they're all doing it in parallel, it should take like order. Very little time depends upon how much log data there is right. I would.

D

Encourage you to go we'll, come check out the link to pull, requests there and maybe follow up the song, because it's you know it's at least an order of magnitude better than it used to be, like I said filled so.

B

We shouldn't basically block.

D

Copying all the logs off of those notes onto a single location and.

B

D

From the application of GCS and I, don't know why we were doing it that way: yeah.

C

But this is very internal to Google, which is why I don't like it too, but anyway.

A

Yes, everything yeah, it does seem kind of weird you'd expect, based on that description, for it to drop a lot more.

D

Yeah I am yeah. You know if you're interested in digging into this. Those are the places to get it to get into it for sure, because if that's a blocker to enabling us across all jobs, you know we would want to look at that.

D

I think I personally was more concerned about the correctness vinkle that okay, if we're leaning on GPS we're leaning on the notes to do it themselves, what if the notes, aren't functional right if the difficult get it far enough to actually be running the Kukla process and then schedule a container, that's responsible for shipping it logs up to GCS, I'm gonna have their logs and that's really impossible to debug from a forensics perspective. Five tests that fail, because the cluster failed to come up. I want logs.

D

That tell me at least how far that it did yet.

A

D

A

Hopefully, hopefully no one will poke poke at that one. During the community meeting, oh Tim's, a pokey guy. He wouldn't he wouldn't roll you during.

C

I'm not actually here, but I will show you there, but this isn't even yours. This is like this is a Google ISM right for Google testing. In there they developed. What kind of you know: I'm gonna go on a little rant here, because I got the time okay, but there is no something that's very useful, only useful in the context of the testing infrastructure, along with using Google stuff right. This is a general purpose problem that the community has and we've developed something that lives outside of it.

C

That does something similar, but unifying these tooling and tool chain type of problems that they are generally useful to other people is a thing that we should be solving versus like we need it for tests intra or the test chain tool chain. So we'll build it here, and you know now: it's it's so entrenched into tests. Infra disentangling and getting out this general composable piece to be generally useful to the broader ecosystem is a non tenable thing right all right on.

A

A related topic, one of the things that jumps out slide, is the there is no five codename 5k tests on AWS, for example, that are blocking nope.

D

C

Exactly and you couldn't use the log tester for it great.

D

Well, so you know I I bring it back to you the funded mandate, examples so Tim. You might be right that this is a common problem that everybody has but I'm, not sure that anybody actually dove in and fixed it in a way that it works for everybody. I I have to go. Look at the commit logs right, but I think this was just like a tool that was needed to reduce the amount of time it took to solve this problem at scale.

D

Otherwise it wasn't painful enough for people to work on it, I'm sure it would be very welcome if the log exported codebase was improved to write, not just GCS but to any old object, storage or bucket right so like if this should have been more appropriately planned as a feature or a effort, I'm trying to dovetail into Bob's final slide here, like you know what would have been the forum to raise this, to write it up as a design proposal to get input from people and then to finally implement it.

D

I think that somebody probably opted to just scratch the itch that they had at the time and but I, don't think that that would detract or in any way to swayed from people who want to make this work more.

A

D

And the same thing from the aid of the US perspective like if we want AWS tests to be blocking somebody who can stand up a 5,000 dead cluster in AWS, you should probably be dedicating the resources to make that work on a continuous, reliable basis. Google has dedicated one person Shawn to making it work on a 2 km, 5k node basis, probably with some assistance from low tech, America right. So I don't know if that would be somebody from sig AWS or if that would be the people behind cops.

D

This cop seems to be the officially tested path for AWS or whether it would be potentially AWS themselves they'd like to volunteer that we would absolutely welcome it. I think.

C

I think calling out AWS seeing how they joined the cloud computing foundation seems a reasonable thing to do. I think.

A

I'm certainly inclined to take another, take another round of I'll, say evangelism with both user folks and an e WS to see if we can't get some support from them on this. So I'm like oh, go, I'll, go talk to Gabe and Brendan and say hey: what does it really take for you guys to help us help help do these five came into tests on a juror and I can do the same thing with the fashion this guy's ordered. He see us.

A

There really ought to great really be a good thing.

D

You yeah totally so like I said just here point to him. If, if any of this is coming across or being presented away, wear it any way where it looks like we would not welcome the effort. That's we have to work on that, but I believe any and all are very welcome to contribute and I. Don't just mean this as pull requests. Welcome I mean seriously what what about the on-ramp is is preventing people from from getting onto the ramp. How can we help yeah.

C

I think we need folks who would help own it for the different providers and to validate these pieces to abstract it away right, like I, don't know of an owner other than reaching out to the separate cloud SIG's. But there isn't a representative from the separate clouds think this might be a call-out location, be like it'd, be nice. If those who are providers for different providers in general, I guess you know, would also do similar types of validation or help support this effort. Definitely a call to arms type of thing. Yeah.

A

I'll definitely do the call-outs during the meeting, but I also want to follow up separately from the meeting, because I don't know, I, don't know who's showing up and not I mean these. Your folks are clearly gonna be on the call, but I don't know the AWS folks need to kind of pop. In now.

A

All right, any anything missing, I.

C

Don't think there's many other updates for this cycle really great.

A

A

Well, we're out of time I. Think I think this is mission accomplished for today, I know Tim. You wanted to have a some other discussion, but I don't think we have quorum for it. So I I don't really have another discussion than anything else for you, okay, anyone else! Anyone! Anyone, okay, well we'll see earrings next hall.