Kubernetes SIG Scalability, 30 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2023-03-30 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing

A

Okay, so this is six collability meeting, uh 30th, March, 2023 and um okay. I see that point is starting with uh with some topics to discuss.

A

Do you want to start by Tech I'm.

B

Sure yeah I was just typing, so um so yeah I think we we need to fill in. The annual report for six scalability for, like 2022.

B

um I, started doing that, but I didn't do um anything creative at this point like I, mostly filled in sections that were kind of obvious. So if anyone wants to help contributing um either um by commenting or by forking it or whatever like that would be great.

A

Okay, yeah I think I can I can help a bit with it for sure um so yeah we can. We can sync.

B

B

Sorry go ahead.

A

So I was just saying that I will help you and we can think offline.

A

B

Yeah sure I'm I'm planning to work on that like um early next week or something like that. I definitely won't have time tomorrow or today, um so yeah, but, let's think offline like Siam. If you want to add anything or anyone else. If who, if someone wants to add to to annual report, that we need to create in the next couple days, and that would be great I link the VR and the work in progress PR in in the notes.

C

Okay sounds good, yeah, I'll track a few things. I think I've been following, so this is basically going to be any major improvements or like things around.

B

Yeah I think this is like. Basically, the the structure is like predefined, but there are. There are things that we want to highlight. There are caps that we were heavily involved in and all that that kind of stuff, so yeah.

A

But also kind of like the like the health of the community and and stuff like that, as.

B

Well, I I did a a preliminary.

B

um uh Pass through that so filled in some of the information but um I'm happy to to change anything there. um If you don't agree with something.

C

Okay sounds good thanks. We think.

A

Okay, um do we have any topics.

C

There's a bunch of sick testing folks joining the call today um so welcome everyone, um I guess this says, based on the discussion we had last six kill so essentially, AWS is now it's kind of a gate to invest long term in scale testing. For uh that's part of that.

C

um There's an issue I cut to kick off some of the discussions um and Ben and Justin here on the call um they've been pretty active with it and helping out with said kicking off kickings off with a uh like set up a test, job and I. Think in this meeting we wanted to discuss about some some of some questions. I think we had on uh how to configure these tests and like where do we actually want to get to.

D

E

Don't know I can also hear from, and you know, Kate semper will be uh pretty involved in terms of like the account management and and keeping an eye on the bills and that sort of thing.

C

C

Okay, so I think um I guess the status Justin. Do you want to give the status of the job uh that you've added we can discuss where to take it to next? Absolutely.

F

Sure so uh we also have cyprian here by the way from from chaops. uh The job is very, very basic and uh we have we're able to spin up jobs uh in the existing brow cluster uh on AWS. Obviously, we created a job that uh just brings up a cluster and runs the conformance tests uh with 100 notes.

F

uh I know hundred is a long way from 5000, but it is somewhere to start uh that that one basically runs without a problem, although there is an asterisk on that and that we have to use Calico and that's sort of interesting. uh Currently whether we haven't tried the VPC, cni um I think it would be great to get that job running in the club in the billing account, at least that we want it to be running in uh and start cranking up up the hundred higher and higher.

F

There are a couple of things that we found, but we can certainly start I've gone up to 300, for example, on my personal AWS account, and that was not a problem, so we can at least start going higher.

F

um I think the big blockers uh to proceed are getting into the the account where we can go higher without you know, causing people pain on the billing front um figuring out what we want to do about the cni uh and which cni we want to or cni or cnis we want to Target, um and then there was something else.

F

I can't remember what the third thing was, but yeah. This is and then actually like making it be a real scalability test, not a conformance test, yeah exactly yes, that was that that big one thank you and.

C

You should probably do that before we bump up the scale, um so yeah I think um I guess so the cni is one uh it seems like. So what do the other cops uh jobs on AWS use today? Just do you know, is it? Is it Calico or.

F

Caps, just everything uh would be the short answer. uh We don't actually test everything but like we try to test every combination, so uh the ones which are best supported on AWS are calico psyllium, uh whatever we call cubenet today uh and uh it supports AWS, vpcc and I- um is that right, cyprian can't think of any others that are used on AWS.

D

No, uh this, uh these are pretty much the best supported the cnis um one. Other thing that I could think before, starting to crank things up is to increase the limits on the account.

D

Otherwise it would be pretty painful.

F

uh There's also one more cni as well, which is IPv6, uh which is no cni uh but I right, I, don't know we might want to do that. So that's another, maybe option but I think today, we've only tested it with uh Cadillac or sun cni, but in theory we might not need a cni at all. If we have, if we, if we use IPv6 mode we're.

D

Testing with the IPv6 with the Calico and psyllium yeah.

F

But I don't know if we need a cni anyway,.

E

Some reasons we'll want a different set of accounts um because for tracking so like, for example, on gcp, that's the same thing: it's not just about increasing the quota. We don't want to increase the quota on all of the accounts, and we know that this is a particularly large bill. So it's much easier to keep track of how much like goes to scale specifically by having a different project pool um and then also like.

E

If we wind up needing a lot more accounts for uh other kinds of tests, we don't want to have to increase the quota on all of those. uh So that way this like because I think it'll it sounds like it'll, be the same, like there'll, be some manual effort to uh bump the quota on the accounts.

E

So if we can keep that targeted to just the like, however many accounts, we think we need for the um the Bosco's pool for scale testing, that's a much easier problem than doing it for like all AWS e to e accounts.

E

uh So we probably want to do that before we try to crank it more.

C

Okay, so uh just to confirm these accounts are already.

C

A set of accounts already exists for with higher limits today or do we need to create some we'll.

E

Need to create some, um the 100 node is just running in like one of the accounts we've been using for like chaos and uh cap, a CI today.

C

Foreign, what's the what's the process for creating, because.

G

C

Need to be Cynthia found, accounts right.

E

H

I think that's not really a big process to do that like we have full access to.

H

We are full of sexual entire organization where we can create this account and bomb. The quro I think. That's not really an issue. I think the main blocker is trying to tie up everything between address and gcp, because it's a completely new process for us like be able to integrate all those icon. The existing Pro diploma and that's the bigger problem, but create the account itself and set up a pool of account is not really an issue.

E

I think the other complication is currently we'll, be we probably if we want to go quickly, we'll just put them in um a pool in the like existing CI clusters, but like long term, it makes more sense to uh we're also in the process and suggesting of setting up an eks cluster to to like actually execute test workloads from and we'll probably want to Pivot to like executing scale tests on Oz from the cluster. That's in AWS. So that way we not have to like.

E

We can do things like avoiding egress between them to upload the logs and things like that at some point down the line uh so we'll probably have to like pivot the accounts but I think today we can just continue on with where things are at and just make another account pool that's dedicated to this purpose and add it to the um the automation and sort that out later.

C

Okay sounds good, so yeah I think I can take up uh like once we set these accounts. I can take up getting the limits increased internally with the I think there are a few dependencies. You'll need to ask for a limit increase um yeah about. Let's just circling back to the cni thing.

C

um I think it's best to actually use the AWS VPC cni, because it's also the officially supported one uh for AWS, and it basically supports all the features that we need like IPv6, ipv4 um things like cider, prefixes um and stuff and I it it.

C

um It's also something we are wetting internally continuously So like um any bugs essentially or any issues we run into. uh Usually they are they're in good hands um unless, unless you folks think there is a there's, a reason not to do that.

F

I I mean I, think I if I I, just uh I'm fine, to run tests with the vpcc and I I think we should also run them with, uh like whatever we choose Calico, because then, when we see a an error, we get more data right like uh we know, or we see a regression. We know like oh, like it's only with VPC cni or it's only with Calico or it's, but with both right.

F

It points us in a certain direction and, like we found this very useful with um we have what we call the grid, where we run like permutations of various things like all the cnis and os's and kubernetes versions, and things like that with uh chaos, uh and so we're able to tell uh like to an extent uh what what, where the problem Lies by looking at, like you know, sort of which which tests are failing and which ones aren't failing or which ones are regressing and which ones aren't regressing.

C

All right so yeah I, guess you're saying we need to be able to continuously compare them against each other.

F

I think that'd be super valuable. I know what other people think yeah.

E

So suggesting keeps him for ahead, like we really would like. I would really like to get us to a place where we're able to say you can just take this job and you can switch like a flag or two and it's it's gcp or Amazon, and we can move between them. um That's part of the reason that I've been discussing using chaops I think it's the most mature cluster tool.

E

We have available to the project that can Target both um and is already like integrated with other it like test tooling, has pretty good coverage, uh whereas, like we have most of our jobs on Cube up today. So if we're gonna, do the lift to move away from Cube up moving towards having relatively identical config, but switching out the providers puts us in a place where um you know.

E

If down the line, we find out that Kate simphra is blowing the budget downloading things from Amazon, then we can, you know, shift some jobs but like back over to gcp or or whatever, um so we don't I, don't think we have to strictly do everything identical, but the closer we can to matching the setup between them and then using the same, tooling and stuff will uh put us in a much more flexible case where we're not going. Oh, we need to turn down scale tests because we're in a budget crunch.

E

uh So that might be a point for picking a cni that like works both so that, like the tests, are closer.

C

Garden, okay, I I mean that is I, see value and also not at least so yeah. That's why you're in test being the same uh so that we can swap one out for the other on demand, there's also value in trying different configuration. So we cover, let's have a different surface of issues. Some some issues may not be caught by the GC test today, because of the way the things are configured there.

E

um Absolutely that's something else. We brought up um I think dim said something about only using cubanum and when the testing leads talked, we were kind of like we don't necessarily want to be that uniform and how we configure clusters um so that we get better coverage.

E

uh But I do think we want to have at least some CI running on things that are are very close between the clouds, so that uh you know we can flip back and forth, and we we know that the regressions are like in kubernetes and and not some other component or Cloud specific or whatever.

C

Got it okay makes sense, I think that's it's. This is also like not a one-way door decision. So I guess you can always change it back to start I think it makes sense too and make I make the gcp one. We.

E

Also have a ton of room on the AWS budget right now, so we can start pretty Broad and if we wind up feeling like we're in a crunch later, we can reevaluate like what Matrix is reasonable.

E

Yeah, but starting with multiple different options today is is fine. We have lots of room to run CI and AWS right now,.

B

I think one one potential concerns I have is that I would like to have one configuration whatever that will be um running as soon as possible. Then then, like add it more because if we, if we um divide our attention into multiple things, it may take us more time to to actually have the first one running, which is, um which is probably our most immediate goal.

A

F

That that might work in our favor, though right if we bring up two ones and we're like oh look, uh Calico isn't as scalable as vpcc and I, like hey team Calico. Do you want to like look at this and then like Calico moves a little bit forwards in the bbcc and I was like Hey TVP, CMC and I Calico has ever taken. You and you're now in second place, they're going to push that a little bit harder right, so that might be a fun, a fun Dynamic. So.

G

Do we know if Calico running on gcp versus AWS is like vastly different.

G

Like if we are running Calico in gcp, do we get like enough signals saying Calico is performant or not.

E

App base juncture we're a bit Limited in what we're running in gcp. As we give uh the credit situation better next year, we should have more room. um We started out early this year way over budget and we've done pretty drastic moves to bring the Run rate down, um but we still have to get through this year. So um uh you know our scale signal is going to be reduced uh on GCE in the future. That might not be that should, like, hopefully, shouldn't be the case.

E

um We've brought down the Run rate a lot um and should have room and scale testing is something that I personally would prioritize. I think it's something that the project can't get with like free, off-the-shelf resources somewhere else and it's a great use of our um credits. uh We've only pulled the lever to reduce scale this year, because it's just one of the few things that you can actually reach and pull quickly, um but I I, don't think we'll have super great signal trying to compare a matrix of things right now.

A

E

We'll want to be able to to like load bounce in the future, um depending like, as the funding situation evolves. My understanding is that um a small difference is that it with the gcp credits, there's three million job deposit in the account every year and there's some posts with a public commitment. So we have some certain level of like we're getting exactly this much and we know what we're doing.

E

um Database credits are a little bit more complex and are like deposited in trenches and the like exact amount may be adjusted and depending on usage and things like that, um and we're still figuring out things like. How much is it going to cost to run the like General CI things when we move more of that over? uh We don't have as good of idea what the Run rate for other reasonable things to run is going to look like so I think we want to maintain a fair bit of flexibility between where we run this.

G

C

All right so I think next step. We have the next steps uh with setting up accounts, getting limits, increased and I. Think, given our decision to kind of have these jobs fungible we need to have. uh We want to use the same cni.

C

um Also whytec. When did we switch to Calico either I thought we were using cubenit.

B

Is it a wireless that the current scalability tests that aren't used in Calico, so we are not testing it for I.

C

B

We have some um suits with kalika, but the scalability tests aren't using Calico underneath.

C

So is it used, what is it using? Is it using IP, alias or.

G

C

Okay, so uh so in that case, is the plan that we move the gcp jobs, also onto Calico or kind of find an equivalent of the uh iplus cni in AWS, because because there's no there's no, so that that component won't work, as is on AWS, because things are a bit different, vpcs and stuff. So we'll have to use something else.

C

um The closest it that actually comes to IPL is the vpcc and I awcpcni in the prefix mode uh called the prefix delegation mode is pretty similar to IPL, yes, I.

B

Would I would personally start with like what you are proposing, the VPC cni, and once we have that started once we have that running, um um given also that what you said that, like it's, it's maintained by you by by AWS and so on, um that we know that it should be working at least, um then it should result to set up and then once we have that running like to like create the Kali code related job like us as a follower.

B

That would be my suggestion.

C

Okay, yeah yeah. We can also compare them at a smaller scale like 100. No jobs are easier to spare as compared to 5000 note the release blocking one, so we could have comparing CNAs at smaller scale. Oh.

A

Yeah, so actually for the Calico we already have for gcp I. Think uh one interesting point would be also probably psyllium um but yeah. We don't pay much attention to the Calico and I think. We also know that in larger scale uh it will struggle basically yeah.

B

We have only like 100 notes or.

A

Something yeah yeah so and probably like five thousand notes, would not work, but.

E

Also right now we more generally use tooling that only works on gcp, and that is something that we also probably want to move away from.

E

uh So if we can get a state working on AWS that we should also be able to run on gcp, uh like that's a like, that's, probably a pretty close follow-up step, um sick testing again, uh what I, what I'm hoping to do out of some of all this is get us out of a state where everything is Cube up and into a state where it's a tool that can Target multiple clouds.

E

um So, while probably the most pressing thing is getting something running on AWS and it's totally reasonable to just use whatever is convenient, uh I hope we're kind of fast following towards a state where you can take this config and we can replicate it back to gcp and and rotate over to using this. The same tooling.

C

Okay, yeah yeah I, think that makes sense, um cool, okay and the last question I had for you. Justin is uh there so because there's a bunch of flags that these tests are tweaking today through these batch scripts for cube up? Is it possible through cops as well.

F

uh In general, it's possible if they are Flags, uh it can be relatively straightforward uh because we can they're probably already mapped in the configuration and, if not, we can just add a mapping for them. uh Environment variables, I think, are a little bit trickier. I. Think someone brought up that we might need an environment variable, so we can add the support for it.

F

um It's particularly tricky for things like Cube API server through scheduler Coupe control, Avenger, the sort of system, components for things like the vpcc cni you can, or we can easily make it- that you can just override the whole manifest and choose your own manifest. So um we'd probably use that route there, but uh in general the answer is not necess they're not all going to be there, but we can easily add them on the chaop site.

B

So I think I think the bottom line is that we are like configuring in a custom way, a bunch of API server, part in particular flux. So um we probably will need to to do that or plan that mapping, at least um for things that um I can I can send you the link to what things we are. We are changing an API server for gcp, so I'm, assuming that we probably want to to make it similar in on AWS.

F

G

Be good, thank you. How about at city do we configure hcd as well for skill testing like any custom Flags.

C

I think the core location is the biggest question. Yeah.

B

Good question I think we don't but um I'm, not 100 person. Sure like me,.

E

I think we have a few small ones, but, uh like you can you can I think all of those can be passed through at CDM on chaos config already, um and this sort of spelunking is why we want to uh switch to something example right now, probably the biggest lip for reporting any of these jobs is, is looking at the environment variables that we set to cube up and figuring out what they actually do and then converting that to some other Tool.

E

uh So hopefully we can do that once uh to the to the cap, cluster spec and then and then that will be portable. um It I, don't think anybody actually knows all these I'm one of the current maintainers of the cube up scripts. uh If you can even call it that and um they're a nightmare, uh the environment variables tend to get string interpolated into bash, generating yaml, uh possibly multiple layers. Deep and uh like we're gonna have to go, look and figure out what they're doing and what like, what flags are actually set.

F

uh I, don't know if I'm allowed to to lobby but I will plug my uh cyprian and my excipient and myself for doing a kubecon uh talk about uh like the direction we're trying to get in, which is that we're trying to have more um manifests uh Plum through. So the idea being that, uh ideally, when we like, if we come up with the configuration for the awsupcc and I in particular, that will be yaml and it will be reusable whatever tool you choose to use.

F

um Obviously that's harder for the API server, where there's a lot more like interpolation of value or mixing of values. That has to happen um but like maybe we can do patches or something there, but that isn't there yet. But that's something that we hope to talk to many people in this uh Sig about at.

B

F

In the meantime, I.

B

Just checked like whether we are configuring at CD Down only thing that we are changing for a CD is enabling P Prof, which is even if we can't do that. For whatever reason: that's not the end of the world. We won't just have um profiles for a CD which.

E

I think the other thing we're gonna have to Port is the the scale tests have a custom log dumper uh that dumps a lot of things in a performant way um and.

H

E

At some point, we'll probably want to see if we can get all the infra tooling happy with dumping, that to S3. um That's probably one of the interesting upcoming problems for getting all this running smoothly.

A

Yeah, the other thing is probably some metrics for pair of Dash right.

C

So I part Dash should be possible to run based off of uh Matrix and s32, uh and this you know, because we actually use it into. We started using it internally, too. Okay and I know. If maybe we have to make some patches, but I guess I, think that should be. It should be easier to work. It may have to be a different first Dash dashboard, though I don't know if it will work with both at the same time, but um yeah I think both both the callouts are good ones.

C

Ben the logging one and the.

E

I, don't think we're super worried about jumping over to S3 right away. I mean the log storage and stuff is, is not cheap, but it's not really top of the bill and I'm not overly concerned about the egress between them. But it'll. Just make sense eventually to like actually execute the Amazon jobs like completely in Amazon and store the results in Amazon and just.

C

E

Lots of cross Cloud, but that's not the most pressing. The log dumping scripts, though being relatively coupled to cube up specifics, is another.

H

One of the.

E

Places that we're gonna have to decouple, though, and or at least couple it to like, chaops or whatever. That is it if, coupled to the tools at least a multi-cloud tool, instead of um specifically how Cube up works um and right now, the log dump script that we use in scale is different from the other ones. The I think the trickiest part is the SSH is to the nodes and dumps directly with credentials to um to keep it manageable versus, like the CI pulling everything in and then uploading.

C

But I I started I mean last I. Remember: I changed it to use the demon site which runs on the nodes and it's picking up the logs and uploading to GC GCS directly changed.

B

This is true for now. This is not true for the.

C

Muscle control, okay, okay, yeah I, guess, yes, yes, I! Think the the bigger the expensive part with transfers and stuff uh like SSH into CP the control, plane, nodes and they're pushing their logs for now. I think that might be okay.

E

And more that like it has to like it, has to be compatible with whatever, like cluster tooling you're. Using and right now it assumes things like the log pass that Cube up clusters have and and like the SSH access that's available to all of the GCE VMS from CI, and that sort of thing um we'll have to pull that thread a little bit to make sure that the scale log dumper works on um the chaos jobs on AWS.

C

Okay, yeah yeah I, think that makes sense um all right, I think at all the time. A lot of interesting topics this time think we we covered a lot of ground with what we need to do uh after this meeting. I can take some of this and just update that issue and what the next steps we need to do and I might have to ask for some help with respect to the setup of accounts.

E

I'm going to point you to uh our no for that, probably though he may delegate somewhere else.

H

uh I would take care of that. That's not really an issue I'm working on that. Currently it's just I have to I've been busy with other stuff.

C

Awesome thanks Anna. If you can let once you provision them. If you can, let us know that uh you know we show the account IDs. Now we can get some of the limits increased for them.

H

uh I will what I'm trying to do is automate icon creation and also service quora requests at the same time: okay, okay and basically leave that. So we go for the three days of requests, because I cannot upgrade the super plan to get the faster restaurants from support. So what I'm doing is like create your account, create the increase the code and wait until we get done by supporting if it's too slow I would paint them so you'll to see. If we can just do that internally,.

C

Yep, that sounds good, and we also have some data internally, like what all services you need to the customer service quota increase for what are some good numbers to requests, especially for a five thousand dollar job um like I'll. You can follow up with you on that. Okay,.

H

And because what I really need is basically wish exactly code up, we need to increase yeah I can do that.

C

Okay, thank you cool thanks folks. uh We should wind up. We already owe the time um Marcel. Are you able to stop the recording or was.

A

G

C

A