KubeVirt SIG Performance and Scale, 24 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-06-24

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.pbmjpoqv1fc8

A

Okay, all right. uh Everyone welcome to sixth scale. um Please send your name to the attendee list. uh The link to the to the document is in chat, um so today we're going to talk about um some um some test framework ideas.

A

So we've um we've we've had a few um um we've had some progress, uh a few things, so I have this little tracking area right here, where I just kind of like added things that we're working on issues just so we can have an idea, so we had um one issue or one pr that was merged. So david also did some good work on adding the um the vmi phase transition times which which merged um so that's awesome.

A

So now we can start looking to you know, consuming this data and even look at different ways that we can pull this into uh ci and start and start measuring.

A

So that's what I wanted to to talk about today um and- and we can talk about this- from uh two perspectives, um performance and from scaling, um so I figured we'd start with performance. So there's um there's already some work. That's that's begun around this. um I think marcelo's. Here there marcelo wrote a ipr looking to consume some to kind of build a job around performance testing, so we can measure against pr in each pr. So there's a bunch of things that are in this pr, and so um I told them.

A

I went and created a um an issue around this topic for performance test framework that covers a bunch of things, but um so before we get into that, though, um I kind of want to back up even a little bit more and talk about this idea in general and get some thoughts and then and then I can always update um that issue afterwards. So take some notes, so at a high level um like the goal we want to accomplish, is we want to have in every pr that's run.

A

We want to have a way to measure with their if we're above some sort of performance threshold with this pr or if we increase performance or decrease performance whatever it is. We want to be able to measure it, um and we want to have this uh some tool to measure it. uh We want to run it in every server pr and we also want to have developers running against their local codes. You want to consumable by those two different personas.

A

um So um with that in mind, um you know what what are people's thoughts on like how we can do this. I just wrote a few points here, but I think this can be expanded quite a bit. um What are what do folks think.

B

Well, I think this takes a very well the overview of the pr data and then the document that I sent last time um in the amazon list and the idea is exactly that, so to be able to track the performance regression or improvement for prs and also for release um the I also the idea is to have like three types of uh size of test: some, what I'm saying small scale that we can run for each pr it's for for hpr.

B

It cannot be it's too big, isn't it otherwise it would be too slow to have to to get the pr you know merged and tested, but the idea is to have then the medians yeah, exactly so some medium tasks that we can run like you daily, and then we can just you know, get some idea.

B

You know daily what's happening also for the prs and uh but with a larger uh set of tests, and then a large scale test that I actually I don't know if we can run so big right now, but I would say, like maybe you know uh well, ideally, is to have like something very big and then, if we run for each release, I don't know what's what's big in the biggest thing that we can reach right now.

B

But 1000 looks good now, but we can maybe go even further if it's needed and then this large-scale test. The idea is to run for each uh release before you release or for each release, something like that and and then we can track that I've been discussing with the red hat folks about that for a while, uh we are getting access to some resource to run it for the upstream.

B

So I actually got access to some small uh that we can start to include for the prs now uh performance tests. I also want want to make you know there are a couple of things here for performance and the way that I'm going to configure the this the infrastructure.

B

So, first of all, the performance tests must not have collocation, so we need to be very carefully when we are running the tests.

B

Another thing is right now: the way that the functional tests are running, they run with a lot of nested virtualization things. For example. It's it's well, not a lot, but what's happening. Is it the convert? Ci and that's the way that it's deployed um the kubernetes cluster?

B

It's run inside of yam and then it creates vms inside yeah. So um we don't want to use the convert ci uh in that way, we want to have the kubernetes cluster running directly on the bare metal node.

B

I mean it's just to avoid any analysis for nested virtualizations, especially because we had the experience that cooper doesn't run very well on nested virtualization, at least some tests that uh we have done in ibm about that. But it's just. This is not toxic. Just first.

C

Off interrupting this is about actually booting the vms in there. Sorry is this about booting the vms. Yes,.

D

C

Yeah yeah yeah. I would expect that uh for an initial phase, we should really not boot vms, fully.

A

So what about? Well? If we don't? um So if we don't boot the m, so we so that means that we, um the running phase uh when we reach running on the vm and transition time. That means that that what like that's when we've handed off to the handler, but doesn't necessarily mean we have to boot, we just kind of stop at some. At some point. It means that the kimu process- well, it means, and when it.

C

Reaches running yeah, it means that uh that we got the report from livert that cue emu started the booting process.

E

I think we're wrong.

C

F

It's getting at.

E

Is that we're not measuring we're just measuring the control plane so maybe said differently uh initially we're just measuring the control plane's ability to scale.

A

Yeah, okay! Well so we could say I so I think so that's important, and I think um I the way I view this is kind of like stressing the control. Point of view is more like the scale side of this, which I do want to talk about, um but like for pure performance testing, though we do, we want to boot the vms. Would that make sense that sounds like we would.

B

We have some- I had actually some discussion that with ramon. um Well, I think both scenarios are interesting. So right now we decide to go to not boot the vms and try to put as much as possible pressure to the control plane. Booting the vms can bring some benefit like it. We can make sure that, for example, the network, it's operational everything it's working, everything you know gets exactly uh allocated to the vm. That should be, if we don't boot that with some some of these things, we cannot check.

B

um But it's fine for now, because what I'm saying the plan now is to put pressure as much as possible to the contour plane and just to answer that in the pr that I implemented was running um means the vm got the state running, and it just mean that they, you know start they start to command, was sent to the cooper at the library and but doesn't mean that the vm will actually put so. Okay.

A

I think I understand so does. um Does this have? Does this require extra work that, like we like, we prevent the vm from booting, but it's not going.

C

To do it mostly just means that you give it uh just a little bit of ram. This means that kumu comes up the bios in c bias, initializes and then kernel immediately crashes, because it runs out of memory, and then it's stuck.

B

Yeah, okay, it's like very few cpu and memory and then it doesn't boot. So.

E

What if we use one of those kernel uh images like just a really really bare bones, then it wouldn't have to necessarily crash and booting would be practically non-existent. I mean it would boot, but it wouldn't do anything really. It would just exist.

C

I'm not sure whatever needs less memory and cpu able to.

E

Yeah, well, I guess I'm just worried about the the crash case, because we would kind of want to know if the crash is our control plane doing something crazy, like burt launcher like.

C

E

But in this case.

C

Launcher stays running if that is your concern. Oh.

G

E

C

Vm inside is spooning or not. We just see queueing with running and happy, but that doesn't.

D

Get the traffic we wouldn't get traffic from things like guest agents updates and the um all the updates to the status as the as always comes up and goes later, halts um yeah, but.

C

From a kubrick's perspective, you don't need the guest agent or anything so really just that the cumulative process is alive is means, the vm is running, and I mean it really means that, but in this case.

F

You said the word crash uh earlier, so I might have missed.

C

Yeah the kernel crashes, but if the kernel crashes, that does not mean that the emulated computer is crashing. It's like if you build windows and you get a blue screen.

E

Got it yeah as long as the commute whatever it takes to get the commute process up and keep it up? uh Yeah.

F

A

E

In agreement there all right.

A

So that sounds like a really lightweight way. We could just get a lot of vms going okay, so then I I guess so what way I'll characterize this- and you know correct me if you'll disagree, so I say, like we start with vms, that won't boot. um There are other areas we can look at here. um If we do want to expand this, because this does limit the scope and that likely, like you said um we can't do any stuff with um with the running like.

A

If we wanted to do like attached devices, we wanted to touch a network. We wanted to do like time to ip address or something like that. We can't do that and now that could be something we could consider doing measuring in the future if we wanted to so that could be an extension. But maybe this is like the first step that we can look at most achievable goal in front of us, so we so we could do.

C

Yeah, I would say it's the least controversial in general, because you don't have to care about the operating system at all and some considerations there which, like, for instance, what happens with your data if you up after at some point, update serious if you're, using it or fedora version or whatever, so it's a very reliable source to not boot the van. That's what I would say. Everything else depends a lot on a lot of other factors.

A

Okay, um okay, so it sounds like with this idea, so we could reach a lot of a lot of vms. Is there like? um I guess we wouldn't really know what the limit is, so it will eventually just reach something or we'll be able to like. Oh actually, let me phrase this differently. So like um these vms they're, not bootable, um we're gonna, do it in essenvert um and what's like um I don't know what would be like roughly um the number of the ends like we could do like.

A

What's the most that we've been able to see, do we have any ideas.

B

Well, so, if we give like, we can do some math, I would say like how many how many resources we can allocate if we get we are going to get like one 0.1 from uh one cpu. Then we can have like uh you know, 10 vms per cpu, something like that.

B

Or maybe we can have also, uh you know over provisioning of cpu. I don't. I don't know yeah.

A

All right I'll just say, there's another question, then: okay, um okay, that sounds pretty reasonable. We can sounds like something we can start with. Okay, so we so we don't boot these vms um uh things you want to do like. We want to capture the metric support um pr performance center, so this would be like so david's change. We have. We have all the phases, we should capture all of them um and then we have some sort of report um and we need to.

A

We need to also need to find this like meets performance standards uh yeah. I think.

C

This is something which we will just find out, so I would expect that we start like with uh running the tests three times like with 20, vms 50, vms and 100 vms, on the same keyboard configuration in the same cluster and uh and initially it's just the matter of comparing them and seeing like, and then you can start defining like uh what baseline do we reach with these bulks and where is the startup time moving, and this is the base diato and once we have once we have it and visualize it or can compare it, we can start defining bass lines.

C

That's what I would say. Everything else would be a little bit just arbitrary. That's it.

A

Yeah, okay, that makes sense to me, okay um and then do we want to talk about um so that so this would be our first step. So we want to talk about this other stuff like defining types of tests. Then so we get a bait, we get a baseline metric um and we, I think we answer this question too. We figure out like how many vms, um so I'm gonna do this down here.

C

I wouldn't get this a question like vm density. I would even not try to answer that right now.

A

Okay, yeah there's there's a bunch of yeah because I think there's like a we could talk, there's lots of types of tests. We can do so yeah.

G

A

Okay, so first step will be um so I like the idea of let's: let's do some information gathering here- let's figure out, um let's figure out the answer to this: um let's attempt to test, let's define standards and, let's figure out how many vms we can reasonably do this with and.

C

I think the numbers above I don't know what we can run right now: uh yeah marcelo. You know how many machines you got uh or how much you can get from that perspective, but I think it's even very reasonable to not even start with two big numbers numbers and we would still get reasonable input like really twenty fifty hundred, maybe two years, it's not like.

B

Nvidia that have got like 200, you guys had a cluster of 200, isn't it so it would be a much smaller cluster with um I.

C

Just met the bulk of vms, which we start not the number of nodes.

B

Okay, yeah, I know yeah, I'm just saying for the amount of vm that I'm thinking it's to start with 100. You know it's for the the small scale, pr that we are we are thinking about and, of course we can push to see how much we can get with the cluster that we we got, but it shouldn't be too big. Now um I mean.

C

Up to you, what are you going to start with, but it would. I would recommend even smaller sizes, like 20 50 100 as the start or 10 50 100, whatever just to get it to so that we can get initially initially an insight into where the things are moving when more vms are getting in.

B

Yeah yeah, I I think it it's it's a good good thing to do this scaling from.

C

B

A

C

Numbers so that we can do it faster yeah.

A

Yeah, okay, so marcel um I'm gonna assign this to you since you've already been looking at this, so this will be. We want to try to answer these two questions. You want to know the foreign baseline and how many vms that we can get to okay, great um okay. So then, um so we can so that'll be like uh what we can. We can start with um so there's the other thing. The other aspect of this. um We we, let's take some time.

A

We can even talk about like different ways that we can um expand this, maybe some different types of tests um so, like I wrote, which one is it this uh this issue after I saw um marcel's pr, so the um I tried to define a few things in here. um So a few things like like this, like what are the different types of ways we can generate low, different types of ways that things tests we can do um so. You talked about density. You know birth testing, that's that's one.

A

We do stress, testing um soap, spike testing. um These are just some of the ones I've read about. um Does that make sense to people like? What's like? What do you think? Is there could there be more here? um What do folks think of these.

B

Yeah, so I don't know if it's covered here, um but one of the tests that actually so and the plan that I sent before I was thinking about three different kind of tests, one it's. You know this of a shock task that we create a bunch of vms that it's the density test or we can. We can do like different ways- stress, as I mentioned here and another one.

B

It should be like uh you know, to measure the steady state of the constant load that we generate so especially to measure you know the scheduling uh we we can configure, for example, 10 vms per second, and we should define a maximum. You know uh population in the in the cluster and delete the vms and keep the load constant for uh for a you, know, constant period and see how the system, uh maybe is the stress test here that it wrote in it so how the system keeps with the constant load.

B

It might break the system so and especially if the depression is very high and another thing is, I don't know if we want to cover that, but it's I'm just thinking about this test. Kubernetes is also doing them. So that's why I'm trying to include here also just three different tasks that I mentioned uh the other one. It's like chaos test, so we just suddenly remove an old and things should come back. You know and we need to measure. How long does it take? Should the system recover.

A

Yeah, do we well, I don't know if we want to do that here, though, like I understand the need for that definitely like we could kill some of the control plane or something and see what happens, but I don't know I mean doing that, while we're measuring performance, I don't know if it's going to get us a lot of um data that, like that, like we could get a lot of variation in the data just based on things. Just all the things happening in the cluster.

E

I think we want to see testing.

A

E

Couple that with this necessarily.

A

Yeah, so the well, the other thing, so um I I was just thinking of this, so I I have another thing here: we have, we have generate types of load. um This is another question. um How are we going to generate load? um I I just lumped it in here as part of this tool, but it's an open question. I know david you've talked about this like previously um like what do people think like? How can we generate load? Is that that's another thing we need to figure out. I think.

E

Everything's on the table there, I think that the um ci or the functional tests um that we're generating load uh that marcelo is already starting on that. That makes sense. It's not very configurable necessarily, but I think, as a standard way of just repeating the same test over and over again yeah. That is probably fine. There's some other tools. I looked at like cube burner, which it needs a little like.

E

I need to submit a little patch just to make it wait on virtual machines to until they're ready, but it allows you to create a repeatable config, so you would define a config with some vm templates kind of and it would um start having ever made iterations of that exact same virtual machine.

E

You want wait until they all come online then go to the next iteration start, like you know, 100 more and like you can represent things like that, and even like the deletion of them afterwards, so you can in one config to define how you want to kind of stepwise, uh add load. So we could use a tool like that and I'm sure there's others as well.

A

Do we, um I guess so we can um maybe leave this to some investigation, so I well okay. What would be the easiest to get started? uh Marcelo seems to be like you already have something. So maybe we can just start with that to see just see like how it goes like if it would just start with it and then you know, as we maybe start to look at expanding it, because we have other use cases, we can look at q burner or other use cases as something that we need to be more configurable.

B

Yeah, so just just to give like very, very high level background why I start to write it in as a functional test, um because, first of all, we want to add you know um in the convert ci the the jobs running so that the load generator should be well.

B

It's not not necessary should be in cooper uh repository, but uh that's why we I discussed with roman and he suggested to to keep there and it was I. I thought it was more natural to add it uh as the functional test. As everything was in this uh folder, you know convert test and, and also it can be- you know, use it like that um and I've. uh I I received some comments in the pr um I think was uh david. I don't remember now so um yeah that.

E

B

Me yeah, so he suggested to actually um separate the part that I'm collecting the metrics and make it like a framework, so as ryan suggested also, you know to have this uh performance framework and actually create like a tool that it I I saw, the school burn burn is doing something very similar, isn't it um but the tool can be something more or less.

B

What burn is doing for collecting the results and then anything can generate the load like we were discussing um the the functional tasks or any scripts or even could burn just just generating the load, but I think it's what we help like to to keep the track now to have like this. uh You know um functional tests to create the vms, and then we can just test verify some thresholds, and actually the test will fail. So this kind of structure that helps you know to have the pr the performance test in the convert ci.

B

So that's why it's good also to have it in the uh as a functional test. Well,.

A

Is there a way to get the numbers without like because I I do like the um because to me, like I said on the comment like we one of the goals we we want to have a framework we want to be so that it can be usable in ci or by users, and so I like that's, why why I suggested that is we could we could discuss that specifically in a um in a uh in its own pr and then and then have like, so that we can kind of separate the idea of generating load out because it because it can be anything like we've said here, um but but is there a way like you can?

A

um We could take what you have and we can generate that base like answer these questions um like without without merging it like, so we can just get the data, and then we can look at the the different components of it. Yeah comment on them.

B

We can do that right, so without merging, so I actually the you know the functional test that I created. I don't know if it's the best practice that, but anyway, I'm actually reading a configuration file and in the configuration file, it's possible to define the number of vms and the functional test will actually be dynamically configured. So, and you know it's not hard to call that that the the test configuration itself is not hardcoded. It's in the configuration of ml, so.

A

Okay, so why don't we do this? So here's, oh yeah, go ahead.

C

I just want to say: would it be an option for you to just split out a very basic functional test which can be which runs with two or three different bulk sizes like a table test or something without much configuration options really just using zeros, almost no memory zone like we discussed and so that we can bring that in and just run it on a broader map and that we can so that we can basically discuss your pr, which adds already a lot separately.

B

You mean simplify the dance test, isn't it.

C

Yeah, mostly about I mean we kind of, I think we kind of agreed now to one first basic kind of test, which is really just putting a very small as a very, very small vm as small as possible and creating an embark and collecting the metrics for it, and I guess just a simple end-to-end test would be sufficient for that. Without any framework or anything we collect the prometheus metrics. We can.

C

We can collect with this metrics for that already and collect the baseline data, and we could then discuss the rest of the pr without pressure.

E

So you mean just to den like just the test itself with. We would collect externally of the test suite the data I'm just.

F

Trying to understand yeah, I mean we prepared everything in in keyboard in the.

C

Ci environment, so that uh the data would be collected from.

B

So the pr is doing that right now um I don't know yeah.

E

It's just it's just big, that's all I mean what are we talking about um as far as the data collection we're talking about manually, collecting it ourselves or are we talking about the data collection that is currently in the pr? I just want to make sure uh it's.

F

C

There is in the cia environment, there is no, there are rules in place which instruct the prometheus and thanos instance there to collect the metrics from all.

F

The instances which run the tests sure I'm saying when we.

E

Talk about so right now the pr is creating a report of some sort or that's what we're talking about. I'm not sure how much has been implemented there we're talking about just the density tests and we would be looking at the results ourselves so not having the functional test actually prepare a report for us. Is that accurate, or are you wanting the functional test to prepare a report as well roman? I.

C

Did not think about the report right now, but maybe you think it's.

E

Well, I I I guess I would be in favor of getting a density test is optional, a really small density test, just the bare minimum that we need to begin to start thinking about this stuff and immediately, and let that be the thing that kind of starts um the ball moving here and then for us to work on metrics collecting and things like that and generating a report independently of that.

E

I I do agree that I think we're lumping a lot of stuff into that one pr and it's going to be difficult to get merged, because it is so large and there's enough contention on exactly how to receive.

B

Okay, I think I got it, uh I can, you know, simplify and remove the configuration and the report part from the pr, and it will be.

C

Yeah, just this fast and easily.

B

C

Yeah, clear and easy without many configuration options which for sure will come in the future but yeah. So I think by.

B

The way so, just oh sorry, yeah, so in the pr I I'm collecting like a different things. Okay, so first is the vm creation time.

B

Another thing is the resource usage for for the memory and cpu and things, for example, that I'm analyzing um I give like uh the for the vm hvm in hvmi, 0.1 cpu, uh request, okay, unlimited and then I get like the resource users of the pod that it's, of course many containers run inside and in my very simple test that I did, it was using the double of the cpu that should it was allocated for the vmi, so something else uh vert handler and maybe something else that it's there in the pod. It's consuming.

B

uh You know some overhead of cpu and this kind of things that I thought was important. That's why I'm collecting this and and then uh do I used to collect that to test. Or do you guys think that this pr should have only the latency and not the resource.

E

I think that the pr should just have the density test and not the anything with metrics collecting yeah. I agree to that and I think that when we look at the metrics collecting part like if we look at external tool, or whatever like I mentioned, maybe in the tools directory we'd start with just a single something really really simple like just the ver.

E

Maybe we just start with transition times to begin with and create a report that just shows that and then we keep expanding that and introduce uh memory and cpu, uh perhaps or other things we're interested. I'm just saying, let's start small and add on as we start getting more data and.

B

Okay, yeah, I think it makes sense.

C

Yeah, the purpose is really just to avoid that this pr just sticks there forever and there's everything yeah.

B

D

Right, I'd add to that number of api calls made um the vmi or something like that.

A

Yeah um yeah, I agree, there's like a bunch of them, um so yeah I covered this um gavin in, like the um like. I have I've listed three here and like like. One of the comments like I mentioned, is that basically everything that in this pr could be a threshold test, every single thing here so yeah like, but um but like like we're, saying like these, are we'll get there so I'd like to. I like the idea with starting.

A

um Let's start with uh that simple density test, and then we have all of this that we can expand to so that it's good.

B

So, for I think for the you know in the convert ci, we don't have. We don't need to have like too many thresholds um metrics, but it's just some key ones.

B

uh I I don't know if you guys saw I I started some time ago to do like just a document, for you know defining this. um What should be our uh slos for the convert, so you know, for example, vm creation time it's one of the metrics that we should pay attention and I don't think we should have dozen of them. It's too much things you know and get complicated to to analyze and see, even though some of them represent the same thing. So.

C

I would really suggest to just start with the transition times, which are pretty clear, that they're important.

A

C

We can look at where to expand.

A

Yeah marcelo, we have so I'd, say marcel. These are the three things um that that have for action items that you can take here so like we do that. So we look for the baseline, we're going to figure out how many vms we can handle in ci, and then we take your pr. We break it down to a simple density test and then we we then come back and we look at separating this out into a framework which could then include all of these things. So that'll be part of the discussion.

B

Yeah, just one last question about the framework, so should the framework should be like a different repository like who burns doing something like that or something inside convert, so we can.

A

Talk about it when we get there, I think, like the, I think, when um I think that's something that'll probably be good. Just got a good discussion topic, maybe for next meeting. If we after we get this data, I think that's something. Maybe after like two weeks go by, we get. We got that information, and then we start seeing okay, here's how we can expand on this. Then I think it'll be a little clearer um what we should do, so we can take that for a next step. For next time,.

A

I'll leave it as a question here.

E

We have a directory for just kind of one-off tools, so we can start if it's just for convenience. We are going to build system and everything uh in cube. We can start in the key vert repo and just as the path of least resistance, and if we want to separate that out of the keyboard repo later on, I don't. I don't see that being a problem.

E

A

Okay, I think the this sounds pretty good. I think we have a plan to go with this performance test framework um and the density tests and get things kicked off, um so that makes sense to me okay, so I I'll um I'll do like some of the tracking here at this. I can I just kind of use issues and attack things, but if I go with you, this was another thing. If we um marcelo, I saw you, you labeled the sixth scale. Do we do we?

A

Is there like a label we can add for like um scale? Does that make sense, but.

B

I can yeah, I can talk with the well actual. Roma is one of the guys, so the ci guys, so I think, makes sense to add that.

A

Yeah I was thinking we could have some labels and then I can just filter for like what is uh so. What we're doing make it easier. Okay, all right.

G

Let's add labels like that, just as the kubernetes group has like sick, you can you type, slash, sick and then sick. You have, I think we use the same bot right so it should. It should work.

A

So what does that mean like we? We if we did like slash, sig scale, it would go right to.

G

It would yeah, um I think, it's a pro feature, but I'm not sure I. I only know that any question is when there's.

C

G

Like zig app, so you do slash seeking and apps.

C

Config, like kevin's idea.

A

How do I do it like ha slash, sig scale,.

E

G

A

Okay, all right something: we can look at next time: okay, great, okay! So then, okay, second question! Our second um point here um is skill, testing, so kind of looking to answer the questions like. How can we get to thousands of vms to stress the control plane to test in every pr, maybe not wpr, but something we can center for our goal?.

B

A

Daily, whatever, if it's something for release whatever it is, but we want, we want a ton we want. We want, I think, north of a thousand um to really cause stress um so kind of open floor on this like like so you already mentioned vms, that don't boot. um That sounds like that. Could we don't know, but it sounds like it could get us somewhere um that that's to me sounds like an option. This is there's a there's.

A

Another idea that I thought of, um because I mentioned q mark originally uh when we started this sig scale, and I talked to david about this a while ago um and along a very similar line. This was um something I had in mind that we could do. I wrote a very like one pagey kind of design document around this. That's linked there um and it's pretty simple. The idea.

A

um All this is is um or the concept behind this is that, because uh every resource in kubernetes is a um or can be considered an api extension, uh what we could do um is we could look at essentially faking um our components um so that what we could do is we can lie about the idea of a vm. We basically remove the the compute aspect out of it, so we have no pods.

A

We have no vms, we just pass around gamble everywhere and we just mock the whole thing and so that we have no run time. We don't launch any pods. We could do this with a real api server that controller runtime offers and we just take tons of ammo. We just throw it at the api and the controller and just see what happens- um and this is one idea I thought could be like a low, a very like something: that's achievable that doesn't require a lot of resources to do.

C

um I guess you can remove quite some load by implementing invert handle or something like a more client, a grpc client which sits on the underside of her tender, which does basically nothing except reporting. The game running at.

G

C

You can use a very minimal part that may help a lot already but yeah. If you remove more, it may be a little bit trickier to fulfill the part requests and the vm requests, but it's simply impossible too. Just for that part.

E

That's interesting, so it's also possible that we could create a code path or a way where we would ignore the pod creation and just immediately hand off, divert handler and then uh pretend like we're doing all the commands. So we can use a real cluster and just tell the qfr control plane to fake the poly creation. Don't actually do it pretend like it happened and also, uh I guess, fake. The grpc calls.

A

E

I don't know that's.

A

F

A

The yeah I know like the idea would be that we, the idea, would be that we like, when I think like when looking at what's happening in at this level right here in the middle. um It's we're basically we're moving yaml around right and, and so the idea is that it doesn't matter. You know if these things actually exist as long as there's yaml. These things are are stressed out, and so, if we can create as much yaml as possible, you know what's the way we create as much you know as possible.

A

We just we just don't, have any compute, because we don't need any resources, so yeah we could fake. You could do. We could fake it here. We could take this whole thing. I mean um yeah and then yeah. Those are all ways we could do it.

E

I brought this up internally at red hat with some of the openshift folks, and the thing that they passed along to me was that it it gives you something so keep mark, gives them something, but the reality is so much more complex than what keep mark reveals that it there's so many variables involved that the reality doesn't necessarily match.

E

So we could create something like cavert, and I think that would give us an idea of something. But I'd be careful about trying to say that that means uh we may or may not match. Reality is the best way to say.

F

E

Guess what would you.

C

The main code paths don't have un unintended bugs I'd, say right, like you can say, yeah it's reasonably fast, normally.

A

Yeah, like we could run into, I mean I agree, dude like we can run into so many um issues like when we, for instance, like um uh like even the um like, I mean lots of different things. One thing is like I was thinking of.

A

Is that that recent issue that we had um and then I pushed the mainland with the list, calls um from vert handler like if we had if we had if we were fake, like we had thousands of word handlers um and we had all that yaml floating around, so we had prometheus enabled we would have. We would have seen that that massive spike in latency would have exploded in our faces, and so that's like where we want we'll see that stuff, but we won't get like you know like.

A

If we're not running these resources, we're not actually running, we don't have the compute resources running. um You know we don't always see like okay. What's the interaction with um when you have one handler and there's thousands and thousands of launchers or something or I don't know what. However, many you can fit in a node and things like that and how it holds up with the the larger cluster so yeah we we can get something but yeah we can't. I agree, we can't. We can't get everything.

E

And one of those list regressions that we had was actually caused by it, wasn't caused by prometheus. It was in the prometheus scraping uh of our components, so we'd have to like that's something that wouldn't necessarily get represented in caper, either prometheus hitting our scrape in points over and over. Maybe we can make it do that, but.

A

Yeah, I know I I'm using that example, because it happened to involve handler and, like I said, if you enable, if we had enabled it or something, um then we would have seen it but like that idea of like of doing lots of of lists or something could be hidden in somewhere in here like say, we're doing a lot of different api calls and we just haven't reached the scale, because we don't have the physical capacity to do it.

A

um We could find some things, so I guess like the way to characterize is: if we, if we were to say like create 5000 fake vms, it doesn't necessarily mean we can guarantee a scale of 5 000., but we have at least some confidence that the yaml will hold up um in our components. We'll have. We won't have massive amounts of latency. In some cases we will have found some paths where it'll be functional.

C

So it yeah it's just another tool.

A

C

Before we we go down this route, it probably makes sense to collect some data on the smaller sets.

C

If I think, if we, we, we we're good in integrating the data, I think it's possible with the smaller sets to see tendencies like okay, with the number when I start 100 vms, compared to starting 10 vms. I see that some list scats or whatever grow more than they should and stuff like this.

B

Yeah so yeah, I think I don't remember. If you know cookmark doesn't doesn't have kubernetes running in the node. I think what it fakes is. The body creations needs it's like the container runtime, but the kubelet is there running.

B

So I know it's not the same thing, but I I kind of considered like the cool launcher and the coop handler to be like something like that. It's not it's the library. Sorry it deliver.

B

Liberty store like run time. So then I think instead of fake, you know the the cooper and the co-handler and the coupe launcher. It's just maybe fake. If we need to go for the direction. Okay, if you want but just fake the leave, you know deliver it uh itself to create the vienna and and do not explode. You know some key components that could relies on.

A

Yeah I mean this makes sense, yeah yeah this. I think I think what you know you talked about not booting like. I think these are all good options and we want to see let's find what our limits are and then you know eventually we hit some limits.

A

uh It sounds like the the no boot option is the easiest one to start with, so we should start with that and if we hit some limits, um this is something we can consider um as just a way that we can get tons of ammo and just see what how our control plane handles it. So that's that is an additional option, something we can keep in mind.

D

Yeah, if I think about our setup, I think that uh that setup is quite likely to show up pain points in things like the mutators we have, and so on, uh some of which look potentially single threaded in places and so on, and I think it'll be valuable for that. You know even beyond the core components, um all the additional stuff where we're heading and there's likely to be added in a production environment. I think we'll hire problems on those quite quickly.

A

Yeah, the other thing I was thinking um like we don't um like. I wonder how many like fan did that work. uh They posted on my list. We did like the reconcile changes like you wonder, like um you know how many different um requests we're making, how many api requests are making kubernetes, and that would be interesting to me to see like when we really explode the amount of launchers and our handlers and when we have tons of vmis what what ends up happening.

A

You know like what like, how costly are these um requests, so things like that, like we can get, we get some numbers roughly like that. That stuff would still hold like the number of api requests like we could find that um this way, there's something we could know that that's a problem, so there's some there's a bunch of things that we can. I think in here that we can learn but yeah. So we'll start with. uh Let's take a circle back, though I think I think it makes sense. We start with this.

A

I think this is a good way to get things off the ground and then, as if we hit limitations, it's something we can look at expanding too.

A

Okay, all right just so do folks have any other open topics. This is kind of the two agenda items have for today. Anything else that people want to bring up.

B

I have I have one comment so in in red hat, uh we have a meeting, you know, b weekly, uh it's a it's also like a meeting for scaling performance, but it's more for open shift anyway. So I some uh I I joined this meeting normally and I think some guys here also do that.

B

uh I don't know if it's possible, so it's bi-weekly. So if it's maybe yeah and normally it's happens- that you know the same day as this meeting is happening. um I don't know if it's possible. If, for you guys for this meeting here because the other meeting, we cannot change it's a lot of of people that doesn't want to change.

B

But if it's possible here uh just change the you know which which meeting at which week this meeting it's happening.

B

So instead of like be every two weeks from now, but we maybe you know, do every two weeks but in the different me in a week that it's happening the other one like it's starting from next week and then we we do like two weeks again. So if it's.

A

Well, what do what do people uh I mean we, we kind of we've had some growing, I mean we've had a lot of things, but more things are picking up. I mean right now we're for bi monthly. I don't know. Do people think that it'd be more more or less or the same value? If we were to have this weekly like we could do that, and then we could get.

B

Sorry, I I use the wrong words, so I'm saying bi-monthly, it's okay, it's just like uh the the way.

G

B

It's happening that it's like colliding to another meeting that I have.

A

Monthly as well.

B

A

It's what I'm saying is that is a possible solution, would be that we could have it. um Well. Maybe that turned me over wrong, but if we were to go to weekly on thursday, I would admire saying it's like you'd be able to it wouldn't conflict every other week, right yeah. So that's what I was saying. It's like. I figured I'd just throw that out there.

A

If, if people find this meeting valuable and we've been having a lot of content in it, um we could look to go to weekly and then you know, folks that have that conflict. You know you can take the internal meaning and then join on the other aspect. We can have it weekly as long as um if people find it valuable, then I, if it's you know if that's worth the time to have a weekly, and I think it makes sense too, but I don't know what do people think is this?

A

um Do you think we have enough content with the things I mean? We seem to be picking a lot of things up, so maybe we could go to weekly.

C

A

People this moment.

C

Weekly might be good and later reconsidered, then, and I mean if at one week the content is less and you just stop earlier right. It's not like you have to fill up the r yeah.

A

Okay, all right, I guess then so why don't we go to weekly, then and and then it kind of makes it easier anyway, for scheduling like no one has to remember or a week and a half hour or whatever to the next one. So just do weekly on thursday and then um yeah then we'll just uh yeah. I think it fits better. Then it'll fit the the schedule that you're that conflict, okay.

A

All right I'll do the I'll do I'll, handle logistics with uh with folks and we'll get that sorted, so we'll do weekly. So the next meeting will be next thursday. um Okay.

G

D

Does anyone have anything.

A

Else they might bring up before we adjourn.

A

Okay, all right! Well, thanks everyone for your time, I'll see you online.

D

Thank you thanks. Thanks ryan.