KubeVirt SIG Performance and Scale, 4 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-11-04

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.xo1a3u7axxkr

A

Okay, um welcome everyone to sixth scale, it's november 4th we're going to paste the link to the documents in the chat over time. Okay, um add yourself as an attendee, please and feel free to add agenda items while we're talking about them. Okay, um let's start with number one um periodic job threshold. So this uh we have some results here. uh I want to go over them and get some thoughts and kind of what we can do with uh with this. So um the change that was made is uh now we're.

A

We actually have the the audit two being built um in the periodic job. It looks like it runs every day, um and so here are here's actually a link to um to this job, that's running um and so we'll. Let's take a look at some of the results together here.

A

So we'll go over a few of these.

A

Okay, so what we'll want to see is here the results at the bottom. So um this is the audit tool running, and this is um the density test that you added marcelo, and this is the results that we're coming back with. So I think here in terms of thresholds here are the three areas that we care about, so the the p50, the p95 and the p99 there's some other values here as well.

A

So here I'll look at one more of these, so I did 11 one. Let's look at 11 2., so we get it so another picture of this roughly what it comes out to 25, 29 and 29, and so by comparison, we've got 23, 29 35, so you know roughly um I'm fairly close on some of these I mean, I guess the p99 is a little bit again. It's kind of what we expect is a little has a little more variation.

A

um So I guess the question is kind of what I want to discuss is sort of um uh how we want to look at this, how we want to take thresholds and because we we sort of need. We have a few data points here and we want to decide.

A

You know how what we want to use and how we want to measure and kind of come up with a way that we think is reasonable, that we can report or we can gate. um So what do people think like p50 seems like we're? Gonna have less a smaller standard deviation, p95 we'll have a little bit larger and p99 is gonna, be I think, we're gonna see a lot more variation. So what do people think? I guess.

B

A

B

We would have to look at more jobs, we could, but we could already start right now without more reporting. Introduce some limits like it could be that saying that the p50 below 30 pe 95 before 35 and p99 before 40, maybe a good uh below 40. Maybe we could start.

C

I don't think I would uh measure p99, I couldn't I don't know yeah that would be. Maybe maybe good yeah makes immediately sense to me.

D

Yeah yeah strange.

A

D

A

D

Yeah there you go.

A

Here's a good example like here's, 86 and even 95 here is very high. I mean actually 50's we're at 2 comparison.

D

And yeah, but there are more vms here. It says that's interesting here we have 970 vms and in.

C

D

Other case we had 850 pms and you thought the.

E

D

Is that right, it's updated.

E

um Which could indicate a problem if why like? Why.

A

Yeah, so that was another thing. These values do vary like you can see. So here's at first, I don't know why they're decimals, um like crate, pods, count right that seem like that should be a whole number, but this is uh there's quite a bit of variation, so here's crepe bonds count 50 we're seeing 17.

E

F

Because they're failing then most likely so the crepe pods count the counts, a decimal just because there's some interpolation going on on the prometheus back end when we do it over a time period.

F

I believe that if we see vast variations in that, that's probably because our create api is, I call is failing for some reason and it's getting free. So maybe it's getting rate limited, that's a possibility, or I I don't know.

G

Maybe okay do do we have like the information of restarted bots, something like that.

A

I don't know I mean, is that something we can tell somewhere, and I mean we don't have it from from here in the in the results here. But the.

F

Vmi would fail so it's a one-to-one relationship. If the pod create we would never get to a running state.

F

A

And I would assume.

F

That the test would fail simply because I haven't looked at the structure of the test.

D

F

G

It looks like it's created, the the test didn't fail, isn't it so maybe something else was being created as well like uh you know what what spots are just all the number of quads in the cluster or only the vmi quads.

F

It's only the pods coming from our control plane. This was being measured.

G

So it may be the controller restarted or something isn't it, but if it's whispering yeah whatever great yeah, it's very strange, isn't it so yeah.

F

Okay, that's unexpected!

F

C

Insight is interesting.

F

For sure and the fact so, the fact that bmi creation, time, like especially the p95, is so high. That also indicates that there's a legitimate problem or a variance here, that's occurred.

G

Yeah, so in the test I would expect some. You know some tests be like slower here, because the cluster is shared cluster, so it might have like something else running together.

G

um However, I wouldn't expect like failing pod crates. You know body creations in it um yeah, so.

B

G

Yeah a little bit of variation.

B

Regarding to the startup times, I think is fine, but the the create update and so on calls should not vary too much. Yet we.

F

Couldn't have an error rate in the audit tool as well, where I could see just how many api um failures are occurring like non 200 exit codes.

F

Yes, that might be interesting.

A

Okay, that's probably what we need here so.

A

All right so we need, I think I think someone when we write some of these down so like let's record.

G

Or maybe for debugging, for now we can like force to collect the artifact just so we can, you know so we just.

A

Yeah, so we we have so the like the fed was like marcelo like we have so we have like some of the um like we can tell 404 is like that's already a metric like we that's. I think I remember you marcelo had a bunch of um showed a bunch of metrics on this, so we we can get this right so like the fair accounts are like um they want the like four fours and the um whatever the 400s.

A

Anything that we see is account, that's a failure there um and then what else like um so we have delete pod counts. um So can we get? I mean we need to. We get the events here right, so we could do like a transition to failure. Maybe a failure state our failed or succeeded state.

G

Is it total or average because, like you, delete pod counts, only five parts were deleted, I'm just maybe it's it's not total. That's why we see a big variation here. Is it like a an average because, like the system now is very slow and then we expect like less creation per minute. You know something like that.

A

I don't even see any deleted in the other ones. There there's a there's, no deleted ones here. We're.

F

Not going to be deleting the pods uh here so what happens? Is we delete the vmis.

G

F

And that's how the pods get garbage collected. Let's see, is there a delete, vmware.

G

F

Don't see it, though that's interesting.

G

F

Yeah, how are the pods, uh how the vmi's getting cleaned up in this test, then? Are we running it to uh that's? They might not be that's kind of interesting, we're just letting them all get to a running state and then exiting the test and they just keep running. And then, when the cluster gets torn down.

G

Maybe it needs to check that don't remember also: okay,.

A

um Okay, that's another thing to look at um so so we have. I don't know this. We just see. I see a bunch of different things. This one has a delete pod count. Obviously these don't so the I mean we yeah. I don't know, I don't know what this means, why? Why would we be seeing them here.

B

Okay, um yeah, I mean I don't know when the test stops, but it could be any, and this david said that some parts are failing and get recreated.

A

Okay, uh all right, so let me add, so we can do this right. We can just do like a um the number of um vmi's in a state in a specific state. We can get that right because we'll have the maybe the phase transition times we can do a count yeah. We have a count, I think in one of the metrics. um We can do like uh how many are in failed or succeeded. At this point.

A

Sorry was someone saying something.

A

Oh, okay, um okay, so the um okay! So maybe we get some of those that might help us um yeah. Okay, I think we can't maybe we so I I don't know if we can have this conversation yet. Actually I think we maybe need to figure some of this out, but first.

G

Just one question: yeah.

A

G

Don't I don't remember, are those metrics comes from prometheus or we are not it's like a permit? Okay, okay, okay, so we can get everything there.

A

So get endpoint counts. Okay, so these are like a third api requests. Patch, um okay, patch virtual machine instances count 99.. It seems pretty close to the one to one right because we're supposed to have. This is supposed to be 100, interesting.

A

G

Yeah some rounding, you know something because average you know timing over a time window. So we can see like very, very small precision problem here, but we can. We can understand 100, maybe for the patch. What's.

A

Patch services cal: what's this.

F

It's the services endpoint, we might be collecting stuff from the we're, certainly collecting stuff from vert operator. Here. Let's see, that's the only thing that would make sense for services fast nodes, that's from handler, most likely.

A

F

And that that seems pretty accurate, so we would be patching. It really should be three, but I I could foresee it going into four or five, because it's three.

E

Cluster right or actually.

G

F

Only a two note: cluster.

G

Yeah two or three I don't remember now.

G

So is it expect that you have 10 updates per virtual machinist per via ami, because we have.

A

Almost thousand yeah I mean almost, it seems like we're at like a ten to one. If we have a hundred thousand yeah.

F

I'm uncertain: can we look at the other one and see how many updates we had.

G

F

A

We've got okay, we're in about ten to one yeah even more, and then this is the other thing.

G

This is the first.

A

G

It sounds too much. I don't know.

F

I'm not surprised, but I think that it can be optimized.

F

uh I think I I did some optimization and on one of the controllers to update less. Maybe I didn't get it. Maybe I didn't do that at the node level, or maybe I didn't do it at the cluster level. I need to introduce a similar change, so we have an expectation that we're not going to try to update a virtual machine until we have seen the previous update has occurred.

G

Do you remember when was it merged already.

F

Yeah it got merged. I think I only did it for one of I can't remember which controller I did it for it. Maybe it needs to be done for vert handler as well.

G

Something like.

F

That, or maybe I did for both- I don't know.

G

I was just curious because we could just check here you know before and after.

F

Error code, I think when we try to update something's already been updated and we just kind of go in a loop with that, because it causes an error which then exasperates the issue, because we didn't retry to update again and again.

G

ah Right, I remember that it was like uh maybe two months ago or something like that.

F

Yeah but I don't know how big of a deal that was really but uh it's possible that we're seeing something similar.

E

Okay and then in case.

G

It's very interesting that yeah the results here.

A

Yeah and create events, this is this is a creative, any object or.

F

Is the like? No no event is uh it's a specific object. It's when we create a event for vmi, for example, to say this vmi has started and we push out an event to the event log or.

C

F

Stream so that that's totally expected- and that number seems accurate to me- is it 100 vms, so seven events yeah that I'm not surprised by that.

G

What's the other one, the other tasks? How is, is it varying a lot? This number, our events or.

G

Ryan, can you can you sorry.

A

What was the question.

G

Can you see the number of events for the other tests.

A

G

A

Much the same 73 74. yep, getting.

G

Close yeah, it's fine.

A

Yeah, I think the real, like one of the big mysteries here is one shows up here and kind of what's leading to this. Some of the this, these numbers being a little bit slower than the others, and maybe it's that there's we're having some failures in the with the vmi's yeah. I think I think to me, like yeah.

A

Probably the best next step here is: let's get some more, let's get some more data on seeing well how many failed bmi somebody, how many are in the final state- um and that might give us a little more insight into what's going on here and then maybe if we could start, I mean I think to me like if we're.

A

If this, if we get this much variation, um I mean this was a lot. We've only run this like eight or nine times, or something like that. I'd expect like a little bit. I wouldn't expect to see this much variation in that that quickly. So maybe we need to fix that first, let's find out what's going on there and then we can settle in on some of these numbers um yeah and the other thing is so to back.

A

To the other point, I think, like um maybe we'll just have this discussion: real quick, like a 9, p, 50 p, 95 p99, I like p99 to me, seems like we were going to get a lot of variation. Maybe we just kind.

G

Of settle in somewhere yeah somewhere.

A

G

Should be fine or seven to five, we can input, maybe create a new one, but I think p50 b95 should be fine. We can check both.

A

Yeah, what I was thinking I mean, can we do like um if there's like a p95 is off where we can like send a message about it and say like hey, you might want to rerun this test or something or if p50 is slow, then we fail. Is there like a way to do that.

G

So right now also now and then I'm back, I will be working the new cluster and the new cluster will be, let's say, reliable, because we will not share the tasks with the functional. You know functional tasks and then and then we can start.

G

You know set an alert because right now, since the cluster shared it's expected to have like big variations now, it's just, I would say for visualization, it's very nice to have the test and then we check and see, and then when we have the test that is isolated, then we can set alerts. Otherwise it will be. Maybe too much you know it will start.

B

um Independent of sharing or not the create, delayed update and get operations should be very constant.

F

G

F

Know it depends.

G

F

How thrashed the cluster is so the cluster's thrashed and and.

C

It's having problems performance-wise, then we could see yeah so.

B

What you can have is more update resource conflicts and so on, but that are exactly the ones which we want to solve in the code, also not stopping over each other on the controller side right. So what was.

H

You said should be very consistent: the create the the amount of great.

B

F

Deleted events- maybe.

B

F

Let's say rates are failing because uh other things are causing rate limiting and other problems. It's hard to say, choose.

G

F

Creates certainly, but.

B

Yeah sure all I mean is that the clusters have a lot of cpu and memory on the bare metal nodes, so the the outside effect should be for these operations, not too big. I agree that it probably varies on the on the percentages.

G

Yeah, well, we wouldn't expect you too much too high variation in the counters. Isn't it so, even though, if it's low, it should have like at least the updates for the vmis things like that should be same because the test is not failing. Isn't it it's creating 100 vmware's.

A

Yeah, I think so I mean I can't tell like it seems like it's going. Okay, I mean we just kind of yeah. I can't really tell from the it's.

B

Just funny that we only see 77 put crates so right.

A

But yeah it's not clear, I would it would be yeah. That is that was another thing like it would be nice to see like have it have the breakdown of we like we talked about like failed there I mean it would be nice to know that, like okay, we, this was the expected. You know we have 100 density tests, but yeah- I don't I don't. I don't know what's happening up here.

F

Maybe the test isn't uh validating all the vmis that come online before uh before exiting or something. Where can we see this test? It's an uh let's say tests is a density.

G

Yeah, it's the one that it's under performance folder, something like that. Okay, it's the functional test, yeah.

F

Let's see what the exit criteria is for this waiting for running vmi.

G

I think it's listing listing the vmis when it's c100. It exists.

F

When it sees a hundred vmis that are in the running phase, fit phase hx yeah, does it delete.

G

uh It does not delete okay, so it leaves for the cleaner. That's why we don't see the delete sometimes.

G

It's fine for now, isn't it. um We are not checking.

F

That the functional test always deletes all.

G

Everything in the end, when it exits, isn't it yeah.

B

G

That, yes, but.

B

um You just for performances, you have to be careful because there are big variations on the delete times based on cubelet timings, so we are normally between end-to-end tests not waiting until they are really all gone. They get deleted and they will disappear within a few seconds, for instance.

B

But if you do performance tests, multiple reference tests in a row, you probably have to explicitly wait for them to disappear.

G

Well, it's it's the ci, so each test has a new cluster created.

B

When the temperature is.

I

Down you really.

B

Just run one end-to-end test here right, yes, in a row.

I

Just one: yes, it's.

B

I

B

I

Eighth or something yeah, then you're fine! Well, what happens it gets deleted after turning yeah? Will it uh when the test.

B

Is over they get the.

I

B

So you will see, but since you are yeah yeah, so they may be captured too. Is that what you think.

F

We would expect to see the deleted uh vmis api yeah.

G

So depends depends when you are running. Isn't it if it's right, if the auto 2 is running before the pods are deleted, it's.

B

G

It's not reversed yeah and you need to run.

B

Yeah you, you need to give your tool the timestamp of the end of the test.

G

Yeah the test should wait, the deletion so that the the auto tool will wait for the first.

B

It depends if you wanna, if you want to measure the delete times, then yes, but then.

G

B

Part of the test to measure them, then you need to explicitly delete that weight.

B

If, if you are, if you don't care about the leads like just about creating one, you can just ignore the this and yeah it's anyway. It doesn't matter. That's.

C

What I mean so.

B

If you care about that, you have to wait, delete them manually in the test and wait for it.

G

I think for now we can, you know, just focus on the creations in it and then we can move to deletion. Yeah.

B

It's really more matter of what the test is supposed to do right now, if it's just about the creates, we can ignore the deletes right now, but I mean yeah, but it's still interesting here. If the, if you don't give the other tool a good time stamp, it may catch some deletes of the clean up, but also not it's so it could really be that the cluster had another issue and five parts cut to.

F

All the deletes, so I'm looking at how we're doing the time stamp for this and we are before we see start time.

G

All right, I think you run the auto two after the test. Finish isn't it. However, it's.

F

Actually, a minute after the test, because we run we get the start time. We run the performance test, uh functional test, then the performance test I can see in the after clause waits 30 seconds, but then in the script that we're actually running profiled it, for it also waits 30 seconds yeah.

B

So our cleanup is not very nice, so it can very well be I'm just checking up. But if you rely on the e2e test cleanup, that's a pretty brutal cleanup and it, I think it also just does delete calls on all the parts explicitly. So you may not see a delete call because.

J

The controller already sees delete.

B

F

So if you think.

B

That this, that the deletes are a valid part of the test, then you.

H

Need to do it inside the test and totally understand so.

F

Just to clarify the reason we aren't seeing the deletes is because the test suite itself is calling delete which is not being captured by perf audit, because it's not coming from one of our components.

G

F

Yeah, okay, we're good that makes me feel more comfortable. I just want to make sure we weren't somehow missing something. The fact that we see 77 creates on the pods rather than 100 is concerning to me because.

G

Yes, that's the problematic part yeah.

B

Maybe um we are packing the pots now tighter than before.

A

Yeah and here we're over yeah and then.

G

F

That's really strange: yeah.

G

Again, is it really total so how.

A

Do you, uh how are we, how are you calculating this look back here, so we're running the uh a minute later, so we've let the metrics reach prometheus and then uh we're looking.

F

Back at that uh command that I'm um that I'm running and perf scale audit, and maybe we can um yeah.

G

F

Maybe we can find an error there. It's pretty easy instagram.

J

F

F

I'll post it in the chat.

J

C

Maybe this works.

F

Hold on posting from my terminal is weird: okay, maybe this works yeah!

F

That's the command prometheus query that I'm using to get all the client requests and the percent d s is the seconds like I'm calculating the seconds between the time period. So the start and the end time period.

G

It's like a race, I think the increase sees.

G

You know how it's increasing over time. Isn't it maybe it's not the total, I'm not sure. So it's it's for me. It's always a little bit hard to interpret from this matrix.

F

So we're saying increase first rate, there's a reason I used increase.

F

Chris should be used with counters.

G

Counter, no, what I'm saying it's increase, it's kind of a rate also, it's not average, but it's also calculate some rate over a time window. So it's not absolute. You know.

F

Okay, so increase is the convenience function, which returns the total across a time period, which is what I was trying to do so.

E

Using a counter.

F

But it's not, we should be fine, it's just not giving the expected results. So that's concerning.

G

Maybe you know the time window should be bigger. I don't know.

F

I can make the time window I can uh I can budget a little bit and make it go back a little bit further.

G

Yeah so the cluster, it will be like a new cluster for the test. It will never get results from previous rounds, so it can be like this.

C

um Okay, so yeah, I can play with this a little bit.

A

C

A

We'll try that so, let's increase.

J

Let's see if that makes a difference um when they look back.

A

Yeah, okay, okay, all right! Well, I think we got, I don't know. I think we got a bunch of things out of this, like I think yeah, let's keep put we'll keep working on this. Let's see if we can get these numbers to be where we want, and then we can start interpreting these.

G

Yeah, it's something that I'm wondering is in the case that we are identifying that we have 100 vms, but only half of pods were created.

G

um We should have a problem here, isn't it if that's what's happening, it shouldn't have yeah. You.

A

Mean like a tested error saying or like it should just never happen because of how you work the test.

G

Yeah or we are moving, is it can it happen there? We are moving to running phase, even though the pods was not great.

F

That's not no, it.

G

Never happens, I mean there's.

F

Always bugs and stuff like that, but that's too.

G

Much yeah there's so many things would.

F

G

I see yeah. I was just wondering about that. So, okay.

F

So the biggest thing I think before we can move forward with creating thresholds, is getting that dedicated environment. So how far off are we from that? uh That's probably what we need to.

G

I think next couple of weeks- maybe in next week I just I just need to create the the the jobs to create the test. So everything is there just need your pro job to create the cluster. Now um I will be working.

A

This weekend, next.

G

Next week next week, we should have that.

F

Okay and then we'll start, this whole process of you know collecting the artifact again and if we see yeah, I.

C

Would like to see like a few days of consistent.

F

Output uh before we create thresholds- and we are getting that that's really curious- it indicates that we need to investigate. What's going on, like we, don't even have a stable enough uh output to set thresholds, and maybe our controllers are messed up.

G

Yes, well regarding the time yes, but regarding the counters, we can have a some investigation here. Isn't it something weird is happening so yeah.

A

Okay, yeah, I think we're gonna. Let's we'll investigate this stuff further.

J

A

While you work on that dedicated environment silo and then let's see what we can find, okay, all right, uh let's move to the next topic, uh so this is tracing so I wanted to so I created a pull request for this.

A

um Pretty it's a pretty simple example: so, usually the utils library and kubernetes, and here's an example, output of what happens um so what I did was, uh or which kind of way it works, is that you can you basically start a trace?

A

It creates a time stamp and then every time that you want to record enough some sort of events, you can add a step in there and I don't have an additional time stamp and then, uh finally, you can stop it and um and then it stops timestamp and then you can log at that point if it takes longer, if you know, excuse the track or two times and it's longer than the same amount of time, so I did was, is that this pull quest adds um a the trace that will output to the logs if the work queue takes longer than a second.

A

So all I did here is, I just had to sleep for one second, um so you can actually kind of see roughly how long the work you takes for a lot of these, it's fairly short, it's milliseconds, um and um so I kind of wanted to get some more feedback on this and dave. I know you want to comment so I wanted to talk about that um as well. It's like um do you want to talk about kind of where your stance is with this and what kind of thoughts about like your concerns about it.

F

Sure so there I had two primary comments, and I don't know the first one before I get into the whole performance and log for bossy thing was uh what we're actually measuring um it. Looked like in the code you were measuring uh when we rate limited as well. Considering that as part of the same span did we address that that's going to make sure.

A

Yeah I got that that should have been that should that we should stop at that point. I changed that so. Okay.

F

That should end.

A

It at that point yeah.

F

All right, so we can move on from that. The thing I'm I'm concerned about is enabling this tracing by default. Just always a lot of that concern is simply an unknown for me. So I don't know the performance impact of tracing, especially at scale like how much more memory cpu that might require of our components, and I also don't know how how much logging spam this could potentially cause if certain issues uh arise.

F

So, if we're just hitting this all the time in an unexpected way, like a certain condition, gets hit and all of a sudden we're just spamming the logs with traces every time a uh a key is queued, then that's not great either. So I was trying to come up with an idea of how maybe we could enable tracing dynamically as like a debug tool, and the idea that came to mind was maybe tying it to log verbosity.

F

So once we hit like a log verbosity of four or five, then tracing becomes something that's on and we get a log output for that.

F

But the big thing I guess is like I'm just uncomfortable with tracing being always on and I'd be curious, other people's thoughts on that as well, because maybe I'm off, I don't know.

G

No, I would actually I didn't see that before, but I I would. I just saw that now and I would suggest the same so it should be like disabled by default and then when we want to check things. So we can enable that it should be possible to disable you see and then enable when we want.

F

G

Just you can just.

F

Add a um like in your you've, made abstraction around tracing. We just abstract uh creative structure, abstraction around that object that trace object and just make it a no op until log verbosity hits something. So we just not do anything.

G

F

G

Avoid to trace too much things for now like uh if we are interested in something try to keep it as minimal as possible, also to don't increase the cold too much you know, but otherwise it starts to be like too big and hard to control.

A

Yeah, so I hear the concern so um like the so some of this uh just to give some more context, like kind of where, where it came from so the I was, I was looking at um something in the the kubernetes api server and the api server has this on um and I can see it in the live it passes by all the time and and so like in terms of in terms of performance like like looking at the library it all.

A

It's doing, is it's it's it's taking this the time stamps, in which case we have. We have our start time stamp. So we have one and then I think I have two steps in there. So two steps and then the stop and then so we just do a math operation to subtract the time steps from that from that time period and if we don't go over the threshold, which I said at one second, we just we throw it out. So it's so by throwing out. Basically um we actually can log it it's just we don't.

A

um We don't actually increase the um the basically, the logging level gets increased when you go over the thresholds. um So we so you could log it, but obviously we wouldn't want to, because when we when we have expected behavior when we're under the threshold, so the basic idea is that it does that it records those time stamps and it just does a math operation so and we're kind of we're doing these sequentially right. These keys are one at a time. It's not it's.

A

So in terms of performance, it's not massive and we have three steps. um I mean I wouldn't expect to see anything at all for for performance on this and then logging wise, like so. This is maybe where um we might differ.

A

Is that so, from my perspective, um if we're running into a problem like with performance like I have said it one second, if our work queue is taking longer than one second, um I really want to know that, like I'd really want to know that, and- um and I would know that like- I would see that, like okay, I'm seeing a number of like with with this enabled, I would see that pop up, um but like in every other case, which is probably like 99 of the time, I'm not going to see anything at all, because my work use is milliseconds.

A

Like I mean this is 18 milliseconds. This is 24, and I mean these are. These are tiny numbers that that are not going to be they'll, never make their way out. So I mean, I guess so. My perspective on this is that I I think I would, I think we can. We could enable, and I don't think it would be um a dragon performance, and I don't think it would also be a drag on logs. I think and the in the case that that it's bad, um I I think it's a good thing to log.

A

It's almost like an error like if we were to, if we're slow um when want to know and then that's that's kind of how I'm looking at it instead of as sort of as like a you know, spamming the logs. It's really like. Okay, we've got a problem here. We need to do something.

F

I don't disagree, I'm still nervous about the unknown. So do we have a precedent, so this is used in kubernetes my understanding. Maybe we can gain an understanding if it's always on in kubernetes and what's measured and let that kind of guide what we feel comfortable with. So we could see how it's utilized in other production environments and if it's on then okay, then it's proven that at least it doesn't cause any problems.

A

Yeah I mean yeah, I let me see if I have a cluster available, I can just.

G

One question so those metrics related to the word queue. Don't we see in the parameters? Metrics.

A

Yeah, so we have, we have uh well it's kind of not it's a it's not really clear. To me to be honest because, like the we see, we see that you know we had that uh the slow work queue times and those have a ton of variation um and and yeah I mean we do have some. um But it's really uh it's not really clear. You know what like what's slow and this is actually kind of what I'd want to know like. I have no idea like which of these work.

A

Cues were slow and if we do hit one you know like in the case of what we saw in prometheus, like we see a few of them, uh I mean it would be interesting to know about them and have them in the log, but I mean it doesn't happen. Often from from my testing. I don't see this happen, but, and we can even see I mean you can see it in the metrics too, like we see a few very slow ones, but it doesn't happen.

A

Often maybe that's kind of the point of tracing right like we can. We would find these things when they happen.

G

Yeah, I'm just wondering like if we need some metric, that it's missing, if it's better, to create a new parameters metric or create the whole tracing thing. Oh.

F

I think both for sure, so the metric is helpful to gain insights. I think for like a high level what's happening, but I think the logs are really helpful to understand like at what point did this occur? Why did it occur? What was happening with what exact key was it and where was it in the in, like the flow.

A

Yeah this will even so I what's handy, is the um this step function and like it'll. This will tell you the time between steps um and even though this like says, change a lot of files. There's really not a lot of code to this. It's um basically I'm just adding um into the kind of key parts of this code, which is, I think I have it around. um That's a little farther down.

A

I think it's uh around the updates um or the the sync and the update functions, I think, are the only two update status, yeah update status and then um the other one I think is in sync um and that's it, and so we would tell we'd be able to know like when what was slow like we could tell okay update status was, though, which could be like well. Maybe it was an api call in here like took longer than a second or something or whatever it was.

A

Maybe whatever we end up, setting it to um was slow, and so we have an idea. We can say: okay, update status showed up in the log. It was three seconds because something was slow in here.

G

So this is, this tracing has also context so yeah.

C

G

Is calling, can you know we can pass the context of who call that then yeah.

A

We can you can put anything so it basically each step. Will you can take you'll, do time, stamps the difference between each of the steps, and um so you can get information and that you can pass during this time period like you know what happened if you, if you want to um if you want to get specific, but I was I've only added two steps here, just about the general functions but yeah you can get more specific about them.

G

Yeah, I think it's just very useful, but I also concerned that it- maybe it should be disabled by default. You know, I don't know if we can put like some some if here and and then we have like some flag to enable tracing. You know.

F

We can figure out the exact mechanism, so ryan.

C

I think I would leave.

F

With you for this discussion- and we can probably follow up on github- is figure out what kubernetes does in production with tracing and, let's, let that be our guide for how we enable this.

G

Yeah, okay, I agree and.

F

If it ends up being something we want to tie into some sort of dynamic um like rossi or whatever I I can help it's that might sound kind of intimidating. I don't know if it sounds intimidating, but it's actually really simple. I can show you how to do it and it's going to be an easy thing to to tap into. We have a lot of apis and ways of getting that information that makes it not a burden at all.

A

Sure, okay, yeah sounds good. All right, we'll follow up with this one and get out then and yeah and yeah. So this is just one. This is just for controller. So what I'll, once whatever we decide um to get this and I'll do each of the uh the work cues and whatever else after okay sounds good all right, we don't have any more topics david. Did you want to talk about the the virtual machine pools, or uh do you want to save that from a different time.

F

I posted it um so there's a pr now for virtual machine pools and what I did was. uh I just implemented all of the default behavior, so I looked at all the different tunings we had when um if somebody created a virtual machine with our design and set no tunings other than how many replicas they want, that's essentially what I've implemented and then we can go in and begin adding all the more advanced tunings in the future. I just didn't want to overwhelm this one pr with too much. So that's how I broke it up.

F

uh I think it's pretty close to what we want. Ryan.

F

You had a good point about not really uh having a use case for attach, so we we want to be able to detach virtual machines from this pool, but attaching them is I I can't think of a reason why and I think, might actually cause some problems so I'll probably either disable that behavior or make it um something that users wouldn't do, but there might be a case where somebody orphan deletes a virtual machine pool and then recreates it where those previous one virtual machines would get adopted again.

F

But I'm uncertain so I'm looking into that. But that's really the only point of contention right now that I'm aware of.

A

Yeah, um I would so my perspective on this. Was that, um like I had some concerns in the touch because of like um I was just kind of looking at deployments and stuff and and the behavior that has like, when you do, when you reattach and detach a pod from the deployment and and from the deployment it kicks out another one when you, when you reattach, and I don't and that's not the behavior that I think you have here but.

H

It was something that that's what ultimately.

F

A

Oh, you do kick out.

F

Okay, so then, well I mean I.

A

F

Don't explicitly do that, but it would be kicked out due to um the replica count, not being it would be, plus one because you attach something so something's going to get removed.

A

Right, well, that's that's where I was going to say: that's like I was like I was. I was like where how are we going to solve that because we're in this weird state, because what made me think of this is like okay, so let's say we did this. We have these stateful objects, you know now. How do we kick out like you know?

A

What do we do um like it's almost concerning like and there's also- and I can also think of a lot of mistakes could happen this way and and like we're, we're attaching um and we could we could like. Well, I don't know it just seems risky like to to want to do this as a use case. I don't know like that. That's kind of where, where.

F

It was, um I agree and that's a problem. For example, aws has their auto scaling groups, which is similar to a virtual machine pool, even though it's got the word auto scale in it uh and when they attach something to an autoscope group, they actually increment the replica account so they're they're doing bookkeeping on the behalf of the user and it's possible even with detach they're, doing something similar where they decrement the replica account which I'm not doing for detach.

F

If you detach something, then it's going to a new one's going to get spun up somewhere, but you know one the virtual machine detached is. You know yours to mess around with, so that's debatable as well, whether we it's tough because we're trying to run this line between what the kubernetes ecosystem is doing, which is what I've aligned with and what the virtual machine ecosystem is doing, which is uh a little different and less standardized.

F

So the expectations here.

F

I think when in doubt, probably do the thing that the kubernetes ecosystem is doing but ensure that we allow the flexibility to achieve the kinds of patterns that would be expected in the virtual machine world so, for example, attach and detach where we aren't decrementing or incrementing. The replica account it's possible. Somebody could pause their virtual machine pool attach something then increment. The replica account themselves and everything would stay stable, but.

A

Yeah, the, um I think, maybe that's that's something I think I think it makes sense what we said like. Maybe we we start with with the kubernetes uh ecosystems definition of this with to me. I think I like it detach, I think that makes sense, detaches and detach and then replace we're going to replace with the rep account and then I think, like the virtualization world, it's like we're going to detach almost with the intention of possibly bringing you back, um which I think is maybe a little bit different behavior.

A

So I mean, I think, and then which could be we could enable. I mean that's that could be enabled, but I think it would be different. I think, would be different than kind of this different approach than this, which is, I think, just attaches and detaches and like we're going to do something with it and we're going to just replace.

F

I agree so for now my take is I'm going to allow vms to be detached. They will get replaced, but you know you'll still hold on to your the virtual machine that you detached, and I will not implement, attach uh we'll just I'll think through that a little bit, it's possible that I would allow adoption of previous virtual machines if something like a virtual machine pool got deleted and recreated like orphan deleted.

F

That's such an edge edge case that I'm not super concerned about it, but I'll think that through as well.

A

Okay, one of the thought I had about this because I thought this was really a neat idea to use the label selector. um One of the. um Let me see if I can go open this up.

A

One of the thoughts I had was um so the label selector is what would control effectively detach um so, in other words like patch here this permission will allow us to detach something that will give us the the ability to do this. Well.

B

A

B

You would why not you would patch.

A

B

Machine pool you would patch the vm just change the labels on it and it's good to see.

A

Yeah, oh okay, so it would be so you'd patch, the vm, so we'll, okay, but well. The point still stands like just maybe not this object, but if you could pat, if you have permissions to patch the vm object, you'd be able to detach that's sort of like our yes or way in okay, and so the my thought was. Okay. um Do um I mean is that is that the way we'd want to go like should detach be like its own api resource? We're talking.

B

E

F

Resources we can do both too.

B

Yeah both this both are possible. The underlying mechanism would be the same. The one of the main reasons for this is, and that's why the core components have this. You can't just do these operations just with cube cattle on any types of resources. You don't need extra sub resources, different types of objects, it's very easy.

F

Okay, I'd be fine with the sub resource too, it's more like a convenience, so you could use vert control, vm pool detach and then the name of the vm and behind the scenes we would just.

B

Yeah remove the label. uh One question regarding to the detach: is you probably thought about this already, but would I then, in practice skipping that index basically and creating another index.

F

uh So what happens is when we are creating virtual machines, um I'm looking at all the virtual machines in that namespace and if a virtual machine with a certain name like I'm indexing, just incrementing if it exists, then I skip that index. So the same thing.

F

There should be no expectation that a virtual machine pool should go from zero, like if you had ten virtual machines replicas in your pool that goes from zero to nine. It could go all kinds of different ways sounds good and the same thing with uh delete like when we scale in.

F

Let's say you had nine and zero through nine, uh and you want five replicas now who knows like right now: it's random that was part of the default behavior, which ones get picked, so you might have nine and zero exist in your pool with 3, 4 and 6, or something there's just no correlation to the index.

A

Okay, all right, that was the only thoughts I had for now. I've only gone through a few of these, so um everyone- that's you know here. If you guys wanna I'll, take a look too that'd be great. I have the link here.

G

I'll have a look on that yeah.

F

Okay, it's possible. This would make sense for our uh density test at some point, manipulating pools, yeah but we'll see I I do want to begin at some point in the density tests: testing the virtual machine object rather than just the vmi object and we're not we're not there yet, uh and I don't want to derail what we're doing quite yet. But the idea of including uh persistent storage in this flow, I think, will be important to us at some point and that's where the vm and the vm pool might be important for us.

G

Yeah, the the the load generator can use different object. It can be the vm instead of the vmware.

F

Excellent yeah, so it might even be a different density test entirely one that has persistent like network storage attached to it, and then we begin wanting to know like, for example, what happens when we try to start 100 virtual machines in this environment and we need the smart clone 100 pvcs uh from the root disk, like so we're, measuring something outside of just our cuvvert control plane and trying to understand the impact of storage with all these start times and stuff as well, but not there. Yet that's a kind of a future topic.

F

A

Cool all right, um we have two minutes left, um there's nothing. We have any more topics, we have any final, closing thoughts and even more topics to bring up before we finish.

F

One thought I just had uh when we look at um one reason: the p99 and even maybe the p95 creation to running might not be super accurate. Is there going to be an initial pull of the container disk and that's going to mean that one virtual machine instance takes longer than all the subsequent ones on that node? I wonder if we should consider that somehow.

G

You mean moving the image like cloning, you know downloading the image from.

F

uh Maybe roman, do you know? Well, we pre-cache those images. Don't we on keyboard? Sorry, I I got distracted for a moment. That's a terrific! So for container disks are we pre-caching uh the container disks on we are on key for ci. I'm pretty sure we are uh would make cluster yeah they're.

H

F

Every node yeah, okay, so as.

H

Long as we're using one hash.

F

Container disks, we should not.

H

Have to worry about the pull time. That's important to us, though yeah.

F

H

B

H

Doing that exactly for that reason,.

B

Too, okay also for the normal end-to-end tests. We are pre-caching them on every node so that we don't have big variances between different end-to-end tests.

B

Let me see what density, but uh if you use a dedicated cluster which is not using keyboard ci, you will have to do the pre-caching yourself, so this only applies to kuberci clusters, so but the dedicated.

G

Cluster that I'm using, I I'm installing convert with the conversation and okay you're fine.

F

Wait installing keyboard with cuber ci or the keyboard cr.

G

No, the convert ci has this external.

B

G

B

External provided: that's I'm not sure if the external provider is.

G

It's pushing the image inside the.

B

Cluster also, but I I would never check that yeah. It can be that external provider is not synchronizing it to every node so because, on the external.

C

B

Not know the size, so it could be 500 nodes and it's not preferable to pre-cache it everywhere. There, for instance,.

H

Right so this is.

F

Really simple just make sure that before the test runs, uh if we're not using uh like cluster sync, with uh one of our standard key vert ci um uh clusters that we pre-populate that image on every pre-pull it on every node, that's it the one that you're using for the container desk.

B

Okay, so we need to change invert launcher too yeah we're good launcher, and the container just needs to be reported.

F

Launcher we'll we get it, don't worry about that one, because it's a um sorry, it's a init container for vert handler. So it has to be there all right.

F

Okay, we're over sorry. I just wanted to throw that out there, because it just.

A

Yeah then yeah.

F

That's good to know.

A

Yeah, okay, all right! um Thanks! Everybody I'll see you online all right.

G