KubeVirt SIG Performance and Scale, 11 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-11-11

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.yg3v8z8nkdcg

A

All right welcome to sixth scale november 11th, um find yourself, listen to me. Please, okay, uh we're probably a short agenda today. um So first thing is just an announcement. I'm gonna be away next thursday, um which is uh do you? I don't know if you guys wanna, do what happened last time. I know david. Do you wanna? I can leave it up to you if you wanna host a meeting. If we have agenda items or um what do you think.

B

I think we could skip it. uh Okay, we, let's see yeah, we probably won't meet again until december.

A

Yeah, let's see great because of thanksgiving yeah.

B

I I don't think that's a problem unless we have something we need to follow up on. uh I think we're doing all right.

A

Yeah, I mean the only thing is well with the thresh these, the the periodic job results. So I this is what I was I'm kind of way. We can decide based on what we go through here, so I um so we I got the.

A

um Where is it this pull request, so this pro pull request merged from uh from last time, we want to have a, I want to add a count of the number of vmis each phase um which you can see here um the this emerged this morning, um but I'm not sure it's been run. Yet. Let me just check looks like there's a new job. Let's see.

B

What's the goal of that, because I thought the functional test would only complete once everything reached the running phase.

A

Yeah so well I what's. What I was not clear to me is that I just want to get some clarity on like what what about like vms, that went to failed like what about like, because whenever we saw last week some irregularities with uh some of these numbers, not only the performance numbers, but just some of the uh the api calls. There was just a lot of variation, so yeah I wanted to see if like, if it just had anything to do with the number of the the phases that the bmis were in.

A

You know, because maybe if they're on running then okay, then we're like okay. Well, then, you know, then it kind of rules that out like, but to me it just seemed like there was something weird going on, so I wanted to see if, uh if that was, if that affected this at all.

B

I think we're we won't see. Any vmis is the problem.

B

Because this is running after the test tears down and our exit condition tears down that namespace and everything for the functional test I mean I could be wrong, uh it doesn't hurt to add it. I'm just not sure if it's going to give you what you're hoping to see, I think more likely. We need to have the functional test fail if uh vmis, unexpectedly crash or don't reach a running phase.

A

Yeah, okay, so I see because we're gonna get right. This is okay yeah, because we're only going to get at the moment of time. This auditor's run okay, all right, yeah! Let me we need to make a change to that. Okay,.

B

That's fine, no.

A

It's okay. I still think this is valuable regardless, like it's good to know anyway, so I I don't think it's it's like completely.

B

It doesn't hurt, it just gets information that may or may not be present.

A

B

A

Yeah then okay, then I'm gonna, then we're gonna need to go all right. Then I'm gonna go through the test again and make some changes because.

B

A

I wish there was a way I could get um like so marcelo. I think, ran this locally when he did this, maybe that's what I need to do like. I just need to have a way to run this locally, um to see what a little bit more into what happens.

B

You can run.

A

B

Locally uh using just make cluster up cluster sync cluster sync and then running that functional test, I would just suggest uh lowering like you go into the functional test and I'm sure there's some number or something for how many bmis get created. I think it's probably 100 or whatever just make it like five and you'll be able to.

A

B

Good idea of how it would execute on your laptop with similar results, just scaled down.

A

Okay- okay, that's fine! I think then. Maybe that needs to be the next step here then, and then we can um just because yeah, I think you're right like if this is run out well after then we're not gonna get it. We're not gonna get the count that I'm looking for. Okay, okay, I mean, I think, that kind of settles that because then I like, so I guess really.

A

This is a matter of if, if I can get any work on any progress on that before next thursday, then we might at least have some more numbers, or at least something that uh we can learn from this and the results but yeah. Otherwise, until then, um this action item we're gonna, we're gonna, just wait until yeah wait till we have a little more information before we can kind of at least start defining some thresholds on things.

A

B

Did you want to talk about your tracing pr.

A

B

um It looks fine, there's uh some sort of build error. uh Yeah.

A

I'm having a problem with this, because I I'm like to me this makes no sense. I I'm I so I do I hold in so when I do I, um uh when I try to attempt to obey a bazel build, it tells me to run a go mod, vendor and uh update. So I do.

B

A

And then I get a bunch of stuff from it and then um and then I need to run a make generate again, which gets a bunch of stuff. It doesn't make. Any sense to me because, like this to me, seems wrong, like the what I'm replacing is actually sure.

B

There's actually a lot, that's wrong. I'm looking at the pr.

A

Weird and it's still failing like it's, not even it's it's building for me locally, it's just not working so.

B

You have some custom stuff in your environment, uh some custom environment variables like if you do a print env, which I wouldn't do that in front of me or on a recording, because it might have something like a credential in there. So then, but you probably have some uh like custom cube, vert related or um like I can see, there's a docker uh registry uh related stuff. So it's um it's training yeah, it's picking up. I would go for my clean man yeah.

B

So here's what I would suggest just to make all this go away and I would suggest starting fresh with uh you- have like two code changes and bmi.go and vm.co. make those two changes in a fresh like uh branch or something then run make generate and see. What happens you might have to run, uh make depths update as well um or at least make that sync, I think, might do it.

A

B

But uh okay, I would do it all from a clean slate, because this is a mess. Yeah yeah, 456 files changed and a lot of these just shouldn't be changed. You're right, they should not be changed at all. This is very strange.

A

Yeah, okay I'll mess around with it. I I figured that's what it was going to be end up going to be it's not even building here like something is just totally off so okay yeah I'll mess around with it and see what uh probably starting fresh and just bring another code and see what happens. Okay,.

B

Looking at your, uh let me look at your.

B

So I have one I just noticed this.

B

Let me see this will work.

B

You have a global object for this tracer. I just realized that yeah this execute function is multi-threaded. Let me let me double check. This is going to work before I.

B

uh Do you remember the name of the file that you hear.

A

Yeah I'm trying to find it in here somewhere. uh It was probably um utils trace.

B

No keeper package until traces.

A

Yes, this one and then watch that go. I think we only changed something.

B

B

This won't work the way you think it works, uh because this function is multi-threaded.

B

So will you let's have tracytils.nutrace.

B

uh That for controller trace, vm variable get stopped on because you can have multiple. This code can be executed multiple times in parallel. That's what I'm trying to say so in vmi or vm.go.

B

uh Yeah, where you're creating the new new trace, that is done in a go routine and there's at least three threads. So when we're assigning.

B

So let me think about that. How you would get what you're looking for here, because you want that birth control, trace or vm object to follow through the execution.

A

I didn't realize there were three or multiple threads here. I saw that oh this is this threadiness variable. This affects this.

B

You see exactly uh when we start all the different go routines. It's up in the top of the.

A

A

So I uh okay, so I was okay. I was under the impression that somebody's were the work. Queue is executed. One executed one at a time.

B

I guess no kind of so here's the guarantees you're given with the work queue you're guaranteed that a single vmi will only be processed by a single uh worker queue in series, so one bmi can't be processed the same time in parallel.

B

What you aren't guaranteed is uh that two separate bmis won't be processed in parallel. So that's where the threadiness comes into play, so you can have multiple keys being acted on at the exact same moment, but a single key will only be acted on once any given moment. You can't have it being acted on the same like multiple times.

A

B

Give that some thoughts, it kind of sucks.

B

What you want is to be able to have this vert controller trace, object, be able to add steps and execute yeah, because you want to know where things are happening. If you remove that step for now, uh because you don't actually have one in vm.go.

B

uh Again, you do an update status.

A

Yeah, it should be an upside down yeah.

B

A

B

I guess if you can remove.

A

B

Step you can get, uh you can't get the fine granularity that you're looking for, but you can at least know which key to belong. It's not very helpful. Really, but you'd know what happened. So you would all be local that that trace would be local to the execute function and you wouldn't have to worry about it. Being a global variable.

A

A

Okay, yeah, that's okay, I'll think about it, because I I do want to because part of this is that I expect- or at least I'd hope that eventually you know if there was any changes in here that people would want to add for these steps.

A

So I'm kind of I kind of like part of this. I want to make sure, like there's sort of a framework to do that. That's some sort of example that I'm leaving for people to do that. Yeah.

B

um So your options here are, you can create a library, that's threat, safe and it would be keyed like you would add a step and you would add a key to it and it would um this library would understand how to look up a map with the key to find the right.

A

B

Do something like that, but you'd have to add a mutex and all that stuff to it.

A

But that's rabbit.

B

A

Okay, yes, that's what I had on my my my uh my other pr okay, but I was doing it for a different kind of for a different reason, but yeah okay. I think I could just reuse some of what's there. Okay, so.

B

You can have a global tracer um object and have it you just pass in keys to start the trace and entrances and internally, when this object has some sort of map that keeps up with what's what's occurring and when it's done, yeah.

A

A

Yeah that works yeah, I mean, I admit, yeah. I guess it, and that also leaves the door open for.

A

Yeah, the yeah okay- that makes sense I mean like just for like in like one of the things I was doing on the um the other like work in progress.

A

um Piece was like I was doing the tracing by, um I think, um like I was taking it all over the place like I basically took the trace and I kind of imported it everywhere, and I added like I kind of you know what the basically they became like. I created like a map with the key just like you're, saying it just kind of moved it around. I think I I think I moved it even outside of this function. That's why I used it or something: okay, I'll I'll play around this.

A

I think I understand how this what's going on here and how to kind of get where, where I want to go. Yeah, okay,.

B

Sorry I didn't catch that earlier. I should have caught that immediately. I oh well.

A

B

A

Yeah I'll make it work: okay, cool, yeah, thanks. That makes sense to me.

A

Okay, um okay, that'll, give me something for that um and then do you want to talk about it? Do you have I added up some more comments to the vm pools? um I don't know if you want to talk about any of that or if you haven't oh.

B

Sure yeah I've been really behind in my github, so I haven't seen those comments yet probably, but I have made some progress in the vm pool thing. Let me let me pull that up.

B

All right, okay, two days ago, can a vm be owned by multiple pools. No, I cannot.

A

Okay, I read this, as you know, figure out. What pools is that I don't know. Maybe it's just the comments.

B

Let me let me see if this makes sense, when a vm is updated, to figure out what pools manage it and.

B

um That comment was when we allowed this detach and attach so it was possible, for example, you could take a vm detach it from one pool and attach it to another. It's not possible anymore.

B

So the comment isn't completely.

A

B

And I will update that comment.

B

Let's see your next comment, it's orphan adoption and attach the same thing.

B

A

I don't did you update this with like for the removal of attach to.

B

Move I did, but let me see the terminology if I check my branch.

A

Yeah, I just don't understand what an orphan is and why, when.

B

A

And why would we trap it if we don't have attached.

A

Let me see the context of this.

B

We don't need to track that anymore. That was again a part of the attach, so I can remove that it didn't do anything today. uh What would happen is if we detached a vm, um then this logic that you see right now would just enqueue the pool the pool wouldn't adopt it or anything.

B

So it's it's kind of useless. I can. I can remove.

A

Okay, yeah, okay. I just wanted to make sure because, like the orphan, because I saw orphans in a few places, so I guess maybe they just go away.

B

Don't want to delete the deleting vms, for example, my is running.

A

Yeah, I think, if you do, if you do a search for orphan you'll, see like there's a bunch of stuff that might have just been.

A

I just been there back to them.

B

Okay, so here's.

A

B

Is actually okay, this uh area.

B

So what I'm doing right here is filtering deleted vms when I'm scaling in so I'm trying to determine uh how far to scale in- and I want to pick vms when I do this random selection of vms- to scale in that aren't already in the process of being deleted.

B

So I don't want to. I want to make sure that I'm actually picking vms that can be removed or be scaled in. uh Otherwise I would, uh it would just be less efficient, so scaling would eventually occur. It would just I might do a few iterations of scaling in rather than efficiently. Only targeting the ones that are eligible to be scaled in the end result would be the same, though.

A

Are you talking about? Are you talking about this comment or the.

B

I'm talking about the line 496., oh.

A

Okay, all right, so what was.

B

Your explanation.

A

The yes repeatedly, sorry.

B

So the uh let me first address the uh orphan adoption uh orphan adoption and attach they are the same thing. I need to remove that logic and that logic doesn't do anything today. It won't attach it. It would just cue the pool so attachment could occur, but attachment doesn't occur anymore. So it's just useless all right logic um line, 496.

B

Are there any scenarios where we don't want to delete the deleting vms, for example,.

B

I don't I'm not sure I've completely followed that maybe so.

A

Like um yeah yeah, so this would be like, um so I I when I'm when I imagine it's like, let's say um we're scaling in some vms were so the deletion succeeded. They attempted to delete, succeeded in that, like we, um the deletion timestamp ended up on the vmi, so we're we're actually attempting to delete them, um but um you know what if they never get deleted for some reason: they're just they're, not they're, not removed, and then we do.

A

You know, let's say and let's you know say like we do another scale in and we and we now have these like ones with deletion. Timestamps, do we just keep deleting them kind of the state I'm wondering is like? Can we get in this place where, like we just keep deleting the same vms and they never deleted.

B

No, it would be considered a pending delete.

B

I I have to look directly at the logic to be certain we're not going to get in a state where, uh let's, let's come up with a hypothetical scenario, we have a replica count of 10, uh we remove it down to nine and that vm that's getting deleted is just stuck with the deletion time stamp um and it's just not going away for whatever reason. If you then scale down to eight, a different vm is going to be picked to be deleted. We aren't going to get stuck on that first, one.

B

It's going to continue to try to scale in. uh We aren't going to force, delete anything it's just if a vm won't delete, then that's just kind of a.

B

That's just the way it is so. Okay.

A

B

A

A little bit more complex.

B

Topic, I can tell you why it would be that way so.

A

B

That's okay! That's.

A

Kind of what I was looking for is.

B

A

Okay, so like when I I wanted to make sure because, like if we're, if we, if we keep deleting, if we keep deleting the same vx, that's that's the I think the case we want to avoid.

A

It's like we just don't want to like keep trying to delete the same one like over and over again and saying, like hey we're scaling in um when we're actually not scaling it, because we're like we're, because we keep filtering for these deleted vms and when they're, actually, um when they're actually there's whatever they're, just not being deleted, for whatever reason we let the user clean them up. Okay, it sounds like that.

A

My my concern is like so you know it's like when you filter for deleted vms you're not going to get in return once that have necessarily deleted timestamps on them already.

B

As the selection mechanism for what vms we can scale in, we are filtering out the ones that are already deleted, because those are already in the process of being scaled in. So it's just to ensure that we only perform the action of deleting a vm on vms that haven't already been deleted.

A

Okay, okay, that makes sense to me then: okay yeah, I just wanted to make sure we avoid that case. Okay,.

B

Cool, so here's the other thing. So if you look the line right under that and that'll be line uh 498, I don't know uh when I'm coming with the count of how many we need to remove how many vms we need to remove scaling, I'm uh counting the ones that are already being deleted in that count, so we're always going to be accurately scaling.

B

It's not like if a vm takes a really long time to scale in it's not going to cause anything wonky, it's just going to mean that we're we know we're attempting to scale in, but it's just taking some time for that to happen. If we scale in further, then it's just going to be in addition to that, it's not going to um cause us to like overshoot scale in or overshoot scale out or anything like that. We're always understand what the declared state is understand.

B

What's currently in action, so the vm is being created or if it's in that process of being deleted and take that into account when we decide what actions need to occur next. So in my hypothetical scenario, where we had a vm pool size 10, we went down to size nine. So we have one vm, that's in deletion and it's just not going away and then we go down to ten. I'm sorry.

B

We go down to eight.

B

What we have is one that's already processed. We know and we're filtering that one out because it already exists, we're accounting for that and the count that we need to understand how many to remove, and that means that there's only one left to remove to get down to eight.

B

It was a lot of okay talk. I lost myself and all that, but uh I.

A

Follow you no, I get it.

B

A

Makes sense yeah? No, I get it. That's that's what I was that's what I was hoping for: okay, that that makes sense to me because then we're that that's to me follows like the principles that we would expect. Okay, yeah.

B

Or undershooting in any way, that's the biggest thing.

A

Okay, cool okay, so we won't get in this thing. It's very good! Okay, all right! I can resolve this or oh I don't I can't, but okay, you can ignore that. I think that's good okay, yeah! That's all I had for now. I haven't I'm still like making my through through this file. I think I've got. I think I stopped at about 496 and I guess.

B

Yeah, it's really really tedious um yeah and if you're wanting to just understand the general behavior of how things work and stuff like that, the functional test exercises some of these edge cases and things like that. So the file and tests the test directory. uh It's a good resource. I guess just to to see how this is exercised.

B

A

Okay, yeah I'll, keep reviewing it a little.

B

Bit of time, yeah thanks this is going to take a.

A

B

In I think it's going to be super useful and as soon as we get this pr merged, then we can start layering in all the kind of more interesting options that we came up with, like the secret uh generation, like that yeah yeah, cool.

A

Yeah awesome, okay, cool all right. I don't! I don't think I have anything else, then um all all right, all right guys all right.

B

A

No, I think I don't know yeah. I think we're good all right well and then you know dave I'll leave to you decide whatever um I'll mark it here whatever as um um but then, if you, if you decide you know nothing, you send out an email um and then I guess one thing I'll: do I'm going to make sure that their time is correct, since how.

B

A

Let me know that that that our time might be a little off, so I'm going to I'll get with chris and get that that sort of, but you said david sent an email. If you decide that you know whatever whatever case for a meeting or not okay, that's good! All right! All right! Thanks! Thanks guys talk to you later have a good day. Bye.