KubeVirt SIG - Storage, 27 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Storage 2023-02-27

Description

Meeting Notes:
https://docs.google.com/document/d/1mqJMjzT1biCpImEvi76DCMZxv-DwxGYLiPRLcR6CWpE/edit#

A

Hello, everybody, um my name is Chandler Wilkerson I'll be guest hosting today for Adam and just wanted to get a couple. You know questions out. uh Do you guys usually wait until like three minutes after the hour in order to start the meeting, and is this agenda good.

B

Yeah, we usually wait a bit to uh to give people some time to put in topics Maybe.

B

Add some more issues from the filter.

A

B

That three minutes seems fair.

C

C

D

A

D

I think Adam sent an email that he's not gonna attend if.

A

D

Else could host the meeting I. Think yes,.

A

I volunteered to help host a meeting.

A

Fortunately, I'm having troubles getting to the Google talk right this moment.

E

uh I think Google's having an issue I'm having issues on my end as well.

A

Okay, well, there.

C

F

A

Issue I already clicked on it that was already in there. So this is uh 1203.

A

I'll, probably just step out of the way and let everybody run the meeting then.

A

Well, from Silence I'm, guessing I need to do a bit more. So what is the usual tag? Well,.

E

Usually, uh there's more stuff in the agenda.

E

What we've been going over quite a bit of stuff, so I, don't really know if there's any I I have a a little bit of.

F

An esoteric issue.

E

That might make sense to talk about um I've been working on this for the last few weeks, and um it's uh it's not CDI related, it's Comfort, related um and essentially Cube Ford has some supporting containers um that have a very high request to limit ratio, um the biggest one being the hot plug attachment container, um because that or actually it's a it's a part. It's not a container. Oh it's a container in the file, um but essentially the container does nothing it.

E

It's entire purpose is to tell the cubelet to attach the volume that's been referenced in this pod into the node, and then the hot plug logic takes over and actually hot plugs it into the virtual machine.

E

Now we've had a user say: hey this stopped working when I put a limit range in my name space and one of the fields you can put on the limit range is the ratio of request to memory or our request to limit for both CPU memory.

E

And if you have a you know: lower ratio, then the attachment power won't start because the ratio is too high and I I'll probably bring this whole thing up in the actual Cube vert, meaning on Wednesday I. Just wanted to see. If anybody had any uh thoughts on this, um the same issue happens for container disks, because the container disks have a you know in a container, with a very high request to limit ratio.

E

Verdi office also creates a container to get the Verde of s uh into the virtual machine and then any sidecars.

E

So if people created a site card to modify domain XML again it it does almost nothing except in the in the startup. All of those have a very high request to limit ratio and they all fail when I put a limit range with a ratio in my namespace um having sort of explained the issue here, I have a PR out right now that, for specifically for the plug Pawn um sets the request to be the limit on both CPU and memory, and that will fix the the uh ratio being in the namespace.

E

It's just a little bit of wasteful, because I am essentially reserving CPU memory for a container that doesn't do anything, um but it only happens if you actually help plug a disk. If you don't terrible upload, you know it's not going to affect you.

E

um So my question is how uh how much of an issue would that be for people the I am actually now um reserving. Essentially it's only like 80 megabytes of memory. It's not like it's a huge amount of memory, but um it's an issue for people. If we do that, um more am I going to get push back on my particular PR.

B

So I think I think this uh way of tackling. It is not really much different from um from what we do to VMS today, which is uh just add at some overhead to the user resource request, but so I I think it's not the end of the world. It's uh it's you're, basically doing the same thing, but for uh or a sidecar pod and regarding the way forward in general, I. Think uh the kind of API that we have in CDI is basically the way to go.

B

I can't see another scenario where it's uh where something can take a place of a full-blown API that gives you an option to just put your put your defaults for these kind of PODS right and be happy with it.

E

And- and just so everybody else knows um in CDI in the cvicr, you can set a configuration where you tell it, use these requests and limits on the worker pods, um and you know you can use that essentially to get the correct ratio you want.

E

If you have a limit range, but a ratio in it um um or you know, if, if you have a a particular image that, for whatever reason uses more memory than the default, you can increase the the amount of memory that's available for the product, um so I I I, don't know how many people we have here that that are running this on like large clusters or anything like that. It doesn't look like we have too many people that would do that.

E

um Like I said, I'll bring this up on Wednesday on the Cooper meeting itself. Hopefully we'll have some people there that you know run this on large clusters, because I I suspect for people with small clusters. It's not going to matter that much. But if you have large clusters, you know all the little reserved pieces and memory and CPU might add up.

F

Alexander will try directly the mailing list, I, don't know if on Wednesday we have so many people right.

E

E

Well, that was my my one and only issue that I had.

A

Did you want to record a PR number for or pop in a link for your PR.

E

uh Yes, let me go find that for a second there.

C

E

A

Just paste it in the chat: okay,.

C

A

A

A

Does anybody want to talk about 1203.

B

And I think we can just uh pick up issues from the 36 we have on the issue. Page just uh 1203 is the last one. We stopped on yeah.

E

We started at the bottom and and moved our way up just for like the first time we stopped at 1203.

A

Last so we've got two pages here: okay, yeah.

E

We're not like you've heard where there's like 600 pages.

A

E

A

No worries I meant okay, so next one up would be. Yes, let's try to keep that list here.

A

We're looking for ones.

B

A

E

A

E

Sometimes some of these have fallen through the cracks we usually try to be as long as we see them um this. This one is is like really old. It's like you know, CDI version 1.16 and we're on like 150, something so um but I think the issue Still Remains.

E

um So essentially what happens to get you know, progress updates and data volumes. Is the controller actually directly connects to the the pod and connects to uh one of the metric endpoints? That has the uh progress update and uh for for Richard here we're actually just using the human image Dash p, where it's spreading the percentage um or, if we're doing a cable conversion? uh And if we're directly writing you know if we're actually like downloading it through a standard HTTP connection.

E

um Most of the time we can actually calculate the percentage because we don't know the total um and that's when we do the DNA. This particular issue seems to be about um uh not being able to connect because of some sort of network policy or actually directly connecting to the Pod, um and if there's a network policy that prevents that, then um you know we can't get information and we can't.

E

Percentage um I, don't know exactly what they would like us to do about this, but uh oh actually they tell us at the first line. They wanna a service associated with the Pod and then connect to the service so that uh we're not connecting to the IP address of the pod.

E

um That, actually shouldn't be terribly hard to do. We just need to create services on the Fly for each Bond. The.

C

E

E

And I might actually make some of the logic in the controllable sampler because in the controller, we're looking at the the IP address and then the um what is it the endpoints on the pond? And if we have a surface we just connect to the surface, it might actually be simpler from a controller perspective.

C

B

I think you were talking about 1203.

A

B

E

Confused yeah yeah I am talking about I.

A

Was I was looking at the other word.

E

A

The screen on the wrong one I was wondering: oh.

E

E

E

So I think maybe I.

G

Don't um I just wanted to interject I. Don't really have any comments about this particular bug, but I do we have a similar problem elsewhere, where we want to sort of communicate progress of PODS that are you know that are doing things that take a long time and I. Don't I, don't know of a good way to do it, to be honest and so I'm really.

G

If anyone finds out how to sort of fix this problem in general, I'm quite interested.

G

I'm interested to know what an answer would look like so.

E

So what we did for the progress, um so we we have some way in in our application to figure out the progress and um essentially we did is we created a Prometheus, uh endpoint um and create creating one is, is really easy.

E

uh You just call the library function and um set up the type of gauge you want, um and essentially it's just a zero two hundred gauge um um and then our controller is is directly connecting um to the Pod to that endpoint and it basically just gets the Prometheus output and finds the correct field and and displays that.

E

um Obviously, that's not perfect, as this bug um shows us, we probably should not be directly connecting to the pawn, but could a service in front of it and then the services uh configured to connect it upon, and then we can just connect to the servers. um I think Alex might know.

E

Don't we have another bug where sometimes in the middle, maybe if the Pod dies- and it ends up on a different note or something that the IP address changes and then our service is all or our controller is confused about it and is trying to connect to the old IP address.

B

No I think the controller is always uh guessing it. It starts over the guessing process, so it'll always work against uh updated.

E

About the update not being right because I was trying to connect to the wrong IP address, I think maybe doing a service uh might actually solve that problem. So.

B

I think I had an issue about um yeah you're right that there was something similar, but I think the issue is that we just don't play nice when, when the networking is bad in the cluster, I I'll have to dig it out.

E

So do we think that this issue should actually just be fixed I, don't think it's going to be very hard to fix, because creating a surface that points to the plot is relatively straightforward right. We just put a label on it and bring this service to the label, um and we haven't done anything with this. I actually didn't know this existed and otherwise I probably would have already fixed it. Some time ago,.

G

I mean thanks for that explanation. Alex it does, it does I mean I know you say it's quite easy, but it does sound. You know, to my mind, it sounds pretty complicated. You've got a service and you've got something in the Pod which is answering requests and if you compare it to log files like pods will just collect.

G

You know you can just get the logs from a pod easily, but.

E

I think how do you do that in an automated version right, you have to have some sort of a lot collector service that connects to the park.

G

I'm saying what I'm saying here is that if you, um it just seems to me when I look to this, that there should just be a way for in the same way the log logs are collected. There should just be a way, um and this has nothing to do with Cuba. This is like an entirely a kubernetes thing. There should just be a way to to Signal some small amounts of data like that from inside the Pod to the metadata I.

F

I, don't have a like a plan.

G

Here or anything I'm just saying it, it sounds really really complicated for something that must be really obvious, and everyone must have have this problem at some point.

E

So um it's basically all about. Do you want your workload to know that it's running inside of kubernetes, because it should be if we provide our our importer pod with a cute config and- and you know, you know, pass the library to connect to kubernetes.

E

We can essentially um just write some code where the Pod itself just updates the um resource that wants the information they can just update it. The thing is, we were, um we don't really want to, let the pot know or the the the application is running in the pot. No, that is running in kubernetes, because then it's like linked to kubernetes.

E

So if you don't want to do that now, then you need to somehow get this information from the pop, and uh you know the Prometheus endpoint every day is one way um you could do an HTTP endpoint, where you just connect with the HTTP I'm going to get some information. That's another way, that's essentially what the Prometheus endpoint is. It's just an ATP endpoint, but um Prometheus has a bunch of libraries that you can use to to get this information relatively easily.

E

um So, instead of pushing the information from the the Pod itself, now you're like pulling it right, you're just connecting to it like a if something changed, has your progress updated Etc? uh It's it's just sort of like a philosophy um thing. If, if you don't care that your application knows it's running in kubernetes, it's probably simpler to pass it acute effect that it can use to update the resource itself and and let the application do that. uh But if you do care, then you have to go through some gymnastics too. To get the information.

G

I guess a cubeconfig would be it'd be hard to make that secure. Wouldn't it as well. No.

E

It's um it's actually not that bad, um because if you start a pod as a.

D

Certain service.

E

Account um there's a secret, that's automatically injected. That is the cubeconfig for that account and um it's it's in a stable place. So you can just read the queue config and build your client from there and then use the client to connect to the kubernetes um cluster and do the updates.

E

It's not super hard to do and all the information you really need is is there.

E

um But the problem from from my perspective is at that point: you've linked your application to kubernetes right and you can run it outside of kubernetes in another orchestrator or you know, if, whatever reason you want to do that.

G

Okay: okay, there's just another question: is: can you Snoop on files in the Pod easily, then? Maybe we could just write the data to a file and then.

E

Well, essentially, what if you, if you mount a secret in a pod, it shows up as a file somewhere. So if your user can connect to the pawn and and uh then you can snoop on files, yes, but if you're using and connect to the power, then it can essentially do anything already. So it's.

G

Okay thanks, it's very interesting.

E

But yeah there's a couple of secrets that automatically injected into every pod um and one of them is is, is the Q config of the service account that's running the also.

E

Well, let's, let's um I I think we should let's get back to this. This issue. I think we should actually do uh should fix it.

E

I'm I'm actually typing a message here.

E

And now, if, if they respond uh at least I'll get the email here,.

C

C

E

We never get nothing to get through all the um the different ones uh where we stop. If you could just put an end saying that this is the one we stopped at so for.

A

Next I was also going to ask if you want to do any like time, boxing or or just kind of let discussions happen. Naturally, we.

E

You know we're, we just started these uh um meetings.

C

So we're still sort of trying to.

E

Figure out what a good format is um because, as you can see, we started off with having quite a few. You know things to discuss in the beginning, but we're sort of running out. So you know it'll, probably end up being mostly a bug, Treehouse thing so.

A

A

Do you want to do 1289 next yeah.

B

There we go so this fcsi people.

D

B

Actually, pretty surprising that they want to use CDI to kind of have like a happy path that tests their um CSI driver so.

C

Yeah interesting.

B

E

The thing is, our artist, Suite is really large and it does a lot of different things and not all of them are related to cefcsi. So.

B

But couldn't we could.

D

B

Give them like a label like a few happy flow tests that are labeled, and then they could just run that small subset.

B

You know like maybe that um maybe those parameterized data volume tests, you know the ones that do import clone upload in the same, the same described, block.

E

Or doesn't uh that we upgrade to Ginkgo two yeah.

C

No, we didn't know that.

E

Doesn't think how to have have an actual label uh on the test where you can actually put away one and I've seen those in Q4, so I know keyboard Edition, 2.0 already.

B

um Yeah, but you could do it in Ginkgo one as well. We do it for the destructive tests.

E

Well, we we, we put a a label on there and then pass a regex to find a particular label, but it's not really like a a separate label feel it's. We put a particular magic string in our test name and then use a regex to find it. uh It's it it'll, probably work I. Think an actual label, uh which is a separate field, would be nicer. Well.

E

E

You probably should do this. Do we have a card for this?

E

B

E

If we have a card.

B

For this and iron, probably not just the issue, no on our repository Okay. So let me um there's a link there to this fcsi uh to ASF CSI PR. Maybe that has more information. Maybe they already ended up implementing this. Somehow, if you scroll down yeah it got mentioned in sexy aside, I had workload tests.

B

Okay, now it's just uh basically the same description as the CIA issue. Right.

E

C

E

If we can just put it at least on our backlog on Kira, then you know we'll have a chance of it actually being scheduled for Sprint again I, don't think this is going to be very hard. It's just. We need somebody to actually do it um and I think if we create a card.

E

E

So I will create a card and put a link to it uh in this.

C

A

Okay, would you like to go back to the list now or wait.

E

Yeah yeah: let's, let's come back to the list and I will I'll create a card for it and put a comment in there.

A

Now more than one access mode from DB, spec.

E

E

I added this a long time ago um and the main issue was for um manually created, NFS persistent volumes.

E

um You know in in the persistent volume you have to say both read, write, many and read, write wands and then, when, um when you create a PVC, you can either specify read, write ones or read, write many and it will bind or allow read by loans and rebrite many in the PVC spec, and it will also buy and um right now a data volume will reject uh any.

E

um You know the PVC part of the data volume if you put in more than one access mode, it's rejected, even though it's an array and you can specify more than ones, and it should accept it. So this was just me saying: hey. We need to fix this where we allow. You know both it's I I since then, I, don't think I've actually seen anybody create a or or create a PVC with both, um but it's technically possible. So I don't think we should reject it.

E

C

E

Actually, we can probably close this um since I've never actually seen anybody do that. It's just one of those it's theoretically possible, so we should allow it, but nobody's ever actually done it.

A

I have seen a use case where you create a DB, and you don't necessarily like. If you don't um specify one of those modes, does it just kind of accept whatever the default storage class provides.

E

No, so if, if that we we added since then, we've added a uh storage specification. So um as part of the data volume, you can either provide the PVC now I'm going to call a template, the PVC template and and that's the the template that the controller will use to create the PVC. But then we've added a storage section and in the storage section you can omit certain required fields that are required in the PVC section, like access mode volume mode.

E

um What happens is, uh or we also created something called storage profile and the storage profile. Basically, for.

C

Certain storage.

E

Provider says: okay, for this particular storage provider, the optimal access mode and volume mode. Is this, for instance, for Seth, it's block mode read write menu right.

E

So if you omit that in the storage section it will go and look in the store's profile and say: oh okay, we should use block, read like many and then when it creates the actual PVC. It basically fills in the blanks from the storage profile and then creates the PVC. That, in theory, should be optimal for uh the storage you're using.

A

C

Very edge case.

E

What do you mean this.

A

uh Particular this.

E

Particular thing is, is very much very much an edge case uh and I. Actually, I am just going to close it because I, don't think um I, don't think it's very interesting and it's more uh me being a little pedantic that you know I. We should allow it.

E

So I'm I'm fine, closing that.

A

Okay, so a feature request for allowing configuration of qmu images or converts cache type option.

B

Yeah, it is pretty interesting: we've gone back and forth between uh changing the default value on this, so we had I think the first year CDI it had unsafe, which was the default or is the default.

B

Then at some point we decided to pull closer to rev and we made the cash option, uh be none and then recently, with some help.

B

We concluded that to write back was the way to go for us, so we're currently sitting at setup on right, back cache mode, but uh I think it makes sense to make this configurable it's just. uh If you scroll down, it's uh there's a whole Matrix that we need to implement for this, and we should decide on it.

B

The thing is um Okay, so, okay, Alexander summarizes it pretty well on this comment here: it's a global level, there's a storage class level and a per data volume level, so I think that's pretty much it. We have to give people the Global knob, so they could just always go with the certain cash mode and then a per storage one and then sometimes somebody wants to use none on their data volumes and sometimes they want other things so just give them a data volume or not.

B

I think that's pretty much the best way to go what what I am missing is that uh for some reason nobody was pushing on this too much. This is a pretty old issue, but it totally makes sense.

B

Certain storages make sense with certain cache modes, others don't and.

B

They issue to me makes total sense and we should I don't know- maybe maybe I'm missing something, and this is not so desired as I think it is.

G

A there's, an interesting um problem that you might run into if you're, so we run into this invert Builder, where you mix up um direct, like oh direct rights, with uh reads that come from the page cache and you can actually get stale data in the page, cache that doesn't reflect what's actually being written to disk it that to be very specific about this, it occurs when you run qm image convert, and then you run qmo very quickly afterwards, on that disk image, um you can have Camu seeing stale data and I can't quite remember exactly what combinations cause problems and what don't you're, probably best to ask Kevin about this.

G

C

G

Be a little bit careful here with this. um You may run into problems like that, and it may even be like the kind of security issue as well.

E

G

Think that's the.

E

Reason why we went when person on at some point. um In particular, we saw this with Blaster and or Seth where we had the part. That was writing the image on node a and then immediately once it was done on node B. We started a VM that was trying to use it and and that one didn't get all the data. Yet um due to some caching on node a.

E

So yeah it's it's actually not an easy problem to solve.

E

As you can tell, we've gone from different test modes.

B

Yeah, it's just that for this specific use case, I think we were just uh kind of ruining this person's flows by.

F

B

Cash, none like they will just making things slower, I think it's NFS related NFS, 4.1, there's I'm, not sure exactly what happens, but from the discussion and this issue. It seems that cashnan didn't make sense, so they would have benefited a lot from from making this configurable, but I don't know what what they would go for instead of none I. Don't think the issue has that information well.

E

I, don't think it would matter that much for uh you know. If we give them all the options, then they can pick what's best for their particular use case.

E

uh To me, the question: more is: what level do we want to implement this right right, I I gave three levels, um you know a cluster level, a storage class level or a data volume level. 403 I, don't know um each one has pluses and minuses right. If you set up at a global level, you set it once and it applies it to everything. But if you have two different stores classes that one different uh modes, then you can, you know Express them.

E

um If you do it at a stores class level, you have to set it on multiple Source classes, um so there's more configuration and if you set it on a data one level, then every time you create a data volume, you have to set it right. If you have a need for a different value than what the default is.

B

Yeah I think in order for this to be complete, we'll have to implement all three of them. Maybe.

D

B

For now, just uh I see my already did this just ping, the person that opened it and just take interest. If the performance issue resolved itself, because we did change, uh we did go to write back instead of cash. None.

A

So I think the last communication on this issue was uh the original poster uh saying I still want it: uh user, configurable, okay,.

C

B

Yeah yeah I missed that one.

E

All right, so what do we want to do with this, then?

E

Should we make a card for it? Well, I'll just add up to.

C

The list of cards.

E

I need to make.

B

uh Yeah I think that makes sense all.

E

Right so I'll make two cards and one for this one and one for the previous one.

E

And then, once I've created the card, I'll link them in here, because they should be uh public. At this point, the gr instance is public, so.

A

All right, we've got five minutes left till the noted time of the meeting ending want to do one more or wrap up.

E

Yeah we could do one more I think the next one is relatively searched wrong. It's not actually really straightforward.

C

D

So this bug also predates I think the import Chrome.

E

Yes, definitely.

D

uh In the original problem that inspired this, um it was a real URL and instead of a ring, it gave you some HTML and a success. Success thing so it looked like a successful import, except it was definitely not an image. You could boot.

E

Well, there's quite a bit of discussion on this one already.

E

So I I think we're.

E

We seem to be heading into just download the file to scratch space and then do the conversion. uh I was doing an inline seems to have uh problems, especially if we have like a g-ship or a taxi type compression in there.

E

um If we do that, then the check sum should be relatively straightforward to compute. By once we have the data in the scratch space during the checksum is simple. So.

E

uh Since we have Richard here he might know, is there a filter or a plugin that you can provide a text on and then give me an image will do the text level on.

G

The file you provide I was thinking about that. Actually, um so there isn't one, but it might be possible to add one. um It's sort of the fundamental problems. You have to read the whole file, which was sort of trying to avoid. But obviously, if you want a computer checksum you've got to read the whole file can't get around that.

E

Yeah I'll think about it. I I, don't think you can get around having to read the entire file to get the checks home.

G

Yeah yeah I know obviously um yeah, but you can. You can skip the um the holes if you're, if you have an image with holes right.

E

G

That's that's uh uh immediate benefit.

E

um So I think we should just leave this one. That's the last one we looked at so, but for the next time we'll look at it again and see what we uh we can get.

C

A better answer.

A

D

I guess one other thing we could do regarding the original scenario is possibly detect very suspicious. Looking images that, like it's, uh it gave out HTML. So that's detectable. That's gonna be like a common scenario. Maybe we can make that into an alert or maybe we look for no partitioning. That's, but that's that's I, don't know! Maybe someone really wants to import an image that way that doesn't isn't partitioned in any way? That's that's actually, no reason that it can't be that way. It just needs to have a bootloader I.

D

Think well, I, don't know, but HTML is probably wrong if it's a valid eshima.

E

Right so maybe if you know you should have a redirect here on getting a redirect or something and I just Spirits out some htmlity saying hey, we can redirect or something we probably should not private from it, because it won't work well, I think that's a different issue than this particular.

E

Requests, this was really doing the text, someone the downloaded file before you actually try and boot it so, and maybe you know confirm that the one you got is the correct one right, then I'll download errors or a bit flipping going on.

C

E

Think we're overtime at this point.

A

Yeah I think we can leave off there. All right sounds good. Thank.

E

A