Kubernetes Storage SIG COSI, 17 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Storage - Object Bucket API Review Meeting 20200917

Description

Meeting of Kubernetes Storage Special-Interest-Group (SIG) Object Bucket API Review - 17 September 2020

Meeting Notes/Agenda: -

Find out more about the Storage SIG here: https://github.com/kubernetes/community/tree/master/sig-storage

A

Okay, um can I share my screen.

B

A

Are the co-host yeah.

B

Okay, thank you all right. So.

B

Okay, so we've started implementing and as we implement, some new questions have come up um before I jump into that. I want to quickly uh recap what we've been doing so far.

B

So as of uh about two weeks ago, um we agreed upon um the overall design of the system and uh we've uh we've updated the cap to have the latest uh designs that that we've we've all agreed on and.

B

I think some of us uh some some of you- have already reviewed the cap. um I uh I humbly request that uh you take some time, uh those of you who haven't had a chance to review yet uh take some time um to review the document and leave your feedback and, as of last week, or so we started discussing about the implementation of of the project. Since we have an overall understanding of the design, we could move forward at this now uh in terms of implementation.

B

As we start writing code, we uh ran into a few decision points that that we're still that still need some clarification. So uh that's that's what I want to bring up today so starting with the api group, um so I wanted to quickly review what the api group for uh csi volumes look like right now.

B

So uh in case of csi volumes, the api group is uh or csi specific objects. The api group is storage.k destroyo.

B

This group does not have any implementation specific details in the group name itself like it doesn't say, csi dot, key it is io and- and it's pretty neutral in terms of you know what it does. It's just storage, rather than some specific type of uh or specific implementation of it like like csi or flex or any of the other things we, you know we've built before, so what we came up with initially for cozy was cosy.6.ktsdio.

B

Now this to me sounded like it is not quite well aligned with what we've been following so far.

B

Ideally when when when we, you know, create a group like there's a new group like this, it sounds like uh we're almost creating a new standard and it's different from csi. It's it's like it's competing with csr almost. Ideally, though, we want to be aligned and not compete with it.

B

In the sense, um the question of uh you know, cozy being something entirely or in the best way. For me, to put, it is the way I see this object. Storage is another form of storage. Really um I don't know if it should be a separate standard that that that you know users of kubernetes will have to.

B

You know, compare and contrast against csi with.

B

So I'm thinking, I'm just open to suggestions too cozy should ideally, in the long run, go into storage or caterer that api group- and I I can imagine this being a bottleneck for us to move forward with getting the api in. um However, uh if, if, if the, if the, if the process of getting the api in is going to be a huge bottleneck, there is, there is another option which is uh for now in the short run.

B

uh While, while we're waiting for the api approval to happen, we could use um object storage.kts.io as the api group for all the code that we write now. I want to make this decision now, rather than later, because all the code, we write imports uh by this package name, uh and I wanted to get your feedback on how to proceed with this.

C

uh I would say: go with storage.k8s.io unless someone complains.

B

Okay, now, while we're implementing um you know we, we will be dependent on getting code into upstream now.

C

Right and so the nice thing is, the uh api groups are version. You have v1 alpha one and.

B

C

If you introduce it in b1 alpha one, I think api reviewers understand that this is not final, uh and you know you could potentially deprecate it without concern in the future. If, if there was a mistake made, so the bar is a little bit lower.

B

Okay, that makes sense. Okay, so yeah we can go with storage.k. It is so yeah the latest uh implementation. I had changed it to object. Storage because I had I thought I thought it might be harder to get into storage, but this is good.

C

Yeah I mean uh we can run it by the api reviewers. I wouldn't bring it up as a major concern unless they do so. I think from the sigs perspective, I'm I'm okay with storage.kds.io.

B

Okay sounds good, um so so now that that is resolved, there is uh there's another concern that we've been dealing with and I don't have slides for it um just started creating them. uh It is deletion.

B

uh When I say deletion, I mean deletion of buckets.

B

um Deletion of buckets is an interesting problem because um the deletion um can be a very, very long running operation, depending on how much how many files or how many objects the bucket has I'm careful in saying how many, rather than how much data it has.

B

So the way the deletion happens in according to the s3 api, or you know, generally with object, storage providers is all the data has to be deleted first and then only delete bucket succeeds.

B

A delete bucket uh returns, an error if there's data in the bucket, and so a deletion operation will constitute listing the objects in the bucket and then calling delete on either a group of objects or one object at a time, depending on the implementation of the backing.

B

Now, there's a problem that that comes up here, which is how do we design the deadline for the grpc call? That is, if I have, uh if I have a across the board deadline for for an operation, say something like 30 seconds for any given operation or say a minute for any given operation.

B

And if we, if we go with this understanding, that if the operation doesn't succeed in a minute we error out and then you know, the operation goes to the back of the queue again and then it gets called when it comes around.

D

So I have a suggestion here um I mean you're correct that some back ends definitely will have a very long running version of this because they have to do a lot of work to delete a bucket. We should acknowledge that some will be able to do this efficiently and instantaneously, because.

B

D

Some garbage collection mechanism that they use to actually clean it up after the fact, I think I think the best way to proceed is have a return code that indeed, that you can return quickly if, if the implementation knows that it will probably take a long time, it can just say you know what I got the request, I'm working on it, it's not done, but don't don't sit around blocking on some thread. I mean we have this with snapshots where you can take a snapshot and it'll return. Success but it'll, say I'm.

D

The successful response will include a flag that says I'm still working. You need to pull me to find out when it's really done, we can either you could either do it through a successful return code and a a flag in the response or through an error code. That is just defined to mean that that you know try again later, I'm working on it.

B

Okay, okay and and the behavior should be. Then, um if you get that try again later working on it, should we just put the item back into the queue and then just let it try as.

E

D

I yeah: what's the alternative um you wouldn't want to put it back at the front of the queue.

B

Absolutely no, no, what I mean is only when the other option is. We only query when something like get bucket is called.

D

B

D

I think the uh you have to transition into like a deleting state and as long as it's in the deleting state, you got to periodically try to delete it and as long as it says, yes, we're still working on that. You just stay in the deleting state and remember to keep polling, maybe with some exponential back off with some maximum time.

B

Wait so so that's that's the question. Actually so exponential back off or retrying is diff. If you're, if you're doing that, then you're holding the thread really.

D

Well, it depends how the controller is structured. I mean that there are work cues with built-in exponential back off in the work queue. Oh.

F

D

Yeah and you could just re-cue it with back off uh at the work queue and let the work you do the backing off.

D

You could also have a have a mode where you know you explicitly: don't try until a certain amount of time has passed. If, if you get back, uh you know still working on a response, because if it you know, if, if the, if the plug-in says you know I'm, this is going to take a while, you don't want to try again after two seconds and after four seconds and after eight seconds, you probably just want to immediately go to some long uh weight before asking.

B

D

That how's it going, um but this pushes all of the the decision making about whether to block or whether to return quickly down into the plug-in and the plug-in. Can you know the plug-in? Might guess wrong it? Might it might return? You know this is going to take a while when in fact it's only going to take five seconds and then it might have been better to just block.

D

But I don't know you need a heuristic, I guess and uh and the the plug-in implementer will have way more information at their level about what the right thing to do.

B

Is understood now, do we want to have a deadline on operations um at the plug-in level? In the sense uh we expect every grpc call to always finish within a minute or something like that.

D

Well, I don't view that as a deadline I mean there's a csi has a timeout, but it's it's not it's not like a requirement that everything completes within the timeout is just. This is how long it will block and if you take longer than that, we're going to call you again later that that's how it's interpreted- um okay so and it should, it should probably be like I think it's 30 seconds in csi- is that does anyone know? I don't actually remember.

C

I forget, I think it was something like that and then it's also configurable on the side, cars.

D

Yeah, I mean ultimately it shouldn't matter what you know. It should just be something reasonable like whether.

E

D

It up or tune it down. You should still get correct. Behavior.

G

Right so so the segue. Well, I just curious: uh does csi expect the drivers or plug-ins as you're calling them to be? Not only item potent but also multi-threaded.

D

E

D

Okay yeah, the idea is, you can call multiple rpcs in parallel and if and if that causes problems on the plug-in side, they can have locking to block out the other calls um and they can figure out how to handle locking internally if they don't want to get multiple calls in parallel.

D

G

That's, but that we can we can in cozy we can expect drivers to be item. Potent and multi-threaded.

D

I believe so I mean I don't know if you make any special statements about what, if you make a re-entering call into the exact same function with the exact same parameters, while another copy of that call is still blocking assad. Maybe you remember there.

C

Was something about that in the spec I I forgot, which way we put it and whether it was uh on the co side to not do that or whether we put it on the driver's side to make them more robust, uh read the spec.

D

Yeah but but in the more general case of you know like you, can have multiple creates happen in parallel with different parameters, but that's that's perfectly reasonable. Okay, yeah yeah.

C

It might have been something like best effort on both sides like make sure your driver is uh tolerant, for you know, cases that would cause it to behave badly and on the co side, try and behave appropriately and not call the same call on two different threads.

H

All right it looks like for s3. We should expect, like uh our driver, to be reentrant right, because, while it's working on deletion, we may kind of ask it to delete again.

B

Yeah, that's that's the expectation. Well, the driver has to implement that here. That's the that's the expectation in the sense um that'll be completely left to the driver about how they implement like pen, is saying some drivers could be very efficient at it and might have support for it in the in the actual provider itself.

D

But like in particular, you know if there's state, that's used to ensure item potency inside the driver, you probably won't need locks around that in case you get the multiple threads coming in but, like you, shouldn't, be holding the lock for the whole delete operation, because if it's going to take like an hour to delete everything and you hold a lock, so you can't do any other work. That's that's also very bad, so yeah.

D

I think the the implementers just have to think about what needs to be locked and and how how to deal with that.

B

Okay, so so what you're saying is we should we should have a mechanism where wait so wait? We don't have to have the mechanism.

B

So is that something we expect from the driver in the sense like like someone else, was saying if we should be able to call the same api multiple times or with multiple or slightly different versions of the same operation.

D

I mean you got to think about this way, so you shouldn't call the same operation multiple times on multiple threads, but like what, if you were in the middle of a create call and then the sidecar that called it just bounced and come that came up again and then, when it came back up, it decided.

D

The next work it needs to do is make that crate call again when the original one is still hanging on the you know, because the socket maybe didn't go away so so the the driver was still processing that create call, and then another crate call comes in with the exact same parameters while you're still processing the first one. Just because the sidecar I mean you can't avoid situations like that. So you best effort is probably.

H

Yeah, I think, in this specific case, create we can. We can probably just rely on underlying.

B

H

Object, storage infrastructure because they also need need to handle that situation right. So we can just kind of rely on that with the deletion. It's different because, as I understand, our s3 driver is going to just go through the all the objects and delete them one by one or something like that. In that case, we cannot rely probably on the underlying object, storage and have to have some mechanism for in.

D

Our like to ensure item potency, you're gonna, need some kind of locking because, like you have a name that comes in and you return an id and that id should always be the same and so like for a given set of inputs, and you got to be careful about. You know, generating that id and making sure that two different calls coming in with the same parameters at the same time couldn't generate two different ids, and now you have now. You don't know which one is which.

H

D

H

Need to keep kind of item potency tokens, but I think it's again like it's up to a driver to kind of to pick up the proper ad importance in token to to yeah. Okay, that's right! Yeah.

B

So can you can you explain a little bit more? So what do you mean bad importance? You're talking like just have a lock that that locks on the name of the bucket or you id of it.

D

And then having it just to the extent that you're gonna take the inputs and figure out if this is something you're already working on or whether it's a brand new request you've never seen before, like you obviously need to have a lock around that kind of decision, because if you get a another parallel, rpc call with the exact same inputs they're going to race.

D

So at least until the point where you figure out, which operation this request is referring to yeah, you need a lock got. It.

B

Okay, okay, that makes sense so yeah that was something and- and this locking mechanism has to be on our side right. The side, car side.

D

No it's on the plug-in side, I mean yeah, you make, you might want additional locks on the side carpet, but the plug-in in order to guarantee item potency in the face of potential parallel rpc calls will need at least a little locking.

B

That I feel like that complicates the plug-in. Whoever is writing. The driver has to know this. Is there any way we can enforce this in a more programmatic way from from the sidecar side,.

H

I'm sorry guys I talking about blocking for creation, I'm trying to understand like what kind of use case we are currently.

D

H

D

Just for ensuring item potency in the face of like a restarting sidecar, that makes two two.

H

D

Requests with the same parameters.

H

Yeah right correct me: if I'm wrong like I can talk, for example, for gcs right in gcs, when you create bucket actually bucket name. Is an item potency token? So if you try to create the same bucket with the same name, you'll, probably the backend will just properly return you some code, like error, code or whatever code. So it's actually, you can rely on the back end and it.

D

Will create it.

H

With the same name, right.

D

If, if the back end that underlies your driver is itself able to guarantee item potency, then of course you can just delegate.

H

That yep for s3, I'm not sure like I I cannot, but I'm sure, like in amazon, you you can use something like, as adam potency token parameter and we can use. For example, I don't know some sort of uid to pass and this uid will be the same and it will be taken from based off. I don't know like uid of any of our object bucket resource and we can pass it every time. They'll correspond to specific uh bucket underlying bucket.

B

Yeah we're going to pass it yeah. We definitely are going to pass that now. The question becomes uh for creation, it's pretty straightforward.

H

B

Now, in case of, say, I start deleting, and then I want to cancel. Is that something that we allow? Oh.

E

I don't think we should go that far no yeah yeah.

B

Yeah mind um we don't do any sort of update bucket right now. um Okay, I think I think I can't I can't find a flaw in this. I think this sounds good.

H

Think the deletion which the thing but it's a good question you raised, because deletion might be implemented, different different ways and different uh object, storage providers right and if.

B

H

That some of these storage providers require all the objects to be deleted before the actual bucket is deleted. Then we just need to uh kind of uh architecture our jrpc api, which is going to talk to these plugins, to kind of to know that if you ask to delete sometimes the response would be like in progress or something like that, and you need to try again later.

B

Oh yeah, I just yeah yeah now that I think about it. We can. We can just have a well-defined deletion response, something like.

F

B

F

It's a you'd have to know if the bucket has objects in it or not. It's a different command. It's a different api call to do a recursive delete versus a delete of an empty bucket.

B

um Should that be something the sidecar should worry about, it can be transferred to the sidecar each time.

B

In the sense that cycle just calls the lead and the driver will know if it has objects or not, if it does, it deletes the objects. First, others just release the bucket.

H

Right, I think sidecar should be like storage provider, agnostic right it shouldn't care about. What's underneath.

B

Right, as far as I know, none of the storage providers allow you to delete a bucket with objects in them correct. As far as I know, yeah.

I

B

Yeah, okay, so this uh this clears up a lot, um so we know how we want to proceed with the design of this. um Is there any other questions and if you wanted to bring up.

B

Jeff, did you have any, I think.

J

I added some comments on the on the crate trade bucket.

B

Yeah yeah, um I yeah- I just read them before this meeting. I think that makes total sense um I'll quickly, open it up um kubernetes enhancements yeah. So so the comment was already exists. uh Error is returned when you have uh this uh different parameters sent for the same resource.

J

Right, yeah, that's how csi returns already exists. Basically, I.

E

J

This way, then, you get true item potency because you, basically, if you uh create a corporate creep arcade with exact the same set of parameters, you always get the same results right. So if it's a new, a new create bucket requires you after it's created, you get it okay, but then you call it again, it's more like a query and that also returns. Okay, not already exists, because otherwise you're writing two different error code for the same input.

J

C

J

This is returned only if you call create bucket and then there is already a bucket there with the same name, but this is with different set of other parameters. So it's like incompatible yeah. That's.

I

J

I

Right, um I see. Okay, that's interesting already exists. Yeah.

H

Maybe it's maybe it's not a great name like to me. I might be wrong, I would say maybe it's a conflict would be a better name here because already exists like it's has established kind of meaning that it which doesn't include that okay bucket exists, but it has different parameters.

J

Our code used by csi, that's uh that's exactly the same already it's a little.

B

Weird yeah, maybe.

J

Conflict, it's actually better with something else, but that's what they use.

B

Doesn't conflict also have uh another one.

J

No, you know we can check. I don't know if there's another one, that is better.

B

Yeah um conflict in kubernetes generally means uh the resource version doesn't match when you're yeah.

J

That's the api yeah, that's also. Maybe it could cause confusion too.

B

Yeah, probably.

I

J

And just because this is throughout the css spec, the use that already exists for something that is already there, but with different parameters for all the almost all it.

I

Says create already exists. Well, this.

J

Now this is the pure jrp zero code, but if you look at csi spec, the mini is not exactly the same. So um if you go just look at the css back basically just says.

C

J

Basically, for.

C

Csi, the idea is that every call has to be item potent and item potent means that the same response should be received for the same request. So exactly what sheng is saying. So if you have an existing object, then just return success and give us the fields of that object. You know that's it and this uh error code is reserved, for you know for for specific cases where we want to point out that we can't complete an operation because it already exists.

H

But we in this situation we kind of rely on underlying storage provider to actually to to tell us that this is the case right, because otherwise for ourselves to recognize the situation might be really not that easy.

C

Yeah yeah, you depend on the uh storage vendor to tell you that the storage vendor either relies on its own back end.

B

C

It's back-end implements item, potency that'll be easy. If it doesn't, it will be difficult.

H

Yeah, because I can imagine that already exists, the back can return. You already exist in all the cases if you have the same parameters and you try to create, for example, bucket again with the same parameters or you try to create a bucket with different parameters. I think.

C

They're responsible the same.

H

Already exists exactly.

C

And the driver needs to interpret that.

H

C

The correct yep yep.

B

Right yeah, I mean things like region and zone could well not zone, but region could change the inter, like meaning of this, like. If I have the same bucket when different regions does it already exist.

H

I think it depends on the object, storage provider- I I suspect in amazon and aws. We need to specify region during the bucket creation. I I might be wrong again for gcs the names are global, so you cannot use the same name in different regions.

B

Got it? Oh, I thought amazon also had a global name, but.

H

Probably you're right, yes, I think maybe you're right. I also heard that yes,.

D

The key is is that the names need to be global across the the universe that the csi or this, that cozy driver is responsible for.

D

If you expect to be able to use different names in different regions and somehow the region has to be encoded in the name, otherwise, you've broken your item. Potency.

C

D

E

D

E

F

Aws the names are global, so you can't have the same name in different regions.

B

You got it is that the same with azure as well or just as a you know, I'm curious. That's all. I think I think we have our answer already.

B

Nope, nobody knows. Okay, we can figure it out. um Okay, that's it! That's it from my side today. um I I wanted to go over these uh questions that were on our minds. um I think I think we cleared up for the most part.

B

One one again, I would, I would humbly request again that you take some time to review the cap and leave your feedback. I'm going to paste the link to this on our chat.

B

B

Oh I'm just looking at the chat message from nicholas, oh error in progress it that's that that sounds good. Actually,.

B

um For for the deletion, he suggested that uh the standard error code could be um e in progress. uh It almost sounds like uh so is that is that a http response code is that a standard http response code.

D

It's it's not a grpc response code, which is what matters I think.

A

J

You know and in the css spec it uses this abortion. It's name is a little weird, but that's like in progress.

J

Yeah, like the aborted, if you, if you search the css.

E

J

When it is in progress,.

E

B

Or transaction, but it's not an abort. Really, it's a timeout almost.

C

I I think we can be more explicit here instead of relying on error codes to give us the right. Behavior return a okay, but have a field in the response that lets. You know whether the operation is still pending and requires polling to confirm, or it has been completed.

B

Okay, yeah, so a standard polling mechanism for every api, then.

C

uh Just for deletion is what I was thinking since that's the long running one everything else should just block until it's complete and if timeout then it's important.

D

I would avoid that for anything that we don't expect to be long-running.

B

Yep, just for the sake of complexity or.

D

Yep, okay, yeah, because it.

C

G

D

On the co side to have to deal with that, every time you you use it.

H

Like something something like create snapshot is consider is not sync uh blocking right and having like create bucket.

J

Crease is blocking, actually it is blocking it yeah. um Well, you you, you create until the snapshot is cut, but then you call it again. So it's blocking until the snapshot is cut and then you can still driver can still in the back end. You can still. uh You can do.

H

Uploading, sorry, please go ahead.

J

I'm saying so we're actually calling chris snapchat multiple times so chris snapchat is also like a guest snapshot. Actually, so it's so basically, it's supposed to block until snapchat is cut and then there's the second step which is upload, for uh you know some cloud providers right and then in that case, basically, you just call controller central controller will call it again to call guess to quick quiz until it is ready when this upload is finished,.

H

That's exactly what I mean like it's: it's like yeah, you, you call create snapshot, but create snapshot, can tell you okay kind of it's. I'm not done yet right, I'm working on it right! So, okay. I will call you later then right. That's the teacher.

J

That's yeah yeah. It doesn't work until yeah. It's like one call actually another two steps inside that one call for for some cloud providers. Yes, yeah. That's.

C

True yeah, basically it comes down to we look at it on an operation by operation basis. If we believe that it's going to take a long time to complete, uh create snapshot was an example of that and then delete bucket is an example of that. Then we will follow this meth model. Otherwise we just do blocking.

J

Like a two-step as well, this uh delete delete bucket said, like I said, you have like several steps, you can return in the middle or what? How do you decide when to.

C

I think it's up to the driver to decide what it wants to do. It can make it one step or two step. If it's going to complete fairly quickly, you can complete it in one step and say you know, block and then return the success code and say: I'm done, you know everything is successful or if it knows there's a long running operation it can convert itself into a two-step process by returning immediately and saying, hey, I've started this thing, but it's not complete.

C

Please follow up and, uh and you know verify that I've completed uh this will be a long running process and then it is up to the caller for the side, cars to continuously pull and get that status until it's finally deleted.

D

So there is a subtle difference, which is that the crate snapshot is actually trying to communicate that two different things have occurred.

C

D

C

D

In the delete snapshot, only one thing ever occurs. It's just a question of how long it's going to take yeah.

C

Yeah, so from that perspective it is slightly different, completely agreed, but uh I I think we can follow a very similar model.

B

Yeah sounds good.

B

All right, that's that's it from my side. If you have any questions, we can address them now, if not we we can. We can end this early today and uh I I will. I will look forward to uh your feedback on the cap.

C

All right, thank you, take care.

B

Yeah. Thank you. Thanks, bye.