Kubernetes Storage SIG Volume Snapshot Workgroup, 17 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Storage - Volume Snapshot Workgroup 20180917

Description

Meeting of Kubernetes Storage Special-Interest-Group (SIG) Volume Snapshot Workgroup - 17 September 2018

Find out more about the Storage SIG here: https://github.com/kubernetes/community/tree/master/sig-storage

Moderator: Jing Xu (Google)

A

Oh sorry, I was muted. So thank you. Let's start.

A

So right now is seeing sad and I are working on a blog post or snapshot feature, so this feature will be available for version 112 and we continue to add all kinds of documentation in kanani's in CSI. So I'll put the list later, I think. Last time we discussed okay, some documentation for user, like more focus on user side and some documentation will more for developer.

A

So we'll address all of those, and today's meeting and I want to more focus on a list of like more like the next step, the features we want to have for snack shot after we have all those basic building blocks and so I ran I have list welcome back to like give us feedback and add more new new things. The first one I think we discuss a little bit in last meeting relate hit. You like panelizer for protection, so there are a few like scenarios. We might need some kind of protection.

A

The first is, for example, you want to delete a volume, but currently it is, it is trying to take a snapshot or actually you want to delete a volume when it already has snapshot taken for this volume, because some storage plugins will not allow you to delete volume if they have snapshots associated with it, and also some volume plugin. You might have like incremental snapshots. So you cannot delete a snapshot if there are some other snapshots or social visit and the last time we kind of agree that is in this scenario.

A

We don't really need some finalizar to protect, so the controller will issue the delete request, but the driver will give us the kind of give us a response, whether it candy beats the volume or snapshots in this case, and if it does not allow to delete volume or snapshots, and it will give us some error message, and so these error message will be popped up to the controller and they will show in the like events in our objects and the controller will like keep trying to delete until it is possible to delete.

B

That resulted in inconsistency across different implementations.

A

You mean some, it's not I bet.

B

I mean if I have two different kubernetes clusters and I tried the same action on both clusters. I could get different results because there's a different storage plug-in being used across those two clusters.

A

B

That that seems undesirable to me.

A

B

Yes, like it seems like we should be designing for consistent, behavior and and if the backends have their own idiosyncrasies, it should be pushed down to the backend to deal with.

A

We just give us like respond, whatever response given by the like ants, and we just show it to user. Like.

B

The alternative that I had in mind was you could require that the back end reports success, even even if it didn't actually delete it, and then it's incumbent on the backend to figure out how how to delete it when it can be deleted. So it's no longer kubernetes problem, so it to worry about this thing. As far as cubanía is concerned, it really is gone.

B

That would be the author I'm. Not it has its own set of problems, but that would be a more consistent user experience and would have some advantages.

A

Because the VAR will be used so okay kind of some failure rights, the object will still be there, so it just has a division. Temp stem there, but it's not really DJ successful yet so it won't actually deviated object. At that moment,.

B

If we had like conformance tests that those kinds of implementations could never pass these conformance tests, because the conformance test is going to set up a known configuration and try and known sequence of steps and expect it everything to go, as you know as expected, and if a back-end can't delete a snapshot or volume for whatever reason it would fail. The conformance test, even if eventually retrying, would succeed that that's still not good enough.

A

Because some issues, the actual.

A

B

Sososo speaking for NetApp, but we had net up, have had to deal with these kinds of cases. In fact, we have exactly the situation where we cannot delete an actual volume that has a snapshot if the snapshot needs to stay around, because our implementation has the snapshot being a member of the volume. But it's not unrealistic for us to to lie and say yeah. We deleted that volume and for us to basically keep track somewhere that when the last snapshot is gone, we need to delete the volume will leave ourself a breadcrumb to go.

B

Do that work later, we've had to do that for other systems. It's it's! It's it's irritating! It's more work for us, but it produces a more consistent result externally. So it's worth considering that that, as.

A

B

Semantics that you're expected to succeed on the delete as and then it's not Carina's problem anymore. It's now your problem to actually make the delete happen. When is possible.

A

B

Way, if, if it can acknowledge that the deletion is intended and and it can do whatever work, it needs to do to ensure that it will eventually be deleted, then it should return, success and say yeah yeah I'll delete that when I'm able to you need to worry about it anymore,.

A

Very clear between the driver and the controller right so currently the University is back I. Don't think we have that kind of information.

A

B

Yeah, we need to I think make it sounds like the decision hasn't been made and and and I'm I'm not saying it has to go this particular way, I'm just pointing out that you know, there's a choice to be made here. We can either surface up the ugliness to the end-user that sometimes snapshots can't be deleted, or sometimes volumes can be leader or we can say no. We want to provide a consistent experience and force the developers to deal with ugliness on their side and in both both options have have issues. Okay,.

C

So Ben's right now we don't really have any restrictions right now right. So um basically it's really left to the backend I. Think it's hard for Co do.

B

We have we have like a CSI sanity tests, which are you know a de facto conformance, suite and and I would expect that you know we would add a bunch of snapshot tests that would test these kinds of things like can I, create a volume and then create a snapshot and then delete the volume and the conformance test will expect that to succeed, because that's a reasonable expectation, but if the back end fails and then what no.

A

A

Might create snapshots already and, and we only test for volume currently, and so we don't really like, have any restriction or any thing I say it just depends on the backhand. Whatever message response you have and volumes it's also possible like without kubernetes, and you can create snapshots for one already and the behavior of the current behavior for volume. Right at least, is it's my return, some arrow in the back in the currently can, because.

D

I have two comments: I agree with Beth and that we shouldn't expose our ugliness to users. Users, don't have to be aware of. You know how this backing works or how that backing works. I think the better alternative to what Ben suggested, which was lying about. You know successful, delicious things like that. I think the better alternative would be to these bad cans that have a coupling between snapshots and corresponding volumes.

D

They can decouple their life cycles so, for example, but the net aback in the Ben mentioned they can promote a snapshot to a volume and that way that snapshot has an independent lifecycle from the volume that it corresponds to and later on, when you want to create a volume from the snapshot effectively our cloning in that volume. So, yes, you would lose some storage efficiencies because you know now you're more space and stuff like that.

D

But as far as the user is concerned, as far as the CEO is concerned, all those complexities are managed by the plugin and I think that's a better route than they said. Basically, my suggestions related to the plugin and, however plugins want to tackle this problem, but.

B

Is it acceptable to just say no I can't do it and and call that a well implemented, plug-in.

D

So I don't see any problem with SiC returning an error and it's off to the CEO to retry. Yes, so you already do that for provisioning. So let's say: if you want, if you fail in provisioning of volume, you keep retrying, not really succeed. So in position.

B

But then how do you? How do you write a conformance test that checks for correct behavior if the correct behavior is, is I'm allowed to return an error as many times as I want? Well.

D

This is already happening with provisioning today, dynamic provisioning. If you can't provision on the first try, you keep retrying until you succeed,.

A

Since we these are in the past, we could I say design the way we first released and volume. So yeah.

B

But if the semantics is supposed to be that you're allowed to delete the volume before you delete the snapshot, then we need a test to validate that. That is something that you can do. Otherwise why.

E

Do we have to enforce this? So, let's imagine there is a back-end that won't allow you to delete a volume unless all the snapshots are deleted. If that's the case and you go and delete the PVC, that process should basically fail. Pinging, saying I am unable to delete this volume because you know snapshots need to be doing the bursts. The status has an error. There comes along and reads it and then they're responsible for doing the cleanup.

D

A

B

Can do that but I consider that basically exposing the ugliness of the implementation all the way through to the end user, which has its own downsides.

A

Your wall don't still have snapshots somewhere, otherwise.

A

So user can go ahead.

A

D

Is just whenever we create a part, the part wouldn't start unless all the dependencies for the wall for the part, which are the plus say the volumes the PVCs are, you know, have corresponding TVs and so then the parts part starts. This is a similar type of dependency. You wouldn't delete a volume until you delete all the corresponding snapshots, so it's a just a dependence users have to be aware, so they can't so they can address it. But.

B

There there are certain scenarios which you should be able to set up and and and execute like a script of actions which would be expected to exceed on any plugin right, and those are the kinds of things that you would put into a conformance and you say, look no matter what your requirements are your dependencies are.

B

It should be possible for this sequence of actions to succeed on your plugin and then you design the conformance tests to run those actions and it's up to the person running them to ensure that the prerequisites are met and then you do it and it passes- and you say great, you can form button resetting of a situation here where it's just not possible to write that test. Well,.

F

Because why would we not want a plug-in to conform to this behavior? I guess is my question. I think I'm missing that point: why wouldn't all plugins want to have that deleted before the volume? Is there a case where that's you know a requirement um plug-in.

C

Capabilities or some back-end can support that some beckon cannot yeah.

B

And on the CSI side, we or the community side we've decoupled, the lifecycle of the snapshot and the volume so that they're not related. As far as Jimenez is concerned, it should be legal to delete one without deleting the other. So Cunha doesn't know that there's any dependency.

D

So yeah my point was that backends, you know that bakken's I wanna support the way, Cuban history, snapshots and volumes. They can emulate that but promote putting a snapshot to Hell volume and manage the snapshot lifecycle. The way you know the manager volume that way, they're decouple. So it's off to the plug-in implementers to figure out the best way they can make this happen in bit kubernetes, but.

B

It sounds like it's also up to the plug-in implementers to just expose their ugliness and not not do anything nice if they want to and and.

D

They are not earnest because whenever because they're treating snapshots just like first-class volumes, so the decoupling between a snapshot of all goes away. The.

B

Question is: are you allowed to fail or not? I guess you have your the option to make your plugin smart enough to not fail, but other people might not make that choice. They might say: oh well, I'll just make almost be lazy and let that operation fail and the user can can deal with the mess.

E

That's an option that a storage vendor gets to make write; they can do extra work to make a nicer user experience or they can let their bubble up to the user and let the user handle it. I think I agree with Ben's original point, which is by trying to work around this. At the kubernetes layer we introduced inconsistencies in the behavior of the kubernetes api and that's pretty bad. So I would prefer to leave this to the storage vendor to decide they can choose to.

E

You know, handle this at their lair and make it pretty or they can surface it up to the end-user either of those is okay. I, don't think you need to have a conformance suite around this to say you must do this behavior or not. Leave it up to the storage vendor.

A

E

A

And there are some trade-off rights if we hide the like the arrows. That moment it's not even yet when we say it's reading I, don't it will cause some issues. Definitely if the system never be able to the media and the user. Just have no knowledge about that, and so so I think the planning can choose. Do some smart things and try something and but it's in 2d the snapshots and then return it successfully, get it. We can. Oh, it's yeah.

E

Or it just surfaces there and says sorry in order to delete this volume, you first need to delete snapshots, which means the users are responsible for the cleanup and it's not the end of the world, because you're not putting the user in the position that they cannot recover from. They would just need to go in and delete the snapshots first, but you know that's a decision. You leave up to the storage vendor.

A

So follow test I think we can just designs visit every Chris and rondalee snapshot first and then delete volume in our sequence. This will have any issues for that.

B

All right, I think you should have both tests so that you can test deletion of the volume before the snapshot. But if that's not expected to succeed, always then maybe that test needs to be marked optional or skippable or whatever.

A

E

I, don't think we don't really have a CSI conformance suite at the moment we have the CSI unit test suite which doesn't really test behavior across methods just for individual methods or.

B

Are you talking about CSI sanity, yeah.

E

B

Sanity does test, it has some pretty complicated stuff in there. Okay.

E

B

And I've been trying to add more stuff and I would love to add these kinds of complicated snapshot tests that we're talking about.

E

B

E

Is that the CSI sanity tests are essentially run as unit tests, so the driver doesn't expect the backend to actually exist. Is that true.

B

You can run it. Do you there's a there's, a mock back-end that you can run it with, but you can run it with real backends and it actually exercises.

E

B

Real back-end, oh.

E

B

It's a it's a great start of a conformance suite yeah.

E

Yeah, that's what I was gonna say next is that we do need something along the lines of what we're talking about here, but it sounds like the açaí sanity is already there. So I'll take a look at that. Yeah.

C

But I think those tests has to be in sync with what is defined in the CSS background. We don't want to add extra things.

B

C

Are not reading in a spec, well yeah and I've been filing.

B

Bugs and fixing those bugs as I find them because yeah there's it was testing things that were not in the spec and it's not testing things that are in the spec and of.

G

B

It's very hard to get it 100% together, but as we find divergences, we should fix them.

H

E

A

So I think for this we kind of agree at least for now right. We depends on the plugins to give us whatever responds. When you try to delete something right, the read operation will make a return from CS. The pizza on the CSI responds, and another thing scenario is more like when you, for example, trying to delete a snapshot, but the snapshot currently used. These means someone like is trying to create volume from snapshots. I know for some pecans.

A

It will not allow you to need some shots. It will return. I realize something like the resource in use, but I'm not sure this is consistent across all the back end parties. So it's possible. We we can, in this case, protect at qualities, layer and finalizer, because it is more like us, that's what you use and we can monitor that and then we prevents.

A

B

I'm, sorry, what is it that makes the snapshot in use.

A

B

So and then, once the volume is finished, creating it's not in use anymore and then you can delete it.

A

B

A

Attached, we don't allow you to delete the volume until volume is detached, so similarly, we can have some protection for this use case, and another scenario is what we use, which means you are trying to create snapshots based on the volume from the volume and at the same time, r22 eat.

A

So if you have a protection, you will prevents volume be deleted. Well, it is still taking snapshot and then after snapshot, if they can, and the volume also.

B

And where does the state machine live that determines whether it's.

A

When you create volume from sang shot issue, this operation, you can add another two shots. So.

B

The state machine is in the communities API, not not in the plug-in.

A

B

So communities are gonna have some flight. That says this volume is being used by a snapshot, or this snapshot is being used by a volume and then something has to clear that flag. When it's done.

A

Also return some arrow, so I'm not sure.

A

Gave us consistent, behavior or not so, for example, the currently.

A

D

Think kubernetes you keep track of outstanding natural operations. Not bugging Tsukuba knees should know whether there is a snapshot, operation, progress.

A

Can have some kind of mechanism to prevent you doing this, but I know whether it is true for all parties, so.

A

D

On the plugin, kubernetes should rely on the state. It manages itself keep track of all there and.

B

Not only that, but there needs to be a way to get out of a stuck state. If something is is mark is in use erroneously, you need to be able to get it out of that state, so you can.

A

Never eat something. So in order to do this, we need to make sure we can see their other corner cases and to prevent that happening. So, yes, this is so we add this nice feature like the prevents you something bad happen, but also we might cause something like undesired. So if we all grade that's nice feature, sorry.

E

Can you describe the feature again that you're proposing to fix this so.

A

Basically, we have a finalizar similar to volume of analyzer right. So whenever you detect there, you have an ongoing operation. I say you create a volume from sanh charts and with a finalizer in a snapshot, object indicating it is currently used and until like, okay.

E

A

The creation finished you remove that finalizer and if you want to delete a chart you can otherwise there is finalizing, essentially the object. You cannot get that shot. Yeah.

E

That seems reasonable to me and I agree with Ortolan and Ben. In addition to this, we should also make sure that the controller that does the operations also keeps track of what operations are pending and doesn't issue multiple operations on the same, a snapshot or volume that needs to be.

A

B

It occurs to me that this is. This is exactly the same situation as as what we were discussing the first part of the meaning, which is you know you could just let let let the plug-in fail and then it's up to the plug-in to figure out what the you know. If it can do something smart or not, and to not not track any state of the kubernetes and say you know it.

B

If, if you can't delete the snapshot because it's in use just fail and we'll we try it similarly, with creating deleting a volume, that's in use, you know I think.

E

Retries are always we always retried right or sorry deletes, are always retried correct. Yes,.

A

E

So then, why do we have a to Ben's point? Why do we need to add additional logic here? Why not just let the plug-in say sorry I can't delete this right now it's in use and then we retry with exponential back-off so.

A

It's possible, it's just I'm, not sure whether all the plugins they have some protection more. If you are trying to delete snapshot well, it is just using natural to create volume. It might cause some data, corruption or, like you, need to model well, it is I'm, not sure. So. It's.

B

It seems like that would be a bug in the plugin if it's not smart enough to prevent data corruption right. So why should that be kubernetes problem? If the plugin can't figure that out.

A

B

Attached volume is something the Corday's has to know about, because it manages the pod and to manage the attachment, but the relationship between a volume and a snapshot is not something communities manages so it you could make it out of scope and to say, hey, do the right thing, otherwise users will suffer yeah, which is which it sounds. Like is what we're saying about. You know the first case of the ability to delete a snapshot or volume has snapshots we're just pushing back on the plugin and saying do the right thing.

E

The the difference that I see between the first case in this case is that the workaround on the kubernetes side in the first case would result in weird and consistent behavior, whereas the proposal here actually will just give us added protection without any. You know weird behavior so over here we can add protection it'll, make it a nicer experience and we don't really have any negative side effects here. That I can see.

E

B

I buy that, but but how about just the added complexity, a network of of doing it doesn't.

E

Really mean I would rank this. A p2 in terms of functionality like I, think what we need to do and I believe Shing and Jing are planning to do. This is come up with a list of features to move this or list of tasks to move this feature to beta when they do that, I would imagine this would be a p2.

D

So it's not main difference between this problem, and the first problem is that this would benefit all plugins, although in plugins, whereas the first one would only benefit the plugins I had the limitation for coupling because snapshots and volumes so if it makes sense yeah. So if it makes sense, you know for all plugins Ramona has alder in kubernetes.

A

Should apply for all plugins, yes, yeah I agree like it's not like a must-have right. It's a nice feature to have, and we can see when you should do that. It definitely has some capacity in the system. So kick it off, but there's we can't make it Monica to and we can work on whenever we think it's a good time yeah and so can we move on to next one, and so the second one on the list is the reach, my creation, retry policy.

A

So right now you know snapshot is kind of unique because we say we only create snapshots once and in case of failure. We just report the arrow and the the content. Controller won't retry, because snapshot is kind of time-sensitive and we users say ok great now they might not want to like retry in a much later time, just like the controller to keep redrawn yeah.

B

I, like the existing behavior, makes sense to me. I actually I, actually kind of this during my testing, because I I created a PVC and then created a snapshot before the PV even created, and it failed for obvious reasons and I was like. Oh what happened then I realized what happened. I. Think! Oh, that makes perfect sense. You can't take a snapshot before the data exists. I mean you wouldn't want it to retry after the data exists, because that would be weird.

A

Just allow you to like retry a little bit so I say we try for a minute or we try for a few seconds. Otherwise it just once and.

G

Make sure it's uh it's a it's a reasonable feature, because you know when you take a snapshot, want it now and there are other higher level of snapshots policies, scheduling that will help you to do snapshots on on a specific time. So I'm not sure. If it could, it could meet for a snapshot that are not in the right time that the user meant.

G

It's needed, oh yeah,.

E

I think it's an important point. That snapshots are very much time bound when you expect to take a snapshot. You expect to take it within a certain time frame, which is why the retry, you know not having a retry, makes perfect sense like Ben said, but if there is some room for retry, we should enable that scenario. So what I was imagining was having a instead of saying: oh, please retry for one or two minutes. What you say is please retry up to a given timestamp.

E

So you know we we qsr a workload, for example, and we know that we have 30 seconds to take that snapshot, and so what we can do is pass that information to say, hey. You know for the next 20 s and 29 seconds. Please try to take the snapshot, and so, if it takes a snapshot within five seconds and fails it can it knows that before that timestamp it can continue to retry. But after that timestamp it has to give up I think that would be potentially useful, yeah.

B

I, like deadlines, snapshot, deadlines and in theory, if the system is slow enough, you could fail the deadlines even before the first try and to say, hey, I didn't get around to taking the snapshot by the time. You asked sorry.

A

E

Don't don't do period because periods can you know at two minutes from now? We don't know like that. That's harder to predict just say here's a specific time that you have if it's before this, you have a chance to retry. If it's after this we're gonna assume your failed.

A

E

We need to think through that, as we design this thing.

A

G

The way I I'm not familiar with it any storage systems that actually do retry after snapshots tape. So again, maybe the use case you mentioned is make sense, but definitely the before they ever should be zero. Well,.

B

I, like I, like the idea of a deadline not for the retry case but for the system is so slow that it couldn't even take, couldn't even make the first try by the time you know, because because there are there.

I

Are it matters that.

B

G

If it couldn't take it on the specific time it useless, so this this third system cannot take it on on the right time, so not good to take it few seconds after because it's not the data I wanted I wanted it in a specific point in time. That's what is oh yeah.

B

So if it matters that much, you set the deadlines to like one. Second.

A

So here it just.

A

B

That's not obvious I mean you could define no, no deadlines to mean retry forever. That's it's! It's up to implementation of how you interpret no deadlines, but I think it's clear what the interpretation of a having a deadline should be yeah. We trying up to the deadline does make sense.

A

Or some other kind of problems when you try to create, so you might want to try immediately.

A

B

The the deck hey if you're talking about if something is slow or something's unreachable, the retry needs to go, needs to bubble all the way up to whatever is invoked in a snapshot typically because, usually there's something else going on you're talking to the application. That's managing the data you're, it's in a quiet state, while you're taking the snapshot and if you can't get the snapshot taken by the time, some time out, you're going to uncollapse the application because it needs to keep them in work and you'll.

B

Come back and you'll try the whole process later on, so that the retry needs to be above kubernetes head in those cases.

A

Some conditions, and so, if you have done I, just control, urges in being a 10-9 time. It can issue another like a few other choice.

A

The the application normally will pass the application for certain period time and then so its allow user to specify the retry peer. It's like a recharge airline.

A

B

I I would support the idea of a deadline, but you're gonna have to solve the problem so I mentioned of. How do you specify you know what time zone is that in what ahead you deal with clock skew time drift all the other problems you can have, plus we used to have to define if the deadline is not specified. Does that mean retry forever, or does that mean retry zero times.

A

Not to retry, so this is our current default behavior, not retry. No really trying only a few specify some kind of retire policy. Then we don't reach one.

B

So it's like one shot, take it or leave it.

A

B

It might take five minutes or it might take two seconds. Yeah yeah.

E

I think this idea is worth exploring. I would put it as a design project for for beta. We can, you know, think, through what the design looks like does it make sense with skew and time zones and all that kind of fun stuff and see if it makes sense and then also at the same time, see if there's actually a concrete need for this? Are we seeing you know failures that would have had they been retried results in a successful creation of a snapshot?

E

Once we have more data, I think we can make a better decision on this.

A

Yeah I think then, because when creation, hello, monarch timings and take very long time in CSS back, we do have a timeout, so it should return raising and say one minute or two minutes. So it's currently I think maximum time. I say a few minutes for.

B

Some applications- that's not fast enough. Some people are gonna, say no, no I really need you in 10 seconds or less otherwise. I'll try again later. Okay,.

A

Right now we don't have a cancellation rights. You can now cancel operation, but it's possible. We can change like make it configurable for Mike's back to return, result response.

A

B

Don't have to deal with that case. Applications that have that requirement could just observe the snapshot after is taken and realize it was taken too late and just deleted and try again yeah. You might waste some waste some work there.

D

To lay the issue is that once application resumes, it will touch the data and then you have inconsistent snatches. Potentially that's the main issue. All.

B

Right right, and so so under the current design, if the API timeout for CSI is like two minutes, the snapshot may eventually get taken after you don't want it anymore and then you'd have to go delete that snapshot, which is fine, because because the snapshot would have inconsistent data, it's not worth anything.

A

A

Con today, we only support on-demand snapshots and some user might have want to have automatic snapshotting, like periodic array snapshot, but this I think also could be like next higher level controller. Doing that this thing goes away and I say you create a script and then you just crazy out today.

B

So you you can do that, but I'll raise the issue that I raised a while ago, because this is an important issue to me. The main problem I have with doing those kinds of things through like a higher level controller is kubernetes will eventually know about a ridiculously large number of snapshots in that. In that scenario, where, if you, if you say I'm gonna, take snapshots every hour and I'm gonna retain them, for you know a week and then I'm gonna have another retention policy past a week.

B

You can end up with hundreds of snapshots for every one of your volumes and and most of them you're never gonna care about the kubernetes will have to track all of them and you'll end up accumulating thousands or tens of thousands of objects in the community's database that nobody ever wants to see. Except for that one percent case where you're like oh I, need to do a restore and, and that's a pretty big tax to put on the system.

A

G

By the way, by the way, also have value for having always information inside Canaries, because for the Cuban era, that means it's pretty important and it would be easier for him to restore in.

I

G

In time so I agree, it may be poor to many objects, miss teach them, but maybe this can all that or it could be. You know you cannot prevent anyone.

I

G

Develop such controller that a and.

I

G

Yeah yeah I mean.

I

No matter what we do, there are people.

G

That will just write a cron job did.

I

Ac takes it, takes.

G

On a very basic.

I

G

Everyone need that so anyway, you will. You will get to a point when the customer, at the end of the day, we because some some sort of snapshot scheduler in communities and and caused a lot of snapshots.

A

So one thing might help is one I miss you know is we gave the functionality and I say user can easily list all the snapshots taken for for water to be seen so visit a user can query okay for this volume, what snapshots this model and also the beads all the snapshots? It is volume. If we have this kind of functionality, then the Apple, a border controller, can I think easily to kind of have some policies or ways to and I say, debate the old ones. You only keep certain number and then.

B

G

B

Idea of having snapshots that are that exists outside of kubernetes knowledge up until the point when they're needed at which point communities becomes aware of them, and you can start using them that unburden screw Bernays for having to track. You know a potentially a huge number of objects, yeah you're right. Some people will do it through kubernetes and use their. You know, they'll create these snapshot, objects and you'll end up with thousands of them, and if that's what they want.

B

That's okay, but it doesn't seem like a good design to me so having an alternative, which is something outside the system is automatically taking snapshots and automatically aging them out and retaining them on some policy, and and only when it you decide. Oh I need to do restore now. There's got to be a way to take that snapshot that exists outside the system and pull it in and say. Okay now, I want to use this snapshot and go create a volume from it.

G

But then every storage window we have to develop that no.

B

No, no, no, we already have the the list snapshots our PC in in CSI and that and that is defined to return all the snapshots, whether they were created by kubernetes or not so as a mechanism tips to get the information out of the plugin. It's just that. What do you do once you have that list? If the object isn't already there in kubernetes, do we have a way to take it? Put any.

G

D

Is the easy part the harder part is when the life cycle of that snapshot is managed by the batch an and then you have to reconcile the Kuban any state at the back and say these are the pieces where the problem comes in importing. It is easy you can just import it from the back end, but then, once you create a corresponding snapchat object inside kubernetes and let's say the storage back-end garbage collects all the snapshots that are more than a day old.

D

Then you have to somehow bring reconcile the kubernetes say with the back end state, and these are where the complications coming but I agree with Ben, then that this is an important use case that we should figure out and it's some I need. Some wood also related to the previous discussion we get regarding the deadlines, because the way kubernetes works is a you know. The creative model, eventually you're gonna, take some action and there are no time guarantees for the event eventualities of your action.

D

So if somebody wants it's not, but now there is no guarantee, like kubernetes say would happen within you know five seconds or I'm not talking about it. Talking about the backend part of it I'm just talking about the all the activities that happen on the kubernetes, so the controller may you know pick up some event with some delay. So a may.

D

If people really care about time, it may make sense to use the backing capabilities to takes now at exactly the right time and that makes consuming schedule staffers from the back end more important for those use cases you.

B

Mean the ability to import them at least three important.

D

Yeah but then it wasn't important and then you have the problem of reconciling the kubernetes say with the back and say well.

B

You could you can imagine, pinning anything that gets imported so that it doesn't get delete it and then that solves that problem, but but yeah that you would like to be able to automatically reap them and keep communities in sync with what's going on.

A

Based on the objects.

A

It's not a coolest from the backgrounds ready to this volume. It could be. This could be different.

B

A

I mean the list here. If you say list, Auto snapshots taken from this volume like this PVC right welcome, legs can do. Is it just goes through all the snapshots? Are you object and check which one relate to this volume reading into this PVC, and then we turn list. Another way is go to the back hands and this all the snapshots are taken from this volume and I can tell you all the list. These tools could be different.

A

B

Shouldn't be totally different, but you're right that the the one in the back end could be longer I.

A

B

We'd have a problem if kubernetes knew about some snapshots at the back end didn't know about.

A

B

One should be a proper or one should be a subset of the other.

A

We I think we only care about, like my nature, particularities.

A

Outside taking outside well.

B

As far as the nay's API is concerned, yeah, you should only care about the ones the Carini's knows about, because anything else is kind of ridiculous, but there has to be a way to get communities to know about these other snapshots. I think that's what art alone is getting at some sort of an import and then a way of I.

B

Don't know how you reckon I, don't know what the right way to handle reconciling snapshots that are going to automatically get to leaves. It is yeah.

D

I just shared an email that I sent to see JP a missionary awhile ago, so I explained his problems to them and there was no resolution so feel free to take a look and comment on it. Sure.

A

You mean if the back end deleted snapshots, results involving communities and when I said no snapshots already gone.

D

Yeah then we many of these forest systems that support schedule snapshots work is that, for example, you specify policies to keep so many daily snapshots, so many hourly snapshots so many weekly snapshots. So the store system takes care of the snapshot lifecycle for you right so where we can define a policy that keeps around, let's say ten daily snapshots and then, as soon as you go to the eleven a then your first Knopfler gets deleted, and now you have if kuba nice has an object.

D

You keep track of these snapshots than you have freaking solid when Eddie stayed with the storage back in state and that's the challenge so.

A

I think right now we do have the controller. It's my not have is all the time, but the controller will appear actually check back stated. We have two objects, snapshots and natural conscience, and the snapshot of content has the central ID and it can check in a exist of those central ID through the plugin, the backhand yeah. You don't want to.

B

Be pulling pulling all the time for that because that would be obscenely expensive.

A

B

Talking about tens of thousands of objects for even a modest size, you know cluster.

A

Query that specifically snapshots so check is doing this or not. If it's not exist, then we can report it's similar to volume, so we do have like a checking periodically checking whether the volumes still attached or not, although after you first attach it and we do periodically checking both.

B

Only attached, Windsor or all of them.

A

B

Well, I'll just point out that it's it's quite likely that people will have two orders of magnitude more snapshots than they will have volume so consider it like a hundred times as expensive, as whatever the volume monitoring mechanism is.

A

B

You really want to check all of them all the time yeah.

A

G

The way I want to mention something that, in many cases, this schedule, error of the snapshot is not inside the storage system. Many use cases are usually that did it do it in in a different system. You know different applications that handle the step shot. It's.

I

G

It's not being managed by the storage system. In many many cases Sun does, but some does not.

B

You know we have to support both spoof styles I, think when a storage system does it and when something else does it and when.

I

Kubernetes does.

B

G

Cases yeah exactly this is why I want to mention that, because there are many of use cases here and not all the time, the storage system is the one that track all the snapshots. All the time.

A

Nice way for you, so at least for users it can either a new snapshot.

G

They think thank you for the call need to drop, keep going baby.

A

But the last item is resource code on I. Think it's also very important. So you can check on today's for storage. We have sorry resource code.

A

D

Have to go okay.

A

Okay sure, then we'll focus on resource code on for next meeting I will try to put more notes on it. Okay,.

H

Thank you, bye.