VMware Velero Office Hours, 8 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Velero Office Hours - Oct 8, 2020

Description

See agenda and notes at https://hackmd.io/I3u1x0u9T46KhuYZN4LX-A

A

um So the mainly the thing, so what I wanted to ask is all about um so say: for example, if you go with any operator based approach, or else if you're gonna provide your service on top of kubernetes platform, so mainly the crd based approach, if you look at it right, so the life cycle of all those resources or stick with the same cluster until unless you don't have um solutions like veliro right so valuero basically helps you to take the backup of all the kubernetes resources and, let's say because of the disaster.

A

If you wanted to migrate from one cluster to another cluster, so valyrio basically helps you to restore to the totally new cluster and then all the resources will be backed up and then everything works well, but main problem is the restoration time. Basically, it depends on the size of the cluster and the size of the resources. What is running on the cluster so in so that basically takes uh based on the size it may take one hour.

A

It may take even two hours, even more, also based on the size, in order to avoid that, if valerie can support something like active passive method, so what I mean is say, for example, you can have two clusters in parallel, so one cluster at any point in time will be an active one and then other one will be a passive one.

A

So in case of disaster, so you can basically bring up the the passive one to active one, because valerio in between should help to keep the system in sync. So anyway, anyway, the valerio is helping um to take the backup and all those things. So somehow we can make it possible.

A

uh We have believe you so that you know. Basically, you can avoid the restoration time so, let's say two to three hours: downtime based on the size. uh So what do you think about this kind of an use case? Yeah? So that's what I mainly wanted to ask.

B

So I'll jump in um so you're asking specifically in the case, so you want to kind of have two clusters running and keep them in sync. So you've got one that is running backups and another one. That's continually restoring from the main cluster. That's running.

A

Yeah exactly yeah.

B

um So I don't um as far as I'm aware at the moment, um we don't have any support within valero. To do that. I think that's you clearly correct me if I'm wrong here, but I imagine you could have something that you that you write around it so that you could have something um another cluster which is able to access the the backups from cluster, a for example, um and cluster b could if it could access those backups, you could have some kind of other.

B

I imagine you you'd have to write your own system around that to be able to to continually run those restores. um I think I did see your issue um and I did see nolan commented on it um that it would be a great feature to have, but um it's it's not um on the road map at the moment. um So I think it's something that we can definitely raise to your product team um and have a discussion and try and get a sense of like the priorities, because obviously we have to balance other things.

B

Other features that we're currently working on, um but I.

A

B

I think it would be a great feature to have, but um I have to raise it with with the rest of the team.

A

Yeah, so definitely so because this in general, so anyone require this kind of a feature so because of course, that there are many things uh also available outside also, you can also build your own one to take the backup and restore, but we need to kind of bring up uh something like, even if you are doing the restore there. Actually, you will run into a sort of a small down time yeah. So if your cluster is very big, actually the downtime will be more. uh So when you do the restoration.

A

So somehow we need to address the problem of so how you can minimize the downtime of a system because of a disaster yeah. So so that's what the main point actually what I wanted to address. uh So let's say I was figuring out a tool so which can help in this aspect so that um so we can make use of it, and also this will be beneficial for anyone outside in general.

A

So everybody are looking for a tool, so of course they require a backup and restore and then even if it is doing the restoration, so they want to minimize the the downtime in general. So that's where we need to bring up some kind of I mean this is another one concept, so what I think of, but there will be also the other concepts.

A

So as you experts, actually you might know it of uh there will be another possibilities, also how you can design the system um yeah, so this just wanted to put across. So what valerio basically thinks about in this use case.

B

So I'm not um so, I will admit I'm pretty new to the valero team, so I'm not actually familiar with kind of the standard um times that it usually takes for restores crazy kind of set. Ask is like two to three hours. Is that, like a typical um amount of time, that's uh usually required for doing a restore um on the cluster.

B

I'm just wondering like if there's um some interaction here with um maybe the the particular provider that's been used, or whether you're using brassic or something like that, whether that might cause the restores to take longer, um I'm not sure.

C

B

C

I'm reading through this, I was just able to open this slack to understand what this. uh So, I guess what you we, what you're suggesting is that we have a way to to restore from a cluster without doing the step of packaging, the cluster in and packaging the cluster, so just go directly.

A

C

One cluster to another: yeah, yes, and that the the reason why is so the restores are faster.

C

Is that is that why what problem are you trying to service? Is it the speed of the restore.

A

um No, it's it's not the speed of the restore, so the mainly I wanted to my main target is to bring down the downtime yeah. So if, if, if you are doing the restore so in typical use case how everybody they run, the system is that term that will be a one cluster that will be their productive cluster. They will take a backup continuously, so in case of a disaster, they will basically spin up a new cluster and then they will restore it from the the old cluster. So then everything is backed up.

A

So then your system is without any data loss, so your system is back and then everything will be up and running, but this restoration during the restoration you are. The solution will be there in a downtime, so it may take half an hour. It may take one hour. It depends on the size of your cluster and then how many uh applications are running. Services are running. It's completely depends on that one.

A

So now the my main target is to bring down the down time during the restoration, so how you can basically address the problem to bring down the the downtime uh yeah. So I don't want to have longer downtime, so let's say with the single cluster, uh I'm running it and then restoring totally new cluster. That is taking example one hour, so I need to address. I wanted to reduce down that one hour into example: some one minute or five minutes maximum, so how I can do that. So that's what the problem I'm trying to resolve.

C

Okay, I understand now so so to answer the the question I I would need some numbers, but even even if I had the numbers I wouldn't have, I don't know the answer off the top of my head. So, like you already said, it depends. How long about a restore take depends on the size of the cluster. Is that that's being backed up.

A

C

Think I think the the size limits that you have set for your clusters also would make a difference, of course, the less memory you have, the less efficient, any uh the the the slower any operation would be so maybe increasing the the the memory would make a difference. I'm not sure I'm just thinking through this. um As far as the proposed solution for making restore faster, which of course we are uh that's a very much in our interest to to go in that direction.

C

uh We would have to investigate, because if a restore takes is taking one hour, let's say, but the actual downloading, from from the storage and and unpacking is not overtaking that time. If, let's say it's just a minute, then the solution is not what the solution is is not is addressing the wrong problem, so the problem might be somewhere else. Does it make sense? What I'm saying.

A

Yeah absolutely yeah.

C

So so so maybe what we? What we're talking about here with this issue, is removing the middleman, removing the the the storage, so you, instead of fetching from storage refresh directly from the cluster, but if fashion from the storage is, is really not over overtaking the total amount of time. For this restore I'm not sure that that would be the solution, but also if your restore is taking a ridiculous a long time then, regardless of having to fetch the restore from the s3 storage, maybe there is something else there that needs to be looked at.

A

B

Yes, one thing I was just wondering is: um are you whenever you take a backup and perform the restore? Are you backing up the entire cluster? Maybe there are only subsets of the applications on there that you really need to have restored. So if you do have clusters running in parallel cluster a and cluster b- and you want to like fail over to cluster b- maybe you could already have certain services running in there and then you only need to back up a subset of cluster a to get it back and functional.

A

um So basically, I wanted to keep both the clusters in sync. So any point in time there will be one cluster is active, so say. For example, I have cluster a and cluster b cluster. A is my productive one, so I always keep my cluster b sync with uh cluster a, but that will be there in a passive mode. So that is not an active one. So my cluster a is basically active one. So let's say due to some situation, my cluster is down so, which means that I no need to take a resto.

A

I need. I don't need to do the restoration, because my cluster be already in sync with cluster a. I just need to bring the cluster b to active1, so my downtime is reduced, so my cluster b is already active. So again the application is works as it is yeah, so without much downtime. So so this is the approach. Basically, I was looking into uh with respect to veliro, or else there will be other possibilities. Also um um uh what khalsa was mentioning about yeah. So maybe there are so many possibilities, so we need.

A

Maybe uh if you can suggest us uh some the best one, so that will be really great yeah, so.

C

So for right now the only thing I can so, I suppose you're using a schedule to do continuous, backups and restores.

A

C

Right and in what I will look at is memory.

A

C

Tweaking the memory limits for your cluster would for for valeria to use would make a difference um this. So if we were to have a sync feature we will have, I mean we have to sit down and design and see how we would fit in. It's not like nolan said it's not on the roadmap.

A

C

Something that would be.

C

Implemented, we don't even know if it'll be implemented, but if it is it's, it might be a while.

A

D

What you're talking about is more replication than backup they're two different concepts, so there was focused on backup and making sure that we have a disaster recovery in case of anything else, not working for you. You can have, um depending on your setup, you can have replication going to one place. You can have backups going to a third place, so um that way, your you're absolutely uh confident that you can survive a disaster um but yeah what what you describe is more more replication than back, especially continuously replicating things. That's replication.

A

Yeah, that's correct.

D

But yeah we of course want to get the down time down as much as possible, but restoring from backup will always take time. Restoring from replication will take shorter time.

A

Yeah, absolutely that that's what the thing I wanted to address. Yeah.

B

Thanks for using the issue um like I said well, we can discuss this and some of the other points that have come up today. um At the add those to the issue, um like I said reason with our product team, maybe to investigate some of the other things to do with general performance and reducing the restore.

B

A

So also yeah, so so I also heard from so one of my teammates is that so they actually using valerio and in a productive environment. Actually their system is very complex one and they actually mentioned about some issues with the valerio to handling the secret. Let's say if you talk about the distributed systems like keema, I'm not sure you know about keema, um so it's it's an extension platform on top of, let's say, k native yeah.

A

So um so it's a serverless system, so they actually mentioned that they have issues with valerio while taking an backup and then during the restoration, because the system is pretty distributed.

A

It's it's a micro services architecture, it's pretty distributed. So when they're restoring it back so they're kind of missing the secrets and then the connections again it was not working as expected after the restoration, I'm not having much details right now. So maybe I can also get the details and then I can also open uh issue in the the valerio, uh the github yeah. So I just wanted to understand from you is that um have you also ran into, or you already know, some sort of issues?

A

What valerio is having right now and what you're gonna address it for those problems, mainly for complex distributed systems for all different sort of resources, the kubernetes resources, how effective it is, after the restoration uh in in in the the very complex systems, yeah.

B

So I don't know um specifically, so what's sorry please clarify, were they saying that they were having specifically issues with restoring secrets or was it the applicant or the app that they were using for managing the secrets within their cluster? You mentioned keem. I sorry I missed the name of the thing that they were deploying um yeah. I'm I'm not aware of like any particular issues with um with large-scale restores.

A

B

But I think if you can speak to your teammates and maybe gather some more information and yeah, like you said, like feel free to open an issue and we'll take a look at it.

A

Yeah, okay, so that's what I don't have the exact details right now, but I got to know about a few things yeah. So the one thing what they talked about is mainly the secrets uh yeah, so in and so other things here. I need to get more details from them, and so mainly, I asked so currently believe you have any known limitation for this complicated or complex, distributed applications yeah. So during the history, after the restoration, it has any issues, um let's say after the restoration.

A

So what we expect is that everything should work uh well uh uh um so before how it was working yeah. So so that's what I just wanted to ask so mainly, but I can get the details. You know.

C

So, with secrets being a kubernetes object, it would be back packed up by with valero one recommendation that I have is for you to open the backup itself and investigate if the secrets have been backed up before even checking if they were restored.

C

So if they are in in the actual backup, that's a good sign and then we just have to focus okay, why it was not being restored, but if they were not even backed up, then you have to see okay. How is the backup being created? That's excluding the secrets, because there are a way to exclude things, and there are ways to include everything that you want. So that's one step that I have.

A

Yeah sure so maybe I'll also collect more details and then I will also open an issue in the.

A

A

I think yeah this that's all from my side. So what I wanted to ask.

B

Basically, well thanks for thanks for joining and uh bringing us your questions, yeah we'll definitely follow up on the the issues and things like that and we'll we'll look out for the issue. The second issue that you mentioned once you get more information from your teammates.

A

A

Okay, then, so I think I may leave the calls, because I connected from indian time zone, it's it's 11.

B

30 yeah, you should go and get some rest, but thanks for joining us and yeah, we'll um we'll chat on github.

A

Thank you thank.

B