Kubernetes Data Protection Working Group, 16 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Data Protection WG Bi-Weekly Meeting 20210616

Description

Kubernetes Data Protection WG Bi-Weekly Meeting - 16 June 2021

Meeting Notes/Agenda: -

Find out more about the Data Protection WG here: https://github.com/kubernetes/community/tree/master/wg-data-protection

Moderator: Xing Yang (VMware)

A

Hello, everyone today is the june 16 2021. This is the kubernetes data production warning group meeting.

A

So today we have a presentation on how to backup data services outside the cluster by tom, and then we will just do an update on the data protection white paper.

A

If anyone has anything else, you want to talk about. Please feel free to add that to the agenda doc.

A

So so uh tom, I think I made you a co-host. Let me actually stop sharing, so you can share.

B

Yeah great thanks.

B

So uh I can ever see my screen. Okay, yes great, uh so we got a few questions on how to do data protection for managed services, and so I put together a few slides just to kind of spark a discussion here.

B

This is the approach that we kind of take, but I think it's possible to have different approaches as well.

B

um So I first wanted to kind of compare to how you can do volume backups- and I think I think ben's talked about this a little bit, but uh it's possible even with just csi, to take backups of volumes, because you can do a csi snap and a csi restore and extract data from that restored volume. So this works even in cases where you, your snapshots um on your storage system are not that persistent and you can't rely on them for a backup.

B

um Alternatively, you might be able to extract data directly through apis from the provider, so there are datapath apis and some providers that let you extract data directly to your your backup repositories.

B

And finally, the last approach that I think is kind of a good default is to consider snapshots durable. So a lot of cloud providers you can imagine aws edf, for example, uh a storage snapshot of a block snapshot is actually pretty durable. It's stored in s3, and if you want to change your failure domains, you can even copy that snapshot between different providers.

B

This make sense because I think I'm going to make a parallel of this to uh to data services as well.

C

So to clarify approach, number three: you mean: don't change existing api, just implement, implement it differently in such a way that when you take a snapshot, you get something more durable than a local copy.

B

Yeah, so if you take, if you use aws ebs snapshot api, the snapshots are more durable than other types of systems. They.

D

B

Up getting exported to s3 automatically under the covers, and you maybe don't go through csi or any type of you know, kubernetes api to take those snapshots, uh and if you do that, you can actually control the the snapshot completely through the apis. I.

A

Was gonna? Actually, you can actually take a css session and then that will be durable for yeah.

E

F

As well right so what.

A

F

You don't have.

A

To go that part.

C

That's something we've prototyped, where you actually use the kubernetes snapshot api, but you establish variables in snapshot class. That say this snapshot should should go to a safe place such that it is, could be used as a backup and not just a local copy right and then and then you can leverage all the existing api stuff and get something that resembles a backup. Just you don't have a lot of leverage to control it other than the snapshot class.

B

That's that's a great point. I think you're absolutely right. uh The implementation that we use here does use if you, if you go provider specific, we kind of do provider for everything.

A

Oh you actually, so you are actually using provided so you're, basically using the suite of vendors. If you are directly rather than using csi okay, I see yeah.

B

If, if, but actually,.

A

That's not yeah, but I mean yeah. It's actually not necessary. I'm just saying that you can actually go through csi route and uh achieve the same result yeah for the durable. If they support that. Yes, that.

E

I don't think the issue is the api. You know whether you use csi or provider or some cloud specific api. You know what really matters is the implementation.

A

E

The storage back and.

G

Side, what tom's trying to point out here is that there's there's basically a fundamental disconnect between on-prem vendors and the cloud and the cloud vendors went ahead and implemented durable snapshots under the covers behind their snapshot. Apis so like, for example, in valeria valero came out of the cloud. Originally, it expects that these snapshots are durable and ching, and I went through all kinds of backflips in order to make this work on vsphere.

G

um So I think that's the thing to take away is that the cloud gives you this already we're starting to see this come into some of the on-prem stuff.

G

But those durable snapshots are really a viable way to do things.

C

But but I feel like there's there's almost a fourth case: that's a blend between one and three here, that is, that is significantly different enough from three that it is worth spelling out, which is, it is a csi snapshot. It is a csi restore, but instead of extracting the data from the volume, the snapshot itself is just durable.

A

Right, I thought.

C

That's exactly.

A

What I thought number three, not not what that mean: okay, yeah, maybe you need to add, like a uh yeah, maybe three a or b or something or even four yeah. So maybe the you still keep your original provider snapshot. That does not involve csi and then and then the second case is actually basically c. Is a driver just to handles everything or when you take a c the snapshot it automatically does the app upload.

A

I mean you, don't care which layer does the upload right as long as I use this, uh what in snapshot api? I get a durable snapshot so.

E

I guess I guess the point ben and I want to make- is we have two different types of snapshots? You know one that is I'm also the term durable is also a little. You know vague. I think we have two types of snapshots, one where it's in a different fault coming than the volume versus not right.

H

A

E

So you know for.

A

Example, snapshots, actually it's a backup. Actually, basically you take snapchat, it becomes a backup already.

C

Yeah, but without defining what snapshot and backup mean that that distinction is is meaningless. So I think arta-long's trying to.

A

Or durable snapshot uh well, but I think people most people on this corporate know this anyway, because if you know the css back, uh we all know that that actually supports two type of snapshot.

C

But the problem is that the poor kubernetes user who's trying to consume this. He doesn't know the difference right because kubernetes hides it from here and.

E

The way we're talking about is look as if you know, csi snapshots are inferior to durable snapshots approach. Three like the way. The way this is phrased right now is lucky. The one is more better than.

A

The other one, which is not the case that should be yeah. We should not be doing that.

I

So also, this is also a use case thing like you, take regular snapshots which are stored in the storage system and you use it uh if you have to revert or you have to. Basically you know you corrupted your database somewhere and revert and and a snapshot that gets copied out to object, storage somewhere, which is how the cloud providers do that you use when you have to restore from failure, because your store local storage system is done, you know so so they they have two different sets of use cases.

I

So we shouldn't just kind of say: you know durable and non-durable right, yeah.

A

I think I think yeah, I think tom, I think, yeah. I I think you'll not see that yeah um the configuration problem people are having, so I think most people knows the differences, but then, when you're, comparing this, this looks like the uh like the cloud provider snapshots are better, that's not what we want to say uh and that's definitely.

A

I don't think that is the intention of the csi spec when we introduce it when we added that uh list snapshots to uh make people uh you know so make sure that when you're taking a snapshot, that's actually get uploaded. You know you have to wait for that to be uploaded, but I mean maybe so the approach should not be approach-wise. Csi right, actually, csi can be either a local snapshot or a remote snapshot. So um this is weird when we are named that, like this.

B

Yeah, I think that's a good point, so um yeah there's tons of points here. It's.

A

Not it's not different approach like you're, basically talking about different sort of systems. Now it's not like you're, comparing different approach. Right approach is always you you take a either you take a kubernetes volume, snapshot, api or you're using vendors. uh You know whatever you say you call it provider uh api right so like under c is that okay, maybe I think, approach one css snapshot. You should have like two one is uh in case you're doing low code.

A

Then you you have you know, then you actually have three three ways or something like that right either. If it's local, then you have to do this two step or if it's local, you can also yeah. You have to do that two step, but also you can actually do this. One inside your css drive. You can do upload or something like that.

A

So I think approach one should be should include like two or three sub cases, something like that and then number two is provided only yeah instead of like so many approaches, maybe just like two big categories: csio provider and then under each you have sub cases.

A

So, whether you support local, I mean.

G

Well, I think this is interesting, but I really want to see where tom's going. Maybe we could yeah.

A

Let's not beat up tom slide on splash sure I think.

G

F

A

Actually, not not the yeah. This is actually not the goal of this, because I I would assume most who knows about all of this already, maybe not everybody, but most people should know the difference already. So maybe tom, maybe you can uh update this later. We can actually uh add comments to your slide deck afterwards.

A

How to make this more clear but then yeah. Why don't you move forward? Because we want to know.

G

Yeah it sounds like this is a good topic for a future meeting is snapshots durability and how do we start referring to them because we've been calling them durable? I don't know where that started from I. I know we started inside vmware and if there's alternate terms, we can certainly discuss those.

A

Okay, yeah, maybe okay, maybe we should yeah. Maybe we could do that yeah, maybe social. Let's talk about this actually.

J

It looks good yeah.

A

B

E

J

It's time sharing their experiences, so it's not nothing about what.

A

J

Maybe let's actually.

A

Move forward with the um you know the core of what you're trying to explain. I mean rather than you know, comparing yeah. I know I get it. I think it's almost like we're talking about who's, the snapshot and kind of is better. That's not what we want to focus on.

B

Yeah- and I I think for nomenclature when I say backups, uh I don't I'm talking about um what you do with local snapshots and not local snapshots only, although there are obviously very valid use cases for local snapshots only as well as people pointed out, um but yeah this. I love this conversation. I think everyone had really some really good points, um but maybe I'll move on. We can beat up my next slides as well uh and we're deaf.

J

uh Tom, I have a quick question yeah. This is not that we got into mckinney then, but under what cases you would rather call provider provided snapshot interfaces instead of using csi snapshot. Is that because the css snapshot do not have sufficient flexibility for you to be able to create remote snapshots in cloud systems.

B

Yes, so the most common requirement in I know, google personal disk actually doesn't have this problem because most of their disks are not region. They're global, but in aws the the snapshots are actually tied to a region because they're they're, an s3 which is an s3 bucket in a region, although I think they might have released some global snapshots as well. But so a common use case for us is actually taking a csi, taking a snapshot of an ebs volume and then copying that to a different region for extra durability.

A

Oh, even even for that, okay, actually.

J

The csi snapshot is not capable of doing that. I thought you can achieve that by passing in opaque parameters into the driver.

J

We, I don't have experience with.

B

E

I think we're going back to the original.

B

Conversation again.

E

B

Know like basically the csi.

K

E

You know you can also back it up to let's say multi-regional budgets, and then you get all the benefits of what tom is talking about here. So there's again, the way this phrase is like you know, one is inferior to the other, which.

J

Is not the case.

E

There's no limitation of a csi api. It's no limitation of you know.

J

We want to understand that because I do know the least yes, I does have some limitations, which doesn't cover rich functionalities in cloud provided interfaces right so.

G

Right, we started a lot of this stuff before the csi snapshots are available, so we have cloud provider specific plugins that we're looking to move towards csi and then like. We have the on-prem cases where we need to be able to take a local snapshot, but then we need to be able to move the data some place else and that's not really available via csi. Yet so I've been kind of dragging my feed.

J

On moving us completely to csi, that's exactly my point right that what I'm saying here is that do we need to enrich our csi interface so back.

E

To the comment that.

J

E

You know like we have already enabled this through the existing csi apis.

A

So you could do that through their storage of a snapshot.

E

L

A

But should we actually add something to make this like a comment to make a like first class parameter there? Is that a good idea? I I feel like we are actually going to this other route? Maybe we need a different. Let's.

C

Let's schedule a future meeting.

A

C

Talk about enhancements to the snapshot.

A

C

And and the table all of that stuff for that.

A

Yeah, let's do that? Actually, I think a lot looks like there are a lot of interesting that so maybe we should have a separate discussion on that and the shantian. We should sync up on that before we talk about that.

J

A

All right, let's uh yeah, we'll, do the next slide.

B

No, I definitely love that because I think we're in the same boat as uh as valero, where we have some legacy implementations that maybe could be updated to use csi, um but I'm not sure all providers support what we need. So we have to kind of dig it dig to the second level of detail. There.

A

B

So for data services, uh you have actually this pretty similar problem, and so I want to frame kind of the primitives that that we have uh obviously the ability to take snapshots through apis in the provider which you know could be similar to the provider. Snapshots that I talked about for volumes um and these apis are kind of across all the different uh cloud providers uh it may you may or may not have them if you have, if you have kind of managed services on-prem.

B

So if you're using, uh if you have kind of a team running these things, uh you may or may not have uh snapshots, and they may, they may have different levels of consistency as well, um but something that most of these data servers have in common is logical snapshots. So you can always take a logical dump of a database and the model that we we use, which isn't isn't always. The only model is that you have a reference to the data service inside your cluster.

B

So when I go back up, for example, in namespace, maybe I have config maps and secrets that tell me uh how to use that that data service and how to connect to it- um and this is actually a pattern you see not just for for data protection, but you see this even for normal application development. You can imagine. um I want to consume just a database and at all, and the way I can do that is by writing.

B

Config map and secret that have the connection information that I need to to connect to it, and- and the thing you need for data protection looks very similar right. You just need essentially connection, and you can do logical dumps through that connection layer.

B

Does this make sense.

E

So basically, here what you mean by data service is something outside of volume: asset of.

B

Csi right yeah, so something outside the cluster. um You know- and this is pretty common in cloud providers where you have things like managed databases, uh but we're actually seeing this in on-prem customers right. So if.

E

I guess any on-prem databases would have be similar right like if you have like a you, know, on-prem mysql or whatever read this. You know. If you want to take a snapshot, you probably need a similar mechanism.

B

Exactly yeah, this is just an example with postgres, but you can. You can apply to many different things right. We mostly see customers using this with in cloud, but we have had a request for on-prem as well.

F

For the in-cloud, have you have you played with or looked around at what amazon's doing with the ack? Is that something you guys think is interesting, or is it too much of a one-off for them or what's your perspective, there.

B

I haven't played with that is that related to their backup manager.

F

No no, this is this, is this is sort of their their attempt to try to just connect in things like rds into into the cluster, so that theoretically inside the cluster, you could allocate external resources and whatnot. It's still super early, but they've been talking about it for the last couple of months.

B

Oh, it's cool. I haven't seen that that is very interesting, because really all you need is the connection information from the cluster and then, if you back up that that namespace or whatever you can, the idea is that you can use that connection information to grab the backup of the thing.

F

And I'm not sure they've gotten apis, yet that allow the backup of the snapshot stuff to happen. They've they've been trying to do more provisioning and whatnot, but it's an interesting path but uh yeah, I think we're the only at least the way they act, we're the only ones that have been talking to them about. Why isn't backup in there? So I was, I was maybe hoping to get some other people to just sort of make that feature request see if we can get them to expand it.

B

Yeah, that's great. Let me uh I'll um I'll take a note to reach often because that is very interesting. Certainly.

A

Steve, can you also add a note in the agenda doc, whatever you were yeah.

K

A

B

And so, combining these two things uh you can actually kind of get a similar set of approaches that we talked about earlier. Maybe it's only three instead of four or maybe I'm missing a few here as well, but for any managed database, you can imagine just taking a logical dump, so you can always with connection information, go and do essentially a mysql damper, a postgres dump, dump, etc.

B

The downside, of course, is there's a few big ones performance. Usually these these dump tools use, for example, consistent, reads which actually has a big impact on the database right, if you're tracking, all the changes, the database itself is tracking all the changes for long-range transactions.

B

uh It can have a really big impact on performance, and so something that we've gotten several requests for is actually to integrate both logical dumps and provider snapshots. Because many of these you know rds, for example, has direct snapshots. You can take um and combine the two approaches, so you can imagine doing something similar like that that I talked about doing for volumes, which is you take provider snapshot in this case, it'll be an rds snapshot.

B

You restore that that database and you have essentially a temporary copyright, a copy of that, and then you can do a logical dump from that connection. So you can uh not only can the benefit of making it logical. Is that uh not only can you store it in rds or the same service, but it actually, it should be kind of portable.

B

You should be able to instantiate a new instance of mysql postgres in your cluster or in a different provider or a different cluster and restore that module dump into that database uh and again just like with volumes you can. Actually. I should probably change this. Maybe like this cloud. uh You can do something similar with volumes where you can take just snapshots and then use apis inside the provider to manage it, so you can copy it between different regions. um You can you know uh kind of change.

B

The failure domains with that um and those are those are really kind of the main ways that we we integrate with these managed services. I think it's not. I wouldn't say it's a a huge percentage of kind of what we do most of the things we back up. Usually are in kubernetes, but we definitely do see some use cases for managed services here.

B

Do these approaches make sense.

M

Comment on the the service, um the application backup in general- is that that's one of the things that these the we lack is seem to be the relationship between component of the services. For example, you know when we have uh an application that use a multiple part and then in backup I think it might be okay, but when we go to the restore time that if this component is not up, the other one will not be even functional.

M

So when we restore, we have to restore them in a specific order for it to, for the whole service to be uh to be functional so, and that is it does it doesn't seem. uh I haven't seen this capture in this document or in many of the um restore service that that we we saw here is because uh without those um functioning then, for example, one component can go up and it it.

M

It doesn't see the other components on up and it will time out and it will shut down that component, although of course there are many ways to get around with it, but that just shows the um the important of dependency between these components of a service of of an application when we do the restore. This is my two cent on this.

B

I think I think you're absolutely right. I would say that's a general problem, though not specific, to managed services. um You know the easy answer is what we do is. Essentially we treat this connection uh configuration as the object that we want to back up because that's what we have the reference to in kubernetes, and so, if you have an ordering constraint, then you have to include that constraint when you restore these.

B

So if I go, you know, the workflow is essentially that I go and restore objects and kubernetes in some order with dependencies between them. uh In this case I would restore you know it would be like I'm restoring these things, but really what I'm restoring is both the configuration as well as uh whatever hooks I need to go and restore the managed services.

G

Yeah, I think, there's also kind of the in between case, which is the operator driven applications running in the cluster, but the operator is responsible for standing things up and even has like um the zolando postgres operator have been messing with and they've got a backup in there. That's based on, uh I think, wall the right ahead. Log shipping, among other things,.

B

Yeah, that's that's. You know that's more complex even than this, because you you first kind of have to restore the operator and well. This is one approach. When you first restore the operator, then you restore the instances, but in that case you're also having to deal with replication uh with wall. So you have to go and figure out how you want to apply them, which means you probably need some kind of custom scripting and hopes to do that.

G

Well, actually, the operator has a backup and restore command. Oh great, so you can tell it restore snapshot x, so it's kind of in between the two right, but it's using in-cluster resources. So, like the there's, persistent volumes in there that you probably shouldn't snap and jack up.

B

Yeah, that's great.

B

That is good example use case, actually.

B

Okay, cool this slide was a lot less controversial than this one, so uh you know feel free to the.

E

Part that is not clear to me is that what is the goal to come up with a generic mechanism for these managed services, or is it to just formalize? You know what needs to be like. What's the goal here with this.

J

The goal here is for uh tom is one weren't here to share the experience of doing this. I see how uh external uh externally hosted applications like uh cloud sql already in aws this kind of applications protected within the kubernetes context.

J

G

Really, to talk about.

J

Yeah yeah, so this is basically.

N

E

Chaston does it today and we want to know whether we can generalize and formalize this process or or not. We haven't gone that.

A

Far yet she's just.

E

Learning what kasana has done so far.

A

Yeah, so we basically got a question uh during our session at coupon, so shantiana actually got this question um and then I think the do we have a I'm, not sure if the person who brought this up is on this quality. um Yes, I'm here, oh hi yeah, maybe you want to uh yeah. Maybe you want to talk about what? What was your original ask so that everybody is clear.

O

Thanks the question was about making a consistent backup of an application and with while using services from cloud providers such as managed databases and other things. So this is exactly the problem. We're having.

A

Okay, does this help this presentation? Does this help.

E

So the the part that is a little wake to me is that you know whether we're talking about different cloud services or whether we're talking about different databases. You know um each one is done differently. There are different steps involved. There are different ordering of things and you know whether we want to formalize.

E

What needs to be done. Look is that the goal, I guess because I imagine.

I

E

Like, for example, amazon may do things one way, google may do things a different way or you know if you have my sql on-prem versus postgres. Things are done differently and you know whether it's our job to provide a common layer, so things are done unified. You know uniformly across different.

E

You know, services or different across different databases or whether it's really the job of the operator for those applications or those services to do the right thing based on what needs to be involved. I guess I guess that's the part that is unclear to me whether we want to come up with a unified api or whether we want to just leave it to the operators to do the right thing.

G

So the the trick that I've been seeing is that, as when you have like one service, it's not a problem, you go into like rds, you say back this up every day and that's the only service that your cluster uses. Fine, not a problem, but you start proliferating services and you start proliferating clusters using services using instances and now you've got this big management problem of is my stuff being backed up. Is it being scheduled properly did I you know, I stood up a new cluster.

G

Did I go to rds and set up a schedule to back up the database there? So that's something where, like from the valero side- and I think the cast inside as well really looking at being the orchestrator of a lot of stuff and sometimes like you know in these cases, even like csi snapshots, where evs is really doing the data movement or these rds snapshots are very analogous we're not really a backup utility, because we don't actually like copy the data, but we are orchestrating the overall backup.

G

So the discovery of the services that are being used is kind of the trick here and I'm kind of curious tom how you um like these config maps. These are semi-custom right, so is it up to the user, to you know tell casten or canister about these, or do you have a mechanism for figuring these.

B

Out we don't have a discovery mechanism for these. We do ask the customer to do it. You could actually script it though, but the I think the issue is that you end up with way more services than you want to back up with a specific cluster. um You know you. If I I can.

B

I can go and query aws for all the rds instances that I can see in some security group um and then I can generate the config mapping connection information, but then the downside is that most of those services probably aren't being used from inside this cluster.

D

B

Of hope is that, in fact, if they're, using even applications uh that would connect to those databases, they would actually use kubernetes objects to represent that that configuration anyway but yeah I mean it's all customer driven. We don't have a discovery mechanism.

G

Yeah, I think, um from our side you know: we've got tanzu, which also has uh tanzania data services, and that covers you know we have like green plum, which is distributed postgres. uh We have a postgres operator, a few other things we're looking at how to get those services when they're in use via kubernetes to be exposed out, so we can handle them with data protection and um yeah, so we're kind of in the early stages. Some of that work.

B

Yeah, that's great there's a project um from apps code that actually was pretty cool, that put put all the stuff in a crd, and the idea was that if, if that was standardized, you could kind of build tools on top of it. um I didn't see it gain too much fraction, but I really liked the idea.

B

um You know I if we, if we're going to propose a standard. Actually this might be a good thing to kind of uh get this group on board on around, um but for now I think you know if you can use config amounts of secrets, that's kind of our prescribed approach right now,.

G

Yeah, certainly it's a it's a it's a available approach. Personally, I'm really pushing us. I really want to push towards uh operator driven stuff. For this I mean you can easily see an operator living in front of rds, even where you can create a resource like rds database. The operator goes and talks to rds and sets it up and that we get a lot of these details of how do I backup and restore this object, gets pushed inside the the object itself into the operator. If you will.

B

Have you have you seen crossplane? Have you dug into that at all uh sounds familiar, but I don't remember it it's it's from upbound. um That's not it's kind of uh the some people worked on rook. Actually, um you know a new company uh bound to work on it. I think they're trying to do that. Essentially, you know they kind of integrate with like terraform, and you can essentially provision resources based on uh crs that you create.

G

Yeah, so if we can get them to, for example, um put in paths where we can trigger their snapshot mechanisms, then we don't have to run around behind them and figure out how to do it.

B

Yeah, definitely I mean.

E

Yeah, I think that's ideal. Basically, for you know these backup orchestrators. The way you know castle or you know, valeria want to be if the api for them is the operator, and we just offload all the complexities, all the application, specific logic to the operator itself.

E

That would be the best way to have it something close to a stat or generic api that can handle multiple.

J

Services, the one one of the things here is that there are just too many application types right, but defining one api type to cover- or it's very challenging.

J

Definitely in in this case is a possible database where what tom is showing right now in in you know, for example, I don't know uh other cash cases, radius cases or cassandra cases, they might need a completely different set of input generating. Those input is not straightforward.

E

Sure, but but the nice thing about you know what we're discussing now is that all of that gets handled by the operator you're not exposed to it. You know, as far as like valerie's concern casting is concerned, you just define a schedule, you just say, take a snapshot and then whatever information they need, whether it's configmap cr or you know secret. Whatever that that's taken care of by the operator.

G

Yes, and so that we're not like trying to figure out how to take a snapshot, the operator should already have all this. You know it's got a resource that told it what it's set up. It should have any secrets, config maps etc in there like, for example, talking to the service, or even you know, which pods are involved with this, which persistent disks are involved with this and and really push that complexity behind it, because I think everything well, almost everything can be boiled down to like a snapshot.

G

Type thing a point in time we want to. We want to restore to this point in time and pretty much if you can't, you probably can't back it up.

J

If it's in the operator layer, this sounds like a great idea. What is the mechanism of notifying the operator? Because this is.

G

J

G

The working plan I've been doing um is to basically have um snapshot, crds that are analogous to the volume snapshot, the csi ones, but probably a little slim down, and so you can find a resource. For example, like the postgres operator exposes postgres crs, and you should be able to write a snapshot resource that references, the cr and then the operator picks it up and does the intervals.

G

J

Do we is this a pattern right now for using operators to manage cloud resources like this or externally measure manage the resources like this or more often, people just use their race directly.

G

Well, they're probably doing this directly, but that's really put you into the position of like hand building each cluster.

N

Right! That's what I'm afraid, because well.

G

N

You know it's like stop doing that.

J

In order to support those that that needs the uh existing work workloads to basically change the way of managing things, a lot well.

G

Well, like, like you know, like tom and casten they've, got a solution right now, where they say hey. If you use this external service, go ahead and write a script and read your config map that you wrote and go ahead and use it and trigger that external stuff. So essentially I mean, if you, if you step back a little bit, that's really very much. This snapshot pattern, except that it was built by the customer as a one-off for each installation, because each installation is unique.

B

Yeah there's some there's. Some commonality, obviously like for rds taking a snapshot in rds, actually has relatively few parameters um the a lot of the parameters that we need are actually, if you want to also take a logical dump. For example, you need the connection information to the specific database.

B

So hopefully you know. My hope is that you can kind of get this for this. You don't have to do the cross product of everything you can implement the logical part and then implement. You know the interaction with just rds or the you know, cloud sql or the other azir. um You know sql server and all that.

P

Is the idea here that it looks something this snapshot? Light instance rcr looks something like an ingress resource or a gateway ins resource where, like different people, can implement their own things and using like so, you can have more than one of these ideas on the cluster like. Is that what you're yeah? Are you proposing a shared api.

G

A common api but different implementations so that we should be able to find a resource that says snapshot, type resource type x, so like postgres database resources have an associated snapshot. Cr and the implementation of the snapshot. Cr is not done in a common place necessarily but like the postgres operator should be exposing that api or responding to those resources.

G

But in the way that we tell it so that you know we don't have to rewrite for each individual operator like right now. This orlando postgres operator has got apis for backing up and restoring, but they're unique to the postgres operator. So I'll have to write a plug-in for valero. That knows how to talk to the lando and then the next one and the next one or the.

I

G

One and I'd rather, um and the other option is to get all the vendors to write a plugin for valero, but then that leaves caster now, because you know you go, you can't use our plug-ins. So I want to. I want to try and drive us towards a standard, or at least a concept where there's a relatively simple uh volume snapshot cr.

G

That's there there's a discovery mechanism, and then we can share that with casting and everybody else to drive these operator driven- and you know external service type things without having to write n plug-ins.

G

You know either the the backup uh companies right and plugins or the database developers right and plugins, and that you know that that seems like madness.

A

Hey dave, I think one challenge uh I think, with this approach is: how do you define this uh common snapshot? Crd? What is in the stack? What is in the status, because we are backing up different things right, yeah, but I don't know.

N

That they're, really that different, are they.

A

I mean normally I'm just saying normally like let's say: if you take a you, should look at one snapshot right, you have slack and you have status. You have. There are objects that you want to observe like what's the status of those, but if it's just like a general one for all, uh what will be returned like what? How are you watching that you know.

M

A

Do you know that it's a success or it's not success.

G

We should be able to define that. I mean the the the differences are not that big. You know backing up a database, it's going to take some time to actually move the data, make a consistent snapshot and then there may be an upload phase. um I mean I mean from my perspective. You know looking.

I

Back everything.

G

Wound up on tape, so if you can't, if you can't reduce it to a string of bits, you probably can't back it up.

A

I was just wondering like, for example, you could have different resources there um or you're saying just kind of uh just say: save them.

E

Just to be more specific dave are you, let's say you know. uh Obviously you know four different databases: we're going to have different crds. You know, for example, let's say one for postgres snapshot, one for let's say mysql, so these are different crds. So there's no way we can come up with the same crd for both. Unless you know we just call it something. Generic which.

L

Kind of has the problems that I would expect that we would have a postgres volume snapshot.

G

Cr crd, there would be a mysql snapshot. Crd.

E

G

That, basically.

E

The goal of unification is for all the my sql operators or for all the postgres operators to use to adopt this crd.

G

And also that the crd- basically, even though it so so so we have like volume, snapshot crd right now and the um postgres snapshot crd should look a lot like the volume snapshot. Crd, the mysql crd should pretty much be identical to the postgresql. Maybe there's room for extra parameters, but we should be.

C

Able to get it, where do you get information about like where to back up to like? If I take one of these postgres snapshots, it has to shove the bits into some data store? Who? How do you specify where that.

G

Is so we can do things like a volume snapshot like a snapshot class with freeform stuff in there right, because, obviously, at some point, this winds up being fairly um fairly.

G

I don't know at some point: there's there's, definitely application specific stuff. um The model I have defined right now like uh we're using this inside the vsphere plug-in. We can snapshot something and then it exposes out a data stream with the snapshot and we can copy that someplace else. So we can leave it sitting there, but the magic of volume.

C

Snapshots is that, like it's up to the csi plug-in, to figure out where to shove the bits and then when you come asked to get them back, it's up to the plug-in to get.

G

Them back to you.

C

G

Partial solution, so, if you're looking at this from the hey, I build a.

I

G

Array- and I want to provide this capability to my customer- that's great if you're looking at it.

D

No, I I don't know, let me.

G

Finish english, let me finish please, if you're looking at this from the point of view of a backup provider, even if your snapshots are durable, customers are going to say things like hey. I want you to actually copy that over to the veritas fault and put it into the disk array, and so even with csi volume snapshots, I need this data extraction api to get stuff.

G

You know people want to back up into multiple locations and that's not necessarily all configured in the storage provider.

C

Right right, so so, where I was going with this is: is that csi sort of hides this problem from you, but it needs to be solved where, at the end of the day, people want control over where their data goes and csi basically doesn't give you a choice, but you're saying you do want the choice, and so, but that has to be specified in an object somewhere and I would hate to see it specified differently for every single kind of object.

C

You know you kind of want a common way to do this where you're like this is. This is the place where all my backups are going to go and if you have three different apps, they can use the same information to decide to shove the three different backups into that place, so they don't have to specify it three times.

E

That's not the.

C

E

Here, um as you know, dave mentioned like we can have a similar construct, just like snapchat class for these application level databases. You know to capture that information and we are doing the same thing. Netapp is doing the same thing.

E

um You know what we can talk about offline, but we are doing the same thing outside of the csi mechanism, where you can specify how to back up a volume to an object store and without even invoking csi apis by just dealing with crs. So we're already doing this today.

C

Yeah, but but we're doing it in a proprietary way like it. I think right.

I

C

Develop is a more open way to say you know for someone who knows how to generate backups, and you know- and I I need a place to shove them that that I can in principle, expect to exist anywhere right that that's the beauty of this design is that it doesn't depend on any particular underlying technology. It just says I know it's a postgres database. I want to back it up. You know it.

C

I need to be able to say where to back it up, and then you know, if there's a better way to, and then my my follow on is, if there's a better way, you'd like to expose that better way. In addition to sort of the default, but having a default that just works everywhere is phenomenal.

G

And and we'll see that some things may not expose it so that we can get access to the data like like. Take a look at like evs snapshots prior to I think. Last year there was no way to access the data in an evs snapshot except by cloning, the volume and attaching it to something, and that's true for like rds snapshots, you can't get into the yeah rds so like in the middle of rds postgres. You can create a snapshot.

G

You can create a new database for a snapshot, but you can't like reach inside and suck the data out of the snapshot directly.

P

Yeah, I kind of wanted to jump in, but I didn't want to like talk over anybody, so I I the only question that I wanted to bring up was. It seems like there's two problems, there's like the problem with like: where does the data go to and and this stuff from the csi, and we want to like give options for that, but then there's the second problem that seems to be like there are some application, specific things that need to happen for a given backup.

P

But are there two problems that we're trying to are? We are we talking about two problems and conflating them into one right now, or am I misunderstanding,.

J

I I think you, I think, you're right, uh you know shin and I or we we. We went through this discussion for multiple times.

J

uh Unlike what dave was mentioning using the operator way, the thought was using application constructs and that actually it was kind of in the original thoughts of defining applications, because if you really think about whether it's externally managed databases or in a stateful workload deployed directly in the cluster, if there is a way we can group the resources and the different ways of taking a snapshot of that those applications within the application construct, then all these problems should be solved.

J

You just need specific controllers to understand those application crs and being able to attack action on that.

G

That's exactly the option.

E

Those are the operators exactly and those are the operators. The question is whether you know we want to standardize the crs.

J

E

Are used by the operators, whether.

J

It's sure the the reason why we didn't go too much on that in the beginning is the variety of the applications makes the application cr definition pretty quick, pretty hard right, that's one with regarding to the destination the backup target, uh I think there's also a discussion right. One one is the causey or something like that uh is to define my my thoughts was to define a target in within the kubernetes construct right. This target could well be a file.

J

Mining could well be a object, storage or something like that, and that target, or even a I don't know, maybe a multi-region bucket in the cloud right in this case all when we do a backup orchestration when backup orchestration system comes in, they can refer to that destination. As a as a single point, a single point of truth, everything just goes there in an ideal world, but in reality it's really challenging, especially with this operating model.

B

Yeah are you thinking of having data.

B

I'm sorry, oh, go ahead. um John, do you think I'm having data path apis in that model.

J

Well, uh it depends on which, right, if it's uh it's a cloud hosting services unlikely, you will likely. You will not be on the data path right, because yeah, the snapshot will be taken care of by the cloud providers or external uh stateful workload providers, whatever it is, but if it is a construct sitting within the kubernetes, yes, then you might be on the data path, but not your not. The orchestrator could well be the application operator, as uh as there was mentioning.

B

Yeah, so the the apis themselves would not expose data path primitives, but they would. The implementations themselves would handle data path. Is what you're saying.

J

Sure, but what we needed from them is to the ability to accept a location. Ask them to hey. I don't want to be on the bypass, but you go ahead and dump data there, this kind of thing right yeah. So I want.

G

I want both um because no seriously because um there should be a fallback position where it's the same as with like the existing csi design, the fallback position to get the data out of the snapshot is clone, attach extract and it's slow potentially, but it should always work. So that's like your fallback position.

A

Hey, I think we only have a couple minutes left um yeah. I think this is a great discussions uh I want to get to the remaining of the identity. Maybe we should.

J

Yeah, could people get to the remaining? Can we should we do some follow-up? I think there are two things we want to follow up right. One is the snapshot, api interface right. The other is conversation over here.

A

Yeah sure yeah. We should actually continue that. uh Yes, so let's sync up offline and maybe we could do follow up in the next meeting.

B

For that discussion dave, could you bring your? um Could you present the what you have right now for the uh crs that you're thinking of I'd love to dig into those.

G

Yeah, I really want to. I really want to get to the point where I can. I can show you that stuff, it's just it's almost almost there so soon great. Thank you.

A

Okay, so let me just uh share the uh I okay, so we basically just want to look at the white paper and see how we are doing. um Oh there's such a whole issue here so white paper. I think uh I see font says he hasn't got time to update that. Yet uh tom, do you have any update about this, or are you you're waiting for people to give feedback.

B

No update, I think the last thing we talked about was steven actually put together. uh I think a lot of the a lot of things that would go in this section, so I think the next step here is probably um organizing things a bit and figuring out what we want to pull off from his section and put it in here um or if we want to extend this at all. You know I can write some text for this as well.

A

Oh uh so he added uh in this in the stock. I can take a look if it's already added there.

B

uh No steven, maybe like a month and a half ago, presented a section that had he I mean, there's a lot of detail. There.

A

B

Cases right, it was use cases, but I think there was a lot of discussion of motivation uh that included hook hook. Information, oh.

A

Okay, I don't know.

F

Do you want me to do you want me to take a pass through it and see if I can rationalize.

B

Yeah- and I was I was kind of thinking- we could even you know, shuffle some some of your section to this one if it made sense as well.

F

Yeah, that's that yeah! That's what I meant by rationalize so that.

B

F

Duplicative and and yeah, because you're right, I I sort of wrote mine, totally, uh not understanding what else was around it, so it might make sense. That'd be great.

Q

Thank you no worries, add the selection item for you again. Thank you.

A

All right um so far, you basically just need to uh you're not waiting for anyone else right. You just need to update it yourself right. I think, last time we.

K

Yeah I just I need to look a little bit more detail to see if we can extract an example flow out of the netpass uh snap it or if we can, I can review the example that we currently have okay to do to illustrate that it can only be also used. Similarly in snap another one that I have I haven't have a chance to look at is how we can do something like uh you know, elastics to to take advantage of this snapshot.

K

I haven't looked at this map that I rested yet, so I think okay.

A

All right, thank you, uh okay. I think this is.

Q

Oh okay, so I think this is steve. You added this one right. We were you were they. You told me to make a note somewhere. I wasn't sure where.

F

To put it so I stuck it, there feel free to put it anywhere.

A

Oh okay, yeah, but.

F

Yeah this was this was mostly just it may tie in with the conversation we were having again it's one mechanism that aws is pushing, so I was just hoping tom or dave, or someone could also take a look and see what they think of where that's headed.

A

Okay, thank you all right. I think we are at the top of the hour uh yeah great discussion. Today uh we will have a follow-up discussions.

A

Thank you. Everyone. Thank you.

D

Everybody thank you.

E

Thank you. Bye thanks.

E