Kubernetes Data Protection Working Group, 11 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Data Protection WG - Bi-Weekly Meeting 2023-01-11

Description

Kubernetes Data Protection WG - Bi-Weekly Meeting - 11 January 2023

Meeting Notes/Agenda: -

Find out more about the DP WG here: https://github.com/kubernetes/community/tree/master/wg-data-protection

Moderator: Xing Yang (VMware)

A

Hello, everyone today is January 11th 2023. This is the kubernetes data protection language meeting uh I. So today we'll have a CBD update and then we have a few quick updates, um a.

B

A

Caps, so uh stop sharing is all yours now.

C

A

You, you should have permission to share.

C

C

um There's some Securities settings, which is not allowing me.

A

uh You are co-host, you, you cannot share, you should be able to share. That's.

C

Very strange at this moment um there's some mac level access, I.

D

Think, oh Prasad, it's on your client I see so you'd. Probably you may need to update and come back into this meeting.

C

I need to uh close and reopen the Zoom app and I'll be right back.

A

D

Saying should we readjust the agenda order.

A

Yeah we can do that because I only have a quick update. So while we're waiting, we can just uh okay I'll just go from the last. This is a really quick one. So this is the content. Notifier. uh It's been there for a long time. I think there are still some comments that we have not got time to address, and then there was someone was asking whether we want to change the status from implementable to provisioner so that it could move forward so team. Has these questions.

A

I said: okay, let's try that, uh basically, that's all I did I didn't didn't make any other change other than just change this to provisioner and I'm still not sure how this would work if we change it to operation I, don't know if there's a uh better chance, you got this uh merged or not so, but anyway, I changed it. So, let's see how it goes.

A

That's okay, I can say there is we didn't really change the content of the cap?

A

So that's this one. uh Did we get? uh Let's see, okay,.

C

Can you try to try again, okay.

A

Let me stop sharing, can you give a try.

C

It's accessible for me: I think you need to make ghost again.

A

Oh okay, co-host hey, try it again.

C

Yeah cool um I hope, you're you're, seeing a notion page. Yes,.

A

You see it yes, awesome.

C

Yeah, so um I would like to update uh the thing we have been working on. So basically you wanted to taste the CBT API and see how see if you know it affects API server performance by uh if you increase load eventually and how it Passover basically performs so I would like to you know uh summarize.

C

um The test I have been doing and uh summarize the observations um yeah, so um I would I wouldn't couple of tests. So I would like to you know, um walk you through the progress. I have done so far. You know for a single host single note, a single node cluster with 4cbn, a gigs of RAM um I had this setup. Basically, uh the observability I've been using Prometheus and grafana to you know. You know, monitor the resource consumption of apis qbpi server, as well as the cpti API server.

C

So this prototype I have been using is um the AG API server uh that I haven't? Had uh you know, um I had hosted on the GitHub, repo and uh yeah, so talking about the environment? uh The first time I I, I I tried this on a single node cluster and.

A

I had to basically.

C

Tweak the AG API.

A

Server, excuse me so this one uh does this one use the aggregated, API server or.

C

Not yeah yeah, this uses Agape server.

A

Okay, so that's the like the basically based on the the cap. That's written right now right, that's not not using the uh voting. Populator approach correct exactly.

C

Yeah yeah, the original one, the the with one we have you know, proposed, skip okay, yeah, that's right, um yeah, so I have to basically mock yeah I had to mock the data to respond to send uh keep 512 blocks each time whenever we, you know query for Change Change blocks, uh the type API server would return the same number of blocks this time and I am using k6 tool to basically uh run number of requests thoroughly.

C

So basically, you know uh Mark the clients or among the users, um so yeah I mean I won't go through all the runs. I would just you know, summarize the findings um or the time say yeah. So, with the you know, um hundreds of virtual users requesting parallely um with the with the k6 with the k6 tool. This is the result we got. uh Only you know thing to concern. Was the average response time we get?

C

And so basically, as per my observation, I had to do some tweak in the Prototype or in the configuration to you know, increase the performance um yeah. So first was first observation was like um the GPS server needed more memory because uh of the data we have been mocking and um the next fix I had to do is uh yeah, so with 200 virtual users requesting the CB change block data.

C

In parallel uh the average response time we were getting was around 30 seconds and after debugging um I found that it was due to the rate limiting on API server and the rate limiting was happening on um uh so in the agapeer server. For each request we will. uh We were trying to get um yeah you're trying to get the driver Discovery uh resource so because of that the rate limiting uh server was applying rate limiting and um yeah. That's why the response was getting delayed after caching, the driver, Discovery response.

C

um We could improve it to um yeah after implementing the caching we to the average support time we got was around six seconds, which is definitely better than uh you know, 27 seconds, but yeah I think this is again not acceptable. Considering we have very low load, it's around, like 200 virtual virtual users, uh querying it uh paralleling um yeah, so I'm still trying to figure out.

C

um If this is the you know, if this response delay, we are getting is due to some limitations from qvps server or uh there is some tweak or you know some implementation implementation things we can improve, AG API server or uh the sample driver side.

C

A

Any Baseline: what is acceptable? Do you have any I if you mentioned that I missed that part, I guess.

C

So, as per the kubernetes um SLS right, um yeah I mean I believe it should be around one second, uh but this is just for core resources, not for uh crds for custom resources. There is no guidelines as such, but only thing we need to make sure that um this doesn't affect the cube. Api server, SLS nslos for core apis.

A

uh Okay, but when you are doing the test, did you actually check uh the core apis? Does that I think you are only checking that for the cpts right yeah.

A

Oh, just just did you do you check, uh did it affect the the core apis.

C

um Right, no I'm just observing the resource, consumption.

D

C

How much memory and CPU it's using? We don't see uh like much Spike in in that, um but yeah once once, we are kind of uh you know, um find out the issue by the response delays getting caused. The next step would be to you know, excuse me, but run these in parallel uh and uh qapi server benchmarking uh when, whenever we are, you know uh having the load applied on the CPT API server. I, think that would be the ideal scenario to Benchmark.

C

Both both the servers I would say, give a pass server as well as QBs server benchmarking under the load of CBT apis. uh But before that uh there are few things uh we need to find out like um via the CB change block, apis or CBT. Apis are consuming that much of time or I would say why there is a delay in the responses.

E

So so showing like um um yeah it's been, it has been a while since I wrote that code on the aggregated API server, but um as far back as I remember like um we don't call like um the kubernetes core API, because there's nothing there. We we need. We only need, like um the volume snapshot, volume content, as well as the new proposed, like um driveway discovery on crd, to find out whether the driver is so yeah.

E

So suppressant like um it'd, be interesting to see like. Is it just like the call to get the the driver, Discovery CR AP, or is it also like the volume snapshot and the volume essential content to calls as well, because I would think it was all factor into uh latency yeah?

E

The um the other observation is also like um I. Think. If you want to measure latency, it would be interesting to measure the p9i, not average.

C

E

C

E

Be um deceiving like we want to know the tail agency right because you know if you have one one really high latency and then five really lows then average to be oh, it's just fine, but that one really high P99 can really be that very unpleasant user unhappy user. So yeah.

C

Yeah so I don't know, uh P95 I think um we are not getting the it's. It's basically hitting the threshold but yeah um to another. Another point to note is the: since we are mocking the um driver responses each time we are just returning the mocked change log data, uh so we are assuming that the CBT calculation or um the response from the CSR Tower we are getting in order of one, since we are mocking the theme so right.

C

Another thing we'll have to consider you know, um while doing the actual end-to-end benchmarking so right now, it's kind of we are getting the change block calculation in constant time, but that would also uh vary, as you said, uh we might add an overhead of getting volume snapshots and then uh creating to the CSI driver to get the actual change data.

C

The only thing we so in this setup, the only thing we are kind of simulating is the uh traffic flowing through the API server and checking, if you know uh in case of large number of uh traffic flowing through API server. If that.

B

C

uh A lot of resources from API server.

E

Got it got it so sorry how much um what's the response payload again? Is it 512 bytes time 200.

C

Yeah, so each request would return, 512 change blocks and each block uh is around 200 bytes. But uh with the you know, we've since k6 uh with the Kasich. We can confirm how many parallel users we want to simulate.

C

um So as of now I have tried with 200 virtual users. So if you see the traffic or the data, send and received the balls around.

D

That is not much.

C

But uh this is 241 MB, but again with with this small amount of data, we are kind of uh yeah. We are kind of. Excuse me. This is the this is latest chart yeah. Even with that much uh data I think uh yeah, it's it's not fast enough. I would say so yeah I'm trying to figure out what could be the um what could be the root cause? If you know there is some limitations from qvps server.

C

uh If there is implementation thing, we can improve I'm doing that, uh um but yeah I'm, trying to figure out how we can make this faster.

E

Another question is like so with the 200 um CBT, so we are simulating the scenario where the CBT service is telling the user hey, 200, blocks of change, but we're just returning the 200 metadata entry.

C

Yeah metadata right.

E

Got it got it? um What's the average block size you assume.

C

E

Just wanted to get a sense of like.

C

E

C

um 200 bytes and with each request we are returning uh for 12 blocks so for sorry, 512 blocks, so file 12 into 200 would be the response size um for each request and yeah. We can configure how many parallel users we want to spawn during the uh during the test run.

E

Sorry, what I meant was like um like like is the payload using the um the proposed like um API or like structure, that um you know how like in the account we talk about like um yeah.

C

B

E

We're talking about like block size and all those other metadata I just want to get a sense of okay, so so 200 blocks of change, but like how much data are we actually talk uh simulating in terms of like um you know the chick, the change block, if that make sense?

E

um So are we telling user that okay, one gig of data has changed inside the volume or two gigs, so they would have changed under the world volume?

E

You know what I mean.

C

E

I'm not doing yeah.

C

E

Not doing it yeah I'm not doing a good job describing like um what I'm trying to say at this point, but it's okay, maybe I can ask you those like.

D

um No, no, you want, you want basically the raw data, so you can figure out API, throughput, I. Think.

E

Yeah not so much of the actual data size being.

C

E

Like I'm, just trying to sense get a sense: okay, 200 blocks have changed, but this really means, like you know, two gig of data on the Vault on the volume has on this machine. You know.

C

Wouldn't that depend on the Block size right, since we are spending the metadata that uh 512 blocks has changed, um but it would depend on the Block size um like to calculate how much data has been changed right now.

D

And have you documented that assumption I think you have right.

C

um If I haven't I'll, let uh that's that's a good point.

E

Yeah I think that's fine, it's like um maybe a large irrelevant at this stage, I'm just curious anyways. The other thing that we can try is also like, um um because I walked two things that stood out to me right. So the CPU utilization of the aggregate Apso is really low, um which means, like he's not doing a lot of like heavy computation I, wonder if there's anything good.

C

Again, since we are mocking the response, um we are kind of keeping the um while the snapshot get calls as well as the actual CSI driver or grpc call I.

B

C

Are calling the CSI driver um API, but it's returning the constant request as of now since we are marking yeah.

E

I know like like uh maybe like um from what like um for my name from perspective, I, think if anything, we would be Memory bound, not CPU bound, because we are not doing any like computation right, we're just making yeah.

C

E

Then yeah yeah.

C

Like proxy, um the driver is actually returning the calculating and returning the response.

E

Yeah yeah okay, I mean if that means, like um you know, there's anywhere. If you know you were talking about optimization earlier I think if you said.

C

um Yeah, more of uh yeah, maybe um the API fairness, APF and other configuration. If we need any uh that could be causing and the you know, throughput issues or something like that. Yeah.

E

Yeah, okay, cool and then the second thing that I want to bring up is also like I think we discussed this previously. Also, like um you know how like um there's some like I, think I saw somewhere in this stuff that you use, like your own cube steak, um metrics to capture um some of the kubernetes and parts of Matrix. So there's like um that, you know if you remember, like the SLO, like alert that I show you that.

C

E

um The kubernetes, API server would usually be deployed with um I. Don't have the link, maybe I'll share with you. I think it'd be interesting to see like um a street continue to load testers like if that um the SLO alert would fire because yeah if I know, at least you know open shift like um if that I saw a fire, then you know I'll kind of oh hell break moves type of thing right then um right, the the you know.

C

E

Variety of fairness thing would probably like they kick in user, was like getting alert on their dashboard and stuff, like that, so they're, just saying that you know in a nutshell, that's just saying that you know if this load continue, my API server is going to die. You know like for over the next. Like um you know, however, the.

C

E

Error budget Windows is so okay, yeah.

C

I think uh with kilometers operator, and um but if you enable Cloud funnel there is a default dashboard we get for API server.

C

It has the SLI dashboard where we can see. If you know uh it's, it's breaking the contract. Basically.

E

So this yeah, so the yeah the read and write okay, so the SRI is great I think um it's a good start, um but but it doesn't have an alert right like it doesn't fire an alert like we need to make.

C

E

Like that goes on fire and stay calm, and not.

C

E

Right now, like looking at the SLI request, um I assume this is like read and write latency. You just tell us how long it didn't tell us whether it is good or bad right.

C

E

But anyways I um yeah I can, if needed I, can help set it up too, but otherwise it should be fairly straightforward. um Other than.

B

E

Need to know how to deploy a Json net, which is yeah another thing to figure out. um Does anyone else um have questions thanks for sorry? This is great.

D

E

D

Is is this a public URL that we can share in chat.

C

uh So, as of now, it's on my notion, I can maybe create a Google Talk and share the link. Okay,.

A

Yeah that'll be great. Thank you. Just add that to the agenda dog.

C

E

I feel, like um sorry, Mark go ahead. Oh.

D

And presumably the the load test is I I I'm new to k6, so presumably that's just a Javascript file somewhere. Is that also something we can put in a repo somewhere or maybe wherever the kepis sample code? Is that that's one thing a detail I haven't torn into so apologies. If I'm asking a newbie question there.

C

Yeah I'll I'll, put in on the same same dog so under the environment like, but here we have mentioned, we are using k6 tool. I can include the configuration file as well.

E

Yeah, you can just like um push it to the um prototype GitHub repo, that we have.

D

That's that's what I'm after that would be great. So.

C

Yeah I mean right now: 200 uh vpus is not that much load. um It's not like. We are testing uh against the real world scenario. I would say so once we fix the performance issues and once we run you know, maybe around um yeah I'm, not sure how much the real world um request to look like, but r200 isn't isn't that much I would say so. We are still need to figure out uh the issues where we are seeing in the performance and then we would run the real benchmarking against queue.

C

Api server, as well as the CBT apis.

A

So person, that's something that you are going to work on next right, so maybe hopefully.

C

A

Meeting you can give an update okay, great.

E

The the idea about like um putting your test script into the GitHub repo is also just like you know. If folks want to try it out, then we know where to find other clients, and we don't have to rewrite it and stuff like that. um It doesn't have to be your. You know, fully configure it to be like the perfect real world scenario, because whatever you have right now, okay.

C

E

That's why it's a prototype um people right. Everyone knows.

D

Good engineering is, reproducible results, yeah.

E

D

E

Like um to um I'm just trying to think um where do we go from here, because, ultimately, right I think this is great, I mean now at least we have some numbers to look at all right because previously before the holiday, it has all been like. um You know we're guessing we're, making assumptions we're pulling numbers out in there. um But at one point like there was still kind of a time where we go like okay.

E

Realistically, what is acceptable? What because, as long as like this whole thing is unbounded, you know like either like try to backup or CBT like um 10 terabyte volume, which we all agree today is not uncommon. It's not uncommon anymore, those kind of data size and then like, even if it is like just five gig or from changes there.

E

You know it's still a lot better than scanning like the 10 terabyte volume and trying to back up the whole thing, but if it is a five gig of or one gig of like um CBT, um you know you'll get one at one immune at one point he's still like. um Maybe there needs to be an upper bound somewhere about realistically what we can push down.

E

You have a network, so I don't know, I don't have to um so.

D

Yeah we can quantify that test case and and precise I.

D

Don't know if your test case can accommodate exactly those parameters, but that would be a good summary and not only that as Ivan is saying or kind of have now less hand, wavy a benchmark that we can compare to any other implementation or any other approach right, because a lot of people will just say: oh do it this way or do it that way and and again we may be able to disqualify them by actually proving that they're, not performant or whatever, compared to where we are with this. Perhaps if that's the best way forward.

C

Yeah I I would also like to discuss about how we scope these I mean. If you have to um present this to, let's say cigar Sig architecture group.

C

Do we should we wait, for you know um the benchmarking API server and the CBD API server at the same time and have the results.

C

This performance, testing, I.

E

Think to um I think that my current thought is like again: it falls back to what we talked about before the holiday um and then Ben and some of us chatted a bit about this too, like at one point, we just have to say something like um you know. If the CBT metadata is greater than x amount of um fights, then he has there's an error. You know like you, have to back up the whole thing you can't, because we can't push this down the network. What what the?

E

What the number is, what axis you know it's kind of hard to quantify at this point right. um You know, like the um say say, for example like realistically we can push like I mean not really discipline. So hypothetically, so you know maybe like again right. We have to make assumptions about capacities and stuff like that. If, like say we say, Okay X is like five gig. You know. If you have more than five gigs of um CPT metadata, then we're not going to return.

C

E

You, but you know like if you're talking about 10 terabyte of volume and then a five gig is like change. Cbt is still valuable in that case right, but we.

C

E

Be able to handle it because you know, or we just bypass the aggregated API server all together, I think at the end of the day it might come down to that right. Just go back to like that. First ever proposal we had almost.

C

E

Oh about almost a year ago, where we just do that two-hub thing and then we just say you know we're gonna not go through the aggregated API server, because one of the things that we brought up before um the holiday was um you know. We said like um CBD is great. When you know the the backup, Delta or whatever is not big. If it is huge, you might as well back up the whole backup back up the whole volume, but still right, like from a user perspective.

E

It's like I, know Tom and I chatted about this um various music.

E

If you have a 10 terabyte volume and then like the changes are like five gig, it's CBT is still available in that case so like compared to like having to scan like you know, we say: okay, we don't support more than five people of tbt, but now, like the the fallback for the user, is that I need to scan a 10 terabyte of volume so that that number is really subjective and dependent right, like um so but anyway, so um I feel like at the end, like the approach might just end up being like going back to that very proposal that we have the earlier proposal that we have where we send that send them back a rest endpoint.

E

um We will still have all these networking architecture issue challenges, but at least like we can safely convince the architecture that you know our service will not bring down the whole like cluster or make it the kubernetes API server very upset.

C

A

E

A

May not have to go to the architect meeting yeah.

E

A

Just uh we need to, we need to convince those reviewers yeah.

E

A

Yeah yeah, so basically, if we can prove that here are some concrete record and show what is performance like then we can ask them to take a look again. Yeah.

E

I think the uh the the to talk, the discussion about bringing this to see architecture is more to like bring bring this um challenges to a forum that we can get advice from not so much to convince the entire architecture group but more like hey. We try like four, if not five like different approaches. Now um How was you know? How should we solve this kind of problem?

E

um I want to quickly just give um an update on volume, populator, um so I think the um if I think the last time we talked about it was the main challenge. There is um oh two main challenges there I think like um for one like uh you know, folks have been concerned about like um the provisioning of this all the this extra like volumes to store CBT entries.

E

um You know, I. Think someone give the example around. Like you know, if you have one name, space and they're like say even like 50 to 80 Parts there, each part has one volume at least one volume. Sometimes you know, if you users try to usually use, usually use them. They want to back up the entire namespace right, not one workload.

E

At a time then you're talking about like in a short amount of time the backup software had to manage, like you know, X number of volumes between that one memes, please um so that that was like one of the, um if not the major primary concern there and then the second one, which was more implementation, details that chang-chan pointed out was um you know as we get back all the CBT entries like um there isn't like an efficient way to like ask this like um CBT entries, which is in memory into that, like uh part that the volume populator would spin up right right, um the the our CBT service would get all the CBP entries you have to dump it into like some sort of ephemeral storage in order for the um the I think, it's called like the Prime PVC and the prime part to pick it up to read it and then dump it into that final volume so, and that in itself is not great.

E

Now we're talking about like at least like two iOS, like you know, from hold.

F

On I'm trying to understand, if you're, if you're, arguing in favor of the populator approach or against the popular approach, I can't tell.

E

Yeah, in my mind, like yeah, that's exactly the Dilemma. All right I feel like I think like these are the challenges that we say or why we uh may not want to go with the volume populator now. In my mind, though, I think like volume population is at least offer like um more stability and reliability over the network, like we trade off some like um performance and resources for stability and reliability. So.

F

I, don't really agree with that, because I mean, at the end of the day, you're, probably accessing the storage over the network. It's just a question of you know what what protocol you're speaking I? Really don't like the idea of storing what is it?

F

What is essentially a message on a disc or you know something that pretends to be a disc um just to try to like lighten the network load, because at the end of the day, you're still you're doing Network traffic to write it to the disk and doing Network traffic to read it back off the disk. You haven't really saved anything you've, just basically hidden hidden the I o cost in a place where the networking people aren't going to look so closely um I I, it seems like a gimmick to me.

F

I I mean the um that you know if you have a huge volume and a huge amount of deltas, and you actually want to know what those Deltas are you're going to have to transfer that data over the network um and- and it should be pointed out that, like assuming you follow up with an actual backup of the blocks, the Delta is referred to. That backup is going to be an order of actually several orders of magnitude, more network transfer than the transmission of the Deltas right.

F

So it's always a win to to transmit the Deltas over the network and then use that information to transfer the volume itself unless more than 99 of the volume has changed right, in which case you're just wasting but you're, going to end up doing a full, anyways and and you'll just do a more expensive version of a full, because you'll transfer all the Deltas that represent the full first um I I, can't imagine a situation where storing the Delta information on a volume actually alleviates any problem. It just hides it.

E

Well, at least like that part of the networking right, the writing to the volume is, is not going through, like um my understanding is not going through, like the kubernetes, not working right. That part of the volume is between the part of the networking is between the storage and the the the volume population right. It's not going through the kubernetes API, so, at least in the control.

B

F

Networking power but but I mean if, if that's really where the concern is, is it just the kubernetes API then then the obvious answer is just use some rest API. That is, you know fully outside of kubernetes uh API. You know just like here's. A rest. Endpoint go talk to it.

E

Yeah yeah, and that is exactly like um the the reason why the cap is being hoed up right, like the um in a sense like the um you know, we got some feedback from um Clayton and David Davidson. You know so probably like he's going through the kubernetes API server is not it's gonna be a no-go, no matter how he's trying to slice and dice it.

E

F

Okay but but then, then the fallback position is okay, then we'll just use the rest. Api outside of kubernetes it'll still be a network API, and not not some some disk based representation of the same information, and if anyone rejects to that, then the answer is well after you get these Deltas, no matter how big they are, the actual backup is going to be 100 times as big. So you know: that's! That's that that's going across the network period.

E

Yeah yeah, and hence, like you know, you know like yeah exactly it will be some sort of rest endpoint outside of the kubernetes um control plane, like you, know, kind of a path right so um so, which is like going back to you, like you, know, um Ben, as you recall, like the the very early to a proposal that we put together, we spent we sent like a you, know, an endpoint to the caller. The colleagues call that extreme the um CBD entries directly.

F

E

F

I thought the the the the challenge that we were facing that was leading us towards. You know trying to do something in the kubernetes control. Plane was. What do you do when you have multiple CSI drivers and multiple backup, softwares all running in the same cluster?

F

How to use basically sort of you know ensure that everything can talk to everything and that the right thing can always happen right. It's it's very easy to imagine one CSI driver and one backup software that are specifically coded to know about each other, but when you, if we try to Define an open standard where any CSI driver can talk to any backup, software and multiple of them can co-exist at the same time, then you need something in the middle.

F

That's sort of you know doing the translation or the routing from you know, place to place so that each thing can get the appropriate bit of information from the right place and that they're, all speaking the same language and and that's where I thought. You know something like a aggregated API started to seem like it would make more sense.

F

E

One I um as I recall, like um the the multiple CSI driver kind, existing of the different kinds are not the main, but not the main challenges, because, because um the Cs all the existing CSI inside cars, they already have a way to know right like whether this is something that this is like, a something that I need to worry. This is a resource. I need to worry about or not based on the the driver.

E

The CSI driver name is like the API or something right like we send it a a CPT request and, like maybe inside, there's a CSI driver named, and then it's like how we'll say: okay yeah, this is me- or this is not me, um I- think. The reason why we moved to the educated, API server was at least like the main rationale. Back then was like we thought, like I, will pay low or not be worse than you know the kubernetes logs.

E

It will not be worse than the Matrix server um which we presented in the cab, um but that got shut down because, like um the justification we were was given by um so the architecture was. um You know this was uh logs and metrics. They were like I guess like um they can see that as a system calls or system payloads.

E

F

Okay, I I, guess I! Guess, uh let's go back then to the idea of you know you just have a rest endpoint sitting in a sidecar somewhere that and inside car is talking directly to the CSI driver through an RPC and that the actual API mechanism is just the URL of that rest.

F

Endpoint gets dropped into an object, and so you could have different CSI drivers each with their own sidecar, each with their own snapshots, each with their own CBT implementations, creating these objects and the different rest endpoints, and then the backup software just just talks to it somehow I mean I, would want to I want to see a POC with like two of each. You know at least two of each two different CSI drivers and two different backup, softwares all coexisting in a cluster talking. You know to the appropriate things, the appropriate times and everybody's happy.

F

If you can prove that, like that's feasible, then, and that there's no security, maybe it was the it was the security that was throwing us off because yeah you want to make sure that, like just because you have this rest, endpoint out there, not just anyone can go talk to it. You want to make sure that you know the the clients the talks to it can authenticate as as a kubernetes, you know present uh kubernetes. um What do you call it a yeah token uh yeah to get up so so so that you can reject.

F

You know attackers trying to get this data.

B

F

B

I thought about it: I thought your router maybe was based on the provisioner name and would pick the driver that way, and it would at least get a token or something back used for security. That'd be passed back from the router and then from that point on the the backup software would have to use the token or or some secret that was passed back from the router to establish a connection. Does.

E

That make sense it does and then like um in our pre one of our prototypes from last year was we used the um something called like um the r back proxy. So we didn't we we well for one like we don't want the CBT service to become a Secret management, competencybook management as back to it. So what it was uh we we used this down, R back proxy, where hey Samuel, request, Pass, Me, Your Service, account token and then I'm gonna ask the kubernetes API server, where I can trust you.

E

The first um pass on authenticating and authorizing the request: okay,.

F

So did that work out, or was there a reason not to do that? We.

E

I think, like I, think I can't remember, I, think we'll pass out on um I think he yeah he did do some Pro I. Think that part worked.

E

um Then you know like then we decided that okay, if we need a tighter, like uh authentication mechanism, at least we can figure that out in the future version. But at least we know at least I think I know like what the RB proxy will work. We can trust the kubernetes API server for yeah to.

F

Delegate as long as like the backup software can use the same credentials to authenticate to this thing that it uses to talk to kubernetes, and we can do the authentication to some sort of proxy so that the actual sidecar that is syncing, these rest requests doesn't need. Any any you know doesn't need to store any secure information. It can just Outsource all of the actual authentication to some proxy. That seems ideal.

E

Yeah, oh now, I. Remember us sorry, man now remember! Then we got into a discussion of what. If the backup server is not within kubernetes, then you won't have to. It may not have to service account token. Of course we can all like push that back to like beta.

F

Well, it may not have a service account, so it can be. It will need some credential because it's talking to kubernetes, to create the snapshots and it's talking to kubernetes, to create the the PVCs and it's talking to kubernetes, to create these guys backups. So it gets it needs some. It needs to be able to talk to the kubernetes API and whatever it's using. It should be able to use the same thing to talk to this rest. Endpoint.

C

F

C

So, let's open most of the time, I guess.

F

It could be a user account right. There's no future service account. It can be a user account yeah, whatever whatever it is, we should be able to authenticate it through the Arabic proxy.

E

Yeah and of course, the obvious prophecy, sorry precise, the RBI proxy would need to run as a site car. um There was again very a small discussion around okay. Now do we want and not yet another side car inside the CSI driver, but that was a very waste.

F

Is that it's not a network service, it's an actual Sidecar.

E

At least the client inside is right, um there's a client-side card that needs to go and ask the um kubernetes API. So can they trust your service account?

E

F

Sorry, I, don't understand, I, don't understand why that would be a sidecar. Not just like a library. You could link into your binary.

E

It's an existing like um project that we found from the observability group I mean there's: okay, there's nothing stopping us from like just you know. You.

C

E

And embedding the code into our site car right so.

C

You can just import the required authentication package and use this yeah.

F

Yeah, that's what I'm thinking is yeah we're going to build a sidecar that talks to the CSI driver and has a rest standpoint and somewhere in there you can just import whatever code. We need to perform the authentication which.

C

F

Going to include like a network round trip to something right, because because we don't have access to the actual secrets, so we need to ask someone: hey: is the secret okay, yeah.

E

C

Think we had implemented this on uh the controller controller prototype. If you remember yeah,.

E

It's one of the three or four um prototype repositories that.

C

E

Yeah we just need to like invoke the token review, API and, of course, like you, have the kubernetes API server will not talk to us unless we have a TLS search that it recognizes. So um the nutshell, in the nutshell, like uh is, is to.

F

Work well hold on the the sidecar is going to be running as one of the many sidecars of the CSI driver, so it will get the the service account of the of the CSI driver that the vendor uses when they install the driver itself right like each vendor that wants to implement the service will have yet another sidecar with the same service.

F

Account and it'll probably have some additional R back granted to it to enable all of this, and we get to specify what that is right because for each side car we say you know these are the required permissions that you got to Grant the service account that it runs under so I don't see any problem with just you know having another optional Sidecar with a bunch of our back, that it needs, and then it's up to the vendor to make that happen when they install their driver.

E

So sorry, but I didn't before we go to that part, so you're saying like I've, let them let the vendor like um handle the requests, the TBE requests, some authentication and authorization. No.

F

No I mean I'm, saying like like all of the other side cards right. We have a provisioner sidecar, we have an attacher sidecar, we have a resizer sidecar, we have a snapshot or sidecar. Each one is in theory, optional, right, CSI vendor gets to choose which of the side cards they package up and install with their driver, and this would be yet another one of those where we would write the code.

F

We would publish it and then each CSI driver would choose to use it or not use it depending on whether they implemented the those rpcs and their driver to do CBT yeah.

E

F

B

E

Think yeah I think that's the uh yeah okay yeah! That's the uh the the model that we propose like earlier last year, when we first the very first time we put together: okay, yeah, okay, cool.

C

So one feedback I think we had got with this data path. Api was um people or you know, there's it's not. The kubernetes way of consuming appears right, so yeah, one.

F

C

F

Should be but but the response to that is this API is, is going to trans, be transmitting so much data that we don't want to do with the kubernetes way right. It's a it's a data, heavy API and therefore it's been kicked out of the normal process for apis, and that's why it's special yeah, the uh I, think I think we didn't know that last year, when we were having those discussions that that it was the amount of data we were proposing to push through.

F

This was so much that even the architecture team was would say. No, we don't. We don't want that to even touch kubernetes API server.

E

Yeah I think, like uh I, think at this we're at a stage where we can I feel like we can go back and say Hey, you know we did our due diligence. Actually the aggregated Apex server was first um suggested by it, came out of the production Readiness review like um um review. You know it was actually like suggested to us not mandated, but suggested people would say hey. Why? Don't you guys check out aggregated API, server and I think we're at a point where we can say hey.

E

We did our due diligence we for almost a year, and now we conclude that you know like our original proposal might still be the right one. If, like you know, we want stability, the control plane, okay,.

A

So since the Prasad has been doing this POC, should he continue with this, because he has done times testing and he's going to check the performance problem so I'm thinking that he should actually just to continue with that and then uh Yvonne. If you want to go back and look at our original, you know previous pocs and uh and we can look at both together, yeah.

E

Yeah I think plus what Prasad does is. It still has a lot of value because, like even with the independent rest endpoint, you know we still need to have publish some sort of Benchmark right because usually it was to come back to us and say: hey yeah, aggregated API. Server control plane is out of the equation, but how much? Realistically, as a user can I like you know, get and expect you know when I use this service, so yeah, okay I? Can we can do that?

E

um Yeah I can look back at the previous proposal in the previous prototype, whereas, like the benchmarking that project does what still has a lot of values, in my opinion, sure, okay, cool, all right, I think that's all from CBT all.

A

Right thanks, everybody thank you.

A

uh So I just saw one thing that is quickly go over. That's.

A

So um this is The Wedding Group one. You know uh there are some comments, so I've just updated this uh C size back PR.

A

um So now we have a new group controller service and we no longer have a volume group specific messages. So we only have a volume group snapshot csrpc. So it's a simplified based on the comments. So basically just want to get some reviews. Let me see I think yeah I. Think after after my update, I don't see got additional reviews um visit us, you see now we do not have the sweating group construct anymore. uh Is there? Is there any issue with this? Is it fine so.

F

So showing that the review you want is for the the CSI, spec or PR.

A

F

A

We need to get this one merged before we can uh do the the cap right.

F

This one yeah, you know I I've, signed up to do the review and I haven't done it so yeah feel free to to remind me and poke me if I.

A

Oh okay, great okay, I will I will put this in here. Yeah I will remind you to review this. um Okay I think we are running out of time.

A

So that's that's all we have today. Thank you.

E

Thanks everybody. Thank you. Thank you. Buddy.