Kubernetes Secrets Store CSI Community, 23 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Secrets Store CSI Driver SIG Auth Subproject Meeting 20200723

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, I think we oh.

B

A

B

A

Sorry about that, everyone we'll get this we'll get better every other week uh welcome. This is uh july 23rd. So second uh csi secret store call thanks for joining, and I think we got some new people who've joined, which is awesome, um want to kick off uh around the intros.

B

Sounds like a good idea.

A

Yeah and if you do have the link to the doc just please make sure that you've added yourself um as an attendee. So you know you showed up.

B

Tommy, do you want to kick it off yeah.

C

We'll go down the list here: uh I'm tommy murphy, I'm from google uh working on the uh google secret manager, csi plugin for this driver um uh colin.

D

uh Hi everybody I'm colin mann. I work with tommy at google, I'm just getting ramped up here, so uh I'm just attending this meeting to catch up. Welcome.

E

Hey hi, I'm anish, I'm from microsoft. I work on the csi driver and azure plugin for the csr driver.

B

Hey I'm rita from microsoft. I work on the secret store, csi driver and also the azure keyboard, plug-in.

F

Hey I'm tom, I just joined the vault team at hashicorp here recently.

B

F

G

um I am the now the engineering manager at the ecosystem team on vault at hashicorp. um We've had some work with rita and team on the csi provider.

A

Gibson, I'm one of the senior pms here at microsoft. uh Managing our secure upstream projects is this being one of them perfect. I think we got everyone right. I will moderate and I will attempt to take notes again.

A

Unless anyone else wants to.

B

I'm happy to help a tag team also.

A

Yeah, if you don't mind, yeah that'll work.

B

A

Awesome: okay: let's kick off the agenda.

A

Let's read the typing away: there.

B

uh Yeah, maybe, and if you want to kick us off with the latest release for the csi drive.

E

Yeah um so last week we released 0.012 for the csi driver. um There were a few notable improvements that we added basically a separate uh reconciler controller, and that would create the secrets instead of having the secret creation. As part of the note published volume and one of the major bug fixes that we had was basically honoring the context when invoking the provider.

E

Cubelet has a context timeout of about 2 minutes 3 seconds when trying to mount the volume and when invoking the provider we were in setting the context. So we added that um so that if the provider doesn't respond within that stipulated time, then the mount would actually fail and then it would be retried. uh The next time.

E

uh Also with 0.012, we deprecated uh an old option of configuring, the volume attributes as part of the pod spec- and this has been added in the release, notes and uh I think tommy added for google provider plugin and they're already using secret provider class, but for hashicorp. Maybe we also need a doc update just to remove references of providing the values in the prospect.

B

Phil, do you mind clicking on that link just so we can see the release, notes.

B

And actually, last night I saw someone reported that it broke their stuff for the kevil provider because of this deprecation yeah.

B

I think what we need to do in the future is to make these deprecations a bit more obvious, at least few releases ahead of when we actually deprecate it and we we need to start making it more obvious, maybe as part of a release now saying that we're going to deprecate something uh like in releases from now yeah.

B

This is going to be a happen again. I'm sure right.

E

Yeah also with this release, we switched to using the gci repo for hosting the images before previously. It was in docker hub. But now we have moved to gcr and the way it works is there's a docker directory in the driver, repo, where we update the image version and once the image version is updated.

E

There's a brow job that automatically builds the image and pushes it to the staging repo and then there's a manual process to promote the image to the broad registry and uh for I'm working on a dock for the release guidance so that everyone knows how it's done.

A

B

Does anyone have any questions or concerns about anything that went, went to the latest release.

B

B

By the way, if anyone wants to be part of the release, train or release team, please raise your hand um always need more help there and eyes more eyes are always welcome.

A

All right, so we got the next milestone, uh looks like we're looking for more contributors, so here's here's the milestone um rita. Is this all the the issues that that that are needed to burn down for this milestone? Is that what I'm understanding.

B

Yeah so uh so initially, I went through the back all the issues in the backlog um just to save some time for from this call this round, um but next time, when we do another um backlog grooming, we should do it as part of this call, but we did it this time ahead, just to save everybody the the time um this round, but yeah.

B

As you can see, there's a lot of good first issues. I guess- um and some of these already have pr's work in progress.

A

Does anyone have any questions about some of the issues on the board here.

E

Now, if you want to pick up any of the issues, just feel free to assign it to yourself and just go from there and if you need any, if you have any questions around what the pr needs to do, please feel free to think on slack or on the github.

A

Yeah and then what I I'll probably do, I'm just looking here so yeah we don't have a project board, so uh I'll probably just put a simple combo on board on this, so we can. We can track this too, as well. Just.

A

Visually all right.

C

Okay, sorry, I have a question there that uh that milestone is just gonna, be like another, like point release not like for v1 like it's just mild. The next.

B

One yeah: this is for the next cut that we we should include these things if possible. Unless there's like a major bug that somebody's waiting on, then we can do like an emergency release or something.

C

It's just, I guess this is the first time a milestone has been used like that.

A

Now that's a good point. uh Tommy maybe um does it make sense to kind of have a semantic type of versioning here.

B

um Are you thinking like like 0-0-13.

A

Yeah, something like.

B

A

So we can just track. You know when there's gonna be a next major cut.

B

A

A

Awesome: okay,.

A

Let's see next, we have ability to configure provider, call contacts using a flag.

E

Yes, okay, I did that right. So um in 0.0012, like I said previously, we fixed um the issue by setting the context for the provider calls. uh Let me see if I can find it here, um so what we, um what I was thinking was: maybe we can also try and propagate the context time out to the provider so that the provider can just automatically use the con user context with that particular deadline and when it makes calls to it, the external secret store um right now for the azure provider.

E

What I've done is I've added a default context timeout in the provider, but there's no way to configure that, but if we can do that through the driver, we basically need to add another flag. Apart from the four flags that we already support, that would just be context timeout equal to and then the driver would determine what the timeout should be and just invoke the provider with that.

E

So I added that so we can discuss if that's something that we want to support. If we do, then it would require a change on all the providers as part of the next release.

B

It might be helpful if maybe you can explain why this was added and what the behavior the bug we re users reported.

E

Let me just post the pr chat.

E

Phil, I posted a pr in the zoom chat. If you can open that.

H

E

Right so I mean there was an issue on uh a user had posted on slack. uh What they were seeing was. The provider call was basically timing out um and it turned out that the issue was they had a network policy configured on the cluster uh because of which there was no egress traffic um and there was no context in the exact exact command used to call the provider. So what happens?

E

Is the provider doesn't respond within the two minute three seconds, which is the context timeout, and because of that, the driver context times out and the process is still running, the provider process is running, but the context times out and the driver does not unmount the unmount the path, because context is timed out without an error.

E

So then, the next time when cubelet tries to recall the driver for node, publish uh it checks. If the volume is already mounted and then if it is already mounted it does not remount. So then, it just responds back saying it's already mounted and because of that, the pod starts up without the file in the path and if they are using the sync secret feature, then the secret does not get created, because no publish volume did not create that in the old release.

E

So what we did as part of this pr is exec.command all we just switched to using exec.commandcontext so that we include the cubelet context in the provider call and if the context times out, then the process is kill and then there is a context context timed out error. So in that case we successfully unmount and then the next time the driver invokes the provider. We remount the whole thing.

E

So this context is configured for the driver using the cubelet context and now what we want to do is we want to propagate whatever is the timeout also to the provider and for the provider for the azure provider today, the default that I've set is one minute 50 seconds um just so that it can finish all the operations of fetching from external secret store and also writing the contents to the files.

E

But we want this value to be configurable and the only way values can be configured on the provider today is by the driver, passing those values to the provider.

E

So this would require addition of a new flag, contain a context timeout to the provider binary so that when the driver invokes the provider, it would also add this additional flag and also set a default timeout value and the provider can honor that context timeout and respond within that stipulated.

E

G

So so I I I get the concept, but I was trying to think like let's say on the vault provider. What would I do with that number other than potentially time out myself?

G

If I can't get the secret in that amount of time.

E

Is that pretty much the only action right? It is so I mean the provider usually invokes like a lot of external calls right and then it's always easier to embed the context in that with the timeout, so that it responds back within the time uh otherwise, most of the times it's just context or background, and the call just came on and the process is clear.

G

Yeah, that seems straightforward now, right.

E

And I think one big advantage is if we don't have a context in the provider and if just the driver cancels the context and kills the process, the user actually does not see what the error was in the provider.

E

uh It does not know why the provider didn't respond, but if we add a context there, then it would say that connection timed out or refused, or at least.

B

The user would.

E

Get the error back so they know if they've misconfigured, something on the cluster.

C

Yeah this seems uh one off and it would provide a little bit more context like in our logs, just get deadlines being too short for requests to finish things like that.

E

Yeah so I'll open an issue for this I'll add what we discussed there and then we can go from there.

A

All right, that's the context uh replacing the end-to-end test suite with a nico.

E

So I added that, on behalf of another contributor who created the issue and also has a pr open uh today, the end-to-end suite is in using bats, and uh what we want to do is uh move from that framework to uh either the gingko framework and or also similar to what cluster api does.

E

But again it's using the kubernetes client, rather than just running batch commands. So the pr in this are switches to using ginkgo, but it still continues to use cube city I'll, apply, cube city will create and tutorial commands. Basically, it's just doing an exit, dot command, and I just want to bring this up so that we can have a discussion on what we think would be the right way to move the e to e test suite to a different framework.

E

um I think if we move it to ginkgo with using the kubernetes client similar to how kubernetes does uh then it will be a it'll, be easier for new users to add e2e tests, and that way we can also have a complete coverage for all the new features. That's being added to the driver.

B

Yeah, so in his um work in progress pr, um it definitely is, I think it's a enhancement to what we have now, um but I would like to see a consistent way. As you said, using cube, ctl apply. It seems like it's like trying to do it in different ways.

E

B

And hopefully, adding at this adding with ginkoi will help um make it easier for users to add e to ease.

B

Can you can you click on the work in progress pr in the previous issue.

C

I'll just say I didn't look as much at the ecov tests or.

B

At the bottom, replace e to eta.

E

Sorry, tommy, you were saying something.

C

uh I was just gonna say I don't have much context yet on the the testing frameworks here. So I I have nothing to add to.

C

B

um I mean so maybe maybe the asses, like maybe uh folks, on the call- could take a look at this pr and chime in um kind of do a comparison between what we have with the bath's edes and this pr and see which one you would like to work with. um It's actually a good way to to see you know uh if this actually enhances the contributor experience.

B

um So yeah, so maybe maybe um you guys can take a look at this pr.

C

Is that uh your main hesitation, I guess on this right now- is just making sure that there's more consensus that it's.

B

Yeah, okay, I mean, since this is a kind of a big change in that it's changing the way we test and changing the contributor experience right. So we want to make sure if we do it, we do it right.

C

Sounds good! Oh we'll! Look at that! Oh chime! In on the pr. I guess.

B

A

All right so now we have. The git service account token for the csf, csi driver kept has been merged.

E

Yeah, I think so. Tommy had posted the link uh last week for this, and then I saw that the kept has been merged.

E

um I got a chance to review the kept and I think it's a very good addition, and also it's something that we can reuse for the secret rotation as well, because there's going to be a new remount, I mean the flag is still not determined, but this be going to be a new flag which will make the cubelet invoke the node publish volume uh at every periodic interval, and that is something that we could piggyback on for the secret rotation feature as well.

E

That reduces the overhead on the driver to do the periodic reconcile right it. Just public call comes in and no publish volume would just need to go and rewrite the contents of the file rather than unmounting and remounting, so the amount would still remain the same.

E

One caveat is this is going to be implemented as an alpha feature in 120, which would mean if we go through this route for rotation, then users can only use it for 120.

B

Yeah yeah, it's a long time for adoption, like probably two years.

C

Yeah, I think I'll I'm still reviewing the the rotation, uh which is like the next thing. That's uh yeah. I think I just want to keep this in mind because it will uh save our plug-in. Our plug-in has to get the surface count for the pod, um and that takes a number of round trips and a number of like permissions and stuff that this would save so just making sure that they kind of can converge in the future.

C

Is I think, kind of like my priority there.

B

Okay, um we can def. We should definitely add this to that uh rotation design dock, if not already as something to keep track of in the future.

A

Okay, so follow up some last sync: um secret rotation feature design, review.

E

uh Yeah, I think the dog is being reviewed right now. Our brian couldn't make it because he had another commitment.

C

E

uh But I believe uh he has a branch uh which has the changes, um so maybe the next time you can, I mean I think we probably have to go down that route, because the kept like I said is probably only count going to come up in 120, but also, if there's a way possible. We need to design it in such a way that when the 120 comes and 121 onwards, we could uh just have users move to using that. Instead of having the overhead on the driver for the rotation.

E

We basically just have to think out a way where there's a strategy to move from the logic being in the driver for the periodic reconcile to piggybacking on what the cubelet could automatically.

A

Do right, um vault provider um update vault provider uh to the to use latest release.

G

Probably good for me to give a oh quick status um generally on the bulb fighter, um so yeah we've been for for a few months now, we've been really um really strapped for bandwidth, um so stuff stuff is backed up a bit and it kind of came to a head when we were doing the webinar, um because we were going through some tutorials with zero, zero five and there's actually issues in what was what have been brought in at that revision and um chase.

G

And I were talking about it and it needs a little a little bit of a revamp because some of the stuff that was put in to handle kvv1 versus kvb2, like all that, can be handled cleanly if we just use the vault api package like this, is all being done manually with like each like gets and stuff, and um we kind of want to like get get to the next level.

G

On this thing clean that up which will kind of clean up the latest release, get it get it stable and good, and we can have that be our latest and um that's sort of our nearest and then kind of burn down.

G

Some of the open issues- and you know see- what's see, what's hanging out there um to to that of to that end, um so I think y'all met tom tom just started he'll be he'll, be helping out with that, along with jason, um so we've kind of stabilized on the team a little bit and are carving out some some actual dedicated cycles for csi, um and I I really can't speak to like uh well using the royal's latest release is useful enough but like uh like helm, chart or window support, though I would need to discuss that with the team as far as the um timelines feasibility but um yeah, so that's kind of the status of um of where we're at.

G

I noticed there was like a an issue a day or two ago. It's like hey. What's the status of this project and so um yeah, I did want to make clear that it's uh um we kind of want to get it to a. I don't know longer long term, like what the steady load of of uh you know um resource. We will apply like how how tracking will be, but we definitely want whatever is there to be a stable production usable like within the capabilities we're offering they're solid, and so that's our immediate term goal.

B

Thanks yeah, that makes sense. I recall you, um you were talking about cleaning up the implementation a bit uh along. You know a while back so yeah. This is probably a good time to do that. um If there's anything you need, uh you know in terms of integrating with the latest csi release and any integration or issues you see, please definitely reach out to the team here. I guess we'll definitely want to address any of the uh integration issues you see.

G

Yeah, I think, as I recall most of the most of it was fairly like literally like cleaning things up and like stabilizing. There was one open, almost design level decision that was hanging out there. I think you and I had both commented on like sort of I can't remember the detail it had to do with where a secret was rendered, because you have the secret name and then the value within the secret and there's like multiple ways to like denote what you're fetching and where it's going.

G

So I don't know that that may be up for review again. That's probably the biggest like kind of uxc aspect to what we would need to fix.

B

Right right, um I think how you want to uh present the metadata or this you know the the parameters that users can provide in in the secret provider class is up to the um the provider right um as long as you know. Obviously, we will need to update the e to e, make sure we're testing the right uh objects now right and the the right behavior the updated, behavior um and notify users with the the changing behavior um in the next release.

B

I think it would be helpful to um you know similar with the. Similarly with the driver release planning like it might be helpful to also do that for each of the providers, so that um at least people know like hey. This is the scheduled release and here's what's coming that way, it's more visible to users.

C

One thing I have noticed is I've been reviewing some of the newer features like the uh like the kubernetes sync and some of the rotation stuff.

B

Because it does.

C

Seem like the like provider, specific notations of like which objects and like where they go and path. uh Is you like that knowledge is used by the like the sync thing so, where uh right now, our driver has like a different way of denoting the secrets and the files.

C

And that we'll either need to to update to more closely match the the stuff that, like the kubernetes sync, feature, uses or uh address some of the like schema differences somehow. um But I haven't fully formed the thoughts yet on that. It's just something that I have.

C

I noticed in like the last week.

B

Right so you're talking about the sync object: uh property right on the secret provider class.

C

I no the actual like it seems like that also relies on the under like parameters and then you have objects, and then you have array and then the actual, like the secret resource and the file name.

B

C

It seems like that format needed to be kept in order for it to pick up like which files to write to which secrets, but maybe I'm wrong.

B

Yeah yeah I mean we, I mean if we have time we should definitely address this, um but so just so we're on the same page. I just want to make sure we're talking about the same thing. um So um I guess, uh can I share a screen. Is that cool.

A

Yeah yeah I'll go ahead and stop and all.

B

B

Oh, ah I think you have to enable sharing. uh So if I think, if at the bottom, if you see permission uh so so phil you, you will have to do because you're the host.

A

Yeah, I just uh bumped you up to co-host so.

B

Oh some power here, thank you. Zoom security, yay.

B

All right, let me know when you see my screen.

C

Okay, you are broadcasting yeah.

B

Okay cool, so, as you can see so I think you're talking about this one right.

F

B

We're different no, no.

C

Like uh in the secret provider class.

E

I think you probably want to pull pull up the one that has the whole thing with secret objects and.

B

Okay, let's see test we'll use test.

B

B

Okay, uh wow, it is taking a long time to load.

C

I put a link in the chat.

B

B

Maybe you should have shared your.

A

Yeah, if you want, if you want to shoot me the link, I can fire it back up all.

B

Right, okay, you see it now.

E

B

Okay, okay, so everything under parameters is provider specific right.

C

Right but it seemed like uh the reference to where you have object and name and then secret, alias.

B

C

That that depends on the structure of parameters. Objects, array, object, alias.

B

Is that actually it's not so this object name is essentially the file name that it's mounted.

B

So this is the name of the file once it's mounted to the file system, and what this is doing is it's actually looking for that file on the file system.

C

Okay, so object name is just the file name. It is not like picking up the.

B

No no, but it happens to I. I see why you're confused because it happens to be this like the object, name and object, alias this determines the name of the file that is not that is created, um so I think better documentation.

B

Obviously- um and this really should just say the name of the file um and, however, so in like say for a different provider, this could be like file name or whatever right, as long as it matches the um the logic within the plug-in to to make sure it's the same yeah great cool hope that was helpful.

C

Yeah, I just saw some of those references uh in other.

B

Places I wasn't.

C

Quite sure, if it was the same or different.

B

That's a good, um very good point, though,.

G

Yeah yeah tommy is the google is what you're working on um public. Yet, even if work in progress, yes, yeah, if you could drop a link, I'd be curious too.

B

Yeah um speaking of the google provider, um I think what I would love to eventually do is have like a like a table here in the reap me that has all the providers and then and then like a check mark next to all the capabilities as well. um So people can see like okay, which feature is available in which provider um and then a link to like say, example uh or uh readme, or something right like that would be super helpful for users.

C

Yeah uh like where our status is kind of right now, it works um hasn't been tested too much or no integration.

B

C

We're working on doing the um like a published, docker image where right now, the repo you have to build it yourself, which.

B

Okay, yeah, that makes sense: um okay, cool, um so yeah. So I guess once you've done that and um we do like a code review right and then um speaking of edies you'll have to go through this, so um so yeah. So once that all happens, then we can get the provider added to to the to the readme.

C

And then I think we had maybe some of the same issues as the vault provider about like bootstrapping off or like workload, identity or the. um So there may be some code on the gcp one that that the vault one can like piggyback.

B

Okay, yeah anything you can um share like on slack and you know, tag gem and folks right and that'll be super helpful.

C

A

B

A

um That is all the agenda items we got about, 10 minutes left uh for the call any other announcements or anything else. Anyone wants to chat about what we have this.

H

Time yeah, I want to uh bring one.

E

Issue up really quick, I'm not gonna! In the last community meeting, we discussed about optionally installing the ability to sync kubernetes secrets right like in helm, so uh one thought around. That was, if we do do that, and do we enable it by default and just provide a knob for users to disable it, or do we disable it by default and if they need it, they explicitly need to enable it, because uh one concern with disable disability, disabling it by default, is when users install it with helm.

B

E

Don't go through all the flags that need to be set, so it's possible that they install it and out of the box.

E

Sync secret is not enabled and then, when they try it out, it's going to fail and then they probably will have an issue open and then they'll after we tell them that you need to set this flag to enable that feature. Then they might try that. But so I think what do we think is the best way to do that.

B

um Oh sorry, go ahead.

C

uh I was just starting work uh earlier this week on I'm like plucking that out, um and uh so I don't, I think, installing it by default, but allowing it to be disabled is okay, because I think customers would run into confusion if they tried to use it.

C

E

Yeah, I think it also supports backward compatibility right like they have an old chart and then they upgrade to the new chart. Then they still want to continue using that it should still continue to work.

C

Yeah, I've actually never used helm, so I'm digging into that now too, but uh yeah. I think uh hashicorp wasn't on the last uh past meeting, um it's just there's to to recap.

C

My concern was that uh this feature allowed you to copy secrets to kubernetes secrets, um but that could be like an escalation of privilege or like you might not want your secrets copied into like a durable storage that isn't your like fault or key vault or gcp, and it gives the driver permissions over your secrets to like read and write all secrets in the cluster which you might not want to grant to this for some reason. um So that was just kind of why I filed the issue and started looking into it.

A

Yeah I haven't seen the spec on that uh after the sink did the secrets persist in kubernetes, like as secrets.

B

A

B

uh This was requested by a lot of users because they didn't want to change their application code to read from a file a lot of applications. You know the 12-factor pattern right. um They actually just use like environment variables and they're they're used to like read from kubernetes secrets or or whatever right.

B

um So I definitely agree that we should not expose this feature by default. I feel like it should be an opt-in feature, given that we want the default to be most secure and users should opt in. Having said that, I also think initial concern about backward compatibility is a big one.

B

So maybe what we could do is make make this configurable in home, expose it as a you know, parameter in values, yaml, and then we can keep it as enabled for a couple of releases and make sure we have that deprecation like the change message in the release notes, so that people will know hey in the next few releases, we're going to disable it by default.

B

That way, we give people sort of like some time to adjust what what do you think about that.

G

If I could a question, I just as um I don't know kind of think about this just now, um yeah that sort of rounding down to off by default seems like intrinsically the right way, I'm a little I'm not quite following the backwards. Compat like we don't have secret syncing now right. So what what's the backwards? Compat issue.

B

So um so, currently the key vault provider is the only one that supports. The syncing aspect, however, feature is on the driver right, so right now we're basically trying to turn it on and having it configurable uh in the driver, so that so that there is a way to opt in and opt out um and the current behavior is we give elevated privilege to the driver assuming that users want that feature on right, which is problematic. So this is why tommy is looking at making it configurable.

G

B

G

Yep, okay, that makes sense.

B

Okay, so what do folks think about.

B

Making it enabled for a few releases and then disable it in few releases and with the message in the release notes for the next few releases.

E

C

B

Awesome thanks for tommy, for raising the issue and working on the pr.

C

We'll see if I create the pr this week, hopefully.

B

Let us know if we can help.

A

All right just a couple of minutes left, but I think we got through everything this. This was a good meeting. Again I did some real-time note-taking uh if I miss anything, feel free to update or add to what I put out there.

A

And let's see so, our next meeting is in two weeks so that'll be on august. The 6th.

A

All right thanks everyone showing up, and we will see you the next call- have a great day.

B

Good to see everyone.

B