Kubernetes SIG Auth, 11 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Auth 2022-01-11 KMS #3

Description

Kubernetes Auth Special-Interest-Group (SIG) Meeting 2022-01-11 KMS #3

Meeting Notes/Agenda: https://docs.google.com/document/d/1woLGRoONE3EBVx-wTb4pvp4CI7tmLZ6lS26VTbosLKM/preview

Find out more about SIG Auth here: https://github.com/kubernetes/community/tree/master/sig-auth

A

All right so hey everyone. This is chemist meeting number three on january, 11th 2022, uh so for this meeting we had wanted to go over for what the tracks work are for ms and sort of like what we plan on doing.

A

So yesterday we had talked after the issue triage meeting briefly and some of the things uh we discussed there were that we wanted um well. We had talked about these three distinct things. We talked about the reference mms library.

A

We had talked about.

A

To start with a a uid.

A

And in some iteration of this cap we would add.

A

Third thing was.

A

A

Does anybody remember anything else.

B

I think this looks good. This is what we talked about.

A

Yeah, um so that covers the the three things uh rio. Do you mind maybe going over them for damage.

C

Can you repeat the last part.

A

I said: do you mind going over the three things, just as a summary for the gaming's have it recorded also.

D

Yeah, maybe just before like. Could you give me an idea of the state currently of the cab like because we kind of wanted to split it into different topics, so observability performance and things like that, and I know we talked a bit about performance in the first place and a lot about observability, but not yet about recovery. I guess you give me a heads up of where we are today.

B

Yeah, so the first meeting we talked about are the different uh categories right so recoverability, observability and uh performance and then also like. If you looked at the initial proposal draft, we had said uh we'll focus on observability and recoverability first, because those are things that we could uh get in without any breaking changes and also based on some of the discussions that we've already had. um So we decided to go with observability first, um so open a cap for it and then see what changes that we can include in the next coming release.

B

um So the kept deadlines, I think, john 28. So we want to get kept in before that and then try to make changes in this release and the conversation for this. What we thought was, maybe we could uh do the audit id part right so where we can actually uh string the whole audit id from the user, creating the request all the way to the kms plugin. So in the last meeting we decided that that's something we could do, but there were a few outstanding questions that we had for the cigar community.

B

One was audit, id is user configurable, so the user can actually overwrite it with whatever audit id they need. um So we needed to understand why that is required, and then we also wanted to see if we could get a sign off from everyone on cigar and the outcome from the call was yes. Audit id is configurable. We need to understand more on that behavior.

B

So that's why there's a track item right now, so that we can understand why it's configurable and what we need to do and then the second thing is every request that not all the decrypter encrypt decrypt request comes to the kms plugin, because the api server has caching. So basically it caches the deck and the encrypted version of the deck in it in memory, and because of that, there is no guarantee that every request that originates from the user will actually hit the kms plugin.

B

So we decided that there's no way we can provide the direct audit trail from the user request, all the way to kms plugin. So we wanted to focus on something where we can correlate requests from api server to kms plugin. So at least that enables debugging um easier than what it is today so today, if you need to correlate the only way you can do that is through timestamps, you have to look at a timestamp in kms and then you have to go back and look at timestamping the api server.

B

But so, instead of that, we were going through a couple of options, and one thing we decided to do now was the api server will generate a unique id for every request that it sends the kms plugin. This will be part of the grpc request, that's being sent to the kms, and then the api server and kms plugin can log using that uid and then they can also use that for auditing to start off and then in the track.

B

Two that we're following up with cigar, to understand why audit id is, in the end users control. So if we can change that to make it um less configurable, then maybe we can also hook up that audit id with the uid that we are generating. So that's where we are now and the next steps is we're going to write a cap for this part of it and then try to get that in and then start the work on this release.

B

And then in this call and the subsequent calls we'll plan for the other categories that we have listed in the proposal and then also work on tips for those. But it will just span across other releases.

D

Yes, thanks for the summary I think it covers. Like most of the thing I was wondering the last meeting I went to. Basically we had this discussion of. How can we propagate the audit information, um and I think you had two ideas behind that first was to have one id that would be generated specifically for kms, like some specific field, just for us or use a diode id. So I guess the the approach you are trying to take now is to switch to edit id with the audit id chain.

D

I think that most shared before is it am I understanding this correctly.

B

Yes, so right now we're starting, which is the uid, but I think once we understand the track two for the audit id and why it's user configurable, then we can actually change it to like once we figure out that part and then fix what is it? uh What we have can just be chained toward the audit id, so we can actually also show that to the users.

D

Okay, and so the goal today is to like kind of answer, the question that the seagull people had uh and the concern that they had right.

B

So at least the start of this call is so we have these three track items. One is a reference to mms library and then the observability kept and then also figuring out the audit id. So we had a small discussion yesterday after our triage call just to understand these things and then who will pick it up?

B

So I think here we're just summarizing and then we're having assignees for it, and then I think also we might briefly discuss what we want to do and then, after that, we can go ahead and discuss about recoverability performance and security as.

D

A

I wanted to give an update, so I did speak with tim yesterday in regards to the audit id thing and what I got from him was so there appears to have been two use cases that were thought of for the audit id being under control. One of them was something related to keep ttl and it following redirects and like maintaining knotted id that it received originally from the server and continuing to pass it through.

A

I think that use case fell through because they just decided to ctl should never follow a redirect. It's just a bad idea. It shouldn't, because the server doesn't redirect you really in any situation other than very, I think I think, there's technically redirects in some of the exact and proxy things, but they led to problems and other things. So that wasn't, I don't think that's the case. We need to worry about. The other case is aggregated api servers, which we also are not concerned about, because we have an mtls connection to them. So we can.

A

We can use that as our trust anchor, but the gist I got from tim is that we we could consider ignoring a use end user provided audit id when it's untrusted like we could make such a change right, um the. So what what I had at what I had thought that we could do as a next step there is reach out on. You know like kdev security and take off mailing lists and be like hey. Does anyone rely on this being under your control? If so like? Why? What are you trying to do and etc?

A

Right then, with that in mind, once we get a little bit of feedback, we could try to write a kept to describe the change we want to make right like so. The the end state would be only trusted actors can set this so aggregated api servers can uh or front proxies in front of api servers can set this totally fine. You know they're they're root on the cluster. That's totally fine!

A

You know you could even imagine a authenticated user can set this header if they have some permission right if they have some some, if they pass some subject, access review as a for example, that could totally be a thing also right- and you know, there's a variety of ways to go about this. I don't want to speculate too much on that because, as far as I know, no one uses this for anything, but that you know this is kubernetes and everyone comes up with horrible things.

A

They want to do um so that that was my thought there that we continue like you know. We try to reach out to our community, see if anyone would be impacted by a change and based on that feedback. It would tell us sort of the scope of the work for the cap.

A

Does that make sense, though,.

D

Yeah, I'm just trying to understand a bit more like how the users can impact the audit id like. Is it just open to them, and it is clear that they can modify it or is it just because we don't have any verification and authentication or a list authorization mechanism behind it?.

A

Yeah, it's I! I don't think this is documented anywhere per se, but if you send a request to the kubernetes api server and you set a header audit dash id, we will treat the value of that header as the audit id instead of generating on ourselves.

A

That's that's the just! That's the behavior. As far as I know, this is not documented written or said as a thing that anyone should do.

A

But you know we don't want to break anyone if it is valuable to them right, because you know at a bare minimum, any change would need to support aggregated api servers right because aggregated api servers should be able to know that the real api server is telling them here's the audio id for this request keep keep flowing this through your stack right. That way, because, from the user's perspective, it is one request it doesn't matter, that's passing through seven layers of proxy or whatever um so yeah. That's the general. Just there.

D

Yeah- and I think your solution of reaching out to the ml is like the best approach that could be done.

A

Does anybody have any other dogs? This was just. You know something I thought of yesterday, so it might not be fully baked as an idea.

E

Yeah, so, oh sorry, uh one question, so what would happen when we find because probability is, is quite high that someone on the world uses the the ordered id? What would be the kind of the triage if it's not now? I think that we should not touch because it breaks too much of the infrastructure in the world or where we say: okay, it's fine, you can deal with it and, if not so, would be there an option to maybe add a new header um to the request, like a trusted ordered id.

A

So you know say say: this is widely depended on and it's really problematic to change right. You know we could we could. We could um uh add a new flag to the api server.

A

That ops, you know, keeps the current behavior, but when set by the you know the infrastructure admin to true or whatever it changes to a more strict behavior, and that the fact that it's in that strict mode would be passed down to the kms layer.

A

So it knows if it can trust the audit id, because what we had wanted to do is basically we would go ahead and start with a nice uid and work on the kms case and get that the ability to you know tie a kms failure on the api server to kms actual operations. We want to get that in and you know in parallel, we work on this audit id work and if we are successful in getting that into a better state, we can just go ahead and optimistically change the audit id sorry.

A

The uid within that kms call to be the audit id right that way uh in the future. Without breaking any api thing, we just give you more information right, so, instead of it just being a randomly generated uh uid, it actually is the kubernetes audio id. So you in those cases, you can actually tie it back to the kubernetes out of the event right, but it doesn't.

A

You know from you know, a we're, not saying that you know any of these uids have any schema or necessarily any meaning today, right we're just saying that you can use it for tracing right and if we get to a state where we can give you more, which is yeah. If you you know, if you happen to have kubernetes audited, events enabled- and this particular request did generate an audit event.

A

You can use the uid to tie them together. That would be nice, but um that that was sort of like the final end state that we had kind of hoped for. But we don't know right. We don't know if we'll be able to make the audit id get into a state where it can be trusted, and if we can't that's okay, we still want to do.

A

We still want to have a uid in the kms api for the ability to trace these things.

E

Would it be possible to kind of say, okay, no matter what we just proceed with the flag, because um we would leave it up to the admins if they, if they have some special, weird setup, where they think they can trust the client and for the other, people who are not trusting their clients could just proceed uh with the flag and then have an implementation for for this? No matter what and really kind of push the decision back to the users yeah. So like.

A

I I find configuration to be missed, like mistakes, I find all configurations.

B

Okay: okay, mistakes: it's.

A

It's basically like you, you have a configuration because you could not figure out how to do it right. So you push that to the user. Okay right, I don't. I don't like that now this is minor right. This is just a boolean. It's not that big of a deal.

A

I I would like to push hard on the ability for kubernetes to improve without end user interaction right, because what that would end up being is no one would be able to set this to true without possibly breaking their users or customers, so they almost likely keep it at false and then maybe make some configuration in their. You know infrastructure right like so you know I'm imagining gk or aks, adding more knobs in their ui where you're like no, I definitely do want.

A

My audio id to be trusted like just creates work for all of our ecosystem, and then users have to understand what it means and those probably show up in this best practices. Security guy for kubernetes make sure you check this box, like it funnels out in like a terrible kind of way right.

E

And if you would do the inverted way that we that you need to set up in a flag in order to um to make clients be able to set the audit id, then people who so, if they're, not many people or companies who are using this alternative from the client's starting point. You could maybe say: okay, we just you need to add this flag to keep the previous functionality.

E

So when, when classes start to cause problems, they can investigate what's issue and say I need to add a flag to keep the previous logic.

A

Yeah and that such you know, such an approach is fine. It's more with that approach. We would have to have a long stage rollout where the flag basically starts off, not changing the behavior, then, with some amount of buffer. You know we say cool the behavior is about to change. You need to set it to whatever you want, and then it changes and so forth um that you know that that is a choice. um No, that was not a bad approach.

A

um uh You know my personal desire would be to see if we can make it so that we just sort of force the ecosystem into like no, you, you have to make this id trustable. You cannot just arbitrarily, allow it right and only if you get strong push back in the other direction, saying no like here's. My absurd use case for why I want to set this id from actors that are completely untrusted and have no like attached to authentication or something um right.

D

um If I can chime in and bring an idea to the discussion, so we are like what we are trying to achieve is basically to have this trusted, and the example that you had with aggregated api and the api server means that if we want that site to be trusted, we'll need some kind of authorization to kind of say that you're changing the id between an aggregated api and the api server is valid and should be done. It's something that is trusted.

D

um And if we introduce this kind of uh trusting chain like this kind of authorization, why not make it even more generic and say that, instead of having just that piece of code in the api server, have it really generic and allow users to authorize their own uh application to change the audit id as per needed, like as if it was, for example, on their back rule, and they would allow the application to change this particular value as if it was basically a negated api that is talking to the abs server like building?

D

This kind of translation can be very generic to me and we don't really need to tell the customer the customer, the users that we will break them, because we will just add another layer of authorization that we need to enable to be able to to post. You.

A

Right whenever you have to enable something you broke, somebody right that is considered yeah right, so it can. It can't be in that form, without a giant rollout phase of some sort. So the the one thing I'm trying to avoid here is too much change to aggregated api servers, so the front proxy mechanism that they use isn't really an authorization mechanism. It's an authentication mechanism which is specifically that this actor is the front proxy and can set any user information on the incoming request right, so it because it can set any user information.

A

It is by definition, root because it can just ask to be rude if it wants to right, you can always do that, uh so that that path is kind of well defined already, but it's sort of outside of really both authentication and authorization right. It's basically, this actor is trusted right, so that that mechanism kind of exists in this fine. If we wanted to make it generic it's a little bit harder because that that tunnel isn't really authenticated right.

A

It's it's like fully trusted, and you know the thing doing that work is the front proxy, but from the kubernetes api perspective, it's the original authenticated user that flows through that tunnel through those headers. So it gets a little murky right like you can't like it's, not that user you're authorizing it's the fact that it was the front proxy that told you that you're authorizing so it gets a little weird. I mean we could totally make all this work um the proxy itself, but then yeah, but it doesn't have an identity.

D

Yeah right it just if it was a client talking to the proxy in front of the api surveillance. There are multiple kind of proxy and you can't really authorize one of one or the other because, like the request can go into any of them. Yeah.

A

uh But you know we don't necessarily need to get too far into this one particular thing. uh What I wanted to make sure is so I had you know I had volunteered for the audit id thing. You know I reached out to tim and then you know I can continue doing more iv stuff, but.

C

So sorry, good question: remember: uh I think there was a concern about how like, during bootstrap time, uh the the api server wouldn't be generating audit id, because it's not being written to audit logs. So in those cases we can't use the audit id at all.

C

Or back to uid, then.

A

Yeah, so it I, I think, no matter what the use of an audit id. Let's just pretend audit id was always trusted, no matter what it's it's an optimistic use, it has to be um so both for internal calls for the api server as it's filling its cache.

A

uh You know those don't really have audio ids associated with them. I think, um but more.

A

More specifically, even you know, there's no guarantee that an audit event is going to be generated for right.

C

A

C

I guess I I I I'm all for the improvement of a gen who gets to generate audit id. I just want to make sure it's a separate effort from like making kms observability better right, because I feel, like the scope, is increasing quite a bit here, just because I think having the uid generated is already serving the purpose that we came here for uh and be also because the improvement, the kms improvement doc, has a bunch of other capabilities that I think a lot of people really want to see specifically around the performance aspect.

C

I so I'm questioning. Are we using our time correctly here? If the uid serves a purpose that tell us hey, you requested the encryption decryption hey, you actually did the encryption decryption. I can correlate the the request, I'm good. So then can we again I'm wondering if our scope is increasing a bit beyond kms. At this point,.

A

Yeah, that's totally fair, that's totally fair! um So I I guess the way I had framed it in my head, the the like the cap. That initial is you know, working on writing and stuff at this. Current stage does not need to say anything about audio id like it. Just it just doesn't need to care about it. It's just just say I'm going to do uid and that's what I'm doing yep.

C

A

I I was thinking of this as completely parallel unrelated. We get to use in some meaningful way in the future. I.

C

Know I, and and selfishly I want to see all the wonderful things in that doc get implemented, and given that we have limited time and resource, how do we effectively use that use that, for the the thing that most people want today, right, yeah.

A

I mean I didn't I didn't I didn't know if there was you know thinking through this. I don't know of other tracks of work related to kms. That can be done in parallel with the observability stuff and the reference implementation. If they are, then we should talk about those in there and you know prioritize them. I, I think you're just asking a priority question right, like yeah.

C

Yeah yeah yeah yeah, I'm not saying we should drop anything, I'm just saying, yeah and also january 20th, right we're on the 11th. So how do we get the cup like actually open and reviewed, uh and then you know and the performance one there's a lot of opinions there. um So I I want to make sure we have our time spent on the things that that that really make the biggest impact today.

C

I mean: does that make sense.

A

Yeah, so I mean within the next week right. The most critical thing is the cap right. Everything else can work uh and- and it's totally fine if it weighs right- it's not really a big deal right.

C

Yeah and the uid is it works and it it's not something that causes backward compatibility issues. uh So I I don't I I guess I I don't feel like we're. Gonna get a lot of pushbacks right to get that kept. Merge, yeah.

A

I mean I I uh the only real pushback I have on the uid. Is I'm unclear? It's not even pushback. It's just unclear to me exactly how we guarantee that the uid gets logged like always to make sure that, when there are failures that it is guaranteed to be there in some way, that's it. That's literally my only question. I'm sure this is totally technically possible. We could totally make sure that this always occurs.

C

A

Otherwise, um I don't have any concerns from it seems totally fine, it seems to serve the purpose. It seems to be a very small. Iterative change seems totally fine.

C

Okay, so, let's make sure that's addressed in the cap, um you know how do we make sure the uid is written to the autolog.

A

Oh well, not the audit log right. It would be it's not the api server log right because because I I think we we're not going to try to put it as an audit annotation right now right right. We we're not going to worry about anything related to audit right now, right.

C

Right right, right, okay, so, okay, so log.

C

A

A

Yeah, so that that's the so that's the thing we can focus on, so I don't know like when an issue you plan on trying to have the kept open. You know, certainly you know we can just prioritize. You know whenever you have the kep open, you know read out myself and whoever else cares to review it can review it and get it in a state, and I think I think wednesday is the cigar meeting right.

C

That was last week.

A

Was it last week, sorry, so we wanted yeah. I couldn't remember if when yeah wednesday, um so we don't have another sick odds meeting before kept freeze it kept freezes before the next cigar meeting great job.

C

I said once it's opened, let's just tag the you know, sigoth for approval right.

B

Yeah yeah totally.

E

Go ahead! Well, so the 20th you said right and the next golf meeting is on the 19th right. So we have one day um up front. Isn't it okay? You might.

A

Be right, uh yes,.

E

Yeah, okay, so.

A

That might be a little too close to matter like if it's not in by then you know you, you might have missed the boat already, but sure you are right. We we so so what that tells me is, uh I can go ahead and be like.

A

All right, I can go ahead and put this on there as like. We will talk about it, make sure we use the beginning of the meeting to look at any open questions and comments, get those addressed and basically be like all right cool. This is this is within. This is implementable within 124, whatever we're saying.

A

So that that seems high priority and then I so I suspect the work for this will not be that hard. I suspect, it'll be pretty pretty straightforward.

A

And maybe we can talk about it, maybe afterwards, I'm just trying to think of like once we're done with this work. What are we going to do next.

C

I would love to talk about the performance uh part of the thought, and also we have a little bit of time left on this call right. I also want to make sure: is there anything that we're adding in as part of this observability cap? How does that relate to the performance stuff right, um anything that we need for performance? That could be impacted by this?

C

I don't think so, but I think originally we wanted to use the other id for other things. Possibly I could be.

D

Wrong, I think most of it is about debatability at this point like for observability, for all the observability efforts, like most of it seems to be focused on debatability rather than actual performance.

A

D

A

To see if I have the right doc.

C

And put the link here if it helps.

A

I think I have a doc open, yeah.

C

That's that's the one yeah.

A

Right so so we we haven't, really talked about performance at all. Right so I mean so. We have what 20 minutes right. So we we can. We can just switch right. We can just talk about performance, stuff and it's okay. You know, if we're not going to do any of it in 124, it'd be good to have the discussion started.

A

B

Also, I think, last time we talked about recovery. Maybe we can also close on that. One too, like I think, in this proposal we said we could have an option where the user deletes it, and then we also talked about the cluster admin going and deleting the key in xcd. So maybe we can also close on that one to say what we intend to do. Do we want to leave it the way it is today where list calls will fail, and then the user knows, and then they can go and delete it.

A

So well, it's been months now, but the last time we talked about this in cigar. The feeling I got from folks on the call was that it is not the api server's responsibility to handle storage errors like it's just not in a position to do anything good about that. Just as you cannot work around your database being broken within your app right.

A

You I mean if xcd like you can go, ignore all encryption at rs, just if you just are using xcd and you provisioned it with a really slow disk and it's completely overloaded and your list calls are failing, because scd can't keep up.

A

Is the api server supposed to do anything about that, and I think the answer is well no you're supposed to go fix that cd? Is this different? I don't know like what I dislike about the kms api and all the encryption apis is. They have put us in a place where we don't control the world for our storage right.

A

We rely on an external service that can arbitrarily fail for an operation that previously could effectively never fail, minus fcd being on fire right um and so, and this this same exact concern is the same concern I had brought up for the service. The external service account signer kept. That is basically like this right. It will turn what is currently an api that can never fail, because the private key is directly provided to the api server and uses it for signing.

A

It will turn that api into one that can arbitrarily fail for un unexpected network reasons effectively right. I don't know what the actual answer here is like, because I'll stop talking about y'all.

D

Okay, yeah one question from my side: um what's the issue that they like the thing that they were uh against the fact that we want to automate the deletion of the key that are corrupted like of the data that cannot be decrypted anymore or was it the addition of dash dash force, delete to keep code.

A

So think of it as one higher level up, which is when you're in this state, the kubernetes api is unable to run its api guarantees, it cannot run finalizers. It cannot run those components right.

A

Those deletes and force deletes. All of those basically are just like additions that are like saying: hey it's okay, to violate the kubernetes api and it's like, but it's not like the kubernetes api is, is authoritative here right and if you wish to violate those things, you basically need to use the back door of directly hitting the database and basically perform the violation right. We should not give you a front door approach to bypassing finalizes, because you have basically broken your cluster right like okay, yeah. It.

D

A

um And I don't have a good.

D

Introduction to one id that I had, which was to handle recovery via observability, because by observability we should be able to know whether any kind of data can be decrypted or not anymore, straight from the provider itself. So if we can bring back this information to the users to say that hey this secret or whatever cannot be decrypted, because you don't have access to the key anymore or whatever we just add an alert.

D

Then if we have the enough information for them to then delete the keynote cd, that could be a workflow that works.

C

D

C

Be user initiated operation right yep, like you tell me it's no longer encryptable or decryptable, I should say yeah.

D

C

D

The energy would be like, instead of using uh quebecois to do that, they would use, for example, lcd control to directly direct delete the object in the database so.

A

D

They have a way to recover.

A

So, but let's step back right and say that this is uh an eks cluster, an aks cluster or gke cluster right, where you only have access to the kubernetes api right. Are we okay with basically staying in these environments, where you know you're you're, working with a provider that is managing the infrastructure? For you and is handing you? You know api server credentials in that case, there's only two options: you make a support ticket with your provider and ask them to go press buttons or every provider implements something right.

A

Some some approach to handling this case right. So you could imagine gke has a button that says force delete like do a full read of fcd, any anything that fails. Give me a list of it and then I'll press, a scary, ass button that says nuke it like nuke the things that are broken right.

A

That's not free right! That's a lot of work right and at the end of the day, anyone writing that code is going to be scared as hell, because they're giving you a button that lets you is guaranteed to break something somewhere at some point in time. Right like it is effectively like a support ticket so.

D

Is that okay, I don't know right and I think really, I guess that's the issue whenever you try to automate deletion like whenever you try to automate any kind of deletion, you still have the same fears and the same concerns when doing so, and that's why at first, I only wanted to go through the alert pipeline because via dialogue, you could directly automatically automate that, but most of the time you have an administrator doing something manual on the other fire.

D

They have to do a manual action, and this is like a bit safer but more tedious, um but the idea would be that we still have a safe way to recover, even though it's tedious, but at least we have something because currently like we have nothing and yeah, I'm not sure if we want to keep that to keep it. That way.

A

So I I I think, observability and just making it clear what is broken. I think it's a great idea.

C

A

The what I, what I'm curious about, I think rita this was something you had brought up a while back is so yes, bad things will happen and we want to be able to recover from them. Is there a way within like the kms api itself, where we can make it very difficult for this to ever occur? So the particular case I'm thinking of is, can we, I think the word you had to use reader was a lease um I'm naive to how kms apis work in the cloud kms's.

A

But today you know when you're, using the you know the mechanism we have through the grpc api is just basically encrypt and decrypt right.

A

If we consider an expansion of that interface, can we specifically hold a lease against the encryption key that we are using and basically make it incredibly hard for you to get into this case right, like you, have to like explicitly go out of your way to break something right, because today it's not within that contract. Right, like like, like basically, I feel like the kms api should be like hey I'm using this.

A

You can't have it right now and if you insist like you're, basically agreeing to breaking the world but like it, it should be very difficult.

C

Make it hard for any operations on kms to actually impact the cluster.

A

Right um so, but again, I am naive to the cloud kms apis and I don't know what that ends up, looking right like because we we have to whatever we build as part of our grpc api. Really, yes, this.

C

Is part of the dock like we can write this as a here's? What you should be doing right if you're using this feature, you should uh turn on the the lease lock feature in your kms and maybe point them to a in the reference implementation to how to actually ensure that leases.

C

Is there as part of the request or something.

A

Right, but I what I'm saying is: is that part of like how does that manifest within our grpc api and then how do we then manifest that, within our stores, layer to allow for like a high fi, fidelity connection right so that way without like without like a live cluster, you can still validate right like if the cluster is completely turned off or you have the lcd data, you should still be able to validate how that data was encrypted right. What, basically, just that data set, should tell you what leases you're currently holding right.

B

So I think that is where we thought about passing the kick back right so that it's actually stored in hcd and then that metadata is available so that the kms plugin knows. So these are the kicks that is currently used in hcd. So I I cannot delete it or no action should be taken on those right, but also like we could land up in this particular problem.

B

In like one of three scenarios, one is, the key is completely outside in kms gets deleted for some reason like if it's azure keyword, let's say someone just deletes azure keyboard by mistake. So basically we are in a state where we cannot recover. Then the second one is is also related to performance. Right, like we hit the rate limit because of some reason.

B

So now we're not able to do the encrypt decrypt, because all our calls are sequential today and then we're in a state where, like we're in a jail and then we can't break out of it soon enough so like that is one scenario and then the third one is rotation scenarios right, like the rotation we have.

B

We do have good guidance today, like we basically have steps in the kms docs telling users that you basically run another provider with a different socket and then wait for all the existing secrets to be decrypted and re-encrypted, and then you go and remove the old provider.

B

So we have guidance for those, but again that can also lead into errors. So we can get into this error state in one of those three scenarios, so the least probably covers one thing. But again that is something we can tell users that, like be really careful that you don't delete this key and then for each particular kms external store. We can tell them that, like hey, if this option is available, then this is how you enable and make sure it doesn't get accidentally deleted.

B

But we still cannot prevent that in the cluster like if something does happen outside of the cluster, we still cannot prevent what happens in cluster, so we won't be able to recover.

A

Yeah so yeah, I don't think we can ever get to state right, just as you cannot prevent someone from deleting scd, because there's.

B

No there's no way around.

A

Here, right like they can they can destroy the data if they want to destroy data. You did mention rotation.

A

I I'm curious. If we have ideas and like the procedure you described is you know totally technically correct, but it's also relatively complex to orchestrate on, like an aja environment, especially a cube adm self-hosted or an openshift self-hosted like it's a lot of work like it's easier when you're, not self-hosted, but it's not easy, never gets easy.

A

Is there? Can we do something within the grpc api for it to do rotation itself right? Because if we are tracking within the kubernetes api server, what keys we're using? I could imagine the way of the grpc api saying hey. I would like a new key now on some interval right. It gets a new key. It then initiates.

A

You know it then initiates the rotation on its own. Based on you know, however, it's gonna, you know orchestrate within its net right and then you don't have to worry as much right. So it's like. Basically, this is hard code to write, it'd be cool. If we wrote it ourselves once tested it really well and just told people, don't don't write your own, we already did it right.

D

Yeah, I I think that's a good idea, but the the issue that I would see is that what we want with kms, like the the core id behind this kms feature, is to have like the key uh handled somewhere else like by a trusted entity such as a kms in the cloud provider, for example, and every operation related to this key, such as the generation and all that stuff should be uh handled by the gamer.

D

So, for example, if any user said that they want their key to be rotated every 30 days in the kms, I don't think the cluster itself should have its says to tell the kms hey. I want a new key now, it's not the policy that was defined in the canvas itself, so I don't think we should force rotation ourselves from the inside and yeah yeah go ahead.

A

As I said to be specific, what I'm more saying that we do is uh so there's a set of configuration within the encryption configuration option I want to. I I want to keep it as is or smaller. I don't want to add any knobs there. What I would like is for the grpc api to allow us to ask the plugin.

A

How often do you want to do rotation right? So in this case you know it would be like hey tell me what your rotation policy is and if it returns like zero you're, just saying no rotation, but you know if you return 30 days that would help initiate that, but once that's there, then we can orchestrate it for you right. That's what I want like. I want that difficulty of orchestration and the specifics of how the rotation needs to be done to be handled by us once correctly, but yeah you're right.

A

I don't think we need to be in the business of deciding how often that happens right.

D

Just a proposition that I would have, because I was thinking about that also for our downstream use case, and basically um I think I mentioned in the first meeting. We discussed that a bit and you told me that um the kms itself tells um the api which key it has used to encrypt the data like which one of the cac. So if we keep in the api server, for example, a list of the the last kk, then we will know uh if a new one has been created by the kms.

D

And if that happened, we can initiate an actual rotation.

D

So we don't really need like any kind of communication between the provider and chemist, because kms should already be giving us enough information to figure out if a new key.

A

Has been created.

D

A

So that is that is problematic today, because we don't have any mechanism that looks across keys right, so we only ever act on a single key. So if that action happened, when all the api servers were not running and they came back up, they would be unaware of any change because they would start with the new key right and also it's problematic. When you have many api servers, because they all might not see that event right, only one of them would so then how would the others know that it occurred?

A

That kind of thing, I'm not saying this is technically insurmountable. I just think we we haven't, we haven't. We don't have a storage format that allows for that yet, but yeah. If we had a better storage format that tracked uh the key encryption key id somewhere in there and then also we built in a mechanism that basically was like an observability controller. That's like I'm gonna, look at everything, that's being encrypted and be like. Well, how is it being encrypted and then it then it sees that yeah you're.

A

I I've observed uh you know like if there's a leader or whatever I have observed that a the key encryption id being returned by the kms is different than the ones that was the one that was being previously used. I can automate a rotation now totally that the.

D

Total work yeah. If I'm not wrong, this could be even simpler, because if you create an object currently via the kms, um it will be encrypted, like the the data. Queue will be encrypted via the latest uh key encryption key. So at the very least like if your first creation gives you a specific key id, you will take that one as the latest key encryption key that exists, and then it's pretty easy to notice.

D

Like some updates, you just have to compare like the the key of the last encryption uh with the one that you add in your storage, like the the one that you cached previously and if the id is different, then you can say: hey a new key has been created, so you need to start like iterating over how to rotate all the data.

C

Hey we have two minutes left um I I thought it was kind of funny that we went from uh observability to recovery to rotation, um but not we didn't really talk about performance.

A

We're bad people.

C

It's it's okay, and this is all really helpful discussion nonetheless, um but I do feel like we're, probably at the point where um I think, if specific people have ideas- and it sounds like there are lots of really great great ideas, it might be good to maybe start writing cups against these different topics or or like a within the existing docs, so that we can continuously iterate asynchronously and then we can use this time to discuss what what we are uh you know proposing um and then of course uh I I personally.

C

I also want to understand, like of the few of us other than the observability one, which one should we tackle next right, because there's a bunch of topics as everybody mentioned: recovery rotation, uh there's, there's performance and then there's also encryption, config hot load of encryption, config and then there's also different encryption protocols.

C

It might be good to focus on one, so we don't go all over the place.

C

um So, which one should we.

E

C

Next sorry, I'm trying to be like the the person.

A

Yeah you're you're you're being.

C

You're, like stop, stop.

A

Engineering, no.

C

I mean these are really great, though, um and and and we are documenting the discussion as we go so but.

A

I am wondering which.

C

One do we think, will make the most impact right. The the the.

D

Most impact is definitely performance, like recovery is only a few people, but I think.

C

D

We are close to kind of having an idea of how to do the recovery, or at least we have a lot of ideas.

B

Or things that we could.

D

Document or do so this is close to conclusion, whereas like performance still needs to be discussed completely so, but so.

C

Can I more.

D

C

Yeah, does everybody share that sentiment, yeah.

A

I think the most likely uh failure mode in production today would be performance.

D

Right recovery only happens when.

A

Bad things really go wrong and you probably did it yourself, like you, caused self-harm.

C

I mean I'll be honest. I start working on this because of performance. All the.

B

C

Stuff are important, but without that we can't even tell people to really use it. So yeah.

A

If you really use kubernetes and have a lot of secrets, performance issues or a lot of things you encrypt right, like I feel like we say that you should encrypt secrets. I I find that annoying. I feel, like you should say, encrypt everything, because just assume existence.

C

A

C

Okay, so, given that should we can the call to action? Is everybody on this call? Look at the performance stuff write, you know, ideas, questions within the doc and then let's just work on it asyncly and then next monday is not not monday. Next tuesday, let's discuss this focus on performance.

A

Yeah well to be fair. First thing is review initiatives kept yeah.

C

Absolutely I I assume not everybody's gonna, do that anyway, right so.

A

That's first thing and then we'll work on performance, all right, cool.

D

um You you mentioned like for observability, you mentioned the reference library. um Is it the discussion that we had in the first meeting about like adding metrics as a library, or is it something different.

C

I think long term. We want to reference implementation that has every one of these topics in there, um but initially right but initially, as you said, once we get the cap approved for observability, we can always start there first, so whatever gets approved, we can add them. Incrementally, okay,.

D

That makes sense.

A

I think for observability we were considering both like prometheus style, metrics and open tracing style, metrics or traces. Whatever the correct term is like.

D

Yeah, that would make sense open.

C

D

E

uh Christoph did you have something? So what are the exact um exchanges, because we need to be more or less done by next tuesday right so that we can kind of uh finish everything up. First,.

C

E

C

Once a niche pushes the pr for cap, let's all go review and then ping uh segoth approvers, uh and then we will also present it at the next wednesday's uh psych call just to close any open items and then, in the meantime, if you have time think about all the questions and edge cases that we want to discuss, asyncly or synchronously on this call for performance.

E

Okay cool, so I will watch for for anything from energon github and then jump on it all right. I gotta drop y'all thanks for being here.

C

E