GitLab Infrastructure Group, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Infrastructure demo on plans to use Vault for secret management

Description

doc - https://docs.google.com/document/d/1jh_hRvhaYBx9ENPmmnZ4e34QOc9-3oHKvA5aog22aio (internal)

issue - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15447

A

A

A

I'm not sure if you can expect a lot of people from the us now it's a bit late.

A

I think sometimes devin and craig show up, but I'm not sure they were here last week really steven was.

B

And give it two more minutes and see.

B

I guess it just does then um I'll go over the agenda, I'm not sure if any manager is joining, but there is an ask from steve uh to for people to give examples of our storms, which are being paged in a lot of times for for the same problem, um he's trying to uh look into ways to reduce the amount of of alerts when something goes wrong and there is, there is some cool diagrams and so in an info in the in the issue that is worth looking at.

B

And next I'll go over the volt demo and since this is recorded, we'll then share with the other time zones share. My screen.

B

Can you all see.

A

B

B

A

B

All right, so this is what we're going to cover as an overview.

B

um Vault is currently deployed in the free environment, and this is the external url that you can use to access today, well, not to log in yet, because we are still waiting on it to get some permissions on to access google groups so that we can integrate policies for everyone, and this is the internal url, where we'll be connecting to over most services, I'm going over the cluster setup quickly.

B

It does use the official helm chart from mashikorp. There is links on most of these topics where you can either look at code or more details and going over the high availability setup that we currently have.

B

There is five pods deployed in the in the stateful set, um it does use multi-zone, so there is two posts per zone, the preference and they can shift depending on. If there is a zonal failure or not, the disks are original, so we don't specifically are bound to a zone because of the data.

B

For storage, it does use raft also for corum, so everything is built in into one um hey graeme. um Do you want me to go back? We just started like one minute ago, and I was just starting to talk about the vault setup.

C

um I think that's okay, because I'm pretty sure from what I've seen it looks pretty standard nothing. Okay,.

B

Yeah, just just quick highlight household it does use the official helm uh chart from ashikorp for high availability. We have a stateful set with five pods. uh It runs two per zone by default. Well, except.

A

B

Zone is always getting one, and this is for current reasons uh based on our raft calculates the leader we use original disks, so we are not bound to a zone for storage and, as I said before, it uses route for storage and chrome, so everything is built in into one.

B

We have snapshots backups that we push to a bucket in gcs and we've tested the recovery from it and all works well um in terms of security, it does use kms from gcp for uh auto as unsealing.

B

So this means that even if we have data that is stored in disk exposed somewhere publicly, it doesn't matter because data is always encrypted in in a in a like a combined way. We need kms from gcp and plus um other tokens to actually unseal the data so that we can access it, and there is some details in this link here and now you can see how shamir keys works.

B

um We don't have a root token, so it's not possible to have like an admin uh token for everything we remove it. After bootstrapping, we do have recovery keys that are stored in one password that we can use to recover from snapshots. In case there is a full region outage. We can easily take data from gcs and bootstrap the service somewhere else, but.

A

The lack of ik is mostly becoming is just to get all token again like if we yeah. This is the scenario you use the kiloway key to get a wood token and you can use the token to there is no snapshot if you want yeah, but the restoring from squash is just initialize again, and so you get a new token from that. And then you first restore the subject toy store, which will yeah put your data back, and you are good to go.

A

B

Yeah thanks, um so the general way the cluster is set up is is deployed by helm in in kubernetes, but all the configuration comes from terraform um so run everything from roles, policies and everything is in uh is in an environment in config management, and there is a link for the the folder in terms of secret structure. uh We only have key value back ends for now uh to mimic what we currently have with uh gkms and and other stuff that we are using in different uh tools.

B

um So we have one for ci and we are modeling it after basically host name, so this is gonna, be either gitlab.com or ops.github.net, and then the repo and the environment. um So the roles are being scoped in this way, so that we can uh for now scope it to a specific ripple um in a specific environment.

B

C

Just just quickly on this, like often we have the same secret like to me, I'm not sure I I I'm not sure I'm assuming this is done like this. As you said, more about role-based access permissions but um like are we gonna have this situation where you have two secrets that are the same thing, but have to go in different parts like to me. That seems more complicated. If you know what I mean is this the only way we can kind of yeah, okay.

A

You can set the secret at the level as well. Like you can see. I should didlab.com slash your secret and those secrets at that level that need that level and not two directories are accessible to all environments. I mean.

C

This is not. This can be changed obviously, and we can figure it out all I'm thinking about. Is we don't want a situation where it's like? Oh here's a password or we want to try and avoid a situation where it's like here's a password. You actually have to update it three times in vault, because different things are using what is the same password so um but yeah. I appreciate that, like name spacing at that way means you can say this ci job can access that or this kubernetes cluster.

C

So I don't know play around with that think about how you can kind of if you can get it to the point where, even if it's just like here is the secrets, like a production, here's the secret, but we can like. I don't know if we can do roles so that we can have ci and jobs and communities classes accessing the one secret, because I think a big thing here is um obviously for the whole point. Part of the big work on this project is a single source for secrets right.

C

So we we want to try and push that as much as possible, where we can, with the the opportunities are made available to us. um It could also be, I think I think you probably need to like. We need to document and investigate the the variables we use in general, like there are some that are just used for ci, um and there are some that I use see it's tricky because it's almost like you want to define sorry.

B

Yeah yeah yeah totally and the structure that we have now is like the basic of the basics.

B

I do have shared folders in these that they can be accessed sure like, for example, the ripple level can access a shared folder that all environments can access and stuff like that, um and it's easy to manipulate this in any way that we want yeah.

C

Yeah exactly yeah.

B

Okay, cool just we, we didn't want to go full on uh into trying to solve all the problems that we have in trying to scope, specific secrets for now um and then just like to try to mirror what's already in place, so that we also don't uh put any limitations on integration and and stuff, like that sure yeah for kubernetes, same thing, cluster, namespace, app and the app will be bound to a service account and that service account then, is bound to a role and that role can only access this path.

B

And so on and same logic. There is shared folders. You can share a folder across the whole cluster, the whole namespace and so on in terms of authentication methods.

B

At the moment we are leveraging google for integration with gcp. uh It is oidc, but it also uses jwt depending on on what kind of things you are doing.

B

um Kubernetes and we have authentication with the cluster based on on service accounts and native kubernetes authentication and with gitlab using jwt, and this is how we integrate with ci drums and and then app role, which is kind of like token based, and we only use it for provisioning. At the moment. We don't use it for anything else um and next I'll go over. Some examples. um Merge requests where, for example, this is not a merge request, but this is how it generally works um in ci with the native integration using the secret uh feature.

B

So basically, you just provide the variable and then you specify which path involved volt it uses and which field to grab, and this will basically inject the secret in a ci variable like if you're using the variables uh in in the ci job- and this is pretty straightforward and is like the simplest one- um and this currently works for github.com and and ops- that we have integrated with vault and now I'll go in a bit over how it works, but not in too much detail an exterior form on all of these methods.

B

They will use, uh for example, uh for ci to use the gitlab authentication backend um then for terraform. It will also use it just. It depends on the way that we are authenticating it. If we go here quickly to the providers.

B

We can see that we have a dynamic of login, and this is where we configure so these settings here. This will be provided by ci and these vault jwt and and the author. All they come from the job itself.

B

Go over um it's going to be hard to explain here.

C

And then in terraform they just become uh sorry local variables or yeah or or from the like. They're like the vault. Is a provider right like and then we just reference the provider and.

B

The variables here yeah, um so this is templated with the with the json added with stuff that we have, um and here we will just be uh like, for example, the author will be named after the repo and then just read only read write token and so on, and this is the ci uh jwt token. So we are using the the native uh gitlab ci token dot indicate with volt and just to show how it looks on the other side, um environment, vault, roll skate, lab.

B

This is how vault uh gets the role to to the job or not, based on bound claims. So here we have a project id which we configured before and we there's a bunch of other parameters that we can set on.

B

For example, we only allow it to assume this role if uh it's it's coming from this project id and if the reference is protected in this case, if it's running in in the master branch or main branch, um and then we give- and we give it these policies and the policies is what is currently scoping, what kind of path it can access in the key value back end.

B

And we have configuration here here we set it from config management, project id138, so and and the way that we validate. If this is this instance of gitlab, there is a it uses, a jwks.

B

Endpoint in the in that instance, to validate that, actually is that gitlab instance that is generating these tokens so that we can. We can authenticate both ways.

B

And once the provider is authenticated, then we can just grab secrets from there. In this case we are grabbing secrets to configure vault itself, um so volt uses vault to configure itself.

B

And then moving on to uh helm or tanka anything kubernetes. um This is a bit generic in the way that it works for anything here, we're just explaining how it would work for helm or tanka um we're just installing volts and then just generating the token.

B

In the same way, the terraform was updating obtaining its credentials, which is using the the vault authentication path is going to use, which is the gitlab authentication backhand, which role and then the the ci job, and this allows us then to execute anything uh with with volts that that role allows and for uh helm files. Here we are using the native integration, which is just a reference vault and then the path of the secret.

B

So this will expand the secret and just apply it here, instead of what we were doing before, which was executing g-cloud.

B

So, instead of g-code, now these helm files library will execute uh vault api calls and get secrets for us automatically.

A

One thing about this: one: is: it's not documented in the hand, file documentation. uh I had to find that in the request and it's it's been implemented since 2019, it's never under documentation. So I don't know if it's stained or not.

C

No it's it is, it is stain. I I did look at this back in the day and you're right, it's poorly documented it wraps around. So this is functionality in helm file not in helm, so helm file is pulling the secret and passing it as a plain text value to helm itself. um It uses sops mozilla's sops project, which is the format you've got to follow. So if, if people are looking for how the format works, you go to the mozilla stops, project is usually the best but yeah it it it's.

C

um It has had work to it done to it, like they've expanded the secrets back in, so I don't think it's going away, but yeah. I agree. I don't know why it's the documentation for helm file in general actually is quite poor, um though I think they're going to fix it up now that they've moved the project into a standard, its own github organization.

B

And this is another approach that we think would be most beneficial.

A

B

It completely removes secret handling from ci, and in this case we are using external secrets, which is a kubernetes operator that basically has a bunch of crts custom resources in terraform in kubernetes, where it directly connects to volt and and injects the secrets in kubernetes directly for us. So in this method, we no longer need ci to handle secrets, and there is a many different ways of us connecting to vote grabbing the cigarettes, and this is what makes it a good fit, because it's very flexible on how we can connect to it.

B

And now we can manipulate the secrets and so on. um In this case, for tank, we just added.

B

We are generating a secret store, which is basically telling external secrets, which provider to look for secrets, and in this case it's volt. We are giving it defaults address, which path is going to use, which version of key value back end. It uses. You know if you want, v1 doesn't have versioning and we true as versioning.

B

That's basically the only difference, um and then we just tell it how to authenticate, and we will be using its native service account for for the namespace with kubernetes, where it's going to authenticate and the role and the service account it uses and for external secrets. uh This is what actually tells external secrets provider um where to look for things. uh So, basically we are telling you to use that secret store.

B

We just created, and we are telling you to grab this data, the secret key, a reference which will be the secret path and what target it's going to to name. This is going to be the secret. It is actually going to create.

C

So there's there's a secret store, so just trying to clarify here so there's one secret store crd per cluster right, because we only have one volt instance per. We have one volt instance per environment. Each cluster probably only needs one secret store reference. Is that correct.

B

Depends: um okay, how we want to scope it, so we can use a cluster one and everybody has access to those secrets right, but we don't want uh certain applications to have access to potential. We have access to other secrets of other applications, so here we are creating one per application. um I.

C

Mean I get I understand, I'm just this all just looks so complicated for what I think is such a simple space. I I I'm obviously just very focused on simplicity, like I don't think we need all this crazy, crazy security model. To be perfectly honest, um I'm just trying to define like that's the thing.

B

Once once this is in place, you don't need to look at it anymore. um Yeah.

C

B

The way that you then use secrets, you just create um the external secret.

C

Yeah, well, that's what I'm I'm trying to understand. Can we move the secret store objects into like the helm file for vault? So when you deploy volt, oh wait. No, that's not gonna work. How are you doing it at the moment in say, staging where you've got like four kubernetes clusters to all of the clusters.

B

So um um when, when you deploy uh the external secrets operator, it does deploy the volt connection and sets everything so, okay, everything is in place. That's the only thing that you need to deploy in the question right.

C

um So sorry, so external secrets will be in all four clusters and that's deployed with a gke service account which we then, which will be the actual process, that's authenticating to vault, and so then do we need. We still need a secret. At least one secret store, crd in every cluster right, yeah.

B

So and I guess I'm trying to figure it out wide or you can name space specifically.

C

Okay, so what I'm trying to so what I'm trying to split here is developers, so our long-term goal and delivery for this year is to get developers owning deployments. um You know, like that's one of the things we're moving for so probably what we want is we don't want.

C

We don't want correct me. If I'm wrong, we probably don't want secret store objects to live along alongside the apps like here's their deployment, here's their service, we don't want them specifying their own secret store right, because it sounds like no matter what they say that they can just say. Here's just I'm going to define a secret store that gives me access to everything right. uh Well, you still not going to work.

A

In like in the configuration in telephone, you need to burn that service account to, or that has policies, giving access to certain paths.

C

So, but isn't the service account? This is where I'm so, isn't the service account external secrets, or is it not like the.

A

Service account.

C

It's like uh you.

A

Like the way, it's you do it every secret store, but it has its own service account created with it. It's creating a service account that is configured in the secret store and that's what the secret store uses to authenticate to vote uh to.

C

So that's how that's.

A

All for each app you would have was a dedicated service account for autistic and secrets. It's on sql store.

C

So so, okay, so hang on. Let.

B

Remember the the bound claims I spoke about ci. It also applies here so, for example, here the role is bound to a service account and the name space.

C

So I guess the question probably then is so the external secrets controller. Does it do all requests to volt as itself or.

B

Like no okay, see good store.

C

Okay, that that is the part I'm missing, in which case then that's fine, like I'm just trying to understand. How do we say this app has access to these secrets, like we define that somewhere versus a developer, just being able to define what whatever they want kind of thing? Is there a split there? We can.

B

Yeah with uh with the roles that that's where we define what has access to what right, um and here we said, we said the rules can't.

C

They create their own secret store object that has any role they want like that's what I'm or is the role based off the literal kubernetes service account.

B

That always create the role as to we exist involved before.

C

Yeah, okay, okay! So basically we're saying that service account for gitlab web can access these secrets. But the actual configuration of the secret store object has no real, like that's just telling you where to find the secrets and who define it as, but it's not actually controlling permissions of that yeah.

B

C

Think that makes sense yeah, okay, so yeah, I think the difference is, and maybe this was when I looked at external secrets- and this was years ago back when I was it was still under godaddy before it was its own project it. The model was I'm not even sure if there was a secret store cid, it was the external secrets. Controller runs as a service account, and you know it just had access to kind of like whatever you wanted.

C

So it was like you had to be careful of what people were setting up um as secrets if it makes sense, but it sounds like they may have fixed that or maybe not fixed that but re-architected it so that it now the controller is still making requests on behalf as the actual user that we expect them to. So there's no way for a user to create crd objects to grain, access to something they're not allowed to access.

B

No, they can, but they won't, they won't get them, because what? What is going? The external secrets controller, will.

C

Make the request with.

B

The same kind of object.

C

B

Yeah um yeah, I think, it'll, be easier if I show it here um so, for example, we have this one.

B

So this is creating a secret store, an external secret in the vault namespace, um and we can see here that it has been created and if we go to the secret store, we can see it here, and this is all in the vault um namespace right yep.

B

um And if we look at the definition we can see so it's called volt volt um and it's bound to the kubernetes authentication method. Yeah.

C

So that's the only thing.

B

In this path and yeah- and this is basically whatever uh prefix- we give it- we called it kubernetes. This is the cluster name role, it's going to try to use, and if we go back.

C

Yeah so that so yeah, so I'm pretty familiar with this so yeah. So it's saying that so that's on the secret store object right, so you're saying is the secret store. This is the secret store log me the service account uh in this kubernetes cluster myself. My web app service account log me into volts at this authentication path, and then you have to specify the role that you're trying to log in as and then it'll give you excess or deny based off. You know if you actually have access to that role.

B

And and the role involved was created here.

B

Yeah and the name space, and we can give extra policies or not here we are, we are not, um and then we just specify which service account is going to use. um We are using the pre-created one. We could also create a separate one.

C

Yeah sure so that means then so each app should have one secret store, object and many secret uh external secret objects. On top of that cool.

B

Yeah in terms of secret store, um maybe per app, doesn't make sense everywhere right, so we can also make it burning. Yeah.

C

B

C

Sure, no, I I said I was confused. I what I thought a secret store object was was just basically defining if the whole cluster could kind of access secrets. Now I can understand its request based off the service account itself, then that that makes more sense. I think I think for apps. This is how I see it, and it said this is this is my opinion and you know I'm happy to change it for so there's two kind of layers right for the apps upwards, so you can consider the gitlab web app.

C

You can consider like design.gitlab.com all of the bits and pieces. I think we probably want a secret store per app because their developer groups are going to be they're different people and what they have access to is different below that. So, like console elasticsearch fluentd, all those you could do one per app as well or you could just have like a global infrastructure one. If you really want. um Let's get it's up to to you guys, I could see it just to try and reduce complexity.

C

You could just have one that you define all of these pieces are part of like the reliability operating group that you could do, but you know- or you could also do one per component as well, for maximum.

B

Yeah, it really tie into how we want to structure it uh in terms of like.

D

A

B

Duplicate secrets uh so that we don't have to update him in multiple places and so on, um because we can make it as simple as we want or as complicated as we yeah, because it's it's really flexible with with the policies- and this is the external secret that it's creating uh so he's just pointing to the secret store, store. Yeah.

C

That makes sense yeah yeah. We will do this yeah. We will do this for git labcom. So what I, what I'll probably do is when, obviously, when this is all ready to roll, we will just use we in our helm release well we're we're changing the gitlab com repo over the coming months, but we will essentially just create pure secret objects and um sorry pure external secret objects, and just rely on the controller to sync that across and we won't because we can't we don't want. I mean we can all agree.

C

We don't want secrets coming from ci pipelines and stuff anymore, so we'll just do that um for the labcom.

C

Yeah yeah yeah we're not going to use the helm file integration. I don't think.

B

Yeah yeah I'll agree that just having the money, because then you don't have to worry about.

B

C

Also permissions right because, like we want like for us now, it's for everyone on this call who's sres, like we get permissions to do whatever we want doesn't matter, but you know we're trying to move any actual. Gitlab developers want to be able to work in the repos a lot more and work in ci pipelines and actually see the ci pipelines, which means we have to get all the permissions from the security out of them.

D

So that brings up the question: what happens when the vault server is down? Yes,.

B

I was gonna get to that, uh and this is what is so great about it. Like we don't care, it can be down the only blocker like you. You can see a vault being down as the same as like ci being down, because if volt is down, then ci probably can't grab the secrets right, but in this case we don't really care because the operator syncs the secrets from volt to native kubernetes secrets.

B

So if vault is down, not nothing happens right just you won't be able to update secrets, but you can always go and change them directly in kubernetes until you fix vault or anything like same as before, right like if jkms is down. What can you do because we are leveraging native kubernetes secrets, um so this is just a way to bring them from vault and apply them in kubernetes.

C

And to be clear, this is only for the external secrets controller, it just syncs them into kubernetes, and then they live in kubernetes for the helm, file native integration and for the gitlab ci native integration. If volt is down, we cannot like that means. Those will not work. um Yeah.

B

We can, we can run the jobs. Yes.

C

Yes, so one of the things I do want to make sure actually I'll wait till the end.

B

All right so I'll I'll just quickly demo like, for example, here we have a my key uh with value uh test2, but if we go to version one, we can see it just tests. um So if, if we go to the external secret, um I just reduced the refresh interval, but it doesn't really matter. um This can be useful in cases where we want to pull stuff dynamically from vault and update it. Whatever time we can just also just disable it, and it will only update when we change the spec of the of the secret.

B

So here I'll just change it to version one.

B

B

And we go to secret and we will see the uh reference that we added there. Vault example secret, so this was created by the operator and we can see there is my key with four bytes and there it is our secret, um and if we go back external secret, we can update it again to version two.

B

And if we go back, we see that is now test. Two like this makes it super easy to just add those manifest with external secrets and then the operator does everything with the secret.

C

Can we actually enforce that you can't update a new secret name and get you a new secret version, so at least for us, we would really like to to basically force that you have to create a new secret object with the new secret version, with a secret name like we do now, simply because, like in the case of the gitlab chop, at least, if you change the secret out, underneath it pods aren't reloaded and everything, and it's like we. We like the more deliberate workflow where you have to do a merge request, saying gitlab pods.

C

Please now use this new secret object, which is a new version of the secret. I'm not saying that workflow.

B

C

B

Works for everything.

C

B

To force the department to rotate, but.

C

It's not just that it's also auditing and stuff like that for get lab.com. We want. If we go in to get. I know, you've got the the external secret object. I guess which is kind of maybe that fulfills enough, but yeah, I'm just thinking.

C

Is there a way to have like you make the external secret sorry, this yeah, the external secret, object, immutable or, like you can't update it. You need to create a new one.

B

um Unsure um that's.

C

B

C

Is what's okay.

C

You know what that's fine. We can do that um we're going to start doing static analysis tests on manifest anyway. So that's something we can pick up in there and then we could just be like we. We would you know we can fail the test saying we don't want you to do this. If that's because.

B

Policies for like when does it delete a secret or not.

C

Yeah sure that's fine.

B

But I think stuff like that is probably better addressed with alpa or something similar um yeah.

C

Yeah, I think I think what we want to do is just like static analysis at ci time right. So that way um you can give feedback to the user. This has failed, and this is why, um or you know, define a library even if it's running up or underneath or something like I just you know, on the stand alone, to validate that. That makes sense.

B

Yeah, I haven't looked in detail what it can do in that aspect, but there is policies um at the operator level. Maybe um it can do that, but yeah, certainly something that we can check um and just to wrap up um in other use cases, um for example, what we will do for uh shelf.

B

There is an official vault, ruby library, so we can do the same thing as we currently do with kms, where we just source the secrets from an external library um same thing for ansible there is a community module where we can use the same kind of authentication methods we use here like native gcp and so on, so that we don't have to rely on static service account keys like, for example, this will rely on im so that we don't have to have manual tokens. Well, we don't want to have secrets to access secrets.

B

um We want them to use im and stuff that we already have in place.

C

um The have you guys looked at what devin did like devin. You could almost got the whole way with the chef vault stuff. Didn't you.

D

C

D

It still around.

C

Like I don't think.

D

The code's still there and um I don't remember exactly where it is- it's been a while, but it yeah it's all- still still uploaded.

A

That's been useful, yeah.

D

Yeah I can help you find it if you want.

B

Yeah thanks yeah, we haven't looked much in detail into chef or nco, yet just because we know that it's possible to do, and there is many ways to do it. um But probably the least resistance would just be to do what we currently do with jkms. Just using an external library or something like that.

B

um Yes, gmc or one.

A

Before so yeah it works.

B

And then for last group policy so that we can give people access it's in progress, we're waiting for it in an access request to give us uh some credentials so that we can authenticate with g suite to get access to group membership.

D

Are you doing anything with the vault agent.

B

D

One of the things that I I found when I was looking at the the chef stuff is that there are cases where we don't want to wait for the next chef run to update secrets um consoles and databases and stuff like that, um were the main ones. But I think there were some others.

D

So the solution that I was using was with chef, install vault agent and then let vault agent update the local config files on all the pets.

D

um It was a nice way to to make sure that what, when you updated vault, it immediately updated all of the config files in places where, where we're running off the secret, that's in a config file.

B

Yeah definitely possible. uh It's also nice, because the volt agent actually caches uh certain things. um So even default is down it's it's still. There is still some caching there well.

D

And it if you use vault agent to write to a local config file for a service, that's only going to read from a config file that doesn't know about vault um once it writes to the config file. It's there. So it's kind of built in caching, yeah.

B

Yeah, like we are trying to deliberately not use it as I'd say, a dependency of runtime um like that's. Why we have this approach with external secrets where, because you can use volt agent also in kubernetes and the application will read directly from the vault agent, um but we don't have, we don't want to have a hard dependency on vault as as a as a tool like it does. If volt goes down, it should not affect any of our uh production services right.

B

um We yeah, we haven't looked into using the agent for chef or or anything vms, yet, um but yeah it'll be nice to have. But for now we are just looking at. I pretty much.

D

I think I have an existing recipe in chef to install vault agent and to get it writing to an arbitrary file but yeah. I remember that same thing. I didn't want to use vault agent for authentication, but using it to do its templating thing like the console templating, where you just say, you know when this value change is involved, change it in this file and then let the the application use the files that way you didn't have that dependency. That vault has to be running all the time.

B

Yeah yeah, it does uh yeah integrates with console also, so we can use both um yeah. Definitely lots lots of possibilities, um yeah we're just trying at the moment, just the basics, so that we can at least start using a single source of truth for the secret.

D

Yeah, it's long overdue.

B

Yep- and that was all that I had to show um there- is lots more things that I didn't get into, because it will get too long, but from the document links that is links to source code and more documentation and so on. If you're interested in exporting.

C

So this is currently just deployed in pre. Is that correct? Yes, have you have you started talking to the security team yeah all about the architecture, ipad, cool.

B

So that will be next steps.

C

Yeah cool- I suggest do that sooner rather than later, because um the big thing, the big thing that jumps out at me and once again, things may have changed. I'm not sure is the difference in deployment model versus when we were looking at this last time. So last time we did an isolated gke cluster that was completely internal and not accessible from the internet, because it is our secrets.

C

um Looking now like valtteri.lab.net, what are you using for ingress or like what is confronting that to the internet? Is it just an external load, balancer or.

B

Yeah just there's.

C

B

C

But is that like? Is it a gk ingress object or just literally service type load, balancer.

B

Both um the external one uses gcp with iap at the end of the deal where proxy stuff, so that we can access the webpage ui outside of the internal network right okay, so that one doesn't work for anything cli or api directly. So we have the internal endpoint, which is just as a service.

C

B

C

Balance uh type internal yeah, okay, cool that so hang on. So if I go to vault.pre.gitlab.net, is that through iip.

B

C

Not restricted to our org, isn't it because I just tried that logging in with my personal account the screen came up or unless no, it must be. Okay, I must be. I must be have my browser configured wrong, that that's fine, that's perfect! That's what I always just wanted to check, because anything we exposed to the internet. Obviously we need. Basically, we don't just want to trust on vault itself, not having like permission. You know like a security bug before you expose it to the internet.

C

Ip's perfect, um the so so that means we're planning on running a vault cluster per environment. So one and pre one in staging one in production instead of like an isolated one with name spaces or we.

B

Haven't gone that far in thinking.

C

About that yet.

B

We would prefer to just run one for production as a whole and.

C

One for non-production.

B

C

Okay, yep yeah that that's what we did before so that makes sense.

B

um So that we have like a staging cluster and a production, one.

C

Someone can refresh my memory is our terraform repo public.

A

No, it's not okay,.

C

A

C

To say the last time security pulled, one of the things security wanted is like all of the roles and permissions, which is all in terraform. We actually correct me if I'm wrong devon. We didn't do this in the gitlab com, repo, didn't we. We had like an entirely separate repo that was like.

D

C

Yeah, so I think it's probably fine, um because the gitlab com, terraform repo or whatever, it's called in our conflict- management's, not public, um but once again security will. Let you know if they want it even further locked down. So you all, I'm saying, is you may end up having to pull that out into its own git repo, but but I'm hoping not um as long as it's not public. Oh and you know what the thing with security is, it depends who you get.

C

The team themselves may have changed their mind on what they want to do with that. So, um and I think that was pretty much all I had. um Oh so yeah, the other thing so the external secrets. So there's two interfaces for us and delivery, and probably uh the developer users eventually are going to be using right. So there's I'm going to try and push them for kubernetes just using the external secrets controller, because it is a simplistic model and because they can just generate their own manifests, and it sounds like the permissions.

C

We can handle reliability, side and there's just no way they can really screw it up or make ci pipelines fail. I I don't mind you know if people want to use the helm file integration, but I think just pushing for generic secret objects and external secrets will just be less much less painful for everyone. The other big one uh is going to be secrets and get lab pipelines.

C

So classic example will be the release, tools, pipelines and the get lab, get lab or get lab project pipelines.

C

So that's where, if, if fault is down, the ci pipelines won't work and that can be a huge risk for us and delivery, especially if we're close to the 22nd of the month and like it's not even involved being down. For you know a few hours or a day can cost us a lot of time when we're trying to meet deadlines.

C

So the only thing I wanted, I think, it'd be good. Some point in the project is to consider, I know you've tested backups. Can we do automatic, backup testing? So, like you know, you know how we do its database. We have a ci job that, like spins up a new vault cluster, restores the backup every day. So it's like we're all constantly testing the backups. We should try and make sure we're doing that with the open source version. Do we get streaming replication?

C

Can we actually have a cluster running somewhere else and we stream replication to it or not? Okay, no! That's something! Nice yeah enterprise.

A

C

Fair enough.

B

We can kind of doing it based on the snapshots right, because.

C

B

If the snapshots restored automatically but yeah.

C

Maybe that maybe maybe that uh maybe that's not a bad idea or the other thing is, it would be great. Maybe this is too hard a runbook or a script where so gitlab ci's integration, it just populates environment variables right so at the moment, in a lot of the important pipelines in the gitlab application right, we just go in sorry.

C

You know in the cicd settings we're like here's, an environment variable, you know, here's the value market is protected and all that kind of stuff, and so what we'd be doing is removing those by the sounds of it and putting them into bulb and the end result is the same: whether they're environment variables coming from git lab or whether they're coming from fault.

C

It would be good to have an emergency run book and or tooling, or script where we could be like, gets all the secrets for a specific ci job and actually uses the gitlab api to manually to populate them back in as get lab secrets. So it's kind of like turning off the volt integration like if it volts down or it's having problems, then we are like. We need to run pipelines now like an easy run book.

C

That's like take a backup restore it click this button and it just populates them over in the cid cd settings, and then you know we can run pipelines and then, when we fix things we just go through and delete them all and switch it back to paul, or it doesn't have to be that extensive, and maybe the backup restoration process is something just something where we've got a clear in case of emergency, and you know, there's, like lots of stress: we've got a clear understanding of if we, if volt is down and we're not sure when we can get it back up, there's a way that we can just get the secrets.

C

We need into environment variables on like ops.getlab.net and the gitlab project might be the big ones. We can identify a list and be like we can just get them running manually somehow until things like calm down, hopefully we'll never need it, but I think that would be really good just to to get that confidence in the ci cd integration, because at the moment you know it's comes from the like ops.gitlab.net.

C

They just stored it on ops.gitlab.net. If ops is running, the ci power lines are running, it's all fine, it's a closed system. So, if we're introducing an external dependency just once again having a run book or a process where we can kind of flick it back just temporarily, if needed, would be good.

A

Yeah, a snapshot restore is a two-step process.

D

C

A

Need access to kms as long as you've got access to kms. You can open this here and you can restart.

C

So so, when we were uh looking at this last time, we talked to some other groups that were using bold, especially the open source version. They had a bug where they had not only did their vols instances go down with corruption, but their backups for a week since they upgraded vaults were also corrupted.

C

So I'm not I'm not trying to cover every scenario, because that was obviously a very specific and rare scenario, but I think once again, just just in case the vault. The vault ecosystem has something wrong. What what would be the options there considering? These are just environment variables. We can populate yeah.

B

This the snapshot is a kubernetes job, so we can make it run as well.

C

I do think just like trying to keep like. I know we don't get streaming replication, but even just like uh up to like a a you know, a master that we just have a job. That's kind of manually syncing it because most of the time secrets won't change. So it's not a big deal. It's more about availability, some kind of hack. We can do to try and keep some secondary availability. I think, could be nice like that, might be good enough.

C

You could consider doing something like an export to plain text or something, and then gpg encrypting that as well and putting it in a gcs bucket. Maybe that's overkill. I don't know it's just it's just thinking about different ways to really build that confidence, so that in an absolute failure you know in the failure scenario we have a clear understanding of how we need to get those things, so we can keep moving.

C

I think that's the key key takeaway, but I think if we can just make sure we keep a I'd, be just fine with like a secondary one running in us west, uh maybe and then yeah like once an hour half a day. We just like snapshot, restore- and you know obviously paging the ocs. If that doesn't work, and things like that, that's probably just good enough and then making sure, um and even in delivery. We do.

C

You know we do rollback tests, it's not unreasonable for us to do a test once a month or whatever of testing the secondary volt instance. We can do that as part of like you know, standard uh monthly disaster, recovery, testing, um yeah.

B

I mean the backup snapshot and resource office is quite small in size, so we can really do it like hourly or anything.

C

Yeah, that's right just something like that. I think really just to make the confidence so that those you know we don't in the case of being down and we like, we actually really do need to run ci pipelines. It's an it's, not something that needs discussion. There's just a rumble like this is exactly what you need to do. Boom change it over to this and just away you go so that you know delivery and everyone else can just easily follow that. I think that's the key yeah.

B

For cases like ci variables and stuff, like we already have in place, um I was thinking I'm just like leaving both running sure yeah. Oh yeah, yeah.

C

Yeah we're not just gonna.

B

Cut them over tomorrow.

C

I think the thing is probably what needs to be done as well, and this isn't really reliability's job is probably an audit needs to be done on some of them. I I'm not sure if we need all of them, there's probably a lot that, because, because the ones we add into the ui we they're not in git, we have no git history. We don't know who added them when or whatever.

C

So I at the risk of making this project take longer, but at some point someone's probably going to need to when we move them, probably audit them, it could be good to get security involved. They've got uh infrastructure, security now and like say, hey, look we're moving all these over and get them to like document where they are, where they're coming from who's using them and stuff like that, like it could be some joint work, you could do there, but yeah that makes sense.

B

Yeah it'll be nice yeah.

C

Have we thought yet about uh it's probably too too early to really think about? Like that's fine, I was just thinking like. Do we actually just want all secrets to be created by reliability or do we want end users to be able to create them, but I guess it might depend on per application. I guess.

B

Yeah, I think, for now we are growing with the with the same model as before yeah, so.

C

Just tickets to reliability to create it and then yeah cool like.

B

Same groups that we currently have in gcp that can access gkms and so on.

C

Oh yeah, that makes sense.

A

One thing we did at the first opening is: um we ability had like fully accessible secrets and stuff, like you could view create, did it and everything and the developers had access to that group and they could create secrets.

A

So if they had without access to secret, they could create it and I'm not sure if they could update it or not, but they could not do it and they could not view secrets. So they could only only knew the ones they were creating or updating and yeah. They could not solid or view anything.

C

So by creating it's like, if I create a secret that creates the secret object, but then what populates it with something random.

B

No, no, I mean they, they can inject the secret. They can't.

C

B

And themselves, then, after they create it, they cannot really never say it yeah fair enough yeah. It doesn't be too much because there was.

C

B

Cases where people wanted to confirm what was there yeah.

C

So they're asking.

B

Yeah but yeah definitely possible yeah.

C

I think, ultimately, that probably will be driven once again. I think you know security should weigh in a lot on how we want to do that. Possibly.

B

Yeah, at least will give us way more possibilities and we currently have which academics and stuff now we want to go forward in restricting access and all of that yeah. It's nice. um We are a bit over time, uh but thanks everybody for joining, and I'm glad that there was a good discussion and thank you.

A

See you thanks bye.