GitLab Delivery: GitLab.com migration to k8s demos, 3 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GMT20200131 k8s migration sidekiq: project export queue in staging load testing

Description

Part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/143

A

A

B

B

A

A

All right, I think we can, we can get started in whoever joins joins.

A

A

Cool previous week to have a demo with low testing today, that was one go.

A

Skaar backstretch also deploy workable. That is not completely right, like we felt.

C

A

That theater yeah.

C

Yeah we got a little bit of work to do. Yeah.

A

Sounds good finish up the data video I have no clue what this was about. I should take better notes. It's.

D

Power I didn't explain it just it was reconciling the data been study mo config in the pod versus the database, that vehicle could fit on inside the cake. Exporter BN the biggest gap right now are the replicas for application load balancing. This is something that I don't think we anticipated, which is that we use console, which does a DNS lookup to get the list of replicas for application load balancing and we don't in.

A

D

You don't have a good story yet for how to do this. You know in our psychic pod we can run. Maybe we can run console, maybe have a couple options: I tried to lay them out a new issue, we run a sidecar, do we just run console console agent and another pod, and then you know make console available for sidekick pod. You know, I, don't know, I think we have lots of. We have different options that we really need to figure out and whether this goes into the official chart or not I don't know either.

D

A

A good question: actually, what do we expect users to do? I think.

D

For in general, because you know, Patroni is the one registering itself as a console service that has all the replicas in the list. You know it's kind of custom to get like calm I mean other people could do this, but they don't necessarily do it. We do ship with console, so you know I, don't know like, but I think I think this pattern of using this DNS console DNS for load balancing is probably unique to us, or very few people are doing it.

A

Okay, I wouldn't even say this is a p2 than Jarrah. This seems like a pretty big deal. Yeah.

D

I mean, of course, we can run without using the load balancer, but we can't go to pod until we now know.

A

This is a production, blocker yeah.

B

A

Mean I am more inclined to say we want to understand the the impact of this one cue in kubernetes, so my vote would go even for something that is not permanent, like we don't have to make a decision immediately and so five problems like we don't like. If we go down the route of thinking about all the use cases that the users could have that's gonna drag us out in a completely different direction.

A

So we could leave the inclusion in the charts and all of that for one of the next iterative steps and just finds the simplest possible way. If that means running a sidecar downloading, the sidecar is fine. That.

D

Could be I mean if we really wanted to? We could also just go with that load balancing for this queue, because this queue is probably just not making a whole lot of database calls once we I mean at least what I'm discovering during my load testing scene. What you're doing production I mean we just we're not doing exports at a high rate at all. That's very it's very low. So, except when someone abuses us but I, mean I, think we'll just work it out we're working out the issue.

A

For this one I don't like if we decide to not go down the route of immediately thinking about how are we gonna place this in the charts for this one I'm fine with that yeah? Let's just find a way to get it cool all right, I think the other ones are completely readiness review.

A

Skarbek readiness review is completed from our side, but we need to include others right. Did that start up, yeah.

C

It started up, we've had a bunch of people coming on it. So far, I've got.

A

C

Person left to review it and then I consider it merger ball.

A

Alright job, do you wanna damn with the load testing.

D

I'm sure yeah I can do that. First, I.

D

Was prepared for this I just have to get my screen layout, so I have both browser, and here we go. Ok, the.

A

Last perfect: this is the more fate I have that we are doing a good thing.

D

Ok, so hopefully you can see my whole desktop here and what I have is the the workload resources on the right. What we can see here is just the CPU and memory usage of the cyclic exporter pod. It looks like we're running two pods right now. Normally I actually see one. You can take a look.

C

B

C

That column change so you're only running one pod, AHA.

D

Yes, okay, I see, it seems this order. Okay yeah, so we have once I click export pod, that's expected and you can see the CPU usage is really low and you know everything is, you know fairly low. So the first thing one of the things we wanted to look at was what is our current?

D

You know what recurrence currently seen in production for export. In general, I looked over I just looked over a 24 hour period. We peak at 10 exports a second but that's fairly unusual. It's more typical radius, 5 exports every minute, so it's really slow. You don't see a lot of these. Of course, when we do get abused, it goes up quite a bit. I didn't realize that we already have rate limiting in that for export, specifically in the application, and this is not the Rack Attack or anything.

D

You know it's actually just a rate limit for I. Think like one export, every five minutes per project and yeah I ran into this during load testing, because the first, of course the first thing I want to try to do is just like you know, export. You know like a hundred times for one project and I didn't work, so I had to hack the code to make that work, but I think the first thing we're gonna try here is just doing an export across ten different projects.

D

The project example that I'm using is games Apple because I just looked for a medium-sized project and this one's around four hundred megabytes. It's like I think a good example of, like maybe kind of on the larger side. I think using like these very small projects is not a good, a good reference so and I, unfortunately, with pre prod, which is where I'm doing my testing, using a very large project like give lab HQ or the Linux kernel.

D

I was having trouble and I think it has to do with the fact that free projects is ahead of us. I was just taking very long, so I think when we go to staging we're, gonna be able to do some better load testing with very large projects. Ansible itself takes about 20, 20 seconds or so to to export, so I think it's a good example.

D

So here the lower-right I have just watching the pods. You see that there's one export pod here running I'm telling the logs and I'm just pulling out some interesting things from side to this is also going into stock driver. But it's a bit nicer to see it in the terminal, so I have a for loop. So what I'm doing here is I have ten different projects, they're all just imports intangible and we're just going to hit the API and we're going to export all of them right after each other.

D

So you can see like successful responses for all of them. You can see that we have the first down here in the lower left just to explain this is the queue latency. This is the time that it goes into Redis to the time that sidekick picks it up. You can see it's very fast and then it completed in about 15 seconds. Now you see that the next one started now the queue latency is a bit higher because it was waiting to the Redis key for longer. We look over here and we can see that.

D

Okay, we still have one pod. So you know nothing has tried to scale up yet if we go over to the computer resources. Hopefully what we'll see here is that our CPU will start it's like.

D

So you can see the CPU is going up and we'll probably you know, HP is probably starting to think about scaling and concede right now, that's I can export now is starting to bring up another one. This is one thing, I think we're kind of aware of this, but at the same time, maybe not realizing how lucky it is is how long it takes for these pods to boot. It's a minute and 40 seconds, which is terrible. The dependency init container alone takes 40 seconds.

D

So if we just make that a really simple check to make sure that the David a schema is correct too then Beth shaves off 40 seconds so now we're down to a minute still seems too long to me, but at least it's a little bit better. So now we have like you know some more export again. So now we just had two of them, but we'll probably it'll probably start to spin up a third one, and you see the second one is still.

D

You know still trying to get ready.

D

So if you notice here on the lower left, you can see that to the start, like nothing is being done in parallel right now, we're just doing one at a time, so these latencies are creeping up, because these are the like the amount of times that you know the job is in the Redis queue. You know the latency, of course. The time it takes is the same, doesn't really change, but still you have to wait a long time just for the job to be picked up so.

A

This would mean that a big big, big export it has a potential of blocking up quite a lot of things. Yeah.

D

Exactly in fact, I saw this when I tried to export give at HQ. You know from github I I tried to I imported that project and then I tried to do an export of it and yeah it just it just never. It was just blocking everything. So in that case, with only one pod, you know the CPU doesn't climb so we're kind of stuck at one pod and we're blocking everything else. So that's no good! You probably won't.

D

We probably would want to scale up in that case and that's where I think we're we're lacking a bit. We have in production. Now we have for sidekick export VMs and for process I think processes running in each so that we can do 16 simultaneous exports.

D

You know all the time like you know, that's it statically allocated right so yeah, but.

A

What what if so, what if I mean we're starting here, fairly low we're, starting here with one pod, which is not even close to comparable to what we have in prod, so we don't have to start with with something low. Here we can start with something relatively high. Let's say: 10 pods. That means then imports that can be handled at the same time, right and.

C

A

All of a sudden, this slow ramp up doesn't is not that big of a deal unless there is abuse. If there is abuse, then we are gonna see a lot of queuing.

B

E

Just to kind of paint a bit of picture around, like the usage pattern that we see in project export right, so I would say like 90 percent but I, don't know the exact numbers. But in my head it's like 90% of all project exports happen between four o'clock in the morning and 10 minutes. Past 4:00 in the morning at UTC morning, right I.

D

Don't I don't know if that's what I saw but I can.

E

If you look at the sidekick every morning, there's like a long queue.

D

Now cuz I just did this very I.

E

Know I look in prometheus at the sidekick here as well, so.

D

So here is job status, done queue, project, export and account so I don't know it seems. Like you know, we have the key.

E

D

E

Weird because I normally see if I got it. Let me just go to my dashboard, which is the priority detail. One exports. Sorry just I just want.

D

Is this, like is a song the SLA for.

E

Sidekick yeah I'll give you the link that I use.

E

D

E

Is the total total execution time for priority psychic estimated p95 queue time, because maybe it's not 90% but like every morning we get a really long queue, a backlog of export jobs at 4 o'clock right and the reason for that is that those jobs are themselves get lab CI. They kicked off I, get lab CI jobs that are at daily that are all run at 4 a.m. because that's when out daily jobs are run yeah.

D

E

But my experience is that, like people like, don't you know, there's no sort of no one Minds if it takes an hour for those jobs to get scheduled, sure.

D

E

You look at so I'm just trying to look here. I mean it's at 4 o'clock every morning. It'll take up to 10 minutes. Well, that's a that's the highest Prometheus bucket that we have yeah.

B

E

It's up to 10 minutes on that buckets that it's queueing and no one seems to mind and that's to be the worst.

D

Okay, well, I think I. Think like the challenge. The biggest challenge we have right now is to figure out how we're going to scale there or how many pods do we need to you know at minimum to be able to service these exports. I I was a bit surprised that I did this very large export. It hung around for a pretty long time. Do you know? Is there shouldn't there be a timeout for how long these jobs run before they were? Is there no timeout.

E

Sorry, Bob you go ahead, I don't think there is no. That.

D

Sucks like there really needs to be great.

E

Are you talking you're talking hours, this.

D

Is this was at least 30 minutes, if not an hour, and the reason why it's because it's pre proud and we have NFS and I- think it's just like something was taking a very long time. I didn't dig into it, but to me that sounds like we need. We need to timeout right. We can't wait a lot. Wait that long. It's.

E

One of those things where it's kind of sunken costs right and people probably start over and over and over yeah, but I- think more importantly into the point, which is a like a totally different conversation like obviously project exports are a real mess. At the moment today, I find out about coop exports, which is even more fun. Yeah.

B

Yeah, it didn't pick an easy queue like yeah.

E

But like we don't have that many runners, job we've I, think each project export node has got like to trade through effective threads.

D

E

D

There's four there's: four machines are.

E

There's four machines: ok, yeah.

D

E

We just match whatever we've got in production like that's, fine and and and I think the number one thing is. We just have to be like sitting expectations that it will be queuing, but there it queuing right. So there always has been queuing and sometimes jobs will take like 15 minutes to schedule and that's okay.

D

B

Have four process.

D

I think have four processes: I can check right now. I didn't have four processes in four games. 1916. Let's just take a look. That's.

C

D

So I think it's sixteen Andrew. That's a lot of pot. I mean there's a lot of pods, but.

B

The problems with the the the slope boot time that you see now that code, that's causing the queuing and so on. Won't that be a problem for web pause in the future as well. If it's actually a rails environment, that's taking a long time to boot, yeah.

D

I mean it's it'll.

C

Be a prompt for any pod that runs rails. This is a dependency container. That's inside of every single pod that.

B

D

Of every deployment.

B

C

Use in good lab argue life charge so.

B

Should we be working on making that faster, because now that's not something we've ever paid attention to I think I've.

C

Created an issue inside of the charts repo to discuss what we could do with that is.

E

There much that you can do that I.

C

Think, realistically, whatever that dependency pot is doing, needs to be separated, separated out into its own thing, whatever that thing is and run prior to an upgrade, perhaps it's.

D

The the problem is as its loading rails, and you know it just takes a very long time if we do something else for the database check I think it's going to be on the order of seconds instead of like like less than 10 seconds, I assume there.

A

Are some suggestions in that issue? Scarlet referenced right like that we don't have to load rails only to do this check I forgot what it does now, but there was a suggestion. You.

B

Know tonight we're going to have to load it at some point somewhere, it's going to be a problem, so I I really think it's a good idea to look into that. Look.

A

B

A

Being in in kubernetes, no.

B

Making it not as slow to boot, yeah.

E

And we can, we use spring.

A

D

Are the weeds now yeah.

A

D

One last thing is that I I tried to play with concurrency and read up a little bit about how sidekick handles concurrency and that suppose you can have jobs on multiple threads doesn't matter for export like if I said a high concurrency, we only do one export at a time. Does that make sense to everyone or not, or would you expect to have more than one export happening simultaneously if I set up concurrency Prasad kick.

B

C

That part of the difference between psychic and psychic cluster, no.

D

No sai take itself as this concurrency option. You can set it like. They say they. They recommend, like I, don't know between 10 and 50 or something like that and it spawns multiple threads and allows you to you know: process jobs more efficiently, but for this it doesn't matter so I, don't job.

E

You definitely want to keep it really know for export because of the memory constraints of the.

D

Show yeah I just was expecting different results, Illinois up a little bit more, but I was expecting at least to see jobs processing simultaneously, but that was not the case.

E

Like the whole reason, we separated sidekick arts into its own cluster was so that the process is how do they have a very low concurrency on on there because of the memory constraints of the really poorly written export process. At the moment, right because.

D

E

To keep it as low as low as you can go, Wow, okay,.

D

E

Other thing is that the threads right like I, think a lot of it is basically spinning over like database results and then.

D

E

And so it's very CPU intensive, and so you know with the global interpreter lock you can't really sort of actually multitask from the side process. Yeah.

D

Okay, well, cool I. Think that's about it for the demo. I, don't know like what to take away from this I think we need to test the large X. The large project export a bit more I. Think I'm gonna wait until we do that on staging I'd, also like to do some more like apples to apples comparison between VM and kubernetes. We can. We can like enable two bananas for a bit tests and project exports.

D

You know and super the latency how the lanes it compares to the VN system makes sure that there's nothing different about about I, don't expect there to be, but you know, it'd be a test to do.

C

Maybe consider setting the minimum scale to equal, that of staging, which I think would be for.

D

C

So we would have four pods in the staging by default.

D

C

Other concern that I have primarily right now is disk usage, because some of our larger projects are gonna use a significant amount disk and yeah.

D

I learned last.

C

Week we saw a lot of alerting come through that our psychic exporters are we're starting to run out in this space. So I wonder.

D

If it's gonna be ever in.

C

The cluster out of space, what.

E

D

Well, we're just using we're not using memory were using. You know, SSD, okay,.

E

D

It goes away once the pot is terminate.

E

It's like it's like it's.

D

Ephemeral yeah, but these pods, like they live for a pretty long time, so.

D

I'm not sure I'm, not sure how we're gonna deal with a say. You.

C

Know if we're a cup that end up dying and stuff doesn't get cleaned up, we'll eventually run that pile out of space and kubernetes will terminate the pod if it exceeds 2 gigabyte limit yeah.

D

E

A

50 gigabyte limit, yeah, okay, so I guess we cannot explore export my little ponies anymore.

D

E

A way to have like a like a maximum time and a party like just say well after three hours, recycle it when it's finished its job, no.

E

D

Something to look into maybe.

E

You could stop sort of responding to health checks or something.

D

Cool, let's move on to the next thing: if we have anything left so.

A

I do I do have a couple of remarks jeraf.

A

There are quite a lot of challenges around around this I'm. What I'm considering like there are a lot of challenges around running this in production. The way it is right now can we like have one place where we have those challenges written down. So we said long route of time of new parts we mentioned 15 gig discs.

A

We mentioned concurrency and couple of other things so that we can evaluate what kind of risk we are willing to take, because if we go down the route of resolving every single problem that we've just seen, knowing that this already exists in production, but by pure luck of us, not caring about limits, it works. I, don't think we'll ever finish. This yeah.

D

So I'm using the epic for that I had to dive one for chart, since we can't do this cross group, epic, linking or cross group issue anything I think so I'm just using this, then scarback I think you're aware of this as well. So you should just add them here, if they're not to him. Otherwise we just associate them to the epic but I'm using this as the source of truth. Epic, all.

A

Right, I'll, I'll ping, you all there, then yeah.

D

A

Add adds those things and then I.

D

A

I would like to reorder like that is great, but I would like to have a order of things that we want to execute and what is the cut line? If that means we create a new port in the github org repo.

A

And I just I just want to make sure that we don't over engineer this I wish the whole export was overall properly engineered, but it was not and that's fine, and we know why. But I want to just make sure that we don't over over over spend on this sure.

D

Yeah, especially since you know we can have VMs and kubernetes running simultaneously for a while, we can easily switch back and forth so we.

A

Don't want to be held up, yeah cool cool awesome, awesome, demo, careful thanks, thanks from doing that. It exposed to really really great things here. So my next question, for you would be: is there a way we can? Just? Can you add this script somewhere in CI and fire this off at it? Well, even if it means just running this in a pipeline.

B

A

I think bob was saying something I didn't care. Oh okay,.

A

Yeah, let's maybe do that and then have this as a as a standard part of our demo. So whenever we do something we can in the demo, we can just fire this off in pipeline in the pipeline and then continue our demo normally just to see like a bit of traffic. On this sure.

D

D

Did you even talk about the dog fooding, it's in the list? Yes,.

A

A

So what I'm, trying to figure out here is how can we actually help the monitoring team with their dog fooding epic, because even give this is like their top priority for the court quarter? How can we add this into our work right now, so that it doesn't block us fully but also provides and continuous feedback? While we are working on this that we could update in our in our process, so I'm not really sure what kind of things we would want to depend on when it comes to our monitoring stack, especially with the cyclic you.

A

This could be a great experiment to show the gaps of of what we need and tell them. This is the criteria for us to use it in in reality, so I'm kind of open for ideas like what would there be what would be worth to set as one of our goals as well, while doing this epic.

C

It's my understanding that we're missing some features that allow us to properly pars our filter down what data were trying to look at currently as it stands, I don't know how. Well we could dog food the this capability as it stands today like we can.

A

C

The environments, but we can't say, let's filter it down by you, know, deployments and specific pods and nodes and such.

A

Right but that's exactly the type of feedback that I want us to provide, so you.

E

Know we, let's take a look at what they've got because I.

D

I talked to I put this in the agenda here, but I talked to Dov on Wednesday and I think what Mary just said like they. They want to know what to prioritize, because, like they've been given the direction that they want, are that our monitor needs to be more like Rehanna, but they can't just copy everything. So they want to know what specific things do we want.

D

We talked a bit about variable templating, which I think would make it so that we could have a little bit closer functionality and some of the you know the kubernetes monitoring dashboards, and we talked a little bit about that before monitoring, but I think like reaching out to the p.m. he's very eager to talk to us and to kind of get us to open up issues and then bring it to his attention. So he knows what we need now.

A

With what you said, Jeff just now open up issues I'm linking the epoch that I promoted some time ago, where we already work with them. So just opening up issues we did already, and this is why I'm kind of pulling back to us. So if some of this that you're mentioning right now is still not there, we should be exposing that either in that Deb epoch or the other one I think this one where that I linked dogfooding a single metric.

A

So far back if there are more items like this, please those and the point here is not blocking as the point here isn't blocking another team. While we are already working on this right, we are already building out all these dashboards right now, so might as well see how what we can do to help out putting this in in a product. So.

E

Like what I'd say is you need stuff? That's on that dashboard, a right we need to have like it's the s alive figures, the the optics cool, the error rates cool. There are piaced scores and there's headline figures like that's what we're building on and right, but.

A

You're talking about it, no but you're, talking about Andrew you're talking about the end result of a full completed product. What we're talking about here is that we have very basic things here and they don't even know where to start right, like that's the big problem like if you tell them just copy this, they don't know where to start so we kind of need to work with them and guide them. We don't have the experience of using this properly. I would think so.

A

The idea here is looking what we currently have like Jerry's Shuren, currently showing and then exposing some of these things to them. Right like we want to be able to filter things out. This is a really big priority, so adding that so, if your environment, properly and so on, and so on, I.

D

Mean I think the improvements that have been made since we originally opened this I think is that now we can express the dashboard panels in you know, so we could with JSON it start to create a whole bunch of different dashboards.

D

We have fan O's connected so that you know we can call it metrics that way, we don't have variable templating, which I think is probably gonna, be the next big thing that we need and also there's just like a lot of this default crap here that I don't know, it's just makes it kind of annoying.

A

D

Like yeah so but I think in general meirin, like yeah, we have this epic, but honestly, like I, haven't looked at it since we came so we need to look probably refresh like see where we are now and where we want to be, because I can think of some more issues. I could probably add all right. So.

A

Just to make it clear it's not up to us when we have our our major epic to try someone else's smaller, epic or however big it is it doesn't really matter it's it's not up to us. They did come to us to ask us for advice. So now it's the time for us to provide help right. So if this default crap is actually in the way, we should provide feedback to them on that and then have them come to us with the things they adapted so that we can also adapt.

A

But just saying that's, oh, this is not great. It's not really helpful to to them when they are trying to make it better for us. You know, and that's the angle I'm trying to take here.

B

A

Right cool yeah.

E

The thing that I just looking at that the one thing that would be quite an easy win for them would be to get annotations on those graphs of like when the pros happened and stuff yeah.

D

It's there, it's just that easy, like a little rocket ship. Man I, think that's.

D

Yeah I told my I told that, like you know what needs to be really compelling reasons, I mean it's up to them to come up with some really good, compelling reasons why this is better than pure fauna and it doesn't, they don't have to be Kohana on what your.

A

Father does they need it? You know they.

D

Need to be Cortana on integration with deployments, and things like that right, yeah.

E

A

Are not trying it yeah.

E

A

Guys guys, please don't mix things up, they are not trying to beat Ravana they're trying to get to a point where this is usable, you're talking about a product that is rounded and completed versus part of the product that is just starting properly. So what we are trying to do here is to provide them how to get to the point where this is going to be a plausible replacement, not saying that we need all of that built right now.

D

E

What I, what I would say like if I was to give some positive encouragement, is like take a look at the that Explorer dashboard in Griffin er, like let us mess with that graph right. So maybe it's just like prom QL and we can just sort of type things in and change it on the fly, because the problem is that the moment is like really inaccessible. It's just like a graph and and if we want to dig further, we don't have the controls to do it.

E

But at least, if you had access to the to the prom QL and you could kind of add filters in yourself that might help that make sense, so that.

A

All of that makes sense, and you- and this is the type of feedback that we can provide. That's the the point I'm trying to explain right and if, at some point we manage to replace one graph from from all of the dashboards that we have, then that's already success.

A

A

C

The single source of truth for chef and secrets grain has put together an excellent proof of concept using the help file project. To put this together and in he's got a working POC, he created a small, get lab secrets, help chart that is able to grab all of our secrets out of G kms appropriately and create the secret objects inside of our clusters, and similar he's got the necessary templating done in his same POC repo, where it grabs our configurations.

C

Out of, if I recall correctly, he's just querying get lab, get labs, API endpoint to grab the appropriate data out of the JSON files that we have store inside a gate lab itself, so we're not using the chef's API at all so far, I think the POC itself is fantastic, a thing.

C

This is a good step forward to move forward and I'm wondering if anyone has any concerns or questions about this, because otherwise I think we should go ahead and move forward with Korea, necessary Epping and issues to get started and enable us to provide all the secrets and configurations via CI to populate our career Nettie's clusters as necessary.

C

This will remove the fact that we have configurations duplicated between chef as well as kubernetes, and the potential for configuration drift between these two environments.

A

My first question is the epic you you share. That product is using. It's not really clear to me where in there it says, did they switch to home file. It's.

C

Not clear to be either haven't fully looked into that epoch at all. I was given to that very late yesterday, so I didn't actually look into it, but this was the epic day and I had a conversation about what we were trying to look for, because I heard through Graham, they were trying to use helm file so as part of their cluster management.

C

Their cluster application management feature they're trying to work on will include helm file to some capacity, but they've got a couple of other hurdles that they want to work through before they finally get to some like this I. Imagine the part. That's gonna kind of hold us up immediately off the bat is that currently they don't have a method to manage multiple clusters within one project, with what they're trying to provide.

C

So far, it's only limited to one repo is equal to one Korea's cluster, for some reason, so I think that's it's still a work in progress from them, but I feel like if we went the helm file route today. That gives us a leg up where we could easily integrate with something that they finally get rolled out, that we could probably potentially utilize in the future and.

A

Does it Dyess too much into that object right like how much will we be tied to helm file going forward and.

C

How easy it is I, don't know the answer to that question because I don't know what the end goal with the product team is trying to accomplish with this.

C

If it's something where they plan on managing the UI and we end up providing with a bunch of configuration files that, under the covers helm files using I, think we'll be set up in a good position, but that's yet to be determined.

A

C

Moment I'm, as far as understood from our conversation, all.

A

Right give me so then he'll didn't share that with you or he doesn't understand. No, yet maybe.

C

We didn't discuss it to the level of detail that I needed to answer your question. Potentially, okay,.

A

I'll I'll upload this and ask Daniel to take a look at the recording and provide us some feedback. The thing I would like to avoid is making a decision to go with one project product going completely in another direction and us tying ourselves again into yet another workaround that we we keep doing so I think grain can continue working on the POC. That's not the issue! What we could you is theoretically focus on one item that we could remove. That would help us unblock us right. We don't have to port all the case. Etl stuff.

A

We can port one that we don't have currently and that would not exclude us from making a decision one way or another depending on how product goes. I.

C

Can agree with that should I go head start spinning up necessary issues.

A

Jeff, how do you feel about this.

D

um I mean I, think I, put thoughts in the issue and I'm just worried that this could be a distraction from what we the problem we set out to solve, which is having a single source of truth for secrets. Now.

D

Maybe this is a way to get there and it's the best way to get there, but I I, don't know, but I think I think we need to make sure that we don't lose sight of what we set out to solve, which is having a unified workflow for updating a chef secret and a kubernetes secret, and that means actually fixing the way we do chef secrets, because chef secrets are updated through a engineer sparcstation they are not.

D

They do PCI, so I think without solving that problem, we can't really have a single source of truth, because what is it going to be like an engineer? Is gonna update a secrets for chef in his work station and then go to a pipeline and write it and what doesn't seem like it's gonna sustain us something better than that.

A

It's actually a bigger problem than this Jeff yeah.

D

But it doesn't actually think we need to solve it like you know, if we, if we, if we want a single source of truth, we need to up in secret, should be a one step and done through one workflow, not to workflows.

A

I'm very concerned with that, actually, not because I don't agree with you. I agree with you there we don't have the Pass City to drive that as well and I.

D

A

A

Well, you already know what I don't have to actually drive this further from a different angle. I, it's recorded all of this, so.

B

A

Like I understand that you're trying to resolve the problem from like a proper angle like resolve the underlying issue once and for all, but is it possible to just move on with this for a while and unblock ours until we get.

A

Or I'm choosing my words of really carefully here until we get more possibilities from others. Collaborating with us in other teams,.

B

D

Think it remains to be seen what the workflow will look like, and we can take a look at that afternoon. After we finish or after we decide on what we're gonna do for the kubernetes stuff is just like. I see two possibilities: one is with this witness proposal. I see two possibilities: you update a secret in chef and then you do a deploy or you wait until the next deploy and then you see the difference and you apply it or you have a pipeline running continuously.

D

That constantly is looking at the chef's secrets and updating our kubernetes cluster and notifying us or notifying us when there's a difference which I think yeah. That was one of the early proposals and I I, don't know I think. Ideally, we would just have a single pipeline that updates secrets in both locations and then triggers the kubernetes pipeline afterwards, but I'm just really worried about getting out of sync and come. You know being surprised when we.

A

Already had an outage on staging the other day yesterday, actually, yesterday with.

C

A

Registry overnight, it was directly I think directly related to this discussion. I was.

D

Related to a secret that was updated in one place, but not the other place skarbek is that true I didn't know.

C

Clue I've been fully caught up on what.

A

We have an update, we have had an update in certificates.

A

It's unclear like it's unclear, yet I didn't read everything it's unclear where the update came from, but basically we were just using a wrong certificate for registry and that caused outage on staging. Luckily,.

A

So this RCA that I'm supposed to be creating right now could expose exactly what you're saying, Jeff and could actually support a bigger effort going forward. The problem I have with that is that puts another huge delay on the team of two that are currently doing this migration right and if I cannot source more help from the rest of New, York or Department. It's going to put a big dent in in what we can do.

C

I think the use of helm file using the POC coming from grain will only solve half of this problem. It at least gets us to the point where kubernetes is using the same configurations that are shelf configurations are using how we make those changes is still up for debate and that I think it that's a completely different conversation that we need to have.

A

Okay, yeah I'm I'm more than happy to resolve a part of a problem first and then tackle the bigger problem later I.

C

Think this is a good iterative step towards getting us to a good, solid solution in the future. Yeah.

A

All right how about this? Let's do a couple of things in parallel: sync up skarbek, sync up with grain and explain the one thing that we are trying to resolve with helm file and have that as a focus of not the POC, but of how we could be doing this with the current infrastructure that we have so resolve. One problem: don't replace anything else, I'll in the meantime, also sync, with Daniel and asking for feedback.

A

Well, what are they actually doing there and I'm going to raise up another thing with the rest of the department that our secrets management needs to be more streamlined and see whether there is any possibility for us to do that relatively short term, with the help from others?

A

That's not going to resolve all the problems Jeff, unfortunately, but it's gonna get us moving and not blocked and that's locked into a big running project that needs to be pushed through the whole of org.

D

Cool sounds good. All.

A

Right anything else that we would like to discuss here.

A

Cool in that case, I hope you have hope you all have a great weekend and we'll see each other next week. Thank you.