GitLab Gitaly Reliability Stable Counterpart, 18 Aug 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2022-08-18: SRE Reliability stable counter part discussions

Description

Discussing a way forward in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16115

A

So today is august 18th and we are discussing uh what a necessary staple counterpart for the kitteny team is and what kind of projects uh we can work on together, both from the getaway team and sorry.

A

So before we get started.

A

I want to drive some context around this. I guess um in the past, scalability team has been the de facto as igor saw, that the de facto saber counterpart for uh get to d1. They needed to do some configuration changes when they weren't sure about or uh when they needed to roll out something in production.

A

Scalability team has been happy out there, which ended up resulting into the reliability team having little knowledge on italy, and it was mostly um yeah like we don't have any in-house knowledge on the reliability team about italy, and sometimes we are on call for that service or sometimes we need to roll out a change for that team, because the scalability team is too busy or something, and we don't really have any go-to person for that and, like the managers always end up like rolling the dice on that.

A

So we wanted to have like a stable counterpart for the catalytic team and, as andres said, and the issue as well like italy team is a bit special in the sense it's its own service in its own language and it's one of our core product understands like it's good data.

A

So um it's right in the name of the company, so um I I think that's what is one of the big drivers for the stable counterpart um we've had in the past table counterparts for the runner team and igor was one of those uh last year and we also had a stable counterpart for the fulfillment team for the customer.migration.

A

um So we want to build on top of those experiences and make it more official for the team and have dedicated people for that.

A

Any questions so far on that awesome. So I wanted to set up this meeting because we had some async discussions on the issue about some proposals on what we can work on, and things like that. But before going through that, I wanted to ask like are: are we creating a solution for a problem that does not exist like uh do do? Do we actually need a stable counterpart in the sense like?

A

Does the utility team feel like there's problems, that the infrastructure team is not addressing and uh or has little bandwidth to address kind of thing, and maybe andres you're? The expert here on this part.

B

So this is an excellent question.

B

I think the main issue we could solve that is sort of a uh a lingering time bomb is the lack of expressive, specific expertise in italy on the infrastructure team, and this affects um incident response uh in particular, but over time. I think it also affects what infrastructure design changes you make. That might turn out not to be a good fit for italy simply because we couldn't articulate it well, and you didn't have the the necessary insight to to take it into account.

A

B

To my mind, this is more knowledge transfer thing and to be to to like jump.

C

B

Far forward, I think italy deserves its own sre team, because it's just special and the scope of the company is growing so big that I don't think it's quite reasonable to keep growing individual scope. I don't know how you handle scope in, but I think eventually we will have uh at least two tips.

A

Yeah, I I think the only answer I can guess, as we don't unders.

A

Iteration of having stable counterparts- we've tried this before in the past, but with a lot less people in the team um and a lot less um focus time as well, because um uh we only recently had like 30 people on the reliability team, and now we have kind of the bandwidth to do this. So we don't have that scope at the moment, and I think this is one of the first iterations to do that so yeah agreed there excellent.

B

Excellent, so, besides keeping like growing growing people more into experts on italy, which this would do well, we would also, then, in in return leverage the experience that sris have with running production, which is the largest known, gitlab installation, yeah.

A

B

This, like, yes, we have operational things like uh production, access and, and things like that, but I see more even more value in design work in in in co-designing uh things, especially like what we're doing now changing the whole way. The application works. Yeah.

A

B

It's from a from a high-level perspective, it's from a data perspective, but it's not from an infrastructure perspective and, yes, we will socialize the design.

B

At which point, whoever has expertise has a better chance of jumping into this, and so, if you already had built that- and some of you would some of you have it uh like, we would get more value out of this. That's yes,.

A

Agreed yeah and we can also have like a voice in the infrastructure about the design because, like I saw a comment on uh some prefect merchant was one day and I was between uh matt and jakob and, like matt, provided a point of view from an infrastructure department and like jakob said: oh, that makes complete sense, but I didn't have visibility around that and, like um I think having that is actually what can solve a lot of problems there. So I agree with that.

A

um So you mentioned the re-architecture of prefect or using rough protocol not to use postgres. For example. um What's the progress there like? Where are we at at that stage? Is it still um being in an rft process like how is that.

B

We had we had a long internal discussion about the details of it and now that now that it seems feasible, we don't think there will be major blockers. We also found that it solves actually a lot of problems that we have now. Some is writing up. What was a fairly fragmented discussion on an issue. Is writing a design doc so that there could be a second round of review without having to you know, piece together. The information yourself got.

A

B

And this is our q3 and from then on, I'm hoping that this will be blessed by everybody who needs to bless it, and then we can go and implement. The pla is including the. How do we iterate towards results and how do we? How do we actually get there, because I guess that's also at as hard as the? How will it work.

A

Yeah and like uh does the rfc will have, will the rfc have like a rollout uh proposal as well like how we will deprecate the postgres sql database, or things like that, or is that still out of scope.

B

I I think we need both uh yeah and we we have two okrs one is the writer design doc. The other is write, a project got it and either like. Yes, we will need eventually a detailed rollout plan.

B

The current hope is that it will mostly be by osmosis. So as we upgrade software, the plan is to start using raft like keep postgres keep everything running as it is just start using graft. In the background to also replicate information, and once that works, then we figure out how to switch over and then remove all the postgres and that this would efficient effectively take all the single nodes into a single node cluster situation, which aligns very well with our goal of that there shouldn't be cluster and node cluster.

B

There should be one code path that everything passes through.

A

Okay, okay, um yeah. uh And how can the uh sorry.

C

Hey sorry still, oh god, I wanted to uh go back to the previous point quickly and just on a on a higher level. Also, maybe give a little bit of input from kind of the scalability side. um So I think the collaboration has been working pretty well between guitar and scalability. I'm I'm pretty happy with that.

C

There's certainly been cases where we've been a little kind of resource constrained and just didn't really have the capacity so having a little bit more people available would be useful.

C

Where I see the bigger gap, though, is when it comes to scope and scalability is really focused on well scalability and performance right, so we've been heavily engaged in optimizing like we designed uh some some caching related logic. We've been uh putting into place profiling tooling.

C

So it's very much within that realm, but that leaves a lot of stuff kind of out of scope and where I see the biggest gaps are operational tooling, and it's already hinted at a bit later in the agenda. We can. We don't have to go into the details right this second, but um how do we manage our italy fleet? How? How do we provision and rebalance the the shards?

C

Those are questions that aren't really addressed by the scalability team and um it's kind of outs outside of the scope of the the mission of the team. So that's where I see a potential opportunity for.

C

A sort of reliability, sre counterpart to provide more input and provide more assistance to the guitary team.

A

That makes sense yep.

B

Yeah makes sense, I guess eagle, you guys suffer from the same scope, growth over time as reliability and, like all the infrastructure teams that it just keeps getting bigger and bigger, and.

C

Yeah I mean, I think the scalability managers are a little more protective of the scope, so we're kind of punting more of that responsibility over to reliability, um which um kind of pushes just pushes the problem elsewhere, but I think you know we we need to address it somehow.

A

Yeah, uh I think that makes sense- and I agree- that's uh so so like jumping on a bit on the proposals for projects that the sr team can work. So, for example, the c group stuff that's uh from what you just described, eco, that really falls under the scalability team rather than because that is more cpu constraint. Like constraining, cp, uh cpu usage. Things like that feel.

B

Free to overwrite.

A

Me there but yeah.

C

Yes, I can see.

B

C

I can see it going either way.

B

Yeah, it would also prevent outages actually.

C

A

B

More of the goal of it: it's not that you know cost doesn't skyrocket because you use too much cpu, it's more that gate. Labcom doesn't go down because we ate all the cpu.

A

Yeah, that makes sense so yeah yeah, okay,.

C

Yeah I mean I, I think, the negotiation of where the boundaries should lie. I'm gonna defer defer that completely to andrush and uh other managers involved. Here you can, you can fight it out.

A

Awesome: okay: um okay: uh let's look at the time! Okay, 15 minutes left, um so uh I was gonna jump on to like how can the sre team become more involved, like the reliability team, specifically get more involved in in the design dock for uh for uh bluster? Is it a matter of just waiting for sami to write down everything and then we'll just review that merge request or is there anything else that we can do there.

B

It's currently in a google doc for a faster turnaround of feedback.

C

I don't know how close.

B

He is to actually finishing it, but that's quite a bit written and I intend I haven't tried it myself, I'm just back from holidays this tuesday yeah.

A

B

Do you have a video, certainly yeah? I think I put this in the whichever notes on the issue have I blah blah.

C

B

A

Yeah there, I think I thought yes.

B

This is the issue for tracking the work around the design of which has a link to the thing.

A

A

uh Okay and like it would be completely okay for us to start looking at it and poking at it, and leave some feedback. If you find.

B

Maybe I'll summon, but I I like whether this is timely. But yes, of course we want your input and we want your feedback.

A

Perfect, okay sounds good.

B

It's just that, I don't know what state state is in.

A

Yeah that makes sense, okay and okay. uh Anything else. Someone wants to add so far.

A

Okay, so looking at the current projects that the sra team can work on like for this quarter and the next one coming, um so it's pretty clear that support for getting cluster our architecture direction as number one priority since that's uh gitalis okr, so you can kind of tie those two together, um uh there's also the like the c groups, rollout, that's uh ongoing um and I'm curious. If like should the reliability team pick it up.

A

I know rachel raised some concerns around that because of uh ramping up time and like context, switching and things like that, um I I'm curious uh to hear andrew's opinion on this. Like is there a lot of work still left? Is there like a big amount of uh context together around it? Or can we like kind of pick it up in two weeks and like uh continue the work with the guitar team to get that ready.

B

I don't think that's overly complicated and it's.

C

B

Going in waves, sort of uh I need to talk to john uh who's been doing the actual.

A

B

And I would love for this knowledge of how this works and why it works in the in that way to be in more than one person's head.

A

C

B

Maybe ego you have more context, but.

B

I was thinking of somebody.

C

B

On the team to offer also to offload jon for other work, if we can start getting sre involved more actively, not just for you know roll this out for us please, but.

A

B

Improve on it, uh that would be awesome. Awesome.

A

Okay, so let's write down some action items in the agenda, so uh the first section item um and like this is for me just thinking to write down uh the okrs or whatever the stable counterpart is going to be doing this quarter and next for the reliability team. So.

B

Before you do that, I have a meta question about the stable counterparts. I want to make sure that this is not in addition to all their regular responsibilities, because then they still have the full scope, and it's like some of that time at least needs to be dedicated to this, so for them to be efficient.

A

Yeah, no so uh from what I understand from can both me and calliope are gonna, be quantum code, full-time, stable counterparts and the scientists uh in the sense that any issues that come up like, for example, I don't know c groups dot and uh the rough architecture. That's what we're gonna start working on next week, for example, uh cool. That's the kind of commitment that we're thinking about um and also like. Even when we start discussing, like I don't know, ssh access for digitality, yeah.

B

Started that was the other thing I wanted to. I wanted to mention that that is something we cannot really do ourselves exactly.

A

B

A

Don't know how this.

B

Actually works and how to make it so that we don't accidentally cloud.com.

A

Yeah, um I I think we're gonna get started with that next week and that sounds like.

B

A

Number one priority for disporters to be uh getaly's a stable counterpart, and if that means like giving you ssh access and then limited or in a uh safe way, and then that would be our p1 comp and helping out with the raft, design and c groups rollout. Since the c groups has been happening in waves, kind of thing.

B

There was also this other thing that jarv got in touch with me, for which is reducing storage costs, and I think the stable counterparts should get involved in that, even though there is not much from italy, I think anymore, that can or should go into that as input we've. Given our inputs.

C

B

Like just as on principle like get.

A

B

To more things, get in italy.

A

I've talked to jarv about this uh this week actually and um he wants to drive it and that's completely fine and yes, we can help out when uh even us just being aware that uh costs are coming and like I I I know you folks talked about like volume, management and like why don't we have volume management, kind of thing?

A

And things like that, so uh I think we're aware of that. uh We're just leaving it up to jar, because for us that seems more of a cost savings measured on a reliability measure that makes sense yeah. So it.

B

Is a cost-saving measure, but how it's done it.

A

B

Is a matter of of reliability.

A

That's a good point. Yes, uh especially if we're gonna end up uh seems like it's the simplest way having a downtime on on each node. um So, yes, um yeah, that's a good call out. We can keep an eye on that as well and see if we can help out in any way or at least be aware that things are happening there.

A

So yeah that makes sense.

B

So let me turn this around. What can we give you guys to be efficient?

B

uh We have like, I don't know how much you know about italy, the internals and how this works or how much you wanna know what it would you wanna learn like. Would you to watch our uh new new team member videos or would.

A

B

A

uh I would like I would go as far as say if you can join if you have your weekly meetings. That would be yes, perfect, because even if we like are aware of what's going on in five minutes, okay, perfect, so we can join there. If you want, please.

B

Come along, I would be delighted to have you. I also hang out on uh italy, lounge yep. uh I can add you to a bunch of things. Yes,.

A

uh Do that okay and I'll write down and I'll pick with a proposal of the things that we're gonna work on for qt q3, which would be rough designs, c groups and potentially ssh access? Unless we can talk about this async unless we feel like rebalancing, is more of a problem than ssh access. But we can talk to that.

A

B

Is it okay to introduce you the two of you already? As you know, we have our s3 counterparts now. Yes,.

A

Yes, um if someone else says otherwise, then it's up it's my fault, but from all the information I have it's me and calliope that's going to work on it. So yeah, let's go ahead with that and roll a roll with it.

A

I also want to open up emergency quest today in the handbook as well to um make it official and make it like. This is what the handbook says kind of thing, so yep thank.

C

You, oh sorry, could I get an invite to the gately teaming? I don't.

B

C

Got it on my calendar.

B

I'm gonna do this right now.

C

B

And save and all of these and center thingy, and if I can find the notes document again I'll share that with you as well.

C

So uh steve when it comes to knowledge transfer in particular on uh c groups, because I you know, I think, there's there's a few folks on the scalability team that do have expertise with this design. um I'd be happy to help uh facilitate some some of that knowledge transfer.

C

um Maybe if you could connect with um rachel to to kind of see what the what the time allocation there is and whether it makes sense to share that with the sre team. I think it does make sense.

A

Yeah, um I think it does because at the end of the day, if we have to modify some c groups, we'd have to know what and where to modify um so yeah. uh I agree with that. I I think rachel and the epic uh to discuss like an official handover or whatever that we need to be doing so. Yeah sounds.

C

A

B

A

Awesome so thank you, igor and kaleopi for keeping the notes up to date on the agenda up to date. I appreciate that because I'm terrible at doing that so yeah, I think we have a clear action plan for now just to continue discussing async, uh especially especially on priorities. uh Is there anything else, that's not clear for anyone.

C

No, this is great. Thank you. So much awesome.

A

Have a nice day then I'll see you in one minute. I guess.