GitLab Delivery: GitLab.com migration to k8s demos, 26 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022-01-26 GitLab.com k8s migration EMEA/AMER

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello yeah: that's your auntie on parental leave! No next week. Next.

B

Week, ah okay last few days of work, and I want to see today's demo.

B

So no question reuben.

C

C

um The problem is, uh I was I'm using jaeger tracing to visualize the traces and it doesn't really offer any any way to like analyze like a bunch of traces and doubt any anything from that. Should I start.

B

A

B

I think noah are we all here, yeah minutes.

D

I'm hoping scarbeck's going to join mr penguin.

E

Sorry, I'm having zoom problems today.

E

uh Welcome to today's uh wonderful meeting, uh room and youth got the uh floor because I see your name.

C

uh Let me share my screen.

C

uh So I was thinking about how we can like analyze pipeline durations over a period of time and draw some conclusions.

C

So one thing that came to mind was tracing a pipeline as if it's a normal web request. So I tried that.

C

So I tried tracing a few pipe lines, for example. This is, I think, two days yeah the two days, so you can see um you can see patterns within one pipeline so, like.

C

Normal things like uh uh jobs, that don't that are not blocking like this careful stuff like that, can be easily seen, and you can drill down further, um so the jobs of the pipeline that is triggered by in dev, but I couldn't figure out how to use a figure to to sort of kind of aggregate analysis on like multiple traces and see which ones are commonly taking a lot of time. Stuff. Like that, I don't, I don't think jaeger provides any such functionality, um other tools like lightstep and honeycomb. They provide that functionality.

C

But maybe someone else can have some better ideas of what you can do with this kind of data.

A

I think uh in stackdriver we also can do this kind of um aggregation of traces right. I think we are using this even for some of the go projects, so maybe we could, instead of sending um traces to giga, we could also configure to send them to a stack driver. Maybe if we have better tooling for analyzing traces there. I didn't look into jaeger since a long time and also too deep and into a stackdriver, but I mean the idea of of telling the tree is that that's more or less vendor agnostic right?

A

So if you configure it the right value.

C

A

Should be able to send traces to a lot of different tools, hopefully.

C

Yeah yeah it'll be just a matter of changing the url that you're sending the traces to.

A

Yeah, but I really like this approach because I always wondered: uh do we have built-in metrics and gitlab to just see when different pipeline jobs start and end to exactly get that out of our pipelines? Right and I guess collecting this matrix would be not really working, because it would have a very high cardinality, I think, depending on which pipeline you run it on and so on. But exactly for these cases, maybe um tracing is the right solution.

A

Then, because you get exactly these insights right and if you aggregate it, you also can maybe see averages and where we spend most time so really cool yeah.

C

There is the problem that we sometimes rename our jobs. So, for example, uh recently we removed the auto deploy prefix. So it's like every job had like a auto deploy prefix. I don't know if this job from october might show it yeah see or to deploy, wait him. So I don't know if this kind of renaming of jobs would affect the aggregate analysis, but we can try.

C

For example, if we say if we traced all pipelines for say a couple of months, it's that function or something.

A

Yeah, I guess, if you change the naming, then you always run into problems or if we add new stages or jobs, but in general we shouldn't change too much right so and then we could compare over a longer time. In most cases,.

C

Yeah and we could work around that- I mean while trace while pulling out the data we can just ignore the auto deploy prefix you'll have to do like custom stuff, but it's possible.

E

I know you're playing with this so far, but have you discovered anything super exciting about the data that you've come across at this moment in time.

C

No, unfortunately, I've not seen anything uh that is actionable sort of.

D

F

D

Fit with the sort of stuff andrew mentioned um on the platform direction right like perhaps this is a kind of a sort of theme we want to pull out, which is properly. You know proper visibility or observability into deployment pipelines.

C

There are many vendors which provide the visibility into ci pipelines. I think dog has like a whole ci pipeline dashboard.

C

I don't know if you use any of these, but maybe we could think about that as well.

D

Interesting yeah.

F

Yeah I mean we, I guess um we'd probably first need to talk to um product teams about building that kind of stuff into the product.

A

I think that starts to get interesting is also um points like um in our runners when they start pulling images how long it takes to pull an image when it starts to get uh started up and so on, because this is where we can't really look into, because we can't script this in any way. So this is happening in the application, and I don't know if you have that but unload these numbers right to see. If you have a trend of increasing pull times for images and things like that.

B

Those numbers used to be stored in the database so when we introduced.

B

uh So when you look at the at the job, uh the ci output of the job, you have foldable sections right, so yeah the there are some of them. There are generated by the runner itself and those are uh something for uh download time preparation time in terms of the is the image ready and things like that.

B

I mean I'm talking about something that we developed three years ago, even more so it can be completely different today, but those not those sections the default one were stored on the.

A

Database cool- I mean it's nice to have them in the database, but but having them sent out as traces would be even better.

B

Yeah sure I'm just saying that if the data is there, we can try to get it out. Even if it's just uh one of query, query or if we are going to build an api for getting this out.

C

So the thing is uh the way it output into the into the job logs. You can pass this and extract the timestamp, so I had done that for one of them, but I think I've not.

C

I have not uploaded it to jager, but it's certainly impossible. So, like you can have each job. You can have all the sections displayed here as well.

C

um That's it for my demo.

D

Just before we go, does it show packaging as well, or does it just start from the deployments.

C

uh So you can do this for any pipeline. I've done it for uh pipeline, but drills down into the package of pipeline. So like.

A

C

The pipeline, I think, wait, help and.

D

This is the one.

C

For weight only.

D

Interesting because yeah, that would might be an uh an interesting kind of addition like if we'd have had that on um packaging. For the last six months, we would um possibly have seen dev coming right, yeah.

C

C

D

C

uh The package of pipelines- I think they don't change much so might be easier to analyze. Those that's.

D

True, they tend to add jobs, not like rename and stuff yeah. um What do you want to do is the next step uh ruben like do you think you have enough to write an issue um about this approach, or is it something which um sort of uh I guess like? We will certainly be discussing platform metrics in the next few weeks.

D

um Would you like more of the kind of high level stuff before we try and figure out how we could maybe put something like this together.

C

um Yeah, there are multiple approaches uh to do this. For example, I think um robert had done something with pandas recently this month.

A

C

Sort of get like mean durations of uh the packager pipeline package of pipeline jobs, so I don't know if that will be more effective and I think the verify team is also working on or thinking about uh doing this kind of thing natively in in the product.

D

um So perhaps uh I mean we could perhaps just capture the problem, and I have it as a kind of like one proposal is possibly something like this.

C

D

Or just keep it logged as a great idea for when we certainly do talk about this stuff again, because we'll have to figure this out at some point right.

C

Yeah, I can record it in um uh in an issue: yeah yeah, awesome, okay,.

E

I think this is neat I would love to see this get rolled out, because I think, at least from a historical perspective, if we could see where we're improving, that would be fantastic.

D

I agree yeah. I agree.

C

I think uh pipelines have increased by like 20 minutes uh in the.

E

Last, don't say that no, it's a decrease right.

C

Yeah about 20 minutes, I think yep see the one in october was 6 hours, 15 minutes. Now it's like all seven hours, 10 minutes, seven hours, 15 minutes.

E

And this is going to drastically change when we complete the next stage of the um staging canary pipeline um work that graham and meyer are working on too. But.

D

In a good way, because that takes um about yeah, that's the big shift right. The big changes we brought staging canary in, but once we do, the reorder, which is the next step staging and production, will be mostly in parallel. So we will take an hour off 50 minutes off.

C

That's another thing I liked about uh tracing the pipeline. You can easily see what runs in parallel. What doesn't what waits for something else.

E

Is jager like real timeline? Could this be a different method of watching pipelines to an extent.

C

um Jaeger is simply a ui to see your traces, so it depends on your collection of traces and sending it to jaeger if you're doing it in real time, it'll be visible in real time, but to do it in real time. You'll have to probably integrate with the product. I guess because uh what I'm doing now is simply using the jobs and pipelines api to extract the start time end times and then push that to jaeger yeah.

F

One one of the challenges that you run into there is uh because you have a hierarchy of these spans. The root span has started, but it hasn't finished yet right, and so you basically don't have that root span. Yet so you can sort of find these subspans potentially, but the the visualization is kind of partial um and yeah, so I've when I've tried to do that kind of thing. In the past um it wasn't super usable for long-running still ongoing jobs. For this reason.

E

That makes sense cool any other questions or comments before we move on.

E

Igor, let's do a demo.

F

Yeah I snuck one in there, so um I figured out one of the issues that we were seeing with uh redis replicas, where we've got a vm, a set of vms, and we want to add some pods as replicas to those vms.

F

So I have already some vms running and if I ask this vm0 what its role is, it is the primary and it has the vm-1 as a replica, and so I now want to add in this case it's only got one replica instead of two. It doesn't really matter that much in this case for the purposes of this demo, so I'm going to install helm to set up now a redis cluster. So this we've already seen in the past.

F

I think skavik already demoed this, um maybe a couple weeks ago, two weeks ago, so this is gonna, go ahead and install the helm release for ls, okay, so we've got our first pod that is coming up, and this is still a little bit messy in that it goes through some weird kind of reboot cycle.

F

um So it's it's kind of waiting for uh dns to become up to date and only once dns is up to date. Will the sentinel process be able to actually discover this redis process, so that's kind of why we we have this failure here on the sentinel container within this pod, and this is also why well, I guess we see ready one of two, so it it takes a while for this to reattempt and reconnect, and I think it now managed to properly do it.

F

So, yes, so we've got ready two out of two, and so it goes on to the next pod. So what we should already be able to do and yeah you can see the first. The first step is going through a crash loop, um so it's sort of a similar issue there, but now that we already have one of these parts, I'm gonna kind of fast forward and set this node 0 to be a replica of our vm cluster.

F

So let me find the correct command.

F

So that should be the replica of command, so we're saying node zero. Please become a replica of vm0.

F

And so I'll go ahead and do that and if I now go ahead and ask the vm what its role is still the primary. But now it's got two replicas right, so we've got the vm, and now we've got node 0 as the as the other replica and it's a host name, not an ip address. So this is the novelty. This is the thing that we fixed, which has been plaguing us for like two weeks.

F

um So I just kind of wanted to demo that and show that it's working as expected.

E

One of the items I had a question on is, I couldn't tell I remember, doing the research, but I forget what the result was. The id of the replica changes anytime. A new pod comes back online, which, during one of our testing scenarios, we were adding one pod at a time to our infrastructure and because of that configuration change, older pods get recycled, so they get generated with a new replica id and because of that sentinel thought the pod was down when it was in fact up and participating at least within redis.

E

Maybe not within do central need to look at that any further, or is that something that we could probably not ignore, but something that's probably safe? In the long run.

F

I don't know yet um I haven't seen that resurface, um but I think it's a scenario that I should try and recreate as well. Yeah.

E

Okay, um I guess the other question I had is we had spun up two issues. One was to investigate creating a deployment where it kind of joins the cluster automatically, and for this we're able to successfully get rolling, I don't have a quick way to demo that so I can't really showcase it, but I did at least prove inside the issue that it's possible.

E

I had a question.

E

Okay, I forgot my question never mind. Sorry,.

E

Does anyone else have a question they didn't forget.

E

Okay, um akbad you've been working on sshd in between your release management. um You want to talk a little about the blocker.

G

There is a blocker, um I think, as reported by sean, it's something related to sh. Sorry, the rsa keys, like are not readable. I think by the shd, if I remember correctly- and this is blocking us from proceeding to production- and one thing from our side is also the observability like- I think the mr is now ready somehow like. I will just work on the comment from scarbeck and we can merge it later and I think it would be ready for us, but it's plugged from the other side from the source code team.

G

So this is just a heads up.

E

G

E

uh Do we know if they are progressing with any performance testing that we're still missing from our right into this review?.

G

I don't think they did. I think, to be honest, I don't know like I've checked it, but like some while ago- and I forgot maybe so I don't really have an answer from the top of my mind,.

E

Okay, that's fine! um I just want to make sure that you know before we push this into production, that that readiness review has been completed and been reviewed by other members of our infrastructure team, and I know performance testing was on that, so we're blocked for moving from production but stuff like that. Shouldn't be blocked at this moment in time.

E

Okay, uh the last thing on the agenda um alessio or rather amy tasked alessio, with trying to figure out uh what's going to be blocked when the dev instance gets migrated over to gcp during that maintenance procedure.

E

Unless you and I have been chatting a little bit back and forth on this issue, but I wanted to kind of bring other people, especially on this call, because we're all greatly involved with kubernetes to some extent. I just want to make sure that I myself have not missed anything.

E

So if anyone else has any thoughts, I just kind of want to bring this threat to everyone else's attention. Just in case um amy, I say you've got a question that you may want to verbalize.

D

Yeah, I was just wondering if we have a date that we're aiming for to migrate.

A

I think it's still hard to tell I mean just- were able today to fix the most urgent chef issues to get a chef run through, but now the devil is in the details. Right shouldn't get everything working as we need it and um it's really hard to um say concrete date. I mean I would hope that the um with the help of baloo should not be able to work on this node.

A

Also um to get the configuration working for gitlab itself and see what the what ifs you need to fix and add via chef, then testing the data migration. So this all still needs some time and I'm not sure if you would be able to manage this next week I mean we could try to and a2- and I hope too, but um it's really hard to tell.

D

Let's try to like if it means like, let's try and get this done as soon as we can, because it's causing so many pain points like um definitely like. If that means, we need to ask for some help like if we need to value to help with things or or whatever that is then then, let's ask um I would. I would definitely love to have this completed, like the migration completely do the clean up later, but I would love to have this migrated like asap like this week or early next week.

A

Yeah, so um I will spin up a few issues for what's still missing. That came apart today, and I think we can in parallel also start already with um testing the data zoom, because we have a working machine, at least for copying. Data too right now and I'll, see that we get gitlab up and running as it needs to be usable.

B

I was going to add something to this, which is, I think we should for any blockers that we have in terms of things that we can't do during immigration like this. I think we should spawn up a new issue and fixing that issue, because it's really hard to believe that, basically, if dev goes down, we can shut down the the business because we are we can I mean I can understand that we can deploy new stuff fine. I can't really understand that, but not being able to roll back not being able to scale anything.

B

This is not acceptable in my view, so I mean if we, this is something that we have to fix. uh Sooner or later, I say sooner because it's it's really unbelievable.

D

Yeah, I think we can. We can figure that stuff out following the migration um for sure, because what I want to make sure we've got a few days notice of is um to schedule this in so that uh release managers uh can plan around that. Like you know, we we will certainly not want to be trying to do everything. On the same day, there will be some down time, so that's kind of the the you know. Let's get an estimate in, but but yeah henry to your point like.

D

If there's anything we can push forwards, then please please go ahead and do that that'd be super.

A

D

A

Can't start the data copy testing and that will give us also time estimate for how long the downtime needs to be, and if we confirm that, then we also know how how big the impact will be of the migration right and then it's easier to scatter a point where we want to do this like. If it's very long, then we would maybe consider doing it on a weekend or at low traffic times. But if it's just for an hour, maybe then we could do it just at any time that we use it.

D

Yeah for sure and graham is uh working um today as well. So you know if there's something which like if we get up the stage where we're up to the date of migration, let's get it get an issue so that he knows that he can kick that stuff off.

E

Excellent, any other questions or comments.

E

Anyone have anything fun, they learned recently with kubernetes. They want to share.

E

That's unfortunate cool. Thank you. Everyone for your time, see you next week.

D