GitLab Delivery: GitLab.com migration to k8s demos, 13 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-03-13 k8s migration sidekiq: project export queue production rollout

Description

Enabling the project export queue in production https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1736

A

Alright, shall we get started I think everyone is here. So, let's review the previous week, I.

B

Posted the previous week goals in today's agenda, so we could start a priori goals. The fixed desk utilization metric, we pinged Ben but I- think we're going to take a look at this ourselves which it's perfectly fine I just haven't, had a chance to do that this week. So hopefully, if we are able to deploy successfully today, maybe that's something you can.

C

Do later today,.

B

Jarred was doing logging work. If you want to speak up.

D

I, don't really have much to cover here. It looks like we're in pretty good shape with logging. There's one outstanding issue, or one known outstanding issue that I do uncover today, while doing some additional failure testing, it looks like some log messages aren't being properly indexed by elasticsearch I'm, doing a chart update to log 400 error reasons, so I'm working on that now to kind of dig into what the problem is, but I added the issue to the epic and that's the only offsetting issue we have right now, so otherwise logging is good.

D

We have all the logs that we expect for a registry and sidekick, so that's good, and so then the next is. We should be running a production and we're not writing introduction what.

B

We could potentially demo that still.

C

Alive yeah yeah.

B

So the last item on this list was determining why rollback did not occur after we had timed out a deployment turns out. We had just missed an option in our home file. Configuration will be migrated from our cake control, managed a variety of helmand command line options into helm file, which uses a configuration item for this. So this was actually relatively easy fix. I was done relatively quickly, so.

B

Did we have any questions, or should we proceed towards trying to deploy something in a production I just.

A

Asked Jeff to confirm whether they shall link in the logging work item is correct. If not, could you please just adds the correct issue?

A

Alright, skarbek, let's go for it exciting moment. Alright,.

B

So I will share my screen. I'm gonna share the pipeline I'll bump the font size for everyone's viewing pleasure, so I ran into an issue yesterday, where also found two other seekers that were not populated in our GCMs vault, but these are already being populated inside of our infrastructure, so I ran a dry run earlier today with the corrections in place.

B

So we should see two changes to our secrets, one being the one that we did a change request for this morning, which was a workhorse credential and then a real secret, which should now be populated correctly. I ran this stuff locally, so I know they will be prop populated, like we want to.

B

Otherwise the rest of the sniff looks the exact same that we've seen in the past, so I'm not going to go through it, because it's really long so at this point, I'm ready to proceed to hitting this play, but unless anyone has any concerns.

A

Being DSR yong-chol mikkel.

C

B

All right, yeah.

C

I'm pinging, you.

B

Always in the way so I'll hit retry. This is the latest master code. We never did revert our attempt yesterday, so this is still in master, so we're just gonna hit the retry on this job.

B

And this is gonna work today, so.

E

This is a septic worker that subscribes to annul yeah, yep I.

A

Was only half joking I'm going to ping you Miguel in production channel with a link to the issue? Oh no. We.

E

E

B

A

No need to apologize. I was not clear with what I asked I.

B

Wish the sea I would auto scroll for me.

B

Jarba commented about this yesterday about the millions of evicted pods in production. I forgot to look into that yesterday, so we still have a very polluted production environment with good live registry. Having.

D

The sidekick for gamers I'm, sorry what I was.

E

Just offering some help with oh yeah.

D

Yeah I mean I think what I was looking for was the reason for the failure on the elasticsearch side, but Igor mentioned that that might be difficult to track down. So maybe time would be better spent increasing our logging on the client-side and you can either set the log level to debug, but that's really noisy or the plug-in has a special option.

D

That says show me the reason for 400 errors, which is by default, set to false, which is kind of silly in my opinion, but whatever so I'm updating a chart now and we can reproduce the error I think once I reproduce the error and see the reason, then maybe we'll have an idea. What's going on.

B

Alright, so the pot is now running I'm, just waiting for it to get into a ready state. Hopefully that happens before this five-minute timeout.

A

All right to repeat again, distribution is working in this release to improve the startup times correct.

B

Yeah there's one remaining item that I recall related a boot snap. That's going to improve the boot of the container okay, we're now ready on this pod. The one thing I'm noticing and I didn't really notice this until trying to push this into production is that it takes a really long time for the sidekick pod to switch between running and ready. I think after this call, I will spin up an issue to investigate why that is occurring.

B

Okay, so job succeeded. So we are now running a null queue in production. So let me make my share less busy.

B

So, let's just go to dashboards, isn't wrong window dashboards, and the goal here is that we should be able to find our sidekick pod running production.

B

So we have our deployment, it's running version, twelve, eight five. We have our one pod running and we're getting our memory usage, CPU, usage quota and stuff will probably start populating after a little bit more time has passed, so we now have metrics, which is good. So, let's look for our logs, whose pinging me.

D

So I back, you can just click the link in the zoom chat. I already have it up in case it. You.

B

Don't have it because it's you know still loading for me. Thank you, sir I.

A

Pasted both dashboard link and give on a link and dock.

B

Perfect so after you, you know, search twice in Cabana sub shows up and we see precisely what we need to see. So, let's drill down into the actual log message message.

E

B

And this is precisely what I'm expecting to see out of the container. The one thing I don't see is which queue you're pulling from.

B

However, that just hasn't made it into gabbana, yet you.

D

Mean I've seen listening on queues, no.

B

There he goes yep yeah.

D

B

Just popped in so we are now running a production.

B

Is anyone excited.

E

A

Super exciting stuff, okay, so let's.

B

A

Graphs, a bit yes.

B

I, don't need that one, this one that we kind of care about, because this is the one. That's just gonna, show us basic information, but we're not gonna really see anything so reciting, because the pods not gonna, be doing anything at this moment in time.

B

Is there something in particular that you want to look at there so when.

A

You say it's not doing anything in particular. What do you mean? It's.

B

Put it's looking for a queue called null: it's gonna for work from a queue called null which does not exist. Okay. This is simply here to make sure one the deployment worked. The configurations are saying so the pods started up. So we know the configurations are saying we got past the issue that we found out yesterday. So I think really.

B

We should proceed to our next step. I, don't know if you want to use the demo for that, but we potentially could use the demo for this okay. So our next step, let me unless.

A

You are wanting to tell me that there is something that we should hold off for I'm, not getting that feedback from from anyone over you, at least, if.

B

Anything it's just going through inspecting the pod looking at the configurations to make sure they all look sane being that the pod started up I'm, not as worried about it. But you know this is a production system, so it would be wise to potentially do that.

F

B

Doing that I would recommend not doing that inside this demo. Just because that will take me a bit of time to go through everything.

B

Beyond that, our next step is to perform a merge request of number 164. Can.

A

We do that in parallel. Can jarv do the checks. While you prepare the rollout.

B

Certainly because we do have a.

B

This merge request will touch both staging in production, so we're changing the image that were being used, the one that is the latest image that I was able to get created, which is not from the last two days, but I think three days ago, 3:13 I can't see today's date. No, it is today.

B

mmm Let me make sure that image exists. I.

B

Want to steer us their own direction here.

B

313 835 this morning, six seven, two eight one to the east: seven, three, eight okay, so that image does exist in our registry. So this merge request. Jarv, it's number one 64. This will do two things: it updates our staging image. That way, it's just a sign in shape that the image works in staging and it will also update the image in production while simultaneously changing to the project export queue, allowing us to pull from project export.

D

It looks good start back though I do have a question. um Let me share my screen. Okay,.

B

D

I was just looking at the log messages from this queue, and the first thing I see here is we're using 1285 as the container image is that correct I thought we were using it out of the play tagged image, not.

B

D

I thought this. This change here looks like it's changing the tag from this older Auto. Deploy to this newer, auto, deploy you're.

B

Looking at the staging deployment, oh right.

D

I see so we weren't using AHA, yes, okay, and so we were using this 1285 image before correct, because that's it in our values, tiny mo no.

B

We're just using what the helm truck provides it's by default. Oh yeah,.

D

Right, which is going.

B

To be, in this case, the latest release, which is called v5 cool because we're running version, 3, 1, 2 or 3 0 to something 3, something to someone there.

A

Okay, just to be clear here while I do want to see, is moving forward. I don't want us to move forward. If neither of you are feeling confident that we should be right like if you need time to take a look at everything, go for it, I'm just wondering what kind of expectations do we have with this double checking right.

A

B

Double checking is going only as far as making sure the configurations look saying inside of the pod, like I want to be able to compare the configuration built inside of the pod to a sidekick node running in production beyond that I'm, confident in moving forward with the plan of action, as is I just want a bit of time to compare configuration files to make sure there's not something that looks agree egregious. That needs to be looked at or investigated.

B

A

So my question still stands: can job compare those two things while you roll this thing out and stop posting them in the rollout, because it's gonna take some time right, like for all these pipelines to run.

B

We can do that relatively quickly, I think cuz, there's only one configuration file that we seriously care about. Anyways and that's gonna, be the I had to find it, but it's doable. Okay,.

A

Yeah you, okay with that all right, we have a thumbs up. Let's do it that way,.

B

Probably not something I could share my screen with, because these configuration files are gonna, have secret tokens and stuck, though, that's the unfortunate part.

C

B

Okay, so job you more or you approve this merge request.

D

Yeah, it's ready to be merged and.

B

I'll be ready what let me get rid of the whip status and then let me also find the pipeline forward and make sure I update the merge request with it before we merge it.

E

B

I rolled this into the existing change request procedure.

C

A

Is Miguel? This was just to make sure that we are running first before we, because we don't have a zero state to compare with. So now we have now we can roll sucky to actually take jobs.

B

Jarv, while I'm looking at configuration files, I updated that merge request with the ops pipeline. Do you want to just run through that with the people that are on this call just to show off the expected changes that we expect to see between the two staging and production environments and nothing in pre in canary.

D

So here's the MRI calm.

D

Here's the huh knocks we go to the pipeline, and so what we're expecting to see here are some changes for production and.

D

Production and staging that's the tag changed for both, so we can look at the dry run for staging.

D

You can see that the image is updated. Here, looks good and let's look at the PPR DJ Ronny, so we have first, the Q which is switched from node to product export. So that looks good. And then we have the image which is updated from 1285 to the on deployed.

B

Jarre, can you refresh me on the wire configuration files are supposed to live inside of our export nodes, our VMs.

D

Alright you're, like association into the pod, and you want to look at the configure.

B

The node I got the configuration file in the pod up I'm. Looking for the Associated configuration file for the node now.

F

B

Be but like the generated stuff that omnibus will pipe out.

D

Volume in note kubernetes node, you mean like the vm ESI, yeah yeah, so far, ops get that give that browse and then I should be under there. I.

B

Don't know why I can't remember this information I think.

A

Create a dictionary going forwards, you can only use VMs for.

D

Something like.

B

Charlie, do you want to roll it out? You want to merge that merge, request aging.

D

D

Here's the biplane I.

D

Like just probably a stock I think she's really bored he's like you know not much to do in Slovenia right now. Everything shut down.

A

Maybe she was just being nice, no.

D

That's possible she. Actually, she actually would do something like that.

D

D

B

I see some minor differences to Omni off between the production, VM and production, kubernetes. Being that this a sidekick project, export I, don't think we really care too much about Omni off having a difference here.

A

B

That's directly related be able to log into the system correct and the ability to use a sam'l first single sign-on for the area services like Twitter, github, bitbucket and Salesforce.

A

This project exports that we can do project exports. Now that's for imports. We can import from their hubs and all that well.

B

This is specific to logging in though I think right, I.

A

D

So just to update on staging the pods are coming up. Now they just take a long time, so we're waiting.

C

B

Port number for this psychic exporter is different.

D

I think that'll be okay, I think.

B

So too I'll check that in a second, because we should be able to see that as a target I mean.

D

Prometheus operator will just Auto, detect it right and exactly.

B

Okay, some I'm gonna validate that.

D

And it we're getting we're getting psychic metrics and staging so, and the port number is the same in staging right, correct. So then I would suspect it's fine right. It.

B

Better, be it's going to be.

B

Let's see, how can I do the porting for this.

D

I'd just like to an example start back. If you need one.

B

Thank you kindly that just saved me some time, I.

D

Assume you want to get directly to prometheus on the pod.

B

D

Okay, the deployment for staging this stuff.

B

We really need to expose prometheus to the outside world from our jke clusters to make life a little easier.

B

Okay, so I'll check all metrics, but I would like to proceed if no one has any qualms.

A

Comes up for me, okay,.

B

D

Stewart well.

B

Then, in that case, I will share my screen again and we'll just hit the play button on this uh job.

B

Michaela, I'm verbally notifying you that we're making a change in production.

F

E

Surprised, I just click the button you get a nice.

A

When I scroll up to see the dish just one more time, yes,.

B

So configuration change. This is probably related to the fact that our image changed. As you can see, we changed the image from 1285 to our auto, deploy image from this morning.

F

B

Check psychic, that was sidekick, that was the dependencies container.

B

Changing from the null queue to the project export queue, so once this project, or once this pod comes up, it'll start pulling from that particular queue.

B

And just a reminder: we are limiting the HPA so that we will only get one pod. So kubernetes will only export one project at a given moment in time with a concurrency level is set to 1 as well as the pod. Counselors are the one that.

A

Sounds good? We just need to confirm that this is performing and it's actually working without any errors before we decide any further.

B

All right, the POTUS now entered a running State so now we're just waiting for it to become ready.

A

Scraprack this looks quite a lot better than last week or week before that am I mistaken, because I remember it is timing out right, yeah.

D

I'd be a pure pop, you don't have any pots in production, so we didn't have to wait to cycle through all of them like we have like no tests. So this is the first, probably the only deployment that's going to be nice for production, because we're probably going to scale up I, don't know we may not scale up that. Much though we'll see we we.

B

Have an issue to investigate what the ploys look like when you have more than one pot, because we may need to tune the the time out that helm waits for before it decides to roll back, because these pods are taking a long time like it's been three minutes in the pot is yet to enter the ready state. And if it's going to be like that for one pod, we're gonna run into a large issue when it comes to deploying in general.

B

A

So yeah that means order priority. There will be health this running, even if it's only with one part to collect data on how we are behaving with real-world workloads and then, at the same time, increase the pressure on distribution to actually work on speeding this up and then in this at the same time, for us ensure that we don't deploy manually but which also deploy there. We go if.

B

Someone has the agenda open and so I make a comment that we need to create an issue to address the fact that it takes nearly a minute to go from running to ready for these psychic pods.

A

B

Create an issue I just want to make sure it gets noted down since I. Don't have the agenda on my screen, so the job succeeded. So we should now start being able to pull work so I'm gonna pull up the logs and because I can't auto refresh I'll just hit the button manually over and over and over again.

B

This was probably old, pods shutting down. Let me add the pod name to the data here. This.

C

B

So was that the old PI yeah that was this was the old pod of jarvez me filtered on the pod name.

B

Okay, so that was the old pod shutting down.

B

And here are the logs from the new pod starting out so.

D

We really don't get a lot of project exports. You know they don't happen at a high frequency. We may be waiting a while for this pod to pick up one. That's.

B

True, that's true also, we should upgrade to sidekick Pro for some features and support.

D

Yeah we could, we could stop, we could stop project, export, sidekick and I mean that's a safe operation and which would force the pod to pick up on and then start it again. Girardi.

B

Should have enough projects in your personal gitlab account.

D

I, don't have like a hundred projects to cycle through I only have that in staging I, probably should have, because we can't just export the same product over and over because of rate limits. We need to cycle through, like ten of them. Can.

A

You check how many experts are we doing at the moment, yeah.

D

B

Always forget which sidekick dashboard- this is actually supposed to be queue, detail.

C

Not queue detail.

A

Go at the top Q and Q. That's now.

F

Set all the way at the dump very good.

F

Type, you can type in the Box export.

B

My computer doesn't like it when I try to type project.

D

So it looks like about what a minute or so based on blogs.

B

Q time execution time array.

B

None of these tell me how many are in the queue I.

D

Just put it in the Zune chat, just click on that.

A

B

They're completing within seconds I would.

A

I would like to ask the uncle to see whether we are okay to stop for a minute to see a couple of jobs being picked up. I just.

E

Got paged for HPA I'm able to scale up? Oh okay, that's.

B

Expected because we limited the HPA to one, so it can't scale up that should.

A

Be alert on this right now, no.

B

This is an alert that we have to ensure that we're not wait for clarity. Did you get an alert or a page page a page, so we probably have it as a page, because if we say for the gitlab registry, which is taken all the traffic, if we scaled up to its maximum pod count, something is probably wrong. So the fact that it exists is a valid Alert, slash page, but we don't need to be paging for this because we're an experimental phase at this moment in time.

B

So we should silence that for quite a while, and we should probably revisit how those rules operate in situations such as this, because I think we were considering at one point in time, maxing out the HPA to start with, because the time it takes for this pod that started was so lengthy at one point in time. So.

E

In that case, it sounds like I should create a silence for the psyching exporter.

E

Unsilenced for the registry right, correct.

B

Like we don't want to impact the registry alerting here, yeah.

E

I'll create designs now: okay,.

A

We should add to our change requests as well for the next time. Yes,.

B

Alright Marigny had a question for Mikael before we got interrupted by that alert. Yep.

A

Wanted to check whether it makes sense for us to stop the queue for a few minutes to force some four jobs to go.

E

Yeah I think, are you thinking about stocking all of them or just stop them? We.

B

E

B

Want to stop all of them or all the ones in the PM's at least side.

D

Yeah, the psychic export has dedicated the ends, so we would just stop sidekick you'll, give up city or stop sidekick arm. Just those VMs.

E

D

All of the project exports- and this will in turn, there's a grace period of like 20 seconds or so don't wait for me. Ups are in progress to finish, but I don't really see much in the logs anyway, so I think it's going to be fine and then the next job should hopefully pick up picked up by the pod. I think it's.

B

Pretty low risk.

C

D

Yeah I mean if it doesn't get picked up, we'll just start cycling again and yeah.

E

And the expectation from users is not that these will be executed in you, so I think it's fun. Yeah.

D

B

Even if our pod airs out on an export, it can be retried right.

D

B

So if there's errors.

D

That will be retried.

D

A

Want to force this because we are planning to leave this running over the weekend, so it's better to actually see an error right now than have errors ramping up over the weekend. Even if it's limited.

A

D

D

Scriber I'm going to go ahead and update the changes you for this.

A

A

A

B

So I've got the command ready to stop psychic cluster. You know in our VM sorry. Are we ready I.

D

Think so I also updated the changes you okay.

B

And now we'll hit the enter key.

B

Alright, so it's down on all four of those nodes: I'll just run a status for sanity.

B

Okay, so they're stopped and.

D

Just like that, we have an export on kubernetes.

D

We have one that's starting.

D

That was really fast, I.

B

D

D

B

Don't know how long it's gonna take I'm.

D

Gonna go ahead and do an export of a project that I know. That's very small.

B

It's gonna get queued up behind this person. That's true.

A

Well, at least you'll know it worked if it exports it yep.

D

B

Killing the log directly out of the pod, because that might be faster than elasticsearch giving me this thing as well. Rico.

A

I'm seeing some drops and if that attacks at the moment, I don't know if it's related to a stopping side, kick on export note.

E

C

E

A

Seems so yes, I'm trying to zoom in now.

D

Maybe we should add another pod or to scrub at what do you think I can I, don't know, I mean. Obviously there are some jobs here that are taking like longer than a minute, but most of them are taking much or much faster than that. I.

D

Guess, like I, can check the project.

B

What would you check on the project see.

D

See if it's a big one.

B

Good idea do that before actually looking at the data, can we maybe Bob could help me here but I'm saying that there's a meta user in this.

F

B

This kicked off by someone internally.

F

Adi hump is somebody that works for good luck. There exactly.

B

I'm wondering if maybe not the person that owns the project did not kick off this project I support, but someone internally did.

F

That yeah, that seems, seems to look like okay, which means you could ask them mistakes. What worked yeah.

B

I did not execute the scale-up for the psychic export. Did you want to, or did you want to continue looking at the size of that project? First,.

A

Well, it looks the get early. One is the our old friend file, 45.

A

So it doesn't seem related to this.

A

But I'd leave.

D

To you, it's like a 10 gigabyte like he said, nine point. It's big.

D

Why the first one like we get is like this huge project that sucks, let me see like I can maybe if this he's been doing lots of exports, I can see if he's done, an export in the past and how long it took okay, let me let me just check that real quick, but yet so.

E

The saturation dropped already, okay,.

A

F

D

Yeah I see attempts at exporting this project and failures. I do not see a successful one. Let's give up on. Let's add a like two more pods two.

B

More all right so I'll scale up to three and actually I'm kind of concerned that the HPA will scale me immediately back down.

D

Can we set the min policy? I don't see this product ever being successfully exported. We really need, like we shouldn't, be, allowing people to export things they're. Just like always gonna fail right.

A

Yeah I think the the first request is actually to have some visibility in what is if this is doing right, we don't know and.

F

Eventually, we want people to be able to export projects like that.

D

Yeah I'll share with you. This keep on a link.

B

All right so the it is confirmed that the HPA interrupts an annual scale operations, I'm going to edit the HPA and set it to a maximum three.

B

And then I will manually scale it just picked it up automatically. For me, I guess: metrics are already up there. You control.

B

Yeah it's already over the CQ usage, so that makes sense.

B

All right drive, two more pods are on their way to coming online. It will be about three minutes.

A

Is everyone? Okay? If I announced this in the company announcements at the moment, so that everyone is aware sure.

F

B

So just be mindful of time, we are over our meeting time if anyone needs to help offer I.

F

Do but I want to see an expert complete.

D

So, looking at the the dashboard, our psychic cueing latency per job, just skyrocketed for product export, and that's because this one job is just holding up other things that are queued. I was.

B

D

We have how many exports of the queue we have five exports that are waiting.

D

E

D

Is mine and I just now see I think another one just started the.

B

See two started, but none of them are ours or nothing with the word jar of attached to it.

B

Well, I see a gun, I see two Duns.

A

Hopefully successfully yes,.

B

Let me share this I'm looking at I'm, sharing the wrong screen. I'm. Sorry.

F

D

Succeeded yeah.

B

I see a Sophie Inc productivity get lab back up, where's the where's. The successful method done.

D

Like I just downloaded my export, it definitely worked. Okay, very.

B

D

Now my concern is that at turn,.

B

I guess we just done.

D

We need to set a minimum pod number and I. Don't know what that number should be. Maybe we should make it a little bit higher than we think we should like make it like this high as five, but.

A

My question here is: we are still planning to have VMs process jobs right. Yes,.

D

Oh all, right for the weekend we're gonna turn. The I was getting ahead of myself I'm already thinking like we're, gonna leave it no.

A

I think it is yes, let's leave VMs on, but if you want to run more pods side by side yeah, let's do that. I I would be fine with that. If a sorry uncle is fine with that, but yeah.

D

I guess around without.

A

Vms over the weekend, yeah.

E

Our love, no I, don't want to leave sorry go ahead. I was.

D

Just going to say I think then, if we have, the BNF in one pot is probably okay like because it won't it won't matter like we still have the same capacity that we currently have no.

A

What what what we can do is actually have more capacity on kubernetes right, but as a full bank and VMs yeah.

D

I see because right, because if we have like more pods, there's a higher probability that the pod will pick up the job first year, VMs and.

F

It would be interesting to compare the timings of, like average timings of the reports on the pumps, because you'll want to know that eventually, yes,.

B

How about we run with four pods that way? Kubernetes is one node equal to all four of our psychic nodes.

D

Yeah, that sounds fine to me for okay,.

B

I'll create a merge request to do precisely that great.

D

B

Right so I think that concludes my demo. I could create that merge. Request outside of this meeting I'll do that as soon as cool here.

F

Thank you, I. Don't.

A

Know, but this is extremely extremely exciting, so great great work all around if I could high-five you now and if it was allowed I would we can do that? That's the corona high-five.

A

You all have great weekend and I'll post this to company announcements and yeah, we'll mention all of you fantastic job, guys Thanks.