GitLab Delivery: GitLab.com migration to k8s demos, 8 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-05-08 Sidekiq elasticsearch shard migration

Description

https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/237 , part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/112

A

All right, so the first thing we're going to discuss here is the elasticsearch migration I just wanted to bring up the conversation about the dedicated note, polis sounds like we've agreed to put it on the default node pool for now, given that there isn't a better choice, um I guess the issue for that is linked to the epic. You want to comment on it. Does anyone have any additional thoughts on this or not not.

B

Weaken the bond is this: a decision that we can revert in future I mean it's just a message of change. Our.

A

B

And day we just move to another yeah.

A

It's super easy. It's a bit. You know it's a bit of a pain to decommission a pool, but as far as migrating to a new pool, that's super simple. You know and I think one will eventually do. Is move registry to a dedicate and then have this default pool as just the generic one. We use for things that don't have a better place.

C

A

C

It's expensive or a pain to decommission them.

A

It's a pain to be commissioned because you have to migrate the workloads carefully. So right. When you start tearing down the node pull you need to first migrate, the work work, work loads and then- and then you know it's just like it's a high touch upper yeah I think.

D

The terraform operations required to make that happen is the more difficult part, because the.

A

D

Our tariffs or module handles and array and creating node pools in a certain order, screws things up, yeah.

A

But it's cool next up is the console agent. So after elasticsearch we're probably gonna look at gig for WebSockets. The first thing we need to do is get the console agent installed in the cluster I. Think we're just going to use the upstream charge for this we're going to install it as a daemon set.

A

Unfortunately, they don't have an official well, they have a charge, but it isn't published so the way that they recommend installing it is to clone it and then install it locally. So we have some options here. One is that we mirror it to get lab and then publish it on charts like David died. Oh, we could mirror it and add a sub-module to the project, that's installing it or we could just add the github repo as a sub module to the project. That's installing that we.

B

Using console only for DNS queries for a lot balancing and things like.

A

That yeah, that's that's over UC, for you.

B

For that, so can we keep this outside of the cluster and just providing it I? Don't remember the name in kubernetes when you say this is a service and lives outside of the cluster. So just point here.

A

Well, I mean you mean like we would make DNS queries to a service, that's running outside of the cluster yeah.

B

A

Could do that I, don't know if that's wise to do, though, because like we would then have to you mean like we would just do a DNS query against a console agent running somewhere else, but we still install it if.

B

I guess we have some, we have it, and so it's somewhere because we are using it they're all installed locally. Every.

A

Node, so every no that is known in every VM has console agent installed yeah. We.

C

Have a browser cluster.

A

C

About, like just I mean, while we suggesting those ideas, what about a part that actually runs in communities that have that as the agent and they make the call against that. The other thing is it's not configured at like the the system. Resolver right, like the application, has like an IP address and a port in it, which means.

E

Nobody's yes, whatever.

C

We want to do so.

A

You don't why not? Yes, maybe I don't there well. No. This is exactly what we're gonna do. We're gonna have the console agents running in a single pod as a daemon set and right and then the problem is, is that we just need to install the chart and the chart is managed upstream by console and uh she quark console and they don't publish it. They just have like a git repository, so we either need a mirror it. We need to mirror it or publish it ourselves, I just I, don't know what to do.

A

I think I think. Maybe we could start with just creating a mirror and then using the get sub module.

C

Why don't we just clean it with the Shaw?

C

A

Yeah I mean there's I guess so. The issue is, is that we have a project that needs to be able to add the repo to how I can write yeah, and you can't just add it using a good path if it's not published, so you need to either clone it and locally.

D

On the file system.

A

In which case the sub-module makes no sense, um I don't know, I, don't know. If anyone has opinions on this, if not I'll just choose something just I. Don't some modules are just not a great.

E

Bit sorry, it's going. Nobody.

A

Like sub modules and.

C

You were saying the other day that you were creating something else. Yeah.

A

I definitely regret using the sub module for deployer, but maybe here it makes sense, I don't know I.

B

Usually never makes sense.

C

What about just running a fork and then manually updating it every now and again or even having a CI job on that fork, that that kind of checks upstream and then fails a build? If there's like you know, if it's behind or something like that, yeah.

A

I mean the the what we could do is we could have a mirror, a pull mirror that publishes the chart to chart stuck in London IO, and then we pull it from charts like Atlanta.

A

That's, maybe we should just do that. You.

D

Might have a sense as to how often the chart will end up being updated.

A

No I don't know I. Imagine pretty often I mean it's a pretty popular teacher, yeah.

B

But do we want to just keep upgrading console in our production infrastructure without even knowing if they are no.

A

B

It's or something like that.

B

But are using a sugar.

A

B

Cloning in CI and just say yeah. This is the last version.

C

B

Chatted we tested, so we clone it. We put it in.

A

Then it's a sub module, that's exactly what's really there. It is I mean it's a shower. It's a rail and.

C

The question is whether this is like a 1 so forth. This is like if it's a 1, so if it doesn't even really worse, it's not worth the discussion, but if this is something that you're going to be doing like well, if this is a pattern, then it's worth like considering right, you know, and so how many other things do you think are there that are like this that are going to know. It's like same pattern there. Maybe if you're me not that many like but like a handful like three or four, maybe.

A

C

Then maybe it's not that important, because it's only a small number yeah.

A

I, don't know like so I think like doing it. The sub-module way is like the easiest, obviously right and having a sub module, that references, github is actually even the easiest thing, but I don't know if we want to have a dependency on github for our stuff. So if we don't do that, that mean we need to create a pool, mirror and reference that.

B

I think we need the pool mirror. In any case, you know, I mean we remove a dependency right so things that can go wrong. Let's say we need opes to deploy right, so I would love to have the mirror and oops, because I don't want to do. I have everything hopper running about get up is down and I can't deploy. Well.

A

B

A

Problem because, like we don't have arts redundancy yet on dev or ops right, we don't have a mirror on their props for charts, there's no pages site on dev. So we already have this problem.

B

Ok, so let's move it where everything else is so it's still one.

A

Yeah, ok, as a dri I will make a decision. Then I won't just do it here.

A

Maybe maybe I'll think about it and start on Monday, because I don't think I want to do today.

D

The only thing I've got is I know: we've been looking at the Alaska migration to be at least investigating it. I, don't know how close we are to executing it, but I would like to stress that we really need to remove the allow failure on the coronaries deployments and deployer we're still getting failures quite often and like we had one today and I, don't know why it failed, but, like it at least, went through the entire pipeline before it got to the point where than a job itself inside of kubernetes failed.

D

So there I pasted the epic I just wanted.

A

To do something.

D

A

Maybe I can maybe, let's figure out why I failed during this is the demo, and maybe we can just like trace it and figure out like that's.

D

A pretty good idea: jarv, let's.

A

See yeah send me, send me the pipeline wind I can do it yeah.

D

Pipeline whoops lost the agenda pipeline. Is there.

D

We'll probably need to look in.

D

Maybe stackdriver for the failure.

A

Oh, so it actually triggered the deployments yeah.

D

It triggered just fine, but you know, ended up rolling back because something went wrong and I don't know what that what wrong is yet this.

B

Is the ongoing deployment.

A

Is correct? Yes, could it be that it just we had too many pods, because we have a lot of pods and staging and they've just find out.

D

Well, this is production, I would.

A

Hope, production.

E

D

Not bound to because we have a 30-minute time limit, I would hope. Production is large enough, where we don't run into that kind of problem. Yeah.

A

Production should work the staging. Sometimes we have problems because we have this load generator. That's like yep yeah, so.

B

Is this happened during the registry upgrade because the deployment started the same time, but before it reach that point the pipeline is very long, so maybe the.

D

Register is deployed slightly differently.

D

We would have seen the registry failure a lot well before production, based on the failures that we've seen in the past, with sidekick.

A

You so I guess: let's go to stackdriver.

A

A

A

Mm-Hmm, maybe you can just search for the shop.

A

A

About an hour ago,.

D

If you wanna, search by the deployment name I'm putting that in the agenda, okay,.

A

D

Might be easier.

D

And just for awareness I pluck that name out of the dashboard in Gravano.

D

So ingre fauna, we are running only six pods at the time the deployment occurred, so it spun up six pods and then it failed. So it rolled back.

B

You just to have more calm, so this deployment was supposed to upgrade the project exporter. It's.

D

It was supposed to upgrade the memory bound shard, also.

B

D

See I don't know how much you've been following along, but we've since, since we've had project export inside a kubernetes, we've shifted over to the the work that scalability has been doing and we have migrated over to shards now. So the memory bound shard contains project export and some other random queue called new design version. Something along.

B

D

So it contains two queues now, but that other queue is so lightweight. It doesn't even show up on the radar and.

C

It's probably not memory bound.

D

Yeah I has that question, but someone responded saying that sometimes with the image conversions do.

B

Design management right so yeah that might be memory.

D

Bound but I guess so far.

B

D

Hit those types of resource constraints, yet, okay,.

B

So now question number two why we are looking at stuck driver instead of our own elasticsearch.

A

And that's a great question because not all of the logs are currently getting to elasticsearch.

A

We have a box. In that sense, the gke law looks to elasticsearch, but we've seen a lot of errors because of the number of fields and honestly like elasticsearch, is like you have to be very careful when you send box to it because, like you have too many fields, it's not happy. If the data types change, it's not happy like.

C

Like for certain things, like application like when you try to search across like what a thousand users are doing, you.

E

Know as a trend it's great, but you have four logs if you just kind of want to dump belong somewhere, it's probably a bit overkill.

A

A

So I wonder if it's the database schema. That's the problem. Let's see here. This is like really hard to read a little bit better.

A

So the current is.

A

Nine and the code base is oh five or six.

A

Well, that's not gonna work right well,.

D

That's a little concerning did we have a silent failure with our database migrations but like wouldn't that stop everything else from starting because everything else is deployed properly well,.

A

I mean one I guess is only kubernetes. Has this schema version check, so it's possible that things are really screwed up and we just don't know but I am it I am reading this right. Wait.

B

Wait wait but if we're.

A

D

B

We have a background migration, any kind of post deployment migration within the code base. This check will.

A

D

Fail is this Aseema, let's go and have this would happen before the post deploy migrations. That's a problem.

B

Right because when you write the code of the post, apollomon migration, you upgrade the structure SQL, so the scheme, the the code base version. That number is up in the future during the deployment, because when.

E

B

Code, you run everything and, as an outer the merge request. You are supposed to upgrade the struck the schema down with a new number, but if you check those number, when you spin up the thoughts it will fail. If, because, if there's a post deployment migration, it has not yet run.

D

B

I would reduce, deploy.

D

Migration mango the schema, but they did a database. I thought that was.

B

No such thing as opposed deployment, migration in Ruby on Rails is just something that we build on top of debt and so schema either is so either you run or you don't run the migration. So when you do things locally and you have to do you have to prepare your mod request, you have to run also the post deployment and this update it because.

D

B

D

All right, so we know why this failed. I bet you now that the deploy has completed those deployed. Migrations are currently running, so, unless you, maybe you and I, should work together to see if we could retry the kubernetes job.

B

After another, then- and.

D

Let's see if it finishes and we'll document this, this job do.

A

You guys or does anyone know how to just simply check the schema version from either the rails console or the database console I do.

D

A

B

I can just database it's it's. um It's a schema version.

B

B

B

You, okay got it so right now is 20 2004 to 9 something.

A

Alright, so that means that it would still fail, because the code base is that Oh, five or six probably deployed.

B

Is a five or six, but we are running the positive element. Migration now right are we. This is what isn't this yeah.

D

Right now so once this is.

B

Done I mean I can also look at the log if there are real migration so.

B

It's still in the early phase, it's just putting pencil.

B

The way, if you want to see the schema, its selects are from schema migrations, ordered by version limit. One.

D

Can you copy and paste that somewhere less you, because I would like to know.

B

D

Future. Thank you. Man, I.

B

Will paste it in the agenda.

A

Okay, so let's go back, that's not like a show-stopping issue, I guess it's something we need to well. Maybe it is a show-stopping issue and we need to figure out what the best it.

D

Definitely is yeah, I, think I know: we've been working on making sure the timestamp in our tag has been the same. I, don't know if that work has been completed, but that's been the number one issue that we've run into in the past.

D

Lately I've been seeing a lot of SSL failures inside of our ops pipelines. I, don't know where, in our process that SSL failing is actually occurring like whether it's reaching out to get lab or whether it's reaching out to github or whatever I, don't know yet, because I haven't delved deeply into it. But this is just a new one.

D

I've been capturing a list so far, I've got three items in issue: eight, zero, six, that's tied to the I think I think you and I need to work together to figure out how to prioritize this particular. Maybe this issue or epoch in Tayna, with all the other work that we've got going on I just haven't, had that had the time to pick up this topic at all. So.

A

The time stamp issue is not actually the time stamp. It's just that when we tag omnibus or we tag CNG, we always need to tag the other one yeah.

D

A

Yeah yeah, so so normally like 90 percent of the time that happens because we're all stages, but if there's an omnibus change and without a rails change we need to tag CNG as well. So that's that's the fix for that. um The fix for this I'm not really sure other than disabling the check, I'm, not sure what else we can do right there.

B

Was opposed deployment migration and it's run so I'm gonna retry the is it fine to his right right, yeah.

D

Well check the schema first yeah.

B

The schema is 20 2005. Oh six, perfect.

D

Then yes, we should we try that job.

D

Carabas suggest that, for this particular issue, we open up a an issue with the charts for the distribution and I think proposing an option to skip this check would be wise and I. Think I'll be met with good criticism, because you don't want this shipped to customers, but I think the ability for us. This will be important unless there's some way.

B

You know this is also is needed for customers, I mean if you this is a coordinators installation right, so you can't expect to just pose everything and upgrade and then restart. So a deployment is not something that happens just because you flip I.

A

Doubt there are many people who skip poster boy migrations when they run migrations, though I'm sure most of them just run migrations. They run everything.

B

But this is how the charts behaves so charts by default can run everything upfront.

B

How can it work once they post migration.

A

Word the default ought is to do everything the default for my grade. The default for migrations is to do everything, including post, deploy migrations, so.

D

If those take a long time to run, you.

B

Know I have a working Gil.

D

Lab install for quite a while yeah.

B

A

I think for most databases, that probably those migrations are fairly fast yeah.

D

We are pretty large, install yeah.

A

D

Should just get a issue open and have a discussion there and figure out what to do what's the best route to go about doing this yeah.

A

D

Will be important for the distribution team be aware of.

D

Unless you teach the retry button.

B

Its surrounding.

D

Can you post a link somewhere because I don't see a ride job anywhere.

B

A

D

Wish the job spit out the URL for the pipeline earlier, because, right now, it's just showing me the API URLs, so I can't view the actual pipeline quickly at least kind disappointed by that.

B

If it's just the the API and gets work, this should be our web URL element. So you can.

D

Clear we're just printing out the wrong thing at all: yeah.

B

D

Print out, we should be able to print boat without really.

A

A

Skarbek is there anything we could do to make these failures a bit easier to diagnose without having to even look at logs, but just see it in to see how job output I've.

D

Considered that there is a way inside of kubernetes, you can run a cube control command that will output, something that gets stored in the event log when a failure occurs. The problem, though, is that this upgrade happens inside of helm, so I don't know how to integrate a cube control command. I will look at a failed, deploy, look back at the correct deployment to get that information and I also don't know if we store the correct information in those logs.

D

So this would be something we would have to investigate to see what we could do and if we're storing the right data to get that information so that we don't have to look at logs, but I'll create an issue to address that, because I'm curious about that as well. I'll, add that to this exact topic, actually good, something that's the perfect place to put it. Okay,.

A

Can we just like tail the event logs somewhere during the deploy and then make it available as an artifact to download or something like that? That's.

D

A good idea I like that.

D

I'll add that it's a suggestion on an issue.

A

Ok, is there anything else.

D

I guess not we're just watching the deploy. Retracts help sell.

D

Shall we call it.

A

Yeah, let's call it I have one question before we end the call that you might know back.

D

A

You can stop the recording.