GitLab Delivery Team, 6 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-05-06 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

C

Hello happy thursday.

D

C

Hello, so uh let us start is there anything that we would like to demo this week.

E

I could showcase the excellent work that graeme has been doing to an extent.

C

Yeah that'd be super.

E

So if everyone recalls from last week's demo meeting, we've been having this issue where.

E

Servers that we try to services that need to communicate with the console will sometimes get a obnoxious error message.

E

We've played with a couple incantations which allowed us to change the behavior of how service discovery communicates with the end point that's necessary, so we went from using the default method of just specifying hey, go, reach this service and then kubernetes routes you appropriately and you get set to write a pod. We then change that to a headless service, where the attempt was allowing kubernetes to decide which pods you go to pending the state of.

E

um uh I can't remember how the unless service works at the moment, but we tried that and it didn't work either and in fact caused more errors or the rate of errors was a lot more, so our latest incantation was to change it to using a node port service, which we specified a specific port.

E

First time we rolled this out, we ended up barfing a lot of errors in kubernetes because we are using an older version of kubernetes and there's just an incompatibility with this version, which is fine, but we fixed that, but later yesterday, graham discovered that this was not actually doing what we intended. So these error messages we see here the same stuff. We've been seeing that we've been trying to eliminate, but in this case it's just pointing to the ip address of that node and going to that port.

E

However, apparently node ports work by sending the request to this port, which then gets forwarded to the service and then back to a console pod. So we were still not leveraging the end goal here, which was to take the traffic and keep it locally on that node grain made that change.

E

So now we are talking to port 8600 on that host versus going out to a kubernetes service to make this traffic exchange happen, um as you can see from the logs the last time this happened was at 3, 10 am utc, so this change has been ruled out for quite a while, and we haven't seen this is staging. We haven't seen error since so this is fantastic.

E

This is going to prove that things are working as desired. Staging is a lot more lightweight. So if I go back a little bit further in time.

E

You'll see the rate of this does kind of drop off. You know like we were seeing this on a regular cadence, so this is, you know, within an hour, seeing between five and say 20 of these errors during times of high load and staging, which is very rare.

E

If you hear construction noise behind my apologies, but you can see that there is a drop-off when graham rolled this change out and right now would be a time an opportune time where we should be seeing a lot of these uh failures come through. So I think the work that graham has resolved here is promising. So in related to our discussion items, I think what I'm going to work on today is create the necessary change request to roll this out into production.

E

Get this done with today and that way, hopefully, I'm not going to say it, but maybe we're an inch closer to pushing the api into the canary stage well, setting traffic into kubernetes in the canary stage.

B

Yeah, I think we really found a solution to a very long-standing issue with that I mean that was really great and hard to research. I think we learned a lot by this right. I mean we learned about this back in our old quinitus version, with um specifying this um port for two different protocols which isn't working and then learned a lot about this way that we can use help to patch our kubernetes installations.

B

If we don't have support for that in the charts, which is uh not very nice, but still a workaround to get us forward in a lot of cases right and um also learning a lot about how notepads host ports and all this is working. So I think that's cool.

E

Yeah, I think, there's a documentation situation here, because I could have sworn that this latest change was going to be the solution, but then it didn't change the behavior at all. So I was concerned about that and then discovering that we use the wrong port number confuses me, but it makes sense as you read it, but it's just mind-boggling uh what we had to go through to get to this point. So yesterday I opened a bunch of issues.

E

One is to get some help from the developers to add logging, because currently we don't have a very clear, concise way of determining how bad of a problem this was initially from the first place that could probably be closed, and then I opened up an issue with gcp to see if they could provide us additional troubleshooting steps, and maybe we could close that.

C

Let's wait till we know it's fixed, just.

E

Yeah I'm going to wait until I get this to have it right.

C

So it would still be interesting, I think, to know like well, we won't know great but like it might be interesting to know how problematic it is once we do get this fixed I'll, open, a retro issue because yeah, I think we learned a lot but also. I think that there was um some things that we maybe want to do differently next time or learn from or avoid or things like that.

C

But um it's I mean overall, it's an interesting one, because I think that this solution, like hopefully this is the fix- um and I think one of the reasons graham was pushing on this.

C

One was um partly because he hoped it was a fix, but also because it was just a simpler architecture, and so it was a kind of a it was a kind of win and hopefully a win-win, but uh that's a good one for thinking about in the future, as we like, how do we get logging and how do we simplify things when we see odd stuff.

C

I'm wondering today uh henry how much time have you got? Maybe you can actually help could be great to get this stuff into production today, if we can, um because graham will be coming online tonight, well daytime for him, but it's his friday, so we're rapidly getting to the end of the week so be great. If we could get this stuff through to production today, and then that gives graham kind of all of his data keep an eye on it as well.

B

Yeah I just reviewed that amount approved this for this meeting, so I can just um go and also work on the cr scavec to get this in production I mean it looks like we can just do it right. I mean this is a one line change and it should be deployed without interference to anything right, so that should be fine to do.

E

I would agree, I think a change request is just for posterity, yeah auditing purposes,.

B

Yeah then, let's do whatever um credit. These are after this meeting then so we can push it through.

C

That great thanks henry just.

B

Need to wait for the production deployment to finally finish event, eventually, because this is going on since hours right because of the migration.

F

It will take hours to complete post deployment migration, creating an index on ci, build.

C

Is this the one that failed on canary yesterday.

B

After five hours, yeah.

F

Yeah, I mean no, it's not going to fail after five hours. This one failed immediately because of the the other everybody.

B

The first tryout.

F

So maybe it will take less than five hours, because the one the one that failed, because it took five hours, were two indexes on the same table and this one is only one. So we don't know. Maybe I need three hours yeah. I can yeah I'm completely blind on this. We have no idea what is happening and wow.

C

Okay review that one as well.

F

I have a question on on those things. I didn't follow the problem um closely, but we were trying to troubleshoot this in last week, so I don't understand why we were able to to to get the console the dns information when we were testing it on the shell in the rails, console and so what? Where was the problem? Because we were able to reach the console and get the information back.

E

Recreating this issue is horribly difficult. Grain ended up spinning up a a simple test using a while loop and just sat there doing the same request over and over again hoping to find the issue, and eventually he did so. It was a very sporadic problem, which is what made this troubleshooting so horribly difficult.

E

um So I don't know like it worked for us, because we were using the same workflow as our rails code, but the fact that we couldn't recreate just means there's something underlying inside of kubernetes that we're still not seeing. uh That was that's what we are hitting so.

C

We test with the pods coming in and out, because that was the the thing kind of graham was was talking to me about yesterday was so. The problem is that um the pods start talking to console and energy to scaling console, gets scaled away into a different pod, basically or somewhere else, so it loses touch with it. So it's just the disconnect by keeping them together. They scale together.

E

B

E

We trouble the effort that we were troubleshooting in last week's demo was simply trying to recreate the problem. Just in general, we weren't trying to test um targeting a specific pod in a specific scale event, because testing that would involve having to put everything inside of another while loop and just hoping that it would happen because during a scale event you don't know which pods are going to get removed and you don't know which nodes are going to get removed. So we may be on the wrong pod.

E

That shows the behavior right, so it would have been very difficult for us to figure this out, which is why I decided. Maybe we should just add additional logging. You know that way. We could start uh tracking how bad this actually is, but.

F

Sorry so we reached the right conclusion last week because when we were able to to get the information we so probably is something related to the amount of requests. So it's a sporadic error. It doesn't happen so often to because otherwise it would never work. Okay,.

E

Yeah and grain mentioned in this issue somewhere that sidekick sees us a lot more often than our services and either the api of the git services and that's partially due to load, but we also have more sidekick pods running than we do get services. So there's also just the volume of requests that are occurring. That also impacts this as well.

C

Nice well, hopefully, I'm I'm really hoping this to fix. It's a good sign that staging looks stable, because this probably is um visible during deployments um based on it's to do with scaling. So that's, hopefully a good sign.

C

How would you feel uh scarbeck about putting a small amount of traffic into canary.

E

After we push this production, yes,.

C

Do like even a small amount before I mean I mean it sounds like it won't be too far out, but like how much risk do you think there'd be for a small amount of canary traffic? With this I mean, maybe the answer.

E

Is probably low, it's probably low, but at this point I just want to get rid of this issue get rid of this issue. Let's close it out, let's get this stuff into production. That way, it's just one less checklist item that I need to deal with in the first place.

C

If but, if henry is dealing with that bit, is there other things we can do to push forwards on the other on the migration like? Would the next step be to actually get traffic running on canary or.

E

Yeah so I'll be creating the necessary change. Requests to do that work and um going through my uh merch request just to make sure it's up to stuff.

C

Do you want to push on with that and assume like, so I'm I'm kind of at the stage with this service discovery that this fix looks pretty good on staging I'm really hoping this is the fix for production, um but if it's not, I think we're going to have to push force that logging and then really work out like how big an issue this is uh and whether it's actually like really has to block canary or not because it sounded when I spoke to.

C

Graham that's the alternative to doing that is basically, we have to just run a load of experiments on staging and hope. We hit a fix yeah, uh which.

E

C

Unknown amount of time right so.

E

So yeah I started creating the change request last week, so I could finish that up and get that out for review and then hopefully, at that moment in time henry will have finished or have started. The change request for the service discovery fix. So we could be working on this in parallel.

C

Yeah, okay, yeah: let's try and do that it'd be great to get some traffic running a canary like in the next like like soon as we can really so that we can actually see if there's anything else, we need to worry about.

A

There are two things uh I want to highlight: the first one would be. It would be great if we can actually put some traffic now on uh on on the api on canary, because that will give us the before it would be great to see the before um on uh on canary, because once henry is able to then roll out the fix, we'll know the after whether this was actually the thing that was fixing it right, because if we are going in right now on canary and uh rolling out, the change graham did.

A

We won't know like right like it might be, that canary or production didn't even have this, but it might be that canary just doesn't have this problem and we'll see it in production. It will be good if you do the before.

B

If we see this on sidekick and production already right, so we have this issue.

E

Sidekick any rails application that uses service discovery. We currently see this on.

A

Without the api, even.

B

A

So we have an issue.

B

Already all the time that just would be maybe worse with api.

A

Well, what's the amount of traffic okay, can we confirm that, as in I don't want to cause an outage, but I want to put some traffic on this, so we can see how much worse it makes it, because if this has been a blocker, then I get to ask a question of oh sorry. If that's this has been a problem, then I get to ask a question of. Why was this a blocker for api immigration right and if the answer is well because the volume might actually push it over the line?

A

Let's add a bit of volume to see how much it will change.

E

That sounds good um I'll, create a change request. Well I'll finish up my change request and I'll add some extra details to capture uh the before and after of these charts and such awesome.

A

And then the second thing is even if we enable api uh today and tomorrow and everything is fine, can we make sure that on friday afternoon, we pull this back, because I do not want a single thing, possibly derailing pg upgrade on saturday, like not a single thing from our side at least right cool thanks.

C

Yeah, that makes sense. Is there anything else we need to dial back tomorrow ahead of the pg upgrade.

A

I think, based on the the practices that have been done, no, um I just think that it would be. There would be a possibility for derailing if we now enable a service that was.

C

Yeah, I meant kind of like if we have any other experimental things. I.

B

Think we should be fine.

C

Cool and did the readiness review got approved uh was that last week, scott.

E

um It was either last week or the week prior, but yeah we got the readiness review fully merged in.

C

E

All the risks and all the fun stuff has been documented in some way, shape or form that you know about awesome, great stuff.

C

Nice is there anything else you want to cover on service discovery or getting traffic on canary.

C

One other thing that we can wrap up in next week is jarv mentioned as well. We should just test the hot patcher, it's still working for the api um that uh he is going to do the fire drill next wednesday um for hot patch anyway, so we can just wrap it up into that. I think so. We can keep an eye on that.

C

C

Nice um andrew, is there anything you want to dab. I will discuss.

G

um uh Sorry, I I we had another clash. I think we were intending to. I.

C

Realized yeah, it's okay, I'll talk to rachel, we'll work out our meeting schedule. I think.

G

We I think we move both of them to another new time and then but uh yeah.

G

um So I started taking a look at the ingress stuff yesterday and then, just since yesterday to now it's been ci um uh problems, so I haven't looked into that as much as I'd hoped, but um I posted a problem yesterday before before all the troubles began and I haven't even really reviewed the answer yet, but I just wanted to understand whether there's some difference between the way we monitor ingress in staging and in production or if it's just the lack of traffic, that's causing that difference.

G

I don't know, um apologies go back for not actually like reviewing your your response, but it's still like on the back burner from incidents.

E

The answer for me is like I'm not entirely sure, either anyways yeah, but to hopefully get to the point of the possibility of it being a traffic based item. Hopefully, today will enable api for a period of time minimally, and then maybe we can start seeing if we get any new metrics out of the cr.

G

E

G

Okay, yeah, so basically what we were seeing there is that I think last week I said that nginx has got really horrible um metrics, but I don't know why but nginx ingress controller for kubernetes on staging has got really nice metrics um and I don't know whether they you know so when we have nginx running in a vm, there's there's an endpoint called status and it's got very, very terse, like pretty useless metrics really, and I assume that that's all we would get from the nginx ingress and in production.

G

It's kind of like that on gprod, but in staging there's like we've got durations, we've got counts, we've got status, codes the whole lot, and if we can get that going in production, then it'll we'll have some really nice, like at the edge monitoring um there, which which is exciting but we'll need some more labels, so I will be sending out requests for more labels. Now, that's all. I really have to update this week. Unfortunately, some.

C

People we must have labeled everything right. Can we make a bot for this like.

G

Yeah, are they, but now it's e have we got, have we've got labels and all the objects we've got like the option to label all objects in in cloud native in in the charts right. So it's not like. We have.

E

G

Like it, or is that not the case.

E

All objects, everything that we manage or that we've created rather yes, a stuff that we use external to us like the radius and postgres and yes or yeah.

C

E

Stuff and engine the nginx ingress is one of them.

C

E

Need to figure out how to add those additional labels, um because we forked the engine x controller. We.

G

So what we did, um I I showed you, the the saturation for the node pools right. So I think we did that last week or may maybe it was after last week, but um with that we actually worked around. So what I realized was the labels that we would need on the node pool there. They're very easy for us to add, but we'd need to rebuild all the pools, and so I just felt like.

G

Let's rather not do that, and so we we had a little workaround, where we maintain our own uh recording rules for the labels for those node pools, and when we rebuild those node pools, we can we can put them on like as and when we rebuild them and get rid of it and it just unblocked us without having to um you know, re, rebuild and taint and rebuild all the node pools that we have already, because it just felt like that was too much work.

G

So we could do something like that as well. If, if we were blocked like now that I'm working on this stuff on a daily basis, if we found that it was going to be like a two-month turnaround to get something into cloud native, git lab or whatever, then I would probably just do the same thing there and then leave it as technical. That for later.

E

Hopefully, adding additional labels to the ingress controllers doesn't take much effort, but I have not looked into it, so I don't know. I won't know that until I look into it.

G

Cool but yeah the ingress is, is cool. um I don't know whether it's really worth it, but I noticed that a whole bunch of the thanos stuff has got um persistent volume claims and it would be quite nice.

G

I don't know if that's going to be the only thing that we ever have in our infrastructure that uses pvcs, but I was wondering whether we should label that as well or whether it just kind of because it's like a once off, it's just a special snowflake, but I did notice that some of those are getting quite high and I don't know what alerting we've got on them at the moment.

E

I didn't know we were using pvcs for thanos, so that's new information main uh we probably.

G

E

Some monitoring to it, because otherwise we're going to run in this space eventually right.

G

We do have, we do have some, oh I I know something that I can show as well um sorry this this this isn't very exciting, but it is oh I'm having a terrible time sharing on my computer, but there's a cube that we have a cube service. um Maybe I'll just send a link. um We have a cube service dashboard in.

G

Let me just put it into the channel.

G

And so that this dashboard, I'm not going to share my screen because it just doesn't work anymore, but in this dashboard you'll see if you open up the saturation section there's this q persistent volume claim disk space and same with inodes and those are pretty much the only things in there. As far as I can tell are thanos. um So we do have the alerting and it should probably be fine like if that alert goes off.

G

It'll say it's the cube service, we'll go anyone see uh it's actually thanos, but if we had lots of different persistent volume claims, then I would say it would be better for to say you know the get lee service or the redis service, rdb or whatever has got a problem. But if it's just the one, then you know it's probably not worth the effort.

A

I would not assume that uh we'll have one, um maybe one have it.

G

A

But uh yeah but.

G

It it it's all, really question of timing like do we need to do that as part of this or do we can we wait until we start putting stateful loads into into into kubernetes like and roughly when that'll be yeah.

A

If it takes you five minutes to do it, I would rather do it now and not forget about.

G

A

But if it's longer than.

G

A

Then it can definitely wait.

G

A

G

Probably and the the other thing- that's quite nice about that dashboard, like I don't know if it's really that helpful yet, but we've got all the the actual api server for kubernetes itself is being monitored there, and so I'm I'm guessing.

G

If google have an outage um on on their gke clusters in some way it might prove to be quite complic quite difficult for us to guess that that's happening at first, and so this will be quite nice, especially once we start putting slos on it, because we'll actually get an alert that um you know the cube service. Api server is is acting strangely, so that's quite a nice thing as well. um So that's that's on there as well.

E

Andrew just a quick question about the error ratio, um I feel.

G

E

The merge request: we were counting warning messages towards.

G

E

G

E

G

That is um that's only for the logs, so if you scroll down to well to the api service and then you click on kubernetes cluster warning logs and it's actually quite a useful thing for people to so I was what I was trying to do.

G

There was look for everything: that's not like debugging information and there I so that's where I'm searching for critical error warning and that's how I spotted that um problem with the node node port um and it's gone away, which is quite nice, because when I looked at it, there was like lots of stuff in there and now there's a small amount of stuff. Although you know, there's still some strange things: invalid metrics from gitlab web service, websockets like what's that.

E

um So that happens often anytime, a new pod spins up it takes about three minutes before metrics actually start being gathered or computed by the api or whatever the metric server is that's so.

G

For the hpa, oh, it's just yeah, okay, it's just it's! It's! It's quite nice having it's quite a useful link. I think that that's that thing I think um you know, especially if we have an incident then you can drive. Then you can go through that and maybe see something useful there. um So there's that.

A

G

A

Quick question: why is um that link that uh you shared uh cube overview? Why is it in cube folder and not in kubernetes folder.

G

The we can so we can rename it um the serve like everything. We've had everything already up to now, that's sub the folder is type equals cube and that has been set for several months. We can change it to kubernetes, um but then it's got to match the type label and the saturation metrics, and everything like that, so it's just that cube is, is what's being used on the saturation metrics and in several places.

A

Just just to um to ask to ask a question because, like if I do this like I, I have whatever right like all of this is great like we have per service vision but then, which one do.

G

A

Yeah right, okay,.

G

So, okay, okay, I've got you.

A

Know what I mean right like so.

G

That is yeah, okay,.

A

If I remember correctly, the kubernetes folder was something that was automatically generated and it's probably half abandoned. So we either remove that or we rename cube to kubernetes and kind of align it because it's going to be confusing.

G

Can we not just put all the uh so we'll break all of the metrics that we've got and any of the history that we've got in there if we change it to kubernetes? So I would what about if we do it the other way around, because then we've got to the other reason. Is that kubernetes being as long as it is, and grafana's got a 40 character limits on on dashboard names, because, and so you kind of there's not much left after that.

A

As long as we have some consistency, yeah.

G

A

G

It yeah cool um uh scobic, I would say for now: let's definitely leave those because in an in a situation you might still want those dashboards like. I don't find them particularly helpful because they, you know they're so difficult to kind of get to the right workloads, but there could be a situation where it's important.

E

um Sometimes I still use them.

G

E

Yeah, um I think, along with whatever that rename is going to happen, can we add the label to the chart or to the dashboard, just the word kubernetes that way. If someone types the word, kubernetes yeah quickly find it, because once they start typing, the entire word cube will disappear from there.

G

Yeah yeah, okay, I will. I will have to figure that out, because it's all generated but I'll I'll I'll I'll figure something out for that cool. um I know what I just. I rename that I'll give that the title on the folder kubernetes and then it'll always go there. That's how I'll figure that out! Yes,.

E

I have another question: um I have a desire to increase the observability of nginx, especially with the controllers in terms of logging, historically, we've never in our nginx logs into elasticsearch, because they're so voluminous same reason for ha proxy.

E

Looking at just the api service rails log, we generate 327 million events in one day. um What's what's the harm in having 327 additional million events? Additionally, for the nginx service in elasticsearch, I don't know how to evaluate whether that's going to blow up.

B

E

Or not, I think that was.

B

A big coast discussion right, I mean at the beginning, it was.

G

B

About cause- and I think we would need to discuss with finance and uh andrew and so on,.

G

I also think that we did just for a little bit of extra history. We did actually have those nginx logs at one stage in in elk, and we found with the cost and with the fact that nobody ever used them for anything, because they've only got a limited um number of fields that you can't really configure that well, at least in omnibus it's kind of difficult to configure them. So there was never much in there that that we use so we stopped using it. What do you want to get out of it like?

G

What are you looking for that? You don't have elsewhere.

E

So sometimes nginx will log whether it's buffering request and that was.

G

E

The things we considered the blocker because we had that custom engine x configuration that we worked so hard.

C

E

Place so having detailed information such as that or if nginx is doing something goofy, because this is a service that we don't really manage at all like we haven't managed in omnibus because we installed in the configure, but the engine singers controller, slightly less so because this was a fork, a project that we brought into our.

G

E

So there's not really.

G

For me, specifically as opposed to the access logs, because there's two separate streams in nginx and the error logs, the only thing that's a bit rubbish about their logs is the last time I looked at them. They don't have like a url or anything they just kind of like stream of consciousness, kind of like didn't buffer this request, and then you've got to figure out which request it's talking about.

E

Okay, maybe I'll we've got a few issues ready to logging so um aside from trying to grab specific error, just the error stream, uh what? If we because currently we don't do this? What if we send the data to bigquery?

E

Would that be useful at all, because I also don't know the usefulness of these logs, because I feel like they're representative between hd proxy and the.

G

E

Web service itself, they're just duplicate information.

G

E

G

As if what about, if at the as a first level, we just put the errors, we because we should be doing something with the error logs right, I mean like it's, it's actually a bit bad that we are just sending them to devnet at the moment, I agree so what about? If we focus on on the error logs and not the access logs, uh nginx, underscore error, log and then and then put those in elk, um I don't think they're structured in any way. But, like you know, that's fine.

E

It's a good start: okay, um I'll, find the issue and I'll modify the acceptance criteria to account for just that, because that would be pretty good. I think.

G

C

G

Right. Thank you. Thanks good.

C

Idea sounds good.

C

Awesome is there anything else we need to demo discuss plan for.

C

Right: okay! Yes, thank you andrew nice detail, because what I was going to say is those cats are quite cute scarbeck, but new cats.

E

uh I could show pictures do.

C

You want pictures, yes,.

E

E

They're in quarantine for two weeks, they they all, are suffering some form of eye infection.

E

uh Give me a second, I feel, like upper respiratory infections, are horribly common.

G

Are there, is it a thing that cats that kittson's kids, I.

E

I guess because that's the my wife has been dealing with that primarily and it's a little interesting uh slack. So I don't know the names of these cats. Yet I think robin is still trying to figure that out, but this is some one of the siamese looking ones. You could tell it's a little eye, crusty thing going on, um but it's cute. It's adorable fits in your hand.

E

Here's another one, this one's, mostly gray and white, but you know you see it's got little eye groups and such it's get and then the third one another siamese like one but more leaning on the white side. So yeah three kittens.

C

Lucky I don't live near you. This would be a real problem. I'd have like 20 cats right.

C

E

We only have six cats in this house. I thought we were going to have a lot more, but then something changed at the very last second, so I don't know.

C

E

She was slated to get six new kittens today, but then something changed. So we only got three, which I think is fine, but unfortunately, instead getting six healthy kittens, we got three, you know unhealthy kids, I guess so be it.

C

That sounds like a bad trade, less kittens more work.

E

That's what I that's, what I told robin.

C

E

C

Right, well, thanks very much. Everyone um go demo and good luck. Henry skelbeck, andrew andre, actually like lots of exciting next steps. Coming so speak to you soon, bye.