GitLab Delivery Team, 15 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-04-15 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

A

My camera is going to remain off because my connection is not the best.

B

C

B

B

I'm going to assume there's not a demo um scale back to talk us through how's the api service.

D

uh It's great um it's deployed in the canary. It is not currently taking any traffic. We have a configuration item that is missing. uh This was a object that was pulsing a lot of errors in staging, so we had to prevent staging from taking traffic until we had some fixes in place. Unfortunately, the fix that I created was kind of misguided and it took into account the use of geo, um so this created a problem production when we tried to roll it out.

D

It created a situation where uh last week this was first making a lot of requests for the api service and they were getting stuck by rack attack. So due to that, we rolled the change backwards.

D

uh Staying worked on a fix to modify the application code to resolve that, so that was fixed, and then we decided to roll that back in place again on. uh Was it tuesday yesterday yeah wednesday yesterday um and then andrew decided to show me a chart that said?

D

Oh you're, you're saturating our cloud nat pool for just the registry nodes, because we're making way too many connections back to our api um api fleet- and this is due because the registry is making connections to the api for notifications for the geo service that we don't use in gitlab.com. So I worked with jason a little bit at the very end of yesterday on a fix. uh That's the merge request or that's the issue. There's a wrench request associated with it.

D

It's currently ready for review, um but once that gets into place that would relieve the workaround that we are using to get this configuration in place and then we could start taking traffic in canary.

D

I hesitate to do that now, simply because we'll have a situation where we'll be throwing a lot of errors, because the registry and the api are not mending very well. This is a gitlab.com specific thing. We are using notification events as a way to monitor our container registry usage, the api.

D

So this isn't something that customers would run into. This is something just the way that we decided to implement this particular feature of getting metric data in the snowplow, um so we're almost there. So in the meantime, I finished up the readiness review yesterday and I sent it off to three groups of people, um a couple sres staying from development, and then a couple of members of security to have owner have an overview of the writing distributor to see what holes we have in our documentation and such so I'll be following up on that information.

D

As the reviews come out from people, I've already got some information from uh stan, as well as the security team. So far, so that's helpful.

D

So yeah it's just a matter of waiting for this fix to get reviewed and then, however long it takes to get merchant to master or if the review is taking too long I'll just pull that in manually and go from there.

B

And we can definitely um did you add a priority label onto this review. Have it great, um how can.

C

Someone leave a link.

D

D

I could do that.

D

B

B

How quickly do these errors show up like when we start sending traffic to canary? How long does it take to see whether this issue's been fixed or not.

D

It's immediate but I'll, be using when this gets merged and our helm chart gets upgraded and pushed out through canary. The only thing I'm going to be checking for is the um the configuration item existing inside of the gitlab yama file. That's how I know that we're going to go and as soon as that's done I'll start working on the necessary items to start sending traffic which in this case is just modifying the weight values of the canary environment in nha proxy.

D

A

D

Just monitoring the canary service to make sure it's operating. Okay, looking at logs looking at metrics.

B

Super great stuff, nice and then in terms of uh on canary, um so I checked to graham yesterday and he is still working through the um service discovery- sometimes fails inside kubernetes blocker um at the top. I should move it down.

B

So he's hoping to have some experiments that we can run to try and progress this one further like before we become blocked, is basically the kind of goal so as much. We can try and get this ahead of us wanting to move ahead of canary, but at the moment we are still a bit in the dark about. What's going on with this one.

D

So about this I would like to try to figure out if there's a way to determine how many requests we are making versus how many requests that are being shut down for whatever reason. um But I don't know how to create the necessary charts to do that. So I was wondering andrew if you might be able to help me out with this, because we know how service discovery works. We ping console every 60 seconds.

D

Every pod is going to do that every 60 seconds. um I did leave some open questions and then they should relate to this, but I would like to be able to figure out how many requests we are making, because we don't log the successes we only log the times that we fail.

D

So if we can figure out how often we are successfully making those requests, we could derive the ratio of how many times we fail to make the service discovery request, and then we can determine how often this is an actual problem for us, but I also have some open questions related to how service discovery works, because I don't know much about it. I don't know.

D

I don't know who to pain, to bring into it, but like if service discovery fails once what does the application do, for example like do we immediately try again, or do we just fail and start switching over to the primary server like what? What is the what's.

E

Right are you talking specifically about the service discovery for finding a replica using.

D

The service record.

E

So my um I'd need to confirm this, but I'm pretty sure that if it doesn't, in fact, I know that if it can't discover it goes to the primary, because there's that other bug about sidekick starting off with with traffic only going to the primary and then and then it fails well sort of fails over to using the replicas. So that's definitely how it works, but the the way that that service discovery works is not by pinging. It's by it's yeah.

C

It's making some mistakes.

E

Yeah yeah, it's it's! It's a it's a yeah, a service record. um Sorry, I was using.

D

That yeah, I was overusing that word.

E

And so, if we I, I mean, I guess we could like put logging on somewhere on console to kind of log those requests and then either just use, grip and kind of hack it or put it into structured logging.

E

It's it's calling console on the local node right is that is it a demon set.

D

E

D

And, as from what I could tell it's going to go through the service, so that request is going to land on whatever pod is available which, depending on the cluster.

A

D

Between 30 and 80 plus nodes.

E

Oh so it doesn't always go to the local node.

D

Now that's one of the options that greg would like to explore is modifying how those requests are handled.

E

Yeah I mean like uh I haven't done a lot of this, but my initial reaction is that that would be better because just because that node, that agent can get to um that secondary right doesn't mean that you can so you're going to somewhere else. On the on the on the cluster and saying you know, what can you see is available right now and then you're using that, where you should rather be doing that locally? I would imagine, and it's less surprising as well.

D

E

I mean that might be, I don't know okay, but that might be part of the problem.

D

It is an option that graham would like to explore. I haven't followed up with.

A

D

That particular item but I'll have to read through this issue.

E

There's an issue for that yeah.

D

E

Created an issue.

D

So I created an an issue off of this one that basically is exploring the options for running the agent.

E

D

um We, I don't think we've touched this uh so graham has looked into this a little bit, so I have to re-read through what he's written, to figure out where we are.

E

Yeah and yeah, just you know, that's how we do it on vms as well, because going through service you've got the service discovery to service discovery, which.

A

Is kind of but it's great.

E

A

Consul isn't having a consistent view, or it should have a consistent view and unless there's a partition, what is available and what not I mean they should.

D

So the problem is not that console has differing or bad data.

A

D

That sometimes we make a request to console and that request fails for x reasons yeah. I know.

E

Yeah yeah, I suppose yeah I mean it.

A

I mean locally, it would be better, be because of performance and and and latency.

D

A

Yeah but it shouldn't be inconsistent if.

D

A

One or the other console note right from the result that you get, I hope at least.

D

Well, that was the problem. We've been trying to figure out whether it's due to a pod being rotated, or maybe a node being taken out of rotation at just the right moment when a request is coming in, but so far we've been unable to determine that one of the things that graham did find is that the calico network service might be misbehaving on some of our nodes and that would impact the ability to send traffic to the appropriate pod.

D

um But I just started my day, so I haven't read up on this issue yet.

E

Yeah I just sort of like the idea of it. It seems less surprising to use the local one as well like it's the principle of lead. Surprise I would agree yeah, but but henry you you, you write it. You should be consistent.

D

But, like I said we're, we we're seeing roughly 300 requests fail per day and I feel like we would be making a lot more requests because we run roughly 300 pods that run all the web service information throughout the day anyways.

D

So you know assuming 300 pods times 60 times per hour, that it's doing this we're making a lot of requests and very few seem to be failing. But I would like to try to create a chart to prove how often this is. But I don't know if that's possible, based on our current metric.

B

Capabilities, cool, okay, anything else about the api service that we want to discuss.

D

I don't have anything.

B

Okay, so number four andrew lucky you and thank you uh for helping us out with the observability stuff. um I thought it might be useful to start off and just maybe get some thoughts around like where we are with this and what we'd like to like, ideally be seeing so that, like we had some um sort of initial work on the epic, those pieces were very much tied. I think to where we were in the migration rather than being kind of the overall kubernetes observability.

B

So completely welcome to scrap this and like put in um like a completely fresh load of stuff. But whilst we have everyone on the call like what what do we want to do with observability.

E

So I haven't had a chance to look at the issue yet but um or the epic, but in my mind the things that I was thinking about are really important is is having like first class monitoring for node pools as a as a thing that are kind of their own thing, and so they're going to be a little bit different, because everything else that we have in our world is kind of in the service hierarchy.

E

So you know we have services and stages and shards and and everything kind of rolls up to a service, um and that was true in the vm world as well. You know we had. Each service was on a different type of node, and now the node pools are kind of mixed a little bit. So we get you know. Some services have two node pools, some node pools are shared and we have the bug that we have some random jobs that are just kind of going anywhere in the kubernetes cluster, and so there's like it.

E

It's a many to many between services and node pools, and so I think we should just monitor those as their own thing. Like with their own set of dashboards, like you know, how is the health of of my node pools so there's a kind of like a high level overview, and then you can go down into like into a single node pool and see the health of that. So that's the node pools and then backing on the work that we've already done for monitoring like.

E

I think we should just make sure that it's working properly everywhere. Like the other day, we got plant email fixed. We should make sure that that's there, we need to get the autoscalers into our monitoring um so that we can see autoscaler saturation and then sorry back on node pools. We also have node pool saturation um and then the um oh all of the graphs that we've got on those kubernetes detail dashboards at the moment.

E

There's a lot of places where we're just seeing like error, error, error, crashly back off um like a lot of pods, fail for strange reasons and we're not reporting on that at all at the moment. So we need. We need like much better alerting around that.

E

um My I don't know what you think about the scobic, but I was thinking that if we almost treat it like an slo like the number of pods that fail divided by the total number of pods that were created or containers or whatever, um because you don't want to kind of alert on a fixed threshold or on a you know, like oh, a single part failed.

E

There was an error like you know so, and so we could use the the current like sli approach, that we have and just apply that to like pod creation and and failures in that in the same way as like a 99.9 percent of them. Should you know in a month should be successful.

D

It would never that.

E

D

Report that a single pod is failed because there could be any number of reasons. A pod fails like it could be running out of memory or we could have killed it because we configured our project, exporter or uh importer to now. It is export to make sure that's not using up too much disk, and you know if it picks up a job. It tries to write 30 gigs of data, we're going to kill that pod, so it'll be unreasonable for us to fire an alert or page on something like that.

D

So a ratio or that kind of makes sense.

E

And we also get like at the node pool level like I've, seen quite a few ooms and kind of weird stuff that we're also not tracking at the moment, and we should probably on those node pools, make sure that we, you know first, firstly, monitoring load and that issue that I linked there was that cpu scheduling, one which is like outrageously high, and you still don't understand why, but that stuff as well. um But what about crashlyte back.

D

We'll offs to know about that, because that's.

E

Pretty quick right like that's that that should never happen right precisely.

D

E

D

That's usually.

E

D

Be a symptom of something else, bad happening, whether it be a database issue, maybe a configuration issue, yeah.

E

I mean we should probably look once we've got it going. We should probably look at seeing if we can put crash loop backups into the deployment health indicator, because it's probably the best deployment health indicator that we have.

D

E

Thank you if there's a crashly back off, don't carry on deploying this thing.

D

Yeah, the same could be said for configuration changes as well, which we don't have integrated into the configuration management pipeline, but we probably should probably should yeah. Let's get this done.

B

What do you um need for this andrew, like what's your kind of plan, around kind of uh taking this epic and getting it to? I guess, like amazing and monitoring.

E

So so I I'll appreciate reviewers um and uh feedback, and I will I will start on like a grooming that that epic and going through it and and then kind of mostly like reviewers, but then also yeah, like not just reviews, because that just means you know but like feedback on on on how we're doing it and um yeah we can. We can get that going, get the alerts going properly for that as well, and then also, I think, it'd probably be quicker.

E

If uh you know, if we need more labeling done and that kind of stuff scotty do you think we've got all the labeling now that we need on. I hope.

D

E

Yeah, because that was a drag.

D

If we're missing something like I could apologize, but you know I'll try to work on that as quickly as possible, but I'm hoping we have all the labels necessary because that took a long time to get accomplished. So it did.

E

Yeah, okay, cool, so yeah, then then, mostly, I guess it's just um figuring out like if there's more labels that we need on anything um and getting all that done and then yeah trying to figure out a plan for like I, I guess oh yeah, like a plan for, are we just going to set up this kubernetes service like that we've kind of spoken about a few times? Is that going to be when we attribute like a node pool being saturated, because we try to do everything to the level of a service?

E

Is the service going to just be the cube service or kate service, or something like.

D

That that's a good question.

B

D

E

I don't I don't.

A

B

Know we talked about it, but I can't remember what we said last time so so.

E

At the moment, because, like of that sort of many-to-many issue with the like, we have many node pools to many services, although there is kind of a one-to-one on a lot of them, like you know, ultimately, what we're trying to do is get some sort of attribution and ownership on alerts right.

E

So when an alert comes in you'll see nearly all of our alerts, say it's the service and then you know, technically you could say well it's the data stores teams alert, um but because the node pool doesn't have a service, we either need to kind of create a new service called like kubernetes, and then I guess the ownership of that would be the delivery team. um But uh you know or or we do, some sort of inference where we like well.

E

This note pool mostly serves this, but I think it's too complicated, like I think, just having a kubernetes service would be the best way for.

A

Having it attributed to services still, maybe can we have record rules that at least match it to certain services, so we have one. Can we need a service generally to to to watch what's happening there right for the different load pools, but.

D

A

This per service, where we are in danger of reaching some limit, um we could still maybe generate something so that we can include it in the service dashboard.

E

Yeah, so so for the um like for the hpas, for example, those will definitely be attributed to a service. So if an hpa is maxed out that won't be pinging, the kubernetes servers that will be pinging like git and saying you know, we've got no more get part or sidekick or whatever, and so on hpa on a node pool, it will probably say q. So if we, you know, if we kind of maxed out the node pool, then the alert will go to kubernetes.

E

I think, from my point of view, I would imagine over time that we're going to get less and less note pulls you know. Instead of having you know, the node pools will just be kind of different types of machines, and so, as we go towards that, you know having these very rigid, node pool per service is going to become more difficult.

D

Yeah, I would agree with that, because, there's more than just what runs a specific service on a node pool, because you have the monitoring infrastructure, you have the logging infrastructure, etc.

D

At some point in time, in the future, we're going to come back and try to do a cost analysis on kubernetes, because we do spend quite a bit on our clusters.

A

D

We're probably going to try to figure out how to do a cost analysis and resource um analysis to figure out how we could um lower that bill at some point in time in the future, so that that makes sense to me.

E

There's a related kind of um thing that I've noticed and like the reason we had to go away from the six-hour burn rate on on thanos was because not because of the number of pods, but because the number of pods that started and stopped through the day- and we don't know- I don't have any data on this yet but like between, like 9am in europe and say, 5 p.m in europe. Right, I don't imagine that you're gonna need to like recycle your entire fleet down, because the traffic's just kind of getting higher and higher.

E

So why do we see so many pods stopping and starting? Even if there's no deployment during that time- and I think that there's also some sort of optimization around like when we're evicting, pods and and like it seems to be just way too high. For my.

D

At the moment, we definitely turn very high and I think, there's two ways we could go about adjusting. This is one modifying the way that we scale just on a different metric that way we kind of slow uh the rate at which we change the problem. That introduces is, if, for whatever reason, we suffer say an outage or maybe there's just for whatever reason. Lower traffic will be slower to scale upward during high demand.

D

So I think it would be great if we had the ability to scale on a custom metric that way we kind of leverage the ability to scale up and down, based on some metric that we define that way, there's always pod availability and that we're not resource starved and vice versa, we're not over over utilizing anything um that would reduce our churn rate. Because then now we have a smooth curve of pods, starting and stopping throughout the day versus this jaggedness of starting and stopping pods constantly every three minutes so.

A

E

That would be a.

D

Great optimization, we could look forward to in the future.

E

So once we have custom prometheus based scaling, metrics we've got. You know this green area. Here, that's.

D

E

What I want to.

D

Scale on is that upper bound of that green.

E

Line exactly and that's your baseline right. So if you have a second metric which then starts giving you a signal that you need even more, you can kind of, you can increase more, but then you never go down below that boundary over there and we know how to do that. Now.

E

We've got like you know all the metrics we've got all the detail, and then that would be you and- and you kind of you advance it by like 10 or 15 minutes, so that when by the time the load has got there, the machines are already or the nodes are ready to go. The pods are ready to go.

D

Yep, I think the one thing that would be slightly different is sidekick, just because that's going to be based on queue sizing and that would change drastically throughout the day. But, like our.

A

D

Api fleet and the git fleet, for example, that would be a perfect metric to utilize, yeah yeah, hey custom metrics are in scaling is something that's been on my bucket list. I would love to get to it at some point in time, but I think we need.

E

To I think it's migrating.

D

To kubernetes too yeah.

E

And, and also we need to have, we really need to understand our metrics and have all of our dashboards before we start using those metrics to scale, because otherwise, we're gonna be in for a lot of pain.

A

D

um Amy back to your original question about this particular epic, uh something that I find missing. Maybe it's somewhere else, but I know you had a conversation with graeme about our observability into helm and deployments.

D

um They're like we're missing some log data during a deployment operation is that in a different epic, somewhere.

B

Yes, there's an issue for that which graham's going to put his thoughts on. So this is all about the thing he posted in slack about atomic, whether we should keep atomic.

B

Upgrades, so let me just find this so yeah he's thinking about this, and uh at the moment we I think it's quite likely that once we've done the api service migration in the kind of weeks of tidy up after that he'll take a look and see if this is a big job. So okay uh for e is the is the issue that'll all go on too so yeah he is thinking about it.

B

um It's sort of in the sort of starting point was the helm three logs, the fact we don't have the helm three, uh the logs. Now we have helm three, um but actually, I think, he's thinking about this, whether this is slightly bigger and um whether actually there's the order of or the way we do things in the pipelines actually um related to this issue. 72 721 is actually the right approach, so I don't know what how big I think that is.

B

I don't quite know what the um the changes he wants to make in there, but yeah. Certainly it was a. We need to get that observability back, um that we lost with helm3.

B

Perfect um cool so andrew, like feel free to like just give us a shout like, either in slack or on the issues or in in these demos like for, if you need input or need help with any of this stuff.

E

I will definitely do that. Thank you.

B

Thank you for taking this.

B

So the final thing I was just curious about um this. One can be a quick one, because I know we have it um written down as well, but I'm just really curious about the pre-deployment issue we had following the registry um changes just.

B

um I thought it might be just of interest and useful for other people to hear kind of like what maybe not specifically why this one was a problem but sort of more generally, like what sort of problems did we see here and do we need to do anything special to make sure we avoid these sorts of deployment um problems.

D

So we did not have a problem with pre, yet this was something that was on my mind after we introduced the container registry, and it just took me too long to get an issue open for, but I was reminded of this when I hopped on friday to tell everyone that I'm going to stay in bed, but I saw where henry was troubleshooting with uric the fact that after I'd introduced the api into canary, I was blocking the ploys because we were introducing a change or we were trying to deploy to canary.

D

But um I didn't update our checking mechanism to say it's. Okay, that there's a change to the api deployment.

D

So because of that that reminded me of this, the situation was come would be that the when we executed pre-deployment, because something other than an image change happens on a specific set of jobs or deployment objects.

D

The checker would fail because it thinks there's going to be a configuration change sneaking into place like this is a protection to prevent us from having accidental changes make their way into our clusters during deployment.

D

So, since the container registry is spinning up a job and destroying a job for every single time it deploys, this is going to cause a problem when we deploy the next release candidate to pre. So I fixed that.

A

Yeah I mean the ugly thing here is that for our db, migrations and queen letters and for registry, we spin up a job instead of spinning up an init container and the job will always be a new object. So it's always seen in the div which each deploy. We will see a new diff just for creating a new job which is expected, but it's not nice, because you know, I said to just see it diff with no change.

A

If you just you know, run the usual um db migrations, so if it would be implemented as an init container application instead of a job, I guess we would not see any diff, and then you wouldn't have this issue, but I'm not sure of the complexities for other use cases, then for self-service customers and things.

A

It's it's a little bit tricky. I think.

D

The problem with doing it in a container is that it's going to get run for every single pod so for running 80 pods in production, for example, because I have no clue what we do run today, you're going to run that migration script 80 times, which is not reasonable. So it makes sense that it's configured as a job and that's exactly how our migrations would work if we were using them inside of our home chart for our rails application. But we have that disabled.

D

Because we run migrations as part of a deployment process that we define.

A

Yeah so filtering out, this specific diff probably is the best solution here, and this is done now.

D

So amy does that answer your question.

B

I think it does. Yes. um Is there anything we need to do in the future like? Are there any kind of? um I guess like things we need to watch for anything we can make easier in the future so that we um can avoid these things.

B

I appreciate it's relatively low risk because we don't have deployed to pre very often, but um even so,.

A

I think we need to remember when we add new services to communities that we need to also adjust this um filter, that we have so each time when we add something new like if we add canary to api canary or apa gpro later, we need to adjust this filter to also include this stuff to be filtered see. We shouldn't forget about this, so this is still manual yeah.

D

Luckily, it's an easy fix. That's so very quick! It's just unfortunate that I completely forgot about it.

B

Do we already have um like a checklist or something that reminds us of like when we add a new service to kubernetes? We need to do these various things.

A

I mean it's mentioned in the readme of the repository, but you don't read this every time when you do a change right, so we don't have a explicit checklist like this.

B

Okay, I wonder if that might be useful, we don't have to do it as a corrective action now but, like I wonder, as we go forwards like as we get more services coming into kate's and like if other teams start adding them as well, um whether that might be useful.

A

Yeah I mean it's some kind of tribal knowledge. I think I mean I. I know that I think scarborough explained this to me at the beginning, and I forgot about it again. Then I wondered when the second obvious is coming from. Why is it complaining and then java reminded me again? Oh we have this filter and I ah right there was this one thing I mean we don't do this very often right that we add new services so for several months nothing will happen and then you do this and then it pops up again.

A

So it's um it's, it's not hard to fix, but it's still something that you need to think about here.

B

Yeah, okay: well, maybe we should. uh We should I'll open an issue. I think we should probably start thinking about like what would what are the things that, when we add a new service to kubernetes, we want to make sure we've always updated or checked um so that we don't have to try and remember all of these individual pieces.

B

Awesome thanks for going through. That, though, um is there anything else that anyone would like to talk about today.

B

No okay, let's go back, you said this be super quick and uh it wasn't and it was excellent. It was a great discussion. So thanks everyone for uh coming along today and uh good luck. Let's go back with the next step of canary.

B

Awesome thanks: everyone have a good rest of your.

E