GitLab Delivery: GitLab.com migration to k8s demos, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-09-01 GitLab.com k8s migration EMEA

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning,.

B

Good morning, good morning,.

A

So amy's not here to lead so I'm kind of disappointed.

A

So I guess I'll go ahead and get uh started with uh the agenda. um I wanted to showcase.

A

Let me back up before I start showcasing anything um kubernetes in a recent version decided to get rid of the docker runtime they've recently switched to running container d.

A

The container d is a branch off of docker in some way shape or form that enables us to interface without the docker engine.

A

In this case, container d will use run c, which was also created by the docker team to run containers so kubernetes uses container d as the api container d goes down to run c, which actually actually executes the containers that end up running on our nodes.

A

With this we have a new method of being able to hop into containers to perform some troubleshooting actions previously docker provided all the necessary stuff. You could do a docker, exec and then you'd have you know a shell of some kind.

A

If that shell was not enough, such as the requirement to install a tool that required root access to certain portions there's another method using docker, where you could exec into a new container, attach it to the space and the network space and such to the container that you want to interrogate and you'd have all the necessary access available to you container d does not have this same capability but run c. Does you just have to know how to leverage it? So I thought I would take our demo and show us how we would do that.

A

I've got an open, merge request that covers how to accomplish this task, so I'm kind of going to roll through what I've learned, because this is slightly different than the easy method that we're used to.

A

But let's say, for example, we want to troubleshoot a container, so the easiest method. For me, I'm going to leverage our canary stage in production because not all node pools are running container d versus the docker runtime, and this is just due to the upgrade progression and things happening through incidents and my mitigations that we've done over the course of time. Production has suffered more incidents. Therefore, production has more continuity, running node pools, so it's easier for us to use canary as a um as a testbed.

A

So let's pick a pod on canary and then we can validate that it is running um the container d run time. Listen, sir! Don't don't tell me this information.

A

You're connected, so what do you complain about.

A

It's my contact sector. It is.

A

Okay, do I have to resort to staging.

A

Listen, I had a demo ready for you all. I was like yeah, I'm gonna sit here and show this and this and this can you show.

C

Us how to debug.

A

This this is probably my computer not behaving properly. So no, oh, look. It failed to establish a connection I'll use staging because I just I'm seeing I need to connect to a specific cluster.

A

I see that we do have some container d node pools, running and staging.

A

Okay, so let's pick a pod, this is staging so like I'm not worried about breaking things. I should have started with this anyways instead of production, but so be it. So I'm just going to pick a websockets pie just because so k get pods and what we want out of this is where it's running, so we can get that information by just passing in the wide flag which gives us which node the container is running on. So I know that we are now running on this node here.

A

So I'm going to spin up a new shell and I'm going to go ahead. Gca staging gcloud, compute ssh into that guy and then.

A

The other piece of information that I need to know, because this is container d, it's a little more complicated because, like normally, what you would do is to do a dr ps and then poof. You have all your information, but because docker is not the runtime, you don't have that information cry. Control is the container d equivalent to providing us this information, and, if I remember correctly, the command is not ps uh list containers, so it is ps.

A

So if I do a ps, you have all of your available items that are running in this case. You see that there's a gitlab work course that's running. There should be a web service.

A

I only see one of those, so I picked a poor example where this particular node is only running one pod of something. If I were to pick a node say in production, where there's four pods running on it. You're gonna see four gitlab workhorses, but you don't know which pod or container that resolves to. So I can't tell you from via the cry control command that this pod that I want to interrogate is this container that is running here.

A

That's kind of the downside, so in order to get that information, what you would do, uh let's see jq status container status. I stat hi, fine! Oh, why didn't anyone tell me? I did that wrong?

A

Fine. What do you want.

A

This is turning out to be the worst demo ever in my apologies, it is status, isn't it status.

A

Container, oh, it's plural, so here's the fun part! You know we have two containers participating in this pod. We've got the web service container. We've got the word course container. So, let's just target a web service container, because it's at the bottom I'll have to scroll up.

A

The container id is the important context here, so we can see that the container is 9 aaf and if we go down here, we should see 9 aaf sure enough. The web service corresponds with 9 aaf.

A

So now that we have all that necessary information, we now have the ability to do fun things um well. Firstly, I guess I should go back a little bit like so previously we would do like a docker exec and would just pipe in the container id cry. Control does allow us to do that. So if I do an exec, I think I still need the tty and let's see if this works then bash.

A

Failed to find containers so, let's use the actual id.

A

I'm still learning certain things, so that's why I'm not fully fleshed out must be specified with a direct true. So I guess you still want the I flag, yeah.

C

A

You get to a show, you get the id, you get that you're, the get user, but you're still limited like you, can't switch to root because it requires authentication which we don't have. So your tooling capabilities are still limited. You can still do a curl on localhost and do fun things if I'm assuming you know curls available so stuff like that is available to you. But what, if you want to do something more invasive like run a p trace or an s trace, or you want to do a tcp capture stuff like that?

A

You cannot do that inside of the container. As is you have to become root in some way, shape or form you have to do other things in order to install your tooling that you may require so having this really lengthy container id is very important because we need to interface with run c and cry. Control doesn't provide us the native capabilities. If I do uh exec and help you'll notice, I can't specify the user, and I can't specify various kernel capabilities that I want to be able to execute. Previously.

A

We would do a docker exec with the user flag and there's also the cap add flag, so we can specify the user and which capabilities out of the kernel that we need to operate on, however, run c does provide us this information, um we could add kernel capabilities.

A

We could specify the user that we want to log into etc. So that's what we need, but there's a catch with this. If I do run c- and I forget the exact commands, because they're all slightly different ps inside of run c gives you the list of processes running the setting here. So it's more like the process list command list is um the listing of the containers, that's running by the run c container runtime. So if I do a list, you don't get any information which is highly unfortunate.

A

This is because run c chooses to store its data instead of run, run c, which is not how container d stores its state.

A

I cheated because I did some investigation in this if I do run container d run c case, and then I do a list, we get all of our listing pods and if I shrink this down very tiny, you see that there's a lot of information on the screen that you probably can't see, because I'm sharing a pretty large monitor.

A

But the important part we need. Is this lengthy id up here. So if I copy that into my buffer and if I do a exec give me a terminal on that, give me the shell.

A

I now have basically run a docker exec into the container or cry control exec into the container. So if I want to do something more fun, I could specify the user, I'm not going to sit here and specify kernel capabilities, because I can't remember, but now I've got root. Access into this container.

A

So now I can sit here and I could run tcp dump. I could install p trace and you know, run a process trace and do really fun things on a container in which on docker was easy on container d. It just changes how we get to the same capabilities.

A

So that's what I wanted to showcase. I do think this is cool. I'm still writing up some documentation so that we could document this in the future. The one thing I want to explore next, with this in general, is what to do in cases where we have identified a pod. That's misbehaving!

A

We don't want it to be removed from the cluster, but we want to separate it out, so we can continue interrogating it that way, we're not blocking auto deploys, but we could still interrogate this container or we could segregate this container, so it's longer receiving customer traffic. We could still look at it in some way, shape or form. I want to figure that out next, if that's possible,.

A

And unless you I don't know how it's actually appropriately pronounced, I say cry control because the cri and it's ctl, so you know some people say cube cuddle. um I guess I could say cry cuddle.

A

Oh, that sounds goofy. So so that's what I wanted to demo. Does anyone have any questions?

A

I see no questions. That's.

D

Fantastic, I just need to learn all this stuff. Again, I mean why don't they need to change everything all the time right? Why can't we have something which is just staying the same and works everywhere.

A

I would have to bet that something has to do with docker in its business model, but I'm not.

D

A

To go into those details, yeah.

D

A

A docker I think, they're going to make it paid from the start of next year. Also.

C

A

D

Okay, then it's becoming irrelevant nice.

A

It's still free for personal use, but yeah, but gitlab's a large organization. I imagine we have more than two defended engineers and imagine all of us have docker installed. So technically we should be getting licenses.

A

A

um Discussion items um I wanted to talk about the two ongoing um migrations that we're accomplishing, so we all know that the web migration is now completed at this point. What's left is cleanup and documentation updates um grain did get started on this. We've removed the virtual machines in our pre and staging environments, and we removed the deployer mechanisms and patch mechanisms so we're no longer deploying to the nodes that still exist in production.

A

I just recently created a merge request to remove the virtual machines in our production, both main and canary stages, that merge request is now out for review.

A

I merged a chef change that removes some of the roles necessary for staging in pre, but I need to create the merge request for production.

A

And then, after that, it's just, I think it's um we still have tuning left to accomplish uh cluster b is still in our um testing phase, so it's slightly different than the other two clusters. um Grain is currently managing that and I haven't looked into the state of that issue. Yet so I think, what's left is to either push that change out to the rest of our clusters or continue tuning.

A

um So I need to follow up on that one and henry. Regarding your question about appdex drops during deployments, I need to create an issue for.

D

We this into this all the time also, since we moved api because we see the same thing there right, but somehow we never really came to a good conclusion like we couldn't really prove that it's related to plots being slow when they are new or new nodes right. We couldn't really prove that and you had the suspicion that maybe it's just um being the um readiness checks, health probes have probes being accounted wrongly to the up decks when we terminate pots.

D

So that's really strange because I'm not sure if we really have an issue, because error right is not going up, just the up decks, dropping down yeah and I'm not sure if we really have an issue there or not. The thing is just every time we deploy and we see the updex drops. I think oh something's going wrong yeah, okay, but I also have no good idea of what it could be. It's just very.

A

D

A

Yeah so I'll create an issue because I know one does not exist. If it does, I've obviously forgotten about it. I'll link it to this epic and, let's make sure we kind of prioritize that.

D

Yeah, I think you had this very long long issue, looking into put a scaling and and performance issues there right, I'm not.

A

Sure, if this is like the api, I think that was more tuning, though, instead of trying to figure out what was going on with the aptx drops so so separate context that we need to address yeah right um any more questions about the web migration.

A

Okay, so pages migration. This is something that just recently got kicked off um I'll share my screen really quickly, because I do have some things. I could at least showcase a little bit. I've pre-populated our epic with a variety of issues, I'm currently working on trying to get pages working on kubernetes.

A

um I did notice that there's a lot of missing stuff in our helm chart preventing me from getting that started. So I expressed this in my last moment meeting, but just to reiterate this. My goal is to start with prepod as a method of learning. How pages needs to operate inside of kubernetes?

A

It's a new service to me entirely. So I need to learn about it in the first place and then get it working in pre-broad and then leverage staging as a way to test how we're going to perform the migration, because I know we're going to have to make some fun changes to our aj proxy cookbook in order to enable a smooth migration to avoid downtime and such so staging its primary focus is just going to be, let's figure out how we do this in a non-outage driven methodology.

A

At some point, we need to sit here, and this issue right here is to determine how the configuration needs to look like for resources as well as tuning it, so that it runs well and our hpa responds accordingly, because we do have the occasional customer will perform a release and then pages suffers because everyone clicks on their blog post linking ends up landing on pages and kind of drives. Ups up the wall.

B

I have a question: can I ask this cardboard yeah? Why do we have aj proxy in front of pages, which is doing ssl termination by itself.

A

To load balance the load across all of those servers, it's.

B

Just a pcp level: is it this okay? So do we still need it in in a kubernetes environment,.

B

The ingress should do this by himself, even not even interested exposing the service. You have a good point.

B

A

We have a very good point alessio I like that idea.

A

Tell you what you have such a good idea.

B

Maybe we need it in the beginning if we're going to do mixed uh deployment with vms.

A

Yeah maybe, and to start we'll, definitely want to keep hi proxy in place. Obviously, but I.

B

I'm thinking it can be just one entry for the kubernetes cluster and then let kubernetes deal with low. I mean, then you have false load balancing there, because then from hi proxy perspective, yeah with just one end point, you know what is the name in the back end? I bet one backhand in ha proxy and then actually there are more it's more capable than the others, but yeah. We should definitely figure out if we can remove it from from.

A

I think the two things that come off the top of my head are the ability to leverage our existing monitoring capabilities because he products we have lots of monitoring metrics and such already in place in logs.

A

I know already that we are losing logs from the pages service when people attempt to make a connection and then there's something wrong with ssl, whether it be our fault, the client fault. What have you like? It doesn't matter to us as operators of the service, but to the people that are using the service. It will matter too. um So I know pages doesn't log every single connection that comes into it, so he proxy for that point might be beneficial um and then acls.

A

You know if there's some sort of abuse happening, h.a proxy, we could quickly put rules in place to block people from destroying the service.

A

So I already know that there's reasons to keep it in place, but I do think it's worth exploring because from a technical standpoint, if it's just going to do the tls termination in the future, there would be caching in front of it right now. There's not um aj proxy is just kind of a dumb proxy at this point not really doing much, so I have to we'll have to look at any existing rule sets and such to figure out what we want to do in the future. But I do think that is a post-migration task.

A

So I'll keep that in my back pocket, it's not an issue to address that.

A

So yeah, so staging will be primarily used for determining how to do the migration. At some point, I need to do some performance testing to figure out what kind of resources we'll need inside of these pods.

A

Let's see as far as migration blockers, I've identified at least three items that are we're just missing inside of our helm. Chart that enables certain options this one, uh it looks like gitlab, blogger is not properly starting potentially and it's preventing us from capturing the first, a few seconds in which gitlab pages starts up, which is kind of crucial, because when gitlab pages starts up, we get some clear information as to whether or not the service is healthy or not. So knowing that information when the pod starts is quite crucial.

A

So in this proposal I ended up taking all three of these issues, because you know distribution is busy with things. My proposal is just to remove gitlab logger, because page is already outputs in a structured log format, we're completely missing the network policy. So we lack a bit of security that we need to maintain.

A

So I'm trying to get that pushed into place and then there's various options that are missing inside of pages so items I could think quickly are stuff related to the let's see I had two merge requests for this, so one was is adding h, a proxy b to support. We currently leverage this in production. This currently exists in pages, but does not exist inside of our home chart. So it's simply something we cannot yet activate.

A

um So I need this in place before I can push this into pre-prod um and then all this has been merged. Okay, so we've got at least one merged item for this, so that's kind of cool.

A

So in this last one um I marked it as a blocker just so that I could discuss it very quickly.

A

Pages has an integration with sentry, but if you go to sentry and if you go to the pages specific application, there's no events, it's like the service is magical and doesn't have any errors at all.

A

So um in a different thread, I'm like hey is it actually work because it is configured correctly and I know, inside of our service, it's configured to talk to sentry.

A

So jamie was nice enough to spin up an issue to figure out um whether we simply aren't throwing any errors, which I know is kind of false, because I've seen areas fly through our logs and I've seen errors fly through a metrics.

A

It may just be that they're dropping a lot of errors potentially, so I only mark this as a migration blocker just for the just to bring it to our attention. I don't think it's going to block us when it comes to moving to production, because this is the current state today. So.

A

So I thought that was an interesting fact that I came across.

A

um Does anyone have any questions about um pages.

A

All right, sean.

C

A

C

Wanted to hijack this and ask some questions. So uh jarek was asking the other day what happened to a background migration psychic job he had, and we know that long-running sidekick jobs aren't a great fit for kubernetes because um well, because your container can go away anytime, basically and yeah. So we know that's a risk, but here I was just trying to figure out why why the container went away. So we can see that the we get the context deadline exceeded for the health, the readiness or the.

C

I think it's the readiness one and in some cases sidekick seems to be shut down, gracefully and other times it just gets killed. um Jav pointed out that, uh actually, let me just share my screen.

C

Jav pointed out that we seem to be so, I think we have a six gig memory limit on this, which you can probably tell from this chart. um So.

C

D

C

Yeah, I don't know if that's part of it, I assume we can't be hitting the cpu one, because um it's a.

A

Cpus, it's gonna be a throttle at that point. Well,.

C

Yeah also it's a single ruby process, so I can't really use more than one cpu's worth of cpu um but yeah like.

C

How how do I go further here, like you know, I can see um so there's one here that was from yesterday that just got killed um this one from the other day was.

C

um Oh, I think this time range will be wrong, but I think this one got terminated more gracefully so like it got to do its own shutdown steps um but like how do I? How do I do anything useful here? Basically, um sorry, let me just oh wait. This is the wrong one.

C

Yeah feel free to start answering, while I just try and find the right logs just to demonstrate that it died in a different way. I.

B

Have a question on this, so the deadline exceeded error is coming out from uh go process. So it's not sidekick it's the thing that is checking the.

C

B

That has a timeout for receiving uh an answer, and this timeout is not met. So this means that sidekick is not giving back. The redness probe timely.

C

Right but like is that a cause, or is that an effect, and if it's a cause like how do we debug why it's happening? Yeah.

B

Yeah, what I'm thinking here is just random ideas that something got in between maybe memory limits or something like. So it received a signal, and maybe the handle for the signal how it is implemented in prometheus is that it stop accepting new connection. So while it was terminating the job, even though it doesn't have enough time for terminating it, it was reported as not ready because, obviously a terminating only new incoming connection. That's why you see this.

B

This is just maybe a hint of something that happened before, because I don't expect this is blocking the the web service for scraping the the metrics well.

C

It could be uh it's in the same process is, it is the? Is the metrics web server in the same process as the psychic threads? I can't remember.

D

I would just say so: you should should be right. Yeah.

C

Yeah, so if it is, I mean it's ruby right, so if it is in the same process and the site, one of the sidekick threads is doing something cpu heavy, then the readiness probe will fail because it can't like the request, can't be handled like it. Won't it won't obtain the global vm lock, but if that was the case, I'd expect this to happen. A lot on the cpu bound pods. I don't actually know if it doesn't happen on the cpu. Now it wants to be fair.

B

It might do yeah, I have to drop for another goal, but my suggestion here is that if we can find logs from the kubernetes to see if prior to this, it was sending some messages, so some termination things like that. Maybe that's an end, and so this is just informing us.

C

As far as I'm aware, these are all the kubernetes for this pod. Okay, so.

B

So I've got I need to draw.

A

I guess thank you yeah, so I've got two thoughts. Here is one the fact that we don't see the liveliness probe. Failing is a little bizarre, um so either the liveness probe is not configured or it might be configured a little differently.

C

A

Probe is what kubernetes is going to utilize to say: hey. We need to restart this pod because something is wrong. The readiness probe will simply remove the pod from taking customer traffic, but sidekick doesn't take customer traffic. So the fact that it has readiness probes it's a little weird.

D

A

Like we'll we'll need it to tell kubernetes that hey this pod is ready.

D

C

A

Continue on with like a deployment, for example, oh.

C

Okay, um but it doesn't matter one step: it's not.

A

Traffic, so it's not going to be inside of a service per se, so you know sidekick itself is what pools work. Nothing is distributing work to sidekick in this particular case, so I think there's two avenues. We need to explore one: let's look at the liveness probe and see how it works, because I've have forgotten at this point how it's configured and, let's determine if the liveness probe is causing these failures and determining, if that is leading to the pod being restarted.

A

uh If I recall, jarp had started slack thread and discover that these pods were not showing a restart count being incremented. Oh.

C

That's the case.

A

I think I might be making this up, because I dream a lot, the other avenue that might be worth exploring, because I did see the message that the process rapper was killed. So that would lead to me to believe that it's not kubernetes that's interfering here, but instead the node um killer might be being invoked.

A

So what we might want to look at is the node itself to validate. It is not suffering from excessive memory usage. Okay,.

C

So this channel that henry's showing seems to be like pretty similar to that.

D

C

Familiar where it's like bumping up against, if.

D

You look at it. You always see a memory bouncing up to the limit right and then crashing like. It truly looks like all the time we are just growing in memory until we reach the limit and that point the pot will crash. So I think that is the reason why we see it constantly fail.

D

So there's really something stopping it, because it's.

C

D

The memory limit.

C

So any issue I think, javs said what did he say.

D

C

There's only one.

D

There's only one port running normally for ford.

C

Yeah he said it's expected. The pod will be terminated after reaching the limit yeah um and he said so that we should be able to get whether it's hitting the limit from the lobs. But I don't know if he means the node logs there or the pod logs.

A

They're going to be like the kubernetes event, logs, okay.

D

That's something.

A

You have inside of.

D

um Which chart do we.

C

Use uh database throttles is the one we're interested in, oh sorry, which node or which chart.

D

Yeah, I look for for the notes um that you're running on, like the um catch-all.

A

I wonder if it's default uh sean you still had, you might still have the tab open. That tells us which node pool.

C

uh Yeah one sec, um the in the deployment thingy um in the kubernetes workloads, repo yeah yeah. uh What am I looking for.

A

Node selector is blocked, yeah, okay, so it's running on the default pool which has overs. You know.

D

A

Gigabytes of ram.

D

Yeah but yeah hitting the six gigabyte uh limit of the pot. So I think it's not really on the node, but on the on the pot limit, that's getting terminated, right, yeah, but.

C

That means we should be. We do have the headroom to bump up the we do the pods memory. If we want to try that I mean I don't even know what set it to because it just you know like if I said it, seven is it just gonna say: try the same behavior, but with something like what we.

A

B

The note is definitely.

A

Interfering with this pod's capability to complete his task, henry.

C

Would you mind posting that on the issue as well, once we're done on this call? Thank you.

D

You mean the the um dashboard.

C

Sorry, the the out of memory kill at the top there.

A

If you scroll down, maybe you scroll down a little bit.

D

Like a screenshot of that one, oh.

A

Yes, yeah that that message right, there is key yeah.

C

A

Yeah, so we already see final right.

C

Did you say uh yes.

A

C

A

Events uh index that should have this information.

C

Because if I can correlate those to these, I mean they should match the times. I guess, but if I can.

D

C

That they match the same time. I've got gke inc, audit and system d, envy, okay and then.

A

And then, like the other thing like, if we're seeing this continuing to happen, but we're not seeing the pod show up with the restart by kubernetes, that means the node that we're running on is suffering and the node itself is um killing various processes.

A

And if that's the case, we need to consider bumping up the requests for the ram usage by this pod to help prevent or enable that pod to have the available memory to it.

C

Okay, so it sounds like yeah bump the memory requests and limits see what happens to start with. This is an okay starting point. Is that.

A

D

I would agree we could also.

A

Do a deep dive into node logs to see if the node is interfering or if kubernetes is interfering but yeah.

D

Let's start with that,.

A

That'll, be the easiest and quickest way to go about it, because going through loss is kind of painful yeah.

C

The max the max the memory bound shard, which I'm guessing is the one with the most memory limits and requests, is uh three and a gig for requests and limits. So I can set it to the same as or I can request to set it to the same as that and see what happens. um Yeah.

A

Yeah, let's start with that and go from there.

C

A

Cool. Thank you very much. Just keep in mind that when we make that change, it's going to restart all the pods. I think and then.

C

Right, but they get restarted, often enough, it's fine they can. They can take.

A

C

A

Okay thanks um and if you need additional help, um let's try to prioritize this that we're not blocking yorick's work here.

C

Yeah well yorick, actually just posted on the issue and said that he tried to run it on a console and the console got stuck so he might have uh ohm's the console.

C

Yeah, um that might be the issue here. um A gig seems quite.

C

Is that low I mean like, I guess I.

A

Guess you're running high to me, like that's higher than what we're using for a memory bound chart.

C

No, it's the same as what we're using for memory bound. I was just trying to think in terms of vms um like how much memory did they have. I guess they were doing it more. They were doing more things right, like it's, not just the memory for the psychic process, so it's difficult to compare yeah but yeah, okay, yeah.

A

All right, um so it sounds like we've got enough information to move forward. um Is there anything anything else we want to talk about that part.

A

Cool um so our last item, our goals, I think our goal should be to knock out the remainder of this web migration, which we're currently doing and I'm going to create a new issue to slow it down. So that's perfect, so cool all right. That's all I have to discuss anyone have any further questions. Otherwise we could well I'm going to go to lunch cool. Well. Thank you. All. Have a lovely day see you later.