GitLab Delivery Team, 16 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: [US] Container Registry Interactive Demonstration

Description

Skarbek works with other SRE's to demonstrate how to deploy, view metrics, and find logs related to our recent service migration of the Container Registry from VM's into Kubernetes.

A

That's all right, we recorded the meeting in the EU and the DNA meeting was recorded, so you know we're late to the game, but we got it.

A

So how do we get bash going on your workstation Alejandro.

A

This will be very important because we take advantage of a lot of the features and the latest versions of bash, where we use a lot of variable, substitutions and inheritance kind of a hard dependency.

B

If I'm just going off of what I did the other day, but I did a brew, install bash and then I think it was like a brew link, force bash and that got me. The bash 5.

A

You didn't need to use the Force flag on that.

B

And not for the install but for the link. I did ok.

A

I guess for curiosity Alejandra, what do you use for your normal she'll.

C

Just do default Ehrman and on item I.

C

Don't really play that much with that I.

A

Guess this is your fourth MacBook Pro, oh.

C

No I, just this is just quite fine and there's a Mac with the same name. Yeah.

A

C

Right so they're using the old, that's right, nice.

C

Situation is this: that is five, so we can use that perfect.

A

Perfect you're only missing an optional dependency. We don't care about that because we're not doing anything related to it. So you don't need to worry about that at this moment in time, so York station is supposedly ready to go. So if you want to run the I'll type it in here, okay control, EG STG list- this will provide you a listing of a wide variety of things. In this case, it will tell us that we've got a few things installed inside of the staging cluster, so we should see.

A

Oh, we forgot. We forgot a step. Sorry.

A

Do you have the rum books page open if not I'll, link it? Here's a.

D

A

So inside of there there's that in that first text block, we had the g-cloud container clusters get credentials, run rat for the staging environment. So the second line in that inside of there and.

A

Now we're that list command and we should see the awesomeness from helm.

A

Perfect, so, as you could see, we've got a few things to show here. So the first item I want to point out to the fact that you we ran the helm list command, which is just asking the metadata inside of kubernetes. What do you have installed? So we have three things installed, get lab which contains of the get lab application, which in this case is restricted down to the container registry, we're using our own helmet art for this, but we disable everything except for the registry.

A

The next items get lab monitoring, probably not the greatest sort of naming choices, but this is just the the use of a stable Prometheus operator helm, chart provided by the community. This provides us our mechanism for monitoring all of our clusters, so you'll see this installed and every single one of our clusters that we run today that is relatively new. Our old gke runner, closers they're not running this stuff and then I want to call this Platinum L, but it's the plant UML service. This is just a fancy.

A

Integration that get lab offers that jarba has been trying to push into production so and then last listing you see, we did a cube control get secrets. The register requires a few specific secrets if you watched his demonstrate or his presentation on Wednesday there's three of them that are required for the docker registry to communicate cleanly with the get lab API, as well as cleanly with our clients. So those secrets are mainly created and they're clocked in here as necessary. So at this point you can now run a list which is fantastic. So that's great.

A

So the next thing I want to look at is the pipeline's. So what I'm gonna do is merge. The merge request that you put in if I could find that tab because I've lost it already I'm going to prove it and I'm gonna merge it. So what I'd like for you to do next is note that get lab does not run any of the actual pipelines. It's only the source of truth for our repos. That way, it's publicly available, our pipelines are going to be in the obstinance I'll paste.

A

That link here inside of the zoom chat and let's go ahead. Look at our pipelines.

A

Yeah go ahead; click on the running pipeline.

A

You open up the dry run so the dry, so our pipeline is relatively straightforward. The first thing we do is a basic checklist to make sure the shell scripts have all the necessary stuff ready to go kind of. Needless, but we also do a set of dry runs so you're looking at the staging dry run. So if you scroll up a little bit, we're gonna see a few things inside of this particular job.

A

The first thing: you're gonna, see if you scroll like half way up the screen, we should see a diff yeah, so we use the plug-in helm diff. This provides us a view into what is proposed to change inside the configuration so somewhere in this general vicinity. We should see the change that we want to make the change exactly right. There we're changing the version of the image registry, doing a downgrade for this demo. The second half, if you scroll back down again helm, is doing a dry run of the upgrade command.

A

So if you were running helm locally, this is the same output you would see- and this is just checking to make sure that you could cleanly generate the templates and apply the templates to our cluster, but nothing is actually happening.

A

The real good part is going to be the next job, that's after this, so the upgrade yeah. So this is going to run the exact same commands, but it peels off the dry run flag. So you'll see right now it's attempting to perform the deploy as we speak. So that's good news we're right where we want to be so. While it's deploying let's go look at our metrics, so I'll put a new link into the zoom. If you want to click on that, one.

A

When this opens go ahead, drop the time window down to like 30 minutes just so that our charts are a slightly easier to look at and might load a little faster.

A

So there's a few things: I want to point out on this view, so the top right shows us the act of replica set. So if you're unaware anytime, you create a deployment inside of kubernetes, they creates a replica set. That replica set is what maintains the pods that are responsible for that deployment. So you could see the yellow line goes back to infinity. It's currently running three pods, that's our old replica set. So that's the old version of the registry. That's probably running registry version 271 as you've got your mouse highlighted over B.

A

We see three pods running version. 271 of that version.

A

We see the that small snippet of what's trying to start up right now, there's two pods attempting to run the new version to get lab registry, as you can see, they're kind of struggling I guess because below that on the far right, there's unavailable, replicas 40% of all of our pods are not running properly. So three out of five pods are running, so two of them are having problems. So at this point you know something's wrong, but we have ways to look into this information.

A

So, let's hop over to our Long's and see if we can find out our failure, so I dropped. Another link in the zoom chat for you and just to make this a lot easier. So we're not pecking I've got a copy and paste of the appropriate filter. So when this loads switch the index to the gke staging index and then use that as your filter, so we could look for the logs associated with the container register in that environment.

A

It's a pub/sub gk e INF.

A

We're yeah we're putting everything related to G ke into one index. I've been told, that's not safe, but I. Don't know how to address that. I'm gonna save that for the elasticsearch professionals.

A

The work correctly there we go.

A

So open up one of those log items, there's actually quite a few log pieces here so to make this view easier for that message: block go ahead. Add that or pin that so that we see like all of the messages in the ListView.

C

A

It's the icon that looks like two panes. If you hover over it, it says toggle, column and table scroll down to the the expanded view of how can you do it from there now just scroll on the listing of the log that you've expanded scroll down to the area where the message block is and there's a few set of icons to the left of message, yeah that.

C

A

And then unex, like d, expand or unexplained that document just so, we could see everything.

A

What's that not the right one? Oh, that was the wrong one. It was not message: it's text, payload, I'm. Sorry, that's not text! Payload yeah! It was correct.

C

A

A there should be an X next to message: you could click and we'll get rid of out.

A

Perfect, okay, now we could see these are the logs that are coming from the container registry relatively quiet, because this is the stating environment. So we don't see a couple million logs from people pushing and pulling images or logging in, but based on these messages here, we could clearly see that there's something wrong with the pause that are attempting to come back up, that being the storage drivers not registered this particular version of the docker registry. We know something is wrong with it for the purpose of this demonstration.

A

This was easy to utilized as a way to find logs and troubleshoot, because we know exactly what's wrong, but there's at this point. This is this kind of would if we were troubleshooting a different issue. This is kind of you know the direction to go to, but in this case we know is that precisely what is wrong? This particular image of the docker registry was missing, something when it was compiled. Hence two seven one was released immediately after that.

A

Okay, so I've showed you logs and I've showed you metrics. The last thing I want to show you is the rest of the pipeline, so go back to our pipeline job. Hopefully it's been five minutes.

A

It has so the job has failed to be expected without scrolling. If you look near the top of your, the job output you'll see that it noted that the upgrade failed, and the very next message was that it timed out. So we've got a configuration inside of helm. That says, hey, wait, five minutes if you're not successful, perform a rollback and that's as you could see the next very next message is it's performing that exact item?

A

So it's performing a role of rollback which in this case in this case that means it's just simply deleting the deployment that we attempted to push out.

A

So if you go back to our metrics, for example, we should see a refresh it in the right. Pane. You'll see that the on the right side, the git lab registry, 5, 5, 5 vb, whatever started to spin up a few pods but immediately shut down and the existing registry replica set remained intact. So during this point in time we never had an outage.

A

We attempted to an upgrade helm, decided that it wasn't healthy and decided to perform a rollback and you'll see below your mouse, where it says, unavailable, replicas we're back down to 0, which is exactly where we won't where we want to be because we want everything to be available in this particular case.

A

So that concludes what I wanted to show you. You have done admirably at showing everything to us so I. Thank you for your participation.

A

Does anyone have any questions or comments about what you have seen today, I.

C

Think I wanted to come in your really other entities. Yes, as we move more services into coronaries, we definitely went along different indexes in elastic I.

A

Forget who brought it up, but maybe it was a Stan but like everything in gke is going to that one index. So this includes all the audit logs associated with je, as well as the cluster operations. So anytime, the API is doing something those logs are all going to the same place and then every single pod that logs all those logs, are going to the same place. I made it easy here because I gave you a filter. That's relatively quick to you know limit it down to the registry pods.

A

But yes, that would be something we need to work on in the deep dark future, preferably soon, but I guess I should create an issue.

D

A

C

Want to just how do we define what is the because we saw that there were in the highest point. There were three with the old version and two with the new version, like that's configurable, I suppose, and yes,.

A

So during the for the configuration we define the minimum number of pods via the HPA configuration. So if you want to share your screen again and pull up that repo I'll show you a few things that we have defined.

A

There's a file called values, llamo.

A

And then search for a section, labeled registry.

A

It'll be a new subsection yeah here we are so in this block. We're defining a few things. So most of this is just modifications to the default, and where is the HP a configuration scroll down a little bit more I, don't see what I'm looking for there. It is! No that's not it! Oh that's right! The HP is not in this file. Go back, go back to the root of the repo and then look for, say pretty AMA.

A

No, not this one: okay go to either production, maybe yeah go to the production, G prod, llamo somewhere in here, perfect, okay. So what yah moles doing? Excuse me, what helm is doing is combining all these files, so we take the values Yama file and we take the whatever environment we're operating on. So we just operate on the staging environment, so we would have looked at the G staging Yama file and we're mashing all these values together. So in this case we're telling the registry for production we want to run the image version.

A

271 we're gonna tell our HPA that we want to utilize at most one hundred and fifty pods by default, that max value is set to ten and the minimum for that is set to two. So when a new deployment comes up, it'll automatically come up with two pods from the get-go when a deployment occurs because we're using auto scaling- and this is auto- scaling based off of CPU utilization, the deployment is going to read how many pods are currently running and try to scale to that number.

A

So if you go back to the metrics, we saw that the we first try to bring up one pod, but for whatever reason, the kubernetes cluster decided to scale up so I guess soon. After that it tried to bring up two pods during the deployment process, but both of those pods were feeling the entire time, so they were eventually shut down when we reached our time limit right now, we're still running at four pods, we'll probably see that skill back down the like shortly after this meeting I think it's every three minutes.

A

The HP evaluates what it wants to do with the pods and the metrics, but so we configure the HPA only the max setting. We really don't care about the minimum number.

A

So, yes, it is configurable.

A

Go ahead and go back to that values. File I'll go ahead, gloss over how it decides to scale up a little bit, or at least I'll attempt to explain it a tad bit and find that registry section again.

A

And look for something related to requests and limits somewhere near there. I.

A

Don't see it at all.

A

Okay, I, don't know where this is, but I could explain to you without having to show you anything open up our metrics real quick. So the the container registry is going to scale based on CPU usage. It's going to communities maintains the concept of a mill accor. So for every one core is a thousand Miller Coors of CPU.

A

We request on the container registry point zero five Miller cores of usage when a pod gets spun up if kubernetes finds that a node does not have the available point, zero five Miller cores available, it won't schedule that pod or it might go to a different node. The CPU usage is based off of that request rate. So, if we're only using point zero one percent this case we're using twenty two percent of that CPU request.

A

These values are much higher in production because using point, zero five of a core is very minut. That's probably equal to doing math on a computer very quickly, so the HP, a it's gonna, take the CPU usage and average it across all running pods. And if it's over 75 percent it'll scale up additional pods as it deems necessary, and there's this really cool formula on kerbin website as to how it determines how many pauses should scale up.

A

I'm like quickly over that, so if you have questions about that and feel free.

A

Cool well. This concludes what I wanted to show you guys.

A

If there's no more questions, I'll go ahead and the call I.

B

Have a quick question sure so I put this in the notes, but- and this might be too baked too much of a kubernetes specific question, but you know whenever we were looking at the part where it we had the M R, we had a bad build. He was trying to spin up the pod and it failed. What kind of capabilities exists there as far as like it, knowing that there is really a failure per se?

B

I know, in this case it's a hard failure, but you know I can think of examples where you know pod might come up and pass a health check, but at my neck not actually be working properly. There's.

A

A few things that we use to help protect us, so the this is coming in the next version of the helm chart release that the distribution team is working on we're going to we're introducing liveliness probes and readiness probes.

A

So when a pod comes up, we're checking its endpoint to determine whether it's healthy or not, and if that endpoint is not returning healthy, then that pod is not going to be marked as ready, and that will that's one measure that we have to protect us just in case there might be a bad configuration or like in this precise example, where we just had a container that was completely not legit for our use case, there's a few other things that might help protect us so say.

A

For example, we had a situation where a pod came up and it may have been responding, but for whatever reason, it's got an issue where it just sinks the CPU or it pegs, the CP usage really high. We are implementing limits to most CPU and RAM use. So if they exceed those limits, kubernetes will D schedule that pod. It will terminate it for us that way. We prevent the cluster from running out of resources and starving the cluster from operating other workloads.

A

For other things like this, we would simply rely on or the rest of our metrics. So if a pod came up and its passing health checks, but we're throwing 500s we'll have our health checks telling us and alerting us to that kind of situation.

A

Cool at that point, it's just up to the necessary troubleshooting and an investigation to figure out what might have gone wrong at that point. But at least with this demonstration, you've got the tools to figure out where that stuff lives.

B

And I can also assume that, in theory, with a proper pipeline that the docker image itself, you know whatever core changes inside you know between 2, 7 and 2 7 1 is I hypothetical that there's some work. That could be done there to maybe do proofing concept, or you know things outside of this structure that maybe we had in staging it's been running. For two days we haven't seen a problem: yep promote production now and we've. You know that can get some programmatic problems worked out. Yeah.

A

So, theoretically, specific to the container registry there's an open issue that allows or there's an open issue for the quality team to generate the necessary checks to run QA against the container registry, and we would love to push that into our CI pipeline. That way, it's full CD right now, production is gated, but we would like to not have to gate that by human. Instead, we want to run QA if it passes, go ahead, promote it on as you see fit.

A

We do that today, with Auto deploy it'd, be nice, a tricky care that, over into our communities, work as well. Along that note, as we move migrate, more services, though we're gonna, be pushing QA harder to get that kind of capability. Anyways, I think the next thing we kind of want to move is sidekick, which I don't know what kind of QA capabilities we'll have on that front, but it'd be nice. If we could have a pipeline where we don't have to touch it, it's just make your change and watch it go through the pipeline.

A

A

Already any more questions.

A

Alright well, I will end the call. Thank you again Alejandro for volunteering. Your screen and thank everyone for joining, have a wonderful weekend.