GitLab Distribution Team Demos, 24 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-24 Distribution Demo - Orchestrator complete 2k reference

Description

Distribution team demos Orchestrator deploying a complete 2k reference architecture. Discussed some of the challenges resolved during the final steps of this functionality, documentation additions to be made, and features that should be on by default going forward.

A

Hello, everyone and welcome to the distribution team demo. My name is robert and this week I'm going to be demoing. The current state of the gitlab orchestrator and what we're going to do is install the 2p reference architecture. So let me go ahead and share my screen.

A

I've done a little bit of preset up here.

A

As you can see, this is just the 2k reference, because this is nothing really new has changed here since the last time I did a demo, we get our machines with the various and sundry.

A

A

It's gonna kick off orchestration. Now some notable things to change. Let's go from here and over to a.

A

All right so some things that have changed we now uh thanks to dustin. We will, I definitely install licenses so one of the things with get lab licenses. Typically, you put your license at the beginning and then, if you want to upgrade or do more later, you have to go into the web ui and update if you're, trying to do just regular unintended installations. A lot of folks seem to have the pattern of when we do a release. They do an update right.

A

So, if you're getting a new license, you want to have that just be part of that process. This is now possible. So that's one improvement. We have for the user experience.

A

The other thing that we have done is we've made some discoveries along the way uh for one we figured out that the max wall senders we were. There have been random issues that pop up every now and again, when running pipelines.

A

Let me uh stop sharing my screen for a second, so they're in random issues in a pipeline where you run a pipeline and you do it the first time and you stand up, say geo, secondaries and you'll get this message saying hey. I can't do that and the number of wall senders is. Actually you know you increase the number of wall senders and you keep increasing it until it seems to work, but the documentation all says have your number of nodes plus one well, the answer is it's not quite right and what goes on?

A

Is it when you do that initial replication, there's a pg backup upgrade and that doesn't take one slot. It takes two so for as many geosites as you have an additional to your primary site, you need plus two extra slots to handle that first initial application, because when they all start in parallel, you'll have your number of replicas in the primary site, plus the number of geo sites you have by two plus one slot required which is not transparent in either the postgres documentation. When you go through it, the first time or in our documentation.

A

So we now have addressed that issue and that's gonna stop some of the strange. Why did my databases intermittently fail and it's all about the timing? When does that those two slots get made and taken out? Because if they happen too quickly, you get a problem.

A

uh The other thing that we have fixed in the current release is we discovered that when you spin up application nodes quickly, there are settings before you do database migrations they get encrypted into application, the application settings of gitlab and that those encrypted items if two applications come up at once, the first one wins.

A

So if it's not application, one, which is the we're going to run all of the database migrations everything fails from then on, because the encryption key to the database has changed because it puts the encryption key for the wrong node in before it tries to run the actual database migrations. So this is a thing that we've learned. uh Stan is going to be looking into that more and we're going to see if we can get rid of that whole set application.

A

Settings set something in our database, even though we have database migrations turned off right because it's not really expected behavior. So that's something else that we have learned in the past cycle. Go back to sharing my.

A

Screen now, as you can see, we're doing the load balancer and what I wanted to point out now, there's an interesting thing when you do a load, balancer, okay, the machine comes up. It's the vault port is port 22 for ssh, but when you put a load balancer in front of application nodes, one of the things that you have to do is um reroute ssh traffic, so that when you do a git clone or a push that traffic actually hits one of the application servers behind the proxy.

A

Well, the machine already has 22 and aha proxy isn't going to help us do that trick where you send back to it like overwrite ssh, so the way we've gone around this is that on the aha proxy, when this machine comes up, orchestrator, auto detects and says what port are you using for ssh and it figures it out based on the configuration and if it's not if it doesn't have both ports, it knows it's not configured right, yet it stays on port 22 logs in does the package installation and then switches so that after everything is done, you'll log into ssh.

A

The actual estimation of machine on port 2222 and port 22 is going to redirect to the machines and every word of orchestrator after this will detect that change and intelligently figure out whether it should be trying to run ansible jobs over port 22 or port 2222.

A

So that's just kind of a nice baked in features that you don't have to worry about. You know which one is it when I set it up and also keeps it identical, so that job can just run over and over and over again from the first time, all the way to like the 30th time so, and that is one of the things has been added. So since the last time we showed this last time we did ticket architecture. We could just really provision the nodes.

A

So we've added a couple: we've added everything, except for buckets, which means that we can now set up giddily behind the system. Both applications nodes set up instead of successfully um the database sets up correctly. The redis sets up correctly and starts using what is caching and the load. Balancer properly stands up in front of the two application nodes.

A

So we'll actually be able to connect to this architecture the correct way, which is through the load, balancer um and then also there's uh some interesting features of load balancer that I want to share as soon as it gets towards it's running.

B

So, let's see where we're at you're on database, I might be able to do something like this. So let me just look.

A

One two: three from the.

A

Bottom, all right, so one of the other cool things, thanks to gerard that I found out with h proxy- is that you have this interesting interface at port 1936. If you expose it, it shows you what's going on for hi proxy on the load balancer. So I found this invaluable because this tells us what all listeners are.

A

Oh, we also have monitor down. So when we get this up, we'll be able to look at monitoring for the cluster as well, um but we were able to see that right now it can actually get to the endpoints on the application nodes.

A

Now it's not done configuring, but we can see that I'm sorry the ssh on the endnote, so we can see that it can now connect on ssh, which is good and we're waiting on the actual rails, app, which is going to be more configuration, but this gives us a great overview to know if our cluster is healthy or not. So I just wanted to share that because I thought that'd be a really awesome tool again. Thank you to gerard for for uh sharing that, because I hadn't seen that so.

B

We're still going.

A

All right, so I've done a lot of talking a lot of showing, so I'm gonna pause for a moment and open up to the floor. If anybody has questions comments.

C

So I do have one that stands out and the question is: what are we doing to better document the not quite spoken things in regards to the database?

C

Obviously, in our case here we're standing up a lot of the infrastructure all at once and eventually other people will be able to make use of this functionality as well. But we need to.

C

We need to make sure that it's well documented that if you already have a primary in place- and you just suddenly want to add two geos exactly how many slots you need to make sure the primary has before you do that so.

A

Yep, I am working on that as documentation, mr as we go through. uh Let me pull this.

B

A

So as I go through with these one of the things that I do is I go back and do stuff like this. So one of the secondary aspects of the orchestrator project is everything that I find.

A

I go back to the reference architectures and the geo pages, and I go back- and I add like for here one of the things that wasn't clear is that the 2k architecture doesn't talk about much about why you do nfs and it says nfs is optional, but what it doesn't tell you is that if you don't do nfs, you need to set up fast keys, because nfs is how authorized keys for your s. When you set your key to the ui.

A

That's how all the different application nodes know how to connect, and so you'll get random, doesn't work or does work, because the load balancer will either send you to the one where your key went or to the application node that doesn't so. You have to set up fast keys if you don't set up nfs and since we're deprecating, nfs bas keys is becoming default. So this is an mr that adds that information.

A

So that you get so that the reference architectures when you go through it you're not left, go scratching your head going. Why did this not work? Why can't I? Why is that work work intermittently? So um this is one of them. um The other one that's going on is in the geo side of things.

A

So, if you run a license check on a geo, secondary you'll, say geo won't work because there's no license but geo doesn't need a license on the secondary, only needs to be on the primary site. So uh this I have updated the rate task at this point and it's still going to review but updated the rate test to actually check that correctly.

A

So that's just a couple of ways in which, as we learn these things they're all they all get folded back into our main line. Documentation.

C

Right and and for reference of other people, we've we've been turning on the fast keys or the fast ssh key lookup by default as part of the cloud native uh pretty much since inception. For this particular reason um and in reality there's pretty much, no reason not to do that. Anyways uh early on we were concerned about. You, know, performance, difference and stuff like that. But there's just no point the fast keys look up is so performant that it's not impactful and actually at scale it's more performant than nfs itself.

C

So it that's a great example of something that we just need to call out, because the distribution team has basically gotten accustomed to large scale applications just having it on and we're like. Oh, we have to tell people to do that.

A

Yeah, so that is uh yeah, so that's been, there's been a couple of things learned, you know we a lot of a lot of the benefit here for orchestrator is that we um we're learning about.

A

You know things at scale, also things in parallel, like the application settings being touched before database figurations, it was kind of sort of quasi-known, but the large-scale implications of it was not so things like that or massive parallelization and standing up of omnibus is calling out issues, the other one that came up.

A

It just recently merged is this pg bouncer running method, so what we? What we used to do is we used to run a status, but if you couldn't connect, you get an error. So what we do now, is we actually I'm sorry you would get.

A

You could get an error state after because the service could be there, but not listening like there's this there's this gray zone between when the service for bouncer pg bouncer starts and when it's actually properly listening and the check was saying, are you started, but it did not necessarily imply and you're listening to so so because of that we now uh pg bouncer running now. Checks does the check in such a way that it checks for both right. So we never. We never let this run where you're not actually listening right, because that's frustrating.

C

Right, so that's basically implementing in um for context of others watching that effectively implementing the readiness probes that you would find in kubernetes or the various container orchestration platforms. We were doing a liveness probe by just going. Is it running, but we weren't actually doing a readiness probe by making sure that it's listening and actually able to respond.

C

So this brings everything into line.

A

Yep and that work went into omnibus so folks that are have already written their orchestration. They know some. Some some uh large sites have already done some automation of their own.

A

This will help them with their automation as well, so and a lot of this we're trying to fold things uh back so that this is for folks that have already done all the work to automate we're going to put this goodness in there, so they can take so those folks can take advantage of it, because if you already have a solution in place, orchestrator adoption you might, but it's a long project.

A

Let's get this get some some benefit to the hands of those folks today, right now so- and that's kind of the other goal right is to you know not some folks won't use orchestrator simply because in their environment they have to use the things that they have by by policy by whatever so we're trying to fold all these improvements back for them as well.

A

um To that point, while that still installs, the other thing that we are working on right now is the concept of putting in rate tasks to accomplish some other things. So this is a small proof of concept that I worked on. It got done last night, but the idea is that when you migrate currently the orchestrator when it runs on a multi database when it or in any database, you can run it once and it'll configure, but then the second time it'll break. So it's not idempotent with this.

A

This is a rate test that will detect. Have I actually run or not? Should I load the database? Should I run migrations and just give us back a simple change or false um still going through the the process of talking it through with the database for team reviewers, but this is kind of where we're heading and this rank task will be available to anyone and we're going to take what we do with licenses that dustin did for licenses and we have a similar asking charts. So we're going to take that logic.

A

We have for licenses turn it into a rank task and be able to share that. That's just a couple of things that are kind of things that are popping out in orchestrator and they're, going to keep getting folded back because really orchestrator and charts have a lot in common. As far as some of those the problems we're solving, so we're going to try and double that double up wherever we can so that we everybody benefits here.

A

Okay, now here's another thing to talk about. We still have to do this thing where we copy the get lab secrets uh json file around, and that is because there are certain secrets to get made in that file, but they have to be on every node. uh It would be better to do that declaratively and there's a just a couple of things that if you do it declaratively, then those secrets live in the gitlab rv.

A

Now, there's still plain text in the gitlab uh json file, the gitlab secrets json file, but that is the thing we're trying to identify and there we're done so. Everything is running. So let us come back to our aj proxy and I'm just going to take that port off.

B

A

I'm gonna stop my screenshot for a second. While I grab my.

A

A

Okay suck my shirt.

A

Again now this is the same ip address that we had with the with the uh the aj processor. This is the this is hitting the load balancer and.

A

We can see projects what are your projects, so let me fire up a terminal over to.

B

B

B

B

A

Silver key in there, so we can see that we're here. Let's make a project test, I'm sure it will make it nice and pretty public project and everything is up and running. uh Is there anything else? As far as kicking tires on this, that folks would like to see. Is there a pretty bog standard get louder.

A

Is this sorry, I missed? Is this one with geo instance? It is not a jail. This is the 2k architecture. So actually um you know I haven't actually looked this. I've done it all from the command line before so. Let me go to the admin. Is there a way in here to look at like see redis and italy.

C

uh To some degree, if you scroll to the right and then scroll down.

C

Okay, yeah, what little you can actually see here is uh rails. Your ruby postgres doesn't actually show your redis here, but you can hit giveaway servers and it'll actually show your configuration for giveaway servers.

C

uh If you go to jobs in the navigation menu just above getaway servers that actually takes you, um I wonder if they moved it, maybe they did and at one point and it's probably still floating around somewhere, you actually had the ability to get to the sidekick administration status. Page, I think that's under monitoring is that our monitoring now yeah so go to monitoring and then background jobs.

C

There you go just this is the actual like we're doing an authorized pass through directly to sidekick through the rails console, I should say directly through rails versus through workhorse, and we're actually checking the status of sidekick and its integration with redis. So we can look at it how busy they are, how many clients there are, how many things have failed or things like that. It's one of those points that a lot of people don't realize is actually a way to see how that is behaving.

C

So future notes and demos is if we want to load sidekick with a whole bunch of work. You know whether it's running gbt or something like this. This is a good way to watch how those components are actually holding up to the load.

B

A

Let me come over to projects your projects.

A

B

Thank you. One.

B

B

A

So now, if git ssh is working as we expect.

A

There we go so you can connect or get ssh you can clone. You can push. So that proves out that the load balancer up front is working and going through. So that is all I have for demo. I'm gonna stop sharing my screen.

A

um Are there any other questions? Comments concerns.

C

I'm not hearing or seeing anybody fighting with a mic, so I'm going to go ahead and say thanks for the demo and anybody has for questions. We know how to do that. I will wrap this up thanks, everybody for coming and I'll get it posted shortly.