GitLab Delivery Team, 14 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DLV003 - Delivery team Training - Building a pre-production environment

Description

John Jarvis presents the steps taken while building the pre-production environment
Slides: https://docs.google.com/presentation/d/15nWPLNRYvSjIdLr4NKI1OLHJedYAKDhbeV7tpnSWJ6w/edit

A

A

A

Okay, I think we're gonna get started. Can you give me a thumbs up? If you hear me, okay, cool? Okay, you.

A

So this presentation is about creating the pre prod environment and it also covers a little bit of material about what an environment is versus what an internal deployment is, and it goes into a little bit of why we need this environment to begin with so I'm starting off with this summary, what is pre-production just kind of give you a general overview. We need it to build this environment, because this year we're planning on trying to improve deal of deployments in general, and these are deployments that go to get comm.

A

We are planning to make them more frequent and make them more reliable, and in order to do this, we discovered that we really need a new environment other than staging just because there's so much contention for staging when we're validating fixes. So initially this new environment. For this new deployment of Gil Lab is going to be used for testing fixes that are made after we deploy to production. Perhaps security fixes maybe fixes for regressions, but it kind of remains to be seen exactly how we're going to use it.

A

But in a way this is just going to be something that runs alongside of staging that we can use for validation before we go to production with code. There are a few differences from both staging and production. Initially, this is just going to have a fresh database doesn't have any customer data on it. It's completely separate from production doesn't share any infrastructure. This means it's in its own GCP project and you know we're managing access with Google, OAuth and also initially for this first iteration.

A

It doesn't have any h.a, but we are building it with a che in mind to the added leader to start off, I'm gonna kind of go through the different environments and starting off for just internal deployments, and I wanted to kind of begin. This began this slide with just talking about single installations, because these are quite easy. Anis. One of the things I really like about gitlab is that we make it extremely easy to install in a single server, and that's thanks to the omnibus package.

A

Despite the high number of moving parts and the complexity you really all you need to do is and have to get install or an RPN install, and you think you know, however running instance. This, of course has some shortcomings. One is that it's a single instance, so it doesn't have any high availability.

A

Of course, there's other things like the logs are on the instance itself, so you need SSH access. There is monitoring, but it's do think it web application. It doesn't monitor everything that you might need to monitor. All of the assets are locally on disk. So if you want to use object, storage, there's some additional configuration that you'll need for that. So this is where we introduce environments and the first tier of environments introduced these sort of additions, and these are services provided by the infrastructure team.

A

You have you know, chef manage servers, you have centralized logging and monitoring. Runners are connected that are managed by the infrastructure, team, object, storage and central authentication. I put these two as the first examples, because they're the simplest, dev and ops, which I'm sure everyone has used for most people have used.

A

Dev we use as the nightly C e-bill the gitlab and its installed to everyday, and it's also used for some of the release process and ops is where we store all of our operational tooling, and it's where we initiate a lot of our operational tasks or using runners that are connected to the ops instance.

A

The next here is I'm, putting this new environment to pre, prod, predicate lab comm, and it introduces a bit more. It has a bastion, it has some AJ and you can also do chat, ops deployments and then the last tier of environments are production and staging. Of course, production and staging are extremely similar in topology, although they do have some differences, like the number of instances is much smaller for staging. In some cases the instance sizes are different, but we typically keep them the same. They have separate databases.

A

Staging, of course, has production data in it so that we can test against it before we go to production and production. Has this PR environment attached to it, which is a geo deployment which is currently being built out and both of them have canary I'm, not going to go into more detail here on what the canary deployment is or what the the canary stage or the geo deployment. But this is sort of the layout of environments that we have in. You know internal to give that.

A

So, let's just kind of summarize like what is an environment, and these are all services that the infrastructure team provides. You know you have chef and terraform, centralized logging monitoring and alerting authentication access with a bastion. We have to set up Network peering to the operations. You know the operations infrastructure so that we can do deployments rails and DB console access for developers that we control- and this is someone asking a question- no okay and and also like. We can configure these environments with a very similar to Paula g-2 galeb comm.

A

That includes like Patroni, multiple Redis cluster CDN and traffic segmentation.

A

So I'm going to talk here about the very like minimal, Viable Product for pre prod, and this was purposely done to make so that we can create this thing quickly and you know deploy to it as soon as possible. So it's very simple or as simple as it gets. When it comes to environments, we have a gift that server. We have a brother, we have a bastion, we have monitoring infrastructure, which includes Prometheus and the alert manager and some exporters.

A

What this doesn't include are some of the things that we may or may not add later, and we probably will, but those are in the gray box. It's like load balancing having a dedicated H, a Patroni database and Redis clusters and separate giddily servers and I guess. The thing to keep in mind here is that we've built out kind of the basic environment here, and we can add to this later as we need it.

A

Of course, it needs to be paired with the operation infrastructure, which is on the right hand, side. This gives us centralized. You know, centralized logging as well as monitoring and a whole bunch of other goodies.

A

That's pretty much it so I'm gonna kind of talk about like what we need to do to create this very simple environment and it's gonna be broken down into a few bullets. We're gonna start off with terraform it's about 600 lines of terraform config. That sounds like a lot, but it actually isn't that much because we use shared modules, and this is just a copy and paste from other environments.

A

There are some things that are unique to the environment, obviously like instant size, its counts, sub that allocation and then which fleets were deploying to, but more or less it's just a subset of staging or a subset of production. Another thing is that we have to do some chef configuration, and this usually involves creating new chef roles for the you know, OS level configuration that is specific to that environment. This includes things like endpoints.

A

It includes like any kind of like labels and things that we need for Prometheus and I kind of just broke these down too, into like monitoring infrastructure bashed in the get modification which fills out to get lab, RB file and, and then I. Just put a note here that we don't you chef for the runner, because it's deployed in gke, which makes it a bit easier.

A

The things we need to do is that we need to create a new Java ops command for deploying to it. This could be a little bit more generic, but right now we have separate like a separate option for each environment, so we needed to do that. You needed to create a we needed to create a deployment pipeline, and this involves just editing the gitlab CID mo for the deployer I'm not going to go into too many too much detail there.

A

But basically you create you know you, you create the changes or you make the changes to the git lab CID animal and that allows that allows us to have these stages for deploying to the environment. And then you have some other things like documentation and monitoring. Of course, like we need run books, we need dashboard, which is you know, a manual thing for right now and like where to find logs and things like that.

A

So what I did here is that I broke this down into days, to kind of give you an idea of like how long it took and what needed to be done day by day. It took six days to do and I think you know my manager asked me like what year it is I said: oh, it's 2019 because he thought it was like 2015 or 2005 or something like, but it doesn't matter which year. It is because it's just ridiculous that in 2019 it takes so long to build this environment and I'm gonna.

A

Try to explain here why it took this long and then, at the very end, I'm gonna kind of go through the improvements. I think we could make in the short to medium turn, to make this a little bit easier for the next person day. One. You know, of course, you'd create the project in GCP. You have to enable all of your api's and quotas. This is all very manual, it's not easy to automate, and then you create your initial, mrs for the terraform config.

A

This just creates, like all of the base infrastructure day to you, know, I created all the chef roles and then the first big milestone is, like you know, provisioning one server and terraform ensuring that I can boot and register properly with chef. There was an issue here that we had to work through because there was a problem with TD agents that took a little bit of time to debug and day three. This is like.

A

Where now you have most of the infrastructure up, ensure that the application is configured want to make sure you can log in and then you have finally like a full run of terraform. Without any, you know, without any errors, everything is up and running, and so now you're just down to kind of. Like the last configuration on day, four I decided just to destroy everything and reprovision it, because I wanted to make sure that it could come up from scratch and then there's some manual.

A

Things like I had to create, go off credentials for logging into productive, lab comm, so basically the OAuth credentials that allows anyone with a given ID, like an email address to log in and for the Prometheus servers as well. Since we have a lot than a volt there de5 ensure that the runners are configured and connected to get live instance. This is pretty easy because it's all in gke I committed a run book that shows you how to do.

A

This basically follows our instructions, but makes it even a little bit easier for our specific use case. So I do this a lot that it's very simple just to like create a new runner cluster on gke. We have to configure the purine between the ops environment and the pre pod, which we talked about earlier, of course, like I realized after this that there was overlapping subnets I had to resolve that we have to generate SSL Certificates because we're you know connecting to pre-buy comm. We don't.

A

Of course we don't, of course, have a wildcard certificate, so we have to generate a certificate updated in secrets. Reconfigure. The registry also needs and summon SSL cert, and then you just need to go through everything and confirm that all the configuration is proper. So you know like we have a lot of files that are stored in object, storage and they each have different buckets.

A

You have to make sure that they're all working properly basics, so this is like granting access so that anyone can log in creating the deployer pipeline created, chat, ops, command, create a new select channel for that chat. Ops command, create a dashboard for the Peapod environment, just to kind of have some overview and then test the deployment, and this involves use and give that chat up. So you can see the command there.

A

We run the deployment it creates this nice little CI pipeline and then, if everything works, everything is green and then you can move on and then on the seventh day. I guess, like you know, you can rest or you can kind of think about how you can make this whole process better or maybe question your life choices, because it is like a lot of work to get it working.

A

The I did include a detailed log of everything. I did, including all of the issues I ran up against for the next person. It's way more detailed than most people probably are interested in, but for maybe the sres, if you want to see, feel free to access it at that link. So I came up with a list of things like these are the things I would like to automate to make it completely self-serve, and it's quite a list.

A

I mean there's a lot of things here and I think we can maybe kind of rethink of how we do these environments in general to make this list shorter. But there are a few things that I think I would address right away and I and I put this here. So the top three things I think like in GCP having a GCP project that is dedicated to an environment is not always the right choice.

A

I mean GCP comes with a lot of overhead for creating projects like it's requires you to enable eight the eyes it requires you to adjust quotas, they can't be disposable, so we probably would just want one project for, like all of all of our environments, if you were gonna, create lots of them. Second bullet using chef to maintain hosts inventory.

A

Just doesn't scale and I think most of the sres are realized this and it's something that we've been talking about a lot if we switch to using a central console cluster for like managing our hosts, inventory and and some reasonable or like pretty strict conventions on how we name things and I. Think a lot of the configuration where you just like either have to have raw IP addresses, because you can't we don't have internal name resolution across projects or you know like anywhere. We have where we have to like use. Hostnames.

A

It's just going to be easier. If we can have DNS lookups on console having a more consistent bootstrap, definitely would help. We do a lot when we bring up a new server that includes you know. We start from the base OS. So there's a lot of opportunity for things to go wrong during provisioning. Using images would just be a massive improvement, and this is something that is currently being looked at in that epoch came up with some other improvements.

A

These are like lower priority things, but you know we do manual private subnet allocation because we have to ensure that subjects don't overlap for peering. This is something that we could probably manage a bit better for logging, like everything that is not staging or production, is sharing the stage indices so for elasticsearch. So, but maybe we could come up with something better. There chat ups and deploy could be a bit more automated and generic and then for database.

A

I would say, like you know, maybe if we had a lot of environments and we were bringing up a lot of them having a shared database, cluster and Redis clusters with multiple databases, because I think AJ configuration I, really just don't see us automating, setting up a Patroni cluster and tearing it down quickly. It's just something. That's a little too high touch, so I would say like having my infrastructure that we can share across multiple environments and make that a lot easier.

A

That's all I have I look at the chat to see if there are questions.

B

Just just um I was.

A

B

Since registry requires that extra bit of configuration if it was going to be included, but if it's a copy of the other environments, there's probably roles for it, yeah.

A

So currently for pre pod for registry, it's running on the instance locally, just like it does on dev and ops, which means your registry images will go to object storage, but the registry service itself is on the get on the same instance that the rails app is running and to connect to registry for pre prod you just you use registry pre-doc to get led comm for pages. It's actually not enabled for pre prod, but it's something that we can add. I don't have like a separate fleet for the pages service.

A

The service is the service itself is running, but it's not configured so you can use it yet, but that's something that we could could possibly add. Thanks.

A

Are there any other questions? Hi.

C

John, it's mal from compliance. I had a question maybe about slide 4 it looked like there would be some potential connections between these pre prod instances and ops net or logs mm-hmm Duquette. La donna is that the those quote production instances of those sites- those environments? Yes, perfect! That's this guide! Yes,.

A

Yeah, so we we allow connections for it. It's it's. It's fairly, selective we have specific subnets. So the way it works is that you first create appearing, and then you use firewall rules to restrict like which that's can access which, like boxes on on either side. So we allow incoming connections from the ops from the ops infrastructure for monitoring, because dashboards that get LOD net needs to use the Prometheus server on the left as a data source.

A

So we allow that's four on the Prometheus subnet, we allow connections from dashboards like a lab net, for the release runner. We have two and four operations like maintenance. We have the SSH from the release runner, so we allow the release runner to have pretty much all access to all of the internal ideas on all the subnets, but it's limited to the release runner, but maybe, if I'm not sure, if I'm answering your question.

C

So my initial concern was generally, when you have these types of pre-production environments, seeing those connections with non with production infrastructure, there's potentially some issues, yours if you're saying that, based on firewall rules that certainly some kind of controls that we could potentially and Jeff. Let me know if I'm I'm capturing this correctly, we could validate to make sure that there is no bleed or leakage. Yeah I think the directional, the direction of those connections yeah.

A

Exactly I think that would be a good thing to stand for, like, for example, on the left here. Your pre prod can't talk to production, but ops can talk to both pre prod and production. Exactly.

C

Yeah, okay, perfect I just wanted to make sure that I was seeing that correctly and that's just a kind of a to do for for you and for your group and the compliance team to actually fully vet that out, so that we we have a full understanding so that we can either suggest additional controls or say nope. This is the way it is and then speak to it in the terms of an audit. So thanks cool I appreciate it. Probably any other questions.

A

Okay, well, thank you. Everyone, if you think of anything else later, always feel free to hit me up on slack and thanks everyone for attending hi. Mister thanks.

B

B