GitLab Customer Success Skills Exchange, 11 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 13.0: Gitaly and Praefect

Description

Simon Mansfield, Solutions Architect and Christiaan Conover, Manager, Technical Account Manager for East Enterprise covering Gitaly and Praefect

A

So, simon and christian, the stage is yours all right, um simon. I can share my screen if that works, for you, yeah.

B

A

Cool, so this is a pretty uh highly anticipated capability we've been talking about, for you know, probably over a year for many of us with customers, um because when we talk about setting up gitlab in an ha configuration, the one caveat we've always had to put on it is except italy which isn't actually aj capable yet, and you have to use fs and it becomes a little bit of a an annoying asterisk, but with gitlab 13, we have finally released our first iteration of what we're calling italy cluster, which is our aja solution for that.

A

uh um Essentially, what this is is it is a high availability solution for getting and allows for you to build out, scalable and redundant repository storage and negates the need for nfs. So with this, we have now gotten to the point where you can deploy git lab without any nfs behind it at all.

B

And there's there is one asterisk there, which is.

A

B

Support, I was about to say, yeah, okay, but not forgetting.

A

But that's not a page, there's other ways to deploy to deploy pages that didn't. Let you get into that, but we'll therefore.

B

And also you don't normally have the same same performance, performance concerns.

A

B

As you do with italy, so yeah.

A

Right so fair point: yes, there is that asterisk, but um yes for uh for an ancient environment. Now you technically, you don't need nfs really at all, um depending on how you set it up, but especially not for your git repositories, which is the key thing here, um and so this basically addresses the last remaining single point of failure element of gitlab. uh So this is a big thing for for customers who are running cloud providers on-prem. What have you?

A

Because this also means that they can leverage any number of storage volume solutions behind their giddily instances in whatever works best for their environment and allows them to scale in a variety of different ways.

A

So here's the architecture from our documentation that describes what uh what it looks like when you set up a high availability environment. um So you'll note here the bottom half of this outlines what it looks like and so there's a few key elements here that we're going to talk through here in a moment, um but you'll see that essentially the ingress point for the gilly cluster is a load balancer in this architecture.

A

The reason for that is that the prefect can actually have multiple prefect nodes in a cluster so that you can even remove the single point of failure of your prefect component in your cluster. So it can communicate with your you know in any combination with the postgres database that is set up for prefect um and act as a as a load balancer itself and handle any request.

A

So it's not a requirement to have a load, balanced, prefect environment. You can have just a direct connection from your gitlab instance to a single prefect node up to you, but it is architected to allow you to do so to do it in a way that even prefect is redundant so um and then you'll notice here the giddily, the actual ghillie nodes um they're set up in in cluster groups, and you can also shard across multiple giddily instances behind prefect. If you needed to do so, for example, your storage volumes can't support.

A

uh You know a horizontally scalable giddaly environment for iops purposes, and you need to have multiple, different storage, physical storage volumes behind it. There are use cases there, it gets complicated, but it can be done so. The idea here is that this provides a variety of permutations for setting up an aha environment for your getaway storage.

A

um So prefect specifically, we've mentioned this a few times and simon. I know I keep talking here I'll. Let you carry this one. If you want to discuss this one.

B

um Yeah, I mean most of it's already been said, so that protect is, is, is this sort of thing that sits in front of italy? um It's sort of transparent to gitlab. um Gitlab doesn't know that it's talking to prefect um it just, um but it's it's the thing. That's responsible for syncing, all the all the italy nodes, um and it's in it's the thing that's responsible for really implementing.

B

ha um As as christian said, you can have multiple prefect nodes so that they, you know that they, the prefect mode itself, isn't a single point of failure um and and yeah really, and it also does some load balancing as well. So the perfect nodes also kind of have this kind of element of um of balancing traffic between the different various digitally modes as well. So yeah, that's really! What prefect is um you will hear? It mentioned a lot if you're talking about italy, cluster.

A

Yeah, it is of note this is a. This is a new element that our gila team developed specifically for this purpose. So it is likely from now on that you're going to hear italy and prefix set in the same breath for most occasions um anything outside of the omnibus package.

A

You're, probably going to hear this referred to quite a bit um so simon and I on monday, actually went through the process of setting up a giddily cluster and we started from an existing environment that I created with a single giddily app or a gitlab, app node and a italy node separately configured. um We took this approach figuring that the majority of our customers currently looking to do.

A

This probably have some variety of aj architecture, where they've broken out the various components that gitlab uses onto different nodes for scalability and there's a good chance that italy is a stand-alone element. As part of that.

A

So I wanted to replicate roughly what you'd likely encounter when you're dealing with an existing environment with giving for a perspective, aj configuration, and so we went through the process of building out the ghibli cluster, integrating it into git lab and then figuring out how you would migrate data from your existing italy space to the new cluster, and uh we won't take you through the multi-hour process in depth.

A

If you really are interested in doing that, you can watch the video that we put on youtube unfiltered, which I paired down a few spots where we fumbled through gpp for our own lack of knowledge there and eventually figured out the solutions. But you can see the progression from start to finish of how we got that done um and the pitfalls we encountered. um But here are some of the takeaways from going through that process uh simon. If you want to go through some of these I'll, let you know.

B

Yes, there's something to be aware of at the moment is that um the the prefect leader election is currently um not it favors, availability of a consistency so that the moment there is a possibility of data loss, um not it's very unlikely. It's when prefects know what a prefect node has to fail, and basically the leader of election um it takes place.

B

There is a an issue- that's in flight at the moment, to change that to um default, to using the postgres database that prefect uses to do leader election and that's going to be made the default in 13.1. It's actually there. Now it's just not the default. um I think it's because it was still untested when it was launched, so that that will that will um give you consistency over over the availability um and kind of solve those issues.

B

um The next point I added so I'll talk about it, which is that at the moment, when you install omnibus you kind of get everything all in one um and that's you get the postgres database there for all the data as well omnibus does not include a postgres database for prefect prefect needs its own database. Ideally, it can actually go inside, and this is another point of it later on. um It can go inside your your github database, but it's not supported in that configuration when you're using geo.

B

So for that reason I mean myself and christine would both recommend. I think that you know you separate out straight away, um because if you ever want to go to geo you, you then have to untangle your database. So there's no prefect database going to be installed in omnibus by default. At the moment. um Do you want to take the next one christian.

A

Sure um so, as I mentioned earlier, one of the things that we went through in our in our setup process was the effort to migrate data from the existing italy instance to the gita cluster. um When you create the cluster, you then obviously have to go into gitlab rv file and add that as a data directory that um the git lab can use to store information, but uh all storage volumes are created, equal and get and get labs eyes.

A

So uh you have to tell git lab where you want projects to be stored and to migrate from an existing one to a to a new one. Requires api calls right now. I don't know of any utility that has been created to help support the at the migration of data in in a batch process from one location to another.

A

I actually have started poking myself and just building a proof of concept that might be able to do that as a very simple cli, just as a personal project to maybe facilitate that more easily, we'll see what comes out of that and if it's only usable, um but it is something that right now, you do have to script via api calls to migrate from one to the other.

B

Yeah the next one is really I've already said, which is separate, your databases out and then the final point is that at the moment, there's very limited scope for admins to actually be able to monitor the cluster. uh There's an epic around this. This is something that the getaly team is working on um and so yeah, that's something that they're really pushing for so.

A

One final point: I want to make on that first bullet here. um The main reason for this is, if you're not familiar with collector environment and data, consistency, architectures and stuff, like that, for this sort of thing, the main, the main reason that there's a potential for data loss with the way it's currently set up is if, if one of your, if your primary get early node fails, what prefect does right now? Is it just picks another one? It doesn't worry about how up to date. It is.

A

It just picks one and uses that as the as the leader and then it goes forward from there. The database that gets created for prefect is also used to track the changes that occurred over time and which nodes have which data sets, but it's not currently utilizing any of that information to actually take the leader. So there is a possibility. If you get out of sync that it will, it will say we don't know which nodes have. What and it'll put you in a read-only state for any repositories that don't have up-to-date synchronization.

A

um So that's that's. The reason behind that is that they're working on that in 13.1, um hopefully we'll see resolution to that point. It won't even be a concern by the time our customers are deploying this in production.

B

Okay yeah, so we went into this process without having read up anything without having spoken to the product team about it specifically, so we'd do it as if a customer was running through the process, um we did make notes throughout the the process and we followed the documentation and actually the documentation is really solid. There was pretty much only one point really where we really truly got blocked, and that was due to the cloud provider like documentation really rather than our own. um So but we're gonna, add those you know everything.

B

We've got feedback for we're gonna, add back into the to the italy team and try and get that added to the docs. um One point of note is that the reference architectures, so the the five 10 25 50k architectures, do not yet have gita lee cluster in in the reference architectures. That is something that people are aware of and they're adding that support.

A

I made a note here at the bottom that the gitlab orchestrator project- this is something that I actually was um made aware of by jason, plum when he graciously joined our session on monday to try to help us work through some gcp issues.

A

We apparently have a project underway at the moment that some engineers are working on to build out, basically a canonical set of terraform and ansible scripts, that customers could use to set up gitlab in any permutation of single omnibus and saw up through full aha capability and they're building in the ability to build out gitly cluster as part of a scripting process.

A

um I will update the deck here with a link to that project. If you're curious, jason gave us all sorts of disclaimers about. This is not production ready. This is not productized, yet your mileage may vary.

A

Customers can use it, but it's up to them how they do it all that kind of stuff, um but it could be a very useful resource for any customers that you're working with that are interested in having any sort of infrastructure as code configuration and are looking to us to provide some guidance or best practices on how to do so, um because it looks like it's pretty well well-rounded right now as to all the components that it covers.

A

So it's something worth looking at as an aside, if you're interested um so the main takeaways from this from going through the exercise. The sign that I did and reading through the docs uh is the getaway cluster is really well architected. uh It it's clearly thought out to be scalable and redundant and address all the concerns somebody would have of building an h8 solution, um especially when you're dealing with the types of things that that get transactions caused with storage. um So it's I think it's gonna, be a great solution.

A

um Obviously there's a few limitations right now that I would say, make a little premature to implement a production environment. I've been giving my customers guidance before this was even released, that they should probably hold off until probably 13 1 13 2 until they look at doing this in their actual production environments.

A

Just anticipating that, since it was the first ga release, there were naturally going to be bugs that got surfaced by it being out in the wild um and that aligns with what we've been seeing from some of the known caveats to it that that they're expecting to resolve in the next couple of releases. So um I've encouraged customers to set this up in staging environments, to test it out and understand the process.

A

But I would say it's probably not production ready for most of them until later this summer, um and, as you know, simon- and I have also uh agreed that we're going to help the gilly team develop the docs to be more uh even more useful than they are currently are for a customer-facing perspective, so that um you know. Ideally, customers can walk through this step-by-step with little to no assistance from us.

C

Hey christian, are we using this in production today or no on gitlab.com.

A

We may be using a production on gitlab.com, but let's not forget. We also wrote beta releases of the product itself on gitlab.com, so I think we're more risk, tolerant with our dedicated infrastructure team. To do so, um and we probably have it built out in such a way that we're we're limiting the risk of data loss because of the scale that we're probably putting it at. That would be my my assumption, um but yes, I do.

B

A

B

And one of the biggest things is the leader election which, as I said, is not defaulted to switched on, but it is available. um So I imagine that our infrastructure team have probably switched that on um so yeah. That would be the main thing.

A

um But from the perspective of you know, customers who have more limited resources for this type of thing than we do. um I've generally recommended to them just give it a couple of months. So this thing stabilizes and gets to fully production ready.

B

Mark you actually had a really good question, which I think is worth calling out. It's a pretty challenging question to answer so, and I know we don't have too much time. So I'll just be quick, but um you said: is it? Does it cost more or less than the traditional nfs model.

A

Probably gonna cost more to be honest,.

B

A

B

A

Well, all right, so here's why, uh in a traditional nfs model, you probably have some combination of a compute instance and a storage volume to support your nfs right and that compute instance may or may not even be present.

A

uh If you're needing an uh management run a computer instance, it might be provided by your cloud provider or you may have a sam or something in your infrastructure, but in general you're, probably gonna have one gitaly node that is connected to that storage volume for nfs, um and if you have multiple gitly nodes that are talking to that, it's still just those nodes in a giddily cluster environment.

A

You know you necessarily are adding um on addition, in addition to your n number of giga nodes, you're, now adding at least one prefect node at least one postgres database, which is probably on its own node as well, um and possibly a load balancer in front of all of that between your gita environment. So your compute resources alone are likely to be higher.

A

You may see some savings if you're not having to use an a managed nfs solution, um but it's not necessarily going to be enough to offset the additional cost of setting up this infrastructure. With that being said, I mean our reference. Architectures, don't even start talking about ha until you're at like three or five thousand users, so it may be a negligible difference from the perspective of the service provider.

A

Me search writer in terms of like the group at your customer- that's managing it, but just based on the architecture that we have recommended you're likely going to be using more compute resources than you would. If you were doing just a single gilly instance with nfs or multiple instances, even.

C

But but one would also expect that one would gain some amount of uh additional performance out of this, because nfs is, you know, pretty much a pig with regard to locking and everything else and has its own.

B

Absolutely so the benefits the benefits of this is that the you, the actual implementation of your italy nodes, becomes much more important because guitar, if it's talking to nfs it's more about the nfs storage and how performant that is than the gitline itself. Now you've got local ssds attached to your github right, and that is where it's getting the data from. So potentially there is performance benefits there.

A

I, I would argue, maybe holistically when you factor in not just the infrastructure costs, but also the the increased productivity from faster performance as well as, hopefully, the the less uh the lower amount of infrastructure management that maybe would have to take place for fine-tuning your guitar instances to talk to an nfs solution. You might overall see a total cost reduction from that perspective, but purely from the bill that you paid your cloud provider, it's probably going to be a little higher yeah.

C

So, simon any data on the performance we can share yet or no improvement.

B

Performance, wise, I think it's too early to say at the moment. um I I think this we were really looking at this from a from an implementation perspective, not performance, but I can get in touch with the italy team and we can. We can ask them if they've got any data on that.

A

I think they've been running benchmarks, but I don't know if they've published those yet yeah.

C

I haven't seen them: okay,.

A

All right just do it um yeah I'll just say: I know we're done here with ours, we're probably over time, so we'll we'll uh we'll hand it over to chloe next. I think she's next right.

B

Rob I know you've got questions we'll do the mazing. I guess we do have a couple minutes if we want to try to take a swipe at a couple of them.

B

Reb, do you have a preference on which one they yeah.

C

I just say I mean the most important is like you know. In the past, we had to have people copy data in other ways. Does this copy everything, including lfs data, which was notoriously absent from some of these things?.

A

C

A

I gotta go to the docks on that, but essentially, if it, if it lives in italy, it will be replicated by this. It's my understanding.

B

I'm not sure the style of is.

C

My issue is that we have you know a lot of people use lfs, because that's a smart way of managing the size of your git repos and we have historically to my understanding, not done anything except it's here now you you get to replicate it all over.

C

Okay, just just questioning I'm not not accusing people.

A

Yeah to be I'm not I'm not positive, how that how that works.

B

Yeah, my understanding is stored elsewhere, so I don't think it would be classed under this it'll be an s3 or whichever storage mechanism you're using under the hood.

A

Right and I think that probably would still be our recommendation to use something like s3 for lfs. Just because.

C

In which case it would sort of magically get there through the magic radio, as they say, right.

B

Yeah effect effectively, I think what happens with lfs is that there's a link to the lfs uh stored in italy in terms of like this file exists in your git repo, but it's not actually here. It's yeah.

A

It's more or less appointed that would.

B

Be replicated but the file itself would not be.

C

Right, I mean that's how lfs works, because we want to keep the stuff out of the repo, but the question is like what do we do with like that and pages and other data that sort of exists outside of the sort of normal get infrastructure like stuff, that's alone in the file system? So we're? We still have the same problem with that that we did before. I guess.

A

Well, and and the majority of that other- that other type of data still, we already have architectural support for using things like object, storage to make it so that you don't rely on single point of failure, storage solutions and you are on a scalable option. Obviously that doesn't work for people who are entirely on-prem and don't have an object, storage layer that they can use, but our our application architecture does support the use of things like s3 for all those other components.

A

What this is targeting is specifically the components that can't live in s3 that they currently rely on nfs, because there is no other solution, absolutely, namely the git repositories, and that's that's where we see the both problems when people are using nfs because of the transactional nature of git.

C

Thank you, simon and christian.