GitLab Protect Stage, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: External License DB Architecture and Repository Walkthrough

Description

This is a brief overview of the artchitecture, design and project setup for the External License DB.

Chapters:

0:00 - Architecture Overview
04:10 - Namespace Layout
04:30 - DB Schema
06:17 - Feeder
09:30 - Interfacer
11:00 - Processor
13:20 - Exporter
14:53 - Deployment
16:43 Terraform Environments
18:58 Deployment Jobs
20:00 Documentation & Wrap Up

A

Hello, everyone I'm going to be walking through the architecture and some of the information necessary to understand how the external license database works and is how it's designed and set up. So what we're looking at here is the architecture diagram of all the various components down below you will see, gitlab and then above you'll, see Cloud platform and then to the left are either instances or the package Registries. So these are the three major components that we are communicating with.

A

So everything pretty much starts within gitlab, where we have a number of repositories. These repositories are either containers that get pushed into a container registry. They get lab one as well as the Google one or their binary releases that are used to run from our CI CD job inside of the deployment Repository.

A

So we have schema, which is the schema of the database, which is over here in Cloud SQL. The schema includes migration scripts, which I'll walk through the processor is for communicating directly with the database. It's one of the only ones that actually talks the database, the other one being the exporter.

A

The interfacer is what kind of listens for the package names coming from the feeders, which I'll explain in a bit that tell it how to communicate to the package registry, how to get the version information pull out the licenses for those versions and then push that over to the processor. The exporter is what it does.

A

It communicates with the database and then exports all of the data in a structured format that the sea team is aware of, and then the feeder itself is what initially kind of kicks off the whole process of communicating with the package registry getting a list of packages. Sometimes it gets versions, but usually just gets the packages, and then it pushes them over a pub sub, which is just a message: queue pushes these package messages and then they go to the interfacer. The interfacer then calls out to the package.

A

Registries gets the information for the licenses and then pushes that over the licenses, Pub sub and then calls into the processor and then the processor batches, all those up stuffs them into the database every minute or so that way we don't kind of destroy the database with thousands of connections. It kind of batches them up nicely, so we keep the utilization pretty low, and that is roughly how this is designed.

A

uh One thing about the migration is: it's actually a service only because terraform, which we use for deployment automation, doesn't support uh Cloud jobs or Cloud yeah Cloud jobs, so it's actually running as a service, but it's once the deployment process kicks off. It will deploy a new version of the schema if there's any changes. Otherwise it just does nothing so yeah.

A

The major components here are the deployment project, our git lab registry gitlab repositories, the um uh artifact registry for Google, so we're actually pushing these from the registry from gitlab service tree over to the artifact registry inside of Google. That way, Google can pull them in for the interface for a processor, migrator and so forth.

A

A couple of things to note about the pub sub: there is a this nice little interface, Between, pub, sub and Cloud run Cloud run is just running a container as a service, but it also runs it as an HTTP server, but Pub sub communicates over HTTP, but it also has some additional kind of tricks and things they can do, such as setting, depending on the number of Pub sub messages. How much to scale out the interfa or excuse me, the containers so it'll tell it how many messages it can accept at once.

A

So each container can accept n number of messages, as well as scale out to n number instances. So when we talk about scaling out horizontally, we're talking about scaling out the number of instances and we're talking about vertically we're talking about scaling out the number of messages, concurrent messages that each container can handle so I hope that makes so much sense. That is the uh design of our architecture.

A

One thing I want to call out is all of the license. Database projects are contained within this namespace, which is the security products license DB.

A

You should already have access to all of these I'm going to briefly walk through each one of these repositories, so you have a better understanding of how they're laid out and how they're structured so we'll start off with schema, because that's where everything starts from so here we have I, don't know if this is going to take up my whole screen or not. We have our migration, so each one of our projects, the main binary entry point, is under the command directory and then the name of the project. So here we just have our project.

A

We are using the standard, UL Ur, fave CLI. This one has a number of command line. Arguments that you will need to know. uh This says also gets set up in the cloud run: terraform deployment which I'll cover in another segment or another video. So that's how our commands are set up. uh Scripts are all these are for the release and deployment process. So if you see these this, uh the scripts lib scripts directory. This is how it's setting up the deployment in the release.

A

So we are gating each release uh so anytime you want to actually release a version. You will need to go through and actually click the manual release button after you've committed and merged to Main.

A

um So migrations for schema are contained in here we're using Goose, which is kind of a nice migration system for go so you can have either SQL files or you can have go files, so sometimes you'll need to do more complex, like Dynamic stuff, communicate with the database. Get information then do migrations that way. In those cases, you'll want to create a ghost uh go migration script. Other cases that are kind of standard you'll just have your up statements and then, if you need to roll back, you'll have your down statements.

A

So that's how that is structured. So that's it for schema. Let's move on to the uh we're gonna I guess work back! Actually, no it's working forward, so we're going to look at the license feeder next, so the license feeder is what is kicking off the entire process. This is done as a deployment, um a CI CD job. So keep that in mind.

A

If you're wondering how this whole process kicks off, it's a scheduled job within the deployment project, but for building we have uh the feeder here so again, command feeder is the main entry point has number of environment variables. These are usually set in the gitlab CI yaml. So you don't need to worry about them too much. We need to register them and then it just pulls off one thing that is kind of interesting about our design.

A

Is we try not to have passwords anywhere so we're going to use uh impersonation, so the deployment project has a deployer key that we have set to allow it to create tokens to impersonate other services, so we're basically dropping our privileges to allow us to do this type of work, so in here, in this case, we're just going to impersonate, whatever users provided to it, which I think is a feeder user but yeah um and then it kind of just goes through and depends on the type of registry it'll start feeding out the uh the packages.

A

uh So if we look at the structure here, the main kind of interesting parts are going to be the the actual registry. The ones that talk to the registry, so we have golang feeders- we have most of these in by now I- think we're only missing one at this point, which is the packages uh ruby gem.

A

So it's all kind of structured in this way that it follows the same interface, a very basic interface of just like feed and registry name and then, depending on how they communicate with the registry they're going to be working differently. uh Obviously, so that is somewhat other there's. Some kind of helper like Publishers, which handles Pub sub um I, can create videos on all these.

A

If you want to go into more details on the architecture design of each individual project, but I'm just going to go through it right kind of lightly right now- uh and these are all again most of these things have uh either interfaces that allow you to test it. We also have concepts of dry runs, so you can kind of see how it would work before you actually blast out millions of messages over Pub sub uh again lib scribs handles release. In this case the feeder is a binary release, so it's pushing into the package registry.

A

So you have here each feeder version getting released, so that is the feeder uh one. Other thing is the bucket. This handle is storing uh kind of State for the feeder, so the feeder will sometimes save cursors or time stamps, so it can continue where I left off and that uses uh Google's storage buckets.

A

So oh and one other thing, I should probably mention real quick is all messages are just Json. So inside of the data you'll see a package message. This is just the the pub sub message: that's basically encoding or decoding uh the data right now, it's very simple: it's just a package registry uh package, name version and then any sort of metadata Sometimes some packages will require additional information, so that's available there.

A

So that is the message. So, if once it goes from the feeder, it's then going to go over Pub sub and it's going to go into the license. Interfacer and the license. Interfacer again has Pucket same information inside a command. Again we have the interfacer the message senders just for testing. Inside of this we determine or configure various things like where to keep errors. If we failed to look up a package, we'll save it in gcp and whether if you want to do that or not, we enable kind of feature flags for it.

A

So again, the interfaces are an HTTP server. So you will see inside of the dispatcher doing kind of your usual HTTP server stuff, as well as kind of handling the incoming messages determining if it's a dead letter, meaning that the it tried 10 times and it failed. So it's like okay, we need to give up and then it'll just dead letter it and store the information that would have been lost into a gcp bucket um and then it just kind of goes through I.

A

Don't again, I don't want to go into too much detail here, but depending on the package registry type, it will then call into the interfacers and once again we have it split up by package registry. So if it's Pi Pi we're going to handle it. This way where we have this handle method, that's uh yeah goes through and processes, each one of the messages and then looks it up.

A

Does its business and then returns from each one of the interfacers will return from the handle method back to the dispatcher and I believe it will say: uh yeah interfacer handle message and then it's going to provide it, it's legitimate. It will push it off and say, publish it and we're done so that is the interfacer. Next up is the processor.

A

The processor is what's again communicating with the database, so this one will do a little bit more setup. It's a little bit more involved because it has to communicate with Cloud run or excuse me Cloud SQL again it is a HTTP server, so in this case we're creating a new server here. Initializing the database before that making sure we have all the information we need so yeah that configures that and then inside of this server we are again handling the incoming HTTP message or Pub sub. It's the same thing.

A

It just has a little bit of different information in it contained again. We handle dead letter messages here.

A

Otherwise, we do something a little bit different where, since we need to batch up these messages, we also need to keep track of the request that came in, because the way that Pub sub works is if a message comes in, it's going to have a time out of I think 600 seconds. So we need to track and close out that connection. If we just accepted it and closed the connection, Pub sub would think it's done and it wouldn't be able to retry.

A

So we kind of had to get a little bit fancy and create a data structure that had a channel inside of it. Saying: okay, accept this message, put into a queue but leave the connection open because we don't want to return until the batched insert is actually completed on the database side. That way we can track request failures, so we create this little data structure that cues it up and then once it does get batch insert we close that channel and then it can.

A

It's able to return from this HTTP request and then Pub sub knows that that message has been handled and it can continue to send out a new one. So it's a little bit funky there, but it's the the best way we could have with not creating thousands of connections to the database or having thousands of insert insert statements. It uses a different style of um kind of batching things up, copying things around to Temporary tables and then inserting them and again I can do another video on that one as well.

A

So that's the processor. We now have the uh Sporter. So at this point all the data should be in the database. The exporter is going to run as a CI CD job again from the deployment project which I'll cover in a second. It has the main exporter command here, which again uses the same kind of structure of reading and environment variables, and then configuring the connection to the database and then how how you want to kind of export the data from what start period tool until what end period so yeah, that's the exporter main.

A

This is uh that's an old code. We need to remove that okay, the storage has, we had to use some kind of I want to say janky, but it's we had to use some sort of kind of proxy object, because the way we're writing to the gcp storage with a cursor. We couldn't track how many bytes were written.

A

So there's a little bit of um kind of complexity around how we're monitoring, how much we've written to the object, because we're rotating it much like an Apache log, rotate or after, like 10 megabytes, it'll rotate to the next file, so to track all that there has there's a little bit of complexity there.

A

But again we can make another video to cover that so everyone's aware of how that works, um and then it's yeah, the usual kind of connecting to the database uh and then creating a cursor to kind of iterate through and pull out the components uh from the database and then store them as CSV files into a public gcp bucket. So that is probably the the major difference of that.

A

So on to deployment deployment is kind of the heart of this whole project. It's made up of terraform, so we're using terraform we're using actually like all of the git lab features we're using terraform with gitlabs HTTP back end, so we're storing all the state inside of the deployment project. uh So that way, it's it's able to remember exactly, what's being already deployed the kind of pipeline. The way it works is we have that's, not the right one, whereas Pipelines deployment researcher.

A

Why am I not cicd, Theory and Pipelines?

A

So it's made up of a number of steps where, once you merge some change, so you're going to modify a terraform file or whatever it goes through its validation, it builds out uh container information, so it's actually pulling out containers from gcp or excuse me from gitlab pushing them into uh Google's cloud provider for our Google's Cloud container provider are called artifact registry, we're doing the same thing for production, so we're keeping them separate, and then it goes through a stage where you have to manually release the development environment. So what happens in these stages?

A

It's going to prepare the plan, so it'll terraform will create a plan of how it's going to change the infrastructure and then to actually apply that plan. You have to click this play button. Once you've confirmed your changes, work in Dev. You will then click the play button for prod and again you can change this if you want but I figured. This would make the most sense for people who are are coming into this project.

A

So the way that deployment is set up, we opted to use these kind of different environments, so local is for you for your environment. So if you're testing, locally I created some helper scripts to, for example, push images from git Labs container registry to your personal gcp projects, container registry as well as do the whole terraform plan and apply stuff automated for you. So each one of these has a main TF, which is the defines the infrastructure you're going to create and it references modules.

A

So each module is kind of broken up by its concept or whatever it needs so license DB, for example. So each one of these modules will put in its information from a variables file which is here.

A

This determines the type of variables that this project or this deployment requires and then to actually provide the values for that you have a tfvars file. So that's where this is all set, so you're going to have to modify this to your particular instance like, for example, your email address and whatnot as well as environments again, these are kind of helpers for kind of applying uh stuff to your environments. So definitely take a look at those and we have guides on how to set all this stuff up in the project.

A

Root readme as well as this local directory has a readme on how to actually do everything. It should be pretty straightforward, but again you can ping us if you have any complications.

A

So prod right now is not set up, we're kind of leaving that as an exercise to SCA or we're going to work with you to get that built out. Dev is, and honestly it's just copying. What's in Dev and putting into prod and changing some names, that's not that much work. So the difference here is, we have the back end TF. This is where it's actually storing the information for Dev. Again, we keep the the environments separate. So this one has the backend information for that.

A

uh So that way all your States stored in gitlab and not on the developer's desktop. Once again, we have the dev tfrs again. This is all the same information broken into each module um and yeah that's kind of how the deployment works now the actual process for that deployment. uh You can look into this file or using gitlab's new secure files.

A

So we do have a single key Json R2 one for Devon for prod that has the secure file, which is the the service account Json file, as well as each kind of development, environment or prod environments, and then the register, the versions. So if you want to say change the processor, you will go through the processor project, release it and then you'll have to bump these versions manually and then apply and create an MR in this deployment project and then release it and then click play to to deploy to Dev so yeah.

A

It has a number of steps, validation which I showed earlier, and then we have these feeders and exporter jobs stages which are just going to be run using rules that say only if this flag is set which comes from the ucicd schedules. So right now we have these kind of set up. So if you want to kick off like an npm run, you just click play. Obviously these will be scheduled, but this will give you an idea of how it's currently set up.

A

So if you are feeling overwhelmed- and you just don't understand- and you need to just change something or get some test in, we did create a significant amount of documentation. So this is pretty much everything you need to know from preparing your environment to how to communicate to the database to developing schema changes. Each project has its own documentation that you can go through and say like.

A

Okay, if I need to create a new feature for the processor, do all these things and it'll walk you through deploying to your personal environment, to test deploying to uh kind of like on your local development and then deploying it kind of properly using that local environment. That I showed with the apply and push local images shell scripts and then, once you confirmed, oh that's working, then you can move to uh changing the dev environment in the prod environment to reflect any changes, if you add a new environment variable or whatnot.

A

So all this information here again for each individual project- it's all here. So hopefully that helps you because again, this is kind of an overwhelming system. We also have the architecture in here as well as the security guide. We've already supplied this information to the security team, so they are aware of it and yeah that pretty much wraps up the kind of introduction to the external license database I hope that was helpful and I will do further videos if required. Thank you.