GitLab Applied ML group, 3 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Applied ML - Building MLOps pipeline in Gitlab for Suggested Reviewer - "The first MLOps template "

Description

This video walks through the step of building an end to end ml pipeline automated from data extraction to bot service.

A

Okay, um hi everyone um today is really exciting. For us, we will be showcasing how we build the first emmelops pipeline for the applied ml team.

A

This is really exciting for us at a very great milestone where we've started dog fooding with internal customers with using reveal recommender.

A

We um we've had a lot of questions on how we've used gitlab ci in building machine learning models, and here is our full full pipeline that goes from data ops, to mlaps, to connect to the frontend and we'll go a lot more details into it. So to begin with um I'll start, first sharing my screen: um this is just a little bit of the basics of how reviewer recommender the process actually works.

A

In the background, the first part is really data. We are using the merge request api to extract data from there. uh So the first phase is the gitlab ci service triggers that process from pre-extracting setting up right environment extracting to ingesting to processing.

A

Then it goes to from that data lobs to the emma lops part of it, where we then trigger the training of the data uh training of the model, uh tuning, selecting uh serialization of the model, um and all that is done in uh google cloud storage, and that is then connected to our final step, uh where we are serving the model uh and sending the output to the bot.

A

That's the bot that we see for our internal customers, uh suggesting the reviewer for a certain mr um the front end that can also change with a lot more details later, where we would add to the model, monitoring and observability part of it and also change uh the way we uh serve the experience from abort to actually a front-end ui.

A

So now um I'm going to actually um let andreas and alexander go into a lot more details into the pipeline.

B

A

B

Share my screen here.

B

And show you how these pipelines actually look like, so we start uh for each project. We create a scheduled pipeline which triggers uh every three days.

B

And this looks like this there's a quite a few jobs.

B

uh Which we combined all into this von yamo, we actually have three big important jobs, the extraction, the transformation and the training.

B

The reason why we decided to to combine them into this one single yamal. Well, there's two reasons. First, we wanted to keep this uh sequential flow of uh making sure the extraction comes before a transformation. Transformation comes before training, and uh this seemed like a very convenient way to do this, but also for each of these jobs. We have each one has its own repository its own pipeline for running unit tests for doing dependency scanning, and this way we can keep that separate from the actual model training pipeline.

B

So yeah, these are the free projects we just remotely includes.

B

And once we do that, we just do some housekeeping. We set some global variables.

B

We check if we run any extractions successfully for this project before, because if not, we need to get all the data extracted from all the previous summers.

B

Otherwise we just get the data since the last job, uh then we run the actual extraction job.

B

um All this stuff gets passed around in artifacts between jobs, so we can easily coordinate which data gets fair.

B

One downside of this setup is that in some cases we need to actually clone these repositories and get some important files which again we can pass from with artifacts.

B

Like we do here for the transform file, we need to get the the actual main dos pi that we're running to transform the file. We need to fetch it because when we include a remote yaml, we just include the instructions, not the actual repository, so these uh these files, we need to clone as well same. We do for the the training job again, we pass this with artifacts and once the the transformation and the training is done, we use um some small bush scripts to to actually persist this into our database.

B

And that pretty much it it seems pretty simple, but it actually connects the all these three important jobs and lets us do an end-to-end pipeline automatically on preset schedules for all projects. We want.

A

That's um that's great, actually um going back to uh the jobs um alexander. Could you help actually even just explain all the different stages from pre-extract to extract to pre-transform all the way to post pipeline?

A

That probably will because the stages are quite different to a devops pipeline, so that would actually probably help everyone as to how we've set it up, what those specific jobs are, even and um and any sort of guidance on uh to reproduce it things. People need to consider as well.

C

Yeah sure, but okay, let me first share this same uh address. Can you stop sharing please? Okay, thanks.

C

Let me first share the same pipeline, so, for instance, this is the pi. This is the male ops pipeline for uh for our internal handbook.

C

So just to sum up what andres said, it consists of many jobs, most of them just for housekeeping, but mainly there are three main stages to extract data, so this one right to transform these data, so we use data flow jobs to transform data and also to move data from so yeah underneath we also use pops up between extract stage and transform stage, and we also have some data flow jobs that move there from uh from pub sub to google cloud storage, then we also have a transform job.

C

uh Also, this is a data flow job used to transform data into preparing training and test data sets, and um so finally, maybe the also very so this is also very important stage. uh Some. This is the training stage, so first we tune hyper parameters. We select the best model for a given project and then finally, we train the final model that will be published to google cloud storage and will be served later on each request.

C

So right now, so let me focus on each uh stage right on each of these three stages. Okay, so, but we can also check the elapsed, ci file that we have right now, yeah and, as andres said, these three stages, they are located in individual projects. So this is the extract stage right uh here we have the transform stage, and here we have all the jobs that relate to the training stage. So let's go.

C

Let's check each of these projects right now,.

C

And, let's start with the extract stage: okay, I'm trying to find the project.

C

So yeah, if you go to our extractor report to the ci folder, you will find the yaml file that is included in each mlabs pipeline. So if we check this yaml file, we'll see that it consists that it has only one extract, merge, request, ci drop that mainly extracts all merge requests from one date to another date. So we took these dates from the postgres database that we use underneath our emails pipeline.

C

So this is just one um just one comment to extract yeah, as I said, merch requests with approvers and also with the divs, because, right now the model works based on change files. So this y for each merge request. We also need to extract change files, changes exactly so yeah, that's all for the first extract stage yeah. For now. We also just one thing: we use uh batch of size like 50, 50 requests.

C

We extract on each request to the gitlab api, yeah and okay, let's check another gitlab ci file of another stage so, and this is the transform stage. Okay, we have these chops in this ripple.

C

Okay, if we go to this ti folder, we will find almost the same file as in the extractor, repo and yeah. So we have only one job here, also just to transform our extracted merch requests and prepare training and test data sets. So this is the python project. We use the python sdk to write the data flow job and using this command we create a data flow job. So this is the runner that.

A

C

That data flow runner, our input is our role. Data set and our output is training and test data sets. So later we use these data sets to to feed our model also to tune hyper parameters. So let's check this this stage the latest the last one.

C

Okay and then we go to the recommendation engine to the heart of this mediops pipeline.

C

Okay- and maybe this one is the most difficult stage in terms of the number of jobs that we have so first we have the pre-processed dataset job, so just to download everything that we have for a given project from the google cloud storage to zip all these uh files for the next job uh yeah. This is the goal of this uh this step. Now, the next job is to tune hyper parameters, so we need to tune in order to select.

C

So we need to find the best parameter hyper parameters in order to find the best model that can can can give us the best results so uh yeah. So we we take the zip data set from the previous job and we use this data set here, just to transform sorry just to tune hyper parameters. And, finally, when we find the best model, we uh sorry the best hyper parameters. We train the final model install this one, so this is the job that is related to this step.

C

So here we extract from the file uh the best hyper parameters. Then we put them to the special yaml file used.

A

C

Their model and then finally, we train our model and yes, the last step. We also need to publish this model right now. We push everything to google cloud storage. First, we serialize this model. Then we push to the google cloud storage and later the our backend part will take this uh these models, deserialize them and provide recommendations.

C

So I think that's all for all these three stages.

C

So yeah we can also check templates that we have so, for instance, this is the template to tune hyper parameters, so just some variables to control the way, how we tune the method that we use to tune hyper parameters.

C

So this is uh this is what we're trying to find the best number of factors, the best regularization, the best number of iterations. So that's just a config file that is used by the model to select hyper parameters in order to select the best model again.

C

And the same one with the training, so, finally, when we find the best happy parameters, we put them here to generate the final config and then we we feed our model with this config to train the final model.

A

Yeah, thank uh thanks alexander um I'm also. um uh I think I would also like to point out in uh which I is definitely in our templates is how we also include security scanning through this process, which is something quite rare for machine learning engineers to include sas das testing as part of their cacd template into building that, um and then I think the last part.

A

I know we talked about pops up data flow and, if anyone's interested, we also have the architecture of it, and I will just share my screen and we can go through that.

A

Yeah, so this one first just quickly a the architecture. We have another video that will go a lot more in length in reference to all the different parts and what we use. But if you think about the ci file that we've built the full mlops template that we are calling for, it's actually starting all the way from that extractor connecting to pops up data flow into google cloud where the ml model training is done and then deserializing uh into backend into the projects. So that's the full sort of workflow of it um and uh yeah.

A

That is the first emma loves template. I think, alexander, you want to say something.

C

Yeah, do we need to explain this architecture.

A

um I think we have another video for it.

C

Yeah, just maybe you know we forget we're going to say uh we create the scheduled pipeline on the project registration, so this is done automatically once we include this item plates. This ci template will register and use kettle pipeline for the given projects, and this then, this pipe, this amalebs pipeline will be run every three days and automatically update the model dataset.

C

So no actions required from the project maintainers only to include the site templates.

A

Yeah, um I think, that's a wrap.

A

um Well, um I hope this was very informative um uh if anyone is keen on understanding how to build the emma loves pipeline using gitlab ci or have any questions on the pipeline food review recommender, please do drop a note for us in our applied ml slack channel or reach out to either me juan alexander and andreas. We are really happy to help in any part of this journey in building envelopes.

A