GitLab MLOps, 26 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Simplest Hyperparameter Optimization Pipeline with GitLab

Description

This time, we explore how to create a very simple pipeline to perform Hyperparameter Optimization, using SKLearn's GridSearchCV and posting the results to the MR.

Codebase: https://gitlab.com/gitlab-org/incubation-engineering/mlops/hyperparameter-tuning-exploration

Exploring Hyperparameter Optimization with GitLab: https://gitlab.com/groups/gitlab-org/incubation-engineering/mlops/-/epics/6

A

Hello and welcome everyone to another session on exploring gitlab for machine learning and data science use cases today we're continuing the our path on exploring gitlab pipelines for hyper parameter, optimization and this time we're gonna. Do the simplest, hyper parameter, optimization pipeline uh just going back a bit. Why are we doing this hyper parameter? Optimization is a very tedious long process within a machine learning.

A

Lifecycle is about choosing the best parameters, configuration for your algorithm and for your data transformations, but they can give some really nice results in then they can increase your performance quite a lot, and but for that you need to run a lot, a lot of models, a lot, a lot of training.

A

So it's the perfect candidate for acid pipeline. It's also the first step towards autumnal. So if you think that the algorithm that you choose in itself is a hyper parameter, you can think as automl as an extension of hyperparameter tuning and within this is gitlab ci, a good uh tool for this use case. It makes sense within the within the gitlab ecosystem, but is it the tool that we currently have ready for this?

A

So if you want to follow up, uh this is just a part one. I have a summary of everything on the hyperparameter exploration, epic and I recently published the part 0 of this. If you don't know what hyper parameter optimization is you might want to check out it's just an explanation of the concepts behind hyper parameter optimization.

A

So what are we going to do today? um We're going to create kind of the hello world of of this, so we're building up here the foundations of what we're going to work on the next few uh parts. So just the most basic hyper parameter, optimization pipeline. We can think of so we're going to use synthetic data that is small, we're going to have a very simple model, we're going to use the mode. The simplest uh hyper parameter, optimization that we can think it's going to be not going to be parallelized.

A

It's just going to go serialized one after the other. We're going to just have a few number of parameters and the results are going to be posted directly into the mr. So no external tools, no uh model registry for comparison, uh no hyper opt no bayesian approach. No parallelization, nothing like that. That will come soon, but it's nice to build upon so without further ado.

A

Let's see it in action, um so I have this over.

A

A

So suppose I have this uh this.

A

This uh repository, which I'm using for this code, um so I have, for example, all that I have a simple model that runs I'll go through the code later. But what what is important at this point is that I have this hyper parameters over here and I have created a gitlab pipeline over here. That will run the optimization uh whenever a new merge request is created. So let's say I want to create a I'm gonna create uh I'm gonna come over here with uh the hype parameters, and then I'm gonna create I'm gonna edit.

A

This I'm gonna say: okay, I wanna test an additional hyper parameter over here or value for the mean sample split, which would be 30.. uh It doesn't really matter so I can create a new branch, a new branch.

A

With new parameters, uh which will start a new merge request, so I can just uh I can. I could write a nice message, but I won't do this right now. So if I look over here, it's checking checking checking.

A

Let me just refresh the page okay. So the pipeline is already running over here.

A

So it's very simple: we're gonna explain this can take a while until the docker is executed. I'm gonna speed up this, so you don't get bored and let.

A

So you can see here now already that it already installed all the packages that we need, uh everything that was defined on the on the requirements.txt.

A

It's just installing, and now you can already see here that it's already generating the data and soon it's going to start training.

A

It doesn't generate the the output as it goes it when it finished the script it generates. The entire output.

A

Okay, so now we can see that it generated all the trainings, so five mo training models for each combination that we have uh each one taking one second, so it was about nine, it trained the model, 90 different times um yeah and then formatted the results and then published to the mr.

A

So now we can go back into the mr that I just created and that's not it that's this one, uh and then we can see that it actually generates uh a comment into the mr with the best accuracy and the table of the parameters and the results. So you can see that the difference between the the the worst and the best um the best cases is about almost 0.1 percent on 1.5.

A

So that is actually quite big. So if you think here it's not, but if you're running a multi-million million dollar company uh revenue- and you can increase your revenue by 1.5 percent- that's quite a lot so yeah. So this is the most the simplest thing we can do and now I can go a bit over the code in itself. So very simple! This is a very simple uh pipeline.

A

I start by generating a fake data, so instead of using a real data or some of the data sets out there, what I do I generate seven different uh variables uh randomly and then it creates a function y uh that is true or false, depending on this equation that I just I don't know typed randomly, um but the fact is that y is either true for false and then so.

A

This is a classification problem and y is completely determined determined by those variables and then what I do is that I remove two of those variables, so I three actually of those variables. So instead of having seven variables for prediction, I only have four. This means. I don't have the entire information available to me, which makes it cool for machine learning. We can use the data to try to recover the actual. Why then iu?

A

I run the optimize sqlearn script, which is also very simple: it trains the model which is a random forest classifier with specific random state just to define, and it uses the grid search cv from from sqlearn to optimize the data, the the the hyper parameters. It's a very simple algorithm. It's just try all combinations available. So if you have uh three parameters each one with three different values- or in this case we have three one with uh three uh parameters: other two another two. That means we have twelve combinations.

A

Then we will try five times each one of this uh of these combinations, so it loads these hyper parameters from the the hyper parameter file that I just showed before, um and you can configure configure very easily uh very quickly and it passes directly into the uh optimizer, then a very simple script, and that does the formatting, so it picks up the results. The csv results transforms into markdown and then a final one that just publishes this uh to the mr. So, instead of uh instead of this is not related to this project.

A

In itself, this can be used in any project. You have, uh it just passes down and sends a message into the mr.

A

um So this was a very simple example. It's the start of our exploration, but we can already see some really good important points over here and I think the biggest pain point for myself is that the iteration speed of changing the code committing them testing on the gitlab ui checking again waiting for the gitlab uh pipeline to run, checking changing again, and this whole cycle takes a really long long time and it could be a lot faster if it ran locally it does.

A

We do have some uh some tools like the pipeline editor uh and the validate the gitlab validation uh from on the vs code marketplace, uh which help a lot. But the fact that you need to the problem is that you need to test them uh the thing running and it just runs on gitlab ui. It doesn't run locally, there's no solution for running them locally. So that's very unfortunate um up. Next, we at the the the part two of this is trying to make it parallel.

A

So if you saw before it just runs a single pipeline, um it doesn't parallelize the there is the the runs uh which is not optimal in this case. It's not a problem because it run takes one second to finish, but imagine that you have a it takes five six hours to trade, a model which is not unreasonable.

A

It's common to have this kind of application where it takes six five hours days weeks, sometimes to train so making it parallel is quite important, and this is what we're gonna do on the next uh on the next session and then the part three a little bit going beyond is, instead of using this uh predetermined approaches, where you have every single combination already uh stored or computed from the get-go, we use a little bit more of the interior iterative algorithms that update the possible values every uh iteration.

A

So this what we will be working on soon. Parallel to this, we are also integrating this with an ammo flow, so keep an eye on it. I will share the link as soon as I have it, but yeah. Thank you for watching and see you next time.