GitLab Incubation Engineering, 12 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SEG MLOps Update - 12th November 2021

Description

Feedback and Ideas for Jupyter Support: https://gitlab.com/gitlab-org/gitlab/-/issues/343024

Feedback and Ideas for Pipeline Experimentation: https://gitlab.com/groups/gitlab-org/incubation-engineering/mlops/-/epics/5

All Updates: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/16

A

Hello, everyone uh and welcome to another uh mlaps uh civil engineer group update it's time for november 12 2021 uh welcome back and we are going to talk about some very exciting things this week. Just a reminder of what we're doing here. uh My vision for this analogues is just to make gitlab a tool, data, scientist and machine learning engineers love to use. The thing is especially data scientists.

A

They don't use gitlab because they want to they use because they have to get lab and git and the ci is really compatible uh a lot of times with uh with the data scientist workflow, and we are trying a little bit to bridge this gap and see what is the tooling that is missing. What are the components?

A

How can we make uh gitlab a better companion for data scientists? So, on that note, uh first yay uh jupiter is live. uh We finish most of the uh the the first iteration of it. You can see it over here. It is not perfect, it is not a deal. That was never our goal, but it is better than before uh a lot better. uh So now you can clearly see what changed within a jupiter notebook.

A

um You can clearly see the differences uh we're still. There are a lot of things we want to do over here, uh but the first iteration is live and users can already see it. It's going to be for the self-managed uh customers, it's going to be on 14.5 for gitlab home users is already over there. um Okay, very happy with the results uh we shared this on twitter and the results. The reception by the by our customers were was really really great.

A

They were very happy with this very excited, uh both gitlab uh users. Our competitors were looking at this as well, I'm very happy with it uh we'll still be working on it. uh There's still a lot of things like I said we want to do and if you have any uh asks or any features with future requests any feedback. I will link this uh this issue on the on the video description, so you can just come in uh uh give your suggestion. uh I will take a look and then we can discuss.

A

uh Then I can think I can see what I'm gonna do next, um but very exciting, very happy finally took two months, but it's out there and yeah okay. Next, one uh done that uh taking out of the uh out of the scope now the other thing that we're working on, but we are both looking at feature implementation, which is the jupyter notebook, but we also want to explore how data scientists could use uh the things we already have on gitlab.

A

One of the aspects that data scientists, machine learning engineers do is, uh that is part of of the machine learning process is to optimize the hyper parameters of the model. Hyper parameters are just the parameters are passed to the model itself, not to the training. So when you create a model, you can select, a lot of configuration uh between. uh You can tune out a lot of the configurations for the for algorithm, uh and these are called hyper parameter and this configuration can make an impact on the model performance.

A

So this is what you call have a parameter. Optimization is in an additional step to the training pipeline, where, after where you, you train with the data but also choose what are the best hyper parameters for your use case, and this is a very long, tedious process. uh You run many many times the model and using power plants for that I are usually the way to go uh either with kubeflow or aws sagemaker, and here we want to explore a bit. How can we use gitlab to that? For that?

A

This is not the common use case for gitlab pipelines. We understand that and uh it's more about exploring. How could this be make better, and can we use this uh as it is right now so to be able to do that? I, uh the code is like this is the issue, but over here, okay, so what I did instead of picking up a uh a data set or anything, that's out there, what I did it created a simple model that has six seven random variables and that I compute a function that returns true or false.

A

Based on this uh five or six uh variable the seven variables, then I remove three three of this. uh This variables, which are called features, which means that I'm trying seven features, are necessary to create the my target for uh for prediction, but I only have four available, so I can use machine learning to try to estimate the value of y based on a b c and g. So it's a very simple way of creating a model for a data set that I can control.

A

uh Our focus here is not really did it set itself is how to explore the pipeline, so it was okay for now. So what I did over here- I created a pipeline uh over here that, okay, this, what yeah.

A

The pipeline itself says it depends.

A

Enter the visualize okay, so what I did to create a very simple pipeline that first of all fetch the data, which is the step that I shared just to create a data set and I created an artifact, then it optimizes the hyper parameters. I use sqlearn scalar for this and I use a very simple algorithm for the helper parameter. Optimization hyper of hyper. The optimization steps are basically a search problem. You have to run a search, so one of the ways is to test out all possibilities and choose whatever is best.

A

This is what this algorithm does. So it's very simple, uh not optimized!

A

You can do a lot of training with this, so you cannot go crazy on on the your hyper parameters, but it is a first skeleton so that that's okay as well and then it uh the next step it creates a it publishes the results to the merge request. So if we come here to the this is the visualize, but I can come to the merge requests and I can see a very simple merge request that I added so hello.

A

It doesn't really matter, but if you come over here, it reads this file of uh uh where is uh it reads from this file uh the hyperparameter.ammo?

A

So there are many parameters that I can configure on this model, and then I run this pipeline over here. So we can look uh on the step optimize this scalar job, and you can see over here that it's really many many tries. So it tries all possibilities within that specific scope that I gave not only that it tries many times for because it uses cross validation to to score uh the models. So for every configuration that I have, it tries five times with a different training set.

A

This is to make sure that the data that the the score is compatible with, uh like it's not just one case where it went really well, no within that hyper parameter. Those are the best results, for example, so, for example, for this one, it tasks five times with different training data for max f, two max samples, 500 and simple step, and it does for everyone. So if you look at here, I just had 12 candidates, but I only had uh what is the type of parameter, 20 and then over here.

A

So I only tested two three uh values for next step: two from examples and two forming uh sample split. So that means I only had 12 possibilities, but in the end there were 60 model fits. So it takes a lot of time to compute this, not in this case, which is very simple model, very little data. I didn't care so much about that, so it's very fast, but think that this can explode very quickly. This is why it's important to have this as a pipeline.

A

The- and this is a limitation of this one with this- is run sequentially. Ideally, this should run in parallel because they're independent of each other. So I can just go crazy and fire five, six, seven uh concurrent pipelines so that I can get results faster after the the pipeline. uh This optimization pipeline runs. Then it's the next step, which is publish results.

A

The publish results just creates takes all the the tasks that I ran. So there are 12 possibilities, so one two three twelve and it ranks so I can see that the best result for this sets of parameters is with a 10 of max depth. 500 like samples, I will not go over what this means.

A

uh This is another video entirely, but there was a difference of uh two uh percent uh between the worst and the best model, so it seems small, but when you are doing machine learning at scale, that actually is a lot of difference uh on the results.

A

I think that you are running this on a recommendation page if you get two percent more uh of uh of conversion, uh just because of this, that's insane so now what it does is just uh it creates uh this uh report and that's it for now that it doesn't do much more than this. What, but that doesn't mean. That's all that's where we want to stop. So if we come over back over here and over here, no, this one. uh This is the skeleton. This is the starting point where we want to.

A

uh uh We want to start getting more and more complex. So, first of all, uh there are many things we can do, but we can use real data instead of the the fake data they are using. We can implement an iterative pipeline, so there are algorithms for optimization that are not try everything they are more likely. Let's try these five check the results and then come up with new five values, and so on so forth, really a search problem.

A

We can use this to stress tests so suppose that you are running uh tensorflow and really really large data sets. How would a gitlab behave? Well? How would the runners behave with this? So let's write this out as well. Gpu runners that we have available and one other thing is the building experience. So it's really not nice to build the pipelines, it's not horrible, but it's a lot of trial and error, uh but is it possible to make it easier?

A

For example, can you transform a jupiter notebook into a pipeline user, pushes a jupiter notebook and that becomes a pipeline? We can try that. So if you have any idea anything that we want us to, try it out I'll. Add the comment over here for the expectation uh for the epic share with us. We will try to look into. uh We will give priority to suggestions to external suggestions and yeah. That's it. Thank you very much that is for today, bye.