GitLab Incubation Engineering, 5 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: MLOps Demo - November 5th 2021

Description

All demos: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/16

A

Hello, everyone and welcome to another mlaps demo, this time for the november uh for the week of november 15th 2021 this week, we have very exciting updates regarding jupiter experience and also some updates on exploration for pipeline runners, pipeline runners for machine learning. So first of all, it has been uh the the divs have been integrated into the code base, so it is already in production, it's still behind a feature flag and it's only available for some specific projects on the git lab organization, but it can already be seen in the wild.

A

So, for example, now we have this. This is a radio on production. uh This is public. Anyone can take a look at this. This is how a uh notebook is being rendered uh differ is being rendered uh for new uh commits. You can see that we also added uh code highlighting for markdown, so it's a little. It's a lot easier to parse. Even there are things we can still improve a lot of them, but we already have this so for comparison.

A

This would be what it would look like. Otherwise, so let me open up this here.

A

A

So this is what it looked like before. uh You can't really make a lot of what is going on on this notebook. However, this is the new one uh over here. So a lot easier to parse a lot of easier to a comment to discuss.

A

But before we publish this to the uh to all reports within the gitlab org or to gitlab itself, we still need to fix one bug.

A

That is quite problematic that if you try, since we are rendering, is not this diff. If you try to add a suggestion over here and you just change anything to- I don't know blah over here and you add the comment. This will completely break the notebook. If you apply the suggestion, it will become an invalid notebook and it it's not really a good experience uh for the user. So what we will do um we'll disable uh suggestions for now for uh for python notebooks.

A

We have ideas on how to fix this eventually, but for this iteration uh we will not do that. uh We will just uh disable complete, suggest, cold suggestions for uh to pronounce books.

A

um So what's next for this, uh of course, after fixing up this uh this bug, we also will do a up of the code base. We, I had to add a lot of logins uh to figure out some uh additional bugs, but also we want to start working on a richer bits. So right now we have cleaner dips, but we want to do a rich experience for notebooks.

A

For example, we had like this is uh not a really good a good image, but if this was a valid image, we wanted to display this along with the diff, uh perhaps uh uh linkedin not linking but highlighting. So this is a com. This is not really useful. This can be a little bit less uh visible or grayish, for example. So add a little bit more, a better rich experience for dr notebooks within gitlab now that it's already cleaner- and this is for drop notebook, so very exciting. It is live.

A

It is on the code base, it's still behind a future flag, but it is, there uh soon will be available for everyone. The second update is that we've been an initial to the jupiter.

A

There are many things we want to explore within developed space, and one of them is how to pipe gitlab pipelines, uh help machine learning users uh and one of the ideas that we had was to try an experiment on hyper parameter tuning, so hyper parameter tuning is a process where you take a model that you already have, and this model usually has a lot of parameters for configuration.

A

So, for example, if I reuse a random forest classifier, I have number of estimators that I can configure the criteria max depth mean samples and a lot more, and these affect the outcome of the model right. What we want to do is, and- and you can optimize is that the same way that you can optimize machine learning mode, you can optimize the algorithm in itself, but this process is very slow because you have to iterate over uh many times.

A

You find that it's a search problem, uh so you can either go random search and just like research, try out all possibilities. You can do a lot a little bit smarter search with bayesian approaches, for example. But the point is that it's a search problem. It takes a long time, and it's very repetitive, so it's very useful to have this is a pipeline.

A

Tube flow implements this with a catalytic. It's a library specifically done for trying out this on kubernetes, but we want to do something different a and we want to try it out on try just using gitlab pipelines. For now.

A

um For now, I found out that it's impossible to run a loop on git lab pipelines, so you cannot do the iterations, uh as you would do. Normally, you could do random the not a random but the grid search which is try out everything. So you can use child uh parent child pipelines to try out every single possibility and get the results in the end, uh and this is what we're gonna try. First, so first we're gonna the thing that I did so far.

A

I created a sample data set that has uh seven features and it has a equation that depends on the seven features and what I do. I just pick my data set. My future data set that I have for this machine learning model has only four of these features, which means that it doesn't capture the whole equation in it right. So I can try to build a machine learning model that tries to predict the final value, uh and I did this: it's a random forest model, uh classifier and okay.

A

It achieves this accuracy on the on the default parameters, so that one score for this is 0.73. uh F1 score is a measure of that balances. How often it makes right predictions and how often it makes it forgets to make a right prediction. Basically, so here we start the parameter tuning, so we create a grid search uh which grid search means. We're gonna, try out everything uh all the possibilities. So I have, I would have here three times two times two combinations which is uh 12 candidates for each of the 12 candidates.

A

I split the data three times to have 30 and I do three trainings. So only with this small parameter space I have 36 trainings to be done, uh so you can imagine how this explodes very easily.

A

So we do the the parameter training over here uh and we have the results which are very uh known, so not so much readable uh over here, but we have some plots here, for example, plot over here. The the test score versus the one of the parameters which is max depth max depth of the tree is how many nodes, how many decisions each of the trees can make, but you can see that the higher results, although you have some outliers, but the higher results- happen with a larger max depth.

A

So you usually choose an a larger max that similar with max samples. You see that the scores are here are on average larger than the squares over here, but with the new sample split, it doesn't seem to change a lot. It kind of looks okayish, but- and you have this, uh this is a high plot, which is a interactive tool that it makes it a little bit easier to explore the parameter space.

A

So here you have, each line is an experiment uh with different parameters, and you see so you see that for this over here. Let me try to select this one. The best parameter uh that you thought that you found at that point. Seven three eight is uh the mean splits is 10. The param is 500 like samples on the and the max depth extend, so you can see where each of the parameters fall into the space and the combinations of within the parameters itself. So it's a pretty cool visualization to have.

A

Let me just select a.

A

Okay, yeah now I have everything so uh yeah, so here I have all of the different. I have even the time the scores and everything for this, so this is a very a toy project that we can use now for implementing the pipeline. So the next steps that we have over here now we have the somewhat decode uh very simple, very stupid. uh It's not a real problem. This is a toy uh one that we're, but is already enough to test out what we want to do.

A

So the next step is find out how to uh implement a parameter like the the the search problem within this. um This this framework and the cool stuff about testing out this hyper parameter. Tony, is that this paves the way for out the ml.

A

You can think that even the choice of algorithm itself that you use for the machine learning model is a hyper parameter, so this hyper parameter tool. You could step at one step before or one step into the search that would be the one to identify. What is the best algorithm for the data to have so this is one making hyper parameter is about uh it's a good step towards finding out if you can do uh ultiml within gitlab or not with gitlab pipeline, so that is uh that is very exciting uh yeah.

A

So next steps over here transform this into pipelines, see if we can torture the the offering the product uh enough so that it we can find something nice and what needs to be improved specifically for machine learning applications. um I think that was it that I had for today.

A

Thank you for joining me one more week and until the next one.