GitLab MLOps, 30 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Pipelines for MLOps Use Cases: why does it become CREATE? [IncEng MLOps - March 30th 2022]

Description

This week we talk more about pipelines, what we have worked on so far and why they are so important for MLOps.

Pipelines with Stubbed Jobs: https://gitlab.com/groups/gitlab-org/-/epics/7681
Citer: https://gitlab.com/gitlab-org/incubation-engineering/mlops/poc/citer
Glyter: https://gitlab.com/gitlab-org/incubation-engineering/mlops/glyter/-/tree/poc/glyter

All updates: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/16
This Update: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/48

A

Hello, everyone and welcome to another update for the incubation engineering uh at gitlab for mlaps. My name is eduardo and today we're going to talk about pipelines as part of the melops great stage. I've been talking about this a little bit in my past few updates on some of the efforts that I have already done, or some that I want to do, but I've never take. I have never taken the time to talk about this from a higher uh bird's eye ego eye view.

A

So, and I want to talk today about the vision that I have for this. Where do we want to get? Why is this important for envelopes, how pipelines how data scientists create and how pipelines feature fit into this great process? This is going to be a longer update than usual, but hopefully it's going to be interesting one.

A

So for those that don't know at gitlab, we kind of split the devops into a lot of different stages, so you have great plan verified, and today I'm going to talk specifically about how did a scientist create. So if you want to learn more about stages, you can follow the link on the previous uh slide, but we're going to talk about how data scientists create.

A

First and foremost, these are the tools of trade data. Scientists mostly use python, r and jupiter to create uh their uh things and now, whatever uh ide they use for python and r scripts as well like r studio, vs code and whatnot, but python r and jupiter are the de facto almost tools for the trade and something very interesting about how data scientists creating jupiter is that it's very.

A

It's very iteration by default, it's very iterative by default, so you have one cell that you you edit and then you run the next cell and then you go back to the previous cell and then you go forward and backwards and forward, and from this you can already see how pipelines are taking shape. Think think about that. Each of these steps each of the cells is a job.

A

So you can already think that this is a shaping its way towards a pipeline, and but not only that, even when we do when we don't use uh jupiter or whatever it's very common, that we need remote code execution. So, for example, I need to run a long. uh uh I need to train a model, but sometimes training models can take eight hours or what or a lot of time, and I wanna run multiple trainings at the same time.

A

So I can delegate this into a uh into a cluster or or something like that, and this is part- and I'm not talking here about ci cd, I'm talking here about creation. This sometimes happens when I don't even have the code on a uh on git. Yet this is still living in my uh in my computer, in my local development uh setup. So, for example, here with ray ray is distributed.

A

uh Solution is a solution for it's. A company implements distributed solutions for for data science. You see here. uh They already have, for example, radar remote that tells that that function should execute on a specific box or a remote instance same thing here with uh with kubeflow pipelines, which is also commonly used by the scientists to implement this pipeline. So you build a a dag to run your your data, science workflows and again. This is on the created you're, not we're not talking about various ci cd.

A

This is way before ccd even start to take shape, not only that think about model tuning when we tune a models. When you tune hyper parameters, we run a search prop a search optimization where every step is training a model.

A

So you can see how how often we have to train a model right from the create stage and train. Our motto is not cheap. It's not fast. uh It's often not able to run on your own machine. You have to offsource this uh shuffle this into a more powerful machine that has better gpu or whatnot, also motor debugging. So, for example, if I want to debug some bias variance issues with my model, I can plot graphs like this, but note that each point on this graph is a trained model in itself.

A

Not only that it's often a collection of training trained models, so you might have each point is the result of five six seven, eight models being trained so that you can take an average for that specific value.

A

So you can see already you get the feeling of how often we get this and why it is important and how pipelines come very, very early on the mlop's uh life cycle, devops life cycle.

A

And, let's talk just to show a little bit more, how this is uh this is. uh This is quite crazy. Is this is all the uh all the offerings uh for from ray uh on from their website? They have all of these great libraries all of the this great product.

A

Only two of them are not create. All of them are on the create stage. Already you have fetching the data which is already created. You have training models which are radio. You have two new mods, which is already great. Our lib is already our reinforcement.

A

Learnings are right on the create all the distributed, libraries that they have is already on great, so yeah, it's it's just essential for envelopes, for pipelines to be part of, creates to be seen from the create perspective, not only uh later down the the problem and on this space there are very big user, bring points that I have not me address. First of all, it's a fragmented market.

A

You have a lot of open source tools that each one does something small or doesn't do very well or whatever, but they all require you to learn that specific tool. um It all require you to maintain that specific tool. They not always have the best ui and it's just really complicated for data scientists to come in and choose a tool on this area or or bring in because they are also not part of the infrastructure team. Often you have infrastructure team and you have the data.

A

Science, team and communication is not always the best between then some companies. So it's really hard for the designs to get the tooling that they want uh and I think we can fill in this space. I think we have a lot of uh opportunity here for gitlab to grow into this area, so we are already part of the stack of many of these uh of this, the companies other that we can look at so there's no additional cost of maintaining this on another tool on the stack we have great documentation.

A

We have top-notch ui for pipelines a lot better than all the tools that have surveyed uh in the past. We have a big community and we can be the same to to have create verify and deploy following the same pattern for pipelines for the data scientists.

A

So the problem or not the problem, but the situation that we have right now is that pipelines are on the verify stage they were created.

A

The ui was created, the the use cases were created with pipelines as part of the cicd, and what we need to do is expand these pipelines as part of both the verify, but also of the create stage when we take the classes that only look at this from the verify- and we start looking from the create stage, then we can start seeing what other features that we can uh and how can we prove this tooling for uh this use case specifically so, for example, some of the things that we can do here is we have the whole infrastructure of gitlab runners already implemented.

A

We have uh the devops team on the company on on the company of the users, already knows how to create a box or a runner or anything. We can use this gitlab runners as backend for remote remote code execution.

A

uh It also means that, while I'm iterating imagine that some jobs can take six seven eight hours, I'm not gonna be commuting. Every change that I make so that it gets uh it gets published or uh it gets. uh It starts a pipeline only if I only want git commit. No, it needs to be faster.

A

Iteration process needs to be faster and I need to be able to tell when should something run, and when should I want to start only after a specific job or only run some specific jobs and, finally, since I'm already creating pipelines for create, why not just reuse the same pipelines for verify.

A

So I already created a pipeline before I have already have my steps already have my stages over there. Why can't I just port that pipeline uh to the virtual to the ci cd part.

A

So when I talk about runners, I talk about gitlab as a backend as gitlab when you say, for example, radar remote run, call a gitlab runner or an operator model, suppose that we have a gitlab.remote, for example, call a runner that will run the job and return to the uh to the local machine, the answer so that it connects it. It gets all answers together or if I'm working on a jupiter notebook, it can be that the cell runs each cell runs remotely on a runner.

A

Even though the the the the document is a survey itself, the code is executed in a different runner, which is a lot more powerful which has gpu, which has I don't know how many, uh how much memory you need for that. But this is what I mean, but the problem here is that pipeline executions right now is heavily coupled to the ci and git flow.

A

You can't really do this kind of process without going through the git flow, without going through the uh git commit get push start a pipeline uh because you need to push code changes as well, so you're running pipelines where the code is not on the server, it's not on a git uh repository. Yet it's still on your local machine, it's still running over there. It only exists there. It's not on the on the repository. Yet so current setup of gitlab doesn't work for this. It's not prepared for this.

A

It was never coded with this use case in mind. What I did so far was citer saturn was a poc that I uh that I created a few weeks ago, where you have a local code. I created a pre-configured repository project uh on gitlab and I use a lot of creative engineering to send code to this uh through a trigger dynamic pipeline that is run on the runner, but my local process is still watching the runner and once the runner is done, returns the results.

A

So, from the perspective of the developer of who is working on the code, everything uh happens on their machine. They don't need to open gitlab for anything, everything is in their local, it's still very limited, very hacky, but it shows a little bit how we can do this sidereel can easily be evolved into having, for example, a into uh our vs code plugin, so that we can run pipelines directly on the on the repository or on gitlab.

A

The second problem, imagine that I have this this pipeline with five steps. I'm testing them out or I'm not even testing, I'm actually working on them and pipeline d fails now suppose that c strain a is preparing. uh The environment b is fetching, the data c is training. The model and d is uploading. The model suppose that the first three steps take 24 hours to finish, which is not uncommon at all.

A

I don't want to run a b c again once I fix the. I have suppose that I have to change d. If I have to change the, I would have to run everything again and I don't want that.

A

Rerunning entire pipelines for mlaps is really really costly and one way to avoid this is with the concept of stub jobs, uh which is okay. If I'm gonna rerun from d, let's find the last successful uh run of c and use the output of that pipeline so that I can run the new d on top of it.

A

I have an epic for this over there, which I'm gathering uh support, uh not support but feedback on this, how we could approach this problem, how we could work around uh minimum via products for this, but this is really game changing for mlaps and for many other use cases, but for amalops this is very important.

A

It allows teams to focus on iterating on only one step of the time that uh one step at a time and reuse. What was done before so cube flow implements. This with cache uh plumber implements this as well a lot of the the the pipeline tooling.

A

That we have here already implements this with sound in some way or another, but it's quite complicated to do this with gitlab, because you need to do this with a new commit, you can run jobs fail jobs on gitlab, but as long as they are part of the same commit here, we are talking about different commits.

A

Another thing is okay, I have jupiter python and, or I created my pipe, my data pipeline, my machine learning pipeline and now I want to go and create a ci. I have to manually transform that into ammo. That goes into this gi, but why would I need to do that? Why is this necessary?

A

Why couldn't I just have something that automatically takes my the pipeline that I already have created translates into the yamo. That's that that um gitlab is going to read and run that for me, no need to transform anything works by default speaks the language of the user, so this is what glitter is. This is also another poc that I have it currently transforms. A jupiter notebook into a gitlab pipeline is also very limited, but also showcase a little bit of the experience that we can offer users on the site.

A

So just a summary here.

A

Pipelines, you cannot talk about mlabs and not talk about pipelines in the create stage. um It's just not possible. You need pipelines in the great stage. It's it's just so important from the beginning of your of your ml of working, an ml model of working on data science using case that, starting from data ops, already it's already in there. So we need to look a little bit put step back a little bit and put this new goggles for our pipelines, looking at it from the create stage. What are the pain points when creating uh pipeline?

A

Because when you use it for the verify you call the pipeline once and then you it runs. It keeps running for all the commits or whatnot. It's like the usability is a lot more a lot. It's a lot more reusable than when you talk about great stage great stage. You're always creating a new pipeline always always create a new pub and you, you start a new piece of code new pipeline.

A

You start a new function, new new step on the pipeline, so you're changing the pipelines a lot more than you do on the verify stage, and this exacerbate a lot of the user's pain point that a lot of the user pain a lot of the pain points users have with it.

A

So just to summarize what I've been working on and what I want to work on. Citer is about decoupling running a pipeline from ci. It's about the coupling running up or the the the running the pipeline from git flow. It's about running a pipeline that is not there where the code is not even not even the configuration or the code or the necessary files are even on the repository yet so I need to upload these files or it downloads from somewhere else running pipelines with stub jobs is about making it. It's about.

A

Reusing is about optimizing time and resource usage for uh our users, so it's about making sure that it only runs when it needs to be run while iterating and glitter is about making it that simple to transform a pipeline that you already had on the create stage into a ci configuration that you can run on the verified stage and that's what I had for today. uh Thanks for sticking with me have a good one. Bye.