GitLab Incubation Engineering, 17 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Experiment Tracking - January 2023 Overview

Description

Epic: https://gitlab.com/groups/gitlab-org/-/epics/9341
Feedback Issue: https://gitlab.com/gitlab-org/gitlab/-/issues/381660
All Updates: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/16

A

Hello, hello, my name is Eduardo I'm, an incubation engineer mlaps here, gitlab and I've been working for a bit a little bit on machine learning. Experience tracking on incubate in this feature on gitlab and today, I want to talk about a little bit, not really a demo, but an overview of this project. What is the reason? uh What is that I'm doing? Why am I doing how I'm doing it? What are the next steps so to get started with machine learning, experiment, tracking I'm, going with the assumption that we don't know?

A

What's what is external tracking? So the first step is what is experiment tracking? What is it that I'm trying to create over here? A machine learning model is often seen for you to replicate a machine learning model.

A

You often need three pieces of information. One is the code that generated the machine learning model. Second, is the data that was used to create a machine that was passed through the code and three the hyper parameters, so the configuration of both code and data and environment. So, with these three components, we have a machine learning model.

A

Now, when I did, a scientist is working, something that happens often is that we don't know what is the best hyper parameter for the specific code and data. So what we do is we have. We do something called hyper parameter tuning where, where we try a bunch of different hyper parameters on the same code and on the same data and each of the hyper parameters will generate a specific model candidate uh and then we compare what is the model candidate? That's that performs best and we go on with it.

A

An experiment in this case is a collection of very different model candidates that are measured according to that they are comparable. So, for example, they are comparable based on the model metrics that they perform. There are collection of model of candidates trained on different code on different data, on different hyper parameter uh sets, but that are comparable in some way or another and experiment tracking is a registry where you save all of these experiments right. So this is what we're trying to build over here.

A

So a data scientist is working on their machine or is it running on the CI and they can store all this metadata about their experiments about this. This models that this model candidates that they're creating these artifacts that are making into gitlab.

A

How we are implementing experiment tracking on gitlab, so the largest player, the largest open source player over here- is ml flow. Mlflow has a really good client, so a part of so the code that goes into the data scientist writes to save information into mlflow is really good. It has a lot of features. It has a large user base. It has a it's. It is open source.

A

It has a large number of Integrations of other tooling that integrates to it, but it doesn't really Provide support for authentication or any of the corporate corporate expectations of a uh of a opens or of a tool to be used, so authentication user management. uh All this kind of stuff in its setup in its uh deployment, is another tool that you're deploying internally.

A

So you need a platform engineer that will work on publishing this, either in kubernetes or whatever, and setting up a way to restore the artifacts and everything else and another one. Another issue is that it's siled knowledge, it's the information that is on Ember flow sure you can access through an API, but that's extra effort that you need to put so. The knowledge stays a little bit solid. The information stays a little bit solid in it.

A

How you're approaching this, then?

A

We have a few North Stars here that we want to work about one. We want to create a an experience that there's zero setup needed. If you have gitlab it will work for anyone that has kit lab for the data scientist that already has ml flow or wants to use ml flow minimal to zero code changes, so their code should just work either on gitlab or mlflow. It doesn't really matter.

A

Third leverage, the gitlab platform. We can build a feature, but we want to go beyond that. We want to use this feature to inform the other stages of the devops lifecycle and we use when I use information from the other, the rest of the devops lifecycle to improve the feature itself. So it's not something a part. It's part of the gitlab experience, it's part of the devops experience for the data scientists and more on the technical.

A

So how we are approaching this, how are we making sure that that we're following up this North Stars ml flow, is composed of two components on for this case? One it has a client which is the code that goes into the data scientists code base where they write. Okay, I want to run this experiment and save this neighbor and there's the back end, which is the information where the information will go.

A

What we're doing here, we are replacing the background, the back end with gitlab, so by just switching to your eye on the MFL client on the code that they are using or, if I'm, with variable, instead of pointing to a more flow pointing to gitlab, it just works. This is where we, this is where we want to be.

A

Create a dropping replacement for ML flow backend within gitlab, so the positive side is that one if the user has gitlab in their organization or if they use gitlab.com or whatever they automatically have experiment tracking, they don't need to set up a different service. They need to set up anything. It just works. Second, it's easier to integrate across the platform on our site, so I don't need to be calling an API to get a fetch information.

A

For example, if I want to display the candidates that were created through NMR I can have this information already on the gitlab side. I can already prepare the data model and everything for this use case. It's a lot easier for us to work with.

A

There is zero addition setup necessary to deploy like I said it just works. There's no platform engineering engineer help it's just about setting up. We have authentication by the by default here. So by leveraging project and group user management, we already give users automatic access to user, to authentication and user management for the data scientists.

A

They don't need to change anything at all in their code base. The only thing that they need to do is change the era I and pass a tracking token so that they can access gitlab instead of mlflow.

A

The downsides, though, is that we need to keep API parity over time and we'll have to implement the UI on our site, so we cannot leverage all the great UI the mlflow has already built. We need to implement that on our site, so those are the two biggest disadvantages.

A

Are we on this I have a video over here? That shows a little bit of an example. So, on the right side, I have a training code on the left side. I have gitlab working uh I'm not going to go through this video I have the link over there. If you want to watch I'll link below as well. So at this point we can already use gitlab as a drop in for a mail flow backend. This is already possible.

A

We already have implemented the necessary endpoints, the minimum UI necessary for this to work. It already saves the artifacts intricate lab itself. So if you log uh artifacts through gitlab, it will save on demo flow. It will save on on. If you log artifacts, with ammo flow, it will save on gitlab using the package registry.

A

We are currently dog fooding, uh the MVP, so our data scientists are testing this out, pointing out the failures requesting features or improvements and I'm working on or to iterate on those and what are.