GitLab MLOps, 5 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SEG MLOps Update Jan 5th 2022 - Looking Forward

Description

Happy new year! In the first update of the year, we are taking the time not to talk about what we did, but what are the things we working on and where do we want to get with them.

Update Issue: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/42

User Personas: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/40

Rendered Notebook Diffs: https://gitlab.com/groups/gitlab-org/-/epics/7194

Glyter: https://gitlab.com/gitlab-org/incubation-engineering/mlops/glyter/-/tree/poc

Knowledge Repo: https://github.com/airbnb/knowledge-repo

A

Hello, everyone and happy new year. um This is the first update for sag of the year and today, since it's new year, it's time for new year resolutions, we're gonna do something different. We're gonna talk, not what we did before or what we did last week, because we were all on holidays, but we're gonna talk about uh what we intend to do throughout the next few months.

A

um So we've been working so far on jupiter working on pipelines, but I want to share a little bit about. Where do we want to get with each of those things?

A

So just a recap: the vision of the sega mellops is to make it lab a tool, machine learning engineers and that the scientists love to use so not something they have to use they're for to use, forced use, but something advocate for within the company, something they like to be part of, and within that, for that, the mission is to explore and collaborate with different teams to deliver features that improve the user experience and along the way, also increase the awareness within the company about this user base, so that link over there goes into the mlaps handbook page.

A

If you want to check it out. Most of this will end up there at some point but uh yeah to achieve this vision uh right now we have four areas that it's in my mind a bit so first of all, is user. Personas uh then rendered code review for jupyter, notebooks and then glitter and then analytics repository most of it. I have already spoken about a bit here, but except analytics repository, but I'll go over everything.

A

So, first of all user, personas.

A

It's no secret that data scientists and machine learning engineers were not considered target customers or target users for uh gitlab, and we know we knew they used gitlab, but there was nothing done for them um and that is reflected with on the lack of a user persona and user research done on the area.

A

So the impact of this is that when you go and talk to teams about specific use cases for data scientists, you always have to explain the the data science use case in itself, how it is different than a regular developer and how it changes, how it needs to be special here and there.

A

The skill sets that are generally used, the outcomes and everything else, and this becomes repetitive because you have to explain over and over and over, and you are the one coming with ideas. While if they know about the use case from the get-go, they can come up, the other teams can come up with uh with the ideas and with the improvements themselves, not only myself.

A

So one of I think, I believe, one of the main things we can do to increase impact on the long term, for data scientists and for machine learning. Engineers is to create the user personas, not not ourselves, of course, with the user research uh with the ux and user research departments, but work with them and drive this creation and drive the the the kickstart, the user research on this area, so that it's not my word right now. It's my word. It's me talking uh and I want the users to talk.

A

I know what the users want. uh Well, some because I was a user. I wasn't a scientist, I was there. I felt the pain, uh the pain points that I'm trying to solve, but not the rest of uh of the teams. So this is what we need to just to solve.

A

The second point that we want to work on is rendered code reviews for jupyter notebooks, and then you ask me, wait didn't you already deliver something 4.14.5 and yes, we did.

A

We did deliver the the cleaner cool review, experience so cleaner diffs, where we transform into markdown, and we display this and remove a lot of unnecessary things, uh and then we display that one, which is already a big step in the right direction, but consider this scenario suppose that I have this notebook here: nothing special about it, an image, a couple of images with, I don't know just random data.

A

This is how the div looks like now. This is already much better than the diff before with on the json base, but look at how much space this image takes and it's it becomes hard to find what can be commented on. um So what we want to do.

A

And then I come with a with a what, if here we can do better what? If, instead of displaying this markdown, we actually had code reviews over the notebook itself or render the blocks within the notebook in the uh in the in the diff. So this is what I'm talking about now mind yourself. This is a very poor image edition that I made in google slides itself so, but the idea here is, we render each block. We do diffs, uh it's a different diff algorithm.

A

uh It's a different diff uh pattern of the bun data. Scientists might understand that, but a different dip first to the the cell, the idea of the cells and then you give what's in each cell uh and and then you can create the the the display.

A

But the point is we are now rendering blocks of what is in there. So when I'm reviewing the code, I'm not only reviewing the code, I'm reviewing the image itself like. I can see the image over there I can.

A

I can discuss if that's the best representation or not. I can discuss the code. I can discuss the questions I can discuss the markdown. I can just the notebook, I look at the notebook not at the source file. Now we know from users that they do want still the raw divs. uh There are still use cases where the raw diff is important, and we don't want to hide that, so we also are going to add the ability to toggle between raw and rendered.

A

If, but this is an idea of where we want to go, we want to be able to have the code review, have the diff with rendered blocks, with markdown rendered with image renders, with uh code cells rendered, uh because that this is the what is necessary to have a good code review on a jupyter notebook. You cannot rely only on the code, you need the images itself because in images are the output and they depend on each other.

A

So yeah, I'm very excited about this, and I'm very confident we can do this. It might take a while, but we need I I we need to do this, it's just something we need to do so to get there. This is what we want to do. This is our the six items that uh that that I want to deliver in next few months for the render jupyter notebook notebook diff, which is not a markdown diff.

A

So it's a algorithm specific for rendering for different notebooks render the images render markdown blocks, render latex render the code and the ability to toggle between raw and rendered. If we're already working on a lot of this, so we're already working on the a part of of toggling between run, render diff, we already working on the notebook, diff and then render becomes the next step.

A

So that was for render so notebook divs, and now we go into the third step, which is glitter. Glitter is a library we created a very simple one. For now, it was more of a proof of concept that um let me open up here later that picks up a jupiter notebook, for example this notebook over here.

A

So it has a couple of steps and it runs this notebook as a pipeline, so each one of the steps will become a step. So if I come here over the pipeline of over over the cicd pipelines, I can see on this one over here that there was one step for each one of the steps that were on that original notebook.

A

And we can do better, of course we can do it, it would if we could. This was a proof of concept. If we couldn't do better, it would be a failed proof of concept, but how do we want to make it better?

A

The key thing here that we need to remind is that, since training models is uh takes, a lot of time takes a lot of resources, sometimes resources that the the the data sciences don't have on their machine or the data they cannot access from their machine. For many reasons, pipelines are not for mlabs are not part of the verify stage they are as well, but they are part of the create stage and at the great stage you cannot expect that a scientist should to keep creating a commit every time they want to change a code.

A

They are prototyping. So imagine that you are prototyping your model and every time you change something you have to commit so that it does goes down. It triggers a pipeline that doesn't make a lot of sense, but at the same time, right now, all pipelines run with things that are within with the file. Even the glitter for the lid for for my pipeline for glitter to run the notebook has to be on their repository, but what if we could run a notebook, an arbitrary notebook on a pre-configured uh repository, so think about it.

A

You have one repository shared across the company. Its only goal is to have pre-configured runners and a and a way that anybody or the data scientist can run their pipelines on that repository. So this is what I'm talking about.

A

I just passed the repository where this needs to be run and a notebook that is in my machine. It's not on the on it's not a repository. It's not committed, it's just probably staging. I just change a line or something I don't know, um but take the thing that we're building with glitter right now, that of of of transforming intricate lab ci, but with a arbitrary notebook, and I think we can do that without any changes to gitlab codebase, um just a lot of creative engineering.

A

So my my strategy on tackling this on trying to make a mvp for this is to rely on the gitlab api to upload a notebook into the repository and then trigger a pipeline that is already configured where you pass that file the path file. The the pipeline downloads that parent pipeline downloads that jupiter file transforms into a gitlab ci and runs it as a child pipeline.

A

It's not pretty. Our goal is not to be pretty. Our goal is to make it work so that later we can think how to make it right so first step make it run second step, make it right.

A

So that is a lot of great engineering, which is a lot of fun, very excited about trying this out, and I think it will work if it doesn't. I would just add more creative engineering to it, but the key problem here right now is how to make it run with arbitrary, a notebook not rely on a file. That's in the commit at this point so yeah. Let's try it out.

A

Another thing on this that I haven't mentioned. One thing that is in my mind, for pipelines is the concept of the checkpoints, especially for machine learning engineers, so suppose that your trainer model downloading data runs fine training data runs fine and something fails when uploading to the model uh registry.

A

This on the while or I don't know with testing or computing the metrics, and I want to run everything again. It would be really interesting if I could run a new pipeline based on a the state, the output state of a previous pipeline. So that's something that is in my mind. I don't think I'll be exploring that soon, but that has a lot of of benefits not only for data scientists but for the entire, uh our entire ciu's base user base, and it's just there all the time.

A

Last and a little bit of least analytics repository, so um when a company grows so does the number of data scientists and when the number of data scientists grow, they usually speak into teams then, and they create a lot of knowledge and it's impossible to consume all the knowledge that that is created. So data scientists. We usually talk here about data scientists, doing machine learning, but that's not, I would say, that's not 50 of of the job of data scientists more often or not, they're not doing machine learning.

A

They are not creating machine learning products. They are using data to create intelligence, to help business, make the better decisions so and on that side, creating metrics uh testing, metrics, creating uh cohorts for studies, and things like that. So on that sense, those are usually they are developed with an egyptian notebook or on our markdown. And then a report is created on a google docs or something where they share with their with the stakeholders.

A

But the point is uh when that jupiter is completed and then when they create a report, the jupiter or the markdown is pushed to a gitlab repo or something. And then it's completely forgotten forever.

A

When I wasn't a scientist we would have under on a on a previous company. We were about 200 data scientists and we joked that every two years we would just. We would never run out of work because every two years we would just do the same work again, because everybody forgot that that work was done in the first place or that analysis was done in the first place.

A

I tackled this within that organization by deploying a analytics repository, which is basically a wiki, but for data scientists, where you would push jupyter, notebooks or r markdowns into this tool. You would make it pretty add comments, ability for commenting ability for searching for displaying for sharing, um and it was really well received by by my my colleagues and it really really helped this idea of okay. It is somewhere if I want to go search for every time. I start a ticket or a new analysis.

A

My first, the first thing I'm going to do is just go into this repository and search. What's in there um the same way, you search for libraries when you are building a uh a new software or or when you creating new infrastructure, you're, not going to implement everything for on your own. The first thing you do is see what's in out there and this solved that problem, it was heavily based on well.

A

It was built on top of uh airbnb knowledge rebel, which is a repository, that's a bit dead for a while, but it does implement most of the necessary things and what we want to do on this here, uh we're not going to deploy a knowledge, rapper thing. What we want to do. What I want to do is to test out you.

A

I already did a poc of this a couple of years ago, but use gitlab pages to read all the notebooks that are within a repository, create a sqlite or something index all of them and create a page, make them pretty, create an index for them uh and add some uh some of this type of uh functionality, to search and to discover uh with tags and things like that.

A

So this out of the four items this is the most.

A

uh How can I say uncertain of them? uh This is the one that I had. I have planned the least so far, at least within the gitlab ecosystem, and I'm not sure if this is going to work or how it is going to work or if it's just going to be part of glitter, uh something that we can make life easier for uh for others, but well uh something I want to work in there in the near future.

A

So that was it. The four things I want to work on, or I will work on- are user personas for the data science use case render defs for not for jupyter notebooks, not clear click clearer and right now we have cleaner. We want to go with rendered objectives for notebooks.

A

We want to do a way for users to run arbitrary, jupiter notebooks as pipelines and four analytics repository. I hope this excites you.

A

It makes me really excited about what's to come and if you have feedback, if you want to see something else, be explored, we have the this issue over here with our weekly updates feel free to drop over there or drop by over. There leave a comment or something, uh and I will answer and that's it for today. Thanks all for sticking with me, bye.