GitLab Applied ML group, 23 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Applied ML weekly team meeting - Sep 23 2021

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, so, um let's hear where to start in the applied ml weekly team meeting, um so the first thing. So what are the open questions on our new approach for milestone two? This is for me what decisions are one-way doors like hard to undo hard to go back and where we may not have buy-ins from the stakeholders yet, and I've got the issue here and we've had a really lively debate in the issue. What are what are? What are folks, thoughts and alexander, looks like you have the first um personal preference.

B

Yeah, so my main question is for who will the future will be enabled so either on a customer bases or for everyone?

B

So that's the main question because I think that other things they are consistent with the old version of milestone, 2 and except maybe for uh review roulette replacement because we decided not to replace.

A

Yeah, so definitely that's an easy one answer uh for me. So milestone two was selected customers, including ourselves. It is not all. It is definitely not all customers, um even if we had the ability to do that today, I don't think we would. We want to we'd want to try it out for a couple learn from it, get feedback, etc.

A

um The one of the big questions- and I know alexander you and I discussed this uh earlier this week. I think it's an issue as well as when we, depending on the approach when we enable on a per customer basis, um is it a separate environment. That's created to do the ml workloads or that customer or is the same environment, use the shared one just in the data partitioned by customer and how much manual effort is it to set up the environments if it's a separate one per.

B

So that's still a question for me: that's why we had a great discussion in one of the issues how to automate all these things. So as far as I understood so, I think we we need to. We can't move the entire pipeline to to the customer. Ci drops. So maybe we need to do that externally.

B

So it means that we have a see a predefined, a predefined, ci template to make recommendations and generate the artifacts json file. Then we can parse this file. So that's fine right. We can use either the ui integration or bot integration, but the things like data extraction model tuning training. I think they should go outside of customer ci jobs, so we can orchestrate them somehow, for instance, using airflow or some similar tools, because we already have this pipe like these stagers. We only need them to run.

B

For instance, once a customer wants to use this feature, we need to start extracting data and then we need to start training model, and then we need to retrain the model. For instance, every week, every month or after each uh new merch requests.

A

I guess I'm thinking as I look into the air the what what information do you think we need to make the decision on approach on this.

B

I think who should maintain this infrastructure? That's the main reason because we introdu, like we don't introduce something new for the company for gitlab, because I see that the data team already uses airflow and some other tools etl tools. But in our case, for instance, uh if we launch all these things on our side. So who should maintain? One solution is to to make almost the same thing as with the data flow right, because it's only a backend for a patch beam. So we can use, for instance, uh google cloud composer.

B

So it means that we can write the same airflow scripts and we can move them there to to google cloud composer and once we decide to, for instance, to integrate with the data team or to do something else. So we can.

B

We can stop using cloud composer, for instance. So that's one of the cases.

A

I was thinking some notes while you're talking, but I missed a lot of what you said so later on. If you could, you could update the notes document hopefully are.

C

They okay, but.

A

C

A

That'd be great or, as you try, it's really hard to talk and type. At the same time, I've gotten.

B

A

I started out, I was poor at it and I still am not great at it so so uh from there. So the infrastructure should maintain infrastructure. um I think definitely once it's in full production, maybe not a milestone too, but I would, unless we have a good reason not to. I would still do that in milestone two.

A

What about this? How about we implement via ci job integration for one and then two of our own projects, so maybe for gideon for git lab itself and see what we learned from it and then decide rather than trying to predict the future perfectly, which is very, very hard to do, is do it more energetically? What do you think.

A

What do you think of that approach, alexander and everyone else.

C

I think that's the only viable approach so looking for I mean let's look at the challenges of gitlab the product.

C

Ultimately, if we have an applied ml feature or reviewer feature which we can do food, we can use in our project and on gitlab.com we can enable for customers and on self-managed now they can extract their own data and they can even set up on their own. That's the ultimate target. Now every component we use that like.

C

If you go on and use a certain cloud provider or a tool which is not, which is a single tone which is not an open source component which we can bundle, then we shall have this trouble, but those are uh very long term. uh You know architectural concerns, but ultimately, let's say let me give you a silly example. For example, let's say postgres it's an open source project, it's available everywhere.

C

We use it, it's very easy to use, but if we use a component which is not something we can bundle which is not available everywhere, we have gitlab self finished. There's going to be, I think, long term, something missing but short term. Coming back.

C

We can simply focus on gitlab.com, we can focus on gitlab or even in a limited roll out. We can focus on gitlab's own projects and one thing we know from gitlab that if we use a feature very good ourselves, that feature becomes successful for our customers too, and finally,.

A

C

Roulette like how many percentage of our customers are using review, roulette.

A

C

Zero exactly so that's some people who know about danger, so that's still wild west in short or customers, probably go to the assignment or reviewer, select box combo and then choose their reviewer by a human.

C

You know just guessing whom to choose overall, I think our danger, roulette or reviewer roulette is one thing which we didn't already make a feature available to everyone, and I hope, with the new um I mean under view integration. We can go beyond that limitation.

A

You know we jump back past some people's comments, they've already written into uh taylor in particular. What are your thoughts.

D

Overall, I like this strategy. I want to just say like: if we decide we need to not go the ci template route. That was just an idea I put out there, but I could see how that could end up working long term even for self-managed too. So I think this makes sense. um This gives us a very easy pattern for how we can do feature flagging, as well as selective customer enablement.

D

So I I feel good with this plan. I think we've got a good um approach. I am very curious to hear mon's thoughts on this, though.

A

Absolutely yeah we don't. We don't want to definitely finalize our short-term plan until mon has a chance to weigh in uh even after we want to try to make things uh two-way doors decisions as appropriate. So let's say we go with the ci job integration so that and which is in some ways. I really like it, because we do that for other things, and in some ways I don't it's, not the user. The user experience is not great. Like I know we do that.

A

First, at least at least, we did do that for the secure scanners, where you um you'd have to edit your uh you know your your uh ci yaml file and if you got it wrong, things wouldn't work correctly, it'd be better for it to be a button or even a feature flag, but we're very early days. So I'm totally fine with it now so that's it.

A

The ci you we tell the customer how to integrate or try to use the the ci job we've created that will initiate the data extraction, the running of the models, the tuning of the models and then the output of putting the comment in the in the issue. uh Sorry, in the mr with the recommended, mrs of the recommended reviewers, so the we'd kick it off with the ci job. Then, where do we? Where do we? Where do we extract the data to and run the models? Is it a shared environment?

A

That's used by all customers where we partition out the data by customer there's a lot of pros of that. We have to make sure we do it securely. So one somebody couldn't get a different customer's model data, for example.

A

um But the nice thing is it's one infrastructure that needs to be set up once and may and maintain in one place. If we build an infrastructure on the fly, we could partition customer data by.

E

Infrastructure.

A

But boy does that that that concerns me on the maintainability of that.

B

So I think we can create our own infrastructure, for instance, for now and make, for instance, a single pipeline for each customer.

B

You know what you said, so we could have the same things when we recommend someone in dci templates because, for instance, if this is, if the, if we have a template for project a we don't need to, we don't need we. We must not to recommend for project b. So the same thing, absolutely.

B

Yeah but the best thing that okay, I started to work on this issue today. So I found how to do that, because we can reuse uh gitlab tokens to check uh to authorize uh this yeah to authorize the ci template for making recommendations and then.

A

And then the the token would be put in the ci job configuration then.

B

uh So so yeah it depends uh the main thing that, as I understood, uh we cannot use ci drop tokens generated by by the ci to by by the ci drop. So we cannot use these tokens to extract some data using api because the first they leave only when the drop is running and then so they have a limited scope. So we cannot.

B

We cannot extract data. So that's why I think that we can so we can generate a project token for each customer or a customer can generate it, and then we can reuse these tokens, for instance, to authorize and review. I mean to authorize and review when we recommend reviewers and to extract data so like that. But there is one one thing that, for instance, if we, if we decide to to move the entire pipeline to ci jobs, they are limited in time. As I understood albert said that uh there is only one hour right for.

C

That's the default um so and a lot of people are not aware of it. I made a lot of stuff inside ci, so github shared runners also have a limit which is three hours. So that's something I think we can never change uh on github.com, but a customer locally on their self-manage instance can go and set any timeout. They want. Defaults are one hour and three hours for runners. One hours for project at this moment.

A

One kind of is we're not targeting self-closure customers at this time only for composting, but that's good to know for a future yeah um for github.com.

C

It's three hours so anything any job in a single job, so you can finally divide your test between jobs, any single job which goes over, uh let's say three hours can never run for one hour. We have to tell the customer to go to the project settings and increase the limit.

C

Otherwise they will have a.

A

Could we do something like this, where the ci job kicks off on review, initiates the under view? You know data extraction, etc, but it's not the actual job that runs the actual job. That runs is so we put like something in a message cube to say: we need to do the extraction for this customer and then do run the models and et cetera, et cetera, so the ci job just kicks it off and it finishes very quickly.

A

All it does is do the job somewhere else, so we're not limited to the time limit on a ci job, because we don't need honor of you to finish for the ci job to complete right. It's just a good time to kick it off. So maybe we just.

B

I just like that from our previous conversation that we have the ci drop to recommend or to train I mean initially when, with the first run, we need to start extracting data, but we can do that uh like in background somehow, so we just start this process and then we we we say that. Okay, you need to wait a bit while the model is trained.

A

Yeah, really, if the ci, if, if the machine learning analysis took eight hours, we wouldn't want to see it which it wouldn't. But let's just.

C

Say that we wouldn't want the ci job.

A

To wait that long, while that ran it doesn't need to it's. Not it's not something that sequentially is necessary, but you were saying over.

C

Question to alexander: let's go I'll. Ultimately, let's say we have 10 million projects with repositories and we go crazy and we wanna now train in each repository under you for every project on gitlab and make it available in a let's say, not inside ci, but in a in an infrastructure effort. How long will it take or how big effort is it.

B

So right now it's quite fast because we just have some kind of matrix factorization there, so we need to wait for when we tune the model, I mean when we tune the hyper parameters.

B

It could take like around one hour right now, so to train the model, it's quite fast, but for instance, when we start I'm afraid that when we start to improve the model, for instance adding some nlp or layers, I mean when we start to process uh merch request descriptions here we can take. We need much more time much more and I'm afraid of that so because, for instance, customers they paid for some cicd minutes and we will start to spend them yeah.

B

Of course like we can. uh I mean there are so many things that we need to address because and we cannot change them at the same time, because there are so many, for instance, as I understood, we can't restart automatically the ci drop right now. Yes, is it right.

C

um You can have a pipeline schedule where you say, like a chrome syntax, you can say every hour every day run that job.

B

Yeah yeah but yeah yeah, right yeah. I saw that but, for instance, if we, if right now, if we extract participants of the merge request, so we can.

B

I don't know how it works right now, but you know it takes a lot of time to extract even like 25 or even like 50 participants using the graphql api. So sometimes the drops they are false because of that it means that and we can easily restore it. So we need to restart the drop and then everything works. Fine.

C

The question uh so the data you are expecting is it available.

A

Okay, alpha's a little bit hard to hear you. Can you get a little bit closer to the microphone yeah.

C

A

C

So, um in short,.

C

I can you repeat your question because I had another concern which I'm going to ask you uh alexander sorry,.

B

Yeah right I mean that I'm afraid that we can't restart automatically this uh some ci drops, for instance, if we need to extract participants, the this job can easily uh be broken yeah, because that's.

C

A that's a good concern, so in short, jobs are usually you know. Ci is done for testing, so jobs fail and your test fails and you are happy you catch. You cut something. So that's not made for this. If you need a retry logic, I didn't respond to that immediately. I'm sure there might be a feature there, but in short, if you need a retro logic, you should implement that yourself in your own ci script. Yes, yes, yes, however, let's investigate and um I will think fabio uh he's more knowledgeable.

C

There might be a retrying logic there if ever, but these jobs are made to fail not rather than retry, they are not big brand jobs. They are. You know things.

B

Yeah, that's what I'm trying to say that we have a lot of things that we need to address. So if we choose this way, we need to understand that. Maybe it will require some time because we need to collaborate with some other teams to adapt, for instance, pipeline for this, uh uh for this strategy, yeah for this integration.

C

uh If you have additional like say everyday additional uh repository data, okay, I have two questions. Do you need, do you do an incremental training, or do you need to dump the data from scratch regularly or do you need do you do you like? Do you store the data somewhere and say incremental.

B

Okay, so right now it's a full retraining but the better solution. For instance, we can, uh for instance, when we have a new data. We can.

B

uh We can pass like several iterations to adapt the model and then, for instance, after one week, we need to retrain the model to make the results better.

C

The second thing um from graphql api, the data you are getting- does it come from I'm trying to figure out if it comes from the database or repository git history or both because both okay- if it's both um so you said it's too slow- uh I mean the only other alternative is that we can get some data from the database from the database.

C

Somehow, because we have ways to connect to replica the data team has direct connection, you know that's what they do or um for the repository we can do is um non-shallow or history clone, and if we can work with that, at least during the milestone too, we can check out a repository which then we can incrementally check out more and keep that somewhere in some cache.

C

Those could kind of speed up the graphql issue, which you are mentioning, and I believe that dumping data from graphql long term is going to be a challenge. If for larger projects,.

B

uh There are some fields that are not stored in the database like merged at something like that or merged. I don't remember mihail knows better uh and, for instance, we need to extract the change files and right now we cannot do that using like pure graphql api, so we need to work with the local repository. At the same time,.

C

Yeah, uh you mean merged, you said just.

B

Yes, I think I think this field, because uh there is an issue I can send you uh that's four.

C

B

Yeah, okay, so.

C

Database team wants to have a small orthogonal table, so merge request is one table. There's merge, request metrics table where you have the merged date and add all merge request data is that by the way? Just if that answers, but send me the issue, I will check.

A

Alexander and um and after- and it is a great discussion- we can definitely go over on time. If you both. Can, I think that'd be great. I do want to jump actually to agenda item number five and just uh taylor's got an announcement. Then we can come back to this.

A

Taylor, you got an announcement for us.

D

There we go sorry, I couldn't find the underlying button. um Sorry, I'm driving someone to the airport. um So basically I uh interviewed for the uh applied ml pm role. I am thrilled to say that I have accepted it as of um technically 10 1, I'm off at the end of the next week, um so I'll officially be your product manager starting 10-4, so um I'm around, as you all know, so um I'll be able to give you all more dedication to this.

D

One thing you'll note is that, while this is the applied ml group, I am the product manager for model ops, so I'll be doing a lot more than just applied ml, um but really thrilled to join. The team excited to give you all more focus and time.

A

Great good awesome, awesome, you're, officially part of the team uh revan unofficially for the unofficial was fine too for a while, but that's awesome news, taylor.

A

Sean, do you have time to cover yeah? You want to.

C

Yeah, actually, I.

A

I did want to talk about.

C

It a bit- and um I have a I'm interviewing- a candidate in just a few minutes, so I think I'll I'll drop off and I'll add it to next week's agenda.

A

C

Okay, people can read.

A

C

E

A

The workshop and blog post, if I can do anything to help you know, let me know I think that'd be great yeah I'll I'll read the comments. I think.

C

And thanks very much alexander for your offer. I'll definitely contact you about this.

A

A

Do you have more time to discuss what I um interrupted, or do you not have time? If you don't that's fine, you can. We can continue some other in some other time.

A

Okay, so sorry about interrupting you, both there's a great discussion: uh why don't we get back to it? So where did we leave off.

C

uh If, if that's true, if that's too detailed alexander, I can sync with you too just uh because there are little unknowns from my site on how underview works and from your site on how, because I was on the product intelligence team, I was able to see how different parts of the system work, because I was doing instrumentation there. So that might help us to. You know, answer some questions.

C

uh Maybe we can sing tomorrow. We are in the same time zone. I think I mean.

A

C

A

Prefer to talk now, let's find it oh yeah, if you prefer, if you both prefer now, I'm I'm good with nick.

A

I see alexander.

A

Let's go get alexander's.

C

Questions? Yes, yes,.

B

Okay, which question the second one right.

C

No, I mean any uh in the mindstorm too any uh concerns. uh You know you, oh yeah,.

B

Any questions you have.

C

On the ci gitlab product.

B

Okay, okay, so uh you know I'm trying to find the best way, not the best. Maybe the simplest way at this time how to automate this full pipeline that we have right now I see that in general there are two ways so the first way we use something like airflow another way we use uh gitlab ci jobs, ci pipelines, so uh both I think, are fine, but it looks like pipelines, they are they're they're cool, but maybe they are not adapted right now for for this kind of task that we have right.

B

So uh you said that if we start to adapt them, it will be cool because it will be cool for the company for the company and for the customers, because we we can help improve the tool, but I'm afraid that it can take a lot of time.

C

Because yeah, we cannot do it perfect concern. So I think it's a concern for monologues and the.

C

In short, long term, that should happen but short term gitlab ci uh is there's a runner, so it's a computer. I used it for data extraction which works fine now for extracting or usage data from versions app to snowflake.

C

But during that time I had a lot of changes which I'm that's. Why I'm sharing all the information, and I tried to promote that approach as data pipelines to the data team, and they were not that keen, because classically data engineers work with air flow and a lot of different tools.

C

They, like direct connections, you know so somehow that project which I did, which works fine now for extracting the usage data to snowflake and to science works fine, but didn't become a good example for other data, so we just use it for license app where we store github licenses.

C

So your concern is totally right, but in turn tyler said he is also responsible for monologues and eduardo. Bona is also quite interested in the discussions long term, if we really wanna, have apply them in gitlab and in customer projects. We should solve this issue without a third-party cloud service. That's the, uh and how can we do it like?

C

I don't know how airflow is licensed, how it can be bundled um yeah, apache, okay, so in sure those are all questions you know and introducing a new component to bundle or even on gitlab.com is hard, but doing a single instance where you do it on confluent gcp for yourself is possible now as part of milestone. Two writing, because what you want to do is you want to learn now?

C

So that's why uh yeah, let's uh be concerned and you're right, so gitlab ci is not made for this exactly depending on how long the training takes and etc. How long the repository extraction takes. You might hit the revolve and also even the data storage size will, if you want to extract a huge uh git repository with a lot of storage and history.

A

What about the ci job, this being extremely simple in terms of kicking off or queuing up a job elsewhere, something completely unrelated to ci, that is in unreviewed to say you know uh running. The model has been requested for this project for this customer, and then that happens completely, uh so the ci job finishes basically almost right away, but in the mo all the all the unreview stuff happens separately and in parallel and not related to the sea icon, the ci job just kicks it off.

C

Vane, I totally agree there. So one thing I envision now is: let's say I go to my project on my project. I enable and review I click a button that starts some background jobs and that says: hey your unreview will be available in some hours and will email you and then we really run a background job and we gather the data which alexander needs in a certain format: do the extraction in the back end and put it somewhere now, there's the there are still questions here.

C

I don't I didn't I mean I don't know, but then data is so easy that in the ci pipeline, the uh let's say the training and finally, the uh running of the model is kind of cheaper but training and so on uh so is not subject to a lot of delays because of extraction of data or transformation of that data. So we can do I mean when I think we can do what you tell by just maybe in some project clicking a button you know initially behind the feature flag.

C

I want my project to have unreview something runs and one hour.

A

C

uh Something becomes available um and then I have the interview and then I can have a ci job or whatever. So that's one option.

A

How would the feature flood interact with defeat with the ci job? I kind of.

C

B

A

B

A

Like the ci job, the the the you, the customer, the user, would would would have to add it specific to the ci job. You know we tell them in the documentation. If you want to own an underview. Add this text, you know to pure ci job, which is great, uh so that would kick it off. Where does the fee feature play? I think a feature.

C

A

Is better than that, but that that might be good enough for a first generation to do it in ci job. But how would would is it either or feature applied or edit the ci manually, or would they both be done and if so, how do they interact?.

C

I think, and uh why, because you know ci, as you said, it's a manual process developers love it and it's not easy and okay, you include a template, but it's finally, something we as gitlab engineers, love and do, and devops engineers do and coming back the future flag could only be that cosmetically.

C

You know when you want to enable a button. You want to hide it from people, so we have the feature flag, rollout strategy by project, so we can enable disable per name space so that customers don't see it or don't see it. Unless we tell them to do so so that.

A

C

Are not confused, so they are like.

A

Ned is, it would be the feature flag would just be another way to kick off till when the ci job, when ci jobs run also kick off on review. So one way to do it is via the ci, the ci job configuration. The other would be the feature flag which basically would insert itself when ci jobs are run for that project. Safe, don't kick off on the view, you know, um so it's two different ways to enable the same thing. It sounds like if I'm understanding correctly.

C

Yeah yeah totally, uh but in short here let's go back to or what we have to do so we have to train. We have the exact data, hard problem, if you do it in the back end somewhere and make it quite ready for uh for unreview, that's going to be that button which I'm imagining behind the feature flag.

C

You click on that it does something and prepares all the data for this project, but running the model or then making actual recommendations of reviewers could be done in a ci job easily for the time being, but long term. Anything is better to be on the ui I mean so you go to your merchant fest and there you have. I don't know how about you. You need a product designer and you suddenly. Let's say I go to my project I enable and review.

C

Then I go to my mr and I now see the reviewers for my mr coming from machine learning. That's the ultimate experience. I think um gitlab paranormal is just a background job framework which we have available. That's why we like to dog food it and we want to use it, but overall long term. Anything is better to be in the ui I mean, or in the background job.

B

C

For the long term,.

B

Yes, sorry for the long-term plans, for instance, we have.

B

We have the situation when for the new projects, we can collect data step by step, for instance, let's say that underview is integrated into the core and once a user creates a project, so we can collect the required data step by step with every new merch request and you commits, but we will also have some old projects and for these projects we need to collect the past history and it's a long running task. Sometimes it can be right for huge projects, yeah totally.

E

That's so one thing: one thing: do you know the people in charge of the gpu runner, because, based on the things I'm hearing, um you say something alberta resonated that they said that the ci pipelines are made to to to fail to catch some testing and that's it. So this is not a machine learning, workflow or machine learning, so so this tool for that purpose, probably is not the best. But I know that there is this team that they are working on.

E

The gpu run enabled runners, so I didn't know that, for example, no more than three hours will be possible to to use a runner, but it I know I would say that this is standard on any training of machine learning model. This can take more than three hours, because in three days do we know, or it would be good to ask them. What is what is their approach?

E

How are they thinking about solving this problem, because the way that the value proposition of the gpu runner is that is to train machine learning models, but if there is a hard stop after three hours, so it wouldn't make any sense. So I would be curious to know what do they? How are they planning to approach this? Because if they really are marketing it to training machinery models, it should be longer than three hours.

E

C

um I, what I mentioned is that current shared runners have these limits on github.com. However, gpu runners probably are not that shared runners and probably they could set their any limit. They want.

E

So I was talking about the defaults.

C

And what we offer for free to users? Okay, so in gpu runners, probably they keep the same problem and they should solve it. But gpu runners are one part of the story and I'm asking now alexander: do you use the gpu runners for training? That's the part which I'm missing.

B

uh I'm also afraid that, for instance, we'll start to integrate and review and at the same time, we'll start to develop a meta-ops platform even.

A

B

Very you know simple way, not maybe very reasonable for it for the customers for customers, but anyway it will look like uh animals and an ml ops platform yeah. We definitely can't.

A

Depend on what the model ops team is building from for milestone two, although we want to keep an eye on it, so we we're not because because then milestone two won't be for a much you know until much further out. However, we don't want to be blind to what they're doing either and do things that we need to rewrite it without at least planning for that.

B

Yeah, like we work in parallel right now with the with the mleops team right.

C

Yeah totally um and your concern is right, alexander, but overall, in short, that's the start of the journey, and I think you specifically yourself. You should focus on the smaller problem, but everything which is field here as problems are the problems of the industry and if github is going to be finally useful for customers to do their own ml broadly, which is not the concern of apply them at the time being. But it's going to be really cool to solve all these problems. You know which we.

B

Are I think we will solve them, but let's solve them step by step, because it's hard, we are very small team. So sometimes it's really hard to manage all this at the same time. So.

C

Yeah you're right, so that's why I mean so. Finally, we have modelops team and other teams, so we should follow the viable way which is possible at the current moment, uh for learning like rain um stresses like.

B

I would like to have. I would like to see that and review works for customers even for some selected customers or who would like to test this feature as a better. But after that we can replace some of the parts right.

B

Because, okay, we see that and review works, and then we start to replace, for instance, some of the stagers. Maybe we will we will go back to kafka, but still unreview will work.

C

um I totally agree, but I am not aware of the long-term plans there that, like let's say we can make our review work, but we can keep. Let's say we could keep exactly the same infrastructure you were using before gitlab, because you could give us an api and we could make it work.

B

We changed everything almost no.

C

uh So that could make it work, but then it could not be easy to convert into a feature, so I think, on one extreme, we make andrew work exactly as it is, which is easy, but then we don't end up with a gitlab feature. On the other extreme, we convert gitlab into a full ml ops platform with reviewer capacity with machine learning.

C

So I think yeah, I'm not trying to suggest one thing, but at the end of the day we are gitlab. We are providing um that tool to other people. To use I mean, and if you don't make it reusable, let me give you a silly example: I'm not well-versed. Let's say that the whole rv relies on a server and all customers who want to use it on gitlab.com or elsewhere need to go and open an account in confluent or, let's say, gcp, that's going to be not very good long term.

B

No, we don't need this, we don't need these things even right now, yeah. I.

C

Know right now, but I just went to the other extreme. For example,.

B

Yeah, I see what I mean yeah, I say I agree yeah I see but like there are two cases, look the first one. For instance, we uh we choose airflow and pop sub, for instance right uh then it means that uh okay with the predefined site template we recommend with the airflow and pops up on google and google dataflow. We extract and prepare everything right. So it means that when we start to work on this third milestone, it means that we need to move data from self-hosted customers somehow to to our side.

B

First, for instance, we can do that through the path through pub sub right. If we choose uh gitlab pipelines, maybe we can train even on the customer side and even on the self self customer side right, because in this case we don't need to move data outside of their.

C

Exactly so, you know, gitlab is finally people love it because you self-host software, your source code, your repository is your asset.

C

You are there and now moving your repository data to us or to a third party for the customer is something they don't like so overall, if we can make it work, if we can make that work locally, that's going to be the best outcome for for the customer, because.

A

C

Data, um even on gitlab, you know like when we enable run review um normally going into the repository history of a customer and doing their uh some training without the customers. Knowledge is something which they won't like without initiating it first themselves. You might want to think about the implications of that.

B

But maybe in this case, maybe we don't need to to to make this model too complicated to make it strainable on the customer's side. That's what I mean, maybe in this case, for instance, if we, if we choose gitlab pipelines- and we do all of these things, for example on the self on the self customers side. Maybe in this case that would be one model, but we could have another model on our side. That is more complicated and we can take so we can spend more time to train it.

C

B

Totally so github.com.

C

Is the only instance where namespaces are hosting different customers.

B

C

All every like, 200, 000 or so or maybe half a million self-managed instances are smaller.

C

I made a lot of size analysis there, but what you say is, I think, true in short, for github.com we can do something different and ideally, of course we want to have the same thing exactly on gitlab.com on the self-managed long term, but we can always github.com can always be the forerunner pioneer or new things and in the meantime, if the um self-managed evolves and people can do more there, because model, ops or other teams do great things there or gpu runners.

C

Then then that's going to be um and viable for us and we can do that way in the future.

B

So what do you think if we, for instance, if we put data extraction into a ci drop.

C

B

C

By the way we can even get the size of the repo, if it's more than a certain, we can just say not at the moment. You know for a certain customer, probably gitlab.org gitlab is one of them.

C

I mean huge repo, but so I think I mean let's go the um way we described naively and let's see if we hit all the challenges, you know um on the data instruction and we can really. Finally, we are gitlab the company and we can go and change the anything like william suggested. We can go and check the gpu runners. We can check the runner timeouts, have more runners, have a special runner with a different timeout, have a apply them as runners, which no one uses stuff like that, which could be easier.

B

uh What do you think do we have time.

A

B

A

Asked to try you got this, can you repeat again.

B

Yes, sir, uh do we have time to test to try, for instance, this solution of using ci pipelines, but probably maybe we we need to improve something I mean something in using pipelines, maybe the usual way how these pipelines are used right now by using right now. So what do you think.

A

So make sure you say so you think it's okay to use pipelines now for what they can vent, how we can benefit from them, but change it in the future to something else.

B

uh No, it's like alper suggested this thing that if we use uh gitlab pipelines, it means that we can improve them. Is it right, so we.

A

Can I wouldn't want to depend on improving them because that they're very critical part of the product and not and they're, not simple, not in a bad way? They're, not simple! These are lots of features.

A

We may not be able to improve them in the short term and still achieve our goals and apply to that. So we can't improve them. Surely you know everybody can contribute. uh That doesn't mean that all changes are accepted or that all teams agree on all teams, other teams, changes etc. So that might be they might love it or they might. We might give them pause. I don't know what changes we would make.

C

A

C

Just what I meant is just increasing the timeout settings uh and having shared runners, which is not changing code, which could be easier to do.

A

So the um oh, the I I like the idea. It was a feature and I don't know how viable the idea is the way I understand this is we have a feature flag that does the extraction and that kicks off the extraction, and that means the extraction is not it's not limited to a time frame it.

A

It's enabled by feature flag, we'd, maybe we'd get I don't know, would it be kicking off a sidekick job perhaps to the sidekick job then goes and does the extraction and then stores the data wherever it needs to be, and then when we actually want to run the mod later and and uh output, the recommendation that we put in the ci job because running the model and updating the mrs with uh comments saying who's recommended is gonna, be much faster and you know we'll take nowhere near an hour.

B

Yeah! That's what I'm! What I'm implementing right, implementing right now sei template, so that that that makes requests to to the model and then.

A

You need to read the model, so how do you get the data into the model? I think we get the data into the model via a feature flag: it's not via ruby code, rubio's code controlled by a feature flag, and if customers love to use both right, you can't do one without the other and if they do one without the other, it will work. Of course, that's that's, not a good user experience, but that's still, okay for an mvc. I think.

B

Yeah yeah, that's what I'm trying to say. Okay, there is a feature flag. Once this feature flag is enabled we start data extraction automatically, but as I understood, we cannot do that in ci drops in this case right right.

B

So we need to do to do this somewhere outside of these drops, then, for instance, uh a user can add a ci template and this site template will generate recommendations that can be parsed later and yes, all all the so I mean the recommendation. Process works in ci jobs.

B

But there can be situations, for instance, when the model is not trained. Yet it means that we need to say something like okay. So, let's we need to stop this job, saying that okay, so we can't recommend.

A

I think if, if the data extraction, if the model is not built yet if the data extraction is not done yet or the model is not ready, yet I think it's fine that that's a case where, um when the ci job runs, it just says it's: it's not running it. It just comments it it outputs in the logs saying that most mrs don't succeed on first try. So you know, because you know automated tests, don't succeed, etc.

A

So they'll run it again most likely and then get a recommendation later and if and actually even if, even if this, even if the code is mergeable on first run of all the pipelines, that's okay, they can rerun. They can kick off the pipelines again once the model's ready they can click the run button again and it'll kick it off later manually. I think those are those are fine.

B

Review roulette works right now, like that, oh guys, okay, I mean you can easily.

A

Sort of the security scanners, if you want to rerun your security scans on on changed code, where you want to see it ran the first time be. You know you want to run it again to see if there's any new vulnerabilities based on new things that the scanners find. Since you last ran them. You just hit the run button again to rerun your pipelines and it'll pick up the latest configuration.

A

So it's it's a it's kind of it's similar in some ways.

B

uh There is also another. Oh sorry, there is also another uh strategy. For instance, we have this. We have this predefined ci template where we recommend, and once this template is run, we start the extraction process. I mean we trigger somehow, for instance, another pipeline ci pipeline, maybe or something else where we, where we extract data. So there is another case like.

B

And it means that this ci template works like a future flag.

C

Yeah, I think that can also work fine, I mean we have several alternatives. I I'm aware that I'm uh so in short, you go there. You check if there's a model uh data, if no you extract it. If, yes, then you so you launch you also pipeline architecture is quite rich. Finally, you can launch a new job or inside one job. You can do all this um and and then finally you, if you don't have enough data, you can say, as vayne mentioned at the moment, the training is continuing.

C

um We run it at some point.

B

That's something.

C

We do actually for some cases, so we can. um I mean.

C

One trouble which I have to mention is that, while defending the backhand approach, sidekick jobs on in gitlab have five minutes recommended run time, so that means you have to divide the job into very little. So uh when I was calculating other stuff there, I would never expect anything to run on the back end more than three hours.

C

Our database have a 15 second timeout on github.

B

So we have three hours, but yes, so that's still also a good question for me, because I would like to have some kind of a b test. As helper said in one of the issues to understand, uh do we need to improve the current model? How can we improve the current model? Maybe we need to change the direction completely, maybe, for instance, right now, I thought that maybe we need to to introduce, for instance, some tech description uh features to the data set that maybe we don't need these things completely.

B

Maybe customers, users- maybe they want something else, so it would be good to have this kind of, but at the same time we don't have a let's say a platform where we can test all these things. I mean this mlops platform, so we cannot like. We can't write the model and run this model just in five minutes in in production or in stage.

C

By model you are talking about, you are mentioning machine learning model right.

B

Yes, yes, any any machine learning model, so we don't have this uh ability to do uh yeah. We are like. uh We have two problems at the same time, and this is why we need to find some kind of trade trade-off right.

C

Yeah totally, um I mean a lot of.

A

C

Because, finally, we are as the pioneer here in the devops. We are trying to apply them and that's why all the challenges we face are true challenges which everyone would face and the solutions to them are going to be need to be innovative. Actually, that's what I believe I mean, so we can try multiple alternatives, like you said too,.

B

Yes, but yeah, of course, there are a lot of like open source projects that can help us like ml uh ml flow or some other, but still we need to maintain that we need to install. We need to to do a lot of things, so we cannot take them all and introduce at the same time, right.

C

Yeah totally introducing new components is quite difficult.

A

That's a great discussion.

B

Yeah, maybe that's why I'm on sad that this is the best time to join our team.

A

Yeah but lots of uh that's a hard challenging but fun decisions to make so yeah. Thank you. Thank you. I know this went a lot longer than we scheduled, but that's just fine. I know alexander you've been wanting to bounce a bunch of ideas off of what you were thinking off of somebody who knows the gitlab product in detail and thank you alper for being that person today. So you could give us all your great uh advice.

C

Yeah, so sorry for not giving uh exact answers. As you see, there are trade-offs in every choice and- um and we are in a challenge- that's the fact. Absolutely.

A

Cool thanks thanks everybody.