GitLab Applied ML group, 9 Jun 2021

Previous Meeting

⏯

youtube image

►

From YouTube: Infra and Applied ML discussion

Description

https://gitlab.com/groups/gitlab-org/-/epics/5794#note_592129663

Discuss Applied ML reviewer/maintainer assignment architectual plan and PoC with the infrastructure team

A

So we're going to talk about infrastructure and applied machine learning and milestone one uh and again welcome to uh welcome to get live. Alexander god glad you're you and I were part of the company now.

A

So we had some pre-read in the document. um I think uh is it worth? I don't know if everybody had a chance to go through, is it worth doing a quick overview of the components of unreview and milestone one, or should we just skip to the discussion of the questions happy to do it happy to do an overview if everybody, if anybody would like.

B

I think I mean I was able to read the slide deck and and also really familiar with the epic, but I think an overview would be fantastic.

A

Okay, awesome, um so the if you look at the epic it is linked in the doc uh epic 5794, and that picture um that I linked to the direct comment is I'll turn it over to alexander and others. To do more is milestone. One is a proof of concept. It is not integrated into the gitlab product. It's a proof of concept to use on review on the gitlab product itself, so we can dog food it ourselves and when doing that, you know there's a number of components that are part of unreviewed today.

A

That may or may not be part of the part of things when we integrate it with a product.

A

So the idea uh here for milestone one is to create perhaps a separate portion uh separate space in gcp and gitlab's gcp um to implement unreview pretty much, as is as much as we can and where we can't change it to then export from the gitlab product.

A

The data we need, probably through graphql, then run it in this separate instance, um and then then so we can do recommendations via applied machine learning of maintainers and reviewers and then pass that back to the product um either where it's displayed in under view itself or perhaps in the product as comments or something like that, so that git lab team members can use it and dog food it ourselves as we develop gitlab, not customer used yet, and that is the initial idea um alexander. What part of that and taylor and others?

A

What part of that did I get right, and what part of that would you parts of that? Would you change.

C

A

B

C

Understood so I think uh during the first milestone, we just need to find some equivalents on the google cloud platform because uh I used to azure to build some components. So we need to find some some other tools just to replace during the first. As long as unreview uses the the api of gitlab. So I think we just need to find how to replicate uh like, for instance, kafka or azure data factory, and and that's all for the first milestone. As I see.

A

Yeah is that so it seems like you know, there's a number of open source components in unreviewed, so there's kafka, there's hive, there's mongodb and they're all just general, open general open source things that we'd want to implement in gcp. Just just for this proof of concept. um Adf, the uh you know is your data.

A

uh uh You know what df stands for. uh What is it called again functions.

C

Azure that data factory.

A

Is your data factory? So that's not something, of course, we'd want to use a gcp, it doesn't apply to gcp, so we, I think, that's the primary one. We need to find a replacement for. Is that accurate or would there be others.

C

Yeah, that's true because I use adf to pre-process data and then I store the pre-processed data in in mongodb.

A

So we'll need to find a replay. I think the only thing we're going to have to find a replacement for for the proof of concept is adf, uh the others hive and kafka. We should be able to implement ourselves, but by we I mean the development team should be able to implement uh at gcp. um If we're given by infrastructure the um the uh kind of the space to do that and uh separate from you know the rest of the product.

A

um So steve it looks like just on the milestone. Looks like you have a comment. You wanna verbalize.

D

Yeah so question there and um uh take the opportunity to say: uh welcome uh alexander, it's great great to have you and uh great to have this as a part of our our uh future part of the product. So uh that's awesome, um so wayne you were saying milestone is as a poc. I just want to kind of understand like, uh and I think I I got it from something you're mentioning there.

D

It is one difference, um there's probably many differences, but there's one difference in thinking about this is a poc versus, like the you know, initial nbc of a of a feature, um the intent that it it's not going to be uh share. It basically available to customers for use it'll only be used by git lab is that is that part of it? Or am I thinking about that wrong?.

A

Only but only used by team members.

D

By team members: okay, yeah, okay, right, that's that's a good uh distinction there, okay um and then in the, um and do you see that as kind of a key characteristic of a point at which this would transition from a poc into something else would be that we'd have the findings of the poc and before it would be open for wider use?

D

I think doing things like later on. Brent's got a question about readiness, review and everything, but that would those would be some of the things that that we would do prior to customer use. Is that correct.

A

Yes, and um so milestone two is, I remember, is: is the beginning of customer use uh or have the ability with um actually is that, let me make sure that's accurate. um I have the milestones linked to each other, so milestone two, which is two months after milestone. One is customer facing mvc integration, so milestone two is um which is linked to my to the that first, epic and second epic is customer facing for com customers only not for self-hosted and mvc, some minimum viable change, and you know controlled by a feature flag.

A

um You know we don't intend to turn it on for all customers.

A

You know right away, etc, um and it also involves integration with the gitlab product itself, so instead of, for example, showing instead of or in addition to showing the recommended, maintainers and reviewers that the unreviewed components come up with in a comment, it would actually be part of the product itself where it would show it in in the product itself um when, when users are changing or selecting maintainers and reviewers and taylor has already thrown some ideas in there on what that, what that ui might look like, but that's when we'd actually integrated with the product, but the scope is with milestone.

A

Two is dot com, customers only and probably a relatively small number to start ourselves. Dog booting it, but also um we haven't talked through this. But definitely not you know roll it to one customer on day one and then you know 100 on you know day 10. It would be, I think, it'd be a slow roll out. We haven't worked those details out yet, but it wouldn't be a fast pull out, because also machine learning workload takes up a lot of cpu power, of course.

A

So we want to make sure we plan that out as well.

D

Okay, great, thank you for that.

E

I'll just add this is intended at the moment to be an ultimate only feature so thinking through, like the cost profile, our margins, as well as just how many customers actually use that, like just restricting to ultimate cuts off a lot of concerns.

E

um I'll also mention that, if you think about how we use review roulette today, I think that's a good way to think through what this sort of initial poc will end up. Looking like um review roulette, you trigger it, it posts a comment um that I think will end up being kind of what this poc ends up.

E

Looking like and to me success is: do we get a poc capable of producing logically and just very clearly better results than what review roulette can provide today, and that is kind of to me, the gate that says all right: we're ready to now move on to trying to expose this to.

C

Customers, so we can ask a few technical questions.

A

Please, and uh as you do that if you have time to put them in the either just before or as or after add your add your thoughts uh type, your thoughts too, if you would, but actually do you want one since alex since brent already, has some thoughts in there, um maybe brent you want to verbalize yours and then we'll go over to alexander's.

B

Yeah, I just wanted to recommend an approach that we take from milestone one and that similar to what we've done for other teams, where we've set them up with a an isolated gcp project where they have full access. We're doing this for sharding a couple other efforts. So my suggestion here for milestone one is we do the same, but then also figure out a time when it makes sense to begin in a readiness review and just from a high level. Readiness review allows us to start having for, for an alexander, welcome uh for alexander's benefit.

B

Here. Irenaeus review allows us to start having a an asynchronous slash, synchronous discussion about like architecture, thoughts and stuff, and things like that, so that we can get prepared um as we get a bit as we get into progressive milestones, and we start to incorporate more and more things and and align, because one of the big things that we need to do from the infrastructure side from delivery uh also within reliability, is get ready for like how we deliver this and get it to production.

B

So having the readiness review sooner rather than later, rather than at the end like during um a significant milestone, um would be important here.

A

No, and I I definitely it's great great ideas- brenton I'll, definitely add the readiness review milestone too. We'll also want- uh and you know our scope right now is milestone one, but when we start getting closing in on starting milestone, two one of the key things will be.

A

We have these various components like mongodb in review like kafka, like um uh hives and those things are not part of the gate, lab architecture today, breather.com or self-hosted, so we do want, potentially in the future a portion of this functionality to be available for self-hosted, and that poses all sorts of challenges, but even for com do we want to make these components part of our architecture or the product and having which gives us a lot of benefits and flexibility and features, but also requires us to be able to maintain it scale it secure it, etc.

A

So, there's one of the tasks in milestone. One is to do that evaluation of what you know we have these. Various components like with mongodb, which is a great document store you know, is that we want to make that part of the gitlab architecture, or do we want to use something that exists today, like the postgres.

A

Ability to store documents, but is that good enough?

A

Maybe it is maybe different, uh you know and and other than so there's gonna be a lot of discussions there and we definitely like your or somebody from your team's input on those as we work to try to make those decisions with you, and I will tag you brent in that specific issue to make sure you're. If I haven't already make sure you've seen it and then you know, definitely uh get it to your team as.

A

A

uh Alexander, what were your technical questions and if you could type them in the doc uh as well, that'd be great. The doc link.

C

uh I'm just trying to understand okay during the first milestone, we will use the public api like like it has right now: okay, but during, for example, during the next milestone, how we, how are we going to integrate so through the public api as now or we'll find another solution, because there is also another issue. For example, we need to retrain the model, for example each week each day, so we have to find this interval.

A

It's a great question. um We often use. I don't think we can answer today, but it's definitely definitely a question. We need to answer so we often use the graphql api inside the product itself with the various apis to do things. So um we could go directly to the database, though. If we chose to um we'd have to maybe the graphql api doesn't have the features we need to do it at scale. Maybe it does.

A

But that's one of the things we need to discuss is what are the pros and cons of the various approaches um without knowing all the details or without knowing many of the details, I would not just towards using graph at ql if it'll meet our needs versus going directly to the database, but I don't know if it will meet our needs but mija with the kinds of things that you've done on other teams and into the architecture.

A

What would you guess would be a good approach for this and again just a guess at this point.

F

My guess is that using the database directly would probably give us the least work required if we want more data than the graphql api exposes, I mean if graphql does exposes everything we want, we can use it. On the other hand, it has limits like we're going to be issuing hundreds uh of queries against it, because there is a limit. How many results can you graph your api input return so we'll be like issuing several longer query, bigger queries or hundreds of smaller graphql queries and also depends on the data we want from the api.

F

If we want something that the api doesn't expose currently, then it means some merge. Requests will be in order to fix that.

A

Great step so again we're not trying to decide today, but this is a great discussion to start having today um so steve you have um oh yeah, uh you put in there private, so.

D

Actually is it, are you gonna, post this for public or private uh public.

A

D

A

We definitely need to keep track of not only the functionality but performance and capacity impacts of whatever decisions we make yeah. We can talk about my comment.

D

A

Yeah, the machine learning does put a high load, not only pulling the data but in general, but a high load running machine learning workloads on so we need to plan for that as well. um Yeah.

D

And I think you're I think you're already familiar with the thing that I'm referring to so yeah.

A

Good stuff, um cool alexander, I I cut you off, I think, a little bit before any other uh technical questions or other thoughts on all of this.

C

uh Yeah going back to to graphql, for instance, I didn't find a way how to identify is a user, a bot or not, and for instance, we have to do that if we, if we make recommendations because right now, if we check, for instance, sometimes in the top 10 recommendations, we see bots and I didn't find a way how to remove them. How field right now so.

A

C

A

Great point, um I don't actually uh me, how do you know or taylor or anybody else.

F

I don't recall if there is a column in the database, saying that's something I bought or not. I think most people when they create a user for an integration, external integration. They create a normal gitlab user. So that's a tough, that's a that's! A tough question.

A

We might need a configuration to say who uh is this a human or not.

F

We might just we just might have other boolean flag when you're creating a user account to like hey. Is this a bot so.

B

D

Want to come in go ahead.

B

Oh I'm like I didn't. I don't want a solution here, but we I did want to mention that, based on my previous experience in the admin this weekend, there is no differentiate differentiation, at least in the admin. What surface there between a bot or not- um and I don't want a solution here, but my recommendation would be that maybe we suggest a feature where uh bot users don't have a password, um they are merely created and they have api keys. So anyway, as a another mechanism to dylanate sorry steve.

D

No no reason to be sorry. I was actually only going to uh uh make a humorous comment and that like, if we don't have that we should have you know bottom out somewhere, but then it probably doesn't have the best entomology. So maybe wayne's suggestion of like human or not is probably.

C

I think there is also another way, for instance, if we recommend a bot, a person can easily say that. Okay, I'm not satisfied with that recommendation and we can exclude those uh users from from the recommendations from yeah.

A

I think the system will learn that over time, like you're saying but event, it'd be great to uh be able to give it some advice in advance on human or not so it doesn't make those uh recommendations. Well, I guess bots will rarely be assigned maintainer or review.

A

You know be asked to be a maintainer or reviewer, so I think the history will show that they're not, but that doesn't mean, but that's not the only criteria for you know determining if somebody should be maintain being maintainer or reviewer of a change is if they've done one before that's, not the only thing so yeah good things to consider. I'm sure we're gonna have another 20 questions like that of what do we do in this situation? What do we do in that situation, etc?.

C

Yeah right now, for instance, we have to clone the repository to extract some more information that we cannot extract using the public api.

C

Is remember, this relates to files that are changed in.

F

Commits you mean precise comment; data like which files were modified at which positions? uh I don't think.

C

Right now, under view doesn't use position, but I think later it will be yeah.

F

We will use it, I don't think that's discoverable using public api, so we have to fix that or we have to investigate if we can access that from the database directly. I'm I don't think that's the case. I think italy would do that, but I'm not exactly.

C

It would be I it would be amazing to have a document where we can write all these issues, but I'm not sure maybe.

A

I think actually it might be good to capture these in an issue linked off of milestone. One or two is and then it'd be great, create that issue and just start putting you know, questions we want to answer. It can be very open-ended and then um we can, you know, keep that issue public as well, so we get feedback, not only you know this group and other team members, but the public um yeah. What if you could create that issue- and uh these are great things to put in it.

C

Okay, I I just was trying to to finish all the trainings today on my onboarding yeah.

A

No rush, absolutely when you get.

A

A

Our focus today was on infrastructure, uh although we discussed many other things too. Do you have what you need from us at this point? Do you think we are we on a good track here? Are there any other feedback.

B

No, I I mean, from my end, I'm satisfied with a deeper understanding of like what milestone one versus milestone. Two represents and- and I I see a collaborative path forward on like how we can enable, um in the similar fashion that we did for uh like the sharding, poc and other teams to have an area to start um developing milestone. One in.

A

And then um I have not put in requests for things like that before um me. How maybe has or maybe taylor has, but um I guess we'll probably be asking someone like what how do we put in those requests to ask for those resources in gcp, yeah.

B

We don't have a proper workflow for that, uh but I will point you towards what we did for sharding um and and uh we can work async on. That sounds great thanks. Thanks.

D

Thanks for uh hosting the conversation, everything and and alexander, if you, if you finished all of the onboarding stuff up to date today, for you you'd be the first person to do that so uh uh give yourself some slack. So there's a lot.

A

To do there there's a lot there, that's for sure all right! Well, you everybody great discussion and uh alexander and taylor, and me we'll be talking with our first actually group meeting uh tomorrow, but it's great to get a head start on that and talk about the infrastructure team so have a great day. Everybody cheers bye, cheers.