GitLab Applied ML group, 17 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Applied ML weekly team meeting June 17, 2021

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hey wayne, hello, everyone hi.

B

Listen uh people who haven't uh attended previously and came today so maybe a super quick introduction from kai and sean would be great.

A

Sure so um I'm sean carroll, I'm the engineering manager for source code and um I'm very interested in this project and what's going on here and just quickly about myself, I'm not by any means a expert in machine learning. But I did work in a lab at a swiss university for nearly four years on supporting machine learning projects, mostly from infrastructure perspective.

C

Cool and then uh I'm kai, I'm the product manager for code review, and um I'm only here because we're gonna be your first um implementation point. I suppose at some point in the future, so um thought we'd come, say, hi and then um probably won't come back for for at least a little. While a couple milestones, maybe.

B

Yeah sure, thanks guys, so um we usually don't screen share in meetings, but um this, I think it's warranted item number one is just to go through the issue board issue boards and gitlab are not obvious until you kind of work with them, at least in our usage of them, in our specific um labels, so maybe share screen for maybe two or three minutes.

B

So let me do that and share the correct.

B

B

So there's many different labels used.

B

um We in our product development flow process, the ones that uh you know we're at least currently using are things like scheduling, which means um you can open and close these, so I can open them close that so scheduling is, um you know we haven't started work on it, um but we will, in the future um it needs to be scheduled.

B

Blocked means it's blocked on something it could be blocked in another issue or something else or like this one's blocked on time. Like me, how's not transitioning, officially to this team for another week or so so I'm just kind of blocked based on time, then there's um ready for development, which means we are ready to start developing on it, but we haven't and in development, which means we're actively developing and there's other ones like in review in production, etc.

B

So um the ones that we want to be that alexander and I want to be working on now um and others as interested in determining the plan for the open source components that don't exist in get live infrastructure which we've been collaborating on async in the issue, which is great uh publishing the plan in a handbook page based on that which I'll do based on everyone's feedback, so I'll, create that initial handbook page and tag anybody interested to review it as we merge it and then iteratively improve it and then creating the infrastructure request.

B

So uh so we can get that in for the infrastructure team, for the proof of concept and then going back a bit. The big change to the code base is replacing adf uh with google dataflow or similar for the proof of concept which alexander, once you start working on, that just change the label from ready for development to in-depth and then it'll show up here, and that will be the kind of the next task to go from there. I think we are closing in on the in dev ones, as we finalize for now.

B

At least this iteration of the architectural plan document it and then create the infrastructure request. So alexander and I and me now and the others have been collaborating on that. So does the um issue board make sense any any questions about issue boards in general or about these plans here.

B

Great great I'll stop this here um so alex you had a yoda two comments on the current issues. Do you want to verbalize.

B

No you're a mute, alex yeah. Sorry, I forgot.

D

Yeah two comments, like I found that we have two design graphs right now, so maybe we can join them.

D

The first one is for the first milestone, as I understood, but the second one is for the second for the second milestone, but still we can join them because we found that hive is an adf. We don't need them, so we can simply join this part.

B

Yeah, so how about you? Yes, why don't you be the dri and combine them so consider you're the directly responsibility, you're responsible, individual you're, the owner, so I'll stop editing, go for it and thank you.

D

I will do that.

B

D

Second comment: uh just a comment that uh I have updated: the descriptions on how and review uses different components, so you can find it there. I have attached the link.

D

So there are more explanations.

B

I review that that looks great I'd, encourage others as well as we um update the integration plan.

D

Yeah I left myself with the handbook a bit, so I fixed this one. Only yesterday.

B

Great, so you've got number two as well.

D

So I have some other questions. So can we start working on the handbook page? So we can open a merge request. We can collaborate there because I found that we can start working because we have the first. uh We know the picture before we know the picture. We know the future picture, so we can start describing these things.

B

Absolutely um I'll work on the first draft, uh I'm going to do handbook pages uh because then you can spend your time out on handing pages as much and on actually changing the review code base is.

D

That, okay to you, uh it's like I'm waiting for my laptop. So still I have time to update the this page.

B

Okay, if you want to do it, that's great. If, if that issue, I think that issue is assigned to me if you're going to be primary on it, just reassign it to you.

B

Yeah, when is the, uh when is the laptops supposedly going to.

D

Arrive it's somewhere right now between london and moscow. I guess so! Maybe during the next week I hope.

B

Okay, good job. uh You also.

D

Number three uh yep, so we found how to replace hive and azure data factory, so that's cool, but we also, you know we have to decide how to deploy kafka.

D

uh I have attached the comment, so I found three options: how we can do that, but still I'm thinking that, like we focus right now, we focus more on on the book right. So we don't need to create the high availability cluster or something like that.

D

So maybe, if you give me some time, I can check which option is better for us right now, because I can easily use my own works work workplace in google and I can check so and I will write everything in in that issue.

B

Sounds great and then anyone interested can uh subscribe can watch the issue and comment on it. We can uh continue working on it. That's great. It helps.

D

Number four right: I have a lot of questions, uh so unreview requires a scheduler to automatically start the extract stage on the training stage, because I.

B

D

I haven't implemented this before. I think that airflow looks good, but I'm not sure that this is the goal of the first milestone. Maybe it's better to move to the second milestone.

B

E

Do you think how often.

B

Do you think we'll run the daily.

D

Initially, I thought that maybe it would be enough to run each week to extract some more information about merged nurse requests and then then, finally, to retrain the model here.

B

Manual scheduling for the poc is just fine past the pa.

D

Yeah yeah: that's what.

B

D

Right now, what does.

B

Everybody else think.

F

Yeah, I wouldn't once a week wouldn't once a week be too much I mean I mean I don't I don't think we don't. I think we generate a lot of data in in a week, so I imagine training the model daily or at least ingesting. The data daily would put a lot less pressure on the components involved, especially the database. I mean a database outage. Even it can can come even from a read, maybe not for a long time, but we can get the statement timeout, so we need to retries. We need to like put.

F

We need to consider your database pressure. Basically is what I'm saying so.

A

B

F

I think is not not not I'm lacking the word here. It's not often enough. Okay,.

D

Yeah yeah, that's true, yeah, that's true! Right now! I'm I mean you know. I meant the the last project that I use. I see right now that gitlab produces much more data than other projects. So that's. Why yeah that's true that maybe one week it's too much, but you know.

D

But we know- maybe that's not uh necessary for the first milestone because we're trying to produce the puck right. So maybe we can move that to the or at least for example, if we finish everything for the first milestone in one month and we will have one month more, maybe we can. Maybe we can do that.

F

Is there a significant difference in how often you inject the data with regards to training the model? uh What do you mean? I mean if you want to uh like, if you want to run the data ingestion once a week, is there going to be a different? It's going? Is there going to be a big difference between ingesting it daily and once per week once per week?

F

If you want to, if you want to fetch data from a week before then you're going to have a huge query with a ton of results, if you fetch data per day, then you're going to have less lots of less smaller queries so yeah, I imagine database people would be far happier having small queries rather than one huge query. So that's that's my main point here. Basically,.

D

uh Yeah I made a few experiments with airflow before the acquisition, so uh one I used to to split the history before today into several intervals and then to to take this data for these intervals into and to ingest and to process this data and move it to the database.

D

To remove this pressure, it's possible to do with the airflow.

F

Okay: okay, if, if we can, if we can.

D

F

Yeah I mean yes.

D

Yeah, the slicer not to have a very big query when we extract everything we process everything and then, when we move everything to the database. Yes, we will produce a great pressure, I mean to slice and then to make it with some. I don't know with some buffer between these intervals.

F

I'm I'm not exactly sure, and what do you mean by this.

D

I mean, for instance, uh we can extract data for one one for one year or two years ago, then, for one year one year ago, and then for the current year, like that.

F

All right what I'm talking about is like. If you try to extract a year's worth of data in one query, then we are getting a database timeout like you, you. You can't extract that much data, so we need to slice it into smaller chunks like.

A

B

F

And one week ago I still think that's going to be too much and you're going to get a statement timeout. So we are looking at preferably getting one day of data off per per day or per query or maybe per hour. The actual interval between queries needs to be adjusted, but if we try to get a week's load of data weeks worth of data in one query, then we're going a database timeout and someone's not going to be happy. If we do that far too often,.

D

Yeah, maybe in this case we need to introduce airflow during this milestone.

F

What I can tell you is that we have 100 milliseconds query timeout on the production database. So if your query takes more than 100 milliseconds, it's going to time out.

D

Yeah, but we can, we can say: okay, give me the data before day a and day b so and then make and drops like that and.

B

I do the proof of concept is definitely proof of concept. It doesn't need to be production ready, so we definitely need to not cause too high load in the database, nor have it fail because the queries time out, however, um I think automated scheduling uh to pull the data and process it is probably not needed for the proof of concept. We can babysit it uh manually in the proof of concept.

B

I think if data pull doesn't work for example, or things like that perhaps, but we don't need to decide that now, but um I I'm okay with manual scheduling for the proof of concept all.

F

B

F

Doesn't mean that's the right way to go.

B

That's just my opinion.

D

Or another option I can create an issue and I can write everything there. So maybe we can discuss yeah.

B

Sounds good, alex is number five, which is great.

D

So can we start a new issue to discuss thoughts on on how one review might be used by self-hosted customers, because I feel that we need to discuss these things, because all the time when we propose something other team members, they say something like okay. We cannot use this or that one that technology, because we need to support self-hosted customers. So maybe we need to start discussing these things to understand.

D

Maybe we need to introduce several strategies, for example, one strategy for gitlab.com, another strategy for self-hosted customers, because we need to understand um the pressure that we can produce while training models and extracting data.

B

You know it's a great it's a great uh question alex the the concern is we have too many things going on at once and we need to make progress on some things. However, these some of these decisions that we make may not be two-way door decisions. You know, meaning once we go through the door, it's hard to go back to the door and go back. um So um the current philosophy is that we believe that self most self-hosting customers.

B

This can change, wouldn't want to run this themselves due to not having the processing power to do so or want to dedicate it. So what instead we do is leverage something it doesn't exist yet of having an ability for self-hosted customers. If they choose to, then we graph the data at the custom within the customer's instance send the data to our cloud process. It compute, you know, run the models and then send the result. Recommendations back um so we can depend on cloud-based resources.

B

If that turns out to not be a valid assumption, then what components we choose to power? Things is much more important, but because right now at least that's the assumption that we use something like that for cell phones is.

A

B

um The valid way to think about it, taylor,.

C

Yeah, I think so, there's still a thought too, that this could potentially be tied to usage ping for us to incentivize people to turn that on as well as we already have a mechanism for sending data back to gitlab I'd, I'm not convinced. That's the right choice here, I think talking to some customers would help decide that, so I think we can start having some of these exploratory conversations.

C

We've already got. A few customer calls lined up of customers that are interested in just this capability in general. So I think that's one way to start down. This path is just start having customer.

C

B

Can we make the assumption that we can address that? We may need to do some refactoring if we decide to support self-hosted in a different way, because it does make it harder for us to progress on the poc and the dot com? The you know the the get well git lab hosted ones. If we don't make that assumption. So I'd.

A

B

To make that assumption that if we change the path we may have some refactoring now I see you're dotting taylor so good. So so I guess does that make sense alex is we can assume cloud posted for now and self-hosted?

B

We may make some decisions that are sub-optimal in the in the distant future or may not, but we're okay with that. So we can make more interesting progress now.

C

The other thing I would say too is I mean, as we run this on our own code base, we'll get a better sense of what the actual requirements are for the compute here. If it's not crazy, I there is very much a possibility that we may just offer this as a specialized runner or something. So I I think we still just need a little bit more information about how this is going to run what the performance profile, what the cost profile looks like just running within our own data set.

D

uh So yeah, so thanks uh so right now the performance is not so I mean crazy. I mean that doesn't require a lot of resources right now, because the model is not so difficult, but as soon as we start, you know to improve the model. We will get much more, we'll produce more pressure.

F

We're talking about specialized gpu instances or the current complexity of the model can be trained on a gpu, a cpu.

D

Sorry uh right now we can use gpu to train the model because it supports gpu.

F

Can we use cpu or is it going to be on order or two of magnitude slower.

D

We can use the cpu right now. It will be fine, but if we start to to make some to to add some neural networks, so, okay.

F

So right so we could take. Theoretically, we could start over with cpu your optimized instances and if we decide that hey gpu processing power is required, then we can move to gpu optimized instances. Okay, that that's.

D

That is only right now, because.

F

Yes, I I do understand that the question is whether right now can we use cpu or we need gpu right away. We can we can use cpu and it will be fine. Okay, great. That's that's good to hear.

B

Andre you have a thought. You wanna.

E

Sure wayne I just mentioned this- I don't know if exactly the same strategy would apply here, but it might be worth checking out how we did for the source graph, which is it's different, because it's a third-party software application, but we do allow customers that have airtight instances that set up their own instance of source graph and setup. So gitlab connects to their local instance of source graph instead of relying on sourcecraft.com.

E

I don't know if it's exactly the same scenario here, but that's something that we already support at the product level for his instance manager. So I don't know if kai, if you have anything else to add that might be worthwhile. You also follow up that process, but it's wanted.

C

To mention, let's say: there's yeah, there's three other models for like running additional infrastructure, which is source graph, um git, pod and then elastic search. All three of them require in self-managed separate infrastructure. If you want to run them and like have different configurations, so there's like three three places or groups that you can talk to um and I worked on all of them, but I'm happy to let you go talk to pms who work on them.

B

Still, I think, actually we happened to cover your comment already, but uh did we cover that fully or not yet.

A

Yeah sorry, just as I was typing it, I think it was actually answered on both parts. So um just about uh do we need a gpu and um what do we do with self-hosted, but it sounds like we have that answered.

B

Nate audrey, you didn't get a chance to introduce.

E

Yourself sure so, yeah hi everyone just wanted to say hi from from guru of you excited excited to see the work of this group excited for the news, and uh I understand that right now we're at a uh implementation stage we're setting things up the groundwork and everything. So it might be a little bit too soon, but I just wanted to make ourselves available if you need anything to know about code review, anything down the line, I'm the front of engineering manager for the group.

E

So I can I can even if I don't know the answers I can find those who do so. Anything at this point that we can help with yeah.

B

I appreciate it. Actually, you and ky both give and sean uh coaching groups have already given us some great ideas: um yeah the we're not planning to integrate with the get lab product at all uh until milestone two, which is at our current schedule, wouldn't be going live for four months and we're not going to start working on that for two months.

B

We're working on proof of concept now which doesn't change any code in the gitlab product itself and milestone one so, but glad uh to see you here of course, you're always welcome and uh love to have your feedback on things, um but we won't. We won't. We won't be asking you to do anything other than give us feedback on our plans for probably about two months.

E

That's fine uh I'll be hanging out on the weekly whenever I can, but if you ever need anything feel free to bring me asynchronously and I will help cool thanks.

B

Okay, we're about out of scheduled time anything else. uh We should discuss.

B

Great, thank you. Everyone have a great day.