GitLab Package Group, 21 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Package Think Big

Description

We discuss focusing the team on the dependency proxy, pipelines for packages, and metadata.

A

Hello and welcome to the package monthly think big, um where we talk about issues that are a little further out on the horizon and some ideas that will help us to get there. We have a bunch of things on the agenda today, I'm going to pull up the first one.

A

I have the first one so uh dan and I were talking yesterday a bit um about the idea of wanting to focus the team on uh maybe one initiative for a milestone and and see how that works, and uh it's feeling like we have hugo joining who's, not yet on this call, but um maybe he might not be invited if someone could ping him and just send him the invite yeah I'll take care of it.

A

Oh thanks and david has been just crushing those cleanup policies and and making awesome progress on the scalability and performance and and rolling them out for historical projects, um and we have we've been making a bunch of sort of core improvements to the dependency proxy.

A

Over the past couple of months, we've been resolving, bugs we've been on encountering and covering uncovering new bugs and given the importance of the dependency proxy, both in terms of what we're hearing from customers now from the existing functionality of wanting it to work with docker hub to wanting to expand it to work generically with any container registry, and then, of course, the future desire to have it work with the package registry, and uh you know, really support virtual registries.

A

It also came up a bunch during the issue about how to grow the team about. Where does the dependency proxy live? Does? Does it make sense to split the product into a container image proxy and uh package pro uh proxy, so it make? Maybe it makes sense to to de-risk some of that work in general and focus on that for a given milestone or beyond. So I thought it would be helpful to so.

A

I tried to plan 14-3 mainly focused on this effort, I'll pull up that milestone plan and share my screen in a secure place.

A

A

Okay, so yeah, I was thinking that this could be a good idea and I'm I'm looking forward to all of your feedback.

A

um Okay, so for the container for the well. I I didn't think that there's any way we're gonna get around the registry development and deployment work. I know haley and juan that's going to be taking up our time through the end of next year, but I thought we might be able to make time for the moving the dependency proxy to workhorse or which requires some more coursework and also some rails work.

A

um We also during our most recent, actually steve uncovered that the dependency proxy is throwing some controller errors which is contributing to our error budget. So this might be another good one to tackle, because it's currently like nine, almost 90 percent of our error budget is the dependency proxy, also be very helpful if we had user level data- and we were counting this towards our monthly active users, the the dependency proxy's pulling over a million images per week and we're not getting any credit for that for our user data for at gitlab.

A

So I think that would be very helpful as we look to expand the product um well and when you consider that we're we're downloading a million uh images per week um it. I think it makes sense to start to add in this job that clears the cache automatically uh before we get into a similar situation.

A

uh As with the container registry, where we have petabytes of data and then um something you know that we've kind of punted on for a while, but uh I think it would be great to start considering the user interface for this. So creating a graphql for query for the dependency proxy. um That could return a list of images and manifest information would be really helpful and we could start to present that information to users in in the user interface.

A

That will definitely help too, as we start to expand, make it work more generically, and then I'm just was talking to sid recently and he was asking about the dependency firewall and I I've kind of been thinking about the dependency firewall mainly focused on virtual registries and pulling in packages from let's say npm.

A

But I suppose we could do the same thing with images pulled from docker hub and we could add a simple rule that says: don't don't ever pull images that have a red hat in the name for instance, or that where the author name has changed in a given time. So this issue was actually opened by a customer recently and it suggests that we basically just have a list of allow and deny patterns, and I thought that this would also be a good idea.

A

So I'll pause there and stop sharing and just ask what do you think of this idea of focusing our efforts towards one initiative or one set of initiatives as a team? And what do you think of the dependency proxy as that focus and third, what do you think of those issues.

B

So if you have a comment in the document- and I don't know if that's better.

A

uh Yes, so she says that we are starting this, the first layer of dependency proxy tests in this milestone, but uh I she does imagine that this work will continue um to ensure that into future milestones and then uh david. I think you had the next comment.

C

Yeah, uh so first I'm I'm totally fine in my tunnel with cleaner policies.

C

um Yeah. My question was um seeing the milestone review issue. They were all about the dependency proxy for container images. Should we extend the scope of this to the dependency process for packages, and I think we already have some ideas like changing the current implementation into a real proxy so that we can easily add caching.

A

I love that idea, yeah like we, we and we have a whole set of issues and we have for npm, for instance, where we would turn it into a proxy and then when we would store the meta, store the data and then cache the data and then before you know it runner, we have a proper proxy of the public registry.

A

I would definitely support that uh and I think we can probably yeah. I would, I could add those issues as.

A

D

I would want to maybe consider, though, um with the migration to workhorse I'd be curious, like like, in my view, I feel like. Maybe that should happen before we start to add more and more um because it's sound, or at least I like thinking about it. It seems like the kind of thing where, if we add more now like with different packages and start refactoring, whenever we do change the workhorse, that might end up being more work on top of what we already have.

D

Whereas if we switch over to workhorse now, then we don't have to worry about it in the future as much. I might be over thinking right, though,.

A

Can we do both at the same time can like? Can we keep this set of issues that I just shared and then also show also pull in the first issue for turning the npm request forwarding into a proxy instead of a just a request forward.

D

I'm not familiar with what the first issue is, so I don't know david I mean yeah. We probably did there's probably some overlapping work that could happen before it gets to be. How does the upload.

D

A

What do you think drama and haley about? um Do you think that you would have time to maybe help out with some of the workhorse implementation for for getting the proxies switched.

E

We're, I think, we're a bit busy. We could probably help with the go stuff, uh but when it comes to workhorse, it's a completely different application with its own logic.

E

So it would require probably more time than what we have right now.

E

But if the intention is to give some go support in terms of syntax and reviewing that kind of stuff yeah for sure, I think we can. We can help.

A

Make sense and steve or david have have either of you worked on a workhorse before or is this something we may need to try and recruit help from outside of the the package group, because the moving to workhorse is a infradev uh issue as well, so we might be able to bubble it up and get some help from outside our team.

C

uh The modifications we did for the file uploads part of them were in workhorse. So I guess we are familiar with the ping pongs between workers and whales.

C

The thing is that, for this part, we need to create a custom logic on top of that, so that I might need a bit of time to investigate or perhaps external help, but uh it's not like we don't know a thing about locals.

C

D

Yeah, I think it depends on you know like how big of a factor time is on this. um We can, probably um you know, become more familiar and and take the time to figure out how to implement that, but it might take longer than if we got one of the workhorse maintainers to jump on it for us. So it kind of depends on that time factor or if we wanna. If we want to gain the experience of becoming more familiar with workhorse and doing it ourselves.

E

I think one of the difference is that this is not about uploads. This is about downloads, so it's the reverse.

E

So I'm not sure if that logic already exists in workers or not, but at least four packages it seems like this is new and you want to explore it area.

C

D

C

Yeah, it's uh basically, we need to apply a what will cause those for uploads. We need to apply that in a get request, and on top of that we need to have a custom logic that will tell workers hey. You need to download data from this source and upload it there, and this is, I don't think it exists in workforce.

E

Yeah, so it seems like there are some foundational foundational changes and architecture discussions to have so probably certainly something that would be good to involve one or two maintainers from workers.

B

We've had great questions for a while about someone in getlab owning workhorse, um and we use it a lot. So one of the reasons why we've been sort of more involved than maybe other groups is for that reason.

B

So it's something to think about where the workhorse could end up as a product category living in one of the package stage. Teams that we end up with. So that's part of the reason why I think this might be an interesting point to do that.

B

I'm not sure which team that might live in kind of makes sense to be in the package registry team, given that's where the overlap is, but something else to consider what we're thinking about inside here. I think.

A

Dan sounds like we may need help from some of the workhorse maintainers and I I'll I'll reply to that issue. But I'll see you in any any uh help you can get in rallying. The troops would be helpful too.

A

Okay, well, if that's, let me follow up on that and see also about pulling um the mpm sorry heliburgena.

F

Yeah, I I don't want to volunteer this person, but uh I think uh jaime might might be able might be someone to ask about this. He might be able to um work like have some have some good development time that you can spend on this.

A

He he's on the release the release team right.

F

uh He could be.

D

Okay, yeah, it's just perfect.

D

And then the other aspect is, um for the most part, the workhorse work needs to be done in order for the rails work to be done, so it might make more sense to schedule those separate milestones.

A

A

And so yeah like now back to date, we come full circle back to david's point. Does it makes sense to focus on maybe starting to add the npm functionality, or should we wait until should everything basically be gated behind making it work with workhorse, switching it to workhorse.

E

I would just like to see uh some cash eviction and inspiration policies done before adding any more features, because with any any additional features, comes additional data and we might soon found ourselves in the same situation that we are right now with the container registry and we're still a long time to avoid that. So the sooner we do that the best.

A

Yeah, so we can still I'll just pull up the screen again, so we we could still um prioritize. We can't do the well. Maybe we could do the workhorse portion I'll follow up on that the rails portion will need to move back a milestone even from wherever this work is done, but we can investigate the controller errors. We can add the user level data. We can prioritize removing artifacts that are older than 90 days and we can also potentially add a graphql query, or would that be?

A

Would you consider that part of the new feature joan that we, you would be hesitant to add, because that's not adding new data right? That's that would just be or would it be? Okay, can you repeat.

E

A

The creating a graphql query for the dependency proxy, so returning some of the image and manifest data.

E

No, I think that doesn't qualify as a problem. Basically, I see a problem for everything that will add more data or different kinds of data to the dependency proxy storage.

G

Okay, do we need sorry, do we have any performance concerns around uh pulling out those info from the dependency proxy? It's true that they are in the models, but if they are super linked, we may need to consider performances.

G

A

Okay, well, I, like it I'll, follow up on the workhorse portion and um I'll try to keep this milestone, focused on the dependency proxy and see what we could fit in. I, like the I, like the idea of having one central theme for the p1 issues for a milestone and see how that goes.

B

um Okay, yeah one of the one of the things there that I chatted with sam about it and sam was mentioning that he's worked in other places where, like that, swarming thing is, can be really good for the whole team to be working together on one thing and that's kind of what was in my head as well of like, let's all focus on the same thing together if possible. um I I not you know, I'm not dismissing your points you are on on.

B

You know the work you have going on that you and haley have going on with the container registry, which is super high priority as well, but being able to all work together is kind of a cool thing. um So if we can make that happen in a way that makes sense, um that's kind of my main motivation just to break up the way we're working and get everyone together on the same goal is kind of the intent there.

B

So um if there are other options that you might want to suggest, I'm happy to look at those two guys. I'm sorry.

E

Yeah, I'm fine with that. In the end, it's just a matter of priorities. So if it is the priorities getting this done as a team all together, then I'm fine with it.

D

I, like the swarming idea, a lot. I think it would be fun to to do that. It would really like invoke some additional collaboration where it's not just like a. I get a mr review and I'm like. Oh, I didn't even know we were working on that.

A

I agree cool uh the next thing and um I was hoping david that we could just talk quickly about pipelines for packages so quick background for everyone. If you haven't seen the demo and issue and discussion uh we talked about this last month in our think big.

A

The idea is to use uh package updates as a trigger for for pipelines, because right now pipelines are focused only on the repository.

A

So if you want to run a package specific pipeline, you have to go through and install the package and there's a lot of extra lines you have to add to your pipeline and since then, we've done some competitive research and seen that github action supports uh this funk package updates as a trigger as well as aws uh pipelines, uh whatever they call that product I forget, um and so I want to we invited dev to kind of hear about this and and talk about next steps.

A

uh Originally, we were thinking that uh problem validation would make sense where we would go and talk to customers and say. Is this a problem we've kind of heard from customers that this is needed and seeing the competitive research? It's maybe a little more clear, so I was wondering maybe it makes I was thinking that it makes sense to move more to like solution validation, to be able to show customers a specific idea and say like this is what we have.

A

How would you you know and kind of get feedback on the the solution versus the problem?

A

It's a lot of product manager, words. I apologize for all those at once anyway, david. I know you were talking to folks on the pipeline authoring team and I was wondering if you had any uh follow-up items or any anything you wanted to add to that.

C

Not yet the meeting is scheduled for tomorrow, so I will report that report back in the issue with whatever we discuss, then it will be recorded.

H

A

Cool, do you would it be helpful if I, if I was a fly on the wall in that meeting or you can, would it be helpful.

C

I I can invite you if you want to attend.

A

Okay, um I think really what we want to get to is like what could be an mvc of this um that we could deliver, because I think that will be the most valuable thing that we could shop around to customers.

A

I've been doing my customer conversations, just kind of gently probing about it and it seems like people are instantly just kind of yes and I had a twitter poll and uh my huge twitter audience of uh you know 100 users, we we had 100 were saying that this this functionality would be useful. So I think we're zooming in on something and really having an mvc proposal would be really valuable for taking the next step.

C

Yeah, I'm thinking into something super basic where, when you push a package, you have a pipeline and the jobs on this pipeline will simply pull the package archive and then it's the job configuration you do whatever you want with that file, although we might want to start with specific package types first, because not all the package types are archives or files. For example, composer is a git tag, it's not an archive.

C

So let's leave that problem to a side and focus on package types that have archives such as npm or maven.

A

I would vote for starting with npm or maven, whichever one is easier from a technical perspective from a data point of view like npm, is the most popular uh in terms of polls, but maven does see the most pushes. So if we were going to focus on one event, when a package was published, we'd probably see the most events for for maven, um even though npm is, and.

B

I mean, I think the other thing to consider consider there is. We have some pretty large customers who use both npm and maven, who we could consider reaching out to to discuss how they might want to use it, whether it's valuable for them. I know some of those customers um not mentioning any names, because online I'm like so most customers have been really keen to work with us and some future sort of development efforts, so might be a might be a cool opportunity to partner with a large customer.

B

So somebody has to consider with both of those.

A

Yeah I've. I have good relationships with probably five large enterprise customers that would be interested in in testing this out and and would definitely give us some feedback. So I think once we kind of zoom in on the npc we'll have a good way to like interact and verify that it works.

I

If you maybe it's uh it's too early to ask this question, but maybe have you thought about.

I

How the user needs to define it or configure the pipeline to be triggered, I mean. Is it the same user that defined the pipeline? Is the one that you need? You need that user to define the pipeline if it will be triggered because of a packet change or those are different users, because there's a two questions, so how how users define it if those are like the same users that are defining pipelines.

C

So my initial idea was to have a specific yaml config file for those pipelines in the demo. I took a shortcut.

C

I used the regular yaml file and I use that for the package pipelines, but I don't see a point of mixing both configurations because those both configurations between the pipelines for git git repositories or git pushes and pipelines for packages those are different jobs.

C

It doesn't make sense to mix everything up in in a same file as for the user,.

C

That's a good question.

A

I could answer that. I think that most it depends on the organization at a small midsize customer. It's probably the same person setting it up. um It's probably a developer or devops engineer, but at a large enterprise, it's much more likely to be sort of the center of excellence team, that's maintaining this list of packages and and the package registry for a set of groups underneath them. But it's still going to be a devops engineer. It's still going to be uh so yeah. It's still going to be whoever's used to writing pipelines.

A

It's just their concern might be more focused on the registry.

I

Cool david, can you maybe elaborate a bit more about uh why you think those are like separate location where you need to to configure it.

C

um Yeah, so the the jobs are used to uh demonstrate this feature. They were totally different from the usual jobs we have currently, um and the thing is that even for the runner, I had to make some modifications, because when the runner picks up the job for the package pipeline, it doesn't have to pull a git repository. It has to pull a package file, so it's really different and the it's like the starting conditions for the job is different in the pipeline for packages.

C

It's it's really the package file and in the standard pipelines it's the git repository at whatever comic it's targeted and to me it looks like it. It seems that they are really different, but I'm kind of new in the world of pipelines. So I'm I'm all ears and.

F

And suggestions.

A

Functionality that people would want like they want to know when a package is updated, they want to scan. um They want to scan their basically their entire package registry or or container entry when, when there's an update or they want to when a new up version is updated, they want to delete any old versions, for example, so there's a lot of stuff where they may not want to run their entire development pipeline or pull a repository. They may just want to run something against their package registry, for instance,.

A

Another thing that david you mentioned before is that this doesn't just have to work for packages using gitlab's registry. This would also work for packages that are hosted outside of good lab as well right.

C

Yeah, exactly the the main event is a package that is pushed in the package registry in the gitlab package registry, but it's not necessary that we have the source code for that or the or the build info for that package, because the package could come from a project within git lab or it could come from an external tool or even manually. You can just pull up a terminal or console and just build your package and push it to the gitlab package registry and that's the start for the pipeline.

C

Well, the pipeline.

C

A

A

Any other questions, or we can move on to the next item,.

B

What what are the next steps here.

A

It seems like the david has a meeting tomorrow with um a developer on the pipeline authoring side and they're, going to talk up through a potential mvc.

A

Once we have that mdc, we'll, try and that'll be something that we can validate with customers. So going to the I mentioned, we have several customers that have expressed interest in this, so going to them and kind of reviewing the proposal and and seeing if it makes sense and and fits their needs, and then from there scheduling an issue and seeing if we could build something.

B

B

Tweet, okay, that works for me. Thank you. Sorry to space there.

A

Dub anything to to add.

I

David, can you please add me as well to the meeting with the fly on the wall. I will.

A

Okay, thank you. Thank you, um okay. So the next item I had, let me open up this epic.

A

So one thing that we've seen a lot of issues for is basically you know: we've had one model data model for the package registry for a while. Let's just share my screen again and uh I've collected a bunch of issues in in one epic that are all related to. um We can't the users don't have the data that they want.

A

um So I just thought I'd share this as kind of a reference and see if it sparked any ideas or any discussion, um there's a bunch of issues here, how many a lot so everything from combining like names in the same view of the registry having more analytics.

A

um Sorry, just getting distracted um yeah. Maybe I'm not ready to talk about this one, I'm sorry um but yeah. We want to have that. We basically there's a whole bunch of things that are related to metadata, making sure it shows up in the user interface making sure it's available for the packages.

A

So I just wanted to bring this up and say that this will probably be a major focus for us in the coming months and we have an issue scheduled for 14.2, for example, to bring in more npm metadata, we'll need to do the same thing for each of the package manager formats. uh So a lot of the work and then making sure that that shows up in in the api and just making you know, there's a bunch of other things like that, not really the best discussion item, I realized as I'm talking through it.

A

I thought it would be more helpful. Sorry, everyone, epic, fail nico, you're gonna, save me.

G

Yeah I wanted to ask if, in that epic we are including the so I think there is a modern factor that is due, um which is kind of big, but maybe david has more info about that. I I know, because I discuss it with him.

G

Sorry to put you on the spot.

C

Doing two things: at the same time: that's a not a great idea.

G

A model refactor you have to do this to split the package in on server mode, instead of one giant blob.

C

Yeah, I would say that those are two different aspects, so the model splitting thing is a quite complex one and would be a lengthy task to complete. uh So the idea for those following us.

C

uh Currently, we do have a number of models in the rails back end and those models gets updated and we throw more code inside of them for each package type. We add for each support which, for each additional package, type support, we add- and this is slowly becoming like what I call giant bags of everything- and this is not great. So we have this idea of splitting this and organize the code more cleanly and yeah. I think it's really two different things between splitting the models and having a proper backup for the package features.

A

Thanks when you, when you say backup, you me, you mean basically that, like adding the package registry to the break rate tasks, so you could back up the entire registry or something else.

C

um Yeah, I think that's the expected thing. We would need to backup all the packages.

A

That's something I hear probably two to three times a month. I would say from customers, they always say you know talking to large enterprise customers and they ask about well what about backing up the registry. We can't do that now, so not not every day, not every week, but it does come up often, and we haven't really scheduled anything like that.

C

My only concern here is that um it's not only data that we have in the database; it's also data that we already have in object storage.

C

So what happens with with this I mean if we want to back up that, uh I'm not sure.

A

Because we could, then we we're talking about potentially terabytes of data for each custom for for large customers.

C

uh Yeah, I'm thinking of maven packages.

C

They use a special package that is without a version, and that package has a single file that keeps uploaded each time we upload a new version, so you can end up with thousands and thousands of files and could take a bit of storage on on object storage. So how? How can we backup that?

C

I think we need a bit of an investigation here on how the backup rate task works and what we can do with it.

E

There's gof's replication support for packages already because if they have, they must already have some kind of logic to copy package data from one place to another.

C

Yeah, it is baked in in the models directly, so we don't do anything special about it. It's directly supported.

B

Okay, so what I heard there was an investigation issue tim that we need to get rolling at some point before we can go forward there.

A

Yeah, I'm gonna. I I have that to follow up on probably schedule it a bit further out but yeah something to to follow up on and and with regards to the metadata just for that comes along with the packages we'll try and schedule that bit by bit as we move forward, starting starting with npm, um and then we also have issues for helm, that's scheduled extracting the metadata and presenting it so we'll work through each format. At a time.

A

Okay, there was one more item on the agenda that I'm going to pause, the recording for, because I'm just going to do a quick demo but feel free to stick around.