GitLab Package Group, 9 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discussion about caching the list of tags for container repositories

Description

Related issue: https://gitlab.com/gitlab-org/gitlab/-/issues/327311

A

Okay, so hello, everyone. This is the package team. uh We are going to do an investigation around an issue that we are having with an api endpoint at a group level to list container repositories under gryphon group.

A

Let me share my screen.

A

Okay, so this is the issue that we have uh it's been around for a while five months already- and this is the api endpoints and the problem is that whenever someone invokes these endpoints and sets this parameter to true rails, will pull the list of tags for each one of the repositories under the group and return that in a single response. So if a group has 10 repositories rails will first query the database to get the list of those repositories.

A

So that's fast, but then, if tax is set to true rails, will uh issue one request per repository against the container registry to get the tag list and that's the problem, because it takes a lot if there are a lot of repositories.

A

So uh I've created this this issue a while ago, uh but there are still some open questions that we need to answer, and I was mainly concerned with the approach to invalidate a cache.

A

uh So, let's see the first thing that I listed here is that we need to double check that uh whenever we request a token or whenever we pull from the repository or push to repository, we always get a token from rails, regardless if the repositories are public or not.

A

So I think so. I I looked at this not too long ago for the registry migration work, and that was the conclusion, but yeah for demonstration purposes. Let's make sure that it is the case. So what I have here by the way can you read this? Probably it's easier this way? Okay! So what I have here on the top side? Basically, I'm I'm putting uh many in the middle proxy between the docker clients and the gitlab registry, so that we can inspect the network requests that go from from my machine to the gitlab content registry.

A

So I have a repository.

A

So let's try yeah can be this one. So first I'm going to try to take an image and push it to a repository of my own. This is a private repository, so if we do, if we do this.

A

Access forbidden, so I have to docker login.

A

Okay, let's try again, let me just resize that push it down.

A

And let's push again.

A

And can you give me just one minute so like I can open the door.

A

Okay and I'm back so I just pushed an image to the registry. Now, let's inspect the list of requests that happen it here, so the first one, the stop one always happens. It's the client checking that the registry supports the v2 protocol, so there is a request without any authentication, and then the registry will complain and say you need to authenticate.

A

If you want to talk to me- and this is the api endpoint- that you need to it's to request the token and then the docker client will do so and it will send my credentials to the server the gitlab api, endpoint adjustability, slash odds and it will request a token to access the repository that I'm trying to push to the api will return the token and from there onwards, whenever we need to communicate with the registry, uh we will send an authorization either.

A

The docker client will fill the authorization either with the token to communicate with the github registry, I'm showing the token here, but there is no problem because one when this video goes live, the token will be expired already.

A

So that is fine. We know that pushing always uh requests a token if we try to pull it to pull that image once again same thing should happen so same thing.

A

We we had to request the token to to the gitlab api. So that's fine. Now, let's try to do the same, but from a public container repository just to make sure it can be. Can you try something.

B

Can you push a second tag with a different name? I want to know if the token is reused, somehow.

A

No, that is yeah. Let's, let's try that so.

B

Let's reset, let's push.

A

So you always uh always require it always request a token before any of the operations. Even if it's the exact same operation multiple times, it will request the token again and again and and again, oh yeah, so that should be fine, so yeah. So the next thing is trying to pull an image from a public repository.

A

Can be this one and.

A

Yeah so same thing,.

A

It will fetch the the token for that, even though it is a public repository, it will request a token regardless.

A

And it will download the image which is pretty large. So let's stop that so.

A

Yeah, I I think we can close this one.

A

We always get a token from the from the rails api before pulling or pushing to repository regardless if it is public or not, and then the second point that I have here is um yeah related with third party registries, which may not or will not authenticate with the gitlab api, uh and that means that we can't rely on it for this feature, which means that this feature caching, the tag lists can only be done for gitlab.com and for a self-managed using the the gitlab container registry.

A

I think that's fine, because right now the main problem that we have is with gitlab.com. We could ship this for self-managed as well, but there determining the version of the registry is not that accurate. We do have. I think we still do have in place something to detect that version and vendor, but it may change and the priority is really gitlab.com, so that should be fine, regardless yeah. Let's, I think we should put this behind the feature flag and then make sure that it only works first for gitlab.com and that's it.

A

I think we already have uh something like that or head for cleanup policies right. We do a check if we are in gitlab.com or not yeah, we do yeah, so it would be something similar, a future flag um or checking if it is gitlab.com and then checking if the future flight is enabling uh and the next one radius or database yeah. uh I uh yeah. I think this is not a question. Ready should be the place to go. uh I I guess some.

A

Some of the work that you did for arduino for the cleanup policy sketching should be easy to apply to this as well or at least the knowledge from it, which is good and in the end, the only thing that we need to start is really the container repository id as key, and then the value is the list of tags and that's it.

A

There might be some concerns about size because uh for some repositories there are like thousands and thousands of tags, and those tech can be large in name like if they are a commit.

A

They are like 40 shards links. If you have 5 000 tags, that's a lot of text, but still, I think, that's not a problem, although we can validate with uh with the radius expert, but I think we should. We should be fine.

A

A

And then uh making sure that we can uh first fill that cache and then uh empty the cache whenever a token is requested. So basically, if, if someone tries to write stuff or delete stuff from a repository uh once we generate a token, we need to reset the tag list, cache and assume that the repository is going to change, uh and so in the next query, to that endpoint we need to retrieve the type list. Once again, I think.

A

A

Yeah so once or before we end over the token, in the authentication service, we have to reset uh the cashier if authorized actions includes push or delete, so that should be easy to do and because it's ready there is no problem of requiring a right database connection here, which would be problematic.

A

So I think that that's the easy part, I think, probably the most difficult part, is that, ideally, we would only use the cache for the the group endpoints.

A

So instead of doing it's on the container registry clients, we would do it somewhere hired up so that it only applies to that group level. Endpoints of the of the api.

A

Let's see if I can.

B

uh It's group container repositories.

A

Yeah, so it's this one, it gets the the repositories list and then it does the pagination using that entity yeah, and this is the the parameter that may be true, and if it is true, uh it's going to expose text- and I guess text comes from the.

B

Container, actually, it's both both parameters. If tags or tags counts are true, they will trigger a call to the container registry.

A

Yeah, so it should be yeah tags from the container repository model and then this one will look at the manifests and if the manifest is not filled, it's going to the clients and then the client will hit the registry api.

A

So if we, if we did the caching gear, that would mean that it would apply to every single functionality that retrieves uh the tag list from the container registry, and I think that's a bad idea, because a problem with it could impact everything like ui, normal api requests, even clean up policies as well. So, ideally, we would do it as close to the to the group level endpoints as possible.

A

So do you think we can? We can do it david.

B

I think yeah it's a good idea to start with the api and well, if it works well, we can always think about digging the caching deeper, but yeah api is a good starting point. When, when do you cache things when, on the first call on this api.

A

Yeah, okay, I think that works so basically the first well, whenever it is called uh before going to the registry, we check if the the entry exists in radius. If it is, we read from there, if not we'll, let it go to to the registry and then fill the cache uh but yeah, basically that check for every single request check. If the repository cache entry is filled, if not feel it from there and yeah.

A

I think it's a it's better to do it just for the api, and even we can probably even go further and wrap this with with a project level feature flag so that we can enable it just for uh there are only I think two or for projects which are causing performance problems in production uh and they all belong to the same to the same top level namespace.

A

So we can even do that at the api level and make it even more specific by only targeting that uh that namespace or those four projects where this is currently a problem, and if it works great, then we can extend it like with a percentage-based rollout to two others.

B

That could be a bit complex to do to have a future flags called to a project level, because we directly get the container repositories from the finder which use the group and well, in short, you don't you don't pull projects, you just pull the container repositories out of a single group and.

A

B

We don't have the projects.

A

But, uh for example, here you could still get it right, so uh this one must be uh instances of uh container repository right, yeah and from in fro and from there you you could check it's not listed here but yeah, but I I remember you can check the the top level namespace or you can even dot project, and it will give you the project, so we could perhaps use that and and before and before serving because the thing is, if we yeah for project, maybe it's not a good idea, but we could likely do it for group right because the api is group is at a group level, so it will always be the same group for every container repository.

A

So you just need to get the group of uh you're getting here. Basically yeah. That would be zero yeah. So we can, we can do it by by by group. I think that might think. Can we have it here.

A

And we would limit that to that specific top level group that is causing problems uh right now on gitlab.com and if it works, we could do a percentage-based rollout for everyone else.

A

Just for the api and then if it's or if it works great, we can extend it beyond the api uh but yeah in validating cash and all of that it's it's always tricky. So it's easy to to get into trouble, so the the the smallest we start, the better. I think.

B

Yeah we need both. We need the application setting that it says that we are using a gitlab registry and we are on gitlab.com.

A

Yeah, so it's this one yeah, okay, yeah. I think I think that makes sense.

A

uh I think the probably the the main question uh is where, where where we need to intercept that these requests to fill the cache or read from the cache but yeah, probably here at the api level and then what's the best data structure to save these and radius, taking into account that the list of values can can grow large, I think, on average, from the analysis that we did to the inventory.

A

I think the average stack count per repository repositories like 10 or 20 texts right silly.

A

C

It's I think it's it's way less. I think the average repository has two tags. Maybe.

A

Okay yeah, so on average it's not a problem, but if we, if we take into account the ones that are likely to cause problems, those should have like thousands of them. uh So yeah. It's better. It's better to refer on the safe side and and come up with an estimation for the size of those values in redis.

A

Based on that, like assume that all variable tags will be as large as possible on the name like 40 shards, for the uh git commission and and assume that rupees will have like 5 000 text or something and then pick a structure that will fit that.

B

We we don't an expiration, we don't need an expiration on the keys on radius right, yeah.

C

B

We will expire them manually, so if we don't need that we can use the whales multi write command, that we write to two radius in a single way. This command, and actually the nice thing with that- is that it will take the value serialize it and then you you can there is an option to compress the value, because the the serialization output is a string and then you can. You have an option to compress that that string, yeah.

A

Yeah that makes sense, because yeah compressing that will certainly save a lot of space.

C

Yeah, so I misspoke about the tag counts. The average is 20, but 76 of repositories do have an average of the 75 percent lowest repositories. Do you have an average of two and so it's 25 highest? I have an average of 82., so hockeystick.

A

Yeah yeah, uh uh I think, that's okay and we will have a limited setup for repositories cached at least for now. So we can always change the the structure if it starts to cause problems, but I think that yeah, I think that that should be fine.

B

Do we need an api somehow to reset the cache? Could that be useful.

A

Don't we have helpers for that already.

B

Yeah yeah, I mean having a rest api to call and it will reset this cache of the list, because.

C

B

We are using like non-expiring keys and, if you run into bugs, you might want to reset the cache or something like that.

A

I I think that well for gitlab.com- uh maybe that's not a good idea if this goes to self-managed yeah, maybe, but given that we will start small and just enable this for a couple of namespaces.

A

If there is a problem, we can expire it manually, our support can expire manually or we can even disable the feature flag. um So I guess we should be fine without it for a first iteration yeah. It was more.

B

For forward follow-up, iterations.

A

Yeah yeah I mean if it turns out to work great yeah, then we would like to think about uh how to make it scale like applied to everything, but uh uh I, in the end, I don't think we will want to apply to to everything like when you are on the ui.

A

Maybe it's not a good idea to use that cache, um but yeah. Let's see, I think starting small makes our life easier as well.

A

So I can think I can I can yeah, I can think of any problem yeah. The the main question for me was whether we would be able to intercept that at a higher level because doing it at the lower level. uh We would be too impactful and risky, but uh yeah other than that. I think we, I think we're good.

B

My only concern is really the size of the values in redis, but since we are starting with a future flag, we can control that growth more or less so, yes, yeah.

A

Okay yeah, I guess we could use a list for that, but we we could just also use a a string with concatenative values.

A

I don't know how the the civilization uh work with red is. Does it, but probably it will use a list and not a string if you pass it on the right. um I think.

B

I don't record the code of active support, but it does some serializations for for specific types and all the other types it will use a json serializer. I think so you get a string output and then you can.

B

You can just um compress that output for the.

A

Okay, yeah, I think that's the the the the main question so yeah. I think we can. We can make a real calculation like worst case scenario. We expect the at least to have this size.

A

And then probably request some feedback from uh radius exports, uh maybe andrew uh or bob yeah, and they should they should be able to help us find the optimal solution, but again yeah, just because we are starting with just a couple of groups.

A

It won't be a problem straight after that.

A

Okay, so I think I'm going to drop the conclusions here and then we can open a try to discuss the the format in radius and that's it. I think we have an implementation plan.

A

Do you have any any other concerns.

A

Cool, so let me just stop sharing and stop recording thanks. Everyone.