GitLab Object Storage Working Group, 9 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-11-09 Object Storage WG - APAC

Description

Working Group Page: https://about.gitlab.com/company/team/structure/working-groups/object-storage/

A

So it's time, hello, everyone and thank you for joining this code. So for those of you that don't know me, I am alessio, I'm a softbank engineer in the delivery team and I will be the facilitator for this working group.

A

So thank you, everyone for leaving comments in the agenda, and so I think we can get started so eric. You have the first item.

B

Yeah, um so I was just I was um as I was watching, the the intro video I was just curious, like um moving to the single catch-all default bucket for all um um should be aware of any like underlying issues or gotchas that, because I was wondering why we were using different buckets per feature anyway. In the first place,.

B

A

I don't have uh an answer that I'm sure about it. I can kind of try to retrofit the decision, thinking about how we ended up with a direct upload and how workers started doing this. So I think that we it.

A

I think that what happened here is that we ended up uh building one feature that required one bucket, I'm not sure if the first one was artifact or large file, storage and very likely at that point in time it just looked like: let's create a artifact configuration and then several miles, then later we say: okay, now we want to do large file storage. We definitely need a configuration for storing them and each one maybe just started as a folder or a directory and then eventually became a bucket and yeah. Probably we just drew with this.

A

It's worth trying to figure out if uh there is a strong requirement about this, the only so one step back when I'm referring to single catch old bucket, I'm not thinking that we should force to use just one bucket. It should just be the default.

A

So if you like what we have right now with the consolidated configuration right that, in that case, the bucket is still different, but basically uh you have a shared configuration in terms of uh credentials and things like that providers credentials, but you can still override every single one now uh I do believe that most of the installations can just do fine with one bucket, but I was talking with andrew about this and geneticate and, for instance, one in from one one topic where it's good to have.

A

uh One point that we say that is good to have multiple bucket is about building so in a big installation like our it's really hard to enumerate. What's inside uh buckets and trying to figure out and say how much are we spending in artifacts compared to registry images? Even though registry images is outside of the scope of this working meeting, but yeah, you got a point right so have when you have things in one bucket, you can say this is a bucket and the amount of things stored here is this: is this size right?

A

So that's where it gets easier.

A

The point the point why I'm I'm insisting in this in single catch old bucket, is that it's really hard to ship a new type of upload because there's no place where you can sort them unless you configure. So that's the reason why we need to have one, but it doesn't doesn't have to be the the only solution available.

C

Yeah my thoughts on this were and take this with a grain of salt, because I have actually not worked with object, storage that much at gitlab, but uh at my previous company we did separate data into different buckets um for security reasons as well, so you actually had a separate key pair for every single bucket, depending on what kind of data resides in it, and I can, I could see it. I think you basically already said it.

C

This might not be necessary for any type of customer, but I could imagine that ford.com, for instance, um I like, I feel I would love to hear the opinion of like a security expert on this, but I would feel a bit like if he about you know having get blobs reside next to image uploads. In the same you know gcp or as three bucket, because if those keys are compromised, then all of the data would be at risk.

B

So a follow-up question with that, um given we're only gonna suggest like a default single catch-all, but still support um different buckets if a user decides to um I'm just trying to understand, maybe my lack of knowledge of the object storage, like would that add more complexity, given we need to support both um like a single catcher in the next different buckets. Why not really.

B

A

I think that sun made a point about this later on, in kind of trying to figure out the requirements for this, so in what I'm thinking right now is something like this.

A

Basically, uh what happens right now is that when a request comes in, that goes through a route where workers knows that there can be an upload, then this route gets through a pre-authorization stage, where basically, uh at the beginning of the request, when it comes in, we are sending some information about the request, not the request itself to rails, appending a slash authorized at the end of route.

A

This gave us credentials for uploading, something okay, so uh in what's happening here. Is that because each uh feature has its own bucket, this authorized method basically gives you um a pre-signed url inside that bucket.

A

Now, if we can move away from this, which has some limits like, for instance, when you do artifacts, you will need to generate a metadata files. So you you get one file and you should upload two. This is not supported with the current implementation, so what happens? Is that you get pre-authorization for the artifacts, but then you generate the metadata file on disk and it's uploaded on the object, storage with rails controller, which is sub optimal.

A

So just what I'm thinking is that if we move away away from this drop me the whole request type of approach and we go into a more simple internal api when we basically when we scan the request, we find a multi-part upload at that point in time. We have more information about it, something also the file name, for instance right and on that point in time we just say rails. I need to upload something and- and- and we can give this api call what we know about it.

A

So you can give the url we can the credentials from the user it we. We have to figure out what we want to do here right. But the point is that at that point in time we can ask rails to provide the same type of information, so pre-signed url or object storage credentials. It really depends on the level of implementation that we have, but at that point in time we can make more informed decision.

A

But, more importantly, we can ask twice or three times or four times, because if we want to upload several files or we want to generate metadata, we first ask credentials for them for the for the artifacts, then we say we also need to generate a metadata and this kind of move the the problem to the rails side, where, instead of having an authorized meter for each controller, we just have a single one and then, if you're say, if you have special need you, you just change the behavior of this uh generic um authorization and say okay.

A

This is this: is my url, so you're kind of recording an exception or in the thing, so I'm an artifact. So I know that I should do special stuff. So this is how you should answer to that api call. If not you just give me an url, and- and this is where basically, we can say if I detected, I am in a special bucket.

A

I give you credentials for my special bucket, but if I'm not a special bucket, I just give you the catch-all with the sign with the prefix, basically that it all boils down in generating a prefix or not generating a prefix. When we are giving back the pre-signed url.

D

um Hi um I join a little bit late and apologize for that. A quick question on uh what you're just saying with the prefix.

D

Like the prefix, will there be any authorization on who actually can call it or yeah like a access permissions on it? Because I guess, if you don't, if you don't restrict like who can actually uh prefix those goals, uh you can end up in a situation where someone can guess a prefix to actually access a part of the storage that they should not be accessing.

A

So this is exactly uh so. This is the same problem that we have right now. So basically uh we have one credential for, for let's say each packet has its own, but I'm not sure if this is how we're running. But the point is that at workers level we uh we we are trustless, we don't know what is happening. We just say to rails that owns all the information and can validate access to single elements in the object, storage.

A

Someone has asking for this: is it allowed to or not and then in rails we we can do uh user inspection, name, space, inspection and trying to to figure out. If, if we can give you uh a pre-assigned url, because all the information on those buckets are private, so there's no way you can change the the url once it's generated, because the signature is that for that specific object.

A

So this is already in place. Basically, it's just just how we want to organize stuff on on the object, storage and and back to the point it was made earlier about. uh Eventually, uh if you want to have separate credentials for different type of buckets, we should still think about that at least on gitlab.com.

A

We are multi-tenant. So it's already so user. A and user b data already exists in a shared bucket.

A

So yeah I mean there's, there's no much value in. Maybe there is, but this I was thinking that alif. If you can get access to the up to the to the bucket. It's already a big problem, regardless of it having just lfs files or having artifacts or things like that.

C

But like this is a concentrated risk right, because if we have 10 buckets where data is isolated and say um you know, um like git objects are held separately from, you know something less crucial like user avatars.

C

I would argue that if the user avatars leak, if the keys should leak, then um sure, like those advertisements, might be at risk, but that at least wouldn't compromise our core data right, whereas if we had a single storage bucket and those keys would be. That means, if that's like a disaster right, yeah.

A

But I think we're using uh the equivalent and google of iim credentials, so actually each pod can access everything, regardless of it being uh one single one, single iim rule or multiple. We should check this, but basically, if you get into the pod uh you you can get everything on object, storage- and this is true, regardless of using one or more credentials, because if you have access to the pod- and you can run whatever you want on the pod, you can dump the configuration file and all the credentials are still there.

C

All right another another thought I had, and I don't know maybe this is getting into the weeds too much already, but because it sounded initially. The main problem. We're trying to address is basically the complexity for a developer to actually add a new kind of upload right because they have to deal with all these different layers and they specifically need to configure a new bucket for it.

C

So, but I'm wondering maybe there is a way to kind of you know have both where we do potentially support multiple buckets, but it's kind of abstracted away in the sense that if you configure and you kind of upload um you, you would configure it against, like maybe a conceptual bucket, that the application can resolve into either a single global actual bucket in the object, storage provider or maybe different buckets, because you could maybe map it as well into a folder structure into a into a single global gcs bucket, for instance, versus separate ones.

C

So I'm actually wondering if maybe these actually two problems, that we should need to solve separately, uh one being the actual storage and the object provider, whether that be a single bucket or multiple buckets, and then the developer experience where currently, they just think about all these low-level implementation details.

C

But maybe there is maybe we put an abstraction in place where they they don't have to do that they just specify the kind of upload. Then we have some kind of machinery on or configuration in place that might change between com or maybe a self-managed customer, uh whether that is mapped to actually multiple buckets underneath or not.

C

Does that make sense.

A

Yeah this this makes absolute sense. There's the the three items that I made in the introduction. Video are three distinct problem that are all going in the same direction and basically we need a bit of each one to get out of the woods here, but they can be developed separately independently.

A

So, let's try to get back to the agenda if or if we have other questions about this, we can continue. If not. The next point, I think, is david.

E

uh Yeah thanks: uh I think that idea of the a single post horizon point should be in the block ideas section the agenda because it should be explored. uh So I had a question yeah, because when I I heard the term generic, actually I'm a bit afraid of that.

E

But the question is already answered below the question: was: is active storage already out of the scope? But it isn't, as mentioned below. Another thing I wanted to act on. The video was the testing gap on object, storage. um We have been observing that in the package team and so as a solution for that we have been adding tests to the qa suit and we have been adding the ability for the qsu to run to be executed on different object, storage configurations, which means different providers.

E

um But it's a lot of work, but it's it's working. We we we caught some bugs there before eating production.

A

This is good, so this is part of the qa and the smoke test right, so we run them so the the first time we run them is when it's on cannery staging.

E

uh Yeah, I think so.

A

Yeah by default, you.

E

Can manually run it.

A

Earlier but yeah- and this is running over several object- storage configurations, so different providers and things like that.

E

Yeah, if you open, if you open one of the specs, you will see that there is a let's say that we open the maven because it respects there is a meta tag that is called object, storage and well, I don't know the details, but this will run the suit against different providers. I think right now we have local storage, gcs and s3.

A

So I I'm not really familiar on this type of tests, because um so this type of test is changing. The configuration file for the gitlab instance itself to run the the specific.

E

Yeah, so these tests runs it's actually like a spec that will run against a gitlab instance. It's not against the controller or something. So we have the whole architecture running. So we have rails, workhorse, object, storage, running everything, and because these tests can configure the gizlab instance, we can have some features like okay. I want I want object, storage running, but I want s3 or I want google cloud storage.

A

So this means that we don't run them on production canary, for instance, because in that instance, we run against an existing one, that we can't reconfigure, and maybe we don't even run that on staging.

E

A

When you just run the qa test on the merge request, when we build a synthetic environment,.

E

Yeah, I can ask our test engineer, but I think they run on schedule pipeline against staging if I'm not wrong but uh yeah. This is like testing engineer territory, not mine,.

A

Yeah, I'm not sure I mean this is great. I mean, regardless of the the level of coverage. This is what we need, so I'm just thinking that, uh knowing what how we run tests as a delivery team has the release manager's rotation. So when I'm release manager, I know that not every test can run on staging and production canary. Just because of this because they are so, you can run qa on a synthetic instance when just get generated for you and then you can just run whatever you want.

A

Then you can run admin level tests on staging where you can change application configuration, but usually only things that are in store configuration stored in database, so application settings and then on production. You can touch anything, so you can only run things that do not require admin uh level, 2d instance, but this one is changing the configuration itself. So it's it's to me. It's an it's new type of test that I've never heard before. So this is why I was asking, but this is great.

E

I would assume that the object storage meta attack needs to know if it runs against staging or production, so that well, if it is the case, it can do anything, it will need to use the object, storage, configurator.

F

We actually uh articulate this add test cases to all object, storage, use cases.

F

I think it's a good. It's a good thing to curate. It gives confidence. My opinion.

E

So for us we we do have a feature test on uploads, um but currently they run the inline workforce so that that's good. The only thing is that it runs the local storage um configuration. So we don't test against object, storage, and that was the main pain point and that's why this motivated the move to the qa suit, since we have the whole architecture great for us.

F

Yeah, I think that's a good idea. I think we should add it to all use cases.

G

By the way, I've heard that we are testing now against gcp and amazon s3. What about azure? Because that's the thing that actually is on top of our heads in support a lot of azure problems, especially with bigger customers using it.

A

I think that our expert on azure is done that maybe we'll join the this afternoon call. So I think he contributed the basically fixing most of the support and he also forked the gems that we are using for handling hazards. So probably he will be best. One answering this question. I have a question for you, though, which is: uh are those customers that are experiencing problems running with direct upload, enabled or not.

G

um In case of, let me check that was the ubs.

G

I will check that very quickly. Okay,.

A

A

Patrick you have the next item.

F

Yeah it's about the when I read the initial proposal about uh having the internet upload api, so it seems it's still suggesting that you still have the same flow, so it goes through. Workhorse calls rails, but instead of having specific endpoints per resource, there would be like a generic api right. So it seems that it's still suggesting suggesting that we're going through wordpress and workers would still be responsible for uploading, docu storage. So the question is: are we really strict on that requirement or is there like flexibility?

F

If we want to add maybe another component.

A

Oh okay, I was thinking that you are suggesting that we go straight to rails and I was going to say going straight to rails is not an option, but another I mean okay, so.

A

Why we are doing this on workers right now? There is a there is a problem which is that multi-part uploads when they reach ruby on rails, ruby on rails as major where they dump them on disk, so one gigabyte uh artifacts just an example.

A

If if workers was not there, we just get through rails raids will write one gigabyte on file on disk, then we'll replace uh part of the request with an abstraction.

A

It's a file reader, basically a file reader that points to the beginning of that file on disk and basically, at that point in the controllers, with your regular controller's timeout, you want to read that file and upload it to object, storage in rails with global interpreter lock and all the problems that we know are there. So that's the reason why workers is doing direct upload.

A

That being said, if we are willing to suggest an alternative that relies on external components, we can evaluate this one, I'm so nothing is set in stone here. We we can just try to come up with a couple of proposals validate some of them and then building a long-term plan for this, which is kind of the point to in the agenda. So we can't do this in rails. That's unless someone has a magic solution for it. I don't think we can also. We were mentioning um active storage, it's also below in the thing.

A

Active storage has its own solution for this. It's called direct upload as well, but it only works on javascript and our runners can't really run javascript to do the direct upload acceleration. So we will still have something in between that can do the direct upload for us, but yeah. So uh I mean uh this question. Patrick is just linking to uh my point next point, which is my day, which is the reviewing the exit criteria so.

C

I had a super quick uh follow-up question, a small question before we go to the exit criteria. So when we, when we sail when we say rails, um writes the upload to disk first, is it actually rails the framework or is that the differentiation.

A

So it's not puma, it's rock. It's not.

C

Even racist rock, it's wreck, okay, I got it all right. Thanks.

A

You're welcome so uh access criteria.

A

uh There is a lot here at stake. I think you all understood or grasped this so uh is really I we. We can't really get to the end of this by the end of uh what is january, not only there is also another problem, which is that this is a working group, so it's not a strict binding in terms of your own time in terms of what you're working each milestone.

A

So the right tool for this is the engineering allocation. So the the idea here is that we are running p. We will uh develop plc, we will evaluate uh potential solution to the problem, but the end goal is more about writing a meet to long term plan in terms of epics blueprints. It really depends on what we decide will be. We will decide, will be the the right tool for writing down this plan and then with marin our executive sponsor.

A

We will go through the engineering allocation and make sure that uh proper team are will schedule this work. So this is uh so I I want to make this really clear. We will not do only coding in this working group. We cannot fix this uh entirely by the end of the of the working group time, so um I hope this is clear to everyone involved, because I mean we would get frustrated by thinking that we will just fix this, and then we cannot so that's being safe. So sorry, I lost my my strength of patrick.

A

This doesn't mean that we are going to follow the solution that I mentioned in the in the introduction, because this is just my thinking on the problem, because I I started working on on a blueprint for this basically last year, then due to priority shift, it never concluded. So this is just a starting point. What I thought about this over the course of time we will evaluate proposals.

A

There is also some of the links are pointing to uh creating a an external component that can run uh object, storage for us, so we just either is rails or workers if we need access to our own files. We just ask this component, so this is an option as well.

A

We just need someone that is willing to write a proposal and then we can evaluate.

A

So that's that's the thing and basically sun is not here, but his point is really solid, so we need to have a clear idea of what we are trying to solve. What are our requirements in this and then we can evaluate every proposal how it matches uh in terms of those requirements. So he wrote something here. I'm going to read is his point so so that we can try to figure out if they make sense.

A

Okay, we are over time. So let me do this. Okay,.

D

Just just one thing: should we extend the meeting to one hour um because, like the the pick is quite big uh and yeah.

A

Okay- let's do this so today I got another one in the afternoon: another 30 minutes. So I suggest we don't go through uh the standpoints as we we will cover them in the recording in this afternoon and all of you can watch them. uh I want to just uh just pinpoint point four and five very quickly so point four. We need to figure out that everyone that knows something about object. Storage is on some level involved into this.

A

So please, if you know, if you know how to spread this, you know other team members that may know about other buckets. Let's just try a sync to figure out who can join this team and the other thing is this the meeting schedule, so I think we should run after.

A

uh Maybe we can go one hour, but just do uh every other week, so we do one week, epic friendly one week, uh america's friendly and yeah, and then, if we working on plc, maybe we can just make sure we are paired into different time zones so that, if we're working on something, someone can still uh talk about what happened, but I mean it's gonna, be one hour is probably better but having one hour in the morning and one hour in the afternoon is not really something we can do so.

A

Just google logged me out.

A

Is this okay for all of you? Do you have other ideas or counter proposal? Okay, okay, so thank you. I'm sorry we had to rush at the end. We will try to figure this things out and yeah. So please watch the recording for the afternoon session and I will adjust the meeting schedule so that we can have on um every other week on the two zone friendly meeting.

A

Thank you and see you next time. Thank you. Thank you. Bye.

D