GitLab Gitaly group, 17 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-11-17: Experiments with Git object offloading to object storage and partial clone

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

That's got it well. There we go welcome to the partial damn partial clone demo. um uh The script is provided by chris and we've iterated on this a couple of times.

A

The idea is to use partial clone to basically scrub repositories on the server side from some data and we're mainly thinking about um large blobs right now. uh So let's say someone pushes a blob of 100 megabytes. We store it on this right now, but that is expensive. If we could upload that that would be great cool. Let's get started.

A

Copy and paste and we're gonna clone this repository, because it has some.

A

Utilities ooh pack reused zero, not so great good luck um and then.

A

We set the test directory, oh that's, a good iteration and then we can start an http daemon and we're using apache.

A

And now we're gonna check if we actually started it with ps.

A

That looks correct and it's serving.

A

How could it be so bad at copy pasting?

A

Let's go to raw.

B

Series, could you please zoom a bit? Yes.

A

Like this uh detach and go.

C

A

um Does that make the left visible, yeah.

B

A

What about the right right now.

B

um Yeah good enough to read.

A

Okay, um I was going to copy this slide.

A

Paste all right there we go. This is a full copy of the git repository.

A

uh It's a mirror from the actual kit.

A

Repository we're gonna update it, so we allow filter clones and allow any sha one in one which is required to later fetch blobs. You might be missing: okay still receiving.

A

Those config options are set by default and, funnily enough, I just noticed that for http they're set in our code base for ssh they're set in rails, not great, but it works. I guess all right and now we're gonna clone um from the just cloned repository and we're using the non-local. So it doesn't hard link on disk and it actually closes the data and pretends to.

A

Not even represents it doesn't reuse any of the data of the local gitlab kit and we set the filter. So we filter based on blobs and the parameters for the blood filters that we limit. All the blobs above 800 kilobytes.

A

800 kilobytes is just because then we know how many files we're missing blogs not far.

A

A

There we go and paste.

A

Cool um yeah, so now we might not have all the data now we're gonna show uh how many blocks we're missing. It should be 23 and it's kind of cool this one liner.

A

So it is a ref list which is super nice as a tool. It has 25 flags which are all pretty cool, especially if you combine them now. We just list all the objects and if they're missing, we print them and then we pipe it to pearl just for the regex match. If it starts with the question mark, we know it's missing, then we pipe it to work out dash line, so 23 lines were outputted, so we're missing, 23 objects, which then implies.

A

If we need that object, we need to go to the server as server to get those objects. If we do a checkout where we are missing those objects, those blobs um now we're going to validate that the server has them all. The server now is a remote repo, a local repository on disk, which we cloned from with the no local, but it was a full clone from the gitlab.com repositories.

A

So it should be a full one.

A

So we should have zero missing objects. There.

A

Now we're going to source utility functions from the repository that chris made available and we're going to upload one of the blogs to the http server. We, the apache, we just started.

A

Because the blog blob is a shell function which we should re-implement someday in getting.

A

C

For now this, this works especially.

A

For demo purposes, so the server replied with first and then 204, and if we curl this end point it's going to show what blobs it has and it should have one blob and the hash should be equal to the one we just posted there we go, um and this is the size and, as you can see, it's over the 800 kilobytes.

A

um So it is considered a big blob for the purpose of this demo. We're gonna hijack my path. Well, it's not really hijacking but we're gonna prepend, my path to uh use utility functions in the test directory.

A

If I'm too, for both, uh you can just tell me, but I'm trying to make it clear for uh anyone.

B

A

Watch the recording as well: okay, exactly.

B

C

Okay- and one thing I was just wondering- so we we don't clone objects larger than 800 kilobyte, but that also means that our or our current tree or our current yeah working tree is missing. Those objects objects right.

A

Great question: um it's not so here we were wait, where's, my mouse now all right yeah here, um so we clone into client one right. So what happens? First is when you're reading objects, we're receiving yada yada and then git does exactly the same loop again with two objects, and these two objects are in the current checkout.

A

So first it tells the server give me everything, except all the blobs, larger than 800 kilobytes and then git actually needs two of those objects and goes to the server again and say: hey. You know what those two shots I want those and so on demand. It's actually doing a git fetch with two shares. uh This is the second fetch right here, so um it it went exactly as you thought it went, but then it was clever enough to just fetch those objects. It was missed.

C

So the working tree has those files, but the git object database does not have them.

A

They both have them now.

C

A

Because it needed it for the checkout locally, so it's added it to the object database and then proceeded with the checkout.

C

uh They all were loaded because.

B

Of the checkout or yeah because of the checkout.

A

Yeah so wait. uh Oh the.

B

Second, one is.

A

Yeah the second one is done because uh git wants to do the checkout notices that it's missing blobs um and then doesn't do it so uh and then mrs blobs, and then does do it. So what it had to do is uh enumerate the objects. It wants two additional objects and then it's going to fetch those. But what we could do is.

A

A

Checkout with the same filter and now we're gonna.

C

A

To client 100 and what you'll see is that it will clone this exactly the same number of objects. It will count the same. This whole thing is the same, resolving delta's the same, and this part it will skip because it doesn't miss any objects uh for to perform the checkout.

A

I'm just gonna press enter and maybe I'm wrong, um and then we can all laugh about it, but I think this is how it works.

B

It's how it works.

A

Yeah, I figure it's the same, we'll see in a.

A

Second, drum roll.

A

Yeah, okay, so everyone's back on board like this is um okay cool. Then, let's continue on, um but it was a good question indeed that that was.

B

Yeah, you can launch the the command to count the number of missing objects and you would see.

A

Oh yeah, that's true as well, but then it wouldn't be 23 right for 25.

A

A

A

A

there you go cool, um and this is also what I really like about. Partial clone is how transparent it is for the end user. So if we were to use this um in the gdk, for example, where a program is cloning for you or in ci.

A

You eventually do get all the objects you wanted, but it does save you bandwidth if you don't end up needing them in the first place, all right, but let's continue on with the promissory remotes to the first client we just cloned. This is not not the full copy anymore.

A

We're going to add a remote, and the trick here is that the protocol is test http git in this instance, and it's not hdb or git or ssh.

A

And then we're gonna set this one remote as a promissor remote, and the promissory in this context means if I'm missing objects. I can iterate through all the promissory modes and one of those I can just ask hey: do you have this shot and this is used to differentiate remotes from one another? So let's say you have the same repository on multiple servers and some of them do act as a promising, remote and the other. Don't then, this is still compatible throughout all those remotes.

A

All right so yeah a cat file, which is a git plumbing feature. It just shows you whatever is in the hash and the dash b means print, and this is one of those big files. The hash, it's the big. The hash is actually this value um because we already uploaded this. We must be missing it and we pipe it through less because it's a big file and then it's gonna take a while.

A

Oh, this was the the weird thing where it was stuck right. It was last time as well, but then.

C

Now it was super quick.

A

Yeah all right, um I don't know how to debug this, but we can do this async or do you wanna? You have ideas now, chris.

B

Yeah, I'm not sure, what's going on, I haven't tried to debug it a lot, but.

A

B

There's another one right.

A

Where we do a cut file in the script.

B

Yeah we do it at other times. Okay,.

A

The next time I'll remove the pipe into less and maybe then we get more output.

B

A

Okay, but the point of this cat file was just to require it like the checkout we just discussed like if git needs something and it doesn't have it. It goes to one of the promisers and tries to obtain this object is missing, um so it just did that it went to one of the promisers and the promiser returned the object, so the missing object count has decreased by one.

A

So here it was 23.

A

Here we fetch one of the files, and here it's going to be 22.

C

How how do we know which remote it used to get the file, because it could have gotten it from the origin as well.

A

No because the origin wasn't set to promissory amounts.

B

Yeah, it could have used the old gene, but later in the in the demo, we we removed the origin and we try again. So we will see it later.

A

But the origin is in the park. Oh, it is. Okay, sorry blob, uh uh sorry tone wow. This went very wrong in my head. um Excuse me: um okay, but.

B

Now onto here, when it takes a lot of time, it means the http will move towards us, but.

A

Okay, um oh you wanted to add more color. Chris did I miss something.

B

No, we we will see later that it uses the the the http promiser remote, because we will have deleted the the original remote.

A

B

It will have no choice.

A

um For now, we'll continue with the update hook, so this command I'm just going to execute copies a shell script as the update hook, the update rook runs for every um reference update. So if you update a branch from one shot to another, this happens on a push, but also with no. This happens on a push period, um and then it runs the update hook per one of those reps. So if you push multiple branches, then it runs the update hook for each of those branches.

A

This is why we don't use this hook on github.com because it has n plus one issues, but for the demo this is just fine.

A

This uh hook well, first create a big blob and then um I'll explain later so what this does? It creates one mega one megabytes um I think yeah so had one meg with count of one megabyte from defu random and redirect standards out to git client, one new random data, which just means create a new blob in git client, one new, random data.

A

This just stages it and committed commits it.

A

And we paste that, and now we can push that to the origin. Remote there we go and what's interesting now is that there's the hook is actually being run, so large push attempted with a file named and the size is over uh the one. No, this is like the one megabyte yeah, never mind, and this is the upload blob output.

A

We showed earlier that the remote outputs it so that's why it's prefixed with remote code and we updated the master branch from one shot to another, which then means that the list of objects on the server should be updated.

A

um The size should match it does and I don't know the hash, but I'm gonna, I'm just gonna assume it works.

A

All right so now, let's clone it again, um the host, where we clone from, is the full repository again we apply the same limit.

A

Which, then means we're? Gonna miss the same amount of.

A

A

So there we go! um Oh and now we fetch free objects because in the checkout there's the new random data file as well. um So that's what the counter means in the writing: objects.

A

And then we wait cool, so let's remove at least from the tip of the branch, the new render data and let's commit the removal of the.

A

Data: let's push that if I'm not mistaken, the update hook will run, but it will not detect a large block.

A

A

Oh? Okay, wait. uh This was, I guess,.

A

Oh okay, so this was in the original repository, but not in this one. So now we push now it detects that it is a large blob and it uploads it to the http server.

A

B

Yeah, it's not very smart, so it's.

B

A

It doesn't make sense why it triggers so it's fine. I think I could reason about it within a few seconds. So if I could do that in a few seconds, everyone can get client free- and I guess the point here is the new random data is no longer in the checkout, so we won't fetch it again. So the additional fetch will fetch two objects, not three.

A

Where here I think yeah here, it's still fetch three objects.

A

But, however, if we're gonna check soon, then it is gonna miss this big blob from the local object tv there. You go enumerating objects too.

A

A

A

So here's for tone's question the next section: what if there's no origin, so we add a new origin, we mark it's.

A

A

A

So then we remove a whole section remote at origin. This is a quick little trick that you do it from through config and not through remote, and then some zip command fun. Little story, remote is actually a very thin wrapper. The remote subcommand is a very thin wrap around the config and then the second command just checks. If there's anything in the config, which is has the text origin and the master branch still points to the origin as a remote.

A

um Even though we just removed that so, let's see what happens, I'm going to upload the blob.

A

This blob should now be present on the apache server d7. There's a d7.

A

And in this case, I'm not going to pipe it through less because now I want to see if it doesn't block.

A

This is interesting, so cat files, somehow blocking.

B

C

B

Time so, okay, we have to wait. I I didn't try to make it faster. So.

A

uh That's fine, but why does it take some time.

B

My guess is that, because there is a so there is a fast import process that runs, and my guess is that after the first import for process, it tries to to read back things or to check.

A

Okay, oh so that's why? If, if I interrupt and then just rerun, the repack will not be triggered um so it's much faster but in the end just.

B

Fashionable when it has run once the the the blob is now in the local report, so it's much faster because it's it doesn't need to fetch it from the http remote.

B

It's the fact that, uh when, when it's fresh fetch from the http remote as the fast import protocol is, is used by this fetch inside it, it takes a lot of time. But I haven't tried to optimize this. No.

A

B

You know what's happening exactly.

B

But I think it should not be too difficult to optimize it, because what.

C

B

From the log is that the fetching the the the blob from the http remote is really fast and then uh san diego to get is really fast. It's it's just that after that, the stuff- and I don't know exactly what stuff- and this is what takes a lot of time.

A

Okay, um all right yeah thanks for explaining. I don't think it's the biggest priority right now in this stage of proving that the solution works, but if we could fix it. That would be great, though, but at least thanks for explaining.

B

Yes, so the new step 5 is new compared to last time.

A

A

So this adds remote. It has the test http git protocol again and it's a promissor upload.

C

B

Large blobs now we trace to use this the same. The same plain, http, remote on the server side.

A

And now the large blob stop text we iterate over and upload to.

A

So this is the process of scrubbing, but it was much faster than I anticipated.

A

B

Now we can yeah some stuff and to get stuff from the http server.

A

So these are all the blobs. um If I were to upload blob and then this sha, would it just reject it? Well, no.

B

You can't upload them again, but it will do nothing.

A

A

So repack dash all is uh dash a is all I think, and the dash d is recalculated deltas.

B

It's to delete all packs the dashboard.

A

B

But yeah the issue I had is that is that there are some limitations in get repack that prevent us from passing a filter to to get rid back.

B

So that's why there is a and, and then here there is also underlying limitations in the beatback object. So now now you should try to to build yourself a new git from the from the the cc repack filter branch, because otherwise the but you can try, but the next uh command will not work.

B

You can try. It.

A

Okay, let's first observe failure and then I'll rebuild to get okay. So now it's going to go to italy, get.

A

Let me see, is it in this kit or is it in the gitlab kit.

B

In the github git.

A

Okay, so gate remote.

A

Update origin kits at getlab.

A

New keyboard, I'm blaming that um so that would be.

B

A

C

A

Work no such remote or remote.

B

A

You are, you are not the right.

B

B

A

B

Yeah perhaps you are sorry.

A

Can't I just do okay, let's just.

A

A

All right, I get to check out which branch.

B

The cc repack filter it's in.

A

Playback filter.

A

A

So what does this branch change? It allows additional.

B

Filters to pack.

A

B

Yeah, it allows dash dash filter without dash dash std out, because there is an artificial check right now in pack. Objects that, because usually pack objects objects, is used with std out to send the stuff.

A

B

When, when you, when you request something from from a server- and here we we, we use it to to repack not to send stuff anywhere. So so, usually when you repack, you don't use a filter. So that's. Why that I guess that's why they disabled the filter option when you don't choose std out, but it's an artificial.

B

Limits so hopefully, if we explain why why there is no reason.

A

B

A

Yeah, it should be an easy patch yeah.

B

It should be an easy.

A

And we pack it to a different pack file, oh and then we removed.

B

Yeah we have to to do some uh some some kind of stuff that e3 pack does.

A

B

But the issue is that, while there are a lot of issues, first gt pack doesn't support the dash dash, filter and also.

B

um Yeah by default, when d3 pack uses some options that we should remove, uh because otherwise uh it it doesn't work with the dash filter. So that's why I I we have to do a kind of manual repack by using uh pack objects or self and then by uh moving the the pack, they generated the pack files so that they replaced the the old ones.

A

If I run this, then I prepends my path with the working directory and.

B

I use your gate that should work.

B

I think I've seen this before.

A

This directory and.

B

Then describe e.

A

So now we do this echo, which creates a new pack.

A

It's pretty clever, especially this part where you uh create a new pack file.

B

Yeah, there is no no other way. I found to to add like to remove.

B

Blobs and then to.

A

At the same time, this is just just plain cool right like this. This that git allows you to do this.

B

A

But yeah you, you can disagree, but I I I really like how uh how much exposed uh kit is like no other database allows you to just move some files around and well. Maybe it does maybe I'm just unaware of them.

A

Okay, so just for context, the index is the index of what's in the pack, so you can access the data much faster. Otherwise you gotta iterate over the pack, so the creation of the index is expensive, but if you have an index with a pack, then lookups are super fast again.

B

Yeah there is an issue, it's that I I couldn't find a way to also create the bitmaps when.

B

When with the pack objects- but maybe we will find a way to to do to do that so now, the without bitmaps the service is a little bit slower.

A

I just wanted to call out for a second that we touched a promissor pack as well. So if, like the package, didn't change, uh so all the objects are in this pack there's an index on this pack and if it's missing then there's an empty pack, but it has the promissor extension, uh which then means that um it's the problem, it's a promised object and um yeah. That's why you might see that file.

B

Yeah that promise or like files just tell me that there might not be all the objects in the pac file that it counts.

A

B

Works just to show.

A

And the missing.

A

A

Oh we're off by one.

B

um Yeah, I'm not sure why, but I also got 26 sometimes, and sometimes I got 25, so I don't know what's going on there but yeah. It seems to have worked.

A

Copy paste and then I can pipe this to wcl as well right 26.

B

A

Okay, now without piping, I just want to um wait if I pipe this through to short.

A

And I pipe the other one through short.

A

Oh, that's not short, this is short and we just how many are with a six three one, two, three okay. This is rudimentary check summing, but I'm just gonna claim that they're equal, so it might be 25 or 26, but the main point is we got it on the server if we're missing it on the local repository and just to show that we can fetch one of those.

A

A

Oh, this is the first import thing again: okay,.

C

A

Interrupt it, and now it's already there there you go cool, very cool.

A

A lot to digest, especially uh this part, so that was wrapping my uh brain uh a little when I saw it, but um I also like git uh just because you can do this, but um any questions from tone or pablo.

C

B

Good, at least today, I understand the half of that.

A

B

Next time and you.

A

Might uh you might hit the 75.

B

A

C

B

C

Not into the last part as well yet but yeah um interesting stuff.

A

Yeah, the last part is really scrubbing data from a repository which then only is present on the uh http server, because in the previous sections we didn't do any of the weird repack thingies, so they were all uh still present. And now we had one repository which didn't have any.

B

Of the data except yeah, because previously the clients we used were cloned with the filter so from the beginning, they didn't have the large object, the large plots but the when we cloned initially cloned the server we cloned it fully with all the all the large plots in its database.

B

So so the the the goal of this back object stuff is to just to remove them to remove these large blocks from the from the from the git that database on the server something it's something we could do, uh for example, every night or or every I don't know.

A

Yeah and the advantage of that would be that we have them local if there is a load coming so only the first time, it's missing and then the rest of the day. We have it. Let's say a new repository is very popular and then at night we scrub it, and then we try to go without having that foul locally again.

B

Yeah, because all the all large files might not be very interesting and we might actually never need them on the server, so we might need the the latest ones, but not the old ones.

A

I'm just gonna clone it once more, with the no local.

A

Because uh wait like um what I'm trying to prove is that this gitlab git is it's missing objects. We just shown that the 26 we fetched one. So it's 25 now, um but this repository should be gitlab.com like um in a few iterations of backend server work, but for the customer we want to do a git clone and then dash. No local, so don't reuse any objects from gitlab git. Oh I yeah.

B

You should use the filter because otherwise the the server will have to fetch.

A

But this is exactly what we want right, just to show that it will be transparent to the end user.

B

um Yeah we can. We can show that yes,.

B

I haven't tried it.

A

My gut is saying this should work.

A

Oh now it's going to do the fast import just 25 times or no just once so. This is going.

B

To be, we will see if it's.

A

It might not work well, we answer questions still, um but uh it would be so cool if this completes in uh in a few seconds, um because that would prove it's fully.

A

Then we also see d, and we just run this one liner to to counter misses uh missing objects, because then it would be fully transparent to the end user and like in in air quotes. The only thing we still got to do is make sure that hit biff and all those server actions um do not eagerly fetch all the big blobs.

B

A

B

If we use that on the server, we should tell the the clients to always use the filter, because it would, it would offload.

B

Some some more to the http promiser remote and also it would hopefully the clone with the filter. The the server will not often need to to fetch everything from the http remote and therefore the server will use the yeah disk space.

A

Yeah yeah, I fully agree there. I just want to also make sure that we're backwards compatible, which it seems we are oh. This is super cool.

A

So this should print out zero.

B

You should not remove the dash c gitlab git.

A

Yeah yeah yeah, sorry that would be funny to.

A

A

Oh, this is so cool okay. This this makes me super happy, um very cool, all right.

A

At least this is recorded and, and james and mark can be just as excited as I am, and we can get this to production.

A

There's probably a lot of work in the till end, but um the first steps look super promising.

A

uh Cool very, very cool chris. I I am like a charles on christmas morning, um so um uh further questions from tone above hello.

C

Yeah I was, I was wondering about the test, http git protocol and how that knows, how to connect to uh localhost port. Ten thousand.

B

um Yeah right now, it's.

B

It doesn't well, it's it, it's use environment, viable variables, and it works only because it's on the same machine and the variables are the environment variables are set, but in the protocol we can the we can.

B

We can pass a url and uh we can work with this url.

B

I haven't really tried that, because it's not necessary for the world, but.

C

I can configure these environment.

A

C

The in one of the scripts in in the demo, repo or.

B

C

B

The um I can send you the url of the of the script, so that you can see how it.

B

B

So so I sent it in the chat.

B

C

B

Debug stuff and the the interesting stuff.

B

B

B

B

C

Because that script is in the part, it will run that script and that script will do the right thing depending on that yeah.

B

It's because this script is named git remote, test, http, git and and git when it's passed, a url that starts with stuff. It tries to see if there is a git, remove, dash stuff split in the path and and it uses as a helper to access the remote and that's what's going on when uh when he tries to fetch from the from uh optimizer remote.

B

It sees that it has a promise remote called test, http, git and well with the url that starts with with touch test http git. So it uses this script as a helper to access these remotes.

C

Great cool nice um yeah, I'm not sure if this is something we want to answer now or we have thought about at this point. Is it um because now we have like one one web server that lists a bunch of blobs? Is it the intention to um like group those blobs and subdirectories for each um repository, or is it like becoming like considered like one big object, storage, which has blob for everyone who likes to store blobs on my http server.

B

First, it could be other kind of storage than http server. I just use the http server because I I think that it can be well, it's quite simple to understand, and quite it can be quite useful for things like images or whatever.

B

And, and also because a lot of software that store stuff has an http interface, so we could, for example, use.

B

B

Know all kind of uh archive stuff and also because git lfs, I think, uses some kind of http server by default or well. I don't know but.

B

So so, um and so I don't, uh I don't think the way we organize stuff on the on the on the http server on the promiser remote server is is really relevant, but I I don't- and I don't know what we would like to do in.

B

A

B

A

There's one giant uh bucket on some object, storage and the advantage there is that if you fork a repository- and we already have this one giant blob on the bucket- then we don't upload it twice. The disadvantage is that if I know a shower of a blob I'm interested in then I can pretend that I have this shy in my repository push it and pull it down, because I now have access to it and given sha1 is shattered.

A

Of course, we might not want to use sha-1 but yeah, just as a little coloring to the question from tom.

B

I'm not sure if I understood your question very well, but.

C

Yeah, I think that that answer that um it's something we we need to decide on how we want to do that, depending on, what's what's the most beneficial according to yeah, the features like like we have that elevates like the ups and down sides there. There are like arguments for both both ways to implement.

C

A

Pavlo, did you still have questions.

B

Not on this topic.

A

Okay, then, I'm gonna end the recording.