GitHub Git Merge 2017, 4 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Top Ten Worst Repositories to host on GitHub - Git Merge 2017

Description

In this talk, Carlos Martin Nieto, will describe the technologies GitHub has developed to handle the more challenging repositories and use-cases, from heuristics to replication and quotas, as well as what it takes to backup this data.

About GitMerge
Git Merge is the pre-eminent Git-focused conference: a full-day offering technical content and user case studies, plus a day of workshops for Git users of all levels. Git Merge is dedicated to amplifying new voices in the Git community and to showcasing the most thought-provoking projects from contributors, maintainers and community managers around the world. Find out more at git-merge.com

A

Alright how's everyone doing so, they said I'm I'm Karl, so we could help, and today we're going to go on a tour of some of the edge cases and a nuisance situations that happen when you provide it hosting to the world at large. So you know you probably have github or the largest kid hosts in the world. We all have over 50 million repositories, 19 million users and with this kind of numbers we run into situations where we just didn't predict any of the who was going to happen.

A

I work in the key infrastructure team. Our team is responsible for providing g'd as a service to both internal customers. That's the website and the API and to external users. At all of you running your key clients. We also handled support escalation for a technical aspect.

A

If a user writes in asking about performance issues with the repositories or support needs some some if they want some input on the technical aspects of some response, they're going to send velasquez to to double-check their work, so that you know it's a popular website, we run into a fair share of unexpected use cases, and even it is not obvious at the first glance that something is going to cause some issues.

A

It'll be unexpected interactions with how we run our infrastructure or how other people run theirs, but I'm going to name a few repositories in this talk. This is not the only repositories where we had to intervene and were the only ones that even there are behaving unexpectedly.

A

Most of them might even doing anything that that that we would consider inappropriate. We just didn't plan for them to be doing what they were doing and so there's a couple of ways of resolving the issue. Some of them. The issue is the solution, is technical as on our side or theirs. Sometimes it's just teaching the user about some of the limitations that git has and how to work around them.

A

So with that, let's start with our room repository so Scott, github github, and you know we we deploy to something like 400, 500 machines, on every single deploy and we deploy about a hundred cents a day, and this is something we're very proud of, and this is something we want to keep doing and in order to enable that we need to keep that running smoothly.

A

All of the these 500 machines all ask for the an update at the same time, and that needs to be quick. This is a big load on all of these service running hosting the github repository, because it's only one of them will have out of and 500 asking at the same time.

A

This is what we call a Thunderhead problem and we we can use the fact that the all of these repositories of data, the at the same time that they start from the the last deployed commit obviously and then they want to run to the same commits. So all they need all the same data to be sent. This is music caching problem when we detect that multiple identical requests are coming in from multiple hosts. We catch that response and for a few minutes, anybody who asks for exactly the same thing.

A

We can stream directly out of this, and then you know we don't even need to worry about git and any of the calculations. You will have to do.

A

So we have the kubernetes repo k-8. Nobody seems to be sure how to pronounce the name they make extensive use of for requests, and this is a good things, a very good thing, but it does put some extra load on our on our machines when, when you have a whole request, page open and the base branch updates, you might have noticed that the button goes gray for a bit and then it hopefully it comes back green, what's happening there is the the web page is asking the server.

A

Hey, can I make the button green, so this involves creating a merge and a rebase on our servers and then storing the results to store the results we need to update. We need to store the business references because that's how good things of how good knows where, where things are, unfortunately we're a bit limited on updating a couple of a couple of times, a second which is where the bottleneck comes in.

A

When 50 people have a hundred requests open or at the same time- and they all say hey- can you can emerge this- we essentially time out a lot of them. So we we start mitigating this by grouping up all of these references, so we limit the pull request, updates to a single update, doing multiple reference updates and then the bigger lesson here. Is we just it's okay to be too slow and timeout in these cases, because it's a machine doing request and when we timeout and say sorry, I couldn't do this in time.

A

It will just ask again and then eventually the once I'm out ill it'll be fine. The the user will see the correct result.

A

The this next repository is a different interesting one supposed to meet Wars commit wars are used to roading to ask whether it would be okay to have a ruffle in the hackathon. The idea they had was to have the 300 plus people they had at their hackathon pushed to a single repository at the same time, and the winner would be the last person who pushed before a particular time.

A

We asked them not to do it and we told them right. We would tell them why we didn't think it would work both. You know you can take it timestamps we're not going to be able to handle 300 people pushing two things: the same repository and you're going to run to goda usage. They went ahead and did it anyway. They need change that tiny aspect, but it was basically the same thing. They tried to perform 7000 pushes in the space of an hour or to the same reference.

A

In the same repository, the reporters just couldn't keep up with the load. So this is our error graph that that red bar right at the bottom, which you barely can see. That's the that's the threshold that makes us get alerted because that already that's above the the usual baseline of errors, because we are not computers or you know not. Everything goes well yeah.

A

That was a fun night, but at least so we took solace in the fact that the their attempt failed in the ways that we predicted and it didn't affect any of the other repositories, and this gives a lot of confidence in in our automated systems and our understanding of them right feels very good to know that you know we said yeah, it's gonna thing this way and you fail that way that this is sleep at night, knowing that it's going to get handled so the next one is not. You know so unreasonable.

A

They have a it's open data about New, York and not entirely sure what they're doing. But the thing here is that they have 800 almost a million files in the repository and that's four for a tip commit half days. It's well shattered. They have lots of nested repositories so for maintenance, the Dok tend to pay for maintenance. It's basically fine, but there were into us said: hey we're trying to do a an update of the written file via the API.

A

It's we're changing one line on the top level and the API call is timing out like what's happening, and it's inside this was entirely on us. They were doing something that usually works, which is the equivalent of what you would be doing locally we're reading all of the files state, the one, the one file with the deep content and then write everything back out and create the new commit with that this usually works.

A

Would you see with this I'm with these numbers? It does seem a bit silly, but so we noticed there's only six files in two directors at the top level and really that's all. We would need to do. That's all that we need to change right. The top level is a single change, 20 bytes of that that needs to change there. So we realize hey.

A

Maybe more people are doing this kind of thing, so you know we made the Co debate a bit smarter so that it would avoid reading in any any any directories you will need to read in so that's what happens. That's the the Irish when, when I deployed the change, the first read repositories and then it drops a bit further when it was activated for everyone.

A

That's the 90th percentile for this particular operation and I mean I constantly say that finding out this, this issue, with this edge case repository, was better for everyone like our load, is lower. Our users get faster access to everything.

A

So this is supposed to rethink its. Can boot, see I found it as the largest one in one of our drives, and so I would like what's happening here like how do they use up all of these space so in Kitchener zone and I give this website people, let's call them funny images and videos at one point, as far as I can tell one point: they replace their feeds, RSS feed with an app for your phone, and then we showed you up on there.

A

People were unhappy, so they set up their own replacement leaves. This was such a replacement feed. There were the XML file with you know for the feed button, also all of the images, so they were pushing about three to four hundred megabytes every 10 minutes.

A

We noticed this and you know we we told them. Please do this so, and they stopped doing it right. Sometimes you just have to ask I think they're still pushed in the XML file, but they're hosting the images. Those were.

A

This next one, you might have heard of cocoapods specs. This will be attackin usual whatever, so they managed to read a few different limits that made all of the thing time that made hosting them really hard. So bit of background. Cocoapods is a package manager for Objective, C and Swift programming languages. This repository is its manifest. This is how it knows what what packages exist. It has a lot of files, a lot of commands I believe they push about a thousand commits per week automatically and until very recently, they had very large directories.

A

All of their packages were all listed alongside so it was maybe 80,000 directories next to each other, I mean you will probably want to check that out on those servers. That's the same. We still have to read huge histories.

A

This we keep seeing issues here with on the website. We show you the he. This is the last time someone touch this directory so on, or here's the last modified file a time of all of these files that keeps timing out, because we just cannot get far back enough in in history. In the ten seconds we give ourselves to actually provide any information here.

A

To add to that, every cocoa, cocoa, pods user was updating, maybe multiple times a day right when you're saying oh I, wonder if there's an update for my for this library and then everyone's keeping us they were, they were an in a way that was very efficient. We can cache a lot of the responses, so there was an issue that the some user opens under main repositories in Hastings, rapid, slow, like what's happening, and then so. This is the issue you can see our response saying well.

A

This is why very hard for us to host this. In a sense they there were so many clients. There was so many so much load that our monitoring system will say. This is too much I'm going to throttle you, so we would add in delays before even processing anyone's command.

A

This was this is to protect the other uses of the of the machine, because you know if everyone is fetching cocoapods and your reporter lives on the same machine, you wouldn't wanna. You want us to delay your able story because they're you know they have a lot of users, so this was interfering with their their users enough that we decided to spread the load. So we moved to. We generally serve a reporter from a particular repository, but we have the ability to spray it up to three computers right now, so we did this.

A

This also spreads out the quotas, because the fellow needs machines is slower. We did also apply stricter limits on the on the history will show on the UI. So, instead of just trying out of the ten seconds we lot, we give it much lower timeouts, because we expect the most of the time we're not going to find the data we need in time, and this lets, let us spend more time serving requests that are going to be successful.

A

So you know we ask them to do the updates more effective more efficiently, so it was easier on our machines and to use more more nested directories which may which makes the the key to get objects smaller, and this is then quicker for everyone. We have less loader machines, their own clients, update faster.

A

Right so, let's, let's move to this one! This is the particular exact you'll. Now this this lets, you very easily get started with having a a Jekyll side, so you can have it on github pages you for the you progress repository, give it the name of your username and then you know github will host the pages there. This has, however, been that every fourth of this, through of this repository, is someone else's website. So it's completely different content right. That's the common history. After the point you forward and everything else is different.

A

This is the opposite assumption that we have everywhere else on the website, and this makes the clothes more extensive than most it's. Not it's not that bad I mean we do have ways of making this faster.

A

But then there was a time when some group of servers said: hey I, know I'll clone every single part of this repository. At the same time as I mentioned, we serve them all over all out of a single machine, so it could have a PC load. That's load average 400 from a third principle core machine. This was never going to succeed. It was just going to be annoying, we had to move it. Does the the system? Does it automatically, but load had to move away from that machine? We?

A

The solution here, is to ban the IPS and kill the processes, and the machine is happy again and it can continue serving the normal users chosen.

A

This meaner, CI logs, we were suspicious of this one because it has CI logs in the title.

A

So it's not in my list here, but there's some people who essentially users are pay. Something happened, one of my machines, I'm, going to push an update to github and that gets that back, it's inefficient very quickly, and so, but what happened here is we were elated about this repository, getting some errors and you know so we look in today's and say hey this. Multiple pushes a minute for the last few days now what's happening here right we can.

A

We can handle this right load for a while, but eventually we need to run maintenance on the under repositories, because every push is a new but file as a new file. That's a new place to look for things and every time you push it becomes slower and slower and slower, and we have you know we calculated how many pushes have that been since the last time we run maintenance and then you know the sub power.

A

Many we weren't maintenance on them, it's, but this they were pushing so quickly that by the time we were done with maintenance, they were already enough new data files that we had to run maintenance again and again and again, but to add to this one of the machines: that's hosting this repository fell behind and it just could be touch up, so I mean necessarily seemed like something weird was going on.

A

This was an academic IP range, so we figured some student was being careless or had a strip running for for class that they, you know they would be and realize. So we are the owner of the repository, hey. Well, you know what's happening and they said oh yeah. We had this experiment running, but we turn it off six days ago, like you know, to do it again, we'll we'll make sure to be nicer, but no no like right now, there's you have been pushing for that six days straight.

A

So you know. Can you double check, like you know, just just to make sure nothing? Nothing fishy is going on so sure enough, two thousand processes what they were doing was essentially they were pushing and then for creating a new process which is also pushing was also grating. A new process was pushing and they just sped up to the point where you know. After six days, we couldn't keep up with the load anymore.

A

It's just since been deleted. I, don't know what what came out of their experiment.

A

Hopefully you know more double-checking. The you know things I'm running, so this is in a this next example is in a particular repositories. More so we've make forking easy. That's you know the main way to contribute some people fork to just to have their own copy.

A

This gets into a bit of the the the bookkeeping that we need to run internally to make the the forks cheaper than than the other words would be so, for example, traverse Linux it's about a gigabyte if you clone it, there's about 16,000 Forks. Now we don't spend 16 terabytes on this, because we've run out of disk space very quickly. If we, if we did this for every single repository, I mean right.

A

So what we do is we keep all of the objects together in the skin the same repository and then we keep the each fork is mostly contains information about. What's this, what the each references are and then a link to here, my my data is actually here in this other place, and this means that you know insert 16 terabytes. The prophecy notes is 42 gigabytes. This is still sometimes an issue with maintenance that the automated maintenance tries to run and then something it sends out. And then we we added to decide hey this fails.

A

Maybe a human should look at it and something usually, we can just say we'll try again and it succeeds because you try to run maintenance at a busy times day or it just wasn't. Working.

A

The the thing about the next one IntelliJ is it's a bit interesting to the the clone size is bigger, but it has an order of magnitude. Fewer Forex yet is bigger on disk. The thing here is they really like tax, which, which is good right. It's good to know where you know this is the state at this point, but it doesn't mean that we, when we fork we copy over all of the tags, and then we keep a global list of all of the references in all of forms of our repository for IntelliJ. That reference.

A

This file is 5 gigabytes and get wants to read that multiple times whenever we run maintenance. This is just because you know get the repository, the Linux repository would you know for which it was originally intended. It has 500 tags, that's the use case. It was fine, so it's not.

A

You know exactly efficient about doing these things, and recently it even became impossible to forth to create a new Fork of the intelligent community repository because we would run out of memory forget no, not in the machine as a whole, but we do limit each get rotated process to it and one of RAM, so it doesn't trust anything else.

A

This is it you can fork again now we we make this a bit more efficient, but it's never done the the insert in this an interesting one with spoon knife where it's 160 kilobytes right, if this is this is a one we use for for training for workshops where we tell people, here's how you create a Forex. Here's how you clear pull request is how you collaborate with other people and get up it's. You know, therefore it has a lot of work. This is nearly nine years of workshops and trainings.

A

So it is, it's not only nine gigabytes, but this is partly more a an issue for the people who run the database because that's where we store all the the pull request, this is over 5,000 for requests that are open right now and I only know this over 5,000 because we stopped counting there.

A

This is this is a challenging problem, keeping all of this efficient, because if you only say what I want the cumulative repository to be efficient, then you're making each individual clone less efficient. What we originally implemented, some of these techniques that have been more efficient, it actually turned out to be less efficient. We spend more time trying to recalculate what data to send you. This is then I mean this is mistakes where it is now. We can now start loading immediately and in general, that's unless you're no clothing, everything in a single network. It's fine!

A

We don't really notice clones that much it's really more forking and just running maintenance. The the thing here is, you know always option always be optimizing. This is a work. That's never done.

A

There are plenty of small optimizations that were driven by these repositories growing a bit more and then they grow a bit more and then we hear another issue and then we fix that and then or you know, just buy us a few ones and then you know it grows a bit more and then it gets a different limit and then we fix that- and this is where that's just never never never done this last one was an interesting one.

A

There's an key is a website where you share, but you learn about like programming things that you can get and get up or just how to administer your machine or to get started with developing.

A

They there's a companion app to make sure you're committed to learning where you can store your because they'll today, I did this I learned about. You know how to do a git push for example, or allow all these kind of stuff.

A

However, they this app made it so that they all pushed into the same repository every all the time so that it would, as these particular uses, so they will show up in their github. Compare contributor graph and I mean yes, so some of the numbers actually undress I have some files. Hundred thousand comments by themselves is log? Isn't all that much. But what really gave us pause is. We noticed it's only six thousand and do contributors to this repository like the webpage. Doesn't the webpage?

A

If you look at contributors who will show use it to one hundred and we just don't even try to count more because it's just no, it's barely worth it, and this is that case where maintenance was still failing. They were pushing so often and they had huge trees similar to the case of the issue with cocoa pods, where they had every user. They had like the repository, then a username and then a timestamp because it was logging and then every username was next to each other.

A

So this gives us again the problem where I have twenty six thousand directories in my tree, which is a huge tree that gets written over and over and over that it it makes life very hard to get so we figured this wasn't a reasonable thing to do so again, we ask, please don't do this and said: okay, fine! We won't then the row blog post about how the rope broke github or something like that, which I mean it's fine. They I mean it's. They broke their own repository.

A

Their proper assumptions as well right so just to to sum up, like your users, are gonna, surprise you and that's: okay, that's great! Like the you want, your users grew up together and within reason. The job is to make it work.

A

Sometimes it is just nice to just reach out to users, say hey: why are you doing this like? Would you be doing this a bit differently, or could you just not be doing it at all in the more extreme cases, and just always something new every day is different and you're, always learning stuff. When you win, when you let everyone give you data they'll, give you all the data that you have.

A

Thank you around. You.