GitLab Pair Programming - Frontend, 16 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Advanced patterns for GitHub's GraphQL API

Description

Presented by Rea Loretta, CEO at Toast

The GitHub API is a key part of accelerating workflows at scale. This session will leave you with tactical tips for how to paginate effectively, create and plan queries, use tech-preview features, and manage costs learned from years of practice and iteration at Toast and beyond.

About GitHub Universe:
GitHub Universe is a two-day conference dedicated to the creativity and curiosity of the largest software community in the world. Sessions cover topics from team culture to open source software across industries and technologies.

For more information on GitHub Universe, check the website:
https://githubuniverse.com

A

Hi, everyone welcome to advanced patterns for a github graph, QL API, I'm, RIA, co-founder and CEO of toast and I'm super thrilled to be here in front of all of you and that all of you want to take your grass fuel to the next level and I really hope that you'll enjoy this talk as much as I did, making it and definitely tweet me. If you have any questions, I make sure to get back to you.

A

Okay, so I wanted to start off with some pre talk, hype I know it's mid-afternoon and it's probably tiring you're, probably craving in that pod. So work with me, I need some energy in the room. Here are some pictures of people who listen to this talk you'll become a master of graph QL. You will manipulate code with just your mind. You will put out production fires with a snap of your fingers and you will be a 10x engineer, but you probably still won't understand ma nuts, but that's ok, it's another talk.

A

Alright, let's bring it back on a somewhat more serious note. We know a lot of people would like better access to the wealth of data that getup has, and it is, after all, a hub of all your team activity.

A

So my goal with this talk is to enable you to build your own data fetcher for github using the graphical API, and then you can go crazy, retrieve massive amounts of data, and you know it's all good and also with open source our code into a reusable starter kit, so everyone can play and I believe data should be freely accessible for everyone who is curious to learn. So that's why it's really really important to me. um Disclaimer, though, don't use it for evil. It's not a real license, don't look it up.

A

I just want us to stay true to the Archimedean oath, which is like the Hippocratic oath, but for engineers.

A

So, like other talks today have mentioned, data is pretty amazing and does cool things like you can use it to identify bottlenecks and unblock your team. This is good. You should be doing this, but data can also be taken out of context and used to make up arbitrary metrics to track engineer performance. This is bad, very bad and the consequences for misusing the content of this talk and open source library is I will personally be very disappointed in you.

A

Let that sink in okay, so now that we squared that bit away I want to provide context about myself and how I got interested in graph QL for starters, toasts integrates with github and slack and notifies engineers when to unblock teammates and, as you probably all know, github has a lot of activities. The bigger your team is the more activity, so it's very important for us to not simply just pass everything through and because I would just create distractions for everyone.

A

So toast filters through this noise and delivers relevant notifications to the person needed for unblocking the team, and we started out as this zero setup notification bot for individual contributors and since then, we've evolved to empower the entire review process by allowing teammates to respond directly in slack and this past year. We've just we've learned so much and we're continuously growing to meet demands of the team level and I'm sure you've all been in healthy teams and unhealthy teams.

A

Me too, as such a huge part of our ongoing work, is learning the behaviors and patterns in these teams and alerting them on anomalies before they snowball and get out of control.

A

So to do this, we learn from a variety of Oryx, with different workflows and habits, and we've needed to build out a robust analytics pipeline for this. So in this talk, I'll share some of the technical challenges that we face along the way and personally I have a hard time following super abstract talks. So I thought it'd be nice to and fun to learn through building something meaningful.

A

So let's define the problem and the scope. What data should we pull? First, arguably, one of the most interesting aspects of github is code collaboration where all the learnings and drama happens. So here's a familiar story. We write some code, we send it off for review it lands in the reviewers inbox and after some time they get to it. If we're lucky, they give us a meaningful feedback, plus bonus knits, and we get the review back in oh boy.

A

We're excited to make all these changes and grow as engineers, and maybe we go through a few more rounds of commits re reviews, more changes and eventually ends up like that right. But in all seriousness, pull request. History tells a fascinating story and we can learn a lot from the timeline events, the people involved and how quickly concerns were raised and resolved.

A

So this leads to fascinating insights now, I'm, not saying that you're going to be able to infer all of this magically, but you can at least start looking through some data, and then you can detect potential triggers that lead to a high-stress working environment or even get deeper insights like what is the best time of day to ask for a review or learn that large PRS are three times less likely to be approved within the first 24 hours. With this you can encourage best practices and healthy habits for your team.

A

So, let's look at our data shape at the root there's our organization and some repos, and each repository has four requests. Y'all know this: each BOE request has lists associated with with associated entities so such as timeline events, reviews commits, etc. All of those these are the ones we thought were interesting.

A

So this is a good match for a graph he'll write in the shape of a graph we're at a talk about graph QL, so naturally you're all here to learn about this, but I still want to take some time and point out benefits of using Dracul. So for one we can pull much more data with your round trips, so this entire process is more efficient.

A

Secondly, some data is not easily accessible via the REST API, but we can grab it with a graph Y API and, as previously mentioned, the graphical data model Maps conveniently to our mental model of the data shape. So it's easier to reason about.

A

On top of this github graph, QL API is one of the better graphical API eyes out there and they've really tried, in my opinion, to provide a good developer experience. So that's in itself a good reason to try it out, but if you're not really into any of these benefits, then this talk is probably not very useful for you, but I, don't know. Maybe you're here for the doodles I'll be ok with that.

A

Ok, so we got our data data shape. Let's look at the schema next, so the schema will describe our graph nodes and their relationships to each other, and the good news is github. Scrath ql schema is structured, exactly like we want so can we just fetch all the data? It's not that simple. So, according to our schema, we want to pull all the poll requests for the org. If we want to do that, we want to first pull all the repos and then all the associated PRS.

A

So a very simple query would look like this notice. We specify the organization by name then the repositories with the argument. First 100- and this is our page size upper limit which specifies fetching only the first 100 repos, which is fine for some orgs, and we also request the first 100 PRS for this repo and the repos I have more than 100. Prs are out of luck for now.

A

So, coming back to our query, there's an obvious problem with this. Since we're not fetching all the data, we can get the first 100, but the volume of data that we want may be much larger. How large? Well it's not uncommon to have thousands of PRS in one repo or more and for analytics. We need to fetch everything to get that complete picture. How do we tackle this?

A

So you probably all know that API s and database has solved this by allowing us to paginate, and there are various types of pagination supported by graph QL, but the most flexible one is cursor based pagination. So here's a quick refresher, let's represent the pull request. We want to fetch with these squares and number them for convenience. Now, let's line them up.

A

And as I mentioned previously, that the max page size is a hundred so for the purposes of illustration, let's just pretend we're gonna pull three at a time arbitrarily choose that page size of three. So these will be the first three pour request that we fetch taking a look at our updated query. We just update that number to three and also notice that we pull our specific repo by name to simplify our query.

A

So to start paginating we need to specify not only page size which is 3, but we also need a cursor at this point. We don't yet have a cursor. So let's leave that as null.

A

Our return would look like this, but we still want a cursor to get the cursor as part of our return. We need to update our query to pull an object called page info. That's what it is right there. Now we get a cursor along with poor requests and our result set nice so for context. The results would look like this.

A

Our cursor is just an opaque pointer. It's called end cursor because it points to the last item in the data set. We just fetched, so that's what it looks like now. It's pointing at PR number three in this case, so for the second page, we just repeat simple enough page slide is page size is still three and we provide the cursor that we got from page one and we use the after parameter this passing the cursor.

A

Here's how our query looks like now so note the highlight the dollar sign this case, a notes, the query at the cursor variable that has to be passed into the query separately. Your implementation will depend entirely on your language and library of choice, but as well. It looks like for us as a result, this query would fetch PRS, four five and six and the new cursor, which points to PR six now.

A

Finally, we can fetch our last two PRS by repeating this process cool like this, so we notice notice that we're using our cursor from step two right there. So it's pretty straightforward this whole path, and now our cursor points a PR, eight right. This is our result. How do we know we're at the end? Because we don't want to just keep fetching forever? How do we know we need to stop fetching? So in our return, the list of PRS is contained within notes.

A

This notes object and the cursor is actually living in page info, like we know already, with a full name and cursor page info also has another property called has next page, which behaves exactly as it sounds. This property is true when there are more pages and we can keep fetching it's false when we fetched everything straight forward. The updated final query would look like this right now we can paginate with cursors yay, so I know what you're thinking you're, probably like ria hi.

A

This is get up universe and you're on stage with your little microphone talking about pagination like it's some new cool thing. What am I a baby developer, nah nah man here I'm here for the advanced stuff.

A

So hold on yes, it sounds absurdly simple so far, but what makes it hard like? Why is this a basis of a talk at all? Well, for one pagination alone is not enough to pull all the data we need. Pagination is a big part, but there are pitfalls with nested queries and graph. Ql is all about nested queries. We can't just padge me on any query. We want why well meet this little guy from earlier. You may have noticed him terrorizing my engineers in the opening slides, he's a gotcha sore.

A

This little monster lives in every codebase architecture, complex system in the nooks and crannies, ready to jump out and delight you with all the gotchas. He can find. That's not terrible, he's actually kind of cute to me. But what can surprise you is that these gotchas can turn into much bigger blockers which can't be resolved without some serious consideration. So, let's take a look at the gotchas that are bigger than they seem so one of the biggest strengths of graph QL also presents one of the biggest technical challenges.

A

How do we pageant effectively when there are nodes on each nested level that we care about? Do we manage multiple, cursors and pagin across every layer? What's an effective strategy for that?

A

A range of other gotchas built on this one.

A

So if you were thinking of just making a massive API call, you'll be hit with the nodes limit pretty quickly that exists on every query. Fortunately, this is something that you can easily calculate yourself before you run your query recall this query.

A

Regardless of how many nodes actually exist on each layer, the nodes limit looks at the projected total, so a hundred times 100 gives us 10,000 total nodes. What if we decide to fetch comments with our pull requests, my calculation yields 1 million total nodes. That's a lot. Github notes limit is 500,000, which is already very generous, and if we run our query will see the following error: ok and the good part is github- is very explicit about what went wrong.

A

This is a nice, descriptive, error message, but just very useful, and let's keep this in mind and be careful with fetching nested data, as it's pretty easy to hit the limit on tarnex gotcha.

A

So we know that graph QL has a power to pull grass of objects in one roundtrip, but, like everything in life, queries also have a cost associated to them. It's like the price that github has to pay to fulfill your query, and if the price is too high, they're refused to pay it.

A

Unlike the notes, limit, cost is not something you can compute for yourself, it's something that github will compute for us, but through trial and error we can build some intuition around this. So here's how that we can check.

A

This is a special type that we can add at the top level called rate limit here we ask for two properties cost and remaining, and remaining is like how much of our query allowance is left. Quick note. This is the primary way that github imposes rate limits on graph QL requests where rest api is will have a limit of 5,000 requests per hour graph QL api's. Will have a hourly cost limit of 5,000 and the simplest queries.

A

My experience usually cost one for comparison, so notice, like I'm in commented out comments which we're still pulling 10 and notes at maths right now. This is still a lot. Yes, how much this costs it's one! That's a bit unexpected! We're fetching a lot of data right. So let's add comments back in and go ahead and make a guess in your minds: how does cost change yeah? That's a huge cost difference. That's a huge cost difference with a query.

A

This costly, you won't be able to exceed 50 requests per hour, so be wary and there are many things that can influence cost the biggest one in my experience is projected amount of nested entities. In this case. It's almost 500k so use this as an indicator for potential costs, but do experiment. It will vary from time from different experiences. We will discuss strategies for minimizing costs, but for now, let's just look at the next gotcha so now that we satisfy notes, limits and cost constraints. The final test is the actual runtime performance of this query.

A

It's not something that even github can protect predict ahead of time and sometimes a perfectly fine query that doesn't look like much would consistently timeout and the timeout is fairly aggressive. If a query takes more than 10 seconds, it will fail. Unfortunately, so this query only costs 6 points but 9 times out of 10. It would timeout with this error.

A

In our experience, this happens when a query actually returns a lot of data. So here it's attempting to return about 30,000 lines of JSON, which is not that much only about 12 kilobytes gzipped. But it's a lot for github to compute in a single API call. That's not pre cached, so note that also cost is just an estimate of how heavy the query is. Forget, observers and even low-cost queries could timeout. This makes designing our strategy complex because we don't know upfront which exact queries would timeout ahead.

A

So these gotchas are were the opponents on their own, but in a teamfight they can be overwhelming and present quite the challenge. So, let's take a look at the whole request lifecycle in context. We write our graph QL query that pulls NASA data, send it to github servers. They check that our queries but note below the notes limit. Okay, that passes, then they compute the cost and add it to our current running costs and check.

A

If we're still below our hourly quota stone, the green cold, they finally run the query and if it completes under 10 seconds, then we finally get the result back now.

A

It's worth noting notes limit is something we can compute ourselves as mentioned cost of something get up computes, but we can experiment to find out more in timeouts is I, don't know just something not we or github can calculate and predict sometimes retries help, but most often it means our query is just too heavy and needs to be broken down and the simplest way to do this is to decrease the max page size and send several lighter requests instead of one heavy request.

A

So I hope this gives you a good understanding of the query lifecycle and knowing how this process works and help you navigate errors that you will undoubtedly get when you actually start getting your hands dirty. So you can always refer back to this when the time comes.

A

Now that I've convinced you of the sufficiently challenging problem. How do we solve it so recall to answer this question looks to call our data model. Any entities on PRS are three levels. Deep. Let's try to query for comments in this case so notice we picked 40 as the upper limit for PRS. Just a satisfying notes limit could have been 100. While we limit comments to 40, it's fine doesn't really matter much for this example.

A

What does matter is that this query Unisa paginate on an s, identity PRS in this case, and it's very common to have more than 40 PRS in a repo, and if your company has more than 100 repos. That complicates things, but even if we can handle that some PRS are more than 100 comments, that's even more nesting, so we need a better strategy.

A

The strategy that we came up with is double paths in cases when we need to pull a lot of data that requires pagination on multiple levels, we can do it in multiple passes, so the core idea behind double pass is that we can first query to collect the IDS of our top-level entity repos in this case and then run a query for each top-level entity and paginate on them separately.

A

So this query would fetch IDs and names of first hundred repos. Now with IDs and names, we can create a new query where we provide one idea at a time and paginate on purpose simple enough. But how do we do this? Typically graphical API is have a rigid schema for this to work. We need to find a way to construct a query that fetches repositories by ID or similar alternative.

A

This is from the API schema documentation, so we can either paginate over repos and apply various filters, or we can just pull one repo at a time the latter seems handy. So, let's use that now we can run one query for each repository and then paginate on the nested entity. Poor request: where do you learn how to do this previously, and this approach works for sure, but it's kind of slow and it doesn't fully leverage the power of github Braccio api? Let's think about how we can optimize us further.

A

If we look closely at a typical organization, we noticed that the total number of PRS in repos is very non-uniform in distribution and it's common to have repos with thousands of peers and repos with zero to ten. We can plan our queries better by taking into account the total number of PRS and repos, and if we can get this number without fat, fetching all the PRS that's great and it turns out we can so. Let's modify our query. This query will fetch three repos and a total counts of PRS. They have.

A

The total count is accessible for every connection in this graph, so for any collection of nodes we can get their total count without pulling all of the notes. This is very useful. This is what we return now. Let's see how the PR account fits into our double path: strategy.

A

Okay, so first we run a query to get all repo IDs and how many PRS they contain to illustrate session the result, let's use an adorable puppy. We separate out repos that have fewer than 100 PRS, save it for batch fetching, so we don't have to fetch each separately. This reduces a total number of requests and if the repo has more than 100 PRS, it goes on the right and we fetch one repo at a time paginate on pour requests.

A

And then, when we're done, we can merge the results on those two strategies and end up with a lot of puppies. So straightforward enough. Let's see it with some data. Here's our repos with their accounts, I've arranged them an increasing count, so we can quickly sort through them into two groups and we're batching these ones into groups where the totals still less than a hundred. It doesn't really matter how we group these as long as we keep below the max of 100.

A

The bottom line is that now, instead of four queries, bring you only to make two for large data sets. This is a lot of savings on this side. We fetch one by one: it's pretty self-explanatory using the method, weird you describe for paginating large amounts of nests and entities.

A

We merge this together and we have all of our peers that we set out to fetch right nice, there's still one piece to flush out: let's deep dive into this part, so the repos that don't have a lot of PRS, we need to construct a query that pulls a subset of repos and they're PRS. How do we do this? Specifically? How do we batch?

A

We need to find a way to construct a query that fetches multiple repositories by ID seen this documentation before it does not allow us to pull data by multiple IDs or by multiple names. What if we just combine multiple repository types into one query? Ok, let's be less confusing with some examples. One approach would be to use graphical aliases to fetch several repositories by name notice, these repo back-end and repo web labels. These are aliases, which is just a way to name your data to avoid conflicts in the resulting data set.

A

This would work, but it's a lot more hands-on. We would have to figure out alias names and how to stitch this together. This will either result in a heavier query or require us to use partials, so we're not duplicating our sub queries for each repo and, most importantly, this approach only works because there is a type that allows us to fetch repo by name what about all the other entities we might need to pull like this.

A

It doesn't seem like a solution that is robust and can be applied generically again, fortunately, for us, github already thought about use cases like this and it's one of the only graph QL API as I did, but it really should be a best practice, because it's pretty great so what github did is add a special type on the top level called notes type.

A

This allows us to query any item by ID and it's like a global key value store so, for example, consider this query notice top-level that accepts an array of IDs and returns given entities like type, but because it's a generic type. We can't access any of the repository type properties. We can only get ID and type name fields. So, in order to solve this, we need to use inline fragments it's like a type cast for graph QL, and we can now rewrite our query, reporting Li, that's what the syntax looks like.

A

This allows us to access properties on a specific type and it's a very useful technique. We can use it to fetch pretty much anything. We can just keep adding IDs the list simple and easy. So when we batch we'll just feed all the IDS from each batch into this parameter and fetch all those entities at once, so you might be thinking hey. This type thing looks like cheating.

A

Can I just provide a million IDs and fetch everything in one query, you are right: normal pagination rules don't apply to this type, it's not bound by the upper limit of a hundred items. But in our experience it's easy to construct queries then that lead to timeouts, and it really depends on how heavy the objects are. In our experience, the reasonable limit would be between 30 and 80 I use at once. The proquest with a lot more nested properties. Fetched 50, is a good middle ground.

A

These should provide some context for your experimentation all right, so this all sounds good, but let's make sure we covered and accounted for all the gotchas that we discussed before these guys. So the nested data problem is solved by double pass by definition, although, depending on what you're hoping to do you mind, a triple pass or dribble pass more passes nodes limits with this is simple. Well, we just make sure we don't exceed the notes limit. During query composition, we discussed optimizing for cost reduction. Let's make sure we handle this.

A

Let's consider this query to fetch reviews and comments for every pull request. If we run it, as is the cost, is gonna be the following, which is a lot? We can cheat a little. What do we know about this data? If we think about this query? Well, it's very rare when a poor ol class will have more than 100 reviews or more than 40 comments per review, so intuitively it just seems abnormal. So I would say it's safe to decrease those page sizes.

A

Let's assume there are no more than 20 reviews, for starters, run our update of query and look at that. The cost went down by a factor of five optimizations like this require knowing your data, of course, intuition and experiments, but it can make a drastic difference. So definitely make sure you play around with this.

A

We at toast analyzed data from a bunch of companies and pick sensible defaults. So if you're planning to use our library, that's all covered there. Lastly, we have timeouts. This is purely trial and error and running queries in production. What we learned is that the easiest strategy to reduce page size is at the cost of making more requests, or the easiest strategy is to reduce pay sites.

A

We can reduce page sizes from like a hundred 250 and double the number of requests, but in general seems like 50 s, a good default to set as a pagination limit for heavy queries, and the last bit I want to mention is a schema preview. The github Raphael API is constantly evolving and new features become available. Often, for example, things like is my pull, request a draft and are my CI checks passing these are still in the preview mode. So to access these, you would need to execute your queries while sending HTTP accept headers.

A

It would look something like this. This is a header I use for draft pull, requests and cool, that's it. That was a lot of stuff. A lot of learnings that we just went through give yourself a pat on the back. If you're still with me, high-five and go build your own github data, fetcher, here's the link to the open source library, it's a work-in-progress and currently contains starter code.

A

You can use it as a one-off script to pull data for your org and get it as a JSON file, we'll be adding to it more as we build more so and you know feel free to send in some poor requests and if you're like RIA where's the advanced stuff, and this talk is apparently still way too baby level for you. Please please, please PLEASE, say hi to me after I'd love to get your expertise on stuff that we haven't solved yet and thanks and you can send me homework reading on Twitter yeah.

A

So, lastly, this talk was a lot of fun to put together thanks for coming and thanks for listening to me, ramble.