Ancient City Ruby 2019

Getting Down To Business with GraphQL

Ancient City Ruby 2019

00:00:11.849 Good morning! Thanks for being here on the morning of the second day. My name is Robert Mosolgo, and this talk is for you if you've ever had a hack week and created a GraphQL proof of concept, only for your CTO to say, 'This is great, but what about that?' So, we're going to discuss a few of those things.
00:00:17.039 As I mentioned, I work for a startup called GitHub, or, I guess I'm a Microsoft employee now. I didn’t run that by the brand guidelines, but I work on the API team at GitHub.
00:00:23.789 We maintain a lot of the plumbing for the REST API and the GraphQL API to help developers get their APIs out the door in a fast and secure manner. I've been there for about two and a half years.
00:00:31.170 Besides that, I maintain an open-source project called GraphQL Ruby, which was introduced a little bit yesterday. You can see that I starred my own repository to try to make it look cool, so you can do that too if you want.
00:00:38.160 Additionally, I live in Charlottesville, Virginia, a place that was in the news a couple of years ago for a Nazi rally. If you were in town for that, it was not a very warm welcome to visitors.
00:00:43.379 But this is a picture of the mountains near my house, which is more of what I'm into. My wife and I are raising three daughters—one of them is in that picture twice, and you can figure it out.
00:00:51.360 This was just when the street cleaner came through our neighborhood, which was a big event! Besides software, I also make cheese at home, which is a big mess.
00:00:58.320 I was saying before that sometimes computer people need a bit of danger and excitement in their lives. And the great thing about cheese is that it is like a tour de force of all the best parts of traditional food preservation.
00:01:04.649 This is actually an action shot of me sprinkling a bit of bacteria in there. They grow and digest the lactose milk sugar, producing lactic acid that prevents other bacteria from growing.
00:01:09.720 That’s a good start if you're trying to put milk on a shelf for months. The next step is to introduce enzymes that break down the proteins, which cause them to undissolve and precipitate from the milk, collapsing down on each other.
00:01:16.860 The book calls it a protein matrix, but I like to call it a protein gel that traps the big fat globules inside while allowing the water to flow out freely.
00:01:22.950 I read a funny description of cheese making that simply states: you remove the water from the milk. That’s how you do it.
00:01:30.689 So, that’s my process—kind of janky, squeezing the water out.
00:01:34.500 The last part is aging. This is my mini-fridge cheese cave, where you intentionally grow mold on things.
00:01:40.320 There’s some brie that’s coming along, though I handled it a little too roughly. You can see it has some spots.
00:01:47.400 This is the fireworks show of cheese making, where everything goes in looking like a white blob, but depending on slight differences in salt, moisture, fat content, and acidity, they all become really different.
00:01:52.440 Quick shoutout to Shopify: a lot of my favorite home cheese-making supply retailers are apparently based on Shopify, so I recognized that checkout flow anywhere.
00:01:58.350 I have two favorite home cheese-making supply retailers, and they both use Shopify! Enough about cheese, though.
00:02:02.310 I am also here to talk about GraphQL, but if you don't like GraphQL, you can talk to me about cheese later. If you don't like cheese, I'd be interested to hear more about that too!
00:02:08.820 Now, a quick introduction to GraphQL. We had one yesterday, but if it’s new to you, or to ensure we’re all on the same page, I’ll introduce the concept at a high level.
00:02:15.300 GraphQL's website calls it a query language for your API. There are really two key takeaways from that: one, it's an API.
00:02:22.170 If you have a web application and want it to communicate with the outside world, you need some kind of interface—maybe it's a user interface or an interface for other software systems.
00:02:29.370 Secondly, it’s a query language. The things that come and go are written in a query language, similar to SQL.
00:02:34.500 For example, we have an HTTP API with a path and a structured query string.
00:02:40.320 As Damon and Morgan mentioned yesterday, one cool thing about GraphQL is that you can ask for whatever you want, as long as the API supports it.
00:02:47.400 In this case, we’re asking for a certain repository by name and then some related objects, like five open issues and information about each issue.
00:02:52.440 What the endpoint will return is a JSON response that’s structured in the same way the query was.
00:02:58.350 So you’ll find the repository and the issues, but it turns out there are only four open issues.
00:03:04.230 Then each issue has the attributes we requested.
00:03:09.410 On the Ruby side, we explored a little example of this yesterday. There’s a schema, which is a bunch of classes connected to each other, and one of them might look like this.
00:03:14.850 This represents a repository, and sometimes these objects map to your Active Record models.
00:03:20.820 However, if you work on a 12-year-old app and don't want to share your database structure with the world, you can model this according to your application's concepts.
00:03:26.489 That's one of the things that appeals to me about it.
00:03:31.170 You can see that for these relationships from one kind of object to another, they can be implemented with methods.
00:03:37.229 This is very cool, but there are many cool things in programming. Why would you bother doing this?
00:03:44.220 Here are a few reasons we use GraphQL at GitHub. One is that we have a lot of clients who want to be empowered to build things.
00:03:50.880 They know what they want to build, so they want flexible and easy-to-use access to that data.
00:03:57.120 An example would be a CI integrator who needs to run the same type of transactions frequently. They can either do that through five or six REST endpoints or in one large GraphQL query.
00:04:04.130 Another advantage of GraphQL for us is the schema on the server with a set of rules regarding how that schema can and cannot change over time.
00:04:10.110 This provides a clear framework for us to understand how we can change our APIs, giving our clients insight into what we won’t be doing.
00:04:17.100 Finally, an interesting advantage from our perspective is that many of our REST APIs return a representation of an object that we decided upon many years ago.
00:04:23.340 Sometimes, there’s a field on a repository that’s expensive to calculate.
00:04:29.760 We might need to call out to the Git filesystem, merge it with some database information, and do some calculations.
00:04:35.760 However, we don’t know if anyone is really using it, and these unused fields can waste a lot of time on our server.
00:04:41.390 They waste our clients' time, and in many cases, they don't want them.
00:04:47.730 With GraphQL, a client can request just what they need without having to wait for anything else.
00:04:53.069 Now, here are some reasons not to use GraphQL as you consider it. One is that it introduces some overhead.
00:05:02.159 Because it’s a complicated system, it makes dynamic queries. If you and your clients know exactly who needs what, it may not be a good fit.
00:05:07.919 This is because you would incur a tax for flexibility that you don’t need.
00:05:13.000 Another consideration is rendering a Rails view. You might run all the database queries at the top of the controller and render the view.
00:05:18.460 In this case, you know exactly what queries you have to run because you know what the view code is.
00:05:25.330 However, with GraphQL queries, the incoming queries could be anything.
00:05:30.730 Therefore, it’s much harder to implement in an efficient way. We will talk more about that.
00:05:38.020 The other point is that not everybody knows how to use GraphQL. The tooling is not as polished as 15-year-old REST tooling.
00:05:44.700 Depending on your industry, people might not want to learn it based on who your clients are.
00:05:50.290 With that in mind, I want to share a few things our team has been working on at GitHub over the last year or so regarding GraphQL.
00:05:57.000 The first one is authorization. Previously, I mentioned comparing the GraphQL paradigm to REST endpoints or Rails controllers and views.
00:06:02.210 One great thing about a REST endpoint is that you know exactly what you're going to return.
00:06:07.820 So authorizing an endpoint is pretty straightforward. Who is the current user? What kind of data are we expecting to return?
00:06:13.070 With a GraphQL query, it’s much more complicated because you don’t always know what things you're going to return.
00:06:18.830 For example, let's say I want to know what DHH has been up to lately.
00:06:23.990 When that comes to the server, how do I know whether I should run it or not?
00:06:30.140 If I do run it, how can I ensure I haven’t returned something I shouldn’t have?
00:06:35.060 For instance, the first ten issues might be all Ruby on Rails issues, which is public data.
00:06:42.130 Alternatively, maybe they are all Basecamp issues, which are company secrets I'm not allowed to know.
00:06:48.590 Here’s how we approach that problem. I showed earlier the concept of a GraphQL object.
00:06:54.390 There are really two parts of the GraphQL runtime: objects and scalars.
00:06:59.610 Scalars are like leaf fields, such as a title which is a string—it doesn’t have any properties you can ask about; it's a property of something else.
00:07:05.260 Objects, however, have properties. For example, a user is an object, and you can ask the fields of a user.
00:07:11.740 When I started at GitHub, they had already discovered a sound pattern for authorizing GraphQL.
00:07:17.690 They decided that anytime a query runs and encounters a new object, we’ll run an authorization check on that object.
00:07:23.370 So, at runtime, the way this looks is the first thing we do is load a user from the database and call some kind of authorization hook.
00:07:30.290 This checks if the current viewer can see the data.
00:07:36.910 If they can, we continue down the list—for example, loading the list of issues.
00:07:42.860 For the first issue on the list, can the viewer see it? Yes.
00:07:48.890 For the next one, here's another repository—can the viewer see this? Yes.
00:07:54.500 But for the next issue on the list—can the viewer see this? No.
00:08:01.170 That means the database filtering in the API was incorrect, and an object that should have been kept back was almost returned to the user.
00:08:06.419 In that case, we have a couple of options. The first option is to replace the object that isn’t allowed with nil.
00:08:14.300 However, we generally don’t do that because leaving a nil in the response shows that something is there.
00:08:20.800 If you’ve ever tried to visit a top-secret project on GitHub that your teammate hasn’t given you access to yet, it says 404.
00:08:26.150 We don’t want to disclose what our customers are working on.
00:08:32.940 Instead, if something slips through, we crash the query entirely. This puts something on our bug tracker.
00:08:40.570 We look through the stack trace to understand where in the API we returned something that failed the check.
00:08:46.270 The hope is that it’s someone else's job to fix that implementation.
00:08:53.370 Implementation-wise, it looks something like this. Here's another GraphQL object class.
00:08:59.970 Instead of the field definitions, you see a class method called `authorized`, which receives the application object.
00:09:05.850 In this case, it would be an instance of the issue model, along with the GraphQL context.”
00:09:12.810 Each time we resolve fields on these objects, we check this hook.
00:09:18.170 The advantage here is that there are lots of different ways to get an issue from the API.
00:09:25.110 You could load it by ID or look for the issues that DHH has authored recently.
00:09:30.920 No matter how you get there, it will always go through this authorization hook.
00:09:38.380 So that's a significant advantage to this object-level authorization method that’s built into GraphQL Ruby.
00:09:45.279 You can find documentation on it if you download it and start running these class methods.
00:09:52.380 At GitHub, we carry out two things in this method.
00:09:57.760 First, we check both scopes.
00:10:02.050 When we receive an API request, it has been authorized to look at certain things and all the things it’s not authorized to specific ones.
00:10:08.200 Some of that checking we can do ahead of time because the query is structured according to the schema.
00:10:14.490 However, some we have to check at runtime. For example, we have both public repo scope and private repo scope.
00:10:20.380 From a static query, we can't know whether the first five repositories will be public or private.
00:10:27.300 We also have GitHub apps, which offer granular permissions, and there’s an explicit table of granted permissions we check.
00:10:32.200 Finally, there’s application logic that determines whether a user should see something.
00:10:39.179 For example, if I invite you to collaborate on one of my top-secret projects, and you accept my invitation, you gain granted permission.
00:10:45.259 If you haven’t accepted yet, there’s an open invitation but you don’t have the permission.
00:10:50.979 This can become complicated, but you can accomplish a lot inside that method.
00:10:56.460 If you’re interested, you’ll find documentation online with substantive how-to guides on this subject.
00:11:03.062 Another area we’ve been working on is how to run effective or efficient database queries within this GraphQL paradigm.
00:11:09.829 In a Rails view, you might look at the ERB to determine what types of objects you need.
00:11:15.200 From there, you can use methods like `includes` and `joins`.
00:11:21.650 With GraphQL queries, it's challenging to know ahead of time what kind of data you will need.
00:11:27.937 So, the technique we use here is called batching.
00:11:34.679 Imagine batches of pancakes: you put a pancake, and let’s say that you’ve heard of companies that claim to have some AI service, but it's just employees in an office answering questions.
00:11:40.969 Let’s say we implement our GraphQL API by putting the query on your desk and asking you to write up the response by Friday.
00:11:47.950 For instance, we’re going to select a user named DHH and ask for some information about that user.
00:11:54.200 And, we want some information about another user called ‘sk mets’.
00:12:01.510 The easiest implementation here is querying for the first one, resolving the fields, querying for the second, and so on.
00:12:07.800 This ultimately results in three round trips to the database, which incurs significant transaction costs.
00:12:13.670 If you’re in a hurry, perhaps I’ll say this needs to be done yesterday, and that’s not a good GraphQL invitation.
00:12:20.070 Instead, a more efficient approach is to batch the requests.
00:12:27.090 This allows for a single round-trip to the database, saving much of the transaction overhead.
00:12:35.129 The example becomes trickier when you want to see what's up on a repository, like the Ruby compiler.
00:12:43.590 I could load the first ten pull requests and the author's name for each.
00:12:50.050 If I sat that query on your desk and asked which authors to load, the only way to identify them is to run the query partially.
00:12:56.330 This is where batching runs in as a dynamic technique for loading data.
00:13:01.210 Here’s an example object class with an author field. Imagine an implementation that just calls the database to load a single object by ID.
00:13:07.120 That practice leads to an N+1 situation because each time you encounter the author, it'll load an object.
00:13:13.610 Instead, we grab the foreign key and return a promise using something called a loader—more on that later.
00:13:19.900 When GraphQL can provide that data, we say we'll need a user with this user ID and return that request until it's available.
00:13:26.060 From the GraphQL standpoint, the workflow starts by resolving fields eagerly, gathering up these data requests.
00:13:31.000 When it reaches the end of everything it can resolve eagerly, it collects the pending requests.
00:13:37.679 For example, user ID 1, user ID 10, user ID 21—then we run all those as a batch.
00:13:43.510 A typical implementation of a loader resembles this.
00:13:49.470 It defines a batching unit, like in our earlier example—the user class—and gathers up the values it needs to load.
00:13:56.640 Finally, it dispatches a request for all the different IDs.
00:14:03.710 Instead of an N+1 run, it performs a singular query.
00:14:09.520 We utilize the GraphQL Batch library from Shopify, as they offered a JavaScript implementation when Facebook announced GraphQL.
00:14:16.110 The upside is you can use loaders for different data sources, such as Active Record and Git RPC.
00:14:21.810 Let’s say a lot of fields need the same user. If the first 10 pull requests are all authored by the same person, we can deduplicate those keys before dispatching the batch.
00:14:28.200 GraphQL Batch keeps an identity cache—for instance, if we request user ID 10 and have already asked for it, we return the same object right away.
00:14:35.080 At GitHub scale, this is beneficial because rather than hitting the database or Rails' SQL cache, it will reinitialize the user object with the same ID.
00:14:43.909 We have a few objects in the source code that take a long time to initialize, so it's practical to reuse instances when possible.
00:14:49.780 We’ve been using GraphQL Batch for a while and realize if we could integrate it with the GraphQL runtime, we could do even better.
00:14:56.820 Here are a few improvements we're looking to implement: speed and reduce latency for clients.
00:15:03.680 One way to achieve that is by introducing parallelism. There’s a proof of concept from someone at Strava.
00:15:09.739 The basic concept is this: you might have heard that Ruby can’t run in parallel, which is accurate, as it has a global interpreter lock.
00:15:17.820 It permits only one line of Ruby code to run at a time, even with threads.
00:15:25.450 The exception is that when the Ruby runtime is waiting for an external service, such as a query to a database, it can wait on multiple things at once.
00:15:33.530 So if we have three different kinds of batch loads for various objects, the current implementation runs one after the other, waiting before each one.
00:15:39.970 We could improve this by starting the first IO call and simply forgetting about it, moving on to the second, and so forth.
00:15:47.550 Finally, we wait for all of them to finish, which would speed things up without adding considerable complexity.
00:15:54.430 In practice, here's a loader for a specific repository where we'll load commit information according to the commit hashes.
00:16:02.320 If we run that in parallel, we could leverage the concurrent Ruby library that’s included with Rails since version 5.
00:16:09.390 This library maintains an internal thread pool that can queue and distribute work among several threads.
00:16:16.180 However, the catch here is that your application must be thread-safe, as work will be running in different environments.
00:16:22.420 Another area for improvement would be to simplify issues developers encounter repeatedly. For instance, if we're asking for posts and their authors.
00:16:29.700 They'd advise that if you’re cautious, you can ensure you do not perform an N+1 query.
00:16:35.970 Anyone running GraphQL Ruby in production probably acknowledges that challenge.
00:16:42.220 I'd love to build in batching to make this process easier and develop robust solutions for this.
00:16:49.110 The folks from Shopify are interested in similar objectives, and they've shared numerous shortcuts and techniques.
00:16:56.490 I'm optimistic about working on something solid, so if you're interested, please check out the pull request.
00:17:04.510 Lastly, I wanted to address our team’s recent efforts regarding our API's performance.
00:17:10.180 This is straightforward in most cases; you record how long endpoints take.
00:17:16.600 You can push them to Datadog to monitor endpoint performance.
00:17:24.020 However, with GraphQL, this becomes tricky because there’s only one endpoint: /graphql.
00:17:30.710 There is also tremendous variation in the size of the queries.
00:17:36.000 For example, we do have a Datadog dashboard for GraphQL that looks like this: the 50th percentile is this blue line; the 75th percentile is the purple line.
00:17:43.520 Here you can see the 95th percentile. If this were any other scenario, you'd know that something must be wrong.
00:17:49.840 With GraphQL, however, I don’t believe anything is amiss. The reality is that some GraphQL queries are easy for the server to fulfill.
00:17:56.220 Conversely, others that are legitimate queries are complicated and ask for a lot of data, taking considerable time to respond.
00:18:03.340 How do we identify through this significant variation whether things are running smoothly or if we've introduced a regression?
00:18:09.230 I’ve used a funny technique for a couple of years that has inspired one potential solution.
00:18:16.810 Something you may not know is that part of the way GitHub adopted GraphQL was alongside the Rails view layer.
00:18:23.410 Several views in GitHub run a GraphQL query to fetch their data before rendering.
00:18:30.840 One of those views is the pull request show. Has anyone here ever looked at a pull request on GitHub?
00:18:37.950 It's relatively straightforward, so I use that view as a canary whenever changes are made to how GraphQL works.
00:18:44.780 Since it’s a pretty important view, we can determine if things are operating smoothly.
00:18:51.870 The benefiting factor is that we know the query remains stable. Even with that, there’s still a significant difference between the 50th and 95th percentiles.
00:18:58.650 Yet, it’s substantially better than 20 times. We can use this as a benchmark to ensure everything is running smoothly.
00:19:05.320 However, the problem is we don’t know what our API clients are doing; we know the needs of the pull request view.
00:19:12.180 We have a clear way to record that query because we document the page views, but what about our clients?
00:19:19.690 So, I thought: maybe we can figure out what our clients are doing.
00:19:26.160 If you’ve ever worked with a GraphQL client, you'll know many query strings are hard-coded in application code.
00:19:32.830 For instance, if you’re working on the deployment integration with Heroku, you might use a hard-coded query string.
00:19:40.020 Depending on the deployer and whose repository needs updating, the query string remains constant with some altered values.
00:19:47.069 For us, this looks like clients sending us a GraphQL query, receiving the response, and in the background, we enqueue a job.
00:19:54.080 Afterward, we parse that query and conduct processing. We also keep track of who uses which GraphQL fields.
00:20:00.660 In the case of deprecating something, we can reach out to the appropriate parties.
00:20:07.300 We have a workflow for processing these queries that has led to a data warehouse.
00:20:14.110 It may not be big data, but in this warehouse, we log the query strings run by clients, with numerous values scrubbed.
00:20:20.390 We track how long it took to execute, who ran it, and what kind of integration or app it corresponds to.
00:20:27.760 This helps us develop an effective signature for the transactions being executed.
00:20:34.060 Therefore, we can monitor performance over time, such as for a substantial company whose compliance workflow runs over GraphQL.
00:20:41.500 This allows us to follow how that query performs throughout.
00:20:47.940 I've started analyzing that, and here's what the charts look like: a query that ran about 150 times a day for the past two weeks.
00:20:54.430 We see different components of the timing—CPU time refers to how long we ran the Ruby code.
00:21:01.870 The two largest external services impacting performance are the database and Git filesystem.
00:21:09.390 After monitoring for two weeks, things were stable until a couple of days ago when we noticed a sudden increase in Git RPC response times.
00:21:15.840 The MySQL time remained relatively stable; however, we see spikes, potentially for instance, during deployments or some resource contention.
00:21:24.980 This change in the last couple of days signals a potential problem. You can file alarm fatigue to the mix.
00:21:32.750 Now, whether this interpretation holds true depends on the queries we receive.
00:21:39.260 For example, earlier we queried for a repository's first ten issues and related info.
00:21:46.070 In practice, you can ask for the first 100 repositories, and for each repository, the first 100 issues, and then for each issue, the first 100 labels.
00:21:53.210 The time it takes to run that query depends on the number of repositories. If there’s one, it will run quickly.
00:22:00.250 However, if there are a hundred repositories, each with a hundred issues, and each issue has 100 labels, it will take far longer to complete.
00:22:07.100 One major gap in this analysis needs attention next week.
00:22:14.000 I need to account for the response size when measuring how query time varies.
00:22:20.820 I still have to dig into that, but I’m sharing this in hopes of garnering insights from this group.
00:22:27.170 If you’ve figured out ways to assess similar problems using GraphQL, I’d love to hear your thoughts after the discussion.
00:22:34.860 That essentially wraps it up—thank you for listening. I hope you enjoyed it and that your gears are turning!