00:00:11.849
Good morning! Thanks for being here on the morning of the second day. My name is Robert Mosolgo, and this talk is for you if you've ever had a hack week and created a GraphQL proof of concept, only for your CTO to say, 'This is great, but what about that?' So, we're going to discuss a few of those things.
00:00:17.039
As I mentioned, I work for a startup called GitHub, or, I guess I'm a Microsoft employee now. I didn’t run that by the brand guidelines, but I work on the API team at GitHub.
00:00:23.789
We maintain a lot of the plumbing for the REST API and the GraphQL API to help developers get their APIs out the door in a fast and secure manner. I've been there for about two and a half years.
00:00:31.170
Besides that, I maintain an open-source project called GraphQL Ruby, which was introduced a little bit yesterday. You can see that I starred my own repository to try to make it look cool, so you can do that too if you want.
00:00:38.160
Additionally, I live in Charlottesville, Virginia, a place that was in the news a couple of years ago for a Nazi rally. If you were in town for that, it was not a very warm welcome to visitors.
00:00:43.379
But this is a picture of the mountains near my house, which is more of what I'm into. My wife and I are raising three daughters—one of them is in that picture twice, and you can figure it out.
00:00:51.360
This was just when the street cleaner came through our neighborhood, which was a big event! Besides software, I also make cheese at home, which is a big mess.
00:00:58.320
I was saying before that sometimes computer people need a bit of danger and excitement in their lives. And the great thing about cheese is that it is like a tour de force of all the best parts of traditional food preservation.
00:01:04.649
This is actually an action shot of me sprinkling a bit of bacteria in there. They grow and digest the lactose milk sugar, producing lactic acid that prevents other bacteria from growing.
00:01:09.720
That’s a good start if you're trying to put milk on a shelf for months. The next step is to introduce enzymes that break down the proteins, which cause them to undissolve and precipitate from the milk, collapsing down on each other.
00:01:16.860
The book calls it a protein matrix, but I like to call it a protein gel that traps the big fat globules inside while allowing the water to flow out freely.
00:01:22.950
I read a funny description of cheese making that simply states: you remove the water from the milk. That’s how you do it.
00:01:30.689
So, that’s my process—kind of janky, squeezing the water out.
00:01:34.500
The last part is aging. This is my mini-fridge cheese cave, where you intentionally grow mold on things.
00:01:40.320
There’s some brie that’s coming along, though I handled it a little too roughly. You can see it has some spots.
00:01:47.400
This is the fireworks show of cheese making, where everything goes in looking like a white blob, but depending on slight differences in salt, moisture, fat content, and acidity, they all become really different.
00:01:52.440
Quick shoutout to Shopify: a lot of my favorite home cheese-making supply retailers are apparently based on Shopify, so I recognized that checkout flow anywhere.
00:01:58.350
I have two favorite home cheese-making supply retailers, and they both use Shopify! Enough about cheese, though.
00:02:02.310
I am also here to talk about GraphQL, but if you don't like GraphQL, you can talk to me about cheese later. If you don't like cheese, I'd be interested to hear more about that too!
00:02:08.820
Now, a quick introduction to GraphQL. We had one yesterday, but if it’s new to you, or to ensure we’re all on the same page, I’ll introduce the concept at a high level.
00:02:15.300
GraphQL's website calls it a query language for your API. There are really two key takeaways from that: one, it's an API.
00:02:22.170
If you have a web application and want it to communicate with the outside world, you need some kind of interface—maybe it's a user interface or an interface for other software systems.
00:02:29.370
Secondly, it’s a query language. The things that come and go are written in a query language, similar to SQL.
00:02:34.500
For example, we have an HTTP API with a path and a structured query string.
00:02:40.320
As Damon and Morgan mentioned yesterday, one cool thing about GraphQL is that you can ask for whatever you want, as long as the API supports it.
00:02:47.400
In this case, we’re asking for a certain repository by name and then some related objects, like five open issues and information about each issue.
00:02:52.440
What the endpoint will return is a JSON response that’s structured in the same way the query was.
00:02:58.350
So you’ll find the repository and the issues, but it turns out there are only four open issues.
00:03:04.230
Then each issue has the attributes we requested.
00:03:09.410
On the Ruby side, we explored a little example of this yesterday. There’s a schema, which is a bunch of classes connected to each other, and one of them might look like this.
00:03:14.850
This represents a repository, and sometimes these objects map to your Active Record models.
00:03:20.820
However, if you work on a 12-year-old app and don't want to share your database structure with the world, you can model this according to your application's concepts.
00:03:26.489
That's one of the things that appeals to me about it.
00:03:31.170
You can see that for these relationships from one kind of object to another, they can be implemented with methods.
00:03:37.229
This is very cool, but there are many cool things in programming. Why would you bother doing this?
00:03:44.220
Here are a few reasons we use GraphQL at GitHub. One is that we have a lot of clients who want to be empowered to build things.
00:03:50.880
They know what they want to build, so they want flexible and easy-to-use access to that data.
00:03:57.120
An example would be a CI integrator who needs to run the same type of transactions frequently. They can either do that through five or six REST endpoints or in one large GraphQL query.
00:04:04.130
Another advantage of GraphQL for us is the schema on the server with a set of rules regarding how that schema can and cannot change over time.
00:04:10.110
This provides a clear framework for us to understand how we can change our APIs, giving our clients insight into what we won’t be doing.
00:04:17.100
Finally, an interesting advantage from our perspective is that many of our REST APIs return a representation of an object that we decided upon many years ago.
00:04:23.340
Sometimes, there’s a field on a repository that’s expensive to calculate.
00:04:29.760
We might need to call out to the Git filesystem, merge it with some database information, and do some calculations.
00:04:35.760
However, we don’t know if anyone is really using it, and these unused fields can waste a lot of time on our server.
00:04:41.390
They waste our clients' time, and in many cases, they don't want them.
00:04:47.730
With GraphQL, a client can request just what they need without having to wait for anything else.
00:04:53.069
Now, here are some reasons not to use GraphQL as you consider it. One is that it introduces some overhead.
00:05:02.159
Because it’s a complicated system, it makes dynamic queries. If you and your clients know exactly who needs what, it may not be a good fit.
00:05:07.919
This is because you would incur a tax for flexibility that you don’t need.
00:05:13.000
Another consideration is rendering a Rails view. You might run all the database queries at the top of the controller and render the view.
00:05:18.460
In this case, you know exactly what queries you have to run because you know what the view code is.
00:05:25.330
However, with GraphQL queries, the incoming queries could be anything.
00:05:30.730
Therefore, it’s much harder to implement in an efficient way. We will talk more about that.
00:05:38.020
The other point is that not everybody knows how to use GraphQL. The tooling is not as polished as 15-year-old REST tooling.
00:05:44.700
Depending on your industry, people might not want to learn it based on who your clients are.
00:05:50.290
With that in mind, I want to share a few things our team has been working on at GitHub over the last year or so regarding GraphQL.
00:05:57.000
The first one is authorization. Previously, I mentioned comparing the GraphQL paradigm to REST endpoints or Rails controllers and views.
00:06:02.210
One great thing about a REST endpoint is that you know exactly what you're going to return.
00:06:07.820
So authorizing an endpoint is pretty straightforward. Who is the current user? What kind of data are we expecting to return?
00:06:13.070
With a GraphQL query, it’s much more complicated because you don’t always know what things you're going to return.
00:06:18.830
For example, let's say I want to know what DHH has been up to lately.
00:06:23.990
When that comes to the server, how do I know whether I should run it or not?
00:06:30.140
If I do run it, how can I ensure I haven’t returned something I shouldn’t have?
00:06:35.060
For instance, the first ten issues might be all Ruby on Rails issues, which is public data.
00:06:42.130
Alternatively, maybe they are all Basecamp issues, which are company secrets I'm not allowed to know.
00:06:48.590
Here’s how we approach that problem. I showed earlier the concept of a GraphQL object.
00:06:54.390
There are really two parts of the GraphQL runtime: objects and scalars.
00:06:59.610
Scalars are like leaf fields, such as a title which is a string—it doesn’t have any properties you can ask about; it's a property of something else.
00:07:05.260
Objects, however, have properties. For example, a user is an object, and you can ask the fields of a user.
00:07:11.740
When I started at GitHub, they had already discovered a sound pattern for authorizing GraphQL.
00:07:17.690
They decided that anytime a query runs and encounters a new object, we’ll run an authorization check on that object.
00:07:23.370
So, at runtime, the way this looks is the first thing we do is load a user from the database and call some kind of authorization hook.
00:07:30.290
This checks if the current viewer can see the data.
00:07:36.910
If they can, we continue down the list—for example, loading the list of issues.
00:07:42.860
For the first issue on the list, can the viewer see it? Yes.
00:07:48.890
For the next one, here's another repository—can the viewer see this? Yes.
00:07:54.500
But for the next issue on the list—can the viewer see this? No.
00:08:01.170
That means the database filtering in the API was incorrect, and an object that should have been kept back was almost returned to the user.
00:08:06.419
In that case, we have a couple of options. The first option is to replace the object that isn’t allowed with nil.
00:08:14.300
However, we generally don’t do that because leaving a nil in the response shows that something is there.
00:08:20.800
If you’ve ever tried to visit a top-secret project on GitHub that your teammate hasn’t given you access to yet, it says 404.
00:08:26.150
We don’t want to disclose what our customers are working on.
00:08:32.940
Instead, if something slips through, we crash the query entirely. This puts something on our bug tracker.
00:08:40.570
We look through the stack trace to understand where in the API we returned something that failed the check.
00:08:46.270
The hope is that it’s someone else's job to fix that implementation.
00:08:53.370
Implementation-wise, it looks something like this. Here's another GraphQL object class.
00:08:59.970
Instead of the field definitions, you see a class method called `authorized`, which receives the application object.
00:09:05.850
In this case, it would be an instance of the issue model, along with the GraphQL context.”
00:09:12.810
Each time we resolve fields on these objects, we check this hook.
00:09:18.170
The advantage here is that there are lots of different ways to get an issue from the API.
00:09:25.110
You could load it by ID or look for the issues that DHH has authored recently.
00:09:30.920
No matter how you get there, it will always go through this authorization hook.
00:09:38.380
So that's a significant advantage to this object-level authorization method that’s built into GraphQL Ruby.
00:09:45.279
You can find documentation on it if you download it and start running these class methods.
00:09:52.380
At GitHub, we carry out two things in this method.
00:09:57.760
First, we check both scopes.
00:10:02.050
When we receive an API request, it has been authorized to look at certain things and all the things it’s not authorized to specific ones.
00:10:08.200
Some of that checking we can do ahead of time because the query is structured according to the schema.
00:10:14.490
However, some we have to check at runtime. For example, we have both public repo scope and private repo scope.
00:10:20.380
From a static query, we can't know whether the first five repositories will be public or private.
00:10:27.300
We also have GitHub apps, which offer granular permissions, and there’s an explicit table of granted permissions we check.
00:10:32.200
Finally, there’s application logic that determines whether a user should see something.
00:10:39.179
For example, if I invite you to collaborate on one of my top-secret projects, and you accept my invitation, you gain granted permission.
00:10:45.259
If you haven’t accepted yet, there’s an open invitation but you don’t have the permission.
00:10:50.979
This can become complicated, but you can accomplish a lot inside that method.
00:10:56.460
If you’re interested, you’ll find documentation online with substantive how-to guides on this subject.
00:11:03.062
Another area we’ve been working on is how to run effective or efficient database queries within this GraphQL paradigm.
00:11:09.829
In a Rails view, you might look at the ERB to determine what types of objects you need.
00:11:15.200
From there, you can use methods like `includes` and `joins`.
00:11:21.650
With GraphQL queries, it's challenging to know ahead of time what kind of data you will need.
00:11:27.937
So, the technique we use here is called batching.
00:11:34.679
Imagine batches of pancakes: you put a pancake, and let’s say that you’ve heard of companies that claim to have some AI service, but it's just employees in an office answering questions.
00:11:40.969
Let’s say we implement our GraphQL API by putting the query on your desk and asking you to write up the response by Friday.
00:11:47.950
For instance, we’re going to select a user named DHH and ask for some information about that user.
00:11:54.200
And, we want some information about another user called ‘sk mets’.
00:12:01.510
The easiest implementation here is querying for the first one, resolving the fields, querying for the second, and so on.
00:12:07.800
This ultimately results in three round trips to the database, which incurs significant transaction costs.
00:12:13.670
If you’re in a hurry, perhaps I’ll say this needs to be done yesterday, and that’s not a good GraphQL invitation.
00:12:20.070
Instead, a more efficient approach is to batch the requests.
00:12:27.090
This allows for a single round-trip to the database, saving much of the transaction overhead.
00:12:35.129
The example becomes trickier when you want to see what's up on a repository, like the Ruby compiler.
00:12:43.590
I could load the first ten pull requests and the author's name for each.
00:12:50.050
If I sat that query on your desk and asked which authors to load, the only way to identify them is to run the query partially.
00:12:56.330
This is where batching runs in as a dynamic technique for loading data.
00:13:01.210
Here’s an example object class with an author field. Imagine an implementation that just calls the database to load a single object by ID.
00:13:07.120
That practice leads to an N+1 situation because each time you encounter the author, it'll load an object.
00:13:13.610
Instead, we grab the foreign key and return a promise using something called a loader—more on that later.
00:13:19.900
When GraphQL can provide that data, we say we'll need a user with this user ID and return that request until it's available.
00:13:26.060
From the GraphQL standpoint, the workflow starts by resolving fields eagerly, gathering up these data requests.
00:13:31.000
When it reaches the end of everything it can resolve eagerly, it collects the pending requests.
00:13:37.679
For example, user ID 1, user ID 10, user ID 21—then we run all those as a batch.
00:13:43.510
A typical implementation of a loader resembles this.
00:13:49.470
It defines a batching unit, like in our earlier example—the user class—and gathers up the values it needs to load.
00:13:56.640
Finally, it dispatches a request for all the different IDs.
00:14:03.710
Instead of an N+1 run, it performs a singular query.
00:14:09.520
We utilize the GraphQL Batch library from Shopify, as they offered a JavaScript implementation when Facebook announced GraphQL.
00:14:16.110
The upside is you can use loaders for different data sources, such as Active Record and Git RPC.
00:14:21.810
Let’s say a lot of fields need the same user. If the first 10 pull requests are all authored by the same person, we can deduplicate those keys before dispatching the batch.
00:14:28.200
GraphQL Batch keeps an identity cache—for instance, if we request user ID 10 and have already asked for it, we return the same object right away.
00:14:35.080
At GitHub scale, this is beneficial because rather than hitting the database or Rails' SQL cache, it will reinitialize the user object with the same ID.
00:14:43.909
We have a few objects in the source code that take a long time to initialize, so it's practical to reuse instances when possible.
00:14:49.780
We’ve been using GraphQL Batch for a while and realize if we could integrate it with the GraphQL runtime, we could do even better.
00:14:56.820
Here are a few improvements we're looking to implement: speed and reduce latency for clients.
00:15:03.680
One way to achieve that is by introducing parallelism. There’s a proof of concept from someone at Strava.
00:15:09.739
The basic concept is this: you might have heard that Ruby can’t run in parallel, which is accurate, as it has a global interpreter lock.
00:15:17.820
It permits only one line of Ruby code to run at a time, even with threads.
00:15:25.450
The exception is that when the Ruby runtime is waiting for an external service, such as a query to a database, it can wait on multiple things at once.
00:15:33.530
So if we have three different kinds of batch loads for various objects, the current implementation runs one after the other, waiting before each one.
00:15:39.970
We could improve this by starting the first IO call and simply forgetting about it, moving on to the second, and so forth.
00:15:47.550
Finally, we wait for all of them to finish, which would speed things up without adding considerable complexity.
00:15:54.430
In practice, here's a loader for a specific repository where we'll load commit information according to the commit hashes.
00:16:02.320
If we run that in parallel, we could leverage the concurrent Ruby library that’s included with Rails since version 5.
00:16:09.390
This library maintains an internal thread pool that can queue and distribute work among several threads.
00:16:16.180
However, the catch here is that your application must be thread-safe, as work will be running in different environments.
00:16:22.420
Another area for improvement would be to simplify issues developers encounter repeatedly. For instance, if we're asking for posts and their authors.
00:16:29.700
They'd advise that if you’re cautious, you can ensure you do not perform an N+1 query.
00:16:35.970
Anyone running GraphQL Ruby in production probably acknowledges that challenge.
00:16:42.220
I'd love to build in batching to make this process easier and develop robust solutions for this.
00:16:49.110
The folks from Shopify are interested in similar objectives, and they've shared numerous shortcuts and techniques.
00:16:56.490
I'm optimistic about working on something solid, so if you're interested, please check out the pull request.
00:17:04.510
Lastly, I wanted to address our team’s recent efforts regarding our API's performance.
00:17:10.180
This is straightforward in most cases; you record how long endpoints take.
00:17:16.600
You can push them to Datadog to monitor endpoint performance.
00:17:24.020
However, with GraphQL, this becomes tricky because there’s only one endpoint: /graphql.
00:17:30.710
There is also tremendous variation in the size of the queries.
00:17:36.000
For example, we do have a Datadog dashboard for GraphQL that looks like this: the 50th percentile is this blue line; the 75th percentile is the purple line.
00:17:43.520
Here you can see the 95th percentile. If this were any other scenario, you'd know that something must be wrong.
00:17:49.840
With GraphQL, however, I don’t believe anything is amiss. The reality is that some GraphQL queries are easy for the server to fulfill.
00:17:56.220
Conversely, others that are legitimate queries are complicated and ask for a lot of data, taking considerable time to respond.
00:18:03.340
How do we identify through this significant variation whether things are running smoothly or if we've introduced a regression?
00:18:09.230
I’ve used a funny technique for a couple of years that has inspired one potential solution.
00:18:16.810
Something you may not know is that part of the way GitHub adopted GraphQL was alongside the Rails view layer.
00:18:23.410
Several views in GitHub run a GraphQL query to fetch their data before rendering.
00:18:30.840
One of those views is the pull request show. Has anyone here ever looked at a pull request on GitHub?
00:18:37.950
It's relatively straightforward, so I use that view as a canary whenever changes are made to how GraphQL works.
00:18:44.780
Since it’s a pretty important view, we can determine if things are operating smoothly.
00:18:51.870
The benefiting factor is that we know the query remains stable. Even with that, there’s still a significant difference between the 50th and 95th percentiles.
00:18:58.650
Yet, it’s substantially better than 20 times. We can use this as a benchmark to ensure everything is running smoothly.
00:19:05.320
However, the problem is we don’t know what our API clients are doing; we know the needs of the pull request view.
00:19:12.180
We have a clear way to record that query because we document the page views, but what about our clients?
00:19:19.690
So, I thought: maybe we can figure out what our clients are doing.
00:19:26.160
If you’ve ever worked with a GraphQL client, you'll know many query strings are hard-coded in application code.
00:19:32.830
For instance, if you’re working on the deployment integration with Heroku, you might use a hard-coded query string.
00:19:40.020
Depending on the deployer and whose repository needs updating, the query string remains constant with some altered values.
00:19:47.069
For us, this looks like clients sending us a GraphQL query, receiving the response, and in the background, we enqueue a job.
00:19:54.080
Afterward, we parse that query and conduct processing. We also keep track of who uses which GraphQL fields.
00:20:00.660
In the case of deprecating something, we can reach out to the appropriate parties.
00:20:07.300
We have a workflow for processing these queries that has led to a data warehouse.
00:20:14.110
It may not be big data, but in this warehouse, we log the query strings run by clients, with numerous values scrubbed.
00:20:20.390
We track how long it took to execute, who ran it, and what kind of integration or app it corresponds to.
00:20:27.760
This helps us develop an effective signature for the transactions being executed.
00:20:34.060
Therefore, we can monitor performance over time, such as for a substantial company whose compliance workflow runs over GraphQL.
00:20:41.500
This allows us to follow how that query performs throughout.
00:20:47.940
I've started analyzing that, and here's what the charts look like: a query that ran about 150 times a day for the past two weeks.
00:20:54.430
We see different components of the timing—CPU time refers to how long we ran the Ruby code.
00:21:01.870
The two largest external services impacting performance are the database and Git filesystem.
00:21:09.390
After monitoring for two weeks, things were stable until a couple of days ago when we noticed a sudden increase in Git RPC response times.
00:21:15.840
The MySQL time remained relatively stable; however, we see spikes, potentially for instance, during deployments or some resource contention.
00:21:24.980
This change in the last couple of days signals a potential problem. You can file alarm fatigue to the mix.
00:21:32.750
Now, whether this interpretation holds true depends on the queries we receive.
00:21:39.260
For example, earlier we queried for a repository's first ten issues and related info.
00:21:46.070
In practice, you can ask for the first 100 repositories, and for each repository, the first 100 issues, and then for each issue, the first 100 labels.
00:21:53.210
The time it takes to run that query depends on the number of repositories. If there’s one, it will run quickly.
00:22:00.250
However, if there are a hundred repositories, each with a hundred issues, and each issue has 100 labels, it will take far longer to complete.
00:22:07.100
One major gap in this analysis needs attention next week.
00:22:14.000
I need to account for the response size when measuring how query time varies.
00:22:20.820
I still have to dig into that, but I’m sharing this in hopes of garnering insights from this group.
00:22:27.170
If you’ve figured out ways to assess similar problems using GraphQL, I’d love to hear your thoughts after the discussion.
00:22:34.860
That essentially wraps it up—thank you for listening. I hope you enjoyed it and that your gears are turning!