00:00:00.000
Ready for takeoff.
00:00:16.920
Hello everyone! I hope you had a nice lunch.
00:00:19.140
Thank you for coming to my talk.
00:00:20.939
Today I'm excited! I'm Varun, and I work as a software engineer at Sourcegraph.
00:00:24.420
Today, I'm thrilled to tell you about one of the projects I've been working on recently, which is scip-ruby.
00:00:27.900
It's a precise indexer for Ruby source code, and I will break down what exactly that means.
00:00:31.820
I've come all the way from Taipei, so I'm glad to meet so many wonderful people here.
00:00:34.920
Let's get started! To provide a bit of background, at Sourcegraph, our core product is a SaaS developer tool.
00:00:38.340
It helps you understand and change code at scale.
00:00:40.620
Part of that includes searching, navigating code, making large scale changes, and setting up dashboards for migrations and alerts for code changes.
00:00:44.520
All the fun things! Today, I will focus only on one sub-part: how we get language-specific information into Sourcegraph.
00:00:48.239
This is increasingly becoming foundational for everything else because when you want to make code changes or search code, you need to be aware of the language semantics.
00:00:52.920
Before I talk more, let me show you some pictures of what we are trying to achieve with scip-ruby.
00:00:56.579
Here’s a Sourcegraph screenshot involving one of Shopify's open-source repositories.
00:01:01.420
This repository has been indexed by scip-ruby before I took the screenshot.
00:01:05.159
I hovered over the cursor or ‘sub’ and clicked on 'Find References', similar to an editor.
00:01:08.000
Below, you'll see a reference panel with different results.
00:01:10.140
There are two kinds of results: search-based and precise.
00:01:13.200
Search-based results can sometimes return false positives.
00:01:15.120
For instance, the first search-based result is for the string replacement method ‘sub’, which is not exactly what we were looking for.
00:01:18.300
We were actually looking for the property ‘sub’, and the precise result returns the correct reference.
00:01:20.040
Therefore, we need to do something more clever than mere string searching.
00:01:23.460
We need to be aware of the language semantics, including classes, methods, and properties.
00:01:26.280
We want something like this, where the precise result works, and that's powered by scip-ruby.
00:01:28.040
Here's another example of scip-ruby in action.
00:01:31.260
I'm looking at a code for a gem called Zeitwerk, and I'm trying to find references to the module defined in that gem.
00:01:34.200
The search results show many different repositories, including GitHub repositories.
00:01:38.640
This means you can search across gems, GitHub repositories, GitLab, Perforce, and different sources.
00:01:41.160
Even though the name ‘Zeitwerk’ is unique enough that maybe a string search would suffice, generally you want to capture the semantics of dependencies between gems.
00:01:43.200
You need to understand what is connected to what, especially when considering vulnerabilities and how they affect your code.
00:01:48.300
Essentially, we are building IDE-level navigation in Sourcegraph.
00:01:50.880
This brings us back to the title of my talk, which I'll break down bit by bit.
00:01:53.760
You can think of the indexer part as the component that aggregates all this information across source files.
00:01:56.760
It presents the information in a way that allows you to query it in different ways, including references and definitions.
00:02:00.000
The navigation needs to be fast; if it's slow, users won’t utilize it.
00:02:03.000
At a high level, this is where indexing fits into the pipeline.
00:02:05.880
An indexer will take all the source files and configuration and emit a single file in the script data format.
00:02:09.180
This is an open-source Protocol Buffers schema, and the index gets uploaded into a database.
00:02:13.560
Usually, this happens in a CI pipeline, running with each commit.
00:02:15.120
It can also run as a repeating job inside Sourcegraph itself, so there's no need to set up your CI.
00:02:18.240
When a client, for example, is navigating in the web browser and performing actions like 'Find References' or 'Go to Definition,' they query the database directly.
00:02:20.520
There’s no need for a running language server process in the background.
00:02:24.180
This separation of the analysis phase from the query phase is possible because the code in Sourcegraph is read-only.
00:02:28.500
You are not editing as you go, so we perform the analysis once.
00:02:32.220
Now, the query can be optimized like a typical database query.
00:02:34.920
The other benefit of this is that the database, backend, and client are mostly language-agnostic.
00:02:37.800
If you were using a language server, then the query part also needs to be language-aware.
00:02:41.280
The language-aware element is restricted to the analysis happening during indexing.
00:02:44.520
This gives us a clearer separation of concerns.
00:02:46.520
You can think of it as an ahead-of-time language server instead of something running just-in-time while navigating the code.
00:02:48.000
Additionally, it allows us to avoid running a language server for every client.
00:02:51.090
Clients may be browsing different kinds of code, leading to millions of lines of code.
00:02:53.660
So, what's up with the survey bit?
00:02:55.539
If you're unfamiliar with Sorbet, it's a type checker for Ruby developed by Stripe.
00:02:58.680
It adds support for gradual typing and optional type signatures.
00:03:02.520
This is similar to TypeScript; however, Sorbet is much faster than other production type checkers that I know of.
00:03:04.860
Scip-ruby is built on an open-source version of Sorbet, which is significant because I'll reference some Sorbet internals later in this talk.
00:03:08.760
Performance is crucial for us because customers can have very large codebases.
00:03:12.539
Ideally, we would index every commit so you can navigate without thinking about the branch you’re on.
00:03:15.660
Building scip-ruby on top of Sorbet is beneficial because Sorbet already understands the semantics of Ruby code at a deep level.
00:03:17.700
This allows the index to be precise, similar to how an editor would work.
00:03:20.040
Sorbet also has an LSP, which contributes to our desire for precision.
00:03:22.740
We do not want to merely rely on syntax, heuristics, or machine learning.
00:03:25.320
We don’t compromise on the quality of the final result; it needs to be as good as possible.
00:03:28.920
The cross-repo bit is something I previously mentioned.
00:03:31.680
We want to navigate between gems, GitHub repos, GitLab repos, and Perforce, without being limited to a single repository.
00:03:34.680
This flexibility is crucial because many customers use different codehosts.
00:03:37.680
For every Ruby developer, Sourcegraph is free to use for open source maintainers.
00:03:41.010
You can upload indexes for your open-source repositories, allowing anyone to navigate that code.
00:03:43.140
Otherwise, you can get a license for a single-tenant managed instance or a self-hosted instance.
00:03:46.140
Now, let’s dive into the details.
00:03:48.960
Earlier, I showed you this abstract representation of indexing.
00:03:50.700
Source files go in, and the index comes out.
00:03:52.680
The tricky part is determining what goes in the middle.
00:03:54.840
Since we decided to build on top of Sorbet, we need to identify where exactly the index fits inside Sorbet.
00:03:58.440
Sorbet is a fairly large codebase, so that’s our first step.
00:04:01.260
This is a simplified view of Sorbet's internal pipeline.
00:04:05.220
We start with the source code at the top left and end up with a control flow graph at the bottom right.
00:04:08.400
There are various steps in between, like tree representations.
00:04:10.920
If you've attended the talks by Carla or by winning stock on the first day, you've seen what the price tree or abstract syntax tree looks like.
00:04:14.160
In our case, those representations aren't as important.
00:04:17.760
The main focus is having access to type information to display hover documentation.
00:04:21.360
You'll notice that type checking and inference occurs on the control flow graph.
00:04:24.060
We want to emit the index after type checking has completed.
00:04:26.520
That sounds reasonable, right?
00:04:29.640
So the next question is: What does the control flow graph look like?
00:04:31.800
For example, here's a small Ruby code snippet.
00:04:33.480
It's a fragment of a fizzbuzz function where you perform a modulus operation called plus equals if that operation succeeds.
00:04:37.020
Let’s examine what the control flow graph looks like for this code snippet.
00:04:39.840
First, you'll notice that this flat piece of code has been structured into a graph.
00:04:42.240
There are different blocks, known as basic blocks in compiler jargon.
00:04:45.120
The arrows denote control flow. For instance, the first block has an if statement at the very end.
00:04:47.520
The edges indicate what happens if the if condition is true or false.
00:04:50.760
Basic blocks operate like small functions within a function, with control flow being external to the basic block.
00:04:54.120
Within each line inside a basic block, which in Sorbet is an instruction, they largely resemble Ruby code.
00:04:57.360
It can be more verbose, but if you understand Ruby, you can comprehend the control flow graph.
00:05:01.800
The index needs to describe the source code accurately.
00:05:03.720
The control flow graph is an implementation detail, but we must understand the relationship between the source code and the control flow graph.
00:05:06.900
To emit the index correctly, let’s look at the expression inside the if statement.
00:05:08.940
That expression, ‘i % 3 == 0’, will be translated into four different instructions.
00:05:12.000
The literal assignments become temporary variables.
00:05:14.520
The modulus operation and the equals operator become method calls.
00:05:16.200
If you've implemented operator overloading, you've seen that these methods are represented explicitly in the control flow graph.
00:05:19.200
Furthermore, the overall logic must remain the same.
00:05:21.180
In compiler jargon, you could even say the translation is semantics-preserving.
00:05:23.400
The result of the modulus operation, represented as ‘temp1’, is used as the receiver for the equals operator.
00:05:27.120
This reality must be true in both the source code and the control flow graph.
00:05:30.180
Using temporary values simplifies processing later.
00:05:32.520
Temporary values ensure that every method receiver and argument is a variable.
00:05:34.680
We won’t need to traverse a tree structure, perform recursion, or use a visitor pattern.
00:05:37.020
This flattening makes later processing much easier.
00:05:39.360
Additionally, control flow is made explicit in the edges between basic blocks.
00:05:42.240
The edges point from one basic block to another based on execution.
00:05:45.420
Regardless of whether the if statement executes, the remaining portions of the function must still run.
00:05:47.220
Hence, the fall-through edge from basic block one to basic block two ensures basic block two is executed.
00:05:49.680
There are no edges entering the middle of a basic block; control flow is defined by the edges.
00:05:52.140
We also need to consider the body of the if statement.
00:05:55.860
The example used is the ‘out += x’ statement.
00:05:57.780
The literal is assigned to a temporary variable, which calls the ‘plus equals’ method.
00:06:00.360
From a practical perspective, it resembles calling the ‘plus equals’ operator on ‘out’.
00:06:02.460
Thus, the same patterns hold where the control flow graph simplifies the control flow structure.
00:06:06.300
The control flow graph consists of around 17 different instruction types, while a parse tree might include around 90 nodes.
00:06:09.120
Despite Ruby’s flexible syntax, the control flow graph is simple.
00:06:12.060
Now, how do we emit an index for this structure?
00:06:15.300
We can break this down into two main aspects: handling the nodes in the graph and the edges.
00:06:17.700
For the edges, we first traverse the graph in topological order.
00:06:20.640
This means we explore the graph from top to bottom.
00:06:22.140
Definitions must come before usages.
00:06:24.240
Next, let's consider the handling of nodes.
00:06:26.640
Let’s examine the first basic block for simplicity.
00:06:29.520
In this basic block, we have two parameters: ‘$i’ and ‘$out’.
00:06:33.480
We record these as parameters for the function.
00:06:36.420
Next, we may either maintain arrays for definitions and references or use hashes.
00:06:39.540
We also need to record source locations for references to the original source code.
00:06:42.420
This is how we keep track of where everything came from.
00:06:44.760
In this case, ‘$i’ and ‘$out’ are named variables, not temporaries. So, we need to record them.
00:06:48.090
We then iterate over each instruction and check whether it's a named variable or a temporary.
00:06:51.060
For example, ‘temp0 = 3’ makes no emission because it’s just a temporary.
00:06:53.640
Next, we look for the receiver ‘$i’, which is a named variable, hence we add it to the references.
00:06:57.060
The modulus operator is an operator, so we also generate a reference for that.
00:07:00.420
Essentially, we repeat this process while checking for method calls, looking at receivers, arguments, and the method.
00:07:03.720
The left-hand side of an assignment is usually comprised of temporaries.
00:07:06.900
We also handle cases where we encounter an equals sign, which prompts us to emit another reference.
00:07:09.720
Overall, even though Ruby code can seem complicated, the navigation tool simplifies it down to straightforward loops and functions.
00:07:13.560
A key takeaway for everyone, especially those new to compilers, is that compiler tasks often revolve around finding the right abstraction.
00:07:16.920
In my experience, the core indexer I worked on over the past few months is about 2,000 lines of code and 1,500 lines of test inputs.
00:07:20.520
Conceptually, the implementation is mainly a pure function storing data into a hash table.
00:07:23.520
Once you grasp the core abstractions, it’s manageable.
00:07:25.380
That’s primarily what the indexer does; we've effectively traversed every piece of Ruby code.
00:07:28.560
Almost.
00:07:30.360
We need some clever hacks to ensure it functions as intended.
00:07:33.420
Let me describe one of those hacks.
00:07:35.160
Recall the Sorbet pipeline we discussed previously, which included various representations.
00:07:38.280
Ultimately, we chose to work with control flow graphs where type checking and inference occur.
00:07:41.700
In practice, however, some files don't undergo type checking.
00:07:45.060
Sorbet has a concept of strictness levels.
00:07:48.000
If you don’t include a magic comment at the top of a file that says ‘typed: true’, it won't even attempt to type check the file.
00:07:51.420
The challenge is that some customers aren’t using Sorbet at all, while others only use it partially.
00:07:54.120
However, we want code navigation to work for everyone.
00:07:57.060
How do we make that happen? The solution isn't to create a separate indexer.
00:08:00.960
After considering options, we decided on a simpler approach.
00:08:02.880
This involved adding a single line to the code that could conditionally force Sorbet to type check the file.
00:08:05.520
Even in cases where we specify ‘typed: false’, we still force type checking to happen.
00:08:08.880
This allows us to generate a control flow graph, which works surprisingly well.
00:08:12.000
Sorbet tolerates many errors in code despite generating thousands of them.
00:08:15.000
We can suppress irrelevant errors because navigation is our primary focus.
00:08:17.520
That’s how the indexer works—using standard Sorbet with a few hacks.
00:08:20.520
Now, let me discuss how we test the indexer.
00:08:22.500
Essentially, we rely on what are commonly called expect tests or snapshot tests.
00:08:25.680
We serialize the index into a human-readable format, annotating source code with comments.
00:08:29.520
This annotation shows the source ranges for different definitions and references.
00:08:32.520
This makes it easier to determine if a patch added the correct definition or skipped any essential items.
00:08:36.180
The focused approach to testing enables fast, predictable, and comprehensible outcomes.
00:08:39.780
That's most of what I wanted to share.
00:08:42.300
Scip-ruby is open source and available on GitHub if you’re an open source maintainer.
00:08:45.240
Existing Sourcegraph customers and those interested in potentially using Sourcegraph can find instructions in the scip-ruby README.
00:08:48.480
I’ve tried to make the configuration as simple as possible; it should generally work with just a few minutes of setup.
00:08:53.520
You should be able to upload an index to Sourcegraph.com for open-source code and start navigating your code.
00:08:56.820
There are example repositories on Sourcegraph.com, including a Shopify repository.
00:09:01.560
The Shopify repository is an example of a project that is 100% Sorbet-compliant.
00:09:05.820
While it might have some bugs, it showcases what you can achieve.
00:09:09.300
Another example is the Homebrew repository, which is about 30-40% adopted with Sorbet.
00:09:12.180
You will still be able to navigate code, including files not designated for type checking.
00:09:15.840
This is possible due to the hack I mentioned earlier.
00:09:18.780
There are multiple ways to provide feedback.
00:09:22.140
The QR code there links to our Discord, and there’s a scip-ruby channel.
00:09:24.000
Alternatively, you can provide feedback through GitHub issues.
00:09:27.360
If you have questions, feel free to open a GitHub issue as well.
00:09:29.640
Thank you! If you have questions for me now, we have a bit of time.
00:09:32.640
I can also take them later.
00:09:34.140
Thank you!
00:09:35.640
[Foreign]