Sorbet

scip-ruby - A Ruby indexer built with Sorbet

scip-ruby - A Ruby indexer built with Sorbet

by Varun Gandhi

In this talk at RubyConf 2022, Varun Gandhi introduces scip-ruby, an open-source indexer designed to enhance Ruby code navigation within Sourcegraph. The primary aim of scip-ruby is to provide precise language-specific information that improves code understanding and facilitates large-scale changes. The presentation outlines several key aspects of scip-ruby and its integration with Sorbet, a Ruby typechecker, highlighting the following points:

  • Purpose and Functionality: scip-ruby serves as an indexer that aggregates information across Ruby source files, enabling features like 'Go to definition' and 'Find usages' with a focus on precise results rather than simple string matching. Varun demonstrates its utility using examples from actual code repositories, emphasizing its ability to navigate across various sources such as GitHub and GitLab.

  • Indexer and Query Separation: The architecture of scip-ruby separates the indexing process from the querying process, which allows for efficiency and speed in navigation. The indexer compiles information into a single file format, which is then uploaded to a database, enabling instant queries without a running language server.

  • Integration with Sorbet: scip-ruby builds on Sorbet's capabilities, allowing for in-depth type checking and accurate code analysis. Varun discusses the speed benefits of Sorbet and how it aids in producing a precise index while adhering to strict type-checking principles.

  • Control Flow Graphs: The talk covers the significance of control flow graphs in the indexing process, explaining how Ruby code is translated into a graph structure to preserve semantics during analysis.

  • Challenges and Hacks: Varun addresses challenges in handling files that may not be fully compliant with type checking in Sorbet. He describes implementing a workaround that forces type checks on all files to ensure consistent index generation.

  • Testing Approach: The indexer is tested using snapshot tests that serialize the index in a human-readable format, allowing for effective validation of outputs and easy identification of errors in new patches.

In conclusion, scip-ruby stands as an innovative solution for Ruby developers aiming to navigate code efficiently. By combining the strengths of Sourcegraph and Sorbet, it offers robust support for open-source maintainers and is accessible for usage with minimal setup. Varun invites attendees to explore scip-ruby on GitHub and participate in the development via feedback channels.

00:00:00.000 Ready for takeoff.
00:00:16.920 Hello everyone! I hope you had a nice lunch.
00:00:19.140 Thank you for coming to my talk.
00:00:20.939 Today I'm excited! I'm Varun, and I work as a software engineer at Sourcegraph.
00:00:24.420 Today, I'm thrilled to tell you about one of the projects I've been working on recently, which is scip-ruby.
00:00:27.900 It's a precise indexer for Ruby source code, and I will break down what exactly that means.
00:00:31.820 I've come all the way from Taipei, so I'm glad to meet so many wonderful people here.
00:00:34.920 Let's get started! To provide a bit of background, at Sourcegraph, our core product is a SaaS developer tool.
00:00:38.340 It helps you understand and change code at scale.
00:00:40.620 Part of that includes searching, navigating code, making large scale changes, and setting up dashboards for migrations and alerts for code changes.
00:00:44.520 All the fun things! Today, I will focus only on one sub-part: how we get language-specific information into Sourcegraph.
00:00:48.239 This is increasingly becoming foundational for everything else because when you want to make code changes or search code, you need to be aware of the language semantics.
00:00:52.920 Before I talk more, let me show you some pictures of what we are trying to achieve with scip-ruby.
00:00:56.579 Here’s a Sourcegraph screenshot involving one of Shopify's open-source repositories.
00:01:01.420 This repository has been indexed by scip-ruby before I took the screenshot.
00:01:05.159 I hovered over the cursor or ‘sub’ and clicked on 'Find References', similar to an editor.
00:01:08.000 Below, you'll see a reference panel with different results.
00:01:10.140 There are two kinds of results: search-based and precise.
00:01:13.200 Search-based results can sometimes return false positives.
00:01:15.120 For instance, the first search-based result is for the string replacement method ‘sub’, which is not exactly what we were looking for.
00:01:18.300 We were actually looking for the property ‘sub’, and the precise result returns the correct reference.
00:01:20.040 Therefore, we need to do something more clever than mere string searching.
00:01:23.460 We need to be aware of the language semantics, including classes, methods, and properties.
00:01:26.280 We want something like this, where the precise result works, and that's powered by scip-ruby.
00:01:28.040 Here's another example of scip-ruby in action.
00:01:31.260 I'm looking at a code for a gem called Zeitwerk, and I'm trying to find references to the module defined in that gem.
00:01:34.200 The search results show many different repositories, including GitHub repositories.
00:01:38.640 This means you can search across gems, GitHub repositories, GitLab, Perforce, and different sources.
00:01:41.160 Even though the name ‘Zeitwerk’ is unique enough that maybe a string search would suffice, generally you want to capture the semantics of dependencies between gems.
00:01:43.200 You need to understand what is connected to what, especially when considering vulnerabilities and how they affect your code.
00:01:48.300 Essentially, we are building IDE-level navigation in Sourcegraph.
00:01:50.880 This brings us back to the title of my talk, which I'll break down bit by bit.
00:01:53.760 You can think of the indexer part as the component that aggregates all this information across source files.
00:01:56.760 It presents the information in a way that allows you to query it in different ways, including references and definitions.
00:02:00.000 The navigation needs to be fast; if it's slow, users won’t utilize it.
00:02:03.000 At a high level, this is where indexing fits into the pipeline.
00:02:05.880 An indexer will take all the source files and configuration and emit a single file in the script data format.
00:02:09.180 This is an open-source Protocol Buffers schema, and the index gets uploaded into a database.
00:02:13.560 Usually, this happens in a CI pipeline, running with each commit.
00:02:15.120 It can also run as a repeating job inside Sourcegraph itself, so there's no need to set up your CI.
00:02:18.240 When a client, for example, is navigating in the web browser and performing actions like 'Find References' or 'Go to Definition,' they query the database directly.
00:02:20.520 There’s no need for a running language server process in the background.
00:02:24.180 This separation of the analysis phase from the query phase is possible because the code in Sourcegraph is read-only.
00:02:28.500 You are not editing as you go, so we perform the analysis once.
00:02:32.220 Now, the query can be optimized like a typical database query.
00:02:34.920 The other benefit of this is that the database, backend, and client are mostly language-agnostic.
00:02:37.800 If you were using a language server, then the query part also needs to be language-aware.
00:02:41.280 The language-aware element is restricted to the analysis happening during indexing.
00:02:44.520 This gives us a clearer separation of concerns.
00:02:46.520 You can think of it as an ahead-of-time language server instead of something running just-in-time while navigating the code.
00:02:48.000 Additionally, it allows us to avoid running a language server for every client.
00:02:51.090 Clients may be browsing different kinds of code, leading to millions of lines of code.
00:02:53.660 So, what's up with the survey bit?
00:02:55.539 If you're unfamiliar with Sorbet, it's a type checker for Ruby developed by Stripe.
00:02:58.680 It adds support for gradual typing and optional type signatures.
00:03:02.520 This is similar to TypeScript; however, Sorbet is much faster than other production type checkers that I know of.
00:03:04.860 Scip-ruby is built on an open-source version of Sorbet, which is significant because I'll reference some Sorbet internals later in this talk.
00:03:08.760 Performance is crucial for us because customers can have very large codebases.
00:03:12.539 Ideally, we would index every commit so you can navigate without thinking about the branch you’re on.
00:03:15.660 Building scip-ruby on top of Sorbet is beneficial because Sorbet already understands the semantics of Ruby code at a deep level.
00:03:17.700 This allows the index to be precise, similar to how an editor would work.
00:03:20.040 Sorbet also has an LSP, which contributes to our desire for precision.
00:03:22.740 We do not want to merely rely on syntax, heuristics, or machine learning.
00:03:25.320 We don’t compromise on the quality of the final result; it needs to be as good as possible.
00:03:28.920 The cross-repo bit is something I previously mentioned.
00:03:31.680 We want to navigate between gems, GitHub repos, GitLab repos, and Perforce, without being limited to a single repository.
00:03:34.680 This flexibility is crucial because many customers use different codehosts.
00:03:37.680 For every Ruby developer, Sourcegraph is free to use for open source maintainers.
00:03:41.010 You can upload indexes for your open-source repositories, allowing anyone to navigate that code.
00:03:43.140 Otherwise, you can get a license for a single-tenant managed instance or a self-hosted instance.
00:03:46.140 Now, let’s dive into the details.
00:03:48.960 Earlier, I showed you this abstract representation of indexing.
00:03:50.700 Source files go in, and the index comes out.
00:03:52.680 The tricky part is determining what goes in the middle.
00:03:54.840 Since we decided to build on top of Sorbet, we need to identify where exactly the index fits inside Sorbet.
00:03:58.440 Sorbet is a fairly large codebase, so that’s our first step.
00:04:01.260 This is a simplified view of Sorbet's internal pipeline.
00:04:05.220 We start with the source code at the top left and end up with a control flow graph at the bottom right.
00:04:08.400 There are various steps in between, like tree representations.
00:04:10.920 If you've attended the talks by Carla or by winning stock on the first day, you've seen what the price tree or abstract syntax tree looks like.
00:04:14.160 In our case, those representations aren't as important.
00:04:17.760 The main focus is having access to type information to display hover documentation.
00:04:21.360 You'll notice that type checking and inference occurs on the control flow graph.
00:04:24.060 We want to emit the index after type checking has completed.
00:04:26.520 That sounds reasonable, right?
00:04:29.640 So the next question is: What does the control flow graph look like?
00:04:31.800 For example, here's a small Ruby code snippet.
00:04:33.480 It's a fragment of a fizzbuzz function where you perform a modulus operation called plus equals if that operation succeeds.
00:04:37.020 Let’s examine what the control flow graph looks like for this code snippet.
00:04:39.840 First, you'll notice that this flat piece of code has been structured into a graph.
00:04:42.240 There are different blocks, known as basic blocks in compiler jargon.
00:04:45.120 The arrows denote control flow. For instance, the first block has an if statement at the very end.
00:04:47.520 The edges indicate what happens if the if condition is true or false.
00:04:50.760 Basic blocks operate like small functions within a function, with control flow being external to the basic block.
00:04:54.120 Within each line inside a basic block, which in Sorbet is an instruction, they largely resemble Ruby code.
00:04:57.360 It can be more verbose, but if you understand Ruby, you can comprehend the control flow graph.
00:05:01.800 The index needs to describe the source code accurately.
00:05:03.720 The control flow graph is an implementation detail, but we must understand the relationship between the source code and the control flow graph.
00:05:06.900 To emit the index correctly, let’s look at the expression inside the if statement.
00:05:08.940 That expression, ‘i % 3 == 0’, will be translated into four different instructions.
00:05:12.000 The literal assignments become temporary variables.
00:05:14.520 The modulus operation and the equals operator become method calls.
00:05:16.200 If you've implemented operator overloading, you've seen that these methods are represented explicitly in the control flow graph.
00:05:19.200 Furthermore, the overall logic must remain the same.
00:05:21.180 In compiler jargon, you could even say the translation is semantics-preserving.
00:05:23.400 The result of the modulus operation, represented as ‘temp1’, is used as the receiver for the equals operator.
00:05:27.120 This reality must be true in both the source code and the control flow graph.
00:05:30.180 Using temporary values simplifies processing later.
00:05:32.520 Temporary values ensure that every method receiver and argument is a variable.
00:05:34.680 We won’t need to traverse a tree structure, perform recursion, or use a visitor pattern.
00:05:37.020 This flattening makes later processing much easier.
00:05:39.360 Additionally, control flow is made explicit in the edges between basic blocks.
00:05:42.240 The edges point from one basic block to another based on execution.
00:05:45.420 Regardless of whether the if statement executes, the remaining portions of the function must still run.
00:05:47.220 Hence, the fall-through edge from basic block one to basic block two ensures basic block two is executed.
00:05:49.680 There are no edges entering the middle of a basic block; control flow is defined by the edges.
00:05:52.140 We also need to consider the body of the if statement.
00:05:55.860 The example used is the ‘out += x’ statement.
00:05:57.780 The literal is assigned to a temporary variable, which calls the ‘plus equals’ method.
00:06:00.360 From a practical perspective, it resembles calling the ‘plus equals’ operator on ‘out’.
00:06:02.460 Thus, the same patterns hold where the control flow graph simplifies the control flow structure.
00:06:06.300 The control flow graph consists of around 17 different instruction types, while a parse tree might include around 90 nodes.
00:06:09.120 Despite Ruby’s flexible syntax, the control flow graph is simple.
00:06:12.060 Now, how do we emit an index for this structure?
00:06:15.300 We can break this down into two main aspects: handling the nodes in the graph and the edges.
00:06:17.700 For the edges, we first traverse the graph in topological order.
00:06:20.640 This means we explore the graph from top to bottom.
00:06:22.140 Definitions must come before usages.
00:06:24.240 Next, let's consider the handling of nodes.
00:06:26.640 Let’s examine the first basic block for simplicity.
00:06:29.520 In this basic block, we have two parameters: ‘$i’ and ‘$out’.
00:06:33.480 We record these as parameters for the function.
00:06:36.420 Next, we may either maintain arrays for definitions and references or use hashes.
00:06:39.540 We also need to record source locations for references to the original source code.
00:06:42.420 This is how we keep track of where everything came from.
00:06:44.760 In this case, ‘$i’ and ‘$out’ are named variables, not temporaries. So, we need to record them.
00:06:48.090 We then iterate over each instruction and check whether it's a named variable or a temporary.
00:06:51.060 For example, ‘temp0 = 3’ makes no emission because it’s just a temporary.
00:06:53.640 Next, we look for the receiver ‘$i’, which is a named variable, hence we add it to the references.
00:06:57.060 The modulus operator is an operator, so we also generate a reference for that.
00:07:00.420 Essentially, we repeat this process while checking for method calls, looking at receivers, arguments, and the method.
00:07:03.720 The left-hand side of an assignment is usually comprised of temporaries.
00:07:06.900 We also handle cases where we encounter an equals sign, which prompts us to emit another reference.
00:07:09.720 Overall, even though Ruby code can seem complicated, the navigation tool simplifies it down to straightforward loops and functions.
00:07:13.560 A key takeaway for everyone, especially those new to compilers, is that compiler tasks often revolve around finding the right abstraction.
00:07:16.920 In my experience, the core indexer I worked on over the past few months is about 2,000 lines of code and 1,500 lines of test inputs.
00:07:20.520 Conceptually, the implementation is mainly a pure function storing data into a hash table.
00:07:23.520 Once you grasp the core abstractions, it’s manageable.
00:07:25.380 That’s primarily what the indexer does; we've effectively traversed every piece of Ruby code.
00:07:28.560 Almost.
00:07:30.360 We need some clever hacks to ensure it functions as intended.
00:07:33.420 Let me describe one of those hacks.
00:07:35.160 Recall the Sorbet pipeline we discussed previously, which included various representations.
00:07:38.280 Ultimately, we chose to work with control flow graphs where type checking and inference occur.
00:07:41.700 In practice, however, some files don't undergo type checking.
00:07:45.060 Sorbet has a concept of strictness levels.
00:07:48.000 If you don’t include a magic comment at the top of a file that says ‘typed: true’, it won't even attempt to type check the file.
00:07:51.420 The challenge is that some customers aren’t using Sorbet at all, while others only use it partially.
00:07:54.120 However, we want code navigation to work for everyone.
00:07:57.060 How do we make that happen? The solution isn't to create a separate indexer.
00:08:00.960 After considering options, we decided on a simpler approach.
00:08:02.880 This involved adding a single line to the code that could conditionally force Sorbet to type check the file.
00:08:05.520 Even in cases where we specify ‘typed: false’, we still force type checking to happen.
00:08:08.880 This allows us to generate a control flow graph, which works surprisingly well.
00:08:12.000 Sorbet tolerates many errors in code despite generating thousands of them.
00:08:15.000 We can suppress irrelevant errors because navigation is our primary focus.
00:08:17.520 That’s how the indexer works—using standard Sorbet with a few hacks.
00:08:20.520 Now, let me discuss how we test the indexer.
00:08:22.500 Essentially, we rely on what are commonly called expect tests or snapshot tests.
00:08:25.680 We serialize the index into a human-readable format, annotating source code with comments.
00:08:29.520 This annotation shows the source ranges for different definitions and references.
00:08:32.520 This makes it easier to determine if a patch added the correct definition or skipped any essential items.
00:08:36.180 The focused approach to testing enables fast, predictable, and comprehensible outcomes.
00:08:39.780 That's most of what I wanted to share.
00:08:42.300 Scip-ruby is open source and available on GitHub if you’re an open source maintainer.
00:08:45.240 Existing Sourcegraph customers and those interested in potentially using Sourcegraph can find instructions in the scip-ruby README.
00:08:48.480 I’ve tried to make the configuration as simple as possible; it should generally work with just a few minutes of setup.
00:08:53.520 You should be able to upload an index to Sourcegraph.com for open-source code and start navigating your code.
00:08:56.820 There are example repositories on Sourcegraph.com, including a Shopify repository.
00:09:01.560 The Shopify repository is an example of a project that is 100% Sorbet-compliant.
00:09:05.820 While it might have some bugs, it showcases what you can achieve.
00:09:09.300 Another example is the Homebrew repository, which is about 30-40% adopted with Sorbet.
00:09:12.180 You will still be able to navigate code, including files not designated for type checking.
00:09:15.840 This is possible due to the hack I mentioned earlier.
00:09:18.780 There are multiple ways to provide feedback.
00:09:22.140 The QR code there links to our Discord, and there’s a scip-ruby channel.
00:09:24.000 Alternatively, you can provide feedback through GitHub issues.
00:09:27.360 If you have questions, feel free to open a GitHub issue as well.
00:09:29.640 Thank you! If you have questions for me now, we have a bit of time.
00:09:32.640 I can also take them later.
00:09:34.140 Thank you!
00:09:35.640 [Foreign]