Analyzing an analyzer - A dive into how RuboCop works

by Kyle d'Oliveira

In the talk 'Analyzing an Analyzer: A Dive into How RuboCop Works', Kyle d'Oliveira provides an insightful overview of RuboCop, a widely used Ruby linter designed to analyze code for styling, security, and error-checking. d'Oliveira shares his extensive experience with Ruby and Rails, emphasizing the importance of tools like RuboCop in promoting best practices within the Ruby community.

Key Points Discussed:
- Introduction to RuboCop: d'Oliveira introduces RuboCop, explaining its popularity and functionality as a static code analysis tool, beneficial for linting and enforcing coding standards in Ruby.
- Historical Context: The speaker reflects on his personal journey with RuboCop, describing the initial challenges faced in adopting it for code analysis and style compliance, and how its autocorrect feature alleviated some frustrations.
- Functionality Overview: The mechanics of RuboCop are explored, focusing on the command-line interface, configuration loading via YAML files, and the critical process of analyzing code by creating an abstract syntax tree (AST) of the code that allows RuboCop to systematically identify offenses.
- Example of Code Analysis: One example highlighted is the ArrayJoin cop, which checks for proper usage of the star method to concatenate array values. d'Oliveira delves into how regular expressions are inadequate for this task, advocating for the AST approach.
- AST and Traversal: The talk further breaks down how RuboCop traverses the AST, applying specific cops which are rules that identify violations within the code structure. The correlation between different node types and their evaluations is articulated well, reinforcing the systematic approach employed by RuboCop.
- Autocorrect Mechanism: Finally, d'Oliveira explains how errors flagged by RuboCop are corrected using the TreeRewriter class, which facilitates seamless modifications to the source code while preserving its structural integrity.

Conclusions and Takeaways:
- The speaker emphasizes the robustness of RuboCop and its evolving nature within the Ruby landscape. d'Oliveira encourages attendees to leverage their new understanding of RuboCop's internal workings for potential contributions to the tool or enhancements for their projects. The talk sets a foundation for using RuboCop effectively, inspiring developers to automate their styling and linting processes and improve code quality in collaborative settings.
Overall, d'Oliveira's talk serves as a valuable resource for Ruby developers looking to enhance their code analysis practices with RuboCop.

00:00:00.000 Ready for takeoff.

00:00:16.920 Hello everybody and welcome to my talk.

00:00:20.039 Today, I'll be analyzing an analyzer: a dive into how RuboCop works.

00:00:23.039 RuboCop is quite complex, and I don't think we can do an exhaustive study on it, so this will be more of a regular dive, not a deep dive.

00:00:26.640 My name is Kyle d'Oliveira, and I'm based out of Vancouver, Canada.

00:00:30.960 I've been working with Ruby and Rails for well over a decade now.

00:00:33.840 I love the language, I love the community, and I enjoy spaces like this where we can interact with each other and get a sense of what the Ruby community feels like.

00:00:39.600 I'm particularly drawn to tools that can benefit the entire community, and RuboCop is no exception to that. My hope is that after listening to this talk, you'll understand some of the basics about how RuboCop can analyze and correct code.

00:00:48.719 Maybe some of you will feel inspired to contribute custom rules for yourselves, your organization, or even to the open source repository, or to start playing around with new tools that utilize similar ideas or concepts.

00:00:52.860 I've been working at Aha! for the past two years, and it is one of the best workplaces I've been a part of. We are a human-centric company that helps other companies build the products that matter to them with our suite of products.

00:01:01.440 We have an amazing distributed team, by design, all over the world. We have one of the best company cultures I have seen, powered by the responsive method, which provides us a framework of shared values that we all agree upon and embody.

00:01:05.400 It really helps empower us so we can move quickly and stay aligned. So, if you’d like to be part of that culture, of course, that's higher.

00:01:08.880 Linters are static code analysis tools that can flag programming errors, suspicious constructs, and stylistic errors. But they can do more, too; they can alert around security issues, and they can be used as a tool for training other engineers.

00:01:15.420 RuboCop is one of the most popular linters for Ruby. I’m just curious: by a quick show of hands, has anyone here not worked with RuboCop before?

00:01:19.200 I don’t think I see a single hand, which is about what I expected.

00:01:23.820 Rails is one of those gems that is very closely tied to Ruby, so much so that there's often the assumption that if you say you're working with Ruby, they just assume you're working with Rails.

00:01:30.960 To put things in perspective, Rails has been downloaded about 387 million times from RubyGems and is about the 40th most popular gem.

00:01:37.680 RuboCop has been downloaded about 270 million times and is about the 76th in popularity. Although it is not as popular as Rails, RuboCop is significant enough that when you talk about Ruby, you're also thinking about RuboCop.

00:01:45.360 The original creator of RuboCop gave a talk about this in 2018 at RubyKaigi, so after this talk, if you're really curious about learning more about RuboCop, this is a great resource.

00:01:50.040 I wanted to begin today's talk with a little personal story.

00:01:53.260 I first got into using RuboCop several years ago at a time when we were trying to establish an agreed-upon style for all the code we were writing.

00:01:59.580 However, we often ended up with pull requests full of nitpicky comments.

00:02:01.320 These comments wouldn’t address the content of the pull request but would focus on how the code looked.

00:02:06.180 Sometimes these critiques were valid, but other times, we would get into long, pointless debates over trivial matters.

00:02:11.760 Debates would arise over the use of single quotes versus double quotes, the maximum line length, or whether we needed an extra line at the end of a guard clause.

00:02:16.200 We thought that if RuboCop could handle our linting and styling for us, we could focus all the comments on the actual content of the pull requests, which helped quite a bit.

00:02:21.180 However, it was also incredibly frustrating to work with. We'd write code, push it up to CI, and it would get rejected because RuboCop flagged violations.

00:02:25.860 We would then go back, fix those issues, and do it all over again. I had a project in which I needed to namespace a large number of constant references.

00:02:31.259 As a result, all the lines became longer, and the maximum line length rule became the bane of my existence. I had to hunt down every line that exceeded the length and figure out how to break it up into pieces, which was quite tedious.

00:02:37.020 Many engineers shared a similar experience, and the message of dissatisfaction with how RuboCop was rolled out became mixed with expressions of disdain for RuboCop itself.

00:02:42.180 However, we eventually leaned into RuboCop and learned how to use its autocorrect feature, and much of the frustration began to fade away. We discovered we could have RuboCop fix our code automatically when it detected violations.

00:02:51.420 Most of the time, those changes were spot on, so we continued to explore this further.

00:02:55.380 We started writing our own custom cops to help us transition from bad patterns to good ones. We also developed some to keep deprecations under control as we were upgrading Rails.

00:02:59.880 We utilized RuboCop's error messages as a way to explain concepts related to our internal documentation and to justify various decisions being made.

00:03:06.600 I gave a talk at RailsConf 2020 illustrating how RuboCop can be used to communicate information about bad patterns in code.

00:03:10.320 That is another resource you can look up later if you're interested in learning more.

00:03:14.040 This year, RuboCop turns ten, and in the open source world, this is quite a big deal.

00:03:18.000 Development of RuboCop has remained consistent over the years, and given where we are now, I don't see RuboCop going away anytime soon.

00:03:22.800 However, it is large enough that understanding how RuboCop works can be quite tricky; there are thousands of commits, thousands of closed issues, and hundreds of contributors and releases.

00:03:29.880 This is not going to be a code review of RuboCop.

00:03:31.620 I don’t think I could cover it thoroughly, and it would likely be overwhelming.

00:03:34.440 With the history of RuboCop being somewhat of a marathon, we have just 30 minutes here, which feels like more of a sprint.

00:03:38.520 Instead, this talk will focus on how the basics work. I'll outline the basics and help illustrate how some of the processes function within RuboCop.

00:03:41.880 The goal is to stay as close to how RuboCop works as possible, although I may simplify some points for the sake of clarity.

00:03:44.230 So, let's dig into this! How does RuboCop work? A good way to dive into the details is by looking at the command line interface.

00:03:47.820 Let's see how it operates for a single file and a specific cop.

00:03:52.560 When you run the RuboCop command, it first executes the file, which loads the Ruby library and processes it.

00:03:56.699 Then it loads some configuration files. These steps are the easy part; I’ll touch on them for completeness.

00:03:59.340 The real substance comes when RuboCop needs to process the file—this involves taking the existing file and performing actions to make decisions about the code inside it.

00:04:04.560 Once it has processed the code, it will run through a series of cops that will determine whether any offenses exist.

00:04:08.700 If there are any violations, it can write or rewrite the source code and change the file accordingly.

00:04:14.040 It will then loop back to finish making any adjustments and may start the process over again.

00:04:19.200 This loop is crucial, as multiple different cops can adjust the same line of code, sometimes requiring multiple passes.

00:04:22.920 Also, be cautious of infinite loops! We'll break down this process into clearer steps.

00:04:28.920 The first element in this entire process is the command line interface.

00:04:31.680 This part is straightforward, so I'll keep it brief, as this isn't primarily focused on command line interfaces.

00:04:35.040 There is an executable file called RuboCop, configured to utilize Ruby. It loads the appropriate libraries into the load path and requires RuboCop, and then it performs some processing.

00:04:38.639 Now we transition from the command line to the Ruby realm. This serves as RuboCop’s entry point to perform its intended tasks.

00:04:46.560 The next step is loading configuration, determining which cops are active, and defining what options need to be provided.

00:04:51.420 Generally, this is all done through YAML files. I won’t delve into the options here, as thorough documentation is available online.

00:04:56.760 Basically, there is a large YAML file where options can be set for all cops.

00:05:04.500 For example, you can specify which Ruby version you’re using, as well as file patterns to include or exclude.

00:05:10.260 You can also provide specific configurations for individual cops by nesting the settings according to the cop's name.

00:05:15.060 That’s a quick overview of YAML config.

00:05:16.760 Now onto the more interesting part: processing the code.

00:05:21.600 This topic might feel a bit meta because we're discussing code that is designed to understand code.

00:05:29.460 To illustrate how this works, let’s consider a specific example: RuboCop has a style cop called ArrayJoin.

00:05:34.560 Its purpose is to check whether the star method is used to join values of an array.

00:05:39.540 Imagine we have some code, and we want to determine if any bad patterns are present.

00:05:43.920 If it detects a violation, we want it to flag that code accordingly.

00:05:49.560 For instance, if we have an array containing 'foo', 'bar', and 'baz' that is being joined with the star method, it clearly violates the rule.

00:05:54.420 While humans can easily recognize this, we need to write code that identifies it programmatically.

00:05:59.040 One way to do this would be to use regular expressions, starting with a seemingly complex one.

00:06:03.960 It can quite literally search for patterns like an opening brace, a bunch of characters up until a closing brace, then the star method, and additional spaces.

00:06:10.800 However, regex becomes very convoluted when accounting for varying code styles.

00:06:15.960 What if we use single quotes, don’t include spaces, or vary the number of arguments passed? In these cases, regular expressions can rapidly become inadequate.

00:06:20.760 There must be a more efficient way to achieve this than endless variations of regex.

00:06:24.460 The solution is to take Ruby code and convert it into an abstract syntax tree, or AST.

00:06:31.320 The AST serves as a structured representation of the code.

00:06:38.040 For example, consider a simple begin rescue block. This illustration will help visualize its corresponding AST.

00:06:43.620 Ruby provides a gem called parser that handles this conversion. The parser comes with Ruby, requiring no additional installations straight out of the box.

00:06:51.240 Taking our example from earlier, imagine we want to see what the AST for our array looks like.

00:06:56.640 The entire construct represents a method call—a send node, with children that represent various elements.

00:07:01.560 The first child indicates the receiver (in this case, our array), while the second identifies the method (the join method).

00:07:06.780 Additional children will provide the arguments being passed, which in this example contains the string that specifies the joiner.

00:07:11.880 Now, if we want to determine whether our code violates any rules, we can apply the parser to obtain this AST.

00:07:14.880 Expressing the AST in a formatted manner will yield an easily digestible output.

00:07:18.380 The beauty of the AST representation is its consistency.

00:07:21.600 If our original code is amended—say, by switching quotes or adjusting spacing—the underlying AST remains unchanged.

00:07:26.520 This stability permits RuboCop to analyze code with confidence, leading to reliable conclusions.

00:07:30.600 Utilizing a utility gem known as RoboCop AST enhances this functionality, facilitating a more user-friendly interaction with our AST.

00:07:36.360 A class called ProcessSource within this utility can take some code in a string format and the expected Ruby version.

00:07:41.940 From there, we can delve into the AST and query it as per our needs.

00:07:44.520 In our previous example, we can check if it violates the array join rule.

00:07:52.440 By checking if the node type is send and verifying its receiver, we can ask the right questions.

00:07:58.320 These checks allow us to ascertain if a violation exists.

00:08:01.680 Now that we have some understanding of how RuboCop processes the code, let’s explore how it applies specific cops.

00:08:06.780 Let’s introduce a bit of complexity. What if our joining code is wrapped in a method?

00:08:09.840 This changes the approach; we can no longer simply check for send type.

00:08:14.400 Instead, we need to construct a method definition node, which lacks a receiver.

00:08:19.740 However, we can still identify the arguments passed to this definition, allowing us to inspect the inner workings.

00:08:24.960 To navigate through the nodes, we need to traverse the AST and analyze its components.

00:08:29.700 Through this traversal, we can visit every node and gather the necessary data.

00:08:34.260 For instance, let’s create a method called walk that will allow us to visit every node within the AST.

00:08:39.000 For each node type, we will specify a corresponding method that starts with 'on'.

00:08:44.520 For example, we need an onDef method since the def is the top-level node.

00:08:47.880 Next, we determine how to process the information within our definition node.

00:08:52.500 The arguments passed will yield their own children while the body of the definition may contain a varying number of types.

00:08:56.460 By defining an onSend method, we can analyze the operation more directly.

00:09:01.800 The onSend method will help dig further into understanding what is happening at this node.

00:09:06.180 Returning to the onArgs method, we can now work through each child of this AST and invoke appropriate methods.

00:09:09.780 With these methodologies in place, we can ensure we’re accessing every necessary part of the AST.

00:09:14.280 RuboCop AST has abstracted a lot of this away via a module called RoboCop AST traversal.

00:09:19.560 This module allows us to traverse over any arbitrary AST in a depth-first manner.

00:09:23.820 When visiting each node, if any of the active cops define methods that are pertinent to these nodes, they will be handed over for further processing.

00:09:29.040 When defining our own class, we can include this module to navigate any given AST.

00:09:32.520 Each node type will correspond to methods we designate in this class.

00:09:36.600 Thus, if we had our onSend method, we could apply it to each relevant node and perform the necessary assessments.

00:09:40.080 Our aim is to query attributes like if a receiver is indeed an array, is the method name correct, and if all conditions are satisfied, to register a violation.

00:09:45.480 This method serves as the backbone of how we determine if a violation exists.

00:09:50.760 For further examples, we could observe another cop—say, a MinMax cop.

00:09:55.620 This cop would take advantage of its own on array method to analyze the specific arrays it encounters.

00:10:01.380 From here, an evaluation is made—like checking if the minimum and maximum values match the expected conditions.

00:10:04.500 This structure allows the cop to focus specifically on parts of the AST that it cares about and provide precise evaluations.

00:10:08.220 There are a plethora of different node types possible in ASTs—from variable assignments to class definitions, each wielding its own specific evaluations.

00:10:14.459 The ability to recognize and represent them allows RuboCop to engage meaningfully with the replaced patterns.

00:10:19.800 Now, let’s see how RuboCop modifies the AST and autocrrects the identified issues.

00:10:24.600 If we revisit the earlier method needing adjustment, we would replace the 'star' with a 'join' while properly wrapping the arguments.

00:10:29.040 Luckily for us, the parser gem incorporates a class called TreeRewriter, which streamlines the rewriting process. This class manages various transformations in the proper order.

00:10:35.040 Now, the immediate questions that arise involve determining the range and deciding the new content.

00:10:41.520 Diving into the range, this is defined by a particular class that indicates character spans across Ruby expressions.

00:10:46.560 Every piece of parsed code has an accessible location method, revealing interesting facets of that node.

00:10:50.340 One of these facets is the expression itself, indicating precisely where that piece of the AST exists.

00:10:54.600 Focusing on the send node, identified as the third child of the overall AST, we can utilize location to find its corresponding source code.

00:11:01.680 By checking the location.expression, we discern what part of the source code requires alteration.

00:11:05.640 Once we establish the content to be substituted, we can ask the receiver for its source, revealing how it was originally expressed.

00:11:09.960 Likewise, we can assess the arguments passed in and obtain their original string representations.

00:11:16.500 Bringing these together, we set up the stage for our rewriting procedure, focusing explicitly on that send node and gathering its associated data.

00:11:24.840 Employing this knowledge, we apply the corresponding replace method to adjust the AST per our requirements.

00:11:29.760 With the tree successfully rewritten, we achieve the intended alterations, seamlessly transforming the source code as needed.

00:11:35.520 Effectively, this encapsulates how RuboCop operates. Upon detecting violations, the auto-correcting mechanisms within it will enact necessary changes.

00:11:40.680 As corrections are finalized, RuboCop writes the changes back into the file, reprocessing if further alterations are warranted.

00:11:44.160 Now that we've journeyed through this crash course on RuboCop, it's important to reflect on the insights gained.

00:11:49.320 We’ve explored components of the command line interface, seen how files convert into ASTs, and analyzed how RuboCop interacts with that tree.

00:11:53.880 I hope this enhances your understanding of RuboCop's functionalities, inspiring you to think about how to leverage this knowledge.

00:11:58.899 If you are interested in building a tool that needs to analyze source code, the concepts we've discussed about traversing and parsing the abstract syntax tree could be very beneficial.

00:12:05.880 Thank you for your time; I hope you found this engaging. If there are questions, feel free to approach me after the talk.