RailsConf 2019

Bug-Driven Development

Bug-Driven Development

by Ylan Segal

In the talk "Bug-Driven Development" at RailsConf 2019, Ylan Segal discusses how bug reports can drive improvements in software development and testing methodologies. The main theme revolves around the identification, reproduction, and resolution of a complex bug impacting user functionality in a Rails application.

Key points include:

  • Introduction to Bug-Driven Development: Ylan introduces the concept, emphasizing the educational value of tackling difficult bugs.
  • Case Study Presentation: He narrates a real-world scenario involving a bug where user statuses did not correctly reflect changes made in the web interface due to discrepancies between the web and API behavior, particularly under different account settings.
  • Challenges in Debugging: The debugging process was complicated due to restrictions on accessing production data, making it hard to replicate issues that were confirmed only under certain conditions, such as regional account differences.
  • Code Analysis and Bug Identification: Through the exploration of the code, Ylan reveals that the root of the bug was linked to the incorrect assignment of user account characteristics due to inadequate equality checks.
  • Testing as a Solution: He emphasizes the importance of writing regression tests that accurately capture the necessary conditions to replicate the bug, advocating for thorough testing across different scenarios to ensure robustness.
  • Refactoring for Improvement: Post-bug fix, he discusses how the process of understanding the bug led to improvements in the system architecture, such as creating better abstractions and adhering to design patterns, which ultimately facilitates simpler code maintenance.
  • Emphasis on Continuous Learning: The conclusion underscores that every encountered bug is an opportunity for learning and improvement, reinforcing that robust testing practices are essential in guiding effective system design.

Overall, Ylan illustrates how confronting bugs not only leads to immediate fixes but also contributes to long-term enhancements in application structure and development processes.

00:00:21.020 Hello everyone, thank you for coming to my talk.
00:00:24.720 My name is Ylan.
00:00:25.199 I am originally from Mexico City, and I've been a long-time resident of San Diego.
00:00:28.320 I have been writing code in some form or another for the last 20 years.
00:00:30.180 I've been using Rails since about 2009.
00:00:35.040 Well, this talk is about bug-driven development, and I'll tell you a story that I hope you find useful.
00:00:42.980 A word about Procore: I'm a staff engineer there.
00:00:45.269 At Procore, we build the software to help construct the world. That's our mission.
00:00:55.649 We work on code that is used daily by thousands of construction professionals around the country and across the world to improve their jobs and provide the information they need at the right moment, whether it's on the job site, in their app, or on the web.
00:01:06.330 I love working there; I'm surrounded by smart, caring people who are tackling interesting problems.
00:01:15.450 They truly live up to their values, which are ownership, openness, and optimism. Maybe you'll love it too. We are currently hiring at our headquarters in Carpinteria, California, and in Austin, and we're even starting an R&D team in Sydney, Australia.
00:01:26.970 If you are interested, please come to our booth and talk to us. Thank you, Procore, for letting me speak.
00:01:41.280 Alright, so this story starts with a bug report, as many stories do.
00:01:44.880 One of our customers reported: "I’m trying to set a user as inactive, but when I do it on the web interface, it’s not reflected correctly in the API; the change isn’t sticking."
00:01:50.910 Okay, that sounds easy enough. Let’s try to recreate it locally.
00:01:56.099 I go to my development environment to reproduce the issue and make the desired change, and it shows up correctly in the public API.
00:02:00.720 I don’t see any issue.
00:02:01.440 I decided to stop by the customer support team that received this report and see what they thought. They confirmed they were able to replicate this bug, but it only happened with accounts in the European Union.
00:02:20.730 Now, this is starting to look interesting. We have two deployments: one in the U.S. and one in Europe.
00:02:29.310 This wasn’t for Procore; it was for a company I worked with before joining Procore. Both deployments were essentially mirror images of each other; we deployed the same code at the same time to both environments.
00:02:40.280 I started thinking that maybe there was a front-end bug that didn’t deploy correctly to the European Union due to old assets, or perhaps it has to do with caching.
00:02:51.620 Honestly, I had no clue what was going on, so I went down several rabbit holes trying to figure it out.
00:03:00.590 I enlisted the help of QA personnel, asking them to help reproduce this bug. Now, at this point, we do not have any production access.
00:03:17.120 I don't know how many of you have had the ability to open a Rails console in production, but in the company I was working for, this was absolutely not allowed due to security restrictions.
00:03:26.510 We were limited to checking logs and had QA saying they could recreate this issue in a staging environment, but only with an enterprise type of account.
00:03:34.459 This made absolutely no sense because changing a user role should not depend on the type of account; that is unrelated code.
00:03:44.310 So now we have quite an interesting bug to investigate.
00:04:01.300 Now, there’s a disclaimer: the code I’m about to show you has been modified from its original version and formatted to fit this screen.
00:04:10.450 I will also simplify it quite a bit for the demonstration. The original code is more complex, but I want to focus on certain elements that are relevant to our discussion.
00:04:19.970 In our system, we have accounts, users, and user accounts. A user account represents a company or organization that has an account in the system and multiple users associated with it.
00:04:34.090 User records contain their name and other attributes, while the user account record includes the role, which is where our problem lies.
00:04:50.150 Now, my inclination is to take you through all the rabbit holes I explored and illustrate how complex this was to solve, but that might not be particularly entertaining.
00:05:02.470 So instead, let’s talk a little about the controller.
00:05:10.060 Our controller is somewhat simplified but would typically look like this.
00:05:12.520 Let’s consider the class responsible for searching users. This class job is to retrieve users, and then we take those users and apply a decoration process.
00:05:29.750 The decorator here applies a similar concept to the notion of a 'current account' which is akin to 'current user' in many applications.
00:05:41.390 In some parts of the application, there is an application controller that provides access to the current account, enabling us to reflect on the user's access within the context.
00:05:56.480 When a user gets decorated, it also has included data from the user account, such as their role and whether they’re enabled.
00:06:01.150 Here we encounter the crux of our problem. If you follow along, we iterate over all user accounts to find the correct one tied to the current account, but we’re not doing so correctly.
00:06:30.229 The missing equality check reveals this flaw in the logic; we were only assigning the property rather than checking if it was correct.
00:06:46.600 This oversight leads to a data leak between accounts, assigning role and enabled flags incorrectly. The data belonging to one account is inadvertently mixed with another.
00:07:09.000 This was concerning, not only due to privacy and data sensitivity but also due to the implications for the company culture.
00:07:23.000 So why was this error surfacing now? This particular codebase had been in production for years without any issues. It had to do with the unique IDs we were using for accounts, which differ from numeric IDs and produced unusual results.
00:07:54.000 Due to their non-sequential nature, the way Rails loads these accounts can lead to unexpected behaviors when instantiation occurs.
00:08:05.880 QA's inability to reproduce the bug stemmed from the nature of our ID generation and how accounts were loaded into the system.
00:08:27.700 At this point, we understood the bug well enough to proceed. Our next step was to write a failing test case.
00:08:39.290 We needed a user that was associated with multiple accounts, different values for role and enabled flags, and ensure that data was loaded in the correct order.
00:09:05.440 We set up a spec to check that the user decorator received the correct values for role and enabled.
00:09:44.760 In creating this test setup, we used Factory Girl to instantiate users with specific roles across their accounts.
00:09:57.100 However, as we created user instances, we encountered another complexity. It was crucial to ensure that the second account loaded correctly.
00:10:22.200 We would need to ensure that we applied the differing roles and enabled flags to expose the bug fully.
00:10:44.260 Throughout this process, I was prompted to consider the performance implications of finding user accounts in Ruby versus delegating that task to the database.
00:10:56.560 I wrote a PR suggesting that we let the database handle finding accounts.
00:11:00.030 My co-worker then opened a PR to further investigate my changes, prompting deeper introspection about why we originally selected Ruby for that task.
00:11:23.560 It turned out there had been careful consideration in the original commit to guard against performance issues due to N+1 query problems when handling account lookups.
00:11:48.050 While we shifted to more efficient patterns, it brought to light the notion of being mindful of optimizing at the right level.
00:12:28.160 Even as we improved our system’s design, I felt uneasy about the complexity requiring us to take extra steps just to create tests.
00:12:49.200 Eventually, I recognized we needed to replace the user decorator with a user account decorator, providing a clearer distinction and structure.
00:13:15.470 Reflecting on this allowed us to consolidate where user data and account data intersected, enabling us to operate more efficiently.
00:14:05.110 Our newly defined user account search process honed in on needed user associations, sidestepping previous complications.
00:14:34.380 Specialization became a theme, with distinct responsibilities for loading and decorating processes, leading to clearer abstractions and overall design.
00:15:21.100 As I worked through this transition, I documented more around the benefits of such refactoring—both in code and via tests.
00:16:35.210 Testing became central to maintaining and clarifying our design. We emphasized preventing regressions and enhancing the system through every change.
00:17:14.120 So, what I want you to take from this presentation is that bugs present opportunities for improvement.
00:18:11.530 Even when systems are immovable, bugs provide insights and understanding of how your system operates.
00:19:02.060 I also encourage you to invest in your testing skills; they are crucial for designing better code. If you can leverage testing, your systems will become easier to work with.
00:19:50.750 That’s it! Thank you all for coming.