00:00:21.020
Hello everyone, thank you for coming to my talk.
00:00:24.720
My name is Ylan.
00:00:25.199
I am originally from Mexico City, and I've been a long-time resident of San Diego.
00:00:28.320
I have been writing code in some form or another for the last 20 years.
00:00:30.180
I've been using Rails since about 2009.
00:00:35.040
Well, this talk is about bug-driven development, and I'll tell you a story that I hope you find useful.
00:00:42.980
A word about Procore: I'm a staff engineer there.
00:00:45.269
At Procore, we build the software to help construct the world. That's our mission.
00:00:55.649
We work on code that is used daily by thousands of construction professionals around the country and across the world to improve their jobs and provide the information they need at the right moment, whether it's on the job site, in their app, or on the web.
00:01:06.330
I love working there; I'm surrounded by smart, caring people who are tackling interesting problems.
00:01:15.450
They truly live up to their values, which are ownership, openness, and optimism. Maybe you'll love it too. We are currently hiring at our headquarters in Carpinteria, California, and in Austin, and we're even starting an R&D team in Sydney, Australia.
00:01:26.970
If you are interested, please come to our booth and talk to us. Thank you, Procore, for letting me speak.
00:01:41.280
Alright, so this story starts with a bug report, as many stories do.
00:01:44.880
One of our customers reported: "I’m trying to set a user as inactive, but when I do it on the web interface, it’s not reflected correctly in the API; the change isn’t sticking."
00:01:50.910
Okay, that sounds easy enough. Let’s try to recreate it locally.
00:01:56.099
I go to my development environment to reproduce the issue and make the desired change, and it shows up correctly in the public API.
00:02:00.720
I don’t see any issue.
00:02:01.440
I decided to stop by the customer support team that received this report and see what they thought. They confirmed they were able to replicate this bug, but it only happened with accounts in the European Union.
00:02:20.730
Now, this is starting to look interesting. We have two deployments: one in the U.S. and one in Europe.
00:02:29.310
This wasn’t for Procore; it was for a company I worked with before joining Procore. Both deployments were essentially mirror images of each other; we deployed the same code at the same time to both environments.
00:02:40.280
I started thinking that maybe there was a front-end bug that didn’t deploy correctly to the European Union due to old assets, or perhaps it has to do with caching.
00:02:51.620
Honestly, I had no clue what was going on, so I went down several rabbit holes trying to figure it out.
00:03:00.590
I enlisted the help of QA personnel, asking them to help reproduce this bug. Now, at this point, we do not have any production access.
00:03:17.120
I don't know how many of you have had the ability to open a Rails console in production, but in the company I was working for, this was absolutely not allowed due to security restrictions.
00:03:26.510
We were limited to checking logs and had QA saying they could recreate this issue in a staging environment, but only with an enterprise type of account.
00:03:34.459
This made absolutely no sense because changing a user role should not depend on the type of account; that is unrelated code.
00:03:44.310
So now we have quite an interesting bug to investigate.
00:04:01.300
Now, there’s a disclaimer: the code I’m about to show you has been modified from its original version and formatted to fit this screen.
00:04:10.450
I will also simplify it quite a bit for the demonstration. The original code is more complex, but I want to focus on certain elements that are relevant to our discussion.
00:04:19.970
In our system, we have accounts, users, and user accounts. A user account represents a company or organization that has an account in the system and multiple users associated with it.
00:04:34.090
User records contain their name and other attributes, while the user account record includes the role, which is where our problem lies.
00:04:50.150
Now, my inclination is to take you through all the rabbit holes I explored and illustrate how complex this was to solve, but that might not be particularly entertaining.
00:05:02.470
So instead, let’s talk a little about the controller.
00:05:10.060
Our controller is somewhat simplified but would typically look like this.
00:05:12.520
Let’s consider the class responsible for searching users. This class job is to retrieve users, and then we take those users and apply a decoration process.
00:05:29.750
The decorator here applies a similar concept to the notion of a 'current account' which is akin to 'current user' in many applications.
00:05:41.390
In some parts of the application, there is an application controller that provides access to the current account, enabling us to reflect on the user's access within the context.
00:05:56.480
When a user gets decorated, it also has included data from the user account, such as their role and whether they’re enabled.
00:06:01.150
Here we encounter the crux of our problem. If you follow along, we iterate over all user accounts to find the correct one tied to the current account, but we’re not doing so correctly.
00:06:30.229
The missing equality check reveals this flaw in the logic; we were only assigning the property rather than checking if it was correct.
00:06:46.600
This oversight leads to a data leak between accounts, assigning role and enabled flags incorrectly. The data belonging to one account is inadvertently mixed with another.
00:07:09.000
This was concerning, not only due to privacy and data sensitivity but also due to the implications for the company culture.
00:07:23.000
So why was this error surfacing now? This particular codebase had been in production for years without any issues. It had to do with the unique IDs we were using for accounts, which differ from numeric IDs and produced unusual results.
00:07:54.000
Due to their non-sequential nature, the way Rails loads these accounts can lead to unexpected behaviors when instantiation occurs.
00:08:05.880
QA's inability to reproduce the bug stemmed from the nature of our ID generation and how accounts were loaded into the system.
00:08:27.700
At this point, we understood the bug well enough to proceed. Our next step was to write a failing test case.
00:08:39.290
We needed a user that was associated with multiple accounts, different values for role and enabled flags, and ensure that data was loaded in the correct order.
00:09:05.440
We set up a spec to check that the user decorator received the correct values for role and enabled.
00:09:44.760
In creating this test setup, we used Factory Girl to instantiate users with specific roles across their accounts.
00:09:57.100
However, as we created user instances, we encountered another complexity. It was crucial to ensure that the second account loaded correctly.
00:10:22.200
We would need to ensure that we applied the differing roles and enabled flags to expose the bug fully.
00:10:44.260
Throughout this process, I was prompted to consider the performance implications of finding user accounts in Ruby versus delegating that task to the database.
00:10:56.560
I wrote a PR suggesting that we let the database handle finding accounts.
00:11:00.030
My co-worker then opened a PR to further investigate my changes, prompting deeper introspection about why we originally selected Ruby for that task.
00:11:23.560
It turned out there had been careful consideration in the original commit to guard against performance issues due to N+1 query problems when handling account lookups.
00:11:48.050
While we shifted to more efficient patterns, it brought to light the notion of being mindful of optimizing at the right level.
00:12:28.160
Even as we improved our system’s design, I felt uneasy about the complexity requiring us to take extra steps just to create tests.
00:12:49.200
Eventually, I recognized we needed to replace the user decorator with a user account decorator, providing a clearer distinction and structure.
00:13:15.470
Reflecting on this allowed us to consolidate where user data and account data intersected, enabling us to operate more efficiently.
00:14:05.110
Our newly defined user account search process honed in on needed user associations, sidestepping previous complications.
00:14:34.380
Specialization became a theme, with distinct responsibilities for loading and decorating processes, leading to clearer abstractions and overall design.
00:15:21.100
As I worked through this transition, I documented more around the benefits of such refactoring—both in code and via tests.
00:16:35.210
Testing became central to maintaining and clarifying our design. We emphasized preventing regressions and enhancing the system through every change.
00:17:14.120
So, what I want you to take from this presentation is that bugs present opportunities for improvement.
00:18:11.530
Even when systems are immovable, bugs provide insights and understanding of how your system operates.
00:19:02.060
I also encourage you to invest in your testing skills; they are crucial for designing better code. If you can leverage testing, your systems will become easier to work with.
00:19:50.750
That’s it! Thank you all for coming.