Maciej Rząsa

Debug Like a Scientist

wroc_love.rb 2024

00:00:12.679 Hello folks, I'm Maciej. So far, we've been talking about one half of our job regarding writing features. Now, we'll be discussing writing tests. And specifically, some tests are just harder than others.
00:00:18.720 Have you ever been there? You sit at work, staring at the code, and nothing is working. You're stuck because you've changed every piece of code twice, and it still doesn't work. Or you debug as a team and have to report for the tenth time that it doesn't work. Nobody cares that it works on your computer.
00:00:32.079 Then, someone with a hero complex arrives, claiming, 'I know,' and disappears for two days, only to return with a thousand lines of changed code. You deploy it to production with high hopes, but it still doesn't work. Then the worst happens—people start venting their frustrations. Everybody's frustrated, and you start looking for a scapegoat, maybe within your team or outside. If you’re in a bigger organization like I was, you think, 'Okay, I can’t fix it, so I’ll reassign it to another team.' A week later, this bug returns to you.
00:01:03.640 At the end of the week, your manager comes asking, 'What have you been doing?' When you respond, 'None,' it leads to the realization that the question should not be, 'Have you ever been there?' but 'How many times have you been there?' I’ve been there numerous times, as I have over ten years of Ruby development experience. Currently, I work on an application with terabytes of data in the database, orchestrating a dozen machine learning services. It's an awesome, yet complex environment where it's easy to get stuck in weird bugs—much more than just the last year.
00:02:00.719 I must say that when you're in these situations, you start questioning your life choices. You might think, 'Maybe I should switch careers and become a journalist instead,' or 'Maybe I can work remotely, like a shepherd in the Polish mountains, not just from my desk.' I see that look in your eyes—you’ve been there. Yet, I haven’t switched jobs; I'm still a developer. I used to be a principal engineer and a team lead, but now I’m just a senior developer. However, my company is terrific, so I stay.
00:02:30.199 I bring you hope because I want to share with you a method that has been highly effective for me—a way out of this debugging hell. I’ve seen it applied successfully by other developers independently, leading me to notice that many of us use this same highly effective method without even realizing it.
00:03:05.239 I want to tell you to start debugging like a scientist. I work for a company called Chattermill, and with my ten years of experience in Ruby development, I want to emphasize why we should approach debugging like scientists—
00:03:39.240 although, admittedly, scientists can seem weird and highly impractical. They often produce abstract theories and write boring papers that nobody wants to read. They waste public money that could be redirected towards public infrastructure. So, why should we adopt their approach? Let's look back to the late 19th century when physics was at a crossroads. Physicists thought they understood the world, but simultaneously faced a crisis concerning the speed of light.
00:04:09.640 While they knew light exhibited wave and particle properties, they insisted that it needed a medium—like sound—to propagate. They proposed the hypothesis of a luminiferous ether, a nonexistent medium. To investigate, Michelson and Morley devised a smart experiment in 1887 to measure the speed difference of two orthogonal light rays. They expected different speeds but found them to be the same, revealing a surprising reality, which in computer science we would call a bug.
00:04:49.740 Faced with this outcome, physicists patched the discrepancies with weak, ad hoc hypotheses. This crisis continued until a clerk from the patent office, Albert Einstein, wrote two somewhat dull yet groundbreaking papers. Instead of shouting at the reality, Einstein chose to describe it based on observations and theories, two assumptions: the laws of physics apply equally in all frames of reference, and the speed of light is independent of the state of motion.
00:05:10.919 Starting from these assumptions, he deduced length contraction and time dilation, leading to the Lorentz transformation. While known before, it lacked grounding until Einstein's contributions validated it through experimental confirmations of special relativity, which at least in my time, was normal material in high school physics.
00:05:47.279 So why is the method more valuable than the results? The importance lay in how physicists observed a surprising aspect of reality and guessed an explanation. They formed hypotheses, made predictions, and devised experiments to confirm or refute those hypotheses. For the luminiferous ether hypothesis, they ultimately had to reject it, while Einstein’s special relativity hypothesis was accepted.
00:06:27.440 It's a method I believe we can apply. Still, you might wonder how this relates to your work. I have emotional connections to the worst bugs I’ve encountered. They serve as painful reminders we can learn from. Allow me to share one example: we had inexplicable errors in our integration testing suite in Cucumber, who loves Cucumber? I hear you, it’s stable, right? It never breaks.
00:06:48.200 We were optimizing a GraphQL API and incorporated a library to pre-load SQL data from the database. This library began producing shameful errors, calling methods that didn’t exist. In front of us, there was an angry mob with pitchforks yelling, 'Remove the gem!' Fortunately, I was working remotely; thus, the mob was virtual, yet their demands were real. We didn't want to remove the gem because it improved our speed, but still, we felt the pressure.
00:07:23.120 As we investigated, we checked the source code of the gem, and everything seemed alright—'It works on my computer,' right? We noted that this special configuration constant was used but the bug happened anyway. In disbelief, we began to wonder if we were seeing a race condition, a guess I needed to verify. The stakes were high; I was promised a PhD, though I’m still waiting for an answer.
00:07:52.960 It wasn’t a haphazard guess; I was a professional and formulated a hypothesis based on supporting evidence—the before-filter changing the configuration was introduced around the same time the CI problems started. We used Puma for our integration tests, which had threading enabled, hinting that it could be related to threading issues. I conducted an experiment.
00:08:25.919 Initially, my attempt at threading failed. It wasn’t simple enough, so I simplified the experiment by simulating threads. I wrote a script that included business logic capable of raising an error, while interacting with the before and after filters in a different thread. Running the script line by line, I confirmed our initial hypothesis—it was indeed a race condition. This led to the decision to revert the pull request that introduced it, ultimately resolving the issue.
00:09:00.280 What happened in this case is that we began with observations about the guard variable, and came up with a wild hypothesis regarding the race condition. This hypothesis was verified through a simple observation that led us to understand the reality better and propose an effective fix. The bug is considered fixed not necessarily when a solution is deployed to production, but rather when understanding comes into play.
00:09:37.599 We initially gathered data through observations, similar to a laboratory experiment. We proposed and then verified our hypothesis. I propose calling this approach hypothesis-driven development—a term with marketing potential, if you will. This process worked well, but I acknowledge it was a simple case; I could debug locally. However, sometimes bugs manifest only in CI environments.
00:10:00.720 This particular time was concerning CI, where we struggled to reproduce the issue locally. We relied on Cucumber, which utilized a distributed environment orchestrated by Docker Compose. We encountered sporadic failures related to timeouts, causing frustration as it indicated that our infrastructure didn’t function well. I started digging deeper into these issues.
00:10:40.480 My hypothesis was straightforward: maybe we pushed too much data to the GraphQL gateway. I decided to lessen the payload and switched to a simple HEAD request to evaluate the outcome. I encountered a series of retries; initially, it resulted in a 404 error, but on the second attempt, we experienced read timeouts instead. This piqued my curiosity—what exactly was a read timeout?
00:11:20.200 I researched the differences between timeouts, discovering that a read timeout indicates the process started, the port was open, but there’s no response to my request. Unsure of why this was happening, I delved deeper. While perusing Docker Compose documentation, I stumbled upon an issue where a container froze due to excessive logging in a specific configuration scenario. This seemed relevant since we too had a health check that was overly verbose in logging.
00:12:00.200 Relying on this lead, I formulated a hypothesis regarding logging and thus decided to disable health check logging to evaluate the outcome. To my astonishment, the issue was fixed! More importantly, we had pinpointed the problem to be in Docker Compose due to its outdated version. We had successfully navigated this obstacle by documenting everything along the way.
00:12:48.480 The knowledge of having successfully gathered observations led to a clearer understanding of the facets related to the observed bug. Everything became interrelated from the Docker Compose framework that introduced the error but spurred broader reflections on the bugs identified. Now I turn to discussing a more complex case.
00:13:42.440 This case was a particularly embarrassing one that saw production failures every thirty minutes with 502 errors. Our clients experienced frequent disruptions with no apparent reason for these issues. I joined a task force to tackle the problem, collecting contributions from different teams that were frustrated but determined to identify the underlying issue.
00:14:38.240 A myriad of hypotheses emerged—was it an application-level cron job that was triggered every thirty minutes? An infrastructure cron job causing widespread issues? Or was a single client executing slow requests? Observations were an uphill battle with limited progress, but a few observations emerged, particularly regarding the nature of the errors.
00:15:19.480 We established that the problem wasn’t affecting the whole application but was isolated to a single node. Thus, we theorized it could be related to resource issues. Concurrently, I began examining database load balancing and noted a memory leak was evident because the length of the Passenger queue metrics surged leading up to the errors.
00:15:55.160 Tracking these metrics gave us insight into potential patterns—the problems began cropping up at specific times, with spikes following subsequent minutes. To make the picture clearer, I began augmenting Grafana data to include more statistics from Passenger. I noticed that the number of Passenger processes was decreasing, yet spawn events weren’t occurring as frequently. Intrigued by this anomaly, I became curious as to what could be causing the decrease, fearing it could be rooted in process management.
00:16:38.959 I mulled over the passenger process model, filled with questions without immediate answers. Should passenger not spawn new processes when needed? If not, it likely indicated a more serious fault. I pondered over these observations and went to bed, wishing to digest everything, but ended up posting my findings on Slack to better articulate them and see if others had insights.
00:17:27.360 The next day, I found responses from a colleague who had picked up on a particular log that revealed passenger timed out on startup, which would prevent new processes from spawning. This meant we had an avenue to explore. We began gathering more relevant details surrounding this localized issue and understanding more about how passenger’s process model operated.
00:18:02.679 We established that when we fired up a new node, the problem didn’t occur there. It became increasingly evident that the issues arose due to a massive boot snap cache linked to performance, ultimately causing passenger to time out while trying to start new processes without properly launching them. Should we find the best solution, we needed to reduce the footprint of the cache.
00:18:47.920 To validate this, we simply removed the cache and hoped for the best. Upon conducting this experiment, the problem vanished. However, being cautious, we hung onto skepticism since it could emerge again. But once a day had passed without further incidents, we confirmed that the boot snap cache was indeed the root of the problem, prompting questions surrounding performance and our team’s previous upgrade.
00:19:38.160 In hindsight, I recognized that this simple error highlighted how we could scale our debugging efforts across teams. While one person can manage individual research paths, effective collaboration with multiple hypotheses involved various insights without interruptions. This method allowed us to tackle problems more successfully.
00:20:12.160 Our diverse teams worked around the clock, benefiting from differing time zones. Furthermore, everyone shared their knowledge transparently, eliminating the 'hero' syndrome where one person fixes everything. It was a collective project that fostered confidence with our collaborative efforts.
00:21:01.920 It’s also crucial to adopt strategies while mentoring junior developers. Instead of resorting to conventional methods like 'you’re too inexperienced to fix this,' we should guide them through their hypotheses and the observations they’ve made, writing down peculiar behaviors to form a clearer picture. This enables the junior developer to grow, providing validation of their insights without making them feel inadequately regarded.
00:21:49.760 With most developers oftentimes feeling overwhelmed during complex debugging scenarios, experiencing moments of frustration is typical. However, it is entirely possible to navigate a path out of this mess while adopting a scientific mindset. It’s not necessary to resort to rapid, instinctual solutions without evidence; instead, we can utilize the discipline of hypothesis testing in how we approach problems.
00:22:35.919 The solution isn’t just finding quick fixes; it’s also the methodology practiced through scientific inquiry that leads to more effective debugging. This doesn’t stop at hypothesis formulation but continues on to establishing good habits with a conscientious approach towards collecting notes, presenting one’s assumptions, and implementing effective communication. By sharing thoughts and noting observations, we can narrow down the true nature of the issues we face.
00:23:20.960 Ultimately, improving one's process is essential for individual growth as well as fostering a culture of shared learning beyond what we’ve resolved in our environments. Thus, integrating hypothesis-driven methodologies could provide a solid foundation for processes undertaken within your organization—simply pushing through problems without examining them lacks depth.
00:24:01.520 The need remains apparent regarding how systematic evaluations lead toward an engaged community where developers build themselves a more secure foundation so they don’t squander time on errors easily dealt with through collaboration. If everyone takes it upon themselves to reinforce good habits and shares knowledge, everyone becomes the better for it.
00:24:45.920 In conclusion, the path to success is indeed complicated. The way we approach debugging should embody the scientists' spirit of inquiry and exploration—observing carefully, hypothesizing deliberately, and taking experimental action when the opportunity arises. Remember, our goal is to solve bugs through understanding and collaboration, ensuring we have each other’s backs in this process.
00:25:29.760 Thank you! I welcome any questions you might have.
00:26:38.040 Any documentation on kind of bugs should be referenced? Such as knowing to reproduce demeanors introduced in our team’s culture to help avoid issues in the future. What kind of processes or rituals have worked well that we could suggest to each other?
00:27:29.840 I believe postmortem evaluations can be beneficial. In my previous company, we implemented rules for conducting them. However, simply performing postmortems doesn’t ensure effectiveness or discourage blame culture, creating documents that lack depth.
00:28:13.920 A simple rule in a smaller team—'We screwed up, let’s document it on Slack'— suffices, as it promotes good practices. It’s vital not to overburden the evaluation point as it can prevent productivity, finding reasonable resolutions without convoluted processes. Moving swiftly sometimes is far simpler than getting bogged down too deeply.
00:29:06.040 Would I consider writing a paper about this? I’d love to! Just need guidance on how to proceed with it. If there are no more questions—folks, this concludes our discussion today.
00:29:40.760 Thank you!