Ruby Video | The overnight failure

The overnight failure

Sebastian Sogamoso

#failure-management

#software-development

#error-handling

#payment-systems

The overnight failure

Sebastian Sogamoso • September 29, 2017 • Budapest, Hungary

The video titled "The Overnight Failure" features speaker Sebastian Sogamoso at Euruko 2017. He shares a personal story about a significant incident during his time at a company developing a ride-sharing application. The talk focuses on the inevitability of failures in technology and the essential lessons that come from such experiences.

Key points discussed include:
- Personal Anecdote: Sebastian recalls a stressful Saturday morning referred to as 'Black Saturday,' when payment processing errors led to numerous users being incorrectly charged.
- System Overview: He explains the app's operation, detailing how trips were managed and payments processed, highlighting the complexity of the job queue system in handling transactions.
- Failure Description: A malfunction in the payment processing resulted in multiple duplicate charges, leading to significant user complaints and financial confusion.
- Bugs Identified: He identifies two main bugs: one that multiplied jobs in the queue and another that allowed transactions without checking for previous charges.
- Response and Recovery: Sebastian outlines his immediate steps to mitigate the chaos, including issuing refunds and charge reversals. He emphasizes the importance of learning from failures and understanding errors in order to build better systems.
- Creating a Safe Work Environment: He advocates for a workplace culture that allows for admitting mistakes without fear of judgment, stressing the importance of thorough documentation of failures for future learning.
- Emotional Impact: He reflects on the stress and responsibility of managing such errors, but also encourages openness about failures in the tech community.

In conclusion, Sebastian implores listeners to acknowledge that failures are a part of professional growth and learning. By fostering environments that permit transparency and accountability, teams can improve not only their morale but also their processes.

The overnight failure
Sebastian Sogamoso • September 29, 2017 • Budapest, Hungary

EuRuKo 2017

00:00:30.630 I came all the way from Colombia, and if you know anything about Colombia, you should definitely come and visit. In fact, I have the perfect excuse for you to go to Colombia: we have a RubyConf in Colombia next year. You're all invited! This is my first time in Hungary, and I'm in love with Budapest—it's such a beautiful city. I've been trying the local cuisine, eating different dishes, and I've been enjoying it quite a lot. I've even been taking some cooking classes to improve my culinary skills, which are a bit lacking. I decided to prepare goulash, a recipe that I've always liked. I researched it, bought all the ingredients, followed all the steps, and the result? It was an epic fail! I definitely have a long way to go.

00:02:03.840 We also don't talk about these kinds of experiences in public. Usually, we see people discussing the cool and good things they've done and their successes. But we rarely hear about failures unless they are really public affairs. For example, consider what happened to GitHub on January 11, when someone made a mistake that caused major downtime for several hours. I know there are people from GitHub here, so I feel for you. Or take the incident with Amazon AWS this year, where someone also made a mistake that affected the service. I don't know if you remember, but what was kind of funny was that S3 was down, and they couldn't update their status page because the icon was hosted on S3. It’s a funny story, but I can only imagine the stress those people went through.

00:03:46.500 Most of us may not work for companies that have such a colossal scale, but regardless of the scale of our applications, we are all building software that humans rely on. That means when we create bugs, it can get really stressful. This takes me to the most important reason why I wanted to give this talk: when we deal with big problems, we often learn the most. It's not pleasant; it's not a good situation to be in, but sometimes it's during these challenging times that we learn the most. I want everyone to think for a moment about the worst thing that could happen to you at work—what's the worst thing that could happen today? Imagine that because that's what happened to me.

00:04:36.329 Let me tell you the story of how it happened. I worked for a company that had a product very similar to a ride-sharing application, which people used to share rides to work. The way it worked was straightforward. We had someone, say Ana, who owned a car and commuted from home to work every day. She could find someone like Alex, who didn’t have a car but had a similar commute. The app allowed them to synchronize their schedules, establish a shared trip, and it took care of everything related to payments. When they started carpooling, the app would keep a record of their trips and even match them with more people, such as John, who had a similar route. The app would manage payments, charging the riders like Alex and John while compensating Ana for the trips she drove.

00:05:46.250 Now, let’s talk a little about how that payment system was implemented. Each week, a payment processing task was triggered, which would pull all the trip records from the past week and place a job into a queue for each passenger-driver pair. This resulted in a significant number of jobs being created every week because of that weekly payment process. Each job included a passenger-driver combination with the amount of money the passenger had to pay the driver. This meant that for every passenger-driver pair, a job was created weekly.

00:06:54.759 After processing the jobs in the queue, the payment gateway charged the passenger and paid the driver. The gateway managed all aspects of the payments, so we didn't have to concern ourselves with that. To summarize, users had an app that they used to carpool every day. The app tracked their trips and had a payment system that ran weekly, charging passengers and compensating drivers. Now that you understand the system, let's move to the part where things went wrong.

00:07:41.360 Once that process ran, a Saturday morning at 6 a.m. came, and the usual processing started. I like to call that day 'Black Saturday'. It wasn’t because there were sad discounts or events—it was an exceedingly stressful day. Early that morning, a user went to the store to buy some eggs and bread for breakfast. When she tried to pay, her card was declined. She thought, 'What the hell? This doesn’t make any sense.' After some investigation, she discovered her credit card was maxed out.

00:08:25.200 Upon reviewing her transactions, she found numerous charges from the carpooling company I worked for, and this was around 6:25. Other users soon began to notice similar problems. The customer care team started receiving a massive influx of complaints, and as the day wore on, Black Saturday was not shaping up well. By 6:34 a.m., several users were reporting the exact same issue. Meanwhile, I was still in bed sleeping when I received a call from my boss.

00:09:28.420 The call was something like this: 'Hey, sorry to call you this early, but we have a problem with payments in production, and a lot of customers are complaining.' I tried not to freak out and told him, 'Yeah, sure, I’ll check it out right away. Let’s talk over chat.' I hung up, feeling nervous and shocked, and soon after called my brother to talk about it. At this point, before even 7:00 a.m., the day was off to a terrible start.

00:09:54.270 When I checked the payment gateway, I saw a multitude of duplicated charges. Upon looking deeper into the payment system, I discovered it was still full of jobs waiting to process, and more were being created continuously. I immediately stopped the process to prevent any further charges. I compiled a list of all the incorrect charges and issued refunds to every affected user. However, I realized that the driver's side of the system was still operational, which meant I had to execute charge reversals to prevent the company from losing money on those transactions.

00:10:53.020 At that moment, it seemed that I had contained the problem. However, the queue was still filled with jobs. I began to notice that many jobs appeared duplicated in the queue, with the same passenger-driver combinations repeated multiple times. It wasn’t just a couple of times; it was thousands of duplications! The entire system was seriously compromised, and I learned that the problem stemmed from two main bugs.

00:11:44.060 The first bug existed in the code responsible for fetching information from the database and queuing it up. Ideally, we had n amount of jobs in the queue. But due to the bugs, we had n squared jobs! It was overwhelming. The second bug was in the payment process code itself, which sent transactions to the payment gateway without verifying whether a passenger had already been charged that week. This should have been a basic check.

00:12:15.790 Fixing those two bugs was relatively easy. By 10:50 a.m., I had the fixes deployed to production. After a long and exhausting day, I felt tired and frustrated. Immediately after deploying the fixes, I started searching for a new job, which may not be entirely true, but that’s how I felt at the moment. Now, to quickly summarize the situation: thousands of users were affected by this bug, with users being charged different amounts. In one alarming case, someone was charged $5,000. Many users faced charges of over 2,200 times! Thankfully, they weren’t charged more times, most likely because we maxed out their credit cards or drained their accounts.

00:13:18.130 After seeing this unfold, I had some serious thoughts. I genuinely questioned why these systems fail so quickly. It all spiraled fast, and we needed to reach out to every affected user, offering expedited refunds, usually by sending them a physical check that would deliver the funds immediately—or at least the next day. This was our fastest option, though in retrospect, it was a really crappy situation. Although the worst part seemed over, I still had weeks of follow-up work to sort out the mess left by those bugs. I spent a lot of time helping the customer care team manage the barrage of inquiries that flooded in.

00:14:07.570 No one would trust a team responsible for such failures in the payment system, and I was part of that team. The experience was rough. This is why I am sharing this experience today—because embarrassing moments happen to all of us, and at some point, we will create some kind of issue in our careers. You may think you’re safe, that your tests will save you, but even with a QA team in place, that’s not a guaranteed safety net. Mistakes will happen.

00:15:03.360 What we need to do is cultivate environments in our companies where people can admit when they mess up. It must be understood that it’s okay to acknowledge mistakes, regardless of severity. These situations are a common part of our professional journeys, and we will likely face them multiple times in various capacities. Therefore, it’s vital that no one feels judged or unsafe in expressing their problems or sharing the mistakes they made.

00:16:31.550 I urge you, if and when these scenarios do arise—believe me, they probably will—first, understand what happened and how to fix it. Move slowly to ensure you don’t inadvertently create bigger problems while trying to resolve the others. It's crucial to take your time to analyze a thorough and effective solution for any issues encountered. In times like this, there’s often immense pressure to act quickly because it seems urgent, but that approach usually leads to exacerbating the problems.

00:17:35.040 It’s vital to document everything related to these events—what caused the problem, how it was fixed, and the valuable lessons learned from the experience. We need a record and a history to prevent the recurrence of similar issues. This documentation should be available for the entire team to understand how to avoid making the same mistakes in the future. Creating a culture of openness is essential, where admitting problems and accepting responsibility for fixing them is welcomed, rather than simply assigning blame.

00:18:58.730 Always remember, your mistakes do not define you; they are temporary. Even when it seems dire, in time, you'll look back and perhaps even laugh about it. We share common experiences and vulnerability, and many people say we aren’t our mistakes, which applies to code as well. I invite everyone here to share their stories of failure. I know it sounds daunting, but at the conference, leverage the hashtag to tweet them. I would love to hear each other’s stories! You can also come speak to me afterward if you want to chat or commiserate. We can support one another.

Sebastian Sogamoso

explore all talks recorded at EuRuKo 2017

Explore all talks recorded at EuRuKo 2017

EuRuKo 2017

Keynote: MJIT, what, how and why

Yukihiro "Matz" Matsumoto

Lightning Talks Day 1

Manuel Morales, Ivo Anjo, Jan Lelis, Victor Shepelev, Piotr Murach, and Jan Krutisch

Helping communities & products thrive

Data-driven production apps

The Real Black Friday aka How To Scale an Unscalable Service

Judit Ördög-Andrási

Introducing Tensorflow Ruby API

Distributed Systems: Your Only Guarantee Is Inconsistency

Anthony Zacharakis

Things I Learned the Hard Way Building a Search Engine

Katarzyna Turbiasz-Bugała

How to Make It As A Junior Dev and Stay Sane

Issues with asynchronous interaction

Anna Shcherbinina

The overnight failure

Sebastian Sogamoso

Rescuing legacy codebases with GraphQL and Rails

Lightning Talks Day 2

Pilar Andrea Huidobro Peltier, Jake, Quentin Godfroy, Mehdi Lahmam B., and Ana María Martínez Gómez

Predicting Performance Changes of Distributed Applications

Ruby 4.0: To Infinity and Beyond

Bozhidar Batsov

Keynote: The Story of JRuby