The Overnight Failure

Talks

Sebastian Sogamoso

@sogamoso

#failure-analysis

#software-development

#error-handling

#team-communication

#payment-systems

The Overnight Failure

by Sebastian Sogamoso

The talk "The Overnight Failure" by Sebastian Sogamoso at wroc_love.rb 2017 focuses on the theme of embracing failure in the software development industry, highlighting the importance of discussing and learning from mistakes. Sogamoso, who works for Cookpad, shares a personal story about a major incident at a previous job that he refers to as Black Saturday, illustrating the challenges and lessons learned from this experience.

Key Points Discussed:
- Introduction and Context: Sogamoso introduces himself and his background, sharing his love for Poland and inviting attendees to an upcoming Ruby conference in Colombia.
- Cultural Observations: He humorously mentions his observations about Poland, including its unique drunk tests and the confusion around building floor designations.
- Story of "The Overnight Failure":

- The presentation dives into a critical failure associated with a carpooling app that Sogamoso was developing.

- On a routine billing day, multiple users were accidentally charged due to a system error that created duplicate payment jobs, leading to outrage among users when their cards were declined due to excessive charges.
- Crisis Management: Sogamoso recounts waking up to a flood of complaints and the frantic efforts to contain the financial damage, which involved stopping the charge processing and reversing the erroneous charges.
- Root Causes Identified:

- The errors were attributed to flaws in both the job retrieval system and the payment processing logic, which were not sufficiently verified for duplicates.
- Learning from Failure: Despite the devastation caused by the incident, Sogamoso emphasizes the significance of openly discussing failures within the tech community to mitigate feelings of imposter syndrome and foster a culture where mistakes can be addressed constructively.
- Post-Mortem Analysis: He outlines steps taken after the incident, including implementing better testing practices and monitoring systems to avoid similar failures in the future.

Conclusions and Takeaways:
- Cultural Shift: Emphasis on creating an environment where discussing failures is normalized rather than stigmatized, enabling teams to learn collaboratively.

- Importance of Testing: Highlighting the necessity of robust testing and monitoring processes to catch issues before they escalate.
- Community Engagement: Encouragement for attendees to share their own failures as part of a collective learning experience using the hashtag #IBrokeThings and stressing that individuals should not equate failures with their identity.

- Call to Reflection: Sogamoso urges developers to reflect on their own worst-case work scenarios and learn to manage stress and repercussions when failures occur.

00:00:12.139 Good morning, everyone! I know the first talk of the day can be a bit tough, especially when it's raining outside.

00:00:18.230 But I’ll try to make this entertaining, okay? So, let's get started. My name is Sebastian, and I work for a company called Cookpad.

00:00:27.980 That's our logo right there. You can find me as @Sebos in Twitter, and I really enjoy when people tweet at me while I'm talking.

00:00:34.730 Feel free to do so! I came all the way from Colombia, where I also have a Ruby conference, similar to this one. It will be held on the 8th and 9th of September, and you should all attend! I’m planning to go, and even better, to speak there.

00:00:54.160 We’re going to open a call for proposals soon, and we would love to have speakers like you from this part of the world.

00:01:06.020 This is my third time in Poland, and I love it here! There are two things that I’ve found really unique about Poland.

00:01:20.780 Firstly, the disposable drunk tests! You can find them conveniently located in many places like bars and restaurants. I think it’s a great idea.

00:01:32.510 Secondly, I’m often confused by the ground floor designation. Many times, I’ve gone out looking for the exit on the first floor like a crazy person, unable to find it.

00:01:44.930 It's been an interesting cultural experience for me. So, last year when I was here, I forgot to do something really important: take a selfie with all of you.

00:02:05.240 So, I’m going to rectify that this year! First, please raise your hands and make some noise! Let's make it a fun selfie!

00:02:25.750 Okay, I’ll tweet it out right after the talk. Thanks for posing!

00:02:32.060 Now, let’s get a little serious. The name of this talk is "The Overnight Failure," and it's a true story about something that happened to me at my previous job.

00:02:38.060 Why did I decide to talk about failure and share a story from my experiences at this conference? Well, there are several reasons.

00:02:59.659 We all have "broken the internet" at some point or had a creative idea that made it to production only to fail. Yet, we don’t often talk about these situations publicly, because it’s easier to discuss cases where we succeeded or appeared competent. We usually confide about our mistakes only to our closest co-workers or friends.

00:03:26.450 This avoidance only nurtures feelings of imposter syndrome, which many of us experience. The good news is that a few companies are making efforts to change this narrative.

00:03:49.760 For example, GoodRx experienced a significant outage but made a public post-mortem explaining what happened. Similarly, AWS suffered downtime, which affected several services we all rely on. I can’t imagine how stressful it must have been for those teams during those outages.

00:04:19.669 Even though most of us may not operate on the same scale as AWS, we still design software that affects people's lives. If our applications go down, it has consequences for users, regardless of the scale.

00:05:01.759 The most important reason I want to talk about this today is that significant problems are often the situations from which we learn the most. By sharing our experiences, we can learn together and avoid making the same mistakes others have already encountered.

00:05:28.729 So, I want everyone to think for a few seconds: what's the worst thing that could happen to you at work? Imagine the worst possible scenario.

00:05:53.080 Now, let me share one of those situations with you, which I call "The Overnight Failure." I'll start by explaining how the product I was working on functioned.

00:06:15.650 It was essentially a carpooling app that allowed users to share rides to work. Let’s say we have a woman named Marcela. She owns a car and drives to work every day. The app I was working on helped Marcela find someone like Claudia, who didn't have a car and had a similar commute, and allowed them to carpool together.

00:06:57.650 The app would establish a cost for each trip and keep a record of how many trips they took. It could also match Marcela and Claudia with others on the same route, letting them all share a ride. Once a week, it would run a billing process where Claudia would be charged for the trips, and Marcela would receive payment.

00:07:50.759 Now, from a technical perspective, the billing process was triggered weekly. It would gather all trips from the previous week, queue jobs for each driver and passenger pair, and process them.

00:08:08.680 The idea was that at any given moment, there should only be one driver-passenger pair in the queue. This approach allowed for a single charge for all trips taken that week.

00:09:15.310 Each job would process the passenger's charge via a payment gateway, which was a third-party service. Let me recap: we had an app for carpooling with a weekly billing process to manage payments.

00:09:37.960 Now, let’s get to what went wrong. It was a given Saturday at 6:00 a.m. just like any other billing day when everything started fine. Little did I know, this day would soon be called Black Saturday.

00:10:18.480 This name isn’t because of any discounts or holidays, but because of the immense stress involved. There’s a story here: a user went to a store to buy bread and eggs for breakfast and found her card was declined.

00:10:57.630 After several failed attempts, she went online to discover her credit card was maxed out due to multiple charges from my company that day. At that point, we had a significant problem.

00:11:38.490 As it turned out, it wasn’t just one user; several others reported similar issues. Our customer care team began receiving a flood of complaints.

00:12:21.060 By 6:34 a.m., the reports escalated, and users were showing outrage. While all of this was unfolding, I was peacefully sleeping in my warm bed, blissfully unaware.

00:12:49.260 At 6:35 a.m., however, a phone call woke me. It was my boss, who said he was sorry to disturb me so early but outlined the issues we were experiencing with payments.

00:13:05.830 I tried to sound calm and told him we should follow up over Slack. I was still in shock as the conversation continued with him trying to remain calm.

00:13:43.310 At 6:43 a.m., I proceeded to investigate. The first thing I did was check our payment gateway, where I discovered a huge number of duplicated charges. This was clearly not normal.

00:14:56.750 I tried stopping the charge processing, but the queue was filling up even more. Therefore, I had to stop the part of the system responsible for filling the queue.

00:15:57.330 After some frantic minutes, I managed to check how many users we charged and began reversing the charges using the payment gateway API.

00:16:46.000 I still had to stop payments heading to drivers to avoid emptying the company's bank account.

00:17:07.540 Eventually, by 7:28 a.m., I managed to contain the problem. While it wasn’t fully fixed, people were no longer being charged.

00:17:46.000 More work was required, but thankfully, the situation improved. I started analyzing the queue and noted a pattern; many jobs had duplicate entries.

00:18:38.990 That meant we were putting too many jobs in the queue instead of the intended quantity.

00:19:12.120 It became apparent there were two main problems contributing to this issue. The first was a flaw in the system responsible for retrieving jobs from the database.

00:19:46.060 We should have only been adding valid jobs, yet we were inadvertently adding more than necessary.

00:20:09.160 The other issue was related to how we used the payment gateway. We weren't verifying that we only sent a driver-passenger payment once a week.

00:20:53.200 After deducing the root causes through investigation, I worked on creating tests to catch similar bugs. Working with customer care to provide clear communications about the issue was paramount.

00:21:47.380 At 10:50 p.m. that same day, I finally managed to deploy a fix in production, but this was not a small victory.

00:22:49.310 The billing process was not going to be executed for another week, so I felt exhausted, frustrated, and worried that I might get fired.

00:23:10.430 I felt terrible knowing that thousands of users were impacted, with one user being charged over five thousand dollars due to our mistake.

00:23:50.830 The stress of the entire situation left me feeling drained. During that chaotic night, I kept wondering how something like this could happen.

00:24:28.780 Then we realized the refunds would take up to five business days and that was unacceptable for many customers who were left without money.

00:25:02.910 Customer care worked overtime to expedite reimbursements where possible, managing distressed users who were understandably upset.

00:25:48.300 To circle back, I want to emphasize the importance of discussing failure at conferences. As developers, we will face embarrassing moments in our careers. It may have happened to you, or it could happen in the future.

00:26:33.580 We need to foster a culture where making mistakes is accepted, creating an environment free from judgement.

00:27:19.300 If such situations arise, it’s crucial to take your time to determine the root cause before deploying fixes.

00:28:02.530 Document what caused the problem and the solution. Conduct post-mortems not to blame but for collective learning.

00:28:54.800 It's important to remember that you are not your failures; as developers, we are human. Mistakes are part of life, and feelings of self-doubt will pass.

00:29:49.480 To end on a positive note, let us celebrate our failures. I encourage you to share your experiences on Twitter using the hashtag #IBrokeThings.

00:30:37.390 I'd also like to thank Cookpad for sponsoring my trip here. Check out our engineering blog, sourcediving.com, for interesting articles.

00:30:50.980 Thank you all for listening!

00:31:05.550 Are there any questions?

00:31:11.370 As part of the post-mortem, what measures did you take to prevent this from happening again?

00:31:31.400 We realized we didn't have a robust test case for the billing process. We implemented tests to validate the normal operation and ensure the QA team verified results in the payment gateway before running the production process.

00:32:34.310 Having proper monitoring and alerts for abnormal charges is also essential. It's important that if we notice any unusual activity, we address it immediately.

00:33:58.289 We also learned the importance of rigorous testing to catch these issues before they escalate.

00:34:29.789 Thank you for your questions! If you have more questions or stories to share, I'll be around for the next few days. I'm eager to hear about your experiences!

Sebastian Sogamoso

@sogamoso

wroclove.rb 2017