The Overnight Failure

by Sebastian Sogamoso

In "The Overnight Failure," presented at RubyConf 2017, Sebastian Sogamoso shares a personal and relatable story about a significant failure he experienced at work - a programming bug that accidentally charged users hundreds or thousands of dollars instead of the intended small fee. The talk emphasizes the importance of addressing failures openly and learning from them, rather than hiding them away, which can contribute to feelings of imposter syndrome in the tech community.

Key points covered in the talk include:

Embracing Failure: Sogamoso highlights that everyone makes mistakes, and discussing these mess-ups publicly can foster a supportive environment. Many in the industry only share success stories, which can lead others to feel inadequate when things go wrong.
Understanding the Problem: He recounts how a routine process at his company, designed to facilitate carpooling payments, malfunctioned, resulting in multiple duplicate charges for users. The breakdown of this process escalated into what he calls "Black Saturday."
Crisis Response: When alerted to the problem, Sogamoso detailed the frantic discovery of errant transactions and how attempts to reverse the charges resulted in further complications. He emphasizes the importance of calm, methodical problem-solving rather than rushing to implement quick fixes.
Lessons Learned: After resolving the immediate crisis, he discusses critical takeaways, advocating for documentation of failures, understanding root causes, and creating a no-blame environment that encourages learning from mistakes.
Community Support: The speaker encourages attendees to share their own failures, reinforcing the notion that failures are a shared experience that can strengthen the community.

The main takeaways from the talk stress that everyone will face failure in their careers, but it's how one reacts and learns from these experiences that defines growth. Sogamoso invites the audience to embrace their failures and build a culture of openness and learning in their organizations.

00:00:11.210 Okay, so hi everyone! I hope you're having a good time here. I know this can be a hard time of the day, and the only thing between your lunch and the screen, two days of talks. Last night was fun; we had an amazing time at Ruby karaoke, so I'll try to make this enjoyable for you.

00:00:26.640 This is actually my first time at RubyConf. It's also my first time in New Orleans, and I have really enjoyed my time here. But as Erin pointed out, so far the conference themes have been about death. That's a bit sad. I really don't think Ruby's dead, but just so we are all clear, let's check the internet because that's where you go to find the truth. Okay, yeah, Ruby's not dead yet. Thank you for building the website, Jason.

00:00:46.680 Now that we've got that out of the way, we can continue. Let me introduce myself. My name is Sebastian, and I work at a company called Coop. I like when people tweet at me during my talks, so that's my Twitter handle. I came all the way from Colombia, and if Netflix is everything you know about Colombia, you should definitely go and visit to see the real Colombia. In fact, I have the perfect excuse for you to go there: we have a Ruby conference next. The next one is going to be in September 2018, and we'll be announcing the exact date really soon. You can follow the conference on Twitter, it's going to be fun.

00:01:16.140 Lately, I've been trying to get into things different from programming, so I've been taking some cooking lessons to improve my culinary skills. When I was accepted to give this talk, I decided that I wanted to prepare some jambalaya because it's something I've always wanted to try. I went online, looked for a good recipe, and it looked nice, so I gathered all the ingredients, prepared everything, followed the steps, and this was the final result. As you can see, it was an epic fail! I definitely have a long way to go.

00:01:51.780 Before we continue, I really want to ask everyone a favor. I'm really happy to be here, so I would love to take a selfie with you. If you could all raise your arms, make some noise, I'll do this real quick. I promise! Yeah, thanks, you're great!

00:02:02.780 Awesome! Now let's get a bit more serious. The name of the talk is 'The Overnight Failure,' and this talk is based on a true story about something that happened to me, I'm pretty sure. Why do I want to talk about this? Why do I want to share a time I really messed up? There are a few reasons, and I'll share the most important ones with you.

00:02:23.370 First of all, we all have broken the internet at some point. We've all screwed up really badly and created bugs. If it hasn't happened to you yet, it will happen; believe me, it's just a matter of time. We typically don't talk about these things in public, like at conferences or meetups. We usually hear people talking about what they've learned, how they built something cool, or how they made a project successful. Mostly, we only hear about success stories, and I think people should really talk more about this in public.

00:02:54.900 Discussing screw-ups makes it easier for everyone. We usually just talk about this with people who are really close to us—our families, friends, or colleagues, because they're in the same boat as we are. By not speaking about our failures in public, I think we contribute to imposter syndrome, which is something terrible that many of us suffer from. Sometimes, companies do talk about this publicly, as when GitHub went down for several hours due to a mistake that took out a huge chunk of the internet.

00:03:29.250 I can only imagine how stressful the situation was, as I even remember how their status website showed everything was right. They had a red icon showing that they were down because of S3, so it's something that's easy to laugh about. It's fun when we look at other people's experiences, but if you really think about it, you can feel how stressful that situation was for the people dealing with it and how hard it was to maintain customer trust and regain it.

00:04:06.300 Although we don't all work at companies that have such a huge scale as GitHub or Amazon, if we create bugs, they still affect the lives of our users. It doesn't matter if it's millions or hundreds of users; it will always be stressful. That's the most important reason I want to give this talk and to discuss my screw-ups: we learn more from failure than from success.

00:04:35.070 When we fail, when everything goes wrong, or something unexpectedly fails, that's where we learn the most. I also want to share a little bit about what I learned. Before we move on and I tell you my story, I want you to take a few seconds to close your eyes, if you want, and think about the worst thing that could happen to you at work. Imagine what that could be like, the apocalypse for you.

00:05:30.160 So, keep that in mind, and I'll tell you what my overnight failure story was. How did it happen? How did the 'overnight failure' become a reality? I used to work at a company that had a product that allowed people to carpool. It was a mobile app that allowed users who drove to work every day to share rides.

00:06:19.480 For example, let's say there's Mary. Mary had a car that she used to commute every day. The app I helped build allowed Mary to find Anna, who didn't have a car and had a similar commute. The app allowed them to exchange money for rides. When they shared a ride to work, the app kept track of the trips they took together, and it even allowed them to be matched with someone else, like John, who also had a similar commute.

00:06:59.240 How the money exchange process worked was that once a week, a procedure ran that charged the passengers and then transferred the money to the driver, in this case, Anna. To explain how this was technically implemented, once a week, a job would trigger, checking the database for every trip taken by users in the previous week. For each driver-passenger combination, a job was put on a queue to process the payments.

00:07:36.610 So, a lot of jobs were created each week. Each job contained the passenger-driver combination and the total amount the passenger should pay the driver. After that, payment was processed, and the passenger was charged using a payment gateway, while the driver was paid through our payment system.

00:08:27.600 Now that you kind of understand how it worked, I'm going to tell you exactly what happened. It was any given Sunday at 6:00 a.m. when the process triggered, just as it usually did. The process began running and marked the start of a day I like to call 'Black Saturday.' This had nothing to do with big discounts or a shopping spree; it was a really stressful day.

00:09:23.420 Let’s say a user went early to the store to buy some eggs for her family breakfast. When she tried to pay, her card was declined, and she thought, 'Wow, this is weird; this shouldn't happen.' This happened around 6:25 a.m. She went online to investigate her credit card and saw a lot of strange charges made by the carpooling company I worked for on that same morning.

00:09:50.110 That's step two before Black Saturday. It was bad, but then it got worse; more users were affected by this, and it wasn't just a few users—it was a lot of users. The customer care team started to receive an influx of complaints. By 6:35, a lot of users, in a short amount of time, reported multiple issues.

00:10:24.820 At 6:34, I was still sleeping; I remember that I was always sleeping in on Saturdays at that time. I was happily in bed when my phone rang. It was a call from my boss, and this is how the conversation went: 'Hey, sorry to wake you up this early, but there’s an issue in production. A lot of customers are complaining about it.' I replied, 'Okay, sure, I’ll take a look right away.' I was really stressed but tried to act cool.

00:11:12.410 We ended the call, and then my boss said, 'Hey, I'm still on the line.' At this point, Black Saturday was looking pretty bleak. It wasn't even 7:00 a.m., and things were already looking bad, so I started to investigate what was happening. I checked the payment gateway and indeed saw a lot of duplicate charges.

00:11:42.880 The first thing I noticed was that it didn’t make any sense, so I looked at our billing system. I discovered that the queue was still full of jobs; I thought to myself that I should stop processing jobs altogether to reduce the number of new charges. I thought stopping that part of the process would at least prevent more charges from being created. However, when I looked at the queue again, it had grown, and this was bad.

00:12:44.140 So then, I decided to stop the entire process, ensuring no new jobs went into the queue. The first thing I thought was to refund all the charges that we shouldn’t have made, which sounded like a good idea at the moment. However, a lot of charges we had already made had already created a lot of transfers to the drivers.

00:13:39.500 By 7:28, we could say the problem was somewhat contained at that point. But the day was still not looking good; the queue was full of jobs, and there were thousands still waiting to be processed. As I started looking into them, I noticed a pattern: many duplicate jobs, with the same user as both the passenger and the driver, and the same amount to be charged.

00:14:19.840 The issue was clear: we were putting every possible passenger-driver pair into the queue thousands of times without checking if we had already charged that combination. After digging into the problem and understanding where things went wrong, I wrote tests, and when they failed, I fixed the code so that the tests would pass. This took a long time, and it was a long day; I felt really dumb and frustrated.

00:15:17.990 As I was about to deploy the fix, I did the programmer's prayer that we're taught in Rails: 'Please work, please work, please work.' Fortunately, it did work. By 10:55 p.m., I started doing the obvious thing: looking for a new job.

00:16:04.510 So, what had happened was that there were charges that affected seven users. All users were charged a different amount, with the worst-case being one who was charged over five thousand dollars when we should only have charged that user fifty. Some users were charged up to five hundred times, and the worst part was that we maxed out a lot of credit cards and emptied many savings accounts.

00:16:58.850 As I mentioned, I initiated a refund for every extra charge, but the problem was that refunds take a long time, and people didn't have money in their checking accounts anymore. I'll tell you in a few minutes how we dealt with that. Something I was really glad about was that we were still not on Ruby 3 at that time because many jobs were processing in such a short amount of time. I couldn't imagine what would have happened if we were using that version; the problem would have been so much worse.

00:18:08.390 Refunds take up to five business days, which became unacceptable for a lot of our users. So what we had to do was reach out to all of them and offer an expedited option for reimbursement, like sending them checks or cash. We did whatever we could to figure out how to handle the refunds. It was really stressful; I spent at least the next week just gathering information about how badly we messed up.

00:19:28.510 So why do I want to tell you all of this? Why is this important? I want to go back to the why. Embarrassing failures will happen; they will eventually happen to all of us. I mentioned before that tests might save you from this, but they won’t. Errors are made by humans, and bad things will occur eventually. We need to create an environment where admitting mistakes is fine, and where no one feels they have to blame others or make excuses for their mistakes.

00:20:38.700 We need to foster trust that we won't be judged. The safer people feel about admitting their mistakes, the more they'll learn about both the mistakes and their consequences, and the more we as a community will learn from it. So, when you're facing a situation like I did, my first recommendation is to ensure you fully understand what happened before doing anything.

00:21:15.280 It's easy to jump to quick conclusions and attempt the easiest fix, but sometimes that leads to making the problem worse, as I did with the refund process. My second recommendation is to move slowly. Don’t rush the problem; the damage has been done, and people have already noticed. Take your time to think about what you're really doing.

00:21:51.860 Document the problem: document what happened, document the root cause, document how you fixed it, and what you will do to prevent this from happening again. Documenting the lessons learned is crucial because that’s the only way we can genuinely prevent similar mistakes in the future. Lastly, don't assign blame. It's not about who wrote the buggy code; that's not important. What's important is that the process failed.

00:22:51.310 You are not your failures. Just because you screw up doesn't make you the worst developer ever. Everyone has messed up at some point; believe me. In the end, no one will care about the bugs you made when you die. Keep in mind that as bad as you might feel in the moment, what you're dealing with will pass, and eventually, you'll be able to look back at it with a laugh.

00:23:37.560 So let's celebrate failures, as they make us who we are. Come and talk to me; share your failures. I've heard a lot, and it's therapeutic! You can tweet using the conference hashtag. It would be great if we could share our war stories because as a community, we can help each other through these stressful moments in our careers. We can face it together. I work for a company called Coop, so if you like cooking and want to share new recipes, check us out! We're hiring, just like everyone else. If you want to work at a place where you can build an application used by millions worldwide, come talk to me. Thanks a lot!