Talks

Mistakes Were Made

Mistakes Were Made

by Jesse Spevack

In the talk "Mistakes Were Made" at RailsConf 2020, Jesse Spevack, a Senior Platform Engineer at Ibotta, reflects on several critical missteps he made while working on a project involving a cashback shopping system. The presentation serves as a public post-mortem to share lessons learned from significant errors intended to improve future development practices. Spevack outlines four major mistakes made during the project: the selection of inappropriate technology, siloing team members’ work, premature optimization, and introducing multiple changes simultaneously.

Key Points Discussed:

- Picking the Wrong Technology: The team's decision to use Kotlin and Akka, despite lacking proficiency, led to plumbing issues that hampered progress. A more familiar technology stack would have suited their needs better.

- Siloing Work: Tasking team members with distinct responsibilities resulted in a lack of shared knowledge about the systems, which became problematic when a key engineer left for another team.

- Premature Optimization: Spevack acknowledged the pitfalls of making optimizations before identifying actual bottlenecks, which created unnecessary complexity and made debugging difficult.

- Too Many Changes at Once: Altering both the input data and the processing algorithm simultaneously complicated the capability to effectively compare the outcomes from the old and new systems.

Throughout the talk, Spevack emphasizes that, despite these errors, the project ultimately delivered significant cost savings, amounting to approximately $1.5 million in annual operational expense savings. He acknowledges that failure can be a vital teacher and encourages others to learn from his mistakes to avoid similar pitfalls in their projects. In conclusion, Jesse Spevack advocates for open discussion about failures and stresses the value of teamwork and clear communication in software development.

Overall, this presentation highlights the importance of careful planning, ongoing communication, and a focus on shared understanding within teams to enhance project outcomes.

00:00:08.640 I want to thank the RailsConf team for selecting my talk for the memorable post-mortems track and for putting together this wonderful couch conference under very difficult circumstances. I can only imagine how much work went into the planning of the conference and even more for the replanning. So, a huge thanks to the Ruby Central team, you all are heroes, and I am very grateful. I also want to thank all of you out there for choosing to watch my talk. I hope you and your loved ones are in good health and safe.
00:00:17.460 My name is Jesse Spevack. I'm a Senior Engineer at Ibotta, a cashback app, and I am also a recovering educator and a father of twins. You can follow me on Twitter at @PlanetEfficacy for a mix of Ruby, political outrage, and fish content on the Internet. I use he/him pronouns.
00:00:42.750 I want to point out that there’s a reasonable degree of safety in my discussing mistakes during this year's RailsConf memorable post-mortems track. I am privileged to enjoy a presumption of competence; I don't have to worry about not being taken seriously after owning up to a mistake. Also, my mistakes do not reflect on others who share my race, gender, religion, and other identity markers. The goal of this presentation is to discuss four major mistakes I made on a project I worked on this year. I will be honest and vulnerable so that you can avoid these missteps. In the best-case scenario, you will walk away with some ideas on how to not get paged six times at 3:00 AM.
00:01:47.490 The four major mistakes I want to discuss are: picking the wrong technology, siloing work between team members, premature optimizations, and making too many changes to a system at once. All of these mistakes stemmed from good intentions, but in the end, they nearly doomed the project altogether. I made these mistakes while working on a project over the course of about six months for Ibotta, a cashback shopping app with millions of users, built in Denver, Colorado.
00:02:11.400 Over the past six years, Ibotta has awarded over $682 million to its users, which we call savers. We have had over 170 million offers redeemed on our platform since RailsConf last year, and currently, we have about 145 developers working with a majestic Rails monolith on Rails 5.1—soon to be updated to 5.2, God help me. In the last two years, we have been transitioning to a service architecture composed mostly of Java, Kotlin, and Node microservices. I have mixed feelings about this transition, but suffice it to say that given the scale and complexity of what we do, I'm incredibly proud to be a member of the Ibotta team.
00:02:54.910 The project I began about a year ago was to take one aspect of the financial domain of our Rails app and move it into a microservice. To give some context, brands pay Ibotta to run coupons or offers in our iPhone and Android apps, which pay cash to users in return for their purchases. I work in a domain that aims to make smart predictions about when we should remove offers from the app, ensuring we don't exceed budgets set in contracts with our brand clients.
00:03:45.050 A year ago, we had a system in place called the Real-Time Expiration Service, which had two main purposes. First, it tracked budget usage for all content in our application using Redis. Second, through a scheduled job, pertinent data was read from Redis and enriched with additional information stored in our database. For example, the job would iterate over each contract, calculate an expected date of budget exhaustion, and adjust expiration dates on our coupons accordingly.
00:04:29.960 We decided to move to a new system, which we dubbed the Real-Time Expiration Service Version 2, from our monolith to improve performance and decrease coupling, allowing us to add functionality more flexibly in the future. We were particularly interested in iterating on smarter and more complicated prediction algorithms. Additionally, we aimed to eliminate a costly dependency we hoped to remove from the monolith entirely, thus decreasing our monthly AWS bill.
00:05:02.059 Now let's delve into when our story began. We had a great opportunity to make a significant impact on Ibotta's bottom line. To accelerate development, we wanted to staff up the team and hired a new engineer with experience in high-volume event tracking. We were excited to bring her on board as quickly as possible.
00:05:52.310 As we began to scope out the project, our team started discussing which technology might be the best fit for the problem space. Our new co-worker insisted this was a perfect use case for Scala, her favorite language, along with Akka, a framework for building concurrent distributed systems. I remember sitting in the planning meeting with my team, our manager, and members of the architecture team, from whom we needed to secure buy-in for introducing new technology in this critical system.
00:06:29.440 One of the senior architects asked a lot of tough questions, and our new teammate did not shy away from her conviction. The message I walked away with was: you're making a bed and you're gonna have to lie in it. We could use Akka, but we had to accept responsibility for this decision, along with the understanding that we would not benefit from significant institutional knowledge or experience. This brings me to the first best intention that did not lead to a successful project execution.
00:07:20.970 We had good intentions in selecting technology we thought could solve our problem space, but instead of fully discussing it, we attempted to compromise with the engineering organization by settling on a language and framework. We went with Kotlin, a JVM language with several existing services in production written in it, including our entire Android app. We also decided to use Akka, which didn’t turn out to be a mistake due to any fundamental flaw, because I actually really like Kotlin and have a good understanding of Akka.
00:08:03.050 So, what went wrong? Picking these technologies was a mistake not because they weren't suitable for solving the problem, but because no one on the team had written production-grade Kotlin code, and only one of us in the entire company had any experience with Akka. The technology was a bad fit for our company and our team, and while it’s acceptable to use new technologies for a proof-of-concept, our system—if we got it wrong—could cost millions of dollars. We ultimately picked the wrong technology and, as a result, faced many plumbing battles that slowed our progress.
00:09:08.800 In retrospect, we should have used a more conventional stack for our company, such as Java, Spring, and Camel. If I could have my way, Ruby and Rails, or maybe Sinatra, would have been ideal. Sometimes it makes sense to gamble with a new technology, and I appreciate the advice that software developers should pick up a new language every year. However, this project was too important for that, and choosing tools that didn't fit well led to my next mistake.
00:09:50.410 In my experience, teams of developers flourish when they swarm on a common problem within a domain. In such cases, work moves quickly, and knowledge is shared effortlessly. Unfortunately, this was not the case for Real-Time Expiration Service Version 2. My second mistake was siloing the work. While none of us were experts in Kotlin initially, my teammate had extensive experience with Akka, and I had considerable experience with our Rails monolith. This was not the only project we were focused on, and to get the second version delivered, we decided to silo the work.
00:10:55.830 My teammate took on most of the Akka and Kotlin stories while I managed the integration with the Rails monolith. Our intentions were to speed up the project, but that turned out to be a significant mistake. While in the short term it did accelerate our development, this lack of shared understanding of both systems ultimately slowed us down over the project's lifetime. When it came time for me to modify code written by my teammate, I was completely lost, and we missed the opportunity for me to demonstrate my domain knowledge regarding our finance system. What started as a manageable risk turned into a serious problem a few months into the project when my teammate joined another team.
00:12:16.010 This is the real issue with siloed knowledge: teams change, and people move on. In my case, my teammate just moved to a new area in the office, but any system dependent on the knowledge of a single software developer is prone to failure. In hindsight, the right move would have been to slow down and pair on the work until we felt comfortable in each other's domains. From there, we could have worked faster over the duration of the project.
00:13:03.840 While siloing work, we also made the additional mistake of prematurely optimizing various components of our system. Premature optimization can be an easy trap to fall into. It feels great to think about the most efficient way to process data or to envision our system at ten or a hundred times its anticipated scale. However, as articulated in The Pragmatic Programmer by Andy Hunt and Dave Thomas, one should only optimize a piece of code when it is clear it has become a bottleneck.
00:14:00.830 Unfortunately, I read this book six months into this project, which means I had already undertaken six months of premature optimization. We began to prematurely optimize across both ends of our system, thinking several steps ahead of where we actually were or needed to be. On the Rails side, I hoped to prevent unnecessary database trips. Having encountered this issue previously, I thought that implementing some simple caching would ensure our database would not be overwhelmed once we activated the service.
00:15:00.880 Oh, how naive I was! By caching before it was genuinely necessary, I made it almost impossible to debug the issues we experienced as we began to roll out the system. This was particularly frustrating because we initially activated the system for only a single set of offers under one contract; at that traffic level, there would be absolutely no threat to the database, so my caching became redundant.
00:15:51.170 Thus, while running with just a fraction of the expected traffic, our cache didn't aid our database, instead complicating the debugging of initial issues on the Akka side. We built a system capable of processing about 10,000 times more volume than necessary. My team now jokingly says we constructed an F1 race car when all we needed was a wheelbarrow. We also ended up planning for the distant future while neglecting to deliver incremental value.
00:17:03.950 We started asking about future features we sought to incorporate and began building those out before establishing a solid foundation. For example, we anticipated handing off prediction algorithm logic to our analytics team through an Amazon Lambda function or machine learning call. Instead of waiting for more clarity or having our service up and running, we started creating the data those anticipated models might require.
00:17:41.200 Our intentions were noble; we wanted to envision the future where volume, performance, and complexity would be significantly greater than what we immediately needed. Although premature optimization can serve as an enjoyable engineering challenge, I learned it can lead to feelings of frustration when I realized I had invested significant time and energy in solving an imaginary problem.
00:18:00.020 After realizing I was busy addressing imaginary problems, I discovered that I created much added complexity that turned into a very real challenge. In hindsight, I learned to prioritize solving immediate problems. Whenever I catch myself straying in focus, I remind myself that some issues can wait for the future.
00:18:42.320 To recap, we picked the wrong technology for our team and organization, siloed work, and prematurely optimized whenever possible. Additionally, we made perhaps our most significant mistake by introducing too many changes at once. This brings me to the final lesson: while it can sometimes seem harmless to change many components, doing so can complicate our verification process.
00:19:20.700 The Real-Time Expiration Service Version 2 depended on new input data and new processing procedures. We had it running in various dry modes, allowing us to compare the results of Version 1 and Version 2. The difficulty, however, lay in changing both the input data and the processing algorithm simultaneously.
00:20:02.750 This approach led to an apples-to-oranges comparison. When transitioning a part of a monolith into a microservice, the key question is how to ensure the microservice is functioning as expected when it doesn't need to replicate the exact functionality it's replacing. Our intention was to create a more reliable and trusted data source while improving our processing accuracy, but the changes being made simultaneously made it challenging to build confidence in our system.
00:20:59.740 One thing I’ve noted when working with highly experienced engineers, like Justin Hart—who was one of the first engineers at Ibotta—is that they tend to make small, incremental changes to their systems. They verify that those changes yield the intended results before proceeding. In contrast, we should have opted to replicate the smallest possible unit of business value, validating it prior to implementing further changes.
00:21:52.750 In a controlled manner, we should have started with input data and compared it to the old data, then switched the old service over to the new data before verifying the function. Only once we were assured of the data's quality should we have moved on to refining the prediction algorithm.
00:22:58.720 So, what’s the conclusion? I learned valuable lessons through making these four big mistakes. I appreciate the concept of memorable post-mortems at RailsConf; failure acts as constructive feedback and an excellent teacher. Thanks to the insights gained during this project, I'm a much better developer today.
00:23:39.850 However, my story doesn’t culminate in a traditional post-mortem meeting or document. Even after making those four major missteps, we still managed to incorporate some practices that enabled us to deliver the project successfully, impacting an estimated $1.5 million in annual operational savings.
00:24:15.820 We did a good job communicating our progress to stakeholders and presented our work with varying degrees of technical specificity to our internal audiences. Importantly, we didn’t postpone discussing our post-mortem or retrospective until the project's delivery; we tackled mistakes as a team and actively worked to mitigate them. I participated in various online courses and organized a baby’s first Kotlin study group within our engineering team.
00:25:14.300 I invited a fellow engineer who had been deep into Kotlin for the past year to perform a code review with me on a feature I was developing. I removed my unnecessarily complex caching system, stopped our tendency to plan for distant futures, confront imaginary problems, and shifted focus to the immediate goal of migrating 100% of the relevant traffic from the monolith to the microservice.
00:26:04.020 In conclusion, that's why I felt it was important to deliver this talk. I aimed to share my lessons learned from making mistakes related to technology selection, work siloing, premature optimization, and implementing too many changes at once. Hopefully, by doing this, I can help you avoid making those same errors.
00:26:30.080 I feel incredibly fortunate to be a part of an outstanding engineering team at Ibotta. My journey into coding as a career change came through attending the Turing School of Software and Design in Denver, Colorado. I am grateful to this Rails community and want to express my sincere thanks to everyone for watching my talk. If you have any questions, feel free to reach out. Stay safe, and I truly appreciate your time. Thank you.