Services, Scale, Backgrounding and WTF is going on here?!??!

by David Copeland

In the talk titled 'Services, Scale, Backgrounding and What is Going on Here?', David Copeland shares insights from his experience transitioning a monolithic Rails application at LivingSocial into a service-oriented architecture. The discussion outlines the complexities encountered during this journey and offers practical solutions to common challenges faced by developers.

Key Points Discussed:
- Transitioning to Service-Oriented Architecture: The shift from a monolithic application to smaller services can be challenging, requiring developers to make reasonable decisions amid unexpected complications.
- The Importance of Background Processing: When users sign up, the synchronous task of sending confirmation emails can cause bottlenecks under high load. By implementing background processing with tools like Resque and Redis, email notifications can be sent asynchronously, improving user experience.
- Database Transactions for Consistency: To prevent inconsistency in user states, the implementation of database transactions ensures that both user account creation and email notifications either succeed or fail together, thus avoiding partial failures.
- Handling Edge Cases and Technical Debt: As new functionalities accumulate, issues such as race conditions can arise. To address these, employing retry mechanisms and ensuring proper job processing can help maintain system stability.
- Growth and Complexity Management: With company acquisitions leading to integration challenges, adapting existing logic to handle bulk user activations while preventing technical debt is crucial.
- Need for Continuous Improvement: After facing difficulties with the email service causing duplicate emails, Copeland emphasizes a need for a centralized logging and robust testing protocol to improve reliability.
- Best Practices and Future Directions: Concluding with a call to prioritize clean architecture, monitor background processes, and maintain technical agility, he highlights the importance of being proactive in design to mitigate future frustrations.

The key takeaway is the significance of understanding and carefully managing the complexities of service architecture transitions to enhance performance and user experience.

00:00:02.700 Welcome everyone! I'm David Copeland, and it's my pleasure to speak with you today.

00:00:08.580 This is my second year here; last year I gave a fantastic talk about test-driven development of command line applications in Ruby.

00:00:12.900 For those who don't know, I wrote a book on that topic, published with Craig Prague. Currently, I'm working at LivingSocial, where I'm going to share insights on service-oriented architecture.

00:00:23.520 So, let's dive in!

00:00:25.199 The title of my talk today is 'Services, Scale, Backgrounding and What is Going on Here?' This is essentially a story about reasonable developers making reasonable decisions.

00:00:32.359 They make these reasonable decisions while transitioning from a monolithic Rails application to a more scalable and manageable architecture that accommodates larger teams and more complex functionalities.

00:00:53.700 Despite everyone making these reasonable decisions, things will go wrong and strange occurrences will happen. I'll discuss how to deal with those situations based on my personal experience at LivingSocial.

00:01:10.020 At LivingSocial, I work on an application that we call 'Payments,' which processes every transaction on the site. This means it has to work effectively and be fault-tolerant.

00:01:16.200 Originally, this application was extracted from a monolithic Rails app and had to shift from running synchronously to functioning asynchronously across various processes and services. As a result, numerous unexpected issues surfaced.

00:01:25.020 Let's start with a hypothetical story about our business. Imagine we are trying to get users to sign up and purchase from us. The first requirement is a controller that allows users to sign up.

00:01:35.879 In our hypothetical scenario, we want users to validate their email by clicking on a link sent in an email after signing up. Thus, we have a fairly standard Rails controller that includes a User Mailer.

00:02:10.080 This system was easy to write and test, and once deployed, we found ourselves having a great influx of users signing up. However, over time we realized that problems began to arise.

00:02:22.860 The workflow for this controller involves submitting user information, saving it to the database, and sending the confirmation email. This remains synchronous, which means the user is stuck waiting for the email to send before moving on.

00:02:44.340 This results in a poor user experience, particularly when under high load. For instance, if a tech site like TechCrunch were to feature us, and a significant number of people attempted to sign up, the email confirmations could cause a bottleneck.

00:03:04.380 While it may seem logical to send emails synchronously to guarantee that everything works, they can actually be sent shortly after a user signs up, without affecting their immediate experience.

00:03:17.160 With our current implementation, we cannot allocate resources effectively to ensure an optimal user experience.

00:03:31.680 To solve this, we can utilize background processing for tasks that do not need to happen inline with user actions, like sending an email.

00:03:45.600 We chose to use 'Resque' for this purpose. By making a simple change to our mailer, we can create a job that sends the email in the background, essentially moving our synchronous task to an asynchronous background process using a Redis job queue.

00:04:15.120 This is a quick operation, significantly faster than waiting for the email to send, and allows our application to be focused on delivering a seamless user experience.

00:04:45.000 Months down the line, we configured our application to send us alerts whenever something goes wrong, which is essential, as unexpected issues can arise, such as the redis timeout issue that we encountered.

00:05:14.700 In this scenario, a user's account is created in the database, but they never receive the email they need to activate their account, resulting in a frustrating experience.

00:05:36.300 Worse yet, we have a unique constraint on the email field in our database—meaning if they try to sign up again using the same email, they will receive an error.

00:05:53.640 Despite our good intentions in implementing unique email constraints to prevent duplicate accounts, they've created an unfortunate situation where valid user accounts are rendered useless.

00:06:12.840 To resolve this, we need to improve our approach to handling user activations. Ideally, we’d want both the user account to be created, and the email sent successfully, but if that cannot happen, we prefer neither to occur.

00:06:34.740 So, we turn to database transactions in our application controller. This way, if the account creation fails or if the email queuing fails, the entire operation will be rolled back.

00:06:57.600 In essence, we ensure that our system is not left in an inconsistent state due to failures.

00:07:36.840 Over time, we see fewer occurrences of timeouts, resulting in a lower probability of bad data states within our system, which helps us maintain reliable user experience.

00:08:01.530 However, unexpected issues do occasionally arise. If a job fails due to a more complex error, we can inspect the failed job queue to evaluate the situation and retry processing.

00:08:32.640 This helps us uncover edge cases that we may not have considered. For instance, we discovered a race condition between our code that creates users and the background worker that handles job processing.

00:09:16.320 In this situation, when the job was fired, it tried to fetch the user from the database, but the user was not technically committed yet due to an uncommitted transaction.

00:09:38.880 Thus, we theorized that the job and the related event might execute concurrently, leading to inconsistent state where the user appeared to exist before it actually did.

00:10:05.040 One potential fix could be to implement a retry mechanism, giving the process a moment before trying to access the user record again.

00:10:30.960 After incorporating this retry strategy, we were able to mitigate some of the more complicated issues.

00:11:12.840 Fast forwarding a few months, our company's success led us to acquire a competitor, which brought its own complexities.

00:11:32.460 Their system did not require email validation, so now we needed to ensure that all of their users were integrated seamlessly into our system.

00:11:54.240 We decided to use the same business logic that we had for validation, but apply it in bulk to the competitor's user database.

00:12:22.920 This meant creating user records and sending each an activation email correctly without requiring them to sign up anew.

00:12:56.700 Additionally, we utilized the after_create Active Record hook to trigger this functionality after each user creation but unfortunately this introduced its own problems.

00:13:37.440 As our team grew and the functionality developed, we ended up modifying many aspects of the system, which led to an accumulation of technical debt.

00:14:02.460 This meant that our once clean controller now contained significantly more logic, becoming cluttered as we added functionality.

00:14:34.800 Despite these challenges, there were moments of excitement and enthusiasm as we tackled issues one after the other in our thriving organization.

00:14:59.520 We made it a point to educate ourselves on best practices and restructuring as we faced the need to simplify processes and make things better aligned with our operations.

00:15:28.080 We grew as a team and would later realize that it wouldn't be enough to prevent future breakdowns.

00:15:43.920 Eventually, as complexity continued to rise, we recognized that our mailer service needed a complete overhaul.

00:15:59.500 The new mail service was designed to be a drop-in replacement for legacy mailers, but there was a challenge with integrating this new system seamlessly.

00:16:36.420 Despite the ease of introduction, it required extensive testing to ensure that everything functioned correctly and met the needs of each team using it.

00:17:00.120 We soon realized that we needed better features in the implementation, like ensuring that emails sent were non-duplicate and allowed the client to query the email status.

00:17:35.640 As we faced unforeseen events during implementation, we further scrutinized our approaches, discovering gaps that could lead to inconsistencies.

00:18:10.560 In one case, a partial event failure resulted in the user receiving multiple emails, which caused confusion and frustrated our team.

00:18:48.240 What it became clear after this ordeal was the need to reassess how we design our systems to handle mail events seamlessly while preventing duplicates.

00:19:26.910 More focus needed to be placed on clean architecture that promotes better data handling and prevents the possibility of broken states occurring.

00:20:00.060 In hindsight, a centralized logging mechanism, robust testing protocols, and the transition to an asynchronous model should allow for better tracking and resolution of these complexities.

00:20:51.740 To summarize, integrating services into your system should promote item potent actions, watch for errors to ensure quick rectification, and maintain solid logging for audit purposes to firmly assess how and what went wrong.

00:21:28.680 Looking ahead, we’ve set our sights on implementing these strategies for an overall improvement in performance, user experience, and system reliability.

00:22:12.600 I encourage everyone engaged in this process to carefully consider their architecture when transitioning services, monitor the background processing, ensure better email handling, and prioritize technical agility for future developments.

00:22:57.920 In conclusion, being proactive with your system architecture while understanding potential pitfalls can aid in avoiding frustration down the line.'

00:23:35.160 That's all I've got for you today. Thank you for your time, and I'd love to take any questions you may have!

00:24:00.360 Great! We have some time for questions.

00:25:00.000 I had thought that I would use up all my time to avoid questions, but looks like we've got some time left.

00:25:35.000 For the problem concerning the Resque job that needed retries, did you consider using the 'after commit' hooks that are available in new versions of Rails?

00:26:00.000 In some cases, the 'after commit' strategy would solve those timing issues more effectively; however, since portions of our app run on older Rails versions, transitioning was not straightforward.

00:26:58.740 I appreciate your query, and I'm happy to provide further insights into any other questions you may have.

00:27:30.440 Thank you all once again for participating, and I look forward to connecting with many of you after!