00:00:02.700
Welcome everyone! I'm David Copeland, and it's my pleasure to speak with you today.
00:00:08.580
This is my second year here; last year I gave a fantastic talk about test-driven development of command line applications in Ruby.
00:00:12.900
For those who don't know, I wrote a book on that topic, published with Craig Prague. Currently, I'm working at LivingSocial, where I'm going to share insights on service-oriented architecture.
00:00:23.520
So, let's dive in!
00:00:25.199
The title of my talk today is 'Services, Scale, Backgrounding and What is Going on Here?' This is essentially a story about reasonable developers making reasonable decisions.
00:00:32.359
They make these reasonable decisions while transitioning from a monolithic Rails application to a more scalable and manageable architecture that accommodates larger teams and more complex functionalities.
00:00:53.700
Despite everyone making these reasonable decisions, things will go wrong and strange occurrences will happen. I'll discuss how to deal with those situations based on my personal experience at LivingSocial.
00:01:10.020
At LivingSocial, I work on an application that we call 'Payments,' which processes every transaction on the site. This means it has to work effectively and be fault-tolerant.
00:01:16.200
Originally, this application was extracted from a monolithic Rails app and had to shift from running synchronously to functioning asynchronously across various processes and services. As a result, numerous unexpected issues surfaced.
00:01:25.020
Let's start with a hypothetical story about our business. Imagine we are trying to get users to sign up and purchase from us. The first requirement is a controller that allows users to sign up.
00:01:35.879
In our hypothetical scenario, we want users to validate their email by clicking on a link sent in an email after signing up. Thus, we have a fairly standard Rails controller that includes a User Mailer.
00:02:10.080
This system was easy to write and test, and once deployed, we found ourselves having a great influx of users signing up. However, over time we realized that problems began to arise.
00:02:22.860
The workflow for this controller involves submitting user information, saving it to the database, and sending the confirmation email. This remains synchronous, which means the user is stuck waiting for the email to send before moving on.
00:02:44.340
This results in a poor user experience, particularly when under high load. For instance, if a tech site like TechCrunch were to feature us, and a significant number of people attempted to sign up, the email confirmations could cause a bottleneck.
00:03:04.380
While it may seem logical to send emails synchronously to guarantee that everything works, they can actually be sent shortly after a user signs up, without affecting their immediate experience.
00:03:17.160
With our current implementation, we cannot allocate resources effectively to ensure an optimal user experience.
00:03:31.680
To solve this, we can utilize background processing for tasks that do not need to happen inline with user actions, like sending an email.
00:03:45.600
We chose to use 'Resque' for this purpose. By making a simple change to our mailer, we can create a job that sends the email in the background, essentially moving our synchronous task to an asynchronous background process using a Redis job queue.
00:04:15.120
This is a quick operation, significantly faster than waiting for the email to send, and allows our application to be focused on delivering a seamless user experience.
00:04:45.000
Months down the line, we configured our application to send us alerts whenever something goes wrong, which is essential, as unexpected issues can arise, such as the redis timeout issue that we encountered.
00:05:14.700
In this scenario, a user's account is created in the database, but they never receive the email they need to activate their account, resulting in a frustrating experience.
00:05:36.300
Worse yet, we have a unique constraint on the email field in our database—meaning if they try to sign up again using the same email, they will receive an error.
00:05:53.640
Despite our good intentions in implementing unique email constraints to prevent duplicate accounts, they've created an unfortunate situation where valid user accounts are rendered useless.
00:06:12.840
To resolve this, we need to improve our approach to handling user activations. Ideally, we’d want both the user account to be created, and the email sent successfully, but if that cannot happen, we prefer neither to occur.
00:06:34.740
So, we turn to database transactions in our application controller. This way, if the account creation fails or if the email queuing fails, the entire operation will be rolled back.
00:06:57.600
In essence, we ensure that our system is not left in an inconsistent state due to failures.
00:07:36.840
Over time, we see fewer occurrences of timeouts, resulting in a lower probability of bad data states within our system, which helps us maintain reliable user experience.
00:08:01.530
However, unexpected issues do occasionally arise. If a job fails due to a more complex error, we can inspect the failed job queue to evaluate the situation and retry processing.
00:08:32.640
This helps us uncover edge cases that we may not have considered. For instance, we discovered a race condition between our code that creates users and the background worker that handles job processing.
00:09:16.320
In this situation, when the job was fired, it tried to fetch the user from the database, but the user was not technically committed yet due to an uncommitted transaction.
00:09:38.880
Thus, we theorized that the job and the related event might execute concurrently, leading to inconsistent state where the user appeared to exist before it actually did.
00:10:05.040
One potential fix could be to implement a retry mechanism, giving the process a moment before trying to access the user record again.
00:10:30.960
After incorporating this retry strategy, we were able to mitigate some of the more complicated issues.
00:11:12.840
Fast forwarding a few months, our company's success led us to acquire a competitor, which brought its own complexities.
00:11:32.460
Their system did not require email validation, so now we needed to ensure that all of their users were integrated seamlessly into our system.
00:11:54.240
We decided to use the same business logic that we had for validation, but apply it in bulk to the competitor's user database.
00:12:22.920
This meant creating user records and sending each an activation email correctly without requiring them to sign up anew.
00:12:56.700
Additionally, we utilized the after_create Active Record hook to trigger this functionality after each user creation but unfortunately this introduced its own problems.
00:13:37.440
As our team grew and the functionality developed, we ended up modifying many aspects of the system, which led to an accumulation of technical debt.
00:14:02.460
This meant that our once clean controller now contained significantly more logic, becoming cluttered as we added functionality.
00:14:34.800
Despite these challenges, there were moments of excitement and enthusiasm as we tackled issues one after the other in our thriving organization.
00:14:59.520
We made it a point to educate ourselves on best practices and restructuring as we faced the need to simplify processes and make things better aligned with our operations.
00:15:28.080
We grew as a team and would later realize that it wouldn't be enough to prevent future breakdowns.
00:15:43.920
Eventually, as complexity continued to rise, we recognized that our mailer service needed a complete overhaul.
00:15:59.500
The new mail service was designed to be a drop-in replacement for legacy mailers, but there was a challenge with integrating this new system seamlessly.
00:16:36.420
Despite the ease of introduction, it required extensive testing to ensure that everything functioned correctly and met the needs of each team using it.
00:17:00.120
We soon realized that we needed better features in the implementation, like ensuring that emails sent were non-duplicate and allowed the client to query the email status.
00:17:35.640
As we faced unforeseen events during implementation, we further scrutinized our approaches, discovering gaps that could lead to inconsistencies.
00:18:10.560
In one case, a partial event failure resulted in the user receiving multiple emails, which caused confusion and frustrated our team.
00:18:48.240
What it became clear after this ordeal was the need to reassess how we design our systems to handle mail events seamlessly while preventing duplicates.
00:19:26.910
More focus needed to be placed on clean architecture that promotes better data handling and prevents the possibility of broken states occurring.
00:20:00.060
In hindsight, a centralized logging mechanism, robust testing protocols, and the transition to an asynchronous model should allow for better tracking and resolution of these complexities.
00:20:51.740
To summarize, integrating services into your system should promote item potent actions, watch for errors to ensure quick rectification, and maintain solid logging for audit purposes to firmly assess how and what went wrong.
00:21:28.680
Looking ahead, we’ve set our sights on implementing these strategies for an overall improvement in performance, user experience, and system reliability.
00:22:12.600
I encourage everyone engaged in this process to carefully consider their architecture when transitioning services, monitor the background processing, ensure better email handling, and prioritize technical agility for future developments.
00:22:57.920
In conclusion, being proactive with your system architecture while understanding potential pitfalls can aid in avoiding frustration down the line.'
00:23:35.160
That's all I've got for you today. Thank you for your time, and I'd love to take any questions you may have!
00:24:00.360
Great! We have some time for questions.
00:25:00.000
I had thought that I would use up all my time to avoid questions, but looks like we've got some time left.
00:25:35.000
For the problem concerning the Resque job that needed retries, did you consider using the 'after commit' hooks that are available in new versions of Rails?
00:26:00.000
In some cases, the 'after commit' strategy would solve those timing issues more effectively; however, since portions of our app run on older Rails versions, transitioning was not straightforward.
00:26:58.740
I appreciate your query, and I'm happy to provide further insights into any other questions you may have.
00:27:30.440
Thank you all once again for participating, and I look forward to connecting with many of you after!