Safely Decomposing a Highly Available Rails App

00:00:24.960 I'm Adam Forsyth, and I'm a software engineer and the Community Lead at Braintree Payments. We make it easy for you to accept credit cards, PayPal, and other payment methods online and in mobile apps. This talk is about decomposing a highly available Rails app. While my slides may not have much content on them, my speaker notes will contain all the details and will be available on my GitHub page afterwards.

00:00:48.680 My talk will cover what we did to extract a service from the Braintree Payment Gateway. I have some general tips for developing modern Rails API apps, and I'll delve deeper into three somewhat unique approaches we used during this process that may be useful to you. I will move quickly through the material to allow time for questions at the end.

00:01:20.720 To provide some context, the Braintree Gateway was historically a monolith. The codebase dates back to 2008 and was built upon Rails 2.1, making it one of the older and larger Rails applications out there, consisting of roughly 100,000 lines of code. By 2013, the monolithic architecture was slowing us down, limiting our language and framework choices to Ruby and Rails. There was a long ramp-up time for new developers due to the vast amount of information required to work on a single app. Furthermore, the features in that app had gradually become poorly separated over time.

00:02:02.840 Consequently, we started down the road of decomposition. We began building new features in separate services while refactoring the payment gateway to be more modular within a single app. Additionally, we started extracting existing features into their own new services. However, we encountered problems; externally facing features still had to be integrated into the same app because they needed to share a specific subset of the code related to identity management, authentication, authorization, and configuration. Essentially, this involved understanding who you are and what you can do.

00:02:36.400 In 2014, we decided to extract the code forcing us to build all new features into the monolith into a new service. Confusingly, we named it the 'AY' app, reflecting its role in two-factor authentication, as it encompasses authorization and authentication among other responsibilities. The goal of this project was not to create a stand-alone service but to facilitate further decomposition, enabling us to develop more new features as independent services and extract additional features from the gateway.

00:03:05.760 This project was high-risk, as all of our traffic is authenticated. As such, all traffic depends on that code. Other services rely on our payment service for their operations, meaning we can't afford downtime or risk breaking API authentication. Our strategy was not to move quickly and break things. Instead, we aimed to move deliberately and safely. Our goal was to acquire maximum confidence in our code's correctness, allowing us to roll out changes gradually and minimize the inevitable problems.

00:03:49.560 Moreover, we stayed focused on modernization. The gateway had been updated to Rails 3.2, but a significant amount of the code still stemmed from Rails 2. The new app was built from the ground up using Rails 4, as Rails 4.2 was ruled out due to a bug. We upgraded from RSpec 2 to RSpec 3, adopting the more modern expect syntax. We transitioned from 100% custom serialization code to ActiveModel serializers, and from a bespoke proxy and request queuing layer to Nginx and Unicorn. We employed state machines in our code and switched from the ASM state machine gem to the StateMachine gem because we needed better transition failure behavior.

00:05:06.960 At the onset of this project, we were not using database connection pooling in the old app, but the new app employed PG Bouncer from the start. To expedite and simplify testing, we wrote an in-memory version of the entire service. This was used wherever 100% accuracy wasn’t critical, especially for tests not related to authentication or identity code. We anticipated performance concerns stemming from transitioning from an in-app method call to a network call.

00:06:30.000 To avoid premature optimization accusations, our focus on optimizing was genuinely necessary. We approached caching as a first solution, and fortunately, we had high-volume idempotent endpoints that we could safely cache on the client side, completely bypassing the need for requests. Our caching time was relatively brief, but it greatly benefited our high-volume merchants.

00:06:44.800 We also implemented server-side caching, primarily in Nginx instead of Rails. Our next consideration was server-side performance, as we understood our average request time in the old app wouldn't suffice in the new app. We switched to OJ for faster JSON serialization and employed Rails API from the very beginning to achieve better defaults and significantly faster requests. Most of our requests are processed in single-digit milliseconds, and I believe most are under single-digit milliseconds.

00:07:21.280 We utilized Bullet to ensure our ActiveRecord queries were efficient, avoiding issues like N+1 queries. Additionally, we wrote custom SQL for multiple high-volume, uncacheable endpoints where we needed to optimize to a single query to eliminate database roundtrip time. The third performance issue we addressed was connection overhead. The natural approach would be to use persistent connections, although we found that this didn't yield favorable results. We faced a trade-off between slow server restarts while waiting for connections to close or risking dropped requests when connections were unexpectedly closed.

00:08:04.320 If we shortened the connection lifetime so server restart times wouldn't be bolstered, we missed out on many advantages. Therefore, we resorted to using persistent connections solely for idempotent methods. We employed a middleware solution to handle this, while other methods avoided persistent connections. This way, if the connection is severed, we can seamlessly retry because we only utilize persistent connections for idempotent endpoints.

00:08:46.200 The final performance concern involved handling large POST bodies, which exhibited very poor performance. Although I did not delve into this issue significantly, we experienced networking problems with these large POST requests. The solution involved manually applying Gzip compression and decompression to the POST bodies, thus resolving the issue.

00:09:14.559 Now, regarding code correctness, we employed all standard practices for ensuring our code's accuracy. For instance, we are ardent practitioners of Test-Driven Development (TDD) at Braintree. On this project, we utilized a mix of pair programming and pull requests, ensuring no solo commits were made. There was at least one pull request code review paired with every edit of the code, in addition to the standard code review conducted prior to deployments.

00:09:56.399 We emphasized shared context; even if code was committed before reaching your review, all contributors read every commit—not just those pertaining to the new service but also changes to the existing service and any related infrastructure. We created a seam within the application, then extracted the functionality as is standard practice. This process provides an explicit contract when building the new service.

00:10:37.680 Once several rolls out began, everyone involved in the project entered the on-call rotation to ensure shared ownership of the system's correctness. We rolled out the application to production long before it was used for any critical functions, allowing us to identify and address operational issues early on, evaluating how it would perform. As a side note, we encountered a phenomenon known as 'telephobia,' which reflects the fear of imperfection and never meeting standards throughout this project.

00:11:55.960 We recognized that although the standard practices I’ve just mentioned were vital, they weren't sufficient for creating full confidence in the correctness of the new service. Our goal was to replace code that had been over six years old, dating back to Rails 2.1, making it more maintainable. Our focus couldn't just be on altering existing code to fit newer Rails versions; we had to address the quality of the test coverage.

00:12:44.600 Unfortunately, the test coverage had not been promising, which left us doubtful about the accuracy of our new code, even with the correct tests copied over line by line. The first step we took to bolster our confidence was to establish a proxy layer named 'Quackery.' This operated at the seam we designed within the application rather than as a distinct service.

00:13:28.560 Quackery was configured with a hierarchy tree of all the endpoints present in both the old and new code. It recognized which endpoints were idempotent and allowed invocation with a unified set of arguments for both code paths. This was crucial because we aimed for our tests to represent real-world scenarios. The Quackery code intelligently compared the results returned from both code paths, ignoring minor discrepancies and timestamps—focusing on those auto-generated fields likely to differ.

00:14:58.960 This methodology effectively amounted to automated testing, as it involved not merely tests but actual code processing real requests. Additionally, Quackery allowed us to easily toggle between which code path was being utilized during tests.

00:15:52.760 Returning to the concern of code accuracy, we believed this was beneficial, yet still, that wasn't enough. There exist several non-idempotent endpoints that resemble remote procedure calls more than CRUD operations. We aimed for greater depth in testing beyond what handwritten assertions could fabricate. If we could read results from these types of methods on multiple occasions, we wondered if we could perform dual writes.

00:16:51.520 It was indeed possible, but a crucial detail was omitted; both the new service and the existing application were still sharing a database. This was necessary as both apps required strong consistency. The necessity to keep them using the same exact database needed to cease once we completely transitioned to the new code, though it was necessary throughout our rollout.

00:17:33.000 As per search results, shared databases are considered an anti-pattern; however, we approached double writing with simplicity in mind. Once we developed a method for reading results from idempotent methods, we simply added metadata to the requests directed to the service, indicating whether we wanted to perform idempotent endpoint calls. If the system received this metadata, it initiated a database transaction before proceeding, allowing us to roll it back after handling the request.

00:18:40.720 This process enabled us to acquire authentic results without influencing the database state. We became less confined to double requests solely to idempotent endpoints; we could do this for all our endpoints, thus allowing for exact results comparisons across the entire range of our code.

00:19:06.080 Unfortunately, maintaining long-running transactions is ill-advised for multiple reasons. Even for requests that are exceptionally swift, there could exist some search-type requests that may be significantly longer by tens of milliseconds. It becomes unacceptable to hold a connection open for that length of time.

00:19:21.440 Though I was eager to test this approach in some live systems to evaluate its efficacy, our DBA opposed this. However, the use of double writing and testing greatly expanded our scope for validating state beyond what assertions usually permit, as it allows full response comparisons from actual code.

00:19:49.080 That said, due to serialization constraints, not every state can be verified back to the client, but it’s a close approximation. Now, let’s revisit the shared database issue—its long-term viability poses substantial concerns.

00:20:01.520 The shared database mechanism was exceedingly beneficial during our transition, allowing for features like identifying which existing features were involved. However, once we fully transitioned to the new service model, we had to separate the databases.

00:20:55.760 Two primary methods exist to achieve this separation without downtime. The first method involves adding an extra standby to your cluster, allowing it to catch up before promoting it to master on a new cluster. The downside is the size of the data involved; even though the database itself is relatively small, the overall data cluster is extensive, necessitating substantial hardware investments.

00:21:34.000 Additionally, we preferred to keep the new database within the original cluster rather than establishing a separate one due to size considerations. The second methodology is essentially conducting a database dump and reload. Several approaches exist for this procedure, which do not always utilize traditional dump and reload tools, though the concept remains alike.

00:22:20.600 This can be managed by temporarily halting access while transitioning. However, it’s vital to maintain strong consistency between the old and new datasets.

00:23:09.880 We concluded from the performance evaluations and double-writing research that most high-volume endpoints in the new application were idempotent, with many of them being pure—returning results without altering database state. Conversely, we discovered that non-pure endpoints fell into three categories.

00:23:59.680 The first category of endpoints is those not involved in transaction processing. This meant uptime could be arranged without issue as they were utilized only by users or background jobs. Some non-critical features could temporarily cease; the second category comprises optional auditing and analytics that can be paused without serious repercussions, pending later data analysis. The third category contained endpoints possessing minimally invasive core functionalities that do not fundamentally alter the database state, allowing for a halt similar to the previous cases without affecting authentication or login capabilities.

00:24:58.480 To facilitate smoothness during the database split without interrupting traffic, we enabled a 'read-only' mode by incorporating a configuration option for read-on mode. This allowed us to offer friendly error messages to clients during times they attempted to access endpoints that had been temporarily disabled.

00:25:49.640 We subsequently modified the non-critical and nearly-pure endpoints, ensuring they sidestepped database state changes, allowing for execution of permissible actions without affecting any state. This adjusting process facilitated each endpoint functioning correctly or failing gracefully during the read-on mode, enabling the execution of the DB dump and reload method without needing to pause traffic.

00:26:25.720 The practical implementation for this change proved rather complex and accompanied various challenges. There were brief periods of degraded performance in our control panel where specific background jobs could not be executed, but all transaction processes remained unaffected. This was because we maintained operations in read-on mode instead of a shutdown status, which would have rendered immediate downtime during the actual database separation—a situation we could not have tolerated.

00:27:13.000 That's all I have for today. As I mentioned at the beginning, my slides along with the accompanying speaker notes will be accessible on GitHub at github.com/agf. Thank you for your time, and if anyone has further questions, feel free to approach me afterward.