Zero-downtime payment platforms

by Prem Sichanugrist and Ryan Twomey

In the presentation titled "Zero-downtime payment platforms", Prem Sichanugrist and Ryan Twomey address the critical importance of maintaining high availability in payment processing platforms. They emphasize that even minor downtimes can adversely affect customer experience and revenue. Various strategies are discussed to mitigate risks associated with both internal and external downtimes.

Key Points:

Definition of Downtime: The speakers define two types of downtime:
- Internal Downtime: Caused by issues within their own application, such as application errors or infrastructure failures.
- External Downtime: Resulting from dependencies on third-party services, such as payment gateways or email providers.
Handling External Downtime: To counteract scenarios where payment gateways might go down, the team implemented a risk assessment system that allows the acceptance of low-risk orders even when the payment gateway is unavailable. This is managed through:
- A manual shutdown system initially, which evolved into automated processes for efficiency.
- A timeout mechanism that evaluates the risk before proceeding with order processing.
Internal Downtime Solutions: The presenters describe a fallback system, including:
- Chocolate, a separate Sinatra application that acts as a request replayer when their main Rails application fails. This ensures that customer requests can still be stored and processed later without immediate disruptions.
- Akamai Dynamic Router: This is utilized to reroute requests, minimizing the impact of application errors by handing off to the backup (Chocolate).
- The use of a unique request ID to avoid duplicate charges and to manage orderly processing even when switching between applications.

Significant Examples:

The first part of the talk delves into practical applications, illustrating how during high traffic periods, proactive measures were crucial in maintaining functionality and customer satisfaction, illustrating a situation with a high volume of transactions where they had automated processes in place to handle possible downtimes.

Conclusions and Key Takeaways:

The presenters stress that all systems are prone to failure; hence, preparations should include having a robust failover strategy.
Implementing a replayer mechanism can significantly enhance user experience by ensuring operations continue smoothly during disruptions.
It's crucial to constantly evaluate and refine risk assessment models to appropriately manage order acceptance during downtimes.

Ultimately, the speakers convey that while it is impossible to entirely eliminate downtime, thorough planning and intelligent design can have substantial positive impacts on system reliability and user satisfaction.

00:00:16.680 Good afternoon, everyone! I hope you all had a great lunch.

00:00:22.920 We're going to get started here. Today, we're going to talk about zero-downtime payment platforms, or some techniques to ensure your application never appears to be down.

00:00:30.199 My name is Prem Sichanugrist, and I work for a company called LevelUp, which is located in Boston. I'm Ryan Twomey, and I work for another company called LevelUp, which is also in Boston.

00:00:43.600 Since you're here at RailsConf, I know you want to learn something valuable. We have a website called Thot Learn at learn.thot.com where you can find screencasts, books about Ruby on Rails development, and more. You can also use the promo code 'railconf' to get a 20% discount on your first month or anything else in the store.

00:01:19.720 Let's start out with some background. LevelUp is a mobile payments and advertising platform based in Boston. When you hold up your mobile phone, it displays a QR code that you point at the cashier's scanner to place an order. So, if you want to get a coffee or a sandwich, that's how you would do it.

00:01:34.520 This process interacts with our REST API in a Rails application, which primarily functions as a Rails 3.2 app. What happens next is quite a bit of processing, but it ultimately leads to hitting the customer's credit card to complete the order. In completing that order, we go through a payment gateway such as Braintree, Authorize.NET, or another service. The key point here is that we rely on a third-party service to finalize that part of the transaction.

00:02:08.160 Our tech stack consists of a Rails 3.2 application hosted on Heroku, and we use a PostgreSQL 9.1 database. We are currently evaluating PostgreSQL 9.2, which is really exciting if you haven't checked it out yet. Our database has two followers: one is in the same data center but in a separate availability zone, while the other is located on the west coast.

00:02:28.280 A follower, in Heroku terms, refers to a read-only replica of the master database. One important thing to note is that your app, or any app, is always dependent on a multitude of different components, many of which are outside your control. For instance, Heroku is built on top of Amazon Web Services. If there's ever an issue with AWS, it could potentially affect your application.

00:02:53.400 So, always being aware of everything that interacts with your app and impacts its uptime is critical. We're also handling quite a bit of volume; at peak times, we could be processing around $1,000 a minute. That's substantial, and while it's not Amazon money, it still represents a lot of revenue. Consequently, we really don’t want any downtime.

00:03:22.799 Let's discuss the different types of downtime that could affect us. There are primarily two categories: first, we have internal downtime, which means we can’t execute our own code. This could happen if our application crashes, Heroku goes down, or if something catastrophic occurs.

00:03:39.879 The second type is third-party downtime. This could involve something critical that we rely on, like our payment gateway or our email provider, which could become unavailable. I'm going to start discussing some of these external factors, where we have diminished control. As Ryan mentioned, this includes external databases, email providers, caching services, and payment gateways.

00:04:12.599 In our case, since LevelUp is a payment platform, we particularly focus on payment gateways. A crucial question worth considering is: what would happen to our payment platform if the payment gateway goes down? In the past, prior to implementing our strategies, when a new order came in and our payment gateway went down, the order would be rejected, turning away customers who were unable to pay for their purchases, resulting in unhappy customers.

00:04:45.840 We started to consider ways to take some risks that could lead to happier customers. The process was undertaken iteratively. We began with a manual shutdown feature, where on the admin panel we incorporated a big red button that an admin could use to shut down the system and transition it into a fail-safe mode. In this fail-safe mode, we would accept low-risk orders and save the transactions in the database to charge customers later. We assessed risk before saving the orders, although we can’t disclose the specifics of our risk assessment process.

00:05:35.600 One simple criterion we considered was whether the order was below a specific amount, say $100, which we considered low risk. There are various methods to assert risk. For instance, we could check how many times a customer has used our service or how long they have been signed up, including whether they have ever had a failed transaction. This ensures that we're likely able to collect payment from them later.

00:05:59.600 Implementing this approach allows us to let customers make purchases, even when our payment gateway is down. We avoid turning away customers, but the manual process requires human intervention whenever the payment gateway fails. Unfortunately, humans need sleep; they can't monitor the system 24/7. Therefore, we realized that this approach was not sustainable and needed automation.

00:06:27.520 To automate the process, we devised a few methods based on three main steps. First, we take the part of our system that makes the charge request to the payment gateway and wrap it in a timeout with a preset number of seconds to wait for a response. If the request times out, and we are unable to reach the payment gateway, we evaluate the risk of the transaction. If the risk is too high, we display a failure message. However, if the risk is low, we save the order and return a success message.

00:07:04.800 This success message mimics a normal successful response, meaning that from the cashier’s or customer's perspective, there’s no difference—they won’t know about the underlying issues we faced. We save all the information in our system, and we built a mechanism that runs in the background every so often to find pending orders and retry them, allowing us to reconcile or complete these orders.

00:07:35.400 The timeout code is fairly straightforward; it charges the card via the gateway and is descriptive enough. If the timeout occurs, we catch this in the rescue block and call a method designed to assess the risk of saving the order without charging the card. We first check whether the risk is high or low. If it’s deemed high, we return a generic validation message to avoid giving away too much detail. If the risk is low, we save a unique identifier on the order generated from the gateway.

00:08:41.000 We introduce a way to uniquely identify any charges ran through our payment gateway by applying a random string to the Gateway ID. Once that’s done, we return true, allowing the order processing to continue. The final step is ensuring that we have a cron task that runs periodically—which finds all of these orders and attempts to retry them.

00:09:16.000 Approximately every 10 minutes throughout the day, we deploy a task that checks for orders that need to be reconciled. We have something called order.reconcilable that finds any orders matching our Gateway ID format. If it identifies any, we trigger the order.reconcile method.

00:09:44.000 The function first runs a check using something called a similar order finder. Its primary role is to query our payment gateway to identify any charges processed within a specified window around the order's creation time. If a duplicate charge is identified, we update the Gateway ID to reflect the actual ID that we found, ensuring we don’t accidentally charge our customers multiple times.

00:10:35.760 Conversely, if we don’t find a duplicate, we proceed to charge the customer, still adhering to our existing timeout processes. However, be aware of a critical race condition: without locks, concurrent cron tasks could lead to multiple charges if they aren’t executed sequentially.

00:11:00.960 The advantages of this system are evident: there are no humans involved, which means operations can occur at any time, providing great convenience. On the downside, there were a few rough patches along the way as similar order finders proved to be finicky. Extensive testing across your payment gateway is essential to ensure reliable performance.

00:11:44.320 Now, let’s shift our focus to when our own system might go down, which can happen for many reasons such as application errors, crashing, or connectivity issues with AWS.

00:12:04.079 To illustrate, let me share a story about a specific incident on October 22nd when AWS experienced significant downtime, impacting many users. During this time, I was still at work and became hungry for a burrito. However, I couldn't remember the hours for a specific restaurant and found out that they were also hosted on Heroku.

00:12:26.360 This meant I couldn’t access their website for the business hours. So here is a lesson: always utilize a CDN to serve your static web pages, such as the hours for a restaurant, in case your main site is unavailable.

00:12:51.520 In the event that Heroku or AWS goes down for us, such issues can have serious repercussions. When this occurs, we may notice a significant spike in the number of abandoned transactions. You would naturally assume this is due to poor customer experience.

00:13:26.520 Fortunately, we have strategies in place that can handle such circumstances. For example, we utilize a request replay mechanism which is powered by Akamai. It may not be immediately obvious how this whole system works, so let me explain how we structure our stack.

00:14:08.840 At the top, we have the internet serving as the gateway for our requests, all of which route through the Akamai dynamic router before reaching Heroku. Behind that, we incorporate our request replay technology as well as a CDN to serve static site assets.

00:14:30.800 When customers attempt to initiate a purchase, for instance by scanning their phones, the request gets sent to our Rails application. However, if the application encounters an error or fails to respond within the designated timeout, it will trigger the Akamai router to reroute the request to our replay service.

00:14:51.240 The replay service, which we call Chocolate, is designed to handle requests separately. We decided to build it as a distinct Sinatra application from scratch.

00:15:03.520 We made this decision primarily to avoid potential bugs in our main production codebase. This replay application is responsible for performing risk assessment, storing a raw request in the database, and ensuring that once normal service resumes, we can effectively replay those requests.

00:15:26.240 Although we call this application 'Chocolate,' it’s actually a structure built for web requests. Chocolate is deployed in a different cloud environment, separate from AWS, to mitigate the risk of single points of failure.

00:15:44.320 The idea is that if our main application or AWS goes down, customers remain blissfully unaware of the issue, thereby preserving their positive experience.

00:16:01.200 Despite this safeguard, we still need to conduct risk assessments similar to before. If an order is accepted but we cannot charge later, we still need to have robust support processes to reach out to the customer for follow-up.

00:16:30.319 So, how does Chocolate effectively operate? It has an endpoint that mimics the path of our primary Rails app. The application extracts key information, checks the legitimacy of the order, assesses risks using our established models, and then saves the request details for future reference.

00:17:06.360 As such, when requests pass our initial sniff tests, we save relevant information for debugging later on, and the response returned is identical to the production response. Therefore, cashiers and customers are completely unaware of the interruptions.

00:17:23.600 Next comes the replay process, which is initiated as soon as the production site is back online. The order model on our Chocolate application has a defined replay method that opens a connection back to the production app to reapply all necessary requests.

00:17:51.840 Our support team manually triggers these replays to monitor which orders successfully go through and maintains visibility over daily replay volumes. If the team identifies any issues with replaying, they take appropriate action to remedy the situation.

00:18:19.120 It’s crucial that we avoid duplicate orders during this process. For this, we utilize a unique request ID injected into every incoming request. This unique ID is stored alongside the order in the Rails production application.

00:18:48.200 If an order fails over and is sent to Chocolate, the system retains the same request ID. Upon replaying the order, we check this ID; if it matches an existing order, we reject it to prevent double charges.

00:19:17.560 The decision to failover arises when a request has exceeded our defined timeout—set to 15 seconds—through the Akamai dynamic router. If the request times out and an order is still pending, it reroutes to Chocolate.

00:19:41.560 Concerning any charge that did get processed, we synchronize it using our 'dupes' logic in production. This ensures that we consistently manage customer purchases effectively.

00:20:18.080 One of the significant advantages of having multiple layers of failover is the ability to replay requests into an entirely separate application that is not dependent on the original system’s location. But, increased complexity comes with maintaining additional layers.

00:20:57.480 Despite the increased costs and potential complications, the overall benefit of redundancy outweighs the drawbacks, as having these layers in place preserves revenue and customer satisfaction.

00:21:18.920 One unusual observation we made was that there were still several orders ending up in Chocolate despite our systems being operational.

00:21:41.440 This showed that there were instances in which our site had no reported downtime, yet orders unexpectedly failed over. Our analytics captured various spikes in activity that hinted at significant issues.

00:22:08.760 It turned out the reason for these occurrences was random routing within the Heroku platform. Each request is assigned randomly to one of our available dynos, which can cause bottlenecks.

00:22:33.320 Requests that came in simultaneously could cause delays if a particular dyno is processing a previous request. If this happens, and the request sits for too long, it will eventually trigger a timeout leading to the request being rerouted to Chocolate.

00:23:10.680 This behavior not only results in abandoned requests but can also cause backlogs if certain dynos are consistently preoccupied. To remedy the situation, we needed to ensure the responsiveness and performance of each dyno was maximized.

00:23:47.440 We can’t completely eliminate issues with random routing, but we consistently strive to optimize our application and fine-tune our configurations, which can significantly improve performance and reduce the frequency of such abandoned requests.

00:24:42.320 In closing, there are several key takeaways from this talk. Firstly, expect that your site will invariably go down at some point—keep in mind that planning ahead for these situations is crucial.

00:25:29.760 Secondly, consider implementing a replay mechanism or dynamic routing to handle essential web requests, like order submissions. You might even contemplate giving customers some grace periods during downtimes to ease their experience.

00:25:55.960 Moreover, keeping your backend lean and efficient is beneficial. Offload tasks into background jobs whenever possible to free up server resources. And lastly, consider using a CDN to serve static assets, especially during periods of increased web traffic.

00:26:26.960 Now, does anyone have any questions? I’ll go ahead and repeat the question for everyone to hear. Just so we understand your concern, you’re suggesting building a secondary copy of our application to handle transactions when the main app is unavailable. Did I summarize that correctly?

00:27:05.440 The straightforward answer is that managing state across multiple applications can be extremely challenging. When an order comes in, it triggers numerous functions, impacting state across various components.

00:27:42.920 To illustrate this point, if an order is placed on a secondary application while the primary app is down, synchronization between these applications afterward becomes problematic. We have therefore determined that handling state across multiple applications complicates matters excessively.

00:28:12.400 As we discussed previously, deploying on Heroku makes sense as it allows for a streamlined process that many developers are familiar with, negating the need for complex infrastructure management, which balances the pros and cons of maintaining state.

00:28:51.560 Additionally, if problems arise with our primary system, we can leverage a third-party service to enhance our redundancy and backup.

00:29:25.200 We've thought about incorporating a third failover mechanism, and it’s entirely feasible to add more layers of redundancy with additional replayers. The key concept is to create a smooth way to manage requests to capture any critical data during failovers.

00:30:02.920 To conclude, although moving to a secondary payment gateway might appear to mitigate risk, the complexity introduced by additional dependencies might prove to be counterproductive.

00:30:39.680 In addressing how to manage various payment gateways or additional layers within the application, we tend to favor strategies that focus on reliability and direct control over service dependencies. Our process is continuously adapting and improving over time.

00:30:59.760 The question you brought up pertaining to overall performance based on our risk factor—early on, we noted a high percentage of transactions rejected. However, as we adapted and implemented strategies, we managed to tune our acceptance rates to a more favorable range.

00:31:21.920 Thank you, everyone! If you have further questions, I am open to discussion. Let's keep progressing together.

00:31:41.000 Thank you!

00:31:54.000 Thank you!