Talks
Uptime == Money: High Availability at Braintree
Summarized using AI

Uptime == Money: High Availability at Braintree

by Paul Gross

The video titled "Uptime == Money: High Availability at Braintree" features a talk by Paul Gross, a developer at Braintree, presented at RubyConf AU 2013. Braintree, a payment gateway that processes online payments, emphasizes the critical importance of high availability (HA) due to the substantial revenue losses incurred by both the company and its merchants during downtime. Gross elaborates on Braintree's strategies for maintaining uptime, addressing both planned and unplanned downtimes.

Key points discussed throughout the talk include:

  • Importance of High Availability: With Braintree processing approximately $5 billion in annual transactions, uptime is vital; even a few minutes of downtime can lead to significant financial losses for both Braintree and its merchants.

  • Planned Downtime Management:

    • The transition from MySQL to PostgreSQL has enabled quicker database migrations, drastically reducing planned downtime.
    • Implementing rolling updates allows for minimal disruption during deployments, with servers being updated individually without taking down the entire site.
    • They use transactional DDL in Rails migrations to ensure that failed migrations can roll back without causing significant outages.
    • The innovative mechanism for managing Rails caches is introduced, allowing old columns to be removed without impacting ongoing operations.
  • Handling Unplanned Downtime:

    • Braintree employs load balancing across redundant services to optimize uptime during server failures.
    • The company constructs its load balancing system using tools like Linux Virtual Server (IPVS) rather than relying on third-party black box solutions, enhancing understanding and control over their systems.
    • Automatic failover mechanisms are in place to seamlessly route traffic to operational instances in case of server failures, which helps in managing service continuity.
    • The use of tools like BGP for managing inbound traffic ensures redundancy by rerouting through alternate paths during network issues.
  • Robustness of Architecture: The architecture incorporates components like a Redis queue (Broxy) for request handling, allowing for the acceptance of requests even while performing maintenance, which mitigates the impact on end users.

In conclusion, Braintree's approach to high availability combines meticulous planning, use of modern technologies, and a resilient architecture to meet its uptime goals. By striving for five nines (99.999%) availability, Braintree continuously adapts its strategies to ensure minimal service interruption, thus safeguarding both its and its merchants' revenues.

00:00:05.960 Hello, my name is Paul Gross. I'm a developer at Braintree, and I've been with the company for about three years. I have a lot of knowledge, so if you have questions about Braintree or some of our technology, feel free to ask me afterwards.
00:00:11.559 Braintree is a payment gateway, which means we provide software that allows you to process payments online. We primarily handle credit card transactions, and we are proud of our merchant list, which includes many recognizable names. Essentially, when you go to a merchant's site like GitHub and enter your credit card information, it comes to us, and we take care of all the payment-related tasks such as charging cards and depositing funds.
00:00:23.640 For us, uptime is critical because if we're down, both we and our merchants lose money. Last I checked, we process approximately $5 billion in annual transactions, averaging about $9,500 per minute. This means that for our merchants, every minute we're down equates to significant lost revenue, which compounds the issue further since many users may not realize we are the ones behind the scenes handling their payments.
00:01:00.560 People often discuss uptime in terms of 'nines.' For instance, five nines is the benchmark that's frequently aspired to, equating to just over five minutes of downtime a year. This means in practical terms, if a primary database fails and takes 10 to 15 minutes to recover, we've already missed our uptime goal for the year. At Braintree, we strive for five nines, although we don't always achieve it. There are two main types of downtime I will address during this talk: planned downtime and unplanned downtime, along with the strategies and techniques we use to manage both.
00:01:51.760 To start with planned downtime, there are times when a maintenance window is necessary. It's crucial to reduce this maintenance window as much as possible to minimize impact. Braintree's main gateway codebase is about four years old, originating as a standard Rails app using MySQL with Apache and Passenger. During early maintenance, we would deploy code updates and interrupt services to run database migrations, which often took significant time.
00:02:35.080 To mitigate this, we transitioned from MySQL to PostgreSQL, which allows faster migrations. In MySQL, adding or altering columns typically involves copying the entire table, causing long downtimes, especially with larger datasets. In contrast, PostgreSQL performs these operations in milliseconds, which drastically reduces downtime. Also, PostgreSQL allows us to add indexes without locking tables, which was a major pain point with MySQL.
00:03:42.040 Using transactional DDL with Rails migrations means that if a migration fails, it rolls back completely, which is not something we had with MySQL. This rollback feature allows us to quickly revert to the previous state without a significant outage, which is crucial when deployments don’t go as planned. Today, our deployment process involves minimal downtime by using rolling updates and pre-migrations.
00:04:05.840 We deploy code server by server while the site is still up, taking all necessary pre-downtime actions before we take the system offline. After rolling code updates, we can add indexes and populate tables as needed when the site is operational again. This method enables us to maintain high availability during deployments.
00:05:02.120 However, one downside is that Rails caches columns, meaning we cannot drop columns after new code is live, as doing so triggers errors when inserting into non-existent columns. To tackle this issue, we've implemented a mechanism that allows Rails to forget about specific columns, which gives us the ability to clean up old, unused columns without downtime.
00:05:29.920 Additionally, our architecture uses a component we call the Broxy, which acts as a Redis queue. It receives requests from Apache and sends processed responses back, thus separating the handling of requests from the database interactions. This allows us to pause traffic without going down entirely. When we pause traffic for maintenance or another reason, requests continue to be accepted but are queued until we resume.
00:06:35.640 In essence, we can perform maintenance or even database failovers without impacting the user experience. Once we finish the deployment or maintenance task, we can start the dispatchers back up, and the queued requests are processed without any perceived downtime. The aim is to ensure high availability while maintaining flexibility during deployments.
00:07:33.120 Now, let’s discuss unplanned failures. We understand that servers can fail unexpectedly, and our strategy involves load balancing across multiple redundant services. We prefer to avoid using black box load balancing appliances; instead, we construct our own load balancers using Linux Virtual Server or IPVS. This allows us to fully understand and optimize our components for high availability and security.
00:08:22.000 Our load balancer architecture includes checks to ensure that backends are operational, removing any that fail from the load balancing pool. This way, failed servers won’t receive traffic, and the system continues operating smoothly under expected loads.
00:08:59.440 We also utilize Pacemaker to manage clusters, ensuring that if one instance goes down, traffic can be routed seamlessly to another operational instance. This automatic failover mechanism is critical in maintaining uptime.
00:09:51.720 In terms of network failures, they are common occurrences, so we use tools like BGP to manage inbound traffic from diverse ISPs and data centers. This resilience means that if one route experiences issues, traffic is rerouted through alternate paths, maintaining our service availability.
00:11:01.559 When it comes to outbound connections, we have multiple processing networks to connect to during transactions. We've implemented proxies to manage communication with these networks effectively, keeping track of network health and availability to adaptively reroute requests as necessary.
00:11:45.560 In conclusion, our strategy focuses on maintaining uptime through precise pre-deployment migrations, utilizing automation to diminish human error, effectively managing both planned and unplanned downtimes, and architecting systems that are inherently resilient. We strive for five nines availability and continuously adapt our processes as necessary to meet this benchmark. Thank you for your attention, and I'm happy to take any questions.
Explore all talks recorded at RubyConf AU 2013
+21