Uptime == Money: High Availability at Braintree

The video titled "Uptime == Money: High Availability at Braintree" features a talk by Paul Gross, a developer at Braintree, presented at RubyConf AU 2013. Braintree, a payment gateway that processes online payments, emphasizes the critical importance of high availability (HA) due to the substantial revenue losses incurred by both the company and its merchants during downtime. Gross elaborates on Braintree's strategies for maintaining uptime, addressing both planned and unplanned downtimes.

Key points discussed throughout the talk include:

Importance of High Availability: With Braintree processing approximately $5 billion in annual transactions, uptime is vital; even a few minutes of downtime can lead to significant financial losses for both Braintree and its merchants.
Planned Downtime Management:
- The transition from MySQL to PostgreSQL has enabled quicker database migrations, drastically reducing planned downtime.
- Implementing rolling updates allows for minimal disruption during deployments, with servers being updated individually without taking down the entire site.
- They use transactional DDL in Rails migrations to ensure that failed migrations can roll back without causing significant outages.
- The innovative mechanism for managing Rails caches is introduced, allowing old columns to be removed without impacting ongoing operations.
Handling Unplanned Downtime:
- Braintree employs load balancing across redundant services to optimize uptime during server failures.
- The company constructs its load balancing system using tools like Linux Virtual Server (IPVS) rather than relying on third-party black box solutions, enhancing understanding and control over their systems.
- Automatic failover mechanisms are in place to seamlessly route traffic to operational instances in case of server failures, which helps in managing service continuity.
- The use of tools like BGP for managing inbound traffic ensures redundancy by rerouting through alternate paths during network issues.
Robustness of Architecture: The architecture incorporates components like a Redis queue (Broxy) for request handling, allowing for the acceptance of requests even while performing maintenance, which mitigates the impact on end users.

In conclusion, Braintree's approach to high availability combines meticulous planning, use of modern technologies, and a resilient architecture to meet its uptime goals. By striving for five nines (99.999%) availability, Braintree continuously adapts its strategies to ensure minimal service interruption, thus safeguarding both its and its merchants' revenues.

Uptime == Money: High Availability at Braintree
Paul Gross • February 20, 2013 • Earth

RubyConf AU 2013: http://www.rubyconf.org.au

Braintree is a payment gateway, so downtime directly costs both us and our merchants money. Therefore, high availability is extremely important at Braintree. This talk will cover how we do HA at Braintree on our Ruby on Rails application.
Specific topics will include:

Working around planned downtime and deploys:
- How we pause traffic for short periods of time without failing requests
- How we fit our maintenance into these short pauses
- How we do rolling deploys and schema changes without downtime

Working around unplanned failures:
- How we load balance across redundant services
- How the app is structured to retry requests

RubyConf AU 2013