00:00:05.960
Hello, my name is Paul Gross. I'm a developer at Braintree, and I've been with the company for about three years. I have a lot of knowledge, so if you have questions about Braintree or some of our technology, feel free to ask me afterwards.
00:00:11.559
Braintree is a payment gateway, which means we provide software that allows you to process payments online. We primarily handle credit card transactions, and we are proud of our merchant list, which includes many recognizable names. Essentially, when you go to a merchant's site like GitHub and enter your credit card information, it comes to us, and we take care of all the payment-related tasks such as charging cards and depositing funds.
00:00:23.640
For us, uptime is critical because if we're down, both we and our merchants lose money. Last I checked, we process approximately $5 billion in annual transactions, averaging about $9,500 per minute. This means that for our merchants, every minute we're down equates to significant lost revenue, which compounds the issue further since many users may not realize we are the ones behind the scenes handling their payments.
00:01:00.560
People often discuss uptime in terms of 'nines.' For instance, five nines is the benchmark that's frequently aspired to, equating to just over five minutes of downtime a year. This means in practical terms, if a primary database fails and takes 10 to 15 minutes to recover, we've already missed our uptime goal for the year. At Braintree, we strive for five nines, although we don't always achieve it. There are two main types of downtime I will address during this talk: planned downtime and unplanned downtime, along with the strategies and techniques we use to manage both.
00:01:51.760
To start with planned downtime, there are times when a maintenance window is necessary. It's crucial to reduce this maintenance window as much as possible to minimize impact. Braintree's main gateway codebase is about four years old, originating as a standard Rails app using MySQL with Apache and Passenger. During early maintenance, we would deploy code updates and interrupt services to run database migrations, which often took significant time.
00:02:35.080
To mitigate this, we transitioned from MySQL to PostgreSQL, which allows faster migrations. In MySQL, adding or altering columns typically involves copying the entire table, causing long downtimes, especially with larger datasets. In contrast, PostgreSQL performs these operations in milliseconds, which drastically reduces downtime. Also, PostgreSQL allows us to add indexes without locking tables, which was a major pain point with MySQL.
00:03:42.040
Using transactional DDL with Rails migrations means that if a migration fails, it rolls back completely, which is not something we had with MySQL. This rollback feature allows us to quickly revert to the previous state without a significant outage, which is crucial when deployments don’t go as planned. Today, our deployment process involves minimal downtime by using rolling updates and pre-migrations.
00:04:05.840
We deploy code server by server while the site is still up, taking all necessary pre-downtime actions before we take the system offline. After rolling code updates, we can add indexes and populate tables as needed when the site is operational again. This method enables us to maintain high availability during deployments.
00:05:02.120
However, one downside is that Rails caches columns, meaning we cannot drop columns after new code is live, as doing so triggers errors when inserting into non-existent columns. To tackle this issue, we've implemented a mechanism that allows Rails to forget about specific columns, which gives us the ability to clean up old, unused columns without downtime.
00:05:29.920
Additionally, our architecture uses a component we call the Broxy, which acts as a Redis queue. It receives requests from Apache and sends processed responses back, thus separating the handling of requests from the database interactions. This allows us to pause traffic without going down entirely. When we pause traffic for maintenance or another reason, requests continue to be accepted but are queued until we resume.
00:06:35.640
In essence, we can perform maintenance or even database failovers without impacting the user experience. Once we finish the deployment or maintenance task, we can start the dispatchers back up, and the queued requests are processed without any perceived downtime. The aim is to ensure high availability while maintaining flexibility during deployments.
00:07:33.120
Now, let’s discuss unplanned failures. We understand that servers can fail unexpectedly, and our strategy involves load balancing across multiple redundant services. We prefer to avoid using black box load balancing appliances; instead, we construct our own load balancers using Linux Virtual Server or IPVS. This allows us to fully understand and optimize our components for high availability and security.
00:08:22.000
Our load balancer architecture includes checks to ensure that backends are operational, removing any that fail from the load balancing pool. This way, failed servers won’t receive traffic, and the system continues operating smoothly under expected loads.
00:08:59.440
We also utilize Pacemaker to manage clusters, ensuring that if one instance goes down, traffic can be routed seamlessly to another operational instance. This automatic failover mechanism is critical in maintaining uptime.
00:09:51.720
In terms of network failures, they are common occurrences, so we use tools like BGP to manage inbound traffic from diverse ISPs and data centers. This resilience means that if one route experiences issues, traffic is rerouted through alternate paths, maintaining our service availability.
00:11:01.559
When it comes to outbound connections, we have multiple processing networks to connect to during transactions. We've implemented proxies to manage communication with these networks effectively, keeping track of network health and availability to adaptively reroute requests as necessary.
00:11:45.560
In conclusion, our strategy focuses on maintaining uptime through precise pre-deployment migrations, utilizing automation to diminish human error, effectively managing both planned and unplanned downtimes, and architecting systems that are inherently resilient. We strive for five nines availability and continuously adapt our processes as necessary to meet this benchmark. Thank you for your attention, and I'm happy to take any questions.