Migrating a Live Site Across The Country Without Downtime

by Drew Blas

In this presentation at the MountainWest RubyConf 2013, Drew Blas from Chargify.com discusses the complexities and strategies involved in migrating their live site across the country without downtime. Chargify is a service that processes payments for businesses, necessitating their commitment to 24/7 uptime without planned maintenance windows.

Blas highlights the critical need for seamless migration due to the impact on customer businesses if Chargify experiences downtime. He outlines their selection of EC2 as their new cloud provider, citing its PCI Level 1 compliance and superior customer support as crucial factors in their decision-making process.

The key points covered include:

- Automation with Chef: Chargify transitioned from manual processes to automated systems using Chef, enabling efficient configuration management and successful replication of their environment on AWS.

- Testing Protocols: Stress testing and process testing were employed to ensure comprehensive validation, focusing on the functionality of systems under load and establishing consistent operational behavior across both data centers.

- Data Synchronization Challenges: Migrating without downtime required overcoming challenges such as data layer synchronization, DNS management, and routing traffic for MySQL and Redis. Strategies included establishing a site-to-site VPN for inter-datacenter communication and using HAProxy for efficient traffic routing.

- Utilizing Tungsten: To manage MySQL replication and ensure data integrity, Tungsten's capabilities were leveraged, facilitating smooth failovers and uninterrupted service during the migration.

- Execution Plan: The migration process was meticulously planned with a six-step procedure that included halting non-essential processes and managing DNS settings to transition smoothly to the new data center.

In conclusion, Blas emphasizes the importance of thorough documentation, rigorous testing, and an automated infrastructure to facilitate future migrations and operational resilience. He encourages relentless testing, which is essential for maintaining high availability and performance in a cloud environment.

This session serves as a comprehensive guide for professionals facing similar challenges in data migration, addressing not only technical execution but also strategic planning and operational continuity.

00:00:09.280 Hello, and thank you for being here.

00:00:20.400 My name is Drew Blas, and I work at Chargify.com.

00:00:25.519 Chargify provides a service for managing recurring subscription billing, including everything related to credit cards.

00:00:31.519 This involves managing expiration dates, renewals, coupons, and all the details that accompany them.

00:00:45.360 Last year, we decided to move data centers. We selected a new provider and needed to migrate all our operations from the old data center to the new one.

00:01:02.719 The thing about Chargify is that we help process money for other businesses. Many startups might think otherwise, but businesses are fundamentally designed to generate profit.

00:01:13.920 If we're not processing sign-ups and transactions for those who rely on us, they stop making money, which makes them unhappy customers.

00:01:20.560 For that reason, we don't have planned outages or maintenance windows; we aim for 24/7 uptime. Of course, achieving 100% uptime is not feasible, but it's our goal. Anything below that doesn't sit well with us.

00:01:40.560 We needed to find a way to switch seamlessly. I’m skipping ahead a bit to discuss our search for a new provider, as it’s crucial to understanding how we executed our migration.

00:01:55.840 I should start by saying that I despise enterprise sales. Given that we needed to be PCI Level 1 compliant, it was essential to find a hosting provider that could help us meet the same requirements.

00:02:21.599 This presents challenges since you end up with two extremes: old guard data centers clinging to outdated practices and newer cloud providers eager to help.

00:02:35.040 Old-school data centers still operate as if it’s the 1980s; they tend to ask questions about server needs three years into the future and insist on an exhaustive bid process.

00:02:47.280 As for pricing, it can feel random. They might consult an outdated pricing book to provide numbers that can change depending on the day or some arbitrary circumstance.

00:03:00.159 Thus, we decided to go with EC2, the quintessential cloud provider known for spearheading the virtualization revolution, and crucially, they are a PCI Level 1 certified provider.

00:03:14.879 Not many other cloud providers can say that yet, though they're striving to catch up. For any agile business, the flexibility they provide is unmatched.

00:03:35.040 We wanted to break free from reliance on the old model, which forced us to negotiate special deals with sales representatives for additional resources.

00:03:44.240 We recognized that Amazon, despite its imperfections, was the best choice. They had been around longer than any other cloud provider, fulfilling our security requirements and now offering level 1 PCI compliant support.

00:03:53.280 If you’ve used their enterprise-level support, you’ll likely agree it’s exceptional; you speak directly to someone who understands your needs without being transitioned between multiple representatives.

00:04:09.440 Having chosen EC2, we began planning our migration, which started with adding automation. We moved away from the old methods where environment managers executed manual scripts and commands in the console.

00:04:25.119 It was critical that we automated our processes, and while we briefly considered Puppet, we ultimately chose Chef as our configuration management tool since we are a Ruby-centric shop.

00:04:40.560 To align our new systems with the old ones, we first needed to install Chef on our existing servers. Without doing this, we couldn’t ensure our new systems replicated the structure of the old one.

00:04:51.840 Once Chef implemented on the old systems, we duplicated our stack to AWS, allowing for side-by-side operation in a disaster recovery setup.

00:05:04.000 One day, we would simply failover to the new system as if the old provider had disappeared, aiming to eliminate any downtime during this planned switchover.

00:05:22.080 Our goal was to minimize the failover window, ideally resulting in no downtime at all. We would run tests repeatedly until we got it right.

00:05:38.080 Let's talk briefly about testing. Testing can mean various things. We have unit tests for our application and controller tests.

00:05:56.080 Infrastructure testing can involve maintaining your Chef code in a repository while also ensuring that your chef scripts function correctly through either Vagrant or an equivalent tool.

00:06:13.440 But when I refer to testing in this presentation, I'm primarily discussing process testing.

00:06:27.360 This involves chaos monkey-type testing where interconnected systems are evaluated to ensure they achieve the intended outcomes.

00:06:41.920 It's akin to integration testing, but on a much higher level. For instance, is your site accepting signups, sending confirmation emails, and performing all necessary tasks?

00:06:55.760 If you can confirm that your entire stack works as expected in a production-like environment, you can safely replicate that environment elsewhere.

00:07:08.880 We ensured that both data centers operated identically and that all requests were processed correctly as intended. To do so, we could simulate fake traffic.

00:07:23.040 Generating synthetic traffic helps us test resilience. We were able to increase load beyond our typical use and evaluate the systems under stress.

00:07:39.680 This approach is beneficial, regardless of whether you're migrating or just optimizing your infrastructure.

00:07:50.399 Next, logging and alerting became vital in identifying operational issues. We set up scripts to continuously check our site for sign-up functionality.

00:08:06.960 If those scripts failed, we would receive alerts to diagnose the issues promptly.

00:08:21.440 Returning to our initial plan, we needed to automate everything. Automation is crucial for reproducibility, accountability, and reducing human error.

00:08:37.200 Transitioning from a post-manual world was a significant challenge, especially since our previous environment had been managed through manual processes.

00:08:50.079 Integrating Chef wasn’t a walk in the park because cookbooks often have numerous dependencies, complicating the compatibility with existing systems.

00:09:05.760 Establishing dependencies is arduous; much like achieving 100% test coverage in an untested application takes time and effort.

00:09:20.480 As we tackled this, we had some unique challenges. For instance, our old environment was Red Hat-based while our new setup was on Ubuntu.

00:09:36.579 While Chef can handle different distributions effectively, many community cookbooks are not optimized for varying base systems.

00:09:48.560 Ultimately, we achieved Chef automation across all our old systems and created identical environments in the new data center.

00:10:03.279 Now we needed to address the challenges of actually switching everything over, particularly with regards to data synchronization.

00:10:19.200 In an ideal situation, we’d run both centers side-by-side, transitioning from one to the other without noticeable change.

00:10:35.280 However, achieving that level of synchronization requires addressing data layer challenges and other interconnected components.

00:10:50.640 Our first solution was to establish a VPN connection. Our old data center already had a hardware VPN appliance in place for secure remote access.

00:11:04.560 Amazon provided a seamless way to connect; we established a site-to-site VPN tunnel, allowing everything to function as if on the same network.

00:11:20.560 As long as we accounted for latency, setting up this connectivity wasn't too complicated. Routing the traffic correctly was the trickiest part.

00:11:36.960 Amazon's Virtual Private Cloud (VPC) played a pivotal role here. With VPC, systems can be interconnected while offering enhanced routing and security capabilities.

00:11:56.400 VPC allows us to set our subnets, define private IPs, and utilize routing tables to manage traffic more effectively.

00:12:10.160 Using VPC, we could guide traffic within our private network, ensuring optimal communication between our old and new data centers.

00:12:28.280 However, we faced three key migration challenges: managing DNS changes, synchronizing Redis, and routing MySQL traffic.

00:12:43.440 The DNS migration presented its own challenges. Changing DNS to direct traffic to the new data center was critical but notoriously unpredictable.

00:12:59.440 When DNS changes are made, they don't propagate instantaneously; they're often prone to errors, resulting in potential connection issues.

00:13:16.080 We realized that dropping traffic entirely wasn't a plausible option. Instead, we opted to switch off app servers in the old data center and insert HAProxy.

00:13:27.920 HAProxy would forward traffic over the VPN. Although this increased latency, it was preferable to losing customer requests entirely.

00:13:44.800 We also faced challenges with Redis, particularly concerning our use of it for Resque jobs. Redis failover introduces complexity, particularly with Redis Sentinel.

00:14:05.679 The documentation on Redis Sentinel is extensive, detailing various states and rules, complicating our understanding of its functionality.

00:14:23.920 As we explored alternatives, we found that orchestrating synchronous traffic transfer between Redis instances was problematic.

00:14:39.440 To solve this issue, we decided to configure multiple queues within different data centers without needing to synchronize them.

00:14:57.520 Jobs could flow into separate Resque queues across data centers, with the older queue remaining active until it was fully processed.

00:15:15.439 Thus, we created a more manageable solution rather than a perfect one.

00:15:27.760 The third challenge was managing MySQL traffic. Rerouting this traffic while maintaining data integrity presented significant difficulties.

00:15:45.920 We could have gone with a master-master configuration, but the overhead and reliability issues outweighed the benefits.

00:16:06.080 Instead, we opted for asynchronous replication to switch traffic smoothly. This meant we needed a robust solution in case of a failover.

00:16:21.760 That solution came through investing in Tungsten, a product from Continuant. Tungsten is partially open source but offers many advanced features under a licensing model.

00:16:36.720 Tungsten manages MySQL replication efficiency, allowing more straightforward failovers while ensuring no queries are lost during the transition.

00:16:54.960 It also understands SQL commands, allowing it to queue commands during a failover, ensuring consistent execution once the process completes.

00:17:11.360 Additionally, Tungsten is data center-aware, managing replication across different geographical locations efficiently.

00:17:30.080 This permits failover to local replicas while reducing downtime. The structure they created resolves many of the complex problems we faced during migration.

00:17:51.280 With an effective failover strategy, we can swap back and forth between master and slave servers without losing any queries.

00:18:09.280 We can take servers offline for updates without service interruption, and those schema changes happen significantly faster than expected.

00:18:30.560 Eventually, we reached the big day of migration. We tested tirelessly and planned our preparations carefully.

00:18:48.320 We lowered our DNS TTL and simulated traffic with rolls back to ensure we caught every possible error.

00:19:05.120 Despite simulating countless scenarios, issues arose—the data would desynchronize, or we would receive error reports.

00:19:18.160 However, once we successfully switched between the environments without causing any errors, we felt confident in moving forward.

00:19:33.839 We devised a six-step plan to execute the migration smoothly. Each step was essentially a single command that acted as a rollback point.

00:19:51.920 We halted all non-essential processes to minimize potential errors during the transition.

00:20:10.080 Next, we did the switch. First, we switched the master database to the new data center, managing all queries across the VPN.

00:20:27.200 The initial response times were slower, but we planned to streamline traffic through HAProxy to reduce latency.

00:20:41.920 After carrying that out, we switched the DNS settings, directing most users to the new environment.

00:20:54.080 Following our timeline, once we confirmed no outstanding requests in the old environment, we would terminate the non-essential processes and clear the old queue.

00:21:09.120 We invested considerable effort into laying a strong foundation for our infrastructure. Laboring for this migration has proven beneficial.

00:21:27.920 We are now confident in switching between data centers and can conduct controlled failovers regularly.

00:21:42.559 We recognize not every migration is seamless; the groundwork we've built now allows for future improvements.

00:21:59.040 Continued testing is necessary, and tools like Zookeeper or Doozer will help us manage configurations seamlessly.

00:22:12.559 As we wrap up today, remember that thinking through, documenting, and testing configurations are key.

00:22:28.399 Thank you for your attention.