00:00:09.280
Hello, and thank you for being here.
00:00:20.400
My name is Drew Blas, and I work at Chargify.com.
00:00:25.519
Chargify provides a service for managing recurring subscription billing, including everything related to credit cards.
00:00:31.519
This involves managing expiration dates, renewals, coupons, and all the details that accompany them.
00:00:45.360
Last year, we decided to move data centers. We selected a new provider and needed to migrate all our operations from the old data center to the new one.
00:01:02.719
The thing about Chargify is that we help process money for other businesses. Many startups might think otherwise, but businesses are fundamentally designed to generate profit.
00:01:13.920
If we're not processing sign-ups and transactions for those who rely on us, they stop making money, which makes them unhappy customers.
00:01:20.560
For that reason, we don't have planned outages or maintenance windows; we aim for 24/7 uptime. Of course, achieving 100% uptime is not feasible, but it's our goal. Anything below that doesn't sit well with us.
00:01:40.560
We needed to find a way to switch seamlessly. I’m skipping ahead a bit to discuss our search for a new provider, as it’s crucial to understanding how we executed our migration.
00:01:55.840
I should start by saying that I despise enterprise sales. Given that we needed to be PCI Level 1 compliant, it was essential to find a hosting provider that could help us meet the same requirements.
00:02:21.599
This presents challenges since you end up with two extremes: old guard data centers clinging to outdated practices and newer cloud providers eager to help.
00:02:35.040
Old-school data centers still operate as if it’s the 1980s; they tend to ask questions about server needs three years into the future and insist on an exhaustive bid process.
00:02:47.280
As for pricing, it can feel random. They might consult an outdated pricing book to provide numbers that can change depending on the day or some arbitrary circumstance.
00:03:00.159
Thus, we decided to go with EC2, the quintessential cloud provider known for spearheading the virtualization revolution, and crucially, they are a PCI Level 1 certified provider.
00:03:14.879
Not many other cloud providers can say that yet, though they're striving to catch up. For any agile business, the flexibility they provide is unmatched.
00:03:35.040
We wanted to break free from reliance on the old model, which forced us to negotiate special deals with sales representatives for additional resources.
00:03:44.240
We recognized that Amazon, despite its imperfections, was the best choice. They had been around longer than any other cloud provider, fulfilling our security requirements and now offering level 1 PCI compliant support.
00:03:53.280
If you’ve used their enterprise-level support, you’ll likely agree it’s exceptional; you speak directly to someone who understands your needs without being transitioned between multiple representatives.
00:04:09.440
Having chosen EC2, we began planning our migration, which started with adding automation. We moved away from the old methods where environment managers executed manual scripts and commands in the console.
00:04:25.119
It was critical that we automated our processes, and while we briefly considered Puppet, we ultimately chose Chef as our configuration management tool since we are a Ruby-centric shop.
00:04:40.560
To align our new systems with the old ones, we first needed to install Chef on our existing servers. Without doing this, we couldn’t ensure our new systems replicated the structure of the old one.
00:04:51.840
Once Chef implemented on the old systems, we duplicated our stack to AWS, allowing for side-by-side operation in a disaster recovery setup.
00:05:04.000
One day, we would simply failover to the new system as if the old provider had disappeared, aiming to eliminate any downtime during this planned switchover.
00:05:22.080
Our goal was to minimize the failover window, ideally resulting in no downtime at all. We would run tests repeatedly until we got it right.
00:05:38.080
Let's talk briefly about testing. Testing can mean various things. We have unit tests for our application and controller tests.
00:05:56.080
Infrastructure testing can involve maintaining your Chef code in a repository while also ensuring that your chef scripts function correctly through either Vagrant or an equivalent tool.
00:06:13.440
But when I refer to testing in this presentation, I'm primarily discussing process testing.
00:06:27.360
This involves chaos monkey-type testing where interconnected systems are evaluated to ensure they achieve the intended outcomes.
00:06:41.920
It's akin to integration testing, but on a much higher level. For instance, is your site accepting signups, sending confirmation emails, and performing all necessary tasks?
00:06:55.760
If you can confirm that your entire stack works as expected in a production-like environment, you can safely replicate that environment elsewhere.
00:07:08.880
We ensured that both data centers operated identically and that all requests were processed correctly as intended. To do so, we could simulate fake traffic.
00:07:23.040
Generating synthetic traffic helps us test resilience. We were able to increase load beyond our typical use and evaluate the systems under stress.
00:07:39.680
This approach is beneficial, regardless of whether you're migrating or just optimizing your infrastructure.
00:07:50.399
Next, logging and alerting became vital in identifying operational issues. We set up scripts to continuously check our site for sign-up functionality.
00:08:06.960
If those scripts failed, we would receive alerts to diagnose the issues promptly.
00:08:21.440
Returning to our initial plan, we needed to automate everything. Automation is crucial for reproducibility, accountability, and reducing human error.
00:08:37.200
Transitioning from a post-manual world was a significant challenge, especially since our previous environment had been managed through manual processes.
00:08:50.079
Integrating Chef wasn’t a walk in the park because cookbooks often have numerous dependencies, complicating the compatibility with existing systems.
00:09:05.760
Establishing dependencies is arduous; much like achieving 100% test coverage in an untested application takes time and effort.
00:09:20.480
As we tackled this, we had some unique challenges. For instance, our old environment was Red Hat-based while our new setup was on Ubuntu.
00:09:36.579
While Chef can handle different distributions effectively, many community cookbooks are not optimized for varying base systems.
00:09:48.560
Ultimately, we achieved Chef automation across all our old systems and created identical environments in the new data center.
00:10:03.279
Now we needed to address the challenges of actually switching everything over, particularly with regards to data synchronization.
00:10:19.200
In an ideal situation, we’d run both centers side-by-side, transitioning from one to the other without noticeable change.
00:10:35.280
However, achieving that level of synchronization requires addressing data layer challenges and other interconnected components.
00:10:50.640
Our first solution was to establish a VPN connection. Our old data center already had a hardware VPN appliance in place for secure remote access.
00:11:04.560
Amazon provided a seamless way to connect; we established a site-to-site VPN tunnel, allowing everything to function as if on the same network.
00:11:20.560
As long as we accounted for latency, setting up this connectivity wasn't too complicated. Routing the traffic correctly was the trickiest part.
00:11:36.960
Amazon's Virtual Private Cloud (VPC) played a pivotal role here. With VPC, systems can be interconnected while offering enhanced routing and security capabilities.
00:11:56.400
VPC allows us to set our subnets, define private IPs, and utilize routing tables to manage traffic more effectively.
00:12:10.160
Using VPC, we could guide traffic within our private network, ensuring optimal communication between our old and new data centers.
00:12:28.280
However, we faced three key migration challenges: managing DNS changes, synchronizing Redis, and routing MySQL traffic.
00:12:43.440
The DNS migration presented its own challenges. Changing DNS to direct traffic to the new data center was critical but notoriously unpredictable.
00:12:59.440
When DNS changes are made, they don't propagate instantaneously; they're often prone to errors, resulting in potential connection issues.
00:13:16.080
We realized that dropping traffic entirely wasn't a plausible option. Instead, we opted to switch off app servers in the old data center and insert HAProxy.
00:13:27.920
HAProxy would forward traffic over the VPN. Although this increased latency, it was preferable to losing customer requests entirely.
00:13:44.800
We also faced challenges with Redis, particularly concerning our use of it for Resque jobs. Redis failover introduces complexity, particularly with Redis Sentinel.
00:14:05.679
The documentation on Redis Sentinel is extensive, detailing various states and rules, complicating our understanding of its functionality.
00:14:23.920
As we explored alternatives, we found that orchestrating synchronous traffic transfer between Redis instances was problematic.
00:14:39.440
To solve this issue, we decided to configure multiple queues within different data centers without needing to synchronize them.
00:14:57.520
Jobs could flow into separate Resque queues across data centers, with the older queue remaining active until it was fully processed.
00:15:15.439
Thus, we created a more manageable solution rather than a perfect one.
00:15:27.760
The third challenge was managing MySQL traffic. Rerouting this traffic while maintaining data integrity presented significant difficulties.
00:15:45.920
We could have gone with a master-master configuration, but the overhead and reliability issues outweighed the benefits.
00:16:06.080
Instead, we opted for asynchronous replication to switch traffic smoothly. This meant we needed a robust solution in case of a failover.
00:16:21.760
That solution came through investing in Tungsten, a product from Continuant. Tungsten is partially open source but offers many advanced features under a licensing model.
00:16:36.720
Tungsten manages MySQL replication efficiency, allowing more straightforward failovers while ensuring no queries are lost during the transition.
00:16:54.960
It also understands SQL commands, allowing it to queue commands during a failover, ensuring consistent execution once the process completes.
00:17:11.360
Additionally, Tungsten is data center-aware, managing replication across different geographical locations efficiently.
00:17:30.080
This permits failover to local replicas while reducing downtime. The structure they created resolves many of the complex problems we faced during migration.
00:17:51.280
With an effective failover strategy, we can swap back and forth between master and slave servers without losing any queries.
00:18:09.280
We can take servers offline for updates without service interruption, and those schema changes happen significantly faster than expected.
00:18:30.560
Eventually, we reached the big day of migration. We tested tirelessly and planned our preparations carefully.
00:18:48.320
We lowered our DNS TTL and simulated traffic with rolls back to ensure we caught every possible error.
00:19:05.120
Despite simulating countless scenarios, issues arose—the data would desynchronize, or we would receive error reports.
00:19:18.160
However, once we successfully switched between the environments without causing any errors, we felt confident in moving forward.
00:19:33.839
We devised a six-step plan to execute the migration smoothly. Each step was essentially a single command that acted as a rollback point.
00:19:51.920
We halted all non-essential processes to minimize potential errors during the transition.
00:20:10.080
Next, we did the switch. First, we switched the master database to the new data center, managing all queries across the VPN.
00:20:27.200
The initial response times were slower, but we planned to streamline traffic through HAProxy to reduce latency.
00:20:41.920
After carrying that out, we switched the DNS settings, directing most users to the new environment.
00:20:54.080
Following our timeline, once we confirmed no outstanding requests in the old environment, we would terminate the non-essential processes and clear the old queue.
00:21:09.120
We invested considerable effort into laying a strong foundation for our infrastructure. Laboring for this migration has proven beneficial.
00:21:27.920
We are now confident in switching between data centers and can conduct controlled failovers regularly.
00:21:42.559
We recognize not every migration is seamless; the groundwork we've built now allows for future improvements.
00:21:59.040
Continued testing is necessary, and tools like Zookeeper or Doozer will help us manage configurations seamlessly.
00:22:12.559
As we wrap up today, remember that thinking through, documenting, and testing configurations are key.
00:22:28.399
Thank you for your attention.