Talks
Scaling with Friends
Summarized using AI

Scaling with Friends

by Geoffrey Dagley

In the presentation titled 'Scaling with Friends' by Geoffrey Dagley at the Big Ruby 2013 event, the speaker shares insights from the development and scaling journey of the 'With Friends' games at Zynga, which began with two brothers in a library. The key points discussed include:

  • Growth Journey: Dagley recounts how the 'With Friends' game family, including popular titles like 'Words with Friends' and 'Chess with Friends,' evolved from humble beginnings to their current status, serving millions of players daily.
  • Technical Challenges: The challenges of working with a single service supporting multiple games and the necessity of backward compatibility due to varying client versions are highlighted. Dagley notes the constraint of not disrupting older versions when updating the system.
  • Scalability Issues: He discusses scaling problems faced, including traffic surges following celebrity mentions and the need to implement limits on game starts, ensuring server stability.
  • Database Management: Key technical details regarding their database strategies are shared, including initial reliance on XML, transitioning to JSON, and the implementation of caching and database sharding to handle traffic numbing.
  • Real-World Anecdotes: Dagley provides anecdotes such as managing incidents like the 'Movepocalypse,' ensuring the game's integrity during rapid growth, and humorous moments in their evolution, such as renaming data tables.
  • Implementation of Metrics and Testing: Emphasizing the importance of real-time analytics, Dagley talks about instrumenting and reporting on everything to guide their development decisions. He concludes with the necessity of robust testing practices to catch potential bugs early and maintain code integrity.

Overall, Dagley's talk highlights the importance of adaptability, careful planning, and continuous improvement in technology choices, essential for managing and scaling a successful gaming platform.

00:00:19.439 All right, so yeah, I'm Jeff Dagley. I work at Zynga, formerly known as New Toy. How many of you have played a 'With Friends' game? I like that you keep me in a job! So, we've got the 'With Friends' family. The back end is a Rails service. On the front end, we have chess with friends and various other games.
00:00:39.600 So, how many play chess? There are a few of you. Words with Friends is our most popular game. Scramble with Friends has a few players as well. That’s actually the older icon of Hanging with Friends. Is anyone still playing Hanging with Friends? What about Matching with Friends? Yes, there are a few of you. And lastly, Gems with Friends is the one I currently work on.
00:01:05.360 All these games came out of McKinney, so if you're wondering where Words with Friends and Chess with Friends started, it was in the Downtown Public Library of McKinney. The two brothers, Paul and David Bettner, left their jobs when the iPhone came out and said, 'We're going to do something with this; it seems great!' So they went up to the library, which was quiet, and started building 'Chess with Friends'.
00:01:22.880 For those of you who don't know, 'Chess with Friends' was actually the first game. They released it, and while it was doing okay, they went on to build 'Words with Friends,' which became even more successful. I joined in October 2009 initially as a contractor for New Toy.
00:01:39.840 Today marks my three-year anniversary with Zynga, so I'm excited about that! It's fun stuff—three years at a company these days is quite an accomplishment. What are we going to talk about today? We'll discuss some mistakes I made, and I've been listening to talks about solving big problems. I'm going to share how we encountered those problems.
00:02:02.960 We didn't start off trying to solve those problems; we began as two guys in a library building a game on iOS without knowing the server back end. We built as we went along. We'll discuss the mistakes we made and how we've corrected them over time, as well as where we're going forward.
00:02:43.040 To your right is the day the iPhone 4 launched; we were the number one app in the iTunes App Store! That was an exciting place to be and shows where we came from. Here are the With Friends games we've developed. This illustrates the ecosystem of what we're currently supporting.
00:03:20.000 We have one back-end system supporting all these different games, with various clients and versions out in the wild. For example, 'Chess with Friends' has both a free and paid version; so does 'Hanging with Friends.' 'Scramble' and 'Gems with Friends' also have free and paid versions. 'Words with Friends,' our big one, has iPhone, iPad, and Android versions, available on Google Play and the Kindle.
00:03:36.160 Talking about supporting multiple clients and platforms, we have a fun problem in that not everyone updates to the latest version right away. Many players are still using older game versions. All this means we must ensure we don’t break older versions when releasing updates.
00:04:00.640 In case you haven't heard, a few people like our games, including Alec Baldwin, who got kicked off a plane for playing 'Words with Friends.' Various celebrities have tweeted or mentioned us, which is interesting. Fred Durst, the lead singer of Limp Bizkit, tweeted his username inviting people to play with him.
00:04:29.280 This led to our servers going down as everyone flooded to start games with him. We had to implement limits, as the server could not handle the overwhelming number of requests. Now, if you try to start a game with somebody exceeding their 20-game limit, our system alerts you.
00:05:11.759 We all know that Rails is often said to have scaling issues, and this is a joke among us. When I joined Paul and David, they shared their experience of building the back end with Rails, claiming it was 'sexy.' This kind of line does help with decision-making about technology, but it made me wonder if they thought I was sexy too.
00:06:04.639 What are the constraints we're dealing with? One major one is backwards compatibility. The current iOS version is 6.12, while Android is 4.22. The sheer number of Android versions and devices complicates matters. While we can force client upgrades when introducing major server changes, we want to avoid disrupting users whenever possible.
00:07:43.440 So we’ve kept parts of our code that check client versions. In one instance, we had to explicitly handle cases where a negative ID was sent from a client due to issues converting data types. This capability allows us to react more quickly on the server side.
00:08:59.760 When the clients were initially built, they used Active Record’s XML functionality rather than JSON. Many early clients are still requesting XML responses. We are transitioning new features to JSON, but we still have a substantial amount of XML traffic.
00:10:36.159 We support multiple game types within our one service, and we originally had a lot of conditionals designed to handle the specific logic for each game.
00:11:23.919 We have since refactored into smaller classes to better manage this logic. However, any change in shared logic affecting purchasing across all games must be carefully tested to avoid breaking other versions. Despite our best efforts, it's still easy to miss issues when updating one game that may inadvertently affect others.
00:12:54.000 The rapid growth of our user base did not come with a dedicated Ruby conference or in-depth planning. The Bettner brothers focused on developing 'Chess with Friends,' and all our growth has been organic. This organic growth brought many challenges, evidenced by our patchwork codebase filled with fixes.
00:13:59.600 So, how many of you are familiar with YAGNI? It stands for 'You Aren't Gonna Need It.' As our database grew, we discussed whether to shard it, but ultimately decided it wasn’t necessary yet—again, focusing on the simplest solution possible.
00:14:51.440 We began our journey on Slicehost, similar to how you might start with Heroku today. Slicehost was easy to set up; we eventually switched to Rails Machine, which provided us with dedicated hardware. Later, we migrated to SoftLayer, which supplied us with bare metal machines. Finally, after being acquired by Zynga, we moved into their data center.
00:16:32.480 We now operate on over 400 app servers, deploying around 35 unicorns on each server. We have over 40 worker machines processing background queues, 70-plus memcache servers, and over 125 MySQL shards for more complex needs.
00:17:14.960 MySQL-related issues plagued us early on. One notable incident occurred when we received a surge of new players after John Mayer tweeted about 'Words with Friends.' This influx strained our system and made late-night gameplay painfully slow.
00:18:57.480 We faced challenges like the 'Movepocalypse,' where we experienced ID exhaustion due to how Rails defaults to Integer IDs. This led us to coordinate client updates to ensure they could support a larger ID data type.
00:19:55.679 Concurrently, the 'Chatpocalypse' arose, leading to significant limitations. While we focused on managing the volume of moves, we guaranteed the integrity of chat functionality.
00:20:54.640 We worked through some interesting fixes, such as using an ID generator that accidentally collided with the MySQL auto-increment numbers, highlighting the need for careful planning.
00:21:43.679 We also experienced issues with MySQL's simultaneous connections limit. As we scaled our application, over 512 concurrent connections became routine, necessitating the implementation of a connection pooling strategy.
00:23:29.520 To address the challenges of scaling, we’ve kept track of every move made in our games, continuously refining database architectures to better accommodate our growing user base. We adopted partitioning strategies in light of the overhead from our earlier systems.
00:24:15.879 Over the years, we added humor into our development culture. For example, during the 'Movepocalypse,' we transitioned from simple moves tables to the 'Dance Moves' table to manage all our movements efficiently while introducing sharding.
00:25:53.679 The evolution of our game tables shows how we creatively adjusted our database design to keep pace with the growth in users. We explored sharding as a solution to our expanding data requirements, splitting massive tables into manageable sizes.
00:26:40.160 The design of large tables made it challenging to add new columns or data types. We began utilizing flexible JSON columns to store dynamic information without needing to alter table schemas.
00:27:30.679 As we shifted to a sharded approach, we tackled issues related to how different databases generated auto-incrementing IDs. We turned to Redis for effective ID generation, allowing for consistent tracking across multiple systems.
00:28:50.800 We introduced caching mechanisms to optimize our database interactions, focusing on areas that would yield significant performance improvements. Over time, we scaled our caching solutions to accommodate surging user activity.
00:30:25.920 Our efforts to maintain reliability extended to careful monitoring practices. Implementing tools like New Relic made tracking performance and troubleshooting problems more efficient.
00:31:54.080 We used real-time analytics to inform our decisions during major implementations. The practice of rolling out changes in increments allowed us to manage risk effectively and adjust whenever necessary.
00:32:43.440 In my experience, it’s imperative to focus on metrics that reveal the health of the application, especially under stress. By tracking these consistently, we can refine our approach and meet growing user expectations.
00:33:39.040 Investing in a robust testing culture helped us avoid ship without proper quality checks. Thus, we adopted a rule to ensure that every pull request would need accompanying tests to maintain code integrity.
00:35:20.000 Deploying code under scrutiny allowed us to stabilize our application over time. While it's inevitable that bugs will arise, a diligent testing strategy positions us to catch potential setbacks sooner.
00:36:05.000 Staying on top of code deployment and observability enabled us to forge a more robust system. For example, being cautious about how we managed cache versions significantly minimized issues related to inconsistent data.
00:37:14.720 We integrated strategies to handle known edge cases, which further strengthened our codebase. This includes remaining aware of how changes could affect fundamental functionality.
00:38:42.920 Overall, our experiences managing the games highlighted the importance of adaptability within a development environment. Continuous efforts to refine processes and technology choices have been paramount to our growth.
00:39:50.640 And of course, I can't forget to mention a few ‘Words with Friends’ tips. Knowing some two- and three-letter words can significantly increase your game performance. I'm more than happy to share those insights if you’re interested.
00:41:15.840 You can find me online as g daggly on most platforms. Thank you all for your attention!
Explore all talks recorded at Big Ruby 2013
+7