RailsConf 2013

Configuration Management Patterns

Configuration Management Patterns

by Beau Harrington

In this engaging presentation titled "Configuration Management Patterns," Beau Harrington discusses the complexity of managing configurations in modern Rails applications as they scale. The session begins with an acknowledgment of the common perception of configuration management, often associated with tools like Chef and Puppet, and sets the stage for a broader discussion on patterns for effective management of configurations within growing systems.

Key points covered in the talk include:

  • Defining Configuration: Harrington provides a broad definition of configuration, encompassing all values that might need sharing or changing across applications, including settings like hostnames, user credentials, feature flags, and translations.

  • Case Study - Kingdoms of Camelot: He illustrates the concept using his experience working on the mobile game "Kingdoms of Camelot," highlighting the need to manage over 25,000 configuration values, which cannot be efficiently handled through static YAML files alone.

  • Configuring for Context: The talk emphasizes the need for configuration to adapt based on context, including different environments, server roles, and regions. Harrington introduces composite configurations, which allow for hierarchical settings tailored to specific contexts.

  • Decoupling Configurations from Source Code: Harrington suggests moving configurations out of source control and incorporating them into a build process, effectively treating configuration files like build artifacts which can be versioned and accessed independently.

  • Dynamic Configuration Updates: A significant part of the presentation focuses on techniques for dynamically updating configuration without requiring deployments or service restarts. He discusses using background threads for polling configuration changes.

  • Empowering Non-Engineers: The approach not only streamlines configuration management for engineers but also empowers non-developers, such as game designers, to make configuration changes directly, thereby reducing dependency on engineering resources.

  • Testing and Change Control: The importance of rigorous testing — both technical QA and play-testing QA — is emphasized before deploying configuration changes, as even minor alterations can significantly affect user experience in gaming contexts.

  • Open Source References: The presentation ends with references to useful tools and practices in the industry, including Netflix's architectural patterns and tools for configuration management, indicating that the principles discussed are part of a larger movement towards more robust and flexible configuration management strategies.

The main conclusion of the talk is that proper configuration management is crucial for maintaining control and flexibility as applications grow and evolve. By employing the discussed patterns, organizations can streamline their processes, avoid configuration bottlenecks, and enhance overall efficiency.

00:00:12.259 Thank you! Hello, what's up? Let's see, it's Wednesday afternoon, day three of RailsConf. Everyone's a little sleepy, so I'm going to wake you all up with the most insane, mind-blowing topic I could ever conceive.
00:00:20.640 I mean, when I get neck-deep in XML and my favorite Turing-complete YAML files, I just go crazy. So this talk is 40 minutes just for you. I'm Beau Harrington. Here is my amazing GitHub profile, which I think has one repo on it, and then my Twitter, where I make jokes about things that have nothing to do with software.
00:00:34.260 First things first, I noticed about three minutes after I submitted my talk the magical two words: configuration management. To a lot of people, especially our DevOps crowd, that means Chef and Puppet, but this is not what this is about, so here's your chance to escape. Next, I'm going to steal a phrase I heard from Chris Kelly. This is intended to be just a quick conversation about ideas; it's not a strict prescription of what you have to do. I'm just going over some things that have worked for us, some things we haven't actually put into production on all of our games yet but have still shown promise, or things that have been cool ideas for personal projects.
00:01:06.180 So, I'm the Chief Software Architect at a company that makes mobile and web games, including the number one top-grossing iOS app of 2012, Kingdoms of Camelot: Battle for the North. Of course, we are hiring, and that's all about that. So what is configuration? You can probably start off with some of the contents of your config directory in your Rails app. We've got database.yaml here—this looks like configuration, right? We've got hostnames, usernames, passwords, and timeouts - those are my favorite. Then, we've got some kind of newer hotness: feature flags! We want to be able to turn features on and off with just a simple deploy or otherwise so we can flip the search box or append the request time to the request very simply and easily. It looks like configuration, right?
00:02:02.640 How about translations? They live in the config too; I don't really know why, but they're there. It's so frequent that the people who will be doing internationalization and translation for us on bigger teams and apps are all software engineers. So yeah, we've got to translate some strings—great, I guess that's configuration. I hate this so much. For some reason, we're putting view logic into our models, but you know what? I'll take it; that's configuration, that'll work. How about if you're playing a free-to-play game or a game for mobile web? We always have quests, right? I want a reward when I finish one of those quests. Let's say 50 coins. Now that's more configuration.
00:03:01.320 Okay, now I'm in my production.rb file, and we've got some assignments going here. We've got a second-level access of variables; we might even see a block or two. Configuration, essentially, and very annoyingly, is everything you want it to be. For our purposes today, we'll say that configuration consists of values you may want to share or to change. I'm intentionally keeping the definition super broad because you’ve heard the phrase 'code is configuration' or 'configuration is code,' and the line is extremely blurry. So if it doesn't have a precise definition, that’s okay. Why do we care about configuration? I guess what I care about is configuration so much.
00:04:14.220 I lied; I’m going to go back to my corporate life for a second. Kingdoms of Camelot: Battle for the North is a real-time home strategy game with a shared world. You can have up to, I think, 70,000 players in a single world, and they’re all fighting over the same resources: they join alliances, they fight each other, they kill each other—it's quite exciting. Part of that involves having a kingdom, and each of your kingdoms has lots of buildings. There are 15 different kinds of buildings, and each of those buildings can be leveled up gradually over time. We have 10+ levels for each building, and each of those levels has eight different requirements, including different kinds of resources, a certain amount of time that needs to pass, and prerequisites for buildings that have already been built.
00:05:04.740 So just for buildings, that's 1200+ values that we have to deal with. That's 1200 integers and a couple strings—mostly integers. And that's just the buildings! Look at all this other stuff too—research, knights, battle tuning, NPCs, items, quests—which gives us a grand total of over 25,000 values that we have to deal with for one game. I'm not particularly looking forward to copying and pasting stuff into a YAML file for somebody from an email, or worse yet, from a Word attachment inside an email. So let’s get serious. When we're going to roll out new features or even configuration changes for our games, we have to do two levels of QA beyond just our normal automated testing.
00:05:53.580 We have technical QA, but we also have playtesting QA. Everything can be working perfectly from a technical standpoint, but from a playtesting standpoint, if we release something that nerfs a valuable item or makes something cheap very overpowered, that's just as bad, if not worse than failing to deploy on a technical basis. Because our games are free-to-play, users can pay zero dollars and can potentially pay more than zero dollars for the items in their game. So if we mess up the configuration values for those, we are basically nuking people's investments, and we don’t want to do that. So we want to have a process around deploying new configurations that's just as rigorous as when we’re deploying code.
00:07:02.759 Now, let’s talk about context. We're going to discuss our environment and data centers, which are relevant for configuring our applications. We also need to think about how we generate those configurations in the first place and how we're going to push them out onto our servers. Finally, what are all these going to do for us? What will this enable us to do, and what are some other software packages we can explore?
00:07:37.380 Let’s talk context. Let’s actually create our own game right now! It's a really cool spec, and I think it's going to make millions. You're going to click this puppy and get 10 gold coins, and sometimes you will get a prize as well—either a yo-yo or a piece of candy. So let’s take our first stab at this. We've got the puppy click class, yes! So we have three beautiful constants there: coins per click, odds of winning a prize, and an array of prizes. I feel like we're doing pretty well so far; we’ve avoided any kind of magic number—it’s not terrible, right? But we should at least get the variables out of source! If I was happy with this, I wouldn’t be yammering on about it for 40 minutes, so let’s extract them into a simple config file. I'll call it game.yaml and config.
00:09:30.840 Let me go ahead and tell you right now that all the examples in here are in YAML for the sole reason that YAML takes up less space on this screen here—no particular affinity to it. So we've got our little game hash here, and we've stored everything as part of that YAML data structure. Awesome, now everything's out of source and it's in a single place in config. That’s really good because it's natural for us to start looking at the config directory first when we’re looking for configuration. Now, what about when we actually need to care about Rails? This seems pretty basic; it’s a pattern where you've got things split by development, test, and production. It’s a bit gross though because we are repeating ourselves.
00:10:35.880 When I'm in development, I want to make sure that I win a prize every time because I'm super impatient—my time is super valuable! So I've set the odds of winning a prize from 0.1 or 10 percent to 100 percent. And then all the things that are the same I've had to replicate for development, test, and then production as well. If the screen were bigger... Let’s move on a bit to composite configurations. I believe these have seven or eight different names: composite configurations, cascading configurations—it's the same kind of concept in CSS where we’ll have a core set of configurations and then we will apply transformations on top of it based on a weighted set of factors.
00:12:00.900 The first one, and the only one kids usually use, is the Rails environment. So we’ve got all the core values and for anything we’re overriding; we’re just going to state those values in the sections for those particular environments. We’ve got odds of winning a prize set to one for development and nothing else for test. We want to make sure that we’re testing the screen properly, so we change the image from a piece of candy to a screen test. There, we're dry; we’re not repeating ourselves—that’s great!
00:12:56.100 So there's an example of the wonderful goodness that you can have there. But why stop at Rails? You can do a lot more with being able to infer information about the context of your deployment and make bigger changes from there. I know a lot of people are on AWS, and if you’re doing a multi-data center deploy or multi-region deploy, you’ll need to change the server names you're using for each of those data centers. If you have certain machines that are more powerful than others, you might want to change the settings if you're in a heterogeneous environment. How do we do that?
00:14:00.600 Well, we can just pull different YAML files. We’ve got the core, the environment, production, or whatever, and then we have the region. We'll say 'U.S. East,' the hostname of the machine will be taken as well. So we go through and load the contents of each of those YAML files and do a deep merge into what starts as an empty hash. The further down the file list you go, the more likely they are to win—the ones with the highest priority will have the final say. Anything that gets set in a host-specific file will always win!
00:15:11.880 We can build on this concept; we can have tags to pass through an environment variable and load up a different configuration. This is useful for transient quick loads of different sets of values if you’re having a crisis or want to pull up specific debugging values. If you are testing some new code, you can also assign role-specific configurations this way. And again, the code is pretty simple: just pull in the list of tags, split them by comma, and then load in each of the relevant YAML files—that's it!
00:16:14.180 There’s already a gem that encapsulates a bunch of this stuff: Rails config. I'm sure there are many others. It seems like the killer feature for many of them is having dot notation instead of hash-based access. The Rails config gem also does hot reloads of settings files for you, so if you do settings.reload, it will suck in all the new values—it’ll reload all the files for you. You can do some interesting things with pushing new configurations without doing a code deploy, and you can also do the composite configurations we talked about. It also has support for setting developer-local settings. It’s more of a configuration or a convention than anything else, but with a git ignored file, it has settings that are specific to you, and you don’t have to worry about committing those back up and messing things up.
00:17:24.600 All this is great; we now have a little more sanity, clarity, and predictability as to what configuration settings are going to be applied when. And yet, I can't give this to a non-engineer. I mean, I could, but I don’t want to! Our settings are aware of their context and environment—that's great—but everything is still in source, still tied to a commit, still tied to deploy, and it’s inaccessible to any other apps. If you guys have been paying attention to all the other talks this week, I’m sure by now you all have completely reconfigured and refactored your applications so they’re beautiful, sparkling service-oriented architectures.
00:18:20.760 So if we have configuration stuck inside a single app, it’s going to be extremely hard to extract that, and you’re going to be doing things like putting configuration sets into gems and passing them around. I don’t really think that’s the best pattern, and yes, of course—it's still inaccessible and not engineered. So let’s take a look at how we’re actually going to bake these configurations or build these configurations based on our hands. We want to decouple the process of generating configuration entirely from the application and the repository itself.
00:19:21.660 Do you guys remember how we used to do software before we had scripting languages all over the place? We would do builds, and the build process would generate an artifact, and we would take that artifact and put it somewhere—that process maps really well to handling complicated configuration changes. We can take whatever data source we have; it can be something entirely different than YAML, which may be more appropriate from an editing point of view. Then we can have a process to turn it into a YAML or JSON file, version it, store it in a central repository, and then have some mechanisms to deploy it in a way that's separated from the overall application itself. This sounds like a good job for a web app to me.
00:20:52.560 In terms of generating, holding, and making individual configuration files accessible to users, we can also hold a whole bunch of logic in that web app that we don’t want to shove into our actual applications: things like access controls, individual auditing, making sure we keep a list of every single change, and any kind of balance or sanity checking.
00:21:22.740 For those of you in larger corporate environments, you can put capabilities in there as necessary! So it's a simple principle, really. You have your dimensions: core configuration, environment, AWS region tag, anything else we've specified, and then the value for that, whether it's core, dev, test, production, or whatever. And yes, it would map nicely to a document store. So why don’t we just pull it directly from the database? Well, the database is precious compared to everything else you have. If you have decent architecture, you can spin up more of anything else except database power.
00:22:56.520 Yes, you can say you'll have horizontally scaling databases that can go multi-data center and many other capabilities, but I wouldn’t trust it compared to being able to do simpler operations by pulling things from files. Plus, if you're managing hundreds or thousands of apps, I’d rather do thousands of requests against a static file going through NGINX or S3 rather than trying to pull all of that information from your precious database. But if you don’t have tons of load, it’ll work fine. So our build process is going to take our configuration values however they’re stored and transform them into a single YAML doc or whatever— that's our artifact!
00:24:32.160 We’ll note the builder, set a monotonically increasing build number or timestamp, and enforce a changelog which is actually handy. We can include a checksum, especially for people who are quasi-technical; they’ll love making changes to files and not annotating them—it's a pain when those files are not under version control. So, with a checksum, that won’t happen. For example, let’s get double coins; let’s set 20 points every time we click on the puppy. We're just going to change our value in the configuration table, and then we’ll run our build process.
00:25:48.720 There’s our YAML doc! Awesome, here’s information about our actual artifact. We put in the build number and a tag name, and then a checksum. Plus, we’ve got it on an asset server. If you want to pull this particular version of the configuration, it's there for taking, so you can experiment with it not just in production but in any other environment, as well. So how do we get that artifact into a particular environment? We can do a promotion process where we map a single build artifact, a single configuration artifact, to an environment or region or something else.
00:26:49.860 In the end, we will have directories based on environments, data centers, regions, and tags, with YAML files in them, all on a remote disk or web server somewhere. If you build your tools correctly, any user can run this promotion process; it doesn’t have to be just an engineer. The settings are aware of context and environment—the settings are now outside the repo—they're accessible to other applications. Any app you want can access those URLs, and changes are accessible to non-engineers, but we can't deploy them yet.
00:27:41.520 So how do we do the final step, which is going to be the deployment process for all this? We have got a title slide for it. A release process will be running standard tests and continuous integration against the configuration. We’ll run both QA and playtest QA against a running instance of the application that has that config. Once everything checks out, we're going to deploy to production. We loop through our Rails environments, AWS region, and hostname, gathering that file from our configuration artifact server. For each of those files, we do that same deep merge so that we have that composite set of configuration.
00:29:11.280 But you’ll still have to restart. I really don’t like to do restarts! Check out my lightning talk which will show that I really don't like to restart. What we can do is if we're running a modern interpreter, we can spawn a thread in the background to pull for updates to that configuration file or set of files. A really good mechanism for this is Celluloid, which is a library that basically brings actor patterns into Ruby and works great on both JVM and Rubinius.
00:29:27.720 You can spawn a Celluloid worker that runs on a timer and fetches your configuration values. Every 30 minutes, it will recreate your configuration. If you write it in a thread-safe way, you will be able to have any new artifacts promoted to production automatically brought into your app. This doesn’t work well for applications running on MRI or that have a Global Interpreter Lock; it’s also a bit gross in multi-process environments. That means Unicorn—once they fork, they fork!
00:30:07.380 So, if you’re going to do this, you should be on a modern interpreter and try to be thread-safe because doing in-place updates can get tricky. Here's an example: let’s say you spin up a worker and then fetch all the configurations just like we’ve seen before. You'd have a timer running every 30 seconds or minutes to refresh those values and reset the timer so it keeps occurring. You can stick your configuration access behind a method to ensure that any accesses are thread-safe and that you're getting the latest value.
00:31:03.420 So, when we deploy our change to increase the number of coins per click from 10 to 20, once that configuration file is promoted, all we have to do is wait. That timer will go off and reload all configurations; the new value will be there next time we access it. This approach is good for simple things, but if we want to change more advanced configurations, we can use non-change hooks. When a particular configuration value changes, we can take actions based on that change, such as logging warnings, breaking caches, etc.
00:31:53.280 This is especially handy for operational tasks, like adding or taking out Memcache servers from our pool. Anytime the list of Memcache servers changes, when we're pulling configurations, we can reinitialize the Rails cache client to be a new instance of whatever client you are using, along with the new list. Your operations team will be pleased with you since they won't have to find you to remove Memcache servers during maintenance! Why have I been talking about polling instead of pushing this whole time?
00:32:52.560 With polling, you keep the burden of what happens when things go wrong in the same place—same code base and source as the code that's trying to pull and fetch the configuration updates. If we have a failed poll, we can easily rescue that timeout or whatever, and we can preserve the original configuration. We have a lot of flexibility. But if we're trying to push and the push fails—if I can't access a machine from my control server, I can't communicate with it to tell it to do anything else. It will continue to run with its original configuration.
00:34:02.880 At that point, your responsibility is in the wrong place, and you can't do anything about it. This is why, for the most part, you want to poll instead of push. There is an exception I'll talk about in a minute. So, the configuration is now out of source; that’s awesome! We’ve got non-engineers that can build and deploy, and even better, we’ve got configurations that are shared across our constellation of service-oriented services. We can update configurations live without restarts, and we can test those configurations easily. This doesn’t just work for production; you can set up as many environments or hosts as you want.
00:35:35.880 Then, whoever is responsible for those configuration updates—be it a game designer or playtester—can easily push that configuration to a test server and test it themselves without involving you. I think that's a big plus because I know everyone in this room has been bugged a lot for configuration changes by people who can't do it themselves. If you design your toolset correctly, you can also keep a record of all changes, you can audit everything, and you will know exactly what has been going on. Plus, you can have more than just configuration files; Git will also supply you with some very useful metrics.
00:36:45.780 This structure is great for games, where we've got thousands upon thousands of configuration values, and now we have a framework and peace of mind when managing them. What else can we use this for? It's excellent for pushing out lots of A/B testing; if you want to run many experiments, this is an efficient way to make sure that all your experimental data, in terms of what’s going to be changed, can be easily pushed out to all your machines. Additionally, it’s great for dealing with internationalization and translation management. Some of our games are published in 15 languages and have tens of thousands of strings each.
00:38:30.000 We don’t want translators mucking around in config locales. With this framework, we can allow them to use whatever tools they like and have become accustomed to. We work with them to create a simple tool that will bake those translations into a YAML file the game can then consume. And finally, we utilize feature flags. Here’s some bonus non-Ruby content: I think there are some influential and cool tools in the following that aren't strictly Ruby. First of all, how many of you have been keeping up with Netflix's efforts in open-sourcing their internal API? Everyone's hands should go up!
00:39:37.800 Their commitment toward pushing the envelope and open-sourcing how to deal with not just operating in the cloud, but mundane problems like configuration management is impressive. We’ve taken a bunch of ideas from their configuration library, the composite configuration library called Archaius—it isn’t Java, but you can use it in JRuby, and that’s just the start of the Netflix ecosystem of Java plugins and libraries. They've also shown a commitment to running their tools on other JVM languages like JRuby as well. There's also Zookeeper, which has a reputation for being tough to run in production; I won't dispute that, but it does solve an extremely hard problem of strong consistency clustered synchronization service.
00:41:07.560 So, if you look at it from a basic primitives perspective—sets, puts, watches—a guaranteed order of operations is essential. The watch primitive on Zookeeper gives us a stronger reliability guarantee for push updates to configuration. This leads to an interesting Python app based on Zookeeper called Jones, which takes many of the concepts talked about today and combines them with Zookeeper to allow instant push of configuration changes to whatever hosts are monitoring those specific keys on Zookeeper.
00:41:51.600 All right, I think I have some extra time. That's it; you get another gratuitous puppy picture. Cool, you are all experts now. Thank you very much!