Observing Chance: A Gold Master Test in Practice

by Jake Worth

In the presentation "Observing Chance: A Gold Master Test in Practice," delivered by Jake Worth at RailsConf 2017, the speaker discusses an innovative testing technique known as the Gold Master test. This approach is tailored for evaluating complex legacy systems where traditional test methods may fall short. Worth begins by contextualizing the challenges of maintaining and testing legacy code, which he describes as old and unwieldy software that is difficult to change and understand due to its complexity and age.

Key points discussed throughout the talk include:

- Challenges of Legacy Code: Worth highlights the common pitfalls of legacy applications, such as outdated technology stacks, unmaintained libraries, and vulnerabilities stemming from infrequent updates. He emphasizes the inevitability of working with legacy code in modern software development.
- Limitations of Standard Testing: The speaker explains the weaknesses of standard testing methods (unit tests, API tests, and UI tests), particularly when applied to large and complex codebases. These tests often struggle to effectively cover nuanced behaviors in legacy systems.
- Introduction of the Gold Master Test: Worth introduces the Gold Master test as a comprehensive regression test designed to monitor overall behavior rather than specific functionality. It aims to maintain consistent behavior of the application across iterations.
- Workflow of Gold Master Testing: A detailed workflow is provided, involving the restoration of a production database, running transformations, capturing outputs, and comparing them against established ‘gold master’ artifacts. The ideal applications for this testing method are characterized by their maturity, complexity, and expected minimal change over time.
- Practical Implementation: Worth outlines a process for implementing Gold Master tests using Ruby and PostgreSQL, stressing the importance of scrubbing sensitive data before creating test databases. He elaborates on the structure of the test code and how failures prompt critical evaluations of whether the changes were valid or if they represent breaking changes to the user contract.
- Case Study and Real-World Application: He shares real-world examples, including a project called "Today I Learned," illustrating how the Gold Master test effectively identified significant behavioral changes during development that conventional testing overlooked.

In conclusion, Worth stresses the value of Gold Master testing as a strategy for legacy systems, recommending it for developers working with mature, complex applications. He argues that adopting such creative testing methods is essential as the landscape of application development evolves. The overarching takeaway is that Gold Master tests can serve as a vital safeguard for ensuring consistent user experience and application integrity over time.

00:00:11.990 Morning everyone, hello and welcome! Thank you for being here.

00:00:17.160 I recently revisited one of the most sophisticated applications that I have ever been a part of. A key feature of this application took a complex input, a huge production database connected to a CMS.

00:00:29.400 It processed all that data through a Ruby function consisting of hundreds of lines of raw Postgres code. This function outputted a CSV file that was between five and ten thousand lines long, representing every possible permutation of the data.

00:00:42.000 This CSV file was far too long to be read or reasonably understood by a human. So, this was a complex feature in a complex application.

00:00:52.920 Now, given that I was one of the original authors of this feature, you might think it was easy for me to just jump right back in and start working on it again. However, that was not my initial experience, and this ties into something known as the "equals some flaw". Any code you haven't looked at for six or more months might as well have been written by someone else.

00:01:11.240 My point is that old code is challenging, and it doesn't matter if you wrote it or if someone else did; we call this legacy code, and many of us deal with it every day. For those who have not yet experienced it, let me walk through some of the hallmarks of a legacy codebase.

00:01:37.200 First, you'll see old interfaces. In the application I was dealing with, we were working with aging versions of Ruby and Rails. If you open up the gem set, you’ll find gems that have been forked, gems that are no longer maintained, and gems that have no path to upgrade.

00:01:51.540 Each of these provides its own unique obstacle to future development. Additionally, you’ll encounter vulnerabilities because you are not updating the software; you're missing security patches, and your code becomes more vulnerable. Also, there is often dead code, as poor cleanup over time leads to code that hardly ever gets called.

00:02:20.280 We also have code that is reliant on abandoned design patterns—things that we now consider to be anti-patterns. So, these are some of the downsides of legacy code, but there are some benefits too, and I would list these as profit for end users.

00:02:43.830 Starting with profit, a legacy application is often one that is making money. If you are a developer working within a business with a legacy codebase, the fact that you're there suggests that someone is making money, and you hopefully have users who care. They've invested in your product and have a contract expecting you to deliver certain outcomes time and again. This contract is a very special one.

00:03:19.770 One thing I know for sure is that developing legacy code is inevitable. This is true in two ways: first, it's inevitable for you if you are not currently working on a legacy codebase. If you stay in Ruby on Rails, this legacy will be part of your career. Second, it is also inevitable for the applications themselves.

00:03:49.530 I believe that no application is ever truly feature-complete; we will always want to develop and add features, and progress will continue. Thankfully, when that happens, we hopefully have tests, and in the application I was working on, we did have coverage and a design that still made sense to us a year down the road.

00:04:01.440 However, what happens if we don’t have those tests? Now we're facing something much worse—untested legacy code. If you're going to continue developing a Ruby on Rails app that has no tests, you will need to retroactively find a way to add tests, or you will end up breaking features and negating the contract between the application and the user.

00:04:31.210 Most of us are familiar with three types of tests: unit tests, API tests, and user interface tests. Unit tests test individual functions and classes, API tests test how different parts of the application communicate with each other and with external parties, and user interface tests—also known as integration tests or feature tests—test behavior from a high level.

00:05:03.130 If we needed to start from scratch testing an untested legacy codebase, these three types of tests are usually not enough. Each type has its shortcomings. For unit and API tests, there could be thousands of functions and endpoints, and you need to know which ones are actually being used, or else you will waste your time testing unused code.

00:05:10.180 For user interface tests, we need to know what a user actually does on the site. Figuring out which types of tests to write and in what order is hard and quite subjective. Each type of test has its blind spots.

00:05:27.130 Now, I’d like to introduce a metaphor that I will be using throughout my talk: taking a watermelon and throwing it into a fan. Picture a big fan that can chop up the watermelon and splatter it onto a wall.

00:05:40.660 In this metaphor, we start with a complex input—the watermelon—which represents the production database connected to the CMS. We throw the watermelon as fast as we can into the fan, and the fan here represents our Ruby function. The resultant splatter on the wall is our complicated output—our five thousand-plus line CSV file.

00:06:14.260 There’s an interesting feature of this type of system: changes to the fan are really hard to detect. If I throw a watermelon into the fan today and then throw another one into the fan tomorrow, even with the splatter on the wall, it is very difficult to tell if the fan has changed at all.

00:06:32.710 However, detecting changes to the fan is the only thing stakeholders really care about; they want to know that we can guarantee consistent output time and time again. This leads me to question whether any of the traditional tests—unit tests, API tests, and user interface tests—are really equipped to cover this feature.

00:06:49.890 Well, the closest is the unit test, but the isolation of a test database will never come close to the complexity of the production database. So, while we have to test, the reason we must test is that we want to keep adding features while also preserving behavior.

00:07:06.589 This is a problem, and so what are we to do? I have a solution, and it's something I’ve built in production called the Gold Master Test. My name is Jake Worth and I'm a developer at Hashrocket in Chicago.

00:07:30.170 This talk will be 38 minutes total and comprise 61 slides. Importantly, this is not a Rails talk; rather, it is a general programming talk. Here’s my agenda: I’ll start off by defining a Gold Master Test, then I’ll talk about writing the test, and finally, I will discuss how to work with the test.

00:08:07.610 So, to define this test, I want to talk about a test that is similar to a Gold Master Test and then use that definition to clarify what the Gold Master Test is. The seeds of this idea come from a book published around 2005 called "Working Effectively with Legacy Code" by Michael C. Feathers.

00:08:51.130 In the preface to this book, Feathers defines legacy code as code without tests, which fits perfectly with our working definition of untested legacy code. He summarizes a key idea in the book with the following quote: "In nearly every legacy system, what the system does is more important than what it's supposed to do." This highlights that the behavior of the legacy system isn’t right or wrong—it simply is what it is.

00:09:20.149 This observed behavior becomes a contract with the user. There's a sub-chapter in the book about a type of test called a characterization test, which is defined as a test that characterizes the actual behavior of the code. It’s not right or wrong; it simply performs according to the behavior established by the legacy code.

00:09:45.019 The process involves using a piece of code in a test harness, writing an assertion that you know will fail, letting the failure tell you what the behavior is, and then changing the test to expect the behavior the code produces. For example, we start by saying "expect something to equal 2", we run the test, and it fails, indicating the output isn't what we expected.

00:10:16.110 In another context, this might be considered a lazy way to write a test because it avoids determining what the code does upfront. However, if you accept the notion that what matters is what the legacy code does, then it makes perfect sense.

00:10:34.440 Feathers introduces a heuristic for when to apply such a test, which I’ll summarize: step one—write tests where you will be making changes, step two—look at specific things you will change and attempt to write tests for those, and step three—write additional tests on a case-by-case basis.

00:11:14.780 If this is a characterization test, here’s how it differs from a Gold Master Test. A characterization test focuses on areas where you will make changes—it operates on the micro-level, concerned with specific modifications. In contrast, a Gold Master Test focuses on the entire application as a whole, caring only about the macro level, intentionally ignoring what happens inside the black box.

00:12:11.440 Here’s my definition of a Gold Master Test: A Gold Master Test is a regression test for complex, untested systems that asserts consistent macro-level behavior. The image on the right represents the Voyager Golden Record, launched into space in 1977 to showcase the sounds produced on Earth until that point.

00:12:50.640 Now, let’s break down my definition a bit. It’s a regression test; we have features that we like, and we don’t want them to go away, so we’re writing a test to help us prevent that. It's designed for complex, untested systems and asserts consistent macro-level behavior.

00:13:19.470 This means the application works in the broadest sense, and we want it to continue functioning as it has. This is my personal definition and reflects the fact that in software, many concepts emerge simultaneously, making it challenging to pinpoint a canonical definition.

00:14:04.560 Now, let’s look at a sample workflow of a Gold Master Test. During the first run, we restore a production database into our test database—that’s our watermelon. Next, we run the Gold Master Test, which processes all the data, similar to the fan.

00:14:48.650 Then, we capture the output—the splattering on the wall. In the first run, we simply ignore this output, which is a crucial setup for the next run, during which we perform the same steps.

00:15:08.560 We restore the production database into the test database once more, run the Gold Master Test, capture the output, and then compare it to the previous output. If there's a failure, we proceed to a crucial decision-making step: either we change the code or we commit the new Gold Master as the new standard.

00:15:51.330 Failing tests prompt some sort of decision-making process; if you’ve written the test correctly, you shouldn't be able to bypass that failure. The ideal application for a Gold Master Test shares three characteristics: it is mature, complex, and we expect minimal change to the output.

00:16:21.910 A mature application contains important behavior that isn’t sufficiently covered by a test. It is complex—complex enough that a few unit tests and an integration test won’t be enough. Finally, we expect minimal change to the output to maintain the contract established with the user.

00:17:11.420 There are several benefits to adding a Gold Master Test to your codebase. Firstly, we establish rigorous development standards, creating a high bar for our developers. Essentially, we’re saying that nothing in the entire application should change in any way, and if you’re using tests, you should be able to integrate that into your testing cycle.

00:17:59.700 It exposes surprising behaviors, particularly if your code is non-deterministic or returns different results based on the operating system. A Gold Master Test will catch those inconsistencies quickly. Additionally, it aids in forensic analysis; since the test covers the whole application, if something breaks, we can go back in time using tools like Git bisect and pinpoint exactly when the issue arose.

00:18:34.600 Once again, here's my definition: A Gold Master Test is a regression test for complex untested systems, asserting consistent macro-level behavior. Now that we have a working definition, let’s move on to writing one.

00:19:11.180 We’ll be looking at a bit of code now, written in Ruby, using RSpec and Postgres. As mentioned, we have a feature that processes a large production database through a complex Postgres function, producing a significant CSV file. This makes the application an ideal candidate for a Gold Master Test.

00:19:35.910 When I write a test like this, I like to break it into three phases: preparation, testing, and evaluation.

00:19:48.660 To begin preparation, we acquire a dump from production—importantly, scrubbing the database of sensitive information such as email addresses, IP addresses, session data, and financial information. This step is critical as we will eventually check in some or all of this data into version control. We use an empty utility database, running a scrub script against it to ensure the process is repeatable.

00:20:47.070 Once we have scrubbed the data, we dump it out as plaintext SQL. Our team wrote a small Rake task for this export. We name the destination file "gold_master.sql" and utilize the `pg_dump` utility, passing relevant Postgres flags to create the dump.

00:21:32.600 Postgres flags may require some data massaging as the production database and test environment may not be identical, but I'll provide a link to a GitHub branch at the end that shows some useful flags. After dumping the database, we check this plaintext SQL into version control.

00:22:14.540 Next, we enter the testing phase. I start with an empty test file to validate the setup. This empty test file describes a class called "Fan" and a function called "shred" that produces a consistent result.

00:22:41.240 To step into the test, the first task is to take the production database and dump it into the test database. We use `ApplicationRecord.connection.execute`, begin by truncating the schema migrations table to avoid conflicts, and then read "gold_master.sql" into the test database. As a result, we maintain a richer testing environment than most developers usually have.

00:23:30.780 Next, we perform the transformation, referring to it as the "fan" in our metaphor. We call the function "fan.shred" and store its meaningful output, which returns CSV output, in a variable we name "actual".

00:24:07.660 This is crucial; it's important to ensure the transformation produces a notable output for our assertions. Our test can generate the Gold Master on the first run, and on subsequent runs, it can compare the current result to the Gold Master.

00:24:45.820 I prefer this approach so that developers can run tests without needing substantial prior knowledge of Gold Master Tests. It’s easier to regenerate a Gold Master over time if you need to change it, presenting better opportunity for consistent development.

00:25:29.180 Now that we have the "actual" variable, we begin making assertions. We check for the existence of a file called "gold_master.txt", which will serve as the location for both present and future Gold Masters. If the file does not exist, we write our "actual" output to that file.

00:26:13.510 This process will work on the first pass, as it simply generates a file. The end of this first run indicates that the test passes and everything is well. During the second pass, we check if the Gold Master file already exists. If it does, we read the file and compare it to the current "actual" result.

00:26:57.090 If the contents of the Gold Master file do not match "actual", we write to the Gold Master file. If you have checked in that Gold Master, it adds unstaged changes to your version control, which you’ll need to manage manually.

00:27:31.080 Finally, we assert that if "actual" does not equal Gold Master, the test fails loudly. This simple 19-line test file captures the entire testing process.

00:28:10.340 In a stable system, this test should pass consistently. A failing test serves as an alarm indicating that we have broken the contract with our user, much like any good regression test.

00:28:43.770 So what happens if it fails? We start with a test failure and evaluate whether it is a valid change—if so, we check in the new Gold Master and continue. If not, we need to investigate carefully to understand what caused the failure.

00:29:20.240 Now let’s discuss working with a codebase that includes such a Gold Master Test. I’d like to present a sample workflow based on my experiences as a developer. Recently, I contributed to an open-source project called "Today I Learned", which allows colleagues to publish short posts about interesting things they’ve learned, frequently with code samples in markdown.

00:30:26.190 This project has been running for over two years, so it is quite mature. However, it has a complicated underlying technology stack. While the front end might appear simple, the backend functionality is quite intricate.

00:31:07.510 I came to the conclusion that it's reasonable to apply a Gold Master Test. My primary assertion is that the homepage should provide consistent data—a statement echoed by my colleagues, as they expect certain features as regular users.

00:31:36.590 To create the Gold Master Test, we start by preparing: obtaining the production database dump, scrubbing sensitive information, and dumping it as plaintext SQL to check into version control.

00:32:07.830 I wrote a script to sanitize the data, focusing on three tables in the database: developers, sessions, and posts. I adjusted user information to less sensitive values while still keeping them unique.

00:32:35.010 Since sensitive data management is crucial, I ensured to adjust data responsibly, especially considering that client projects typically contain sensitive information.

00:32:57.990 We limited our posts to a manageable number, ensuring we focus on quality and speed during testing. Once prepared, we follow a similar procedure to earlier: dumping production data into the test database.

00:33:38.550 We then use Capybara to visit the root path of the application to simulate a real user experience. The resulting page HTML is captured for both the initial run and subsequent comparisons.

00:34:14.240 In this way, the Gold Master Test ensures if any HTML changes occur on the page, the test will catch these discrepancies. It’s been rewarding to see how effective a Gold Master Test can be in maintaining project integrity.

00:34:56.930 Though the output may look disparate when comparing large HTML files, our goal is preserving user experience. If changes occur that would notably confuse users, ideally, those discrepancies should be flagged.

00:35:37.160 To wrap up, if you have a mature, complex, and stable application, consider implementing Gold Master Testing. This approach can serve as a comprehensive substitute when you lack a robust test suite.

00:36:04.480 My experiences with these tests have continually revealed surprising insights into the code, informing how we evolve our applications. Looking forward, I believe future applications will require creative testing strategies to navigate the balance between legacy systems and new frameworks.