Continuous Visual Integration for Rails

00:00:10.360 This is the Continuous Visual Integration talk. Thank you for being here. We need to go through a lot of material today, so let's get started.

00:00:18.039 A little bit about me to start: My name is Mike Fotinakis. I'm currently the founder of Percy, which is a tool for visual testing. I'm excited to share with you some of the things I've learned over the last year about how to test pixel by pixel.

00:00:24.920 I am also the author of two Ruby gems: JSON API Serializers and Swagger Blocks. If you use either of those, I'd love to talk to you afterward or if you have any questions.

00:00:31.840 Let's dive right in. This talk will come in three parts: the problem, the general solution, and how it works, including architectures, methodologies, and the problems that come with it.

00:00:37.760 So let's start with the problem. Unit testing is mostly a solved problem. We have various strategies, techniques, and technologies for testing the data, behavior, functionality, and integration of our systems.

00:00:43.719 We have tools for end-to-end testing and smoke testing deployments, but how do we test something visual? Consider a scenario where the color of the text becomes the color of the button or the button text loses its visibility due to zero opacity.

00:00:50.719 This can be caused by a simple issue. Another example is a 404 page from an app I used to work on: this is how it should look. It’s simple, straightforward. We launched a feature, and four weeks later we were informed that our 404 page looked like this.

00:01:00.000 I'm sure you've all seen a 404 page that breaks. No one caught this in QA because nobody goes through the 404 page during QA, and this was caused by a simple change—someone moved a CSS file, and everything else worked except for the 404 page.

00:01:12.720 After it was fixed, we found that the Call to Action (CTA) was totally obscured, and it didn't get QA'd for mobile either. I went back to check the 404 page, and it was broken again. In business, this is called a regression, specifically a visual regression.

00:01:24.960 So, how do product teams fix this today? I’ll throw out some ideas. Hire more people? Or have interns just clicking around, exploring behavior?

00:01:30.840 This is often referred to as QA. It can be developer QA or QA engineers doing checks, which helps spot issues before they hit production or ensures customer-reported issues get fixed after deployment.

00:01:37.400 But QA is slow, manual, and complicated, making it impossible to catch every single issue. In a medium-sized app with numerous models, there can be hundreds of flows and thousands of potential page states, especially with constant feature turns.

00:01:49.040 QA is also expensive in terms of manual engineering hours spent fixing visual regressions.

00:01:55.920 Returning to the button issue, my standard fix is to write a regression test for it, as I am a big proponent of Test Driven Development (TDD). However, here’s the issue: the test doesn’t fail if the button works but looks visually incorrect.

00:02:06.400 So, do I assert the CSS computed style of the color, or that it has a certain CSS class applied? None of this truly tests the right thing, and no one wants to write a fragile, inflexible test in a developing product, so I don’t do it.

00:02:18.080 The key problem is that pixels are changing, but we’re often only testing the underlying abstractions that involve those pixels. However, the pixels are what users are actually interacting with, which makes this a significant issue.

00:02:25.720 Despite all current testing strategies, we still lack confidence in our deployments. A million unit tests can cover a variety of data scenarios, but if you change a CSS file or its properties, you still need to manually check it.

00:02:37.680 Now, let's discuss the solution. I don’t want to claim this as the ultimate solution, but rather a new tool in the toolbox. The question is: what if we could see every pixel changed in any UI state in every PR?

00:02:47.080 Imagine testing apps pixel by pixel. To achieve this, let's introduce a new concept: perceptual diffs, also known as P-diffs or visual diffs.

00:02:53.680 This concept has been explored by many, including Brett Slatkin at Google. He once humorously described how they launched a pink dancing pony to production, which led them to develop this style of testing.

00:03:01.240 So, what is a perceptual diff? It's relatively straightforward. Given two images, you compute the difference between them—essentially calculating the delta without context about what the image is depicting.

00:03:11.599 For example, you can highlight all the red pixels that have changed from one image to another. This can be computed for any kind of image.

00:03:18.920 Now, let’s try another example. I want you to shout out what differences you see between two images side by side before I show you the P-diff results.

00:03:27.360 You’ll notice things like background color changes or missing elements. This is the P-diff result, which clearly indicates the changes in the most immediate way.

00:03:33.680 Let's quickly create a P-diff. I have two images: old and new. By using ImageMagick's compare tool, I can generate the visual difference.

00:03:45.600 I run the command that compares the old and new images, and voilà, we have our first P. You can see all the pixels that have changed, with the images shown beneath each other.

00:03:55.440 The tool can apply different settings, including a fuzz factor if you don’t mind some pixel changes within a certain range of colors.

00:04:01.440 Creating P-diffs is relatively easy. Let’s examine some real-life examples. Determining the differences could take a moment, but look closely, and you may find that entire sections have disappeared.

00:04:10.080 One such case is a ’Do you agree to the terms of use?’ section missing from a page, missed by back-end changes that affected how the front end rendered.

00:04:20.440 A visual diff tested would help catch that; without it, you'd never notice the form is now non-functional, presumably due to the missing agreement.

00:04:32.880 Another example might show normal visual changes when a new person was added to the page. Visual diffs can sometimes be noisy, as small changes create numerous updates.

00:04:43.600 For instance, if you see very similar pages with slight differences, it could be due to changes in the footer from a gem that adds scripts when in a broken state.

00:04:51.200 You would need visual testing to catch such issues, as they often fly under the radar even when all other traditional tests pass.

00:05:02.960 On a lighter note, some P-diffs I've encountered end up resulting in visually amusing changes—like an image getting perfectly shifted into a clever pattern.

00:05:10.960 Another strong signal from a P-diff is when no pixels have changed at all. In a pure refactor, this indicates that everything consumers interact with remains untouched.

00:05:19.600 As your application scales, maintaining the capacity to do these refactors with a clear understanding of unchanged elements is crucial for code health.

00:05:31.600 Now, let's get into writing a visual regression testing system quickly.

00:05:38.280 I have an app, ‘GF Andor,’ which was the demo app from Brandon Hay's talk at RailsConf two years ago. I will write tests for it today.

00:05:44.320 The feature specs I've written execute basic tasks like visiting a page, expecting certain content, or interacting with a dialog.

00:05:50.960 These tests check the app's behavior by performing typical actions, such as submitting a GIF which initiates a jQuery animation.

00:06:01.200 Let’s save a screenshot at the end of the test using the Capybara capabilities, which support most web drivers.

00:06:08.560 We run the tests, and let’s note any changes. We compare the screenshot to see what has changed in the state of our app during testing.

00:06:17.360 However, it's important to note that this state might not fully reflect some elements, like indeterminate border images.

00:06:25.680 Next, we’ll adjust the background color slightly and rerun our tests to see how that affects the visual output.

00:06:32.920 After making the change, we compare the old and new images, noting that pixel changes reflect the modified background.

00:06:44.640 You might think such changes are trivial, but designers often want to ensure color consistency across the app, sometimes finding even minor changes significant.

00:06:55.920 Let's talk practical uses: obvious benefits include catching visual regressions.

00:07:02.800 Advanced uses include validating style guide updates. Removing CSS can be daunting, as you may not know where it is utilized.

00:07:11.120 If you conduct a visual diff test on your top pages before deletion, you’ll know if styles are necessary.

00:07:24.800 Testing living style guides or conducting safe dependency upgrades provide excellent opportunities to utilize visual testing.

00:07:32.560 Visual regression testing is also applicable for emails and D3 visualizations, which can be tricky because testing D3 is not straightforward.

00:07:43.520 Imagine knowing at a glance how D3 visualizations appear as part of your test suite; that's the kind of confidence we want.

00:07:54.320 Now, we need a visual review process alongside code review. Why is this not commonplace if it’s so effective?

00:08:00.640 This process grows complicated. I categorize problems into three areas: tooling, workflows, and performance.

00:08:11.680 Tooling difficulties exist. Tools like Phantom CSS create confusion by presenting individual visual changes as numerous test failures.

00:08:18.320 Manually storing baseline images adds unnecessary weight to the workflow, which most developers would avoid.

00:08:29.680 Performance is a major issue across all tools. The examples I've shown relate to simple pages, but frequently pages can be as large as 30,000 or 40,000 pixels high.

00:08:42.360 Sometimes it can take up to 15 seconds just to render those images. If you need to run hundreds of tests, which take a total of 30 minutes, it’s not ideal.

00:08:55.920 Lastly, non-deterministic rendering adds complexity. Changes in browsers introduce variability; for instance, pure CSS animations confuse the diffs.

00:09:05.760 In services like Percy, we freeze animations to generate consistent outputs. You may want to refer to my blog post for deeper insights.

00:09:13.360 Dynamic data can also create obstacles, such as a date picker displaying varying components. To combat this, you can utilize fixture data for consistency.

00:09:25.440 Older testing browsers introduce differences, and tools like Capybara Webkit are outdated and lack modern features, creating challenges.

00:09:38.080 Setting pixel-perfect standards is difficult; floating-point operations don't guarantee the same pixel output across machines. Even subtle rendering differences can arise.

00:09:50.480 This realization emphasizes that perceptual diffs, while useful, are only half the solution.

00:10:01.680 Reiterating the goal: How can we visualize pixel changes across all UI states in every PR? This distinguishes visual regression testing from continuous visual integration.

00:10:13.120 Just as there are numerous processes for continuous integration in code, continuous visual integration requires a framework to verify visual components as changes occur.

00:10:25.120 To pull this off, speed is crucial. Tests must run as fast as your suite; handling complex UI states cannot be an afterthought.

00:10:36.320 Continuous integration is essential; every commit should be integrated efficiently. For me, relying on production or staging is using the process too late.

00:10:48.200 Now, I will explain how we structured Percy to tackle these challenges efficiently.

00:11:00.480 Here's an overview of how Percy works within an RSpec feature spec. You initiate it like any feature spec by visiting a page and performing actions.

00:11:10.480 Then, you can drop in Percy’s snapshot command telling it to record the page with a descriptive name.

00:11:18.720 Now, when testing sessions are pushed up, we aren’t actually pushing images. Instead, we submit DOM snapshots, which is a more efficient way to capture and store the state.

00:11:30.960 Thus, we upload the DOM alongside fingerprinted asset versions, which means no asset is uploaded twice. Initially, the first upload could take time, but the subsequent ones are efficient.

00:11:41.760 Underneath it all, we carry out various operational tasks that align with GitHub, setting the commit status, and driving performance with Percy Hub.

00:11:51.760 One crucial performance aspect is that we can parallelize the rendering of these DOM snapshots instead of processing them sequentially.

00:12:02.160 The current milestone that I’m proud of is achieving a million visual diffs rendered in Percy just yesterday.

00:12:14.080 Let’s take a look at Percy in action with examples from a few of our customers. For instance, Charity: Water's build shows 162 total snapshots tracked.

00:12:23.600 They recently produced a new footer markup and visual diffs were generated from 96 changes. With an overview mode, it’s easy to assess all modifications quickly.

00:12:33.360 Another example involves updating the press page, where CSS styles were altered—resulting in completely broken UI states, which would halt any release until properly addressed.

00:12:45.120 Throughout this, the importance of visual regression testing is underscored—it creates a lightweight visual review system within the development pipeline.

00:12:55.840 Typically, visual diffs serve as a checkpoint before final approval.

00:13:04.120 Now, let’s touch on the notion of snapshotting the DOM model. This powerful method allows for rapid verification of UI states.

00:13:13.120 Persistent communication occurs through integration into CI, making checks seamless. Visual updates and progress can be evaluated quickly as code adjustments are made.

00:13:23.600 I'm always happy to discuss and expand upon ideas to optimize CI workflows, especially as new challenges arise.

00:13:32.880 In conclusion, visual testing is something we should embrace. It can significantly boost deployment confidence and finally incorporate a manual review step that can be automated to a reasonable degree.

00:13:40.640 One last note: due to the DOM snapshotting model, I’m eager to explore regression tests further into the Ember ecosystem.

00:13:49.840 If you're an Ember user interested in testing, reach out via email, and let’s get you beta testing!

00:13:58.640 Thank you!

00:14:06.959 Now, great question. How is the baseline created? Usually, I take the latest version of the master branch, which serves as the baseline, and we offer a mechanism to approve builds manually.

00:14:20.600 If Master is always green and deployable, it’s reasonable to test against it. However, we don’t handle cross-browser testing currently, though this could be a future evolution.

00:14:32.640 Indeed, old browsers limit testing capabilities; many don't support full-page screenshots, though major modern browsers assist significantly.

00:14:40.720 The technology stack we've selected comprises a custom-built platform on Google Cloud, and it runs a full Ember front-end via a Rails API.

00:14:51.920 Therefore, automation and integration are key to enhance CI and offer an efficient, robust visual regression testing landscape.

00:15:03.440 Thank you for your engagement, and I hope you found this talk insightful!