00:00:10.360
This is the Continuous Visual Integration talk. Thank you for being here. We need to go through a lot of material today, so let's get started.
00:00:18.039
A little bit about me to start: My name is Mike Fotinakis. I'm currently the founder of Percy, which is a tool for visual testing. I'm excited to share with you some of the things I've learned over the last year about how to test pixel by pixel.
00:00:24.920
I am also the author of two Ruby gems: JSON API Serializers and Swagger Blocks. If you use either of those, I'd love to talk to you afterward or if you have any questions.
00:00:31.840
Let's dive right in. This talk will come in three parts: the problem, the general solution, and how it works, including architectures, methodologies, and the problems that come with it.
00:00:37.760
So let's start with the problem. Unit testing is mostly a solved problem. We have various strategies, techniques, and technologies for testing the data, behavior, functionality, and integration of our systems.
00:00:43.719
We have tools for end-to-end testing and smoke testing deployments, but how do we test something visual? Consider a scenario where the color of the text becomes the color of the button or the button text loses its visibility due to zero opacity.
00:00:50.719
This can be caused by a simple issue. Another example is a 404 page from an app I used to work on: this is how it should look. It’s simple, straightforward. We launched a feature, and four weeks later we were informed that our 404 page looked like this.
00:01:00.000
I'm sure you've all seen a 404 page that breaks. No one caught this in QA because nobody goes through the 404 page during QA, and this was caused by a simple change—someone moved a CSS file, and everything else worked except for the 404 page.
00:01:12.720
After it was fixed, we found that the Call to Action (CTA) was totally obscured, and it didn't get QA'd for mobile either. I went back to check the 404 page, and it was broken again. In business, this is called a regression, specifically a visual regression.
00:01:24.960
So, how do product teams fix this today? I’ll throw out some ideas. Hire more people? Or have interns just clicking around, exploring behavior?
00:01:30.840
This is often referred to as QA. It can be developer QA or QA engineers doing checks, which helps spot issues before they hit production or ensures customer-reported issues get fixed after deployment.
00:01:37.400
But QA is slow, manual, and complicated, making it impossible to catch every single issue. In a medium-sized app with numerous models, there can be hundreds of flows and thousands of potential page states, especially with constant feature turns.
00:01:49.040
QA is also expensive in terms of manual engineering hours spent fixing visual regressions.
00:01:55.920
Returning to the button issue, my standard fix is to write a regression test for it, as I am a big proponent of Test Driven Development (TDD). However, here’s the issue: the test doesn’t fail if the button works but looks visually incorrect.
00:02:06.400
So, do I assert the CSS computed style of the color, or that it has a certain CSS class applied? None of this truly tests the right thing, and no one wants to write a fragile, inflexible test in a developing product, so I don’t do it.
00:02:18.080
The key problem is that pixels are changing, but we’re often only testing the underlying abstractions that involve those pixels. However, the pixels are what users are actually interacting with, which makes this a significant issue.
00:02:25.720
Despite all current testing strategies, we still lack confidence in our deployments. A million unit tests can cover a variety of data scenarios, but if you change a CSS file or its properties, you still need to manually check it.
00:02:37.680
Now, let's discuss the solution. I don’t want to claim this as the ultimate solution, but rather a new tool in the toolbox. The question is: what if we could see every pixel changed in any UI state in every PR?
00:02:47.080
Imagine testing apps pixel by pixel. To achieve this, let's introduce a new concept: perceptual diffs, also known as P-diffs or visual diffs.
00:02:53.680
This concept has been explored by many, including Brett Slatkin at Google. He once humorously described how they launched a pink dancing pony to production, which led them to develop this style of testing.
00:03:01.240
So, what is a perceptual diff? It's relatively straightforward. Given two images, you compute the difference between them—essentially calculating the delta without context about what the image is depicting.
00:03:11.599
For example, you can highlight all the red pixels that have changed from one image to another. This can be computed for any kind of image.
00:03:18.920
Now, let’s try another example. I want you to shout out what differences you see between two images side by side before I show you the P-diff results.
00:03:27.360
You’ll notice things like background color changes or missing elements. This is the P-diff result, which clearly indicates the changes in the most immediate way.
00:03:33.680
Let's quickly create a P-diff. I have two images: old and new. By using ImageMagick's compare tool, I can generate the visual difference.
00:03:45.600
I run the command that compares the old and new images, and voilà, we have our first P. You can see all the pixels that have changed, with the images shown beneath each other.
00:03:55.440
The tool can apply different settings, including a fuzz factor if you don’t mind some pixel changes within a certain range of colors.
00:04:01.440
Creating P-diffs is relatively easy. Let’s examine some real-life examples. Determining the differences could take a moment, but look closely, and you may find that entire sections have disappeared.
00:04:10.080
One such case is a ’Do you agree to the terms of use?’ section missing from a page, missed by back-end changes that affected how the front end rendered.
00:04:20.440
A visual diff tested would help catch that; without it, you'd never notice the form is now non-functional, presumably due to the missing agreement.
00:04:32.880
Another example might show normal visual changes when a new person was added to the page. Visual diffs can sometimes be noisy, as small changes create numerous updates.
00:04:43.600
For instance, if you see very similar pages with slight differences, it could be due to changes in the footer from a gem that adds scripts when in a broken state.
00:04:51.200
You would need visual testing to catch such issues, as they often fly under the radar even when all other traditional tests pass.
00:05:02.960
On a lighter note, some P-diffs I've encountered end up resulting in visually amusing changes—like an image getting perfectly shifted into a clever pattern.
00:05:10.960
Another strong signal from a P-diff is when no pixels have changed at all. In a pure refactor, this indicates that everything consumers interact with remains untouched.
00:05:19.600
As your application scales, maintaining the capacity to do these refactors with a clear understanding of unchanged elements is crucial for code health.
00:05:31.600
Now, let's get into writing a visual regression testing system quickly.
00:05:38.280
I have an app, ‘GF Andor,’ which was the demo app from Brandon Hay's talk at RailsConf two years ago. I will write tests for it today.
00:05:44.320
The feature specs I've written execute basic tasks like visiting a page, expecting certain content, or interacting with a dialog.
00:05:50.960
These tests check the app's behavior by performing typical actions, such as submitting a GIF which initiates a jQuery animation.
00:06:01.200
Let’s save a screenshot at the end of the test using the Capybara capabilities, which support most web drivers.
00:06:08.560
We run the tests, and let’s note any changes. We compare the screenshot to see what has changed in the state of our app during testing.
00:06:17.360
However, it's important to note that this state might not fully reflect some elements, like indeterminate border images.
00:06:25.680
Next, we’ll adjust the background color slightly and rerun our tests to see how that affects the visual output.
00:06:32.920
After making the change, we compare the old and new images, noting that pixel changes reflect the modified background.
00:06:44.640
You might think such changes are trivial, but designers often want to ensure color consistency across the app, sometimes finding even minor changes significant.
00:06:55.920
Let's talk practical uses: obvious benefits include catching visual regressions.
00:07:02.800
Advanced uses include validating style guide updates. Removing CSS can be daunting, as you may not know where it is utilized.
00:07:11.120
If you conduct a visual diff test on your top pages before deletion, you’ll know if styles are necessary.
00:07:24.800
Testing living style guides or conducting safe dependency upgrades provide excellent opportunities to utilize visual testing.
00:07:32.560
Visual regression testing is also applicable for emails and D3 visualizations, which can be tricky because testing D3 is not straightforward.
00:07:43.520
Imagine knowing at a glance how D3 visualizations appear as part of your test suite; that's the kind of confidence we want.
00:07:54.320
Now, we need a visual review process alongside code review. Why is this not commonplace if it’s so effective?
00:08:00.640
This process grows complicated. I categorize problems into three areas: tooling, workflows, and performance.
00:08:11.680
Tooling difficulties exist. Tools like Phantom CSS create confusion by presenting individual visual changes as numerous test failures.
00:08:18.320
Manually storing baseline images adds unnecessary weight to the workflow, which most developers would avoid.
00:08:29.680
Performance is a major issue across all tools. The examples I've shown relate to simple pages, but frequently pages can be as large as 30,000 or 40,000 pixels high.
00:08:42.360
Sometimes it can take up to 15 seconds just to render those images. If you need to run hundreds of tests, which take a total of 30 minutes, it’s not ideal.
00:08:55.920
Lastly, non-deterministic rendering adds complexity. Changes in browsers introduce variability; for instance, pure CSS animations confuse the diffs.
00:09:05.760
In services like Percy, we freeze animations to generate consistent outputs. You may want to refer to my blog post for deeper insights.
00:09:13.360
Dynamic data can also create obstacles, such as a date picker displaying varying components. To combat this, you can utilize fixture data for consistency.
00:09:25.440
Older testing browsers introduce differences, and tools like Capybara Webkit are outdated and lack modern features, creating challenges.
00:09:38.080
Setting pixel-perfect standards is difficult; floating-point operations don't guarantee the same pixel output across machines. Even subtle rendering differences can arise.
00:09:50.480
This realization emphasizes that perceptual diffs, while useful, are only half the solution.
00:10:01.680
Reiterating the goal: How can we visualize pixel changes across all UI states in every PR? This distinguishes visual regression testing from continuous visual integration.
00:10:13.120
Just as there are numerous processes for continuous integration in code, continuous visual integration requires a framework to verify visual components as changes occur.
00:10:25.120
To pull this off, speed is crucial. Tests must run as fast as your suite; handling complex UI states cannot be an afterthought.
00:10:36.320
Continuous integration is essential; every commit should be integrated efficiently. For me, relying on production or staging is using the process too late.
00:10:48.200
Now, I will explain how we structured Percy to tackle these challenges efficiently.
00:11:00.480
Here's an overview of how Percy works within an RSpec feature spec. You initiate it like any feature spec by visiting a page and performing actions.
00:11:10.480
Then, you can drop in Percy’s snapshot command telling it to record the page with a descriptive name.
00:11:18.720
Now, when testing sessions are pushed up, we aren’t actually pushing images. Instead, we submit DOM snapshots, which is a more efficient way to capture and store the state.
00:11:30.960
Thus, we upload the DOM alongside fingerprinted asset versions, which means no asset is uploaded twice. Initially, the first upload could take time, but the subsequent ones are efficient.
00:11:41.760
Underneath it all, we carry out various operational tasks that align with GitHub, setting the commit status, and driving performance with Percy Hub.
00:11:51.760
One crucial performance aspect is that we can parallelize the rendering of these DOM snapshots instead of processing them sequentially.
00:12:02.160
The current milestone that I’m proud of is achieving a million visual diffs rendered in Percy just yesterday.
00:12:14.080
Let’s take a look at Percy in action with examples from a few of our customers. For instance, Charity: Water's build shows 162 total snapshots tracked.
00:12:23.600
They recently produced a new footer markup and visual diffs were generated from 96 changes. With an overview mode, it’s easy to assess all modifications quickly.
00:12:33.360
Another example involves updating the press page, where CSS styles were altered—resulting in completely broken UI states, which would halt any release until properly addressed.
00:12:45.120
Throughout this, the importance of visual regression testing is underscored—it creates a lightweight visual review system within the development pipeline.
00:12:55.840
Typically, visual diffs serve as a checkpoint before final approval.
00:13:04.120
Now, let’s touch on the notion of snapshotting the DOM model. This powerful method allows for rapid verification of UI states.
00:13:13.120
Persistent communication occurs through integration into CI, making checks seamless. Visual updates and progress can be evaluated quickly as code adjustments are made.
00:13:23.600
I'm always happy to discuss and expand upon ideas to optimize CI workflows, especially as new challenges arise.
00:13:32.880
In conclusion, visual testing is something we should embrace. It can significantly boost deployment confidence and finally incorporate a manual review step that can be automated to a reasonable degree.
00:13:40.640
One last note: due to the DOM snapshotting model, I’m eager to explore regression tests further into the Ember ecosystem.
00:13:49.840
If you're an Ember user interested in testing, reach out via email, and let’s get you beta testing!
00:13:58.640
Thank you!
00:14:06.959
Now, great question. How is the baseline created? Usually, I take the latest version of the master branch, which serves as the baseline, and we offer a mechanism to approve builds manually.
00:14:20.600
If Master is always green and deployable, it’s reasonable to test against it. However, we don’t handle cross-browser testing currently, though this could be a future evolution.
00:14:32.640
Indeed, old browsers limit testing capabilities; many don't support full-page screenshots, though major modern browsers assist significantly.
00:14:40.720
The technology stack we've selected comprises a custom-built platform on Google Cloud, and it runs a full Ember front-end via a Rails API.
00:14:51.920
Therefore, automation and integration are key to enhance CI and offer an efficient, robust visual regression testing landscape.
00:15:03.440
Thank you for your engagement, and I hope you found this talk insightful!