Building and Testing Resilient Applications

In the talk "Building and Testing Resilient Applications" at GoRuCo 2015, Simon Eskildsen from Shopify shares insights on creating resilient systems in light of the increasing complexity and reliance on various external services. The key focus is on how to maintain system performance and availability despite component failures.

Key Points Discussed:
- Understanding Resiliency: Resiliency is about constructing systems from numerous unreliable components while ensuring overall reliability. Effective management of relationships between services is crucial to prevent outages.
- Experiencing Failures: Shopify has experienced numerous failures during peak traffic events, highlighting the risks associated with dependencies and the need for proactive strategies in system design.
- Loosely Coupled Components: For a resilient architecture, components should be loosely coupled, allowing them to operate independently and minimizing the impact of any single point of failure.
- Implementing Fallbacks: In cases where external services fail, implementing fallback mechanisms helps maintain user experience. For example, instead of failing a page due to a service downtime, showing placeholder data allows continued interaction.
- Testing Resiliency: Traditional mocking strategies for testing can be inadequate. Using tools like Toxiproxy, Shopify stimulates various failure scenarios to identify weaknesses within the application before deployment.
- Building a Resiliency Matrix: Documenting all services and their dependencies aids in understanding potential points of failure, enabling targeted fixes and more robust architecture.
- Best Practices: Concepts such as circuit breakers, bulkheads, and error handling strategies are critical in managing dependencies and improving system response during failures.
- Document and Adapt: Sharing insights and lessons learned from resiliency testing enhances community knowledge and prepares organizations for future challenges.

Conclusions:
- Early adoption of resiliency strategies, like circuit breakers and fault tolerance mechanisms, is encouraged before crises arise. Implementing these tools supports smoother scaling and more robust applications as dependencies grow and system interactions become more complex.

Building and Testing Resilient Applications
Simon Eskildsen • June 20, 2015 • Earth

@Sirupsen
Drives fail, databases crash, fibers get cut and unindexed queries hit production. Do you know how your application reacts to those events? Are they covered by tests? What about the failures you haven't even thought of? To avoid cascading failures applications must adopt general patterns to defend against misbehaving dependencies, including themselves. This talk covers the resiliency techniques Shopify has successfully put into production at scale, and how we write tests to ensure we don't reintroduce single points of failure. You'll walk away from this talk equipped with the tools to make your applications resilient and to better sleep at night.

Talk given at GORUCO 2015: http://goruco.com

GORUCO 2015