Resilient by Design

In the video titled "Resilient by Design," presented by Smit Shah at RailsConf 2015, the focus is on developing resilient distributed systems capable of handling aggressive uptime and performance requirements. This talk builds on the concepts introduced in a prior discussion about microservices and the CAP theorem, emphasizing the critical importance of designing systems to withstand failures and avoid catastrophic outages.

Key Points Discussed:

The Importance of Resilience:
- Businesses increasingly depend on software, with downtime leading to significant revenue losses. For example, Flipkart loses around $2,000 per minute during peak times due to system failures.
On-call Responsibilities:
- Developers often bear the burden of on-call duties, which is a growing concern in maintaining system uptime.
Consideration of External Dependencies:
- Modern systems rely on various external services; therefore, resilience must be factored from the design stage to avoid issues during peak loads.
Designing for Failure:
- Developers must assume that failures will happen and plan accordingly, implementing strategies such as fallback mechanisms and circuit breakers to handle service unavailability.
Fail Fast Concept:
- Failing fast helps quickly identify issues with external services rather than allowing for prolonged failures that could increase the load on the system and delay responsiveness.
Circuit Breaker Pattern:
- This pattern prevents consistent load on failing services, allowing for recovery without overwhelming the system.
Bulkhead Pattern:
- Inspired by maritime practices, this approach prevents system-wide failures by isolating components, ensuring that if one service fails, it does not lead to a total system outage.
Resource Management:
- Properly setting timeouts, monitoring resource usage, and establishing limits is crucial to maintain system efficiency and prevent issues like resource exhaustion.
Maintaining a Steady State:
- Automating tasks such as log rotation and implementing archiving strategies are essential to keep operations running smoothly with minimal human intervention.

Conclusions and Takeaways:

Building resilient systems requires proactive efforts such as failing fast, bounding resources, and implementing strategies like circuit breakers and bulkheads.
Understanding what your system should never do is just as vital as knowing its intended functions, underscoring the complexity involved in resilient system design.
Smit Shah emphasizes the importance of integration of these patterns from the design phase to avoid failures that could impact service availability.

This insightful presentation serves as a guide for developers aiming to minimize the risks associated with system outages and improve overall reliability in their applications.

Resilient by Design
Smit Shah • April 21, 2015 • Atlanta, GA

By, Smit Shah
Modern distributed systems have aggressive requirements around uptime and performance, they need to face harsh realities such as sudden rush of visitors, network issues, tangled databases and other unforeseen bugs.

With so many moving parts involved even in the simplest of services, it becomes mandatory to adopt defensive patterns which would guard against some of these problems and identify anti-patterns before they trigger cascading failures across systems.

This talk is for all those developers who hate getting a oncall at 4 AM in the morning.

Help us caption & translate this video!

http://amara.org/v/G6rH/

RailsConf 2015