Ruby Video

Title

Description

Date

Summary

Markdown supported

In the talk "Building For Gracious Failure" at RubyConf 2018, James Thompson, a principal software engineer at Nav, explores strategies to manage failures gracefully within service-based systems. Given the inevitability of failure in technology and the programming field, the presentation encourages developers to shift their focus from attempting to eliminate failures to managing and responding to them effectively.

Key points include:  
- **Understanding Failure**: Failure is unavoidable in software. Thompson highlights that infrastructure issues, human errors, and service deployment mishaps are common. Developers must have a proactive approach for handling these failures to prevent them from negatively impacting their work and environment.  
- **Importance of Visibility**: Achieving oversight of system operations is crucial. Metrics and monitoring tools like Bugsnag, Airbrake, and Rollbar help track failure occurrences and provide visibility into the functioning of applications.  
- **Case Study - Data Sourcing Team**: Thompson shares his experience integrating Bugsnag to improve visibility of errors in a data sourcing team environment at Nav. By utilizing error tracking, he was able to discern patterns in failures that otherwise went unnoticed.  
- **Returning Partial Data**: The speaker emphasizes designing systems to return whatever data is available rather than failing completely. For instance, when migrating data for a Business Profile service, he chose to return null for corrupt fields rather than throwing a 500 error, thus maintaining service functionality and user experience.  
- **Accepting Data Flexibly**: Systems should be able to accept partial inputs effectively. Thompson advises building services that can handle independent data updates while notifying users of any issues, thereby preserving backend data integrity without rejecting valid entries.  
- **Trust and Resilience**: The importance of managing dependence on external services is discussed, as reliance can lead to cascading errors throughout interconnected systems. Strategies should be applied to maintain resilience when unexpected failures occur.  
- **Proactive Measures**: Employing principles from chaos engineering, such as the chaos monkey, can help anticipate and mitigate failures.

Ultimately, the talk encourages developers to cultivate a mindset that embraces failure as a reality and prepares them to manage its impact on systems effectively. Key takeaways suggest that better visibility and resilience in systems, combined with thoughtful failure management strategies, can enhance overall system reliability.

Suggest modification to this talk