Suggest modification to this talk

Title

Description

By Florian Weingarten 
Your application probably communicates with other services, whether it's a database, a key/value store, or a third-party API: you are performing "external calls". But do you know how your application behaves when one of those external services is behaving unexpected, is unreachable, or sometimes even worse, experiences high latency? This is something that you want to find out during testing, not in production, since it can easily lead to a series of cascading failures that will seriously affect your capacity or can even take your application down. Shopify operates one of the largest Ruby on Rails deployments in the world. We serve about 300k requests per minute on a boring day, integrating with countless external services. Focussing on Resiliency has been one of the most impactful improvements that we made in terms of uptime in the last year. In this talk, I will share some lessons learnt and we will walk through a series of ideas and tools that can help you make your application more stable and more resilient to the failure of external services, an issue that is often neglected until it's too late. More importantly, we will talk about how you can write meaningful and realistic integration tests and set up "chaos testing environments" to gain confidence in your application's resiliency.

Date

Summarized using AI?

If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.

Show "Summarized using AI" badge on summary page

Summary

Markdown supported

In this talk by Florian Weingarten, titled "How to stay alive even when others go down," the focus is on writing and testing resilient applications in Ruby, especially in light of external dependencies like databases, APIs, and other services. The essential theme is to ensure that applications can withstand failures of external systems without leading to significant disruptions or cascading failures. He emphasizes the importance of understanding application behavior during failures early in the development process rather than in production. Key points discussed include:

- **Timeouts**: Setting appropriate timeouts prevents long waiting periods in case of service delays, which can contribute to system capacity issues. Setting these limits allows applications to fail fast, reducing the risk of cascading failures.  
- **Circuit Breakers**: Implementing circuit breakers helps avoid making operations on services that are known to be failing, allowing systems to recover without repeated failures. This pattern involves keeping track of error counts and temporarily halting requests to an external service once a threshold is reached.  
- **Graceful Degradation**: Instead of complete failure, applications can provide users with reasonable fallbacks when certain services are down, thus preserving user experience. For instance, if a search service is down, returning an empty result set can be better than serving an error page.  
- **Back Pressure and Operational Isolation**: Introducing measures to manage shared resource access can prevent overwhelming services and ensure that failures in one area don't trigger widespread outages.  
- **Resiliency Testing**: Developing tests that simulate external service failures can identify weaknesses in application resilience. Manual failure testing provides insight into behavior under stress, while tools like Taxi Proxy help create realistic test environments.  
- **Monitoring and Alerts**: It’s vital to have effective monitoring in place to catch issues even when error-trapping mechanisms are suppressing them for user-facing components.

In conclusion, the main takeaways from Florian's presentation include the necessity of proactively designing for failure, utilizing patterns like timeouts, circuit breakers, and graceful degradation, and rigorously testing systems under various failure scenarios to build resilient applications. The importance of continuous development and iteration towards achieving reliability is emphasized as a vital part of application architecture.

Cancel