Resilience Engineering
Knobs, Levers and Buttons: tools for operating your application at scale
Summarized using AI

Knobs, Levers and Buttons: tools for operating your application at scale

by Amy Unger

In her presentation at RubyConf AU 2018, Amy Unger discusses strategies for enhancing application resilience through various operational tools, likening them to knobs and switches used by pilots in aircraft. The talk focuses on how developers can better manage their applications during stressful situations and prevent failures from escalating into critical issues.

Key Points Discussed:
- Application Resilience: The importance of being able to adjust application behavior during both minor failures and serious outages.
- Seven Tools for Resilience:
- Maintenance Mode: Implement a simple environment variable switch to activate a maintenance page during emergencies, ensuring users are informed without needing complex configurations.
- Read-Only Mode: Offer users access to non-modifiable information even if backend services are down, helping maintain user engagement.
- Feature Flags: Utilize global feature flags to control access to features based on user groups, which is particularly useful for multi-tenant applications.
- Rate Limiting: Protect against denial-of-service attacks and manage performance by limiting the number of requests your application accepts during peak times.
- Stopping Non-Essential Work: Prioritize essential tasks by pausing non-critical jobs when resource limits are hit, thus conserving resources for important functions.
- Deployment Flags: Manage deployments under uncertain conditions by using flags that allow for quick rollback if issues arise with new code.
- Circuit Breakers: Implement circuit breakers to stop sending requests to downstream services that are experiencing high error rates, thus preventing overload and maintaining user experience.

Unger emphasizes that successful implementation of these tools requires visibility and control over application status at all times. She encourages developers to test these strategies in various scenarios to ensure they work effectively during critical times. The talk concludes with the notion that having these adjustable controls allows for better adaptability and a more resilient application environment.

Unger invites the audience to think about incorporating these practices in their applications to enhance operational resilience and appreciates the support from Heroku's Sydney office.

In summary, the key takeaway is that by managing control mechanisms effectively, developers can significantly improve the resilience and reliability of their applications during unexpected failures.

00:00:09.310 All right, well, thank you everyone. My name is Amy. I am an engineer, and I'm here to talk about knobs, buttons, and switches that you can add to your application to change its behavior when things are going wrong. We all know, we've all seen applications that fail at a moment’s notice when a single downstream service goes down. Let’s not let that be you. Pilots operate their planes with an array of knobs and buttons to react to changing conditions. Captain Kirk could shout every week about diverting power to the shields, and I want you to have the same level of control in your application when the going gets rough. This talk is about application resilience, but it's a discussion only about one portion of it. When we talk about application resilience, we usually refer to different levels of problems—big problems versus small. On one hand, you've got small, baby problems that your application probably encounters, and on the other hand, you have the large catastrophic failures.
00:01:05.300 These daily failures, such as the one out of every hundred requests that do not succeed, are not part of this talk. However, we also won’t delve into massive, catastrophic failures either, like when your production database goes down and you discover that you have no backups. We aren’t talking about scenarios where aliens might abduct an entire AWS availability zone. So what we are going to address in this talk? I want to ensure that even though this may not be just right for you, a lot of the decisions made under stress have a direct bearing on your product. Your choices on how to serve users and handle failures impact the product you're delivering. I've been fortunate to work in organizations that genuinely care about creating great experiences, even under duress. The tools I'm discussing today are specific to that context.
00:02:01.869 I'm going to discuss a set of seven tools that I have frequently seen used to allow you to shed load, gracefully fail, and protect struggling services within your application ecosystem. The first one is maintenance mode. This is your hard “nope”: put a pretty picture there, but please include a link to your status page or help page so users understand the situation. The key to implementing this is to have it be a single switch. We implement this as an environment variable, as close to the running process as possible. If you have to call your database to check whether it's down, you're likely in a bad situation. The beauty of this switch is its simplicity; you don't have to configure fifty different things just to serve that nice error page. This is especially helpful at 2:00 a.m. The next mode is read-only mode. Most of our applications aim to alter some state, usually involving a relational database or another downstream service. Think about what your application could do if the ability to modify that state goes away.
00:03:36.640 In a read-only capacity, can you help users answer certain questions? Do I have a shipment on the way? Did my credit card charge go through? Even if the billing system is down, you might still provide valuable information to users. However, if you have a microservice without stored state, the answer might be nothing. As you scale, some tools might become too large of a hammer for where you are right now. The next tool involves feature flags. Typically, feature flags are used for flagging individual users into new features or gradual rollouts. However, they can also be employed to create global feature flags, such as turning off billing or rolling out new products. You can refine this even further by considering groups of users who might need to be flagged in or out, which is particularly useful for organizations running multi-tenant applications like Heroku.
00:04:59.270 In our context at Heroku, we can target specific groups based on where they have apps deployed or what technology they use. This targeting helps ensure users who are likely to experience failures aren't inundated with error pages while those unaffected continue on with their work. We have setups to check whether certain features, such as billing, are enabled for these groups. For instance, if you need to check if billing is enabled for the European Union, you wouldn’t query the database to confirm that; instead, you could just refer to a stored value. In addition, managing rate limits is essential for protecting against malicious traffic. Rate limits serve as your first line of defense against denial-of-service attacks and prevent performance degradation for your customers. They also enable load shedding when necessary. Under heavy load, you may need to implement surge pricing where you drop half of your traffic during peak times.
00:06:36.160 Another critical tool is the ability to stop non-essential work in situations where you're hitting limits, such as on your database or maxing out compute resources. Often times, reporting mechanisms or data cleanup tasks don't need to run immediately, and you can set them aside until your application is back to full capacity. So, for us, based on the job type, we have our jobs check a flag before they execute. If the flag returns false, the job exits gracefully and retries later. This approach alleviates unnecessary overhead during critical failures.
00:08:10.300 Throughout my career, I’ve noted that everyone must deploy code during uncertain conditions, whether a major load test day or under stress from high usage. Sometimes you deploy features with unknown factors when dealing with your largest customers, and such deployments can be nerve-wracking. Therefore, putting flags around uncertain deployments is highly beneficial. For instance, we frequently utilize the Scientist gem developed by GitHub to monitor and compare the performance of new code against existing code. This setup allows us to quickly disable faulty features when necessary.
00:09:43.930 Finally, we have circuit breakers. These tools facilitate the protection of your downstream services by preventing overload. With circuit breakers, the first form is a responsive shut-off; they monitor for increasing timeouts or error rates. If a specific threshold is reached, they stop sending requests to that service, allowing us to maintain a controlled user experience. Different gems can help create and manage these circuit breakers effectively.
00:11:39.500 Circuit breakers should have manual triggers for situations where you outright know the service is unreachable. Creating a client class that wraps your service with a dedicated circuit breaker simplifies this process. As you implement these tools, you will find numerous options for where these controls should reside—whether in databases, caches, or environment variables. It's essential to position these features correctly for immediate access, especially during downtimes.
00:13:20.860 Just remember, visibility is crucial. At any given moment, you should be able to run a single command or check a dashboard to understand the status of your application’s controls. If you're unsure of what's activated or deactivated, it's easy to feel like you're flying blind. Ensuring that the mechanisms in your application actually work is paramount; otherwise, you might find yourself in dire situations without the assurance that these controls will function in production.
00:14:54.430 By trading knowledge for control, you’re allowing your application to be more adaptable; however, this also reduces the certainty that everything will behave predictably. Testing various scenarios, especially edge cases under load, can be exceedingly difficult due to the sheer number of combinations you’d need to cover. As for ensuring environments stay synced—production, staging, and development—this is a common issue that most organizations struggle to resolve.
00:17:10.930 In conclusion, I would much prefer to have the capability to adjust how my application behaves during these critical times than to be stuck with an unresponsive system. Thank you for your attention, and I hope you've learned something today. I encourage you to think about ways you can implement these strategies in your applications. Additionally, I'd like to thank the Heroku Sydney office for their support. If you visit bit.ly/rubicon-pillow to take a survey, even though we've run out of swag here, we are excited to mail you t-shirts. I'm excited about the amount of merchandise we’ll be sending out. If you have any questions, feel free to ask me outside during the break.
Explore all talks recorded at RubyConf AU 2018
+8