Incident Response

Summarized using AI

Site Availability is for Everybody

Stella Cotton • February 10, 2016 • Earth

In the talk titled "Site Availability is for Everybody," presented at RubyConf AU 2016 by Stella Cotton, the focus is on the importance of preparedness for site outages and effective load testing. The speaker emphasizes the anxiety many feel when faced with unexpected website downtime and seeks to equip the audience with strategies to handle such crises effectively.

Key points discussed include:
- Understanding Load Testing: Load testing simulates multiple users accessing a site to identify potential bottlenecks and prepare for high-traffic scenarios. This helps engineers create a site availability playbook in advance.
- Preparing for Load Testing: The talk details using Apache Bench as a primary tool for load testing, highlighting its ease of use and setup. Key configurations involve selecting an appropriate endpoint and starting with simple requests before ramping up the load.
- Identifying Key Metrics: Cotton discusses crucial metrics to monitor during tests, like average latency and response times, and underscores the importance of hypothesis-driven testing to understand performance discrepancies.
- Common Pitfalls: The speaker outlines common pitfalls such as misconfigured load tests, handling error pages, and understanding the effect of caching and session management on test results.
- Understanding User Experience Impact: It’s essential to recognize how increasing loads affect user experience, employing principles from queueing theory to clarify these dynamics. As load intensifies, response times can worsen significantly, leading to potential timeouts and degraded user experience.
- Resource Management: Recommendations are provided for managing server resources effectively during load tests, including monitoring CPU and memory usage, managing open files, and handling TCP/IP port limitations.
- Advanced Tools and Techniques: Beyond Apache Bench, the speaker suggests exploring tools like Siege and Bees with Machine Guns for more complex load testing scenarios. These tools can help simulate real-world environments more effectively.

The key takeaway from the presentation is the importance of proactive measures and thorough testing to ensure site availability, which ultimately leads to building confidence within development teams when facing unexpected challenges. Stella Cotton encourages developers to familiarize themselves with the load testing process and understand how their applications behave under stress to enhance overall site reliability.

Site Availability is for Everybody
Stella Cotton • February 10, 2016 • Earth

RubyConf AU 2016: Your phone rings in the middle of the night and the site is down- do you know what to do? Don’t let a crisis catch you off guard! Practice your skills with load testing.

This talk will cover common pitfalls when simulating load on your application, and key metrics you should look for when your availability is compromised. We’ll also look at how tools like ruby-prof and bullet can help unearth problem code in Ruby apps.

RubyConf AU 2016

00:00:00.210 I'm going to start with a little bit of a pop quiz to make things interactive. Imagine a scenario where your phone rings in the middle of the night: your website is down, and everyone on your engineering team is out of cell phone range, leaving just you to handle the situation. Raise your hand if you feel confident about what to do or where to start.
00:00:10.559 It looks like quite a few of you feel good about it, which is great. Now, I want everyone to close their eyes and raise your hand again. No one will judge you; just raise it if you feel like you know what to do or where to start. I'm happy to see that most of you are quite honest.
00:00:29.880 By the end of this talk, I hope those who didn't raise their hands will come away with some ideas on how to get more comfortable with site availability. For the rest of you, I hope you'll get a refresher and leave with some ideas to help your team and individuals who may not be as confident.
00:00:54.780 One of the significant challenges I have found with site availability is that it often catches us completely off guard. Day to day, we practice writing code, refactoring, testing—doing all these things—but a site outage can come from a multitude of random reasons and often appears out of nowhere.
00:01:10.830 This randomness can be quite terrifying. To illustrate this, I'm going to share a scary story about my experiences. It was July 2015, and I was working as a Ruby developer for IndieGoGo, a crowdfunding website. We had many successful campaigns and received significant traffic. Notably, we ran an Australian beekeeping campaign that raised around $12 million.
00:01:28.170 During this month, news broke that Greece had become the first developed country to fail to make an IMF loan payment. Weirdly, this event managed to take down our entire website. It was the middle of the night in California, and Europe was waking up to this incredible news. A British gentleman decided to initiate a campaign to bail out Greece by launching a €1.6 billion project.
00:01:46.560 His rationale was simple: if everyone contributed just three euros, their goal would be met, and the crisis would be resolved. Traffic began to build at an alarming rate during the night, and people started contributing these small amounts swiftly. Eventually, this surge of traffic brought down our entire website, and it did not fully recover until we managed to put up a mostly static page to handle the load.
00:02:04.680 This situation was entirely outside of my usual day-to-day responsibilities. Typically, I triage bugs, deploy updates, and manage various tasks, but this was different, and I was unprepared and a bit scared. I began to wonder how I could have adequately prepared for such an incident.
00:02:24.480 That's where load testing comes into play. Load testing allows us to programmatically simulate many users making simultaneous requests to our website, acting as a low-stress simulator for these high-stress situations. It enables you to explore and build a site availability playbook before any disasters strike. Moreover, it helps identify bottlenecks in your application that could pose risks in the future and can measure the performance benefits from any changes made along the way.
00:02:43.860 However, when I started with load testing, I found that there was much high-level guidance on how to initiate one and plenty of technical instructions around site performance. Bridging those two areas required a lot of trial and error and some frustrating Googling.
00:03:02.230 In this talk, I would like to cover three key objectives: how to get started with load testing, how to increase the intensity of your load tests, and finally, I will discuss a couple of tools for exploring the results that you gather.
00:03:18.800 First, let's focus on getting started. We'll kick things off by preparing the load testing tool. I will primarily discuss Apache Bench because it comes pre-installed on Linux machines and is one of the simplest tools to use. The command begins with "ab" for Apache Bench, which is all you need to initiate your first load test.
00:03:34.360 To set it up correctly, you will want to choose an endpoint to send your simulated traffic. I suggest using example.com because it is officially designated as an example web page on the internet, and doing so ensures you won’t accidentally annoy someone by sending traffic to their actual website.
00:03:55.710 Begin with a simple static page that doesn’t make any database calls to establish a baseline. Once you’re confident that your load testing configuration is correct, you will want to choose a page that will likely bear the brunt of your traffic. Typically, this will be your campaign page rather than your home page. For us, the campaign page was the one that generated the highest load.
00:04:09.490 Load testing against your localhost is an option if you're testing locally, but keep in mind that the load test itself consumes resources on your machine. So, if you run the load test on your server, it may reduce the available resources, skewing your test results. Conversely, running a load test on a production website can impact user experience and potentially take the site down.
00:04:34.520 It is therefore advisable to target a staging server or a production server that does not host any external traffic. Additionally, while technically you can DDoS yourself with your tests, it's advisable to avoid that in practice. Next, let’s talk about configuring the load for your tests.
00:04:59.340 In your command, you will need to specify the number of concurrent requests to execute, using the '-c' flag, along with the total number of requests for the entire test, designated by the '-n' flag. Starting with a concurrency of 1 and a sufficient number of total requests will allow your system to warm up.
00:05:35.270 For example, begin with one request running a thousand times. Ensure the test runs for a sufficient duration, at least a few minutes. It's also important to understand that each request may not represent a unique visitor. Depending on the assets your front-end application loads, a single visitor could generate multiple requests.
00:05:52.860 Another element to consider is the impact of browser caching. A return visitor may generate fewer requests. It's vital to note that tools like Apache Bench do not execute JavaScript or render HTML, so the latency reported will reflect only parts of the user experience.
00:06:18.440 Once you've run your load tests, you can analyze the output from Apache Bench. The results will show the percentage of requests served within specific timeframes. As you analyze these initial results, ensure that the average latency aligns with your expectations based on your production server response times.
00:06:51.070 It's worth noting that load testing can feel like a black box. If you start plugging random numbers into it without a proper understanding, you may get good results but not know why. Therefore, having a reasonable hypothesis about your system performance is essential. For instance, if your load test indicates that 99% of requests are served within 100 milliseconds, but your production shows that the same metrics fall around 650 milliseconds, there's likely an issue with your load testing setup.
00:07:43.690 Common pitfalls in load testing include dealing with error pages, especially if you're testing against a staging server requiring basic authentication. You will need to include that in your Apache Bench command; otherwise, you are just testing your server's ability to return a 400 error.
00:08:08.880 Another frequent issue is encountering 500 error pages or redirects, as Apache Bench will not follow those and log them as non-200 requests. You can identify these problems by checking your Apache Bench results to see if non-200 responses are greater than zero.
00:08:50.580 If you're getting 500 errors before any load is applied, you're likely experiencing one of these issues. While your test runs, be sure to monitor your server logs to gain insight into the responses coming back. It's also essential to differentiate non-200 responses from failed requests.
00:09:26.800 Apache Bench will check the content length of the first request, and if it changes dynamically in subsequent requests—due to session values or any other variable—those will show up as failed requests. If you verify with your logs that everything appears fine, you can generally ignore them.
00:10:14.450 In later versions of Apache Bench, you can add a flag to accept variable document lengths. Once you feel confident about your load testing setup and there are no errors, you can begin increasing the volume.
00:10:35.630 As the transaction load increases, it's important to understand how this affects user experience. We will discuss queueing as we ramp up the load. As the average response time across the site increases, it is explained by a concept called Little’s Law, a fundamental principle of queueing theory.
00:11:03.670 Little’s Law states that the average number of customers in a queue (L) is equal to the average arrival rate of requests (λ) multiplied by the time spent in the system (W). To put this into a practical context, if you consider a single cashier in an establishment, the total number of customers waiting is affected by how quickly they can check out items.
00:11:53.490 If the cashier becomes slower at processing requests, and customers continue to enter the line at the same rate, the wait times for everyone will increase significantly. Similarly, if you have a web server processing requests, as you add more requests to the queue, you'll see the average response time climbing.
00:12:12.080 Thinking about your web stack, it typically comprises several components, including a reverse proxy server, an application server, and a database. These can exist on different machines or the same machine but will often share resources. When you're processing numerous requests under load, the average response time can severely impact usability if the server cannot keep up.
00:12:47.180 As a proxy server allows requests to wait for a certain time, if the load is too high, it can lead to timeouts where users receive error messages. An important aspect to remember is that increasing the number of requests your application server can handle under heavy load does not guarantee performance improvements. There can be a greater number of queued requests that might overwhelm the system.
00:13:02.720 If requests are piling up at the application server level, it impacts overall server performance and can lead to unnecessary resource consumption for requests from clients who have already abandoned the site. This follows the recommendation in queuing theory that a single queue for multiple workers is more efficient when job times are inconsistent.
00:13:30.720 As you consider how to handle increased loads, be aware that while it may be tempting to raise timeout thresholds on the proxy server to decrease your error rate, this may not yield the best user experience. Users would prefer to see a timeout message quickly rather than wait indefinitely for a page to load.
00:14:22.480 Remember that load won't only affect the application; the operating system can be impacted, especially when load testing on different machines. Configurations typically in place on a production machine might not be available in a staging environment, leading to issues with resource allocation and management. Each incoming request in a proxy server needs to be tracked using an IP address and port number, which consumes file handles on Linux servers.
00:15:17.690 If you run too many simultaneous requests, you may encounter errors indicating that there are too many open files. Defaults for file handles are often quite low, leading to rapid exhaustion. You can check what your current limits are and potentially adjust them based on the recommendations of your environment.
00:15:55.470 Another common issue that arises from load testing is TCP/IP port exhaustion. Each machine has a finite number of network ports available for application use, and these ephemeral ports may be depleted under significant load. You can consider modifying the `TIME_WAIT` duration to allow those ports to be recycled sooner, thus preventing potential bottlenecks.
00:16:53.350 When doing load testing, remember that your application behaves uniquely in a real-world context compared to a sterile testing environment. For instance, during testing, the number of database queries, their complexity, and user interactions must mirror your actual usage if you want reliable results.
00:17:30.760 If you're expecting high user activity, ensure that your test includes generating that same activity—consider running a script alongside your load test to simulate a realistic environment. Additionally, keep in mind how external requests can affect response times, as blocking requests to external payment providers during high traffic could lead to further latency in user experiences.
00:18:12.700 By now, you should feel comfortable understanding the lifecycle of a web request within your stack and monitoring your logs for errors. Setting up a reasonable load testing environment enables you to gauge your app's limits effectively without being constrained by the testing infrastructure.
00:18:45.540 From here, you can begin using additional tools to better understand any performance bottlenecks in your application. A good starting point is to investigate the resource limits of the machine where your proxy and application server reside.
00:19:26.540 Using utilities like `top` or `htop` can help you visualize CPU and memory usage during load testing. Keep in mind that on multi-core systems, the reported percentages may exceed 100% for adequate CPU utilization.
00:20:10.360 Always remember that all resources in your hosting environment are finite, and over-allocating workers can lead to performance degradation rather quickly. Ensure that you make appropriate adjustments based on the actual user load to prevent running out of resources before monitoring the behavior of your application.
00:20:37.150 You will also want to consider pointing your application to an external database if possible to relieve some stress during testing, ensuring that it doesn’t interfere with your production environment.
00:21:05.750 Transaction tracing can provide valuable insights during testing and can identify issues similar to what real user logs would produce under load in a live environment.
00:21:32.740 Consider using built-in database tools, like MySQL's slow query log, to monitor the performance of your database under load conditions. This helps not only to identify which queries could slow down the application but also to ensure that critical paths are optimized and resources utilized efficiently.
00:22:22.250 Lastly, while Apache Bench is a go-to tool, several alternatives are available that can give you more flexibility and better test scenarios. Tools like Siege allow you to configure batch requests while Bees with Machine Guns enables spinning up multiple EC2 instances for distributed load testing.
00:22:51.620 In conclusion, each application environment has its specifics. Your application might be hosted on Heroku, using Puma instead of Unicorn, or dealing with entirely different challenges. Load testing can help you uncover areas of curiosity, allowing you to shine a light into dark and scary places within your infrastructure.
00:23:30.880 Thank you, everyone, for being a fantastic audience today. I will tweet out a link to my slides if you're interested in taking a closer look. You can find me on Twitter at @practice_cactus.
Explore all talks recorded at RubyConf AU 2016
+15