Don't Forget the Network: Your App is Slower Than You Think

RailsConf 2016

by André Arko

In the talk titled "Don't Forget the Network: Your App is Slower Than You Think," André Arko emphasizes the critical yet often overlooked aspects of web application performance that stem from network considerations. He discusses how the user experience can be negatively impacted even when response times appear satisfactory. Arko draws on his extensive experience with Ruby and web application development to highlight several key points:

Understanding User Experience: Users may experience delays that developers do not consider, which can stem from various network and routing issues.
The Significance of Routing: Routing is a crucial component of web applications. He notes that the time taken by routing layers can significantly affect the total response time, which is often underestimated by developers.
Metrics and Measurements: Many developers do not measure the time spent in routing layers or understand how different metrics relate to user experience. Tools like New Relic may not capture the complete picture of user waiting times, emphasizing the need for holistic request time measurement.
Impact of Server Environments: Arko points out the importance of being aware of how server conditions, including resource contention and garbage collection pauses, can affect application performance. He encourages utilizing profiling tools to gain insights into these issues.
The Role of Monitoring in Performance Optimization: Proactive monitoring can not only help identify performance bottlenecks across different geographical regions but also optimize resources based on specific application needs.
Misleading Metrics: Averages can often mask the reality of performance outliers. He advocates for using percentile metrics and understanding the distribution of data instead of solely relying on averages.
Alerts and Alerts Fatigue: Setting up alerts based on deviations from the known functioning baseline rather than just averages can help prevent alert fatigue and keep the focus on genuine performance issues.

In conclusion, André Arko stresses that while developers often focus on code efficiency, understanding the network's role and measuring the overall user experience is essential for creating performant web applications. As the talk wraps up, he invites further discussion on the topic, highlighting its relevance for developers dedicated to enhancing user experiences.

00:00:09.769 Thank you for coming to my talk. That's very kind and generous of you to listen to me discuss some important topics.

00:00:15.870 My talk is titled "Don't Forget the Network: Your App is Slower Than You Think." I'm going to talk about some things you probably haven't considered about how people use your application.

00:00:28.580 I'm going to explore how users may be experiencing your application in ways that are worse than you might think. I apologize in advance if my talk makes you feel bad for your users, so brace yourselves.

00:00:41.340 Before I dive in, let me introduce myself. My name is André Arko, and I'm involved in nearly all things Ruby. While this slide shows an older avatar of me, I promise to get that fixed before posting the slides on Speaker Deck.

00:00:58.079 I authored the third edition of a book called "The Ruby Way," which I'm quite proud of. I learned Ruby from the very first edition of this book, which was my favorite, even though I couldn't recommend it at the time as it covered Ruby 1.8.

00:01:18.360 Now, the third edition covers Ruby versions 2.2 and 2.3. I recommend buying it because in a couple of years, you can use it to prop up your monitor just as I do with my copy of the second edition.

00:01:36.119 I work at Cloud City Development, where we specialize in mobile and web application development from scratch. However, I also join teams that need a senior-level developer to assist with their Rails or frontend applications.

00:01:49.770 I've been involved in many projects, and if this talk makes you feel like you could use some assistance, please feel free to talk to me later. One of the other things I work on is something called Bundler.

00:02:09.539 I've been a part of the Bundler project for a long time, and it has been a great experience to work on open source and engage with every aspect of the Ruby community. People use Bundler in ways I would never have imagined, and I get to help solve their problems.

00:02:37.819 We've put in considerable effort to make it relatively easy to start contributing to open source through Bundler compared to many other projects. If you're interested in contributing to open source, please talk to me later or tweet at me, and I would love to help you get started.

00:02:56.630 Lastly, I spend some time on Ruby Together, which is a non-profit trade association for Ruby developers and companies. Ruby Together pays developers to work on Bundler and RubyGems, ensuring that when you run 'bundle install,' it actually works.

00:03:17.000 Without support from companies and individuals, services like RubyGems.org would struggle to stay operational. We need funding to maintain the servers and keep everything running smoothly.

00:03:36.000 Thanks to the generosity of companies like Stripe, Basecamp, New Relic, and Airbnb, we can afford to keep everything functional. We've managed to keep RubyGems.org online without interruption for the past year, but as usage continues to grow, we need more companies to contribute.

00:04:09.829 Now, let's discuss the network connection and how it makes your app slower than you think. Routing is an essential aspect of your application, even if you haven't considered it before.

00:04:16.389 At one point, there was a widely shared article on Rap Genius's blog discussing how the Heroku router was ineffective, and while it may be unfortunate, whether you're on Heroku or not, routing is an integral part of your application and it can negatively affect performance.

00:04:43.719 Let's talk about how routing functions in your application's infrastructure. It's responsible for taking requests from the outside world and forwarding them through your infrastructure until they reach your Rails app server.

00:05:01.900 Once it's processed, the server responds, but then that response has to travel back through a variety of proxy servers before it reaches the user.

00:05:20.919 So, how does this all work? You may not have thought about this before, especially in development, where routing seems like a non-issue.

00:05:38.979 But in production, multiple app servers handle requests from various locations, creating a more complex environment. Every additional layer in this process adds time to what users see, which you might never notice while working on your laptop.

00:06:27.699 Let's have a quick raise of hands: how many of you know how long your routing layer takes? Based on my experience asking this question at various talks, I usually see only a few hands raised.

00:06:55.390 It's an important question to consider, as end users' experience is directly influenced by the total time spent on your routing layer. When a person uses your web app, they experience their requests going through your routing layer twice.

00:07:21.220 However, none of that time is included in metrics you might observe from tools like New Relic, which complicates understanding actual user waiting time.

00:07:49.030 You may look at your New Relic graphs and feel satisfied seeing quick response times like 250 milliseconds without recognizing how much additional time might be added to that before users experience a response.

00:08:08.310 It is vital to comprehend potential traffic surges that could overwhelm your routing layer and the challenge of queueing requests effectively.

00:08:34.630 In many Rails applications, where some requests can be quick while others take significantly longer, you risk the chance of high latency during busy periods.

00:09:01.090 You may discover a confusing situation where fast queries hit a timeout while they should be running smoothly, leaving you to wonder why.

00:09:30.360 New Relic does provide some insight into this through features like Queue Tracking, where you can set a header to show when the request began compared to when the server ultimately processed it.

00:10:05.960 You should observe how much time requests spend waiting for server availability. In many cases, you may find that the total user waiting time is not even being measured.

00:10:34.420 Ultimately, it's crucial to measure holistic request times. You want to truly understand the overall experience of users on the internet.

00:10:55.620 One effective strategy I recommend is to create a Rails controller that returns an empty string and use it in combination with monitoring services to explore how the infrastructure affects response delays.

00:11:25.800 You might find that certain geographical locations impact performance significantly, prompting the consideration of a CDN for slower regions.

00:12:06.160 It is often difficult to ascertain how your application's performance varies across regions and even whether or not it meets the needs of your business.

00:12:50.000 So, keeping track of these metrics can facilitate informed decisions and help enhance user experiences.

00:13:11.520 Now talking about servers, I assume that if you have an application deployed, you also have servers in place. You've either purchased, racked them, or rented virtual machines.

00:13:51.240 Regardless of your setup, it's essential to understand what might be happening on those servers, as they run a multitude of processes that you might not be aware of.

00:14:19.090 It's crucial to know how the environment in which your application runs impacts user experience. A common concern across various programming languages is how garbage collection pauses can add delays.

00:14:50.800 For instance, if your Ruby application pauses due to garbage collection, you need to understand its duration and its overall impact.

00:15:22.840 Tools like GC profiling can provide insights, but it's vital to understand there are other reasons for code pauses that can complicate matter.

00:15:41.670 Drawing from experiences shared by developers at places like Paper Trail, I found a clever approach. By starting another thread to track elapsed time during these pauses, you can understand the overhead effect on performance.

00:16:03.530 Monitoring how long your application spends not executing can offer a clearer picture of resource usage and identifying potential issues.

00:16:23.880 Different virtual machine setups can also introduce additional complexities, especially when running processes within layers of VMs, which can hinder responsiveness.

00:16:49.590 If you're sharing resources with co-tenants who are resource-intensive, you may not be aware of the problems they create for your application.

00:17:20.660 You may also face IO bandwidth constraints from competing applications. This highlights the importance of accurately tracking performance under different server conditions.

00:17:50.850 By taking a proactive approach to monitoring and recognizing the potential impact of resource contention, you position yourself to address performance issues effectively.

00:18:18.320 Just as how Netflix ensures competent performance from their EC2 instances through efficient benchmarking, you can leverage the understanding of your server performance to optimize your costs and service delivery.

00:18:44.190 Understanding the specific needs of your application can help you determine whether CPU, memory, disk, or network bottlenecks are impacting performance.

00:19:10.490 As your application scales, being aware of your metrics can significantly influence your resource allocation strategies.

00:19:49.660 It's obvious that you should be measuring elements you haven't previously captured, which brings us to the point of metrics themselves.

00:20:18.660 Fortunately, the Ruby community has a solid reputation for metrics collection and use, and services like New Relic simplify this process.

00:20:55.090 Tracking metrics is crucial; without them, your production environment operates like a black box, making it impossible to determine the quality of user experiences.

00:21:18.810 The importance of metrics really stood out for me during a talk by Coda Hale at GitHub in 2009 where he emphasized that the core of our work is to deliver business value.

00:21:49.290 To do so, we need to effectively measure and evaluate performance, otherwise, we risk failing to meet user or business expectations.

00:22:23.660 While metrics hold significance, they could lead to a false sense of understanding if not interpreted properly. Receiving an average can sometimes cloud the actual reality, leading to misconceptions.

00:22:57.660 Averages can mislead due to human tendencies to assume they perfectly represent a distribution, which isn't always the case. This matters significantly in applications.

00:23:20.190 Real-world metrics often deviate sharply from the expected normal distribution. In operational metrics across multiple servers, averages may obscure critical performance insights.

00:23:43.490 Enhanced awareness comes not from just relying on averages but observing and understanding behavior through percentile metrics to draw conclusions about outlier behaviors.

00:24:06.000 Visualizing your metrics offers a clearer perspective, as various datasets might look visually very different despite sharing average values.

00:24:34.790 I urge that you don't solely depend on averages for alerts. Alerts should notify deviations from your known functioning baseline, not just when averages decline.

00:25:01.920 In summary, the network plays a critical role in the overall performance of your application. Developers often overlook its significance, focusing instead on the code without considering network factors.

00:25:39.680 After deploying your application, remember that the user experience is essential. Regardless of processing efficiency, users ultimately care about the total time it takes for their requests to be fulfilled.

00:26:08.680 In any case, if you're not alerting on averages, determining the foundation of your operations is key to avoiding alert fatigue.

00:26:49.540 The best advice I've received involves identifying your system's average baseline and adjusting alerts to signal deviations from that baseline.

00:27:22.230 Understanding your application and business's distinct metrics allows for more nuanced alerts, thus avoiding unnecessary noise.

00:27:58.490 As we conclude, if you have any further questions, I'm happy to discuss these ideas further. Thank you for your time!