Statistically Optimal API Timeouts

by Daniel Ackerman

The video titled "Statistically Optimal API Timeouts," presented by Daniel Ackerman at RubyConf 2019, delves into the best practices surrounding API timeout strategies. It specifically addresses how to determine optimal retry times for API requests, moving beyond arbitrary guesses to data-driven decisions.

The talk begins with Ackerman explaining the importance of timeouts and their correct implementation within Ruby applications. His goal is to equip developers with the knowledge to improve their API interactions by accurately assessing and optimizing timeout values. Key points of the talk include:

Understanding Timeouts: Ackerman provides an overview of timeout mechanisms and the circumstances under which they are necessary, emphasizing that proper timeout policies can lead to improved API efficiency.
Historical Context: He references how timeout values are often chosen randomly, challenging the audience to consider how these decisions might be better informed by mathematical analysis.
Statistical Insights: The speaker illustrates the concept of statistical distributions through a visual representation of cumulative distribution functions (CDF), highlighting how different distributions impact response times. Two scenarios are presented where the 80th and 95th percentiles of response times differ significantly, resulting in the conclusion that timeout strategies could be optimized based on these metrics.
Practical Application: Ackerman emphasizes that adjusting timeout settings is a straightforward process, often only requiring updates to configuration files instead of extensive code changes. This highlights the accessibility of implementing optimal timeout strategies in existing projects.
Ruby Library: The presentation wraps up with a discussion about an open-source Ruby library that Ackerman developed to facilitate the calculation of optimal API timeouts, presenting a practical tool for attendees to employ in their projects.

In conclusion, Ackerman's talk provides insights into the statistical analysis of API timeout policies, offering a framework to improve how developers approach these crucial design decisions. By the end of the presentation, viewers will have a deeper understanding of the mathematical basis behind timeout configuration and actionable strategies to optimize their API interactions.

00:00:13.120 Hey, I'm Daniel Ackerman, and I'm here to present my talk on Statistically Optimal API Timeouts. This talk will cover optimal retry policies, specifically focusing on how long you should wait before retrying an API request.

00:00:17.039 A timeout is essentially the amount of time you're going to wait before retrying a request. In this discussion, I will provide a bit of background on the problem, some information about myself, and what inspired me to take on this topic.

00:00:27.359 Next, I will explain why one should use timeouts, the canonical situations where they apply, and how to implement them correctly in Ruby. I will also briefly evaluate the effectiveness of your current timeout strategies. Finally, we will delve into timeout optimization, introducing a Ruby library I developed to calculate optimal timeouts.

00:00:42.640 To set the stage, I'd like to start with a quote by Descartes, a philosopher from the 1600s. He stated that 'mathematics is a more powerful instrument of knowledge than any that has been bequeathed to us by human agency.' I believe that starting with a quote is essential for a mathematically-based talk as it helps frame our mindset.

00:01:01.680 Now, a bit about myself: I am from Austin, Texas, where the weather is quite hot. Currently, I'm enjoying the cooler weather and working as a software engineer at Braintree on the Pay with PayPal team. My passion for mathematics has significantly influenced my motivation for this talk.

00:01:51.600 Let’s get into my inspiration for this topic. We all encounter timeouts in our APIs, and a common question arises: how did someone decide on that timeout value? Often, these timeout values appear to be selected arbitrarily. For example, you'll find some APIs set at seven seconds for a timeout, and it's unclear how that figure was derived.

00:02:02.880 Sometimes there’s a suggestion to retry at the 95th percentile. However, these claims are often not grounded in quantitative analysis. This raises the question: why do we choose these specific values? As someone with a mathematical background, tackling this problem seemed like a rewarding and practical challenge.

00:02:25.760 To illustrate my point, let's look at a quick pictorial argument involving two distributions. In the cumulative distribution function (CDF), we can see the likelihood of receiving a response within a certain timeframe. In the left distribution, about 80% of responses are received within 4 seconds, while the 95th percentile is around 10 seconds.

00:03:05.520 Contrasting this, in the right distribution, the 80th percentile is still at 10 seconds, but the 95th percentile is marginally higher at 10.4 seconds. As a mathematician, it’s evident to me that the left distribution could potentially be optimized. Adjusting the timeout based on the 80th percentile could improve the overall request cycle or retry rate.

00:03:44.720 The simplicity of changing timeout settings is advantageous. It typically doesn’t require extensive infrastructural modifications—often, it’s as easy as tweaking a configuration file. Improving the overall response time for many downstream requests could theoretically boost the response rate of your API or service, which would be incredibly beneficial.”},{