Who reordered my code?!

http://rubykaigi.org/2016/presentations/pitr_ch.html

There is a hidden problem waiting as Ruby becomes 3x faster and starts to support parallel computation - reordering by JIT compilers and CPUs.
In this talk, we’ll start by trying to optimize a few simple Ruby snippets. We’ll play the role of a JIT and a CPU and order operations as the rules of the system allow. Then we add a second thread to the snippets and watch it as it breaks horribly.
In the second part, we’ll fix the unwanted reorderings by introducing a memory model to Ruby. We’ll discuss in detail how it fixes the snippets and how it can be used to write faster code for parallel execution.

Petr Chalupa / @pitr_ch
He is a core maintainer of concurrent-ruby, where he has contributed concurrent abstractions and a synchronisation layer, providing volatile and atomic instance variables. He is part of the team working on JRuby+Truffle Ruby implementation at Oracle Labs. He is a happy Ruby user for 10 years.

RubyKaigi 2016

00:00:00.140 Thank you for having me. My name is Petr Chalupa, and I work for Oracle Labs, which is a research group within Oracle. As mentioned earlier, I am the maintainer of concurrent-ruby, and I also work on the JRuby+Truffle Ruby implementation.

00:00:05.970 Before I start, since I will also be talking about JRuby stuff, I want to clarify that the JRuby+Truffle project is primarily a research initiative and not an official Oracle product. Therefore, please don't make any investment decisions based on what I present today. Any opinions expressed are my own.

00:00:29.369 I will begin with the most dangerous part of my talk, which is a live example using the Dekker's algorithm. This algorithm is used to create a critical section by mutually excluding two threads. It works as follows: there are two flags, initially set to false, which each thread uses to indicate its intention to enter the critical section by setting its flag to true. If a thread sees that the other flag is false, it means that the other thread does not intend to enter, allowing it to enter the critical section safely.

00:01:02.420 However, if the other flag is true, it means that both threads are trying to enter the critical section at the same time. In that case, the algorithm requires one of the threads to enter a waiting state. I wrapped this algorithm in a loop and added some checks to see if it really works. I will first run it on an implementation without JIT compilation enabled. In this case, it won’t encounter any issues because the code is interpreted, leading to a few iterations executing as expected. But if I enable the compiler with Graal, it will eventually fail due to both threads trying to access the critical section simultaneously.

00:01:49.570 This can lead to unexpected results, as indicated by an error message that appears when one of the checks fails. The algorithm is broken due to the code being reordered, which is the topic of this talk, and that is what we will investigate.

00:02:21.790 We will start by examining the circumstances under which reordering occurs, what implications it has, and whether we want to embrace or reject it. We will discuss how to manage the issue of reordering if we choose to embrace it and whether it offers any practical benefits.

00:02:49.870 Let me start by outlining some emerging goals for Ruby. The first and foremost is increased performance, with the aim of making Ruby three times faster. Achieving this goal for Ruby 3 will involve the adoption of JIT compilation alongside techniques like Truffle and Graal, with the intention of running the language as fast as possible.

00:03:09.310 The second goal centers around parallel execution, allowing us to execute code across multiple threads. As most modern processors have multiple cores, fully utilizing this capability is important. While JRuby supports certain forms of parallel execution, MRI (Matz's Ruby Interpreter) has limitations.

00:03:42.030 We aim to develop high-level concurrency libraries, making it easier to utilize this parallelism. There have been discussions around actors, channels, and streams, but concrete implementations are still unclear. Based on my experience working on concurrency abstractions in concurrent-ruby for several years, there are still unanswered questions. For instance, how do we formulate more effective concurrent data structures, like concurrent hashmaps, that can be accessed safely from different threads?

00:04:01.800 Also, how do we create new concurrency abstractions to solve various problems? This talk will partially address these topics.

00:04:45.260 Now, let's talk about where we observe reordering. Two conditions must be met: we need a fast implementation, and the likelihood of errors during execution increases with speed. Furthermore, the implementation must support parallel execution. For Ruby to be fast, it is essential to have speculative optimizations combined with dynamic compilation and parallel execution.

00:05:31.410 Speculative means the implementation can make assumptions about stability within method bodies, such as constant values remaining unchanged. The second condition, optimizing, implies utilizing advanced techniques seen in compilers like GCC or V8 for enhanced execution speed. Dynamic compilation refers to just-in-time compilation, which is insufficient on its own; a highly dynamic language like Ruby requires the ability to revert from optimized code back to the interpreter if any assumptions fail, such as changes in constant values.

00:06:35.039 Ideally, a fast Ruby implementation would also need to employ speculative optimization and dynamic compilation techniques, allowing the code to execute in parallel. JRuby on Graal uses technologies that facilitate self-optimizing abstract syntax tree interpreters. This means the interpreter can dynamically rewrite itself based on the executing code, producing highly optimized machine code for Ruby methods.

00:07:40.650 The first source of reordering concerns compilers that optimize code transformations to enhance runtime performance. A compiler assumes that code runs in a single-threaded environment. It reorders operations while ensuring that results remain consistent. Even if the compiler does not reorder operations, the processor may do so, leveraging its capabilities to execute instructions more efficiently.

00:08:25.309 To illustrate this, let's revisit the Dekker's algorithm. The operations governing flags can be reordered by the compiler because the reads and writes are independent of each other, potentially causing critical section entry issues. Such reordering can lead to the failure of the synchronization algorithm.

00:09:01.790 Another source of reordering concerns the use of futures. A future represents a value to be computed in the future, with a straightforward API. When one thread sets a value, another thread can retrieve it. However, if the API is compiled, the reading thread might store a temporary value and not read from the original instance variable, leading to issues when the value is actually needed.

00:10:07.590 We can also demonstrate another form of reordering using the Dekker's algorithm, assuming a simple implementation template without the cache. In a scenario without out-of-order execution capabilities, both threads may read false values from memory due to reliance on store buffers. This would lead to both threads entering the critical section, confirming that the algorithm fails again.

00:11:04.370 Lastly, the instruction processor itself can contribute to reordering issues, especially in memory models without optimizations in place. If the code attempts to read independent variable values, the processor can reorder their execution, potentially leading to undesirable outcomes.

00:12:23.510 We now need to address the question of where we might become confused about code execution: it could stem from the compiler, cache mechanisms, or the processor. Yet, regardless of the source, we should prioritize understanding our original code's intended order.

00:12:48.300 Although these reordering mechanisms can appear unfriendly, they ultimately serve to optimize execution speed, increasing performance. However, we must find a balance.

00:13:01.800 To address these challenges, we must implement effective memory models for shared variables. These models inform the compiler that certain variables are shared and must be treated accordingly, resulting in no half-updated values during executions. This contrasts with our traditional model where we cannot guarantee the sequence of updates. It leads us to consider sequential consistency.

00:14:47.750 This approach stipulates that the outcome of any execution corresponds to an execution order that represents all thread operations. It ensures that individual processor operations maintain the order specified in their programs, allowing for predictable and reliable shared data interactions.

00:16:54.690 By defining shared variables and determining their communication roles, we can ensure correct operation while keeping optimizations in place for non-shared variables.

00:18:23.680 We can also utilize this approach to build shared variables which can operate under sequential consistency. By marking shared variables as such, we create a structure that helps maintain consistent reads and writes without interfering with caching or compilation.

00:19:18.660 In programming scenarios, we can simplify program ordering by implementing sequential consistency. This methodology enables the reasoning on variable interactions efficiently.

00:20:02.110 We finally conclude that a solid memory model equipped with shared variables can ease the construction of libraries and data structures for Ruby, directly contributing to its future robustness.

00:20:42.970 I have presented just a simple example of a counter, but the principles discussed allow for the development of more complex constructs, like promises or broader concurrency tools. Libraries such as concurrent-ruby illustrate this effectively, but users should still rely on established libraries rather than create everything from scratch.

00:21:21.060 Thank you very much for your attention. I will be happy to answer any questions.