Performance

Summarized using AI

In The Loop

Lourens Naudé • August 11, 2011 • Earth

In the talk titled 'In The Loop,' Lourens Naudé discusses the Reactor Pattern and its relevance in event-driven programming, particularly within infrastructure technologies such as Nginx, Eventmachine, 0mq, and Redis. The presentation aims to elucidate the often misunderstood concepts surrounding event-driven systems, focusing on key technical aspects such as system calls, file descriptor behavior, event loop internals, and buffer management.

The agenda of the talk encompasses several crucial areas:

  • System Calls and I/O: The session begins with an exploration of system calls and how they enable user processes to access resources managed by the kernel, emphasizing the differences between blocking, non-blocking, and asynchronous I/O.
  • Reactive Programming: Naudé delves into how the reactive pattern supports event-driven architectures, particularly in scenarios where maximizing throughput is essential and less focus is placed on individual request processing speeds.
  • Practical Examples: The example of a trading application that uses the Financial Information Exchange (FIX) protocol illustrates the discussed points, showcasing the complexities of handling I/O for multiple clients.
  • File Descriptors: Naudé explains how file descriptors work within the context of both blocking and non-blocking I/O, underlining key operational differences and performance implications.
  • Event Loop Mechanism: An infinite loop mechanism is introduced as a core component of the Reactor Pattern, which manages changes in the state of file descriptors and executes domain logic through an efficient event handling system.
  • Performance Considerations: The talk warns against common pitfalls such as excessive blocking times and inefficient handling of callbacks that can impact overall application responsiveness and client servicing.
  • Best Practices: Key recommendations include properly scheduling work in ticks to retain responsiveness, optimizing protocol parsing, and embracing non-blocking operations to avoid hindering the event loop’s performance.

Overall, Lourens Naudé's presentation emphasizes the importance of understanding the intricacies of the Reactor Pattern, particularly in high-throughput scenarios, while raising awareness of the potential challenges that come with event-driven programming. Developers are encouraged to learn the nuances of I/O operations and adopt efficient practices to leverage this powerful programming paradigm effectively.

In The Loop
Lourens Naudé • August 11, 2011 • Earth

The Reactor Pattern's present in a lot of production infrastructure (Nginx, Eventmachine, 0mq, Redis), yet not very well understood by developers and systems fellas alike. In this talk we'll have a look at what code is doing at a lower level and also how underlying subsystems affect your event driven services. Below the surface : system calls, file descriptor behavior, event loop internals and buffer management Evented Patterns : handler types, deferrables and half sync / half async work Anti patterns : on CPU time, blocking system calls and protocol gotchas

Help us caption & translate this video!

http://amara.org/v/FG9j/

LoneStarRuby Conf 2011

00:00:00.539 laughs
00:00:19.160 A couple of years ago, I had the opportunity to contribute to a trading platform built in the Ruby language. Most of the hard lessons learned there will drive the content for this talk today.
00:00:24.480 For our agenda, we're going to first look at system calls and how we access resources from user processes. We will eventually delve into file descriptors, their behavior, and the fundamental differences between blocking, non-blocking, and asynchronous I/O.
00:00:36.600 With the reactive pattern, we'll explore how that facilitates event-driven I/O, and then we will move on to a section where we examine patterns, gotchas, and best practices applicable to both new users and advanced deployments.
00:00:49.739 As with most things in life, you can never run before you walk. All the patterns presented here are framework-agnostic, meaning they're transferable to Node.js or any eventual frameworks. For brevity's sake, we'll simplify the focus to Reading, Writing, and connecting.
00:01:06.479 Could I have a rough headcount of how many people are using any of these technologies in production at the moment? And how many are developing against them? Who's just here to learn? Alright, seems fair.
00:01:18.420 The sweet spot for the reactor pattern is soft real-time systems where throughput is much more important than processing time. For any given process, there will be a significant amount of in-flight I/O, and the load pattern for these processes involves very little CPU time per user request. Our goal is to drive I/O through a single core and maximize its utilization for tasks rather than waiting on any I/O.
00:01:45.959 However, it's not about the speed of a single client or request, which you often find debated on mailing lists, blog posts, and Twitter. Many claim that upgrading an application to an event-driven framework is done incorrectly because the objective is to maintain access to response times for a higher number of clients—preferably with the same or less infrastructure.
00:02:13.980 The example application we'll use throughout this talk is the relationship between a broker and different trading terminals. The protocol we'll focus on is the Financial Information Exchange (FIX), a widely-used and compact protocol in the financial services industry designed to facilitate functionalities like high-frequency trading.
00:02:39.540 This protocol is essentially a tagged value protocol, which allows us to pack a tremendous amount of domain information into a small chunk of data. Now, jumping back to the system level: on one hand, we have the kernel, which mediates access to everything that user programs might need—be it I/O, memory, or CPU.
00:03:22.260 Our user code is anything you can deploy that runs in user mode, primarily for security and stability reasons. We access resources from our user code through the kernel using system calls. The example I can provide is akin to a normal function call that you're familiar with. For instance, we attempt to read 4000 bytes into a buffer from file descriptor number five.
00:03:54.780 A more effective way to explain this is through a diagram showing two Ruby processes attempting to read from disk via the kernel. The best definition I can provide for a system call is that it is a protected function call from user space to kernel space, intended to interact with system resources. An important characteristic is that it offers a uniform API across diverse systems, such as modern Unix-like systems (Linux, BSD, etc.).
00:04:40.979 Despite this uniformity, the behavior of the same function can differ between platforms. For example, the same file operation might not behave identically on different operating systems. Some systems that don't support certain functions natively may still achieve their objectives with slower emulation. Performance-wise, there are two critical points to consider: system calls are significantly slower than normal function calls—often 20 times or more—even for very simple operations.
00:05:10.920 Moreover, there is a definite context switch involved when performing these calls, which involves switching out memory references and CPU registers. Perhaps most importantly for us is the fact that system calls can block. If you do a read call on a network socket while awaiting data from a downstream client on a slow network, you will block indefinitely until the expected data arrives.
00:05:24.060 Now, let's discuss file descriptors. When we invoke the open system call for a local file with the intent to read, we receive a numerical reference that associates with various types of resources including local files, sockets, directories, and pipes. The API for this is always consistent, allowing programmers to work primarily with that numerical reference without concerning themselves with the underlying device.
00:05:43.320 The numerical handle represents something allocated within kernel space. For network transfers, we are concerned with buffered I/O, which involves having a buffer in both kernel and user space. By default, all file descriptors are blocking, meaning if we intend to read a chunk of data from a descriptor but the current state of the buffer is insufficient, we will block until enough data is available.
00:06:17.100 To visualize this, imagine a function system call attempting to read 4000 bytes back into a buffer, but the kernel buffer currently holds only 1000 bytes. Until the buffer is full with the necessary 4000 bytes, we will remain in a blocked state. Non-blocking I/O, on the other hand, enables us to alter the behavior of file descriptors by setting a non-blocking flag on the socket. This means a read request will only be initiated if the system call will not block; otherwise, an error condition will arise.
00:06:46.680 However, this is merely a guideline. If you invoke a system call that would block, you will still experience a context switch, and not all devices support non-blocking semantics. For instance, non-blocking I/O on local files is an outlier operation. It's essential to clarify that non-blocking I/O is also not to be confused with asynchronous I/O, a common mistake.
00:07:02.580 The irony is that one of the best implementations of asynchronous I/O can be found on the Windows operating system, specifically through a subsystem called Windows I/O Completion Ports. In contrast, Linux has an extension of the POSIX specification known as the real-time extensions. As we look at asynchronous I/O, several points are worth noting: it is generally not applicable to network I/O at all; it often skips double buffering, indicating that once we allocate a buffer for reading in user space, we instruct the kernel to do the work and return the data to us once complete.
00:07:44.880 This transfer does not occur from the kernel's buffer to the user buffer, and another issue is that asynchronous I/O primarily supports file I/O, rendering it practically useless for network connections; it proves to be particularly popular only for database systems. Now, let’s recap: non-blocking I/O is a guideline; the non-blocking flag serves essentially as an indication for the user space, guiding user data reads. Moving forward, we prefer to use the terms blocking and unblocking I/O and generally avoid referring to async I/O altogether.
00:08:41.520 Remember that the primary use case of the reactive pattern is increased throughput, which presents us with numerous challenges: we must manage a significant number of file descriptors and be able to respond to their state changes, knowing which subset is readable and writable. We execute our domain logic while maintaining acceptable response times across all clients. The focus is not on servicing a single request but rather on managing many clients simultaneously.
00:09:49.800 At a high level, the reactive pattern can be visualized simply as an infinite loop where we perform two types of work: adding new file descriptors as we receive new connections, and most critically, modifying descriptor methods that notify us from the kernel when there’s work to be done or if an error condition occurs, like a disconnected connection.
00:10:03.720 A better way to comprehend this is through a diagram depicting a continuous loop, cycling through various states of read and write capabilities of registered descriptors. Who is familiar with `select` or `EPOLL`? Good. Multiplex I/O exists to solve a specific problem: non-blocking I/O is exceedingly inefficient.
00:10:34.560 Every time we attempt to initiate I/O and receive a response, it effectively turns into polling. Multiplex I/O enables us to track tens of thousands of file descriptors, allowing us to inform the kernel to monitor these and notify us of any state changes. It's crucial to distinguish between operations in kernel space and those in user space since the kernel space or the multiplexing framework exists solely within the kernel.
00:11:31.920 In this example, we get our application in user space that maintains multiple registered file descriptors with the multiplex framework. For instance, if we add 200 file descriptors, our first step is registering those. Then, as we proceed with work, we loop continuously, occasionally being awakened by the multiplexer when state changes occur.
00:12:22.500 The multiplexer may notify us that of the 200 descriptors registered, perhaps 10 are ready to read and another 20 to write. At that point, we can identify which subset requires our attention and effectively respond to those needs.
00:12:44.520 It’s crucial to recognize that these notifications or state changes—readiness for reading and writing—are events handled by our user code through callbacks or event handlers. However, a reactor pattern solely delivers I/O concurrency; it does not ensure that any user code executes concurrently. All such code operates on the same thread within the context of the while loop.
00:13:15.150 Therefore, callbacks should consume minimal CPU time per request. When too much time is concentrated on a particular callback, it denies service to current clients and limits response availability for others. Each iteration through the event loop is termed a 'tick'. Typically, several thousand ticks can be achieved per second, with every tick presenting an opportunity for work; hence they should use minimal CPU time.
00:14:07.640 Before we transition to the next section, I’d like to allocate a minute or so for Q&A regarding this lower-level content. Are there any questions at this point?
00:14:53.640 Yes, the handling code must account for the tick. Does this cover both user code and work in the reactor so far?
00:15:15.180 Indeed, that's correct. The tick comprises both the work that needs to be executed and the handling of requests. When we receive notification about an available read, the read happens in user space within the context of the while loop, in conjunction with the callbacks.
00:15:47.160 The first point of confusion is differentiating between the various operating layers. For our focus, we primarily address the application handling dispatch handler at the top, fed by the multiplexer, which then links to our domain logic.
00:16:09.000 The dispatch handler within the reactive framework is responsible for responding to low-level events, directly influenced by the multiplexer. This occurs through actions like handling incoming read requests, parsing and processing data, buffering transfers, etc. The data received over the wire will be converted into Ruby strings upon receipt, and subsequently, converted back when sending data back out.
00:16:38.999 This defines a command-driven interface, as when the multiplexer instructs us to perform work, it needs to happen as quickly as possible. Unlike application handlers—which are responsible for high-level events like connection establishment and reading connections—event handler callbacks involve parsing complex protocols and high-level business logic.
00:17:07.080 One additional point to mention is that, regardless of new events introduced by the reactor, we should extend these interfaces for better encapsulation. It’s advisable to avoid implementing complex business logic entirely within callbacks; rather, delegate these tasks to dedicated encapsulated handlers.
00:17:51.300 Now, discussing testing: you need not rely on a transport for testing. At the top of our implementation, we have a connection callback definition. When we receive data, we enqueue it elsewhere and do not need to depend on the framework to manage this during testing.
00:18:20.760 In this context, we can easily stub out these interfaces in our tests. We can simulate a data receipt event and subsequently manage the data processing independently. This leads to confusion surrounding synchronous and asynchronous layers, especially how events or state changes from file descriptors transpire at the kernel level but ultimately are handled in user code.
00:19:02.880 We have two main options to effectively ensure the continuous servicing of clients. One method is to utilize a background thread pool, made available by frameworks like event machine. This allows us to defer CPU-bound work without affecting other clients needing service.
00:19:48.240 Our preferred order in each tick loop would be to handle I/O first, ensure clients are serviced, and finally accounting for user work. When we route callbacks in this way, we can schedule the workload to run in later ticks, ensuring balanced servicing across the available connections.
00:20:35.640 However, be cautious about consuming too much time on the CPU within that loop; operating everything on a single core of a multi-core System on Chip (SoC) means blocking the entire loop negates core time-sharing principles. Common pitfalls arise around tight loops where work is processed inefficiently, causing resource starvation across a multitude of clients.
00:21:58.740 A better approach is using ticking functionality, where we perform work over a series of iterations. Thus, we handle smaller work units while still allowing the event loop to manage other clients and subsequently cycle through queued work until fully tackled.
00:22:43.440 Next, exploring protocol parsers: for instance, Mongrel features a custom extension tailored for efficient HTTP header parsing, as performance on both ends of the data pipeline incurs costs. If parsing is performed inefficiently, it will incur extra costs during data transmission, thus my advice is to utilize optimized C extensions during protocol handling.
00:23:33.840 Another aspect relates to name resolution during connections. If a connection request to slow broker.net takes 20 seconds to resolve, the entire event loop is blocked during this timeframe. Rather than allowing this to occur, we should use an asynchronous resolver that manages the resolution without holding up other clients in the mean time.
00:24:33.720 Rather than interference at every read or write, relying on async HTTP APIs for endpoints during the event loop context—ensuring keep-alive mechanisms are employed—will negate unnecessary connection overhead. Not all file descriptors support non-blocking I/O, which can become problematic especially when collaborating with cloud systems like EBS volumes, which tend to demonstrate variable performance characteristics.
00:25:19.920 If we need to read data from local files with the non-blocking flag set, it's crucial factors such as potentially unpredictable load on the system can come into play, causing slight delays even when using a non-blocking operation. I’ve worked previously on a Ruby extension that employed the libuv library, facilitating read/write operations while maintaining high levels of concurrency without creating blocking calls in the event loop.
00:26:43.160 Libraries used must also be event-driven; it's pointless to provide high-performance reactor patterns with blocking calls incorporated. As an example, using standard MySQL synchronous queries, which may induce waiting times for seconds on particular queries constitutes a major issue because, during that time, our reactive loop isn't admitting new requests.
00:27:29.880 Conversely, executing asynchronous interfaces while querying databases allows other processes to continue running concurrently. However, it's essential to guarantee that operations do not stray into blocking territory with Ruby's memory management, necessitating strict allocation practices to avoid GC issues delaying responsiveness.
00:28:40.260 Given Ruby's nature as a garbage-collected scripting language, sloppy memory allocation during callbacks can lead to performance hurdles, particularly with GC pauses variably ranging anywhere from milliseconds to over a second, contingent on heap size. Ensure efficient allocation patterns in callbacks, while employing connection proxies with low overhead to transmit data between descriptors directly managed by the reactor.
00:29:23.370 Buffer sizes also play a significant role; as most network services utilize TCP protocols, buffers commonly tally around 4 KB, which introduces the need to remain cognizant during protocol parsing. Boundary parsing may simplify complexities by demarcating headers and footers, enabling the processing of vast amounts of information swiftly.
00:30:17.700 To summarize, this pattern excels in time-sharing for I/O-bound services; however, using it with applications requiring extensive CPU processing may lead to pitiful performance and blocking issues. Event-driven systems using Rails may complicate things since standard Rails rendering could incur a 200 ms block on requests, posing an issue in a reactive context.
00:31:28.740 Though many adopt such patterns in production, they may result in complications and represent a misalignment when deploying this technology. System calls and associated costs are essential to be aware of; I urge you to understand the variances between blocking, unblocking, and async I/O, as these concepts can lead to significant impacts on development.
00:32:12.300 Moreover, always aim to schedule work in ticks as new requests arise, maintaining responsiveness while observing client relations. Finally, always focus on interfaces first since frameworks and systems will funnel data either in production or through testing.
00:32:53.750 Now, if anyone has further questions regarding the presentation, please feel free to ask!
00:33:09.150 We are seeking Rails Ruby or QA engineers at my company, Wildfire Interactive. If you're interested in something fun and exciting, please approach me in the hall or apply via the URL provided above.
00:33:20.880 If you've enjoyed this talk, even if it was a bit too rapid at times, feel free to follow me on Twitter at @missing or check out my GitHub at Fork stop. Thank you.
00:33:39.220 Thank you.
Explore all talks recorded at LoneStarRuby Conf 2011
+15