Talks
Decent into Darkness: Understanding Your System's Binary Interface Is The Only Way Out

Decent into Darkness: Understanding Your System's Binary Interface Is The Only Way Out

by Joe Damato

In the talk titled "Descent into Darkness: Understanding Your System's Binary Interface Is the Only Way Out," Joe Damato addresses the challenges faced by Ruby developers regarding memory management and performance optimization. His discussion revolves around the issues of reference leaks in Ruby, how they affect memory usage, and strategies to diagnose and mitigate these leaks through low-level programming techniques.

Key Points Discussed:
- Introduction to Memory Management Issues: Damato introduces the concept of memory leaks in Ruby, explaining how retaining references to objects prevents the garbage collector (GC) from freeing memory, thus impacting application performance.
- Understanding the Application Binary Interface (ABI): He explains the significance of ABI, particularly regarding 64-bit architecture, mentioning relevant tools like nm, objdump, and readelf, which aid developers in examining binaries and understanding how data is managed at a low level.
- Implementation of Memory Profiling: The main tool presented is Memprof, a Ruby gem for memory profiling that allows developers to pinpoint where memory leaks occur in their applications without the need for extensive Ruby patches.
- Low-Level Assembly Code Insight: Damato delves into assembly language, illustrating how to manipulate function calls within the Ruby binary to track memory allocation and deallocation. He highlights the difference between AT&T and Intel assembly syntax, emphasizing the need for proper understanding of registers and calling conventions.
- Using the Procedure Linkage Table (PLT): The talk elaborates on the PLT's importance for resolving function addresses in shared libraries and how Memprof interacts with this aspect for precise memory tracking.
- Practical Examples and Outputs: As he walks the audience through practical examples, he discusses how Memprof can display insights into memory usage, helping developers address leaks and optimize performance in Ruby applications.

Conclusion and Takeaways:
- Joe Damato concludes with a call for collaboration on Memprof and encourages developers to explore low-level programming to better understand and optimize their Ruby applications. The talk highlights the importance of memory management in application performance and offers tools and methodologies for addressing memory-related issues in Ruby effectively.

This technical discussion is aimed at Ruby developers looking to deepen their understanding of memory management and utilize advanced techniques for software optimization.

00:00:15.360 Welcome to this talk titled "Descent into Darkness: Understanding Your System's Binary Interface Is the Only Way Out." We will discuss what that means shortly.
00:00:21.359 So, who am I? I'm Joe Damato. I live in San Francisco, used to work at VMware, and went to CMU. Some of the projects I've worked on include the first version of Memprof that you saw in the web demo yesterday, some ltrace patches, and some MRI thread improvement patches.
00:00:28.240 I maintain a blog at timetobleed.com— that's the actual URL if you don't believe me. On Twitter, you can find me as Just Joe Damato. Alright, enough of that. We only have 30 minutes right now, so I probably won't have enough time to cover everything, but I'll do my best.
00:00:47.039 I need to say welcome to flight school— we just got to roll right now. I have no idea why this talk was accepted since there are only about five lines of Ruby code in it. But before we actually get started, I need to introduce you to a really good friend of mine— Satan.
00:01:05.280 The reason I need to introduce my buddy Satan is that this talk is basically about how being evil is totally awesome. In other words, you shouldn't actually do any of the things I'm talking about in this talk ever— unless you do it in a VM, as you could seriously destroy your system. However, if you get it working right, amazing things are possible, which is what we're going to demonstrate.
00:01:31.600 Okay, so what's the problem? Basically, my reprocess is 700 megabytes, and I want to know why. I don't know if you guys saw the talk yesterday where you were discussing object allocations being free, but I was having a seizure in the corner over there.
00:01:48.000 So we have Ruby— it's large and slow, and we want it to be small and fast. The problem is it's really easy to leak references in your Ruby code. If you leak a reference to an object, it causes that object, or any objects that it references, to stick around in memory forever.
00:02:05.680 Let me show you a picture of what that means. As long as someone is holding a reference to the object at the top, in gray, this entire tree of objects cannot be freed. That can add up to a lot of memory really fast. Importantly, every GC cycle will scan this tree of objects and burn CPU cycles checking if they can be freed, over and over again. You might say, "Hey, memory and CPUs are cheap; I don’t really care, I'm not scared." But I would remind you that Ruby's GC is a naive stop-the-world mark-sweep GC, meaning that more objects sticking around in memory increases the length of GC runs.
00:02:48.000 Longer GC runs mean less time your app has to run Ruby code, and that's bad. So we want to eliminate leaked references to reduce the length of GC runs, allowing us to run more Ruby app code, which makes everybody happy.
00:03:41.760 But how can we track down these reference leaks? For anyone who knows me, I'm lazy and don’t want to do unnecessary work, especially if I can make someone else do it for me. This means I don’t want to apply patches to Ruby or rebuild it, as those tasks are hard, and I want to maintain binary compatibility with my extensions.
00:04:03.840 I simply want this to work as a gem, requiring just the Ruby gem to do memory profiling. Anything beyond that is too much work. Luckily, my friend Satan has my back, and we can do some pretty tricky things to get this going.
00:04:14.879 AMD64 is just a processor specification. Both Intel and AMD have implemented it. If you see AMD64, it applies to Intel CPUs as well; it's just the name of the spec. So what is an API— an application binary interface? According to Wikipedia, it describes the low-level interface between a program and the operating system or another application. Within it are details like data type alignment, calling conventions, and object file formats.
00:05:09.919 To find these APIs, there's a System 5 API, which is quite extensive with 271 pages. Additionally, there are architecture-specific supplements. The remarkable aspect of the AMD64 supplement is that it references the 32-bit supplement with some modifications. If you're looking to do low-level work on the 64-bit x86 architecture, you need to read all three of these documents.
00:05:47.760 Now let's go through some of this information quickly. We need to use some advanced tools to get things running. For example, "nm" is available by default on OS X and Linux, and it dumps a symbol table. "objdump" disassembles many different object formats, while "readelf" is specific to Linux and dumps ELF-specific information about a binary, which we'll discuss shortly.
00:06:10.479 Here's a sample usage of "nm" on the Ruby binary. It displays the symbol values and associated names. Similarly, "objdump" with the option "-d" allows us to disassemble Ruby to gain further insights. The disassembled output shows offsets, opcodes, instructions, and useful metadata.
00:06:55.680 As we proceed, we need to familiarize ourselves with registers, small fast memory spaces on CPUs, which have specific roles.
00:07:00.800 For example, the "rax" register holds the return value from a function, and "rip" points to the currently executing instruction. You can refer to registers in parts, such as "eax," which corresponds to the lower 32 bits of "rax".
00:07:41.280 It’s important to note that there are two different syntaxes for assembly: AT&T (or GAS) syntax and Intel syntax. The GNU tools default to GAS, but you can switch to Intel syntax if you wish. The assembly examples I show will be in AT&T syntax, and yes, a lot of assembly is coming up.
00:09:10.000 Let's discuss how data movement works. In AT&T syntax, you have "source, destination" format for moving immediate values. For instance, moving zero into "rbx" or one register into another. It's important to note that both source and destination cannot be memory.
00:10:15.760 There are various ways to call functions on the 64-bit CPU, but for this talk, we care about the indirect absolute way and a direct absolute address. I know this sounds complex, but I will clarify it visually shortly.
00:10:50.240 Regarding calling conventions, arguments are passed in registers from left to right, starting with "rdi" for the first argument. This arrangement needs to comply with the ABI, which means proper alignment at the end of the argument area is essential.
00:11:05.920 To move forward, I'm going to walk through some assembly code alongside its corresponding C code. On one side, you'll see Intel syntax, while on the other side will be AT&T syntax, with C code displayed at the bottom.
00:11:44.000 We begin with the two assembly instructions saving the old stack pointer. Following that, the setup involves moving an argument into the proper stack location. The instructions will match your C code directly, such as initializing local variables to zero.
00:12:13.440 Next, we might be setting a value in a register and performing addition. The second instruction can seem obscure, but it's generated by GCC for specific optimizations, though ultimately it doesn't affect the output.
00:12:59.200 Now let's look at ELF objects— they can be executables or libraries. Each shared library loaded into a process has a set of independent data structures that work together using the runtime dynamic linker. This modular setup requires headers to describe the object, including segment and section types.
00:13:45.920 Memprof utilizes "libelf" to gather useful information as it navigates through ELF objects. The tech segment is where the code resides, and the Procedure Linkage Table (PLT) is essential for resolving absolute function addresses, especially since shared objects can be mapped randomly.
00:14:31.680 As we move forward, the PLT is vital for locating functions in shared libraries at runtime. Shared libraries are position independent, meaning they can inhabit any address space. Interestingly, all of this ties back to Ruby.
00:14:57.200 We have the necessary tools to delve deeper into Ruby's internals and track object allocation within its virtual machine. The key function, when an object is allocated, is "rb_new_object". As one can imagine, if I intend to rewrite this while it executes, it’s going to be a wild ride.
00:15:30.479 We want to track memory allocations and deallocations, and I aim to do this through a Ruby gem. First, we need to identify when an object is allocated, which can be achieved by scanning the Ruby binary in memory and rewriting calls to "rb_new_object" to call our handler instead.
00:15:54.560 This disassembly of Ruby shows the procedure we need to follow for locating the call operation for "rb_new_object". The objective is to adjust the displacement such that these calls redirect to our custom function.
00:16:21.679 The key is that if Ruby is built with the non-shared option, we can overwrite calls directly. If it's shared (which is common for many distributions), we need to think differently about how we can handle PLT.
00:17:09.200 Under the PLT, we have an initial instruction setup that leads to a dynamic linker. The linker fills in the table with addresses at runtime, allowing subsequent calls to bypass the linker, jumping directly to the required function.
00:17:41.920 To manage this, I will hook the global offset table and redirect the entry that should call "rb_new_object" to instead collect our internal data.
00:18:11.200 Interestingly, while I’m doing this, I can ensure that symbols remain resolvable as long as I don’t modify the existing symbols in the memory space.
00:18:49.200 Next, we must find when objects are freed by the Ruby VM, which calls the "ad_free_list" function. However, this function can be inlined by the compiler, making capturing its call tricky.
00:19:41.280 Since "ad_free_list" updates the free list, I'll simply track its updates by scanning for relevant 'move' instructions that target the free list and overwrite them. This approach allows me to monitor when the list gets altered.
00:20:56.960 But here's the catch— I can’t simply drop a call instruction at an arbitrary place because it might violate the ABI conditions. Instead, I need an assembly stub to prepare the environment for calling my handler.
00:21:55.200 Through careful assembly, I will align the stack and save the necessary registers before redirecting execution to my handler. Once the operations are completed, the execution will return to the original point, maintaining the program state.
00:22:48.800 I've crafted a stub that allows for hooking the free list update correctly, even ensuring we handle the corresponding return paths and state. This setup is effective for tracking memory allocations and deallocations in Ruby.
00:23:29.000 Let me show you how the output of this process looks. By requiring Memprof in your Ruby environment and starting it properly, you’re able to see detailed insights into memory usage, allowing you to identify exactly where the leaks occur.
00:24:38.960 Additionally, with Memprof, you can track specific blocks, leveraging various middleware options to gather statistics on object counts during requests in web applications.
00:25:22.880 At memproff.com, you can explore the web interface that integrates well with the outputs generated by the gem. You'll find summaries of memory allocation and leaks, helping to debug and optimize Ruby applications.
00:26:09.520 It's worth noting that Memprof currently operates on specific platforms and Ruby versions— primarily 64-bit Linux and specific macOS environments. However, ongoing efforts are being made to accommodate more variations and improve compatibility.
00:27:00.000 I deeply appreciate everyone’s contribution that made this possible. I rely on tools like RVM to test different Ruby versions, and I encourage you all to explore and utilize Memprof.
00:27:45.440 If anyone is interested in the code or collaborating, I can direct you to the GitHub repository for Memprof. Special thanks to those who made significant contributions and to Wayne for his testing setup.
00:28:34.240 I’d like to open the floor to questions now. If anyone has inquiries regarding this process or our methodology, feel free to reach out.
00:29:31.680 Thank you for your attention!
00:29:42.000 Yes, connecting to advanced memory access issues, this approach aligns with NX bit compatibility as well.
00:29:55.200 As a final point, good educational resources for understanding low-level programming can aid in grasping these concepts.
00:30:06.960 For further insights on adapting similar methods in other environments, options exist within different programming contexts— feel free to explore!
00:30:38.720 Thank you once again, and I look forward to your feedback.
00:30:57.120 Any remaining questions?
00:31:08.720 Well, that's it! I'll see you around.