00:00:15.360
Welcome to this talk titled "Descent into Darkness: Understanding Your System's Binary Interface Is the Only Way Out." We will discuss what that means shortly.
00:00:21.359
So, who am I? I'm Joe Damato. I live in San Francisco, used to work at VMware, and went to CMU. Some of the projects I've worked on include the first version of Memprof that you saw in the web demo yesterday, some ltrace patches, and some MRI thread improvement patches.
00:00:28.240
I maintain a blog at timetobleed.com— that's the actual URL if you don't believe me. On Twitter, you can find me as Just Joe Damato. Alright, enough of that. We only have 30 minutes right now, so I probably won't have enough time to cover everything, but I'll do my best.
00:00:47.039
I need to say welcome to flight school— we just got to roll right now. I have no idea why this talk was accepted since there are only about five lines of Ruby code in it. But before we actually get started, I need to introduce you to a really good friend of mine— Satan.
00:01:05.280
The reason I need to introduce my buddy Satan is that this talk is basically about how being evil is totally awesome. In other words, you shouldn't actually do any of the things I'm talking about in this talk ever— unless you do it in a VM, as you could seriously destroy your system. However, if you get it working right, amazing things are possible, which is what we're going to demonstrate.
00:01:31.600
Okay, so what's the problem? Basically, my reprocess is 700 megabytes, and I want to know why. I don't know if you guys saw the talk yesterday where you were discussing object allocations being free, but I was having a seizure in the corner over there.
00:01:48.000
So we have Ruby— it's large and slow, and we want it to be small and fast. The problem is it's really easy to leak references in your Ruby code. If you leak a reference to an object, it causes that object, or any objects that it references, to stick around in memory forever.
00:02:05.680
Let me show you a picture of what that means. As long as someone is holding a reference to the object at the top, in gray, this entire tree of objects cannot be freed. That can add up to a lot of memory really fast. Importantly, every GC cycle will scan this tree of objects and burn CPU cycles checking if they can be freed, over and over again. You might say, "Hey, memory and CPUs are cheap; I don’t really care, I'm not scared." But I would remind you that Ruby's GC is a naive stop-the-world mark-sweep GC, meaning that more objects sticking around in memory increases the length of GC runs.
00:02:48.000
Longer GC runs mean less time your app has to run Ruby code, and that's bad. So we want to eliminate leaked references to reduce the length of GC runs, allowing us to run more Ruby app code, which makes everybody happy.
00:03:41.760
But how can we track down these reference leaks? For anyone who knows me, I'm lazy and don’t want to do unnecessary work, especially if I can make someone else do it for me. This means I don’t want to apply patches to Ruby or rebuild it, as those tasks are hard, and I want to maintain binary compatibility with my extensions.
00:04:03.840
I simply want this to work as a gem, requiring just the Ruby gem to do memory profiling. Anything beyond that is too much work. Luckily, my friend Satan has my back, and we can do some pretty tricky things to get this going.
00:04:14.879
AMD64 is just a processor specification. Both Intel and AMD have implemented it. If you see AMD64, it applies to Intel CPUs as well; it's just the name of the spec. So what is an API— an application binary interface? According to Wikipedia, it describes the low-level interface between a program and the operating system or another application. Within it are details like data type alignment, calling conventions, and object file formats.
00:05:09.919
To find these APIs, there's a System 5 API, which is quite extensive with 271 pages. Additionally, there are architecture-specific supplements. The remarkable aspect of the AMD64 supplement is that it references the 32-bit supplement with some modifications. If you're looking to do low-level work on the 64-bit x86 architecture, you need to read all three of these documents.
00:05:47.760
Now let's go through some of this information quickly. We need to use some advanced tools to get things running. For example, "nm" is available by default on OS X and Linux, and it dumps a symbol table. "objdump" disassembles many different object formats, while "readelf" is specific to Linux and dumps ELF-specific information about a binary, which we'll discuss shortly.
00:06:10.479
Here's a sample usage of "nm" on the Ruby binary. It displays the symbol values and associated names. Similarly, "objdump" with the option "-d" allows us to disassemble Ruby to gain further insights. The disassembled output shows offsets, opcodes, instructions, and useful metadata.
00:06:55.680
As we proceed, we need to familiarize ourselves with registers, small fast memory spaces on CPUs, which have specific roles.
00:07:00.800
For example, the "rax" register holds the return value from a function, and "rip" points to the currently executing instruction. You can refer to registers in parts, such as "eax," which corresponds to the lower 32 bits of "rax".
00:07:41.280
It’s important to note that there are two different syntaxes for assembly: AT&T (or GAS) syntax and Intel syntax. The GNU tools default to GAS, but you can switch to Intel syntax if you wish. The assembly examples I show will be in AT&T syntax, and yes, a lot of assembly is coming up.
00:09:10.000
Let's discuss how data movement works. In AT&T syntax, you have "source, destination" format for moving immediate values. For instance, moving zero into "rbx" or one register into another. It's important to note that both source and destination cannot be memory.
00:10:15.760
There are various ways to call functions on the 64-bit CPU, but for this talk, we care about the indirect absolute way and a direct absolute address. I know this sounds complex, but I will clarify it visually shortly.
00:10:50.240
Regarding calling conventions, arguments are passed in registers from left to right, starting with "rdi" for the first argument. This arrangement needs to comply with the ABI, which means proper alignment at the end of the argument area is essential.
00:11:05.920
To move forward, I'm going to walk through some assembly code alongside its corresponding C code. On one side, you'll see Intel syntax, while on the other side will be AT&T syntax, with C code displayed at the bottom.
00:11:44.000
We begin with the two assembly instructions saving the old stack pointer. Following that, the setup involves moving an argument into the proper stack location. The instructions will match your C code directly, such as initializing local variables to zero.
00:12:13.440
Next, we might be setting a value in a register and performing addition. The second instruction can seem obscure, but it's generated by GCC for specific optimizations, though ultimately it doesn't affect the output.
00:12:59.200
Now let's look at ELF objects— they can be executables or libraries. Each shared library loaded into a process has a set of independent data structures that work together using the runtime dynamic linker. This modular setup requires headers to describe the object, including segment and section types.
00:13:45.920
Memprof utilizes "libelf" to gather useful information as it navigates through ELF objects. The tech segment is where the code resides, and the Procedure Linkage Table (PLT) is essential for resolving absolute function addresses, especially since shared objects can be mapped randomly.
00:14:31.680
As we move forward, the PLT is vital for locating functions in shared libraries at runtime. Shared libraries are position independent, meaning they can inhabit any address space. Interestingly, all of this ties back to Ruby.
00:14:57.200
We have the necessary tools to delve deeper into Ruby's internals and track object allocation within its virtual machine. The key function, when an object is allocated, is "rb_new_object". As one can imagine, if I intend to rewrite this while it executes, it’s going to be a wild ride.
00:15:30.479
We want to track memory allocations and deallocations, and I aim to do this through a Ruby gem. First, we need to identify when an object is allocated, which can be achieved by scanning the Ruby binary in memory and rewriting calls to "rb_new_object" to call our handler instead.
00:15:54.560
This disassembly of Ruby shows the procedure we need to follow for locating the call operation for "rb_new_object". The objective is to adjust the displacement such that these calls redirect to our custom function.
00:16:21.679
The key is that if Ruby is built with the non-shared option, we can overwrite calls directly. If it's shared (which is common for many distributions), we need to think differently about how we can handle PLT.
00:17:09.200
Under the PLT, we have an initial instruction setup that leads to a dynamic linker. The linker fills in the table with addresses at runtime, allowing subsequent calls to bypass the linker, jumping directly to the required function.
00:17:41.920
To manage this, I will hook the global offset table and redirect the entry that should call "rb_new_object" to instead collect our internal data.
00:18:11.200
Interestingly, while I’m doing this, I can ensure that symbols remain resolvable as long as I don’t modify the existing symbols in the memory space.
00:18:49.200
Next, we must find when objects are freed by the Ruby VM, which calls the "ad_free_list" function. However, this function can be inlined by the compiler, making capturing its call tricky.
00:19:41.280
Since "ad_free_list" updates the free list, I'll simply track its updates by scanning for relevant 'move' instructions that target the free list and overwrite them. This approach allows me to monitor when the list gets altered.
00:20:56.960
But here's the catch— I can’t simply drop a call instruction at an arbitrary place because it might violate the ABI conditions. Instead, I need an assembly stub to prepare the environment for calling my handler.
00:21:55.200
Through careful assembly, I will align the stack and save the necessary registers before redirecting execution to my handler. Once the operations are completed, the execution will return to the original point, maintaining the program state.
00:22:48.800
I've crafted a stub that allows for hooking the free list update correctly, even ensuring we handle the corresponding return paths and state. This setup is effective for tracking memory allocations and deallocations in Ruby.
00:23:29.000
Let me show you how the output of this process looks. By requiring Memprof in your Ruby environment and starting it properly, you’re able to see detailed insights into memory usage, allowing you to identify exactly where the leaks occur.
00:24:38.960
Additionally, with Memprof, you can track specific blocks, leveraging various middleware options to gather statistics on object counts during requests in web applications.
00:25:22.880
At memproff.com, you can explore the web interface that integrates well with the outputs generated by the gem. You'll find summaries of memory allocation and leaks, helping to debug and optimize Ruby applications.
00:26:09.520
It's worth noting that Memprof currently operates on specific platforms and Ruby versions— primarily 64-bit Linux and specific macOS environments. However, ongoing efforts are being made to accommodate more variations and improve compatibility.
00:27:00.000
I deeply appreciate everyone’s contribution that made this possible. I rely on tools like RVM to test different Ruby versions, and I encourage you all to explore and utilize Memprof.
00:27:45.440
If anyone is interested in the code or collaborating, I can direct you to the GitHub repository for Memprof. Special thanks to those who made significant contributions and to Wayne for his testing setup.
00:28:34.240
I’d like to open the floor to questions now. If anyone has inquiries regarding this process or our methodology, feel free to reach out.
00:29:31.680
Thank you for your attention!
00:29:42.000
Yes, connecting to advanced memory access issues, this approach aligns with NX bit compatibility as well.
00:29:55.200
As a final point, good educational resources for understanding low-level programming can aid in grasping these concepts.
00:30:06.960
For further insights on adapting similar methods in other environments, options exist within different programming contexts— feel free to explore!
00:30:38.720
Thank you once again, and I look forward to your feedback.
00:30:57.120
Any remaining questions?
00:31:08.720
Well, that's it! I'll see you around.